This repository was archived by the owner on Dec 22, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 262
Add Image-scrapper script #91
Closed
Closed
Changes from 5 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
9db5463
Create requirements.txt
rohitjoshi6 ff354cb
Add files via upload
rohitjoshi6 148efa0
Create README.md
rohitjoshi6 11ef164
Update Scraper.py
rohitjoshi6 db685a4
Update README.md
rohitjoshi6 db4c64f
Merge branch 'master' of https://github.com/Python-World/Python_and_t…
rohitjoshi6 40eb676
Added versions of modules
rohitjoshi6 b29362d
Update requirements.txt
rohitjoshi6 a6a7e25
Merge branch 'master' of https://github.com/Python-World/Python_and_t…
rohitjoshi6 79910cd
Update README.md
rohitjoshi6 9f124ec
Merge remote-tracking branch 'origin' into Scrapping
rohitjoshi6 decaa6b
Adding py script
rohitjoshi6 f543223
Add py script
rohitjoshi6 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Images Scraper | ||
|
||
#### This script scrapes images from a URL and stores them on your local folder. | ||
|
||
# Pre-requisites: | ||
|
||
#### Run the following command: | ||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
# Instructions to run the script: | ||
|
||
#### Run the command: | ||
```python | ||
python Scraper.py | ||
``` | ||
# Screenshot(images saved in local folder): | ||
|
||
 | ||
|
||
# Author Name: | ||
|
||
[Rohit Joshi](https://github.com/rohitjoshi6) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,25 +1,75 @@ | ||
import requests | ||
from bs4 import BeautifulSoup | ||
import urllib.request | ||
import random | ||
import os | ||
import json | ||
import requests # to sent GET requests | ||
from bs4 import BeautifulSoup # to parse HTML | ||
|
||
url="https://www.creativeshrimp.com/top-30-artworks-of-beeple.html" | ||
# user can input a topic and a number | ||
# download first n images from google image search | ||
|
||
source_code=requests.get(url) | ||
GOOGLE_IMAGE = \ | ||
'https://www.google.com/search?site=&tbm=isch&source=hp&biw=1873&bih=990&' | ||
|
||
plain_text=source_code.text | ||
# The User-Agent request header contains a characteristic string | ||
# that allows the network protocol peers to identify the application type, | ||
# operating system, and software version of the requesting software user agent. | ||
# needed for google search | ||
usr_agent = { | ||
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11', | ||
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', | ||
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3', | ||
'Accept-Encoding': 'none', | ||
'Accept-Language': 'en-US,en;q=0.8', | ||
'Connection': 'keep-alive', | ||
} | ||
|
||
soup=BeautifulSoup(plain_text) | ||
SAVE_FOLDER = 'images' | ||
|
||
|
||
for link in soup.find_all("a",{"class":"lightbox"}): | ||
href=link.get('href') | ||
print(href) | ||
def main(): | ||
if not os.path.exists(SAVE_FOLDER): | ||
os.mkdir(SAVE_FOLDER) | ||
download_images() | ||
|
||
img_name=random.randrange(1,500) | ||
def download_images(): | ||
# ask for user input | ||
data = input('What are you looking for? ') | ||
n_images = int(input('How many images do you want? ')) | ||
|
||
print('Start searching...') | ||
|
||
full_name=str(img_name)+".jpg" | ||
# get url query string | ||
searchurl = GOOGLE_IMAGE + 'q=' + data | ||
print(searchurl) | ||
|
||
# request url, without usr_agent the permission gets denied | ||
response = requests.get(searchurl, headers=usr_agent) | ||
html = response.text | ||
|
||
# find all divs where class='rg_meta' | ||
soup = BeautifulSoup(html, 'html.parser') | ||
results = soup.findAll('div', {'class': 'rg_meta'}, limit=n_images) | ||
|
||
urllib.request.urlretrieve(href,full_name) | ||
print("Loop Break") | ||
# extract the link from the div tag | ||
imagelinks= [] | ||
for re in results: | ||
text = re.text # this is a valid json string | ||
text_dict= json.loads(text) # deserialize json to a Python dict | ||
link = text_dict['ou'] | ||
# image_type = text_dict['ity'] | ||
imagelinks.append(link) | ||
|
||
print(f'found {len(imagelinks)} images') | ||
print('Start downloading...') | ||
|
||
for i, imagelink in enumerate(imagelinks): | ||
# open image link and save as file | ||
response = requests.get(imagelink) | ||
|
||
imagename = SAVE_FOLDER + '/' + data + str(i+1) + '.jpg' | ||
with open(imagename, 'wb') as file: | ||
file.write(response.content) | ||
|
||
print('Done') | ||
|
||
|
||
if __name__ == '__main__': | ||
main() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
requests | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add appropriate module version using command
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
You mean i should add version of the modules used in the requirements.txt file. right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Its Showing that there's merge conflict. How can i solve this issue? |
||
bs4 | ||
BeautifulSoup | ||
urllib.request | ||
random |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.