<a href="https://colab.research.google.com/github/Ruqyai/Web-Scraping/blob/master/Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Web Scraping

<h1>Basic Steps:</h1>

* Find the URL that you want to scrape.
* Inspecting the elements of the page.
* Bring the Html page using the URL.
* Find the data you want to extract.
* Write the code for scraping the specific HTML Tag that you want.
* Parse it if that's necessary.
* Run the code and extract the data.
* Store the data in the required format, e.g. CSV.
* You can use pandas to show the result before or after saved.



##Find the URL that you want to scrape.

The URL that we will use

http://www.pythonscraping.com/pages/page3.html

Remember robots.txt file  

http://www.pythonscraping.com/robots.txt


You need basic knowledge with HTML and CSS. Also, you’ll need to understand the site structure to extract the information you’re interested in.

![alt text](https://github.com/Ruqyai/Web-Scraper/blob/master/web/html.png?raw=true)

##Inspecting the elements of the page

<img src="https://3.bp.blogspot.com/-AD-DM0S6tPc/VJE3ZPA9PUI/AAAAAAAADtw/-FTCawVS1_s/s1600/inspect-element.gif" width=100% />


<img src="https://lh3.googleusercontent.com/proxy/8oXttju3H-nUClaQGPSfDihbVFbxzGv70G0dRST1kqVW2NKLe788t6MyIdpa6_ETppPDivHaKDefGProsotCm-E52FFHiQpO_cZgAE6cgna4ECVu8Tv66wf2DpYf6Q31og" width=100% />


<img src="https://iamvdo.me/content/01-blog/32-le-debug-css-est-difficile/2-apple-inspecting.gif" width=100% />


#Example 1

##Bring the Html page using the URL.

In [26]:
from urllib.request import urlopen #help in opening URLs.
from bs4 import BeautifulSoup #for pulling data out of HTML and XML files.
import re #provides regular expression


html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, "html.parser")
print(bs)

<html>
<head>
<style>
img{
	width:75px;
}
table{
	width:50%;
}
td{
	margin:10px;
	padding:10px;
}
.wrapper{
	width:800px;
}
.excitingNote{
	font-style:italic;
	font-weight:bold;
}
</style>
</head>
<body>
<div id="wrapper">
<img src="../img/gifts/logo.jpg" style="float:left;"/>
<h1>Totally Normal Gifts</h1>
<div id="content">Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is
hand-curated by well-paid, free-range Tibetan monks.<p>
We haven't figured out how to make online shopping carts yet, but you can send us a check to:<br/>
123 Main St.<br/>
Abuja, Nigeria
We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.</p></div>
<table id="giftList">
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) frien

##Find the data you want to extract.

##Write the code for scraping the specific HTML Tag that you want.


In [27]:
#all div of the page
print(bs.div)

<div id="wrapper">
<img src="../img/gifts/logo.jpg" style="float:left;"/>
<h1>Totally Normal Gifts</h1>
<div id="content">Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is
hand-curated by well-paid, free-range Tibetan monks.<p>
We haven't figured out how to make online shopping carts yet, but you can send us a check to:<br/>
123 Main St.<br/>
Abuja, Nigeria
We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.</p></div>
<table id="giftList">
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls


In [28]:
#all paragraph of the page
print(bs.p)

<p>
We haven't figured out how to make online shopping carts yet, but you can send us a check to:<br/>
123 Main St.<br/>
Abuja, Nigeria
We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.</p>


In [29]:
#Finding paragraph tags <p> in another way and method .find_all()
paragraphs = bs.body
for paragraph in paragraphs.find_all('p'):
    print(paragraph.text)


We haven't figured out how to make online shopping carts yet, but you can send us a check to:
123 Main St.
Abuja, Nigeria
We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.


In [30]:
#all header 1 of the page
print(bs.h1)

<h1>Totally Normal Gifts</h1>


In [31]:
#List of header tags
titles = bs.find_all(['h1', 'h2','h3','h4','h5','h6'])
print([title for title in titles])


[<h1>Totally Normal Gifts</h1>]


### Find the tags, get the text and Using CSS 

In [32]:
#Sometimes there might be multiple tags with the same names, but different classes or id
divList = bs.findAll('div', {'id': 'content'})
for tag in divList:
    print(tag.get_text())

Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is
hand-curated by well-paid, free-range Tibetan monks.
We haven't figured out how to make online shopping carts yet, but you can send us a check to:
123 Main St.
Abuja, Nigeria
We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.


In [33]:
boldText = bs.find_all('span', {'class':'excitingNote'})
print([text for text in boldText])

[<span class="excitingNote">Now with super-colorful bell peppers!</span>, <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>, <span class="excitingNote">Also hand-painted by trained monkeys!</span>, <span class="excitingNote">Or maybe he's only resting?</span>, <span class="excitingNote">Keep your friends guessing!</span>]


In [34]:
row = bs.find_all(id='gift5', class_='gift')
print([text for text in row])

[<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>]


## Write a CSV File

![alt text](https://github.com/Ruqyai/Web-Scraper/blob/master/web/csv.jpg?raw=true/)

In [35]:
# name the output file to write to local disk
out_filename = "file.csv"
# header of csv file to be written
headers = "exciting Note, image  \n"
allRows = bs.find_all('span', {'class':'excitingNote'})
print(allRows)
images= bs.find_all('img', {'src':re.compile('\.\.\/img\/gifts/img.*\.jpg')})
print(images)
img=[]
for image in images: 
    img.append(image['src'])
print(img)

[<span class="excitingNote">Now with super-colorful bell peppers!</span>, <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>, <span class="excitingNote">Also hand-painted by trained monkeys!</span>, <span class="excitingNote">Or maybe he's only resting?</span>, <span class="excitingNote">Keep your friends guessing!</span>]
[<img src="../img/gifts/img1.jpg"/>, <img src="../img/gifts/img2.jpg"/>, <img src="../img/gifts/img3.jpg"/>, <img src="../img/gifts/img4.jpg"/>, <img src="../img/gifts/img6.jpg"/>]
['../img/gifts/img1.jpg', '../img/gifts/img2.jpg', '../img/gifts/img3.jpg', '../img/gifts/img4.jpg', '../img/gifts/img6.jpg']


In [0]:
# opens file, and writes headers
f = open(out_filename, "w")
f.write(headers)
# loops 
i=0
for value in allRows:
    row = value.get_text()
    #Parse it if that's necessary
    row = row.replace(',', '')
    imgs=img[i]
    # write to file
    f.write(row +", " + imgs +"\n")
    #increase i
    i=i+1
f.close()  # Close the file

In [37]:
#show the file
!cat file.csv

exciting Note, image  
Now with super-colorful bell peppers!, ../img/gifts/img1.jpg
8 entire dolls per set! Octuple the presents!, ../img/gifts/img2.jpg
Also hand-painted by trained monkeys!, ../img/gifts/img3.jpg
Or maybe he's only resting?, ../img/gifts/img4.jpg
Keep your friends guessing!, ../img/gifts/img6.jpg


## Show The CSV File

In [38]:
import pandas as pd 
df = pd.read_csv(out_filename)
df.head()

Unnamed: 0,exciting Note,image
0,Now with super-colorful bell peppers!,../img/gifts/img1.jpg
1,8 entire dolls per set! Octuple the presents!,../img/gifts/img2.jpg
2,Also hand-painted by trained monkeys!,../img/gifts/img3.jpg
3,Or maybe he's only resting?,../img/gifts/img4.jpg
4,Keep your friends guessing!,../img/gifts/img6.jpg




---



---



---



#Example 2

In [0]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

In [0]:
#Access to the table and tags
table = bs.find("table")
headings = [th.get_text() for th in table.find_all("th")]
rows = [th for th in table.find_all("tr")]
#Create a list to store what are in the rows
data = []
for row in rows[1:]:
  data.append([th.get_text() for th in row.find_all("td")])

In [41]:
print(data)

[['\nVegetable Basket\n', '\nThis vegetable basket is the perfect gift for your health conscious (or overweight) friends!\nNow with super-colorful bell peppers!\n', '\n$15.00\n', '\n\n'], ['\nRussian Nesting Dolls\n', '\nHand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! 8 entire dolls per set! Octuple the presents!\n', '\n$10,000.52\n', '\n\n'], ['\nFish Painting\n', "\nIf something seems fishy about this painting, it's because it's a fish! Also hand-painted by trained monkeys!\n", '\n$10,005.00\n', '\n\n'], ['\nDead Parrot\n', "\nThis is an ex-parrot! Or maybe he's only resting?\n", '\n$0.50\n', '\n\n'], ['\nMystery Box\n', '\nIf you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. Keep your friends guessing!\n', '\n$1.50\n', '\n\n']]


In [42]:
# Create a Data Fram 
data = pd.DataFrame(data, columns=headings)
data.head() # show how it's look

Unnamed: 0,\nItem Title\n,\nDescription\n,\nCost\n,\nImage\n
0,\nVegetable Basket\n,\nThis vegetable basket is the perfect gift fo...,\n$15.00\n,\n\n
1,\nRussian Nesting Dolls\n,"\nHand-painted by trained monkeys, these exqui...","\n$10,000.52\n",\n\n
2,\nFish Painting\n,\nIf something seems fishy about this painting...,"\n$10,005.00\n",\n\n
3,\nDead Parrot\n,\nThis is an ex-parrot! Or maybe he's only res...,\n$0.50\n,\n\n
4,\nMystery Box\n,"\nIf you love suprises, this mystery box is fo...",\n$1.50\n,\n\n


##Parse it if that's necessary


Here are a few useful Regex resources:  

https://regexr.com — Learn, build and test Regex  



In [43]:
# some preprocessing for the heading columns
data = data.rename(columns=lambda x: x.replace('\n','').replace(',','')) 
data.head() 

Unnamed: 0,Item Title,Description,Cost,Image
0,\nVegetable Basket\n,\nThis vegetable basket is the perfect gift fo...,\n$15.00\n,\n\n
1,\nRussian Nesting Dolls\n,"\nHand-painted by trained monkeys, these exqui...","\n$10,000.52\n",\n\n
2,\nFish Painting\n,\nIf something seems fishy about this painting...,"\n$10,005.00\n",\n\n
3,\nDead Parrot\n,\nThis is an ex-parrot! Or maybe he's only res...,\n$0.50\n,\n\n
4,\nMystery Box\n,"\nIf you love suprises, this mystery box is fo...",\n$1.50\n,\n\n


In [44]:
# some preprocessing for the rest rows
data = data.replace('\n', '', regex=True)
data = data.replace(',', '', regex=True)
data = data.replace('  ', '', regex=True)

#drop Image column
data = data.drop('Image',axis=1)
data.head()

Unnamed: 0,Item Title,Description,Cost
0,Vegetable Basket,This vegetable basket is the perfect gift for ...,$15.00
1,Russian Nesting Dolls,Hand-painted by trained monkeys these exquisit...,$10000.52
2,Fish Painting,If something seems fishy about this painting i...,$10005.00
3,Dead Parrot,This is an ex-parrot! Or maybe he's only resting?,$0.50
4,Mystery Box,If you love suprises this mystery box is for y...,$1.50


## Save to CSV File

In [0]:
# Convert the DataFram to CSV file 
data.to_csv('file2.csv', index=False)

In [46]:
#show the file
!cat file2.csv

Item Title,Description,Cost
Vegetable Basket,This vegetable basket is the perfect gift for your health conscious (or overweight) friends!Now with super-colorful bell peppers!,$15.00
Russian Nesting Dolls,"Hand-painted by trained monkeys these exquisite dolls are priceless! And by ""priceless"" we mean ""extremely expensive""! 8 entire dolls per set! Octuple the presents!",$10000.52
Fish Painting,If something seems fishy about this painting it's because it's a fish! Also hand-painted by trained monkeys!,$10005.00
Dead Parrot,This is an ex-parrot! Or maybe he's only resting?,$0.50
Mystery Box,If you love suprises this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. Keep your friends guessing!,$1.50




---



---



---



#Example 3

## Query and condition

In [0]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

In [48]:
# reach to the price from the nearest image to it
price = bs.find('img', {'src':'../img/gifts/img1.jpg'}).parent.previous_sibling.get_text()
print(price)


$15.00



In [49]:
# get the tags with certain text
bs.find_all(lambda tag: tag.get_text() == 'Or maybe he\'s only resting?')

[<span class="excitingNote">Or maybe he's only resting?</span>]

In [50]:
# Search about certain text
bs.find_all('', text='Or maybe he\'s only resting?')

["Or maybe he's only resting?"]

In [51]:
#Find how many this text repeat in the page
textList = bs.find_all(text='Now with super-colorful bell peppers!')
print(len(textList))

1
