<a href="https://colab.research.google.com/github/Ruqyai/Web-Scraper/blob/master/Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Web Scraping

<h1>Basic Steps: :</h1>

* Find the URL that you want to scrape.
* Inspecting the elements of the page.
* Bring the Html page using the URL.
* Find the data you want to extract.
* Write the code for scraping the specific HTML Tag that you want.
* Parse it if that's necessary.
* Run the code and extract the data.
* Store the data in the required format, e.g. CSV.
* You can use pandas to show the result before or after saved.



##Find the URL that you want to scrape.

The URLs that we will use

http://www.pythonscraping.com/pages/page3.html

Remember robots.txt file  

http://www.pythonscraping.com/robots.txt


You need basic knowledge with HTML and CSS. Also, you’ll need to understand the site structure to extract the information you’re interested in.

![alt text](https://github.com/Ruqyai/Web-Scraper/blob/master/web/html.png?raw=true)

##Inspecting the elements of the page

<img src="https://3.bp.blogspot.com/-AD-DM0S6tPc/VJE3ZPA9PUI/AAAAAAAADtw/-FTCawVS1_s/s1600/inspect-element.gif" width=100% />


<img src="https://lh3.googleusercontent.com/proxy/8oXttju3H-nUClaQGPSfDihbVFbxzGv70G0dRST1kqVW2NKLe788t6MyIdpa6_ETppPDivHaKDefGProsotCm-E52FFHiQpO_cZgAE6cgna4ECVu8Tv66wf2DpYf6Q31og" width=100% />


<img src="https://iamvdo.me/content/01-blog/32-le-debug-css-est-difficile/2-apple-inspecting.gif" width=100% />


#Example 1

##Bring the Html page using the URL.

In [293]:
from urllib.request import urlopen #help in opening URLs.
from bs4 import BeautifulSoup #for pulling data out of HTML and XML files.

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, "html.parser")
print(bs)

<html>
<head>
<style>
img{
	width:75px;
}
table{
	width:50%;
}
td{
	margin:10px;
	padding:10px;
}
.wrapper{
	width:800px;
}
.excitingNote{
	font-style:italic;
	font-weight:bold;
}
</style>
</head>
<body>
<div id="wrapper">
<img src="../img/gifts/logo.jpg" style="float:left;"/>
<h1>Totally Normal Gifts</h1>
<div id="content">Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is
hand-curated by well-paid, free-range Tibetan monks.<p>
We haven't figured out how to make online shopping carts yet, but you can send us a check to:<br/>
123 Main St.<br/>
Abuja, Nigeria
We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.</p></div>
<table id="giftList">
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) frien

##Find the data you want to extract.

##Write the code for scraping the specific HTML Tag that you want.


In [294]:
# all div of the page
print(bs.div)

<div id="wrapper">
<img src="../img/gifts/logo.jpg" style="float:left;"/>
<h1>Totally Normal Gifts</h1>
<div id="content">Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is
hand-curated by well-paid, free-range Tibetan monks.<p>
We haven't figured out how to make online shopping carts yet, but you can send us a check to:<br/>
123 Main St.<br/>
Abuja, Nigeria
We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.</p></div>
<table id="giftList">
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls


In [295]:
# all paragraph of the page
print(bs.p)

<p>
We haven't figured out how to make online shopping carts yet, but you can send us a check to:<br/>
123 Main St.<br/>
Abuja, Nigeria
We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.</p>


In [296]:
#Finding paragraph tags <p> in another way and method .find_all()
paragraphs = bs.body
for paragraph in paragraphs.find_all('p'):
    print(paragraph.text)


We haven't figured out how to make online shopping carts yet, but you can send us a check to:
123 Main St.
Abuja, Nigeria
We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.


In [297]:
# all header 1 of the page
print(bs.h1)

<h1>Totally Normal Gifts</h1>


In [298]:
#List of header tags
titles = bs.find_all(['h1', 'h2','h3','h4','h5','h6'])
print([title for title in titles])


[<h1>Totally Normal Gifts</h1>]


### Find the tags, get the text and Using CSS 

In [299]:
#Sometimes there might be multiple tags with the same names, but different classes or id
divList = bs.findAll('div', {'id': 'content'})
for tag in divList:
    print(tag.get_text())

Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is
hand-curated by well-paid, free-range Tibetan monks.
We haven't figured out how to make online shopping carts yet, but you can send us a check to:
123 Main St.
Abuja, Nigeria
We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.


In [300]:
boldText = bs.find_all('span', {'class':'excitingNote'})
print([text for text in boldText])

[<span class="excitingNote">Now with super-colorful bell peppers!</span>, <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>, <span class="excitingNote">Also hand-painted by trained monkeys!</span>, <span class="excitingNote">Or maybe he's only resting?</span>, <span class="excitingNote">Keep your friends guessing!</span>]


In [301]:
row = bs.find_all(id='gift5', class_='gift')
print([text for text in row])

[<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>]


## Write a CSV File

![alt text](https://github.com/Ruqyai/Web-Scraper/blob/master/web/csv.jpg?raw=true/)

In [302]:
# name the output file to write to local disk
out_filename = "file.csv"
# header of csv file to be written
headers = "exciting Note1, exciting Note2  \n"
row_s = bs.find_all('span', {'class':'excitingNote'})
print(row_s)

[<span class="excitingNote">Now with super-colorful bell peppers!</span>, <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>, <span class="excitingNote">Also hand-painted by trained monkeys!</span>, <span class="excitingNote">Or maybe he's only resting?</span>, <span class="excitingNote">Keep your friends guessing!</span>]


In [0]:
# opens file, and writes headers
f = open(out_filename, "w")
f.write(headers)
# loops 
for value in row_s:
    row = value.get_text()
    #Parse it if that's necessary
    row = row.replace(',', '')
    # write to file
    f.write(row +", " + row +"\n")
f.close()  # Close the file

In [320]:
!cat file.csv

exciting Note1, exciting Note2  
Now with super-colorful bell peppers!, Now with super-colorful bell peppers!
8 entire dolls per set! Octuple the presents!, 8 entire dolls per set! Octuple the presents!
Also hand-painted by trained monkeys!, Also hand-painted by trained monkeys!
Or maybe he's only resting?, Or maybe he's only resting?
Keep your friends guessing!, Keep your friends guessing!


## Show The CSV File

In [323]:
import pandas as pd
df = pd.read_csv(out_filename)
df.head()

Unnamed: 0,exciting Note1,exciting Note2
0,Now with super-colorful bell peppers!,Now with super-colorful bell peppers!
1,8 entire dolls per set! Octuple the presents!,8 entire dolls per set! Octuple the presents!
2,Also hand-painted by trained monkeys!,Also hand-painted by trained monkeys!
3,Or maybe he's only resting?,Or maybe he's only resting?
4,Keep your friends guessing!,Keep your friends guessing!




---



---



---



#Example 2

In [0]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

In [0]:
table = bs.find("table")
headings = [th.get_text() for th in table.find_all("th")]
rows = [th for th in table.find_all("tr")]
data = []
for row in rows[1:]:
  data.append([th.get_text() for th in row.find_all("td")])

In [308]:
print(data)

[['\nVegetable Basket\n', '\nThis vegetable basket is the perfect gift for your health conscious (or overweight) friends!\nNow with super-colorful bell peppers!\n', '\n$15.00\n', '\n\n'], ['\nRussian Nesting Dolls\n', '\nHand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! 8 entire dolls per set! Octuple the presents!\n', '\n$10,000.52\n', '\n\n'], ['\nFish Painting\n', "\nIf something seems fishy about this painting, it's because it's a fish! Also hand-painted by trained monkeys!\n", '\n$10,005.00\n', '\n\n'], ['\nDead Parrot\n', "\nThis is an ex-parrot! Or maybe he's only resting?\n", '\n$0.50\n', '\n\n'], ['\nMystery Box\n', '\nIf you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. Keep your friends guessing!\n', '\n$1.50\n', '\n\n']]


In [309]:
data = pd.DataFrame(data, columns=headings)
data.head()

Unnamed: 0,\nItem Title\n,\nDescription\n,\nCost\n,\nImage\n
0,\nVegetable Basket\n,\nThis vegetable basket is the perfect gift fo...,\n$15.00\n,\n\n
1,\nRussian Nesting Dolls\n,"\nHand-painted by trained monkeys, these exqui...","\n$10,000.52\n",\n\n
2,\nFish Painting\n,\nIf something seems fishy about this painting...,"\n$10,005.00\n",\n\n
3,\nDead Parrot\n,\nThis is an ex-parrot! Or maybe he's only res...,\n$0.50\n,\n\n
4,\nMystery Box\n,"\nIf you love suprises, this mystery box is fo...",\n$1.50\n,\n\n


##Parse it if that's necessary


Here are a few useful Regex resources:  

https://regexr.com — Learn, build and test Regex  



In [310]:
# some preprocessing 
data = data.rename(columns=lambda x: x.replace('\n','').replace(',','')) 
data.head() 

Unnamed: 0,Item Title,Description,Cost,Image
0,\nVegetable Basket\n,\nThis vegetable basket is the perfect gift fo...,\n$15.00\n,\n\n
1,\nRussian Nesting Dolls\n,"\nHand-painted by trained monkeys, these exqui...","\n$10,000.52\n",\n\n
2,\nFish Painting\n,\nIf something seems fishy about this painting...,"\n$10,005.00\n",\n\n
3,\nDead Parrot\n,\nThis is an ex-parrot! Or maybe he's only res...,\n$0.50\n,\n\n
4,\nMystery Box\n,"\nIf you love suprises, this mystery box is fo...",\n$1.50\n,\n\n


In [311]:
data = data.replace('\n', '', regex=True)
data = data.replace(',', '', regex=True)
data = data.replace('  ', '', regex=True)


data = data.drop('Image',axis=1)
data.head()

Unnamed: 0,Item Title,Description,Cost
0,Vegetable Basket,This vegetable basket is the perfect gift for ...,$15.00
1,Russian Nesting Dolls,Hand-painted by trained monkeys these exquisit...,$10000.52
2,Fish Painting,If something seems fishy about this painting i...,$10005.00
3,Dead Parrot,This is an ex-parrot! Or maybe he's only resting?,$0.50
4,Mystery Box,If you love suprises this mystery box is for y...,$1.50


## Save to CSV File

In [0]:
data.to_csv('file2.csv', index=False)

In [313]:
!cat file2.csv

Item Title,Description,Cost
Vegetable Basket,This vegetable basket is the perfect gift for your health conscious (or overweight) friends!Now with super-colorful bell peppers!,$15.00
Russian Nesting Dolls,"Hand-painted by trained monkeys these exquisite dolls are priceless! And by ""priceless"" we mean ""extremely expensive""! 8 entire dolls per set! Octuple the presents!",$10000.52
Fish Painting,If something seems fishy about this painting it's because it's a fish! Also hand-painted by trained monkeys!,$10005.00
Dead Parrot,This is an ex-parrot! Or maybe he's only resting?,$0.50
Mystery Box,If you love suprises this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. Keep your friends guessing!,$1.50




---



---



---



#Example 3

## Query and condition

In [0]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

In [315]:
# reach to the price from the nearest image to it
price = bs.find('img', {'src':'../img/gifts/img1.jpg'}).parent.previous_sibling.get_text()
print(price)


$15.00



In [316]:
# get the tags with certain text
bs.find_all(lambda tag: tag.get_text() == 'Or maybe he\'s only resting?')

[<span class="excitingNote">Or maybe he's only resting?</span>]

In [317]:
# Search about certain text
bs.find_all('', text='Or maybe he\'s only resting?')

["Or maybe he's only resting?"]

In [318]:
#Find how many this text repeat in the page
textList = bs.find_all(text='Now with super-colorful bell peppers!')
print(len(textList))

1
