<a href="https://colab.research.google.com/github/Ruqyai/Web-Scraper/blob/master/Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Web Scraping

<h1>Steps:</h1>


* Bring the Html page.
* Scraping the specific HTML Tag that you want.
* Parse it if that's necessary.
* Save to CSV file.
* You can use pandas to show the result before or after saved.



The URLs that we will use

http://www.pythonscraping.com/pages/warandpeace.html  

http://www.pythonscraping.com/pages/page3.html

Remember robots.txt file  

http://www.pythonscraping.com/robots.txt


#Example 1

##Bring the Html page

In [0]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html, "html.parser")

![alt text](https://github.com/Ruqyai/Web-Scraper/blob/master/web/html.png?raw=true)

##Scraping the specific HTML Tag that you want


### Find the tags, get the text and Using CSS 

In [2]:
nameList = bs.findAll('span', {'class': 'green'})
for name in nameList:
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


In [0]:
titles = bs.find_all(['h1', 'h2','h3','h4','h5','h6'])
#print([title for title in titles])


In [0]:

allText = bs.find_all('span', {'class':{'green', 'red'}})
#print([text for text in allText])

In [5]:
nameList = bs.find_all(text='the prince')
print(len(nameList))

7


In [0]:
title = bs.find_all(id='title', class_='text')
#print([text for text in allText])

## Write a CSV File

![alt text](https://github.com/Ruqyai/Web-Scraper/blob/master/web/csv.jpg?raw=true/)

In [0]:
# name the output file to write to local disk
out_filename = "file.csv"
# header of csv file to be written
headers = "col1,col2,col3 \n"
nameList = bs.findAll('span', {'class': 'green'})


In [0]:
# opens file, and writes headers
f = open(out_filename, "w")
f.write(headers)
# loops 
for name in nameList:
    row = name.get_text()
    # write to file
    f.write(row + ", " + row + ", " + row+ "\n")
f.close()  # Close the file

## Show The CSV File

In [9]:
import pandas as pd
df = pd.read_csv(out_filename)
df.head(15)

Unnamed: 0,col1,col2,col3
0,Anna,,
1,Pavlovna Scherer,Anna,
2,Pavlovna Scherer,Anna,
3,Pavlovna Scherer,,
4,Empress Marya,,
5,Fedorovna,Empress Marya,
6,Fedorovna,Empress Marya,
7,Fedorovna,,
8,Prince Vasili Kuragin,Prince Vasili Kuragin,Prince Vasili Kuragin
9,Anna Pavlovna,Anna Pavlovna,Anna Pavlovna


#Example 2

In [0]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

#for child in bs.find('table',{'id':'giftList'}).children:
    #print(child)

In [11]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
table = bs.find("table")
headings = [th.get_text() for th in table.find_all("th")]
rows = [th for th in table.find_all("tr")]
data = []
for row in rows[1:]:
  data.append([th.get_text() for th in row.find_all("td")])

data = pd.DataFrame(data, columns=headings)
data.head()

Unnamed: 0,\nItem Title\n,\nDescription\n,\nCost\n,\nImage\n
0,\nVegetable Basket\n,\nThis vegetable basket is the perfect gift fo...,\n$15.00\n,\n\n
1,\nRussian Nesting Dolls\n,"\nHand-painted by trained monkeys, these exqui...","\n$10,000.52\n",\n\n
2,\nFish Painting\n,\nIf something seems fishy about this painting...,"\n$10,005.00\n",\n\n
3,\nDead Parrot\n,\nThis is an ex-parrot! Or maybe he's only res...,\n$0.50\n,\n\n
4,\nMystery Box\n,"\nIf you love suprises, this mystery box is fo...",\n$1.50\n,\n\n


##Parse it if that's necessary


Here are a few useful XPath/CSS/Regex resources:  

https://regexr.com — Learn, build and test Regex  

https://www.w3schools.com/xml/xpath_intro.asp


In [12]:
# some preprocessing 
data = data.rename(columns=lambda x: x.replace('\n','').replace(',','')) 
data.head() 

Unnamed: 0,Item Title,Description,Cost,Image
0,\nVegetable Basket\n,\nThis vegetable basket is the perfect gift fo...,\n$15.00\n,\n\n
1,\nRussian Nesting Dolls\n,"\nHand-painted by trained monkeys, these exqui...","\n$10,000.52\n",\n\n
2,\nFish Painting\n,\nIf something seems fishy about this painting...,"\n$10,005.00\n",\n\n
3,\nDead Parrot\n,\nThis is an ex-parrot! Or maybe he's only res...,\n$0.50\n,\n\n
4,\nMystery Box\n,"\nIf you love suprises, this mystery box is fo...",\n$1.50\n,\n\n


In [13]:
data = data.replace('\n', '', regex=True)
data = data.replace(',', '', regex=True)
data = data.drop('Image',axis=1)
data.head()

Unnamed: 0,Item Title,Description,Cost
0,Vegetable Basket,This vegetable basket is the perfect gift for ...,$15.00
1,Russian Nesting Dolls,Hand-painted by trained monkeys these exquisit...,$10000.52
2,Fish Painting,If something seems fishy about this painting i...,$10005.00
3,Dead Parrot,This is an ex-parrot! Or maybe he's only resting?,$0.50
4,Mystery Box,If you love suprises this mystery box is for y...,$1.50


## Save to CSV File

In [0]:
data.to_csv('file2.csv', index=False)

In [15]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
print(bs.find('img', {'src':'../img/gifts/img1.jpg'}).parent.previous_sibling.get_text())


$15.00



#Example 3

## query and condition

In [0]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [18]:
bs.find_all(lambda tag: tag.get_text() == 'Or maybe he\'s only resting?')

[<span class="excitingNote">Or maybe he's only resting?</span>]

In [19]:
bs.find_all('', text='Or maybe he\'s only resting?')

["Or maybe he's only resting?"]