# Web Scrapping

## Summary
This a tutorial on web scraping in Python. 

In this tutorial, you will be able to scrape data from an amazon web page using the requests, 
urllib and Beautiful Soup libraries, and export that data into a structured text file csv 
using the pandas library.

## Examing the web Page 

In this tutorial, we'll extract the information in a sale web site 'amazon' and store them in a structured dataset.

A technique called 'web scraping' is a useful way to automate this process.

In [126]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
from urllib.request import urlopen
import numpy as np

In [2]:
## Url of  the web page i want to work with 
## web page of Laptops 
myUrl="https://www.amazon.fr/s/ref=nb_sb_noss?__mk_fr_FR=%C3%85M%C3%85%C5%BD%C3%95%C3%91&url=search-alias%3Daps&field-keywords=Laptop"

Here's the way amazon presented the products and their informations:

In [3]:
%%html
<img src="capture.png",width=60,height=60>
products shown in our web page

We can extract the informations of each product into a  **record** with four fields:

* The name of the brand.
* The name of the product.
* the price.
* The shipping

## Importing Html into python 

### First method using Request Librabry

The first thing we need to do is to read the HTML of our web page into Python, which we'll do using the requests library.
* insatallation via :  pip install requests from the command line.

In [11]:
headers = {
    'User-Agent': 'python-requests/2.8.1',
    'Content-Type': 'text/html',
}

request=requests.get(myUrl,headers=headers)

In [12]:
## print the first 3 rows of the html page
request.text[0:300]

'\n<!doctype html><html class="a-no-js" data-19ax5a9jf="dingo"><head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"><style>\n[class*=scx-line-clamp-]{overflow:hidden}.scx-offscreen-truncate{position:relative;left:-1000000px}.scx-line-clamp-1{max-height:16.75px}.scx-trunc'

### Second Method using Urllin Library

Also we can read the HTML of our web page into Python, by using the urllib library. 
* insatallation via : pip install urllib from the command line.

In [6]:
## oprn the url page
r_urllib=urlopen(myUrl)

In [7]:
## print the first 3 rows of the html page
request_url=r_urllib.read()

In [8]:
request_url.strip()[0:300]

b'<!doctype html><html class="a-no-js" data-19ax5a9jf="dingo"><head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"><style>\n[class*=scx-line-clamp-]{overflow:hidden}.scx-offscreen-truncate{position:relative;left:-1000000px}.scx-line-clamp-1{max-height:16.75px}.scx-trunca'

In [9]:
## Close the url page 
r_urllib.close()

### Parsing the HTML using Beautiful Soup Library

We're going to parse the HTML using the Beautiful Soup 4 library, which is a popular Python library for web scraping. 
* installation via : pip install beautifulsoup4 from the command line.

In [13]:
soup = BeautifulSoup(request.text, 'html.parser')

## Collecting the records

I have noticed that each record has the following head of format:


                        <div class="s-item-container">
                            <div class="a-row a-spacing-micro">
                                <div class="a-row sx-badge-region sx-pinned-top-badge">
                                    <div class="a-row a-badge-region"><a id="AMAZONS_CHOICE_B07B5965VG" href="https:/ 

BeautifulSoup will do the work to find all the records wich has this same format for us 

In [14]:
products=soup.findAll('div',class_='s-item-container')

This will return a **ResultSet** with all the records founded

products is an itterable object so we can check its length

In [15]:
len(products)

17

In [16]:
first_product=products[0]

Let's check the first products 

In [17]:
first_product

<div class="s-item-container"><div class="a-fixed-left-grid"><div class="a-fixed-left-grid-inner" style="padding-left:218px"><div class="a-fixed-left-grid-col a-col-left" style="width:218px;margin-left:-218px;float:left;"><div class="a-row"><div aria-hidden="true" class="a-column a-span12 a-text-center"><a class="a-link-normal a-text-normal" href="https://www.amazon.fr/Microsoft-Surface-Ordinateur-Portable-i7/dp/B0713RZCBV"><img alt='Microsoft Surface Laptop Ordinateur Portable 13.5" tactile (Core i7, RAM 8 Go, SSD 256 Go, Windows 10S) - Platine' class="s-access-image cfMarker" data-search-image-load="" height="218" onload="uet('cf');P.when('search-page-utilities').execute(function(spUtils){setTimeout(function(){spUtils.triggerATFEvent(1)}, 0);});" src="https://images-eu.ssl-images-amazon.com/images/I/31hpgf5sykL._AC_US218_.jpg" srcset="https://images-eu.ssl-images-amazon.com/images/I/31hpgf5sykL._AC_US218_.jpg 1x, https://images-eu.ssl-images-amazon.com/images/I/31hpgf5sykL._AC_US327_

## Extacting data

### Extracting the first product

To simplify and verify our work , we'll start by only working with the first record of products

In [18]:
first_product=products[0]

#### The name of the product

The **find** method returns first_result for the first instance of a h2 tag, and returns a Beautiful Soup "Tag" object.

In [19]:
Name = first_product.find('h2')

To extract the text in between we use the attribute ***text*** and return a string

In [20]:
Name.text

'Microsoft Surface Laptop Ordinateur Portable 13.5" tactile (Core i7, RAM 8 Go, SSD 256 Go, Windows 10S) - Platine'

#### The brand of the product

The ***findAll*** method returns all the results of the \<span\> tag into an itterable object

In [21]:
brand=first_product.findAll('span',attrs={'class':"a-size-small a-color-secondary"})[1]

Again, To extract the text in between we use the attribute text and return a string

In [22]:
brand.text

'Microsoft'

#### The price of the product

We would extract the prices from the soup and converted into float.

In [23]:
price=first_product.findAll('span',attrs={'class':"a-size-base a-color-price a-text-bold"})[0]


Return the price only without the special character \xa0

In [24]:
price=price.text.replace(u'\xa0', u'')[4:]

Concerting the price into float format

In [25]:
float(price.replace(',','.'))

1249.0

#### The brand of the product

In [26]:
brand=first_product.findAll('span',attrs={'class':"a-size-small a-color-secondary"})[1]

In [27]:
brand.text

'Microsoft'

#### The shippinp of the product

In [28]:
shipping=first_product.findAll('span',attrs={'class':"a-size-small a-color-secondary"})[3]


In [29]:
shipping.text

'Livraison GRATUITE'

#### The number of stars on the the products

In [64]:
stars = (first_product.findAll("i"))[1]

In [75]:
stars[0].span.text

'4,1 étoiles sur 5'

#### The number of comments on the the products

In [None]:
<div class="a-row a-spacing-mini"><span name="B07K69PCRK">
    <span class="a-declarative" data-action="a-popover" data-a-popover="{&quot;max-width&quot;:&quot;700&quot;,&quot;closeButton&quot;:&quot;false&quot;,&quot;position&quot;:&quot;triggerBottom&quot;,&quot;url&quot;:&quot;/review/widgets/average-customer-review/popover/ref=acr_search__popover?ie=UTF8&amp;asin=B07K69PCRK&amp;contextId=search&amp;ref=acr_search__popover&quot;}"><a href="javascript:void(0)" class="a-popover-trigger a-declarative"><i class="a-icon a-icon-star a-star-3"><span class="a-icon-alt">3 étoiles sur 5</span></i><i class="a-icon a-icon-popover"></i></a></span></span>

<a class="a-size-small a-link-normal a-text-normal" href="https://www.amazon.fr/PC-Ordinateur-Portable-Windows-10-14-Pouces-Laptop/dp/B07K69PCRK/ref=sr_1_4?ie=UTF8&amp;qid=1550677476&amp;sr=8-4&amp;keywords=Laptop#customerReviews">12</a></div>

In [99]:
number_comments=(products[1].findAll('div',attrs={'class':'a-row a-spacing-mini'}))

In [114]:
number_comments[1].findAll('a')[1].text

'12'

In [83]:
int(number_comments.text)

26

###  Creating the dataset 

After dealing with our first products and collecting the information behind, we can continue and do the process over all the other products with a **loop** to repet the process over our soup and return a list of tuples called record 


In [None]:
products[0]

In [136]:
records=[]

for product in products:
    
    try:
        Name =(product.find('h2')).text
    except:
        pass
    
    try:
        Brand=(product.findAll('span',attrs={'class':"a-size-small a-color-secondary"})[1]).text
    except:
        pass
    
    try:
        Price=(product.findAll('span',attrs={'class':"a-size-base a-color-price s-price a-text-bold"})[0]).text.replace(u'\xa0', u'')[4:]
        Price=Price.replace(',','.')
        
    except:
        pass
    
    try:
        Shipping=(product.findAll('span',attrs={'class':"a-size-small a-color-secondary"})[3]).text
        
    except:
        pass
    
    try: 
        
        Customers_Reviews=(product.findAll("i")[1]).span.text
        
    except:
        pass
    
    try: 
        
        Number_Reviews=(product.findAll('div',attrs={'class':'a-row a-spacing-mini'}))[1].findAll('a')[1].text
        
    except:
        pass
    
   
        
    records.append((Name,Brand,Price,Shipping,Customers_Reviews,Number_Reviews))
    

Let's check our 3 first row of records

In [133]:
records[0:3]

[('Microsoft Surface Laptop Ordinateur Portable 13.5" tactile (Core i7, RAM 8 Go, SSD 256 Go, Windows 10S) - Platine',
  'Microsoft',
  '1249,56',
  'Livraison GRATUITE',
  '3,7 étoiles sur 5',
  '27'),
 ('PC-Ordinateur Portable Windows-10 14-Pouces Laptop - Winnovo V146 Notebook 4 Go RAM+32 Go Stockage Intel Atom Quad Core Resolution de 1920x1080 FHD WiFi Bluetooth Mini HDMI Intel HD Graphics (Argent)',
  'Winnovo',
  '189,99',
  'Livraison gratuite possible (voir fiche produit).',
  '3 étoiles sur 5',
  '12'),
 ('Microsoft Surface Laptop Ordinateur Portable 13.5" tactile (Core i5, RAM 8 Go, SSD 128 Go, Windows 10) - Platine - Clavier AZERTY français',
  'Microsoft',
  '1049,99',
  'Livraison gratuite possible (voir fiche produit).',
  '4,1 étoiles sur 5',
  '26')]

## Creating the DataSet inti **DataFrame**

After we create our list of tuples, we can transform it into a data frame using Pandas Library

In [137]:
import pandas as pd
data = pd.DataFrame(records, columns=['Name','Brand','Price','Shipping','Customers_Reviews','Number_Reviews'])

The head of our DataFrame

In [138]:
data.head()


Unnamed: 0,Name,Brand,Price,Shipping,Customers_Reviews,Number_Reviews
0,Microsoft Surface Laptop Ordinateur Portable 1...,Microsoft,1249.56,Livraison GRATUITE,"3,7 étoiles sur 5",27
1,PC-Ordinateur Portable Windows-10 14-Pouces La...,Winnovo,189.99,Livraison gratuite possible (voir fiche produit).,3 étoiles sur 5,12
2,Microsoft Surface Laptop Ordinateur Portable 1...,Microsoft,1049.99,Livraison gratuite possible (voir fiche produit).,"4,1 étoiles sur 5",26
3,"Acer Aspire ES1-732-P6XT PC Portable 17,3"" HD ...",Acer,386.0,Livraison gratuite possible (voir fiche produit).,"3,3 étoiles sur 5",34
4,Lenovo Legion Y520-15IKBN Ordinateur Portable ...,Lenovo,849.0,Livraison gratuite possible (voir fiche produit).,"4,1 étoiles sur 5",34


Changing the price and the number of object to float for later comparison

In [140]:
data['Price']=data['Price'].astype(np.float)
data['Number_Reviews']=data['Number_Reviews'].astype(np.float)

In [141]:
data.head()

Unnamed: 0,Name,Brand,Price,Shipping,Customers_Reviews,Number_Reviews
0,Microsoft Surface Laptop Ordinateur Portable 1...,Microsoft,1249.56,Livraison GRATUITE,"3,7 étoiles sur 5",27.0
1,PC-Ordinateur Portable Windows-10 14-Pouces La...,Winnovo,189.99,Livraison gratuite possible (voir fiche produit).,3 étoiles sur 5,12.0
2,Microsoft Surface Laptop Ordinateur Portable 1...,Microsoft,1049.99,Livraison gratuite possible (voir fiche produit).,"4,1 étoiles sur 5",26.0
3,"Acer Aspire ES1-732-P6XT PC Portable 17,3"" HD ...",Acer,386.0,Livraison gratuite possible (voir fiche produit).,"3,3 étoiles sur 5",34.0
4,Lenovo Legion Y520-15IKBN Ordinateur Portable ...,Lenovo,849.0,Livraison gratuite possible (voir fiche produit).,"4,1 étoiles sur 5",34.0


## Creating and storing the **CSV** file

Finally, we will export our dataset into csv file so we can store it and reuse it for later analysis

In [143]:
data.to_csv("amazon_laptop",index=False,encoding='utf-8')

Readind the csv file 

In [144]:
data = pd.read_csv("amazon_laptop")

We have the same dataFrame as before

In [146]:
data.head()

Unnamed: 0,Name,Brand,Price,Shipping,Customers_Reviews,Number_Reviews
0,Microsoft Surface Laptop Ordinateur Portable 1...,Microsoft,1249.56,Livraison GRATUITE,"3,7 étoiles sur 5",27.0
1,PC-Ordinateur Portable Windows-10 14-Pouces La...,Winnovo,189.99,Livraison gratuite possible (voir fiche produit).,3 étoiles sur 5,12.0
2,Microsoft Surface Laptop Ordinateur Portable 1...,Microsoft,1049.99,Livraison gratuite possible (voir fiche produit).,"4,1 étoiles sur 5",26.0
3,"Acer Aspire ES1-732-P6XT PC Portable 17,3"" HD ...",Acer,386.0,Livraison gratuite possible (voir fiche produit).,"3,3 étoiles sur 5",34.0
4,Lenovo Legion Y520-15IKBN Ordinateur Portable ...,Lenovo,849.0,Livraison gratuite possible (voir fiche produit).,"4,1 étoiles sur 5",34.0


## Analysis an Visualization (Optional)

In [189]:
data['Price'].groupby(data['Brand']).max()

Brand
Acer         1102.94
Asus          939.00
CHUWI         399.00
Dell          989.95
HP            274.00
Lenovo        849.00
MOERUN         39.99
Microsoft    1449.00
OYYU          252.00
Winnovo       189.99
Name: Price, dtype: float64

In [227]:

%matplotlib notebook
import matplotlib.pyplot as plt




We will analyis and visulise the highest prices of laptops by brand in our dataframe

### Simple Way

In [236]:
X=data['Price'].groupby(data['Brand']).max().index

Y=data['Price'].groupby(data['Brand']).max()

plt.figure()

bars=plt.bar(X, Y,width=0.6,linewidth=0, color='lightslategrey')


In [238]:
%%html
<img src="Figure_2.png",width=60,height=60>
Visualization fo the highest laptop's price on amazon

### Dejunkful way 

In [234]:
X=data['Price'].groupby(data['Brand']).max().index

Y=data['Price'].groupby(data['Brand']).max()

plt.figure()


bars=plt.bar(X, Y,width=0.6,linewidth=0, color='lightslategrey')



pos= np.arange(len(X))
ax = plt.axes()

plt.yticks([])
plt.tick_params(top='off', bottom='off', left='off', right='off', labelleft='off', labelbottom='on')

plt.subplots_adjust(bottom=0.15)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)

plt.tick_params(top='off', bottom='off', left='off', right='off', labelleft='on', labelbottom='on')

x=plt.gca().xaxis
for item in x.get_ticklabels():
    item.set_rotation(35)
    
plt.ylabel('Price (Eur)',alpha=0.6)
plt.title("The Highest Laptop's Price by Brand on Amazon", alpha=0.8)
plt.xticks(pos, X,alpha=0.8)

for bar in bars:
    plt.gca().text(bar.get_x() + bar.get_width()/2, bar.get_height() - 40, str(int(bar.get_height()))  , 
                 ha='center', color='w', fontsize=8)
    
bars[7].set_color('#1F77B4')






In [235]:
%%html
<img src="Figure_1.png",width=60,height=60>
Visualization fo the highest laptop's price on amazon