# Web Scraping

**Find a web page to scrape and determine the content you would like to scrape from it - 
blogs and news sites are typically good candidates for scraping text content, 
and https://www.wikipedia.org/ is usually a good source for HTML tables (search for "list of...").
**Scrape the HTML from your chosen page, parse the HTML to extract the necessary information, and either save 
the results to a text (txt) file if it is text or into a CSV file if it is tabular data.

Break the project down into different steps - note the steps covered in web scraping lesson,
try to follow them, and make adjustments as you encounter the obstacles.

The results should be a file containing the results of your web page scrape.
Your results should be saved in a folder named output.
Use comments to describe the steps taken and thought process for obtaining data from the API and web page.

In [1]:
import requests
from bs4 import BeautifulSoup

url = 'http://www.espace-1789.com/film/affiche'
html = requests.get(url).content
response = BeautifulSoup(html, "lxml")
response

<!DOCTYPE html>
<html class="no-js" lang="fr"> <!--<![endif]-->
<head>
<meta charset="utf-8"/>
<title> A l'affiche | espace 1789</title>
<meta content="" name="description"/>
<meta content="initial-scale=1.0" name="viewport"/>
<link href="/sites/all/themes/espace1789_responsive/favico.png" rel="icon" type="image/png"/>
<style media="all" type="text/css">
@import url("http://www.espace-1789.com/modules/system/system.base.css?pv91yr");
@import url("http://www.espace-1789.com/modules/system/system.menus.css?pv91yr");
@import url("http://www.espace-1789.com/modules/system/system.messages.css?pv91yr");
@import url("http://www.espace-1789.com/modules/system/system.theme.css?pv91yr");
</style>
<style media="all" type="text/css">
@import url("http://www.espace-1789.com/sites/all/modules/date/date_api/date.css?pv91yr");
@import url("http://www.espace-1789.com/sites/all/modules/date/date_popup/themes/datepicker.1.7.css?pv91yr");
@import url("http://www.espace-1789.com/sites/all/modules/epail_178

In [17]:
import re 

string = 'jeu. 22 aoû18:3018:3020:3020:30'
new = re.sub(r'\d\d:\d\d\d\d:\d\d', ' \g<0> ', string)
new


'jeu. 22 aoû 18:3018:30  20:3020:30 '

In [32]:
content = response.find_all('li', {'class':'affiche-film-item'})

def get_title(affiche_film):
    titles = affiche_film.find('span', {'class':'title'})
    return(titles.text)

def get_actor(affiche_film):
    actor_main = affiche_film.find('h5') 
    return(actor_main.text)

def get_dates(affiche_film):
    dates = affiche_film.find_all('li')
    dates = [d.text for d in dates]
    new = [re.sub(r'\d\d:\d\d\d\d:\d\d', ' \g<0> ', date) for date in dates]
    new = [re.sub(r'\d\d:\d\d', ' \g<0>', date) for date in dates]
    new_dates = ", ".join(new)
    return(new_dates)

def get_data(content):  
    title = get_title(content)
    actor = get_actor(content)
    dates = get_dates(content)
    return {"title":title, "actor": actor, "dates": dates}

results = []
for c in content:
    results.append(get_data(c))

results

[{'title': 'YESTERDAY',
  'actor': 'Danny Boyle ',
  'dates': 'jeu. 22 aoû 18:00 18:00, ven. 23 aoû 16:10 16:10, sam. 24 aoû 14:50 14:50, lun. 26 aoû 20:30 20:30'},
 {'title': "JE PROMETS D'ÊTRE SAGE",
  'actor': ' Ronan Le Page ',
  'dates': 'mer. 21 aoû 14:40 14:40 18:10 18:10 20:40 20:40, jeu. 22 aoû 18:30 18:30 20:30 20:30, ven. 23 aoû 14:10 14:10 18:30 18:30, sam. 24 aoû 17:00 17:00 21:00 21:00, dim. 25 aoû 18:50 18:50 20:50 20:50, lun. 26 aoû 14:10 14:10 18:30 18:30, mar. 27 aoû 13:50 13:50 20:50 20:50'},
 {'title': 'Rêves de jeunesse',
  'actor': 'Alain Raoust ',
  'dates': 'mer. 21 aoû 18:40 18:40, jeu. 22 aoû 20:10 20:10, ven. 23 aoû 18:20 18:20, sam. 24 aoû 19:00 19:00, dim. 25 aoû 14:30 14:30, lun. 26 aoû 16:10 16:10 16:10 16:10 20:50 20:50, mar. 27 aoû 17:10 17:10'},
 {'title': 'DIEGO MARADONA',
  'actor': 'Asif Kapadia ',
  'dates': 'jeu. 22 aoû 14:10 14:10, ven. 23 aoû 20:20 20:20, dim. 25 aoû 17:40 17:40, lun. 26 aoû 18:10 18:10, mar. 27 aoû 20:10 20:10'},
 {'title': 'An

In [33]:
import pandas as pd

movies = pd.DataFrame(results, columns=results[0].keys())
movies

Unnamed: 0,title,actor,dates
0,YESTERDAY,Danny Boyle,"jeu. 22 aoû 18:00 18:00, ven. 23 aoû 16:10 16:..."
1,JE PROMETS D'ÊTRE SAGE,Ronan Le Page,mer. 21 aoû 14:40 14:40 18:10 18:10 20:40 20:4...
2,Rêves de jeunesse,Alain Raoust,"mer. 21 aoû 18:40 18:40, jeu. 22 aoû 20:10 20:..."
3,DIEGO MARADONA,Asif Kapadia,"jeu. 22 aoû 14:10 14:10, ven. 23 aoû 20:20 20:..."
4,Anna,Luc Besson,"mer. 21 aoû 20:10 20:10, ven. 23 aoû 20:30 20:..."
5,LE ROI LION,Jon Favreau,"mer. 21 aoû 14:30 14:30, jeu. 22 aoû 16:00 16:..."
6,COMME DES BÊTES 2,"Chris Renaud, Jonathan Del Val","mer. 21 aoû 16:40 16:40, jeu. 22 aoû 14:00 14:..."
7,LE QUATUOR À CORNES,Programme de courts métrages,"mer. 21 aoû 17:00 17:00, jeu. 22 aoû 16:50 16:..."


In [34]:
#Export csv:
movies.to_csv('../output/movies.csv', index=False)

In [35]:
#Other url:

url = 'https://www.vinted.fr/enfants/accessoires'
html = requests.get(url).content
response = BeautifulSoup()
response



In [36]:
#create a function for each item to find and then call one general function for all:

def get_title(bs4_tag):
    result = bs4_tag.find('img', {'class':'js-item-thumbnail item-thumbnail lazy-thumbnail __act_as_lazy'})
    result = result.get('alt')
    return(result)   
    
def get_brand(bs4_tag):
    result = bs4_tag.find("a", {"class":"item-box__brand"})
    if result == None:
        return "None"
    else:
        return(result.text)

def get_price(bs4_tag):
    result = bs4_tag.find("span")
    return(result.text)

def get_description(bs4_tag):
    result = bs4_tag.find("div", {"class":"media__placeholder"})
    return(result.text.replace('\n', ' '))    

def get_data(content):
    
    title = get_title(content)
    brand = get_brand(content)
    price = get_price(content)
    description = get_description(content)
    
    return {"title":title, "price": price, "brand": brand, "description": description}
            

In [37]:
#Add pagination to scrap several pages:

results = []
for i in range(10):
    url = "https://www.vinted.fr/enfants/accessoires?page={i}"
    r = requests.get(url).content
    response = BeautifulSoup(html, "lxml")
    contents = response.find_all('div', {'class':'is-visible item-box__container'})
    for content in contents:
        results.append(get_data(content))
      
results  

[{'title': "chapeau bébé 18- 23 mois p'tit bisous (taille 3)",
  'price': '1,00 €',
  'brand': 'None',
  'description': '             très bon état              '},
 {'title': 'Bonnet petit bateau ',
  'price': '2,00 €',
  'brand': 'Petit Bateau',
  'description': '   Petit Bateau             Bonnet petit bateau a rayures  43 - 45 Cm               '},
 {'title': 'Set bonnet écharpe garçon ',
  'price': '3,00 €',
  'brand': 'ORCHESTRA',
  'description': '   ORCHESTRA             Orchestra Très bon état Une écharpe Un bonnet doublé Bonnet taille 53 cm Convient à enfant 4/5 ans               '},
 {'title': 'Set bonnet et écharpe garçon ',
  'price': '3,00 €',
  'brand': 'Hema',
  'description': '   Hema             Très bon etat Un bonnet doublé Une écharpe doublée Taille 86/92 noté dans le bonnet  Convient pour enfant de 2/3 ans pas plus              '},
 {'title': 'Casquette ',
  'price': '1,00 €',
  'brand': 'None',
  'description': '             Taille 46 6 mois environ               

In [38]:
#Create a pandas dataframe:

import pandas as pd

df = pd.DataFrame(results, columns=results[0].keys())
df

Unnamed: 0,title,price,brand,description
0,chapeau bébé 18- 23 mois p'tit bisous (taille 3),"1,00 €",,très bon état
1,Bonnet petit bateau,"2,00 €",Petit Bateau,Petit Bateau Bonnet petit batea...
2,Set bonnet écharpe garçon,"3,00 €",ORCHESTRA,ORCHESTRA Orchestra Très bon ét...
3,Set bonnet et écharpe garçon,"3,00 €",Hema,Hema Très bon etat Un bonnet do...
4,Casquette,"1,00 €",,Taille 46 6 mois environ ...
5,Bonnet,"2,00 €",Du pareil au même,Du pareil au même Bonnet rigolo...
6,Bonnet d.aviateur «car's»,"2,00 €",,Doublure chaude En bon état ...
7,écharpe et bonnet,"2,00 €",okaïdi,okaïdi ensemble chaud 😍 ...
8,Bonnet taille 50 et gants,"2,00 €",Disney,Disney très bon état ...
9,Bonnet,"2,00 €",,Bonnet taille 3mois


In [39]:
#Export csv:
df.to_csv('../output/vinted.csv', index=False)