# Web Scraping

**Find a web page to scrape and determine the content you would like to scrape from it - 
blogs and news sites are typically good candidates for scraping text content, 
and https://www.wikipedia.org/ is usually a good source for HTML tables (search for "list of...").
**Scrape the HTML from your chosen page, parse the HTML to extract the necessary information, and either save 
the results to a text (txt) file if it is text or into a CSV file if it is tabular data.

Break the project down into different steps - note the steps covered in web scraping lesson,
try to follow them, and make adjustments as you encounter the obstacles.

The results should be a file containing the results of your web page scrape.
Your results should be saved in a folder named output.
Use comments to describe the steps taken and thought process for obtaining data from the API and web page.

In [2]:
import requests
from bs4 import BeautifulSoup

url = 'http://www.espace-1789.com/film/affiche'
html = requests.get(url).content
response = BeautifulSoup(html, "lxml")
response

<!DOCTYPE html>
<html class="no-js" lang="fr"> <!--<![endif]-->
<head>
<meta charset="utf-8"/>
<title> A l'affiche | espace 1789</title>
<meta content="" name="description"/>
<meta content="initial-scale=1.0" name="viewport"/>
<link href="/sites/all/themes/espace1789_responsive/favico.png" rel="icon" type="image/png"/>
<style media="all" type="text/css">
@import url("http://www.espace-1789.com/modules/system/system.base.css?pxpzmv");
@import url("http://www.espace-1789.com/modules/system/system.menus.css?pxpzmv");
@import url("http://www.espace-1789.com/modules/system/system.messages.css?pxpzmv");
@import url("http://www.espace-1789.com/modules/system/system.theme.css?pxpzmv");
</style>
<style media="all" type="text/css">
@import url("http://www.espace-1789.com/sites/all/modules/date/date_api/date.css?pxpzmv");
@import url("http://www.espace-1789.com/sites/all/modules/date/date_popup/themes/datepicker.1.7.css?pxpzmv");
@import url("http://www.espace-1789.com/sites/all/modules/epail_178

In [3]:
import re 

string = 'jeu. 22 aoû18:3018:3020:3020:30'
new = re.sub(r'\d\d:\d\d\d\d:\d\d', ' \g<0> ', string)
new


'jeu. 22 aoû 18:3018:30  20:3020:30 '

In [4]:
content = response.find_all('li', {'class':'affiche-film-item'})

def get_title(movie):
    titles = movie.find('span', {'class':'title'})
    return(titles.text)

def get_actor(movie):
    actor_main = movie.find('h5') 
    return(actor_main.text)

def get_dates(movie):
    dates = movie.find_all('li')
    dates = [d.text for d in dates]
    new = [re.sub(r'\d\d:\d\d\d\d:\d\d', ' \g<0> ', date) for date in dates]
    new = [re.sub(r'\d\d:\d\d', ' \g<0>', date) for date in dates]
    new_dates = ", ".join(new)
    return(new_dates)

def get_data(content):  
    title = get_title(content)
    actor = get_actor(content)
    dates = get_dates(content)
    return {"title":title, "actor": actor, "dates": dates}

results = []
for c in content:
    results.append(get_data(c))

results

[{'title': 'LE DAIM ',
  'actor': 'Quentin Dupieux ',
  'dates': 'jeu. 19 sep 15:50 15:50, ven. 20 sep 19:00 19:00, dim. 22 sep 18:10 18:10, lun. 23 sep 20:20 20:20'},
 {'title': 'MA FOLLE SEMAINE AVEC TESS',
  'actor': 'De Steven Wouterlood ',
  'dates': 'mer. 18 sep 10:00 10:00 14:40 14:40, dim. 22 sep 14:20 14:20 16:20 16:20, lun. 23 sep 18:30 18:30vo, mar. 24 sep 16:30 16:30vo'},
 {'title': 'UN PETIT AIR DE FAMILLE ',
  'actor': 'Collectif',
  'dates': 'mer. 18 sep 10:30 10:30 16:40 16:40, dim. 22 sep 16:10 16:10'},
 {'title': 'PORTRAIT DE LA JEUNE FILLE EN FEU ',
  'actor': 'Céline Sciamma ',
  'dates': 'mer. 18 sep 14:20 14:20 17:50 17:50 20:10 20:10, jeu. 19 sep 17:30 17:30 20:10 20:10, ven. 20 sep 14:10 14:10 16:40 16:40 21:00 21:00, dim. 22 sep 17:20 17:20 20:00 20:00rencontres, lun. 23 sep 14:00 14:00 16:10 16:10 18:10 18:10, mar. 24 sep 14:00 14:00 18:10 18:10 20:10 20:10'},
 {'title': 'LES HIRONDELLES DE KABOUL',
  'actor': 'Zabou Breitman et Eléa Gobbé-Mévellec',
  'dates'

In [5]:
import pandas as pd

movies = pd.DataFrame(results, columns=results[0].keys())
movies

Unnamed: 0,title,actor,dates
0,LE DAIM,Quentin Dupieux,"jeu. 19 sep 15:50 15:50, ven. 20 sep 19:00 19:..."
1,MA FOLLE SEMAINE AVEC TESS,De Steven Wouterlood,"mer. 18 sep 10:00 10:00 14:40 14:40, dim. 22 s..."
2,UN PETIT AIR DE FAMILLE,Collectif,"mer. 18 sep 10:30 10:30 16:40 16:40, dim. 22 s..."
3,PORTRAIT DE LA JEUNE FILLE EN FEU,Céline Sciamma,mer. 18 sep 14:20 14:20 17:50 17:50 20:10 20:1...
4,LES HIRONDELLES DE KABOUL,Zabou Breitman et Eléa Gobbé-Mévellec,"mer. 18 sep 16:30 16:30 20:30 20:30, jeu. 19 s..."
5,VIENDRA LE FEU,Oliver Laxe,"mer. 18 sep 18:30 18:30, jeu. 19 sep 14:00 14:..."
6,SWEETIE,Jane Campion,jeu. 19 sep 20:00 20:00vo
7,Retour vers le futur II,Robert Zemeckis,sam. 21 sep 15:00 15:00vf
8,FEMMES AU BORD DE LA CRISE DE NERFS,Pedro Almodóvar,sam. 21 sep 15:10 15:10vo
9,INDIANA JONES ET LA DERNIÈRE CROISADE,Steven Spielberg,sam. 21 sep 19:15 19:15vf 21:30 21:30vo


In [6]:
#Export csv:
movies.to_csv('../output/movies.csv', index=False)

In [7]:
#Other url:

url = 'https://www.vinted.fr/enfants/accessoires'
html = requests.get(url).content
response = BeautifulSoup()
response



In [8]:
#create a function for each item to find and then call one general function for all:

def get_title(bs4_tag):
    result = bs4_tag.find('img', {'class':'js-item-thumbnail item-thumbnail lazy-thumbnail __act_as_lazy'})
    result = result.get('alt')
    return(result)   
    
def get_brand(bs4_tag):
    result = bs4_tag.find("a", {"class":"item-box__brand"})
    if result == None:
        return "None"
    else:
        return(result.text)

def get_price(bs4_tag):
    result = bs4_tag.find("span")
    return(result.text)

def get_description(bs4_tag):
    result = bs4_tag.find("div", {"class":"media__placeholder"})
    return(result.text.replace('\n', ' '))    

def get_data(content):
    
    title = get_title(content)
    brand = get_brand(content)
    price = get_price(content)
    description = get_description(content)
    
    return {"title":title, "price": price, "brand": brand, "description": description}
            

In [9]:
#Add pagination to scrap several pages:

results = []
for i in range(10):
    url = "https://www.vinted.fr/enfants/accessoires?page={i}"
    r = requests.get(url).content
    response = BeautifulSoup(html, "lxml")
    contents = response.find_all('div', {'class':'is-visible item-box__container'})
    for content in contents:
        results.append(get_data(content))
      
results  

[{'title': 'Gants Lacoste enfant Neufs 6-8 ans ',
  'price': '16,00 €',
  'brand': 'Lacoste',
  'description': '   Lacoste             Gants LACOSTE  Tout NEUFS  Erreur de taille malheureusement, c’est pour cela que je les vends ... ils sont magnifiques.                '},
 {'title': 'Mouffles fourrées 12 à 24 mois état neuf',
  'price': '2,00 €',
  'brand': 'None',
  'description': '             Bien chaudes              '},
 {'title': 'Gants imperméables bébé ',
  'price': '3,00 €',
  'brand': 'H&M',
  'description': '   H&M             moufles pour la main entière  taille 2 à 6 mois h et m  neuves               '},
 {'title': 'Casquette adidas bleue',
  'price': '6,00 €',
  'brand': 'Adidas',
  'description': '   Adidas             Portée une fois              '},
 {'title': 'Bonnet Disney Jack le pirate',
  'price': '6,00 €',
  'brand': 'Disney',
  'description': '   Disney             Au top pour l hiver 💟  Bonnet Disney Jack le pirate  Couleur gris et blanc  Taille 52  Neuf      

In [10]:
#Create a pandas dataframe:

import pandas as pd

df = pd.DataFrame(results, columns=results[0].keys())
df

Unnamed: 0,title,price,brand,description
0,Gants Lacoste enfant Neufs 6-8 ans,"16,00 €",Lacoste,Lacoste Gants LACOSTE Tout NEU...
1,Mouffles fourrées 12 à 24 mois état neuf,"2,00 €",,Bien chaudes
2,Gants imperméables bébé,"3,00 €",H&M,H&M moufles pour la main entièr...
3,Casquette adidas bleue,"6,00 €",Adidas,Adidas Portée une fois ...
4,Bonnet Disney Jack le pirate,"6,00 €",Disney,Disney Au top pour l hiver 💟 B...
5,Bonnet Natalys,"2,00 €",natalys,natalys bon état général Taill...
6,"Casquette hello kitty, 54","2,00 €",Hello Kitty,"Hello Kitty Blanche et bleue, t..."
7,Gorritos de algodón nuevos de h&m,"1,50 €",H&M,H&M Están nuevos con etiqueta. ...
8,Uv pet,"1,00 €",Hema,"Hema Ongedragen, maat 86/92 ..."
9,Lot de 2 ceintures ado,"6,00 €",Kaporal,Kaporal Très bon état ...


In [11]:
#Export csv:
df.to_csv('../output/vinted.csv', index=False)