## 🎥 Créez un bot de web scraping pour collecter des données sur les films

**L'objectif** de ce projet est de créer un bot capable d'extraire des données du site IMDb et d'effectuer des analyses sur les films.

**Contexte du projet :**

Vous travaillez dans une agence web spécialisé dans l'analyse de données et le web scraping.

Vous devez travailler dans un projet dans lequel le client souhaite connaître les facteurs qui déterminent le succès d'un film. Pour cela, vous devez créer une base de données de films à partir d'informations recueillies sur différents sites Web, à commencer par le top 250 films d'IMDb. Pour cela, vous devez créer un programme en Python en utilisant Beautiful Soup pour récupérer les données et les stocker dans un fichier.

Puis, vous pouvez alimenter votre base de données en utilisant d'autres sites web (par exemple, Rotten tomatoes). Vous devez travailler en équipe pour pouvoir effectuer ce travail et rendre un projet en github avec un fichier scrapy.py qui contient les fonctions qui permettent de récuper un fichier csv avec les données.

🎬 Import des librairies

In [1]:
import pandas as pd
import numpy as np

import csv #exporter les données scrappées dans un fichier CSV 
import requests #charger la page et stocker son contenu dans une variable
from bs4 import BeautifulSoup

## 🟨 IMDb movies

🎬 Création d'une liste

In [2]:
titles = []
years = []
directors = []
time = []
genres=[]
imdb_ratings = []
metascores = []
votes = []
dollar = []

🎬 Toutes les pages web

In [3]:
pages = np.arange(1, 251, 50)
pages

array([  1,  51, 101, 151, 201])

🎬 Langue

In [4]:
headers = {'Accept-Language': 'en-US, en;q=0.5'}

🎬 Données structurées de la page 

In [5]:
# Stocker chacune des urls de 50 films
for page in pages:
    # Récupérer le contenu de chaque url
    page = requests.get('https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&start' + str(page) + '&ref_=adv_nxt', headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')
    
    # Méthode pour extraire tous les conteneurs qui ont un attribut de : find_all() - div - classlister-item mode-advanced
    movie_div = soup.find_all('div', class_='lister-item mode-advanced')
    
    for container in movie_div:
        # Le nom du film
        name = container.h3.a.text
        titles.append(name)
        
        # L'année du film
        year = container.h3.find('span', class_='lister-item-year').text
        years.append(year)
        
        # La durée du film
        runtime = container.find('span', class_='runtime').text if container.p.find('span', class_='runtime') else '-'
        time.append(runtime)
        
        # Directeur du film
        director = container.find('p',class_='').find_all('a')[0].text
        directors.append(director)
        
        # Genre du film
        genre = container.find('span', class_="genre").text
        genres.append(genre)
        
        # La note du film
        imdb = float(container.strong.text)
        imdb_ratings.append(imdb)
        
        # Le metascore du film
        m_score = container.find('span', class_='metascore').text if container.find('span', class_='metascore') else '-'
        metascores.append(m_score)
        
        # Nombre de vote sur le film
        nv = container.find_all('span', attrs={'name':'nv'})
        vote = nv[0].text
        votes.append(vote)
        
        # Prix du film
        grosses = nv[1].text if len(nv) > 1 else '-'
        dollar.append(grosses)

In [6]:
page

<Response [200]>

🎬 Liens du site

In [7]:
# https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&ref_=adv_prv
# https://www.imdb.com/search/title/?groups=top_250&sort=user_rating,desc&start=51&ref_=adv_nxt

🎬 Viser la partie du html dont on veut obtenir l'information

In [8]:
#print(type(movie_div))
#print(len(movie_div))

🎬 Affichage du premier film en HTML

In [9]:
first_movie = movie_div[0]
first_movie

<div class="lister-item mode-advanced">
<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt0111161"></div>
</div>
<div class="lister-item-image float-left">
<a href="/title/tt0111161/"> <img alt="The Shawshank Redemption" class="loadlate" data-tconst="tt0111161" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/S/sash/4FyxwxECzL-U1J8.png" width="67"/>
</a> </div>
<div class="lister-item-content">
<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt0111161/">The Shawshank Redemption</a>
<span class="lister-item-year text-muted unbold">(1994)</span>
</h3>
<p class="text-muted">
<span class="certificate">R</span>
<span class="ghost">|</span>
<span class="runtime">142 min</span>
<span class="ghost">|</span>
<span class="genre">
Dram

🎬 Création du DataFrame

In [10]:
movies = pd.DataFrame({'Films':titles,
                       'Année':years,
                       'Directeur':directors,
                       'Durée':time,
                       'Genre':genres,
                       'Note':imdb_ratings,
                       'Metascore':metascores,
                       'Votes':votes,
                       'Prix':dollar}).replace("\n","", regex=True)
movies.head()

Unnamed: 0,Films,Année,Directeur,Durée,Genre,Note,Metascore,Votes,Prix
0,The Shawshank Redemption,(1994),Frank Darabont,142 min,Drama,9.3,80,2480727,$28.34M
1,The Godfather,(1972),Francis Ford Coppola,175 min,"Crime, Drama",9.2,100,1713388,$134.97M
2,The Dark Knight,(2008),Christopher Nolan,152 min,"Action, Crime, Drama",9.0,84,2435080,$534.86M
3,The Godfather: Part II,(1974),Francis Ford Coppola,202 min,"Crime, Drama",9.0,90,1190161,$57.30M
4,12 Angry Men,(1957),Sidney Lumet,96 min,"Crime, Drama",9.0,96,734230,$4.36M


🎬 Data type

In [11]:
movies.dtypes

Films         object
Année         object
Directeur     object
Durée         object
Genre         object
Note         float64
Metascore     object
Votes         object
Prix          object
dtype: object

🎬 Suppression des parenthèses sur l'année

In [12]:
movies['Année'] = movies['Année'].str.extract('(\d+)').astype(int)
movies.head()

Unnamed: 0,Films,Année,Directeur,Durée,Genre,Note,Metascore,Votes,Prix
0,The Shawshank Redemption,1994,Frank Darabont,142 min,Drama,9.3,80,2480727,$28.34M
1,The Godfather,1972,Francis Ford Coppola,175 min,"Crime, Drama",9.2,100,1713388,$134.97M
2,The Dark Knight,2008,Christopher Nolan,152 min,"Action, Crime, Drama",9.0,84,2435080,$534.86M
3,The Godfather: Part II,1974,Francis Ford Coppola,202 min,"Crime, Drama",9.0,90,1190161,$57.30M
4,12 Angry Men,1957,Sidney Lumet,96 min,"Crime, Drama",9.0,96,734230,$4.36M


🎬 Suppression de min sur Durée

In [13]:
movies['Durée'] = movies['Durée'].str.extract('(\d+)').astype(int)
movies.head()

Unnamed: 0,Films,Année,Directeur,Durée,Genre,Note,Metascore,Votes,Prix
0,The Shawshank Redemption,1994,Frank Darabont,142,Drama,9.3,80,2480727,$28.34M
1,The Godfather,1972,Francis Ford Coppola,175,"Crime, Drama",9.2,100,1713388,$134.97M
2,The Dark Knight,2008,Christopher Nolan,152,"Action, Crime, Drama",9.0,84,2435080,$534.86M
3,The Godfather: Part II,1974,Francis Ford Coppola,202,"Crime, Drama",9.0,90,1190161,$57.30M
4,12 Angry Men,1957,Sidney Lumet,96,"Crime, Drama",9.0,96,734230,$4.36M


🎬 Suppression de la virgule sur Votes

In [14]:
movies['Votes'] = movies['Votes'].str.replace(',', '').astype(int)
movies.head()

Unnamed: 0,Films,Année,Directeur,Durée,Genre,Note,Metascore,Votes,Prix
0,The Shawshank Redemption,1994,Frank Darabont,142,Drama,9.3,80,2480727,$28.34M
1,The Godfather,1972,Francis Ford Coppola,175,"Crime, Drama",9.2,100,1713388,$134.97M
2,The Dark Knight,2008,Christopher Nolan,152,"Action, Crime, Drama",9.0,84,2435080,$534.86M
3,The Godfather: Part II,1974,Francis Ford Coppola,202,"Crime, Drama",9.0,90,1190161,$57.30M
4,12 Angry Men,1957,Sidney Lumet,96,"Crime, Drama",9.0,96,734230,$4.36M


🎬 Metascore en float

In [15]:
movies['Metascore'] = movies['Metascore'].str.extract('(\d+)')
movies['Metascore'] = pd.to_numeric(movies['Metascore'], errors='coerce')
movies.head()

Unnamed: 0,Films,Année,Directeur,Durée,Genre,Note,Metascore,Votes,Prix
0,The Shawshank Redemption,1994,Frank Darabont,142,Drama,9.3,80.0,2480727,$28.34M
1,The Godfather,1972,Francis Ford Coppola,175,"Crime, Drama",9.2,100.0,1713388,$134.97M
2,The Dark Knight,2008,Christopher Nolan,152,"Action, Crime, Drama",9.0,84.0,2435080,$534.86M
3,The Godfather: Part II,1974,Francis Ford Coppola,202,"Crime, Drama",9.0,90.0,1190161,$57.30M
4,12 Angry Men,1957,Sidney Lumet,96,"Crime, Drama",9.0,96.0,734230,$4.36M


🎬 Nettoyage de la colonne Prix

In [16]:
movies['Prix'] = movies['Prix'].map(lambda x: x.lstrip('$').rstrip('M'))
movies['Prix'] = pd.to_numeric(movies['Prix'], errors='coerce')
movies

Unnamed: 0,Films,Année,Directeur,Durée,Genre,Note,Metascore,Votes,Prix
0,The Shawshank Redemption,1994,Frank Darabont,142,Drama,9.3,80.0,2480727,28.34
1,The Godfather,1972,Francis Ford Coppola,175,"Crime, Drama",9.2,100.0,1713388,134.97
2,The Dark Knight,2008,Christopher Nolan,152,"Action, Crime, Drama",9.0,84.0,2435080,534.86
3,The Godfather: Part II,1974,Francis Ford Coppola,202,"Crime, Drama",9.0,90.0,1190161,57.30
4,12 Angry Men,1957,Sidney Lumet,96,"Crime, Drama",9.0,96.0,734230,4.36
...,...,...,...,...,...,...,...,...,...
245,Once Upon a Time in the West,1968,Sergio Leone,165,Western,8.5,80.0,315851,5.32
246,Psycho,1960,Alfred Hitchcock,109,"Horror, Mystery, Thriller",8.5,97.0,634922,32.00
247,Pather Panchali,1955,Satyajit Ray,125,Drama,8.5,,28989,0.54
248,Rear Window,1954,Alfred Hitchcock,112,"Mystery, Thriller",8.5,100.0,468018,36.76


### 🟨 DataFrame final IMDb movies !

In [17]:
movies

Unnamed: 0,Films,Année,Directeur,Durée,Genre,Note,Metascore,Votes,Prix
0,The Shawshank Redemption,1994,Frank Darabont,142,Drama,9.3,80.0,2480727,28.34
1,The Godfather,1972,Francis Ford Coppola,175,"Crime, Drama",9.2,100.0,1713388,134.97
2,The Dark Knight,2008,Christopher Nolan,152,"Action, Crime, Drama",9.0,84.0,2435080,534.86
3,The Godfather: Part II,1974,Francis Ford Coppola,202,"Crime, Drama",9.0,90.0,1190161,57.30
4,12 Angry Men,1957,Sidney Lumet,96,"Crime, Drama",9.0,96.0,734230,4.36
...,...,...,...,...,...,...,...,...,...
245,Once Upon a Time in the West,1968,Sergio Leone,165,Western,8.5,80.0,315851,5.32
246,Psycho,1960,Alfred Hitchcock,109,"Horror, Mystery, Thriller",8.5,97.0,634922,32.00
247,Pather Panchali,1955,Satyajit Ray,125,Drama,8.5,,28989,0.54
248,Rear Window,1954,Alfred Hitchcock,112,"Mystery, Thriller",8.5,100.0,468018,36.76


🎬 Data types

In [18]:
movies.dtypes

Films         object
Année          int32
Directeur     object
Durée          int32
Genre         object
Note         float64
Metascore    float64
Votes          int32
Prix         float64
dtype: object

🎬 Nombres de types

In [19]:
movies_dtype = movies.dtypes
movies_dtype.value_counts()

int32      3
object     3
float64    3
dtype: int64

🎬 Valeurs Nan

In [20]:
df_nan = pd.DataFrame({'Nan':movies.isna().sum()})
df_nan['%nan'] = df_nan['Nan']/movies.shape[0]*100
round(df_nan,2).sort_values(by='%nan' , ascending=False)

Unnamed: 0,Nan,%nan
Prix,15,6.0
Metascore,5,2.0
Films,0,0.0
Année,0,0.0
Directeur,0,0.0
Durée,0,0.0
Genre,0,0.0
Note,0,0.0
Votes,0,0.0


🎬 Suppression des valeurs Nan

In [21]:
movies = movies.dropna(axis=0)

In [22]:
df_nan = pd.DataFrame({'Nan':movies.isna().sum()})
df_nan['%nan'] = df_nan['Nan']/movies.shape[0]*100
round(df_nan,2).sort_values(by='%nan' , ascending=False)

Unnamed: 0,Nan,%nan
Films,0,0.0
Année,0,0.0
Directeur,0,0.0
Durée,0,0.0
Genre,0,0.0
Note,0,0.0
Metascore,0,0.0
Votes,0,0.0
Prix,0,0.0


### 💡 Nouveau csv

In [23]:
movies.to_csv('movies.csv')

## 🍅 Tomatoes movies

📼 url site tomatoes

In [24]:
Turl = "https://www.rottentomatoes.com/m/"

In [25]:
titres = movies['Films'].replace(' ','_', regex=True)

list_URL = []

for i in titres:
    response = Turl + i
    list_URL.append(response)

In [26]:
# urilist = []
# for movie in ["titre"]:
#     r = requests.get("https://www.rottentomatoes.com/m/" + str(movie))
#     if r.status_code == 200:
#         urilist.append(response)
#         movie = str(movie)+"_"+str(year)
#         r = requests.get("https://www.rottentomatoes.com/m/" + movie)
        
#     urilist

In [27]:
list_URL

['https://www.rottentomatoes.com/m/The_Shawshank_Redemption',
 'https://www.rottentomatoes.com/m/The_Godfather',
 'https://www.rottentomatoes.com/m/The_Dark_Knight',
 'https://www.rottentomatoes.com/m/The_Godfather:_Part_II',
 'https://www.rottentomatoes.com/m/12_Angry_Men',
 'https://www.rottentomatoes.com/m/The_Lord_of_the_Rings:_The_Return_of_the_King',
 'https://www.rottentomatoes.com/m/Pulp_Fiction',
 "https://www.rottentomatoes.com/m/Schindler's_List",
 'https://www.rottentomatoes.com/m/Inception',
 'https://www.rottentomatoes.com/m/Fight_Club',
 'https://www.rottentomatoes.com/m/The_Lord_of_the_Rings:_The_Fellowship_of_the_Ring',
 'https://www.rottentomatoes.com/m/Forrest_Gump',
 'https://www.rottentomatoes.com/m/The_Good,_the_Bad_and_the_Ugly',
 'https://www.rottentomatoes.com/m/The_Lord_of_the_Rings:_The_Two_Towers',
 'https://www.rottentomatoes.com/m/The_Matrix',
 'https://www.rottentomatoes.com/m/Goodfellas',
 'https://www.rottentomatoes.com/m/Star_Wars:_Episode_V_-_The_Empi

In [28]:
first_movie = movie_div[0]
first_movie

<div class="lister-item mode-advanced">
<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt0111161"></div>
</div>
<div class="lister-item-image float-left">
<a href="/title/tt0111161/"> <img alt="The Shawshank Redemption" class="loadlate" data-tconst="tt0111161" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/S/sash/4FyxwxECzL-U1J8.png" width="67"/>
</a> </div>
<div class="lister-item-content">
<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt0111161/">The Shawshank Redemption</a>
<span class="lister-item-year text-muted unbold">(1994)</span>
</h3>
<p class="text-muted">
<span class="certificate">R</span>
<span class="ghost">|</span>
<span class="runtime">142 min</span>
<span class="ghost">|</span>
<span class="genre">
Dram