# Web Scraping Data From AlloCiné.fr

This script builds a DataFrame by web scraping the data from AlloCiné — a company which provides information on French cinema. Because of the long delay, we choose to scrape the data in two steps : 
- First we scrape the url of each movie with `getMoviesUrl()`
- Lastly we use the url list to scrape the data for each movie with `ScrapeURL()`

*Note : We use the popular BeautifulSoup package*

## Functions :

### `getMoviesUrl(start_page, end_page)` :

Save a CSV files of the url list as `movie_url.csv`. The argument must be integers and are used to select the range of page you want to scrape the data from. The `end_page` is not included.

### `ScrapeURL(movie_url)` :

Iterate over the list of url generate by `getMoviesUrl()` and scrape the data for each movie. In the process, we extract :

- `id` : Allocine movie id
- `title` : the movie's title (in french)
- `release_date`: the original release date
- `duration`: the movies length
- `genres` : the movies types (as an array, up to three different types)
- `directors` : movies directors (as an array)
- `actors` : main characters of the movies (as an array)
- `nationality`: nationality of the movies (as an array)
- `press_rating`: press ratings (from 0 to 5 stars)
- `nb_press_rating`: number of press votes
- `spec_rating`:  AlloCiné users ratings (from 0 to 5 stars)
- `nb_spec_rating`: number of users votes
- `summary`: short summary of the movie in french

The function `ScrapeURL()` returns two objects : the data as a dataframe and the url list of error as a list. In addition the two objects are saved as `allocine_movies.csv` and `allocine_errors.csv`. You could pass the list of errors into `ScrapeURL()` to get the extra data.

### Import libs

In [2]:
# Import libs
import pandas as pd
import numpy as np
import re
import unicodedata
from time import time
from time import sleep
from urllib.request import urlopen
from random import randint
from bs4 import BeautifulSoup
import dateparser

from warnings import warn
from IPython.core.display import clear_output
import traceback

  from IPython.core.display import clear_output


### Function: `getMoviesUrl(start_page, end_page)`

In [3]:
# Function to scrape the movies urls from http://www.allocine.fr/films/
# Choose the page range with the two parameters start_page and end_page.
# The url list is save as a csv file: movie_url.csv
def getMoviesUrl(start_page, end_page=None, nb_pages=1):
    if start_page <= 0:
        raise ValueError('start_page must be positive !')

    # Set the list
    movie_url = []

    # Preparing the setting and monitoring of the loop
    start_time = time()
    p_requests = start_page
    if end_page == None or end_page < start_page:
        if nb_pages < 1:
            nb_pages = 1
        end_page = start_page + nb_pages
    m_requests = 0
        
    for p in range(start_page, end_page):

        # Get request
        url = f'http://www.allocine.fr/films/?page={p}'
        response = urlopen(url)
        
        # Pause the loop
        sleep(randint(1,2))
            
        # Monitoring the requests
        elapsed_time = time() - start_time
        print(f'Page Request: {p_requests}; Frequency: {p_requests/elapsed_time} requests/s')
        clear_output(wait = True)
            
        # Warning for non-200 status codes
        if response.status != 200:
            warn(f'Page Request: {p_requests}; Status code: {response.status_code}')

        # Break the loop if the number of requests is greater than expected
        if p_requests > end_page:
            warn('Number of requests was greater than expected.')
            break

        # Parse the content of the request with BeautifulSoup
        html_text = response.read().decode("utf-8")
        html_soup = BeautifulSoup(html_text, 'html.parser')

        # Select all the movies url from a single page
        movies = html_soup.find_all('h2', 'meta-title')
        m_requests += len(movies)
        
        # Monitoring the requests
        print(f'Page Request: {p_requests}; Movie Request: {m_requests}')
        clear_output(wait = True)
        
        # Pause the loop
        sleep(1)
        
        for movie in movies:
            movie_url.append(f'http://www.allocine.fr{movie.a["href"]}')
        
        p_requests += 1
        
    
    # Saving the files
    r = np.asarray(movie_url)
    np.savetxt("movie_url.csv", r, delimiter=",", fmt='%s')
    

### Functions: Getting movie infos

#### Get movie ID: `get_movie_ID(movie_soup)`

In [4]:
# Scrape the movie ID
def get_movie_ID(movie_soup):
    # Get the movie ID
    movie_ID = re.sub(
        r"\D", "", movie_soup.find("nav", {"class": "third-nav"}).a["href"]
        )
    return movie_ID

#### Get movie title: `get_movie_title(movie_soup)`

In [5]:
# Scrape the movie title
def get_movie_title(movie_soup):
    # Get the movie title
    movie_title = movie_soup.find("div", {"class": "titlebar-title"}).text.strip()
    return movie_title

#### Get movie release date: `get_movie_release_date(movie_soup)`

##### Convert French months to English months: `convert_month(month)`

In [6]:
# Convert french months to english months
def convert_month(month):
    if month == "janvier":
        return "January"
    elif month == "février":
        return "February"
    elif month == "mars":
        return "March"
    elif month == "avril":
        return "April"
    elif month == "mai":
        return "May"
    elif month == "juin":
        return "June"
    elif month == "juillet":
        return "July"
    elif month == "août":
        return "August"
    elif month == "septembre":
        return "September"
    elif month == "octobre":
        return "October"
    elif month == "novembre":
        return "November"
    elif month == "décembre":
        return "December"
    else:
        return "Unknown"


##### Main function: `get_movie_release_date(movie_soup)`

In [7]:
# Scrape the movie release date
def get_movie_release_date(movie_soup):
    # Get the movie release date
    movie_release_date = movie_soup.find("span", {"class": "date"})
    if movie_release_date:
        movie_release_date = movie_release_date.text.strip()
        month = movie_release_date.split(' ')[1]
        movie_release_date = movie_release_date.replace(month, convert_month(month))
        movie_release_date = dateparser.parse(movie_release_date, date_formats=["%d %B %Y"])        
    return movie_release_date

#### Get movie duration: `get_movie_duration(movie_soup)`

In [8]:
# Scrape the movie duration
def get_movie_duration(movie_soup):
    movie_duration = movie_soup.find("span", {"class": "spacer"}).next_sibling.strip()
    if movie_duration != "":
        duration_timedelta = pd.to_timedelta(movie_duration).components
        movie_duration = duration_timedelta.hours * 60 + duration_timedelta.minutes
    return movie_duration    

#### Get movie genres: `get_movie_genres(movie_soup)`

In [9]:
# Scrape the movie genres
def get_movie_genres(movie_soup):
    div_genres = movie_soup.find("div", {"class": "meta-body-item meta-body-info"})
    if div_genres:
        movie_genres = [
            genre.text
            for genre in div_genres.find_all("span", class_=re.compile(r".*==$"))
            if "\n" not in genre.text
        ]
        return ", ".join(movie_genres)
    return None

#### Get movie directors: `get_movie_directors(movie_soup)`

In [10]:
# Scrape the movie directors
def get_movie_directors(movie_soup):
    div_directors = movie_soup.find_all(
            "div", {"class": "meta-body-item meta-body-direction"}
        )[1]
    if div_directors:
        movie_directors = [
            link.text
            for link in div_directors.find_all(
                ["a", "span"], class_=re.compile(r".*blue-link$")
            )
        ]
        return ", ".join(movie_directors)
    return None

#### Get movie actors: `get_movie_actors(movie_soup)`

In [11]:
# Scrape the movie actors
def get_movie_actors(movie_soup):
    div_actors = movie_soup.find("div", {"class": "meta-body-item meta-body-actor"})
    if div_actors:
        movie_actors = [actor.text for actor in div_actors.find_all(["a", "span"])][1:]
        return ", ".join(movie_actors)
    return None

#### Get movie nationality: `get_movie_nationality(movie_soup)`

In [12]:
# Scrape the movie nationality
def get_movie_nationality(movie_soup):
    movie_nationality = [
            nationality.text.strip()
            for nationality in movie_soup.find_all("span", class_="nationality")
        ]
    return ", ".join(movie_nationality)

#### Get movie press ratings: `get_movie_press_rating(movie_soup)`

In [13]:
# Scrape the movie press ratings
def get_movie_press_rating(movie_soup):
    # get all the available ratings
    movie_ratings = movie_soup.find_all("div", class_="rating-item")
    for ratings in movie_ratings:
        if "Presse" in ratings.text:
            return float(
                re.sub(
                    ",", ".", ratings.find("span", {"class": "stareval-note"}).text
                )
            )
    return None

#### Get movie number of press ratings: `get_movie_press_rating_count(movie_soup)`

In [14]:
# Scrape the movie number of press rating
def get_movie_press_rating_count(movie_soup):
    # get all the available ratings
    movie_ratings = movie_soup.find_all("div", class_="rating-item")
    for ratings in movie_ratings:
        if "Presse" in ratings.text:
            return float(
                re.match(
                    r"\s\d+",
                    ratings.find("span", {"class": "stareval-review"}).text,
                ).group()
            )
    return None

#### Get movie spectator ratings: `get_movie_spec_rating(movie_soup)`

In [15]:
# Scrape the movie spec ratings
def get_movie_spec_rating(movie_soup):
    # get all the available ratings
    movie_ratings = movie_soup.find_all("div", class_="rating-item")
    for ratings in movie_ratings:
        if "Spectateurs" in ratings.text:
            return float(
                re.sub(
                    ",", ".", ratings.find("span", {"class": "stareval-note"}).text
                )
            )
    return None

#### Get movie number of spec ratings: `get_movie_spec_rating_count(movie_soup)`

In [16]:
# Scrape the number of spec ratings
def get_movie_spec_rating_count(movie_soup):
    # get all the available ratings
    movie_ratings = movie_soup.find_all("div", class_="rating-item")
    for ratings in movie_ratings:
        if "Spectateurs" in ratings.text:
            return float(
                re.match(
                    r"\s\d+",
                    ratings.find("span", {"class": "stareval-review"}).text
                ).group()  
            )
    return None

#### Get movie summary: `get_movie_summary(movie_soup)`

In [17]:
# Scrape the movie summary
def get_movie_summary(movie_soup):
    movie_summary = movie_soup.find(
            "section", {"class": "section ovw ovw-synopsis"}
        ).find("div", {"class": "content-txt"})
    if movie_summary:
        movie_summary = movie_summary.text.strip()
        return unicodedata.normalize("NFKC", movie_summary)
    return None

### Function: `ScrapeURL(movie_url)`

In [18]:
# Function to scrape the data from the movies urls
# The function return a dataframe and a list of url that return error.
# And save them into csv files (allocine_movies.csv and allocine_errors.csv)
def ScrapeURL(movie_url: pd.DataFrame):
    
    # init the dataframe
    c = ["id",
        "title",
        "release_date",
        "duration",
        "genres",
        "directors",
        "actors",
        "nationality",
        "press_rating",
        "nb_press_rating",
        "spec_rating",
        "nb_spec_rating",
        "summary",
    ]
    df = pd.DataFrame(columns=c)
    
    # preparing the setting and monitoring loop
    start_time = time()
    n_request = 0
    
    # init list to save errors
    errors = []
    
    # request loop
    for url in movie_url:
        try :
            response = urlopen(url)

            # Pause the loop
            sleep(randint(1,2))

            # Monitoring the requests
            n_request += 1
            
            elapsed_time = time() - start_time
            print(f'Request #{n_request}; Frequency: {n_request/elapsed_time} requests/s')
            clear_output(wait = True)

            # Pause the loop
            sleep(randint(1,2))

            # Warning for non-200 status codes
            if response.status != 200:
                warn('Request #{}; Status code: {}'.format(n_request, response.status_code))
                errors.append(url)

            # Parse the content of the request with BeautifulSoup
            html_text = response.read().decode("utf-8")
            movie_html_soup = BeautifulSoup(html_text, 'html.parser')
            
            if movie_html_soup.find('div', 'titlebar-title'):
                # Scrape the movie ID 
                tp_id = get_movie_ID(movie_html_soup)
                # Scrape the title
                tp_title = get_movie_title(movie_html_soup)
                # Scrape the release date
                tp_release_dt = get_movie_release_date(movie_html_soup)
                # Scrape the duration
                tp_duration = get_movie_duration(movie_html_soup)
                # Scrape the directors
                tp_director = get_movie_directors(movie_html_soup)
                # Scrape the actors
                tp_actor = get_movie_actors(movie_html_soup)
                # Scrape the genres
                tp_genre = get_movie_genres(movie_html_soup)
                # Scrape the nationality
                tp_nation = get_movie_nationality(movie_html_soup)
                # Scrape the press ratings
                tp_press_rating = get_movie_press_rating(movie_html_soup)
                # Scrape the number of press ratings
                tp_nb_press_rating = get_movie_press_rating_count(movie_html_soup)
                # Scrape the spec ratings
                tp_spec_rating = get_movie_spec_rating(movie_html_soup)
                # Scrape the number of spec ratings
                tp_nb_spec_rating = get_movie_spec_rating_count(movie_html_soup)
                # Scrape the summary
                tp_summary = get_movie_summary(movie_html_soup)
                
                # Append the data
                df_tmp = pd.DataFrame({'id': [tp_id],
                                       'title': [tp_title],
                                       'release_date': [tp_release_dt],
                                       'duration': [tp_duration],
                                       'genres': [tp_genre],
                                       'directors': [tp_director],
                                       'actors': [tp_actor],
                                       'nationality': [tp_nation],
                                       'press_rating': [tp_press_rating],
                                       'nb_press_rating': [tp_nb_press_rating],
                                       'spec_rating': [tp_spec_rating],
                                       'nb_spec_rating': [tp_nb_spec_rating],
                                       'summary': [tp_summary]})
                
                df = pd.concat([df, df_tmp], ignore_index=True)
                
        except:
            errors.append(url)
            warn('Request #{} fail; Total errors : {}'.format(n_request, len(errors)))
            traceback.print_exc()
            
    # monitoring 
    elapsed_time = time() - start_time
    print('Done; {} requests in {} seconds with {} errors'.format(n_request, round(elapsed_time, 0), len(errors)))
    clear_output(wait = True)
    df.to_csv("allocine_movies.csv")
    # list to dataframe
    errors_df = pd.DataFrame(errors, columns=['url'])
    errors_df.to_csv("allocine_errors.csv")
    # return dataframe and errors
    return df, errors

In [19]:
response = urlopen("http://www.allocine.fr/film/fichefilm_gen_cfilm=211012.html")
html_text = response.read().decode("utf-8")
movie_html_soup = BeautifulSoup(html_text, 'html.parser')
movie_html_soup.find('div', 'titlebar-title')
tp_title = movie_html_soup.find('div', 'titlebar-title').text

the_movie = movie_html_soup.section.div
#movie_info = the_movie.select('.meta-body-item')
movie_info = movie_html_soup.find_all('div', 'meta-body-item')
rating_info = the_movie.select('.rating-item')

### Getting movies urls

In [20]:
# Scrape the page from start_page to end_page or with nb_pages
start_page = 1
end_page = 1
nb_pages = 2
getMoviesUrl(start_page, end_page=None, nb_pages=nb_pages)

Page Request: 2; Movie Request: 30


### Loading the list of urls

In [21]:
# Load the list of urls 
m_url = pd.read_csv("movie_url.csv",names=['url'])
m_url = m_url['url'].tolist()

### Scraping the data

In [22]:
# Scrape the data 
d, e = ScrapeURL(m_url)

Done; 30 requests in 104.0 seconds with 0 errors


In [23]:
d

Unnamed: 0,id,title,release_date,duration,genres,directors,actors,nationality,press_rating,nb_press_rating,spec_rating,nb_spec_rating,summary
0,211012,The Batman,2022-03-02,177,"Action, Policier, Thriller","Matt Reeves, Peter Craig","Robert Pattinson, Zoë Kravitz, Paul Dano",américain,3.9,37.0,4.2,9776.0,Deux années à arpenter les rues en tant que Ba...
1,281976,Goliath,2022-03-09,122,Thriller,"Frédéric Tellier, Simon Moutaïrou","Gilles Lellouche, Pierre Niney, Emmanuelle Bercot",français,3.6,29.0,4.0,2837.0,"France, professeure de sport le jour, ouvrière..."
2,42303,Permis de construire,2022-03-09,93,Comédie,"Eric Fraticelli, Didier Bourdon","Didier Bourdon, Eric Fraticelli, Anne Consigny",français,1.8,4.0,3.1,660.0,"Dentiste à Paris, Romain vient de perdre son p..."
3,287738,En corps,2022-03-30,120,"Comédie dramatique, Drame, Comédie","Cédric Klapisch, Santiago Amigorena","Marion Barbeau, Hofesh Shechter, Denis Podalydès",français,3.4,32.0,4.2,1234.0,"Elise, 26 ans est une grande danseuse classiqu..."
4,284864,Notre-Dame brûle,2022-03-16,110,Drame,"Jean-Jacques Annaud, Thomas Bidegain","Samuel Labarthe, Jean-Paul Bordes, Mikaël Chir...",français,3.5,32.0,4.0,1711.0,"Le long métrage de Jean-Jacques Annaud, recons..."
5,260627,Morbius,2022-03-30,105,"Action, Fantastique, Aventure","Burk Sharpless, Matt Sazama","Jared Leto, Matt Smith (XI), Adria Arjona",américain,2.3,11.0,2.3,1131.0,"Découvrez pour la première fois au cinéma, le ..."
6,281401,Le Temps des secrets,2022-03-23,108,Famille,"Christophe Barratier, Marcel Pagnol","Léo Campion (II), Guillaume De Tonquédec, Méla...",français,2.6,20.0,3.8,482.0,Adaptation du Temps des secrets (troisième tom...
7,186437,Uncharted,2022-02-16,116,"Aventure, Action","Rafe Judkins, Art Marcum","Tom Holland, Mark Wahlberg, Sophia Ali",américain,2.6,23.0,3.6,3883.0,"Nathan Drake, voleur astucieux et intrépide, e..."
8,240617,Ambulance,2022-03-23,136,"Thriller, Action","Chris Fedak, Laurits Munch-Petersen","Jake Gyllenhaal, Yahya Abdul-Mateen II, Eiza G...",américain,3.1,21.0,3.1,1006.0,"Will Sharp, un vétéran décoré fait appel à la ..."
9,281709,À plein temps,2022-03-16,85,Drame,Eric Gravel,"Laure Calamy, Anne Suarez, Geneviève Mnich",français,4.0,29.0,4.1,1167.0,Julie se démène seule pour élever ses deux enf...
