# 🎦Web Scraping Movies Data From AlloCiné.fr🎬

This script builds a DataFrame by web scraping the data from AlloCiné — a company which provides information on French cinema. Because of the long delay, we choose to scrape the data in two steps : 
- First, we scrape the url of each movie with `getMoviesUrl()`
- Lastly, we use the url list to scrape the data for each movie with `ScrapeURL()`

*📝Note : We use the popular BeautifulSoup package*

## Functions :

### `getMoviesUrl(start_page, end_page)` :

Saves a CSV file of the url list as `../Movies/Data/movie_url.csv`. The argument must be integers and are used to select the range of pages you want to scrape the data from. The `end_page` is included.

### `ScrapeURL(movie_url, dwld_poster)` :

Iterate over the list of urls generated by `getMoviesUrl()` and scrape the data for each movie. In the process, we extract :

- `id` : AlloCiné movie id
- `title` : the movies title (in french)
- `release_date`: the release date
- `duration`: the movies length
- `genres` : the movies genres (as an array)
- `directors` : movies directors (as an array)
- `actors` : main movie actors (as an array)
- `nationality`: nationality of the movies (as an array)
- `press_rating`: average press rating (from 0.5 to 5 stars ⭐⭐⭐⭐⭐)
- `nb_press_rating`: number of ratings made by the press
- `spect_rating`: average AlloCiné users rating (from 0.5 to 5 stars ⭐⭐⭐⭐⭐)
- `nb_spect_rating`: number of ratings made by the users/spectators
- `summary`: the movies summary
- `poster_link`: url of the movies poster

*📝Note : intermediate functions were created to retrieve each individual feature of the dataframe, which makes it easy for debugging.*

The function `ScrapeURL()` returns two objects : the data as a dataframe and the url list of errors as a list. In addition the two objects are saved as `../Movies/Data/allocine_movies.csv` and `../Movies/Data/allocine_errors.csv`. At this time, the remaining errors are not handled and integrated to the dataframe, because the issues have not been all identified yet.

---
## Import libs

In [23]:
# Import libs
import pandas as pd
import numpy as np
import re
import unicodedata
from time import time
from time import sleep
from datetime import timedelta, datetime
from urllib.request import urlopen
from random import randint
from bs4 import BeautifulSoup
import dateparser
import os

from warnings import warn
from IPython.core.display import clear_output
import traceback

  from IPython.core.display import clear_output


## Functions: Getting movie infos

### Get movie ID: `get_movie_ID(movie_soup)`

In [7]:
def get_movie_ID(movie_soup: BeautifulSoup) -> str:
    '''
    Scrape the movie ID from the movie page.
    :param movie_soup: BeautifulSoup object of the movie page.
    :return: The movie ID.'''
    movie_ID = re.sub(
        r"\D", "", movie_soup.find("nav", {"class": "third-nav"}).a["href"]
        )
    return movie_ID

### Get movie title: `get_movie_title(movie_soup)`

In [8]:
def get_movie_title(movie_soup: BeautifulSoup) -> str:
    '''
    Scrape the movie title from the movie page.
    :param movie_soup: BeautifulSoup object of the movie page.
    :return: The movie title.'''
    movie_title = movie_soup.find("div", {"class": "titlebar-title"}).text.strip()
    return movie_title

### Get movie release date: `get_movie_release_date(movie_soup)`

#### Convert French months to English months: `convert_month(month)`

In [9]:
def convert_month(month: str) -> str:
    '''
    Convert French months into English months for the dateparser to work.
    :param month: French month.
    :return: English month.'''
    if month == "janvier":
        return "January"
    elif month == "février":
        return "February"
    elif month == "mars":
        return "March"
    elif month == "avril":
        return "April"
    elif month == "mai":
        return "May"
    elif month == "juin":
        return "June"
    elif month == "juillet":
        return "July"
    elif month == "août":
        return "August"
    elif month == "septembre":
        return "September"
    elif month == "octobre":
        return "October"
    elif month == "novembre":
        return "November"
    elif month == "décembre":
        return "December"
    else:
        return "Unknown"


#### Main function: `get_movie_release_date(movie_soup)`

In [20]:
def get_movie_release_date(movie_soup: BeautifulSoup) -> datetime:
    '''
    Scrape the movie release date from the movie page.
    :param movie_soup: BeautifulSoup object of the movie page.
    :return: The movie release date.'''
    movie_release_date = movie_soup.find("span", {"class": "date"})
    if movie_release_date:
        movie_release_date = movie_release_date.text.strip()
        month = movie_release_date.split(' ')[1]
        movie_release_date = movie_release_date.replace(month, convert_month(month))
        movie_release_date = dateparser.parse(movie_release_date, date_formats=["%d %B %Y"]).date()        
    return movie_release_date

In [21]:
rd = get_movie_release_date(movie_html_soup)
rd

datetime.date(2022, 3, 2)

In [31]:
rd >= datetime.now().date()

False

### Get movie duration: `get_movie_duration(movie_soup)`

In [8]:
def get_movie_duration(movie_soup: BeautifulSoup) -> str:
    '''
    Scrape the movie duration from the movie page.
    :param movie_soup: BeautifulSoup object of the movie page.
    :return: The movie duration in minutes.'''
    movie_duration = movie_soup.find("span", {"class": "spacer"}).next_sibling.strip()
    if movie_duration != "":
        duration_timedelta = pd.to_timedelta(movie_duration).components
        movie_duration = duration_timedelta.hours * 60 + duration_timedelta.minutes
    return movie_duration    

### Get movie genres: `get_movie_genres(movie_soup)`

In [9]:
def get_movie_genres(movie_soup: BeautifulSoup) -> str:
    ''' 
    Scrape the movie genres from the movie page.
    :param movie_soup: BeautifulSoup object of the movie page.
    :return: The movie genres as a CVS string. None if no genres are found.'''
    div_genres = movie_soup.find("div", {"class": "meta-body-item meta-body-info"})
    if div_genres:
        movie_genres = [
            genre.text
            for genre in div_genres.find_all("span", class_=re.compile(r".*==$"))
            if "\n" not in genre.text
        ]
        return ", ".join(movie_genres)
    return None

### Get movie directors: `get_movie_directors(movie_soup)`

In [10]:
def get_movie_directors(movie_soup: BeautifulSoup) -> str:
    '''
    Scrape the movie directors from the movie page.
    :param movie_soup: BeautifulSoup object of the movie page.
    :return: The movie directors as a CSV string. None if no directors are found.'''
    div_directors = movie_soup.find_all(
            "div", {"class": "meta-body-item meta-body-direction"}
        )
    # We retrieve the people next to the "Par" and "De" keywords, and we keep only one instance of each person.
    if div_directors:
        movie_directors = [
            link.text
            for directors in div_directors
            for link in directors.find_all(
                ["a", "span"], class_=re.compile(r".*blue-link$")
            )
        ]
        return ", ".join(set(movie_directors))
    return None

### Get movie actors: `get_movie_actors(movie_soup)`

In [11]:
def get_movie_actors(movie_soup: BeautifulSoup) -> str:
    '''
    Scrape the movie actors from the movie page.
    :param movie_soup: BeautifulSoup object of the movie page.
    :return: The movie actors as a CSV string. None if no actors are found.'''
    div_actors = movie_soup.find("div", {"class": "meta-body-item meta-body-actor"})
    if div_actors:
        movie_actors = [actor.text for actor in div_actors.find_all(["a", "span"])][1:]
        return ", ".join(movie_actors)
    return None

### Get movie nationality: `get_movie_nationality(movie_soup)`

In [12]:
def get_movie_nationality(movie_soup: BeautifulSoup) -> str:
    '''
    Scrape the movie nationality from the movie page.
    :param movie_soup: BeautifulSoup object of the movie page.
    :return: The movie nationality as a CSV string.'''
    movie_nationality = [
            nationality.text.strip()
            for nationality in movie_soup.find_all("span", class_="nationality")
        ]
    return ", ".join(movie_nationality)

### Get movie press ratings: `get_movie_press_rating(movie_soup)`

In [13]:
def get_movie_press_rating(movie_soup: BeautifulSoup) -> float:
    '''
    Scrape the movie average press rating from the movie page.
    :param movie_soup: BeautifulSoup object of the movie page.
    :return: The movie average press rating. None if no rating is found.'''
    # get all the available ratings
    movie_ratings = movie_soup.find_all("div", class_="rating-item")
    for ratings in movie_ratings:
        if "Presse" in ratings.text:
            return float(
                re.sub(
                    ",", ".", ratings.find("span", {"class": "stareval-note"}).text
                )
            )
    return None

### Get movie number of press ratings: `get_movie_press_rating_count(movie_soup)`

In [14]:
def get_movie_press_rating_count(movie_soup: BeautifulSoup) -> int:
    '''
    Scrape the movie number of press ratings from the movie page.
    :param movie_soup: BeautifulSoup object of the movie page.
    :return: The movie number of press ratings. None if no rating is found.'''
    movie_ratings = movie_soup.find_all("div", class_="rating-item")
    for ratings in movie_ratings:
        if "Presse" in ratings.text:
            # We keep only the number of ratings, and we leave the number of reviews out.
            # (eg: re.match("\s\d+", " 10154 notes dont 1327 critiques").group() returns " 10154")
            return int(
                re.match(
                    r"\s\d+",
                    ratings.find("span", {"class": "stareval-review"}).text,
                ).group()
            )
    return None

### Get movie spectator ratings: `get_movie_spec_rating(movie_soup)`

In [15]:
def get_movie_spec_rating(movie_soup: BeautifulSoup) -> float:
    '''
    Scrape the movie average spectators' rating from the movie page.
    :param movie_soup: BeautifulSoup object of the movie page.
    :return: The movie average spec rating. None if no rating is found.'''
    # get all the available ratings
    movie_ratings = movie_soup.find_all("div", class_="rating-item")
    for ratings in movie_ratings:
        if "Spectateurs" in ratings.text:
            return float(
                re.sub(
                    ",", ".", ratings.find("span", {"class": "stareval-note"}).text
                )
            )
    return None

### Get movie number of spec ratings: `get_movie_spec_rating_count(movie_soup)`

In [16]:
def get_movie_spec_rating_count(movie_soup: BeautifulSoup) -> int:
    '''
    Scrape the movie number of spec ratings from the movie page.
    :param movie_soup: BeautifulSoup object of the movie page.
    :return: The movie number of spec ratings. None if no rating is found.'''
    movie_ratings = movie_soup.find_all("div", class_="rating-item")
    for ratings in movie_ratings:
        if "Spectateurs" in ratings.text:
            # We keep only the number of ratings, and we leave the number of reviews out.
            # (eg: re.match("\s\d+", " 10154 notes dont 1327 critiques").group() returns " 10154")
            return int(
                re.match(
                    r"\s\d+",
                    ratings.find("span", {"class": "stareval-review"}).text
                ).group()  
            )
    return None

### Get movie summary: `get_movie_summary(movie_soup)`

In [17]:
def get_movie_summary(movie_soup: BeautifulSoup) -> str:
    '''
    Scrape the movie summary from the movie page.
    :param movie_soup: BeautifulSoup object of the movie page.
    :return: The movie summary. None if no summary is found.'''
    movie_summary = movie_soup.find(
            "section", {"class": "section ovw ovw-synopsis"}
        ).find("div", {"class": "content-txt"})
    if movie_summary:
        movie_summary = movie_summary.text.strip()
        return unicodedata.normalize("NFKC", movie_summary)
    return None

### Get movie poster: `get_movie_poster(movie_soup)`

In [18]:
def get_movie_poster(movie_soup: BeautifulSoup) -> str:
    '''
    Scrape the movie poster link from the movie page.
    :param movie_soup: BeautifulSoup object of the movie page.
    :return: The movie poster link.'''
    return movie_soup.find("img", {"class": "thumbnail-img"})["src"]

### Download movie poster: `download_movie_poster(movie_soup)`

In [19]:
def download_movie_poster(poster_link: str, movie_name: str) -> None:
    '''
    Download the movie poster from the poster link.
    :param poster_link: The poster link.
    :param movie_name: The movie name.
    :return: nothing but saves the movie poster as a jpg file.'''
    poster = urlopen(poster_link)
    poster_path = "../Movies/Posters/"
    # We create the folder if not exists
    os.makedirs(os.path.dirname(poster_path), exist_ok=True) 
    save_path = f"{poster_path}{movie_name.replace(' ','_').replace('_:_','_').replace(':_','_').replace(',','')}_poster.jpg"
    with open(save_path, "wb") as f:
        f.write(poster.read())
    f.close()

#### Download all movie posters: `download_all_movie_posters(movies_df)`

In [None]:
def download_movie_posters(movies_df: pd.DataFrame) -> None:
    '''
    Download all the movie posters from the poster links in the movies dataframe.
    :param movies_df: The movies dataframe.
    :return: nothing but saves the movie posters as jpg files.'''
    for index, row in movies_df.iterrows():
        download_movie_poster(row["poster_link"], row["title"])

---
## Function: `getMoviesUrl(start_page, end_page)`

In [34]:
def getMoviesUrl(start_page: int, end_page: int=None, nb_pages: int=1) -> None:
    '''
    Scrape the movies urls from the AlloCine website's movie page (http://www.allocine.fr/films/).
    It will ignore the movies that has not been released yet.
    :param start_page: The first page to scrape.
    :param end_page: The last page to scrape (included) (optional).
    :param nb_pages: The number of pages to scrape (default 1).
    :return: Nothing but saves the list of movies urls in a csv file.'''
    if start_page <= 0:
        return ValueError('start_page must be positive !')       

    # Set the list
    movie_url = []

    # Preparing the setting and monitoring of the loop
    start_time = time()
    p_requests = start_page
    # We will scrape by default at least 1 page if end_page is not specified, 
    # or if it is lower than start_page,
    # or if the number of pages to scrape is negative.
    if nb_pages < 1:
        nb_pages = 1
    if end_page == None or end_page < start_page:
        end_page = start_page + nb_pages - 1
    
    # Number of movie requests
    m_requests = 0
        
    for p in range(start_page, end_page + 1):

        # Get request
        url = f'http://www.allocine.fr/films/?page={p}'
        response = urlopen(url)
        
        # Pause the loop
        sleep(randint(1,2))
            
        # Monitoring the requests
        elapsed_time = time() - start_time
        print(f'>Page Request: {p_requests}; Frequency: {p_requests/elapsed_time} requests/s')
        clear_output(wait = True)
            
        # Warning for non-200 status codes
        if response.status != 200:
            warn(f'>Page Request: {p_requests}; Status code: {response.status_code}')

        # Break the loop if the number of requests is greater than expected
        if p_requests > end_page:
            warn('Number of requests was greater than expected.')
            break

        # Parse the content of the request with BeautifulSoup
        html_text = response.read().decode("utf-8")
        html_soup = BeautifulSoup(html_text, 'html.parser')

        # Select all the movies url from a single page
        movies = html_soup.find_all('h2', 'meta-title')
        m_requests += len(movies)
        # Count the number of movies not yet released 
        nr_movies = 0
       
        # Monitoring the requests
        print(f'>Page Request: {p_requests}; Movie Request: {m_requests}')
        clear_output(wait = True)
        
        # Pause the loop
        sleep(1)
        
        for movie in movies:
            m_url = f'http://www.allocine.fr{movie.a["href"]}'
            m_rd = get_movie_release_date(m_url)
            # We keep the movie url only if the movie has already been released
            if m_rd <= datetime.now().date():
                movie_url.append(m_url) 
            else:
                nr_movies += 1
        p_requests += 1

    # Saving the files
    movie_path = '../Movies/Data/'
    # We create the folder if not exists
    os.makedirs(os.path.dirname(movie_path), exist_ok=True) 
    print(f'--> Done; {p_requests-1} Page Requests and {m_requests-nr_movies}/{m_requests} Movie Requests in {timedelta(seconds=time()-start_time)}')
    r = np.asarray(movie_url)
    np.savetxt(f"{movie_path}movie_url.csv", r, delimiter=",", fmt='%s')

## Function: `ScrapeURL(movie_url, dwld_poster)`

In [20]:
def ScrapeURL(movie_url: list, dwld_poster: bool = False) -> tuple[pd.DataFrame, pd.DataFrame]:
    '''
    Scrape the data from the movies urls.
    :param movie_url: The list of movies urls.
    :param dwld_poster: Boolean to download the movie poster (default False).
    :return: The dataframe of the movies and the dataframe of the urls that return an error. Both are saved in a csv file.'''
    # init the dataframe
    c = ["id",
        "title",
        "release_date",
        "duration",
        "genres",
        "directors",
        "actors",
        "nationality",
        "press_rating",
        "nb_press_rating",
        "spect_rating",
        "nb_spect_rating",
        "summary",
        "poster_link"
    ]
    df = pd.DataFrame(columns=c)
    
    # preparing the setting and monitoring loop
    start_time = time()
    n_request = 0
    
    # init list to save errors
    errors = []
    
    # request loop
    for url in movie_url:
        try :
            response = urlopen(url)

            # Pause the loop
            sleep(randint(1,2))

            # Monitoring the requests
            n_request += 1
            
            elapsed_time = time() - start_time
            print(f'Request #{n_request}; Frequency: {n_request/elapsed_time} requests/s')
            clear_output(wait = True)

            # Pause the loop
            sleep(randint(1,2))

            # Warning for non-200 status codes
            if response.status != 200:
                warn('Request #{}; Status code: {}'.format(n_request, response.status_code))
                errors.append(url)

            # Parse the content of the request with BeautifulSoup
            html_text = response.read().decode("utf-8")
            movie_html_soup = BeautifulSoup(html_text, 'html.parser')
            
            # Get the movie data
            if movie_html_soup.find('div', 'titlebar-title'):
                # Scrape the movie ID 
                tp_id = get_movie_ID(movie_html_soup)
                # Scrape the title
                tp_title = get_movie_title(movie_html_soup)
                # Scrape the release date
                tp_release_dt = get_movie_release_date(movie_html_soup)
                # Scrape the duration
                tp_duration = get_movie_duration(movie_html_soup)
                # Scrape the directors
                tp_director = get_movie_directors(movie_html_soup)
                # Scrape the actors
                tp_actor = get_movie_actors(movie_html_soup)
                # Scrape the genres
                tp_genre = get_movie_genres(movie_html_soup)
                # Scrape the nationality
                tp_nation = get_movie_nationality(movie_html_soup)
                # Scrape the press ratings
                tp_press_rating = get_movie_press_rating(movie_html_soup)
                # Scrape the number of press ratings
                tp_nb_press_rating = get_movie_press_rating_count(movie_html_soup)
                # Scrape the spec ratings
                tp_spec_rating = get_movie_spec_rating(movie_html_soup)
                # Scrape the number of spec ratings
                tp_nb_spec_rating = get_movie_spec_rating_count(movie_html_soup)
                # Scrape the summary
                tp_summary = get_movie_summary(movie_html_soup)
                # Scrape the poster
                tp_poster = get_movie_poster(movie_html_soup)
                # Download the poster (optional)
                if dwld_poster:
                    download_movie_poster(tp_poster, tp_title)
                
                # Append the data
                df_tmp = pd.DataFrame({'id': [tp_id],
                                       'title': [tp_title],
                                       'release_date': [tp_release_dt],
                                       'duration': [tp_duration],
                                       'genres': [tp_genre],
                                       'directors': [tp_director],
                                       'actors': [tp_actor],
                                       'nationality': [tp_nation],
                                       'press_rating': [tp_press_rating],
                                       'nb_press_rating': [tp_nb_press_rating],
                                       'spect_rating': [tp_spec_rating],
                                       'nb_spect_rating': [tp_nb_spec_rating],
                                       'summary': [tp_summary],
                                       'poster_link': [tp_poster]})
                
                df = pd.concat([df, df_tmp], ignore_index=True)                
        except:
            errors.append(url)
            warn(f'Request #{n_request} fail; Total errors : {len(errors)}')
            traceback.print_exc()
            
    # monitoring 
    movie_path = '../Movies/Data/'
    # We create the folder if not exists
    os.makedirs(os.path.dirname(movie_path), exist_ok=True) 
    elapsed_time = time() - start_time
    print(f'Done; {n_request} requests in {timedelta(seconds=elapsed_time)} with {len(errors)} errors')
    clear_output(wait = True)
    # Saving files
    df.to_csv(f"{movie_path}allocine_movies.csv", index=False)
    # list to dataframe
    errors_df = pd.DataFrame(errors, columns=['url'])
    errors_df.to_csv(f"{movie_path}allocine_errors.csv")
    # return dataframe and errors
    return df, errors

---
## Launching the script

### Getting movies urls

In [21]:
# Scrape the page from start_page to end_page (included) or with nb_pages
start_page = 1
end_page = None
nb_pages = 20
getMoviesUrl(start_page, end_page=end_page, nb_pages=nb_pages)

--> Done; 20 Page Requests and 300 Movie Requests in 0:00:58.376464


### Loading the list of urls

In [22]:
# Load the list of urls 
m_url = pd.read_csv("../Movies/Data/movie_url.csv",names=['url'])
m_url = m_url['url'].tolist()

### Scraping the data

In [23]:
# Scrape the data 
movies_df, movies_errors = ScrapeURL(m_url)

Done; 300 requests in 0:17:16.264138 with 2 errors
