# 🎦**Web Scraping Series Data From AlloCiné.fr**📺

This script builds a DataFrame by web scraping the **series** data from AlloCiné — a company which provides information on French cinema. Because of the long delay, we choose to scrape the data in two steps : 
- First we scrape the url of each series with `getSeriesUrl(...)`
- Lastly we use the url list to scrape the data for each series with `ScrapeURL(...)`

*📝Note 1: We use the popular BeautifulSoup package*

## **Functions :**

### `getSeriesUrl(start_page, end_page, nb_pages)` :

Saves a CSV file of the series urls list as `../Series/Data/series_url.csv`. The argument must be integers and are used to select the range of pages (or the number of pages, default 1) you want to scrape the data from. The `end_page` is included. If the `end_page` value is correct (not `None` or >= `start_page`), the `nb_pages` argument is ignored.

### `ScrapeURL(series_url, dwld_poster)` :

Iterate over the list of url generated by `getSeriesUrl(...)` and scrape the data for each series. In the process, we extract :

- `id` : Allocine series id
- `title` : the series title (in French)
- `status` : the series status (in French) (En cours|Terminée|Annulée)
- `release_date`: the series release date
- `duration`: the series average episode length (in minutes)
- `nb_seasons`: the number of seasons
- `nb_episodes`: the number of episodes
- `genres` : the series genres (as a CSV string)
- `directors` : series directors (as a CSV string)
- `actors` : main actors of the series (as a CSV string)
- `nationality`: nationality of the series (as a CSV string)
- `press_rating`: average press rating (from 0.5 to 5 stars ⭐⭐⭐⭐⭐)
- `nb_press_rating`: number of ratings made by the press
- `spect_rating`: average AlloCiné users rating (from 0.5 to 5 stars ⭐⭐⭐⭐⭐)
- `nb_spect_rating`: number of ratings made by the users/spectators
- `summary`: the series summary
- `poster_link`: url of the series poster

*📝Note 2: We can choose to download the poster image with the `dwld_poster` argument. If `True`, the poster image is downloaded and saved in the `../Series/Posters/` folder.*

*📝Note 3: Intermediate functions were created to retrieve each individual feature of the dataframe, which makes it easy for debugging.*

The function `ScrapeURL(...)` returns two objects : the data as a dataframe and the url list of errors as a list. In addition the two objects are saved as `../Series/Data/allocine_movies.csv` and `../Series/Data/allocine_errors.csv`. At this time, the remaining errors are not handled and integrated to the dataframe, because the issues have not been all identified yet.

(⚠️ **Warning** ⚠️: the process can take a while, depending on the number of pages you choose to scrape. It is recommended to use a dedicated computer for this process, as it can take a long time to complete.)

---
## **Import libs**

In [1]:
# Import libs
import pandas as pd
import numpy as np
import re
import unicodedata
from time import time
from time import sleep
from datetime import timedelta
from urllib.request import urlopen
from random import randint
from bs4 import BeautifulSoup
import os

from warnings import warn
from IPython.core.display import clear_output
import traceback

  from IPython.core.display import clear_output


## **Functions:** Getting series infos

### Get series ID: `get_series_ID(series_soup)`

In [3]:
def get_series_ID(series_soup: BeautifulSoup) -> int:
    '''
    Scrape the series ID from the series page.
    :param series_soup: BeautifulSoup object of the series page.
    :return: The series ID.'''
    series_ID = int(re.sub(
        r"\D", "", series_soup.find("nav", {"class": "third-nav"}).a["href"]
        ))
    return series_ID

### Get series title: `get_series_title(series_soup)`

In [4]:
def get_series_title(series_soup: BeautifulSoup) -> str:
    '''
    Scrape the series title from the series page.
    :param series_soup: BeautifulSoup object of the series page.
    :return: The series title.'''
    series_title = series_soup.find("div", {"class": "titlebar-title"}).text.strip()
    return series_title

### Get series status: `get_series_status(series_soup)`

In [5]:
def get_series_status(series_soup: BeautifulSoup) -> str:
    '''
    Scrape the series status from the series page.
    :param series_soup: BeautifulSoup object of the series page.
    :return: The series status (En cours|Terminée|Annulée).'''
    series_status = series_soup.find("div", {"class": "label-status"}).text.strip()
    return series_status

### Get series release date: `get_series_release_date(series_soup)`

In [6]:
def get_series_release_date(series_soup: BeautifulSoup) -> str:
    '''
    Scrape the series release date from the series page.
    :param series_soup: BeautifulSoup object of the series page.
    :return: The series release date.'''
    series_release_date = series_soup.find("div", {"class": "meta-body-item meta-body-info"})
    if series_release_date:
        series_release_date = series_release_date.text.strip().split("\n")[0]
        if get_series_status(series_soup) == "Terminée":
            series_release_date = series_release_date.replace(" ", "")      
    return series_release_date

### Get series duration: `get_series_duration(series_soup)`

In [7]:
def get_series_duration(series_soup: BeautifulSoup) -> int:
    '''
    Scrape the series duration from the series page.
    :param series_soup: BeautifulSoup object of the series page.
    :return: The series duration in minutes.'''
    series_duration = series_soup.find("span", {"class": "spacer"}).next_sibling.strip()
    if series_duration != "":
        duration_timedelta = pd.to_timedelta(series_duration).components
        series_duration = duration_timedelta.hours * 60 + duration_timedelta.minutes
    return int(series_duration)

### Get series nb of seasons and episodes: `get_series_nb_seasons_episodes(series_soup)`

In [8]:
def get_series_nb_seasons_episodes(series_soup: BeautifulSoup) -> tuple[int,int]:
    '''
    Scrape the series number of seasons and episodes from the series page.
    :param series_soup: BeautifulSoup object of the series page.
    :return: The series number of seasons and episodes. None if not found.'''
    series_nb_seasons_episodes = series_soup.find_all("div", {"class": "stats-numbers-row-item"})
    if series_nb_seasons_episodes:
        series_nb_seasons = series_nb_seasons_episodes[0].div.text
        series_nb_episodes = series_nb_seasons_episodes[1].div.text
        return int(series_nb_seasons), int(series_nb_episodes)
    return None, None

### Get series genres: `get_series_genres(series_soup)`

In [9]:
def get_series_genres(series_soup: BeautifulSoup) -> str:
    '''
    Scrape the series genres from the series page.
    :param series_soup: BeautifulSoup object of the series page.
    :return: The series genres as a CSV string. None if not found.'''
    div_genres = series_soup.find("div", {"class": "meta-body-item meta-body-info"})
    if div_genres:
        series_genres = [
            genre.text
            for genre in div_genres.find_all("span")
            if "\n" and '/' not in genre.text
        ]
        return ", ".join(series_genres)
    return None

### Get series directors: `get_series_directors(series_soup)`

In [10]:
def get_series_directors(series_soup: BeautifulSoup) -> str:
    '''
    Scrape the series directors from the series page.
    :param series_soup: BeautifulSoup object of the series page.
    :return: The series directors as a CSV string. None if not found.'''
    div_directors = series_soup.find(
            "div", {"class": "meta-body-item meta-body-direction"}
        )
    if div_directors:
        # We retrieve the people next to the "Créée par" keyword
        series_directors = [
            link.text
            for link in div_directors.find_all(
                ["a", "span"], class_=re.compile(r".*blue-link$")
            )
        ]
        return ", ".join(series_directors)
    return None

### Get series actors: `get_series_actors(series_soup)`

In [11]:
def get_series_actors(series_soup: BeautifulSoup) -> str:
    '''
    Scrape the series actors from the series page.
    :param series_soup: BeautifulSoup object of the series page.
    :return: The series actors as a CSV string. None if not found.'''
    div_actors = series_soup.find("div", {"class": "meta-body-item meta-body-actor"})
    if div_actors:
        # We retrieve the people next to the "Avec" keyword
        series_actors = [actor.text for actor in div_actors.find_all(["a", "span"])][1:]
        return ", ".join(series_actors)
    return None

### Get series nationality: `get_series_nationality(series_soup)`


In [12]:
def get_series_nationality(series_soup: BeautifulSoup) -> str:
    '''
    Scrape the series nationality from the series page.
    :param series_soup: BeautifulSoup object of the series page.
    :return: The series nationality as a CSV string.'''
    series_nationality = [
            nationality.text.strip()
            for nationality in series_soup.find_all("span", class_="nationality")
        ]
    return ", ".join(series_nationality)

### Get series press ratings: `get_series_press_rating(series_soup)`

In [13]:
def get_series_press_rating(series_soup: BeautifulSoup) -> float:
    '''
    Scrape the series average press rating from the series page.
    :param series_soup: BeautifulSoup object of the series page.
    :return: The series average press rating. None if not found.'''
    series_ratings = series_soup.find_all("div", class_="rating-item")
    for ratings in series_ratings:
        if "Presse" in ratings.text:
            # eg: Change from 4,5 to 4.5.
            return float(
                re.sub(
                    ",", ".", ratings.find("span", {"class": "stareval-note"}).text
                )
            )
    return None

### Get series number of press ratings: `get_series_press_rating_count(series_soup)`

In [14]:
def get_series_press_rating_count(series_soup: BeautifulSoup) -> int:
    '''
    Scrape the series number of press ratings from the series page.
    :param series_soup: BeautifulSoup object of the series page.
    :return: The series number of press ratings. None if not found.'''
    series_ratings = series_soup.find_all("div", class_="rating-item")
    for ratings in series_ratings:
        if "Presse" in ratings.text:
            # We keep only the number of ratings, and we leave the number of reviews out.
            # (eg: re.match("\s\d+", " 10154 notes dont 1327 critiques").group() returns " 10154")
            return int(
                re.match(
                    r"\s\d+",
                    ratings.find("span", {"class": "stareval-review"}).text,
                ).group()
            )
    return None

### Get series spectator ratings: `get_series_spec_rating(series_soup)`

In [15]:
def get_series_spec_rating(series_soup: BeautifulSoup) -> float:
    '''
    Scrape the series average spectator rating from the series page.
    :param series_soup: BeautifulSoup object of the series page.
    :return: The series average spectator rating. None if not found.'''
    series_ratings = series_soup.find_all("div", class_="rating-item")
    for ratings in series_ratings:
        if "Spectateurs" in ratings.text:
            # eg: Change from 4,5 to 4.5.
            return float(
                re.sub(
                    ",", ".", ratings.find("span", {"class": "stareval-note"}).text
                )
            )
    return None

### Get series number of spec ratings: `get_series_spec_rating_count(series_soup)`

In [16]:
def get_series_spec_rating_count(series_soup: BeautifulSoup) -> int:
    '''
    Scrape the series number of spectator ratings from the series page.
    :param series_soup: BeautifulSoup object of the series page.
    :return: The series number of spectator ratings. None if not found.'''
    series_ratings = series_soup.find_all("div", class_="rating-item")
    for ratings in series_ratings:
        if "Spectateurs" in ratings.text:
            # We keep only the number of ratings, and we leave the number of reviews out.
            # (eg: re.match("\s\d+", " 10154 notes dont 1327 critiques").group() returns " 10154")
            return int(
                re.match(
                    r"\s\d+",
                    ratings.find("span", {"class": "stareval-review"}).text
                ).group()  
            )
    return None

### Get series summary: `get_series_summary(series_soup)`

In [17]:
def get_series_summary(series_soup: BeautifulSoup) -> str:
    '''
    Scrape the series summary from the series page.
    :param series_soup: BeautifulSoup object of the series page.
    :return: The series summary. None if not found.'''
    series_summary = series_soup.find(
            "section", {"class": "section ovw ovw-synopsis"}
        ).find("div", {"class": "content-txt"})
    if series_summary:
        series_summary = series_summary.text.strip()
        return unicodedata.normalize("NFKC", series_summary)
    return None

### Get series poster: `get_series_poster(series_soup)`

In [18]:
def get_series_poster(series_soup: BeautifulSoup) -> str:
    '''
    Scrape the series poster link from the series page.
    :param series_soup: BeautifulSoup object of the series page.
    :return: The series poster link.'''
    return series_soup.find("img", {"class": "thumbnail-img"})["src"]   

### Download series poster: `download_series_poster(poster_url)`

In [19]:
def download_series_poster(poster_link, series_name: str) -> None:
    '''
    Download the series poster from the series poster link.
    :param poster_link: The series poster link.
    :param series_name: The series name.
    :return: Nothing but saves the series poster as a jpg file.'''
    poster = urlopen(poster_link)
    poster_path = "../Series/Posters/"
    # We create the folder if not exists
    os.makedirs(os.path.dirname(poster_path), exist_ok=True) 
    save_path = f"{poster_path}{series_name.replace(' ','_').replace('_:_','_').replace(':_','_').replace(',','')}_poster.jpg"
    with open(save_path, "wb") as f:
        f.write(poster.read())
    f.close()

#### Download all series posters: `download_all_series_posters(series_df)`

In [20]:
def download_series_posters(series_df: pd.DataFrame) -> None:
    '''
    Download all the series posters from the poster links in the series dataframe.
    :param series_df: The series dataframe.
    :return: nothing but saves the series posters as jpg files.'''
    for index, row in series_df.iterrows():
        download_series_poster(row["poster_link"], row["title"])

---
## **Function:** `getSeriesUrl(start_page, end_page, nb_pages)`

In [21]:
def getSeriesUrl(start_page: int, end_page: int=None, nb_pages: int=1) -> None:
    '''
    Scrape the series urls from the AlloCine website's series page (http://www.allocine.fr/series-tv/).
    The range of pages to scrape goes from start_page to end_page (included) if end_page is not None or <= start_page.
    Else, the range of pages to scrape goes from start_page to start_page + nb_pages.
    It will ignore the series that has not been released yet.
    :param start_page: The first page to scrape.
    :param end_page: The last page to scrape (included) (optional).
    :param nb_pages: The number of pages to scrape (default 1).
    :return:  Nothing but saves the list of series urls in a csv file.'''
    if start_page <= 0:
        raise ValueError('start_page must be positive !')

    # Set the list
    series_url = []

    # Preparing the setting and monitoring of the loop
    start_time = time()
    p_requests = start_page
    # We will scrape by default at least 1 page if end_page is not specified, 
    # or if it is lower than start_page,
    # or if the number of pages to scrape is negative.
    if nb_pages < 1:
        nb_pages = 1
    if end_page == None or end_page < start_page:
        end_page = start_page + nb_pages - 1
    
    # Number of series requests
    s_requests = 0
        
    # Loop over the pages
    for p in range(start_page, end_page + 1):

        # Get request
        url = f'https://www.allocine.fr/series-tv/?page={p}'
        response = urlopen(url)
        
        # Pause the loop
        sleep(randint(1,2))
            
        # Monitoring the requests
        elapsed_time = time() - start_time
        print(f'>Page Request: {p_requests}; Frequency: {p_requests/elapsed_time} requests/s')
        clear_output(wait = True)
            
        # Warning for non-200 status codes
        if response.status != 200:
            warn(f'>Page Request: {p_requests}; Status code: {response.status_code}')

        # Break the loop if the number of requests is greater than expected
        if p_requests > end_page:
            warn('Number of requests was greater than expected.')
            break

        # Parse the content of the request with BeautifulSoup
        html_text = response.read().decode("utf-8")
        html_soup = BeautifulSoup(html_text, 'html.parser')

        # Select all the series url from a single page
        series = html_soup.find_all('h2', 'meta-title')
        s_requests += len(series)
        # Count the number of series not yet released 
        nr_series = 0
       
        # Monitoring the requests
        print(f'>Page Request: {p_requests}; Series Request: {s_requests}')
        clear_output(wait = True)
        
        # Pause the loop
        sleep(1)
        
        for serie in series:
            s_url = f'http://www.allocine.fr{serie.a["href"]}'
            s_soup = BeautifulSoup(urlopen(s_url), 'html.parser')
            s_status = get_series_status(s_soup)
            # We keep the movie url only if the series has already been released.
            if s_status != "À venir":
                series_url.append(s_url)
            else:
                nr_series += 1              
        p_requests += 1

    # Saving the files
    series_path = '../Series/Data/'
    # We create the folder if not exists
    os.makedirs(os.path.dirname(series_path), exist_ok=True)
    print(f'--> Done; {p_requests-start_page} Page Requests and {s_requests-nr_series}/{s_requests} Series Requests in {timedelta(seconds=time()-start_time)}')
    r = np.asarray(series_url)
    np.savetxt(f"{series_path}series_url.csv", r, delimiter=",", fmt='%s')

## **Function:** `ScrapeURL(series_url)`

In [22]:
def ScrapeURL(series_url: list, dwld_poster: bool = False) -> tuple[pd.DataFrame, pd.DataFrame]:
    '''
    Scrape the data from the series page url.
    :param series_url: The list of series page url.
    :param dwld_poster: Boolean to download the series poster (default False).
    :return: The dataframe of the series data and the dataframe of urls that return an error. 
    Both are saved into csv files.'''    
    # init the dataframe
    c = ["id",
        "title",
        "status",
        "release_date",
        "duration",
        "nb_seasons",
        "nb_episodes",
        "genres",
        "directors",
        "actors",
        "nationality",
        "press_rating",
        "nb_press_rating",
        "spect_rating",
        "nb_spect_rating",
        "summary",
        "poster_link"
    ]
    df = pd.DataFrame(columns=c)
    
    # preparing the setting and monitoring loop
    start_time = time()
    n_request = 0
    
    # init list to save errors
    errors = []
    
    # request loop
    for url in series_url:
        try :
            response = urlopen(url)

            # Pause the loop
            sleep(randint(1,2))

            # Monitoring the requests
            n_request += 1
            
            elapsed_time = time() - start_time
            print(f'Request #{n_request}; Frequency: {n_request/elapsed_time} requests/s')
            clear_output(wait = True)

            # Pause the loop
            sleep(randint(1,2))

            # Warning for non-200 status codes
            if response.status != 200:
                warn('Request #{}; Status code: {}'.format(n_request, response.status_code))
                errors.append(url)

            # Parse the content of the request with BeautifulSoup
            html_text = response.read().decode("utf-8")
            series_html_soup = BeautifulSoup(html_text, 'html.parser')
            
            if series_html_soup.find('div', 'titlebar-title'):
                # Scrape the series ID 
                tp_id = get_series_ID(series_html_soup)
                # Scrape the title
                tp_title = get_series_title(series_html_soup)
                # Scrape the status
                tp_status = get_series_status(series_html_soup)
                # Scrape the release date
                tp_release_dt = get_series_release_date(series_html_soup)
                # Scrape the duration
                tp_duration = get_series_duration(series_html_soup)
                # Scrape the number of seasons and episodes
                tp_nb_seasons, tp_nb_episodes = get_series_nb_seasons_episodes(series_html_soup)
                # Scrape the genres
                tp_genre = get_series_genres(series_html_soup)
                # Scrape the directors
                tp_director = get_series_directors(series_html_soup)
                # Scrape the actors
                tp_actor = get_series_actors(series_html_soup)
                # Scrape the nationality
                tp_nation = get_series_nationality(series_html_soup)
                # Scrape the press ratings
                tp_press_rating = get_series_press_rating(series_html_soup)
                # Scrape the number of press ratings
                tp_nb_press_rating = get_series_press_rating_count(series_html_soup)
                # Scrape the spec ratings
                tp_spec_rating = get_series_spec_rating(series_html_soup)
                # Scrape the number of spec ratings
                tp_nb_spec_rating = get_series_spec_rating_count(series_html_soup)
                # Scrape the summary
                tp_summary = get_series_summary(series_html_soup)
                # Scrape the poster
                tp_poster = get_series_poster(series_html_soup)
                # Download the poster (optional)
                if dwld_poster:
                    download_series_poster(tp_poster, tp_title)
                
                # Append the data
                df_tmp = pd.DataFrame({'id': [tp_id],
                                       'title': [tp_title],
                                       'status': [tp_status],
                                       'release_date': [tp_release_dt],
                                       'duration': [tp_duration],
                                       'nb_seasons': [tp_nb_seasons],
                                       'nb_episodes': [tp_nb_episodes],
                                       'genres': [tp_genre],
                                       'directors': [tp_director],
                                       'actors': [tp_actor],
                                       'nationality': [tp_nation],
                                       'press_rating': [tp_press_rating],
                                       'nb_press_rating': [tp_nb_press_rating],
                                       'spect_rating': [tp_spec_rating],
                                       'nb_spect_rating': [tp_nb_spec_rating],
                                       'summary': [tp_summary],
                                       'poster_link': [tp_poster]})
                
                df = pd.concat([df, df_tmp], ignore_index=True)                
        except:
            errors.append(url)
            warn(f'Request #{n_request} fail; Total errors : {len(errors)}')
            traceback.print_exc()
            
    # monitoring     
    series_path = '../Series/Data/'
    # We create the folder if not exists
    os.makedirs(os.path.dirname(series_path), exist_ok=True)
    elapsed_time = time() - start_time
    print(f'Done; {n_request} requests in {timedelta(seconds=elapsed_time)} with {len(errors)} errors')
    clear_output(wait = True)
    df.to_csv(f"{series_path}allocine_series.csv", index=False)
    # list to dataframe
    errors_df = pd.DataFrame(errors, columns=['url'])
    errors_df.to_csv(f"{series_path}allocine_errors.csv")
    # return dataframe and errors
    return df, errors

---
## **Launching the script**

### Getting series urls

In [23]:
# Scrape the page from start_page to end_page (included) or with nb_pages
start_page = 1
end_page = None
nb_pages = 1
getSeriesUrl(start_page, end_page=end_page, nb_pages=nb_pages)

--> Done; 1 Page Requests and 15/15 Series Requests in 0:00:07.512769


### Loading the list of urls

In [24]:
# Load the list of urls 
s_url = pd.read_csv("../Series/Data/series_url.csv",names=['url'])
s_url = s_url['url'].tolist()

### Scraping the data

In [25]:
# Scrape the data 
series_df, series_errors = ScrapeURL(s_url)

Done; 15 requests in 0:00:50.214055 with 0 errors
