# 🎦**Web Scraping User Ratings From AlloCiné.fr**🏆⭐⭐⭐⭐⭐

After building our movies and series dataframes in the previous notebooks (see [Webscraping Movies](https://github.com/Bastien-LDC/Allocine-Recommender-System/blob/master/Webscraping/Webscraping_Movies_From_AlloCine.ipynb) and [Webscraping Series](https://github.com/Bastien-LDC/Allocine-Recommender-System/blob/master/Webscraping/Webscraping_Series_From_AlloCine.ipynb)), this time we will retrieve the **user and press ratings** of movies and series from the website of AlloCiné. We will proceed as follows: 
- We reconstruct the comments section urls of the press and the spectators for each movies and/or series with the function `getCommentsUrl(...)`.
- From there, we scrape the rating data from the press and/or user page for each movie and/or series with `ScrapeURL(...)`. 
- The user can be either a person or a press newspaper. Separated files are generated for each type of user, whether for movies or series.

*📝Note 1: We use the popular BeautifulSoup package*

## **Functions :**

### `getCommentsUrl(movies_df, series_df, spect, press)` :

This function will call two sub-functions: `getMoviesCommentsUrl(movies_df, spect, press)` and `getSeriesCommentsUrl(series_df, spect, press)`. Both functions respectively retrieve the list of movies and series ID in the movies and series dataframes generated by `getMoviesUrl(start_page, end_page, nb_pages)` and `getSeriesUrl(start_page, end_page, nb_pages)` in the previous scripts. Then, we use the IDs to reconstruct the comments section urls of the press and the spectators (users). We can chose to get the comments section from the user or the press for each video type.
We then store the lists of urls in a csv file entitled `user_comments_url.csv` (resp. `press_comments_url.csv`), in both movies and series directory (`../Movies/Comments/` and `../Series/Comments/`).

### `ScrapeURL(press_series_urls, press_movies_urls, user_series_urls, user_movies_urls, nb_users)` :

This function iterates over the list of movies or series comments section urls generated by `getCommentsUrl(movies_df, series_df, spect, press)` and scrape the press and user rating data for each video type, by calling respectively `ScrapePressURL(series_urls, movies_urls)` and `ScrapeUserURL(series_urls, movies_urls)`. You can choose whether to scrape the press or the user ratings, whether for movies or series, or both, depending on the parameters value as input of the `ScrapeURL(...)` function.
We use the press comments urls for movies and series to get the press ratings with `getPressRatings(press_soup)`.
We use the user comments urls for movies and series to get the user ratings with `getUserRatings(user_rating_url, last_page, nb_users)`.

In the process, we extract, if available :
- `user_id`: AlloCiné user id (unavailable for the press)
- `(user/press)_name`: AlloCiné user/press name
- `(movie/series)_id`: AlloCiné movie/series id
- `(user/press)_rating`: AlloCiné user/press rating (from 0.5 to 5 stars ⭐⭐⭐⭐⭐)
- `date`: date of the rating (unavailable for the press)

*📝Note 2: Intermediate functions were created to retrieve each individual feature of the dataframe, which makes it easy for debugging.*

The function `ScrapeURL(...)` returns two objects : one list containing the 4 generated ratings dataframes (user_rating_movies, press_rating_movies, user_rating_series, press_rating_series), and another list of the respective errors (as dataframes) for those ratings. In addition, the eight dataframes are saved as `user_ratings_movies.csv` and `press_ratings_movies.csv` (respectively `user_ratings_series.csv` and `press_ratings_series.csv`), and as `press_ratings_errors` and `user_ratings_errors` (in both `../Movies/Ratings/` and `../Series/Ratings/`).


(⚠️ **Warning** ⚠️: the process can take a while, depending on the number of movies or series and the number of user pages you choose to scrape. It is recommended to use a dedicated computer for this process, as it can take a long time to complete.)

---
## **Import libs**

In [1]:
# Import libs
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm_notebook as tqdm
from time import time
from time import sleep
import re
from datetime import timedelta, date
from urllib.request import urlopen
from random import randint
from bs4 import BeautifulSoup
import os
import dateparser

from warnings import warn, filterwarnings
from IPython.core.display import clear_output
import traceback

  from IPython.core.display import clear_output


In [2]:
# A SUPPRIMER APRES TESTS
response = urlopen("https://www.allocine.fr/series/ficheserie-18529/critiques/membres-critiques/")
html_text = response.read().decode("utf-8")
spect_html_soup = BeautifulSoup(html_text, 'html.parser')

---
## **Function:** Getting the comments section urls

### Get Movie Comments Section: `getMoviesCommentsUrl(movies_df, spect, press)`

In [3]:
def getMoviesCommentsUrl(movies_df: pd.DataFrame, spect: bool=False, press: bool=False) -> None:
    '''
    Get the comments section url for each movie
    You can select the spectateurs or presse section, or both.
    :param movies_df: DataFrame of movies
    :param spect: Boolean, True if you want to get the spectators section
    :param press: Boolean, True if you want to get the press section
    :return: nothing but saves urls in csv files
    '''    
    # Get the list of movies_id from the movies_df
    movies_id_list = movies_df['id'].tolist()

    # Set the list
    press_url_list = []
    user_url_list = []

    # Preparing the setting and monitoring of the loop
    start_time = time()
    p_spect_requests, p_press_requests = 0, 0
        
    for v_id in movies_id_list:
        if spect:
            # Url for the spectators/user section sorted by the descending number of reviews per user
            url_spect = f'https://www.allocine.fr/film/fichefilm-{v_id}/critiques/spectateurs/membres-critiques/'    
            user_url_list.append(url_spect)
            p_spect_requests += 1
        if press:
            url_press = f'https://www.allocine.fr/film/fichefilm-{v_id}/critiques/presse/'   
            press_url_list.append(url_press)    
            p_press_requests += 1

    # Saving the files
    comments_path = '../Movies/Comments/'
    # We create the folder if not exists
    os.makedirs(os.path.dirname(comments_path), exist_ok=True)
    if spect:
        print(f'--> Done; {p_spect_requests} Movies Users Comments Page Requests in {timedelta(seconds=time()-start_time)}')
        r = np.asarray(user_url_list)
        np.savetxt(f"{comments_path}user_comments_urls.csv", r, delimiter=",", fmt='%s')
    if press:
        print(f'--> Done; {p_press_requests} Movies Press Comments Page Requests in {timedelta(seconds=time()-start_time)}')
        r = np.asarray(press_url_list)
        np.savetxt(f"{comments_path}press_comments_urls.csv", r, delimiter=",", fmt='%s')

### Get Series Comments Section: `getSeriesCommentsUrl(series_df, spect, press)`

In [4]:
def getSeriesCommentsUrl(series_df: pd.DataFrame, spect: bool=False, press: bool=False) -> None:
    '''
    Get the comments section url for each series
    You can select the spectateurs or presse section, or both.
    :param series_df: DataFrame of series
    :param spect: Boolean, True if you want to get the spectators section
    :param press: Boolean, True if you want to get the press section
    :return: nothing but saves urls in csv files
    '''
    # Get the list of series_id from the series_df
    series_id_list = series_df['id'].tolist()

    # Set the list
    press_url_list = []
    user_url_list = []

    # Preparing the setting and monitoring of the loop
    start_time = time()
    p_spect_requests, p_press_requests = 0, 0
        
    for v_id in series_id_list:
        if spect:
            # Url for the spectators/user section sorted by the descending number of reviews per user
            url_spect = f'https://www.allocine.fr/series/ficheserie-{v_id}/critiques/membres-critiques/'    
            user_url_list.append(url_spect)
            p_spect_requests += 1
        if press:
            url_press = f'https://www.allocine.fr/series/ficheserie-{v_id}/critiques/presse/'   
            press_url_list.append(url_press)                         
            p_press_requests += 1

    # Saving the files
    comments_path = '../Series/Comments/'
    # We create the folder if not exists
    os.makedirs(os.path.dirname(comments_path), exist_ok=True)
    if spect:
        print(f'--> Done; {p_spect_requests} Series Users Comments Page Requests in {timedelta(seconds=time()-start_time)}')
        r = np.asarray(user_url_list)
        np.savetxt(f"{comments_path}user_comments_urls.csv", r, delimiter=",", fmt='%s')
    if press:
        print(f'--> Done; {p_press_requests} Series Press Comments Page Requests in {timedelta(seconds=time()-start_time)}')
        r = np.asarray(press_url_list)
        np.savetxt(f"{comments_path}press_comments_urls.csv", r, delimiter=",", fmt='%s')

### Main function: `getCommentsUrl(movies_df, series_df, spect, press)`

In [5]:
def getCommentsUrl(movies_df: pd.DataFrame=None, series_df: pd.DataFrame=None, spect: bool=False, press: bool=False) -> None:
    '''
    Get the comments section url for each movie and series
    You can select the spectateurs or presse section, or both.
    :param movies_df: DataFrame of movies
    :param series_df: DataFrame of series
    :param spect: Boolean, True if you want to get the spectateurs section
    :param press: Boolean, True if you want to get the presse section
    :return: nothing but saves urls in a csv file
    '''
    try:
        if movies_df is not None:
            getMoviesCommentsUrl(movies_df, spect, press)
        if series_df is not None:
            getSeriesCommentsUrl(series_df, spect, press)
    except:
        print('Error in getCommentsUrl function!')
        traceback.print_exc()

## **Functions:** Getting ratings infos

### Get ID: `get_ID(url)`

In [6]:
def get_ID(url: str) -> int:
    '''
    Get the movie or serie ID from an url.
    :param url: url containing the movie or serie ID
    :return: The movie or serie ID.'''
    id = int(re.sub(r"\D", "", url))
    return id

### Get url: `get_url(url)`

In [7]:
def get_url(url: str):
    '''
    Try to get and open the page url.
    :param url: url to get the content from.
    :return: the content of the url
    '''
    page = ''
    while page == '':
        try:
            page = urlopen(url)
            break
        except:
            print(f"Error occured when opening the url.\nURL: {url}")
            print("\nConnection refused by the server...")
            print("Let's wait for 5 seconds")
            print("ZZzzzz...")
            time.sleep(5)
            print("Ok, now let's try again...\n")
            continue
    return page

### Convert French months to English months: `convert_month(month)`

In [8]:
def convert_month(month: str) -> str:
    '''
    Convert French months into English months for the dateparser to work.
    :param month: French month.
    :return: English month.'''
    if month == "janvier":
        return "January"
    elif month == "février":
        return "February"
    elif month == "mars":
        return "March"
    elif month == "avril":
        return "April"
    elif month == "mai":
        return "May"
    elif month == "juin":
        return "June"
    elif month == "juillet":
        return "July"
    elif month == "août":
        return "August"
    elif month == "septembre":
        return "September"
    elif month == "octobre":
        return "October"
    elif month == "novembre":
        return "November"
    elif month == "décembre":
        return "December"
    else:
        return "Unknown"

### Convert text to rating: `convertTextToRating(text)`

In [9]:
def convert_text_to_rating(text: str) -> float:
    '''
    Convert the press evaluation text into a rating.
    :param text: Evaluation of the movie as a string 
    :return: corresponding float rating value
    '''
    if text=="Nul":
        return 0.5
    elif text=="Très mauvais":
        return 1.0
    elif text=="Mauvais":
        return 1.5
    elif text=="Pas terrible":
        return 2.0
    elif text=="Moyen":
        return 2.5  
    elif text=="Pas mal":
        return 3.0
    elif text=="Bien":
        return 3.5
    elif text=="Très bien":
        return 4.0
    elif text=="Excellent":
        return 4.5
    elif text=="Chef-d'oeuvre":
        return 5.0

### Get press-ratings: `getPressRatings(press_soup)`

In [10]:
def getPressRatings(press_soup: BeautifulSoup) -> dict[str, float]:
    '''
    Get press_ratings from the comments section url for each movie or series.
    :param press_soup: BeautifulSoup object of the press comments section url
    :return: a dictionary of ratings (key: press name, value: press rating) if the ratings are available, else None.
    '''
    press_ratings = {}
    div_ratings = press_soup.find_all('li', {'class': 'item'})
    if div_ratings:
        press_ratings = {div.text.strip():convert_text_to_rating(div.find('span')['title']) for div in div_ratings}
        return press_ratings
    return None    

### Get user-ratings: `getUserRatings(user_rating_url, last_page, nb_users)`

In [11]:
def getUserRatings(user_rating_url: str, last_page: int=1, nb_users: int=5) -> dict[tuple[str,str]: tuple[float,date]]:
    '''
    Get user_ratings from the comments section url for each movie or series.
    We will keep only the nb_users first users who have posted the most reviews.
    :param user_rating_url : url of the user ratings section.
    :param last_page: last page of the user ratings section (default 1).
    :param nb_users: Number of users to keep (default 5).
    :return: a dictionary of ratings {key=(user_id, user_name): value=(user rating, date)} if the ratings are available, else None.
    '''
    user_ratings = {}
    for n_page in range(1,last_page+1):
        # If the number of users is reached, stop the loop.
        # Else, get the next page of the user ratings section until the last page is reached.
        if len(user_ratings) == nb_users:       
            break

        user_rating_url_page = user_rating_url + f'?page={n_page}'        
        try :
            response = get_url(user_rating_url_page)
            # Pause the loop
            sleep(randint(1,2))
            # Parse the content of the request with BeautifulSoup
            html_text = response.read().decode("utf-8")
            user_soup = BeautifulSoup(html_text, 'html.parser')
            # Collect only the reviews infos from the page in a list
            div_ratings = user_soup.find_all('div', {'class': 'review-card'})
            if div_ratings:
                for item in div_ratings:
                    # If the number of users is reached, stop the loop.
                    if len(user_ratings) == nb_users:       
                        break   
                    # If the user is not a visitor...
                    if item.find('span', {'class': 'item-profil'}):
                        # ...Get the user id, name, rating and date of the rating   
                        user_id = item.find('span', {'class': 'item-profil'})['data-targetuserid']
                        user_name = item.span['title']
                        user_rating = float(re.sub(",", ".", item.find("span", {"class": "stareval-note"}).text))
                        user_date = item.find('span', {'class': 'review-card-meta-date'}).text.strip().replace('Publiée le ', '')
                        # Convert the date from French to English
                        month = user_date.split(' ')[1]
                        user_date = dateparser.parse(user_date.replace(month, convert_month(month)), date_formats=["%d %B %Y"]).date()
                        # Append to the dictionary
                        user_ratings[(user_id, user_name)] = (user_rating, user_date)
        except:
            print(f'Error in getUserRatings function!')
            traceback.print_exc()
    return user_ratings

## **Function:** Scraping the ratings

### Scraping Press URLs: `ScrapePressURL(series_urls, movies_urls)`

In [12]:
def ScrapePressURL(series_url: list=None, movies_url: list=None) -> tuple[pd.DataFrame,pd.DataFrame,pd.DataFrame,pd.DataFrame]:   
    '''
    Scrape the press_ratings from the series and/or movies press comments section urls (if available).
    :param series_url: list of series press comments section urls.
    :param movies_url: list of movies press comments section urls.
    :return: dataframes of series and movies press ratings and the errors.
    '''
    # init the dataframes    
    df_series, df_movies, errors_series, errors_movies = None, None, None, None
    if series_url is None and movies_url is None:
        print('No url to scrape!')
        return df_series, df_movies, errors_series, errors_movies
    if series_url is not None:
        c = ["press_name",
        "series_id",
        "press_rating",]
        df_series = pd.DataFrame(columns=c)
    if movies_url is not None:
        c = ["press_name",
        "movie_id",
        "press_rating",]
        df_movies = pd.DataFrame(columns=c)
    
    # preparing the setting and monitoring loop
    start_time = time()
    ns_request = 0
    nm_request = 0
    
    # init list to save errors
    errors_series = []
    errors_movies = []
    
    # -----------------------------------
    # Scraping the series press ratings
    # -----------------------------------
    if series_url is not None:
        # request loop
        clear_output(wait=True)
        print("------ Scraping series press ratings ------\n")
        # Monitoring with tqdm_notebook() progress bar
        for url in tqdm(series_url,desc='Fetching Series Press Ratings'):
            try :
                response = get_url(url)

                # Pause the loop
                sleep(randint(1,2))

                # Monitoring the requests
                ns_request += 1
                
                elapsed_time = time() - start_time
                # print(f'Series Press Request #{ns_request}; Frequency: {ns_request/elapsed_time} requests/s')
                # clear_output(wait = True)

                # Pause the loop
                sleep(randint(1,2))

                # Warning for non-200 status codes
                if response.status != 200:
                    warn('Series Press Request #{}; Status code: {}'.format(ns_request, response.status_code))
                    errors_series.append(url)

                # Parse the content of the request with BeautifulSoup
                html_text = response.read().decode("utf-8")
                press_html_soup = BeautifulSoup(html_text, 'html.parser')

                # Get the series_id from the url
                series_id = get_ID(url)            
                # Get the press_ratings
                press_ratings = getPressRatings(press_html_soup)
                for id, rating in press_ratings.items():               
                    # Append the data
                    df_series_tmp = pd.DataFrame({'press_name': [id],
                                        'series_id': [series_id],
                                        'press_rating': [rating]
                                        })                                    
                    df_series = pd.concat([df_series, df_series_tmp], ignore_index=True)                
            except:
                errors_series.append(url)
                warn(f'Series Press Request #{ns_request} fail; Press rating does not exist! Total errors : {len(errors_series)}')
                #traceback.print_exc()

        # Monitoring 
        series_path = '../Series/Ratings/'
        # We create the folder if not exists
        os.makedirs(os.path.dirname(series_path), exist_ok=True)
        elapsed_time = time() - start_time
        print(f'--> Done; {ns_request} Series Press Ratings requests in {timedelta(seconds=elapsed_time)} with {len(errors_series)} errors (series with no press ratings)')
        # Saving files
        df_series.to_csv(f"{series_path}press_ratings_series.csv", index=False)
        # list to dataframe
        errors_df_series = pd.DataFrame(errors_series, columns=['url'])
        errors_df_series.to_csv(f"{series_path}press_ratings_errors.csv",index=False,header=False)

    # -----------------------------------
    # Scraping the movies press ratings
    # -----------------------------------
    if movies_url is not None:
        # request loop
        clear_output(wait=True)
        print("------ Scraping movies press ratings ------\n")
        # Monitoring with tqdm_notebook() progress bar
        for url in tqdm(movies_url,desc='Fetching Movies Press Ratings'):
            try :
                response = get_url(url)

                # Pause the loop
                sleep(randint(1,2))

                # Monitoring the requests
                nm_request += 1
                
                elapsed_time = time() - start_time
                #print(f'Movie Press Request #{nm_request}; Frequency: {nm_request/elapsed_time} requests/s')
                #clear_output(wait = True)

                # Pause the loop
                sleep(randint(1,2))

                # Warning for non-200 status codes
                if response.status != 200:
                    warn('Movie Press Request #{}; Status code: {}'.format(nm_request, response.status_code))
                    errors_movies.append(url)

                # Parse the content of the request with BeautifulSoup
                html_text = response.read().decode("utf-8")
                press_html_soup = BeautifulSoup(html_text, 'html.parser')

                # Get the movie_id from the url
                movie_id = get_ID(url)           
                # Get the press_ratings
                press_ratings = getPressRatings(press_html_soup)
                for id, rating in press_ratings.items():               
                    # Append the data
                    df_movies_tmp = pd.DataFrame({'press_name': [id],
                                        'movie_id': [movie_id],
                                        'press_rating': [rating]
                                        })                                    
                    df_movies = pd.concat([df_movies, df_movies_tmp], ignore_index=True)                
            except:
                errors_movies.append(url)
                warn(f'Movie Press Request #{nm_request} fail; Press rating does not exist! Total errors : {len(errors_movies)}')
                #traceback.print_exc()
            
        # Monitoring
        movies_path = '../Movies/Ratings/'
        # We create the folder if not exists
        os.makedirs(os.path.dirname(movies_path), exist_ok=True)
        elapsed_time = time() - start_time
        print(f'--> Done; {nm_request} Movies Press Ratings requests in {timedelta(seconds=elapsed_time)} with {len(errors_movies)} errors (movies with no press ratings)')
        # Saving files
        df_movies.to_csv(f"{movies_path}press_ratings_movies.csv", index=False)
        # list to dataframe
        errors_df_movies = pd.DataFrame(errors_movies, columns=['url'])
        errors_df_movies.to_csv(f"{movies_path}press_ratings_errors.csv",index=False,header=False)
    
    # return dataframe and errors
    return df_series, df_movies, errors_series, errors_movies

### Scraping User URLs: `ScrapeUserURL(series_urls, movies_urls)`

In [13]:
def ScrapeUserURL(series_url: list=None, movies_url: list=None, nb_users: int=5) -> tuple[pd.DataFrame,pd.DataFrame,pd.DataFrame,pd.DataFrame]:
    '''
    Scrape the user_ratings from the series and/or movies press comments section urls (if available).
    :param series_url: list of series user comments urls
    :param movies_url: list of movies user comments urls
    :param nb_users: number of users to scrape (default: 5)
    :return: dataframes of series and movies user ratings and the errors.  
    '''
    # init the dataframes    
    df_series, df_movies, errors_series, errors_movies = None, None, None, None
    if series_url is None and movies_url is None:
        print('No url to scrape!')
        return df_series, df_movies, errors_series, errors_movies
    if series_url is not None:
        c = ["user_id",
        "user_name",
        "series_id",
        "user_rating",
        "date",
    ]
        df_series = pd.DataFrame(columns=c)
    if movies_url is not None:
        c = ["user_id",
        "user_name",
        "movie_id",
        "user_rating",
        "date",
    ]
        df_movies = pd.DataFrame(columns=c)
    
    # preparing the setting and monitoring loop
    start_time = time()
    ns_request = 0
    nm_request = 0
    
    # init list to save errors
    errors_series = []
    errors_movies = []
    
    # -----------------------------------
    # Scraping the series user ratings
    # -----------------------------------
    if series_url is not None:
        # request loop
        clear_output(wait=True)
        print("------ Scraping series user ratings ------\n")
        # Monitoring with tqdm_notebook() progress bar
        for url in tqdm(series_url,desc='Fetching Series User Ratings'):
            try :
                response = get_url(url)

                # Pause the loop
                sleep(randint(1,2))

                # Monitoring the requests
                ns_request += 1
                
                elapsed_time = time() - start_time
                #print(f'Series User Request #{ns_request}; Frequency: {ns_request/elapsed_time} requests/s')
                #clear_output(wait = True)

                # Pause the loop
                sleep(randint(1,2))

                # Warning for non-200 status codes
                if response.status != 200:
                    warn('Series User Request #{}; Status code: {}'.format(ns_request, response.status_code))
                    errors_series.append(url)

                # Parse the content of the request with BeautifulSoup
                html_text = response.read().decode("utf-8")
                user_html_soup = BeautifulSoup(html_text, 'html.parser')

                # Get the series_id from the url
                series_id = get_ID(url)
                # Get the url last page if exists. Default 1.
                last_page = int(user_html_soup.find_all('span', {'class': 'button-md'})[-1].text) if user_html_soup.find_all('span', {'class': 'button-md'}) else 1
                # Get the user_ratings
                user_ratings = getUserRatings(url,last_page,nb_users=nb_users)
                
                for id, rating in user_ratings.items():               
                    # Append the data
                    df_series_tmp = pd.DataFrame({'user_id': [id[0]],
                                        'user_name': [id[1]],
                                        'series_id': [series_id],
                                        'user_rating': [rating[0]],
                                        'date': [rating[1]]
                                        })                                    
                    df_series = pd.concat([df_series, df_series_tmp], ignore_index=True)                
            except:
                errors_series.append(url)
                warn(f'Series User Request #{ns_request} fail; User rating does not exist! Total errors : {len(errors_series)}')
                #traceback.print_exc()
        
        # Monitoring
        series_path = '../Series/Ratings/'
        # We create the folder if not exists
        os.makedirs(os.path.dirname(series_path), exist_ok=True)
        elapsed_time = time() - start_time
        print(f'--> Done; {ns_request} Series User Ratings requests in {timedelta(seconds=elapsed_time)} with {len(errors_series)} errors (series with no user ratings)')
        # Saving files
        df_series.to_csv(f"{series_path}user_ratings_series.csv", index=False)
        # list to dataframe
        errors_df_series = pd.DataFrame(errors_series, columns=['url'])
        errors_df_series.to_csv(f"{series_path}user_ratings_errors.csv",index=False,header=False)

    # -----------------------------------
    # Scraping the movies user ratings
    # -----------------------------------
    if movies_url is not None:
        # request loop
        clear_output(wait=True)
        print("------ Scraping movies user ratings ------\n")
        # Monitoring with tqdm_notebook() progress bar
        for url in tqdm(movies_url,desc='Fetching Movies User Ratings'):
            try :
                response = get_url(url)

                # Pause the loop
                sleep(randint(1,2))

                # Monitoring the requests
                nm_request += 1
                
                elapsed_time = time() - start_time
                #print(f'Movie User Request #{nm_request}; Frequency: {nm_request/elapsed_time} requests/s')
                #clear_output(wait = True)

                # Pause the loop
                sleep(randint(1,2))

                # Warning for non-200 status codes
                if response.status != 200:
                    warn('Movie User Request #{}; Status code: {}'.format(nm_request, response.status_code))
                    errors_movies.append(url)

                # Parse the content of the request with BeautifulSoup
                html_text = response.read().decode("utf-8")
                user_html_soup = BeautifulSoup(html_text, 'html.parser')

                # Get the movie_id from the url
                movie_id = get_ID(url)          
                # Get the url last page if exists. Default 1.
                last_page = int(user_html_soup.find_all('span', {'class': 'button-md'})[-1].text) if user_html_soup.find_all('span', {'class': 'button-md'}) else 1
                # Get the user_ratings
                user_ratings = getUserRatings(url,last_page,nb_users=nb_users)
                
                for id, rating in user_ratings.items():               
                    # Append the data
                    df_movies_tmp = pd.DataFrame({'user_id': [id[0]],
                                        'user_name': [id[1]],
                                        'movie_id': [movie_id],
                                        'user_rating': [rating[0]],
                                        'date': [rating[1]]
                                        })                                    
                    df_movies = pd.concat([df_movies, df_movies_tmp], ignore_index=True)                             
            except:
                errors_movies.append(url)
                warn(f'Movie User Request #{nm_request} fail; User rating does not exist! Total errors : {len(errors_movies)}')
                #traceback.print_exc()
            
        # Monitoring
        movies_path = '../Movies/Ratings/'
        # We create the folder if not exists
        os.makedirs(os.path.dirname(movies_path), exist_ok=True) 
        elapsed_time = time() - start_time
        print(f'--> Done; {nm_request} Movies User Ratings requests in {timedelta(seconds=elapsed_time)} with {len(errors_movies)} errors (movies with no user ratings)')
        # Saving files
        df_movies.to_csv(f"{movies_path}user_ratings_movies.csv", index=False)
        # list to dataframe
        errors_df_movies = pd.DataFrame(errors_movies, columns=['url'])
        errors_df_movies.to_csv(f"{movies_path}user_ratings_errors.csv",index=False,header=False)
    
    # return dataframe and errors
    return df_series, df_movies, errors_series, errors_movies

### Main function: `ScrapeURL(press_series_urls, press_movies_urls, user_series_urls, user_movies_urls, nb_users)`

In [14]:
def ScrapeURL(press_series_urls: list=None, press_movies_urls: list=None, user_series_urls: list=None, user_movies_urls: list=None, nb_users: int=5) -> tuple[list,list]:
    '''
    Scrape the user and press ratings from the movie and series pages.
    :param press_series_urls: list of press series urls
    :param press_movies_urls: list of press movies urls
    :param user_series_urls: list of user series urls
    :param user_movies_urls: list of user movies urls
    :return: 2 lists of dataframes of user and press: one for the ratings 
    and the other one for the errors.
    '''
    # We ignore dateparse warnings
    filterwarnings("ignore",message="The localize method is no longer necessary, as this time zone supports the fold attribute")

    # Monitoring the press scraping
    start_time_press = time()    
    # Init the ratings and errors lists
    ratings,errors = [],[]
    # Scraping the press ratings
    press_series_ratings, press_movies_ratings, press_errors_series, press_errors_movies = ScrapePressURL(press_series_urls, press_movies_urls)
    elapsed_time_press = time() - start_time_press
    delta_press = timedelta(seconds=elapsed_time_press)
    ratings.append(press_series_ratings)
    ratings.append(press_movies_ratings)
    errors.append(press_errors_series)
    errors.append(press_errors_movies)   
    # Monitoring the users scraping
    start_time_users = time() 
    # Scraping the user ratings
    user_series_ratings, user_movies_ratings, user_errors_series, user_errors_movies = ScrapeUserURL(user_series_urls, user_movies_urls, nb_users=nb_users)
    elapsed_time_users = time() - start_time_users
    delta_users = timedelta(seconds=elapsed_time_users)
    ratings.append(user_series_ratings)
    ratings.append(user_movies_ratings)
    errors.append(user_errors_series)
    errors.append(user_errors_movies)
    # Monitoring the results
    print("\n    --- RECAP ---")
    if press_series_ratings is not None and press_errors_series is not None:
        print(f'--> Done; {len(press_series_ratings)} Series Press Ratings requests in {delta_press} with {len(press_errors_series)} errors (series with no press ratings)')
    if press_movies_ratings is not None and press_errors_movies is not None:
        print(f'--> Done; {len(press_movies_ratings)} Movies Press Ratings requests in {delta_press} with {len(press_errors_movies)} errors (movies with no press ratings)')
    if user_series_ratings is not None and user_errors_series is not None:
        print(f'--> Done; {len(user_series_ratings)} Series User Ratings requests in {delta_users} with {len(user_errors_series)} errors (series with no user ratings)')
    if user_movies_ratings is not None and user_errors_movies is not None:
        print(f'--> Done; {len(user_movies_ratings)} Movies User Ratings requests in {delta_users} with {len(user_errors_movies)} errors (movies with no user ratings)')
    return ratings, errors

---
## **Launching the script**

### Loading the movies and series dataframes

In [15]:
# Load the movies and series dataframes
def loadDataFrames(nrows: int=None) -> tuple[pd.DataFrame,pd.DataFrame]:
    '''
    Load the movies and series dataframes
    :param nrows: number of rows to load
    :return: movies and series dataframes
    '''
    movies_df = pd.read_csv('../Movies/Data/allocine_movies_100p.csv', nrows=nrows)
    series_df = pd.read_csv('../Series/Data/allocine_series_100p.csv', nrows=nrows)
    return movies_df, series_df
movies_df, series_df = loadDataFrames()

### Getting the comments section urls from spectators and press for movies and series

In [16]:
getCommentsUrl(movies_df=movies_df, series_df=series_df, spect=True, press=True)

--> Done; 1314 Movies Users Comments Page Requests in 0:00:00.001252
--> Done; 1314 Movies Press Comments Page Requests in 0:00:00.006182
--> Done; 1417 Series Users Comments Page Requests in 0:00:00.001018
--> Done; 1417 Series Press Comments Page Requests in 0:00:00.011786


### Loading the comments section urls

In [17]:
def loadingURL(movies: bool=False, series: bool=False, spect: bool=False, press: bool=False) -> tuple[list,list,list,list]:
    '''
    Load the urls of the movies and series pages.
    :param movies: Boolean, if True, load the movies urls
    :param series: Boolean, if True, load the series urls
    :param spect: Boolean, if True, load the spectator urls for movies and/or series
    :param press: Boolean, if True, load the press urls for movies and/or series
    :return: urls lists of the movies and series press and user rating pages    
    '''
    user_series_urls, user_movies_urls, press_series_urls, press_movies_urls =  None, None, None, None
    if movies:
        if spect:
            user_movies_urls = pd.read_csv("../Movies/Comments/user_comments_urls.csv",names=['url'])['url'].tolist()
            print(f"--> {len(user_movies_urls)} Movies Spectators URLs succesfully loaded!")
        if press:
            press_movies_urls = pd.read_csv("../Movies/Comments/press_comments_urls.csv",names=['url'])['url'].tolist()
            print(f"--> {len(press_movies_urls)} Movies Press URLs succesfully loaded!")
    if series:
        if spect:
            user_series_urls = pd.read_csv("../Series/Comments/user_comments_urls.csv",names=['url'])['url'].tolist()
            print(f"--> {len(user_series_urls)} Series Spectators URLs succesfully loaded!")
        if press:
            press_series_urls = pd.read_csv("../Series/Comments/press_comments_urls.csv",names=['url'])['url'].tolist()
            print(f"--> {len(press_series_urls)} Series Press URLs succesfully loaded!")
    return user_series_urls, user_movies_urls, press_series_urls, press_movies_urls

user_series_urls, user_movies_urls, press_series_urls, press_movies_urls = loadingURL(movies=True, series=False, spect=True, press=False)    

--> 1314 Movies Spectators URLs succesfully loaded!


### Scraping the data

In [18]:
# Scraping the user and/or press ratings from series and/or movies
ratings, errors = ScrapeURL(press_series_urls=press_series_urls, 
                            press_movies_urls=press_movies_urls, 
                            user_series_urls=user_series_urls, 
                            user_movies_urls=user_movies_urls,
                            nb_users=100,
                            )
press_series_ratings = ratings[0]
press_movies_ratings = ratings[1]
user_series_ratings = ratings[2]
user_movies_ratings = ratings[3]

------ Scraping movies user ratings ------



Fetching Movies User Ratings:   0%|          | 0/1314 [00:00<?, ?it/s]

--> Done; 1314 Movies User Ratings requests in 6:12:28.403621 with 0 errors (movies with no user ratings)

    --- RECAP ---
--> Done; 105711 Movies User Ratings requests in 6:12:28.812086 with 0 errors (movies with no user ratings)
