# Web Scraping User Ratings From AlloCiné.fr

This script builds a DataFrame by web scraping the data from AlloCiné — this time, its purpose is to retrieve the user and press ratings of movies and series from the website of AlloCiné. We will proceed as follows: 
- We use the series and movies url lists generated in the other scripts to get the comments section urls of the press and the spectators (users) with `getCommentsUrl()`.
- We use the press comments urls for movies and series to get the press ratings with `getPressRatings()`.
- We use the user comments urls for movies and series to get the user ratings with `getUserRatings()`.
- From there, we scrape the user ID and their rating for each movie or series with `ScrapeURL()`.
- The user can be either a person or a press newspaper. Separated files are generated for each type of user.

*Note : We use the popular BeautifulSoup package*

## Functions :

### `getCommentsUrl(movies_df, series_df, spect, press)`

This function will call two sub-functions: `getMoviesCommentsUrl(movies_df, spect, press)` and `getSeriesCommentsUrl(series_df, spect, press)`. Both respectively iterate over the list of movies and series url generated by `getMoviesUrl()` in the previous scripts and get the comments section url. We can chose to get the comments section from the user or the press for each video type.
We then store the lists of urls in a csv file entitled `user_comments_url.csv` (resp. `press_comments_url.csv`), in both movies and series directory (`../Movies/Comments/` and `../Series/Comments/`).

### `ScrapeURL(urls)` :

Iterate over the list of movies or series comments section url generated by `getCommentsUrl()` and scrape the data for each movie or series ratings. In the process, we extract :

- `user_id` : Allocine user id (person or press)
- `id` : Allocine movie or series id
- `user_rating`: AlloCiné users ratings (from 0.5 to 5 stars) 


The function `ScrapeURL()` returns two objects : the user_rating and press-rating as two dataframes. In addition, the two objects are saved as `user_ratings_movies.csv` and `press_ratings_movies.csv` (respectively `user_ratings_series.csv` and `press_ratings_series.csv`).

-------------------------------------------------------------------
## Import libs

In [10]:
# Import libs
import pandas as pd
import numpy as np
from time import time
from time import sleep
from datetime import timedelta
from urllib.request import urlopen
from random import randint
from bs4 import BeautifulSoup
import os

from warnings import warn
from IPython.core.display import clear_output
import traceback

  from IPython.core.display import clear_output


## Functions

In [20]:
# A SUPPRIMER APRES TESTS
response = urlopen("https://www.allocine.fr/film/fichefilm-260627/critiques/spectateurs/membres-critiques/")
html_text = response.read().decode("utf-8")
press_html_soup = BeautifulSoup(html_text, 'html.parser')

### Function: `getCommentsUrl(movies_df, series_df, spect, press)`

#### Function: `getMoviesCommentsUrl(movies_df, spect, press)`

In [12]:
def getMoviesCommentsUrl(movies_df: pd.DataFrame, spect=False, press=False):
    '''
    Get the comments section url for each movie
    You can select the spectateurs or presse section, or both.
    :param movies_df: DataFrame of movies
    :param spect: Boolean, True if you want to get the spectateurs section
    :param press: Boolean, True if you want to get the presse section
    :return: nothing but saves urls in a csv file
    '''    
    # Get the list of movies_id from the movies_df
    movies_id_list = movies_df['id'].tolist()

    # Set the list
    press_url_list = []
    user_url_list = []

    # Preparing the setting and monitoring of the loop
    start_time = time()
    p_requests = 1
        
    for v_id in movies_id_list:
        if spect:
            # Url for the spectators/user section sorted by the descencing number of reviews per user
            url_spect = f'https://www.allocine.fr/film/fichefilm-{v_id}/critiques/spectateurs/membres-critiques/'    
            user_url_list.append(url_spect)
        if press:
            url_press = f'https://www.allocine.fr/film/fichefilm-{v_id}/critiques/presse/'   
            press_url_list.append(url_press)                         
        p_requests += 1

    # Saving the files
    comments_path = '../Movies/Comments/'
    os.makedirs(os.path.dirname(comments_path), exist_ok=True) #create folders if not exists
    print(f'--> Done; {p_requests-1} Movies Comments Page Requests in {timedelta(seconds=time()-start_time)}')
    if spect:
        r = np.asarray(user_url_list)
        np.savetxt(f"{comments_path}user_comments_urls.csv", r, delimiter=",", fmt='%s')
    if press:
        r = np.asarray(press_url_list)
        np.savetxt(f"{comments_path}press_comments_urls.csv", r, delimiter=",", fmt='%s')

#### Function: `getSeriesCommentsUrl(series_df, spect, press)`

In [13]:
def getSeriesCommentsUrl(series_df: pd.DataFrame, spect=False, press=False):
    '''
    Get the comments section url for each series
    You can select the spectateurs or presse section, or both.
    :param series_df: DataFrame of series
    :param spect: Boolean, True if you want to get the spectateurs section
    :param press: Boolean, True if you want to get the presse section
    :return: nothing but saves urls in a csv file
    '''
    # Get the list of series_id from the series_df
    series_id_list = series_df['id'].tolist()

    # Set the list
    press_url_list = []
    user_url_list = []

    # Preparing the setting and monitoring of the loop
    start_time = time()
    p_requests = 1
        
    for v_id in series_id_list:
        if spect:
            # Url for the spectators/user section sorted by the descencing number of reviews per user
            url_spect = f'https://www.allocine.fr/series/ficheserie-{v_id}/critiques/membres-critiques/'    
            user_url_list.append(url_spect)
        if press:
            url_press = f'https://www.allocine.fr/series/ficheserie-{v_id}/critiques/presse/'   
            press_url_list.append(url_press)                         
        p_requests += 1

    # Saving the files
    comments_path = '../Series/Comments/'
    os.makedirs(os.path.dirname(comments_path), exist_ok=True) #create folders if not exists
    print(f'--> Done; {p_requests-1} Series Comments Page Requests in {timedelta(seconds=time()-start_time)}')
    if spect:
        r = np.asarray(user_url_list)
        np.savetxt(f"{comments_path}user_comments_urls.csv", r, delimiter=",", fmt='%s')
    if press:
        r = np.asarray(press_url_list)
        np.savetxt(f"{comments_path}press_comments_urls.csv", r, delimiter=",", fmt='%s')

#### Main function: `getCommentsUrl(movies_df, series_df, spect, press)`

In [14]:
def getCommentsUrl(movies_df=None, series_df=None, spect=False, press=False):
    '''
    Get the comments section url for each movie
    You can select the spectateurs or presse section, or both.
    :param movies_df: DataFrame of movies
    :param series_df: DataFrame of series
    :param spect: Boolean, True if you want to get the spectateurs section
    :param press: Boolean, True if you want to get the presse section
    :return: nothing but saves urls in a csv file
    '''
    try:
        if movies_df is not None:
            getMoviesCommentsUrl(movies_df, spect, press)
        if series_df is not None:
            getSeriesCommentsUrl(series_df, spect, press)
    except:
        print('Error in getCommentsUrl function!')
        traceback.print_exc()

### Get press-ratings dataframe: `getPressRatings(urls)`

#### Convert text to rating: `convertTextToRating(text)`

In [15]:
def convert_text_to_rating(text):
    '''
    Convert text to rating
    :param text: Evaluation of the movie as a string 
    :return: corresponding float rating value
    '''
    if text=="Nul":
        return 0.5
    elif text=="Très mauvais":
        return 1
    elif text=="Mauvais":
        return 1.5
    elif text=="Pas terrible":
        return 2
    elif text=="Moyen":
        return 2.5  
    elif text=="Pas mal":
        return 3
    elif text=="Bien":
        return 3.5
    elif text=="Très bien":
        return 4
    elif text=="Excellent":
        return 4.5
    elif text=="Chef d'oeuvre":
        return 5

#### Main function: `getPressRatings(urls)`

In [16]:
def getPressRatings(press_soup):
    '''
    Get press_ratings from the comments section url for each movie or series.
    :param press_soup: BeautifulSoup object of the press comments section url
    :return: a dictionary of ratings (key: user_id, value: press rating) if the ratings are available, else None.
    '''
    press_ratings = {}
    div_ratings = press_soup.find_all('li', {'class': 'item'})
    if div_ratings:
        press_ratings = {id.text.strip():convert_text_to_rating(rating.find('span')['title']) for id, rating in zip(div_ratings, div_ratings)}
        return press_ratings
    return None
    

### Get user-ratings dataframe: `getUserRatings(urls)`

In [17]:
def getUserRatings(user_soup, nb_users: int, min_nb_reviews: int):
    '''
    Get user_ratings from the comments section url for each movie or series.
    We will keep only the nb_users first users who have posted at least min_nb_reviews number of reviews.
    :param user_soup: BeautifulSoup object of the user comments section url.
    :param nb_users: Number of users to keep.
    :param min_nb_reviews: Minimum number of reviews per user.
    :return: a dictionary of ratings (key: user_id, value: user rating) if the ratings are available, else None.
    '''
    user_ratings = {}
    div_ratings = user_soup.find_all('div', {'class': 'review-card'})
    if div_ratings:
        user_ratings = {id.find_all('span')[3]['data-targetuserid']:rating for id, rating in zip(div_ratings, div_ratings)}
        return user_ratings
    return None

In [34]:
press_html_soup.find_all('div', {'class': 'review-card'})[0]

<div class="hred review-card cf" id="review_1019631083">
<div class="review-card-aside">
<div class="review-card-user-infos cf">
<figure class="thumbnail">
<span class="ACrL21ACrlbWJyZS1aMjAwMzA5MTYxMTUzMTA5NTM3MDEzMjQv thumbnail-container thumbnail-link" title="selenie">

</span>
</figure>
<div class="meta">
<div class="meta-title">
<span class="ACrL21ACrlbWJyZS1aMjAwMzA5MTYxMTUzMTA5NTM3MDEzMjQv">selenie</span>
</div>
<span class="ACrL2NACrsdWIzMDAv"></span>
<p class="m

### Function: `ScrapeURL(urls)`

#### Function: `ScrapePressURL(series_urls, movies_urls)`

In [23]:
# Scrape the user_ratings from the press and user comments section urls
# Returns a datframe and save it as 'movies_press_ratings.csv' (or 'series_press_ratings.csv')
def ScrapePressURL(series_url: list=None, movies_url: list=None):        
    # init the dataframes
    c = ["user_id",
        "id",
        "user_rating",
    ]
    if series_url is None and movies_url is None:
        print('No url to scrape!')
        return None
    if series_url is not None:
        df_series = pd.DataFrame(columns=c)
    if movies_url is not None:
        df_movies = pd.DataFrame(columns=c)
    
    # preparing the setting and monitoring loop
    start_time = time()
    ns_request = 0
    nm_request = 0
    
    # init list to save errors
    errors_series = []
    errors_movies = []
    
    if series_url is not None:
        # request loop
        for url in series_url:
            try :
                response = urlopen(url)

                # Pause the loop
                sleep(randint(1,2))

                # Monitoring the requests
                ns_request += 1
                
                elapsed_time = time() - start_time
                print(f'Series Request #{ns_request}; Frequency: {ns_request/elapsed_time} requests/s')
                clear_output(wait = True)

                # Pause the loop
                sleep(randint(1,2))

                # Warning for non-200 status codes
                if response.status != 200:
                    warn('Series Request #{}; Status code: {}'.format(ns_request, response.status_code))
                    errors_series.append(url)

                # Parse the content of the request with BeautifulSoup
                html_text = response.read().decode("utf-8")
                press_html_soup = BeautifulSoup(html_text, 'html.parser')

                # Get the series_id
                series_id = url.split('/')[-4].split('-')[-1]            
                # Get the press_ratings
                press_ratings = getPressRatings(press_html_soup)
                for id, rating in press_ratings.items():               
                    # Append the data
                    df_series_tmp = pd.DataFrame({'user_id': [id],
                                        'id': [series_id],
                                        'user_rating': [rating]
                                        })                                    
                    df_series = pd.concat([df_series, df_series_tmp], ignore_index=True)                
            except:
                errors_series.append(url)
                warn(f'Series Request #{ns_request} fail; Press rating does not exist! Total errors : {len(errors_series)}')
                traceback.print_exc()
    if movies_url is not None:
        # request loop
        for url in movies_url:
            try :
                response = urlopen(url)

                # Pause the loop
                sleep(randint(1,2))

                # Monitoring the requests
                nm_request += 1
                
                elapsed_time = time() - start_time
                print(f'Movie Request #{nm_request}; Frequency: {nm_request/elapsed_time} requests/s')
                clear_output(wait = True)

                # Pause the loop
                sleep(randint(1,2))

                # Warning for non-200 status codes
                if response.status != 200:
                    warn('Movie Request #{}; Status code: {}'.format(nm_request, response.status_code))
                    errors_movies.append(url)

                # Parse the content of the request with BeautifulSoup
                html_text = response.read().decode("utf-8")
                press_html_soup = BeautifulSoup(html_text, 'html.parser')

                # Get the movie_id
                movie_id = url.split('/')[-4].split('-')[-1]            
                # Get the press_ratings
                press_ratings = getPressRatings(press_html_soup)
                for id, rating in press_ratings.items():               
                    # Append the data
                    df_movies_tmp = pd.DataFrame({'user_id': [id],
                                        'id': [movie_id],
                                        'user_rating': [rating]
                                        })                                    
                    df_movies = pd.concat([df_movies, df_movies_tmp], ignore_index=True)                
            except:
                errors_movies.append(url)
                warn(f'Movie Request #{nm_request} fail; Press rating does not exist! Total errors : {len(errors_movies)}')
                traceback.print_exc()
            
    # monitoring 
    if series_url is not None:
        series_path = '../Series/Ratings/'
        os.makedirs(os.path.dirname(series_path), exist_ok=True) #create folders if not exists
        elapsed_time = time() - start_time
        print(f'Done; {ns_request} Series Press Ratings requests in {timedelta(seconds=elapsed_time)} with {len(errors_series)} errors (series with no press ratings)')
        df_series.to_csv(f"{series_path}press_ratings_series.csv", index=False)
        # list to dataframe
        errors_df_series = pd.DataFrame(errors_series, columns=['url'])
        errors_df_series.to_csv(f"{series_path}press_ratings_errors.csv")
    if movies_url is not None:
        movies_path = '../Movies/Ratings/'
        os.makedirs(os.path.dirname(movies_path), exist_ok=True) #create folders if not exists
        elapsed_time = time() - start_time
        print(f'Done; {nm_request} Movies Press Ratings requests in {timedelta(seconds=elapsed_time)} with {len(errors_movies)} errors (movies with no press ratings)')
        df_movies.to_csv(f"{movies_path}press_ratings_movies.csv", index=False)
        # list to dataframe
        errors_df_movies = pd.DataFrame(errors_movies, columns=['url'])
        errors_df_movies.to_csv(f"{movies_path}press_ratings_errors.csv")
    
    # return dataframe and errors
    return df_series, df_movies, errors_series, errors_movies

#### Function: `ScrapeUserURL(series_urls, movies_urls)`

In [None]:
def ScrapeUserURL(series_url: list, movies_url: list):
    # init the dataframes
    c = ["user_id",
        "id",
        "user_rating",
    ]
    if series_url is None and movies_url is None:
        print('No url to scrape!')
        return None
    if series_url is not None:
        df_series = pd.DataFrame(columns=c)
    if movies_url is not None:
        df_movies = pd.DataFrame(columns=c)
    
    # preparing the setting and monitoring loop
    start_time = time()
    ns_request = 0
    nm_request = 0
    
    # init list to save errors
    errors_series = []
    errors_movies = []
    
    if series_url is not None:
        # request loop
        for url in series_url:
            try :
                response = urlopen(url)

                # Pause the loop
                sleep(randint(1,2))

                # Monitoring the requests
                ns_request += 1
                
                elapsed_time = time() - start_time
                print(f'Series Request #{ns_request}; Frequency: {ns_request/elapsed_time} requests/s')
                clear_output(wait = True)

                # Pause the loop
                sleep(randint(1,2))

                # Warning for non-200 status codes
                if response.status != 200:
                    warn('Series Request #{}; Status code: {}'.format(ns_request, response.status_code))
                    errors_series.append(url)

                # Parse the content of the request with BeautifulSoup
                html_text = response.read().decode("utf-8")
                press_html_soup = BeautifulSoup(html_text, 'html.parser')

                # Get the series_id
                series_id = url.split('/')[-3].split('-')[-1] 
                # Get the user_ratings
                user_ratings = getUserRatings(press_html_soup)
                
                for id, rating in press_ratings.items():               
                    # Append the data
                    df_series_tmp = pd.DataFrame({'user_id': [id],
                                        'id': [series_id],
                                        'user_rating': [rating]
                                        })                                    
                    df_series = pd.concat([df_series, df_series_tmp], ignore_index=True)                
            except:
                errors_series.append(url)
                warn(f'Series Request #{ns_request} fail; Press rating does not exist! Total errors : {len(errors_series)}')
                traceback.print_exc()
    if movies_url is not None:
        # request loop
        for url in movies_url:
            try :
                response = urlopen(url)

                # Pause the loop
                sleep(randint(1,2))

                # Monitoring the requests
                nm_request += 1
                
                elapsed_time = time() - start_time
                print(f'Movie Request #{nm_request}; Frequency: {nm_request/elapsed_time} requests/s')
                clear_output(wait = True)

                # Pause the loop
                sleep(randint(1,2))

                # Warning for non-200 status codes
                if response.status != 200:
                    warn('Movie Request #{}; Status code: {}'.format(nm_request, response.status_code))
                    errors_movies.append(url)

                # Parse the content of the request with BeautifulSoup
                html_text = response.read().decode("utf-8")
                press_html_soup = BeautifulSoup(html_text, 'html.parser')

                # Get the movie_id
                movie_id = url.split('/')[-4].split('-')[-1]            
                # Get the press_ratings
                press_ratings = getPressRatings(press_html_soup)
                for id, rating in press_ratings.items():               
                    # Append the data
                    df_movies_tmp = pd.DataFrame({'user_id': [id],
                                        'id': [movie_id],
                                        'user_rating': [rating]
                                        })                                    
                    df_movies = pd.concat([df_movies, df_movies_tmp], ignore_index=True)                
            except:
                errors_movies.append(url)
                warn(f'Movie Request #{nm_request} fail; Press rating does not exist! Total errors : {len(errors_movies)}')
                traceback.print_exc()
            
    # monitoring 
    if series_url is not None:
        series_path = '../Series/Ratings/'
        os.makedirs(os.path.dirname(series_path), exist_ok=True) #create folders if not exists
        elapsed_time = time() - start_time
        print(f'Done; {ns_request} Series Press Ratings requests in {timedelta(seconds=elapsed_time)} with {len(errors_series)} errors (series with no press ratings)')
        df_series.to_csv(f"{series_path}press_ratings_series.csv", index=False)
        # list to dataframe
        errors_df_series = pd.DataFrame(errors_series, columns=['url'])
        errors_df_series.to_csv(f"{series_path}press_ratings_errors.csv")
    if movies_url is not None:
        movies_path = '../Movies/Ratings/'
        os.makedirs(os.path.dirname(movies_path), exist_ok=True) #create folders if not exists
        elapsed_time = time() - start_time
        print(f'Done; {nm_request} Movies Press Ratings requests in {timedelta(seconds=elapsed_time)} with {len(errors_movies)} errors (movies with no press ratings)')
        df_movies.to_csv(f"{movies_path}press_ratings_movies.csv", index=False)
        # list to dataframe
        errors_df_movies = pd.DataFrame(errors_movies, columns=['url'])
        errors_df_movies.to_csv(f"{movies_path}press_ratings_errors.csv")
    
    # return dataframe and errors
    return df_series, df_movies, errors_series, errors_movies

## Loading the movies and series dataframes

In [24]:
# Load the movies and series dataframes
def loadDataFrames():
    '''
    Load the movies and series dataframes
    :return: movies and series dataframes
    '''
    movies_df = pd.read_csv('../Movies/Data/allocine_movies.csv')
    series_df = pd.read_csv('../Series/Data/allocine_series.csv')
    return movies_df, series_df
movies_df, series_df = loadDataFrames()

## Getting the comments section urls from spectators and press for movies and series

In [25]:
getCommentsUrl(movies_df=movies_df, series_df=series_df, spect=True, press=True)

--> Done; 298 Movies Comments Page Requests in 0:00:00.001084
--> Done; 296 Series Comments Page Requests in 0:00:00


## Loading the comments section urls

### For Series

#### From the press

In [26]:
# Getting the press_comments_urls for series
press_series_url = pd.read_csv("../Series/Comments/press_comments_urls.csv",names=['url'])
press_series_url = press_series_url['url'].tolist()

#### From the users

In [27]:
# Getting the user_comments_urls for series
user_series_url = pd.read_csv("../Series/Comments/user_comments_urls.csv",names=['url'])
user_series_url = user_series_url['url'].tolist()

### For Movies

#### From the press

In [28]:
# Getting the press_comments_urls for movies
press_movies_url = pd.read_csv("../Movies/Comments/press_comments_urls.csv",names=['url'])
press_movies_url = press_movies_url['url'].tolist()


#### From the users

In [29]:
# Getting the users_comments_urls for movies
user_movies_url = pd.read_csv("../Movies/Comments/user_comments_urls.csv",names=['url'])
user_movies_url = user_movies_url['url'].tolist()

## Scraping the data

In [31]:
# Scraping the press ratings from series and/or movies
press_series_ratings, press_movies_ratings, errors_series_ratings, errors_movies_ratings = ScrapePressURL(series_url=press_series_url, movies_url=press_movies_url)

Done; 296 Series Press Ratings requests in 0:33:11.566262 with 134 errors
Done; 288 Movies Press Ratings requests in 0:33:11.584518 with 70 errors


In [33]:
# Sort df_press_ratings by user_id
press_series_ratings = press_series_ratings.sort_values(by=['user_id'],ignore_index=True)
press_series_ratings.user_id.value_counts()

Variety                    75
Télérama                   74
The Hollywood Reporter     57
Le Parisien                57
Le Monde                   51
                           ..
Season One                  1
Nice-Matin                  1
Philadelphia Daily News     1
Red Eye                     1
The Salt Lake Tribune       1
Name: user_id, Length: 120, dtype: int64