# Web Scraping User Ratings From AlloCiné.fr

This script builds a DataFrame by web scraping the data from AlloCiné — this time, its purpose is to retrieve the user and press ratings of movies and series from the website of AlloCiné. We will proceed as follows: 
- We use the series and movies url lists generated in the other scripts to get the comments section urls with `getCommentsUrl()`.
- From there, we scrape the user ID and their rating for each movie with `ScrapeURL()`.
- The user can be either a person or a press newspaper. Separated files are generated for each type of user.

*Note : We use the popular BeautifulSoup package*

## Functions :

### `getCommentsUrl(movies_df, series_df, spect, press)`

This function will call two sub-functions: `getMoviesCommentsUrl(movies_df, spect, press)` and `getSeriesCommentsUrl(series_df, spect, press)`. Both respectively iterate over the list of movies and series url generated by `getMoviesUrl()` in the previous scripts and get the comments section url. We can chose to get the comments section from the user or the press for each video type.
We then store the lists of urls in a csv file entitled `user_comments_url.csv` (resp. `press_comments_url.csv`), in both movies and series directory (`../Movies/Comments/` and `../Series/Comments/`).

### `ScrapeURL(urls)` :

Iterate over the list of movies or series comments section url generated by `getCommentsUrl()` and scrape the data for each movie or series ratings. In the process, we extract :

- `user_id` : Allocine user id (person or press)
- `id` : Allocine movie or series id
- `user_rating`: AlloCiné users ratings (from 0.5 to 5 stars) 


The function `ScrapeURL()` returns two objects : the user_rating and press-rating as two dataframes. In addition, the two objects are saved as `user_ratings_movies.csv` and `press_ratings_movies.csv` (respectively `user_ratings_series.csv` and `press_ratings_series.csv`).

### Import libs

In [2]:
# Import libs
import pandas as pd
import numpy as np
import re
from time import time
from time import sleep
from datetime import timedelta
from urllib.request import urlopen
from random import randint
from bs4 import BeautifulSoup

from warnings import warn
from IPython.core.display import clear_output
import traceback

  from IPython.core.display import clear_output


## Functions

In [3]:
# A SUPPRIMER APRES TESTS
response = urlopen("https://www.allocine.fr/film/fichefilm-260627/critiques/presse/")
html_text = response.read().decode("utf-8")
press_html_soup = BeautifulSoup(html_text, 'html.parser')

### Function: `getCommentsUrl(movies_df, series_df, spect, press)`

#### Function: `getMoviesCommentsUrl(movies_df, spect, press)`

In [4]:
# Get the comments section url for each movie
# You can select the spectateurs or presse section, or both
def getMoviesCommentsUrl(movies_df: pd.DataFrame, spect=False, press=False):
    
    # Get the list of movies_id from the movies_df
    movies_id_list = movies_df['id'].tolist()

    # Set the list
    press_url_list = []
    user_url_list = []

    # Preparing the setting and monitoring of the loop
    start_time = time()
    p_requests = 1
        
    for v_id in movies_id_list:
        if spect:
            url_spect = f'https://www.allocine.fr/film/fichefilm-{v_id}/critiques/spectateurs/'    
            user_url_list.append(url_spect)
        if press:
            url_press = f'https://www.allocine.fr/film/fichefilm-{v_id}/critiques/presse/'   
            press_url_list.append(url_press)                         
        p_requests += 1

    # Saving the files
    comments_path = '../Movies/Comments/'
    print(f'--> Done; {p_requests-1} Movies Comments Page Requests in {timedelta(seconds=time()-start_time)}')
    if spect:
        r = np.asarray(user_url_list)
        np.savetxt(f"{comments_path}user_comments_urls.csv", r, delimiter=",", fmt='%s')
    if press:
        r = np.asarray(press_url_list)
        np.savetxt(f"{comments_path}press_comments_urls.csv", r, delimiter=",", fmt='%s')

#### Function: `getSeriesCommentsUrl(series_df, spect, press)`

In [5]:
# Get the comments section url for each series
# You can select the spectateurs or presse section, or both
def getSeriesCommentsUrl(series_df: pd.DataFrame, spect=False, press=False):
    
    # Get the list of series_id from the series_df
    series_id_list = series_df['id'].tolist()

    # Set the list
    press_url_list = []
    user_url_list = []

    # Preparing the setting and monitoring of the loop
    start_time = time()
    p_requests = 1
        
    for v_id in series_id_list:
        if spect:
            url_spect = f'https://www.allocine.fr/series/ficheserie-{v_id}/critiques/'    
            user_url_list.append(url_spect)
        if press:
            url_press = f'https://www.allocine.fr/series/ficheserie-{v_id}/critiques/presse/'   
            press_url_list.append(url_press)                         
        p_requests += 1

    # Saving the files
    comments_path = '../Series/Comments/'
    print(f'--> Done; {p_requests-1} Series Comments Page Requests in {timedelta(seconds=time()-start_time)}')
    if spect:
        r = np.asarray(user_url_list)
        np.savetxt(f"{comments_path}user_comments_urls.csv", r, delimiter=",", fmt='%s')
    if press:
        r = np.asarray(press_url_list)
        np.savetxt(f"{comments_path}press_comments_urls.csv", r, delimiter=",", fmt='%s')

#### Main function: `getCommentsUrl(movies_df, series_df, spect, press)`

In [6]:
# Function to scrape the comments section urls from movies and comments previously retrieved urls.
# The comments url list is save as a csv file: movies_comments_url.csv (or series_comments_url.csv)
def getCommentsUrl(movies_df=None, series_df=None, spect=False, press=False):
    
    try:
        if movies_df is not None:
            getMoviesCommentsUrl(movies_df, spect, press)
        if series_df is not None:
            getSeriesCommentsUrl(series_df, spect, press)
    except:
        print('Error in getCommentsUrl function!')
        traceback.print_exc()
        

### Get press-ratings dataframe: `getPressRatings(urls)`

#### Convert text to rating: `convertTextToRating(text)`

In [7]:
# Convert text to rating
def convert_text_to_rating(text):
    if text=="Nul":
        return 0.5
    elif text=="Très mauvais":
        return 1
    elif text=="Mauvais":
        return 1.5
    elif text=="Pas terrible":
        return 2
    elif text=="Moyen":
        return 2.5  
    elif text=="Pas mal":
        return 3
    elif text=="Bien":
        return 3.5
    elif text=="Très bien":
        return 4
    elif text=="Excellent":
        return 4.5
    elif text=="Chef d'oeuvre":
        return 5

#### Main function: `getPressRatings(urls)`

In [8]:
# Get press_ratings from the comments section url for each movie or series
def getPressRatings(press_soup):
    press_ratings = {}
    div_ratings = press_soup.find_all('li', {'class': 'item'})
    if div_ratings:
        press_ratings = {id.text.strip():convert_text_to_rating(rating.find('span')['title']) for id, rating in zip(div_ratings, div_ratings)}
        return press_ratings
    return None
    

### Get user-ratings dataframe: `getUserRatings(urls)`

### Function `ScrapeURL(urls)`

In [9]:
# Scrape the user_ratings from the press and user comments section urls
# Returns a datframe and save it as 'movies_press_ratings.csv' (or 'series_press_ratings.csv')
def ScrapePressURL(press_url: list, movies=False, series=False):        
    # init the dataframe
    c = ["user_id",
        "id",
        "user_rating",
    ]
    df = pd.DataFrame(columns=c)
    
    # preparing the setting and monitoring loop
    start_time = time()
    n_request = 0
    
    # init list to save errors
    errors = []
    
    # request loop
    for url in press_url:
        try :
            response = urlopen(url)

            # Pause the loop
            sleep(randint(1,2))

            # Monitoring the requests
            n_request += 1
            
            elapsed_time = time() - start_time
            print(f'Request #{n_request}; Frequency: {n_request/elapsed_time} requests/s')
            clear_output(wait = True)

            # Pause the loop
            sleep(randint(1,2))

            # Warning for non-200 status codes
            if response.status != 200:
                warn('Request #{}; Status code: {}'.format(n_request, response.status_code))
                errors.append(url)

            # Parse the content of the request with BeautifulSoup
            html_text = response.read().decode("utf-8")
            press_html_soup = BeautifulSoup(html_text, 'html.parser')

            # Get the series_id
            series_id = url.split('/')[-4].split('-')[-1]            
            # Get the press_ratings
            press_ratings = getPressRatings(press_html_soup)
            for id, rating in press_ratings.items():               
                # Append the data
                df_tmp = pd.DataFrame({'user_id': [id],
                                    'id': [series_id],
                                    'user_rating': [rating]
                                    })
                                    
                df = pd.concat([df, df_tmp], ignore_index=True)
                
        except:
            errors.append(url)
            warn(f'Request #{n_request} fail; Total errors : {len(errors)}')
            traceback.print_exc()
            
    # monitoring 
    series_path = '../Series/Ratings/'
    elapsed_time = time() - start_time
    print(f'Done; {n_request} requests in {timedelta(seconds=elapsed_time)} with {len(errors)} errors')
    clear_output(wait = True)
    df.to_csv(f"{series_path}press_ratings_series.csv", index=False)
    # list to dataframe
    errors_df = pd.DataFrame(errors, columns=['url'])
    errors_df.to_csv(f"{series_path}press_ratings_errors.csv")
    # return dataframe and errors
    return df, errors

### Loading the movies and series dataframes

In [10]:
# Load the movies and series dataframes
def loadDataFrames():
    movies_df = pd.read_csv('../Movies/Data/allocine_movies.csv')
    series_df = pd.read_csv('../Series/Data/allocine_series.csv')
    return movies_df, series_df
movies_df, series_df = loadDataFrames()

### Getting the comments section urls from spectators and press for movies and series

In [11]:
getCommentsUrl(movies_df=None, series_df=series_df, spect=True, press=True)

--> Done; 15 Series Comments Page Requests in 0:00:00


### Loading the comments section urls

#### For Series

##### From the press

In [12]:
# Getting the press_comments_urls for series
press_series_url = pd.read_csv("../Series/Comments/press_comments_urls.csv",names=['url'])
press_series_url = press_series_url['url'].tolist()

##### From the users

In [13]:
# Getting the user_comments_urls for series
user_series_url = pd.read_csv("../Series/Comments/user_comments_urls.csv",names=['url'])
user_series_url = user_series_url['url'].tolist()

#### For Movies

##### From the press

In [14]:
# Getting the press_comments_urls for movies
'''
## for movies
press_movies_url = pd.read_csv("../Movies/Comments/press_comments_urls.csv",names=['url'])
press_movies_url = press_movies_url['url'].tolist()
'''

'\n## for movies\npress_movies_url = pd.read_csv("Movies/Comments/press_comments_urls.csv",names=[\'url\'])\npress_movies_url = press_movies_url[\'url\'].tolist()\n'

##### From the users

In [15]:
# Getting the users_comments_urls for movies
'''## for movies
user_movies_url = pd.read_csv("../Movies/Comments/user_comments_urls.csv",names=['url'])
user_movies_url = user_movies_url['url'].tolist()'''

'## for movies\nuser_movies_url = pd.read_csv("Movies/Comments/user_comments_urls.csv",names=[\'url\'])\nuser_movies_url = user_movies_url[\'url\'].tolist()'

### Scraping the data

In [16]:
# Scraping the press ratings
df_press_ratings, errors_press_ratings = ScrapePressURL(press_series_url)

Done; 15 requests in 0:00:57.089078 with 5 errors


In [17]:
# Sort df_press_ratings by user_id
df_press_ratings = df_press_ratings.sort_values(by=['user_id'],ignore_index=True)
df_press_ratings

Unnamed: 0,user_id,id,user_rating
0,20 Minutes,22373,4.0
1,Cosmopolitan,22373,4.0
2,Critictoo,22373,4.5
3,Ecran Large,26113,3.0
4,Ecran Large,22373,1.0
...,...,...,...
80,Variety,26062,3.5
81,Variety,18529,4.0
82,Variety,25546,4.0
83,Washington Post,18529,3.5
