# Web Scraping User Ratings From AlloCiné.fr

This script builds a DataFrame by web scraping the data from AlloCiné — this time, its purpose is to retrieve the user ratings of movies and series from the website of AlloCiné. We will proceed as follows: 
- We use the series and movies url lists generated in the other scripts to get the comments section urls with `getCommentsUrl()`.
- From there, we scrape the user ID and their rating for each movie with `ScrapeURL()`.
- The user can be either a person or a press newspaper. Separated files are generated for each type of user.

*Note : We use the popular BeautifulSoup package*

## Functions :

### `getCommentsUrl()`

Iterate over the list of movies or series url generated by `getMoviesUrl()` in the previous scripts and get the comments section url. 
We then store it in a csv file entitled `movie_comments_url.csv` (resp. `series_comments_url.csv`).

### `ScrapeURL(urls)` :

Iterate over the list of movies or series comments section url generated by `getCommentsUrl()` and scrape the data for each movie or series ratings. In the process, we extract :

- `user_id` : Allocine user id (person or press)
- `id` : Allocine movie or series id
- `user_rating`: AlloCiné users ratings (from 0.5 to 5 stars) 


The function `ScrapeURL()` returns two objects : the user_rating and press-rating as two dataframes. In addition, the two objects are saved as `user_ratings_movies.csv` and `press_ratings_movies.csv` (respectively `user_ratings_series.csv` and `press_ratings_series.csv`).

### Import libs

In [11]:
# Import libs
import pandas as pd
import numpy as np
import re
import unicodedata
from time import time
from time import sleep
from datetime import timedelta
from urllib.request import urlopen
from random import randint
from bs4 import BeautifulSoup
import dateparser

from warnings import warn
from IPython.core.display import clear_output
import traceback

  from IPython.core.display import clear_output


## Functions

### Function: `getCommentsUrl(urls)`

#### Function: `getMoviesCommentsUrl(urls)`

In [12]:
# Get the comments section url for each movie
# You can select the spectateurs or presse section, or both
def getMoviesCommentsUrl(movies_df: pd.DataFrame, spect=False, press=False):
    
    # Get the list of movies_id from the movies_df
    movies_id_list = movies_df['id'].tolist()

    # Set the list
    press_url_list = []
    user_url_list = []

    # Preparing the setting and monitoring of the loop
    start_time = time()
    p_requests = 1
        
    for v_id in movies_id_list:
        if spect:
            url_spect = f'https://www.allocine.fr/film/fichefilm-{v_id}/critiques/spectateurs/'    
            user_url_list.append(url_spect)
        if press:
            url_press = f'https://www.allocine.fr/film/fichefilm-{v_id}/critiques/presse/'   
            press_url_list.append(url_press)                         
        p_requests += 1

    # Saving the files
    comments_path = 'Movies/Comments/'
    print(f'--> Done; {p_requests-1} Movies Comments Page Requests in {timedelta(seconds=time()-start_time)}')
    if spect:
        r = np.asarray(user_url_list)
        np.savetxt(f"{comments_path}user_comments_urls.csv", r, delimiter=",", fmt='%s')
    if press:
        r = np.asarray(press_url_list)
        np.savetxt(f"{comments_path}press_comments_urls.csv", r, delimiter=",", fmt='%s')

#### Function: `getSeriesCommentsUrl(urls)`

In [13]:
# Get the comments section url for each series
# You can select the spectateurs or presse section, or both
def getSeriesCommentsUrl(series_df: pd.DataFrame, spect=False, press=False):
    
    # Get the list of series_id from the series_df
    series_id_list = series_df['id'].tolist()

    # Set the list
    press_url_list = []
    user_url_list = []

    # Preparing the setting and monitoring of the loop
    start_time = time()
    p_requests = 1
        
    for v_id in series_id_list:
        if spect:
            url_spect = f'https://www.allocine.fr/series/ficheserie-{v_id}/critiques/'    
            user_url_list.append(url_spect)
        if press:
            url_press = f'https://www.allocine.fr/series/ficheserie-{v_id}/critiques/presse/'   
            press_url_list.append(url_press)                         
        p_requests += 1

    # Saving the files
    comments_path = 'Series/Comments/'
    print(f'--> Done; {p_requests-1} Series Comments Page Requests in {timedelta(seconds=time()-start_time)}')
    if spect:
        r = np.asarray(user_url_list)
        np.savetxt(f"{comments_path}user_comments_urls.csv", r, delimiter=",", fmt='%s')
    if press:
        r = np.asarray(press_url_list)
        np.savetxt(f"{comments_path}press_comments_urls.csv", r, delimiter=",", fmt='%s')

#### Main function: `getCommentsUrl(urls)`

In [24]:
# Function to scrape the comments section urls from movies and comments previously retrieved urls.
# The comments url list is save as a csv file: movies_comments_url.csv (or series_comments_url.csv)
def getCommentsUrl(movies_df=None, series_df=None, spect=False, press=False):
    
    try:
        if movies_df is not None:
            getMoviesCommentsUrl(movies_df, spect, press)
        if series_df is not None:
            getSeriesCommentsUrl(series_df, spect, press)
    except:
        print('Error in getCommentsUrl function!')
        traceback.print_exc()
        

### Loading the movies and series dataframes

In [23]:
# Load the movies and series dataframes
def loadDataFrames():
    movies_df = pd.read_csv('Movies/allocine_movies.csv')
    series_df = pd.read_csv('Series/allocine_series.csv')
    return movies_df, series_df
movies_df, series_df = loadDataFrames()

In [27]:
getCommentsUrl(movies_df=None, series_df=series_df, spect=True, press=True)

--> Done; 15 Series Comments Page Requests in 0:00:00


### Scraping the data