# <center> **Webscraping**

## **Présentation**


- Webscraping du site [www.allocine.fr](https://www.allocine.fr/films/) avec Beautiful Soup et Selenium.

Nous vérifions tout d'abord le fichier "robots.txt" pour voir si nous sommes autorisés à scraper le film.

![robots.txt](images/robots.txt.png)

Il n'y a pas de limitation pour notre tâche puisque nous travaillons sous l'url : <code>https://www.allocine.fr/film/</code><br>

Ensuite nous allons sur le site allocine, catégorie **films** et nous allons scraper les informations à partir des menus déroulants sur la gauche (les catégories de films, les pays et ensuite nous scraperons les films par année).

![filtres](images/filtresSMALL.png)
### Sources :
**Beautiful Soup** :
[beautiful-soup-4](https://beautiful-soup-4.readthedocs.io/en/latest/)<br>
[beautiful-soup-4.readthedocs.io](https://beautiful-soup-4.readthedocs.io/en/latest/#searching-the-tree)<br>

**Selenium** :<br>
[selenium-python.readthedocs.io](https://selenium-python.readthedocs.io/locating-elements.html)<br>
[selenium.dev/documentation](https://www.selenium.dev/documentation/webdriver/elements/information/)<br>
[selenium.dev/documentation/finders/](https://www.selenium.dev/documentation/webdriver/elements/finders/)<br>
[geeksforgeeks.org/get_property-selenium/](https://www.geeksforgeeks.org/get_property-element-method-selenium-python/)<br>

Les liens sont sûrement générés aléatoirement dynamiquement, on peut utiliser XPath avec selenium<br>
ou bien avec lxml ??<br>

Sur ce lien https://medium.com/swlh/web-scraping-using-selenium-and-beautifulsoup-adfc8810240a Selenium est utilisé pour faire le scraping des urls puis ensuite beautiful soup est utilisé pour faire le scraping des pages de chaque urls.

## **Imports**

In [182]:
%reset

In [183]:
import os
import re
import io
import math
import copy
import httpx
import requests
import numpy as np
import pandas as pd
from PIL import Image
import matplotlib.pyplot as plt
from collections import namedtuple

from bs4 import BeautifulSoup
from IPython.display import display
from tqdm import tqdm

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

# One way to set the driver options
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

def _options():
    ''' Another way to set the options '''
    options = webdriver.ChromeOptions()
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('--test-type')
    options.add_argument('--headless')
    options.add_argument('--incognito')
    options.add_argument('--disable-gpu') if os.name == 'nt' else None # Windows workaround
    options.add_argument('--verbose')
    return options

%config IPCompleter.greedy = True

url_site  = 'https://www.allocine.fr/'
url_films = 'https://www.allocine.fr/films/'

## **Scraping the movies**

### **On scrape la liste des genres de film**

In [None]:
# Scrap all categories
def scrap_categories():
    ''' scrap all categories from 'Allocine.fr' list
        Return a Pandas series with all categories.
    '''
    r = requests.get(url_films, auth=('user', 'pass'))
    if r.status_code != 200:
        print("url_site error")

    soup = BeautifulSoup(r.content, 'html.parser')
    print(type(soup))

    categories = []
    elt_categories = soup.find('div', class_='filter-entity-section')
    for elt in elt_categories.find_all('li'):
        categories.append(elt.a.text)

    return pd.Series(categories)

ds_categories = scrap_categories()
print("Nb categories :", ds_categories.shape[0])

<class 'bs4.BeautifulSoup'>
Nb categories : 37


### **On scrape la liste des pays d'origine des films**

In [None]:
def scrap_countries():
    ''' scrap all categories from 'Allocine.fr' list
        Return a Pandas series with all categories.
    '''
    r = requests.get(url_films, auth=('user', 'pass'))
    if r.status_code != 200:
        print("url_site error")

    soup = BeautifulSoup(r.content, 'html.parser')
    elt_categories = soup.find('div', class_='filter-entity-section')
    elt_countries = elt_categories.find_next_sibling().find_next_sibling()
    elts_items = elt_countries.find_all('li', class_ = 'filter-entity-item')

    countries = []
    for elt_item in elts_items:
        countries.append(elt_item.find('span').text.strip())

    # 'Botswana' is not in the country list but there is a movie with nationality 'Botswana', so we manually add it to the list.
    #  https://www.allocine.fr/film/fichefilm_gen_cfilm=2577.html

    if not('Botwana' in countries):
        countries.append('Botswana')
    if not('Namibie' in countries):
        countries.append('Namibie')
    if not('Liechtenstein' in countries):
        countries.append('Liechtenstein')
    if not('Monaco' in countries):
        countries.append('Monaco')

    return pd.Series(countries)

ds_countries = scrap_countries()
print("Nb pays :", ds_countries.shape[0])

['France', 'U.S.A.', 'Afrique du Sud', 'Albanie', 'Algérie', 'Allemagne', "Allemagne de l'Est", "Allemagne de l'Ouest", 'Arabie Saoudite', 'Argentine', 'Arménie', 'Australie', 'Autriche', 'Belgique', 'Bengladesh', 'Bolivie', 'Bosnie-Herzégovine', 'Brésil', 'Bulgarie', 'Burkina Faso', 'Cambodge', 'Cameroun', 'Canada', 'Chili', 'Chine', 'Chypre', 'Colombie', 'Corée', 'Corée du Sud', 'Croatie', 'Cuba', "Côte-d'Ivoire", 'Danemark', 'Egypte', 'Emirats Arabes Unis', 'Espagne', 'Estonie', 'Finlande', 'Grande-Bretagne', 'Grèce', 'Géorgie', 'Hong-Kong', 'Hongrie', 'Inde', 'Indonésie', 'Irak', 'Iran', 'Irlande', 'Islande', 'Israël', 'Italie', 'Japon', 'Jordanie', 'kazakhstan', 'Kenya', 'Kosovo', 'Lettonie', 'Liban', 'Lituanie', 'Luxembourg', 'Macédoine', 'Malaisie', 'Maroc', 'Mexique', 'Monténégro', 'Nigéria', 'Norvège', 'Nouvelle-Zélande', 'Pakistan', 'Palestine', 'Pays-Bas', 'Philippines', 'Pologne', 'Portugal', 'Pérou', 'Qatar', 'Roumanie', 'Russie', 'République dominicaine', 'République tchè

### **On scrape les liens redirigeants vers les pages de films par année**

**Nous utilisons Selenium**<br>
Lors de l'utilisation de Beautiful Soup, certains éléments sont **décorés**, certains liens sont **invisibles**, on ne peut pas directement les scraper.<br>
Le contournement trouvé est d'utiliser Selenium qui permet entre autre :
- d'utiliser les XPath,
- de récupérer tous les élements et non-décorés.

On se donne une liste d'années, par exemple [1980, ... 2000]

[1980 - 1989] puis [1990 - 1999] ... jusqu'à [2000 - 2009].<br>
(cela fait plus de 40000 films et XXX reviews).

**Liens visualisés dans l'inspecteur html de Chrome**

![links_decades_inspector](images/link_decades_inspector.png)


**Liens visualisés avec Beautiful Soup**

![link](images/link_decades_bs4.png)

In [None]:
# To show the limit of Beautiful Soup
r = requests.get(url_films, auth=('user', 'pass'))
if r.status_code != 200:
    print("url_site error")
    
soup = BeautifulSoup(r.content, 'html.parser')
# print(soup.prettify())

elt_categories = soup.find('div', class_='filter-entity-section')
elt_decades = elt_categories.find_next_sibling()
print(elt_decades.prettify())

In [198]:
def get_year_links(lst_years_to_scrap):
    ''' Scrap the links of years in the left panel of Allocine.fr '''

    if isinstance(lst_years_to_scrap, list):
        assert all([isinstance(item, int) for item in lst_years_to_scrap])
    else:
        assert isinstance(lst_years_to_scrap, int)
        lst_years_to_scrap = [lst_years_to_scrap]

    lst_decades_to_scrap = list(set([10 * (year // 10) for year in lst_years_to_scrap]))
    lst_years_to_scrap = [str(year) for year in lst_years_to_scrap]
    lst_decades_to_scrap = [str(decade) for decade in lst_decades_to_scrap]

    driver = webdriver.Chrome(options = _options())
    driver.get(url_films)
    elts_decades = driver.find_elements(By.XPATH, '/html/body/div[2]/main/section[4]/div[1]/div/div[3]/div[2]/ul/li')

    dict_year_link = {}
    for elt_decade in tqdm(elts_decades):
        elt_a = elt_decade.find_element(By.TAG_NAME, 'a')
        if not(elt_a.get_attribute('title')[:4] in lst_decades_to_scrap):
            continue

        driver2 = webdriver.Chrome(options = options)
        url_decade = elt_a.get_attribute('href').strip()

        driver2.get(url_decade)
        elts_years = driver2.find_elements(By.XPATH, '/html/body/div[2]/main/section[4]/div[1]/div/div[3]/div[3]/ul/li')

        for elt_year in elts_years:
            year = elt_year.find_element(By.TAG_NAME, 'a').get_attribute('title').strip()
            if year in lst_years_to_scrap:
                link = elt_year.find_element(By.TAG_NAME, 'a').get_attribute('href').strip()
                dict_year_link[year] = link
        driver2.close()

    for year, url_year in dict_year_link.items():
        print("year", year, '  ----  link', url_year)

    driver.close()
    return dict_year_link

list_years_to_scrap = list(range(2022, 2025))
dict_year_link = get_year_links(list_years_to_scrap)

100%|██████████| 15/15 [00:03<00:00,  4.48it/s]


year 2024   ----  link https://www.allocine.fr/films/decennie-2020/annee-2024/
year 2023   ----  link https://www.allocine.fr/films/decennie-2020/annee-2023/
year 2022   ----  link https://www.allocine.fr/films/decennie-2020/annee-2022/


### **On scrape les films à partir des liens vers les années**


Pour scraper les directeurs et acteurs nous allons sur la page **casting** du film puis nous scrapons les acteurs principaux représentés dans la mosaïque, ensuite nous scrapons la liste des acteurs secondaires.

![all_actors](images/scraping_all_actors.png)

On observe comment récupérere l'url de la page des films similaires en fonction de la page du film :<br>
https://www.allocine.fr/film/fichefilm_gen_cfilm=180.html<br>
https://www.allocine.fr/film/fichefilm-180/similaire/<br>

Cela suit toujours le même modèle, nous allons pouvoir automatiser cela sans scraper les urls.

In [200]:
month_FR = ['janvier', 'février', 'mars', 'avril', 'mai', 'juin', 'juillet', 'août', 'septembre', 'octobre', 'novembre', 'décembre']
month_EN = ['january', 'february', 'march', 'april', 'may', 'june', 'july', 'august', 'september', 'october', 'november', 'december']

def number_pages_per_year(soup_year):
    ''' Return the number of pages for one year'''
    pagination = soup_year.find('div', class_='pagination-item-holder')
    nb_pages = int(pagination.find_all('span')[-1].text)
    return int(nb_pages)

def get_month(date):
    ''' Extract month from a date '''
    return ''

def delete_thumbnails():
    '''Delete all files in thumbnail directory'''
    try:
        folder_name = os.getcwd() + '\\thumbnails\\'
        files = os.listdir(folder_name)
        for file in files:
            file_path = os.path.join(folder_name, file)
            if os.path.isfile(file_path):
                os.remove(file_path)
        print("All files deleted successfully.")
    except OSError:
        print("Error occurred while deleting files.")

def get_title(soup_movie):
    ''' Return the title of the movie '''

    title = soup_movie.find('div', class_ = "titlebar-title titlebar-title-xl").text
    elts = soup_movie.find_all('div', class_ = 'meta-body-item')
    for elt in elts:
        elts_span = elt.find_all('span')
        for elt_span in elts_span:
            if "Titre original" in elt_span.get_text(strip = True):
                return title, elt_span.find_next_sibling().get_text(strip = True)
    return title, title

def get_date_duration_categories(soup_movie):
    ''' Return date, duration and categories (as string) of the movie '''
    elt = soup_movie.find('div', class_="meta-body-item meta-body-info")
    date, duration, categories = '', '', ''
    
    if False: # Not really accurate
        text = elt.get_text(strip=True)
        # print(text)
        if text.count('|') == 1:
            s1, s2 = text.split('|')
            categories = s2.strip()
        elif text.count('|') == 2:
            s1, s2, s3 = text.split('|')
            date = s1[:-8].strip()
            duration = s2.strip()
            categories = s3.strip()
        return date, duration, categories

    text = elt.get_text(strip = True)
    for elt_span in elt.find_all("span"):
        if 'date' in elt_span.get('class'):
            date = elt_span.get_text(strip = True)
            text = text.replace(date, '')

        elif 'meta-release-type' in elt_span.get('class'):
            text = text.replace(elt_span.get_text(strip = True), '')

        elif 'dark-grey-link' in elt_span.get('class'):
            categories += elt_span.get_text(strip = True) + ','
            for item in elt_span.get_text(strip = True).split():
                text = text.replace(item, '')

    text = text.replace('|', '')
    text = text.replace(',', '')

    if categories[-1] == ',':
        categories = categories[:-1]

    if len(text.strip()):
        duration = text.strip()

    return date, duration, categories

def get_country(soup_movie):
    ''' Return country of the movie '''
    elts_section_title = soup_movie.find_all('div', class_ = 'section-title')
    for elt_section_title in elts_section_title:
        elt_h2 = elt_section_title.find('h2')

        if elt_h2 and 'Infos techniques' in elt_h2.text.strip():
            elt_country = elt_section_title.find_next_sibling()
            assert "Nationalité" in elt_country.find("span", class_ = "what light").text.strip()
            elt_span_that = elt_country.find("span", class_ = "that")
            elts_span_country = elt_span_that.find_all("span")
            lst_countries = []
            for elt_span_country in elts_span_country:
                lst_countries.append(elt_span_country.text.strip())
            return ','.join(lst_countries)
    return ''

def get_directors(soup_casting):
    ''' Return list of directors '''
    elt_director_section = soup_casting.find('section', class_='section casting-director')
    if elt_director_section:
        elt_temp = elt_director_section.find_next()
        elts_directors = elt_temp.find_next_sibling().find_all('div', class_ = 'card person-card person-card-col')
        lst_directors = []
        for elt_director in elts_directors:
            if elt_director.find('a'):
                lst_directors.append(elt_director.text.strip())
        return ','.join(lst_directors)
    return ''

def get_actors(soup_casting):
    ''' Return list of actors (maximum 30) '''
    elt_actor_section = soup_casting.find('section', class_ = 'section casting-actor')
    if elt_actor_section:
        elt_temp = elt_actor_section.find_next()

        # scrap main actors (maximum eight actors in the mosaic, see image above)
        elts_actors = elt_temp.find_next_sibling().find_all('div', class_ = 'card person-card person-card-col')
        lst_actors = []
        for elt_actor in elts_actors:
            elt_figure = elt_actor.find("figure")
            if elt_figure:
                elt_span = elt_figure.find('span')
                if elt_span.get('title') and elt_span.get('title').strip():
                    lst_actors.append(elt_span.get('title').strip())

        # scrap list of actors below the mosaic (we scrap maximum of (8 + 22) 30 actors in total)
        elts_actors = elt_actor_section.find_all('div', class_ = 'md-table-row')
        lst_actors.extend([elt_actor.find('a').text for elt_actor in elts_actors[:22] if elt_actor.find('a')])
        return ','.join(lst_actors)
    return ''

def get_composers(soup_casting):
    ''' Scrap the name(s) of the music composer(s) '''
    elts_sections = soup_casting.find_all("div", class_ = "section casting-list-gql")
    for elt_section in elts_sections:
        elt_title = elt_section.find('div', class_ = 'titlebar section-title').find('h2')
        if 'Soundtrack' in elt_title.text:
            lst_composers = []
            elts_composers = elt_section.find_all('div', class_ = 'md-table-row')
            for elt_composer in elts_composers:
                elts_span = elt_composer.find_all('span')
                if len(elts_span) > 1 and 'Compositeur' in elts_span[1].text.strip():
                    lst_composers.append(elts_span[0].text.strip())
            return ','.join(lst_composers)
    return ''

def get_summary(soup_movie):
    elt_synopsis = soup_movie.find('section', class_ = "section ovw ovw-synopsis")
    if elt_synopsis:
        elt_content = elt_synopsis.find('p', class_ = 'bo-p')
        if elt_content:
            return elt_content.text.strip()
    return ''

def get_thumbnail(soup_movie):
    elt = soup_movie.find('figure', class_ = 'thumbnail')
    return elt.span.img['src']

def save_thumbnail(title, url_thumbnail):
    '''Save the thumbnail as image file in directory "thumbnails"'''
    try:
        folder_name = os.getcwd() + '\\thumbnails\\'
        title2 = title.replace('-', '')
        image_name = f"thumbnail-{title2}.jpg"
        file = open(folder_name + image_name, "wb")
        image = httpx.get(url_thumbnail)
        file.write(image.content)
        # Display thumbnail in Jupyter / console
        img = Image.open(io.BytesIO(image.content))
        plt.imshow(img)
        plt.axis('off')
        plt.show()
        # To change resolution: https://www.geeksforgeeks.org/change-image-resolution-using-pillow-in-python/
    except IOError:
        print("Cannot read the file")
    finally:
        file.close()

def get_similar_movies(url_similar_movies):
    ''' return list of similar movies '''

    lst_similar_movies = []
    # print('url_similar_movies:', url_similar_movies)

    # get the 'similar movies page' soup
    r = requests.get(url_similar_movies, auth=('user', 'pass'))
    soup_similar_movie = BeautifulSoup(r.content, 'html.parser')
    if r.status_code != 200:
        return lst_similar_movies
    
    elts_section = soup_similar_movie.find_all('ul', class_ = "section")
    if elts_section:
        elts_similar_movies = elts_section[0].find_all('li', class_ = 'mdl')
        if elts_similar_movies:
            for elt_similar_movie in elts_similar_movies:
                elt_title = elt_similar_movie.find('h2', class_ = 'meta-title')
                lst_similar_movies.append(elt_title.find('a').text.strip())

    return lst_similar_movies

def scrap_movie(elt_movie, options_scrapping):
    ''' scrap all movie informations

        Return: All informations about one movie.
        Args:
         - elt_movie: Full html of the movie page,
         - options_scrapping: scrapping options to keep scrapping a movie / a year or not.
    '''
    
    # get the movie soup
    url_movie = url_site + elt_movie.h2.a.get('href')[1:]
    r = requests.get(url_movie, auth=('user', 'pass'))
    soup_movie = BeautifulSoup(r.content, 'html.parser')
    
    # ------------- #
    #     Title     #
    # ------------- #
    title, original_title = get_title(soup_movie)

    # ------------ #
    #    Ratings   #
    # ------------ #
    star_rating, nb_notes, nb_reviews = get_ratings(soup_movie, options_scrapping)
    nb_reviews = convert_to_integer(nb_reviews)
    if nb_reviews < options_scrapping.nb_minimum_critics:
        print("Not enough reviews, Do not scrape the movie:" , title)
        return 'Reviews', None
    
    # --------------------------------- #
    #   Date, duration and categories   #
    # --------------------------------- #
    date, duration, categories = get_date_duration_categories(soup_movie)

    # We do not scrape 'Documentaries' or movie with only category : Divers
    if 'Documentaire' in categories or categories.strip() == 'Divers':
        print('We do not scrape those category film:', title)
        return 'Category', None
    
    print('Title:', title)

    # ---------------- #
    #     Countries    #
    # ---------------- #
    countries = get_country(soup_movie)

    # ---------------------------------- #
    #   Directors / Actors / Composers   #
    # ---------------------------------- #
    directors, actors, composers = '', '', ''
    is_casting_section = False
    elts_end_section = soup_movie.find_all('a', class_ = 'end-section-link')

    if elts_end_section:
        for elt_end_section in elts_end_section:

            if 'Casting' in elt_end_section['title']:
                # If there is a link to the casting section
                is_casting_section = True
                link_casting = elt_end_section['href']
                r = requests.get(url_site + link_casting, auth=('user', 'pass'))
                soup_casting = BeautifulSoup(r.content, 'html.parser')

                # Get directors' list
                directors = get_directors(soup_casting)
                # Get actors' list
                actors = get_actors(soup_casting)
                # Composers' list
                composers = get_composers(soup_casting)
                break 

    if not(is_casting_section):
        # No casting section
        # for example animation movies does not have a casting section
        # some movies neither: https://www.allocine.fr/film/fichefilm_gen_cfilm=27635.html

        # Get directors' list
        elt_director = soup_movie.find('div', class_ = "meta-body-item meta-body-direction")
        if elt_director:
            elts_span = elt_director.find_all('span')
            assert len(elts_span) >= 2
            directors =  elts_span[1].get_text().strip()

        # Get actors' list
        elt_actor = soup_movie.find('div', class_ = "meta-body-item meta-body-actor")
        if elt_actor:
            lst_actors = []
            for elt_a in elt_actor.find_all('a'):
                lst_actors.append(elt_a.text.strip())
            actors = ','.join(lst_actors)

    # ------------ #
    #   Summary    #
    # ------------ #
    summary = get_summary(soup_movie)[:180]

    # ------------ #
    #   Thumbnail  #
    # ------------ #
    url_thumbnail = get_thumbnail(soup_movie)
    # save_thumbnail(title, url_thumbnail)
    # It is not memory efficient to store the images so we just store the url toward the image.

    # ------------------- #
    #     url_reviews     #
    # ------------------- #
    url_reviews = url_movie.replace('_gen_cfilm=', '-')[:-5] + '/critiques/spectateurs/'

    # ------------------- #
    #    Similar Movies   #
    # ------------------- #
    url_similar_movies = url_movie.replace('_gen_cfilm=', '-')[:-5] + '/similaire/'
    # soup_similar_movies
    # lst_similar_movies = get_similar_movies(url_similar_movies)
    # print(lst_similar_movies)

    return 'OK', (title, original_title, date, duration, categories, countries, star_rating, \
                  nb_notes, nb_reviews, directors, actors, composers,\
                  summary, url_thumbnail, url_reviews, url_similar_movies)


def get_ratings(soup_movie, options_scrapping):
    ''' Scrap the ratings of the movie.

        Return:
         - stareval:   star rating (0.5 to 5),
         - nb_notes:   number of votes,
         - nb_reviews: number of reviews (written reviews),

        Args:
         - soup_movie:   object BeautifulSoup of the movie,
         - options_scrapping: Scrapping options for example:
                              options_scrapping.useSelenium: boolean to choose if we use Selenium or not
                                        True:  Selenium's method,      (SLOWER)
                                        False: Beautifulsoup's method. (FASTER)
        '''

    star_rating, nb_notes, nb_reviews = None, None, None

    if options_scrapping.use_Selenium:
        elts_ratings = options_scrapping.driver.find_elements(By.CLASS_NAME, 'rating-item')

        for elt_rating in elts_ratings:
            elt = None
            try:
                elt = elt_rating.find_element(By.TAG_NAME, 'a')
                if 'Spectateurs' in elt.text.strip():
                    elt_stareval_note = elt_rating.find_element(By.CLASS_NAME, 'stareval-note')
                    star_rating = elt_stareval_note.text.strip()
                    elt_stareval_review = elt_rating.find_element(By.CLASS_NAME, 'stareval-review')
                    stareval_review = elt_stareval_review.text.strip()
                    if stareval_review.count(',') == 1:
                        nb_notes, nb_reviews = stareval_review.split(',')
                        nb_notes = nb_notes.split()[0]
                        nb_reviews = nb_reviews.split()[0]
                    elif 'note' in stareval_review:
                        nb_notes = stareval_review.strip()
                    elif 'critique' in stareval_review:
                        nb_reviews = stareval_review
                    else:
                        assert False

            except:
                print('no tag "a" in elt_rating')

    else: # use beautiful soup
        elts_ratings = soup_movie.find_all('div', class_ = 'rating-item-content')
        for elt_rating in elts_ratings:
            if 'Spectateurs' in elt_rating.find("span").text.strip():
                elt_stareval_note = elt_rating.find("span", class_ = "stareval-note")
                star_rating = elt_stareval_note.text.strip()
                elt_stareval_review = elt_rating.find("span", class_ = "stareval-review")
                stareval_review = elt_stareval_review.text.strip()
                if stareval_review.count(',') == 1:
                    nb_notes, nb_reviews = stareval_review.split(',')
                    nb_notes = nb_notes.split()[0]
                    nb_reviews = nb_reviews.split()[0]
                elif 'note' in stareval_review:
                    nb_notes = stareval_review.strip()
                elif 'critique' in stareval_review:
                    nb_reviews = stareval_review
                else:
                    assert False

    return star_rating, nb_notes, nb_reviews

def convert_to_integer(str_nb_reviews):
    ''' Convert the string str_nb_reviews into integer '''
    if not(str_nb_reviews):
        return 0
    test = re.search('\\d+', str_nb_reviews)
    return int(test.string)

def scrap_years(dict_year_link, options_scrapping):
    ''' Scrap the movies of the years listed in the provided dictionary

        Return: Pandas Dataframe with all informations about scrapped movies.

        Args:
         - dict_year_link: dictionary with list of years to scrap,
         - Options_Scrapping: options to stop scraping a year or to scrap or not a movie.
    '''

    if options_scrapping.use_Selenium:
        options_scrapping.driver = webdriver.Chrome(options = _options())

    counter_movies                          = 0
    counter_scraped_movies                  = 0
    counter_not_scraped_not_enough_reviews  = 0
    counter_not_scraped_categories          = 0
    movies = []

    for year, url_year in dict_year_link.items():
        
        r = requests.get(url_year, auth=('user', 'pass'))
        if r.status_code != 200:
            print("url_site error")

        soup_year = BeautifulSoup(r.content, 'html.parser')
        nb_pages = number_pages_per_year(soup_year)
        consecutive_number_of_unpopular_movies  = 0
        counter_movies_per_year                 = 0

        for i in range(nb_pages):
            url_year_page = url_year + f'?page={i+1}'
            r = requests.get(url_year_page, auth=('user', 'pass'))
            if r.status_code != 200:
                print("url_site error")

            print(f"***  Year {year}  ---  Page {i+1}  ***")
            soup_movies = BeautifulSoup(r.content, 'html.parser')
            elt_movies = soup_movies.find_all('li', class_='mdl')

            for elt_movie in elt_movies:
                # print('---------------------------------------------------------------- ')
                status, movie = scrap_movie(elt_movie, options_scrapping)
                counter_movies += 1
                
                if status == 'Reviews':
                    counter_not_scraped_not_enough_reviews += 1
                    consecutive_number_of_unpopular_movies += 1
                    if consecutive_number_of_unpopular_movies == options_scrapping.nb_consecutives_unpopular_movies_to_break:
                        # Reached the number of consecutives "unpopular" movies so we stop scrapping this year.
                        print(f"Reached {options_scrapping.nb_consecutives_unpopular_movies_to_break} consecutives 'unpopular' movies: BREAK")
                        break

                else:
                    consecutive_number_of_unpopular_movies = 0 # Reset the number of unpopular movies

                    if status == 'OK':
                        movies.append(movie)
                        counter_scraped_movies  += 1
                        counter_movies_per_year += 1
                        if counter_movies_per_year == options_scrapping.nb_maximum_movies_per_year:
                            # Reached the maximum number to scrap per year
                            break
                    else:
                        assert status == 'Category'
                        counter_not_scraped_categories += 1
            
            if counter_movies_per_year == options_scrapping.nb_maximum_movies_per_year or \
               consecutive_number_of_unpopular_movies == options_scrapping.nb_consecutives_unpopular_movies_to_break:
                print('Stop scrapping for this year')
                break

    # Display some infos
    print("\n*** Scrapping summary ***")
    print("Nb movies scanned: ", counter_movies)
    print("Nb movies scrapped:", counter_scraped_movies)

    if counter_not_scraped_not_enough_reviews:
        print("Not scrapped Reviews: ", counter_not_scraped_not_enough_reviews)
    if counter_not_scraped_categories:
        print("Not scrapped Category:  ", counter_not_scraped_categories)

    df_movies = pd.DataFrame(movies, columns = ['title', 'original_title', 'date', 'duration', 'categories', \
                                                'countries', 'star_rating', 'notes', 'reviews', \
                                                'directors', 'actors', 'composers', 'summary', \
                                                'url_thumbnail', 'url_reviews', 'url_similar_movies'])
    return df_movies


def scrap_new_release(options_scrapping):
    ''' Scrap all new releases 
    
    '''
    url_new_release = 'https://www.allocine.fr/film/sorties-semaine/'
    print('new release', url_new_release)
    r = requests.get(url_new_release, auth=('user', 'pass'))
    if r.status_code != 200:
        print("url_site error")

    soup_new_releases = BeautifulSoup(r.content, 'html.parser')
    elts_movies = soup_new_releases.find_all('li', class_='mdl')
    counter_scraped_movies = 0
    counter_movies = 0

    movies = []
    for elt_movie in elts_movies:
        status, movie = scrap_movie(elt_movie, options_scrapping)
        if status == 'OK':
            movies.append(movie)
            counter_scraped_movies += 1

    # Display some infos
    print("\n*** Scrapping summary ***")
    print("Nb movies scanned: ", counter_movies)
    print("Nb movies scrapped:", counter_scraped_movies)

    df_movies = pd.DataFrame(movies, columns = ['title', 'original_title', 'date', 'duration', 'categories', \
                                                'countries', 'star_rating', 'notes', 'reviews', \
                                                'directors', 'actors', 'composers', 'summary', \
                                                'url_thumbnail', 'url_reviews', 'url_similar_movies'])
    return df_movies


# --------------------- #
#      Some options     #
# --------------------- #

# We cannot scrap all movies,
# In the 60' roughly 500 movies per year
# In the 20' 3000 movies per year
# Some movies are totaly unknown with very few informations about

# delete_thumbnails()

# Object to send options altogether
Options_Scrapping = namedtuple('Options', (['use_Selenium', 'driver', 'nb_minimum_critics',\
                                          'nb_consecutives_unpopular_movies_to_break',\
                                          'nb_maximum_movies_per_year']))
use_Selenium = False
driver = None
nb_minimum_critics = 30
nb_consecutives_unpopular_movies_to_break = 20
nb_maximum_movies_per_year = 250

options_scrapping = Options_Scrapping(use_Selenium, driver, nb_minimum_critics, \
                    nb_consecutives_unpopular_movies_to_break, nb_maximum_movies_per_year )

df_movies = scrap_years(dict_year_link, options_scrapping)

# nb_minimum_critics = -1
# options_scrapping = Options_Scrapping(use_Selenium, driver, nb_minimum_critics, \
#                     nb_consecutives_unpopular_movies_to_break, nb_maximum_movies_per_year )

# df_movies = scrap_new_release(options_scrapping)

if options_scrapping.driver:
    options_scrapping.driver.close()

***  Year 2024  ---  Page 1  ***
Title: Un parfait inconnu
Title: Better Man
Title: Je suis toujours là
Title: Jouer avec le feu
Title: Un ours dans le jura
Title: Babygirl
Title: La Chambre d’à côté
Title: En fanfare
Title: The Brutalist
Not enough reviews, Do not scrape the movie: Toutes pour une
Title: Companion
Title: 5 septembre
Title: Maria
Title: La Pampa
Title: Vol à haut risque
***  Year 2024  ---  Page 2  ***
Title: L'Amour au présent
Title: La Pie voleuse
Title: Le Quatrième mur
Title: Jane Austen a gâché ma vie
Title: Emilia Pérez
Title: Mufasa : Le Roi Lion
Title: Wolf Man
Title: Le Dossier Maldoror
Title: Vingt dieux
Title: Conclave
Title: Mémoires d’un escargot
Title: Sing Sing
Title: Nosferatu
Title: Presence
Title: L'Amour ouf
***  Year 2024  ---  Page 3  ***
Title: Le Comte de Monte-Cristo
Title: Brûle le sang
Title: Mon gâteau préféré
Not enough reviews, Do not scrape the movie: Daffy et Porky sauvent le monde
Title: Une nuit au zoo
Title: The Substance
Title: Hiver 

## **Some results of scrapping**

**Scrapping the year 1980** :<br>
We only keep movies with at least 20 critics:

![scrapping_year_1980_more_20_critics](images/scrapping_year_1980_critics_more_20.png)

We only keep movies with at least 10 critics:

![scrapping_year_1980_more_10_critics](images/scrapping_year_1980_critics_more_10.png)

**Scrapping the decade 80** :<br>
We only keep movies with at least 20 critics:

![scrapping_decade_80_more_20_critics](images/scrapping_decade_80_critics_more_20.png)

**Scrapping the years 90 - 95** :<br>
We only keep movies with at least 20 critics:

![scrapping_years_90_95_critics](images/scrapping_year_1990_1995_critics_more_20.png)

In [201]:
df_movies

Unnamed: 0,title,original_title,date,duration,categories,countries,star_rating,notes,reviews,directors,actors,composers,summary,url_thumbnail,url_reviews,url_similar_movies
0,Un parfait inconnu,A Complete Unknown,29 janvier 2025,2h 20min,"Biopic,Drame,Musical",U.S.A.,41,2978,358,James Mangold,"Timothée Chalamet,Edward Norton,Elle Fanning,M...",,"New York, 1961. Alors que la scène musicale es...",https://fr.web.img6.acsta.net/c_310_420/img/7b...,https://www.allocine.fr/film/fichefilm-280195/...,https://www.allocine.fr/film/fichefilm-280195/...
1,Better Man,Better Man,22 janvier 2025,2h 16min,"Biopic,Musical",Grande-Bretagne,42,1881,310,Michael Gracey,"Robbie Williams,Jonno Davies,Steve Pemberton,D...",Batu Sener,L'ascension du célèbre chanteur/compositeur br...,https://fr.web.img6.acsta.net/c_310_420/img/3d...,https://www.allocine.fr/film/fichefilm-290583/...,https://www.allocine.fr/film/fichefilm-290583/...
2,Je suis toujours là,Ainda Estou Aqui,15 janvier 2025,2h 15min,"Drame,Thriller","Brésil,France",42,2211,207,Walter Salles,"Fernanda Torres,Fernanda Montenegro,Selton Mel...",Warren Ellis,"Rio, 1971, sous la dictature militaire. La gra...",https://fr.web.img6.acsta.net/c_310_420/img/f1...,https://www.allocine.fr/film/fichefilm-265940/...,https://www.allocine.fr/film/fichefilm-265940/...
3,Jouer avec le feu,Jouer avec le feu,22 janvier 2025,1h 58min,Drame,"France,Belgique",35,1847,285,"Delphine Coulin,Muriel Coulin","Vincent Lindon,Benjamin Voisin,Stefan Crepon,M...",Pawel Mykietyn,"Pierre élève seul ses deux fils. Louis, le cad...",https://fr.web.img6.acsta.net/c_310_420/img/02...,https://www.allocine.fr/film/fichefilm-313778/...,https://www.allocine.fr/film/fichefilm-313778/...
4,Un ours dans le jura,Un ours dans le jura,1 janvier 2025,1h 53min,"Comédie,Thriller",France,38,4677,965,Franck Dubosc,"Franck Dubosc,Laure Calamy,Benoît Poelvoorde,J...",Sylvain Goldberg,"Michel et Cathy, un couple usé par le temps et...",https://fr.web.img5.acsta.net/c_310_420/img/17...,https://www.allocine.fr/film/fichefilm-323570/...,https://www.allocine.fr/film/fichefilm-323570/...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,Enola Holmes 2,Enola Holmes 2,4 novembre 2022,2h 09min,"Action,Aventure,Policier","Grande-Bretagne,U.S.A.",36,2134,102,Harry Bradbeer,"Millie Bobby Brown,Henry Cavill,Helena Bonham ...",Daniel Pemberton,"Marchant dans les pas de son célèbre frère, En...",https://fr.web.img6.acsta.net/c_310_420/pictur...,https://www.allocine.fr/film/fichefilm-292822/...,https://www.allocine.fr/film/fichefilm-292822/...
746,Hellraiser,Hellraiser,25 octobre 2023,2h 01min,"Fantastique,Epouvante-horreur",U.S.A.,23,373,45,David Bruckner,"Odessa A’zion,Jamie Clayton,Adam Faison,Drew S...",Ben Lovett,Une jeune femme tombe sur des forces surnature...,https://fr.web.img6.acsta.net/c_310_420/pictur...,https://www.allocine.fr/film/fichefilm-186886/...,https://www.allocine.fr/film/fichefilm-186886/...
747,Falcon Lake,Falcon Lake,7 décembre 2022,1h 40min,"Comédie,Comédie dramatique,Drame,Romance","France,Canada",36,1004,86,Charlotte Le Bon,"Joseph Engel,Sara Montpetit,Monia Chokri,Arthu...",Shida Shahabi,Une histoire d'amour et de fantômes.,https://fr.web.img4.acsta.net/c_310_420/pictur...,https://www.allocine.fr/film/fichefilm-289988/...,https://www.allocine.fr/film/fichefilm-289988/...
748,Nostalgia,Nostalgia,4 janvier 2023,1h 58min,"Drame,Thriller","Italie,France",38,1387,142,Mario Martone,"Pierfrancesco Favino,Francesco Di Leva,Tommaso...",,"Après 40 ans d'absence, Felice retourne dans s...",https://fr.web.img6.acsta.net/c_310_420/pictur...,https://www.allocine.fr/film/fichefilm-303339/...,https://www.allocine.fr/film/fichefilm-303339/...


 ## **Enregistrement des données dans des fichiers cvs**

In [202]:
# np.save("csv/categories.npy", ds_categories.array.tolist())
# np.save("csv/countries.npy", ds_countries.array.tolist())
# ds_categories.to_csv('csv/categories.csv', sep = ',', index=False)
# ds_countries.to_csv('csv/countries.csv', sep = ',', index=False)
# df_movies.to_csv('csv/movies_year_2015_to_2019.csv', sep=',', index = False)
# df_movies.to_csv('csv/movies_year_2019_to_2022.csv', sep=',', index = False)
df_movies.to_csv('csv/movies_year_2022_to_2025.csv', sep=',', index = False)

In [203]:
# df_movies_2 = pd.read_csv('csv/movies_year_2015_to_2019.csv', sep = ',')
# df_movies_2 = pd.read_csv('csv/movies_year_2019_to_2022.csv', sep = ',')
df_movies_2 = pd.read_csv('csv/movies_year_2022_to_2025.csv', sep = ',')
df_movies_2

Unnamed: 0,title,original_title,date,duration,categories,countries,star_rating,notes,reviews,directors,actors,composers,summary,url_thumbnail,url_reviews,url_similar_movies
0,Un parfait inconnu,A Complete Unknown,29 janvier 2025,2h 20min,"Biopic,Drame,Musical",U.S.A.,41,2978,358,James Mangold,"Timothée Chalamet,Edward Norton,Elle Fanning,M...",,"New York, 1961. Alors que la scène musicale es...",https://fr.web.img6.acsta.net/c_310_420/img/7b...,https://www.allocine.fr/film/fichefilm-280195/...,https://www.allocine.fr/film/fichefilm-280195/...
1,Better Man,Better Man,22 janvier 2025,2h 16min,"Biopic,Musical",Grande-Bretagne,42,1881,310,Michael Gracey,"Robbie Williams,Jonno Davies,Steve Pemberton,D...",Batu Sener,L'ascension du célèbre chanteur/compositeur br...,https://fr.web.img6.acsta.net/c_310_420/img/3d...,https://www.allocine.fr/film/fichefilm-290583/...,https://www.allocine.fr/film/fichefilm-290583/...
2,Je suis toujours là,Ainda Estou Aqui,15 janvier 2025,2h 15min,"Drame,Thriller","Brésil,France",42,2211,207,Walter Salles,"Fernanda Torres,Fernanda Montenegro,Selton Mel...",Warren Ellis,"Rio, 1971, sous la dictature militaire. La gra...",https://fr.web.img6.acsta.net/c_310_420/img/f1...,https://www.allocine.fr/film/fichefilm-265940/...,https://www.allocine.fr/film/fichefilm-265940/...
3,Jouer avec le feu,Jouer avec le feu,22 janvier 2025,1h 58min,Drame,"France,Belgique",35,1847,285,"Delphine Coulin,Muriel Coulin","Vincent Lindon,Benjamin Voisin,Stefan Crepon,M...",Pawel Mykietyn,"Pierre élève seul ses deux fils. Louis, le cad...",https://fr.web.img6.acsta.net/c_310_420/img/02...,https://www.allocine.fr/film/fichefilm-313778/...,https://www.allocine.fr/film/fichefilm-313778/...
4,Un ours dans le jura,Un ours dans le jura,1 janvier 2025,1h 53min,"Comédie,Thriller",France,38,4677,965,Franck Dubosc,"Franck Dubosc,Laure Calamy,Benoît Poelvoorde,J...",Sylvain Goldberg,"Michel et Cathy, un couple usé par le temps et...",https://fr.web.img5.acsta.net/c_310_420/img/17...,https://www.allocine.fr/film/fichefilm-323570/...,https://www.allocine.fr/film/fichefilm-323570/...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,Enola Holmes 2,Enola Holmes 2,4 novembre 2022,2h 09min,"Action,Aventure,Policier","Grande-Bretagne,U.S.A.",36,2134,102,Harry Bradbeer,"Millie Bobby Brown,Henry Cavill,Helena Bonham ...",Daniel Pemberton,"Marchant dans les pas de son célèbre frère, En...",https://fr.web.img6.acsta.net/c_310_420/pictur...,https://www.allocine.fr/film/fichefilm-292822/...,https://www.allocine.fr/film/fichefilm-292822/...
746,Hellraiser,Hellraiser,25 octobre 2023,2h 01min,"Fantastique,Epouvante-horreur",U.S.A.,23,373,45,David Bruckner,"Odessa A’zion,Jamie Clayton,Adam Faison,Drew S...",Ben Lovett,Une jeune femme tombe sur des forces surnature...,https://fr.web.img6.acsta.net/c_310_420/pictur...,https://www.allocine.fr/film/fichefilm-186886/...,https://www.allocine.fr/film/fichefilm-186886/...
747,Falcon Lake,Falcon Lake,7 décembre 2022,1h 40min,"Comédie,Comédie dramatique,Drame,Romance","France,Canada",36,1004,86,Charlotte Le Bon,"Joseph Engel,Sara Montpetit,Monia Chokri,Arthu...",Shida Shahabi,Une histoire d'amour et de fantômes.,https://fr.web.img4.acsta.net/c_310_420/pictur...,https://www.allocine.fr/film/fichefilm-289988/...,https://www.allocine.fr/film/fichefilm-289988/...
748,Nostalgia,Nostalgia,4 janvier 2023,1h 58min,"Drame,Thriller","Italie,France",38,1387,142,Mario Martone,"Pierfrancesco Favino,Francesco Di Leva,Tommaso...",,"Après 40 ans d'absence, Felice retourne dans s...",https://fr.web.img6.acsta.net/c_310_420/pictur...,https://www.allocine.fr/film/fichefilm-303339/...,https://www.allocine.fr/film/fichefilm-303339/...


### **Quelques difficultés rencontrées**



Pour le scrapping des "films similaires" on récupère simplement la liste des films similaires (leurs titres), ce n'est que lorsque tous les films auront été scrappés et mis dans dans des Dataframe et que leur seront attribués des Ids (clés de tables de base de données) que nous pourrons associer un film aux Ids des films similaires.

Remarque :
Il n'a pas été possible de trouver d'informations sur la méthode utilisée par "allocine.com" pour composer la liste de films similaires à un film, il n'apparait pas de liens clairs entre les catégories de films, ni entre les acteurs, il est possible que ce soit un algorithme d'IA qui soit à la base de ce choix ...