# <center> **Webscraping**

## **Présentation**


- Webscraping du site [www.allocine.fr](https://www.allocine.fr/films/) avec Beautiful Soup et Selenium.

Nous vérifions tout d'abord le fichier "robots.txt" pour voir si nous sommes autorisés à scraper le film.

![robots.txt](images/robots.txt.png)

Il n'y a pas de limitation pour notre tâche puisque nous travaillons sous l'url : <code>https://www.allocine.fr/film/</code><br>

Ensuite nous allons sur le site allocine, catégorie **films** et nous allons scraper les informations à partir des menus déroulants sur la gauche (les catégories de films, les pays et ensuite nous scraperons les films par année).

![filtres](images/filtresSMALL.png)
### Sources :
**Beautiful Soup** :
[beautiful-soup-4](https://beautiful-soup-4.readthedocs.io/en/latest/)<br>
[beautiful-soup-4.readthedocs.io](https://beautiful-soup-4.readthedocs.io/en/latest/#searching-the-tree)<br>

**Selenium** :<br>
[selenium-python.readthedocs.io](https://selenium-python.readthedocs.io/locating-elements.html)<br>
[selenium.dev/documentation](https://www.selenium.dev/documentation/webdriver/elements/information/)<br>
[selenium.dev/documentation/finders/](https://www.selenium.dev/documentation/webdriver/elements/finders/)<br>
[geeksforgeeks.org/get_property-selenium/](https://www.geeksforgeeks.org/get_property-element-method-selenium-python/)<br>

Les liens sont sûrement générés aléatoirement dynamiquement, on peut utiliser XPath avec selenium<br>
ou bien avec lxml ??<br>

Sur ce lien https://medium.com/swlh/web-scraping-using-selenium-and-beautifulsoup-adfc8810240a Selenium est utilisé pour faire le scraping des urls puis ensuite beautiful soup est utilisé pour faire le scraping des pages de chaque urls.

## **Imports**

In [68]:
%reset

In [69]:
import os
import re
import io
import math
import copy
import httpx
import requests
import numpy as np
import pandas as pd
from PIL import Image
import matplotlib.pyplot as plt

from bs4 import BeautifulSoup
from IPython.display import display
from tqdm import tqdm

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

# One way to set the driver options
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

def _options():
    ''' Another way to set the options '''
    options = webdriver.ChromeOptions()
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('--test-type')
    options.add_argument('--headless')
    options.add_argument('--incognito')
    options.add_argument('--disable-gpu') if os.name == 'nt' else None # Windows workaround
    options.add_argument('--verbose')
    return options

%config IPCompleter.greedy = True

url_site  = 'https://www.allocine.fr/'
url_films = 'https://www.allocine.fr/films/'

## **Scraping the movies**

### **On scrape la liste des genres de film**

In [70]:
# Scrap all categories
r = requests.get(url_films, auth=('user', 'pass'))
if r.status_code != 200:
    print("url_site error")

soup = BeautifulSoup(r.content, 'html.parser')
print(type(soup))

categories = []
elt_categories = soup.find('div', class_='filter-entity-section')
for elt in elt_categories.find_all('li'):
    #print(elt.prettify())
    categories.append(elt.a.text)

print("Nb categories :", len(categories))
ds_categories = pd.Series(categories)

<class 'bs4.BeautifulSoup'>
Nb categories : 37


### **On scrape la liste des pays d'origine des films**

In [71]:
# Scrap all countries
elt_countries = elt_categories.find_next_sibling().find_next_sibling()
elts_items = elt_countries.find_all('li', class_ = 'filter-entity-item')

countries = []
for elt_item in elts_items:
    countries.append(elt_item.find('span').text.strip())

# 'Botswana' is not in the country list but there is a movie with nationality 'Botswana', so we manually add it to the list.
#  https://www.allocine.fr/film/fichefilm_gen_cfilm=2577.html
if not('Botwana' in countries):
    countries.append('Botswana')
if not('Namibie' in countries):
    countries.append('Namibie')
if not('Liechtenstein' in countries):
    countries.append('Liechtenstein')
if not('Monaco' in countries):
    countries.append('Monaco')

print(countries)

print("Nb pays :", len(countries))
ds_countries = pd.Series(countries)

['France', 'U.S.A.', 'Afrique du Sud', 'Albanie', 'Algérie', 'Allemagne', "Allemagne de l'Est", "Allemagne de l'Ouest", 'Arabie Saoudite', 'Argentine', 'Arménie', 'Australie', 'Autriche', 'Belgique', 'Bengladesh', 'Bolivie', 'Bosnie-Herzégovine', 'Brésil', 'Bulgarie', 'Burkina Faso', 'Cambodge', 'Cameroun', 'Canada', 'Chili', 'Chine', 'Chypre', 'Colombie', 'Corée', 'Corée du Sud', 'Croatie', 'Cuba', "Côte-d'Ivoire", 'Danemark', 'Egypte', 'Emirats Arabes Unis', 'Espagne', 'Estonie', 'Finlande', 'Grande-Bretagne', 'Grèce', 'Géorgie', 'Hong-Kong', 'Hongrie', 'Inde', 'Indonésie', 'Irak', 'Iran', 'Irlande', 'Islande', 'Israël', 'Italie', 'Japon', 'Jordanie', 'kazakhstan', 'Kenya', 'Kosovo', 'Lettonie', 'Liban', 'Lituanie', 'Luxembourg', 'Macédoine', 'Malaisie', 'Maroc', 'Mexique', 'Monténégro', 'Nigéria', 'Norvège', 'Nouvelle-Zélande', 'Pakistan', 'Palestine', 'Pays-Bas', 'Philippines', 'Pologne', 'Portugal', 'Pérou', 'Qatar', 'Roumanie', 'Russie', 'République dominicaine', 'République tchè

### **On scrape les liens redirigeants vers les pages de films par année**

**Nous utilisons Selenium**<br>
Lors de l'utilisation de Beautiful Soup, certains éléments sont **décorés**, certains liens sont **invisibles**, on ne peut pas directement les scraper.<br>
Le contournement trouvé est d'utiliser Selenium qui permet entre autre :
- d'utiliser les XPath,
- de récupérer tous les élements et non-décorés.

On se donne une liste d'années, par exemple [1980, ... 2000]

[1980 - 1989] puis [1990 - 1999] ... jusqu'à [2000 - 2009].<br>
(cela fait plus de 40000 films et XXX reviews).

**Liens visualisés dans l'inspecteur html de Chrome**

![links_decades_inspector](images/link_decades_inspector.png)


**Liens visualisés avec Beautiful Soup**

![link](images/link_decades_bs4.png)

In [None]:
# To show the limit of Beautiful Soup
r = requests.get(url_films, auth=('user', 'pass'))
if r.status_code != 200:
    print("url_site error")
    
soup = BeautifulSoup(r.content, 'html.parser')
# print(soup.prettify())

elt_decades = elt_categories.find_next_sibling()
print(elt_decades.prettify())

In [87]:
# Scrap the links of the years

# Input: list of years to scrap
lst_years_to_scrap = list(range(1990, 1995))

# Ou Bien
# 2 fonctions à faire
month_to_scrap = 3 # March (current year)

lst_decades_to_scrap = list(set([10 * (year // 10) for year in lst_years_to_scrap]))
lst_years_to_scrap = [str(year) for year in lst_years_to_scrap]
lst_decades_to_scrap = [str(decade) for decade in lst_decades_to_scrap]

driver = webdriver.Chrome(options = _options())
driver.get(url_films)
elts_decades = driver.find_elements(By.XPATH, '/html/body/div[2]/main/section[4]/div[1]/div/div[3]/div[2]/ul/li')

dict_year_link = {}
for elt_decade in tqdm(elts_decades):
    elt_a = elt_decade.find_element(By.TAG_NAME, 'a')
    if not(elt_a.get_attribute('title')[:4] in lst_decades_to_scrap):
        continue

    driver2 = webdriver.Chrome(options = options)
    url_decade = elt_a.get_attribute('href').strip()

    driver2.get(url_decade)
    elts_years = driver2.find_elements(By.XPATH, '/html/body/div[2]/main/section[4]/div[1]/div/div[3]/div[3]/ul/li')

    for elt_year in elts_years:
        year = elt_year.find_element(By.TAG_NAME, 'a').get_attribute('title').strip()
        if year in lst_years_to_scrap:
            link = elt_year.find_element(By.TAG_NAME, 'a').get_attribute('href').strip()
            dict_year_link[year] = link
    driver2.close()

for year, url_year in dict_year_link.items():
    print("year", year, '  ----  link', url_year)

driver.close()

100%|██████████| 15/15 [00:06<00:00,  2.30it/s]


year 1994   ----  link https://www.allocine.fr/films/decennie-1990/annee-1994/
year 1993   ----  link https://www.allocine.fr/films/decennie-1990/annee-1993/
year 1992   ----  link https://www.allocine.fr/films/decennie-1990/annee-1992/
year 1991   ----  link https://www.allocine.fr/films/decennie-1990/annee-1991/
year 1990   ----  link https://www.allocine.fr/films/decennie-1990/annee-1990/


### **On scrape les films à partir des liens vers les années**


Pour scraper les directeurs et acteurs nous allons sur la page **casting** du film puis nous scrapons les acteurs principaux représentés dans la mosaïque, ensuite nous scrapons la liste des acteurs secondaires.

![all_actors](images/scraping_all_actors.png)

On observe comment récupérere l'url de la page des films similaires en fonction de la page du film :<br>
https://www.allocine.fr/film/fichefilm_gen_cfilm=180.html<br>
https://www.allocine.fr/film/fichefilm-180/similaire/<br>

Cela suit toujours le même modèle, nous allons pouvoir automatiser cela sans scraper les urls.

In [88]:
month_FR = ['janvier', 'février', 'mars', 'avril', 'mai', 'juin', 'juillet', 'août', 'septembre', 'octobre', 'novembre', 'décembre']
month_EN = ['january', 'february', 'march', 'april', 'may', 'june', 'july', 'august', 'september', 'october', 'november', 'december']

def number_pages_per_year(soup_year):
    ''' Return the number of pages for one year'''
    pagination = soup_year.find('div', class_='pagination-item-holder')
    nb_pages = int(pagination.find_all('span')[-1].text)
    return int(nb_pages)

def get_month(date):
    ''' Extract month from a date '''
    return ''

def delete_thumbnails():
    '''Delete all files in thumbnail directory'''
    try:
        folder_name = os.getcwd() + '\\thumbnails\\'
        files = os.listdir(folder_name)
        for file in files:
            file_path = os.path.join(folder_name, file)
            if os.path.isfile(file_path):
                os.remove(file_path)
        print("All files deleted successfully.")
    except OSError:
        print("Error occurred while deleting files.")

def get_title(soup_movie):
    ''' Return the title of the movie '''

    title = soup_movie.find('div', class_ = "titlebar-title titlebar-title-xl").text
    elts = soup_movie.find_all('div', class_ = 'meta-body-item')
    for elt in elts:
        elts_span = elt.find_all('span')
        for elt_span in elts_span:
            if "Titre original" in elt_span.get_text(strip = True):
                return title, elt_span.find_next_sibling().get_text(strip = True)
    return title, title

def get_date_duration_categories(soup_movie):
    ''' Return date, duration and categories (as string) of the movie '''
    elt = soup_movie.find('div', class_="meta-body-item meta-body-info")
    date, duration, categories = '', '', ''
    
    if False: # Not really accurate
        text = elt.get_text(strip=True)
        # print(text)
        if text.count('|') == 1:
            s1, s2 = text.split('|')
            categories = s2.strip()
        elif text.count('|') == 2:
            s1, s2, s3 = text.split('|')
            date = s1[:-8].strip()
            duration = s2.strip()
            categories = s3.strip()
        return date, duration, categories

    text = elt.get_text(strip = True)
    for elt_span in elt.find_all("span"):
        if 'date' in elt_span.get('class'):
            date = elt_span.get_text(strip = True)
            text = text.replace(date, '')

        elif 'meta-release-type' in elt_span.get('class'):
            text = text.replace(elt_span.get_text(strip = True), '')

        elif 'dark-grey-link' in elt_span.get('class'):
            categories += elt_span.get_text(strip = True) + ','
            for item in elt_span.get_text(strip = True).split():
                text = text.replace(item, '')

    text = text.replace('|', '')
    text = text.replace(',', '')

    if categories[-1] == ',':
        categories = categories[:-1]

    if len(text.strip()):
        duration = text.strip()

    return date, duration, categories

def get_country(soup_movie):
    ''' Return country of the movie '''
    elts_section_title = soup_movie.find_all('div', class_ = 'section-title')
    for elt_section_title in elts_section_title:
        elt_h2 = elt_section_title.find('h2')

        if elt_h2 and 'Infos techniques' in elt_h2.text.strip():
            elt_country = elt_section_title.find_next_sibling()
            assert "Nationalité" in elt_country.find("span", class_ = "what light").text.strip()
            elt_span_that = elt_country.find("span", class_ = "that")
            elts_span_country = elt_span_that.find_all("span")
            lst_countries = []
            for elt_span_country in elts_span_country:
                lst_countries.append(elt_span_country.text.strip())
            return ','.join(lst_countries)
    return ''

def get_directors(soup_casting):
    ''' Return list of directors '''
    elt_director_section = soup_casting.find('section', class_='section casting-director')
    if elt_director_section:
        elt_temp = elt_director_section.find_next()
        elts_directors = elt_temp.find_next_sibling().find_all('div', class_ = 'card person-card person-card-col')
        lst_directors = [elt_director.find('a').text for elt_director in elts_directors]
        return ','.join(lst_directors)
    return ''

def get_actors(soup_casting):
    ''' Return list of actors (maximum 30) '''
    elt_actor_section = soup_casting.find('section', class_ = 'section casting-actor')
    if elt_actor_section:
        elt_temp = elt_actor_section.find_next()
        # scrap main actors (maximum eight actors in the mosaic, see image above)
        elts_actors = elt_temp.find_next_sibling().find_all('div', class_ = 'card person-card person-card-col')
        lst_actors = [elt_actor.find('figure').find('span')['title'] for elt_actor in elts_actors]
        elts_actors = elt_actor_section.find_all('div', class_ = 'md-table-row')
        # scrap list of actors below the mosaic (we scrap maximum of (8 + 22) 30 actors in total)
        lst_actors.extend([elt_actor.find('a').text for elt_actor in elts_actors[:22] if elt_actor.find('a')])
        return ','.join(lst_actors)
    return ''

def get_composers(soup_casting):
    ''' Scrap the name(s) of the music composer(s) '''
    elts_sections = soup_casting.find_all("div", class_ = "section casting-list-gql")
    for elt_section in elts_sections:
        elt_title = elt_section.find('div', class_ = 'titlebar section-title').find('h2')
        if 'Soundtrack' in elt_title.text:
            lst_composers = []
            elts_composers = elt_section.find_all('div', class_ = 'md-table-row')
            for elt_composer in elts_composers:
                elts_span = elt_composer.find_all('span')
                if len(elts_span) > 1 and 'Compositeur' in elts_span[1].text.strip():
                    lst_composers.append(elts_span[0].text.strip())
            return ','.join(lst_composers)
    return ''

def get_summary(soup_movie):
    elt_synopsis = soup_movie.find('section', class_ = "section ovw ovw-synopsis")
    if elt_synopsis:
        elt_content = elt_synopsis.find('p', class_ = 'bo-p')
        if elt_content:
            return elt_content.text.strip()
    return ''

def get_thumbnail(soup_movie):
    elt = soup_movie.find('figure', class_ = 'thumbnail')
    return elt.span.img['src']

def save_thumbnail(title, url_thumbnail):
    '''Save the thumbnail as image file in directory "thumbnails"'''
    try:
        folder_name = os.getcwd() + '\\thumbnails\\'
        title2 = title.replace('-', '')
        image_name = f"thumbnail-{title2}.jpg"
        file = open(folder_name + image_name, "wb")
        image = httpx.get(url_thumbnail)
        file.write(image.content)
        # Display thumbnail in Jupyter / console
        img = Image.open(io.BytesIO(image.content))
        plt.imshow(img)
        plt.axis('off')
        plt.show()
        # To change resolution: https://www.geeksforgeeks.org/change-image-resolution-using-pillow-in-python/
    except IOError:
        print("Cannot read the file")
    finally:
        file.close()

def get_similar_movies(url_similar_movies):
    ''' return list of similar movies '''

    lst_similar_movies = []
    # print('url_similar_movies:', url_similar_movies)

    # get the 'similar movies page' soup
    r = requests.get(url_similar_movies, auth=('user', 'pass'))
    soup_similar_movie = BeautifulSoup(r.content, 'html.parser')
    if r.status_code != 200:
        return lst_similar_movies
    
    elts_section = soup_similar_movie.find_all('ul', class_ = "section")
    if elts_section:
        elts_similar_movies = elts_section[0].find_all('li', class_ = 'mdl')
        if elts_similar_movies:
            for elt_similar_movie in elts_similar_movies:
                elt_title = elt_similar_movie.find('h2', class_ = 'meta-title')
                lst_similar_movies.append(elt_title.find('a').text.strip())

    return lst_similar_movies

def scrap_movie(elt_movie, use_Selenium):
    ''' scrap all movie informations '''
    
    # get the movie soup
    url_movie = url_site + elt_movie.h2.a.get('href')[1:]
    r = requests.get(url_movie, auth=('user', 'pass'))
    soup_movie = BeautifulSoup(r.content, 'html.parser')
    
    # ------------- #
    #     Title     #
    # ------------- #
    title, original_title = get_title(soup_movie)

    # ------------ #
    #    Ratings   #
    # ------------ #
    star_rating, nb_notes, nb_reviews = get_ratings(soup_movie, use_Selenium)
    nb_reviews = convert_to_integer(nb_reviews)
    if nb_reviews < nb_minimum_critics:
        print("Not enough reviews, Do not scrape the movie:" , title)
        return 'Reviews', None
    
    # --------------------------------- #
    #   Date, duration and categories   #
    # --------------------------------- #
    date, duration, categories = get_date_duration_categories(soup_movie)

    # if not(get_month(date) == month):
    #     return 'Month', None

    # We do not scrape 'Documentaries' or movie with only category : Divers
    if 'Documentaire' in categories or categories.strip() == 'Divers':
        print('We do not scrape those category film:', title)
        return 'Category', None
    
    print('Title:', title)

    # ---------------- #
    #     Countries    #
    # ---------------- #
    countries = get_country(soup_movie)

    # ---------------------------------- #
    #   Directors / Actors / Composers   #
    # ---------------------------------- #
    directors, actors, composers = '', '', ''
    is_casting_section = False
    elts_end_section = soup_movie.find_all('a', class_ = 'end-section-link')

    if elts_end_section:
        for elt_end_section in elts_end_section:

            if 'Casting' in elt_end_section['title']:
                # If there is a link to the casting section
                is_casting_section = True
                link_casting = elt_end_section['href']
                r = requests.get(url_site + link_casting, auth=('user', 'pass'))
                soup_casting = BeautifulSoup(r.content, 'html.parser')

                # Get directors' list
                directors = get_directors(soup_casting)
                # Get actors' list
                actors = get_actors(soup_casting)
                # Composers' list
                composers = get_composers(soup_casting)
                break 

    if not(is_casting_section):
        # No casting section
        # for example animation movies does not have a casting section
        # some movies neither: https://www.allocine.fr/film/fichefilm_gen_cfilm=27635.html

        # Get directors' list
        elt_director = soup_movie.find('div', class_ = "meta-body-item meta-body-direction")
        if elt_director:
            elts_span = elt_director.find_all('span')
            assert len(elts_span) >= 2
            directors =  elts_span[1].get_text().strip()
            # directors.append(elts_span[1].get_text().strip())

        # Get actors' list
        elt_actor = soup_movie.find('div', class_ = "meta-body-item meta-body-actor")
        if elt_actor:
            lst_actors = []
            for elt_a in elt_actor.find_all('a'):
                lst_actors.append(elt_a.text.strip())
            actors = ','.join(lst_actors)

    # ------------ #
    #   Summary    #
    # ------------ #
    summary = get_summary(soup_movie)[:180]

    # ------------ #
    #   Thumbnail  #
    # ------------ #
    url_thumbnail = get_thumbnail(soup_movie)
    # save_thumbnail(title, url_thumbnail)

    # ------------------- #
    #     url_reviews     #
    # ------------------- #
    url_reviews = url_movie.replace('_gen_cfilm=', '-')[:-5] + '/critiques/spectateurs/'

    # ------------------- #
    #    Similar Movies   #
    # ------------------- #
    url_similar_movies = url_movie.replace('_gen_cfilm=', '-')[:-5] + '/similaire/'
    # soup_similar_movies
    # lst_similar_movies = get_similar_movies(url_similar_movies)
    # print(lst_similar_movies)

    return 'OK', (title, original_title, date, duration, categories, countries, star_rating, \
                  nb_notes, nb_reviews, directors, actors, composers,\
                  summary, url_thumbnail, url_reviews, url_similar_movies)

def get_ratings(soup_movie, use_Selenium):
    ''' Scrap the ratings of the movie.

        Return:
         - stareval:   star rating (0.5 to 5),
         - nb_notes:   number of votes,
         - nb_reviews: number of reviews (written reviews),

        Args:
         - soup_movie:   object BeautifulSoup of the movie,
         - use_Selenium: boolean to choose the method to scrape,
                         True:  Selenium's method,      (SLOWER)
                         False: Beautifulsoup's method. (FASTER)
    '''

    star_rating, nb_notes, nb_reviews = None, None, None

    if use_Selenium:
        elts_ratings = driver.find_elements(By.CLASS_NAME, 'rating-item')

        for elt_rating in elts_ratings:
            elt = None
            try:
                elt = elt_rating.find_element(By.TAG_NAME, 'a')
                if 'Spectateurs' in elt.text.strip():
                    elt_stareval_note = elt_rating.find_element(By.CLASS_NAME, 'stareval-note')
                    star_rating = elt_stareval_note.text.strip()
                    elt_stareval_review = elt_rating.find_element(By.CLASS_NAME, 'stareval-review')
                    stareval_review = elt_stareval_review.text.strip()
                    if stareval_review.count(',') == 1:
                        nb_notes, nb_reviews = stareval_review.split(',')
                        nb_notes = nb_notes.split()[0]
                        nb_reviews = nb_reviews.split()[0]
                    elif 'note' in stareval_review:
                        nb_notes = stareval_review.strip()
                    elif 'critique' in stareval_review:
                        nb_reviews = stareval_review
                    else:
                        assert False

            except:
                print('no tag "a" in elt_rating')

    else: # use beautiful soup
        elts_ratings = soup_movie.find_all('div', class_ = 'rating-item-content')
        for elt_rating in elts_ratings:
            if 'Spectateurs' in elt_rating.find("span").text.strip():
                elt_stareval_note = elt_rating.find("span", class_ = "stareval-note")
                star_rating = elt_stareval_note.text.strip()
                elt_stareval_review = elt_rating.find("span", class_ = "stareval-review")
                stareval_review = elt_stareval_review.text.strip()
                if stareval_review.count(',') == 1:
                    nb_notes, nb_reviews = stareval_review.split(',')
                    nb_notes = nb_notes.split()[0]
                    nb_reviews = nb_reviews.split()[0]
                elif 'note' in stareval_review:
                    nb_notes = stareval_review.strip()
                elif 'critique' in stareval_review:
                    nb_reviews = stareval_review
                else:
                    assert False

    return star_rating, nb_notes, nb_reviews

def convert_to_integer(str_nb_reviews):
    ''' Convert the string str_nb_reviews into integer '''
    ''' USELESS ??? '''
    # assert False
    if not(str_nb_reviews):
        return 0
    test = re.search('\\d+', str_nb_reviews)
    return int(test.string)


# ---------------------------------- #
#                                    #
#             Main loop              #
#                                    #
# ---------------------------------- #
        
# loop on all years to scrap (through links previously scrapped)
# then loop on all pages of the year
# then loop on all movies on one page

# delete_thumbnails()

nb_minimum_critics = 20
nb_consecutives_unpopular_movies_to_break = 20
use_Selenium = False

# Create Selenium driver
driver = None
if use_Selenium:
    driver = webdriver.Chrome(options = _options())

counter_movies                          = 0
counter_scraped_movies                  = 0
counter_not_scraped_not_enough_reviews  = 0
counter_not_scraped_categories          = 0
movies = []

for year, url_year in dict_year_link.items():
    
    r = requests.get(url_year, auth=('user', 'pass'))
    if r.status_code != 200:
        print("url_site error")

    soup_year = BeautifulSoup(r.content, 'html.parser')
    nb_pages = number_pages_per_year(soup_year)
    consecutive_number_of_unpopular_movies = 0

    for i in range(nb_pages): # Need to reduce as some movies are totaly unknown with very few informations about
        url_year_page = url_year + f'?page={i+1}'
        r = requests.get(url_year_page, auth=('user', 'pass'))
        if r.status_code != 200:
            print("url_site error")

        print(f"***  Year {year}  ---  Page {i+1}  ***")
        soup_movies = BeautifulSoup(r.content, 'html.parser')
        elt_movies = soup_movies.find_all('li', class_='mdl')

        for elt_movie in elt_movies:
            # print('---------------------------------------------------------------- ')
            status, movie = scrap_movie(elt_movie, use_Selenium)
            counter_movies += 1
            
            if status == 'Reviews':
                counter_not_scraped_not_enough_reviews += 1
                consecutive_number_of_unpopular_movies += 1
                if consecutive_number_of_unpopular_movies == nb_consecutives_unpopular_movies_to_break:
                    # Reached the number of consecutives "unpopular" movies so we stop scrapping this year.
                    print(f"Reached {nb_consecutives_unpopular_movies_to_break} consecutives 'unpopular' movies: BREAK")
                    break

            else:
                consecutive_number_of_unpopular_movies = 0 # Reset the number of unpopular movies

                if status == 'OK':
                    counter_scraped_movies += 1
                    movies.append(movie)
                else:
                    assert status == 'Category'
                    counter_not_scraped_categories += 1

        if consecutive_number_of_unpopular_movies == nb_consecutives_unpopular_movies_to_break:
            # Stop scrapping for this year
            break

if driver:
    driver.close()

df_movies = pd.DataFrame(movies, columns = ['title', 'original_title', 'date', 'duration', 'categories', \
                                            'countries', 'star_rating', 'notes', 'reviews', \
                                            'directors', 'actors', 'composers', 'summary', \
                                            'url_thumbnail', 'url_reviews', 'url_similar_movies'])

# Display some infos
print("\n*** Scrapping summary ***")
print("Nb movies scanned: ", counter_movies)
print("Nb movies scrapped:", counter_scraped_movies)

if counter_not_scraped_not_enough_reviews:
    print("Not scrapped Reviews: ", counter_not_scraped_not_enough_reviews)
if counter_not_scraped_categories:
    print("Not scrapped Category:  ", counter_not_scraped_categories)

***  Year 1994  ---  Page 1  ***
Title: Forrest Gump
Title: Pulp Fiction
Title: Les Evadés
Title: Léon
Title: Entretien avec un vampire
Title: Le Roi Lion
Title: Blown Away
Title: Quatre mariages et un enterrement
Title: La Cité de la peur
Title: The Mask
Title: Nell
Title: Highlander III
Title: Un Indien dans la ville
Title: Chungking Express
Title: Dumb and Dumber
***  Year 1994  ---  Page 2  ***
Not enough reviews, Do not scrape the movie: Le voyeur
Title: Stargate, la porte des étoiles
Title: Speed
Title: Le Péril jeune
Title: Richie Rich
Title: Wyatt Earp
Title: Ace Ventura, détective chiens et chats
Title: Tueurs nés
Title: Richard au pays des livres magiques
Title: Gazon maudit
Title: True Lies
Not enough reviews, Do not scrape the movie: Vanya, 42e rue
Title: Wolf
Title: Grosse fatigue
Title: The Crow
***  Year 1994  ---  Page 3  ***
Title: Pompoko
Title: Harcèlement
Title: Fresh
Title: Junior
Title: Bébé part en vadrouille
Title: Farinelli
Title: Muriel
Title: Swimming With Sh

## **Some results of scrapping**

**Scrapping the year 1980** :<br>
We only keep movies with at least 20 critics:

![scrapping_year_1980_more_20_critics](images/scrapping_year_1980_critics_more_20.png)

We only keep movies with at least 10 critics:

![scrapping_year_1980_more_10_critics](images/scrapping_year_1980_critics_more_10.png)

**Scrapping the decade 80** :<br>
We only keep movies with at least 20 critics:

![scrapping_decade_80_more_20_critics](images/scrapping_decade_80_critics_more_20.png)

In [None]:
df_movies

Unnamed: 0,title,original_title,date,duration,categories,countries,star_rating,notes,reviews,directors,actors,composers,summary,url_thumbnail,url_reviews,url_similar_movies
0,Music Box,Music Box,28 février 1990,2h 05min,Drame,U.S.A.,39,725,67,Costa-Gavras,"Jessica Lange,Armin Mueller-Stahl,Frederic For...",Philippe Sarde,"Ann Talbot, brillante avocate de Chicago, est ...",https://fr.web.img3.acsta.net/c_310_420/medias...,https://www.allocine.fr/film/fichefilm-5424/cr...,https://www.allocine.fr/film/fichefilm-5424/si...
1,Le Cercle des poètes disparus,Dead Poets Society,17 janvier 1990,2h 08min,"Comédie,Comédie dramatique,Drame",U.S.A.,43,53586,777,Peter Weir,"Robin Williams,Ethan Hawke,Robert Sean Leonard...",Maurice Jarre,"Todd Anderson, un garçon plutôt timide, est en...",https://fr.web.img2.acsta.net/c_310_420/medias...,https://www.allocine.fr/film/fichefilm-5280/cr...,https://www.allocine.fr/film/fichefilm-5280/si...
2,Trop belle pour toi,Trop belle pour toi,12 mai 1989,1h 32min,Comédie dramatique,France,31,988,85,Bertrand Blier,"Josiane Balasko,Gérard Depardieu,Carole Bouque...",Francis Lai,"Bernard, chef d'entreprise, est marié à une tr...",https://fr.web.img5.acsta.net/c_310_420/medias...,https://www.allocine.fr/film/fichefilm-4735/cr...,https://www.allocine.fr/film/fichefilm-4735/si...
3,Blue Steel,Blue Steel,25 avril 1990,1h 40min,"Action,Policier,Thriller",U.S.A.,24,554,70,Kathryn Bigelow,"Jamie Lee Curtis,Ron Silver,Clancy Brown,Tom S...",Brad Fiedel,"Jeune recrue de la police, Megan Turner abat u...",https://fr.web.img3.acsta.net/c_310_420/pictur...,https://www.allocine.fr/film/fichefilm-5597/cr...,https://www.allocine.fr/film/fichefilm-5597/si...
4,Retour vers le futur II,Back to the Future Part II,20 décembre 1989,1h 48min,"Aventure,Comédie,Science Fiction",U.S.A.,43,66752,769,Robert Zemeckis,"Michael J. Fox,Christopher Lloyd,Lea Thompson,...",Alan Silvestri,"Lors de son premier voyage en 1985, Marty a co...",https://fr.web.img6.acsta.net/c_310_420/pictur...,https://www.allocine.fr/film/fichefilm-5247/cr...,https://www.allocine.fr/film/fichefilm-5247/si...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
864,Permanent Vacation,Permanent Vacation,25 avril 1984,1h 15min,Comédie dramatique,U.S.A.,29,281,33,Jim Jarmusch,"Chris Parker,Leila Gastil,John Lurie,Richard B...","Jim Jarmusch,John Lurie",Deux jours et demi de la vie d’Aloysious Parke...,https://fr.web.img5.acsta.net/c_310_420/pictur...,https://www.allocine.fr/film/fichefilm-15240/c...,https://www.allocine.fr/film/fichefilm-15240/s...
865,Les Charlots contre Dracula,Les Charlots contre Dracula,8 avril 2019,1h 25min,"Comédie,Fantastique",France,25,233,31,"Jean-Pierre Desagnat,Jean-Pierre Vergne","Gérard Jugnot,Gérard Rinaldi,Gérard Filipelli,...",Les Charlots,Dracula Junior est bien embêté. Seule une femm...,https://fr.web.img5.acsta.net/c_310_420/pictur...,https://www.allocine.fr/film/fichefilm-33282/c...,https://www.allocine.fr/film/fichefilm-33282/s...
866,La Maison au fond du parc,La casa sperduta nel parco,25 mars 1981,1h 31min,"Epouvante-horreur,Thriller",Italie,19,68,23,Ruggero Deodato,"David Hess,Annie Belle,Christian Borromeo,Giov...",Riz Ortolani,"Deux truands, responsables d'un trafic de voit...",https://fr.web.img6.acsta.net/c_310_420/medias...,https://www.allocine.fr/film/fichefilm-129849/...,https://www.allocine.fr/film/fichefilm-129849/...
867,Retour à la 36ème chambre,Shao Lin da peng da shi,1 octobre 2008,1h 39min,"Action,Comédie,Historique,Arts Martiaux","Chine,Hong-Kong",37,221,25,Chia-Liang Liu,"Szu-Chia Chen,Kara Hui,Lung Wei Wang,Yeong-mun...",Eddie Wang,Des ouvriers exploités engagent un acteur pour...,https://fr.web.img4.acsta.net/c_310_420/medias...,https://www.allocine.fr/film/fichefilm-57602/c...,https://www.allocine.fr/film/fichefilm-57602/s...


 ## **Enregistrement des données dans des fichiers cvs**

In [None]:
# np.save("csv/categories.npy", ds_categories.array.tolist())
# np.save("csv/countries.npy", ds_countries.array.tolist())
# ds_categories.to_csv('csv/categories.csv', sep = ',', index=False)
# ds_countries.to_csv('csv/countries.csv', sep = ',', index=False)

df_movies.to_csv('csv/movies_year_1990_to_1995.csv', sep=',', index = False)
# df_movies.to_csv('csv/movies_year_1982.csv', sep = ',', index=False)
# df_movies.to_csv('csv/movies_year_1981.csv', sep = ',', index = False)
# df_movies.to_csv('csv/movies_decade_80.csv', sep = ',', index = False)
# df_movies.to_csv('csv/movies_decade_90.csv', sep = ',', index = False)

### **Quelques difficultés rencontrées**



Pour le scrapping des "films similaires" on récupère simplement la liste des films similaires (leurs titres), ce n'est que lorsque tous les films auront été scrappés et mis dans dans des Dataframe et que leur seront attribués des Ids (clés de tables de base de données) que nous pourrons associer un film aux Ids des films similaires.

Remarque :
Il n'a pas été possible de trouver d'informations sur la méthode utilisée par "allocine.com" pour composer la liste de films similaires à un film, il n'apparait pas de liens clairs entre les catégories de films, ni entre les acteurs, il est possible que ce soit un algorithme d'IA qui soit à la base de ce choix ...