# <div style='color:white;background: #005792;text-align: center;padding: 15px 0'>Recommandations - Webscrapping sur Movielens</div>

## Participants
* Samantha
* Rachelle
* Andrew

## <div style='background: #005792;text-align: center;padding: 15px 0'> <a style= 'color:white;' >Configuration des variables globales</a></div>

### Importation des librairies

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
import logging
import time

### Chargement des fichiers

In [None]:
transformed_dir = '/home/dstrec/dstrec/010_data/001_transformed'

file_movielens_final = f"{transformed_dir}/movielens.csv"

### Chargement des fichiers

In [2]:
movielens_final = pd.read_csv(file_movielens_final, index_col='movieId')

## <div style='background: #005792;text-align: center;padding: 15px 0'> <a style= 'color:white;' >Préparation des données</a></div>

### Affichage du jeu de données 

In [3]:
movielens_final.head()

Unnamed: 0_level_0,title,genres,imdbId,average_rating,most_common_tag
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,tt0114709,3.9,Pixar
2,Jumanji,Adventure|Children|Fantasy,tt0113497,3.2,Robin Williams
3,Grumpier Old Men,Comedy|Romance,tt0113228,3.2,moldy
4,Waiting to Exhale,Comedy|Drama|Romance,tt0114885,2.9,chick flick
5,Father of the Bride Part II,Comedy,tt0113041,3.1,steve martin


### Webscraping des directors

In [4]:
header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

def fetch_director(imdb_id):
    url = f"https://www.imdb.com/title/{imdb_id}/"
    try:
        response = requests.get(url, headers=header)
        if response.status_code != 200:
            return 'N/A'
        soup = BeautifulSoup(response.content, 'html.parser')
        director_tag = soup.find('a', href=lambda x: x and 'tt_ov_dr' in x)
        director_name = director_tag.text.strip() if director_tag else 'N/A'
        return director_name
    except requests.exceptions.RequestException as e:
        print(f"Error fetching data for IMDb ID {imdb_id}: {str(e)}")
        return 'N/A'

with ThreadPoolExecutor(max_workers=10) as executor:
    directors = list(executor.map(fetch_director, movielens_final['imdbId']))

movielens_final['Director'] = directors

### Affichage des résultats

In [None]:
movielens_final.head()

Unnamed: 0_level_0,title,genres,imdbId,average_rating,most_common_tag,Director
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,tt0114709,3.9,Pixar,John Lasseter
2,Jumanji,Adventure|Children|Fantasy,tt0113497,3.2,Robin Williams,Joe Johnston
3,Grumpier Old Men,Comedy|Romance,tt0113228,3.2,moldy,Howard Deutch
4,Waiting to Exhale,Comedy|Drama|Romance,tt0114885,2.9,chick flick,Forest Whitaker
5,Father of the Bride Part II,Comedy,tt0113041,3.1,steve martin,Charles Shyer


### Renommage de la colonne director

In [None]:
movielens_final.rename(columns={'Director': 'director'}, inplace=True)

### Valeurs manquantes

In [None]:
print(movielens_final.isnull().sum())

title              0
genres             0
imdbId             0
average_rating     0
most_common_tag    0
director           0
dtype: int64


### Création d'un fichier csv

In [None]:
dest_dir = '/home/dstrec/dstrec/010_data/001_transformed'

output_file_movielens_clean = f"{dest_dir}/movielens_with_directors.csv"

movielens_final.to_csv(output_file_movielens_clean, index=False)