# <div style='color:white;background: #005792;text-align: center;padding: 15px 0'>Recommandations - Evaluation de la source de données Title principals</div>

## Participants
* Samantha
* Rachelle
* Andrew


## Résumé des observations

Ce jeu de données comprend 6 colonnes et 85 883 465 lignes, et il n'y a aucune valeur dupliquée dans les données.
Cependant, certaines colonnes contiennent des valeurs manquantes.
En particulier, les colonnes `job` et `characters` ont plus de  50% de valeurs manquantes.


### Colonnes à supprimer

* `ordering` n'apporte pas de plus value au modèle de ML
* `job` contient plus de 50 % de valeurs manquantes
* `characters` contient plus de 80 % de valeurs manquantes

### Colonnes à conserver

* `tconst` permet d'identifier le film
* `nconst` permet d'identifier la personne
* `category` permet d'identifier le rôle de la personne

### Transformation

* `category` : il faudrait effectuer un remodelage (pivot) de la colonne

## <div style='background: #005792;text-align: center;padding: 15px 0'> <a style= 'color:white;' >Configuration des variables globales</a></div>

### Installation des librairies

In [25]:
# !pip install pandas
# !pip install numpy
# !pip install matplotlib
# !pip install seaborn
# !pip install plotly-express
# !pip install plotly

### Importation des librairies

In [1]:
import os
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

import re

### Chargement des fichiers

In [2]:
source_dir= '/home/dstrec/dstrec/010_data/000_source/imdb_datasets'
name_file= 'title.principals.tsv'
file_path= f"{source_dir}/{name_file}"

### Chargement des jeux de données

In [3]:
!wc -l '/home/dstrec/dstrec/010_data/000_source/imdb_datasets/title.principals.tsv'

85883465 /home/dstrec/dstrec/00_data_source/imdb_datasets/title.principals.tsv


In [3]:
df= pd.read_csv(file_path, sep='\t', na_values='\\N', low_memory=False, nrows=5000000)

### Configuration des variables

### Configuration des fonctions

## <div style='background: #005792;text-align: center;padding: 15px 0'> <a style= 'color:white;' >Evaluation des données</a></div>

### Affichage du jeu de données

In [5]:
df.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,,"[""Self""]"
1,tt0000001,2,nm0005690,director,,
2,tt0000001,3,nm0005690,producer,producer,
3,tt0000001,4,nm0374658,cinematographer,director of photography,
4,tt0000002,1,nm0721526,director,,


In [6]:
print(f"Ce jeu de données a {df.shape[1]} colonnes et {df.shape[0]} lignes.")

Ce jeu de données a 6 colonnes et 20000000 lignes.


### EDA

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000000 entries, 0 to 19999999
Data columns (total 6 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   tconst      object
 1   ordering    int64 
 2   nconst      object
 3   category    object
 4   job         object
 5   characters  object
dtypes: int64(1), object(5)
memory usage: 915.5+ MB


### Statistique descriptives

In [8]:
df.describe(include='all')

Unnamed: 0,tconst,ordering,nconst,category,job,characters
count,20000000,20000000.0,20000000,20000000,3801298,9898477
unique,1903951,,2088796,13,25250,1866785
top,tt0398022,,nm8467983,actor,producer,"[""Self""]"
freq,75,,11354,5578000,1348309,1015737
mean,,7.728353,,,,
std,,5.437122,,,,
min,,1.0,,,,
25%,,3.0,,,,
50%,,7.0,,,,
75%,,11.0,,,,


### Valeurs manquantes

In [9]:
df.isna().sum()  / len(df) * 100

tconst         0.000000
ordering       0.000000
nconst         0.000000
category       0.000000
job           80.993510
characters    50.507615
dtype: float64

<span style="font-size: 16px;"><b>Remarques :</b><br><br>
Les colonnes "job" et "characters" possèdent 50% de valeurs manquantes. En conséquence, ces colonnes pourraient être retirées du DataFrame.
</span>

### Valeurs dupliquées

In [10]:
print(f"Ce jeu de données a {df.duplicated().values.sum()} valeur(s) dupliquée(s).")

Ce jeu de données a 0 valeur(s) dupliquée(s).


### Valeurs uniques

In [14]:
df['category'].unique()

array(['self', 'director', 'producer', 'cinematographer', 'composer',
       'editor', 'actor', 'actress', 'writer', 'production_designer',
       'archive_footage', 'casting_director', 'archive_sound'],
      dtype=object)

### Remodelage de la colone `category`

In [4]:

df2 = df.drop(columns=['ordering','job','characters'])
df_pivot = df2.pivot_table(index='tconst', columns='category', values='nconst', aggfunc='first')
df_pivot.reset_index(inplace=True)
df_pivot.columns = [col if col is not None else 'none' for col in df_pivot.columns]
print(df_pivot)


           tconst      actor    actress archive_footage archive_sound  \
0       tt0000001        NaN        NaN             NaN           NaN   
1       tt0000002        NaN        NaN             NaN           NaN   
2       tt0000003        NaN        NaN             NaN           NaN   
3       tt0000004        NaN        NaN             NaN           NaN   
4       tt0000005  nm0443482        NaN             NaN           NaN   
...           ...        ...        ...             ...           ...   
433168  tt0461217  nm1374670  nm1385417             NaN           NaN   
433169  tt0461218  nm0209629  nm1028648             NaN           NaN   
433170  tt0461221  nm1920750  nm1918226             NaN           NaN   
433171  tt0461222  nm0284557  nm0648650             NaN           NaN   
433172  tt0461223  nm0131952  nm0637662             NaN           NaN   

       casting_director cinematographer   composer   director     editor  \
0                   NaN       nm0374658        

In [11]:
df_pivot.head(50)

Unnamed: 0,tconst,actor,actress,director,editor,producer,writer
0,tt0000001,,,nm0005690,,nm0005690,
1,tt0000002,,,nm0721526,,,
2,tt0000003,,,nm0721526,nm5442200,nm1770680,
3,tt0000004,,,nm0721526,,,
4,tt0000005,nm0443482,,,,nm0249379,
5,tt0000007,nm0179163,,nm0005690,,nm0005690,
6,tt0000008,nm0653028,,nm0005690,,nm0005690,
7,tt0000009,nm0183823,nm0063086,nm0085156,,nm0085156,nm0085156
8,tt0000010,,,nm0525910,,nm0525910,
9,tt0000011,nm3692297,,nm0804434,,nm0804434,


In [6]:
df_pivot.isna().sum()  / len(df_pivot) * 100

tconst                  0.000000
actor                  22.277935
actress                28.043761
archive_footage        98.245966
archive_sound          99.951290
casting_director       92.457055
cinematographer        44.057686
composer               59.073396
director               16.184988
editor                 53.495947
producer               51.361004
production_designer    81.271917
self                   89.936123
writer                 33.931939
dtype: float64

<span style="font-size: 16px;"><b>Remarques :</b><br><br>
Vis-à-vis des valeurs manquantes, les colonnes suivantes pourraient être retirées du DataFrame :
* archive_footage
* archive_sound
* casting_director
* cinematographer
* composer
* production_designer
* self

</span>