# Preprocessing and Data Visualization

In this project, our goal is to develop a system for information retrieval in a collection of film descriptions published on Allociné. Specifically, we will focus on automatically classifying films by their genre based on the text of their synopsis and title.

## Problem Understanding

The project involves developing a system for information retrieval in a collection of film descriptions from Allociné. The project is divided into two steps, with the first step focusing on training an automated film genre classification tool based on the synopsis and title of the films. The training will be conducted using two CSV files of French-language film descriptions provided on Moodle. Different classification algorithms and text features will be compared and evaluated using best practices such as cross-validation. The second step will involve applying the selected model to a new dataset and producing labeled data with predicted genres.

## Data Understanding

The data is provided in two CSV files, one for training and one for testing. The training data contains 719 films with the following 22 fields:

- acteur_1
- acteur_2
- acteur_3
- allocine_id
- annee_prod
- annee_sortie
- box_office_fr
- couleur
- duree
- langues
- nationalite
- nb_critiques_presse
- nb_critique_spectateurs
- nb_notes_spectateurs
- note_presse
- note_spectateurs
- realisateurs
- synopsis
- type_film
- genre

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import tensorflow as tf
import plotly.graph_objs as go

#### Read the data


In [2]:
df = pd.read_csv('../data/allocine_genres_test.csv')

df.head()

Unnamed: 0.1,Unnamed: 0,acteur_1,acteur_2,acteur_3,allocine_id,annee_prod,annee_sortie,box_office_fr,couleur,duree,...,nb_critiques_presse,nb_critiques_spectateurs,nb_notes_spectateurs,note_presse,note_spectateurs,realisateurs,synopsis,titre,type_film,genre
0,2790,Johnny Depp,Leonardo DiCaprio,Juliette Lewis,9835,1993,1994.0,,Couleur,118.0,...,4.0,216.0,3862.0,3.2,3.9,Lasse Hallström,"Gilbert Grape vit à Endora dans l' Iowa , avec...",Gilbert Grape,,romance
1,502,Jackie Earle Haley,Patrick Wilson,Malin Åkerman,57769,2009,2009.0,753747.0,Couleur,162.0,...,19.0,2344.0,28543.0,3.7,3.8,Zack Snyder,Aventure à la fois complexe et mystérieuse sur...,Watchmen - Les Gardiens,Long-métrage,science fiction
2,5382,Benoît Magimel,Mélanie Thierry,Nicolas Duvauchelle,207967,2013,2013.0,194140.0,Couleur,110.0,...,18.0,139.0,1055.0,3.2,3.5,Diane Kurys,"A la mort de sa mère , Anne fait une découvert...",Pour une femme,Long-métrage,romance
3,4803,Natja Brunckhorst,Thomas Haustein,Jens Kuphal,57005,1981,1981.0,2887938.0,Couleur,138.0,...,,57.0,527.0,,3.3,Uli Edel,"Christiane , une jeune berlinoise de treize an...","Moi , Christiane F. .. 13 ans , droguée et pro...",Long-métrage,biopic
4,2887,Michael J. Fox,Christopher Lloyd,Thomas F. Wilson,29289,1990,1990.0,1677333.0,Couleur,119.0,...,,610.0,44927.0,,4.1,Robert Zemeckis,"Après son voyage mouvementé entre passé , prés...",Retour vers le futur III,Long-métrage,science fiction


In [3]:
# check the size of the dataset and the type of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 719 entries, 0 to 718
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                719 non-null    int64  
 1   acteur_1                  719 non-null    object 
 2   acteur_2                  691 non-null    object 
 3   acteur_3                  686 non-null    object 
 4   allocine_id               719 non-null    int64  
 5   annee_prod                719 non-null    int64  
 6   annee_sortie              672 non-null    float64
 7   box_office_fr             438 non-null    float64
 8   couleur                   673 non-null    object 
 9   duree                     706 non-null    float64
 10  langues                   591 non-null    object 
 11  nationalite               719 non-null    object 
 12  nb_critiques_presse       514 non-null    float64
 13  nb_critiques_spectateurs  688 non-null    float64
 14  nb_notes_s

In [18]:
# check the percentage of missing values as an int 
df.isnull().sum() / df.shape[0] * 100

Unnamed: 0                   0.000000
acteur_1                     0.000000
acteur_2                     3.894298
acteur_3                     4.589708
allocine_id                  0.000000
annee_prod                   0.000000
annee_sortie                 6.536857
box_office_fr               39.082058
couleur                      6.397775
duree                        1.808067
langues                     17.802503
nationalite                  0.000000
nb_critiques_presse         28.511822
nb_critiques_spectateurs     4.311544
nb_notes_spectateurs         3.616134
note_presse                 27.538248
note_spectateurs             3.616134
realisateurs                 0.000000
synopsis                     0.000000
titre                        0.000000
type_film                   24.478442
genre                        0.000000
dtype: float64

We're lucky that the columns we're gonna use have 0 missing values. We can see that the columns `titre`, `genre`, and `synopsis` have a no missing values. We will drop these columns.

In [19]:
df.describe()

Unnamed: 0.1,Unnamed: 0,allocine_id,annee_prod,annee_sortie,box_office_fr,duree,nb_critiques_presse,nb_critiques_spectateurs,nb_notes_spectateurs,note_presse,note_spectateurs
count,719.0,719.0,719.0,672.0,438.0,706.0,514.0,688.0,693.0,521.0,693.0
mean,2989.849791,160094.88178,2007.628651,2007.971726,1270099.0,108.274788,17.867704,368.627907,7200.007215,3.192706,3.204473
std,1682.909016,96361.789306,15.224954,14.9544,1897099.0,20.739611,8.264276,604.606334,17541.807283,0.733868,0.753926
min,6.0,29.0,1921.0,1921.0,27236.0,60.0,2.0,2.0,3.0,1.4,0.9
25%,1459.5,54773.5,2004.0,2004.0,240294.2,94.0,12.0,70.0,591.0,2.6,2.7
50%,3014.0,192934.0,2013.0,2013.0,560326.5,105.0,18.0,206.5,2138.0,3.2,3.3
75%,4455.5,248303.0,2018.0,2018.0,1561026.0,117.0,23.0,436.5,6070.0,3.7,3.8
max,5778.0,279750.0,2022.0,2021.0,17273060.0,251.0,44.0,10775.0,217689.0,5.0,4.6
