# Présentation des bases

In [2]:
import pandas as pd

## Base de données "movie"

3 variables : 
- movieId
- title 
- genres

In [3]:
movie = pd.read_csv(r"C:\Users\nolle\OneDrive\Documents\Université\M2\Mémoire\Movielens1M\movies.dat", 
                     sep="::", 
                     engine="python", 
                     names=["movieId", "title", "genres"], 
                     encoding = "ISO-8859-1")

movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
len(movie["title"].unique())

3883

Cette base permet d'avoir des informations sur les films présents dans la base : leur ID, leur titre, leur date de parution ainsi que les différents genres du film.

### *Title bis : titre sans la date de parution* ###

In [5]:
movie["title_bis"] = movie["title"].str[: -7].str.strip()
movie["title_bis"].head()

0                      Toy Story
1                        Jumanji
2               Grumpier Old Men
3              Waiting to Exhale
4    Father of the Bride Part II
Name: title_bis, dtype: object

In [6]:
liste_films_verif = movie["title_bis"].unique().tolist()
[f for f in liste_films_verif if f.endswith("The")]
# On peut voir qu'il existe un problème de format pour les titres qui se terminent par "The"

['American President, The',
 'City of Lost Children, The',
 'Usual Suspects, The',
 'Big Green, The',
 'Indian in the Cupboard, The',
 'Crossing Guard, The',
 'Juror, The',
 'Journey of August King, The',
 'Bridges of Madison County, The',
 "Young Poisoner's Handbook, The",
 'Boys of St. Vincent, The',
 'NeverEnding Story III, The',
 'Neon Bible, The',
 'Birdcage, The',
 'Brothers McMullen, The',
 'Amazing Panda Adventure, The',
 'Basketball Diaries, The',
 'Addiction, The',
 'Doom Generation, The',
 'Net, The',
 'Prophecy, The',
 'Scarlet Letter, The',
 'Show, The',
 'Stars Fell on Henrietta, The',
 'Tie That Binds, The',
 'Browning Version, The',
 'Babysitter, The',
 'Cure, The',
 'Glass Shield, The',
 'Hunted, The',
 'Jerky Boys, The',
 'Madness of King George, The',
 'Perez Family, The',
 'Quick and the Dead, The',
 'Swan Princess, The',
 'Secret of Roan Inish, The',
 'Specialist, The',
 'Santa Clause, The',
 'Shawshank Redemption, The',
 'Sum of Us, The',
 'Underneath, The',
 'Wal

In [7]:
# On créer une fonction pour corriger les titres

def corriger_articles(titre):
    for article in ["The", "An", "A"] :
        if titre.endswith(f", {article}") :
            return f"{article} " + titre[: -(len(article)+2)]
    return titre

movie["title_clean"] = movie["title_bis"].apply(corriger_articles)
movie["title_clean"].head()

0                      Toy Story
1                        Jumanji
2               Grumpier Old Men
3              Waiting to Exhale
4    Father of the Bride Part II
Name: title_clean, dtype: object

In [8]:
# Plus aucun film ne finit pat "The"

liste_films_verif = movie["title_clean"].unique().tolist()
[f for f in liste_films_verif if f.endswith("The")]

[]

In [None]:
# Verification de la correction du format : méthode 1

liste_articles = ["The", "An", "A"]

for i in liste_articles : 
    test = [f for f in liste_films_verif if f.endswith(i)]
    
print(test)
# On peut voir qu'il n'y a plus de titres qui se terminent par "The".

[]


In [10]:
# Verification de la correction du format : méthode 2

films_rates = []

for i in liste_films_verif : 
    for j in liste_articles : 
        if i.endswith(j) : 
            films_rates.append(i)

films_rates
# Tous les films sont au bon format

[]

### *Date : date de parution* ###

In [11]:
movie["date"] = movie["title"].str[-5 : -1].str.strip()
movie["date"].head()

0    1995
1    1995
2    1995
3    1995
4    1995
Name: date, dtype: object

### *Title clean : obligation de le créer pour les fusions qui viendront après* ###

In [12]:
movie["title_clean"] = movie["title_clean"].str.strip() + " (" + movie["date"] + ")"
movie["title_clean"]

0                         Toy Story (1995)
1                           Jumanji (1995)
2                  Grumpier Old Men (1995)
3                 Waiting to Exhale (1995)
4       Father of the Bride Part II (1995)
                       ...                
3878               Meet the Parents (2000)
3879            Requiem for a Dream (2000)
3880                      Tigerland (2000)
3881               Two Family House (2000)
3882                  The Contender (2000)
Name: title_clean, Length: 3883, dtype: object

## Base de données "movies_metadata"

Base de données donnant des informations supplémentaires sur les films

In [13]:
movies_metadata = pd.read_csv(r"C:\Users\nolle\OneDrive\Documents\Université\M2\Mémoire\Movielens1M\movies_metadata.csv")
movies_metadata.head(1)

  movies_metadata = pd.read_csv(r"C:\Users\nolle\OneDrive\Documents\Université\M2\Mémoire\Movielens1M\movies_metadata.csv")


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0


In [14]:
len(movies_metadata)

45466

Objectif : mettre title au même format que title de la table "movie" afin de voir si la fusion est pertinente. 
Si oui, garder les films en communs.

In [15]:
movies_metadata["title_clean"] = movies_metadata["title"] + " (" + movies_metadata["release_date"].str[:4].str.strip() + ")"
movies_metadata["title_clean"]

0                          Toy Story (1995)
1                            Jumanji (1995)
2                   Grumpier Old Men (1995)
3                  Waiting to Exhale (1995)
4        Father of the Bride Part II (1995)
                        ...                
45461                                   NaN
45462            Century of Birthing (2011)
45463                       Betrayal (2003)
45464               Satan Triumphant (1917)
45465                       Queerama (2017)
Name: title_clean, Length: 45466, dtype: object

In [16]:
movies_metadata["title_clean"].isnull().sum()

90

In [18]:
# 45296 films uniques
print(len(movies_metadata["title_clean"].unique()
))

# 45376 observations non nulles
print(movies_metadata["title_clean"].notnull().sum())

45296
45376


In [None]:
liste_movies_metadata = set(movies_metadata["title_clean"].to_list()) # Liste de films uniques de 'movies_metadata'
liste_movies_movies = set(movie["title_clean"]) # Liste de films uniques de 'movies'

print(len(liste_movies_metadata.intersection(liste_movies_movies))
) 
print(len(liste_movies_metadata.intersection(liste_movies_movies)) / len(movie) * 100
)

# 3027 films en communs dans les deux bases. 
# 78% des films présents dans metadata le sont aussi dans movie

3027
77.95518928663404


On pourra faire la récupération de 78% des films environ. On récupère ses films car ils contiennet les synopsis dont nous auront besoin. 

## Base de données "movies_final" ##

Le but de cette section est de récuperer dans une base finale les films que l'on va utiliser avec les SYNOPSIS et d'autres informations. 
Pour cela nous allons utiliser "movies" et "movies_metadata".

In [20]:
movies_final = pd.merge(movie, 
                  movies_metadata, 
                  on = "title_clean", 
                  how = "inner")

movies_final.drop(["adult", "belongs_to_collection", "genres_y", "homepage", "id", "imdb_id", 
                   "original_language", "original_title", "status", "video", "vote_count", "vote_average", 
                   "title_x", "title_y", "popularity", "poster_path", "production_companies", 
                   "production_countries", "runtime", "spoken_languages"], 
                  axis = 1, 
                  inplace = True)

movies_final

Unnamed: 0,movieId,genres_x,title_bis,title_clean,date,budget,overview,release_date,revenue,tagline
0,1,Animation|Children's|Comedy,Toy Story,Toy Story (1995),1995,30000000,"Led by Woody, Andy's toys live happily in his ...",1995-10-30,373554033.0,
1,2,Adventure|Children's|Fantasy,Jumanji,Jumanji (1995),1995,65000000,When siblings Judy and Peter discover an encha...,1995-12-15,262797249.0,Roll the dice and unleash the excitement!
2,3,Comedy|Romance,Grumpier Old Men,Grumpier Old Men (1995),1995,0,A family wedding reignites the ancient feud be...,1995-12-22,0.0,Still Yelling. Still Fighting. Still Ready for...
3,4,Comedy|Drama,Waiting to Exhale,Waiting to Exhale (1995),1995,16000000,"Cheated on, mistreated and stepped on, the wom...",1995-12-22,81452156.0,Friends are the people who let you be yourself...
4,5,Comedy,Father of the Bride Part II,Father of the Bride Part II (1995),1995,0,Just when George Banks has recovered from his ...,1995-02-10,76578911.0,Just When His World Is Back To Normal... He's ...
...,...,...,...,...,...,...,...,...,...,...
3028,3948,Comedy,Meet the Parents,Meet the Parents (2000),2000,55000000,"Greg Focker is ready to marry his girlfriend, ...",2000-10-06,330444045.0,First comes love. Then comes the interrogation.
3029,3949,Drama,Requiem for a Dream,Requiem for a Dream (2000),2000,4500000,The hopes and dreams of four ambitious people ...,2000-10-27,7390108.0,
3030,3950,Drama,Tigerland,Tigerland (2000),2000,10000000,A group of recruits go through Advanced Infant...,2000-09-22,0.0,The system wanted them to become soldiers. One...
3031,3951,Drama,Two Family House,Two Family House (2000),2000,0,Buddy Visalo (Michael Rispoli) is a factory wo...,2000-01-21,0.0,The only way to find out what you love is to r...


Il reste 6 doublons. 

In [21]:
doublons = movies_final["title_clean"].value_counts()
doublons = doublons[doublons == 2]
doublons

title_clean
Gossip (2000)                2
A Farewell to Arms (1932)    2
Soldier (1998)               2
Hamlet (2000)                2
Emma (1996)                  2
The Castle (1997)            2
Name: count, dtype: int64

In [22]:
doublons_movies = doublons.index.tolist()
doublons_movies

['Gossip (2000)',
 'A Farewell to Arms (1932)',
 'Soldier (1998)',
 'Hamlet (2000)',
 'Emma (1996)',
 'The Castle (1997)']

In [None]:
doublons_keep = doublons.drop_duplicates(subset = "title_clean", 
                                         keep   = "first")

doublons_keep
# Vue sur les doublons afin de voir lequel garder. Ici, nous allons prendre le premier doublon car il comporte 
# des informations sur le revenue. Voici les doublons que nous allons conserver : 

Unnamed: 0,movieId,genres_x,title_bis,title_clean,date,budget,overview,release_date,revenue,tagline
638,838,Comedy|Drama|Romance,Emma,Emma (1996),1996,6000000,Emma Woodhouse is a congenial young lady who d...,1996-08-02,22231658.0,Cupid is armed and dangerous!
754,976,Romance|War,"Farewell to Arms, A",A Farewell to Arms (1932),1932,4,British nurse Catherine Barkley (Helen Hayes) ...,1932-12-08,25.0,Every woman who has loved will understand
1776,2322,Action|Adventure|Sci-Fi|Thriller|War,Soldier,Soldier (1998),1998,75000000,Sergeant Todd is a veteran soldier for an elit...,1998-10-23,14567883.0,Left for dead on a remote planet for obsolete ...
2015,2618,Comedy,"Castle, The",The Castle (1997),1997,786675,A Melbourne family is very happy living near t...,1997-04-10,861789.0,Ordinary Family. Extraordinary Story.
2734,3553,Drama|Thriller,Gossip,Gossip (2000),2000,14000000,"On a beautiful college campus, something ugly ...",2000-04-21,5108820.0,"It can turn you on, or turn on you."
2764,3598,Drama,Hamlet,Hamlet (2000),2000,2000000,Modern day adaptation of Shakespeare's immorta...,2000-05-12,1568749.0,"Passion, Betrayal, Revenge, A hostile takeover..."


In [25]:
movies_final = movies_final.drop_duplicates(subset = "title_clean", 
                                            keep   = "first")

len(movies_final)

3027

In [26]:
movies_final.to_csv(r"C:\Users\nolle\OneDrive\Documents\Université\M2\Mémoire\movies_final.csv")

"movies_final" est propre. 

## Base de données "rating"

In [27]:
rating = pd.read_csv(r"C:\Users\nolle\OneDrive\Documents\Université\M2\Mémoire\Movielens1M\ratings.dat", 
                    sep="::", 
                    engine="python", 
                    names=["userId", "movieId", "rating", "timestamp"],
                    encoding="ISO-8859-1")
rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [28]:
len(rating)

1000209

In [29]:
len(rating["movieId"].unique())
# 3706 films à l'origine

3706

4 variables : 
- userId : utilisateur ayant noté le film
- movieId : identifiant unique du film
- rating : note attribuée au film
- timestamp : aucune idée de l'utilité

Cette table permet à l'aide d'un identifiant unique par utilisateur et par film, d'obtenir la note attribuée par un utilisateur pour un film donné. Cependant, il nous manque une information importante : le title afin de pouvoir faire une fusion avec "movie_metadata" pour créér la table "all". 
Nous allons obtenir cette information grâce à la table "movies_final". 

In [30]:
rating = pd.merge(rating, 
                  movies_final, 
                  left_on = "movieId", 
                  right_on = "movieId", 
                  how = "inner")

In [31]:
rating.head(1)

Unnamed: 0,userId,movieId,rating,timestamp,genres_x,title_bis,title_clean,date,budget,overview,release_date,revenue,tagline
0,1,1193,5,978300760,Drama,One Flew Over the Cuckoo's Nest,One Flew Over the Cuckoo's Nest (1975),1975,3000000,While serving time for insanity at a state men...,1975-11-18,108981275.0,"If he's crazy, what does that make you?"


In [None]:
print(len(rating))
print(len(rating["movieId"].unique()))

898688
2954


In [None]:
2954 / 3706
# On a pu conservé 80% des films. 

0.7970858067997841

## Base de données FINALE --> "all"

Fusion des autres tables avec la table "rating" afin d'obtenir plus de détails sur les films.

In [34]:
all = pd.merge(rating, 
         movies_final, 
         on = "movieId", 
         how = "inner")

In [35]:
len(all)
# La taille est la même que rating ce qui montre que notre fusion a bien fonctionné. 

898688

In [36]:
# Sauvegarde de la table dans un CSV 
# all.to_csv(r"C:\Users\nolle\OneDrive\Documents\Université\M2\Mémoire\All.csv",
#           index = False)

In [39]:
all['userId'].isna().sum()

0

In [40]:
len(all)

898688