# <center>SYSTEME DE RECOMMANDATION DE FILM.</center>

### OBJECTIF: 
       L'objectif de se projet est de mettre sur pied un systeme qui proposera des films par ordre de similarité au film renseigné par l'utilisateur.

### Qu'est-ce qu'un systeme de recommandation ?: 

       Un systeme de recommandation en intelligence artificiel est une application logiciel qui utilise des algorithmes pour analyser les preferences, le comportement et les habitudes d'un utilisatuer afin de recommander des elements pertinents. ces elements peuvent inclure des produits, des services , des contenus numeriques, des connexions sociales, et bien d'autres choses en fonction du contexte d'utilisation.

### Implementation technique du projet

In [1]:
import pandas as pd
import numpy as np
import re
from nltk.tokenize import word_tokenize
from unidecode import unidecode
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from warnings import filterwarnings
filterwarnings("ignore")

In [2]:
df = pd.read_csv("movies.csv")

In [3]:
df.head()

Unnamed: 0,name,rating,genre,year,released,score,votes,director,writer,star,country,budget,gross,company,runtime
0,The Shining,R,Drama,1980,"June 13, 1980 (United States)",8.4,927000.0,Stanley Kubrick,Stephen King,Jack Nicholson,United Kingdom,19000000.0,46998772.0,Warner Bros.,146.0
1,The Blue Lagoon,R,Adventure,1980,"July 2, 1980 (United States)",5.8,65000.0,Randal Kleiser,Henry De Vere Stacpoole,Brooke Shields,United States,4500000.0,58853106.0,Columbia Pictures,104.0
2,Star Wars: Episode V - The Empire Strikes Back,PG,Action,1980,"June 20, 1980 (United States)",8.7,1200000.0,Irvin Kershner,Leigh Brackett,Mark Hamill,United States,18000000.0,538375067.0,Lucasfilm,124.0
3,Airplane!,PG,Comedy,1980,"July 2, 1980 (United States)",7.7,221000.0,Jim Abrahams,Jim Abrahams,Robert Hays,United States,3500000.0,83453539.0,Paramount Pictures,88.0
4,Caddyshack,R,Comedy,1980,"July 25, 1980 (United States)",7.3,108000.0,Harold Ramis,Brian Doyle-Murray,Chevy Chase,United States,6000000.0,39846344.0,Orion Pictures,98.0


In [4]:
df["country"] = df["released"].apply(lambda x : " ".join(str(x).split("(")[-1:])[:-1])
df["country"]

0       United States
1       United States
2       United States
3       United States
4       United States
            ...      
7663    United States
7664    United States
7665         Cameroon
7666    United States
7667    United States
Name: country, Length: 7668, dtype: object

In [5]:
df = df[["name","genre","year","director","writer","star","country"]]
df

Unnamed: 0,name,genre,year,director,writer,star,country
0,The Shining,Drama,1980,Stanley Kubrick,Stephen King,Jack Nicholson,United States
1,The Blue Lagoon,Adventure,1980,Randal Kleiser,Henry De Vere Stacpoole,Brooke Shields,United States
2,Star Wars: Episode V - The Empire Strikes Back,Action,1980,Irvin Kershner,Leigh Brackett,Mark Hamill,United States
3,Airplane!,Comedy,1980,Jim Abrahams,Jim Abrahams,Robert Hays,United States
4,Caddyshack,Comedy,1980,Harold Ramis,Brian Doyle-Murray,Chevy Chase,United States
...,...,...,...,...,...,...,...
7663,More to Life,Drama,2020,Joseph Ebanks,Joseph Ebanks,Shannon Bond,United States
7664,Dream Round,Comedy,2020,Dusty Dukatz,Lisa Huston,Michael Saquella,United States
7665,Saving Mbango,Drama,2020,Nkanya Nkwai,Lynno Lovert,Onyama Laura,Cameroon
7666,It's Just Us,Drama,2020,James Randall,James Randall,Christina Roz,United States


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7668 entries, 0 to 7667
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   name      7668 non-null   object
 1   genre     7668 non-null   object
 2   year      7668 non-null   int64 
 3   director  7668 non-null   object
 4   writer    7665 non-null   object
 5   star      7667 non-null   object
 6   country   7668 non-null   object
dtypes: int64(1), object(6)
memory usage: 419.5+ KB


### Nettoyade de donnees

In [7]:
df.isna().sum()

name        0
genre       0
year        0
director    0
writer      3
star        1
country     0
dtype: int64

In [8]:
df.shape

(7668, 7)

In [9]:
df = df.dropna()

In [10]:
df.isna().sum()

name        0
genre       0
year        0
director    0
writer      0
star        0
country     0
dtype: int64

In [11]:
df[df.duplicated()]

Unnamed: 0,name,genre,year,director,writer,star,country


In [12]:
df.shape

(7664, 7)

In [13]:
df["global"] = df["name"] +" "+ df["genre"] +" "+df["year"].astype(str) +" "+ df["director"]+" "+df["writer"]+" "+df["star"]+" "+df["country"]

In [14]:
df["global"]

0       The Shining Drama 1980 Stanley Kubrick Stephen...
1       The Blue Lagoon Adventure 1980 Randal Kleiser ...
2       Star Wars: Episode V - The Empire Strikes Back...
3       Airplane! Comedy 1980 Jim Abrahams Jim Abraham...
4       Caddyshack Comedy 1980 Harold Ramis Brian Doyl...
                              ...                        
7663    More to Life Drama 2020 Joseph Ebanks Joseph E...
7664    Dream Round Comedy 2020 Dusty Dukatz Lisa Hust...
7665    Saving Mbango Drama 2020 Nkanya Nkwai Lynno Lo...
7666    It's Just Us Drama 2020 James Randall James Ra...
7667    Tee em el Horror 2020 Pereko Mosia Pereko Mosi...
Name: global, Length: 7664, dtype: object

In [15]:
df.head()

Unnamed: 0,name,genre,year,director,writer,star,country,global
0,The Shining,Drama,1980,Stanley Kubrick,Stephen King,Jack Nicholson,United States,The Shining Drama 1980 Stanley Kubrick Stephen...
1,The Blue Lagoon,Adventure,1980,Randal Kleiser,Henry De Vere Stacpoole,Brooke Shields,United States,The Blue Lagoon Adventure 1980 Randal Kleiser ...
2,Star Wars: Episode V - The Empire Strikes Back,Action,1980,Irvin Kershner,Leigh Brackett,Mark Hamill,United States,Star Wars: Episode V - The Empire Strikes Back...
3,Airplane!,Comedy,1980,Jim Abrahams,Jim Abrahams,Robert Hays,United States,Airplane! Comedy 1980 Jim Abrahams Jim Abraham...
4,Caddyshack,Comedy,1980,Harold Ramis,Brian Doyle-Murray,Chevy Chase,United States,Caddyshack Comedy 1980 Harold Ramis Brian Doyl...


### Prétraitement

In [16]:
stop_words = stopwords.words('english')

def pretraitement(text):
    
    text = re.sub("[^\w\s]", " ", text)
    
    text = ' '.join([unidecode(word.lower()) for word in word_tokenize(text) if word.lower() not in stop_words])
     
    return text

In [17]:
df["global"] = df["global"].apply(lambda x : pretraitement(x))

In [18]:
df["global"]

0       shining drama 1980 stanley kubrick stephen kin...
1       blue lagoon adventure 1980 randal kleiser henr...
2       star wars episode v empire strikes back action...
3       airplane comedy 1980 jim abrahams jim abrahams...
4       caddyshack comedy 1980 harold ramis brian doyl...
                              ...                        
7663    life drama 2020 joseph ebanks joseph ebanks sh...
7664    dream round comedy 2020 dusty dukatz lisa hust...
7665    saving mbango drama 2020 nkanya nkwai lynno lo...
7666    us drama 2020 james randall james randall chri...
7667    tee em el horror 2020 pereko mosia pereko mosi...
Name: global, Length: 7664, dtype: object

### VECTORISATION

In [19]:
tfidf_vectorizer = TfidfVectorizer(min_df=20)
tfidf_matrix = tfidf_vectorizer.fit_transform(df["global"])
tfidf_matrix.shape

(7664, 516)

### SIMILARITE

In [20]:
sim = cosine_similarity(tfidf_matrix)
sim, type(sim)

(array([[1.        , 0.19939885, 0.18319076, ..., 0.07383379, 0.05511535,
         0.02395477],
        [0.19939885, 1.        , 0.19035892, ..., 0.        , 0.0152399 ,
         0.02489211],
        [0.18319076, 0.19035892, 1.        , ..., 0.        , 0.01400112,
         0.02286876],
        ...,
        [0.07383379, 0.        , 0.        , ..., 1.        , 0.54784297,
         0.77702632],
        [0.05511535, 0.0152399 , 0.01400112, ..., 0.54784297, 1.        ,
         0.4490865 ],
        [0.02395477, 0.02489211, 0.02286876, ..., 0.77702632, 0.4490865 ,
         1.        ]]),
 numpy.ndarray)

### SAUVEGARDE

In [21]:
np.save("matrice_de_similarité.npy", sim)

### TEST

In [22]:
matrix = np.load("matrice_de_similarité.npy")
matrix

array([[1.        , 0.19939885, 0.18319076, ..., 0.07383379, 0.05511535,
        0.02395477],
       [0.19939885, 1.        , 0.19035892, ..., 0.        , 0.0152399 ,
        0.02489211],
       [0.18319076, 0.19035892, 1.        , ..., 0.        , 0.01400112,
        0.02286876],
       ...,
       [0.07383379, 0.        , 0.        , ..., 1.        , 0.54784297,
        0.77702632],
       [0.05511535, 0.0152399 , 0.01400112, ..., 0.54784297, 1.        ,
        0.4490865 ],
       [0.02395477, 0.02489211, 0.02286876, ..., 0.77702632, 0.4490865 ,
        1.        ]])

In [23]:
def film(namef, df=df, t = 7):
    pos = df[df["name"] == namef].index[0]
    df["sim"] = 100 * matrix[pos]
    df = df.sort_values(by='sim', ascending=False)
    return df[1:t+1]

In [24]:
R = film(namef = "American Gigolo")

In [25]:
R

Unnamed: 0,name,genre,year,director,writer,star,country,global,sim
4477,Crash,Crime,2004,Paul Haggis,Paul Haggis,Don Cheadle,United States,crash crime 2004 paul haggis paul haggis chead...,58.78742
4390,American Splendor,Biography,2003,Shari Springer Berman,Harvey Pekar,Paul Giamatti,United States,american splendor biography 2003 shari springe...,53.937856
2181,Light Sleeper,Crime,1992,Paul Schrader,Paul Schrader,Willem Dafoe,United Kingdom,light sleeper crime 1992 paul schrader paul sc...,53.631549
6780,Grandma,Comedy,2015,Paul Weitz,Paul Weitz,Lily Tomlin,United States,grandma comedy 2015 paul weitz paul weitz lily...,52.633512
1602,Scenes from the Class Struggle in Beverly Hills,Comedy,1989,Paul Bartel,Paul Bartel,Jacqueline Bisset,United States,scenes class struggle beverly hills comedy 198...,52.633512
791,DEFCON-4,Action,1985,Paul Donovan,Paul Donovan,Lenore Zann,United States,defcon 4 action 1985 paul donovan paul donovan...,52.327585
1747,Almost an Angel,Comedy,1990,John Cornell,Paul Hogan,Paul Hogan,United States,almost angel comedy 1990 john cornell paul hog...,50.030298
