In this notebook, I have tried to use two methods, **Content Based Filtering and Collaborative Filtering**, to build a recommender system for movies.<br>
In these methods, from two csv files, one of which is related to the movies and the genre of those movies, and the other is the points given by **imdb** site users to these movies.

# Import libraries

We need pandas and numpy libreries for this notbook.

In [1]:
import pandas as pd
import numpy as np

# Insert data 

In [2]:
#Get path of files csv we have 
path_movies ='movies-i.csv'
path_ratings ='ratings.csv'

#Sorting data into a pandas datafram with read_csv
Movies =pd.read_csv(path_movies)
Ratings =pd.read_csv(path_ratings)

See **Head** of csv file for learning more about data.

In [3]:
Movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
Ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


# Preprocessing

In this step, we perform data recognition and pre-processing operations for two data frames.<br>
First, we will see the operations required for the Moives dataframe, and then the Rationgs dataframe.

**We know our Movies dataframe better.**

I tried to get more information about this data frame and its data by using the following commands in order.
Then I will clean the data and if there is a problem in the data, I will clean it one by one.

In [5]:
Movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [6]:
Movies.describe(include ='all')

Unnamed: 0,movieId,title,genres
count,9742.0,9742,9742
unique,,9737,951
top,,Emma (1996),Drama
freq,,2,1053
mean,42200.353623,,
std,52160.494854,,
min,1.0,,
25%,3248.25,,
50%,7300.0,,
75%,76232.0,,


In [7]:
Movies.value_counts()

movieId  title                                                  genres                                     
1        Toy Story (1995)                                       Adventure|Animation|Children|Comedy|Fantasy    1
53322    Ocean's Thirteen (2007)                                Crime|Thriller                                 1
53129    Mr. Brooks (2007)                                      Crime|Drama|Thriller                           1
53138    Librarian: Return to King Solomon's Mines, The (2006)  Action|Adventure|Fantasy                       1
53140    Librarian: Quest for the Spear, The (2004)             Action|Adventure|Comedy|Fantasy|Romance        1
                                                                                                              ..
4390     Rape Me (Baise-moi) (2000)                             Crime|Drama|Thriller                           1
4392     Alice (1990)                                           Comedy|Drama|Fantasy|Romance         

In [8]:
Movies.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

In [9]:
print(Movies.shape)
Movies.duplicated()
print(Movies.shape)

(9742, 3)
(9742, 3)


In this part, I start to separate and add and remove columns.<br>
Let's go together to better check the following codes that we wrote for preprocessing and see the result of each code

We add a new column named **year**.<br>
Due to the fact that the name of a movie may be repeated, but the year of its production is different, we started making this column.<br>
We separate the values of each row for this column from the **title**.<br>
This column is more effective at the end of the work and to provide more detailed information about the proposed film.

In [10]:
#We use regex to find the year and the parentheses around it in the title column.
Movies['year'] =Movies['title'].str.extract('(\(\d\d\d\d\))',expand =False)

#In the year column we created, we remove the parentheses around the year
Movies['year'] =Movies['year'].str.extract('(\d\d\d\d)',expand =False)

In [11]:
#We remove the year and the parentheses around it from the title column using the following method
Movies['title'] =Movies['title'].str.replace('(\(\d\d\d\d\))','')

#We use the separator for the title column, I will explain further
Movies['title'] =Movies['title'].apply(lambda x: x.strip())

  Movies['title'] =Movies['title'].str.replace('(\(\d\d\d\d\))','')


We do these things in order to access the genre information of each movie.<br>
The genre for each movie is separated using the delimiter **(|)**.<br>
We want to create a list of genres for each movie.<br>
We do this according to the following order.

In [12]:
#We want to have a list of genres of each movie, so...
Movies['genres']=Movies['genres'].str.split('|')

In [13]:
#We see the changes
Movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


In [14]:
Moviesgenres =Movies.copy()

We make the genre matrix and put the value 1 for each movie if it has that genre.<br>
The values that do not get are filled with NAN.
Using fillna, we fill the NAN values

In [15]:
for index, row in Movies.iterrows():
    for genre in row['genres']:
        Moviesgenres.at[index, genre] =1


Moviesgenres =Moviesgenres.fillna(0)
Moviesgenres.head()

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now preprocessing for **Ratings dataframe**.

In [16]:
Ratings.value_counts()

userId  movieId  rating  timestamp 
1       1        4.0     964982703     1
434     4993     5.0     1270604133    1
        4963     4.0     1270604560    1
        4896     2.5     1270604915    1
        4886     4.5     1270604658    1
                                      ..
227     58303    4.0     1447210409    1
        56782    4.5     1447210013    1
        56367    4.5     1447210824    1
        55820    4.0     1447209881    1
610     170875   3.0     1493846415    1
Length: 100836, dtype: int64

In [17]:
print(Ratings.shape)
Ratings.duplicated()
print(Ratings.shape)

(100836, 4)
(100836, 4)


In [18]:
Ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [19]:
Ratings.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [20]:
Ratings.describe(include ='all')

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


We try to make the data frame quieter.<br>
The **timestamp** column will not be used in this lab, we will delete this column

In [21]:
Ratings.drop('timestamp',axis =1,inplace =True)
Ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


<hr>

<a id="ref1"></a>

# Content-Based 

We use the **Content-Based recommendation system method** first and try to score the genres that the user likes the most and then find the genres closest to the user's favorite movies among the movies and recommend the movies to the user.

We create **userinput dataframe** related to videos and user ratings.

In [22]:
userinput =[
            {'title':'Rape Me', 'rating':4},
            {'title':'Toy Story', 'rating':2.5},
            {'title':'Pulp Fiction', 'rating':5},
            {'title':'Alice', 'rating':1.5},
            {'title':'Another Woman', 'rating':1},
            {'title':'Jumanji', 'rating':2},] 

inputMovies =pd.DataFrame(userinput)
inputMovies.head()

Unnamed: 0,title,rating
0,Rape Me,4.0
1,Toy Story,2.5
2,Pulp Fiction,5.0
3,Alice,1.5
4,Another Woman,1.0


In this section, we try to find the movieid of each movie that the user has voted for.
We try to use titles<br>
We can **find the same titles** in the two dataframes related to the movies voted by the user and the main data frame.<br>
The **result of this search will be a dataframe** that has the same titles as the user's data frame.<br>
We will try to join these two dataframes in order to have both the **ratings and the title** together, and then remove the extra columns of year and genres.

In [23]:
#Apply a filter to the Movies data frame to get the data frame that has the same titles as the user frame
Idsinput =Movies[Movies['title'].isin(inputMovies['title'].tolist())]

Idsinput

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
257,296,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994
3249,4392,Alice,"[Comedy, Drama, Fantasy, Romance]",1990
3250,4393,Another Woman,[Drama],1988
7211,72982,Alice,"[Action, Adventure, Fantasy]",2009


<br>

In [24]:
#Connecting two dataframes to have movieis and ratings together
inputMovies =pd.merge(Idsinput,inputMovies)

#Delete the columns we don't need
inputMovies =inputMovies.drop('genres', 1).drop('year', 1)
inputMovies.head()

  inputMovies =inputMovies.drop('genres', 1).drop('year', 1)
  inputMovies =inputMovies.drop('genres', 1).drop('year', 1)


Unnamed: 0,movieId,title,rating
0,1,Toy Story,2.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,4392,Alice,1.5
4,72982,Alice,1.5


<br>
<br>

In the same way as above, we do this for the genre matrix of each movie.

In [25]:
userMoviesG =Moviesgenres[Moviesgenres['movieId'].isin(inputMovies['movieId'].tolist())]

userMoviesG=userMoviesG.drop('title',1).drop('movieId',1).drop('year',1).drop('genres',1)
userMoviesG=userMoviesG.reset_index(drop=True)
userMoviesG

  userMoviesG=userMoviesG.drop('title',1).drop('movieId',1).drop('year',1).drop('genres',1)
  userMoviesG=userMoviesG.drop('title',1).drop('movieId',1).drop('year',1).drop('genres',1)
  userMoviesG=userMoviesG.drop('title',1).drop('movieId',1).drop('year',1).drop('genres',1)
  userMoviesG=userMoviesG.drop('title',1).drop('movieId',1).drop('year',1).drop('genres',1)


Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We get the value of each genre for the user.

In [26]:
UserProfile =userMoviesG.transpose().dot(inputMovies['rating'])
UserProfile

Adventure             5.5
Animation             2.5
Children              4.5
Comedy                9.0
Fantasy               7.0
Romance               1.5
Drama                 8.0
Action                1.0
Crime                 5.0
Thriller              5.0
Horror                0.0
Mystery               0.0
Sci-Fi                0.0
War                   0.0
Musical               0.0
Documentary           0.0
IMAX                  0.0
Western               0.0
Film-Noir             0.0
(no genres listed)    0.0
dtype: float64

We make the genre matrix more orderly.

In [27]:
Tablegenres =Moviesgenres.set_index(Moviesgenres['movieId'])

Tablegenres =Tablegenres.drop('title',1).drop('year',1).drop('movieId',1).drop('genres',1)
Tablegenres

  Tablegenres =Tablegenres.drop('title',1).drop('year',1).drop('movieId',1).drop('genres',1)
  Tablegenres =Tablegenres.drop('title',1).drop('year',1).drop('movieId',1).drop('genres',1)
  Tablegenres =Tablegenres.drop('title',1).drop('year',1).drop('movieId',1).drop('genres',1)
  Tablegenres =Tablegenres.drop('title',1).drop('year',1).drop('movieId',1).drop('genres',1)


Unnamed: 0_level_0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193583,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193585,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193587,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<br>
<hr>

Our main work has started and we want to measure the value of all the movies in the movie list.<br>
For this, we do the following.

In [28]:
recommendationTable =((Tablegenres*UserProfile).sum(axis=1))/(UserProfile.sum())
recommendationTable

movieId
1         0.581633
2         0.346939
3         0.214286
4         0.377551
5         0.183673
            ...   
193581    0.397959
193583    0.377551
193585    0.163265
193587    0.071429
193609    0.183673
Length: 9742, dtype: float64

We sort the output series in descending order.

In [29]:
recommendationTable =recommendationTable.sort_values(ascending=False)
recommendationTable

movieId
134853    0.744898
117646    0.724490
6902      0.704082
148775    0.693878
81132     0.683673
            ...   
134524    0.000000
5288      0.000000
3604      0.000000
134796    0.000000
27667     0.000000
Length: 9742, dtype: float64

Using **movieid**, we find the names and specifications of the first 10 movies that are most liked by the user and suggest them.

In [30]:
Movies.loc[Movies['movieId'].isin(recommendationTable.head(10).keys())]

Unnamed: 0,movieId,title,genres,year
1584,2123,All Dogs Go to Heaven,"[Animation, Children, Comedy, Drama, Fantasy]",1989
2250,2987,Who Framed Roger Rabbit?,"[Adventure, Animation, Children, Comedy, Crime...",1988
3460,4719,Osmosis Jones,"[Action, Animation, Comedy, Crime, Drama, Roma...",2001
4631,6902,Interstate 60,"[Adventure, Comedy, Drama, Fantasy, Mystery, S...",2002
5700,27790,Millions,"[Children, Comedy, Crime, Drama, Fantasy]",2004
5808,31921,"Seven-Per-Cent Solution, The","[Adventure, Comedy, Crime, Drama, Mystery, Thr...",1976
7441,81132,Rubber,"[Action, Adventure, Comedy, Crime, Drama, Film...",2010
8597,117646,Dragonheart 2: A New Beginning,"[Action, Adventure, Comedy, Drama, Fantasy, Th...",2000
8900,134853,Inside Out,"[Adventure, Animation, Children, Comedy, Drama...",2015
9169,148775,Wizards of Waverly Place: The Movie,"[Adventure, Children, Comedy, Drama, Fantasy, ...",2009


<br>
<br>
<hr>

# Collaborative Filtering

In the second part of the notebook, we want to build a recommender system using **collaborative filtering**.<br>
This model has two modes, we consider the user base mode and proceed with it.<br>
In this case, based on the similarity between users' interests, we try to suggest movies that the user has not seen.<br>
I liked the previous method more and I think that if we try to suggest the styles that the user likes, the user will be more likely to click on them.<br>
However, we are going to test this method on the data and see the result of this method and then research about it.

<br>

Like the previous method, we start from the beginning and perform the necessary operations on the data.<br>
We go step by step to see the output of this method.

# Import libraries

In [31]:
import pandas as pd
import numpy as np

# Insert data

In [32]:
path_movie ='movies-i.csv'
path_ratings ='Ratings.csv'

Movies =pd.read_csv(path_movie)
Ratings =pd.read_csv(path_ratings)

See **Head** of two dataframes

In [33]:
Movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [34]:
Ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


<br>

# Preprocessing

At this stage, we take a general look at the data frame and try to fix it if there is a problem in the data.<br>
And finally, we make each dataframe review more orderly and readable according to the model we want to go towards.

We do the same as the previous step in the beginning, but then the work is different.

In [35]:
Movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [36]:
Movies.describe(include ='all')

Unnamed: 0,movieId,title,genres
count,9742.0,9742,9742
unique,,9737,951
top,,Emma (1996),Drama
freq,,2,1053
mean,42200.353623,,
std,52160.494854,,
min,1.0,,
25%,3248.25,,
50%,7300.0,,
75%,76232.0,,


In [37]:
Movies.value_counts()

movieId  title                                                  genres                                     
1        Toy Story (1995)                                       Adventure|Animation|Children|Comedy|Fantasy    1
53322    Ocean's Thirteen (2007)                                Crime|Thriller                                 1
53129    Mr. Brooks (2007)                                      Crime|Drama|Thriller                           1
53138    Librarian: Return to King Solomon's Mines, The (2006)  Action|Adventure|Fantasy                       1
53140    Librarian: Quest for the Spear, The (2004)             Action|Adventure|Comedy|Fantasy|Romance        1
                                                                                                              ..
4390     Rape Me (Baise-moi) (2000)                             Crime|Drama|Thriller                           1
4392     Alice (1990)                                           Comedy|Drama|Fantasy|Romance         

In [38]:
Movies.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

In [39]:
print(Movies.shape)
Movies.duplicated()
print(Movies.shape)

(9742, 3)
(9742, 3)


In [40]:
#We use regex to find the year and the parentheses around it in the title column.
Movies['year'] =Movies['title'].str.extract('(\(\d\d\d\d\))',expand =False)

#In the year column we created, we remove the parentheses around the year
Movies['year'] =Movies['year'].str.extract('(\d\d\d\d)',expand =False)

In [41]:
#We remove the year and the parentheses around it from the title column using the following method
Movies['title'] =Movies['title'].str.replace('(\(\d\d\d\d\))','')

#We use the separator for the title column, I will explain further
Movies['title'] =Movies['title'].apply(lambda x: x.strip())

  Movies['title'] =Movies['title'].str.replace('(\(\d\d\d\d\))','')


In this model, we don't need a **genre** for the recommender system we want to provide.<br>
We delete this column from the movie dataframe.

In [42]:
Movies=Movies.drop('genres',axis=1)

In [43]:
#We see the changes in this data frame
Movies.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


<br>

We start checking the **Ratings dataframe**.

In [44]:
Ratings.value_counts()

userId  movieId  rating  timestamp 
1       1        4.0     964982703     1
434     4993     5.0     1270604133    1
        4963     4.0     1270604560    1
        4896     2.5     1270604915    1
        4886     4.5     1270604658    1
                                      ..
227     58303    4.0     1447210409    1
        56782    4.5     1447210013    1
        56367    4.5     1447210824    1
        55820    4.0     1447209881    1
610     170875   3.0     1493846415    1
Length: 100836, dtype: int64

In [45]:
print(Ratings.shape)
Ratings.duplicated()
print(Ratings.shape)

(100836, 4)
(100836, 4)


In [46]:
Ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [47]:
Ratings.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [48]:
Ratings.describe(include ='all')

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


As in the previous method, we do not need this column**(timestamp)** in this notebook and we delete it.<br>
Finally, we can see the final changes applied to this data frame.

In [49]:
Ratings.drop('timestamp',axis =1,inplace =True)
Ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


<br>
<br>
<hr>

# Collaborative Filtering

In this method, we first create the **user dataframe**.<br>
This data frame consists of the movies he has seen and the ratings he has done.

The initial stages are the same as those designed in the previous system.

In [50]:
userinput =[
            {'title':'Rape Me', 'rating':4},
            {'title':'Toy Story', 'rating':2.5},
            {'title':'Pulp Fiction', 'rating':5},
            {'title':'Alice', 'rating':1.5},
            {'title':'Another Woman', 'rating':1},
            {'title':'Jumanji', 'rating':2},
           ] 

inputMovies =pd.DataFrame(userinput)
inputMovies.head()

Unnamed: 0,title,rating
0,Rape Me,4.0
1,Toy Story,2.5
2,Pulp Fiction,5.0
3,Alice,1.5
4,Another Woman,1.0


In [51]:
#Apply a filter to the Movies data frame to get the data frame that has the same titles as the user frame
Idsinput =Movies[Movies['title'].isin(inputMovies['title'].tolist())]

Idsinput

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
257,296,Pulp Fiction,1994
3249,4392,Alice,1990
3250,4393,Another Woman,1988
7211,72982,Alice,2009


In [52]:
#Connecting two dataframes to have movieis and ratings together
inputMovies =pd.merge(Idsinput,inputMovies)

#Delete the columns we don't need
inputMovies =inputMovies.drop('year', 1)
inputMovies.head()

  inputMovies =inputMovies.drop('year', 1)


Unnamed: 0,movieId,title,rating
0,1,Toy Story,2.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,4392,Alice,1.5
4,72982,Alice,1.5


We separate the **userID** of the people who have seen the movies that the input user has seen.<br>
And we keep it in a new dataframe**(Subset)**.<br>
The number of shared videos may be the same or exactly the same as the number of videos of the input user.

In [53]:
Subset =Ratings[Ratings['movieId'].isin(inputMovies['movieId'].tolist())]
Subset.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
16,1,296,3.0
320,4,296,1.0
516,5,1,4.0
533,5,296,5.0


We group rows based on userid.<br>
By having a user ID, we can receive all the movies that the user has seen.

In [54]:
SubsetGroup =Subset.groupby('userId')
SubsetGroup

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F34E3B76A0>

Now we sort these created groups based on the highest number of common movies.<br>
The person who has the most common movies with the incoming user will be seen first.

In [55]:
SubsetGroup =sorted(SubsetGroup,  key =lambda x: len(x[1]), reverse =True)
SubsetGroup

[(599,
         userId  movieId  rating
  92623     599        1     3.0
  92624     599        2     2.5
  92742     599      296     5.0
  93861     599     4392     2.5),
 (605,
         userId  movieId  rating
  97143     605        1     4.0
  97144     605        2     3.5
  97151     605      296     2.0
  97357     605    72982     4.0),
 (18,
        userId  movieId  rating
  1772      18        1     3.5
  1773      18        2     3.0
  1796      18      296     4.0),
 (21,
        userId  movieId  rating
  3219      21        1     3.5
  3220      21        2     3.5
  3231      21      296     3.5),
 (68,
         userId  movieId  rating
  10360      68        1     2.5
  10361      68        2     2.5
  10419      68      296     2.0),
 (91,
         userId  movieId  rating
  14121      91        1     4.0
  14122      91        2     3.0
  14173      91      296     4.5),
 (103,
         userId  movieId  rating
  15565     103        1     4.0
  15566     103        2   

In [56]:
print(len(SubsetGroup))

399


Since as we go down this sorted list, the number of people who share a movie with the user will decrease, so we try to select a limited number of the total number of people who have at least one movie in common.<br>
Then we design the desired system according to these people and data related to these people.

In [57]:
SubsetGroup =SubsetGroup[0:100]

<br>
<br>

Now we get the Pearson correlation.<br>
between the input user and a subgroup of users who have at least one video in common with the input user.<br>
You save this value in a dictionary.
The key of this dictionary is the number of each user and the correlation coefficient of its value.

In [58]:
import math
from scipy import stats
CorrelationDict ={}
for name, group in SubsetGroup:
    #We sort each group based on the movie ID
    group =group.sort_values(by='movieId')
    
    #We also do sorting work for incoming videos.
    inputMovies =inputMovies.sort_values(by='movieId')
    
    #Based on the shared IDs of each group with the input data user, 
    #we try to compare the ratings that each user has given to that video with the rating of that user and get the distance.
    temp =inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
    
    tempRatingList =temp['rating'].tolist()
    
    tempGroupList =group['rating'].tolist()
    
    
    corr=stats.pearsonr(tempRatingList,tempGroupList)
    corr = [0 if math.isnan(x) else x for x in corr]
    
    if corr[0]!=0 :
        CorrelationDict[name] = corr[0]
    else:
        CorrelationDict[name]=0
    
    
    



In [59]:
CorrelationDict.items()

dict_items([(599, 0.9908301680442991), (605, -0.9345030284511886), (18, 0.9332565252573826), (21, 0), (68, -0.9878291611472618), (91, 0.8485552916276632), (103, 0.9878291611472618), (107, -0.9332565252573826), (135, 0.6286185570937121), (140, 0.7777137710478189), (144, 0.9332565252573826), (153, 0.9878291611472618), (160, 0.9878291611472618), (177, 0.6286185570937122), (202, 0), (217, -0.3592106040535498), (219, 0.8485552916276632), (226, 0.9843241382880894), (232, 0.8824975032927698), (240, -0.9878291611472622), (249, 0), (274, 0.9843241382880894), (288, 0.7419354838709677), (298, 0.9750002110024923), (304, 0.6286185570937121), (307, 0.7970167702187487), (322, 0.9962709627734359), (323, 0.35921060405354976), (330, 0.6286185570937122), (353, 0.6286185570937121), (357, 0.42341515917871025), (359, 0.6286185570937121), (373, 0.987829161147262), (380, 0), (387, 0.6758453353343745), (411, 0.6286185570937121), (414, 0.9332565252573826), (432, -0.6286185570937122), (434, 0.8858920666876039), 

From the similarity indices obtained above, we create the similarity index data frame.<br>
Then we sort this data frame in descending order.

In [60]:
pearsonDF =pd.DataFrame.from_dict(CorrelationDict, orient='index')
pearsonDF.columns =['similarityIndex']
pearsonDF['userId'] =pearsonDF.index
pearsonDF.index =range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,similarityIndex,userId
0,0.99083,599
1,-0.934503,605
2,0.933257,18
3,0.0,21
4,-0.987829,68


In [61]:
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)
topUsers.head()

Unnamed: 0,similarityIndex,userId
99,1.0,178
73,1.0,66
55,1.0,5
58,1.0,15
59,1.0,17


Now we get ID and rating memo for each movie in topUsersRating data frame from Ratings data frame.<br>
(In other words, we merge these two data frames based on the common user ID that is entered in both data frames)<br>
We want to get every video that every user who was among the 100 selected users and close to the input user.

In [62]:
topUsersRating =topUsers.merge(Ratings, left_on ='userId', right_on ='userId', how ='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating
0,1.0,178,1,4.0
1,1.0,178,10,4.0
2,1.0,178,25,4.5
3,1.0,178,47,4.5
4,1.0,178,50,4.5


We determine the value of each movie that each user ID has seen according to the proximity of the user's taste.

In [63]:
topUsersRating['weightedRating'] =topUsersRating['similarityIndex']*topUsersRating['rating']
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,weightedRating
0,1.0,178,1,4.0,4.0
1,1.0,178,10,4.0,4.0
2,1.0,178,25,4.5,4.5
3,1.0,178,47,4.5,4.5
4,1.0,178,50,4.5,4.5


We group based on the movie ID and get the sum of the value of each value for that column group.<br>


In [64]:
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()


Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,34.148255,114.124637
2,27.257213,77.961882
3,9.631751,26.699377
4,0.201474,-0.883479
5,7.973322,19.766889


In this step, we get the weighted average for each movie.<br>
Now we have the collection of the value of each user's movie and we can show the most valuable movie to the user by sorting it so that she can choose it.

In [65]:
recommendation =pd.DataFrame()
recommendation['weighted average recommendation score'] =tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation['movieId'] =tempTopUsersRating.index
recommendation.head()

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.342034,1
2,2.860229,2
3,2.772017,3
4,-4.38507,4
5,2.479128,5


In [66]:
recommendation =recommendation.sort_values(by='weighted average recommendation score', ascending=False)
recommendation.head(10)

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
80860,308.5,80860
79,235.724104,79
57951,193.0,57951
61,166.079879,61
240,165.079879,240
270,165.079879,270
6686,160.480866,6686
888,159.980866,888
93721,118.5,93721
460,111.053253,460


The movies that are suggested at the end of the system are suggested to the user based on the similarity between the users.

In [68]:
Movies =pd.read_csv(path_movie)

In [69]:
Movies.loc[Movies['movieId'].isin(recommendation.head(10)['movieId'].tolist())]

Unnamed: 0,movieId,title,genres
54,61,Eye for an Eye (1996),Drama|Thriller
71,79,"Juror, The (1996)",Drama|Thriller
206,240,Hideaway (1995),Thriller
232,270,Love Affair (1994),Drama|Romance
401,460,Getting Even with Dad (1994),Comedy
673,888,Land Before Time III: The Time of the Great Gi...,Adventure|Animation|Children|Musical
4513,6686,"Medallion, The (2003)",Action|Comedy|Crime|Fantasy
6680,57951,Fool's Gold (2008),Action|Adventure|Comedy|Romance
7432,80860,Life as We Know It (2010),Comedy|Romance
7850,93721,Jiro Dreams of Sushi (2011),Documentary


In [70]:
Movies.loc[Movies['movieId'].isin(Idsinput.head(10)['movieId'].tolist())]

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
257,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
3249,4392,Alice (1990),Comedy|Drama|Fantasy|Romance
3250,4393,Another Woman (1988),Drama
7211,72982,Alice (2009),Action|Adventure|Fantasy


I will try to show you the efficiency of this system now.<br>
As you can see, the genres that most users have paid attention to and liked are less seen in this system.<br>
But again, it is not possible to say exactly which recommender system has a better performance.<br>
**Maybe it depends on the product and goods that we sell!**