# Week 11

## EXERCISE 1: Collaborative Filtering

Metadata:
This dataset, developed by GroupLens Research for their MovieLens Movie Recommender System Project, consists of four CSV files: movies.csv, ratings.csv, links.csv, and tags.csv. The data was collected over a period of 22 years, selecting only randomized and anonymized users who rated at least 20 movies and movies that received at least 1 rating or tag. 

The columns in the dataset are as follows:

1. UserId: Random and anonymous IDs assigned to identify users. Ranges from 1 to 610, as it's a complete sequence.
2. MovieId: IDs assigned to identify movies. Not a complete sequence, as only movies with a rating or tag are included.

movies.csv:

- Title: Movie title along with the year of release.
- Genre: Tags representing genres, including options like action, adventure, animation, children's, comedy, crime, documentary, drama, fantasy, film-noir, horror, musical, mystery, romance, sci-fi, thriller, war, western, and (no genre listed).

ratings.csv:

- Rating: User ratings for movies, ranging from 0.5 to 5 in steps of 0.
- Timestamp: UTC timestamp of when the movie was rated.

links.csv:

- ImdbId: Link to the IMDb page of the movie.
- TmbdId: Link to the MovieDB page of the movie.

tags.csv:

- Tag: Word or short phrase describing the user's impressions about the movie.
- Timestamp: UTC timestamp of when the tag was assigned.

### User-Based Similarity

In [1]:
import pandas as pd
import numpy as np

In [2]:
df=pd.read_csv("ratings.csv")
df

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


# 2. Read the “ratings.csv” file and create a pivot table with index=‘userId’, columns=‘movieId’,values = “rating.

In [3]:
pt=pd.pivot_table(index="userId", columns="movieId",values = "rating",data=df)
pt

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,,,,,,2.5,,,,...,,,,,,,,,,
607,4.0,,,,,,,,,,...,,,,,,,,,,
608,2.5,2.0,2.0,,,,,,,4.0,...,,,,,,,,,,
609,3.0,,,,,,,,,4.0,...,,,,,,,,,,


In [4]:
pt.fillna(0,inplace=True)
pt

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,2.5,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
609,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# 3.sklearn.metrics.pairwise_distances can be used to compute distance between all pairs of users. pairwise_distances() takes a metric parameter for what distance measure to use. Use cosine similarity for finding similarity among users. Use the following packages.

# 4.from sklearn.metrics import pairwise_distances

# 5.from scipy.spatial.distance import cosine, correlation

# 6. Find the 5 most similar user for user with user Id 10.

In [5]:
from sklearn.metrics.pairwise import cosine_similarity

In [6]:
user_similarity=cosine_similarity(pt)


In [48]:
def users_similar(user_id):
    index=np.where(pt.index==user_id)[0][0]
    similar_users=sorted(list(enumerate(user_similarity[index])),key=lambda x:x[1],reverse=True)[1:6]
    print("Top 5 user Ids Similar to",user_id)
    for i in similar_users:
        print(pt.index[i[0]])

In [49]:
users_similar(10)

Top 5 user Ids Similar to 10
159
143
563
177
189


# 7. Use the “movies” dataset to find out the names of movies, user 2 and user 338 have watched in common and how they have rated each one of them.

In [9]:
movies=pd.read_csv("movies.csv")
movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [10]:
new_df=df.merge(movies,on="movieId",how="left")
new_df

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
...,...,...,...,...,...,...
100831,610,166534,4.0,1493848402,Split (2017),Drama|Horror|Thriller
100832,610,168248,5.0,1493850091,John Wick: Chapter Two (2017),Action|Crime|Thriller
100833,610,168250,5.0,1494273047,Get Out (2017),Horror
100834,610,168252,5.0,1493846352,Logan (2017),Action|Sci-Fi


In [11]:
def two_users_sim(user_1, user_2):
    b_titles = set(new_df[new_df["userId"] == user_1]["title"])
    c_titles = set(new_df[new_df["userId"] == user_2]["title"])

    common_titles = list(b_titles.intersection(c_titles))
    print("The movies both rated are: \n", common_titles)

    b_df = new_df[(new_df['title'].isin(common_titles)) & (new_df["userId"] == user_1)]
    c_df = new_df[(new_df['title'].isin(common_titles)) & (new_df["userId"] == user_2)]

    print("\nDataFrame for user", user_1)
    print(b_df)
    
    print("\nDataFrame for user", user_2)
    print(c_df)


In [12]:
two_users_sim(2,338)

The movies both rated are: 
 ['Shawshank Redemption, The (1994)', 'Kill Bill: Vol. 1 (2003)']

DataFrame for user 2
     userId  movieId  rating   timestamp                             title  \
232       2      318     3.0  1445714835  Shawshank Redemption, The (1994)   
236       2     6874     4.0  1445714952          Kill Bill: Vol. 1 (2003)   

                    genres  
232            Crime|Drama  
236  Action|Crime|Thriller  

DataFrame for user 338
       userId  movieId  rating   timestamp                             title  \
51972     338      318     5.0  1530148309  Shawshank Redemption, The (1994)   
51982     338     6874     4.5  1530148453          Kill Bill: Vol. 1 (2003)   

                      genres  
51972            Crime|Drama  
51982  Action|Crime|Thriller  


# 8. Use the movies dataset to find out the common movie names between user 2 and user 338 with least rating of 4.0


In [13]:
def two_users_movies(user_1, user_2):
    b_titles = set(new_df[(new_df["userId"] == user_1) & (new_df["rating"] >= 4.0)]["title"])
    c_titles = set(new_df[(new_df["userId"] == user_2) & (new_df["rating"] >= 4.0)]["title"])

    common_titles = list(b_titles.intersection(c_titles))
    print("The movies both watch and rated 4.0 or above are: \n", common_titles)


In [14]:
two_users_movies(2,338)

The movies both watch and rated 4.0 or above are: 
 ['Kill Bill: Vol. 1 (2003)']


### Item-Based Similarity

# 9. Create a pivot table for representing the similarity among movies using correlation.

In [15]:
pt1 =pd.pivot_table(columns='userId', index='title', values='rating',data=new_df)
pt1 = pt1.fillna(0)
pt1


userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
'Hellboy': The Seeds of Creation (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Salem's Lot (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Til There Was You (1997),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
eXistenZ (1999),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,5.0,0.0,0.0,0.0,0.0,4.5,0.0,0.0
xXx (2002),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.5,0.0,2.0
xXx: State of the Union (2005),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.5
¡Three Amigos! (1986),4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
from sklearn.metrics import pairwise_distances
movie_similarity=pairwise_distances(pt1,metric='correlation')
movie_similarity

array([[0.        , 1.00164204, 1.0023241 , ..., 0.67471266, 1.00818543,
        1.00164204],
       [1.00164204, 0.        , 0.293474  , ..., 1.00359434, 1.00818543,
        1.00164204],
       [1.0023241 , 0.293474  , 0.        , ..., 1.00508734, 1.01158546,
        1.0023241 ],
       ...,
       [0.67471266, 1.00359434, 1.00508734, ..., 0.        , 1.0179175 ,
        1.00359434],
       [1.00818543, 1.00818543, 1.01158546, ..., 1.0179175 , 0.        ,
        1.00818543],
       [1.00164204, 1.00164204, 1.0023241 , ..., 1.00359434, 1.00818543,
        0.        ]])

# 10. Find the top 5 movies which are similar to the movie “Godfather”

In [50]:
def movies_similar(movie_name):
    if movie_name in pt1.index:
        index = pt1.index.get_loc(movie_name)
        similar_movies = sorted(list(enumerate(movie_similarity[index])), key=lambda x: x[1])[1:6]
        print("Top 5 Movies Similar to",movie_name)
        for i in similar_movies:
            print(pt1.index[i[0]])


In [51]:
movies_similar("Godfather, The (1972)")

Top 5 Movies Similar to Godfather, The (1972)
Godfather: Part II, The (1974)
Goodfellas (1990)
One Flew Over the Cuckoo's Nest (1975)
Reservoir Dogs (1992)
Fargo (1996)
