## Movie Ratings Data Analysis

### Introduction
The provided Python code facilitates the analysis of movie ratings data using Pandas and joblib libraries. Its aim is to uncover insights regarding movie popularity and similarity based on user ratings.

### Data Loading
The code begins by loading two essential datasets: **movies.csv**, containing movie details, and **ratings.csv**, containing user ratings. This step serves as the starting point for subsequent analysis.

### Popularity Score Calculation
Next, the code calculates the popularity score for each movie by aggregating ratings. This score reflects both the frequency of ratings (count) and the average rating received by each movie.

### User-Film Matrix Creation
A user-film matrix is created using a pivot table, where rows represent users, columns represent movies, and the cells contain corresponding ratings. This matrix forms the basis for further analysis.

### Movie Similarity Analysis
The code proceeds to analyze movie similarity based on user ratings. It computes a correlation matrix between movies and defines a function to identify similar movies for a given movie ID.

### Results Storage
Finally, the code saves the results for future reference. It stores a subset of the correlation matrix containing similar movies for each movie, as well as the popularity scores for the top 10 movies.

### Conclusion
In conclusion, this code enables comprehensive analysis of movie ratings data, including determination of movie popularity and identification of similar movies based on user ratings.


In [1]:
import pandas as pd
import joblib

In [2]:
movies = pd.read_csv("movies.csv")

In [3]:
ratings = pd.read_csv("ratings.csv")

In [4]:
movies.sample(3)

Unnamed: 0,movieId,title,genres
2808,3753,"Patriot, The (2000)",Action|Drama|War
584,719,Multiplicity (1996),Comedy
1156,1519,Broken English (1996),Drama


In [5]:
ratings.sample(3)

Unnamed: 0,userId,movieId,rating,timestamp
71647,462,1288,3.5,1124158003
55079,365,93510,4.5,1488333289
21218,140,1214,4.0,1080845712


In [6]:
temp_rating = ratings.groupby("movieId").agg({"rating":["count", "mean"]}).reset_index()
temp_rating.columns = [tup[1] if tup[1] else tup[0] for tup in temp_rating.columns]

temp_rating.sample()

Unnamed: 0,movieId,count,mean
7225,73876,1,4.0


In [7]:
mo_rating = movies.merge(temp_rating, how="left", left_on="movieId", right_on="movieId")

In [8]:
min_count = mo_rating["count"].min()

In [9]:
max_count = mo_rating["count"].max()

In [10]:
count_all = max_count - min_count

In [11]:
mo_rating["popscore3"] = round((mo_rating["count"]-min_count)/count_all*(mo_rating["mean"]**3), 2)

In [12]:
mo_rating.sort_values("popscore3", ascending=False).head(3)

Unnamed: 0,movieId,title,genres,count,mean,popscore3
277,318,"Shawshank Redemption, The (1994)",Crime|Drama,317.0,4.429022,83.7
314,356,Forrest Gump (1994),Comedy|Drama|Romance|War,329.0,4.164134,72.21
257,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,307.0,4.197068,68.97


In [13]:
def popularN(n=5):
    return mo_rating.sort_values("popscore3", ascending=False).head(n)

In [14]:
popularN(3)

Unnamed: 0,movieId,title,genres,count,mean,popscore3
277,318,"Shawshank Redemption, The (1994)",Crime|Drama,317.0,4.429022,83.7
314,356,Forrest Gump (1994),Comedy|Drama|Romance|War,329.0,4.164134,72.21
257,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,307.0,4.197068,68.97


In [15]:
user_film_matrix = pd.pivot_table(data=ratings,
                                  values='rating',
                                  index='userId',
                                  columns='movieId',
                                  fill_value=0)

In [16]:
%%time
film_correlations_matrix = user_film_matrix.corr()

CPU times: user 2min 41s, sys: 1.58 s, total: 2min 42s
Wall time: 2min 48s


In [17]:
def getSimilar(mid, n=10, matrix=film_correlations_matrix, thr = 10):
    film_df = pd.DataFrame(matrix[mid])

    film_df = film_df[film_df.index != mid]
    
    film_df = film_df.sort_values(mid, ascending=False)

    no_of_users_rated_both_films = [sum((matrix[mid] > 0) & (matrix[mid2] > 0)) for mid2 in film_df.index]

    film_df['users_who_rated_both_films'] = no_of_users_rated_both_films

    film_df = film_df[film_df["users_who_rated_both_films"] > thr]

    film_df = film_df.reset_index()
    film_df = film_df.drop(columns=["users_who_rated_both_films"])
    
    film_df = film_df.merge(movies, how="left", on="movieId")
    return film_df.head(n)   

In [18]:
%%time
getSimilar(5816, thr=100)

CPU times: user 6.54 s, sys: 456 ms, total: 7 s
Wall time: 7.37 s


Unnamed: 0,movieId,5816,title,genres
0,4896,0.736992,Harry Potter and the Sorcerer's Stone (a.k.a. ...,Adventure|Children|Fantasy
1,8368,0.727898,Harry Potter and the Prisoner of Azkaban (2004),Adventure|Fantasy|IMAX
2,40815,0.673761,Harry Potter and the Goblet of Fire (2005),Adventure|Fantasy|Thriller|IMAX
3,54001,0.630222,Harry Potter and the Order of the Phoenix (2007),Adventure|Drama|Fantasy|IMAX
4,45722,0.52433,Pirates of the Caribbean: Dead Man's Chest (2006),Action|Adventure|Fantasy
5,69844,0.518326,Harry Potter and the Half-Blood Prince (2009),Adventure|Fantasy|Mystery|Romance|IMAX
6,41566,0.508577,"Chronicles of Narnia: The Lion, the Witch and ...",Adventure|Children|Fantasy
7,88125,0.497847,Harry Potter and the Deathly Hallows: Part 2 (...,Action|Adventure|Drama|Fantasy|Mystery|IMAX
8,5349,0.496195,Spider-Man (2002),Action|Adventure|Sci-Fi|Thriller
9,6539,0.48216,Pirates of the Caribbean: The Curse of the Bla...,Action|Adventure|Comedy|Fantasy


In [19]:
test_m = film_correlations_matrix.copy()

In [20]:
%%time
matrix_in_list = {}
for step in test_m.columns:
    temp = test_m[step].sort_values(ascending=False).index[1:11]
    matrix_in_list[step] = temp[1:11]


CPU times: user 8.69 s, sys: 1 s, total: 9.69 s
Wall time: 10.1 s


### Saving the data

In [21]:
joblib.dump(matrix_in_list, "matrix10")

['matrix10']

In [22]:
ten = popularN(10)[["title", "movieId", "popscore3"]]

In [23]:
ten = ten.rename(columns={"title":"Movie", "popscore3":"Rating"})

In [24]:
joblib.dump(ten, "ten")

['ten']