# Problem Statement : 

- Create a Recommender System to show personalized movie recommendations based on ratings given by a user and other users similar to them in order to improve user experience.

# Data Dictionary:

### RATINGS FILE DESCRIPTION

- All ratings are contained in the file "ratings.dat" and are in the following format:

    - UserID::MovieID::Rating::Timestamp

    - UserIDs range between 1 and 6040

    - MovieIDs range between 1 and 3952

    - Ratings are made on a 5-star scale (whole-star ratings only)

    - Timestamp is represented in seconds

    - Each user has at least 20 ratings


### USERS FILE DESCRIPTION

- User information is in the file "users.dat" and is in the following format:

    - UserID::Gender::Age::Occupation::Zip-code

- All demographic information is provided voluntarily by the users and is not checked for accuracy.
Only users who have provided some demographic information are included in this data set.

- Gender is denoted by a "M" for male and "F" for female

    Age is chosen from the following ranges:

    1: "Under 18"
    18: "18-24"
    25: "25-34"
    35: "35-44"
    45: "45-49"
    50: "50-55"
    56: "56+"

- Occupation is chosen from the following choices:

0: "other" or not specified
1: "academic/educator"
2: "artist"
3: "clerical/admin"
4: "college/grad student"
5: "customer service"
6: "doctor/health care"
7: "executive/managerial"
8: "farmer"
9: "homemaker"
10: "K-12 student"
11: "lawyer"
12: "programmer"
13: "retired"
14: "sales/marketing"
15: "scientist"
16: "self-employed"
17: "technician/engineer"
18: "tradesman/craftsman"
19: "unemployed"
20: "writer"

### MOVIES FILE DESCRIPTION

- Movie information is in the file "movies.dat" and is in the following format:

     - MovieID::Title::Genres

- Titles are identical to titles provided by the IMDB (including year of release)

- Genres are pipe-separated and are selected from the following genres:

Action
Adventure
Animation
Children's
Comedy
Crime
Documentary
Drama
Fantasy
Film-Noir
Horror
Musical
Mystery
Romance
Sci-Fi
Thriller
War
Western


##  Concepts Tested:
- Recommender Engine
- Collaborative Filtering (Item-based & User-based Approach)
- Pearson Correlation

- Nearest Neighbors using Cosine Similarity
- Matrix Factorization

## Required to do : 

- Reading the data files, formatting them into a proper workable format and merging the data files into one single dataframe


- Performing exploratory data analysis like checking the structure & characteristics of the dataset and cleaning the data

- Performing feature engineering steps type conversions and deriving new features like ‘Release Year’

- Visualizing the data with respect to different categories to get a better understanding of the underlying distribution

- Grouping the data in terms of Average Rating and No. of Ratings given

- Creating a pivot table of movie titles & user id and imputing the NaN values with a suitable value

- Follow the Item-based approach and Pearson Correlation

- Take a movie name as input from the user Recommend 5 similar movies based on Pearson Correlation

- Cosine Similarity

- Print the item similarity matrix and user similarity matrix

- Example: An user-user similarity matrix just for demonstration.

- Create a CSR matrix using the pivot table.[Optional, This is an extended approach, link to example implementation].

- Write a function to return top 5 recommendations for a given item

- [sklearn optional] Take a movie name as user input and use KNN algorithm to recommend 5 similar movies based on Cosine Similarity. [link to sklearn Nearest Neighbor documentation]


- Matrix Factorization

- Use cmfrec/Surprise library to run matrix factorization. (Show results with d=4).

- Evaluate the model’s performance using RMSE and MAPE.

- Bonus - how can you do a train test split for MF?

- Embeddings for item-item and user-user similarity

- Re-design the item-item similarity function to use MF embeddings (d=4) instead of raw features

- Similarly, do this for user-user similarity

- Bonus: Get d=2 embeddings, and plot the results. Write down your analysis from this visualisation. (Compare with other visualization techniques)

- Follow the User-based approach (Optional)

- Ask the user to rate a few movies and create a dataframe of the user’s choices.

- Find other users who’ve watched the same movies as the new user.

- Sort the old users by the count of most movies in common with the new user.

- Take the top 100 users and calculate a Similarity Score for each user using the Pearson Correlation function.

- Get the top 10 users with the highest similarity indices, all the movies for these users, and add Weighted movie Ratings by Multiplying the Rating to the Similarity Index.

- Calculate the average recommendation score by dividing the Weighted Rating by the Similarity Index and select movies with the highest score i.e., 5.

- Now, recommend 10 movies based on the ratings given by old users who are similar to the new user.


### Questionnaire:

- Users of which age group have watched and rated the most number of movies?

- Users belonging to which profession have watched and rated the most movies?

- Most of the users in our dataset who’ve rated the movies are Male. (T/F)

- Most of the movies present in our dataset were released in which decade?

- 70s b. 90s c. 50s d.80s

- The movie with maximum no. of ratings is ___.

- Name the top 3 movies similar to ‘Liar Liar’ on the item-based approach.

- On the basis of approach, Collaborative Filtering methods can be classified into ___-based and ___-based.

- Pearson Correlation ranges between ___ to ___ whereas, Cosine Similarity belongs to the interval between ___ to ___.

- Mention the RMSE and MAPE that you got while evaluating the Matrix Factorization model.

- Give the sparse ‘row’ matrix representation for the following dense matrix -

[[1 0]
[3 7]]

# New Section

In [129]:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (12,8)
import warnings
warnings.filterwarnings("ignore")


In [130]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [131]:
zee_movies = pd.read_fwf("/content/drive/Othercomputers/My Laptop/Data Science Studies/GitHub_Desktop/BusinessCase_Data_Exploration-/Recommender System for OTT /zee-movies.dat",encoding="ISO-8859-1")

In [132]:
zee_ratings =pd.read_fwf("/content/drive/Othercomputers/My Laptop/Data Science Studies/GitHub_Desktop/BusinessCase_Data_Exploration-/Recommender System for OTT /zee-ratings.dat",encoding="ISO-8859-1")

In [133]:
zee_users = pd.read_fwf("/content/drive/Othercomputers/My Laptop/Data Science Studies/GitHub_Desktop/BusinessCase_Data_Exploration-/Recommender System for OTT /zee-users.dat",encoding="ISO-8859-1")

In [134]:
delimiter ="::"

zee_users = zee_users["UserID::Gender::Age::Occupation::Zip-code"].str.split(delimiter,expand = True)
zee_users.columns = ["UserID","Gender","Age","Occupation","Zipcode"]


In [135]:
# zee_users

In [136]:
zee_users["Age"].replace({"1": "Under 18","18": "18-24","25": "25-34",
                          "35": "35-44","45": "45-49","50": "50-55","56": "56+"},inplace=True)

In [137]:
zee_users["Occupation"] = zee_users["Occupation"].astype(int).replace({0: "other",1: "academic/educator",2: "artist",3: "clerical/admin",4: "college/grad student",
                                             5: "customer service",6: "doctor/health care",7: "executive/managerial",8: "farmer"
                                             ,9: "homemaker",10: "K-12 student",11: "lawyer",12: "programmer",13: "retired",
                                             14: "sales/marketing",15: "scientist",16: "self-employed",17: "technician/engineer",
                                             18: "tradesman/craftsman",19: "unemployed",20: "writer"},
                                            )

In [138]:
zee_users

Unnamed: 0,UserID,Gender,Age,Occupation,Zipcode
0,1,F,Under 18,K-12 student,48067
1,2,M,56+,self-employed,70072
2,3,M,25-34,scientist,55117
3,4,M,45-49,executive/managerial,02460
4,5,M,25-34,writer,55455
...,...,...,...,...,...
6035,6036,F,25-34,scientist,32603
6036,6037,F,45-49,academic/educator,76006
6037,6038,F,56+,academic/educator,14706
6038,6039,F,45-49,other,01060


In [139]:
# zee_ratings

In [140]:
delimiter ="::"

zee_ratings = zee_ratings["UserID::MovieID::Rating::Timestamp"].str.split(delimiter,expand = True)
zee_ratings.columns = ["UserID","MovieID","Rating","Timestamp"]



In [141]:
# zee_ratings

In [142]:
zee_movies.drop(["Unnamed: 1","Unnamed: 2"],axis = 1,inplace=True)
# zee_movies.sample(2)

In [143]:
delimiter ="::"

zee_movies = zee_movies["Movie ID::Title::Genres"].str.split(delimiter,expand = True)
zee_movies.columns = ["MovieID","Title","Genres"]



In [144]:
zee_movies.head(5)

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [145]:
zee_movies.shape

(3883, 3)

In [146]:
zee_ratings.head(5)

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [147]:
zee_ratings

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
...,...,...,...,...
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648


In [148]:
zee_users.shape

(6040, 5)

In [149]:
zee_users.sample(5)

Unnamed: 0,UserID,Gender,Age,Occupation,Zipcode
1619,1620,M,25-34,writer,48009
4154,4155,M,50-55,technician/engineer,85712
4898,4899,M,25-34,programmer,94402
5190,5191,F,25-34,college/grad student,55455
2931,2932,M,25-34,programmer,55346


In [150]:
zee_movies.shape,zee_ratings.shape,zee_users.shape

((3883, 3), (1000209, 4), (6040, 5))

In [151]:
df = zee_users.merge(zee_ratings,how="outer",on="UserID").merge(zee_movies,
                                                         how = "outer",
                                                         on="MovieID")

In [152]:
df

Unnamed: 0,UserID,Gender,Age,Occupation,Zipcode,MovieID,Rating,Timestamp,Title,Genres
0,1,F,Under 18,K-12 student,48067,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,M,56+,self-employed,70072,1193,5,978298413,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,M,25-34,programmer,32793,1193,4,978220179,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,M,25-34,executive/managerial,22903,1193,4,978199279,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,M,50-55,academic/educator,95350,1193,5,978158471,One Flew Over the Cuckoo's Nest (1975),Drama
...,...,...,...,...,...,...,...,...,...,...
1000381,,,,,,3650,,,Anguish (Angustia) (1986),Horror
1000382,,,,,,3750,,,Boricua's Bond (2000),Drama
1000383,,,,,,3829,,,Mad About Mambo (2000),Comedy|Romance
1000384,,,,,,3856,,,Autumn Heart (1999),Drama


In [153]:
df.shape

(1000386, 10)

In [154]:
df.isna().sum()

UserID         177
Gender         177
Age            177
Occupation     177
Zipcode        177
MovieID          0
Rating         177
Timestamp      177
Title            0
Genres        4066
dtype: int64

In [155]:
df["Title"]

0          One Flew Over the Cuckoo's Nest (1975)
1          One Flew Over the Cuckoo's Nest (1975)
2          One Flew Over the Cuckoo's Nest (1975)
3          One Flew Over the Cuckoo's Nest (1975)
4          One Flew Over the Cuckoo's Nest (1975)
                            ...                  
1000381                 Anguish (Angustia) (1986)
1000382                     Boricua's Bond (2000)
1000383                    Mad About Mambo (2000)
1000384                       Autumn Heart (1999)
1000385        Prince of Central Park, The (1999)
Name: Title, Length: 1000386, dtype: object

In [156]:
import re

In [157]:
df["Release_year"] = df["Title"].str.extract('^(.+)\s\(([0-9]*)\)$',expand = True)[1]

In [158]:
df

Unnamed: 0,UserID,Gender,Age,Occupation,Zipcode,MovieID,Rating,Timestamp,Title,Genres,Release_year
0,1,F,Under 18,K-12 student,48067,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),Drama,1975
1,2,M,56+,self-employed,70072,1193,5,978298413,One Flew Over the Cuckoo's Nest (1975),Drama,1975
2,12,M,25-34,programmer,32793,1193,4,978220179,One Flew Over the Cuckoo's Nest (1975),Drama,1975
3,15,M,25-34,executive/managerial,22903,1193,4,978199279,One Flew Over the Cuckoo's Nest (1975),Drama,1975
4,17,M,50-55,academic/educator,95350,1193,5,978158471,One Flew Over the Cuckoo's Nest (1975),Drama,1975
...,...,...,...,...,...,...,...,...,...,...,...
1000381,,,,,,3650,,,Anguish (Angustia) (1986),Horror,1986
1000382,,,,,,3750,,,Boricua's Bond (2000),Drama,2000
1000383,,,,,,3829,,,Mad About Mambo (2000),Comedy|Romance,2000
1000384,,,,,,3856,,,Autumn Heart (1999),Drama,1999


In [159]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000386 entries, 0 to 1000385
Data columns (total 11 columns):
 #   Column        Non-Null Count    Dtype 
---  ------        --------------    ----- 
 0   UserID        1000209 non-null  object
 1   Gender        1000209 non-null  object
 2   Age           1000209 non-null  object
 3   Occupation    1000209 non-null  object
 4   Zipcode       1000209 non-null  object
 5   MovieID       1000386 non-null  object
 6   Rating        1000209 non-null  object
 7   Timestamp     1000209 non-null  object
 8   Title         1000386 non-null  object
 9   Genres        996320 non-null   object
 10  Release_year  996606 non-null   object
dtypes: object(11)
memory usage: 91.6+ MB


In [160]:
df.nunique()

UserID            6040
Gender               2
Age                  7
Occupation          21
Zipcode           3439
MovieID           3883
Rating               5
Timestamp       458455
Title             3883
Genres             360
Release_year        81
dtype: int64

In [161]:
# 6040 unique UserID
# 7 different age groups
# 21 occupations
# 3493 different locations of users
# 3883 unique movies 


In [162]:
df.isna().sum()

UserID           177
Gender           177
Age              177
Occupation       177
Zipcode          177
MovieID            0
Rating           177
Timestamp        177
Title              0
Genres          4066
Release_year    3780
dtype: int64

In [163]:
# There are movies available in database , which were never been watched by any user before . 
# Thats is the reason we have lots of NaN values in our final dataset. 