In [315]:
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
from fuzzywuzzy import process
import seaborn as sns
import matplotlib.pyplot as plt
import plotly_express as px
import re

- Some of the code in this file have been explained in the Explorative analyis file

- I will will explain where i a new piece of code and how i came to any conclusions

In [316]:
movies, ratings = pd.read_csv('../data/movies.csv'), pd.read_csv('../data/ratings.csv')


In [317]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [318]:
movies['year']  = movies['title'].str.extract(r'\((\d{4})\)')

movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II (1995),Comedy,1995


In [319]:
movies.describe()

Unnamed: 0,movieId
count,58098.0
mean,111919.516197
std,59862.660956
min,1.0
25%,72437.75
50%,126549.0
75%,161449.5
max,193886.0


In [320]:
movies.loc[:, 'title_no_year'] = movies['title'].apply(lambda x: x.split("(")[0].rstrip())

---

## 1.3) Recommender system

- The below answers are explained in file 1_3

In [321]:
ratings['movieId'].nunique()

53889

In [322]:
movies['movieId'].nunique()

58098

In [323]:
new_ratings = ratings[ratings['movieId'].isin(movies['movieId'])]
new_ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27753444 entries, 0 to 27753443
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 1.0 GB


In [324]:
# Convert movieId & userId to Categorical's to make them work with csr_matrix
movieIds = pd.Categorical(new_ratings['movieId'])
userIds = pd.Categorical(new_ratings['userId'])

# Create the csr matrix
mati_movies_users = csr_matrix((new_ratings['rating'], (movieIds.codes, userIds.codes)))

mati_movies_users.shape

(53889, 283228)

In [325]:
model_nearest_neighbor = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=30)
model_nearest_neighbor.fit(mati_movies_users)

In [326]:
def recommender_system(movie_name, dataframe, model, number_recommendations):
    movie_id = process.extractOne(movie_name, movies['title'])[1]
    movie_idx = process.extractOne(movie_name, movies['title'])[2]
    print('Movie Selected: ', movies['title'][movie_idx], 'Id: ',movie_id)
    print('Searching for recommendation....')

    
    distances, indices = model.kneighbors(dataframe[movie_idx], n_neighbors=number_recommendations)
    indix = indices.flatten()[1:]
    
    return movies.loc[indix]

In [327]:
recommendations = recommender_system('the fellowship of the ring', mati_movies_users,model_nearest_neighbor, 20)
recommendations

Movie Selected:  Lord of the Rings: The Fellowship of the Ring, The (2001) Id:  90
Searching for recommendation....


Unnamed: 0,movieId,title,genres,year,title_no_year
5854,5952,"Lord of the Rings: The Two Towers, The (2002)",Adventure|Fantasy,2002,"Lord of the Rings: The Two Towers, The"
7042,7153,"Lord of the Rings: The Return of the King, The...",Action|Adventure|Drama|Fantasy,2003,"Lord of the Rings: The Return of the King, The"
2487,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,1999,"Matrix, The"
6430,6539,Pirates of the Caribbean: The Curse of the Bla...,Action|Adventure|Comedy|Fantasy,2003,Pirates of the Caribbean: The Curse of the Bla...
4212,4306,Shrek (2001),Adventure|Animation|Children|Comedy|Fantasy|Ro...,2001,Shrek
3488,3578,Gladiator (2000),Action|Adventure|Drama,2000,Gladiator
2874,2959,Fight Club (1999),Action|Crime|Drama|Thriller,1999,Fight Club
5253,5349,Spider-Man (2002),Action|Adventure|Sci-Fi|Thriller,2002,Spider-Man
1171,1196,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Sci-Fi,1980,Star Wars: Episode V - The Empire Strikes Back
6272,6377,Finding Nemo (2003),Adventure|Animation|Children|Comedy,2003,Finding Nemo


## 1.3a&b How my system works

- My goal in this exercise is to recommend movies to a user based on inputed movie for toy story.

- First i used fuzzywuzzy to return a close match to the movie inputed by the user. This reduces the potential for 
errors since you would need a perfect match for movie you are handling in the dataframe.

- When I get a close enough string as a movie, I reduce the size of my datasets. The original movies dataset is over 58000 rows and the ratings are over 
27 000 000 rows which is obviously too large for my computer to handle. So the best option is to clean these datasets and use what i need.

- I do this by reducing the datsets by at most 2 categories taken from the inputed movie's genres

- I subsequently reduced the ratings to only contain the ratings of the movies found in my cleaned/processed movies dataset

- I create a create pivot daframe using the cleaned ratings dataset

- create a csr_matrix with my pivot dataframe. This helps safe memory usage by only storing non zero values. the csr matrix is then used 
for my recommender algorithm 

#### How KNN works here
- From the csr matrix, each row is a vector. Each of these vectors is a movie since I had my movieIds as rows in the pivot dataframe. These vectors are found in this high 
dimensional matrix space. The KNN recommendation here works by checking the cosine angle between my inputed movie and the K nearest other vectors. The other vector vectors in this matrix with smaller angles compared to my movie will be returned 

<img src="../assets/cos.webp" alt="description of the image" width="300" height="200">

A good example is the image above. Joao Felix and Messi are similar, but Jaoa has fewer years of play and doesnt have as many ratings but is  very similar to messi as opposed
to Cristiano who is quite different but simailar in amout of ratings as Messi. A euclidean distance would have picked Messi and Ronaldo where a cosine would pick Joao as similar to Messi.

Cosine similarity measures the similarity between two vectors or data points in multidimensional space. It is measured by the cosine of the angle between two vectors or data points. It determines whether these two vectors are pointing in the same direction. It is often used to measure similarity in text analysis.

When KNN makes inference about a movie, KNN will calculate the “distance” between the target movie and every other movie in its database, then it ranks its distances and returns the top K nearest neighbor movies as the most similar movie recommendations.


- **[cosine similarity](https://www.kipi.bi/post/basics-to-knn-algorithm)**
- **[recommender system towardsdatascience](https://towardsdatascience.com/prototyping-a-recommender-system-step-by-step-part-1-knn-item-based-collaborative-filtering-637969614ea)**

*the below article is very detailed as it looks into the types of recommdation systems, and eventually goes through a similar recommendation system as this but for books*
- **[recommender system medium.com](https://aman-makwana101932.medium.com/understanding-recommendation-system-and-knn-with-project-book-recommendation-system-c648e47ff4f6)**


- **Also used chatGPT**

>