# Content Based Filtering

## Outline
1. [Packages](#intro)<br>
2. [Dataset](#dataset)<br>
3. [Preparing the data](#prep)<br>

## 1 - Packages
### Libraries and Modules
- **NumPy (np):** NumPy is a widely-used library in Python for numerical and mathematical operations. It provides support for working with arrays and matrices, making it essential for scientific and data analysis tasks.
- **NumPy Masked Array (ma):** NumPy's Masked Array module is used for handling arrays with missing or masked data. It allows you to work with data that may contain invalid or missing values without raising errors.
- **genfromtxt:** This is a function provided by NumPy for reading data from text files and creating NumPy arrays from the data. It's particularly useful for loading data from CSV files and other text-based formats.
- **collections.defaultdict:** The defaultdict class from the collections module is a specialized dictionary-like data structure that provides default values for missing keys. It's useful for various data manipulation tasks, such as counting occurrences.
- **Pandas (pd):** Pandas is a powerful library for data manipulation and analysis. It offers data structures like DataFrames and Series, making it easier to work with structured data.
- **TensorFlow (tf):** TensorFlow is an open-source machine learning framework developed by Google. It's widely used for building and training deep learning models, including neural networks.
- **Keras:** Keras is a high-level neural networks API that runs on top of TensorFlow (or other backend engines). It provides a user-friendly interface for building and training neural networks.
- **StandardScaler:** StandardScaler is a preprocessing technique provided by scikit-learn (sklearn) for standardizing features in your data. It removes the mean and scales features to have unit variance.
- **MinMaxScaler:** MinMaxScaler, also from scikit-learn, is used to scale features to a specified range, commonly [0, 1]. It's helpful when you want to normalize your data to a specific interval.
- **train_test_split:** This function, available in scikit-learn, is used to split your dataset into separate training and testing sets. It's a crucial step in machine learning for evaluating model performance.
- **tabulate:** Tabulate is a Python library for neatly formatting and pretty-printing tabular data. It helps make data presentation more readable and user-friendly.
- **pd.set_option("display.precision", 1):** This line of code configures the Pandas library to display floating-point numbers with a precision of one decimal place by default. It affects how data is presented when you print or display Pandas DataFrames or Series.

#### Terminal Instructions:
1. go to jupyter notebook directory 
2. run 'pip install --upgrade numpy scipy' 
4. Restart Kernel if needed.

In [1]:
import datetime
import numpy as np
import numpy.ma as ma
from numpy import genfromtxt
from collections import defaultdict
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
import tabulate
pd.set_option("display.precision", 1)

from IPython.display import display
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm

%run preprocessing_functions.ipynb

## 2 - Dataset
Data set used is from https://grouplens.org/datasets/movielens/latest/

#### ratings.csv
- n<sub>u</sub> = 610 users
- 100,000 individual user ratings
- ratings are out of 5

#### movies.csv
- n<sub>m</sub> = 9743 movies

Below, load the raw data sets:

In [2]:
# Load the CSV file into a DataFrame (replace 'your_file.csv' with the actual file path)
df_user = pd.read_csv(r'.\datasets\ml-latest-small\ratings.csv')
df_movie = pd.read_csv(r'.\datasets\ml-latest-small\movies.csv')
df_links = pd.read_csv(r'.\datasets\ml-latest-small\links.csv')
df_tags = pd.read_csv(r'.\datasets\ml-latest-small\tags.csv')

### Display Dataset

#### description of the df_user dataset:

- **userId:** This column represents the unique identifier for each user who has provided movie ratings. Users are numbered from 1 to 610, indicating there are 610 users in the dataset.
- **movieId:** This column represents the unique identifier for each movie that users have rated. The movieId values correspond to specific movies.
- **rating:** This column represents the rating given by a user to a particular movie. Ratings are typically on a scale, and in this dataset, ratings vary from 0.0 to 5.0. Higher values indicate a higher rating or preference for the movie.
- **timestamp:** This column represents the timestamp when the rating was recorded. It's a Unix timestamp, which is a numerical representation of a specific date and time.

The dataset contains a total of 5 rows at the beginning and 5 rows at the end as examples. Each row represents a single user's rating for a specific movie. For example, the first row indicates that User 1 gave a rating of 4.0 to the movie with movieId 1 (Toy Story(1995)), and the rating was recorded at the specified timestamp.

In [3]:
display(df_user)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


#### description of the df_movie dataset:

- **movieId:** This column represents the unique identifier for each movie in the dataset. The movieId values are used to uniquely identify each movie.
- **title:** This column contains the titles of the movies. Each row provides the title of a specific movie, along with the release year in parentheses. For example, "Toy Story (1995)" is the title of the first movie in the dataset, released in 1995.
- **genres:** This column contains information about the genres associated with each movie. Genres are typically separated by the "|" character, indicating that a movie can belong to multiple genres. For example, the first movie "Toy Story (1995)" is associated with the following genres: Adventure, Animation, Children, Comedy, and Fantasy.

The dataset contains multiple rows, each representing a different movie. In the example, there are 5 rows, each corresponding to a different movie title and associated genres.

In [4]:
display(df_movie.head())

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


#### Description of the df_links dataset:
- **movieId:** This column represents a unique identifier for each movie in the dataset. The movieId values are used to uniquely identify each movie.
- **imdbId:** This column contains the IMDb (Internet Movie Database) identifier for each movie. IMDb is a popular online database of movies and television shows.
- **tmdbId:** This column contains the TMDb (The Movie Database) identifier for each movie. TMDb is another online database that provides information about movies, including cast, crew, and user ratings.

The dataset contains multiple rows, with each row corresponding to a different movie. For example, the first row indicates that movieId 1 corresponds to the IMDb ID 114709 and the TMDb ID 862.0.

In [5]:
display(df_links.head())

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


#### Description of the df_tags dataset:
- **userId:** This column represents the unique identifier for each user who has tagged movies. Users are numbered, and each userId corresponds to a specific user.
- **movieId:** This column represents the unique identifier for each movie that users have tagged. The movieId values correspond to specific movies.
- **tag:** This column contains descriptive tags or keywords associated with the movies. Users can assign tags to movies based on their content, themes, or other characteristics.
- **timestamp:** This column represents the timestamp when the tagging was recorded. It's a Unix timestamp, which is a numerical representation of a specific date and time.

The dataset contains multiple rows, with each row representing a user's tag for a specific movie. For example, the first row indicates that userId 2 tagged movieId 60756 with the tags "funny," "Highly quotable," and "will ferrell" at the specified timestamp.

In [6]:
display(df_tags)

timestamp_first_row = df_tags.loc[0, 'timestamp']
date_time = datetime.datetime.utcfromtimestamp(timestamp_first_row)
print(f'user2 tagged movie60756(Step Brothers) as "funny" at: {date_time}')

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200
...,...,...,...,...
3678,606,7382,for katie,1171234019
3679,606,7936,austere,1173392334
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978


user2 tagged movie60756(Step Brothers) as "funny" at: 2015-10-24 19:29:54


## 3- Preprocessing The Data 

### One-Hot Encode df_movies Dataset
**One-Hot Encoding** is a technique used in machine learning and data analysis to convert categorical data, such as genres in your movie dataset, into a binary matrix format. It's a way to represent categorical variables as binary vectors, where each category is represented by a binary value (0 or 1).

Here's a brief explanation and an example for the movie dataset:

The movie dataset, has a column called genres, which contains multiple movie genres separated by the "|" character. To apply one-hot encoding to this categorical data:

- Identify Unique Categories: First, you identify all unique genres present in the dataset. For example, unique genres include "Action," "Comedy," "Romance," and "Sci-Fi."
- Create Binary Columns: For each unique genre, you create a new binary column in the dataset. Each column represents one genre. If a movie belongs to a particular genre, the corresponding column gets a value of 1; otherwise, it gets a value of 0.
![df_movies_onehot](./images/movie_one_hot_encoding.jpg)

df_movie_encoded:

In [7]:
df_movie_copy = df_movie.copy()

df_movie_final, df_movie_encoded = onehot_encode_movies(df_movie_copy)

display(df_movie_encoded)
display(df_movie_final)

Unnamed: 0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9737,1,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
9738,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
9739,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
9740,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,movieId,year,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,1995,0,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,1995,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,1995,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,1995,0,0,0,0,1,0,0,1,...,0,0,0,0,0,1,0,0,0,0
4,5,1995,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9737,193581,2017,1,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9738,193583,2017,0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9739,193585,2017,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
9740,193587,2018,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


 ### Create the User Profile 
1. **Collect User Ratings:** You start by gathering the ratings provided by a user for various movies. Each rating reflects how much the user liked a specific movie.
2. **Encode Movie Genres:** You have a dataset that lists the genres associated with each movie with binary classification. For each movie the user has rated, you find the genres it belongs to and mark those genres as "rated" by the user. Next, for each movie the user has rated, the genres associated with that movie will be that rating. Do this for all movies the user has rated.
3. **Calculate Genre Ratings:** For each genre, you sum up the ratings given by the user to all the movies that belong to that genre. This gives you the total rating the user has given to all movies within each genre.
4. **Average Genre Ratings:** To create a more balanced view, you calculate the average rating for each genre by dividing the total genre rating by the number of movies the user has rated within that genre. This helps account for cases where the user may have rated more or fewer movies in certain genres.
5. **Non-Scaled User Profile:** The result of these calculations is a non-scaled user profile. It represents the average ratings the user has given to different movie genres based on their past ratings. This user profile helps identify the user's preferences and can be used for various purposes, such as recommending movies of similar genres that align with their tastes.

![df_user_profile](./images/aggregate_average_user_ratings.jpg)

Aggregate and Average to get the user profile (Takes a while):

In [None]:
df_user_copy = df_user.copy()

df_aggregate_rating, df_genre_rating_count, df_user_ratings_total, df_user_final = aggregate_average_user_ratings(df_user, df_movie_final, df_movie_encoded)

display(df_user_ratings_total)
display(df_aggregate_rating)
display(df_genre_rating_count)
display(df_user_final)

  0%|          | 0/100836 [00:00<?, ?it/s]

### Scaling User Profile 
![df_user_profile](./images/scaling_data.jpg)

In [None]:
# Busted in Normalization???????????????
# normalize with respect to rows????????

df_user_final_copy = df_user_final.copy()

columns_to_scale = df_user_final_copy.columns[1:] 
scalar = StandardScaler() # <-- better
#scalar = MinMaxScaler() <-- dogshit
#scalar = QuantileTransformer(output_distribution='normal') 
#scalar = MaxAbsScaler()

df_user_scaled = df_user_final_copy.copy() 
df_user_scaled[columns_to_scale] = scalar.fit_transform(df_user_final_copy[columns_to_scale])

pd.set_option('display.float_format', '{:.2f}'.format)
display(df_user_scaled)

## 3- Model
### Recommendation Matrix

In [None]:
# Finding the recommendation
df_movie_matrix = df_movie_final.copy()
df_user_matrix = df_user_scaled.copy()

# Extract the 'userId' and 'movieId' column
movie_ids = df_movie_matrix['movieId']
movie_year = df_movie_matrix['year']
user_ids = df_user_matrix['userId']

# Exclude the 'userId' and 'movieId' column for matrix multiplication or other operations
df_movie_matrix = df_movie_matrix.drop(columns=['movieId'])
df_movie_matrix = df_movie_matrix.drop(columns=['year'])
df_user_matrix = df_user_matrix.drop(columns=['userId'])

df_user_matrix = df_user_matrix.T
user_item_matrix = df_movie_matrix @ df_user_matrix

# Set 'movieId' as the index of user_item_matrix
user_item_matrix['movieId'] = movie_ids
user_item_matrix.set_index('movieId', inplace=True)

user_item_matrix = user_item_matrix.T
user_item_matrix['userId'] = movie_ids
user_item_matrix.set_index('userId', inplace=True)
user_item_matrix = user_item_matrix.T

#user_item_matrix.set_index('movieId', inplace=True)
#recommendation_matrix['userId'] = user_ids


display(df_movie_matrix)
display(df_user_matrix)
display(user_item_matrix)

### Top 10 Suggested Movies

In [None]:
df_movie_reference = df_movie.copy()

# Assuming your user_item_matrix DataFrame is already set up
# Get the first 5 user columns
num_users = user_item_matrix.columns[:100]

# Initialize a dictionary to store top movies for each column
top_movies_per_column = {}

# Iterate over the first 5 user columns and get the top 10 movies
for col in num_users:
    top_movies = user_item_matrix.nlargest(10, col)[[col]]
    top_movies_per_column[col] = top_movies

# Display the top 10 movies (movieId and the relevant user column) for each of the first 5 user columns
for col, top_movies_df in top_movies_per_column.items():
    top_movie_titles = []
    top_movie_genres = []
    for movieId, value in top_movies_df.iterrows():
        top_movie_titles.append(df_movie_reference[df_movie_reference['movieId'] == movieId]['title'].values[0])
        top_movie_genres.append(df_movie_reference[df_movie_reference['movieId'] == movieId]['genres'].values[0])
    top_movies_df['title'] = top_movie_titles
    top_movies_df['genres'] = top_movie_genres
    display(top_movies_df)
