# Content Based Filtering

## Outline
1. [Packages](#intro)<br>
2. [Dataset](#dataset)<br>
    2.1 [Content-based filtering with a neural network](#cbf)<br>
    2.2 [Preparing the training data](#prep)<br>
3. [Neural Network for content-based filtering](#neuralnetwork)<br>
    3.1 [Predictions](#predictions)<br>

## 1 - Packages
### Libraries and Modules
- **NumPy (np):** NumPy is a widely-used library in Python for numerical and mathematical operations. It provides support for working with arrays and matrices, making it essential for scientific and data analysis tasks.
- **NumPy Masked Array (ma):** NumPy's Masked Array module is used for handling arrays with missing or masked data. It allows you to work with data that may contain invalid or missing values without raising errors.
- **genfromtxt:** This is a function provided by NumPy for reading data from text files and creating NumPy arrays from the data. It's particularly useful for loading data from CSV files and other text-based formats.
- **collections.defaultdict:** The defaultdict class from the collections module is a specialized dictionary-like data structure that provides default values for missing keys. It's useful for various data manipulation tasks, such as counting occurrences.
- **Pandas (pd):** Pandas is a powerful library for data manipulation and analysis. It offers data structures like DataFrames and Series, making it easier to work with structured data.
- **TensorFlow (tf):** TensorFlow is an open-source machine learning framework developed by Google. It's widely used for building and training deep learning models, including neural networks.
- **Keras:** Keras is a high-level neural networks API that runs on top of TensorFlow (or other backend engines). It provides a user-friendly interface for building and training neural networks.
- **StandardScaler:** StandardScaler is a preprocessing technique provided by scikit-learn (sklearn) for standardizing features in your data. It removes the mean and scales features to have unit variance.
- **MinMaxScaler:** MinMaxScaler, also from scikit-learn, is used to scale features to a specified range, commonly [0, 1]. It's helpful when you want to normalize your data to a specific interval.
- **train_test_split:** This function, available in scikit-learn, is used to split your dataset into separate training and testing sets. It's a crucial step in machine learning for evaluating model performance.
- **tabulate:** Tabulate is a Python library for neatly formatting and pretty-printing tabular data. It helps make data presentation more readable and user-friendly.
- **pd.set_option("display.precision", 1):** This line of code configures the Pandas library to display floating-point numbers with a precision of one decimal place by default. It affects how data is presented when you print or display Pandas DataFrames or Series.

#### Terminal Instructions:
1. go to jupyter notebook directory 
2. run 'pip install tensorflow'
3. run 'pip install --upgrade numpy scipy' 
4. Restart Kernel if needed.

In [1]:
import numpy as np
import numpy.ma as ma
from numpy import genfromtxt
from collections import defaultdict
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
import tabulate
pd.set_option("display.precision", 1)

from IPython.display import display
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm

## 2 - Dataset
Data set used is from https://grouplens.org/datasets/movielens/latest/

#### ratings.csv
- n<sub>u</sub> = 610 users
- 100,000 individual user ratings
- ratings are out of 5

#### movies.csv
- n<sub>m</sub> = 9743 movies

Below, load the raw data sets:

In [2]:
# Load the CSV file into a DataFrame (replace 'your_file.csv' with the actual file path)
df_user = pd.read_csv(r'.\datasets\ml-latest-small\ratings.csv')
df_movie = pd.read_csv(r'.\datasets\ml-latest-small\movies.csv')

### Display Dataset

#### description of the df_user dataset:

- **userId:** This column represents the unique identifier for each user who has provided movie ratings. Users are numbered from 1 to 610, indicating there are 610 users in the dataset.
- **movieId:** This column represents the unique identifier for each movie that users have rated. The movieId values correspond to specific movies.
- **rating:** This column represents the rating given by a user to a particular movie. Ratings are typically on a scale, and in this dataset, ratings vary from 0.0 to 5.0. Higher values indicate a higher rating or preference for the movie.
- **timestamp:** This column represents the timestamp when the rating was recorded. It's a Unix timestamp, which is a numerical representation of a specific date and time.

The dataset contains a total of 5 rows at the beginning and 5 rows at the end as examples. Each row represents a single user's rating for a specific movie. For example, the first row indicates that User 1 gave a rating of 4.0 to the movie with movieId 1 (Toy Story(1995)), and the rating was recorded at the specified timestamp.

In [3]:
# Display the first 5 rows and the last 5 rows of the DataFrame
display(df_user)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


#### description of the df_movie dataset:

- **movieId:** This column represents the unique identifier for each movie in the dataset. The movieId values are used to uniquely identify each movie.
- **title:** This column contains the titles of the movies. Each row provides the title of a specific movie, along with the release year in parentheses. For example, "Toy Story (1995)" is the title of the first movie in the dataset, released in 1995.
- **genres:** This column contains information about the genres associated with each movie. Genres are typically separated by the "|" character, indicating that a movie can belong to multiple genres. For example, the first movie "Toy Story (1995)" is associated with the following genres: Adventure, Animation, Children, Comedy, and Fantasy.

The dataset contains multiple rows, each representing a different movie. In the example, there are 5 rows, each corresponding to a different movie title and associated genres.

In [4]:
display(df_movie.head())

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


### Preprocessing The Data

In [5]:
df_movie_copy = df_movie.copy()

# Perform one-hot encoding of genres
df_movie_encoded = df_movie_copy['genres'].str.get_dummies('|')
df_movie_copy[['title', 'year']] = df_movie_copy['title'].str.extract(r'^(.*?)\s\((\d{4})\)$')

df_movie_copy = df_movie_copy.drop("genres", axis=1)
df_movie_copy = df_movie_copy.drop("title", axis=1)

# Filter out the "(no genres listed)" column
if "(no genres listed)" in df_movie_encoded:
    df_movie_encoded = df_movie_encoded.drop("(no genres listed)", axis=1);

# Join the one-hot encoded genres back to the original DataFrame
df_movie_final = df_movie_copy.join(df_movie_encoded)
display(df_movie_final)

Unnamed: 0,movieId,year,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,1995,0,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,1995,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,1995,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,1995,0,0,0,0,1,0,0,1,...,0,0,0,0,0,1,0,0,0,0
4,5,1995,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9737,193581,2017,1,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9738,193583,2017,0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9739,193585,2017,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
9740,193587,2018,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
df_user_copy = df_user.copy()

# This gets rid of duplicates - keep here just in case
user_ratings = df_user_copy.groupby('userId').agg({'rating': ['count', 'mean']})
user_ratings.columns = ['rating count', 'rating avg']

# Flatten the hierarchical column index
user_ratings = user_ratings.reset_index()

df_user_copy = df_user_copy.drop("movieId", axis=1)
df_user_copy = df_user_copy.drop("rating", axis=1)
df_user_copy = df_user_copy.drop("timestamp", axis=1)

for target_user_id in range(1, len(user_ratings) + 1):
    #display(user_ratings['rating count'][target_user_id - 1])
    df_user_copy.loc[df_user_copy['userId'] == target_user_id, 'rating count'] = user_ratings['rating count'][target_user_id - 1]
    df_user_copy.loc[df_user_copy['userId'] == target_user_id, 'rating avg'] = user_ratings['rating avg'][target_user_id - 1]

df_user_final = df_user_copy.copy()
df_user_copy = df_user.copy()

# Get the column names from df_movie_encoded
column_names = df_movie_encoded.columns.tolist()
# Add these columns to df_user_copy with NaN values
for column in column_names:
    df_user_final[column] = None

# use df_user_copy to get all the movies the user rated, then reference df_movie_final (maybe make copy) to add up the genres      
# Create an empty dictionary to store user genre ratings
user_genre_ratings = {}

num_rows,_ = user_ratings.shape
_,num_cols = df_user_final.shape
num_cols -= 3
genre_rating_count = np.zeros((num_rows,num_cols), dtype=int)
display(genre_rating_count.shape)

total_rows = len(df_user_copy)
genre_count_index = 0
newUser = 1
# Initialize the tqdm progress bar with the total number of rows
progress_bar = tqdm(total=len(df_user_copy), position=0, leave=True)
# Iterate through each row in ratings.csv
for index, row in df_user_copy.iterrows():
    userId = row['userId']
    movieId = row['movieId']
    rating = row['rating']
        
    # Get the genres associated with the movie from df_movie_final
    movie_info = df_movie_final[df_movie_final['movieId'] == movieId]
    # Multiply the rating by 1 or 0 for each genre
    genre_ratings = rating * movie_info.iloc[:, 2:]
    
    if userId == newUser:
        if userId > 1:
            genre_rating_count[int(userId)-2] = genre_count_row
        genre_count_row = np.zeros((1,num_cols))
        newUser += 1
    else:
        genre_count_row += movie_info.iloc[:, 2:].values
    
    genre_count_index = int(userId) - 1

    if userId in user_genre_ratings:     
        user_genre_ratings[userId] = user_genre_ratings[userId].add(genre_ratings, fill_value=0)
        user_genre_ratings[userId] = user_genre_ratings[userId].sum().to_frame().T
    else:
        user_genre_ratings[userId] = genre_ratings
        
    if index == (len(df_user_copy)-1):
        genre_rating_count[int(userId)-1] = genre_count_row
    progress_bar.update(1)
    
# Convert the dictionary to a list of DataFrames
dfs = [df.assign(userId=user_id) for user_id, df in user_genre_ratings.items()]
# Concatenate the list of DataFrames into a single DataFrame
final_df = pd.concat(dfs, ignore_index=True)
# Reorder columns to move the last column to the first position
final_df = final_df[['userId'] + [col for col in final_df.columns if col != 'userId']]
# Fill NaN values with 0
final_df.fillna(0, inplace=True)

# Close the progress bar
progress_bar.close()








(610, 19)

  0%|          | 0/100836 [00:00<?, ?it/s]

In [7]:
# NEED TO DO THIS TO FINAL_DF
# Calculate the average genre ratings for each user
#for userId, genre_ratings in final_df.items():
    #display("genre_ratings: " + str(genre_ratings))
    #total_ratings = np.sum(genre_ratings)
    #display("total_ratings: " + str(total_ratings))
    #total_ratings = df_user_copy[df_user_copy['userId'] == userId]['rating'].sum()
    #final_df[userId] = genre_ratings / total_ratings  

# for genre_name, genre_ratings in final_df.items():
#     display(genre_name)
#     print()
#     display(genre_ratings)
#     print()
#     print()
#     print()
#     print()
#     print()
#     print()

# display
# display(df_user_final)
# display(df_movie_final)
# display(df_movie)
# display(df_user_copy)
display(final_df)
display(genre_rating_count)
display(user_ratings)

final_df_copy = final_df.copy()

final_df_values = final_df_copy.values
# Set the "divide" warning to "ignore"
np.seterr(divide='ignore', invalid='ignore')    
final_df_values[:, 1:] = np.where(genre_rating_count != 0, final_df_values[:, 1:] / genre_rating_count, 0)
df_user_final = pd.DataFrame(final_df_values, columns=final_df.columns)
display(df_user_final)

Unnamed: 0,userId,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1.0,389.0,373.0,136.0,191.0,355.0,196.0,0.0,308.0,202.0,5.0,59.0,0.0,103.0,75.0,112.0,169.0,228.0,99.0,30.0
1,2.0,43.5,12.5,0.0,0.0,28.0,38.0,13.0,66.0,0.0,0.0,3.0,15.0,0.0,8.0,4.5,15.5,37.0,4.5,3.5
2,3.0,50.0,30.0,2.0,2.5,9.0,1.0,0.0,12.0,13.5,0.0,37.5,0.0,0.5,5.0,2.5,63.0,29.0,2.5,0.0
3,4.0,83.0,106.0,24.0,38.0,365.0,103.0,8.0,418.0,70.0,16.0,17.0,3.0,64.0,80.0,196.0,34.0,135.0,25.0,38.0
4,5.0,28.0,26.0,26.0,37.0,52.0,46.0,0.0,95.0,29.0,0.0,3.0,11.0,22.0,4.0,34.0,5.0,32.0,10.0,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
605,606.0,480.0,515.0,156.0,169.0,1501.0,486.0,19.0,2644.0,349.0,30.5,174.0,49.0,164.0,345.0,1328.0,281.0,701.5,246.5,58.0
606,607.0,268.0,156.0,20.0,65.0,183.0,103.0,0.0,329.0,75.0,0.0,144.0,5.0,18.0,79.0,102.0,117.0,251.0,25.0,8.0
607,608.0,922.5,583.0,171.5,216.5,971.5,527.5,18.0,962.5,333.0,15.0,322.0,48.0,91.0,245.0,306.0,550.5,916.0,68.0,29.0
608,609.0,34.0,32.0,3.0,6.0,23.0,21.0,6.0,64.0,3.0,0.0,7.0,3.0,0.0,0.0,16.0,15.0,46.0,14.0,4.0


array([[ 90,  84,  28, ...,  55,  22,   7],
       [ 11,   3,   0, ...,  10,   1,   1],
       [ 14,  11,   4, ...,   7,   5,   0],
       ...,
       [277, 180,  54, ..., 259,  19,  11],
       [ 11,   9,   0, ...,  14,   4,   1],
       [517, 266,  65, ..., 510,  47,  33]])

Unnamed: 0,userId,rating count,rating avg
0,1,232,4.4
1,2,29,3.9
2,3,39,2.4
3,4,216,3.6
4,5,44,3.6
...,...,...,...
605,606,1115,3.7
606,607,187,3.8
607,608,831,3.1
608,609,37,3.3


Unnamed: 0,userId,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1.0,4.3,4.4,4.9,4.7,4.3,4.4,0.0,4.5,4.4,5.0,3.5,0.0,4.7,4.2,4.3,4.2,4.1,4.5,4.3
1,2.0,4.0,4.2,0.0,0.0,4.0,4.2,4.3,4.1,0.0,0.0,3.0,3.8,0.0,4.0,4.5,3.9,3.7,4.5,3.5
2,3.0,3.6,2.7,0.5,0.5,1.0,0.5,0.0,0.8,3.4,0.0,4.7,0.0,0.5,5.0,0.5,4.2,4.1,0.5,0.0
3,4.0,3.3,3.7,4.0,3.8,3.5,4.0,4.0,3.5,3.7,4.0,4.2,3.0,4.0,3.5,3.4,2.8,3.6,3.6,3.8
4,5.0,3.1,3.7,5.2,4.6,3.7,3.8,0.0,3.8,4.8,0.0,3.0,3.7,4.4,4.0,3.1,2.5,3.6,3.3,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
605,606.0,3.2,3.5,3.8,3.5,3.6,3.7,3.8,3.8,3.6,3.8,3.3,3.1,3.7,3.8,3.7,3.6,3.5,3.8,3.4
606,607.0,3.7,3.5,4.0,3.6,3.4,3.8,0.0,4.0,3.8,0.0,4.1,5.0,3.6,4.6,3.5,3.2,4.1,4.2,4.0
607,608.0,3.3,3.2,3.2,2.5,2.7,3.6,3.0,3.4,3.0,3.8,3.3,4.0,2.8,3.6,2.9,3.3,3.5,3.6,2.6
608,609.0,3.1,3.6,0.0,6.0,3.8,3.5,3.0,3.4,0.0,0.0,3.5,3.0,0.0,0.0,3.2,3.0,3.3,3.5,4.0
