# Content Based Filtering Recommendation System

## Introduction
This notebook implements a content-based filtering recommendation system using neural networks. The approach involves training two separate neural networks: one for user data and one for item (movie) data. These networks are combined using a dot product and a final dense layer to predict user ratings for items.

## How Content-Based Filtering Works with Neural Networks

### Step-by-Step Process

1. **Data Preparation**:
   - Collect and preprocess user and item (movie) data.
   - Encode user attributes (e.g., age, gender, preferences) and item attributes (e.g., genre, director, cast) into feature vectors.

2. **Neural Network Architecture**:
   - **User Network**: A neural network that processes user feature vectors. It consists of multiple dense layers that learn a user representation.
   - **Item Network**: A neural network that processes item feature vectors. It also consists of multiple dense layers that learn an item representation.

3. **Combining User and Item Representations**:
   - The output of the user network and the item network are combined using a dot product to capture the interaction between user and item features.
   - A final dense layer is added to predict the rating based on the combined representation.

4. **Training the Model**:
   - The model is trained using a loss function that measures the difference between the predicted ratings and the actual ratings.
   - The model parameters are optimized using gradient descent.

5. **Making Predictions**:
   - After training, the model can predict ratings for unseen user-item pairs by feeding their feature vectors through the respective networks and combining the outputs.

### Example Neural Network Architecture

1. **User Network**:
   - Input: User feature vector
   - Dense Layer 1: `units=128`, `activation='relu'`
   - Dense Layer 2: `units=64`, `activation='relu'`
   - Dense Layer 3: `units=32`, `activation='relu'`
   - Output: User representation vector

2. **Item Network**:
   - Input: Item feature vector
   - Dense Layer 1: `units=128`, `activation='relu'`
   - Dense Layer 2: `units=64`, `activation='relu'`
   - Dense Layer 3: `units=32`, `activation='relu'`
   - Output: Item representation vector

3. **Combining and Predicting**:
   - Dot Product: Combine user and item representations
   - Final Dense Layer: `units=1`, `activation='linear'`
   - Output: Predicted rating

## Prepare the data

In [1]:
import pandas as pd
import numpy as np

In [2]:
anime = pd.read_csv('/kaggle/input/anime-recommendation-database-2020/anime.csv')
rating = pd.read_csv('/kaggle/input/anime-recommendation-database-2020/rating_complete.csv')

In [3]:
# Create user dataframe
user=pd.DataFrame()
user['user_id']=rating['user_id'].unique()

In [4]:
# Generate random value for age, gender, and nationality
user['age'] = np.random.randint(18, 66, size=len(user))
user['gender'] = np.random.choice(['Male', 'Female'], size=len(user))
nationalities = ['American', 'Canadian', 'British', 'Australian', 'Indian', 'Chinese', 'German', 'French', 'Japanese', 'Brazilian']
user['nationality'] = np.random.choice(nationalities, size=len(user))

In [5]:
user.head()

Unnamed: 0,user_id,age,gender,nationality
0,0,42,Female,French
1,1,54,Male,French
2,2,26,Female,Canadian
3,3,37,Female,German
4,4,45,Female,Chinese


In [6]:
anime.head()

Unnamed: 0,MAL_ID,Name,Score,Genres,English name,Japanese name,Type,Episodes,Aired,Premiered,...,Score-10,Score-9,Score-8,Score-7,Score-6,Score-5,Score-4,Score-3,Score-2,Score-1
0,1,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",Cowboy Bebop,カウボーイビバップ,TV,26,"Apr 3, 1998 to Apr 24, 1999",Spring 1998,...,229170.0,182126.0,131625.0,62330.0,20688.0,8904.0,3184.0,1357.0,741.0,1580.0
1,5,Cowboy Bebop: Tengoku no Tobira,8.39,"Action, Drama, Mystery, Sci-Fi, Space",Cowboy Bebop:The Movie,カウボーイビバップ 天国の扉,Movie,1,"Sep 1, 2001",Unknown,...,30043.0,49201.0,49505.0,22632.0,5805.0,1877.0,577.0,221.0,109.0,379.0
2,6,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen",Trigun,トライガン,TV,26,"Apr 1, 1998 to Sep 30, 1998",Spring 1998,...,50229.0,75651.0,86142.0,49432.0,15376.0,5838.0,1965.0,664.0,316.0,533.0
3,7,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",Witch Hunter Robin,Witch Hunter ROBIN (ウイッチハンターロビン),TV,26,"Jul 2, 2002 to Dec 24, 2002",Summer 2002,...,2182.0,4806.0,10128.0,11618.0,5709.0,2920.0,1083.0,353.0,164.0,131.0
4,8,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural",Beet the Vandel Buster,冒険王ビィト,TV,52,"Sep 30, 2004 to Sep 29, 2005",Fall 2004,...,312.0,529.0,1242.0,1713.0,1068.0,634.0,265.0,83.0,50.0,27.0


## Preprocessing

In the Genres column, multiple values can be present. We will split these values and apply one-hot encoding for the 5 most frequent genres. We will also do one hot encoding on Type column.

### Preprocess anime data

In [7]:
# Split genres and explode the dataframe
df_exploded = anime.assign(Genres=anime['Genres'].str.split(', ')).explode('Genres')

# Get the top 5 most frequent genres
top_genres = df_exploded['Genres'].value_counts().head(4).index

# One-hot encode top genres
for genre in top_genres:
    anime[genre] = anime['Genres'].apply(lambda x: 1 if genre in x else 0)

In [8]:
anime['Type'] = anime['Type'].astype('category').cat.codes


In [9]:
anime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17562 entries, 0 to 17561
Data columns (total 39 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   MAL_ID         17562 non-null  int64 
 1   Name           17562 non-null  object
 2   Score          17562 non-null  object
 3   Genres         17562 non-null  object
 4   English name   17562 non-null  object
 5   Japanese name  17562 non-null  object
 6   Type           17562 non-null  int8  
 7   Episodes       17562 non-null  object
 8   Aired          17562 non-null  object
 9   Premiered      17562 non-null  object
 10  Producers      17562 non-null  object
 11  Licensors      17562 non-null  object
 12  Studios        17562 non-null  object
 13  Source         17562 non-null  object
 14  Duration       17562 non-null  object
 15  Rating         17562 non-null  object
 16  Ranked         17562 non-null  object
 17  Popularity     17562 non-null  int64 
 18  Members        17562 non-n

In [10]:
# Convert from object to numeric
anime['Score'] = pd.to_numeric(anime['Score'], errors='coerce').fillna(0.0)
anime['Score-1'] = pd.to_numeric(anime['Score-1'], errors='coerce').fillna(0.0)
anime['Score-2'] = pd.to_numeric(anime['Score-2'], errors='coerce').fillna(0.0)
anime['MAL_ID'] = pd.to_numeric(anime['MAL_ID'], errors='coerce').fillna(0.0)

### Preprocess user data

In [11]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

2024-07-07 15:05:50.951330: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-07 15:05:50.951426: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-07 15:05:51.073730: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [12]:
# Encode categorical variables
le_gender = LabelEncoder()
le_nationality = LabelEncoder()

user['gender'] = le_gender.fit_transform(user['gender'])
user['nationality'] = le_nationality.fit_transform(user['nationality'])

# Normalize age
scaler = MinMaxScaler()
user['age'] = scaler.fit_transform(user[['age']])


### Merge the data

In [13]:
# Prepare user and anime features
user_features = user[['user_id', 'age', 'gender', 'nationality']]
anime_features = anime[['MAL_ID','Score','Score-1','Score-2','Type','Comedy','Action','Fantasy','Adventure']] 

In [14]:
# Merge data for training
data = pd.merge(rating, user_features, on='user_id')
data = pd.merge(data, anime_features, left_on='anime_id', right_on='MAL_ID')

# Sample a fraction of the data to reduce size
sample_fraction = 0.005  # Use 10% of the data
data_sampled = data.sample(frac=sample_fraction, random_state=42)

# Prepare input data
X_user = data_sampled[[ 'age', 'gender', 'nationality']].values
X_anime = data_sampled[['Score','Score-1','Score-2','Type','Comedy','Action','Fantasy','Adventure']].values
y = data_sampled['rating'].values

In [15]:
# Split data
X_user_train, X_user_test, X_anime_train, X_anime_test, y_train, y_test = train_test_split(
    X_user, X_anime, y, test_size=0.2, random_state=42)

## Create Model

In [16]:
# Combined network input
user_input = tf.keras.layers.Input(shape=(X_user_train.shape[1],), name='user_input')
anime_input = tf.keras.layers.Input(shape=(X_anime_train.shape[1],), name='anime_input')

# User network
user_dense_1 = tf.keras.layers.Dense(128, activation='relu')(user_input)
user_dense_2 = tf.keras.layers.Dense(64, activation='relu')(user_dense_1)
user_output = tf.keras.layers.Dense(32, activation='relu')(user_dense_2)

# Anime network
anime_dense_1 = tf.keras.layers.Dense(128, activation='relu')(anime_input)
anime_dense_2 = tf.keras.layers.Dense(64, activation='relu')(anime_dense_1)
anime_output = tf.keras.layers.Dense(32, activation='relu')(anime_dense_2)

# Dot product
dot_product = tf.keras.layers.Dot(axes=1)([user_output, anime_output])
output = tf.keras.layers.Dense(1)(dot_product)

# Clip the predictions to be within the range of 0-10
clipped_output = tf.keras.layers.Lambda(lambda x: tf.clip_by_value(x, 0, 10))(output)

# Compile model
model = tf.keras.Model(inputs=[user_input, anime_input], outputs=clipped_output)
model.compile(optimizer='adam', loss='mean_squared_error')

# Train model
model.fit([X_user_train, X_anime_train], y_train, epochs=2, batch_size=32, validation_data=([X_user_test, X_anime_test], y_test))

# Make predictions
predictions = model.predict([X_user_test, X_anime_test])

# Evaluate model
loss = model.evaluate([X_user_test, X_anime_test], y_test)
print(f'Test Loss: {loss}')

Epoch 1/2
[1m  88/7205[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m12s[0m 2ms/step - loss: 10.0691 

I0000 00:00:1720364786.706252     105 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m7205/7205[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 2ms/step - loss: 5.5911 - val_loss: 2.9214
Epoch 2/2
[1m7205/7205[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 2ms/step - loss: 2.9149 - val_loss: 2.5013
[1m1802/1802[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step
[1m1802/1802[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - loss: 2.4659
Test Loss: 2.5013487339019775


In [17]:
model.save('model.h5')

## Recommendation

In [18]:
user_0=user[user.user_id==0].drop('user_id',axis=1)
user_0 = np.tile(user_0, (len(anime),1))
rec=model.predict([user_0, np.array(anime[['Score','Score-1','Score-2','Type','Comedy','Action','Fantasy','Adventure']])])

[1m549/549[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step


In [19]:
rec_merged=np.hstack([ np.array(anime[['MAL_ID','Score']]),rec])
rec_merged_df = pd.DataFrame(rec_merged, columns=['MAL_ID','Score','Pred_Rating'])
rec_merged_df

Unnamed: 0,MAL_ID,Score,Pred_Rating
0,1.0,8.78,8.900802
1,5.0,8.39,8.254961
2,6.0,8.24,8.626988
3,7.0,7.27,7.545678
4,8.0,6.98,7.121479
...,...,...,...
17557,48481.0,0.00,4.875769
17558,48483.0,0.00,6.346796
17559,48488.0,0.00,6.239069
17560,48491.0,0.00,6.358071


In [20]:
result_df = rec_merged_df.merge(anime, on='MAL_ID', how='inner')
result_df=result_df[['MAL_ID','English name','Pred_Rating']]

In [21]:
result_df.sort_values(by='Pred_Rating',ascending=False).head(10)

Unnamed: 0,MAL_ID,English name,Pred_Rating
8551,21881.0,Sword Art Online II,10.0
15926,40028.0,Attack on Titan Final Season,10.0
6614,11757.0,Sword Art Online,10.0
3971,5114.0,Fullmetal Alchemist:Brotherhood,10.0
1490,1639.0,Unknown,10.0
8058,19315.0,Pupa,10.0
11,21.0,One Piece,9.674823
9664,27899.0,Tokyo Ghoul √A,9.661354
14963,38524.0,Attack on Titan Season 3 Part 2,9.410432
387,413.0,Mars of Destruction,9.170363


In [22]:
actual_rating=rating[rating.user_id==0]
actual_rating

Unnamed: 0,user_id,anime_id,rating
0,0,430,9
1,0,1004,5
2,0,3010,7
3,0,570,7
4,0,2762,9
5,0,431,8
6,0,578,10
7,0,433,6
8,0,1571,10
9,0,121,9


In [23]:
full_rating = pd.merge(result_df, actual_rating, left_on='MAL_ID',right_on='anime_id', how='inner')
full_rating = full_rating[['English name', 'Pred_Rating', 'rating']]
full_rating

Unnamed: 0,English name,Pred_Rating,rating
0,Black Cat,7.530514,6
1,Fullmetal Alchemist,8.59576,9
2,Princess Mononoke,8.593861,8
3,Lunar Legend Tsukihime,7.467288,7
4,Tenjho Tenge,7.302282,4
5,Spirited Away,8.568235,8
6,Fate/stay night,7.833812,9
7,My Neighbors the Yamadas,7.270204,10
8,Samurai Deeper Kyo,7.204611,8
9,Fullmetal Alchemist:The Movie - Conqueror of S...,7.271306,9


## Conclusion
Content-based filtering is a powerful technique for generating personalized recommendations based on item attributes and user preferences. By understanding the attributes that users prefer, it can provide relevant and tailored recommendations, especially useful in scenarios where collaborative filtering may struggle.