**Unsupervised_Learning_Project: Team JB3**

### **Building an Anime Recommendation System**

<div align="center" style="font-size: 40%; text-align: center; margin: 0 auto">
    <img src="https://mcdn.wallpapersafari.com/medium/67/98/JKSuGa.jpg" style="display: block; margin-left: auto; margin-right: auto; width: 800px; height: 200px;" />
</div>


### **Project Overview**

**Introduction**

Anime has become a global phenomenon, captivating audiences with its unique storytelling, diverse genres, and vibrant characters. With an ever-growing collection of anime titles available, it can be challenging for viewers to discover new series that align with their tastes.
- This project aims to solve this problem by developing a robust anime recommendation system that leverages both collaborative filtering and content-based filtering techniques to accurately predict how a user will rate an anime title they have not yet viewed.

**Objective**

The primary objective of this project is to create an end-to-end recommendation system capable of providing personalized anime recommendations to users. This involves:

- **Data Loading and Preprocessing**: Cleaning and preparing the datasets for analysis.
- **Collaborative Filtering Model**: Using user ratings to recommend anime titles.
- **Content-Based Filtering Model**: Utilizing anime metadata to find similar titles.
- **Hybrid Recommender System**: Combining collaborative and content-based models for enhanced accuracy.
- **Model Evaluation**: Assessing the performance of the models using relevant metrics.
- **Deployment**: Deploying the recommendation system as a web application for easy user access.

### **Loading Packages**

In [2]:
import numpy as np
import pandas as pd
import cufflinks as cf
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestRegressor

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### **Data Loading**

**Data Sources**

The project uses three primary datasets:

- **anime.csv**: Contains information about anime titles, including genres, type, number of episodes, average rating, and number of members.
- **train.csv**:This file contains rating data, supplied by individual users for individual anime titles. It contains user_id information, the anime_id of the title watched, and the rating given (if applicable).
- **test.csv**: This file will be used to create the final submission. It contains a user_id and an anime_id column only - no rating (that's your task!). These ids will be used to create the rating predictions.

In [3]:
anime_df = pd.read_csv('anime.csv')
anime_df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [4]:
train_df = pd.read_csv('train.csv')
train_df.head()

Unnamed: 0,user_id,anime_id,rating
0,1,11617,10
1,1,11757,10
2,1,15451,10
3,2,11771,10
4,3,20,8


In [5]:
test_df = pd.read_csv('test.csv')
test_df.head()

Unnamed: 0,user_id,anime_id
0,40763,21405
1,68791,10504
2,40487,1281
3,55290,165
4,72323,11111


### **Initial Data Inspection**

Initial data inspection is a crucial step in any data science project.
- It helps understand the structure, quality, and characteristics of the data before you proceed with any analysis or modeling.
- Here are the key steps and techniques involved in an initial data inspection:


***View the Data Structure***

- Use df.shape to get the number of rows and columns.
- Use df.head() and df.tail() to inspect the first and last few rows.


**Anime dataset**

In [6]:
# Display the first few rows of the dataframe
anime_df.head()


Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


In [13]:
# Getting the shape of the DataFrame
anime_df.shape

(12294, 7)

In [None]:
# Display basic information about the dataframe
anime_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12232 non-null  object 
 3   type      12269 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12064 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB


**Analysis**

It looks like the DataFrame anime_df has 12,294 rows and 7 columns. This indicates a fairly large dataset, possibly containing information about various anime series or movies

In [None]:
# Display summary statistics
anime_df.describe()

Unnamed: 0,anime_id,rating,members
count,12294.0,12064.0,12294.0
mean,14058.221653,6.473902,18071.34
std,11455.294701,1.026746,54820.68
min,1.0,1.67,5.0
25%,3484.25,5.88,225.0
50%,10260.5,6.57,1550.0
75%,24794.5,7.18,9437.0
max,34527.0,10.0,1013917.0


**Train and Test datasets**

In [7]:
train_df.head()

Unnamed: 0,user_id,anime_id,rating
0,1,11617,10
1,1,11757,10
2,1,15451,10
3,2,11771,10
4,3,20,8


In [8]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1016656 entries, 0 to 1016655
Data columns (total 3 columns):
 #   Column    Non-Null Count    Dtype
---  ------    --------------    -----
 0   user_id   1016656 non-null  int64
 1   anime_id  1016656 non-null  int64
 2   rating    1016656 non-null  int64
dtypes: int64(3)
memory usage: 23.3 MB


In [9]:
train_df.describe()

Unnamed: 0,user_id,anime_id,rating
count,1016656.0,1016656.0,1016656.0
mean,6641.019,8658.027,7.785153
std,3692.919,8946.583,1.578809
min,1.0,1.0,1.0
25%,3599.0,1055.0,7.0
50%,6576.0,5420.0,8.0
75%,9980.0,13759.0,9.0
max,13214.0,34325.0,10.0


In [10]:
test_df.head()

Unnamed: 0,user_id,anime_id
0,40763,21405
1,68791,10504
2,40487,1281
3,55290,165
4,72323,11111


In [11]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 633686 entries, 0 to 633685
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype
---  ------    --------------   -----
 0   user_id   633686 non-null  int64
 1   anime_id  633686 non-null  int64
dtypes: int64(2)
memory usage: 9.7 MB


In [12]:
test_df.describe()

Unnamed: 0,user_id,anime_id
count,633686.0,633686.0
mean,36777.752605,8909.389543
std,21028.33097,8880.430436
min,1.0,1.0
25%,18974.0,1240.0
50%,36919.0,6213.0
75%,54908.0,14131.0
max,73516.0,34367.0


***Making Copies of the Datasets***
- Making copies of datasets can be important to ensure the original data remains unchanged during various preprocessing steps, analysis, or experimentation.
- In Python, especially when using pandas, this can be done using the .copy() method.

In [14]:
anime_df_copy = anime_df.copy()
train_df_copy = train_df.copy()
test_df_copy = test_df.copy()

###**Data Cleaning**

***Anime Dataset***

In [16]:
# 1. Handle missing values
# Check for missing values
missing_values = anime_df_copy.isnull().sum()
print("Missing values in each column:\n", missing_values)


Missing values in each column:
 anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64


In [17]:
# Fill missing values in 'genre' and 'type' with 'Unknown'
anime_df_copy['genre'].fillna('Unknown', inplace=True)
anime_df_copy['type'].fillna('Unknown', inplace=True)

# Fill missing values in 'rating' with the mean
anime_df_copy['rating'].fillna(anime_df['rating'].mean(), inplace=True)


In [18]:
# 2. Normalize text data
# Strip leading/trailing whitespace from text columns
anime_df_copy['name'] = anime_df_copy['name'].str.strip()
anime_df_copy['genre'] = anime_df_copy['genre'].str.strip()
anime_df_copy['type'] = anime_df_copy['type'].str.strip()

In [19]:
# 3. Convert data types
# Convert 'episodes' column to numeric, coerce errors to handle non-numeric values
anime_df_copy['episodes'] = pd.to_numeric(anime_df_copy['episodes'], errors='coerce')

In [20]:
# Convert 'rating' and 'members' columns to numeric
anime_df_copy['rating'] = pd.to_numeric(anime_df_copy['rating'], errors='coerce')
anime_df_copy['members'] = pd.to_numeric(anime_df_copy['members'], errors='coerce')


In [21]:
# 4. Remove duplicates
anime_df_cleaned = anime_df_copy.drop_duplicates()

In [36]:
from sklearn.preprocessing import OneHotEncoder

# One-hot encoding 'type' and 'genre'
onehot_encoder = OneHotEncoder(sparse=False)
type_encoded = onehot_encoder.fit_transform(anime_df_cleaned[['type']])
genre_encoded = onehot_encoder.fit_transform(anime_df_cleaned[['genre']])

# Convert to DataFrame and concatenate with original DataFrame
type_df = pd.DataFrame(type_encoded, columns=onehot_encoder.categories_[0])
genre_df = pd.DataFrame(genre_encoded, columns=onehot_encoder.categories_[0])
anime_df_encoded = pd.concat([anime_df_cleaned, type_df, genre_df], axis=1)

# Drop original 'type' and 'genre' columns
anime_df_copy.drop(['type', 'genre'], axis=1, inplace=True)

In [None]:
# 5. Feature extraction
# Split 'genre' into separate columns (this will create a new column for each unique genre)
genre_columns = anime_df_cleaned['genre'].str.get_dummies(sep=', ')

In [None]:
# Combine the original dataframe with the new genre columns
anime_df_cleaned = pd.concat([anime_df_cleaned, genre_columns], axis=1)

# Display the cleaned dataframe
anime_df_cleaned.head()

For unsupervised learning, the cleaning steps should ensure the data is well-prepared for clustering or dimensionality reduction techniques. Here are some additional considerations for unsupervised learning:

- Normalization: Scale the numeric features so they have a similar range, which can be important for distance-based algorithms.
- Feature selection/engineering: Ensure the features used are meaningful and capture the variability in the data.
- Dimensionality reduction: Consider using techniques like PCA to reduce the number of features if necessary.

In [None]:
from sklearn.preprocessing import StandardScaler
# Drop the original 'genre' column as it's now encoded
anime_df_cleaned = anime_df_cleaned.drop(columns=['genre'])

# 6. Normalization
# Scale numeric features
scaler = StandardScaler()
numeric_features = ['episodes', 'rating', 'members']
anime_df_cleaned[numeric_features] = scaler.fit_transform(anime_df_cleaned[numeric_features])

# Display the cleaned dataframe
anime_df_cleaned.head()

Unnamed: 0,anime_id,name,type,episodes,rating,members,Action,Adventure,Cars,Comedy,...,Shounen Ai,Slice of Life,Space,Sports,Super Power,Supernatural,Thriller,Vampire,Yaoi,Yuri
0,32281,Kimi no Na wa.,Movie,-0.243905,2.824474,3.292044,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,5114,Fullmetal Alchemist: Brotherhood,TV,1.093813,2.717032,14.00241,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,28977,Gintama°,TV,0.817776,2.707265,1.732216,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,9253,Steins;Gate,TV,0.244468,2.629126,11.833499,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,9969,Gintama&#039;,TV,0.817776,2.619358,2.400518,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0


***Train & Test Dataset***

In an unsupervised learning project, having separate training and test datasets is essential for several reasons:

Model Training and Evaluation:

- Training Data: Used to train the model. The model learns patterns and relationships from this data.
- Test Data: Used to evaluate the model's performance on unseen data. This helps assess how well the model generalizes to new, unseen examples.

Avoiding Overfitting:

- By training on one dataset and testing on another, you ensure the model isn't just memorizing the training data (overfitting) but is actually learning to generalize from the patterns in the data.


**Train Dataset**

In [23]:
# Clean and preprocess the train_df dataframe
# Check for missing values
missing_values_train = train_df_copy.isnull().sum()
print("Missing values in each column:\n", missing_values_train)


Missing values in each column:
 user_id     0
anime_id    0
rating      0
dtype: int64


In [25]:
# Drop rows with missing values
train_df_cleaned = train_df.dropna()

In [26]:
# Convert 'user_id', 'anime_id', and 'rating' to numeric
train_df_cleaned['user_id'] = pd.to_numeric(train_df_cleaned['user_id'], errors='coerce')
train_df_cleaned['anime_id'] = pd.to_numeric(train_df_cleaned['anime_id'], errors='coerce')
train_df_cleaned['rating'] = pd.to_numeric(train_df_cleaned['rating'], errors='coerce')

In [27]:
# Remove duplicates
train_df_cleaned = train_df_cleaned.drop_duplicates()

In [29]:
# Display the cleaned ratings dataframe
train_df_cleaned.head()
train_df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1016656 entries, 0 to 1016655
Data columns (total 3 columns):
 #   Column    Non-Null Count    Dtype
---  ------    --------------    -----
 0   user_id   1016656 non-null  int64
 1   anime_id  1016656 non-null  int64
 2   rating    1016656 non-null  int64
dtypes: int64(3)
memory usage: 23.3 MB


**Test Dataset**

In [31]:
# Clean and preprocess the test_df dataframe
# Check for missing values
missing_values_test = test_df_copy.isnull().sum()
print("Missing values in each column:\n", missing_values_train)


Missing values in each column:
 user_id     0
anime_id    0
dtype: int64


In [33]:
# Remove duplicates
test_df_cleaned = test_df_copy.drop_duplicates()

In [34]:
# Convert 'user_id', 'anime_id', and 'rating' to numeric
test_df_cleaned['user_id'] = pd.to_numeric(test_df_cleaned['user_id'], errors='coerce')
test_df_cleaned['anime_id'] = pd.to_numeric(test_df_cleaned['anime_id'], errors='coerce')


In [35]:
# Display the cleaned ratings dataframe
test_df_cleaned.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 633686 entries, 0 to 633685
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype
---  ------    --------------   -----
 0   user_id   633686 non-null  int64
 1   anime_id  633686 non-null  int64
dtypes: int64(2)
memory usage: 9.7 MB


***Cleaned Datasets***

In [None]:

train_df_cleaned
test_df_cleaned

### **Exploratory Data Analysis**

### **Unsupervised Learning Models**

### **Conclusions and Insights**