# **Anime Recommender System on Markdown Notebook**



## Introduction

This notebook outlines the steps taken to build a collaborative and content-based recommender system for a collection of anime titles. The goal is to predict how a user will rate an anime title they have not yet viewed, based on their historical preferences and the characteristics of the anime.


## Project Motivation and Importance

In the age of digital content, recommender systems have become essential tools for personalized content delivery. These systems enhance user experience by suggesting relevant items based on user behavior and item attributes. For anime enthusiasts, discovering new shows that match their tastes can be a delightful experience, and an efficient recommender system can significantly enhance user satisfaction and engagement. 

Building an anime recommender system can:

1. **Improve User Experience**: By suggesting anime titles that align with users' preferences, the system can help users find content they are likely to enjoy, reducing the time spent searching for new shows.
2. **Increase Engagement**: Personalized recommendations can increase user interaction and retention, as users are more likely to return to the platform that consistently provides content they like.
3. **Drive Business Value**: For platforms hosting anime content, effective recommendation systems can lead to higher user satisfaction, more views, and ultimately, increased revenue.
4. **Learn User Preferences**: By analyzing user ratings and behavior, the system can gain insights into trends and preferences, which can be valuable for content creators and distributors.

## Steps Covered:
1. Data Loading 
2. Preprocessing
3. Exploratory Data Analysis (EDA)
4. Model Training
5. Model Evaluation
6. Conclusion


## 1. Data Loading and Preprocessing

### Importing Necessary Packages


In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel


### Loading the dataset

In [2]:
# Load the dataset
anime_data = pd.read_csv('anime.csv')
submission_data=pd.read_csv ('submission.csv')
test_data=pd.read_csv ('test.csv')
train_data=pd.read_csv ('train.csv')

In [3]:

# Display the first few rows of the anime dataset
train_data.head()

Unnamed: 0,user_id,anime_id,rating
0,1,11617,10
1,1,11757,10
2,1,15451,10
3,2,11771,10
4,3,20,8


### Making Copies of Original Datasets**

To minimize the risk of accidental data loss or irreversible changes during the cleaning operations and process, we create copies of the original datasets.This precaution ensures that the original data remains unchanged and can be referred back to if necessary.

In [4]:
train_data = train_data.copy(deep=True)
test_data = test_data.copy(deep=True)
anime_data = test_data.copy(deep=True)
submission_data_df = test_data.copy(deep=True)

##  2. Data Cleaning and Preprocessing

### Data inspection

a summary of essential information about the dataset. It provides metadata and basic statistics that help understand the structure, composition, and characteristics of our data.

In [82]:
anime_data.shape

(12294, 7)

In [58]:
# Inspect the anime_data
print("Anime Data:")
print(anime_data.info())

# Inspect the test_data
print("\nTest Data:")
print(test_data.info())

# Inspect the train_data
print("\nTrain Data:")
print(train_data.info())

Anime Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12294 entries, 0 to 12293
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   anime_id  12294 non-null  int64  
 1   name      12294 non-null  object 
 2   genre     12294 non-null  object 
 3   type      12294 non-null  object 
 4   episodes  12294 non-null  object 
 5   rating    12294 non-null  float64
 6   members   12294 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 672.5+ KB
None

Test Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 633686 entries, 0 to 633685
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype
---  ------    --------------   -----
 0   user_id   633686 non-null  int64
 1   anime_id  633686 non-null  int64
dtypes: int64(2)
memory usage: 9.7 MB
None

Train Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5703555 entries, 0 to 5703554
Data columns (total 3 columns):
 #   Column    Dty

In [57]:
# Get a summary of the dataset
print(anime_data.describe)
print(test_data.describe)
print(train_data.describe)

<bound method NDFrame.describe of        anime_id                                               name  \
0         32281                                     Kimi no Na wa.   
1          5114                   Fullmetal Alchemist: Brotherhood   
2         28977                                           Gintama°   
3          9253                                        Steins;Gate   
4          9969                                      Gintama&#039;   
...         ...                                                ...   
12289      9316       Toushindai My Lover: Minami tai Mecha-Minami   
12290      5543                                        Under World   
12291      5621                     Violence Gekiga David no Hoshi   
12292      6133  Violence Gekiga Shin David no Hoshi: Inma Dens...   
12293     26081                   Yasuji no Pornorama: Yacchimae!!   

                                                   genre   type episodes  \
0                   Drama, Romance, School, Super

### handling missing values

the process of managing or dealing with data points that are absent or undefined in a dataset. Missing values can occur due to various reasons such as data collection errors, data corruption during storage or transmission.

In [38]:
# Check for missing values in anime_data
print("Missing values in anime_data:")
print(anime_data.isnull().sum())

# Check for missing values in test_data
print("\nMissing values in test_data:")
print(test_data.isnull().sum())

# Check for missing values in train_data
print("\nMissing values in train_data:")
print(train_data.isnull().sum())


Missing values in anime_data:
anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

Missing values in test_data:
user_id     0
anime_id    0
dtype: int64

Missing values in train_data:
user_id     0
anime_id    0
dtype: int64


Display the number of null values present in each column, helping to quickly identify where data is missing.

In [63]:
# Display the number of null values before cleaning
print("Null values count before cleaning:")
print(anime_data.isnull().sum())

# Remove rows with any null values
anime_data_cleaned = anime_data.dropna()

# Display the number of null values after cleaning
print("\nNull values count after cleaning:")
print(anime_data_cleaned.isnull().sum())

anime_data_cleaned.reset_index(drop=True, inplace=True)

# Print the shape of the cleaned dataset
print(f"\nShape of cleaned dataset: {anime_data_cleaned.shape}")


Null values count before cleaning:
anime_id      0
name          0
genre        62
type         25
episodes      0
rating      230
members       0
dtype: int64

Null values count after cleaning:
anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

Shape of cleaned dataset: (12017, 7)


### Removing Duplicates

the process of identifying and eliminating identical or redundant entries within a dataset.

In [13]:
# Check for duplicates in anime_data
print("Duplicates in anime_data:", anime_data.duplicated().sum())

# Check for duplicates in submission_data
print("Duplicates in submission_data:", submission_data.duplicated().sum())

# Check for duplicates in test_data
print("Duplicates in test_data:", test_data.duplicated().sum())

# Check for duplicates in train_data
print("Duplicates in train_data:", train_data.duplicated().sum())


Duplicates in anime_data: 0
Duplicates in submission_data: 0
Duplicates in test_data: 0
Duplicates in train_data: 1


In [18]:
# Remove duplicates in train_data
train_data.drop_duplicates(inplace=True)
print('removed duplicates in train_data')

removed duplicates in train_data


### Validation and Testing

verifying the Removal of all rows containing any missing (NaN) values, ensuring that subsequent analyses are based on complete cases without any missing values and duplicates.

In [68]:
# Remove rows with any null values
anime_data_cleaned = anime_data.dropna()

# Display the number of null values after cleaning
print("\nNull values count after cleaning:")
print(anime_data_cleaned.isnull().sum())

# Verify duplicates removal after cleaning
print(f"\nDuplicates count after cleaning: {anime_data_cleaned.duplicated().sum()}")



Null values count after cleaning:
anime_id    0
name        0
genre       0
type        0
episodes    0
rating      0
members     0
dtype: int64

Duplicates count after cleaning: 0


### Encode Categorical Variables

Convert the target variable (Category) into numerical labels using Label Encoding.

In [22]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Load the dataset
anime_data = pd.read_csv('anime.csv')
submission_data = pd.read_csv('submission.csv')
test_data = pd.read_csv('test.csv')
train_data = pd.read_csv('train.csv')

# Standardize column names for each dataset
def standardize_column_names(df):
    df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace(r'[^a-zA-Z0-9_]', '', regex=True)
    return df

anime_data = standardize_column_names(anime_data)
submission_data = standardize_column_names(submission_data)
test_data = standardize_column_names(test_data)
train_data = standardize_column_names(train_data)

# Handling missing values: Fill missing values with mean for numerical columns
def fill_missing_values(df):
    for column in df.select_dtypes(include=[np.number]).columns:
        df[column].fillna(df[column].mean(), inplace=True)
    return df

anime_data = fill_missing_values(anime_data)
submission_data = fill_missing_values(submission_data)
test_data = fill_missing_values(test_data)
train_data = fill_missing_values(train_data)

# Encoding categorical variables using LabelEncoder and OneHotEncoder
def encode_categorical_variables(df):
    categorical_columns = df.select_dtypes(include=['object']).columns
    le = LabelEncoder()
    ohe = OneHotEncoder(sparse=False, drop='first')
    
    for column in categorical_columns:
        # Label Encoding
        df[column] = le.fit_transform(df[column])
        # One-Hot Encoding
        encoded_columns = ohe.fit_transform(df[[column]])
        encoded_df = pd.DataFrame(encoded_columns, columns=ohe.get_feature_names_out([column]))
        df = df.drop([column], axis=1).join(encoded_df)





The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(df[column].mean(), inplace=True)


### Preprocessing

transform and prepare the data for it to become more suitable and meaningful for the specific analytical approaches

In [70]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Load your dataset
anime_data = pd.read_csv('anime.csv')

# Handling missing values (filling with median for numerical columns and mode for categorical columns)
numerical_cols = anime_data.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = anime_data.select_dtypes(include=['object']).columns

anime_data[numerical_cols] = anime_data[numerical_cols].fillna(anime_data[numerical_cols].median())
anime_data[categorical_cols] = anime_data[categorical_cols].fillna(anime_data[categorical_cols].mode().iloc[0])

# Encoding categorical variables (Label Encoding)
label_encoder = LabelEncoder()
for col in categorical_cols:
    anime_data[col] = label_encoder.fit_transform(anime_data[col])

# Scaling numerical features (Standardization)
scaler = StandardScaler()
anime_data[numerical_cols] = scaler.fit_transform(anime_data[numerical_cols])

# Display the preprocessed dataset
print(anime_data.head())


   anime_id   name  genre  type  episodes    rating    members
0  1.590838   5412   2686     0         0  2.845534   3.330241
1 -0.780825   2848    161     5       147  2.737388  14.148406
2  1.302401   3346    534     5       132  2.727556   1.754713
3 -0.419493  10259   3240     5        84  2.648904  11.957666
4 -0.356987   3337    534     5       132  2.639073   2.429742


### descriptive statistics

summarizing and organizing the data to describe the basic features of the data

In [33]:
import pandas as pd

anime_data = pd.read_csv('anime.csv')

# Inspect data types
print("Data Types:")
print(anime_data.dtypes)

# Descriptive statistics for numerical columns
print("\nDescriptive Statistics for Numerical Columns:")
print(anime_data.describe())

# Descriptive statistics for categorical columns
print("\nDescriptive Statistics for Categorical Columns:")
print(anime_data.describe(include=['object']))

# Unique value count for each column
print("\nUnique Value Count for Each Column:")
print(anime_data.nunique())


Data Types:
anime_id      int64
name         object
genre        object
type         object
episodes     object
rating      float64
members       int64
dtype: object

Descriptive Statistics for Numerical Columns:
           anime_id        rating       members
count  12294.000000  12064.000000  1.229400e+04
mean   14058.221653      6.473902  1.807134e+04
std    11455.294701      1.026746  5.482068e+04
min        1.000000      1.670000  5.000000e+00
25%     3484.250000      5.880000  2.250000e+02
50%    10260.500000      6.570000  1.550000e+03
75%    24794.500000      7.180000  9.437000e+03
max    34527.000000     10.000000  1.013917e+06

Descriptive Statistics for Categorical Columns:
                           name   genre   type episodes
count                     12294   12232  12269    12294
unique                    12292    3264      6      187
top     Shi Wan Ge Leng Xiaohua  Hentai     TV        1
freq                          2     823   3787     5677

Unique Value Count for Ea

In [32]:
import pandas as pd

anime_data = pd.read_csv('test.csv')

# Inspect data types
print("Data Types:")
print(anime_data.dtypes)

# Descriptive statistics for numerical columns
print("\nDescriptive Statistics for Numerical Columns:")
print(anime_data.describe())


# Unique value count for each column
print("\nUnique Value Count for Each Column:")
print(anime_data.nunique())

print('no descriptive statistics for catagorical columns')

Data Types:
user_id     int64
anime_id    int64
dtype: object

Descriptive Statistics for Numerical Columns:
             user_id       anime_id
count  633686.000000  633686.000000
mean    36777.752605    8909.389543
std     21028.330970    8880.430436
min         1.000000       1.000000
25%     18974.000000    1240.000000
50%     36919.000000    6213.000000
75%     54908.000000   14131.000000
max     73516.000000   34367.000000

Unique Value Count for Each Column:
user_id     57053
anime_id     7785
dtype: int64
no descriptive statistics for catagorical columns


In [42]:
import pandas as pd

anime_data = pd.read_csv('train.csv')

# Inspect data types
print("Data Types:")
print(anime_data.dtypes)

# Descriptive statistics for numerical columns
print("\nDescriptive Statistics for Numerical Columns:")
print(anime_data.describe())

# Unique value count for each column
print("\nUnique Value Count for Each Column:")
print(anime_data.nunique())

print('no descriptive statistics for catagorical columns')


Data Types:
user_id     int64
anime_id    int64
rating      int64
dtype: object

Descriptive Statistics for Numerical Columns:
            user_id      anime_id        rating
count  5.703555e+06  5.703555e+06  5.703555e+06
mean   3.674460e+04  8.902142e+03  7.808691e+00
std    2.101174e+04  8.882174e+03  1.572449e+00
min    1.000000e+00  1.000000e+00  1.000000e+00
25%    1.898500e+04  1.239000e+03  7.000000e+00
50%    3.680200e+04  6.213000e+03  8.000000e+00
75%    5.487300e+04  1.407500e+04  9.000000e+00
max    7.351600e+04  3.447500e+04  1.000000e+01

Unique Value Count for Each Column:
user_id     69481
anime_id     9838
rating         10
dtype: int64
no descriptive statistics for catagorical columns


## 3. EDA

## 4. Model Training

## 5. Model Evaluation


## 6. Conclusion