 <div style="font-size: 34px"> 
<font color='blue'> <b>Anime Recommender System Project 2024 competition</b></font>

<a id="cont"></a>

## CONTENTS
* <b>[1. Project Overview](#chapter1)
    * [1.1. Aim](#sub_section_1_1_2)
    * [1.2. Objectives](#section_1_1_3)
* <b>[2. Importing Packages](#chapter2)
* <b>[3. Data Preprocessing](#chapter3)
* <b>[4. Data Cleaning](#chapter4)
* <b>[5. Exploratory Data Analysis](#chapter5)
* <b>[6. Modeling](#chapter7)
* <b>[7. Conclusions](#chapter8)
* <b>[8. Recomendations](#chapter9)

### 1. **Project Overview**

### 2. **Importing Packages**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import pandas as pd
import re
from html import unescape
import html
import pickle
from surprise import SVD, KNNBasic, BaselineOnly, Dataset, Reader
from surprise.model_selection import cross_validate
import mlflow
import mlflow.sklearn

ModuleNotFoundError: No module named 'surprise'

### 3. **Data Preprocessing** 


#### 3.1. Data Overview <a class="anchor" id="dataset"></a>

This dataset contains information on anime content (movies, television series, music, specials, OVA, and ONA*), split between a file related to the titles (anime.csv) and one related to user ratings of the titles (training.csv). The test.csv file will be used to create the rating predictions and must be submitted for grading. The submissions.csv file illustrates the expected format of submissions.


#### 3.2. Data Loading <a class="anchor" id="data preprocessing"></a>



In [None]:
# Reading the anime.csv file into a DataFrame
anime_df = pd.read_csv('anime.csv')
print(anime_df.head(3))

In [None]:
# Reading the train.zip (csv) file into a DataFrame
train_df = pd.read_csv('train.zip')
print(train_df.head(3))

In [None]:
# Reading the test.csv file into a DataFrame
test_df = pd.read_csv('test.csv')
print(test_df.head(3))

#### 3.4. Exploring the Dataset

In this section, we explore the dataset to understand its structure and the information it contains. The code snippets below provide an overview of the dataset's shape, the number of columns, and metadata information about the dataset.

In [None]:
# shape of the dataset
print(anime_df.shape)

# information about metadata
anime_df.info()

In [None]:
# shape of the dataset
print(train_df.shape)

# information about metadata
train_df.info()

In [None]:
# shape of the dataset
print(test_df.shape)

# information about metadata
test_df.info()

### **Data Cleaning** 

#### 4.1. Missing values <a class="anchor" id="dataset"></a>

In [None]:
# Check for missing values
print("Missing values in anime_df:")
print(anime_df.isnull().sum())
print("\nMissing values in train_df:")
print(train_df.isnull().sum())
print("\nMissing values in test_df:")
print(test_df.isnull().sum())

#### 4.2. Drop Duplicates <a class="anchor" id="dataset"></a>

In [None]:
# Remove duplicates if any
anime_df.drop_duplicates(inplace=True)
train_df.drop_duplicates(inplace=True)
test_df.drop_duplicates(inplace=True)

#### 4.3. Cleaning anime_df <a class="anchor" id="dataset"></a>

In [None]:
# Define a function to clean anime_df
def clean_anime_df(df):
    # Ensure anime_id is unique and non-null
    assert df['anime_id'].is_unique, "anime_id column has duplicate values."
    assert df['anime_id'].notnull().all(), "anime_id column has null values."
    df['anime_id'] = df['anime_id'].astype(int)

    # Function to clean names and unescape HTML entities
    def clean_name(name):
        name = unescape(name)  # Convert HTML entities to characters
        name = name.lower().strip()  # Convert to lowercase and strip whitespace
        # Replace specific known HTML entities and problematic characters
        name = name.replace("&#039;", "'").replace("°", "")
        # Remove any other unwanted special characters but allow meaningful ones
        name = re.sub(r'[^a-zA-Z0-9\s\.\,\-\&\:\;\']', '', name)
        return name

    # Apply the cleaning function to the 'name' column
    df['name'] = df['name'].apply(clean_name)

    # Handle missing values in 'genre' and split into lists
    df['genre'] = df['genre'].fillna('')
    df['genre'] = df['genre'].apply(lambda x: x.split(', '))

    # Function to standardize genre lists
    def standardize_genres(genres):
        return [genre.strip().lower() for genre in genres]

    # Apply the function to genre_list column
    df['genre'] = df['genre'].apply(standardize_genres)

    # Standardize the 'type' column
    df['type'] = df['type'].str.lower().str.strip().fillna('unknown')

    # Ensure episodes is numeric and handle missing values
    df['episodes'] = pd.to_numeric(df['episodes'], errors='coerce').fillna(-1).astype(int)

    # Ensure rating is numeric (should remain float) and handle missing values
    df['rating'] = pd.to_numeric(df['rating'], errors='coerce').fillna(-1.0)

    # Ensure members is numeric and handle missing values
    df['members'] = pd.to_numeric(df['members'], errors='coerce').fillna(0).astype(int)

    return df

In [None]:
# Clean the dataframes
anime_df = clean_anime_df(anime_df)
# Example of converting genre to a string for TF-IDF
anime_df['genre_str'] = anime_df['genre'].apply(lambda x: ' '.join(x))

# Encoding type variable
anime_df['type_encoded'] = anime_df['type'].astype('category').cat.codes

# Display the first few rows of the cleaned dataframes for verification
print(f'Cleaned anime_df:\n{anime_df.head()}')

#### 4.4. Cleaning train data <a class="anchor" id="dataset"></a>

In [None]:
# Define function to clean train_df
def clean_train_df(df):
    # Ensure user_id and anime_id are non-null and numeric
    df = df.dropna(subset=['user_id', 'anime_id'])
    df['user_id'] = pd.to_numeric(df['user_id'], errors='coerce')
    df['anime_id'] = pd.to_numeric(df['anime_id'], errors='coerce')
    
    # Handle ratings: ensure they are integers and within a valid range
    df['rating'] = pd.to_numeric(df['rating'], errors='coerce')
    
    # Option 1: Filter out rows where rating is -1, ensuring rating remains an integer
    df = df[df['rating'] != -1]
    
    # Remove remaining rows with NaN values that could not be converted
    df.dropna(subset=['user_id', 'anime_id', 'rating'], inplace=True)

    # Ensure the remaining ratings are within the valid range
    df = df[(df['rating'] >= 1) & (df['rating'] <= 10)]
    df['rating'] = df['rating'].astype(int)  # Ensure rating column is of integer type

    return df


In [None]:
# Clean the dataframes
train_df = clean_train_df(train_df)

# Display the first few rows of the cleaned dataframes for verification
print(f'Cleaned train_df:\n{train_df.head()}')

#### 4.5. Cleaning test data <a class="anchor" id="dataset"></a>

In [None]:
# Define function to clean test_df
def clean_test_df(df):
    # Ensure user_id and anime_id are non-null and numeric
    df = df.dropna(subset=['user_id', 'anime_id'])
    df['user_id'] = pd.to_numeric(df['user_id'], errors='coerce')
    df['anime_id'] = pd.to_numeric(df['anime_id'], errors='coerce')
    
    # Remove rows with NaN values that could not be converted
    df.dropna(subset=['user_id', 'anime_id'], inplace=True)
    
    return df


In [None]:
# Clean the dataframes
test_df = clean_test_df(test_df)

# Display the first few rows of the cleaned dataframes for verification
print(f'Cleaned test_df:\n{test_df.head()}')

## **6. Models**



### **6.1 Collaborative Filtering Model (SVD)**


In [None]:
# Use 50% of the training data
train_df = train_df.sample(frac=0.5, random_state=42)

# Example of converting genre to a string for TF-IDF
anime_df['genre_str'] = anime_df['genre'].apply(lambda x: ' '.join(x))

# Encoding type variable
anime_df['type_encoded'] = anime_df['type'].astype('category').cat.codes

# Data splitting
from sklearn.model_selection import train_test_split
train_set, validation_set = train_test_split(train_df, test_size=0.2, random_state=42)


In [None]:
from surprise import Dataset, Reader, SVD, NMF, accuracy
from surprise.model_selection import cross_validate, train_test_split
import mlflow
import mlflow.sklearn

# Prepare the data for Surprise
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(train_set[['user_id', 'anime_id', 'rating']], reader)

# Split the dataset into train and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Initialize models
svd = SVD()
nmf = NMF()

# Function to train, evaluate, and log models using MLflow
def train_evaluate_log(model, model_name):
    with mlflow.start_run(run_name=model_name):
        # Train the model
        model.fit(trainset)

        # Cross-validate the model
        cv_results = cross_validate(model, data, measures=['RMSE'], cv=5, verbose=True)
        mean_rmse = np.mean(cv_results['test_rmse'])

        # Evaluate the model on the test set
        predictions = model.test(testset)
        test_rmse = accuracy.rmse(predictions, verbose=True)

        # Log parameters and metrics
        mlflow.log_param("model_name", model_name)
        mlflow.log_metric("cv_rmse", mean_rmse)
        mlflow.log_metric("test_rmse", test_rmse)

        # Log the model
        mlflow.sklearn.log_model(model, model_name)

        print(f'{model_name} Model logged with CV RMSE: {mean_rmse} and Test RMSE: {test_rmse}')

# Train, evaluate, and log SVD and NMF models
train_evaluate_log(svd, "SVD")
train_evaluate_log(nmf, "NMF")


In [None]:

# svd has the best performance (based on MLflow tracking results)
best_model = svd

# Save the best model as a pickle file
with open('best_model.pkl', 'wb') as file:
    pickle.dump(best_model, file)

print("Best model has been pickled successfully.")

### **6.2 Content-Based Filtering Model**

* **Content-Based Filtering using TF-IDF and Cosine Similarity:**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Fit TF-IDF vectorizer on genres
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(anime_df['genre_str'])

# Calculate cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Function to get content-based recommendations
def get_content_based_recommendations(anime_id, cosine_sim=cosine_sim):
    idx = anime_df.index[anime_df['anime_id'] == anime_id].tolist()[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]  # Get top 10 recommendations
    anime_indices = [i[0] for i in sim_scores]
    return anime_df.iloc[anime_indices]

# Example of getting content-based recommendations
recommendations = get_content_based_recommendations(anime_id=1)
print(recommendations[['anime_id', 'name', 'genre_str']])


### **6.3 Hybrid Model**
* **Combining SVD predictions and content-based similarity scores:**

In [None]:
def hybrid_recommendation(user_id, anime_id, svd_model, cosine_sim=cosine_sim, threshold=5):
    svd_prediction = svd_model.predict(user_id, anime_id).est
    if svd_prediction >= threshold:
        return svd_prediction
    else:
        content_based_recommendations = get_content_based_recommendations(anime_id, cosine_sim)
        return content_based_recommendations['rating'].mean()  # Return the average rating of top content-based recommendations

# Example of hybrid recommendation
hybrid_rating = hybrid_recommendation(user_id=1, anime_id=1, svd_model=svd)
print(f"Hybrid Rating: {hybrid_rating}")


6. Submission Preparation
Preparing the Submission File

In [None]:
# Predict ratings for the test set using the hybrid model
predicted_ratings = []
for i, row in test_df.iterrows():
    user_id = row['user_id']
    anime_id = row['anime_id']
    hybrid_rating = hybrid_recommendation(user_id, anime_id, svd_model)
    predicted_ratings.append(hybrid_rating)

# Create the submission DataFrame
submission_df = test_df.copy()
submission_df['rating'] = pd.Series(predicted_ratings, dtype=float)
submission_df['ID'] = submission_df.apply(lambda x: f"{int(x['user_id'])}_{int(x['anime_id'])}", axis=1)
submission_df = submission_df[['ID', 'rating']]

# Save the submission file
submission_df.to_csv('submission_hybrid.csv', index=False, float_format='%.5f')
print(f'Submission file created successfully.')


### SVD Hyper parameter tuning


In [None]:

import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
from surprise import Dataset, Reader, SVD, accuracy
from surprise.model_selection import train_test_split, GridSearchCV, cross_validate
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle

# Prepare the data for Surprise
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(train_df[['user_id', 'anime_id', 'rating']], reader)

# Use 50% of the training data
train_df = train_df.sample(frac=0.5, random_state=42)


# Split the dataset into train and test sets
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Parameter grid for GridSearchCV
param_grid_svd = {
    'n_factors': [50, 100, 150],
    'n_epochs': [20, 30, 40],
    'lr_all': [0.002, 0.005, 0.01],
    'reg_all': [0.02, 0.05, 0.1]
}

# Initialize MLflow experiment
mlflow.set_experiment("SVD Hyperparameter Tuning")

with mlflow.start_run(run_name="GridSearch_SVD"):
    # GridSearchCV for SVD
    gs_svd = GridSearchCV(SVD, param_grid_svd, measures=['rmse'], cv=3, n_jobs=-1, joblib_verbose=2)
    gs_svd.fit(data)

    # Log best parameters and RMSE
    mlflow.log_params(gs_svd.best_params['rmse'])
    mlflow.log_metric("Best RMSE", gs_svd.best_score['rmse'])

    print("Best SVD parameters (GridSearchCV): ", gs_svd.best_params['rmse'])
    print("Best SVD RMSE (GridSearchCV): ", gs_svd.best_score['rmse'])

    # Train the best SVD model
    best_svd = SVD(**gs_svd.best_params['rmse'])
    best_svd.fit(trainset)

    # Evaluate the best SVD model with cross-validation
    cv_results = cross_validate(best_svd, data, measures=['RMSE'], cv=3, verbose=True)

    # Log cross-validation results
    for key, values in cv_results.items():
        mlflow.log_metric(key, np.mean(values))

    # Evaluate the best SVD model on the test set
    predictions = best_svd.test(testset)
    rmse = accuracy.rmse(predictions)
    mlflow.log_metric("Test RMSE", rmse)
    print(f'Best SVD Test RMSE: {rmse}')

    # Save the best model as a pickled file
    model_filename = "best_svd_model.pkl"
    with open(model_filename, 'wb') as f:
        pickle.dump(best_svd, f)
    mlflow.log_artifact(model_filename)

    # Predict ratings for each entry in test_df using the trained best SVD model
    predicted_ratings = []
    for i, row in test_df.iterrows():
        user_id = row['user_id']
        anime_id = row['anime_id']
        
        # Ensure the prediction is within the valid range
        try:
            pred = best_svd.predict(user_id, anime_id).est
            pred = max(1.0, min(pred, 10.0))  # Ensure the prediction is within the rating scale
        except ValueError as e:
            print(f"Error predicting for user_id: {user_id} and anime_id: {anime_id} - {e}")
            pred = np.mean(train_df['rating'])  # Use the mean rating as a default value

        predicted_ratings.append(pred)

    # Create a copy of test_df to hold the results
    results_df = test_df.copy()

    # Add predictions to the test dataframe
    results_df['rating'] = pd.Series(predicted_ratings, dtype=float)

    # Ensure no NaN values in predictions and update the column directly
    results_df['rating'] = results_df['rating'].fillna(np.mean(train_df['rating']))

    # Create the ID column by concatenating user_id and anime_id
    results_df['ID'] = results_df.apply(lambda x: f"{int(x['user_id'])}_{int(x['anime_id'])}", axis=1)

    # Prepare the submission file
    submission_df = results_df[['ID', 'rating']]

    # Debugging: Ensure predictions and format look correct
    print(f"Sample predictions with ID and rating for SVD:")
    print(submission_df.head())

    # Save the submission file ensuring float format
    submission_file = "submission_svd.csv"
    submission_df.to_csv(submission_file, index=False, float_format='%.5f')
    print(f'Submission file {submission_file} created successfully.')

    # Log the submission file
    mlflow.log_artifact(submission_file)

mlflow.end_run()



In [2]:
# Prepare data for plotting
models = ['SVD', 'Baseline', 'NMF']
rmse_values = [SVD_rmse, baseline_rmse, NMF_rmse]

# Create a DataFrame for plotting
results_df = pd.DataFrame({
    'Model': models,
    'RMSE': rmse_values
})

# Plot the RMSE values
plt.figure(figsize=(10, 6))
ax = sns.barplot(x='Model', y='RMSE', data=results_df, palette='coolwarm')

# Set y-axis limit to start from 1 and extend a bit beyond the max RMSE
plt.ylim(1, max(rmse_values) * 1.1)

# Add data labels to the bars
for p in ax.patches:
    height = p.get_height()
    ax.annotate(f'{height:.2f}',
                (p.get_x() + p.get_width() / 2., height),
                ha='center', va='center',
                xytext=(0, 5),  # 5 points vertical offset
                textcoords='offset points')

# Add titles and labels
plt.title('Comparison of RMSE Values for Different Models')
plt.xlabel('Model')
plt.ylabel('RMSE')

# Show the plot


NameError: name 'SVD_rmse' is not defined