# Hybrid Anime Recommendation System

This notebook implements a hybrid recommendation engine (content + collaborative filtering) for anime.  
We also integrate **MLflow** to track experiments, metrics, and models for better reproducibility and MLOps practices.


In [1]:
import pandas as pd 
import numpy as np
import mlflow
import mlflow.keras
import sqlite3




In [4]:
mlflow.set_tracking_uri("http://127.0.0.1:8080")

In [7]:
model_name = "hybrid_anime_recommendation"
model_version_alias = "champion"

# Get the model version using a model URI
model_uri = f"models:/{model_name}@{model_version_alias}"
model = mlflow.pyfunc.load_model(model_uri)

Downloading artifacts: 100%|██████████| 12/12 [00:17<00:00,  1.46s/it] 


In [5]:
import mlflow.pyfunc

model_name = "hybrid_anime_recommendation"
model_version = 3

model = mlflow.pyfunc.load_model(
    model_uri=f"models:/{model_name}/{model_version}"
)


  from .autonotebook import tqdm as notebook_tqdm
Downloading artifacts: 100%|██████████| 12/12 [00:19<00:00,  1.64s/it]  






In [6]:
model

mlflow.pyfunc.loaded_model:
  artifact_path: mlflow-artifacts:/994902288046821146/models/m-4b00d2b1eaa044258f78872ee0b4426f/artifacts
  flavor: mlflow.tensorflow
  run_id: a2ad3c4395774a699e7f97044c63eccf

In [2]:
# Create or connect to SQLite DB
conn = sqlite3.connect("../data/anime_recommendation.db")

In [4]:
import pandas as pd 
import numpy as np
import mlflow
import mlflow.keras
import sqlite3
mlflow.set_tracking_uri("http://127.0.0.1:8080")  # logs stored in local file
mlflow.set_experiment("anime_recommendation")

# Create or connect to SQLite DB
conn = sqlite3.connect("../data/anime_recommendation.db")

2025/09/24 23:32:12 INFO mlflow.tracking.fluent: Experiment with name 'anime_recommendation' does not exist. Creating a new experiment.


In [6]:
# Configure Pandas display options for better readability
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 50)

Load Datasets

In [7]:
df = pd.read_csv('../data/jikan_final.csv')

In [8]:
user = pd.read_csv("../data/userratings.csv")

EDA

In [9]:
df.head(3)

Unnamed: 0,mal_id,url,images,trailer,approved,titles,title,title_english,title_japanese,title_synonyms,type,source,episodes,status,airing,aired,duration,rating,score,scored_by,rank,popularity,members,favorites,synopsis,background,season,year,broadcast,producers,licensors,studios,genres,explicit_genres,themes,demographics
0,1,https://myanimelist.net/anime/1/Cowboy_Bebop,{'jpg': {'image_url': 'https://cdn.myanimelist...,"{'youtube_id': 'gY5nDXOtv_o', 'url': 'https://...",True,"[{'type': 'Default', 'title': 'Cowboy Bebop'},...",Cowboy Bebop,Cowboy Bebop,カウボーイビバップ,[],TV,Original,26.0,Finished Airing,False,"{'from': '1998-04-03T00:00:00+00:00', 'to': '1...",24 min per ep,R - 17+ (violence & profanity),8.75,965324.0,46.0,43,1866337,82455,"Crime is timeless. By the year 2071, humanity ...",When Cowboy Bebop first aired in spring of 199...,spring,1998.0,"{'day': 'Saturdays', 'time': '01:00', 'timezon...","[{'mal_id': 23, 'type': 'anime', 'name': 'Band...","[{'mal_id': 102, 'type': 'anime', 'name': 'Fun...","[{'mal_id': 14, 'type': 'anime', 'name': 'Sunr...","[{'mal_id': 1, 'type': 'anime', 'name': 'Actio...",[],"[{'mal_id': 50, 'type': 'anime', 'name': 'Adul...",[]
1,5,https://myanimelist.net/anime/5/Cowboy_Bebop__...,{'jpg': {'image_url': 'https://cdn.myanimelist...,"{'youtube_id': None, 'url': None, 'embed_url':...",True,"[{'type': 'Default', 'title': 'Cowboy Bebop: T...",Cowboy Bebop: Tengoku no Tobira,Cowboy Bebop: The Movie,カウボーイビバップ 天国の扉,"[""Cowboy Bebop: Knockin' on Heaven's Door""]",Movie,Original,1.0,Finished Airing,False,"{'from': '2001-09-01T00:00:00+00:00', 'to': No...",1 hr 55 min,R - 17+ (violence & profanity),8.38,215590.0,194.0,619,378478,1582,"Another day, another bounty—such is the life o...",,,,"{'day': None, 'time': None, 'timezone': None, ...","[{'mal_id': 14, 'type': 'anime', 'name': 'Sunr...","[{'mal_id': 15, 'type': 'anime', 'name': 'Sony...","[{'mal_id': 4, 'type': 'anime', 'name': 'Bones...","[{'mal_id': 1, 'type': 'anime', 'name': 'Actio...",[],"[{'mal_id': 50, 'type': 'anime', 'name': 'Adul...",[]
2,6,https://myanimelist.net/anime/6/Trigun,{'jpg': {'image_url': 'https://cdn.myanimelist...,"{'youtube_id': 'bJVyIXeUznY', 'url': 'https://...",True,"[{'type': 'Default', 'title': 'Trigun'}, {'typ...",Trigun,Trigun,トライガン,[],TV,Manga,26.0,Finished Airing,False,"{'from': '1998-04-01T00:00:00+00:00', 'to': '1...",24 min per ep,PG-13 - Teens 13 or older,8.22,373517.0,342.0,252,763911,16027,"Vash the Stampede is the man with a $$60,000,0...",The Japanese release by Victor Entertainment h...,spring,1998.0,"{'day': 'Thursdays', 'time': '01:15', 'timezon...","[{'mal_id': 123, 'type': 'anime', 'name': 'Vic...","[{'mal_id': 102, 'type': 'anime', 'name': 'Fun...","[{'mal_id': 11, 'type': 'anime', 'name': 'Madh...","[{'mal_id': 1, 'type': 'anime', 'name': 'Actio...",[],"[{'mal_id': 50, 'type': 'anime', 'name': 'Adul...","[{'mal_id': 27, 'type': 'anime', 'name': 'Shou..."


In [10]:
# Shape of the anime dataset
print("Anime dataset shape:", df.shape)

Anime dataset shape: (26720, 36)


In [11]:
# Number of unique values per column in anime dataset
print("\nUnique values in anime dataset:\n", df.nunique())


Unique values in anime dataset:
 mal_id             26564
url                26564
images             26364
trailer             4920
approved               1
titles             26564
title              26563
title_english      10998
title_japanese     25462
title_synonyms     12463
type                   9
source                17
episodes             250
status                 3
airing                 2
aired              16116
duration             333
rating                 6
score                559
scored_by           8712
rank               16055
popularity         20364
members            11508
favorites           1901
synopsis           21510
background          2556
season                 4
year                  65
broadcast            623
producers           4701
licensors            265
studios             1681
genres               962
explicit_genres        1
themes               948
demographics           8
dtype: int64


In [12]:
user.head(3)

Unnamed: 0,User ID,Username,Anime ID,Anime Title,Score
0,104748,JHaytko,889,Black Lagoon,9
1,104748,JHaytko,27,Trinity Blood,7
2,104750,A-n-i-m-e,50,Aa! Megami-sama! (TV),10


In [13]:
# Shape of the user ratings dataset
print("\nUser dataset shape:", user.shape)


User dataset shape: (5279841, 5)


In [14]:
# Number of unique values per column in user ratings dataset
print("\nUnique values in user dataset:\n", user.nunique())


Unique values in user dataset:
 User ID        52375
Username       52373
Anime ID       14464
Anime Title    14510
Score             10
dtype: int64


Data Cleaning

In [15]:
# Drop duplicate rows from anime dataset
df.drop_duplicates(inplace=True)

In [16]:
# Remove duplicate rows in user dataset
user.drop_duplicates(inplace=True)

In [17]:
# Count how many ratings each Anime ID has received
counts1 = user['Anime ID'].value_counts()

# Keep only those users/anime pairs where the Anime has at least 5 ratings
filtered_user = user[user["Anime ID"].isin(counts1[counts1>=5].index)]

In [18]:
# These shows won't be useful for recommendations
not_yet_aired = df[df.status == "Not yet aired"]

In [19]:
# Keep only anime that exist in the filtered user dataset
df1 = df[df['mal_id'].isin(filtered_user['Anime ID'])]

In [20]:
# missing values check
df1.isna().sum()

mal_id                0
url                   0
images                0
trailer               0
approved              0
titles                0
title                 0
title_english      4161
title_japanese       12
title_synonyms        0
type                  0
source                0
episodes             29
status                0
airing                0
aired                 0
duration              0
rating               15
score                 7
scored_by             7
rank               1572
popularity            0
members               0
favorites             0
synopsis             83
background         9020
season             6753
year               6753
broadcast             0
producers             0
licensors             0
studios               0
genres                0
explicit_genres       0
themes                0
demographics          0
dtype: int64

In [21]:
df1.dropna(subset=['synopsis','rating'],inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1.dropna(subset=['synopsis','rating'],inplace=True)


In [None]:
import ast 

# Many columns are stored as stringified dictionaries/lists (from JSON).
# We parse them into proper Python objects.

df1.producers = df1.producers.apply(ast.literal_eval)
df1.images = df1.images.apply(ast.literal_eval)
df1.trailer = df1.trailer.apply(ast.literal_eval)
df1.titles = df1.titles.apply(ast.literal_eval)
df1.aired = df1.aired.apply(ast.literal_eval)
df1.broadcast = df1.broadcast.apply(ast.literal_eval)
df1.licensors = df1.licensors.apply(ast.literal_eval)
df1.studios = df1.studios.apply(ast.literal_eval)
df1.genres = df1.genres.apply(ast.literal_eval)
df1.themes = df1.themes.apply(ast.literal_eval)
df1.demographics = df1.demographics.apply(ast.literal_eval)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1.producers = df1.producers.apply(ast.literal_eval)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1.images = df1.images.apply(ast.literal_eval)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1.trailer = df1.trailer.apply(ast.literal_eval)
A value is trying to be set on a copy of a slice from

In [None]:
def extract_info(row):
    """
    Extracts nested metadata (producers, genres, images, etc.) 
    from the anime dataset row.
    """
    producer_names = [producer['name'] for producer in row['producers']]
    licensors_names = [licensor['name'] for licensor in row['licensors']]
    studios_names = [studio['name'] for studio in row['studios']]
    genres = [genre['name'] for genre in row['genres']]
    themes = [theme['name'] for theme in row['themes']]
    demographics = [dg['name'] for dg in row['demographics']]
    
    # Trailer URL (if available)
    embed_url = row['trailer']['embed_url'] if row['trailer'] else None
    # Aired date (string format)
    aired = row['aired']['string'] if row['aired'] else None
    # Image URL (large cover image)
    large_image_url = row['images']['jpg']['large_image_url'] if row['images'] else None
    
    return pd.Series([producer_names, licensors_names,studios_names,genres,themes,demographics,embed_url,aired, large_image_url])

# Apply the function to each row of the DataFrame
df1[['producers','licensors','studios','genres','themes','demographics','trailer','aired','image']] = df1.apply(extract_info, axis=1)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1[['producers','licensors','studios','genres','themes','demographics','trailer','aired','image']] = df1.apply(extract_info, axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1[['producers','licensors','studios','genres','themes','demographics','trailer','aired','image']] = df1.apply(extract_info, axis=1)


In [None]:
# Drop anime with no genre information
df1 = df1[~df1['genres'].apply(lambda x: x == [])]

In [25]:
df1 = df1.reset_index(drop=True)

In [None]:
import re

# Pattern to remove annotations like [Written by MAL Rewrite] or (Source info)
pattern = r"\[Written by MAL Rewrite\]|\(.*Source:.*\)" 

# Removing the pattern using regular expressions
df1['synopsis'] = df1['synopsis'].str.replace(pattern, '', regex=True).values


In [None]:
def remove_newline_numbers(text):
    """
    Cleans anime synopsis text by:
    - Removing newlines
    - Removing digits
    - Removing punctuation
    - Lowercasing
    """
    text = text.replace('\n', ' ')
    text = re.sub(r'\d+', ' ', text)
    text = re.sub(r'[^\w\s]', ' ', text)
    return text.lower()

In [None]:
# Apply text cleaning
df1['synopsis_cleaned'] = df1.synopsis.apply(remove_newline_numbers)

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
# Lemmatize and remove stop words
df1['synopsis_cleaned'] = df1['synopsis_cleaned'].apply(lambda x: " ".join([token.lemma_ for token in nlp(x) if not token.is_stop]))

In [None]:
# Map verbose rating strings into simpler categories
rating_map = {
    "PG-13 - Teens 13 or older": "PG-13",
    "R - 17+ (violence & profanity)": "R17",
    "Rx - Hentai": "Rx",
    "R+ - Mild Nudity": "R+",
    "G - All Ages": "G",
    "PG - Children": "PG"
}

# Use the map to replace the values in the 'rating' column
df1['rating'] = df1['rating'].replace(rating_map)

In [None]:
# Replace empty lists with placeholder values
df1['themes'] = df1['themes'].apply(lambda x:["unknown_theme"] if x == [] else x )
df1['demographics'] = df1['demographics'].apply(lambda x:["unknown_demographics"] if x == [] else x )

# Fill missing season values
df1['season'] = df1['season'].fillna("unknownseason")


In [None]:
def get_season(x):
    """
    Derive season (spring/summer/fall/winter) from the month in the aired date.
    """
    spring = ["Mar","Apr","May"]
    summer = ["Jun","Jul","Aug"]
    fall = ["Sep","Oct","Nov"]
    winter = ["Dec","Jan","Feb"]
    y = x[:3]  # Extract first 3 letters (month abbreviation)
    if y in spring:
        return "spring"
    elif y in winter:
        return "winter"
    elif y in fall:
        return "fall"
    elif y in summer:
        return "summer"

df1.season = df1.aired.apply(get_season)

In [None]:
# Split aired string and extract year part
df1.year = df1.aired.str.split(',').str[1].str[1:5]

In [None]:
def fill_na(row):
    """
    Fills missing 'year' values based on aired string format.
    """
    if pd.isna(row['year']):
        if len(row['aired']) == 4:
            return row['aired']
        elif len(row['aired']) == 12:
            return row['aired'][:4]
        else:
            return row['aired'][4:8]
    else:
        return row['year']

df1['year'] = df1.apply(fill_na, axis=1)

In [None]:
# Select relevant columns for recommendation system
data = df1[['mal_id', 'url', 'trailer', 'title',
       'title_english', 'type', 'source',
       'episodes', 'status', 'aired', 'duration', 'rating', 'score',
       'scored_by', 'rank', 'popularity', 'members', 'favorites', 'synopsis','synopsis_cleaned',
       'background', 'season', 'year', 'producers', 'licensors',
       'studios', 'genres', 'themes', 'demographics',
       'image']] 

In [None]:
# Convert list-like fields into comma-separated strings for easier processing
for col in ["producers", "licensors", "genres", "studios", "themes", "demographics"]:
    data[col] = data[col].apply(lambda x: ",".join(x))

In [None]:
# Remove NSFW or less relevant genres
data = data[~(data.genres.str.contains("Hentai")|data.genres.str.contains("Erotica")|data.genres.str.contains("Boys Love")|data.genres.str.contains("Girls Love"))]

In [None]:
# Count how many times each genre appears
genre_counts = {}
for row in data['genres']:
    for genre in row.split(','):
        if genre in genre_counts:
            genre_counts[genre] += 1
        else:
            genre_counts[genre] = 1

print("Genre frequency counts:\n", genre_counts)

{'Action': 3339, 'Award Winning': 200, 'Sci-Fi': 2088, 'Adventure': 2086, 'Drama': 1757, 'Mystery': 681, 'Supernatural': 930, 'Fantasy': 2542, 'Sports': 411, 'Comedy': 3645, 'Romance': 1527, 'Slice of Life': 683, 'Suspense': 281, 'Ecchi': 709, 'Gourmet': 85, 'Avant Garde': 155, 'Horror': 335}


In [None]:
# Drop entries where favorites = 0 (less popular)
data = data[data.favorites != 0]


In [None]:
# Reset index after filtering
data = data.reset_index(drop=True)

CONTENT BASED FILTERING

In [None]:
# One-hot encode categorical columns
genres_df = data.genres.str.get_dummies(sep=',')
studios_df = data.studios.str.get_dummies(sep=',')
themes_df = data.themes.str.get_dummies(sep=',')
demographics_df = data.demographics.str.get_dummies(sep=',')
status_df = data.status.str.get_dummies()
season_df = data.season.str.get_dummies()
type_df = data.type.str.get_dummies()
source_df = data.source.str.get_dummies()
rating_df = data.rating.str.get_dummies()


In [None]:
# Ensure year is integer
data.year = data.year.astype('int')

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF on cleaned synopsis
vectorizer = TfidfVectorizer() 
overview_matrix = vectorizer.fit_transform(data['synopsis_cleaned'])

In [45]:
overview_matrix.shape

(8606, 30640)

In [None]:
# Convert sparse matrix to dense DataFrame
overview_matrix = overview_matrix.toarray()
overview_df = pd.DataFrame(overview_matrix)

In [None]:
from sklearn.decomposition import PCA
num_components = 1000

# Dimensionality reduction with PCA
pca = PCA(n_components=num_components)
pca_data = pca.fit_transform(overview_df)

In [48]:
pca_data = pd.DataFrame(pca_data)

In [49]:
pca_data.shape

(8606, 1000)

In [None]:
# Combine PCA-transformed synopsis features with categorical encodings
combined_features = pd.concat([pca_data,source_df,type_df,genres_df,demographics_df,themes_df],axis=1)

In [51]:
combined_features.shape

(8606, 1100)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
# Compute cosine similarity between all anime entries
similarity_matrix = cosine_similarity(combined_features)
print("Similarity matrix shape:", similarity_matrix.shape)

(8606, 8606)

In [None]:
def recommend(anime: str, top_n: int = 20):
    """
    Recommend similar anime using content-based filtering.

    Parameters:
    - anime (str): Title or English title of the anime
    - top_n (int): Number of recommendations to return (default = 20)
    """
    # Get index of the given anime
    index = data[(data['title'] == anime) | (data['title_english'] == anime)].index[0]
    
    # Sort similarities in descending order (excluding the anime itself)
    distances = sorted(list(enumerate(similarity_matrix[index])),reverse=True,key= lambda x:x[1])
    
    # Print top recommendations
    for i in distances[1:top_n]:
        
        print(data.iloc[i[0]].title,"---",i[1])

In [None]:
# Example usage
recommend("Kimetsu no Yaiba")

Kimetsu no Yaiba: Katanakaji no Sato-hen --- 0.9035888230945653
Kimetsu no Yaiba: Yuukaku-hen --- 0.899188506470277
Kimetsu no Yaiba: Mugen Ressha-hen --- 0.8931762252414965
Kimetsu no Yaiba: Hashira Geiko-hen --- 0.88715760396355
Nokemono-tachi no Yoru --- 0.8147014887809835
Senkaiden Houshin Engi --- 0.8078576510509909
Jujutsu Kaisen --- 0.8063498300831085
Kuroshitsuji II --- 0.76841715118485
Vanitas no Karte Part 2 --- 0.7665458808883701
Orient: Awajishima Gekitou-hen --- 0.7644520160907416
Orient --- 0.7618520270021595
Vanitas no Karte --- 0.7605607336980846
Kuroshitsuji: Book of Circus --- 0.7601304177475053
Sengoku Youko: Yonaoshi Kyoudai-hen --- 0.7558232072483551
Kuroshitsuji --- 0.7527823547322182
Kimetsu no Yaiba Movie: Mugen Ressha-hen --- 0.7520176329929539
Ragna Crimson --- 0.7372769737117499
Chainsaw Man --- 0.734025442552698
Yu☆Gi☆Oh! Zexal Second --- 0.7332565149351102


In [None]:
# Save cleaned anime dataset to SQLite
data.to_sql("anime", conn, if_exists="replace", index=False)

8606

In [None]:
# Save to CSV for external use
data.to_csv("../data/cleaned_data.csv",index=False)

In [None]:
# Save similarity matrix as pickle
import pickle
pickle.dump(similarity_matrix,open('../model/similarity_matrix.pkl','wb'))

MATRIX FACTORIZATION

In [None]:
# Align user ratings with filtered anime dataset
filtered_user = filtered_user[filtered_user['Anime ID'].isin(data.mal_id)]

In [None]:
# Keep only active users (those with >50 ratings)
counts = filtered_user['User ID'].value_counts()
filtered_user = filtered_user[filtered_user["User ID"].isin(counts[counts>50].index)]

In [None]:
print("Unique values after filtering:\n", filtered_user.nunique())

User ID        24008
Username       24007
Anime ID        8606
Anime Title     8649
Score             10
dtype: int64

In [None]:
# Reset index and keep only relevant columns
filtered_user = filtered_user.reset_index(drop=True)
filtered_user = filtered_user.iloc[:, [0,2,3,4]]
filtered_user.rename(columns={'User ID':'user_id','Anime ID':'anime_id'},inplace=True)

In [None]:
# Encode user_id and anime_id to integer codes
user_ids = pd.Categorical(filtered_user["user_id"])
filtered_user["user_id_encoded"] = user_ids.codes

anime_ids = pd.Categorical(filtered_user["anime_id"])
filtered_user["anime_id_encoded"] = anime_ids.codes

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Normalize scores to [0, 1] for better training stability
minmax = MinMaxScaler()
filtered_user["Score_scaled"] = minmax.fit_transform(filtered_user[["Score"]])

In [101]:
print("Final filtered dataset shape:", filtered_user.shape)

Final filtered dataset shape: (4640149, 7)


In [74]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    filtered_user[["user_id_encoded", "anime_id_encoded"]], filtered_user["Score_scaled"], test_size=0.2, random_state=4 , shuffle=True
)


In [75]:
filtered_user.anime_id_encoded.nunique()

8606

In [76]:
X_train.anime_id_encoded.nunique()


8606

In [78]:
import tensorflow as tf
from tensorflow import keras

In [None]:
# Define embedding sizes
num_users = len(set(X_train["user_id_encoded"]))  
num_animes = len(set(X_train["anime_id_encoded"]))  
embedding_dim = 64  # Latent factor dimensionality

In [None]:
# User & Anime input layers
user_input = keras.layers.Input(name='user_encoded',shape=(1,))
anime_input = keras.layers.Input(name='anime_encoded',shape=(1,))

# Embedding layers
user_embeddings = keras.layers.Embedding(num_users, embedding_dim, name='user_embedding')(user_input)
anime_embeddings = keras.layers.Embedding(num_animes, embedding_dim,name='anime_embedding')(anime_input)

# Dot product of embeddings
dot_product = keras.layers.Dot(name='dot_product',axes=2)([user_embeddings, anime_embeddings])
flattened = keras.layers.Flatten()(dot_product)

# Dense layers for learning non-linear interactions
dense = keras.layers.Dense(32, activation='relu')(flattened)
output = keras.layers.Dense(1, activation="sigmoid")(dense)  # Optional bias can be added before this layer

# Build and compile model
model = keras.Model(
    inputs=[user_input, anime_input], outputs=output
)

model.compile(
    optimizer="adam", loss="mse", metrics=["mse", "mae"]  # Regression metrics
)

model.summary()



Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 user_encoded (InputLayer)   [(None, 1)]                  0         []                            
                                                                                                  
 anime_encoded (InputLayer)  [(None, 1)]                  0         []                            
                                                                                                  
 user_embedding (Embedding)  (None, 1, 64)                1536512   ['user_encoded[0][0]']        
                                                                                                  
 anime_embedding (Embedding  (None, 1, 64)                550784    ['anime_encoded[0][0]']       
 )                                                                                          

 MODEL TRAINING WITH MLFLOW AUTLOGGING

In [None]:
mlflow.autolog()

# model training
history = model.fit(
    [X_train['user_id_encoded'], X_train['anime_id_encoded']],  # Separate user and anime IDs
    y_train,
    epochs=3,  # Adjust as needed
    batch_size=64,  # Adjust as needed
    validation_data=([X_val['user_id_encoded'], X_val['anime_id_encoded']], y_val),
)

# Log model into MLflow Model Registry
mlflow.keras.log_model(
    model=model,
    name="anime_recommender",
    registered_model_name="hybrid_anime_recommendation"
)

2025/09/24 23:36:33 INFO mlflow.tracking.fluent: Autologging successfully enabled for keras.
2025/09/24 23:36:36 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2025/09/24 23:36:36 INFO mlflow.tracking.fluent: Autologging successfully enabled for tensorflow.
2025/09/24 23:36:37 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'a2ad3c4395774a699e7f97044c63eccf', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current tensorflow workflow


Epoch 1/3



  saving_api.save_model(


Epoch 2/3
Epoch 3/3




INFO:tensorflow:Assets written to: C:\Users\JUNAID~1\AppData\Local\Temp\tmpjks6g0q8\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\JUNAID~1\AppData\Local\Temp\tmpjks6g0q8\model\data\model\assets


🏃 View run gregarious-frog-661 at: http://127.0.0.1:8080/#/experiments/994902288046821146/runs/a2ad3c4395774a699e7f97044c63eccf
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/994902288046821146
INFO:tensorflow:Assets written to: C:\Users\JUNAID~1\AppData\Local\Temp\tmpb65220r9\model\data\model\assets


INFO:tensorflow:Assets written to: C:\Users\JUNAID~1\AppData\Local\Temp\tmpb65220r9\model\data\model\assets
Successfully registered model 'hybrid_anime_recommendation'.
2025/09/24 23:55:40 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: hybrid_anime_recommendation, version 1
Created version '1' of model 'hybrid_anime_recommendation'.


<mlflow.models.model.ModelInfo at 0x207053dfdf0>

In [97]:
# test data
X_test_user = X_val['user_id_encoded']
X_test_item = X_val['anime_id_encoded']

# Make predictions
predictions = model.predict([X_test_user, X_test_item])




In [None]:
# Save in Keras format for reuse
model.save('../model/model.keras')

USER-SPECIFIC RECOMMENDATIONS

In [None]:
# Example: get recommendations for a given user

user_id = 909      # Example user ID
anime_ids = np.array(list(set(filtered_user.anime_id_encoded)))
anime_size = anime_ids.shape[0]

# Repeat user_id for all anime entries (for batch prediction)
user_ids = np.array([user_id]*anime_size)

# Predict ratings for all anime for this user
predictions = model.predict([user_ids, anime_ids])

# Get top 20 recommended anime indices
top_anime_index = predictions.flatten().argsort()[-20:][::-1]

# Map encoded anime IDs back to actual anime dataset
a = filtered_user[filtered_user.anime_id_encoded.isin(top_anime_index)][['anime_id']]
rec_anime = a.anime_id.unique()
print("Top 20 recommendations for user:", user_id)
print(data[data.mal_id.isin(rec_anime)]["title"])




189                                           Elfen Lied
191                                        Jigoku Shoujo
664                                    Hellsing Ultimate
1324                     Code Geass: Hangyaku no Lelouch
2074               Kara no Kyoukai Movie 1: Fukan Fuukei
2268                  Code Geass: Hangyaku no Lelouch R2
2664    Kara no Kyoukai Movie 2: Satsujin Kousatsu (Zen)
2665           Kara no Kyoukai Movie 3: Tsuukaku Zanryuu
2817                                Clannad: After Story
2848               Kara no Kyoukai Movie 4: Garan no Dou
2849                Kara no Kyoukai Movie 5: Mujun Rasen
3090     Kara no Kyoukai Movie 7: Satsujin Kousatsu (Go)
3146                                       Tsumiki no Ie
3914                                         Steins;Gate
4326          Steins;Gate Movie: Fuka Ryouiki no Déjà vu
4699                                 Gintama': Enchousen
6022                                      Kimi no Na wa.
6945    Gintama.: Shirogane no 

ANIME-TO-ANIME SIMILARITY

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Assume 'anime_id' is 21
anime_id = 10

# Extract learned embeddings from the model
anime_embedding = model.get_layer('anime_embedding').get_weights()[0]
target_anime_embedding = anime_embedding[anime_id]

# Compute similarity scores between this anime and all others
similarities = cosine_similarity([target_anime_embedding], anime_embedding)

# Get top 10 most similar animes
top_10_indices = similarities[0].argsort()[-10:][::-1]
top_10_anime_ids = anime_ids[top_10_indices]

# Map encoded IDs back to dataset
a = filtered_user[filtered_user.anime_id_encoded.isin(top_10_anime_ids)][['anime_id']]
rec_anime = a.anime_id.unique()
data[data.mal_id.isin(rec_anime)]['title']

10                                                 Naruto
231                                                Bleach
393     Naruto Movie 1: Dai Katsugeki!! Yuki Hime Shin...
531     Naruto: Takigakure no Shitou - Ore ga Eiyuu Da...
796     Naruto Movie 2: Dai Gekitotsu! Maboroshi no Ch...
1452                                   Naruto: Shippuuden
1781    Naruto Movie 3: Dai Koufun! Mikazuki Jima no A...
1905                                   Akakichi no Eleven
1984                           Naruto: Shippuuden Movie 1
2881                  Naruto: Shippuuden Movie 2 - Kizuna
Name: title, dtype: object

In [None]:
# Save processed user ratings to SQLite and CSV
filtered_user.to_sql("users", conn, if_exists="replace", index=False)
filtered_user.to_csv("../data/cleaned_user_data.csv",index=False)

4640149

HYBRID RECOMMENDATION

In [103]:
def user_anime_recommendations(user_id,anime_id,model,similarity_matrix,filtered_user,data):
    """
    Hybrid recommendation system combining collaborative filtering (matrix factorization)
    and content-based similarity.

    Parameters:
    - user_id (int): Encoded user ID
    - anime_id (int): Encoded anime ID
    - model (keras.Model): Trained MF model
    - similarity_matrix (np.array): Content-based similarity matrix
    - filtered_user (pd.DataFrame): User ratings dataset
    - data (pd.DataFrame): Anime dataset
    - top_n (int): Number of recommendations to return

    Returns:
    - List of recommended anime titles
    """
    # Predict user-anime ratings
    anime_ids = np.array(list(set(filtered_user.anime_id_encoded)))
    anime_size = anime_ids.shape[0]

    user_ids = np.array([user_id]*anime_size)
    predictions = model.predict([user_ids,anime_ids]) 
    p = predictions.flatten()

    # Get content-based similarity for target anime
    s = similarity_matrix[anime_id]

    # Hybrid score = weighted average of CB + CF
    ratings = 0.5*s + 0.5*p

    # Get top-N recommendations 
    top_anime_index = ratings.argsort()[-30:][::-1]
    
    # Exclude already watched animes
    watched_anime = filtered_user[filtered_user.user_id_encoded == user_id]['anime_id_encoded']
    mask = np.isin(top_anime_index, watched_anime)
    top_unwatched_anime_index = top_anime_index[~mask]
    
    # Collect recommended titles
    recommended_animes = []
    for i in top_unwatched_anime_index:
        anime_data = data.iloc[i]
        recommended_animes.append(anime_data['title'])
        
    return recommended_animes
    
    


In [104]:
# Example hybrid recommendation
print("\nHybrid recommendations:")
print(user_anime_recommendations(60, 243, model, similarity_matrix, filtered_user, data))


Hybrid recommendations:
['Akane Maniax', 'School Days: Valentine Days', 'Pia Carrot e Youkoso!! 2 DX', 'To Heart 2 Adnext', 'Kud Wafter', 'Shingeki no Kyojin: Kuinaki Sentaku', 'Tenchi Muyou! Ryououki: Omatsuri Zenjitsu no Yoru!', 'Hoshi no Koe', 'Tteotda Keunyeo!!', 'Yahari Ore no Seishun Love Comedy wa Machigatteiru. Zoku OVA', 'Tengen Toppa Gurren Lagann Movie Zenyasai: Viral no Amai Yume', 'School Rumble: Ichi Gakki Hoshuu', 'Tokyo Marble Chocolate', 'Clannad: After Story', 'Kidou Senshi Gundam: Dai 08 MS Shoutai', 'Ginga Ojousama Densetsu Yuna: Shinen no Fairy', 'Ranma ½: Yomigaeru Kioku', 'Mahoutsukai Tai!', 'Nekopara OVA', 'Toradora! Recap', 'Chou Hatsumei Boy Kanipan', 'Comic Party Revolution OVA', 'Angel Densetsu', 'Top wo Nerae 2! Diebuster', 'Steins;Gate: Oukoubakko no Poriomania', 'Lime-iro Senkitan: Nankoku Yume Roman', 'Shinpi no Sekai El-Hazard', 'Papa no Iukoto wo Kikinasai! OVA', 'Sweat Punch', 'Kimi ga Nozomu Eien']


# Notebook Complete

This notebook builds a **Hybrid Anime Recommendation System** combining:
- Content-based filtering (TF-IDF + PCA + cosine similarity)
- Collaborative filtering (Matrix Factorization with embeddings in Keras)
- Hybrid approach blending both methods

Outputs:
- Cleaned datasets saved to SQLite/CSV
- Trained recommendation model saved as `.keras`
- Similarity matrix saved as `.pkl`

Use the `user_anime_recommendations()` function for hybrid recommendations.
