ðŸ§© Project on Anime Dataset

Anime scores are usually given based on simple averages.
I wanted to predict them using the actual synopsis text and show statistics like popularity, members, and episodes.

ðŸ“Š Dataset Used--> anime_dataset.csv

Columns included--> title,  synopsis,  genres,  episodes,  popularity,  members,   studios,  score
âœ” Dropped column: year
âœ” Target variable: score
âœ” Object column processed as text sequence

ðŸ§° Tools we are using-->
Python, Pandas, NumPy, TensorFlow/Keras, Scikit-learn, Jupyter Notebook

In [1]:
#Importing Basic Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#Loading Csv File
df = pd.read_csv('C:/Users/USER/Desktop/Kaggle/Anime Dataset/anime_dataset.csv')
df.head()

Unnamed: 0,title,score,genres,episodes,synopsis,popularity,members,studios,year
0,Attack on Titan,8.57,"['Action', 'Award Winning', 'Drama', 'Suspense...",25.0,"Centuries ago, mankind was slaughtered to near...",1,4245518,['Wit Studio'],2013.0
1,Death Note,8.62,"['Supernatural', 'Suspense', 'Psychological', ...",37.0,"Brutal murders, petty thefts, and senseless vi...",2,4186098,['Madhouse'],2006.0
2,Fullmetal Alchemist: Brotherhood,9.1,"['Action', 'Adventure', 'Drama', 'Fantasy', 'M...",64.0,After a horrific alchemy experiment goes wrong...,3,3588803,['Bones'],2009.0
3,One-Punch Man,8.48,"['Action', 'Comedy', 'Adult Cast', 'Parody', '...",12.0,The seemingly unimpressive Saitama has a rathe...,4,3443899,['Madhouse'],2015.0
4,Demon Slayer: Kimetsu no Yaiba,8.42,"['Action', 'Award Winning', 'Supernatural', 'H...",26.0,"Ever since the death of his father, the burden...",5,3340293,['ufotable'],2019.0


In [3]:
#Filling missing values
df['episodes'] = df['episodes'].fillna(df['episodes'].mean())
df.isnull().sum()

title           0
score           0
genres          0
episodes        0
synopsis        0
popularity      0
members         0
studios         0
year          168
dtype: int64

In [4]:
#Deleting the column with most null values
df=df.drop('year',axis=1)
df.dtypes

title          object
score         float64
genres         object
episodes      float64
synopsis       object
popularity      int64
members         int64
studios        object
dtype: object

ðŸ”¥ Techniques Used in this dataset  -->  Tokenization of synopsis text, Sequence padding (max length = 200), Scaling numeric features,                 Multi-input deep learning model,  ANN for numeric data,  SimpleRNN for text data,  Merging features for final prediction

ðŸ¤– Model Architecture Used -->  Embedding Layer,  SimpleRNN Layer for text,  Dense Layers for numeric features, Concatenation of text + numeric,          Final Dense Layer predicting score

In [5]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
#TEXT PROCESSING
text_col = df['synopsis'].astype('str')
tokenizer = Tokenizer(num_words=15000)
tokenizer.fit_on_texts(text_col)
seq = tokenizer.texts_to_sequences(text_col)
maxlen=200
seq = pad_sequences(seq,maxlen=maxlen)

In [6]:
from sklearn.preprocessing import MinMaxScaler
#NUMERIC PROCESSING
numeric_cols = ['episodes','popularity','members']
X_num = df[numeric_cols].fillna(0)
scaler = MinMaxScaler()
X_num = scaler.fit_transform(X_num)

In [7]:
y=df['score']
#Spliting data
from sklearn.model_selection import train_test_split
Xseq_train,Xseq_test,Xnum_train,Xnum_test,y_train,y_test=train_test_split(seq,X_num,y,test_size=0.2,random_state=42)

In [8]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input,Embedding,SimpleRNN,Dense,Concatenate
#Building Model
#RNN 
input_text = Input(shape=(maxlen,))
embed = Embedding(15000,64)(input_text)
rnn = SimpleRNN(32)(embed)

In [9]:
#ANN
input_num = Input(shape=(X_num.shape[1],))
dense_num = Dense(32,activation='relu')(input_num)

In [10]:
#Main
merged = Concatenate()([rnn,dense_num])
dense = Dense(64,activation='relu')(merged)
dense = Dense(32,activation='relu')(dense)
output = Dense(1)(dense)
model = Model([input_text,input_num],output)
model.compile(optimizer='adam',loss='mse',metrics=['mae'])
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 200)]        0           []                               
                                                                                                  
 embedding (Embedding)          (None, 200, 64)      960000      ['input_1[0][0]']                
                                                                                                  
 input_2 (InputLayer)           [(None, 3)]          0           []                               
                                                                                                  
 simple_rnn (SimpleRNN)         (None, 32)           3104        ['embedding[0][0]']              
                                                                                              

In [11]:
#Training Model
history = model.fit([Xseq_train,Xnum_train],y_train,validation_data=([Xseq_test,Xnum_test],y_test),epochs=10,batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [12]:
#Model Prediction
y_pred = model.predict([Xseq_test,Xnum_test])



In [17]:
#Checking Error
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test,y_pred)
print(f"Error : {mae:.2f}")

Error : 0.69


ðŸ“ˆ Results--> MAE: (0.69),  MSE: (1.24)

1.)Model captures strong relationships between synopsis content and popularity metrics.
2.)Predictions were close to actual score values.

âœ… Conclusion:

This project demonstrates how combining text-based features with numeric attributes can significantly 
improve prediction accuracy using a hybrid ANN + SimpleRNN model.
By processing the synopsis through a SimpleRNN and merging it with normalized numeric features, 
the model successfully learned meaningful patterns and produced reliable score predictions.
Overall, this project strengthened my understanding of handling mixed data types, text preprocessing, 
and designing multi-input neural network architectures.