# This notebook will be an attempt to solve a business problem: Can a studio or publisher predict an approximate score their show will get based off the show's initial features?
## We are going to use a robust xgboost regressor since it's an ensemble with decision trees which work well on categorical data

In [20]:
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
import xgboost as xgb

Reading data

In [21]:
data = pd.read_csv('Data/animelist_filt.csv')

One-hot encoding additional features

In [22]:
data['season'] = data['premiered'].str.extract(r'(\w+)\s\d{4}')
seasons = data['season'].str.get_dummies(sep=', ')
seasons = seasons.add_prefix('season_')
data = pd.concat([data, seasons], axis=1).drop('season', axis=1)
prefixes = ["genre_", "studio_", "season_", "source_"]
X = data[[col for col in data.columns if any(col.startswith(prefix) for prefix in prefixes) or col == 'episodes']] 
# Longer shows are proven to have higher mean ratings, so episode counts are crucial
y = data['score']

Splitting the data

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=20)

Training the model and testing/Cross-validating it

In [24]:
xg_reg = xgb.XGBRegressor(objective='reg:squarederror', colsample_bytree=0.7, learning_rate=0.1,
                          max_depth=5, alpha=10, n_estimators=100)
xg_reg.fit(X_train, y_train)
y_pred = xg_reg.predict(X_test)
print(f"Accuracy: {xg_reg.score(X_test, y_test)}")
cv_scores = cross_val_score(xg_reg, X, y, cv=5, scoring='neg_mean_squared_error')
mean_cv_score = -cv_scores.mean()
std_cv_score = cv_scores.std()
print(f"Mean Cross-Validation Score: {mean_cv_score}")
print(f"Cross-Validation deviation: {std_cv_score}")

Accuracy: 0.7214451914723337
Mean Cross-Validation Score: 0.7369025099704366
Cross-Validation deviation: 0.15534254298301092


An accuracy of 72% and cross-validation score of almost 74% show that the model can perform approximate ratings with a margin of error around +- 0.15 of the score

Given the amount of data, the model performs relatively well, although having additional information from the publisher could help improve the scores further

After trying out different amounts of features, the chosen subset of [genre, studio, release season, material source, episode count] performs the best

Saving the model (xgboost package can take a while to download, so we are saving the model for use without training)

In [25]:
xg_reg.save_model('xgboost_model.ubj')

Testing on the provided sample

In [26]:
loaded = xgb.XGBRegressor()
loaded.load_model('xgboost_model.ubj')

In [27]:
sample = pd.read_csv('animelist_sample.csv')
sample['season'] = sample['premiered'].str.extract(r'(\w+)\s\d{4}')
seasons = sample['season'].str.get_dummies(sep=', ')
seasons = seasons.add_prefix('season_')
sample = pd.concat([sample, seasons], axis=1).drop('season', axis=1)
prefixes = ["genre_", "studio_", "season_", "source_"]
X = sample[[col for col in sample.columns if any(col.startswith(prefix) for prefix in prefixes) or col == 'episodes']] 
# Longer shows are proven to have higher mean ratings, so episode counts are crucial
y = sample['score']
print(f"Accuracy: {loaded.predict(X),y.values}")

Accuracy: (array([7.271096 , 7.517318 , 7.270844 , 7.202426 , 7.8390884],
      dtype=float32), array([7.63, 7.89, 7.55, 8.21, 8.67]))
