# Anime Watching Status

The data set that I am using for this project is the *MyAnimeList Database 2020* from Kaggle.com

```
https://www.kaggle.com/hernan4444/anime-recommendation-database-2020?select=animelist.csv
```

Using this dataset, I will be attempting to predict the watching status of a given user.  There are 6 different watching statuses, making this a classification problem.  The reason I am choosing watching status as a target is that watching status was easier to isolate and reduce the risk of data leakage.  To further limit the risk of Data leakage, I removed features that might suggest a users opinion of an anime series/movie--with the exception of episodes watched.

The baseline I am using is the max value of the normalized value counts in watching status; or 0.601.

The metric I have chosen to measure the performance of my models is accuracy.  Although, given the distribution of the categories in watching status, it would give a better picture to use a confusion plot instead.  The reason I have chosen to stick to accuracy, is that it is the most functional metric for the end use.

The best use case for this model would be for general entertainment.  Though prediction of watching status may have better uses, I believe it would not accurately represent the data without using the rest of the dataset--which would require a more advanced approach, such as nlp and market segmentation.

## Imports

In [1]:
# General Imports
import pandas as pd
import numpy as np
import re

## sklearn ##
from sklearn.model_selection import train_test_split,\
                                    RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

# XGBoost
from xgboost import XGBClassifier

# Category Encoders
from category_encoders import OrdinalEncoder

## Function Declaration

In [2]:

def wrangle(anime_filepath, animelist_filepath=None):
    '''Takes anime.csv path, and animelist.csv path; merges and wrangles them'''
    
    def search_and_return(feature_column, search_pattern):
        '''Takes a pandas Series and and search parttern.  Iterates over 
        Series and searches pattern with re.search(), concatinating search 
        results to a list.  This list is returned.'''
    
        temp_list = []
        
        for i in feature_column:
            
            match = re.findall(search_pattern, str(i))
            
            if match:
                temp_list += [match[0]]
            else:
                temp_list += [np.NaN]
                
        return temp_list
    
    
    ## Anime ##
    
    # dictionary of explicit data types for the anime_df dataframe as read in from anime.csv
    anime_dtypes_dict = {
        'MAL_ID': 'int64',
        'Name': 'object',
        'Score': 'float16',
        'Genres': 'object',
        'English name': 'object',
        'Japanese name': 'object',
        'Type': 'category',
        'Episodes': 'float16',
        'Aired': 'object',
        'Premiered': 'object',
        'Producers': 'object',
        'Licensors': 'object',
        'Studios': 'object',
        'Source': 'object',
        'Duration': 'object',
        'Rating': 'object',
        'Ranked': 'float32',
        'Popularity': 'float32',
        'Members': 'float32',
        'Favorites': 'float32',
        'Watching': 'float32',
        'Completed': 'float32',
        'On-Hold': 'float32',
        'Dropped': 'float32',
        'Plan to Watch': 'float32',
        'Score-10': 'float32',
        'Score-9': 'float32',
        'Score-8': 'float32',
        'Score-7': 'float32',
        'Score-6': 'float32',
        'Score-5': 'float32',
        'Score-4': 'float32',
        'Score-3': 'float32',
        'Score-2': 'float32',
        'Score-1': 'float32'
        }
    
    # reading anime.csv into anime_df
    anime_df = pd.read_csv(anime_filepath, na_values='Unknown', dtype=anime_dtypes_dict)

    # drop columns from anime_df, first pass
    anime_df = anime_df.drop(columns=[
        'Name', 
        'Japanese name',
        'Ranked',
        'Popularity',
        'Watching',
        'Completed',
        'On-Hold',
        'Dropped',
        'Plan to Watch',
        'Score-10', 
        'Score-9', 
        'Score-8', 
        'Score-7', 
        'Score-6', 
        'Score-5', 
        'Score-4', 
        'Score-3', 
        'Score-2', 
        'Score-1',
        'Aired'
        ])


    # removing adult/explicit anime titles
    # iterate through anime_df['Genres'] and return True for all titles who's genre does not contain 
    # the word 'hentai', then subset the anime_df dataframe with only 'True'/'safe' titles.
    no_hentai_mask = [False if re.search(r'[Hh]entai', str(i)) else True for i in anime_df['Genres']]
    anime_df = anime_df[no_hentai_mask]
    
    # renaming columns
    anime_df = anime_df.rename(columns={
        'MAL_ID': 'anime_id',
        'Score': 'avg_score',
        'English name': 'english_name',
        'Type': 'media_type',
        'Episodes': 'total_episodes',
        'Premiered': 'premier_date',
        'Producers': 'producers',
        'Licensors': 'licensors',
        'Studios': 'studios',
        'Source': 'source_material',
        'Duration': 'episode_duration',
        'Rating': 'age_rating',
        'Members': 'groups_member_count',
        'Favorites': 'favorited_by_users'
        })
    
    
    ## Anime List ##
    
    # dictionary of explicit data types for the animelist_df dataframe as read in from animelist.csv
    animelist_dtypes_list = {
        'user_id': 'int32',
        'anime_id': 'int32',
        'rating': 'int8',
        'watching_status': 'int8',
        'watched_episodes': 'int32'
        }
    
    # reading animelist.csv into animelist_df
    animelist_df = pd.read_csv(animelist_filepath, dtype=animelist_dtypes_list)
    
    
    # sample to reduce size so it is easier on RAM
    animelist_df = animelist_df.sample(frac=0.01, random_state=42)
    
    # sort both dataframes and merge them
    anime_df = anime_df.sort_values(by='anime_id')
    animelist_df = animelist_df.sort_values(by='anime_id')
    anime = animelist_df.merge(anime_df, how='outer', on='anime_id')
    
    
    ## Feature Engineering ##
    anime['premier_year'] = search_and_return(anime['premier_date'], r'\d\d\d\d')
    anime['premier_season'] = search_and_return(anime['premier_date'], r'(.+?)\s')
    anime['episode_minutes'] = search_and_return(anime['episode_duration'], r'\d\d?\d?')
   
    
    # final round of column dropping (dropping columns no longer needed after wrangling)
    anime = anime.drop(columns=[
        'anime_id', 
        'user_id', 
        'producers',
        'licensors',
        'studios',
        'premier_date',
        'english_name',
        'age_rating',
        'episode_duration',
        'Genres',
        'rating'
        ])
    
    # drop all NaNs
    anime = anime.dropna()
    
    # type casting
    anime['premier_year'] = anime['premier_year'].astype(np.int16)
    anime['episode_minutes'] = anime['episode_minutes'].astype(np.int16)
    
    
    return anime

## Importing Data

In [3]:
# set File paths for csv files
anime_filepath = r'C:\Users\DmgProne\Desktop\VSCode\Lambda\Unit 2\Sprint 3\Anime Dataset\anime.csv'
animelist_filepath = r'C:\Users\DmgProne\Desktop\VSCode\Lambda\Unit 2\Sprint 3\Anime Dataset\animelist.csv'

# import and wrangle data with wrangle function
anime = wrangle(anime_filepath, animelist_filepath)

In [4]:
# display sample of the dataset to help 
# visualize the structure of the data
anime.sample(frac=0.001).head(10)

Unnamed: 0,watching_status,watched_episodes,avg_score,media_type,total_episodes,source_material,groups_member_count,favorited_by_users,premier_year,premier_season,episode_minutes
654212,6.0,0.0,7.648438,TV,20.0,Manga,857824.0,12713.0,2014,Winter,24
342481,3.0,1.0,7.089844,TV,26.0,Novel,2616.0,18.0,1975,Spring,24
631246,4.0,1.0,6.699219,TV,52.0,Original,2557.0,16.0,2013,Spring,24
1016431,6.0,0.0,7.519531,TV,12.0,Original,317337.0,2119.0,2018,Fall,23
973179,2.0,12.0,6.480469,TV,12.0,Manga,152026.0,844.0,2018,Spring,24
607187,4.0,0.0,7.351562,TV,12.0,Light novel,190201.0,879.0,2013,Summer,23
706937,2.0,12.0,7.808594,TV,12.0,Manga,1895488.0,45519.0,2014,Summer,24
728338,2.0,24.0,7.890625,TV,24.0,Manga,1394358.0,22304.0,2014,Fall,24
1058641,6.0,0.0,6.828125,TV,12.0,Manga,17915.0,36.0,2020,Summer,23
948996,2.0,7.0,8.929688,TV,7.0,Light novel,270878.0,6113.0,2017,Summer,22


## Train, Val, and Test Splitting

In [5]:
# split data into X and y where the target is watching_status
X = anime.drop(columns='watching_status')
y = anime['watching_status']

# create Train, Validation, and Test sets (70:15:15)
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X, y, train_size=0.5)

## Baseline Accuracy

In [6]:
# Getting and reporting Baseline Accuracy
baseline_acc = anime['watching_status'].value_counts(normalize=True).max()

print('\n###################################################################\n')
print('Baseline Accuracy: ', baseline_acc)
print('\n###################################################################')


###################################################################

Baseline Accuracy:  0.6009775041797615

###################################################################


## Logistic Classifier

In [7]:
# defining parameter distribution for RandomSearchCV
log_params = {
    'C': np.arange(0.1, 1.0, 0.1),
    'fit_intercept': [True, False]
}

# establishing pipeline
# randomize search is set to take only 3 samples to save on time, but 
# would ideally be increased to find better parameters
model_log = make_pipeline(
    OrdinalEncoder(),
    RandomizedSearchCV(
        LogisticRegression(
            random_state=42,
            n_jobs=3
            ),
        param_distributions=log_params,
        n_jobs=3,
        n_iter=3
        )
)

# fit the model with training data
model_log.fit(X_train, y_train)

# get accuracy score for the fit model, for both train and val data
train_score = accuracy_score(y_train, model_log.predict(X_train))
val_score = accuracy_score(y_val, model_log.predict(X_val))

# display the results
print('\n###################################################################\n')
print('Logistic Regression Accuracy Scores:')
print('Train Score:', train_score)
print('Val Score:', val_score)
print('\n###################################################################')


###################################################################

Logistic Regression Accuracy Scores:
Train Score: 0.6022238115157812
Val Score: 0.6002995668425385

###################################################################


## XGBoost Classifier

In [8]:
# defining parameter distribution for RandomSearchCV
boost_params = {
    'max_depth': [6, 12, 18, 24],
    'alpha': np.arange(0.0, 2.0, 0.1)
}

# establishing pipeline
# randomize search is set to take only 3 samples to save on time, but 
# would ideally be increased to find better parameters
model_boost = make_pipeline(
    OrdinalEncoder(),
    RandomizedSearchCV(
        XGBClassifier(
            random_state=42,
            n_jobs=3
            ),
        param_distributions=boost_params,
        n_jobs=3,
        n_iter=3
        )
)

# fit the model with training data
model_boost.fit(X_train, y_train)

# get accuracy score for the fit model, for both train and val data
train_score = accuracy_score(y_train, model_boost.predict(X_train))
val_score = accuracy_score(y_val, model_boost.predict(X_val))

# display the results
print('\n###################################################################\n')
print('XGBoost Accuracy Scores:')
print('Train Score:', train_score)
print('Val Score:', val_score)
print('\n###################################################################')




###################################################################

XGBoost Accuracy Scores:
Train Score: 0.9088210290525861
Val Score: 0.9015342680178661

###################################################################


## Test Accuracy of Best Model

In [9]:
# Getting test accuracy and reporting
test_score = accuracy_score(y_test, model_boost.predict(X_test))

print('\n###################################################################\n')
print('Test Score:', test_score)
print('\n###################################################################')


###################################################################

Test Score: 0.9019069672042662

###################################################################
