## Data preparation

The two dataframes that will be used in builing the recommendation systems will be processed here.

### DataFrame 1

This dataframe will be used to build a Tensorflow Multi-task and a Jaccard similarity item based recommender. The `User_ID` will be dropped because i want to build a simple recommender that any one can use, even if they aren't part of the original dataset.

In [None]:
# importing packages
import pandas as pd
import numpy as np
import re

In [None]:
# import data
history = pd.read_csv("../data/processed/anime_history.dat", sep="\t")
info = pd.read_csv("../data/processed/anime_info.dat", sep="\t")
ratings = pd.read_csv("../data/processed/anime_ratings.dat", sep="\t")

In [None]:
history['Feedback'].value_counts()

The feedback feature only has 1 value so it will be dropped

In [None]:
# dropping feedback
history.drop('Feedback', axis=1, inplace=True)

In [None]:
# merging purchases and items
df1 = pd.merge(left=info, right=ratings, left_on='anime_ids', right_on='Anime_ID', how='right')
df1.head(2)

In [None]:
# merging df1 and plays
df = pd.merge(left=df1, right=history, on=['User_ID', 'Anime_ID'], how='left')
df.head(2)

In [None]:
# dropping irrelevant columns
df.drop(['episodes', 'anime_ids', 'Anime_ID', 'User_ID', 'rating'], axis=1, inplace=True)

the columns above were dropped beacuse they have no value to the purpose of this project.

`rating` is the average IMDB/rotten tomatoes rating, while `Feedback` is the user assigned rating. (confusing names i know)

In [None]:
# dropping na
df.dropna(inplace=True)
df.head(2)

In [None]:
# unique anime name
df['name'].nunique()

There are over 7000 unique animes in the dataset. Considering that a user interface for inference will be built, we will limit the animes to the top 1000 based on the number of times they appear in the dataset.

In [None]:
# selecting top 1000 animes by value counts
top_1000 = df['name'].value_counts()[:1000].index
df = df[np.isin(df['name'], top_1000)].reset_index(drop=True)

In [None]:
# function to clean the anime name
def text_cleaning(text):
    text = re.sub(r'&quot;', '', text)
    text = re.sub(r'.hack//', '', text)
    text = re.sub(r'&#039;', '', text)
    text = re.sub(r'A&#039;s', '', text)
    text = re.sub(r'I&#039;', 'I\'', text)
    text = re.sub(r'&amp;', 'and', text)
    
    return text

In [None]:
# cleaning name
df['name'] = df['name'].apply(text_cleaning)

The `members` feature denotes the 'number of fans' each anime has. The feature could be left as is, but for this project, it will be categorized based on popularity. This will allow users get recommendations based on how popular an anime is

In [None]:
# categories for audience
df['Audience'] = pd.qcut(df['members'], q=[0, .33, .66, 1.], labels=["Niche", "Universal", "Spectacle"])
df["Audience"] = df["Audience"].astype(str)

# dropping memebers
df.drop('members', axis=1, inplace=True)

In [None]:
# saving the cleaned data -> will be used for training model
df.to_csv("../data/processed/clean_data.csv", index=False)

### DataFrame 2

This will be used for building a simple cosine similarity recommender based on User - Item rating

In [None]:
# merging df1 and plays
df2 = pd.merge(left=df1, right=history, on=['User_ID', 'Anime_ID'], how='left')

# dropping irrelevant columns
df2.drop(['episodes', 'anime_ids', 'Anime_ID', 'rating'], axis=1, inplace=True)

df2.dropna(inplace=True)

In [None]:
df2.head(2)

In [None]:
#cleaning anime names
df2['name'] = df2['name'].apply(text_cleaning)

# selecting relevant features
df2 = df2[['User_ID', 'name', 'Feedback']]

In [None]:
df2.to_csv('../data/processed/cosine_data.csv', index=False)