**Recommending Animes Using Clustering and Nearest Neighbors**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))
anime = pd.read_csv("../input/anime.csv")
ratings = pd.read_csv("../input/rating.csv")

# Any results you write to the current directory are saved as output.

In [None]:
from __future__ import print_function
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', 500)
from sklearn import preprocessing
import seaborn as sns
from sklearn.mixture import GMM
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
from sklearn.neighbors import NearestNeighbors

In [None]:
anime.head()

**Feature Engineering**

The **genre** attribute can tell a lot about how different animes are related. It is the first attribute anyone would look into to get to know what type of anime it is and when it comes to viewing, atleast I'm very picky about what type of anime I'll watch.

We need to separate the entries in **genre** column as individual features and drop the original column.

In [None]:
genre_dummies = anime["genre"].str.get_dummies(sep=",")
anime = pd.concat([anime, genre_dummies], axis=1)
anime.drop(["genre"], axis=1, inplace=True)
anime.head()

Next we need to check for columns with missing values and how to impute those values.

In [None]:
null_columns=anime.columns[anime.isnull().any()]
anime[null_columns].isnull().sum()

Going first with **type**.

The easiest way as of now to fill in the missing values in **type** seems to just place the most occuring value in that column. Of course for better results, "business knowledge" comes into picture.

In [None]:
anime["type"].value_counts()

**TV** is the most common type so lets use that.

In [None]:
anime["type"].fillna("TV", inplace=True)

Next is **rating**.

The easiest way to fill missing ratings is to find the median of all and use that.

The better way is to find median of all the types of animes and fill the missing rating with that of anime's type. In order to do that, we need to group the data on **type** attribute and calculate medians for all.

In [None]:
grouped_type = anime.groupby(["type"])
grouped_type_median = grouped_type.median()
grouped_type_median

In [None]:
# helper function to find median rating of a particular anime type
def fillRatings(row, grouped_median):
    return grouped_median.loc[row["type"]]["rating"]

Iterate over all rows and fill missing ratings.

In [None]:
anime.rating = anime.apply(lambda row: fillRatings(row, grouped_type_median) if np.isnan(row['rating'])  else row["rating"], axis=1)

After looking at the data, I found out that not all animes have their **number of episodes** listed. **Unkown** is used to represent missing number of episodes.

In [None]:
anime[anime["episodes"] == "Unknown"]

We need to replace **Unknown** with a number so that the models can work on **episodes** attribute. The most feasible way to do that as of now is to put **0**.

In [None]:
anime["episodes"] = anime["episodes"].apply(lambda x: 0 if x == "Unknown" else x)

Now we need to encode the **type** attribute so that its useful for the models. 

In [None]:
le = preprocessing.LabelEncoder()
anime["type"] = le.fit_transform(list(anime["type"].values))

We are done with feature engineering. Now time for some analysis and recommendations.

Lets see if some features are correlated.

I'm also dropping the **members** attribute here because there would be a lot of intersections between the types of anime a person watches and it will not help in recommendation.

PS. - I'm aware of the term **collaborative filtering** and that is not what I'm doing in this post so please no hate comments :)

In [None]:
anime_corr_df = anime.copy(deep=True)
anime_corr_df.drop(["anime_id", "name", "members"], axis=1, inplace=True)

In [None]:
k = 20 #number of variables for heatmap
corr = anime_corr_df.corr()
cols = corr.nlargest(k, 'rating')['rating'].index
cm = np.corrcoef(anime_corr_df[cols].values.T)
sns.set(font_scale=1.25)
plt.figure(figsize=(12, 12))
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

Looking at the top 20 most correlated features.

There isn't much correlation between any 2 features. Action and Adventure are correlated more than others but still the correlation is too low to look at PCA.

**Finding optimum number of clusters**

Depending on the problem, the number of clusters that you expect to be in the data may already be known. When the number of clusters is not known, there is no guarantee that a given number of clusters best segments the data, since it is unclear what structure exists in the data — if any. However, we can quantify the "goodness" of a clustering by calculating each data point's **[Silhouette Coefficient](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)**. The silhouette coefficient for a data point measures how similar it is to its assigned cluster: -1 (dissimilar) to 1 (similar). Calculating the mean silhouette coefficient provides for a simple scoring method of a given clustering.

In [None]:
def getScores(num_clusters):
    clusterer = KMeans(n_clusters=num_clusters, random_state=42).fit(anime_corr_df)

    # TODO: Predict the cluster for each data point
    preds = clusterer.predict(anime_corr_df)

    # TODO: Find the cluster centers
    centers = clusterer.cluster_centers_

    # TODO: Calculate the mean silhouette coefficient for the number of clusters chosen
    score = silhouette_score(anime_corr_df,preds)
    return score

scores = pd.DataFrame(columns=['Silhouette Score'])
scores.columns.name = 'Number of Clusters'    
for i in range(2,10):
    score = getScores(i) 
    scores = scores.append(pd.DataFrame([score],columns=['Silhouette Score'],index=[i]))

display(scores)

The max score is achieved when number of clusters is 2. So lets use that.

In [None]:
clusterer = KMeans(n_clusters=2, random_state=42).fit(anime_corr_df)

# TODO: Predict the cluster for each data point
preds = clusterer.predict(anime_corr_df)

# TODO: Find the cluster centers
centers = clusterer.cluster_centers_

# TODO: Calculate the mean silhouette coefficient for the number of clusters chosen
score = silhouette_score(anime_corr_df,preds)

print(score)

Lets see what names are present in the 2 clusters.

In [None]:
clusters = clusterer.labels_.tolist()
animes = { 'name': np.array(anime.name), 'cluster': clusters}

In [None]:
frame = pd.DataFrame(animes, index = [clusters] , columns = ['name', 'cluster'])

In [None]:
frame['cluster'].value_counts()

In [None]:
def showClusters(clusterer, frame, num_clusters):
    print("Top terms per cluster:")
    print()
    #sort cluster centers by proximity to centroid
    order_centroids = clusterer.cluster_centers_.argsort()[:, ::-1] 

    for i in range(num_clusters):
        print("Cluster %d names:" % (i+1), end='')
        for title in frame.ix[i]['name'].values.tolist()[0:50]:
            print(' %s,' % title, end='')
        print() 
        print()

In [None]:
showClusters(clusterer, frame, 2)

**Observations**

* Having 2 clusters doesn't help. We'll fix that.
* Googling the names present in 2nd cluster I found out that they are intended for children, **Doraemon** and **Ninja Hatori** confirm that. 

**Ding, ding, ding**

**Another way of finding optimal number of clusters**

We'll be using the famous **[Elbow Method](https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set#The_elbow_method)** here. The idea of the elbow method is to run k-means clustering on the dataset for a range of values of k, and for each value of k calculate the sum of squared errors (SSE). Then, plot a line chart of the SSE for each value of k. If the line chart looks like an arm, then the "elbow" on the arm is the value of k that is the best. The idea is that we want a small SSE, but that the SSE tends to decrease toward 0 as we increase k (the SSE is 0 when k is equal to the number of data points in the dataset, because then each data point is its own cluster, and there is no error between it and the center of its cluster). So our goal is to choose a small value of k that still has a low SSE, and the elbow usually represents where we start to have diminishing returns by increasing k.

Let's hope this will be more "separating". 

In [None]:
res=[]
for k in range(2,20):
    kmeans = KMeans(n_clusters=k,random_state=42)
    model=kmeans.fit(anime_corr_df)
    wssse=kmeans.inertia_
    KW=(k,wssse)
    KW
    res.append(KW)

In [None]:
plt.plot(*zip(*res))
plt.show()

The elbow is somewhat visible at 6, lets use that. Anything greater than 2 would do good for now!

In [None]:
clusterer = KMeans(n_clusters=6, random_state=42).fit(anime_corr_df)

# TODO: Predict the cluster for each data point
preds = clusterer.predict(anime_corr_df)

# TODO: Find the cluster centers
centers = clusterer.cluster_centers_

clusters = clusterer.labels_.tolist()
animes = { 'name': np.array(anime.name), 'cluster': clusters}
frame = pd.DataFrame(animes, index = [clusters] , columns = ['name', 'cluster'])

In [None]:
frame['cluster'].value_counts()

Looks like a better separation.

In [None]:
showClusters(clusterer, frame, 6)

OK!

Now **Doraemon** and **Ninja Hatori** are in different clusters.

**Ding, ding ding??**

**Recommendation and validation with clustering**

I'm using K-Nearest Neighbours model to give recommendations, using [Ball Tree](http://scikit-learn.org/stable/modules/neighbors.html#ball-tree). 

Recommending the top 6 similar animes.

In [None]:
neighbours = NearestNeighbors(n_neighbors=6, algorithm='ball_tree').fit(anime_corr_df)

In [None]:
distances, indices = neighbours.kneighbors(anime_corr_df)

In [None]:
def get_index_from_name(name):
    return anime[anime["name"]==name].index.tolist()[0]

In [None]:
# method to find the similar animes
def print_similar_animes(query):
    anime_id = get_index_from_name(query)
    for id in indices[anime_id][1:]:
        print(anime.ix[id]["name"])

Time to see if our clusters are good enough for recommending!

In [None]:
print_similar_animes("Hunter x Hunter (2011)")

In [None]:
print_similar_animes("Doraemon (1979)")

In [None]:
print_similar_animes("Naruto")

Not bad :)

I was expecting to get **Naruto: Shippuuden** when quering here, but its not shown. Which means there is much scope for improvement, and obviously there is.

Maybe some other day.