# Course: Machine Learning1 - kNN
<div class="alert alert-block alert-info">
Project: 01 </br> </br>
Team members
<ul>
<li>Mauro Travieso Pena</li>
<li>Quoc Huy Luong</li>
<li>Ngoc Bao Tran</li>
<ul>
</div>

## Exploratory Data Analysis


### Context
This data set contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings.


### Content

#### Anime.csv

**anime_id** - myanimelist.net's unique id identifying an anime.

**name** - full name of anime.

**genre** - comma separated list of genres for this anime.

**type** - movie, TV, OVA, etc.

**episodes** - how many episodes in this show. (1 if movie).

**rating** - average rating out of 10 for this anime.

**members** - number of community members that are in this anime's "group".

##### Dataset reference: https://www.kaggle.com/CooperUnion/anime-recommendations-database?fbclid=IwAR3sXr48_xQHp8NgF9AyXuVf0RGwTkFw8bfkRoXda6zix9rQsevpya8JDOM#rating.csv

## Step 1 - Importing the DataFrame (CSV to DataFrame)

### Libraries

In [None]:
import pandas as pd
import numpy as np
import random as rnd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.decomposition import PCA
from sklearn import preprocessing

#### MyPCA

In [None]:
def myPCA(data,n):
    pca = PCA(n_components=n)
    pca.fit(data)
    df = pca.transform(data)
    PCA_Data = pd.DataFrame(df)
    return PCA_Data

#### myNormalize

In [None]:
def myNormalize(data):
    min_max_scaler = preprocessing.MinMaxScaler()
    Normalized_Data = min_max_scaler.fit_transform(data)
    Normalized_Data = pd.DataFrame(Normalized_Data)
    return Normalized_Data

#### myEncode

In [None]:
def myEncode(data,col): 
    NewData_Encode = data.copy()
    NewData_Encode = pd.get_dummies(NewData_Encode, columns=col, prefix = col)
    return NewData_Encode


#### myCleanAndTransformData

In [None]:
def myCleanAndTransformData(data):
    
    #Drop null rows
    NewData = data.dropna()
    #Remove unknown ata
    NewData = NewData[NewData['episodes']!='Unknown']
    #Add a new column rating class 
    NewData['Class']=1
    # 1: High
    # or 0: Low based on rating
    NewData.loc[NewData['rating'] >= NewData['rating'].mean(), 'Class'] = 1
    NewData.loc[NewData['rating'] < NewData['rating'].mean(), 'Class'] = 0
    
    #Split genre values into rows
    NewData = pd.DataFrame(NewData.genre.str.split(',').tolist(), index=[NewData.anime_id,NewData.type,NewData.episodes,NewData.rating,NewData.members,NewData.Class]).stack()
    NewData = NewData.reset_index([0,'anime_id','type','episodes','rating','members','Class'])
    NewData.columns=['anime_id','type','episodes','rating','members','Class','genre']
    
    #Encode type feature: 6 unique values
    NewData = myEncode(NewData,['type'])
 
    #Encode genre feature: 82 unique values
    NewData = myEncode(NewData,['genre'])
 
     #Drop anmie_id,rating,Class
    NewData = NewData.drop(['rating'],axis=1)
    NewData = NewData.drop(columns=['anime_id'])
    #NewData = NewData.drop(columns=['episodes'])  
    
    return NewData


#### mySplitData

In [None]:
def mySplitData(X_Data,Y_Data,test_size,random_state):
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X_Data, Y_Data, test_size=test_size, random_state=random_state)
    return X_train, X_test, y_train, y_test

def mySplitDataByTrainSize(X_Data,Y_Data,train_size,random_state):
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X_Data, Y_Data, train_size=train_size, random_state=random_state)
    X_train, X_test, y_train, y_test = mySplitData(X_train,y_train,0.33,random_state)
    return X_train, X_test, y_train, y_test

### Datasets


In [None]:
df = pd.read_csv('../data/anime.csv')
df.head()
RawData = df.copy()

## Checking the structure of the columns


### Visualizing the number of columns and features' names associated to the dataset

In [None]:
RawData.columns

In [None]:
print("Dimensions of DataFrame: {}".format(RawData.shape))

### The Dataset counts on 12'294 rows, grouped in 7 columns.

### Obtaining a sample of the associated data 

In [None]:
RawData.sample(5)

### It is appreciated the nature of the columns (Categorical and Numerical).

##Descriptive statistics summary of the variable of the dataset features 

In [None]:
print(RawData.dtypes)

### Number of unique values per feature

In [None]:
RawData.nunique()

### Number of total values per feature 

In [None]:
RawData.info()

#### When complete, every feature should contain a maximum of 12'294 of data. However, it can be seeing that some features doesn't reach that value. It indicates the pressence of missing values.

## Exploring the missing values

In [None]:
missing = RawData.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar().set_title("Missing Values by Feature")
missing.plot.bar().set_xlabel('Dataset Features')
missing.plot.bar().set_ylabel('Missing Values')
missing.head()

#### The category with most missing values is found in 'rating' followed by 'gender' and 'type' respectively.

## Exploring the whole dataset statistically.

In [None]:
RawData.describe()

In [None]:
RawData['anime_id'].describe()

In [None]:
RawData['anime_id'].unique()

In [None]:
RawData['name'].describe()

In [None]:
RawData['name'].unique()

RawData['genre'].describe()

In [None]:
RawData['genre'].unique()

In [None]:
RawData['type'].describe()

RawData['type'].unique()

RawData['episodes'].describe()

In [None]:
RawData['episodes'].unique()

In [None]:
RawData['rating'].describe()

In [None]:
RawData['rating'].unique()

In [None]:
RawData['members'].describe()

In [None]:
RawData['members'].unique()

## Dropping the rows with missing data ('Na')

In [None]:
df = df.dropna()

df['rating'].describe()

## Exploring the Numerical Features

### Feature: **rating**

In [None]:
RawData['rating'].describe()

In [None]:
sns.distplot(df['rating']);

In [None]:
bplot = sns.boxplot(x='rating', 
                 data=df, 
                 width=0.5,
                 palette="colorblind")

### Feature: **members**

In [None]:
df['members'].describe()

In [None]:
sns.distplot(df['members']);

In [None]:
bplot = sns.boxplot(x='members', 
                 data=df, 
                 width=0.5,
                 palette="colorblind")

### Feature: **episodes**

In [None]:
df['episodes'].describe()

### The Numerical feature 'episodes' is declared as an object and also, contains Categorical data which is required to  be removed from the rows in the dataset.

In [None]:
df = df[df['episodes']!='Unknown']

In [None]:
df['episodes'].unique()

### Now, the feature can be converted from object (Categorical) to int (Numerical), and explored.
https://stackoverflow.com/questions/48094854/python-convert-object-to-float

In [None]:
print("Kurtosis Members: %f" % df['members'].kurt())

### Kurtosis is a measure of whether the distribution is too peaked (a very narrow distribution with most of the responses in the center.

### It means the effect of the tails on the whole distribution contribution.

### Again, rating shows a value relatively close to zero, which means that the statistical behavior of this feture seems a balanced Gauss bell.

### Features: episodes and members show a significative amount of kurtosis. 

In [None]:
print("Dimensions of DataFrame: {}".format(df.shape))

In [None]:
#ndf = df.drop(columns=['anime_id','name','genre','type'])
#ndf = ndf.pivot("episodes","rating","members")
#ax = sns.heatmap(ndf)

#flights = sns.load_dataset("flights")
#flights = flights.pivot("month", "year", "passengers")
#ax = sns.heatmap(flights)

## Removing the Outliers.

### Due to the previous observation of the respective boxplots, it can be seen the pressence of outliers on every Numerical feature. In the following lines of codes, they can be count to decide its effects in the distribution.

### We will use Z-score function defined in scipy library to detect the outliers.The Z-score states that the dataset is transformed in a matrix which values more than three standard deviations represent outliers. 

https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba

In [None]:
from scipy import stats
import numpy as np
df = df.drop(columns=['anime_id','name','genre','type'])
z = np.abs(stats.zscore(df))
print(z)

### Using this line of code, the outliers of the whole dataset are removed as a whole (taking into account every Numerical feature in respect to the others).

In [None]:
df = df[(z < 3).all(axis=1)]
df.head()

### As a result, the new Dataset only loses few rows since the previous operation performed (-3.25%)

In [None]:
print("Dimensions of DataFrame: {}".format(df.shape))


###Scatter plots - Relationships between the numerical varibles.

In [None]:
plt.scatter( df['episodes'], df['rating'], marker='o')
plt.title('Rating Vs. Number of Episodes')
plt.xlabel('Number of Episodes')
plt.ylabel('Rating')

In [None]:
plt.scatter(df['members'], df['rating'],marker='o')
plt.title('Rating Vs. Number of Members')
plt.xlabel('Number of Members')
plt.ylabel('Rating')

## kNN algorithm application to modeling the Dataset.

In [None]:
#### Clean and Transform Data

Cleaned_Data = myCleanAndTransformData(RawData)
Y_Data = Cleaned_Data['Class']
X_Data = Cleaned_Data.drop(columns=['Class'])

#### Normalize  Data

Normalized_Data = myNormalize(X_Data)

In [None]:
#### PCA

In [None]:
n_components=40
PCA_Data = myPCA(Normalized_Data,n_components)
PCA_Data.head()


In [None]:
####----------------------------------------------------------------
#### Split  PCA_Data
####----------------------------------------------------------------

PCA_X_train, PCA_X_test, PCA_y_train, PCA_y_test  = mySplitData(PCA_Data,Y_Data,0.33,42)

PCA_X_train.head()

In [None]:
PCA_X_test.head()

In [None]:
PCA_y_train.head()

PCA_y_test.head()

#### https://towardsdatascience.com/building-a-k-nearest-neighbors-k-nn-model-with-scikit-learn-51209555453a

### To apply k-Nearest Neighborg Classification Algorithm to the dataset, it is required to refer to a class label due to its Supervised Learning nature.

### "A supervised learning model takes in a set of input objects and output values. The model then trains on that data to learn how to map the inputs to the desired output so it can learn to make predictions on unseen data."

In [None]:
### Because of its nature, the feature selected to make predictions in our dataset is the variable 'rating', so, it is going to be treated as the class label for the model.

# Add a new column rating class 
df['Class']=1 #df['rating']
df.sample(5)

In [None]:
# 1: High
# or 0: Low based on rating
df.loc[df['rating'] >= df['rating'].mean(), 'Class'] = 1
df.loc[df['rating'] < df['rating'].mean(), 'Class'] = 0

df.sample(5)

In [None]:
print(df.dtypes)

In [None]:
### Split up the dataset into inputs (X) and target (y)

### Drop the Categorical features as well as the class label (rating)

In [None]:
#X = df.drop(columns=['anime_id','name','genre','type','rating','Class'])
X = df.drop(columns=['rating','Class'])

X.head()

In [None]:
##Normalization

In [None]:
###kNN and almost any classification algorithm is very sensitive to values ranges due to the distance based, so it is required to normalize the numerical features. 

### For the possitive nature of the values related to the features in real life, it makes sense to use mim-max normalization. 

#https://stackoverflow.com/questions/26414913/normalize-columns-of-pandas-data-frame
#X=(X-X.mean())/X.std()
X=(X-X.min())/(X.max()-X.min())
X.head()

In [None]:
###Separate the target values (class associated)

In [None]:
y = df['Class']

In [None]:
### Inspect the target values nature

In [None]:
df['Class'].hist(bins=3,figsize=(9,7),grid=False)

y.sample(5)

In [None]:
y.describe()

## Split the dataset into train and test data

from sklearn.model_selection import train_test_split

In [None]:
#split dataset into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)

#### This means that 20% of all the data will be used for testing, which leaves 80% of the data as training data for the model to learn from. Setting ‘random_state’ to 1 ensures that we get the same split each time so we can reproduce our results.

#### Setting ‘stratify’ to y makes our training split represent the proportion of each value in the y variable. For example, in our dataset, if 25% of ratings are high and 75% are low, setting ‘stratify’ to y will ensure that the random split has 25% of ratings will be high and 75% will be low. But in or case, the split point (mean of y) is close to 50%.

## Building and training the model

In [None]:
#### Clean and Transform Data

Cleaned_Data = myCleanAndTransformData(RawData)
Y_Data = Cleaned_Data['Class']
X_Data = Cleaned_Data.drop(columns=['Class'])

In [None]:
#### Normalize  Data

Normalized_Data = myNormalize(X_Data)

#### PCA

n_components=40
PCA_Data = myPCA(Normalized_Data,n_components)
PCA_Data.head()


In [None]:
####----------------------------------------------------------------
#### Split  PCA_Data
####----------------------------------------------------------------

In [None]:
X_train, X_test, y_train, y_test  = mySplitData(PCA_Data,Y_Data,0.33,42)
X = X_train #PCA_Data
y = y_train

In [None]:
from sklearn.neighbors import KNeighborsClassifier# Create KNN classifier

In [None]:
knn = KNeighborsClassifier(n_neighbors = 3)# Fit the classifier to the data
knn.fit(X_train,y_train)

## Testing the model

#show first 5 model predictions on the test data
knn.predict(X_test)[0:5]

In [None]:
### We can see that the model predicted ‘high rating’ for the first 3 and 5th animes in the test set and ‘low rating' for the 4th anime.

## Check accuracy of our model on the test data

knn.score(X_test, y_test)

In [None]:
### Our model has an accuracy of approximately 78.21%. Which gives us a good start, but we will see how we can increase model performance following the steps below.

## k-Fold Cross-Validation

from sklearn.model_selection import cross_val_score
import numpy as np #create a new KNN model

In [None]:
knn_cv = KNeighborsClassifier(n_neighbors=3) #train model with cv of 5 

## CV score (accuracy) and the average of them

cv_scores = cross_val_score(knn_cv, X, y, cv=5) #print each cv score (accuracy) and average them
print(cv_scores)
print('cv_scores mean:{}'.format(np.mean(cv_scores)))

In [None]:
### Using cross-validation, our mean score is about 77.77%. This is a more accurate representation of how our model will perform on **unseen data** than our earlier testing using the holdout method.

## Hypertuning model parameters using GridSearchCV

### Hypertuning parameters is when you go through a process to find the optimal parameters for your model to improve accuracy. In our case, we will use GridSearchCV to find the optimal value for ‘n_neighbors’.

from sklearn.model_selection import GridSearchCV

In [None]:
#create new a knn model
knn2 = KNeighborsClassifier()

In [None]:
#create a dictionary of all values we want to test for n_neighbors
param_grid = {'n_neighbors': np.arange(1, 40)}

#use gridsearch to test all values for n_neighbors
knn_gscv = GridSearchCV(knn2, param_grid, cv=5)

#fit model to data
knn_gscv.fit(X, y)

In [None]:
## Check top performing n_neighbors value

In [None]:
### After training, we can check which of our values for ‘n_neighbors’ that we tested performed the best. To do this, we will call ‘best_params_’ on our model.

knn_gscv.best_params_

### As we can see, 9 is the optimal value for ‘n_neighbors’. 

### We can use the ‘best_score_’ function to check the accuracy of our model when ‘n_neighbors’ is 9.

### ‘best_score_’ outputs the mean accuracy of the scores obtained through cross-validation.

In [None]:
## Check mean score for the top performing value of n_neighbors

knn_gscv.best_score_


### Our model now, gives us an accuracy of 79.18% to be able to predict when new data is used.

In [None]:
## Confusion Matrix (get the predictions using the classifier which was fitted above)