# Tutorial 03: Classification I

In this notebook, we perform classification of Robot movements in an environment with walls using real dataset. 

Install the necessary libraries in the PC or in the Virtual Environment using provided Requirements.txt.

## Import Important Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from functions.fisher_score import fisher_index_calc
from sklearn.metrics import accuracy_score, make_scorer
from functions.plot_confusion_matrix import plot_confusion_matrix
import math, random

## Task 1: Data Preprocessing

Load the given data in the 'Data' folder and analyse the data and solve the following questions.

1. What is the sample size ?
2. Is data labeled? If yes print the labeles of the data.
3. Check the features and type of data. 
4. Plot the data distribution of Features towards labels.

In [None]:
# Load the data from csv to Pandas Dataframe

data = np.loadtxt("Data/sensor_readings_24.csv", delimiter=',', dtype=str)

df = pd.DataFrame(data[:,:24], dtype=np.float64)
df = pd.concat([df, pd.DataFrame(data[:, 24], columns=['Label'])], axis=1)

In [None]:
# verify the data by printing the sample data and its shape
print("Size of the Data:", ...)
print ("Sample Data:\n", ...)
...

### Verify the data distribution and check for the following.

1. Is normalization required ?
2. What do you observe from the data about data distribution and asses if data is balanced?
3. What do you think that is needed further by analysing the data ? 

In [None]:
# Tabulate the sample data set using describe function and analyse. 
...
...

In [None]:
# Group the data by 'Label'
...

In [None]:
# Plot the data distribution using SNS countplot

fig = plt.figure(figsize=(15,5))
axis = ...

# Task 2: Feature Selection

## Little bit of Theory about  Feature Selection
Different types of Feature Selection methods:  
<img src="figures/feature_selection_methods.png">  
Source: Medium.com

#### Correlation Statistics  
The scikit-learn library provides an implementation of most of the useful statistical measures.  
For example:  
1. Pearson’s Correlation Coefficient: f_regression()  
2. ANOVA: f_classif()  
3. Chi-Squared: chi2()  
4. Mutual Information: mutual_info_classif() and mutual_info_regression()  
Also, the SciPy library provides an implementation of many more statistics, such as Kendall’s tau (kendalltau) and Spearman’s rank correlation (spearmanr).
#### Selection Method
The scikit-learn library also provides many different filtering methods once statistics have been calculated for each input variable with the target.
Two of the more popular methods include:
1. Select the top k variables: SelectKBest
2. Select the top percentile variables: SelectPercentile

### Answer the following: 
1. What type of feature selection methods that are applicable for the given dataset? 
2. What do you think about the data size and how it influence the learning?
3. Do we need large data to train the models for better results? 
4. What do you mean by large data ? Large no. of samples Vs More features ?

In [None]:
# Simplr way of feature selection.
# Apply the suitable feature selection method from above description and extract the data.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
# generate dataset

# define feature selection
fs = ...
# apply feature selection
df_selected_1 = ...
print(df_selected_1.shape)

In [None]:
# Similarity Based Feature Selection using Fisher Score 

training_set = ... # Extract Training set without labels
label_set = ... # Extract labels

# Get the fisher scores
fisher_scores = ...

# Plot the fisher scores
fig= plt.figure(figsize=(23, 10))
df_fisher = pd.DataFrame({'Fisher Scores of the Features': fisher_scores})
ax = df_fisher.plot.bar(figsize=(20,10))
plt.show()

### What do you infer from the above plot ?

In [None]:
# Perform feature selection by analysing the above plot
# Remove the features that are not significant according to your analysis.

to_remove = []
for i in range((len(fisher_scores))):
    if ...: # Condition for data filtering
        # we mark for removal
        to_remove.append(i)

df_selected_2 = ... # Delete the data to be removed from training_set
df_selected_2.shape

# Task 3: Model Learning for Classification

Data Preperation for Learning

In [None]:
# Test and Train data splitting
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split


labelEn = LabelEncoder()
encoded_labels = ... # fit the lable encoder to encode lables with numerical values
class_names = labelEn.classes_

X_train, X_test, y_train, y_test = ... # perform train test split with test size 0.3


## KNN Classifier

### Implementat your own KNN Classifier

In [None]:
# Implement KNN classifier with euclideab Distance
def mode(list):
    return ... # Formula for mode
def euclidean(point, data):
    return ... # Formula for Euclidean Distance

class KNeighborsClassifier:
    def __init__(self, k=5, dist_metric=euclidean):
        self.k = k
        self.dist_metric = dist_metric
    def fit(self, X_train, y_train):
        ... # Logic to fit the data
    def predict(self, X_test):
        neighbors = []
        for x in X_test:
            distances = self.dist_metric(x, self.X_train)
            y_sorted = ... # Sort the values
            neighbors.append(y_sorted[:self.k])
        return list(map(mode, neighbors))

In [None]:
model = ... # define a model from the above class with k=3
... # fit the model 
y_pred = ... # predict the value from the model for  X_test data.

In [None]:
# Plot The confusion matrix
plot_confusion_matrix(y_test, y_pred, classes=class_names, title='Confusion matrix For KNN Classification')

In [None]:
print ("Performance - " + str(100*accuracy_score(y_pred, y_test)) + "%")

## Answer the following
1. Can we choose any value for K ?
2. What will happen if we keep on increasing K ?

In [None]:
# Comparision of accuracy with different K value.
# Load the model and plot the accuracies for different values of k 
ks = range(1, 10)
accuracies = [] 
for k in ks:
    ...
    ...
# Append array accuracies with different values of k


# Accuracy vs. K
fig, ax = plt.subplots()
ax.plot(ks, accuracies)
ax.set(xlabel="k",
       ylabel="Accuracy",
       title="Performance of knn")
plt.show()

### Bonus

Implement for other distances and compare the results

### Implementation Using standard library

Implement using Sklearn standard library and compare the results

In [None]:
# Import Libraries and define Model
from sklearn.neighbors import KNeighborsClassifier
model = ... # define a model from the above class with k=3
... # fit the model 
y_pred = ... # predict the value from the model for  X_test data.

In [None]:
# Plot The confusion matrix
plot_confusion_matrix(y_test, y_pred, classes=class_names, title='Confusion matrix For KNN Classification')

In [None]:
print ("Performance - " + str(100*accuracy_score(y_pred, y_test)) + "%")

### Bonus

## Decision Tree Classifier

### Answer the Following
What is Deceision Tree and how do you implement it?   
What is entropy in Decision Tree ?  
What is information gain in Decision Tree ?  

### Implement your own Decision Tree Classifier

In [None]:
import numpy as np
from collections import Counter

class Node:
    def __init__(self, feature=None, threshold=None, left=None, right=None,*,value=None):
        self.feature = feature
        self.threshold = threshold
        self.left = left
        self.right = right
        self.value = value
        
    def is_leaf_node(self):
        return self.value is not None


class DecisionTree:
    def __init__(self, min_samples_split=2, max_depth=100, n_features=None):
        self.min_samples_split=min_samples_split
        self.max_depth=max_depth
        self.n_features=n_features
        self.root=None

    def fit(self, X, y):
        self.n_features = X.shape[1] if not self.n_features else min(X.shape[1],self.n_features)
        self.root = self._grow_tree(X, y)

    def _grow_tree(self, X, y, depth=0):
        n_samples, n_feats = X.shape
        n_labels = len(np.unique(y))

        # check the stopping criteria
        if (...): # Condition for Stopping Criteria
            leaf_value = self._most_common_label(y)
            return Node(value=leaf_value)

        feat_idxs = np.random.choice(n_feats, self.n_features, replace=False)

        # find the best split
        best_feature, best_thresh = self._best_split(X, y, feat_idxs)

        # create child nodes
        left_idxs, right_idxs = self._split(X[:, best_feature], best_thresh)
        left = self._grow_tree(X[left_idxs, :], y[left_idxs], depth+1)
        right = self._grow_tree(X[right_idxs, :], y[right_idxs], depth+1)
        return Node(best_feature, best_thresh, left, right)


    def _best_split(self, X, y, feat_idxs):
        ...

        return split_idx, split_threshold


    def _information_gain(self, y, X_column, threshold):
        # parent entropy
        parent_entropy = self._entropy(y)

        # create children
        left_idxs, right_idxs = self._split(X_column, threshold)

        if len(left_idxs) == 0 or len(right_idxs) == 0:
            return 0
        
        # calculate the weighted avg. entropy of children
        ...

        # calculate the IG
        information_gain = ...
        return information_gain

    def _split(self, X_column, split_thresh):
        left_idxs = np.argwhere(X_column <= split_thresh).flatten()
        right_idxs = np.argwhere(X_column > split_thresh).flatten()
        return left_idxs, right_idxs

    def _entropy(self, y):
        ...


    def _most_common_label(self, y):
        counter = Counter(y)
        value = counter.most_common(1)[0][0]
        return value

    def predict(self, X):
        return np.array([self._traverse_tree(x, self.root) for x in X])

    def _traverse_tree(self, x, node):
        if node.is_leaf_node():
            return node.value

        if x[node.feature] <= node.threshold:
            return self._traverse_tree(x, node.left)
        return self._traverse_tree(x, node.right)

In [None]:
# Import Libraries and define Model
model = ... # define a model from the above class
... # fit the model 
y_pred = ... # predict the value from the model for  X_test data.

In [None]:
# Plot The confusion matrix
plot_confusion_matrix(y_test, y_pred, classes=class_names, title='Confusion matrix For Naive Bayes Classifier')

In [None]:
print ("Performance - " + str(100*accuracy_score(y_pred, y_test)) + "%")

### Implementation Using standard library

Implement using Sklearn standard library and compare the results

In [None]:
# Import Libraries and define Model
model = ... # define a model from the above imported model
... # fit the model 
y_pred = ... # predict the value from the model for  X_test data.

In [None]:
# Plot The confusion matrix
plot_confusion_matrix(y_test, y_pred, classes=class_names, title='Confusion matrix For Naive Bayes Classifier')

In [None]:
print ("Performance - " + str(100*accuracy_score(y_pred, y_test)) + "%")