# Multiclass Classifier to Predict Feet Items 
The objective of this notebook is to build a machine learning model that predicts feet items (shoe & sock combination) by leveraging random forest classifier.

## 1. Importing required modules and loading data file
In the first section of this notebook, we have to import all neccessary libraries, required modules and essential packages. After running them in this cell, we will be able to use them in subsequent cells throughout the notebook.

In [None]:
%matplotlib notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # for displaying plots
import seaborn as sns; sns.set() # plotting package for histograms
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.dummy import DummyClassifier
from sklearn import metrics
feets = pd.read.csv('input_data.csv') #provide the path to the data file

## 2. Exploratory data analysis (EDA) and data cleaning
In this section, it is essential to conduct the initial investigations on data to discover the patterns, identify outliers and spot noises with the help of summary statistics and graphical representations. The goal of EDA is to summarize the important characteristics of data in order to gain better understanding of the dataset and prepare data for machine learning.

### The first step in EDA is to get familiar with dataset
After importing the dataset, we get a quick glance at a data and check the shape of a dataset.

In [None]:
#see the first view of a dataset & check if data has been read into a dataframe object
feets.head(5)

In [2]:
#find out the total number of rows and columns in the dataset 
feets.shape 

In [None]:
#find out the columns & their corresponding data types & checking if they contain null values
feets.info()

In [4]:
#gain basic overview of columns in a dataset
feets.columns

In [2]:
# change datatype if neccessary
feets = feets.astype({"Column 1": desired type, "Column 2": desired type}

### The second step is to get rid of redundant columns (if any).

In [None]:
feets = feets.drop (['column_1', 'column_2'], axis = 1)
feets.head()

In [None]:
# check if there are duplicates and handle them accordingly
feets. duplicated().sum()
feets.drop.duplicates(inplace = True)

Use the [this](https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326adf79e) tutorial for handling missing values

### The third step is to conduct descriptive statistics of a datatset.
We call the describe() function to gain insights into the shape of each attribute by creating summary statistics by viewing central tendency, mean, median, standard deviation, percentile and max values.

In [1]:
feets.describe() #for the summary of numerical data

In [None]:
feets['column_name'].value_counts()#for the summary of categorical data 

### The fourth step is to create a bar chart.
Plotting the bar chart will show the class distribution of the independent variables (x axis distinct items, y axis frequency). 

In [None]:
#bar charts - for categorical data
fig = plt.figure()
ax = fig.add_axes([])
variable_a = ['category_1', 'categroy_2', 'categroy_3']
variable_b = [32, 23, 3, 6]#should be replaced
ax.bar (variable_a, variable_b)
plt.show()

### This fifth step is to plot histograms/kernel density plots for each variables. 
This visualization technique will enable us to see how frequently data in each class occur in the dataset. Namely, it will graphically show the frequency of different data points in the dataset, location of the center of data, the spread of dataset, skewness of dataset and presence of outliers.

In [None]:
#histograms - for quantitative data
sns.distplot(example_data[‘column’], kde = False/True).set_title(‘Title’)
#kde - by default always includes are density plot. if it visually distracts, you can set the parameter kde into false. 
plot.show()

### The sixth step in this section is to create a boxplot for input features.
This way, we get an even better idea about the centre of the distribution as well as verify potential outliers that we have detected in histograms

In [None]:
#create a box plot for input features
sns.boxplot(x = 'column_name', y = 'column_name', data = feets)

### The last step of this section checks the correlation between features in a dataset by calling df.corr()

 
 It calculates the correlation between features pairwise excluding null values. Thus, we will gain understanding on one or multiple attributes that might be dependent on another attribute or a cause for another attribute.Feature correlations matter in predicting one attribute from another.

The cell computes a correlation matrix (Max's note: be mindful of the type of correlation and the scale level of the variable, pearson vs. spearman correlation)

In [None]:
feets.corr()# to reveal whether correlation is positive, negative
#or non-existent. 
#Read more on correlation matrix to interpret Max's note above. 


## 3. Data pre-processing
In this section we will preapre machine learning data by splitting data to three datatsets: train-set, validation- and test-sets. The third set is important to test the final performance of the model. It is used only  on the fine-tuned model. Once it is used, it loses its "value".

### In the first of this section, we will create machine learning datatsets.

Create a train/validation/test split (e.g., 50%, 25%, 25% according to Hastie et al. (2009, p. 222).

In [None]:
# for our model we use such output data as temperature, sex, heat perception and weather state
X = feet[['temp','temperature', 'sex', 'heat perception' 'weather state']]
y = feet['feet_label']

# default is 75% / 25% train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

### Select the algorithms and create a classifier object

In [None]:
#find corresponding code that would create classifier object for the 
#following algorithms
knn = 
dummy_classifier=
random_forest_classifier =

## 4. Model training
In this section, we will feed the ML algorithms with data to help identify and learn good values for all attributed involved. 

### Dummy Classifiers
DummyClassifier is a classifier that makes predictions using simple rules, which can be useful as a baseline for comparison against actual classifiers, especially with imbalanced classes.

In [None]:
# Negative class (0) is most frequent
dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train)
# Therefore the dummy 'most_frequent' classifier always predicts class 0
y_dummy_predictions = dummy_majority.predict(X_test)

y_dummy_predictions

### K-Nearest Neighbours Classification

In [None]:
#alter the following code to meet our needs
X_train, X_test, y_train, y_test = train_test_split(X_C2, y_C2,
                                                   random_state=0)

plot_two_class_knn(X_train, y_train, 1, 'uniform', X_test, y_test)
plot_two_class_knn(X_train, y_train, 3, 'uniform', X_test, y_test)
plot_two_class_knn(X_train, y_train, 11, 'uniform', X_test, y_test)

### Random Forest Classifier
Adapt the following example code from Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001. For documentation click [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=0,
                           random_state=0, shuffle=False)
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, y)
RandomForestClassifier(...)
print(clf.predict([[0, 0, 0, 0]])

### Rule-based algorithm 
The rule-based model file will be incorporated here once it is finalized.

### Training the classifiers

In [6]:
# algorithm.fit(X_train, y_train)

## 5. Model evaluation 
#### (more research is needed to define the relevant metrics)
In this step we will use corresponding metrics to test the ability of multi-class classifier by comparing the performance of different models and eventually, analysing the best peforming model my tuning different parameters.

### Use the trained classifier model to classify new objects

In [None]:
#new object classification on the "test set" as defined by Hastie et al. above

### Accuracy
This metric will be directly computed from the confusion matrix

In [None]:
#algorithm.score(X_test, y_test)

### Multi-class confusion matrix
Confusion Matrix is used to know the performance of a Machine learning classification. It is represented in a matrix form. [Confusion Matrix](https://www.analyticsvidhya.com/blog/2021/06/confusion-matrix-for-multi-class-classification/#:~:text=Confusion%20Matrix%20is%20used%20to,between%20Actual%20and%20predicted%20values.&text=Confusion%20Matrix%20has%204%20terms,and%20False%20Negative(FN).) gives a comparison between Actual and predicted values

In [7]:
#some code

### Precision and recall
Precision quantifies the number of positive class predictions that actually belong to the positive class. Recall quantifies the number of positive class predictions made out of all positive examples in the dataset. [For more read the following blog](https://towardsdatascience.com/multi-class-metrics-made-simple-part-i-precision-and-recall-9250280bddc2).

In [None]:
#some code

### K-fold cross-validation
In this step, we will leverage k-fold cross validation  to estimate the skill of the model on new data.

In [None]:
#some code

## 6. Model deployment

### Deploying the output predictions for the digital product

In [None]:
#some code (look this up)

Further steps: save the model as a file, and create an .xls file with the testset (as defined by Hastie et al., 2009) and corresponding predictions in an additional column Details to be aligned with WD team (e.g., Flask interface).