# Tutorial 2: Baseline Calculation.

---

### Introduction

Before we try any feature engineering methods, we will need to build a basic statistical model to act as a baseline against more complicated techniques and pipelines. In this notebook, we will use logistic regression (LR) on the whole wavelength spectrum without any feature selection/reduction methods; such a method can be called a baseline.

LR is a simple linear statistical method that can linearly classify objects. In this notebook, we will use LR to classify the apples between Sound `S` and Bruised`B`. Follow this [link](https://developers.google.com/machine-learning/crash-course/logistic-regression/video-lecture) to know more about LR

If you are unfamiliar with the classification evaulation metrics (accuracy, precision and recall), you can look [here](https://developers.google.com/machine-learning/crash-course/classification/accuracy) and   [here](https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall).

First we do the necessary imports.

In [6]:
# ___Cell no. 1___

# Python packages 
import pandas as pd # for importing data into data frame format
import seaborn as sns # For drawing useful graphs, such as bar graphs
import numpy as np
import matplotlib.pyplot as plt

import sys
sys.path.append("..")
from source.utils import split #  a pre-defined function to split the data into training and testing


Next let us recall the data

In [7]:
# ___Cell no. 2___

%store -r X
%store -r Y
%store -r df
print(X.shape) # printing the shape the dataframe X

(503, 2074)


In [9]:
%store -r Xgd
%store -r Ygd
%store -r gd
print(Xgd.shape)

(547, 2074)


In [13]:
%store -r Xrg
%store -r Yrg
%store -r rg
print(Xrg.shape)

(562, 2074)


Converting labels from `S` & `B` to `1` &`0`

In [16]:
# ___Cell no. 3___

Y = Y.map({'S': 1, 'B': 0})
Y

0      0
1      0
2      1
3      1
4      0
      ..
498    1
499    1
500    0
501    1
502    1
Name: Condition, Length: 503, dtype: int64

In [14]:
Ygd = Ygd.map({'S': 1, 'B': 0})
Ygd

0      0
1      0
2      1
3      1
4      0
      ..
542    1
543    0
544    0
545    1
546    1
Name: Condition, Length: 547, dtype: int64

In [15]:
Yrg = Yrg.map({'S': 1, 'B': 0})
Yrg

0      0
1      0
2      1
3      1
4      0
      ..
557    1
558    0
559    0
560    1
561    1
Name: Condition, Length: 562, dtype: int64

---

### Split the data into train and test

In [17]:
# ___Cell no. 4___

Xtrain, Xtest, Ytrain, Ytest  = split(X, Y)

In [18]:
Xgdtrain, Xgdtest, Ygdtrain, Ygdtest  = split(Xgd, Ygd)

In [19]:
Xrgtrain, Xrgtest, Yrgtrain, Yrgtest  = split(Xrg, Yrg)

In [20]:
# ___Cell no. 5___

print(Xtrain.shape)
print(Ytrain.shape)

(352, 2074)
(352,)


In [21]:
print(Xgdtrain.shape)
print(Ygdtrain.shape)

(382, 2074)
(382,)


In [22]:
print(Xrgtrain.shape)
print(Yrgtrain.shape)

(393, 2074)
(393,)


---

### Train the baseline model (LR) 

In [23]:
# ___Cell no. 6___

from sklearn.linear_model import LogisticRegression


In [24]:
# ___Cell no. 7___

LR = LogisticRegression(random_state=0, solver="newton-cg") #defining the model
LR.fit(Xtrain.values, Ytrain) # training the machine learning model

LogisticRegression(random_state=0, solver='newton-cg')

In [25]:
LR.fit(Xgdtrain.values, Ygdtrain) # training the machine learning model

LogisticRegression(random_state=0, solver='newton-cg')

In [26]:
LR.fit(Xrgtrain.values, Yrgtrain) # training the machine learning model

LogisticRegression(random_state=0, solver='newton-cg')

### Testing the baseline model

In [28]:
# ___Cell no. 8___

from sklearn.metrics import accuracy_score, precision_score

In [29]:
# ___Cell no. 9___
y_pred = LR.predict(Xtest)
print("Accuracy score is", accuracy_score(Ytest.values, y_pred))

Accuracy score is 0.7682119205298014




In [31]:
ygd_pred = LR.predict(Xgdtest)
print("Accuracy score is", accuracy_score(Ygdtest.values, ygd_pred))

Accuracy score is 0.8181818181818182




In [33]:
yrg_pred = LR.predict(Xrgtest)
print("Accuracy score is", accuracy_score(Yrgtest.values, yrg_pred))

Accuracy score is 0.8284023668639053




We have calculated the classification accuracy, which is about 80%, not bad for a baseline. However, accuracy is not the best measure of performance in this case, because false positives (i.e. bruised apples that are misclassified) are much worse than false negatives (i.e. sound apples that are misclassified).  In this case, precision is a better way to score the algorithm.

In [36]:
# ___Cell no. 10___
print("Precision score is", precision_score(Ytest.values, y_pred))

Precision score is 0.7204301075268817


In [34]:
print("Precision score is", precision_score(Ygdtest.values, ygd_pred))

Precision score is 0.7789473684210526


In [35]:
print("Precision score is", precision_score(Yrgtest.values, yrg_pred))

Precision score is 0.7777777777777778


The model has precision of about 76%.  We will use this score to compare against other methods in the upcoming tutorials.

**Exercise 1:** Perform LR on the other two data sets.

---