# Tutorial 3: Baseline Calculation.

Before we try the fancy feature selection methods with machine learning, we will need to construct a basic machine learning model to evaluate their effectiveness on the testing set. In this notebook, we will use linear methods like logistic regression (LR) on the whole wavelength spectrum without feature selection/reduction methods; such a method can be called a baseline. 

LR is a simple linear statistical method that can linearly classify between objects. In this notbook, we will use LR to classify between Sound `S` and Bruised`B` apples. 

If you are not aware of the classification metrics, you can take a look at the following [link](https://developers.google.com/machine-learning/crash-course/classification/video-lecture).

---

In [2]:
# Python packages 
import pandas as pd # for importing data into data frame format
import seaborn as sns # For drawing useful graphs, such as bar graphs
import numpy as np
import matplotlib.pyplot as plt

import sys
sys.path.append("..")
from source.utils import split #  a pre-defined function to split the data into training and testing


First, let us call the data

In [13]:
%store -r X
%store -r Y
%store -r df
print(X.shape) # printing the shape the dataframe X

(503, 2074)


In [31]:
Y = Y.map({'S': 1, 'B': 0})
Y

0      0
1      0
2      1
3      1
4      0
      ..
498    1
499    1
500    0
501    1
502    1
Name: Condition, Length: 503, dtype: int64

---

### Data splitting

In [32]:
Xtrain, Xtest, Ytrain, Ytest  = split(X, Y)

In [33]:
print(Xtrain.shape)
print(Ytrain.shape)

(352, 2074)
(352,)


---

### Baseline (LR) training

In [34]:
from sklearn.linear_model import LogisticRegression


In [35]:
LR = LogisticRegression(random_state=0, solver="newton-cg") #defining the model
LR.fit(Xtrain.values, Ytrain) # training the machine learning model

LogisticRegression(random_state=0, solver='newton-cg')

### Testing the machine learning model

In [40]:
from sklearn.metrics import accuracy_score, precision_score

In [37]:
y_pred = LR.predict(Xtest)
accuracy_score(Ytest.values, y_pred)



0.7947019867549668

We have calculated the classification accuracy, which is about 80%, not bad for a baseline. However, we are more interested in evaluating our model in reducing the "False Positives" -> a bruised apple that the model thought was a sound apple. Hence, we need to calculate the precision score to reflect such measurement.

In [41]:
precision_score(Ytest.values, y_pred)

0.7647058823529411

The model has about 76% of precision score.

Now let us test how more complicated models will perform against the baseline in the following tutorial