## Classification

* Classification, is an area of *supervised learning* that addresses the problem of how to systematically assign unlabeled (**classes** unknown) novel data to their labels (**classes** or groups or types) by using knowledge of their **features** (characteristics or attributes) that are obtained from observation and/or measurement.
* A classifier algorithm is a specific technique or method for performing classification.
* To learn to classify, the classifier algorithm first uses labeled (classes are known) training data to train a model (i.e., fit parameters), and then it uses a function known as its classification rule (or for short, the **classifier**) to assign a label to each new data point given the feature values.
* A simple measure of classification performance is **accuracy**, that is what fraction of the test data is labeled accurately. 

The automated checkout problem 

![](../images/pepperfeature.png)
![](../images/peppertest.png)

## The intuitive importance of Classification
* In conventional statistics courses, and experimental psychology or neuroscience courses, emphasis is placed on the notion of finding a \textit{significant} difference between two (or more) subject groups or experimental conditions.  
* For example in clinical research we ask questions such as - **Is the patient data different than the control data?**
* But in our minds (and definitely in the patient and in the physicians mind) perhaps we should ask a different question - 
* Based on the characteristics of the patients data, can we determine if the data comes from a patient or a control?**
* The first approach is built around hypothesis testing for differences, the second approach is classification of data.

## What is a Classifier?
* At the simplest level a Classifier is a decision rule as in Signal Detection Theory (SDT), that allows us to categorize data. In fact, many of the ideas we will discussion closely mirror SDT.  
* For example, when you go to the doctor they take your blood pressure, and you get a pair of numbers like 120/80 for the systolic/diastolic pressure. 
* The doctor has a decision rule.  If the systolic pressure is above 130, the patient receives a stern lecture about diet and exercise, and if the systolic pressure is above 140, medication is prescribed to lower blood pressure.  
* Thus, there are 3 classes of patients based on the systolic blood pressure reading.
1.  below 130 - healthy
1.  130-140 - borderline 
1.  above 140 - medication 
* **This is a 3-class classifier** 
* How were these critical values found?  Hopefully, huge amounts of data are collected to look at patient cardiovascular health and blood pressure, and the data says if blood pressure remains above 140, the heart walls thicken and secondary cardiovascular diseases can emerge.   (There are other bad effects too).  

## Examples in Cognitive Science/Cognitive Neuroscience
1. Categorization 
2. Automatic Speech Recognition
3. Face Recognition
4. Brain-Computer Interfaces
5. Biomarkers for Mental Health 
6. Single-trial analysis of Neural Signals 

* In data science/machine learning applications, we are solely interested in making classifiers work as accurately as possible. 
* In scientific applications, we want to know how the classifier was able to work.  We want to know what features of the data were useful and what transformations of the data produced the accurate classification. 

## Logistic Regression 

* The first classifier we will discuss in this class is  **Logistic Regression**. 
* In Linear Regression, we fit a line to data. 
* In a simple (two-class) Logistic Regression we will fit a curve to the probability that the data comes from one **class**

![](../images/Exam_pass_logistic_curve.png)

## Diabetes Prediction Example 
[Pima Indians Diabetes Study](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)


In [60]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
##NEW IMPORTS
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [61]:
pima = pd.read_csv("../data/diabetes.csv")

In [None]:
pima.head()

In [None]:
pima.info()

In [64]:
#I grabbed a list of all the columns 
cols = pima.columns

In [None]:
#Examine how many of each outcome
pima["Outcome"].value_counts()

In [None]:
sns.histplot(pima,x = "Pregnancies",binwidth=1,hue = "Outcome",multiple="dodge")
plt.xticks(np.arange(1,18)-0.5,labels = range(1,18))
plt.show()

In [67]:
#Split the data into the predictors and Outcome Variable 

diabetes = pima['Outcome']
predictors = pima[cols[1:8]]

## Correlations among predictors

In [None]:
sns.heatmap(predictors.corr(), vmin=-1, vmax=1, cmap= "jet",annot=True)
plt.show()

In [None]:
sns.pairplot(pima, hue="Outcome", height=3);

## Training and test sets
* Here I made the decision to make the test size 25% of the data and training 75%
* 
 
#predictors_train has the training data features 
#predictors_test has the testing data features
#diabetes_train has the training data outcomes (targets)
#diabetes_test has the testing data outcomes (targets)

In [70]:

from sklearn.model_selection import train_test_split

predictors_train, predictors_test, diabetes_train, diabetes_test = train_test_split(predictors, diabetes, test_size=0.25,random_state=16)

In [None]:
diabetes.value_counts()

In [None]:
diabetes_train.value_counts()

In [None]:
diabetes_test.value_counts()

In [74]:
# instantiate the model (using the default parameters, except random_state and max_iter)
logreg = LogisticRegression(random_state=16,max_iter = 5000)

# fit the model with data
logreg.fit(predictors_train, diabetes_train)

diabetes_pred = logreg.predict(predictors_test)

In [None]:
print(diabetes_pred)

In [None]:
#Lets see if the predictions match the expected outcomes 
correct = (diabetes_pred==diabetes_test)
ncorrect = np.sum(correct)
pctcorrect = 100*ncorrect/len(diabetes_test)
print(pctcorrect)

## Confusion Matrix 

A confusion matrix is a really nice way to summarize the performance of a classifer. 

In [None]:
cnf_matrix = metrics.confusion_matrix(diabetes_test, diabetes_pred)
print(cnf_matrix)

In [None]:
#Never say '"Healthy", "Normal", just say "Undiagnosed"
class_names=['Undiagnosed','Diabetes'] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="jet" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.xticks(tick_marks+0.5, class_names)
plt.yticks(tick_marks+0.5, class_names)
plt.show()

#Text(0.5,257.44,'Predicted label');

In [None]:
## you can also get this report 
print(metrics.classification_report(diabetes_test,diabetes_pred,target_names=class_names))


### Precision - What proportion of postive identifications were actually correct
  $$ Precision = \frac{TP}{TP+FP}$$ 
* TP = True Positive
* FP = False Positive
* TN - True Negative 
* FN - False Negative  
* Recall - What proportion of actual positive was identified correctly? 
  $$ Recall = \frac{TP}{TP+FN}$$




## Diagnostic Information 

* The strength of linear methods like logistic regression is that they can provide rich insight into the performance of the model. 
 

In [80]:
diabetes_pprob = logreg.predict_proba(predictors_test)

In [None]:
sns.histplot(diabetes_pprob[:,1])
plt.show()

The prediction probability is a confidence estimate on the prediction.  

![](../images/PrecisionVsRecallBase.png)

In [None]:
plt.plot(np.sort(diabetes_pprob[:,1]),'ro')
plt.xlabel('Ssamples')
plt.ylabel('Probability of Diabetes')
plt.show()

## ROC Curve 
* Receiver Operating Characteristic(ROC) curve is a plot of the true positive rate against the false positive rate. It shows the tradeoff between sensitivity and specificity.
* An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

* True Positive Rate (TPR) is the same as recall in metrics and is therefore defined as follows:

$$TPR = \frac{TP}{TP+FN}$$

* False Positive Rate (FPR) is defined as follows:

$$FPR = \frac{FP}{FP+TN}$$

An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives

In [None]:
fpr, tpr, _ = metrics.roc_curve(diabetes_test,  diabetes_pprob[:,1])
auc = metrics.roc_auc_score(diabetes_test, diabetes_pprob[:,1])
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.plot([0,1],[0,1],'r-')
plt.legend(loc=4)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

plt.show()

* The actual choice of threshold is somewhat arbitrary and depends on the importance of TPR and FPR for your classification problem. Typically, if there is no external concern (like death!), one option is to maximize TPR-FPR.  

* We can also (potentially) learn from these models which features were most useful in making the prediction by examining the coefficients of the model. 

In [84]:
model = pd.DataFrame(logreg.coef_,columns = cols[1:8])

In [None]:
model.head()

* The logistic regression model can be written as: 

$$\hat{p}= \dfrac{e^{w^T x}}{1+e^{w^T x}}$$

$$\hat{p}= \dfrac{1}{1+e^{-w^T x}}$$

where w are the weights that we can return and x are out features

The decision boundaries are exactly at the position where the two classes are equiprobable. The boundary decision probability is exactly 0.5. Solving our sigmoid function for $p=0.5$:

$$\hat{p}= \dfrac{1}{1+e^{-w^T x}} = 0.5 =  \dfrac{1}{1+1} $$

$$ e^{-w^T x} = 1$$

$$ -w^T x = 0$$

$$ w^T x = 0$$
 

In [86]:
predictors_train, predictors_test, diabetes_train, diabetes_test = train_test_split(predictors, diabetes, test_size=0.25, random_state=16)

In [87]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler() #This initializes the StandardScaler 
ss.fit(predictors_train)
predictors_train = pd.DataFrame(ss.fit_transform(predictors_train), columns = predictors.columns)
predictors_test = pd.DataFrame(ss.fit_transform(predictors_test),columns = predictors.columns)

In [None]:
predictors_train.head()

In [89]:
# instantiate the model (using the default parameters, escept random_state and max_iter)
logreg = LogisticRegression(random_state=16,max_iter = 5000)

# fit the model with data
logreg.fit(predictors_train, diabetes_train)

diabetes_pred = logreg.predict(predictors_test)

In [None]:
print(metrics.classification_report(diabetes_test,diabetes_pred,target_names=class_names))


*Now I can examine the coefficients.  

In [91]:
model = pd.DataFrame(logreg.coef_,columns = cols[1:8])

In [None]:
model.head()