# Supervised learning lab day 1


## Exercise 1:

Adapt the code from the **Day1-Block2-Notebook1** to train a logistic regression and a perceptron classifier on the new data set below.

This data set contains person-level data from a digitized image of a fine needle aspirate of a breast mass for people with suspected breast cancer. Researchers then measured 11 features of the image. The aim is to predict patient diagnosis. Features of the dataset are:

   -- -----------------------------------------
   1. Sample code number            id number
   2. Clump Thickness               1 - 10
   3. Uniformity of Cell Size       1 - 10
   4. Uniformity of Cell Shape      1 - 10
   5. Marginal Adhesion             1 - 10
   6. Single Epithelial Cell Size   1 - 10
   7. Bare Nuclei                   1 - 10
   8. Bland Chromatin               1 - 10
   9. Normal Nucleoli               1 - 10
  10. Mitoses                       1 - 10
  11. Class:                        (2 for benign, 4 for malignant)

More details of the dataset are available at: https://archive-beta.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+diagnostic

**Questions (add your answers below each question):**

2.1 How many classes are you trying to predict?
**2 - benign or malignant, denoted by 2 or 4 in the output label**

2.2 Using a train/test split of 70/30, calculate the mean and standard deviation of each feature that will be used in the standard scalar?

2.3 What is the sensitivity and specificity of your classifiers?

2.4 [optional] For the logistic regression, vary the decision threshold to generate an ROC curve. 
- What is the sensitivity when the threshold is 0? 
- What is the sensitivity when the threshold is 1?
- If we want our classifier, a priori, to have a sensitivity of 95%, what is the corresponding specifity?

# Data Import 

In [1]:
# data import 
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

data = pd.read_csv('breast-cancer-wisconsin.csv', header = None, na_values='?')
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,1000025,5,1,1,1,2,1.0,3,1,1,2
1,1002945,5,4,4,5,7,10.0,3,2,1,2
2,1015425,3,1,1,1,2,2.0,3,1,1,2
3,1016277,6,8,8,1,3,4.0,3,7,1,2
4,1017023,4,1,1,3,2,1.0,3,1,1,2


In [2]:
# separate the features from the labels
y = data.iloc[:,10]
X = data.iloc[:,1:10]
X.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9
0,5,1,1,1,2,1.0,3,1,1
1,5,4,4,5,7,10.0,3,2,1
2,3,1,1,1,2,2.0,3,1,1
3,6,8,8,1,3,4.0,3,7,1
4,4,1,1,3,2,1.0,3,1,1


In [3]:
# Imputation

In [4]:
 X.isna().sum()

1     0
2     0
3     0
4     0
5     0
6    16
7     0
8     0
9     0
dtype: int64

In [5]:
cols = [6]
X[cols]=X[cols].fillna(X.mode().iloc[0]) # iloc[0] making it a series

In [6]:
 X.isna().sum()

1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
dtype: int64

In [7]:
#impute missing data - Another way to do it
#from sklearn.impute import SimpleImputer
#imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
#imp_mean.fit(X_train)
#X_train = imp_mean.transform(X_train)


In [8]:
#split into training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

In [9]:
#Normalise
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
type(X_train)

numpy.ndarray

In [10]:
scaler.mean_ #show the means for each column

array([4.31288344, 3.05521472, 3.18404908, 2.7607362 , 3.18609407,
       3.49284254, 3.39263804, 2.84253579, 1.55010225])

In [11]:
scaler.var_ #show the variances for each column

array([ 7.93686878,  9.04194111,  8.71459219,  7.84663831,  4.86720949,
       13.27244366,  5.69450613,  9.30649337,  2.70147749])

In [12]:
# Just to check the normalisation

In [13]:
np.mean(X_train,axis=0)

array([-1.94345789e-16, -9.80810525e-17,  1.08978947e-16, -1.45305263e-17,
       -2.54284210e-17, -2.17957894e-17,  2.90610526e-17,  1.12611579e-16,
       -5.44894736e-18])

In [14]:
np.std(X_train,axis=0)

array([1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [15]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(max_iter=10000)
logreg.fit(X_train, y_train)

In [16]:
score = logreg.score(X_test,y_test)
print(score)

0.9619047619047619


In [17]:
from sklearn.metrics import classification_report
y_pred = logreg.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           2       0.97      0.97      0.97       134
           4       0.95      0.95      0.95        76

    accuracy                           0.96       210
   macro avg       0.96      0.96      0.96       210
weighted avg       0.96      0.96      0.96       210



In [18]:
# binary classification, recall of the positive class is also known as “sensitivity”; recall of the negative class is “specificity”.

In [19]:
## ROC analysis
y_probs = logreg.predict_proba(X_test)

from sklearn.metrics import plot_roc_curve
plot_roc_curve(logreg, X_test, y_test) 

ImportError: cannot import name 'plot_roc_curve' from 'sklearn.metrics' (/opt/anaconda3/lib/python3.10/site-packages/sklearn/metrics/__init__.py)


## Exercise 2 (optional):
By adapting the code from **Day1-Block2-Notebook2**, consider the following:

1.1 Generate noisy, simulated data for a model where an input  𝑇  gives output  𝑍  via the relationship,  𝑍=𝑐sin(𝑇+𝑑) . (use c=0.5 and d=-0.25.). **Plot *Z* vs *T* **


1.2 Next, using your simulated data, write a stochastic gradient descent procedure to learn the parameters  (𝑐,𝑑)  from the simulated values of  (𝑇,𝑍) and **Plot the training loss over multiple steps**

In [None]:
### This is for you to attempt and have a go. As you will notice the model for data generation is different than in case of practice notebook. 
### You can adapt the code from practice notebook and play around with the parameters, the range of the independant variable (T), noise variance etc. 
### Similarly you can try simple linear model as in case of practice notebook but can also think of different strategies. 