## Workshop Week 6

- Student ID **46461019**
- Student Name **Nathan Ho**

## Logistic Regression
Breast Cancer data from [the UCI repository](http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) contains records corresponding to 
cases of observed tumors.   There are a number of observations for each and a categorisation in the `class` column: 2 for benign (good), 4 for malignant (bad).  Your task is to build a logistic regression model to classify these cases. 

The data is provided as a CSV file.  There are a small number of cases where no value is available, these are indicated in the data with `?`. I have used the `na_values` keyword for `read_csv` to have these interpreted as `NaN` (Not a Number).  Your first task is to decide what to do with these rows. You could just drop these rows or you could [impute them from the other data](http://scikit-learn.org/stable/modules/preprocessing.html#imputation-of-missing-values).

You then need to follow the procedure outlined in the lecture for generating a train/test set, building and evaluating a model. Your goal is to build the best model possible over this data.   Your first step should be to build a logistic regression model using all of the features that are available.
  

In [17]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.feature_selection import RFE

In [18]:
bcancer = pd.read_csv("files/breast-cancer-wisconsin.csv", na_values="?")
bcancer.head(50)
#bcancer = bcancer.drop(columns = ['sample_code_number'])
#the first column is irrelevant
#the column of interest is class

Unnamed: 0,sample_code_number,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class
0,1000025,5,1,1,1,2,1.0,3,1,1,2
1,1002945,5,4,4,5,7,10.0,3,2,1,2
2,1015425,3,1,1,1,2,2.0,3,1,1,2
3,1016277,6,8,8,1,3,4.0,3,7,1,2
4,1017023,4,1,1,3,2,1.0,3,1,1,2
5,1017122,8,10,10,8,7,10.0,9,7,1,4
6,1018099,1,1,1,1,2,10.0,3,1,1,2
7,1018561,2,1,2,1,2,1.0,3,1,1,2
8,1033078,2,1,1,1,2,1.0,1,1,5,2
9,1033078,4,2,1,1,2,1.0,2,1,1,2


In [19]:
# Examine the data: check number of rows and number of columns
bcancer.shape

(699, 11)

In [20]:
# Look at the statistical summary of the dataframe
bcancer.describe()

Unnamed: 0,sample_code_number,clump_thickness,uniformity_cell_size,uniformity_cell_shape,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitoses,class
count,699.0,699.0,699.0,699.0,699.0,699.0,683.0,699.0,699.0,699.0,699.0
mean,1071704.0,4.41774,3.134478,3.207439,2.806867,3.216023,3.544656,3.437768,2.866953,1.589413,2.689557
std,617095.7,2.815741,3.051459,2.971913,2.855379,2.2143,3.643857,2.438364,3.053634,1.715078,0.951273
min,61634.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,870688.5,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0
50%,1171710.0,4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
75%,1238298.0,6.0,5.0,5.0,4.0,4.0,6.0,5.0,4.0,1.0,4.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


In [21]:
# Check how many classes we do have from the "class" column
set(bcancer['class'])

{2, 4}

In [22]:
# Check number of samples for each class and comment whether dataset is balanced?
classTwo = bcancer[bcancer['class'] == 2]
classFour = bcancer[bcancer['class'] == 4]
print(classTwo.shape)
print(classFour.shape)
print("The dataset is not balanced, there are almost twice as many class 2 than there are class 4")

(458, 11)
(241, 11)
The dataset is not balanced, there are almost twice as many class 2 than there are class 4


In [23]:
bcancer.isna().sum() #returns the sum of all NaN values in each column

sample_code_number              0
clump_thickness                 0
uniformity_cell_size            0
uniformity_cell_shape           0
marginal_adhesion               0
single_epithelial_cell_size     0
bare_nuclei                    16
bland_chromatin                 0
normal_nucleoli                 0
mitoses                         0
class                           0
dtype: int64

In [24]:
print(classTwo.isna().sum())
print(classFour.isna().sum())
#we can for each dataset get the mean and fill in the null values

sample_code_number              0
clump_thickness                 0
uniformity_cell_size            0
uniformity_cell_shape           0
marginal_adhesion               0
single_epithelial_cell_size     0
bare_nuclei                    14
bland_chromatin                 0
normal_nucleoli                 0
mitoses                         0
class                           0
dtype: int64
sample_code_number             0
clump_thickness                0
uniformity_cell_size           0
uniformity_cell_shape          0
marginal_adhesion              0
single_epithelial_cell_size    0
bare_nuclei                    2
bland_chromatin                0
normal_nucleoli                0
mitoses                        0
class                          0
dtype: int64


In [25]:
bclean = bcancer.dropna()
bclean.shape

(683, 11)

In [26]:
# Deal with the NaN values in the data

#from sklearn.impute import SimpleImputer
#imp = SimpleImputer(missing_values=np.nan, strategy='mean') #replacing null values with mean
#imp.fit(bcancer[['bare_nuclei']])
#bcancer[['bare_nuclei']] = imp.transform(bcancer[['bare_nuclei']])
#bcancer.head()


In [34]:
# Split your data into training(80%) and testing data (20%) and use random_state=142
train, test = train_test_split(bclean, test_size=0.2, random_state=142)
print(train.shape)
print(test.shape)

(546, 11)
(137, 11)


In [37]:
X_train = train.drop(columns = ['sample_code_number', 'class'], axis = 1) #X train values - exclude sample_code_number as this is most likely a unique ID
X_test = test.drop(columns = ['sample_code_number', 'class'], axis = 1) #rather than specify the columns we are interested in like last week, if we want all column EXCEPT a few, we can simply drop those columns 
y_train = train['class'] #y train values
y_test = test['class'] #y test values
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(546, 9)
(137, 9)
(546,)
(137,)


In [38]:
# Build your Logistic Regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [54]:
# Do predictions on test set
y_test_hat = logreg.predict(X_test) #you will get 137 values
y_train_hat = logreg.predict(X_train)
print(y_test_hat) #predicted values of y
print(y_test) #actual values of y

[2 4 2 2 4 4 2 4 2 4 4 2 2 4 2 2 2 2 2 2 2 2 4 2 2 2 4 2 2 4 4 2 4 4 4 2 2
 4 2 2 2 4 2 4 2 4 2 4 2 2 2 4 2 2 4 2 4 4 4 2 2 4 2 2 4 2 4 2 2 4 2 4 2 2
 2 2 2 4 2 2 4 4 2 2 4 2 2 4 2 4 2 2 2 2 2 4 2 2 2 2 4 4 4 2 2 2 2 2 4 2 2
 4 2 4 2 2 4 4 2 2 2 2 4 4 2 2 4 2 4 2 2 2 2 4 2 4 4]
280    2
232    2
369    2
563    2
491    4
320    4
327    2
270    4
63     4
187    4
50     4
632    2
45     2
304    4
573    2
561    2
679    2
135    2
13     2
384    2
169    2
527    2
167    4
198    2
671    2
502    2
255    4
538    2
470    2
105    4
      ..
589    2
523    4
143    2
602    2
425    4
609    2
218    4
72     2
281    2
279    4
124    4
672    2
510    2
347    2
4      2
247    4
122    4
354    2
667    2
177    4
365    2
233    4
615    2
419    2
432    2
645    2
353    4
307    2
126    4
67     4
Name: class, Length: 137, dtype: int64


### Evaluation

To evaluate a classification model we want to look at how many cases were correctly classified and how many
were in error.  In this case we have two outcomes - benign and malignant.   SKlearn has some useful tools, the 
[accuracy_score]() function gives a score from 0-1 for the proportion correct.  The 
[confusion_matrix](http://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix) function 
shows how many were classified correctly and what errors were made.  Use these to summarise the performance of 
your model (these functions have already been imported above).

In [53]:
# Evaluate the performance of your trained model
print("Accuracy of model on training dataset: ", accuracy_score(y_train, y_train_hat)) #accuracy of model on training dataset
print("Accuracy of model on testing dataset: ", accuracy_score(y_test, y_test_hat)) #accuracy of model on testing dataset

Accuracy of model on training dataset:  0.967032967032967
Accuracy of model on testing dataset:  0.9635036496350365


In [63]:
print("Accuracy of model on training dataset: ")
print(confusion_matrix(y_train, y_train_hat)) #accuracy of model on training dataset
print("Accuracy of model on testing dataset: ")
print(confusion_matrix(y_test, y_test_hat)) #accuracy of model on testing dataset

Accuracy of model on training dataset: 
[[351   8]
 [ 10 177]]
Accuracy of model on testing dataset: 
[[83  2]
 [ 3 49]]


In [None]:
on test set
83, 2
3, 49
sum equates to the number of tests (137)
sum of diagonals (83 and 49)/total sum = accuracy of model
size of matrix depends on num of dif values
with col and row 2 and 4 col is pred, row is real
83 do not have cancer and preicted not
2 do not have but model ppredicts true
3 do have cancer but model predicts false
49 have cancer, model also predicts true

83 5 0 49

83 0 5 49

**This is the checkpoint mark for this week's workshop. You need to report `Accuracy Score` on test set and also show `confusion matrix`. You also need to provide analysis based on the results you got.**

### Feature Selection

Since you have many features available, one part of building the best model will be to select which features to use as input to the classifier. Your initial model used all of the features but it is possible that a better model can 
be built by leaving some of them out.   Test this by building a few models with subsets of the features - how do your models perform? 

This process can be automated.  The [sklearn RFE function](http://scikit-learn.org/stable/modules/feature_selection.html#recursive-feature-elimination) implements __Recursive Feature Estimation__ which removes 
features one by one, evaluating the model each time and selecting the best model for a target number of features.  Use RFE to select features for a model with 3, 4 and 5 features - can you build a model that is as good or better than your initial model?

In [89]:
clf = LogisticRegression()
for i in range(1,10):
    rfe = RFE(estimator = clf, n_features_to_select = i, step = 1)
    rfe.fit(X_train, y_train)
    y_test_hat = rfe.predict(X_test)
    print("Accuracy of model with three features on training dataset: ", accuracy_score(y_test, y_test_hat)) #accuracy of model on training dataset


Accuracy of model with three features on training dataset:  0.9124087591240876
Accuracy of model with three features on training dataset:  0.948905109489051
Accuracy of model with three features on training dataset:  0.9562043795620438
Accuracy of model with three features on training dataset:  0.9635036496350365
Accuracy of model with three features on training dataset:  0.9562043795620438
Accuracy of model with three features on training dataset:  0.9562043795620438
Accuracy of model with three features on training dataset:  0.9562043795620438
Accuracy of model with three features on training dataset:  0.9635036496350365
Accuracy of model with three features on training dataset:  0.9635036496350365




## Conclusion

Write a brief conclusion to your experiment.  You might comment on the proportion of __false positive__ and __false negative__ classifications your model makes.  How useful would this model be in a clinical diagnostic setting? 

## Commit your finished work on Github
Here are the list of steps you need to follow to commit your work on Github to get checkpoint mark for this week.

Once you finished all the above questions, save the notebook by clicking 'save' button in the toolbar.

You need to follow the same instructions to commit your work on your Github repository.

Step 1. Change your current directory to `practical-workshops-yourName` by doing `cd` command. You can type:
                    `cd practical-workshops-yourName`
                    
Step 2: Add your Workshop Week 6.ipynb using:
                 `git add "Workshop Week 6.ipynb`
                 
Step 3: Commit your work:
                `git commit -m "Finished Workshop 6`
                
Step 4: Push your changes:
                `git push origin master`
                
Step 5: Confirm whether your finished work is now on Github repository by signing into your Github account and clicking on your repository. You can see your added `Workshop Week 6.ipynb` file as well as your `commit message` and `time` of your commit.

Step 6: Well done!. You have now finished your Practical Workshop Week 6. Appreciate yourself.