<h1 align='center'> COMP2420/COMP6420 - Introduction to Data Management, Analysis and Security</h1>

<h2 align='center'> Lab 05 - Data Analysis: Classification </h2>
<h5 align='center'><sub> Author: Afzal Ahmad, 2020； Modified by: Cheng Xue and Taylor Qin, 2022. </sub></h5>

*****
## Aim
Our aim in this lab is:
- Understand and implement a logistic regression model for classification
- Understand and implement a k-Nearest Neighbour model for classification
- Compare the two classification techniques and understand the capabilities and pitfalls of each

*****

## Learning Outcomes
- L03: Demonstrate basic knowledge and understanding of descriptive and predictive data analysis methods, optimization and search, and knowledge representation.
- L04: Formulate and extract descriptive and predictive statistics from data
- L05: Analyse and interpret results from descriptive and predictive data analysis
- L06: Apply their knowledge to a given problem domain and articulate potential data analysis problems

*****

## Preparation

Before starting this lab, we suggest you complete the following:
- Watch the lectures this week
- Complete Lab04 in particular and become familiar with Scikit-Learn's modules


The following functions may be useful for this lab:

| Function                     | Description |
| ---:                         | :---        |
| `LogisticRegression()`, `KNeighborsClassifier()` | create an instance of a classification module |
| `LabelEncoder()`, `StandardScaler()` | create an instance of a pre-processing module |

We have not included functions described in previous labs (especially those used to fit, predict and score models) as we expect you to be familiar with those.

*****

In [1]:
# imports
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression     # Logistic Regression
from sklearn.neighbors import KNeighborsClassifier      # k-Nearest Neighbours
from sklearn.preprocessing import LabelEncoder          # encooding variables
from sklearn.preprocessing import StandardScaler        # encooding variables
from sklearn.model_selection import train_test_split    # testing our models
from sklearn.metrics import confusion_matrix            # scoring

import matplotlib.pyplot as plt    # plotting, if you need it
import seaborn as sns
plt.style.use('seaborn')

  plt.style.use('seaborn')


### Exercise 1: Not-So-Linear Regression

In 1912, the British passenger liner *RMS Titanic* hit an iceberg and sank. Many of the passengers died, and the event is considered to be one of the deadliest marine disasters. Today, we'll be analysing the statistics of the passengers to understand the factors that led to their survivability. We'd like to **predict (or rather, classify) whether a passenger would live or die** depending on factors such as age, gender and passenger class. 

We will use the data collected from <a href="https://www.kaggle.com/c/titanic">Kaggle</a>. The table below summarises the columns within the data:

| Name           | Description |
| ---:           | :---        |
| `PassengerId`  | an arbitrary ID assigned to each passenger |
| `Survived`     | status of passenger's survival<br>(`0`=No, `1`=Yes) |
| `Pclass`       | passenger's ticket class<br>(`1`=Upper, `2`=Middle, `3`=Lower) |
| `Name`         | full title and name of passenger |
| `Sex`          | gender of passenger |
| `Age`          | age of passenger<br>fractional if less than 1, xx.5 if estimated |
| `SibSp`        | number of siblings and spouses aboard<br>brother / sister / stepbrother / stepsister / husband / wife |
| `Parch`        | number of parents and children aboard<br>mother / father / daughter / son / stepdaughter / stepson |
| `Ticket`       | ticket ID |
| `Fare`         | passenger fare ($) |
| `Cabin`        | cabin number |
| `Embarked`     | port of embarkation<br>(`C`=Cherbourg, `Q`=Queenstown, `S`=Southampton) |

In previous labs, we've given you a lot of guidance on how to deal with data - missing values, choosing your columns, etc. This time we'll give you the freedom (and responsibility) of deciding this for yourself. In making these decisions, feel free to consult classmates, tutors, previous labs and lectures, and online research as necessary.

#### 1.1 Preparing the Data
First, we'll need to **import the data**. The data is located in the file `data/titanic.csv`. Your task is to save it as an object called `titanic` and inspect the first ten rows.

In [2]:
def import_data(url):
    """ 
    Import data from an address.
            Parameters:
                    url (string): File path for the data.
            Returns:
                    data (DataFrame): A dataframe of the data.
    """
    #TODO
    data = pd.read_csv(url, index_col=0)
    return data

def first_ten_rows_inspection(data):
    """ 
    Inspect the first ten rows. 
            Parameters:
                    data (DataFrame): A dataframe of the data.
            Returns:
                    None.
    """
    #TODO
    data.head(10)
    #raise NotImplementedError
    
    
titanic = import_data("data/titanic.csv")
first_ten_rows_inspection(titanic)

What are your first impressions from this data? You may wish to do some further **data exploration and pre-processing** (for example, finding missing values, the distribution of data, descriptive statistics) to help you understand what you're dealing with.

**Note**: Keeping in mind that classifying survival rates is the goal here, process the data to make it useful for a classification model. If you're not sure what to do, this is very similar to prediction (as you did in the last lab), so think of this exercise as a prediction of what you're going to do. 

Try to consider the following：

- Which columns you should drop?
- What should you do when you encounter an entry with a missing value? 
- Do you need to recode any columns?

We have written a small scipt to check for the missing values in the data. Feel free to check for other aspects yourself.

In [3]:
titanic.isna().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [4]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In [5]:
def convert_to_num(sex):
    if sex.strip() == 'male':
        return 0
    return 1

In [6]:
def data_preprocessing(data):
    """ 
    Prepare your data - drop unneccersary columns, deal with entries with a missing value, etc.
            Parameters:
                    Original data.
            Returns:
                    Preprocessed data.
    """
    #TODO
    data = data[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Fare']]
    data['Sex'] = data['Sex'].apply(convert_to_num)
    data = data.dropna()
    return data
    #raise NotImplementedError
    
titanic = data_preprocessing(titanic)
if titanic.isnull().sum().sum() == 0:
    print('Yeah! You have successfully preprocessed your data.')
else:
    print('Not yet! There are some missing values in the data.')

Yeah! You have successfully preprocessed your data.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Sex'] = data['Sex'].apply(convert_to_num)


In [7]:

titanic

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0,3,0,22.0,1,7.2500
2,1,1,1,38.0,1,71.2833
3,1,3,1,26.0,0,7.9250
4,1,1,1,35.0,1,53.1000
5,0,3,0,35.0,0,8.0500
...,...,...,...,...,...,...
886,0,3,1,39.0,0,29.1250
887,0,2,0,27.0,0,13.0000
888,1,1,1,19.0,0,30.0000
890,1,1,0,26.0,0,30.0000


#### 1.2 Logistics Regression Implementation
We'll be using two classification techniques in this lab. The first is **logistic regression** - which is different from the linear regression in the previous lab. It is a powerful tool especially for binary classification. It's perfect for this exercise, because survivability can take either 0 or 1. Have a look at <a href="https://www.youtube.com/watch?v=yIYKR4sgzI8"> this video</a> if you want to learn more about logistic regression.

Have a look at the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">documentation for Scikit-Learn's Logistic Regression module</a>; you'll need to refer to it for this exercise. Alternatively, you can run `help(LogisticRegression)` to view the documentation through Jupyter (which would be useful for in-lab examinations).

You will need to implement the following tasks:

1. First **split your data** into training and testing (with 80% training and 20% testing). 
2. Then **create an instance of the `LogisticRegression()` tool**, 
3. and **fit the data** using the instance and save it to an object called `logres_model`. When creating the instance, use `solver=lbfgs` and specify `max_iter=1000`. This specifies the method used for optimisation of the model, and allows more iterations for the model to converge.
4. After creating the model, use `logres_model.intercept_` and `logres_model.coef_` to **get the coefficients assigned to each column**. You'll need to match the order of the coefficients to the order of the predictors when you fit the model.
5. Of course, no machine learning model is useful if you can't make predictions with it. Using the test set that you created earlier, **calculate the train and test scores** of the model (rounding to two decimal places). To increase the score, try adding or removing predictors and compare with classmates to see what they got. Note that the scores here are no longer $R^2$, but **mean accuracy**. We'll explain this in more detail later in this lab.

In [8]:
def data_split(data):
    """ 
    Split your data with 80% training and 20% testing.
            Parameters:
                    Original Data.
            Returns:
                    Train data;
                    Test data.
    """
    train, test = train_test_split(data, test_size = 0.2) # TODO
    return train, test

def logistic_regression(data):
    """ 
    Split your data; Create an instance of the LogisticRression() tool; fit the data;
            Parameters:
                    Original Data.
            Returns:
                    Logistic Regression Instance;
                    Intercept;
                    Coef_dict (dict): A dictionary with the keys to be attribute names, 
                    and the values to be the corresponding coefficients from your model;
                    Train_score (rounding to two decimal places);
                    Test_score (rounding to two decimal places).
    """
    train, test = data_split(data)
    # TODO
    train_x = train[['Pclass', 'Sex', 'Age', 'SibSp', 'Fare']]
    train_y = train['Survived']
    test_x = test[['Pclass', 'Sex', 'Age', 'SibSp', 'Fare']]
    test_y = test['Survived']
    
    lr = LogisticRegression(solver='lbfgs',max_iter=1000)
    logres_model = lr.fit(train_x, train_y)
    intercept = logres_model.intercept_
    coef = {k : co for k, co in zip(train_x.columns, logres_model.coef_[0])}
    train_score = logres_model.score(train_x,train_y)
    test_score = logres_model.score(test_x, test_y)
    #raise NotImplementedError
    return logres_model, intercept, coef, train_score, test_score

logres_model, intercept, coef, train_score, test_score = logistic_regression(titanic)
print("Intercept :", intercept)
print("Attributes Coefficients Dictionary: ", coef)
print(f"Train Score: {train_score}; Test Score: {test_score}")

Intercept : [2.31996983]
Attributes Coefficients Dictionary:  {'Pclass': -1.1295891552174606, 'Sex': 2.450271366438237, 'Age': -0.03478781492180386, 'SibSp': -0.3773814886948353, 'Fare': 0.0008916027778872573}
Train Score: 0.8021015761821366; Test Score: 0.8111888111888111


#### 1.3 Result Analysis

As with prediction, a positive coefficient indicates that a higher predictor leads to a higher probability of the target variable being 1. For example, you might find that the coefficient for `Pclass` is negative - this is because a lower `Pclass` value (eg. First Class) leads to a higher chance of survival. **Unlike linear regression, this doesn't translate directly**; a coefficient of 1.5 does not mean a probability increase of 150%. Instead, it is a **transformation** of the original linear regression formula. If you're interested in learning more, we encourage you to do some online research. As a starting point, try <a href="https://machinelearningmastery.com/logistic-regression-for-machine-learning/">this link</a>. It's likely that you'll study logistic regression in much further detail in future courses at ANU.

Please answer the following questions in the text box:
1. **Find the coefficient for each predictor and describe its effect** (positive, negative, or insignificant) on survivability. You can (and should) also compare coefficients between predictors (eg. age has a stronger effect than class on survivability).
2. Do you think that Logistic Regression is a suitable model for the titanic data? Is it overfitting or underfitting? Why? You should consider looking at the training and test scores.

#### 1.4 What About Me?
Now here's the important question - would you have survived on the Titanic?

For each predictor in your model, decide what the value would be for you, pretending that you time travelled to 1912. For `Age` and `Sex` this would be easy, but you'll have to guess what your passenger class would be. If you'd instead prefer to predict the survivability of someone else (or in addition to yourself), consider your favourite TV show, movie or game character.

Then, use the `logres_model.predict()` function to find out what your survivability would be. You'll likely run in to errors for this function; ensure that the data you give it is in the correct format.


In [9]:
titanic.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0,3,0,22.0,1,7.25
2,1,1,1,38.0,1,71.2833
3,1,3,1,26.0,0,7.925
4,1,1,1,35.0,1,53.1
5,0,3,0,35.0,0,8.05


In [10]:
pred = logres_model.predict([[1,0,22,1,52]]) # TODO: predict survivability using our logres_model for yourself
print("You could survive! Yeah :)" if pred==1 else "It seemed that you couldn't survive :(")

You could survive! Yeah :)




### Exercise 2: In The Neighbourhood
The second classification technique we'll learn is **k-Nearest Neighbour**, often shortened to kNN. The general idea (at least, for 1-Nearest Neighbour), is that you make the model memorise all the training data, and when you get a new point for prediction, you match it to the "most similar" point in the training set and give it the same label. For kNN, we compare it to the k most similar training points and give it the most common label amongst those points.

#### 2.1 Scaling Data, Not Fish
Consider two features, `Pclass` and `Age`, and two points:
1. `Pclass`=1, `Age`=40, `Survived`=1
2. `Pclass`=3, `Age`=20, `Survived`=0

You've likely found that `Pclass` is far more important predictor than `Age` - passengers with First Class tickets were more likely to board lifeboats, and thus had a higher chance of surviving. However, `Pclass` has a range of 1-3 while `Age` has a range of 0-80. For k-Nearest Neighbours, this means that comparing to a point like `Pclass`=1, `Age`=20, thus `Survived`=1, the first point above would have a distance of 20 while the second point would have a distance of 2.

This is why we need to **scale data**, so that the range of a predictor doesn't affect its distance. To do this, we can use the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler">StandardScaler module in Scikit learn</a>.

Because we don't know what the testing data looks like, it would be improper to scale depending on the range of the testing data. So, implement the following:
1. Using a StandardScaler instance, **fit and transform only the training data**, naming the transformed data `train_scaled`. 
2. Then, using the same instance, **transform the testing data separately** (without re-fitting) and name it `test_scaled`.

We print the mean and variance of `train_scaled` and `test_scaled` for you. Even you get it worked properly, you might find that, for the training set, they aren't *exactly* 0 and 1, but any difference is insignificant. You should have found a different mean and variance for the scaled testing set; this is because we used the distribution of the training set to scale the testing set.

In [11]:
# TODO: scale data
def data_scaling(train, test):
    """ 
    fit and transform the given data.
            Parameters:
                    Train data;
                    Test data.
            Returns:
                    Scaled train data;
                    Scaled test data.
    """
    # TODO
    train_x = train[['Pclass', 'Sex', 'Age', 'SibSp', 'Fare']]
    train_y = train['Survived']
    test_x = test[['Pclass', 'Sex', 'Age', 'SibSp', 'Fare']]
    test_y = test['Survived']
    
    ss = StandardScaler()
    ss_model = ss.fit(train_x)
    
    train_x_scaled = ss_model.transform(train_x)
    test_x_scaled = ss_model.transform(test_x)
    
    # df = pd.DataFrame(train_scaled)
    return train_x_scaled, test_x_scaled
    #raise NotImplementedError

train, test = data_split(titanic)
train_x_scaled, test_x_scaled = data_scaling(train, test)
print('Scaled train data mean: ', train_x_scaled.mean())
print('Scaled train data variance: ', train_x_scaled.var())
print('Scaled test data mean: ', test_x_scaled.mean())
print('Scaled test data variance: ', test_x_scaled.var())

Scaled train data mean:  -2.1154512273067783e-17
Scaled train data variance:  0.9999999999999999
Scaled test data mean:  0.03159452902937269
Scaled test data variance:  1.0536880766120307


Find the types of `train_scaled` and `test_scaled` - you'll notice that the scaling module doesn't return a Pandas DataFrame. So that we can apply the same machine learning modules as we have before, convert both of these objects back to Pandas DataFrames, and ensure that their columns are named appropriately. （Hint: renaming columns can be done without explicitly typing out each column name.）

In [12]:
#TODO: convert to DataFrames and name columns
train_x_scaled = pd.DataFrame(train_x_scaled, columns = ['Pclass', 'Sex', 'Age', 'SibSp', 'Fare'])
test_x_scaled = pd.DataFrame(test_x_scaled, columns = ['Pclass', 'Sex', 'Age', 'SibSp', 'Fare'])
train_x_scaled

Unnamed: 0,Pclass,Sex,Age,SibSp,Fare
0,-1.505391,-0.734223,1.400824,1.566351,1.951751
1,0.895666,-0.734223,0.308881,0.506651,-0.382463
2,0.895666,-0.734223,-0.407706,-0.553048,-0.522825
3,0.895666,-0.734223,1.059592,-0.553048,-0.506758
4,-0.304863,-0.734223,-0.919554,-0.553048,-0.155403
...,...,...,...,...,...
566,-0.304863,1.361984,-0.032351,0.506651,-0.155403
567,0.895666,1.361984,-0.032351,-0.553048,-0.251805
568,-1.505391,1.361984,-0.441829,0.506651,1.552928
569,-0.304863,1.361984,1.059592,0.506651,-0.150509


#### 2.2 Getting To Know Your Neighbours
Look at the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html">documentation for Scikit-Learn's kNN Classifier</a>, or use `help(KNeighborsClassifier)`.

Let's continue using the Titanic dataset for predicting survival. Just like you did for logistic regression, 
1. **Create an instance of the KNeighborsClassifier**. For now, set `n_neighbors=5` (i.e. $k=5$). 、
2. Then fit the model and name it `knn_model`. As the instance expects the target variable to have integer values, give it the non-scaled target column for the `y` argument.
3. Now find the **training and testing scores** for this model (rounding to two decimal places). Compare this testing score to the testing score you obtained for logistic regression earlier, and also compare your score with other students.

In [13]:
# TODO: fit KNN classifier
def knn(data):
    """ 
    Split and scale your data using what you wrote before; Create an instance of the LogisticRression() tool; fit the data;
            Parameters:
                    Original Data.
            Returns:
                    KNN Instance;
                    Train_score (rounding to two decimal places);
                    Test_score (rounding to two decimal places).
    """
    train,test = data_split(data)
    train_x, test_x = data_scaling(train,test)
    
    train_x_scaled = pd.DataFrame(train_x, columns = ['Pclass', 'Sex', 'Age', 'SibSp', 'Fare'])
    test_x_scaled = pd.DataFrame(test_x, columns = ['Pclass', 'Sex', 'Age', 'SibSp', 'Fare'])
    
    train_y = train['Survived']
    test_y = test['Survived']
    
    kl = KNeighborsClassifier(n_neighbors=5)
    knn_model = kl.fit(train_x_scaled, train_y)
    
    train_score = knn_model.score(train_x_scaled, train_y)
    test_score = knn_model.score(test_x_scaled, test_y)
    #raise NotImplementedError
    return knn_model, train_score, test_score

knn_model, train_score_knn, test_score_knn = knn(titanic) 
print("Training Score:", train_score_knn)
print("Test score: ", test_score_knn)

Training Score: 0.8546409807355516
Test score:  0.7762237762237763


#### 2.3 How Big Should Our Neighbourhood Be?
Earlier, we used `n_neighbors=5` when creating the kNN instance. Try increasing or decreasing this parameter and see how it affects the model performance. Note that `k` is a hyperparameter of knn, so to avoid overfitting to the test set, we need:
1. firstly **create a validation set**. 
2. **Find the best `k` on the validation set and evaluate the model with the best `k` on the test set**. You can either adjust the code you wrote previously, or copy it here and adjust it. 

In [15]:
    trains,test = data_split(titanic)
    train_x, test_x = data_scaling(train,test)
    
    train_x_scaled = pd.DataFrame(train_x, columns = ['Pclass', 'Sex', 'Age', 'SibSp', 'Fare'])
    test_x_scaled = pd.DataFrame(test_x, columns = ['Pclass', 'Sex', 'Age', 'SibSp', 'Fare'])
    
    train_y = train['Survived']
    test_y = test['Survived']
    
    train_x, val_x, train_y, val_y = train_test_split(train_x_scaled, train_y, test_size=0.2)

In [20]:
for i in [1,2,3,4,5,6,7,8,10,20,30,40, train_x.shape[0]]:
    knn = KNeighborsClassifier(n_neighbors=i)
    knn_model = knn.fit(train_x, train_y)
    
    train_score = knn_model.score(train_x, train_y)
    val_score = knn_model.score(val_x, val_y)
    
    print(f'for n_neigh = {i} train_score = {train_score} val_score = {val_score}')

for n_neigh = 1 train_score = 0.9824561403508771 val_score = 0.808695652173913
for n_neigh = 2 train_score = 0.8881578947368421 val_score = 0.7739130434782608
for n_neigh = 3 train_score = 0.8706140350877193 val_score = 0.8260869565217391
for n_neigh = 4 train_score = 0.8399122807017544 val_score = 0.7913043478260869
for n_neigh = 5 train_score = 0.8442982456140351 val_score = 0.7913043478260869
for n_neigh = 6 train_score = 0.8333333333333334 val_score = 0.7478260869565218
for n_neigh = 7 train_score = 0.8333333333333334 val_score = 0.782608695652174
for n_neigh = 8 train_score = 0.831140350877193 val_score = 0.7565217391304347
for n_neigh = 10 train_score = 0.8289473684210527 val_score = 0.7565217391304347
for n_neigh = 20 train_score = 0.8179824561403509 val_score = 0.808695652173913
for n_neigh = 30 train_score = 0.7982456140350878 val_score = 0.7913043478260869
for n_neigh = 40 train_score = 0.7828947368421053 val_score = 0.7739130434782608
for n_neigh = 456 train_score = 0.611842

What's the best choice of parameter value? Try NOT to fine-tune it too much (as this can lead to overfitting in your model, and you shouldn't be using the testing score to adjust your model). 

What would happen when we set `n_neighbors=N`, where `N` is the number of entries in the training set? Alternatively, what about `n_neighbors=1`? (Hint: think about what a small difference in the predictors values of a new point would cause.)

### Exercise 3: Classifying Flowers
It's likely that you found a higher testing score, and a much more accurate training score, for logistic regression than for kNN. However, as we've mentioned before, (binary) logistic regression has a major pitfall: it can only classify two-class variables.

Of course, we can use an advanced form of logistic regression, called Multinomial Logistic Regression (not to be confused with Multiple Linear Regression), but the theory for that technique is beyond the scope of this course. Instead, we'll simply use **kNN** here for multi-class classification.

Let's go back to the Iris dataset. Your tasks are as follows:

1. **Import and explore the data** (`data/IRIS.csv`) so that you're familiar with it (if you're not already).
2. As you did with the Titanic dataset, **transform the data** as necessary,
3. **split the data** into training and testing (80-20, ensuring that each set is representative of the whole dataset), 
4. **scale the data** according to the training set distribution, 
5. **fit a new model** using a new instance of the kNN classifier, and finally **find the training and testing scores** of the model with this data.

You could feel free to use the previously-written helper functions for the tasks.

In [26]:
def convert_to_num(species):
    if species.strip() == 'Iris-setosa':
        return 0
    elif species.strip() == 'Iris-virginica':
        return 1
    return 2

In [37]:
def data_preprocessing(data):
    """ 
    Prepare your data - drop unneccersary columns, deal with entries with a missing value, etc.
            Parameters:
                    Original data.
            Returns:
                    Preprocessed data.
    """
    #TODO
    data = data[['sepal_length','sepal_width', 'petal_length', 'petal_width', 'species']]
    data['species'] = data['species'].apply(convert_to_num)
    data = data.dropna()
    return data
    #raise NotImplementedError
    
titanic = data_preprocessing(data)
if titanic.isnull().sum().sum() == 0:
    print('Yeah! You have successfully preprocessed your data.')
else:
    print('Not yet! There are some missing values in the data.')

Yeah! You have successfully preprocessed your data.


In [41]:
def data_scaling(train, test):
    """ 
    fit and transform the given data.
            Parameters:
                    Train data;
                    Test data.
            Returns:
                    Scaled train data;
                    Scaled test data.
    """
    # TODO
    train_x = train[['sepal_length','sepal_width', 'petal_length', 'petal_width']]
    train_y = train['species']
    test_x = test[['sepal_length','sepal_width', 'petal_length', 'petal_width']]
    test_y = test['species']
    
    ss = StandardScaler()
    ss_model = ss.fit(train_x)
    
    train_x_scaled = ss_model.transform(train_x)
    test_x_scaled = ss_model.transform(test_x)
    
    # df = pd.DataFrame(train_scaled)
    return train_x_scaled, test_x_scaled
    #raise NotImplementedError

train, test = data_split(titanic)
train_x_scaled, test_x_scaled = data_scaling(train, test)
print('Scaled train data mean: ', train_x_scaled.mean())
print('Scaled train data variance: ', train_x_scaled.var())
print('Scaled test data mean: ', test_x_scaled.mean())
print('Scaled test data variance: ', test_x_scaled.var())

Scaled train data mean:  -3.293661639721298e-16
Scaled train data variance:  1.0000000000000002
Scaled test data mean:  -0.055440940607275536
Scaled test data variance:  1.1119771079625986


In [None]:
# TODO: fit KNN classifier
def knn(data):
    """ 
    Split and scale your data using what you wrote before; Create an instance of the LogisticRression() tool; fit the data;
            Parameters:
                    Original Data.
            Returns:
                    KNN Instance;
                    Train_score (rounding to two decimal places);
                    Test_score (rounding to two decimal places).
    """
    train,test = data_split(data)
    train_x, test_x = data_scaling(train,test)
    
    train_x_scaled = pd.DataFrame(train_x, columns = ['Pclass', 'Sex', 'Age', 'SibSp', 'Fare'])
    test_x_scaled = pd.DataFrame(test_x, columns = ['Pclass', 'Sex', 'Age', 'SibSp', 'Fare'])
    
    train_y = train['Survived']
    test_y = test['Survived']
    
    kl = KNeighborsClassifier(n_neighbors=5)
    knn_model = kl.fit(train_x_scaled, train_y)
    
    train_score = knn_model.score(train_x_scaled, train_y)
    test_score = knn_model.score(test_x_scaled, test_y)
    #raise NotImplementedError
    return knn_model, train_score, test_score

knn_model, train_score_knn, test_score_knn = knn(titanic) 
print("Training Score:", train_score_knn)
print("Test score: ", test_score_knn)

In [78]:
# TODO: your code here
data = pd.read_csv('data/IRIS.csv')

data = data[['sepal_length','sepal_width', 'petal_length', 'petal_width', 'species']]
data['species'] = data['species'].apply(convert_to_num)
data = data.dropna()

train, test = train_test_split(data, test_size=0.2)

train_x = train[['sepal_length','sepal_width', 'petal_length', 'petal_width']]
train_y = train['species']
test_x = test[['sepal_length','sepal_width', 'petal_length', 'petal_width']]
test_y = test['species']

ss = StandardScaler()
ss_model = ss.fit(train_x)
train_x_scaled = ss_model.transform(train_x)
test_x_scaled = ss_model.transform(test_x)

train_x_scaled = pd.DataFrame(train_x_scaled, columns = ['sepal_length','sepal_width', 'petal_length', 'petal_width'])
test_x_scaled = pd.DataFrame(test_x_scaled, columns = ['sepal_length','sepal_width', 'petal_length', 'petal_width'])

knn = KNeighborsClassifier(n_neighbors=5)
knn_model = knn.fit(train_x_scaled, train_y)
    
train_score = knn_model.score(train_x_scaled, train_y)
test_score = knn_model.score(test_x_scaled, test_y)
print("Training Score:", train_score)
print("Test score: ", test_score)



Training Score: 0.9583333333333334
Test score:  1.0


If you did everything right, you should be getting fairly high scores. Run the code a few times using different, random train-test partitions to get a better understanding of the average score.

### Exercise 4: Confusing You Some More
We've explored the default mean accuracy score (using `model.score()`), but classification also has other important scoring techniques that are useful for diagnosing your model. To start off, let's **create a confusion matrix** using <a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html">Scikit-Learn's Confusion Matrix module</a>. Using the testing data for the Titanic dataset and the logistic regression model you created, produce a confusion matrix. Ensure that you give the function the right parameter values (use `help(confusion_matrix)` if needed). (Hint: you'll need to use the `model.predict()` function to create `y_pred`.)

In [None]:
# TODO: create confusion matrix


That's a pretty confusing (pun intended) set of numbers there. What do they mean?

The confusion matrix is made up of $n$ columns and $n$ rows, where $n$ is the number of target levels you have (2 for the Titanic dataset). The rows indicate the observations, or actuals, while the columns indicate the predicted, starting from the lowest level. Specifically, if $C$ is the confusion matrix, then $C_{0,0}$ is the **true negatives** (predicted negative, actually negative), $C_{0,1}$ is the **false positives** (predicted positive, actually negative), $C_{1,0}$ is the **false negatives** (predicted negative, actually positive) and $C_{1,1}$ is the **true positives** (predicted positive, actually positive). As we want correct predictions, we want $C_{0,0}$ and $C_{1,1}$ (i.e. values on the main diagonal) to be as large as possible. The documentation for this module also explains this.

Re-produce the confusion matrix, but this time save it to four new objects by using `tn, fp, fn, tp = ...` and using the `ravel()` function (the `confusion_matrix()` documentation has an example of this). This will keep a record of each of the four numbers described above. We've provided a (crude) way of show the confusion matrix counts, their labels, counts and sums; if you've completed the previous steps correctly you should be able to just run this.

In [80]:
# TODO: save confusion matrix counts

print("                 PREDICTION")
print("                __0_____1__")
print("OBSERVATION  0 |", str(tn).rjust(2), "  ", str(fp).rjust(2), "|", tn+fp)
print("             1 |", str(fn).rjust(2), "  ", str(tp).rjust(2), "|", fn+tp)
print("               ------------")
print("                ", tn+fn, "  ", fp+tp, " ", tn+fp+fn+tp)

                 PREDICTION
                __0_____1__


NameError: name 'tn' is not defined

Now you can compare the predictions and observations. If your model is behaving unexpectedly, you can use the confusion matrix to easily determine whether the model is only predicting one label. Confusion matrices are also important if you especially want to avoid a particular type of incorrect prediction. For example, a cancer screening that incorrectly classifies a person as not having cancer when they do have cancer is life-threatening, so you'd want to alter your model to avoid that.

Now let's calculate a few different scoring metrics:
- **Recall**: TP / (TP + FN). This describes the proportion of actual-positive observations that were correctly classified.
- **Precision**: TP / (TP + FP). This is the percentage of positive-predicted observations that were correctly classified.
- **Accuracy**: (TP + TN) / (TP+FP+FN+TN). This is the percentage of correctly classified observations in total. This is the same as the `model.score()` function that we used earlier.
- **F1**: (2 * Recall * Prediction) / (Recall + Prediction). This is a weighted average of recall and precision, and generally a better metric than accuracy for data that is unbalanced with respect to its target labels.

**Find each of the scores above** for your Titanic logistic regression model.

In [None]:
# TODO: calculate recall, precision, accuracy and F1 scores
recall = None # TODO
prec = None # TODO
acc = None # TODO
f1 = None # TODO
print("Recall:   ", recall,
    "\nPrecision:", prec,
    "\nAccuracy: ", acc,
    "\nF1 Score: ", f1)

Have a look through these scores and understand what they mean for your model. Are these scores fairly similar? If not, how come?

Re-fit the kNN model for the Titanic dataset (especially if you've played around with the `n_neighbors` parameter), and repeat the above steps to **calculate the four metrics for the kNN model**. Compare the two models' metrics.

In [None]:
# TODO: repeat for kNN model


What happens to the confusion matrix if you change the `n_neighbors` parameter to be equal to the size of the training data? Fit the kNN model with `n_neighbors=N`, where `N` is the size of the training set, and view the output of the confusion matrix.

In [None]:
# TODO: change n_neighbors and look at confusion matrix


Check that the output for the confusion matrix matches with your answer to the previous question when you were adjusting the `n_neighbors` parameter.

*****

## Homework & Extension Questions
You will need to complete previous exercises before starting these exercises.

### Exercise 4: Scaled or Un-Scaled?
In an earlier exercise, we showed that scaling was necessary for the kNN classifier. Now, for both the Titanic and Iris datasets, **fit new models using un-scaled data** and compare the predictive scores. What do you find?

In [None]:
# TODO: fit un-scaled kNN model for Titanic and compare


In [None]:
# TODO: fit un-scaled kNN model for Iris and compare


For one of these, you'll find a noticeable improvement in the performance, while the other might be more or less the same as when you used scaled data. Investigate the datasets and **figure out why scaling has a larger impact on one model**. *Hint: look at the descriptive statistics for both datasets. Which statistic(s) are most relevant?*

### Exercise 5: Looking In The Grey Area
Logistic regression has another advantage that we haven't mentioned: it can produce a "reliable" *probability* of success, rather than a black-and-white "success or fail" output. While the kNN classifier can also do this (by comparing the targets of its nearest neighbours), this isn't as reliable and it depends heavily on the `n_neighbors` parameter.

Have a look at the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">documentation for the Logistic Regression module</a> again (or use `help(LogisticRegression)`) and find out which function can be used to calculate the probability of success.

Then, repeat the prediction for yourself and/or a character, and **report the probability of survival**. If you had to guess some of the predictor values for that person, try altering them slightly and see how it affects the probability.

In [None]:
# TODO: find probability of survival for yourself and/or a character
