<a href="https://colab.research.google.com/github/michalis0/DataMining_and_MachineLearning/blob/master/week5/Classification_Exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Mining and Machine Learning - Week 5
# Classification - Exercises

Classification is part of **supervised learning**. The objective is to correctly assign objects to different, predifined categories or labels. An easy to understand example is classifying emails as “spam” or “not spam.”

In [1]:
# Import required packages
import  numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

%matplotlib inline

### Load Data

In [2]:
data = pd.read_csv("https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/week5/data/Sample-Data-Titanic-Survival.csv")
data.head()

Unnamed: 0,Class,Age,Sex,SurvivalStatus
0,1st,"Quantity[29., ""Years""]",female,survived
1,1st,"Quantity[0.9167, ""Years""]",male,survived
2,1st,"Quantity[2., ""Years""]",female,died
3,1st,"Quantity[30., ""Years""]",male,died
4,1st,"Quantity[25., ""Years""]",female,died


In [3]:
# Clean data
data["Age"] = data["Age"].map(lambda x: float(x.strip('Quantity[').split(",")[0].replace('Missing["Not Available"]', "-1.")))
data = data.replace(-1.0, np.nan)
data.head()

Unnamed: 0,Class,Age,Sex,SurvivalStatus
0,1st,29.0,female,survived
1,1st,0.9167,male,survived
2,1st,2.0,female,died
3,1st,30.0,male,died
4,1st,25.0,female,died


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Class           1309 non-null   object 
 1   Age             1046 non-null   float64
 2   Sex             1309 non-null   object 
 3   SurvivalStatus  1309 non-null   object 
dtypes: float64(1), object(3)
memory usage: 41.0+ KB


In [5]:
data = data.dropna().reset_index(drop=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1046 entries, 0 to 1045
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Class           1046 non-null   object 
 1   Age             1046 non-null   float64
 2   Sex             1046 non-null   object 
 3   SurvivalStatus  1046 non-null   object 
dtypes: float64(1), object(3)
memory usage: 32.8+ KB


## In what follows, try to answer the questions. The results are provided. You need to complete the code (# [YOUR CODE HERE] or ...) to arrive at the same results.

### 1. Create a new DataFrame where you encode the different categorical features as follows:
* Use one-hot encoding for `Class`
* Use label encoding for `Sex` and `SurvivalStatus`

In [6]:
# One-hot encoding
# [YOUR CODE HERE]

Unnamed: 0,1st,2nd,3rd
0,1.0,0.0,0.0
1,1.0,0.0,0.0
2,1.0,0.0,0.0
3,1.0,0.0,0.0
4,1.0,0.0,0.0
...,...,...,...
1041,0.0,0.0,1.0
1042,0.0,0.0,1.0
1043,0.0,0.0,1.0
1044,0.0,0.0,1.0


In [None]:
# Label encoding of `Sex`
# [YOUR CODE HERE]

0       0
1       1
2       0
3       1
4       0
       ..
1041    1
1042    0
1043    1
1044    1
1045    1
Name: SexCode, Length: 1046, dtype: int64

In [None]:
# Label encoding of `SurvivalStatus`
# [YOUR CODE HERE]

0       1
1       1
2       0
3       0
4       0
       ..
1041    0
1042    0
1043    0
1044    0
1045    0
Name: SurvivalStatusCode, Length: 1046, dtype: int64

In [None]:
# Concatenate all your DataFrames
data = pd.concat([data, ..., ..., ...], axis=1)
data.head()

Unnamed: 0,Class,Age,Sex,SurvivalStatus,"(1st,)","(2nd,)","(3rd,)",SexCode,SurvivalStatusCode
0,1st,29.0,female,survived,1.0,0.0,0.0,0,1
1,1st,0.9167,male,survived,1.0,0.0,0.0,1,1
2,1st,2.0,female,died,1.0,0.0,0.0,0,0
3,1st,30.0,male,died,1.0,0.0,0.0,1,0
4,1st,25.0,female,died,1.0,0.0,0.0,0,0


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1046 entries, 0 to 1045
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Class               1046 non-null   object 
 1   Age                 1046 non-null   float64
 2   Sex                 1046 non-null   object 
 3   SurvivalStatus      1046 non-null   object 
 4   (1st,)              1046 non-null   float64
 5   (2nd,)              1046 non-null   float64
 6   (3rd,)              1046 non-null   float64
 7   SexCode             1046 non-null   int64  
 8   SurvivalStatusCode  1046 non-null   int64  
dtypes: float64(4), int64(2), object(3)
memory usage: 73.7+ KB


### 2. Logistic Regression: part 1

### 2.1 What is the base rate in this case?

In [11]:
# Base rate
# [YOUR CODE HERE]

0.5917782026768642

#### 2.2 Use logistic regression to predict the `SurvivalStatus` based on `Age` and `Sex`. Display the confusion matrix and the other accuracy measures seen in class.

In [None]:
X = # [YOUR CODE HERE]
y = # [YOUR CODE HERE]

In [None]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# logistic regression with 5 fold cross validation
LR_cv = LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=100)

In [None]:
# Fit the model on the training set
LR_cv.fit(..., ...)

In [None]:
# Train accuracy
LR_cv.score(..., ...)

0.7930622009569378

In [None]:
# Test accuracy 
LR_cv.score(..., ...)

0.7238095238095238

In [None]:
# Accuracy measures
y_pred = LR_cv.predict(...)

def evaluate(true, pred):
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print(f"CONFUSION MATRIX:\n{confusion_matrix(true, pred)}")
    print(f"ACCURACY SCORE:\n{accuracy_score(true, pred):.4f}")
    print(f"CLASSIFICATION REPORT:\n\tPrecision: {precision:.4f}\n\tRecall: {recall:.4f}\n\tF1_Score: {f1:.4f}")

evaluate(..., ...)

CONFUSION MATRIX:
[[95 25]
 [33 57]]
ACCURACY SCORE:
0.7238
CLASSIFICATION REPORT:
	Precision: 0.6951
	Recall: 0.6333
	F1_Score: 0.6628


#### 2.3 What is the prediction for a man aged 50? What is the probability of each class?

In [None]:
# Prediction
# [YOUR CODE HERE]

array([0])

In [None]:
# Probabilities
# [YOUR CODE HERE]

array([[0.76366655, 0.23633345]])

#### 2.4 What is the prediction for a woman aged 30? What is the probability of each class?


In [None]:
# Prediction
# [YOUR CODE HERE]

array([1])

In [None]:
# Probabilities
# [YOUR CODE HERE]

array([[0.35053728, 0.64946272]])

### 3. Logistic Regression: part 2

#### 3.1 Use logistic regression to predict the `SurvivalStatus` based on all other variables (test size = 0.2). Display the confusion matrix and the other accuracy measures seen in class.

In [None]:
X = # [YOUR CODE HERE]
y = # [YOUR CODE HERE]

In [None]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit model
LR_cv = LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=100)
LR_cv.fit(..., ...)

# Accuracy measures
y_pred = LR_cv.predict(...)

def evaluate(true, pred):
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    print(f"CONFUSION MATRIX:\n{confusion_matrix(true, pred)}")
    print(f"ACCURACY SCORE:\n{accuracy_score(true, pred):.4f}")
    print(f"CLASSIFICATION REPORT:\n\tPrecision: {precision:.4f}\n\tRecall: {recall:.4f}\n\tF1_Score: {f1:.4f}")

evaluate(..., ...)

CONFUSION MATRIX:
[[100  20]
 [ 33  57]]
ACCURACY SCORE:
0.7476
CLASSIFICATION REPORT:
	Precision: 0.7403
	Recall: 0.6333
	F1_Score: 0.6826


#### 3.2 What is the prediction for a man aged 50 of the 2nd class? What is the prbability of each class?

In [None]:
# Prediction
# [YOUR CODE HERE]

array([0])

In [None]:
# Probabilities
# [YOUR CODE HERE]

array([[0.80274313, 0.19725687]])

#### 3.3 What is the prediction for a woman aged 30 of the 1st class? What is the probability of each class?

In [None]:
# Predictions
# [YOUR CODE HERE]

array([1])

In [None]:
# Probabilities
# [YOUR CODE HERE]

array([[0.19924215, 0.80075785]])