# Classification

---

Actually, classification is a regression but in narrow (discrete) interval. In classification, we try to classify objects, events, or even people into some category based on their similarities or some common criteria. In regression task we try to find best fitting line to our data, whereas in classification task we also try to find a line which gives us the best separation or discrimination between categories or groups. Compared to regression, in classification the target is not a continuous variable, rather then discrete, taking finite number of values. Such as survived or not survived, defaulted or not defaulted, spam or not spam, and so on. There are three types of classification in a wild:


* **Binary Classification**


* **Multi-Class Classification**


* **Multi-Label Classification**


In this class, we only focus on binary classification. However, you will have solid foundation to dig deeper down in the rabbit whole.


$$
$$


![alt text](images/classification.png "Title")


$$
$$



### Lecture outline

---

* Problem Statement


* Data Description


* EDA - Exploratory Data Analysis


* Data Processing


* Logistic Regression


* Decision Tree Classification


* Random Forest Classification


* Model Performance Assessment

#### Reference


[Titanic - Machine Learning from Disaster](https://www.kaggle.com/c/titanic)


[Discrete Choice Models](https://www.statsmodels.org/stable/examples/index.html#discrete-choice-models)


[sklearn - Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)


[sklearn - Decision Tree Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)


[sklearn - Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [2]:
# For data processing
import pandas as pd
import numpy as np

# For data viz
import matplotlib.pyplot as plt
import seaborn as sns

# For modeling
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# For model performance assessment
from sklearn import metrics

In [3]:
plt.style.use("seaborn") # Set plotting style

## Problem Statement


---


The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. That's why the name **DieTanic**. This is a very unforgettable disaster that no one in the world can forget.

It took about $7.5 million to build the Titanic and it sunk under the ocean due to collision. The Titanic Dataset is a very good dataset for beginner to start a journey in data science and machine learning.

The objective of today's class is to find factors/features, which had relatively higher or lower impact on people's death on Titanic and build a model to predict the probability of dying.


$$
$$


> **Sometimes life has a cruel sense of humor, giving you the thing you always wanted at the worst time possible.**

[Lisa Kleypas](https://en.wikipedia.org/wiki/Lisa_Kleypas)

## Data Description


---

We are given information about a subset of the Titanic population and asked to build a predictive model that tells us whether or not a given passenger survived the shipwreck or not. We are given 10 basic explanatory variables, including passenger gender, age, and price of fare, among others.


$$
$$


* **PassengerId**: Unique identifier for a passenger


* **Survived**: Survival. 0 = No, 1 = Yes. This is our target variable


* **Pclass**: Ticket class. A proxy for socio-economic status (SES). 1st = Upper, 2nd = Middle, 3rd = Lower


* **Name**: Name of the passenger


* **Sex**: Gender of the passenger


* **Age**: Age in years. Age is fractional if less than 1. If the age is estimated, it is in the form of xx.5


* **SibSp**: Number of siblings / spouses aboard the Titanic. The dataset defines family relations in this way:


    * Sibling = brother, sister, stepbrother, stepsister

    * Spouse = husband, wife (mistresses and fiancés were ignored)


* **Parch**: Number of parents / children aboard the Titanic. The dataset defines family relations in this way:
    
    
    * Parent = mother, father
    
    * Child = daughter, son, stepdaughter, stepson
    
    * Some children travelled only with a nanny, therefore parch=0 for them.


* **Ticket**: Ticket number


* **Fare**: Passenger fare


* **Cabin**: Cabin number


* **Embarked**: Port of Embarkation. C - Cherbourg, Q - Queenstown, S - Southampton

# ეს ნახე


https://www.kaggle.com/ash316/eda-to-prediction-dietanic


https://www.kaggle.com/mnassrib/titanic-logistic-regression-with-python


https://www.kaggle.com/zlatankr/titanic-random-forest-82-78



https://www.statsmodels.org/stable/discretemod.html



https://www.statsmodels.org/stable/examples/notebooks/generated/discrete_choice_overview.html



https://www.statsmodels.org/stable/examples/notebooks/generated/discrete_choice_example.html

In [4]:
df = pd.read_csv("data/train.csv")

In [5]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## EDA - Exploratory Data Analysis


---


The goal of this section is to gain an understanding of our data in order to do proper feature engineering and modeling.

In [12]:
df.describe().iloc[:, 1:].round(2)

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.38,2.31,29.7,0.52,0.38,32.2
std,0.49,0.84,14.53,1.1,0.81,49.69
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.12,0.0,0.0,7.91
50%,0.0,3.0,28.0,0.0,0.0,14.45
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.33


In [7]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [10]:
df.shape

(891, 12)

### Checking Missing Values


---

As we see we have missing values

In [14]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### Univariate Analysis


---

As before, here we investigate each variable one by one to see if there exist some anomalies.

### Bivariate Analysis

---

sfdgsdkjfhsbdkfjs

## Data Processing


---

Some variable transformation

## Logistic Regression


---


Explain what is a binary logistic regression




$$
$$

![alt text](images/logistic_regression.jpeg "Title")

### Sklearn

---

Sklearn implementation of logistic regression

### Statsmodels


---

Statsmodels implementation of the logistic regression

### Interpretation of the results


---

Here explain marginal effects

## Decision Tree Classification


---


Decision Trees can also help a lot when we need to understanding the data. A good example is the traditional problem of classifying Iris flowers included in the sklearn documentation, were we can learn about the characteristics of each flower type in the resulting tree. Given their transparency and relatively low computational cost, Decision Trees are also very useful for exploring your data before applying other algorithms. They're helpful for checking the quality of engineered features and identifying the most relevant ones by visualising the resulting tree.

The main downsides of Decision Trees are their tendency to over-fit, their inability to grasp relationships between features, and the use of greedy learning algorithms (not guaranteed to find the global optimal model). Using them in a Random Forest helps mitigate some of this issues.

After this short introduction to Decision Trees and their place in Machine Learning, let's see how to apply them for the Titanic challenge. First, we're going to prepare the dataset and discuss the most relevant features. We'll then find the best tree depth to avoid over-fitting, generate the final model, and explain how to visualise the resulting tree.



$$
$$

![alt text](images/decision_tree.png "Title")

# https://www.kaggle.com/masumrumi/decision-tree-with-titanic-dataset

# https://www.kaggle.com/dmilla/introduction-to-decision-trees-titanic-dataset

## Random Forest Classification


---

Explain Random Forest




$$
$$

![alt text](images/random_forest.png "Title")

## Model Performance Assessment


---

confusion matrix, ROC-AUC და კიდევ რაღაცეები

## Summary

---

sdgfsdgs