## Titanic Classification model
## Decision Tree Example

The titanic.csv file contains data for 887 of the real Titanic passengers. Each row represents one person. The columns describe different attributes about the person including whether they survived (S), their age (A), their passenger-class (C), their sex (G) and the fare they paid (X).

https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/problem12.html

## Part 1 : Load data

In [None]:
url = 'https://raw.githubusercontent.com/ketnas/homework/main/AI/titanic.csv'

In [None]:
import pandas as pd
titanic = pd.read_csv(url)
titanic.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,male,22.0,1,0,7.25
1,1,1,female,38.0,1,0,71.2833
2,1,3,female,26.0,0,0,7.925
3,1,1,female,35.0,1,0,53.1
4,0,3,male,35.0,0,0,8.05


## Part 2: Preprocess data

Preprocess the dataset

Label Encoder vs. One Hot Encoder in Machine Learning

Label Encoder (If necessary)

https://medium.com/@contactsunny/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621

### Convert categorical variables into dummy column

In [None]:
titanic = pd.concat([titanic,pd.get_dummies(titanic['Pclass'],prefix='Pclass')],axis=1)
titanic = titanic.drop(columns=['Pclass'])
titanic = pd.concat([titanic,pd.get_dummies(titanic['Sex'],prefix='Sex')],axis=1)
titanic = titanic.drop(columns=['Sex'])
# titanic = pd.concat([titanic,pd.get_dummies(titanic['Siblings/Spouses Aboard'],prefix='Sib')],axis=1)
# titanic = titanic.drop(columns=['Siblings/Spouses Aboard'])
titanic.head()

Unnamed: 0,Survived,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male
0,0,22.0,1,0,7.25,0,0,1,0,1
1,1,38.0,1,0,71.2833,1,0,0,1,0
2,1,26.0,0,0,7.925,0,0,1,1,0
3,1,35.0,1,0,53.1,1,0,0,1,0
4,0,35.0,0,0,8.05,0,0,1,0,1


In [None]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Survived                 887 non-null    int64  
 1   Age                      887 non-null    float64
 2   Siblings/Spouses Aboard  887 non-null    int64  
 3   Parents/Children Aboard  887 non-null    int64  
 4   Fare                     887 non-null    float64
 5   Pclass_1                 887 non-null    uint8  
 6   Pclass_2                 887 non-null    uint8  
 7   Pclass_3                 887 non-null    uint8  
 8   Sex_female               887 non-null    uint8  
 9   Sex_male                 887 non-null    uint8  
dtypes: float64(2), int64(3), uint8(5)
memory usage: 39.1 KB


### Train/Test separation

Perform hold-out method
- 60% training set
- 40% testing set

In [None]:
titanic_train = titanic.sample(frac = 0.6)
titanic_test = titanic.drop(titanic_train.index)
print(pd.crosstab(titanic_train['Survived'],columns = 'count'))
print(pd.crosstab(titanic_test['Survived'],columns = 'count'))

col_0     count
Survived       
0           318
1           214
col_0     count
Survived       
0           227
1           128


##### X/Y separation

In [None]:
titanic_train_y = titanic_train['Survived']
titanic_train_X = titanic_train.copy()
del titanic_train_X['Survived']

titanic_test_y = titanic_test['Survived']
titanic_test_X = titanic_test.copy()
del titanic_test_X['Survived']

## Part 3: Train a decision tree model

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

In [None]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(min_samples_leaf=30, max_depth=5)
clf = clf.fit(titanic_train_X, titanic_train_y)
print(clf)

DecisionTreeClassifier(max_depth=5, min_samples_leaf=30)


### Tree Visualization

You MUST first install 'graphviz' in order to run the following code.

In [None]:
import graphviz
dot_data = tree.export_graphviz(clf, out_file=None,
                              feature_names=titanic_train_X.columns,
                              class_names=['0','1'],
                              filled=True, rounded=True,
                              special_characters=True, rotate=True)
graph = graphviz.Source(dot_data)
graph.render('dtree_render')

'dtree_render.pdf'

### Variable importance

In [None]:
tree_feature = pd.DataFrame({'feature':titanic_train_X.columns,
                             'Score':clf.feature_importances_})

tree_feature.sort_values(by = 'Score', ascending=False)

Unnamed: 0,feature,Score
8,Sex_male,0.670901
6,Pclass_3,0.177864
3,Fare,0.116662
0,Age,0.034573
1,Siblings/Spouses Aboard,0.0
2,Parents/Children Aboard,0.0
4,Pclass_1,0.0
5,Pclass_2,0.0
7,Sex_female,0.0


## Part 4: Model Evaluation

Evaluation metrics

- confusion metrix
- accuracy
- precision, recall, f1-score

In [None]:
#confusion metrix
res = clf.predict(titanic_test_X)
pd.crosstab(titanic_test_y, res)

col_0,0,1
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,209,18
1,55,73


In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

print("Accuracy:\t %.3f" %accuracy_score(titanic_test_y, res))
print(classification_report(titanic_test_y, res))

Accuracy:	 0.794
              precision    recall  f1-score   support

           0       0.79      0.92      0.85       227
           1       0.80      0.57      0.67       128

    accuracy                           0.79       355
   macro avg       0.80      0.75      0.76       355
weighted avg       0.80      0.79      0.78       355




## Part 5: Model tuning

#### Note:

หลังจากที่สร้าง decision tree แลัว ลองตอบคำถามต่อไปนี้

1. Accuracy Score มีค่าเป็นเท่าไร?
2. ถ้าเราเปลี่ยนวิธีขั้นตอนการ preprocessing เราจะได้ผลลัพท์ที่ดีขึ้นหรือไม่

3. ถ้าเราเปลี่ยนตัวแปรเริ่มต้นในการสร้าง model บางค่า เราจะได้ผลลัพท์ที่ดีขึ้นหรือไม่

#### ตัวอย่าง Parameters ที่เราสามารถปรับได้
- max_leaf_nodes
    - Reduce the number of leaf nodes
- min_samples_leaf
    - Restrict the size of sample leaf
    - Minimum sample size in terminal nodes can be fixed to 30, 100, 300 or 5% of total
- max_depth
    - Reduce the depth of the tree to build a generalized tree
    - Set the depth of the tree to 3, 5, 10 depending after verification on test data

In [None]:
clf = tree.DecisionTreeClassifier(min_samples_leaf=15, max_depth=10)
clf = clf.fit(titanic_train_X, titanic_train_y)
print(clf)

DecisionTreeClassifier(max_depth=10, min_samples_leaf=15)


In [None]:
#confusion metrix
res = clf.predict(titanic_test_X)
pd.crosstab(titanic_test_y, res)

print("Accuracy:\t %.3f" %accuracy_score(titanic_test_y, res))
print(classification_report(titanic_test_y, res))

Accuracy:	 0.803
              precision    recall  f1-score   support

           0       0.84      0.85      0.85       227
           1       0.73      0.71      0.72       128

    accuracy                           0.80       355
   macro avg       0.79      0.78      0.78       355
weighted avg       0.80      0.80      0.80       355



## Activity 1:
ลองปรับค่า Parameters เป็นดังนี้

* min_samples_leaf = 10
* max_depth = 5

จะได้ผลเป็นอย่างไร

In [None]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(min_samples_leaf=10, max_depth=5)
clf = clf.fit(titanic_train_X, titanic_train_y)
print(clf)

DecisionTreeClassifier(max_depth=5, min_samples_leaf=10)


In [None]:
#confusion metrix
res = clf.predict(titanic_test_X)
pd.crosstab(titanic_test_y, res)

print("Accuracy:\t %.3f" %accuracy_score(titanic_test_y, res))
print(classification_report(titanic_test_y, res))

Accuracy:	 0.831
              precision    recall  f1-score   support

           0       0.85      0.89      0.87       227
           1       0.79      0.72      0.75       128

    accuracy                           0.83       355
   macro avg       0.82      0.81      0.81       355
weighted avg       0.83      0.83      0.83       355



# Random Forest Example

Part 1 - 2 เหมือนกับตัว decision tree
เราจะมาดู part 3 เลย

## Part 3: Train a random forest model

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf_random = RandomForestClassifier(random_state=0)
clf_random = clf_random.fit(titanic_train_X, titanic_train_y)

print(clf_random)

RandomForestClassifier(random_state=0)


### Variable importance

In [None]:
tree_feature = pd.DataFrame({'feature':titanic_train_X.columns,
                             'Score':clf_random.feature_importances_})

tree_feature.sort_values(by = 'Score', ascending=False)

Unnamed: 0,feature,Score
3,Fare,0.277696
0,Age,0.260231
7,Sex_female,0.14974
8,Sex_male,0.139641
1,Siblings/Spouses Aboard,0.052536
6,Pclass_3,0.052128
2,Parents/Children Aboard,0.030895
4,Pclass_1,0.024617
5,Pclass_2,0.012517


## Part 4: Model Evaluation

Evaluation metrics

- confusion metrix
- accuracy
- precision, recall, f1-score

In [None]:
#confusion metrix
res = clf_random.predict(titanic_test_X)
pd.crosstab(titanic_test_y, res)

col_0,0,1
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,202,25
1,36,92


In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

print("Accuracy:\t %.3f" %accuracy_score(titanic_test_y, res))
print(classification_report(titanic_test_y, res))

Accuracy:	 0.828
              precision    recall  f1-score   support

           0       0.85      0.89      0.87       227
           1       0.79      0.72      0.75       128

    accuracy                           0.83       355
   macro avg       0.82      0.80      0.81       355
weighted avg       0.83      0.83      0.83       355



# Assignment 8

## Part 5: Model tuning
ทดลองปรับค่า parameter ใหม่ เป็นสองแบบดังนี้และนำมาเปรียบเทียบผลลัพท์กัน

##### Tuning Parameters 1
- max_depth = 10
- criterion = 'entropy'

##### Tuning Parameters 2
- max_depth = 5
- criterion = 'gini'

ผลลัพท์ที่ต้องแสดงคือ confusion matrix และ classification report