**This is my fist journey in Machine Learning!**

By trandition, I will solve the classic "Titanic survival" problem by folowing four steps:

1. Import modules and datasets needed
2. Do an overview of datasets and a subsequent datasets preprocessing
3. Build models and select the better performer
4. Use the "best model" to do prediction 




*Submission File Format:*

*You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns 
(beyond PassengerId and Survived) or rows.*

*The file should have exactly 2 columns:*

*1. PassengerId (sorted in any order)*

*2.Survived (contains your binary predictions: 1 for survived, 0 for deceased)*


# Import modules and datasets needed #

In [1]:
# Basical API
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from scipy.stats import randint as sp_randint # give the integer random distribution 

# Sklearn support
from sklearn import preprocessing # dataset preprocess
from sklearn.impute import KNNImputer
from sklearn.feature_selection import SelectKBest, chi2, f_classif # feature selection
from sklearn.model_selection import StratifiedKFold # k fold cross-validation
from sklearn.model_selection import cross_val_score 
from sklearn.model_selection import RandomizedSearchCV
# from sklearn.metrics import accuracy_score, f1_score # model metrics
from sklearn.linear_model import LogisticRegression # LR model
from sklearn.svm import SVC # SVC model
from sklearn.neighbors import KNeighborsClassifier # KNC model
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis # LDA model
from sklearn.naive_bayes import GaussianNB # GNB model
from sklearn.tree import DecisionTreeClassifier # DTC model
from sklearn.ensemble import RandomForestClassifier # RFC model

# Import "Titanic-surviviors" datasets
train_df = pd.read_csv('../input/titanic/train.csv')
pred_df = pd.read_csv('../input/titanic/test.csv')

### tips for new kagglers ###

1. Before loading and reading the datasets, you must finish your phone verification, or you will get the error:"No such file in directory".

2. The "file (train.csv, test.csv, etc.) path" can be copied .


# Do an overview of datasets and a subsequent datasets preprocessing #

In [2]:
train_df.info()
train_df.head()

pred_df.info()
pred_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass  

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


### Overview: ###

1. The features: "Name","Sex","Ticket","Cabin","Embarked" is non-numerical.

2. In train_df, the feature "Cabin" has a serious missing value problem (only 204 non-null), and "Age", "Embarked" have a slighter problem (714 and 899 non-null).

3. In pred_df, the feature "Cabin" has a serious missing value problem (only 91 non-null), and "Age", "Fare" have a slighter problem (332 and 417 non-null).

 ### Preprocess: ###
 
 1. Drop the features: “PassengerId","Name","Ticket","Cabin","Embarked",  because "Name","Ticket","Embarked" are almostly irrelatively with "survive or not", and the "cabin" has serious missing value problem in both two datasets.
 
 2. Fullfill missing values of "Age" and the noly missing value of "Fare" in pred_df by linear regression method.
 
 3. Recode the "Sex" (male, female) with number 0 and 1.

In [3]:
# Drop feature
train_df = train_df.drop(labels = ["PassengerId","Name","Ticket","Cabin","Embarked"], axis = 1)
pred_df = pred_df.drop(labels = ["PassengerId","Name","Ticket","Cabin","Embarked"], axis = 1)

# Recode feature
lab_encoder = preprocessing.LabelEncoder().fit(["male","female"]) # "female -> 0 & male -> 1"
train_df["Sex"] = lab_encoder.transform(train_df["Sex"])
pred_df["Sex"] = lab_encoder.transform(pred_df["Sex"])

# Fullfill feature
combine_df = pd.concat([train_df.iloc[:,1:], pred_df.iloc[:,:]], axis = 0)
imputer = KNNImputer(n_neighbors=3).fit(combine_df)
train_df.iloc[:,1:] = imputer.transform(train_df.iloc[:,1:])
pred_df.iloc[:,:] = imputer.transform(pred_df.iloc[:,:])

# alter the type of features
train_df["Survived"] = train_df["Survived"].astype("int")
train_df["Pclass"] = train_df["Pclass"].astype("object")
train_df["Sex"] = train_df["Sex"].astype("object")
train_df["Parch"] = train_df["Parch"].astype("object")

pred_df["Pclass"] = pred_df["Pclass"].astype("object")
pred_df["Sex"] = pred_df["Sex"].astype("object")
pred_df["Parch"] = pred_df["Parch"].astype("object")

# Build models and select the better performer #

In [4]:
Model = [LogisticRegression(), SVC(), KNeighborsClassifier(), LinearDiscriminantAnalysis(), GaussianNB(), DecisionTreeClassifier(),  RandomForestClassifier()]
Res_acc = []; Res_f1 = []
X_df = train_df.iloc[:,1:] # dataset of features (Pclass,Sex,...)
Y_df = train_df.iloc[:,0] # dataset of label (Survived)

for model in Model:
    kfold = StratifiedKFold(n_splits = 10, random_state = 1, shuffle=True)
    acc_results = cross_val_score(model, X_df, Y_df, cv = kfold, scoring = 'accuracy')
    Res_acc.append(round(acc_results.mean(),2))
    f1_results = cross_val_score(model, X_df, Y_df, cv = kfold, scoring = 'f1')
    Res_f1.append(round(f1_results.mean(),2))

print("accuracy for models:"+"\n"+  
"LogisticRegression: "+str(Res_acc[0])+"\n"+
"SVC: "+str(Res_acc[1])+"\n"+
"KNeighborsClassifier: "+str(Res_acc[2])+"\n"+
"LinearDiscriminantAnalysis: "+str(Res_acc[3])+"\n"+
"GaussianNB: "+str(Res_acc[4])+"\n"+
"DecisionTreeClassifier: "+str(Res_acc[5])+"\n"+
"RandomForestClassifier: "+str(Res_acc[6])+"\n"
)

print("f1-score for models:"+"\n"+  
"LogisticRegression: "+str(Res_f1[0])+"\n"+
"SVC: "+str(Res_f1[1])+"\n"+
"KNeighborsClassifier: "+str(Res_f1[2])+"\n"+
"LinearDiscriminantAnalysis: "+str(Res_f1[3])+"\n"+
"GaussianNB: "+str(Res_f1[4])+"\n"+
"DecisionTreeClassifier: "+str(Res_f1[5])+"\n"+
"RandomForestClassifier: "+str(Res_f1[6])
)

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py", line 1417, in fit
    for class_, warm_start_coef_ in zip(classes_, warm_start_coef))
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 1043, in __call__
    if self.dispatch_one_batch(iterator):
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 861, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/conda/lib/python3.7/site-packages/joblib/parallel.py", line 779, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 572

accuracy for models:
LogisticRegression: nan
SVC: 0.68
KNeighborsClassifier: 0.69
LinearDiscriminantAnalysis: 0.79
GaussianNB: 0.79
DecisionTreeClassifier: 0.77
RandomForestClassifier: 0.82

f1-score for models:
LogisticRegression: nan
SVC: 0.42
KNeighborsClassifier: 0.58
LinearDiscriminantAnalysis: 0.72
GaussianNB: 0.72
DecisionTreeClassifier: 0.7
RandomForestClassifier: 0.74


### The results of models evaluation: ###

**accuracy for models:**

LogisticRegression: nan 

SVC: 0.68

KNeighborsClassifier: 0.69

LinearDiscriminantAnalysis: 0.79

GaussianNB: 0.79

DecisionTreeClassifier: 0.77

RandomForestClassifier: 0.81

**f1-score for models:**

LogisticRegression: nan

SVC: 0.42

KNeighborsClassifier: 0.58

LinearDiscriminantAnalysis: 0.72

GaussianNB: 0.72

DecisionTreeClassifier: 0.7

RandomForestClassifier: 0.74

***(fail to compute the accuracy and f1-score of LogisticRegression? I cannot solve it.)***

# Select the best parameters of RandomForestClassifirer #

In [None]:
param_dist1 = {"n_estimators":sp_randint(1,51),
              "max_depth": [3,4,5, None],                    
              "max_features": sp_randint(0, 11),          
              "min_samples_split": sp_randint(2, 11),    
              "bootstrap": [True, False],                 
              "criterion": ["gini", "entropy"]}

random_search = RandomizedSearchCV(RandomForestClassifier(), param_distributions=param_dist1, n_iter=50, cv=10) # n_iter表示随机搜索20组，cv表示5折交叉验证
random_search.fit(X_df, Y_df)
print('best parameters:',random_search.best_params_,'\n','best score:', random_search.best_score_)

# Use the "best model" to do prediction: #

In [5]:
RFC = RandomForestClassifier(
                        bootstrap = True,
                        criterion = 'gini',
                        n_estimators = 20,
                        max_depth = None,
                        max_features = 2,
                        min_samples_split = 10
).fit(X_df, Y_df)
pred_y = RFC.predict(pred_df)

print(pred_y)

df = pd.read_csv('../input/titanic/test.csv')
submission = pd.DataFrame({
        "PassengerId": df["PassengerId"],
        "Survived": pred_y
    })
submission.to_csv('./submission.csv', index=False)

[0 0 1 1 0 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 1 0 1 1 1 0 0 0 1 0 1 0 0
 0 0 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 0 0
 1 0 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0
 0 1 1 1 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0
 0 0 1 0 0 1 0 0 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1
 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 1 0
 1 1 1 0 0 1 0 0 0 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 0
 1 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 1 0
 0 1 0 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0
 0 1 1 1 1 0 0 1 0 0 0]
