<a href="https://www.kaggle.com/code/khoshbayani/titanic-comparing-scores-of-different-classifiers?scriptVersionId=139772306" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import KNNImputer
from sklearn.preprocessing import minmax_scale
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV,ShuffleSplit

# Preprocessing train dataset

In [2]:
train_dataset = pd.read_csv('/kaggle/input/titanic/train.csv')

In [3]:
train_dataset.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
train_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


We know these columns ['Ticket' ,'Name' , 'PassengerId'] are useless in our data analysist.
besides the column 'Cabin' has lots of nan values that can not be replaced easily.
So we can delete them these columns.

In [5]:
train_dataset.drop(columns=['PassengerId','Cabin','Name','Ticket'],inplace=True)

devide train data to numeric and categorical

In [6]:
train_numeric_data= train_dataset.select_dtypes(include=[int,float])
train_categorical_data = train_dataset.select_dtypes(include=object)

preprossesing categorical data to numeric

In [7]:
train_categorical_data.head()

Unnamed: 0,Sex,Embarked
0,male,S
1,female,C
2,female,S
3,female,S
4,male,S


In [8]:
train_categorical_data.isna().sum()

Sex         0
Embarked    2
dtype: int64

Then we replace nan values with random values.

In [9]:
def RandomEmbarked():
    return np.random.choice(['S','C','Q'])


indexs_of_rows_with_nan_value= train_categorical_data[train_categorical_data['Embarked'].isna()].index
train_categorical_data.loc[indexs_of_rows_with_nan_value,'Embarked'] = [RandomEmbarked(),RandomEmbarked()]

train_categorical_data.isna().sum()

Sex         0
Embarked    0
dtype: int64

In [10]:
train_categorical_data.head()

Unnamed: 0,Sex,Embarked
0,male,S
1,female,C
2,female,S
3,female,S
4,male,S


Let's binerize the 'Sex' column:

In [11]:
binarizer = LabelBinarizer()
train_categorical_data['Sex'] = binarizer.fit_transform(train_categorical_data['Sex'])

In [12]:
train_categorical_data.head()

Unnamed: 0,Sex,Embarked
0,1,S
1,0,C
2,0,S
3,0,S
4,1,S


Let's do oneHotEncoding the 'Embarked' column:
(Why don't we one hot encode it? Because this column contains three unique values but 'Sex' column contains just two different values)

In [13]:
oneHot = OneHotEncoder()

In [14]:
Embarked_onHotEncoded_array= oneHot.fit_transform(train_categorical_data['Embarked'].values.reshape(-1,1)).toarray()
Embarked_onHotEncoded_df = pd.DataFrame(Embarked_onHotEncoded_array,columns=['Embarked0','Embarked1','Embarked2'])

train_categorical_data.drop(columns=['Embarked'],inplace=True)

#Now we have numerized_df instead of train_categorical_data
train_numerized_df = pd.concat([train_categorical_data,Embarked_onHotEncoded_df],axis=1)

In [15]:
train_numerized_df.head()

Unnamed: 0,Sex,Embarked0,Embarked1,Embarked2
0,1,0.0,0.0,1.0
1,0,1.0,0.0,0.0
2,0,0.0,0.0,1.0
3,0,0.0,0.0,1.0
4,1,0.0,0.0,1.0


In [16]:
num_train_df = pd.concat([train_numeric_data,train_numerized_df],axis=1)
num_train_df.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex,Embarked0,Embarked1,Embarked2
0,0,3,22.0,1,0,7.25,1,0.0,0.0,1.0
1,1,1,38.0,1,0,71.2833,0,1.0,0.0,0.0
2,1,3,26.0,0,0,7.925,0,0.0,0.0,1.0
3,1,1,35.0,1,0,53.1,0,0.0,0.0,1.0
4,0,3,35.0,0,0,8.05,1,0.0,0.0,1.0


Split X and y

In [17]:
data_X_train= num_train_df.iloc[:,1:]
data_y_train= num_train_df.iloc[:,0]

Fill na values with KNNImputer

In [18]:
knn_imputer = KNNImputer()
data_X_train = pd.DataFrame(knn_imputer.fit_transform(data_X_train),columns=data_X_train.columns)

In [19]:
data_X_train.isna().sum()

Pclass       0
Age          0
SibSp        0
Parch        0
Fare         0
Sex          0
Embarked0    0
Embarked1    0
Embarked2    0
dtype: int64

MinMax Scaler:

In [20]:
data_X_train = pd.DataFrame(minmax_scale(data_X_train), columns=data_X_train.columns)

In [21]:
data_X_train.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex,Embarked0,Embarked1,Embarked2
0,1.0,0.271174,0.125,0.0,0.014151,1.0,0.0,0.0,1.0
1,0.0,0.472229,0.125,0.0,0.139136,0.0,1.0,0.0,0.0
2,1.0,0.321438,0.0,0.0,0.015469,0.0,0.0,0.0,1.0
3,0.0,0.434531,0.125,0.0,0.103644,0.0,0.0,0.0,1.0
4,1.0,0.434531,0.0,0.0,0.015713,1.0,0.0,0.0,1.0


# Preprocessing test dataset:

Now, Let's preprocess the test dataset (steps are so similar to train dataset) : 

In [22]:
test_dataset= pd.read_csv('/kaggle/input/titanic/test.csv')

In [23]:
test_dataset.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [24]:
test_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


In [25]:
passengerId_test_data = test_dataset['PassengerId']

In [26]:
test_dataset.drop(columns=['PassengerId','Cabin','Name','Ticket'],inplace=True)

devide test_data to numeric and categorical

In [27]:
test_numeric_data= test_dataset.select_dtypes(include=[int,float])
test_categorical_data = test_dataset.select_dtypes(include=object)

preprossesing test_categorical data to test_numeric

In [28]:
test_categorical_data.head()

Unnamed: 0,Sex,Embarked
0,male,Q
1,female,S
2,male,Q
3,male,S
4,female,S


In [29]:
test_categorical_data.isna().sum()

Sex         0
Embarked    0
dtype: int64

Let's binerize the 'Sex' column:

In [30]:
binarizer = LabelBinarizer()
test_categorical_data['Sex'] = binarizer.fit_transform(test_categorical_data['Sex'])

In [31]:
test_categorical_data.head()

Unnamed: 0,Sex,Embarked
0,1,Q
1,0,S
2,1,Q
3,1,S
4,0,S


Let's do oneHotEncoding the 'Embarked' column:
(Why don't we one hot encode it? Because this column contains three unique values but 'Sex' column contains just two different values)

In [32]:
oneHot = OneHotEncoder()
Embarked_onHotEncoded_array= oneHot.fit_transform(test_categorical_data['Embarked'].values.reshape(-1,1)).toarray()
Embarked_onHotEncoded_df = pd.DataFrame(Embarked_onHotEncoded_array,columns=['Embarked0','Embarked1','Embarked2'])

test_categorical_data.drop(columns=['Embarked'],inplace=True)

#Now we have numerized_df instead of test_categorical_data
test_numerized_df = pd.concat([test_categorical_data,Embarked_onHotEncoded_df],axis=1)

In [33]:
test_numerized_df.head()

Unnamed: 0,Sex,Embarked0,Embarked1,Embarked2
0,1,0.0,1.0,0.0
1,0,0.0,0.0,1.0
2,1,0.0,1.0,0.0
3,1,0.0,0.0,1.0
4,0,0.0,0.0,1.0


In [34]:
num_test_df = pd.concat([test_numeric_data,test_numerized_df],axis=1)
num_test_df.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex,Embarked0,Embarked1,Embarked2
0,3,34.5,0,0,7.8292,1,0.0,1.0,0.0
1,3,47.0,1,0,7.0,0,0.0,0.0,1.0
2,2,62.0,0,0,9.6875,1,0.0,1.0,0.0
3,3,27.0,0,0,8.6625,1,0.0,0.0,1.0
4,3,22.0,1,1,12.2875,0,0.0,0.0,1.0


data_X_test is num_test_df

In [35]:
data_X_test= num_test_df

Fill na values with KNNImputer

In [36]:
knn_imputer = KNNImputer()
data_X_test = pd.DataFrame(knn_imputer.fit_transform(data_X_test),columns=data_X_test.columns)

In [37]:
data_X_test.isna().sum()

Pclass       0
Age          0
SibSp        0
Parch        0
Fare         0
Sex          0
Embarked0    0
Embarked1    0
Embarked2    0
dtype: int64

MinMax Scaler:

In [38]:
data_X_test = pd.DataFrame(minmax_scale(data_X_test), columns=data_X_test.columns)

In [39]:
data_X_test.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex,Embarked0,Embarked1,Embarked2
0,1.0,0.452723,0.0,0.0,0.015282,1.0,0.0,1.0,0.0
1,1.0,0.617566,0.125,0.0,0.013663,0.0,0.0,0.0,1.0
2,0.5,0.815377,0.0,0.0,0.018909,1.0,0.0,1.0,0.0
3,1.0,0.353818,0.0,0.0,0.016908,1.0,0.0,0.0,1.0
4,1.0,0.287881,0.125,0.111111,0.023984,0.0,0.0,0.0,1.0


Importing the labels of test dataset:

In [40]:
data_y_test = pd.read_csv('/kaggle/input/titanic/gender_submission.csv')['Survived']

We finished preprocessing part.
so have a short break time...🧋

Let's divide train dataset to train and test datasets:

In [41]:
X_train, X_test, y_train, y_test= train_test_split(data_X_train,data_y_train,test_size=0.3)

# models

Let's trye models with defult parameters:

# Modeling with default parameters

In [42]:
knn = KNeighborsClassifier()
dtree = DecisionTreeClassifier()
randForest = RandomForestClassifier()
svc = SVC()

for model in [knn, dtree, randForest, svc]:
    model.fit(X_train,y_train)
    print(model,"Classifier:")
    print('   score on train :',model.score(X_train,y_train))
    print('   score on test  :',model.score(X_test,y_test))
    print()
    print()

KNeighborsClassifier() Classifier:
   score on train : 0.8539325842696629
   score on test  : 0.7985074626865671


DecisionTreeClassifier() Classifier:
   score on train : 0.9807383627608347
   score on test  : 0.7947761194029851


RandomForestClassifier() Classifier:
   score on train : 0.9807383627608347
   score on test  : 0.8283582089552238


SVC() Classifier:
   score on train : 0.8089887640449438
   score on test  : 0.8208955223880597




# Modeling using GridSearch

In [43]:
knn_calss = KNeighborsClassifier(n_jobs=-1)
params = {
    'n_neighbors' : [1,3,5,7,11,13,14,15,16,17,18,19,21,22,23,27],
    'p': [1,2,3,5],
    'weights': ['uniform','distance']
}
cv= ShuffleSplit(n_splits=5,test_size=0.25,random_state=123)
knn_clf = GridSearchCV(knn_calss, param_grid=params, scoring='accuracy', cv=cv , verbose=0,return_train_score=False)
knn_clf_r = knn_clf.fit(X_train,y_train)

In [44]:
rand_forest = RandomForestClassifier(n_jobs=-1)
params = {
    'max_depth': range(1,10)
}
cv= ShuffleSplit(n_splits=5,test_size=0.25,random_state=123)
randdorest_clf = GridSearchCV(rand_forest, param_grid=params, scoring='accuracy', cv=cv , verbose=0,return_train_score=False)
randdorest_clf_r = randdorest_clf.fit(X_train,y_train)

In [45]:
svm_class = SVC()
params = {
    'C': [0.00001,0.005,0.05,0.8,1,5,8,9 ,10, 25, 50],
    'kernel': ['linear', 'rbf', 'sigmoid'],
    'degree': np.arange(0,7,1),
    'gamma': ['auto','scale']
}
cv= ShuffleSplit(n_splits=5,test_size=0.25,random_state=123)
svm_clf = GridSearchCV(svm_class, param_grid=params, scoring='accuracy', cv=cv , verbose=0,return_train_score=False)
print('This process takes a bit more time. Thanks for your patience')
svm_clf_r = svm_clf.fit(X_train,y_train)

This process takes a bit more time. Thanks for your patience


# Evaluation

# Evaluation models on trainset:

In [46]:
data_train_predicted_by_knn = knn_clf_r.predict(data_X_train)
data_train_predicted_by_randomForest = randdorest_clf_r.predict(data_X_train)
data_train_predicted_by_svm = svm_clf_r.predict(data_X_train)


score_of_data_train_predicted_by_knn = accuracy_score(data_y_train,data_train_predicted_by_knn)
score_of_data_train_predicted_by_randomForest = accuracy_score(data_y_train,data_train_predicted_by_randomForest)
score_of_data_train_predicted_by_svm = accuracy_score(data_y_train,data_train_predicted_by_svm)

# Evaluation models on test dataset:

In [47]:
data_test_predicted_by_knn = knn_clf_r.predict(data_X_test)
data_test_predicted_by_randomForest = randdorest_clf_r.predict(data_X_test)
data_test_predicted_by_svm = svm_clf_r.predict(data_X_test)


score_of_data_test_predicted_by_knn = accuracy_score(data_y_test,data_test_predicted_by_knn)
score_of_data_test_predicted_by_randomForest = accuracy_score(data_y_test,data_test_predicted_by_randomForest)
score_of_data_test_predicted_by_svm = accuracy_score(data_y_test,data_test_predicted_by_svm)

# Result

In [48]:
print("KNN classifier accuracy score result:")
print('   ',"train:",score_of_data_train_predicted_by_knn)
print('   ',"test:",score_of_data_test_predicted_by_knn)
print()

print("RandomForest classifier accuracy score result:")
print('   ',"train:",score_of_data_train_predicted_by_randomForest)
print('   ',"test:",score_of_data_test_predicted_by_randomForest)
print()

print("SVM classifier accuracy score result:")
print('   ',"train:",score_of_data_train_predicted_by_svm)
print('   ',"test:",score_of_data_test_predicted_by_svm)

KNN classifier accuracy score result:
    train: 0.8462401795735129
    test: 0.8277511961722488

RandomForest classifier accuracy score result:
    train: 0.8742985409652076
    test: 0.8827751196172249

SVM classifier accuracy score result:
    train: 0.8361391694725028
    test: 0.8516746411483254


# Getting output
Getting output as a CSV file: 


Because KNN has the best performance on the test dataset, We decided to use its prediction for submission.

In [49]:
submition_df = pd.DataFrame(data=data_test_predicted_by_knn,columns=['Survived'],index=passengerId_test_data)
submition_df.to_csv('submition.csv',index=True)

If it was helpful don't forget 👍  UPVOTE 👍  please✅✅✅
And don't forget to help me make it better with your guidance.🔍
I'm waiting for you...❤️