# Titanic -kaggle project

## The Challenge
The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np

In [2]:
import xgboost as xgb

from sklearn.model_selection import RandomizedSearchCV

from xgboost.sklearn import XGBClassifier

In [3]:
# This is the new library for data visvalization and data exploration 
import lux
from lux.vis.VisList import VisList

In [4]:
datafile_train=r"C:\Users\surendran\Desktop\python\project\titanic\train.csv"
datafile_test=r"C:\Users\surendran\Desktop\python\project\titanic\test.csv"
ti_train=pd.read_csv(datafile_train)
ti_test=pd.read_csv(datafile_test)

In [5]:
ti_train

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()



In [6]:
ti_train

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()



### target= Survived

In [92]:
ti_train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [7]:
from lux.vis.VisList import VisList

In [8]:
VisList(["Embarked=?","Survived"],ti_train)

LuxWidget(recommendations=[{'action': 'Vis List', 'description': 'Shows a vis list defined by the intent', 'vs…

[<Vis  (x: COUNT(Record), y: Survived  -- [Embarked=S]   ) mark: bar, score: 0.00 >,
 <Vis  (x: COUNT(Record), y: Survived  -- [Embarked=C]   ) mark: bar, score: 0.00 >,
 <Vis  (x: COUNT(Record), y: Survived  -- [Embarked=Q]   ) mark: bar, score: 0.00 >,
 <Vis  (x: COUNT(Record), y: Survived  -- [Embarked=nan] ) mark: bar, score: 0.00 >]

In [9]:
VisList(["Embarked","Survived"],ti_train)

LuxWidget(recommendations=[{'action': 'Vis List', 'description': 'Shows a vis list defined by the intent', 'vs…

[<Vis  (x: COUNT(Record), y: Embarked, color: Survived) mark: bar, score: 0.00 >]

In [10]:
from lux.vis.Vis import Vis

In [11]:
Vis(["Embarked","Survived"],ti_train)

LuxWidget(current_vis={'config': {'view': {'continuousWidth': 400, 'continuousHeight': 300}, 'axis': {'labelCo…

<Vis  (x: COUNT(Record), y: Embarked, color: Survived) mark: bar, score: 0.0 >

In [12]:
ti_train.shape

(891, 12)

In [13]:
ti_test.shape

(418, 11)

In [14]:
ti_train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [15]:
ti_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S




In [16]:
ti_test['Survived']=np.nan

In [17]:
ti_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,




In [18]:
ti_train['data']='train'
ti_test['data']='test'

In [19]:
ti_test=ti_test[ti_train.columns]  # the columns in the two data frames should be in the same order to enable concatenation

In [20]:
ti_all=pd.concat([ti_train,ti_test],axis=0)

In [21]:
ti_all.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,data
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,train
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,train
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,train
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,train
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,train




In [22]:
ti_all.info()

<class 'lux.core.frame.LuxDataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
 12  data         1309 non-null   object 
dtypes: float64(3), int64(4), object(6)
memory usage: 143.2+ KB


In [23]:
ti_all.shape

(1309, 13)

In [24]:
list(zip(ti_all.columns,ti_all.dtypes,ti_all.nunique()))

[('PassengerId', dtype('int64'), 1309),
 ('Survived', dtype('float64'), 2),
 ('Pclass', dtype('int64'), 3),
 ('Name', dtype('O'), 1307),
 ('Sex', dtype('O'), 2),
 ('Age', dtype('float64'), 98),
 ('SibSp', dtype('int64'), 7),
 ('Parch', dtype('int64'), 8),
 ('Ticket', dtype('O'), 929),
 ('Fare', dtype('float64'), 281),
 ('Cabin', dtype('O'), 186),
 ('Embarked', dtype('O'), 3),
 ('data', dtype('O'), 2)]

In [33]:
import lux

In [31]:
! jupyter nbextension install --py luxwidget
! jupyter nbextension enable --py luxwidget

Installing C:\Users\surendran\anaconda3\lib\site-packages\luxwidget\nbextension/static -> luxwidget
Up to date: C:\ProgramData\jupyter\nbextensions\luxwidget\extension.js
Up to date: C:\ProgramData\jupyter\nbextensions\luxwidget\index.js
Up to date: C:\ProgramData\jupyter\nbextensions\luxwidget\index.js.map
- Validating: ok

    To initialize this nbextension in the browser every time the notebook (or other app) loads:
    
          jupyter nbextension enable luxwidget --py
    
Enabling notebook extension luxwidget/extension...
      - Validating: ok


In [25]:
ti_all

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,data
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,train
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,train
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,train
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,train
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,train
...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S,test
414,1306,,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C,test
415,1307,,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S,test
416,1308,,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S,test




### 1.'Pclass'

In [26]:
ti_all.Pclass.value_counts()

3    709
1    323
2    277
Name: Pclass, dtype: int64

In [27]:

cols = ['Pclass']
for col in cols:
    k = ti_all[col].value_counts()
    k = k[k>100]
    indexes = k.index[:-1]
    print(indexes)
    for i in indexes:
        print(i)
        ti_all[col+'_'+str(i)] = (ti_all[col]==i).astype(int)

Int64Index([3, 1], dtype='int64')
3
1


### 2. drop Name and PassengerId

In [28]:
# since it kind of identifier

### 3.sex

In [29]:
ti_all.Sex.value_counts()

male      843
female    466
Name: Sex, dtype: int64

In [30]:
ti_all['Sex.converted'] = np.where(ti_all['Sex']=='male',1,0)

In [31]:
ti_all.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,data,Pclass_3,Pclass_1,Sex.converted
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,train,1,0,1
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,train,0,1,0




### 4.sibsp

In [32]:
ti_all.SibSp.value_counts()

0    891
1    319
2     42
4     22
3     20
8      9
5      6
Name: SibSp, dtype: int64

In [33]:

cols = ['SibSp']
for col in cols:
    k = ti_all[col].value_counts()
    k = k[k>100]
    indexes = k.index[:-1]
    print(indexes)
    for i in indexes:
        print(i)
        ti_all[col+'_'+str(i)] = (ti_all[col]==i).astype(int)

Int64Index([0], dtype='int64')
0


### 5.Parch

In [34]:
ti_all.Parch.value_counts()

0    1002
1     170
2     113
3       8
5       6
4       6
9       2
6       2
Name: Parch, dtype: int64

In [35]:

cols = ['Parch']
for col in cols:
    k = ti_all[col].value_counts()
    k = k[k>100]
    indexes = k.index[:-1]
    print(indexes)
    for i in indexes:
        print(i)
        ti_all[col+'_'+str(i)] = (ti_all[col]==i).astype(int)

Int64Index([0, 1], dtype='int64')
0
1


### 6.Ticket

In [36]:
ti_all.Ticket.value_counts()

CA. 2343        11
CA 2144          8
1601             8
S.O.C. 14879     7
347077           7
                ..
226593           1
C.A. 24580       1
113800           1
C.A. 29178       1
7598             1
Name: Ticket, Length: 929, dtype: int64

In [37]:
# drop ticket number

### 7.fare

In [38]:
ti_all.Fare.value_counts()

8.0500     60
13.0000    59
7.7500     55
26.0000    50
7.8958     49
           ..
33.5000     1
7.8000      1
26.3875     1
15.5792     1
7.1417      1
Name: Fare, Length: 281, dtype: int64

In [39]:
#keep the the col

### 8.Cabin

In [40]:
ti_all.Cabin.value_counts()

C23 C25 C27        6
B57 B59 B63 B66    5
G6                 5
F4                 4
F33                4
                  ..
A9                 1
D38                1
F                  1
A23                1
C111               1
Name: Cabin, Length: 186, dtype: int64

In [41]:
#drop 

### 9.Embarked

In [42]:
ti_all.Embarked.value_counts()

S    914
C    270
Q    123
Name: Embarked, dtype: int64

In [43]:

cols = ['Embarked']
for col in cols:
    k = ti_all[col].value_counts()
    k = k[k>100]
    indexes = k.index[:-1]
    print(indexes)
    for i in indexes:
        print(i)
        ti_all[col+'_'+str(i)] = (ti_all[col]==i).astype(int)

Index(['S', 'C'], dtype='object')
S
C


In [44]:
ti_all.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Embarked,data,Pclass_3,Pclass_1,Sex.converted,SibSp_0,Parch_0,Parch_1,Embarked_S,Embarked_C
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,...,S,train,1,0,1,0,1,0,1,0
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,...,C,train,0,1,0,0,1,0,0,1
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,...,S,train,1,0,0,1,1,0,1,0
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,...,S,train,0,1,0,0,1,0,1,0
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,...,S,train,1,0,1,1,1,0,1,0




### Copy the data

In [45]:
ti_all.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'data', 'Pclass_3',
       'Pclass_1', 'Sex.converted', 'SibSp_0', 'Parch_0', 'Parch_1',
       'Embarked_S', 'Embarked_C'],
      dtype='object')

In [46]:
ti_selected_cols=[ 'Survived', 'Age',
       'Fare', 'Pclass_3',
       'Pclass_1', 'Sex.converted', 'SibSp_0', 'Parch_0', 'Parch_1',
       'Embarked_S', 'Embarked_C','data']

In [47]:
ti_copy=ti_all.copy()

In [48]:
ti_copy=ti_all[ti_selected_cols]

In [49]:
ti_copy.head()

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()



### working on null values

In [50]:
list(zip(ti_copy.columns,ti_copy.dtypes,ti_copy.isnull().sum()))

[('Survived', dtype('float64'), 418),
 ('Age', dtype('float64'), 263),
 ('Fare', dtype('float64'), 1),
 ('Pclass_3', dtype('int32'), 0),
 ('Pclass_1', dtype('int32'), 0),
 ('Sex.converted', dtype('int32'), 0),
 ('SibSp_0', dtype('int32'), 0),
 ('Parch_0', dtype('int32'), 0),
 ('Parch_1', dtype('int32'), 0),
 ('Embarked_S', dtype('int32'), 0),
 ('Embarked_C', dtype('int32'), 0),
 ('data', dtype('O'), 0)]

In [51]:
cat=['Age','Fare']

In [52]:
for col in cat:
        ti_copy[col]=ti_copy[col].fillna(ti_copy.loc[ti_copy['data']=='train',col].median())

In [53]:
list(zip(ti_copy.columns,ti_copy.dtypes,ti_copy.isnull().sum()))

[('Survived', dtype('float64'), 418),
 ('Age', dtype('float64'), 0),
 ('Fare', dtype('float64'), 0),
 ('Pclass_3', dtype('int32'), 0),
 ('Pclass_1', dtype('int32'), 0),
 ('Sex.converted', dtype('int32'), 0),
 ('SibSp_0', dtype('int32'), 0),
 ('Parch_0', dtype('int32'), 0),
 ('Parch_1', dtype('int32'), 0),
 ('Embarked_S', dtype('int32'), 0),
 ('Embarked_C', dtype('int32'), 0),
 ('data', dtype('O'), 0)]

# ML part

In [54]:
target='Survived'

In [55]:
x_train=ti_copy.drop([target,'data'],1)[ti_copy['data']=='train']
y_train=ti_copy[target][ti_copy['data']=='train']
x_test=ti_copy.drop([target,'data'],1)[ti_copy['data']=='test']

In [56]:
x_train.head()

Unnamed: 0,Age,Fare,Pclass_3,Pclass_1,Sex.converted,SibSp_0,Parch_0,Parch_1,Embarked_S,Embarked_C
0,22.0,7.25,1,0,1,0,1,0,1,0
1,38.0,71.2833,0,1,0,0,1,0,0,1
2,26.0,7.925,1,0,0,1,1,0,1,0
3,35.0,53.1,0,1,0,0,1,0,1,0
4,35.0,8.05,1,0,1,1,1,0,1,0




In [57]:
y_train.head()

Button(description='Toggle Pandas/Lux', layout=Layout(top='5px', width='140px'), style=ButtonStyle())

Output()



In [58]:
x_test.head()

Unnamed: 0,Age,Fare,Pclass_3,Pclass_1,Sex.converted,SibSp_0,Parch_0,Parch_1,Embarked_S,Embarked_C
0,34.5,7.8292,1,0,1,1,1,0,0,0
1,47.0,7.0,1,0,0,0,1,0,1,0
2,62.0,9.6875,0,0,1,1,1,0,0,0
3,27.0,8.6625,1,0,1,1,1,0,1,0
4,22.0,12.2875,1,0,0,0,0,1,1,0




### XGB

In [59]:
param_dist = {
              "max_depth": [2,3,4,5,6],
              "learning_rate":[0.01,0.05,0.1,0.3,0.5],
                "min_child_weight":[4,5,6],
              "subsample":[i/10.0 for i in range(6,10)],
                "colsample_bytree":[i/10.0 for i in range(6,10)],
               "reg_alpha":[1e-5, 1e-2, 0.1, 1, 100],
              "gamma":[i/10.0 for i in range(0,5)],
            "n_estimators":[100,500,700,1000],
            'scale_pos_weight':[2,3,4,5,6,7,8,9]
              }

In [60]:
clf=XGBClassifier(objective='binary:logistic')

In [61]:
n_iter=25

random_search=RandomizedSearchCV(clf,n_jobs=-1,verbose=3,cv=10,n_iter=n_iter,scoring='roc_auc',
                                 param_distributions=param_dist)

In [62]:
random_search.fit(x_train,y_train)

Fitting 10 folds for each of 25 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:   56.7s
[Parallel(n_jobs=-1)]: Done 124 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:  2.2min finished


RandomizedSearchCV(cv=10,
                   estimator=XGBClassifier(base_score=None, booster=None,
                                           colsample_bylevel=None,
                                           colsample_bynode=None,
                                           colsample_bytree=None, gamma=None,
                                           gpu_id=None, importance_type='gain',
                                           interaction_constraints=None,
                                           learning_rate=None,
                                           max_delta_step=None, max_depth=None,
                                           min_child_weight=None, missing=nan,
                                           monotone_constraints=None,
                                           n_estimators=100...
                   n_iter=25, n_jobs=-1,
                   param_distributions={'colsample_bytree': [0.6, 0.7, 0.8,
                                                             0.9

In [63]:
def report(results,n_top=3):
    for i in range(1,n_top+1):
        candidates = np.flatnonzero(results['rank_test_score']==i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean Validation Score: {0:.8f} (std:{1:.3f})".format(
                results['mean_test_score'][candidate],
                results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

In [64]:
report(random_search.cv_results_,5)

Model with rank: 1
Mean Validation Score: 0.87121320 (std:0.053)
Parameters: {'subsample': 0.7, 'scale_pos_weight': 9, 'reg_alpha': 0.1, 'n_estimators': 1000, 'min_child_weight': 4, 'max_depth': 5, 'learning_rate': 0.01, 'gamma': 0.3, 'colsample_bytree': 0.9}

Model with rank: 2
Mean Validation Score: 0.87036624 (std:0.049)
Parameters: {'subsample': 0.7, 'scale_pos_weight': 5, 'reg_alpha': 0.1, 'n_estimators': 1000, 'min_child_weight': 4, 'max_depth': 4, 'learning_rate': 0.01, 'gamma': 0.4, 'colsample_bytree': 0.8}

Model with rank: 3
Mean Validation Score: 0.86699641 (std:0.047)
Parameters: {'subsample': 0.6, 'scale_pos_weight': 5, 'reg_alpha': 0.1, 'n_estimators': 100, 'min_child_weight': 5, 'max_depth': 3, 'learning_rate': 0.05, 'gamma': 0.4, 'colsample_bytree': 0.8}

Model with rank: 4
Mean Validation Score: 0.86643267 (std:0.052)
Parameters: {'subsample': 0.9, 'scale_pos_weight': 3, 'reg_alpha': 1e-05, 'n_estimators': 700, 'min_child_weight': 6, 'max_depth': 6, 'learning_rate': 0.

In [None]:
#here we choose 2nd rank(since std is low)

### Model with rank: 2
Mean Validation Score: 0.87036624 (std:0.049)
Parameters: {'subsample': 0.7, 'scale_pos_weight': 5, 'reg_alpha': 0.1, 'n_estimators': 1000, 'min_child_weight': 4, 'max_depth': 4, 'learning_rate': 0.01, 'gamma': 0.4, 'colsample_bytree': 0.8}

In [65]:
xgb_best=XGBClassifier(subsample=0.7,scale_pos_weight=5,reg_alpha=0.1,n_estimators=1000,min_child_weight=4,
                       max_depth=4,learning_rate=0.01,gamma=0.4,colsample_bytree=0.8,n_jobs=-1)

In [66]:
xgb_best.fit(x_train,y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=0.4, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.01, max_delta_step=0, max_depth=4,
              min_child_weight=4, missing=nan, monotone_constraints='()',
              n_estimators=1000, n_jobs=-1, num_parallel_tree=1, random_state=0,
              reg_alpha=0.1, reg_lambda=1, scale_pos_weight=5, subsample=0.7,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [67]:
train_score=xgb_best.predict_proba(x_train)[:,1]

In [68]:
train_score[:20]

array([0.1609495 , 0.9919526 , 0.85611296, 0.99550736, 0.33558646,
       0.30636412, 0.2353305 , 0.744551  , 0.8939948 , 0.97698545,
       0.94672817, 0.96390414, 0.41868582, 0.1992932 , 0.87572515,
       0.9518224 , 0.44952434, 0.3986364 , 0.85128534, 0.90493315],
      dtype=float32)

In [69]:
test_score=xgb_best.predict_proba(x_test)[:,1]

In [70]:
test_score[:20]


array([0.18305437, 0.28122428, 0.31413072, 0.46160424, 0.680035  ,
       0.31808135, 0.783679  , 0.27302924, 0.9428824 , 0.10897081,
       0.13265455, 0.39036718, 0.9782517 , 0.20000306, 0.9785463 ,
       0.95673   , 0.35248896, 0.6039993 , 0.86853653, 0.617003  ],
      dtype=float32)

## we need hard class

In [71]:
cutoffs=np.linspace(0.01,0.99,99)
cutoffs

array([0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 , 0.11,
       0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21, 0.22,
       0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31, 0.32, 0.33,
       0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43, 0.44,
       0.45, 0.46, 0.47, 0.48, 0.49, 0.5 , 0.51, 0.52, 0.53, 0.54, 0.55,
       0.56, 0.57, 0.58, 0.59, 0.6 , 0.61, 0.62, 0.63, 0.64, 0.65, 0.66,
       0.67, 0.68, 0.69, 0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77,
       0.78, 0.79, 0.8 , 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88,
       0.89, 0.9 , 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99])

In [72]:
# train_score=xgb_best.predict_proba(x_train)[:,1]
real=y_train


In [73]:
from sklearn.metrics import fbeta_score

In [74]:
fbeta_all=[]
Beta=2
KS_all=[]


for cutoff in cutoffs:
     
        predicated=(train_score>cutoff).astype(int)
        
        TP=((predicated==1) & (real==1)).sum()
        TN=((predicated==0) & (real==0)).sum()
        FP=((predicated==1) & (real==0)).sum()
        FN=((predicated==0) & (real==1)).sum()
        
        P=TP+FN
        N=TN+FP
        
        KS=(TP/P)-(FP/N)
        KS_all.append(KS)
        
        Pr=TP/(TP+FP)
        Recall=TP/(TP+FN)
        
#         F2_numorator=((1+(Beta**2))*Pr )*Recall
#         F2_denomenator=((Beta**2)*Pr)+Recall
        
        F2_Score= (((1+(Beta**2))*Pr )*Recall)/(((Beta**2)*Pr)+Recall)
        
        fbeta_all.append(F2_Score)
      
    

In [75]:
list(zip(cutoffs,fbeta_all))

[(0.01, 0.7569721115537847),
 (0.02, 0.7569721115537847),
 (0.03, 0.7579787234042553),
 (0.04, 0.76),
 (0.05, 0.7616926503340757),
 (0.060000000000000005, 0.7640750670241288),
 (0.06999999999999999, 0.7640750670241288),
 (0.08, 0.7644166294143943),
 (0.09, 0.7647584973166368),
 (0.09999999999999999, 0.7702702702702703),
 (0.11, 0.7737556561085974),
 (0.12, 0.776566757493188),
 (0.13, 0.7811786203746003),
 (0.14, 0.7894736842105262),
 (0.15000000000000002, 0.7913003239241092),
 (0.16, 0.7942405945192754),
 (0.17, 0.7986921999065858),
 (0.18000000000000002, 0.8016877637130803),
 (0.19, 0.804705882352941),
 (0.2, 0.8104265402843603),
 (0.21000000000000002, 0.8158396946564885),
 (0.22, 0.819357930043124),
 (0.23, 0.821720326765978),
 (0.24000000000000002, 0.8268858800773694),
 (0.25, 0.8337396392003901),
 (0.26, 0.8349609375),
 (0.27, 0.8378245957863792),
 (0.28, 0.8407079646017699),
 (0.29000000000000004, 0.8436112481499752),
 (0.3, 0.852017937219731),
 (0.31, 0.8554277138569285),
 (0.32,

### 1.Consider F2 Score

In [77]:
mycutoff=cutoffs[fbeta_all==max(fbeta_all)][0]
mycutoff

0.46

In [78]:
# test_score=xgb_best.predict_proba(x_test)[:,1]
test_score[:20]

array([0.18305437, 0.28122428, 0.31413072, 0.46160424, 0.680035  ,
       0.31808135, 0.783679  , 0.27302924, 0.9428824 , 0.10897081,
       0.13265455, 0.39036718, 0.9782517 , 0.20000306, 0.9785463 ,
       0.95673   , 0.35248896, 0.6039993 , 0.86853653, 0.617003  ],
      dtype=float32)

In [79]:
pred=(test_score>mycutoff).astype(int)

In [80]:
pred.dtype

dtype('int32')

In [81]:
pd.Series(pred).value_counts()

1    232
0    186
dtype: int64

# 1.submission

In [82]:
submission=pd.DataFrame()

In [83]:
submission['PassengerId']=ti_test['PassengerId']

In [84]:
submission['Survived']=pred

In [85]:
submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,1
4,896,1




In [86]:
submission.info()

<class 'lux.core.frame.LuxDataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   PassengerId  418 non-null    int64
 1   Survived     418 non-null    int32
dtypes: int32(1), int64(1)
memory usage: 5.0 KB


In [87]:
submission.to_csv('Titanic_XGboost_F2_Score_submission1.csv',index=False)

### 2.Consider KS Score

In [88]:
mycutoff=cutoffs[KS_all==max(KS_all)][0]
mycutoff

0.7100000000000001

In [89]:
# test_score=xgb_best.predict_proba(x_test)[:,1]
test_score[:20]

array([0.18305437, 0.28122428, 0.31413072, 0.46160424, 0.680035  ,
       0.31808135, 0.783679  , 0.27302924, 0.9428824 , 0.10897081,
       0.13265455, 0.39036718, 0.9782517 , 0.20000306, 0.9785463 ,
       0.95673   , 0.35248896, 0.6039993 , 0.86853653, 0.617003  ],
      dtype=float32)

In [90]:
pred=(test_score>mycutoff).astype(int)

In [91]:
pred.dtype

dtype('int32')

In [92]:
pd.Series(pred).value_counts()

0    257
1    161
dtype: int64

# 2.submission

In [93]:
submission=pd.DataFrame()

In [94]:
submission['PassengerId']=ti_test['PassengerId']

In [95]:
submission['Survived']=pred

In [96]:
submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0




In [97]:
submission.info()

<class 'lux.core.frame.LuxDataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   PassengerId  418 non-null    int64
 1   Survived     418 non-null    int32
dtypes: int32(1), int64(1)
memory usage: 5.0 KB


In [98]:
submission.to_csv('Titanic_XGboost_KS_Score_submission2.csv',index=False)