## Predicting the Survival of Titanic Passengers

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this contest, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict whether someone survived or did not survive.

To complete this project, you will need to implement several conditional predictions and answer the questions below. Your project submission will be evaluated based on the completion of the code and your responses to the questions.

In [1]:
import pandas as pd 
import numpy as np

In [2]:
train_file=r'E:/Edvancer/Python/titanic/t_train.csv'
test_file=r'E:/Edvancer/Python/titanic/t_test.csv'

ld_train=pd.read_csv(train_file)
ld_test=pd.read_csv(test_file)     

In [3]:
ld_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# lets combine the data for data prep

ld_test['Survived']=np.nan
ld_train['data']='train'
ld_test['data']='test'

#When we add intrest rate as NA, it should be in same as order of train data . I.e, interset rate should come as 4th column only
#hence we use ld_test[ld_train.columns] which ensure that test column order are same as train columns

ld_test=ld_test[ld_train.columns]
ld_all=pd.concat([ld_train,ld_test],axis=0)

In [5]:
ld_all.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,data
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,train
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,train
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,train
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,train
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,train


In [6]:
ld_all.dtypes

PassengerId      int64
Survived       float64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
data            object
dtype: object

In [7]:
#How many unique value each column take??

list(zip(ld_all.columns,ld_all.dtypes,ld_all.nunique()))

[('PassengerId', dtype('int64'), 1309),
 ('Survived', dtype('float64'), 2),
 ('Pclass', dtype('int64'), 3),
 ('Name', dtype('O'), 1307),
 ('Sex', dtype('O'), 2),
 ('Age', dtype('float64'), 98),
 ('SibSp', dtype('int64'), 7),
 ('Parch', dtype('int64'), 8),
 ('Ticket', dtype('O'), 929),
 ('Fare', dtype('float64'), 281),
 ('Cabin', dtype('O'), 186),
 ('Embarked', dtype('O'), 3),
 ('data', dtype('O'), 2)]

In [8]:
ld_all["Survived"].value_counts()

0.0    549
1.0    342
Name: Survived, dtype: int64

In [9]:
ld_all['Survived']=(ld_all['Survived']==1).astype(int)

In [10]:
ld_all["Survived"].value_counts()

0    967
1    342
Name: Survived, dtype: int64

In [11]:
ld_all.drop(['PassengerId','Name'],axis=1,inplace=True)

In [12]:
cat_cols=ld_all.select_dtypes([object]).columns
cat_cols

Index(['Sex', 'Ticket', 'Cabin', 'Embarked', 'data'], dtype='object')

In [13]:
cat_cols=cat_cols[:-1]
cat_cols

Index(['Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')

In [14]:
for col in cat_cols:
    freqs=ld_all[col].value_counts()
    k=freqs.index[freqs>20][:-1]    #[-1] means except the last i.e it will ignor false one & from True one create n-1 dummies
    for cat in k:
        name=col+'_'+cat
        ld_all[name]=(ld_all[col]==cat).astype(int)
    del ld_all[col]
    print(col)
    

Sex
Ticket
Cabin
Embarked


In [15]:
ld_all.isnull().sum()

Survived        0
Pclass          0
Age           263
SibSp           0
Parch           0
Fare            1
data            0
Sex_male        0
Embarked_S      0
Embarked_C      0
dtype: int64

In [16]:
for col in ld_all.columns:
    if(col not in['Survived','data'])&(ld_all[col].isnull().sum()>0):
        ld_all.loc[ld_all[col].isnull(),col]=ld_all.loc[ld_all['data']=='train',col].mean()

In [17]:
ld_all.isnull().sum()

Survived      0
Pclass        0
Age           0
SibSp         0
Parch         0
Fare          0
data          0
Sex_male      0
Embarked_S    0
Embarked_C    0
dtype: int64

In [18]:
ld_train=ld_all[ld_all['data']=='train']
del ld_train['data']
ld_test=ld_all[ld_all['data']=='test']
ld_test.drop(['Survived','data'],axis=1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [19]:
del ld_all

In [20]:
from sklearn.model_selection import train_test_split

In [21]:
ld_train1,ld_train2=train_test_split(ld_train,test_size=0.2,random_state=2)

In [22]:
ld_train.shape

(891, 9)

In [23]:
ld_train1.shape

(712, 9)

In [24]:
ld_train2.shape

(179, 9)

In [25]:
#For any modelling in python, you need to pass predictor and response seperatly
#all predictor will goes in xtrain and response will goes in ytrain

#1 is axis = 1
#we are not using inplace=True cz we are not removing that column from data. 
#Here it will keep all other variables except the Intrest Rate guy


#Build model on ld_train1 using x_train1 and y_train1
x_train1=ld_train1.drop('Survived',axis=1) 
y_train1=ld_train1['Survived']             

#check performance on ld_train2 using x_train2 and y_train2
x_train2=ld_train2.drop('Survived',axis=1) #Predicted/Observed value Value
y_train2=ld_train2['Survived']  #Actual value

# LogisticRegression

In [34]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [158]:
logr=LogisticRegression()

In [159]:
logr.fit(x_train1,y_train1)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [160]:
prediction  = logr.predict(x_train2)
#It consider internally cutoff is 0.5 which is not we are interested in
#We want to look at the probabilities, so we will use predict_proba function


In [161]:
from sklearn.metrics import classification_report

In [162]:
classification_report(y_train2 ,prediction)

'              precision    recall  f1-score   support\n\n           0       0.75      0.88      0.81       100\n           1       0.80      0.62      0.70        79\n\n   micro avg       0.77      0.77      0.77       179\n   macro avg       0.77      0.75      0.75       179\nweighted avg       0.77      0.77      0.76       179\n'

In [163]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_train2 ,prediction)

#Here,
#TN= 88
#TP=49
#FP=12
#FN=30

array([[88, 12],
       [30, 49]], dtype=int64)

In [164]:
from sklearn.metrics import accuracy_score

accuracy_score(y_train2 ,prediction)

0.7653631284916201

In [165]:
logr.predict_proba(x_train2)

array([[0.71340202, 0.28659798],
       [0.8684658 , 0.1315342 ],
       [0.2353462 , 0.7646538 ],
       [0.86620755, 0.13379245],
       [0.68019006, 0.31980994],
       [0.89772567, 0.10227433],
       [0.89017768, 0.10982232],
       [0.88263734, 0.11736266],
       [0.79617926, 0.20382074],
       [0.82000214, 0.17999786],
       [0.81249563, 0.18750437],
       [0.14174429, 0.85825571],
       [0.4026941 , 0.5973059 ],
       [0.87229976, 0.12770024],
       [0.87229976, 0.12770024],
       [0.41329629, 0.58670371],
       [0.89964586, 0.10035414],
       [0.94326276, 0.05673724],
       [0.25373827, 0.74626173],
       [0.96476576, 0.03523424],
       [0.44418253, 0.55581747],
       [0.4053359 , 0.5946641 ],
       [0.91769794, 0.08230206],
       [0.4262831 , 0.5737169 ],
       [0.86569076, 0.13430924],
       [0.28930925, 0.71069075],
       [0.37175045, 0.62824955],
       [0.72843153, 0.27156847],
       [0.89101351, 0.10898649],
       [0.89334668, 0.10665332],
       [0.

In [166]:
logr.classes_

array([0, 1])

In [167]:
#Extract only second

predicted_prob=logr.predict_proba(x_train2)[:,1]
predicted_prob

array([0.28659798, 0.1315342 , 0.7646538 , 0.13379245, 0.31980994,
       0.10227433, 0.10982232, 0.11736266, 0.20382074, 0.17999786,
       0.18750437, 0.85825571, 0.5973059 , 0.12770024, 0.12770024,
       0.58670371, 0.10035414, 0.05673724, 0.74626173, 0.03523424,
       0.55581747, 0.5946641 , 0.08230206, 0.5737169 , 0.13430924,
       0.71069075, 0.62824955, 0.27156847, 0.10898649, 0.10665332,
       0.13510818, 0.08311764, 0.4276015 , 0.3998437 , 0.37963471,
       0.54250149, 0.91226458, 0.049118  , 0.10314929, 0.17606261,
       0.08100049, 0.06855675, 0.65039231, 0.10283602, 0.13175172,
       0.69201026, 0.92746796, 0.10350193, 0.16956053, 0.77737582,
       0.10322013, 0.23243001, 0.22330031, 0.26715516, 0.19021786,
       0.20763745, 0.73284715, 0.13188193, 0.05525264, 0.0966366 ,
       0.68773496, 0.30438811, 0.08873281, 0.10285309, 0.7737212 ,
       0.09458438, 0.80817657, 0.21395565, 0.06929736, 0.10604635,
       0.55781094, 0.02762159, 0.51526682, 0.10816765, 0.62129

In [169]:
# score model performance on the test data
roc_auc_score(y_train2 ,predicted_prob)

0.8340506329113924

# Random Forest

In [27]:
from sklearn.ensemble import RandomForestClassifier

In [30]:
RF=RandomForestClassifier(criterion="entropy",
                            max_leaf_nodes=10,
                            class_weight="balanced")
RF

RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='entropy', max_depth=None, max_features='auto',
            max_leaf_nodes=10, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators='warn', n_jobs=None, oob_score=False,
            random_state=None, verbose=0, warm_start=False)

In [31]:
RF.fit(x_train1,y_train1)



RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='entropy', max_depth=None, max_features='auto',
            max_leaf_nodes=10, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=None, oob_score=False,
            random_state=None, verbose=0, warm_start=False)

In [32]:
p=RF.predict_proba(x_train2)[:,1]
p

array([0.4873706 , 0.19753376, 0.84819392, 0.34739877, 0.29143959,
       0.11132037, 0.43092328, 0.11132037, 0.47372671, 0.35379676,
       0.33690563, 0.85122877, 0.80280048, 0.34739877, 0.34739877,
       0.60480296, 0.43462269, 0.11970401, 0.7160233 , 0.14210697,
       0.51131716, 0.69037147, 0.35227494, 0.60827549, 0.12340904,
       0.68992096, 0.69303248, 0.52788808, 0.23938844, 0.33690563,
       0.19753376, 0.11970401, 0.42503414, 0.4873706 , 0.42637835,
       0.68064286, 0.83367524, 0.57192456, 0.11132037, 0.23280045,
       0.11970401, 0.23502155, 0.67249519, 0.11954177, 0.19753376,
       0.44934229, 0.80700539, 0.18931236, 0.19017844, 0.69303248,
       0.18931236, 0.43092328, 0.38360624, 0.43092328, 0.24969158,
       0.44847682, 0.45686819, 0.19753376, 0.11970401, 0.21458712,
       0.80280048, 0.65648397, 0.197696  , 0.11132037, 0.71286677,
       0.23938844, 0.8154236 , 0.43092328, 0.57192456, 0.11954177,
       0.68064286, 0.11970401, 0.63455021, 0.11132037, 0.66427

In [35]:
roc_auc_score(y_train2,p)

0.8417088607594937

# We can clearly see that scorewith RF comes good than logistic regression. Still we will try to improve score using parameter tunning

# Hyperparameter Tunning with RF

In [38]:
#On Entire train data 

x_train=ld_train.drop('Survived',axis=1)  #Predictor (Except response variable)
y_train=ld_train['Survived']  #Response 

In [36]:
from sklearn.ensemble import RandomForestClassifier

In [40]:
#Total number of predictors / Features   

x_train.shape

(891, 8)

In [41]:

param_dist = {"n_estimators":[100,200,300,500,700,1000],
              "max_features": [5,6,7],   #It should not be >38
              "bootstrap": [True, False],
              'class_weight':[None,'balanced'], 
                'criterion':['entropy','gini'],
                'max_depth':[None,5,10,15,20,30,50,70],
                'min_samples_leaf':[1,2,5,10,15,20], 
                'min_samples_split':[2,5,10,15,20]
                  }

In [42]:
clf = RandomForestClassifier()


In [45]:
from sklearn.model_selection import RandomizedSearchCV

# run randomized search
n_iter_search = 10

random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search,scoring='roc_auc',cv=5)


random_search.fit(x_train, y_train) 

RandomizedSearchCV(cv=5, error_score='raise-deprecating',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
          fit_params=None, iid='warn', n_iter=10, n_jobs=None,
          param_distributions={'n_estimators': [100, 200, 300, 500, 700, 1000], 'max_features': [5, 6, 7], 'bootstrap': [True, False], 'class_weight': [None, 'balanced'], 'criterion': ['entropy', 'gini'], 'max_depth': [None, 5, 10, 15, 20, 30, 50, 70], 'min_samples_leaf': [1, 2, 5, 10, 15, 20], 'min_samples_split': [2, 5, 10, 15, 20]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_sco

In [46]:
random_search.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='entropy', max_depth=15, max_features=7,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=5,
            min_samples_split=15, min_weight_fraction_leaf=0.0,
            n_estimators=300, n_jobs=None, oob_score=False,
            random_state=None, verbose=0, warm_start=False)

In [47]:
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.5f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

In [48]:
#Tentitive Performance

report(random_search.cv_results_,5)

Model with rank: 1
Mean validation score: 0.871 (std: 0.03424)
Parameters: {'n_estimators': 300, 'min_samples_split': 15, 'min_samples_leaf': 5, 'max_features': 7, 'max_depth': 15, 'criterion': 'entropy', 'class_weight': 'balanced', 'bootstrap': True}

Model with rank: 2
Mean validation score: 0.867 (std: 0.03934)
Parameters: {'n_estimators': 500, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 7, 'max_depth': 20, 'criterion': 'gini', 'class_weight': 'balanced', 'bootstrap': True}

Model with rank: 3
Mean validation score: 0.867 (std: 0.03006)
Parameters: {'n_estimators': 1000, 'min_samples_split': 10, 'min_samples_leaf': 10, 'max_features': 6, 'max_depth': None, 'criterion': 'gini', 'class_weight': 'balanced', 'bootstrap': True}

Model with rank: 4
Mean validation score: 0.867 (std: 0.03487)
Parameters: {'n_estimators': 700, 'min_samples_split': 5, 'min_samples_leaf': 10, 'max_features': 6, 'max_depth': 20, 'criterion': 'gini', 'class_weight': None, 'bootstrap': False}


#  Our hioghest score is 0.87 with RF

# Hyperparameter Tunning with Logistic Reg

We know the tentative performance now, lets build the model on entire training to make prediction on test/production

In [52]:
#On Entire train data 

x_train=ld_train.drop('Survived',axis=1)  #Predictor (Except response variable)
y_train=ld_train['Survived']  #Response 

In [174]:
params={'class_weight':['balanced',None],
        'penalty':['l1','l2'],
        'C':np.linspace(0.1,1000,10)}

In [176]:
model=LogisticRegression(fit_intercept=True)

In [177]:
from sklearn.model_selection import GridSearchCV

grid_search=GridSearchCV(model,param_grid=params,cv=5,scoring="roc_auc")

In [178]:
grid_search.fit(x_train,y_train)













GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'class_weight': ['balanced', None], 'penalty': ['l1', 'l2'], 'C': array([1.000e-01, 1.112e+02, 2.223e+02, 3.334e+02, 4.445e+02, 5.556e+02,
       6.667e+02, 7.778e+02, 8.889e+02, 1.000e+03])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

In [180]:
grid_search.best_estimator_

LogisticRegression(C=333.4, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l1', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [181]:
logr=grid_search.best_estimator_

Using the report function given below you can see the cv performance of top few models as well, that will the tentative performance

In [182]:
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

In [183]:
#Tentitive Performance

report(grid_search.cv_results_,5)

Model with rank: 1
Mean validation score: 0.849 (std: 0.014)
Parameters: {'C': 333.4, 'class_weight': None, 'penalty': 'l1'}

Model with rank: 1
Mean validation score: 0.849 (std: 0.014)
Parameters: {'C': 888.9, 'class_weight': None, 'penalty': 'l1'}

Model with rank: 3
Mean validation score: 0.849 (std: 0.014)
Parameters: {'C': 111.19999999999999, 'class_weight': None, 'penalty': 'l1'}

Model with rank: 4
Mean validation score: 0.849 (std: 0.014)
Parameters: {'C': 777.8, 'class_weight': None, 'penalty': 'l1'}

Model with rank: 5
Mean validation score: 0.849 (std: 0.014)
Parameters: {'C': 444.5, 'class_weight': None, 'penalty': 'l2'}



# highest score is 0.84 with Logistic Regression

# Hence we will use RF to predict it on entire test data 

In [53]:
Rf_final=random_search.best_estimator_

In [56]:
#logr.fit(x_train,y_train)
Rf_final.fit(x_train,y_train)


RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='entropy', max_depth=15, max_features=7,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=5,
            min_samples_split=15, min_weight_fraction_leaf=0.0,
            n_estimators=300, n_jobs=None, oob_score=False,
            random_state=None, verbose=0, warm_start=False)

In [57]:
#Predict on test data

#test_pred=logr.predict_proba(ld_test)[:,1]
test_pred=Rf_final.predict_proba(ld_test)[:,1]

In [None]:
pd.DataFrame(test_pred).to_csv("mysubmission.csv",index=False)

In [58]:
cutoffs=np.linspace(0.01,0.99,99)
cutoffs

array([0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 , 0.11,
       0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21, 0.22,
       0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31, 0.32, 0.33,
       0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43, 0.44,
       0.45, 0.46, 0.47, 0.48, 0.49, 0.5 , 0.51, 0.52, 0.53, 0.54, 0.55,
       0.56, 0.57, 0.58, 0.59, 0.6 , 0.61, 0.62, 0.63, 0.64, 0.65, 0.66,
       0.67, 0.68, 0.69, 0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77,
       0.78, 0.79, 0.8 , 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88,
       0.89, 0.9 , 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99])

In [60]:
#[:,1] Extract only second column

#train_score=logr.predict_proba(x_train)[:,1]

train_score=Rf_final.predict_proba(x_train)[:,1]
real=y_train



In [61]:
KS_all=[]

for cutoff in cutoffs:
    
    predicted=(train_score>cutoff).astype(int)

    TP=((predicted==1) & (real==1)).sum()
    TN=((predicted==0) & (real==0)).sum()
    FP=((predicted==1) & (real==0)).sum()
    FN=((predicted==0) & (real==1)).sum()
    
    P=TP+FN
    N=TN+FP
      
    KS=(TP/P)-(FP/N)
    
    
    KS_all.append(KS)

In [62]:
list(zip(cutoffs,KS_all))

#How you will find out where do i get maximum values of ks??
#for which cutoff KS is maximum

[(0.01, 0.00910746812386154),
 (0.02, 0.063752276867031),
 (0.03, 0.08743169398907102),
 (0.04, 0.12568306010928965),
 (0.05, 0.1839708561020036),
 (0.060000000000000005, 0.2295081967213115),
 (0.06999999999999999, 0.27322404371584696),
 (0.08, 0.30601092896174864),
 (0.09, 0.3351548269581056),
 (0.09999999999999999, 0.3570127504553734),
 (0.11, 0.39344262295081966),
 (0.12, 0.42258652094717664),
 (0.13, 0.45355191256830596),
 (0.14, 0.47540983606557374),
 (0.15000000000000002, 0.5125587192023775),
 (0.16, 0.5406001342153197),
 (0.17, 0.5588150704630428),
 (0.18000000000000002, 0.5832134982264403),
 (0.19, 0.6003259514907487),
 (0.2, 0.6167193941136995),
 (0.21000000000000002, 0.6258268622375611),
 (0.22, 0.6411178218770972),
 (0.23, 0.6553062985332183),
 (0.24000000000000002, 0.663311283673665),
 (0.25, 0.6749592560636564),
 (0.26, 0.6836832518454606),
 (0.27, 0.682916307161346),
 (0.28, 0.6945642795513374),
 (0.29000000000000004, 0.6960023008340523),
 (0.3, 0.7105742498322308),
 (0.3

In [63]:
#for which cutoff KS is maximum
mycutoff=cutoffs[KS_all==max(KS_all)][0]
mycutoff

0.53

In [None]:
#Rf_final.intercept_

In [196]:
list(zip(x_train.columns,logr.coef_[0]))

[('Pclass', -1.1008878513502827),
 ('Age', -0.03965323763742151),
 ('SibSp', -0.3259235687089192),
 ('Parch', -0.0931922446649076),
 ('Fare', 0.001960051861875102),
 ('Sex_male', -2.7220497084800686),
 ('Embarked_S', -0.3990108105155722),
 ('Embarked_C', 0.01791075266829576)]

## If you simply had to submit probability scores , you could do this 

In [66]:
#Note:- you store above logr as grid_search.best_estimator_

#test_score=logr.predict_proba(ld_test)[:,1]
test_score=Rf_final.predict_proba(ld_test)[:,1]


In [67]:
#Give a order 0 and 1
#so first column is probability of outcome being 0 and second is 1

Rf_final.classes_

array([0, 1])

In [None]:
pd.DataFrame(test_score).to_csv("mysubmission.csv",index=False)

# If you had to submit hardclasses , you can apply the cutoff obtained above and then submit

In [68]:
#test_score>mycutoff

In [69]:
(test_score>mycutoff).sum()

153

In [70]:
#Above booleans will convert into 1 andf 0 using astype(int)

test_classes=(test_score>mycutoff).astype(int)

In [71]:
test_classes

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

In [None]:
pd.DataFrame(test_classes).to_csv("mysubmission.csv",index=False)