# This notebook is an extention of my first project on Titanic Datasets

1. The goal of this notebook is to improve score in Kaggle Leaderboard in Titanic Project, where our task is to predict whether a passenger survived or not from given sets of independent variables(features).
2. Training and Test datasets has been cleaned from missing values and is ready for preprocess data for machine learning model
3. In this notebook I want to try implement different aproach and see if that could improve the score.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np

# For visualization
import matplotlib as mpl
import seaborn as sns
import matplotlib.pyplot as plt

mpl.rcParams["patch.force_edgecolor"]=True
plt.style.use("seaborn-darkgrid")
%matplotlib inline


# Preprocessing
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, LabelEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

# Metrics
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict

# Models
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier, Pool, cv

# Evaluation
from sklearn.model_selection import GridSearchCV, cross_val_predict, cross_val_score

Let's import my datasets and check head for them

In [2]:
train = pd.read_csv("data/train_no_missing_values.csv")
train2 = train.copy() # copy used with OneHotEncoder
train.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Embarked,Title,Person
0,0,3,22.0,1,0,7.25,S,Mr,male
1,1,1,38.0,1,0,71.2833,C,Mrs,female
2,1,3,26.0,0,0,7.925,S,Miss,female
3,1,1,35.0,1,0,53.1,S,Mrs,female
4,0,3,35.0,0,0,8.05,S,Mr,male


In [3]:
test = pd.read_csv("data/test_no_missing_values.csv")
test2 = test.copy() # Copy used with OneHotEncoder
test.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Embarked,Title,Person
0,892,3,34.5,0,0,7.8292,Q,Mr,male
1,893,3,47.0,1,0,7.0,S,Mrs,female
2,894,2,62.0,0,0,9.6875,Q,Mr,male
3,895,3,27.0,0,0,8.6625,S,Mr,male
4,896,3,22.0,1,1,12.2875,S,Mrs,female


In [4]:
# Make a copy of Passenger column for submission
passengerId = test["PassengerId"]
# Drop Passenger column from dataframe
test.drop(["PassengerId"], axis=1, inplace=True)
# Create copy of dataframe to the X_test variable
X_test = test.copy()
# Check the head of dataframe
X_test.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Embarked,Title,Person
0,3,34.5,0,0,7.8292,Q,Mr,male
1,3,47.0,1,0,7.0,S,Mrs,female
2,2,62.0,0,0,9.6875,Q,Mr,male
3,3,27.0,0,0,8.6625,S,Mr,male
4,3,22.0,1,1,12.2875,S,Mrs,female


**Create X_train and y_train**

In [None]:
X_train = train.drop(["Survived"], axis=1)
y_train = train["Survived"]  # our true label

When I did EDA on Titanic dataset in previous notebook I found out that data disribution of this column is skewed to the right because there are few outliers. I want to check if by normalize this column will improve the model

As both X_train and X_test have to be identical I'll do all the preprocessing for both.

In [None]:
train.Fare.plot.hist();

In [None]:
X_train.head()

In [None]:
X_test.head()

In [None]:
# Normalize column Fare in X_train dataframe
X_train["Fare"] = MinMaxScaler().fit_transform(X_train[["Fare"]])
X_train["Age"] = MinMaxScaler().fit_transform(X_train[["Age"]])

In [None]:
# Normalize column Fare X_test dataframe
X_test["Fare"] = MinMaxScaler().fit_transform(X_test[["Fare"]])
X_test["Age"] = MinMaxScaler().fit_transform(X_test[["Age"]])

In [None]:
X_train.head()

In [None]:
# Concatinate SibSp and Parch and create new column
X_train["Family"] = X_train["SibSp"] + X_train["Parch"]
X_test["Family"] = X_test["SibSp"] + X_test["Parch"]

In [None]:
X_train.head()

In [None]:
# Drop Embarked, SibSp, Parch Column
X_train = X_train.drop(["SibSp", "Parch"], axis=1)
X_test = X_test.drop(["SibSp", "Parch"], axis=1)

In [None]:
X_train.head()

In [None]:
X_train.dtypes

In [None]:
col_list = ["Pclass","Title","Family","Embarked","Person"]

Now, let's turn categorical into numbers

In [None]:
pd.get_dummies(data=X_test,
               prefix=col_list,
               columns=col_list,drop_first=True)

In [None]:
# For column in X_train dataframe
X_train_dm = pd.get_dummies(data=X_train,
                            prefix=col_list,
                            columns=col_list,
                            drop_first=True)
# For column in X_test dataframe
X_test_dm = pd.get_dummies(data=X_test,
                           prefix=col_list,
                           columns=col_list,
                           drop_first=True)

In [None]:
X_train_dm.head()

In [None]:
# X_test = pd.get_dummies(X_test, drop_first=True)

In [None]:
# Create Base model
clf_model = GradientBoostingClassifier(random_state=45).fit(X_train_dm, y_train)

# Accuracy
acc_clf = round(clf_model.score(X_train_dm, y_train)*100,2)

# Cross Validation with 10-folds
train_pred = cross_val_predict(clf_model,
                               X_train_dm,
                               y_train,
                               n_jobs=-1,
                               cv=10)

acc_cv_clf = round(accuracy_score(y_train, train_pred)*100,2)

print(f"Base Accuracy: {acc_clf}")
print(f"Base Accuracy with cv=10: {acc_cv_clf} ")

In [None]:
clf_model.get_params

Let' improve our model by tuning hyperparameters, but first use GridSearchCV to find best parameters

In [None]:
# Parameters
clf_params_gs = {"n_estimators": np.arange(10,120, 10),
                 "min_samples_split": np.arange(2,20, 2),
                 "min_samples_leaf": [1,2,4],
                 "max_depth": [2,3,4],
                 "max_features": ["auto", "sqrt","log2"]}

# Instantiate GridSearchCV
clf_grid = GridSearchCV(clf_model,
                        param_grid=clf_params_gs,
                        scoring="accuracy",
                        verbose=1,
                        n_jobs=-1,
                        cv=3)
# Fit GridSearchCV
clf_grid.fit(X_train_dm, y_train)

**Best hyperparameters found with GridSearchCV**

In [None]:
clf_grid.best_params_

In [None]:
clf_grid.best_score_

In [None]:
best_clf_model = GradientBoostingClassifier(n_estimators=100,
                                            max_features="auto",
                                            min_samples_split=16,
                                            min_samples_leaf=2,
                                            random_state=45,
                                            max_depth=3)

# Fit model
best_clf_model.fit(X_train_dm, y_train)
# Best prediction parameters model
gbc_predictions = best_clf_model.predict(X_test_dm)

# Cross Validation 
best_cv_pred = cross_val_predict(best_clf_model,
                                 X_train_dm,
                                 y_train,
                                 n_jobs=-1,
                                 cv=10)
# Accuracy score
clf_cv_acc = round(accuracy_score(y_train, best_cv_pred)*100, 2)
clf_cv_acc

**Get proba and set threshold**

In [None]:
gbc_cv_proba = cross_val_predict(best_clf_model,
                                 X_train_dm,
                                 y_train,
                                 method="predict_proba",
                                 n_jobs=-1,
                                 cv=10)

In [None]:
# Classification based on probabilities values
gbc_y_prob = gbc_cv_proba[:,1]

y_new_pred = []
threshold = 0.7

for i in range(0, len(gbc_y_prob)):
    if gbc_y_prob[i] > threshold:
        y_new_pred.append(1)
    else:
        y_new_pred.append(0)

In [None]:
new_acc_gbc = round(accuracy_score(y_train, y_new_pred)*100,2)
new_acc_gbc

**Random Forest Classifier**

In [None]:
# Base model
rfc_model = RandomForestClassifier(random_state=45).fit(X_train_dm, y_train)
# Accuracy for model ran ones
rfc_acc = round(rfc_model.score(X_train_dm, y_train)*100,2)
# Cross validation
rfc_pred = cross_val_predict(rfc_model,
                             X_train_dm,
                             y_train,
                             n_jobs=-1,
                             cv=10)

# Accuracy for model with cross validation 10
rfc_cv_acc = round(accuracy_score(y_train, rfc_pred)*100,2)


print(f"Base Random Forest accuracy: {rfc_acc}%")
print(f"Base Random Forest accuracy with cv=10: {rfc_cv_acc}")

Let's see if we can improve model

In [None]:
rfc_params = {"n_estimators": np.arange(10,120, 10),
                 "min_samples_split": np.arange(2,20, 2),
                 "min_samples_leaf": [1,2,4],
                 "max_depth": [2,3,4],
                 "max_features": ["auto", "sqrt","log2"]}

# Instantiate GridSearchCV
rfc_grid = GridSearchCV(rfc_model,
                        param_grid=rfc_params,
                        scoring="accuracy",
                        verbose=1,
                        n_jobs=-1,
                        cv=3)
# Fit GridsearchCV
rfc_grid.fit(X_train_dm, y_train)

In [None]:
rfc_grid.best_params_

In [None]:
rfc_grid.best_score_

Best hyperparameters for Random Forest Classifier found with GridSearchCV

In [None]:
# Instantiate model
best_rfc_model = RandomForestClassifier(n_estimators=30,
                                        # class_weight="balanced",
                                        max_features="auto",
                                        min_samples_split=12,
                                        min_samples_leaf=2,
                                        random_state=45,
                                        max_depth=4)
# Fit data
best_rfc_model.fit(X_train_dm, y_train)

# Accuracy with best parameters
best_acc = round(best_clf_model.score(X_train_dm, y_train)*100,2)

# Accuracy with cross validation
rfc_pred = cross_val_predict(best_rfc_model,
                             X_train_dm,
                             y_train,
                             n_jobs=-1,
                             cv=10)

# Cross validation score
rfc_cv_acc = round(accuracy_score(y_train, rfc_pred)*100,2)

print(f"Best Accuracy: {best_acc}%")
print(f"Best Accuracy: {rfc_cv_acc}%")

**CatBoost**

In [None]:
cat_feature = np.where(X_train_dm.dtypes !=np.float)[0]

train_pool = Pool(X_train_dm, 
                  y_train,
                  cat_feature)

catboost_model = CatBoostClassifier(iterations=1000,
                                    custom_loss=["Accuracy"],
                                    loss_function="Logloss")

catboost_model.fit(train_pool, plot=True)

acc_catboost = round(catboost_model.score(X_train_dm, y_train)*100, 2)
acc_catboost

In [None]:
acc_catboost

In [None]:
# Run Cross validation for 10 folds
cv_params = catboost_model.get_params()

cv_data = cv(train_pool, 
             cv_params,
             fold_count=10,
             plot=False)

# Accuracy score
cat_cv_acc = round(np.max(cv_data["test-Accuracy-mean"])*100, 2)

In [None]:
# Create a table with all scores
models = pd.DataFrame({
    "Model": ["Gradient Boost","Random Forest", "CatBoost"],
    "Score": [clf_cv_acc, 
              rfc_cv_acc, 
              cat_cv_acc]
})

models.sort_values(by="Score", 
                   ascending=False,
                   ignore_index=True)

In [None]:
def plot_feature_importance(models, data):
    """
    Function creates plots for given list of models
    """
    for model in models:
        # Create dictionary with model and their feature importance
        feat_imp = pd.DataFrame({"imp": model.feature_importances_,
                                 "col": data.columns})
        # Sort the dictionary
        feat_imp = feat_imp.sort_values(["imp","col"],
                                        ascending=[True, False]).iloc[-30:]
        
        # Create plot
        _ = feat_imp.plot(kind="barh",
                          x="col",
                          y="imp",
                          figsize=(20,10))      

In [None]:
models_list = [best_clf_model, best_rfc_model, catboost_model]

plot_feature_importance(models_list, X_train_dm)

# Let's try preprocess our data with OneHotEncoder and see if that change something

In [5]:
train2.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Embarked,Title,Person
0,0,3,22.0,1,0,7.25,S,Mr,male
1,1,1,38.0,1,0,71.2833,C,Mrs,female
2,1,3,26.0,0,0,7.925,S,Miss,female
3,1,1,35.0,1,0,53.1,S,Mrs,female
4,0,3,35.0,0,0,8.05,S,Mr,male


In [6]:
test2.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Embarked,Title,Person
0,892,3,34.5,0,0,7.8292,Q,Mr,male
1,893,3,47.0,1,0,7.0,S,Mrs,female
2,894,2,62.0,0,0,9.6875,Q,Mr,male
3,895,3,27.0,0,0,8.6625,S,Mr,male
4,896,3,22.0,1,1,12.2875,S,Mrs,female


**One way how we can do it:**

In [7]:
train2["Family"] = train2["SibSp"] + train2["Parch"]
test2["Family"] = test2["SibSp"] + test2["Parch"]

train2.drop(["SibSp", "Parch"], axis=1, inplace=True)
test2.drop(["SibSp", "Parch"], axis=1, inplace=True)

In [8]:
train2_col = ['Pclass', 'Embarked', 'Title', 'Person',
       'Family']

In [9]:
train2_con = train2[train2_col]

train2_con = train2_con.apply(LabelEncoder().fit_transform)

In [10]:
train2_con["Age"] = train2["Age"]
train2_con["Fare"] = train2["Fare"]

# train2_con["Age"] = MinMaxScaler().fit_transform(train2[["Age"]])
# train2_con["Fare"] = MinMaxScaler().fit_transform(train2[["Fare"]])

In [11]:
test2_col = ['Pclass', 'Embarked', 'Title', 'Person',
       'Family']

In [12]:
test2_con = test2[test2_col]

test2_con = test2_con.apply(LabelEncoder().fit_transform)

test2_con["Age"] = test2["Age"]
test2_con["Fare"] = test2["Fare"]

# test2_con["Age"] = MinMaxScaler().fit_transform(test2[["Age"]])
# test2_con["Fare"] = MinMaxScaler().fit_transform(test2[["Fare"]])

In [13]:
test2_con.head()

Unnamed: 0,Pclass,Embarked,Title,Person,Family,Age,Fare
0,2,1,2,2,0,34.5,7.8292
1,2,2,3,1,1,47.0,7.0
2,1,1,2,2,0,62.0,9.6875
3,2,2,2,2,0,27.0,8.6625
4,2,2,3,1,2,22.0,12.2875


In [14]:
train2_con.head()

Unnamed: 0,Survived,Pclass,Embarked,Title,Person,Family,Age,Fare
0,0,2,2,2,2,1,22.0,7.25
1,1,0,0,3,1,1,38.0,71.2833
2,1,2,2,1,1,0,26.0,7.925
3,1,0,2,3,1,1,35.0,53.1
4,0,2,2,2,2,0,35.0,8.05


In [15]:
# Split train2_con into X_train and y_train
X_train2 = train2_con
y_train2 = train2["Survived"]
X_test2 = test2_con

In [16]:
X_test2.shape

(418, 7)

In [17]:
X_train2.shape

(891, 7)

In [18]:
y_train2.shape

(891,)

In [19]:
# Instantiate model
rfc_model2 = RandomForestClassifier(n_estimators=10,
                                   max_features="auto",
                                   min_samples_split=16,
                                   min_samples_leaf=1,
                                   random_state=45,
                                   max_depth=4)
# Fit data
rfc_model2.fit(X_train2, y_train2)

# Accuracy with best parameters
rfc_acc2 = round(rfc_model2.score(X_train2, y_train2)*100,2)

# Accuracy with cross validation
rfc_pred2 = cross_val_predict(rfc_model2,
                              X_train2,
                              y_train2,
                              n_jobs=-1,
                              cv=10)

# Cross validation score
rfc_cv_acc2 = round(accuracy_score(y_train2, rfc_pred2)*100,2)

print(f"Best Accuracy: {rfc_acc2}%")
print(f"Best Accuracy: {rfc_cv_acc2}%")

Best Accuracy: 85.3%
Best Accuracy: 82.94%


In [20]:
rfc_params = {"n_estimators": np.arange(10,120, 10),
                 "min_samples_split": np.arange(2,20, 2),
                 "min_samples_leaf": [1,2,4],
                 "max_depth": [2,3,4],
                 "max_features": ["auto", "sqrt","log2"]}

# Instantiate GridSearchCV
rfc_grid2 = GridSearchCV(rfc_model2,
                        param_grid=rfc_params,
                        scoring="accuracy",
                        verbose=1,
                        n_jobs=-1,
                        cv=3)
# Fit GridsearchCV
rfc_grid2.fit(X_train2, y_train2)

Fitting 3 folds for each of 2673 candidates, totalling 8019 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  76 tasks      | elapsed:    2.6s
[Parallel(n_jobs=-1)]: Done 376 tasks      | elapsed:   13.5s
[Parallel(n_jobs=-1)]: Done 876 tasks      | elapsed:   31.5s
[Parallel(n_jobs=-1)]: Done 1576 tasks      | elapsed:   58.7s
[Parallel(n_jobs=-1)]: Done 2476 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 3576 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 4876 tasks      | elapsed:  3.4min
[Parallel(n_jobs=-1)]: Done 6376 tasks      | elapsed:  4.5min
[Parallel(n_jobs=-1)]: Done 8012 out of 8019 | elapsed:  5.8min remaining:    0.3s
[Parallel(n_jobs=-1)]: Done 8019 out of 8019 | elapsed:  5.8min finished


GridSearchCV(cv=3, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=4,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=16,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=10, n_jobs=None,
                                              oob...
                                              verbose=0, warm_st

In [21]:
rfc_grid2.best_params_

{'max_depth': 4,
 'max_features': 'auto',
 'min_samples_leaf': 1,
 'min_samples_split': 12,
 'n_estimators': 10}

In [22]:
rfc_grid2.best_score_

0.8271604938271605

In [25]:
# Instantiate model
best_rfc_model2 = RandomForestClassifier(n_estimators=10,
                                        # class_weight="balanced",
                                        max_features="auto",
                                        min_samples_split=12,
                                        min_samples_leaf=1,
                                        random_state=45,
                                        max_depth=4)
# Fit data
best_rfc_model2.fit(X_train2, y_train2)

# Accuracy with best parameters
best_acc2 = round(best_rfc_model2.score(X_train2, y_train2)*100,2)

# Accuracy with cross validation
rfc_pred2 = cross_val_predict(best_rfc_model2,
                              X_train2,
                              y_train2,
                              n_jobs=-1,
                              cv=10)

# Cross validation score
rfc_cv_acc2 = round(accuracy_score(y_train2, rfc_pred2)*100,2)

print(f"Best Accuracy: {best_acc2}%")
print(f"Best Accuracy: {rfc_cv_acc2}%")

Best Accuracy: 84.74%
Best Accuracy: 83.05%


In [26]:
# Create Base model
clf_model2 = GradientBoostingClassifier(random_state=45).fit(X_train2, y_train2)

# Accuracy
acc_clf = round(clf_model2.score(X_train2, y_train2)*100,2)

# Cross Validation with 10-folds
train_pred2 = cross_val_predict(clf_model2,
                                X_train2,
                                y_train2,
                                n_jobs=-1,
                                cv=10)

acc_cv_clf = round(accuracy_score(y_train2, train_pred2)*100,2)

print(f"Base Accuracy: {acc_clf}")
print(f"Base Accuracy with cv=10: {acc_cv_clf} ")

Base Accuracy: 90.35
Base Accuracy with cv=10: 82.94 


In [27]:
# Parameters
clf_params_gs = {"n_estimators": np.arange(10,120, 10),
                 "min_samples_split": np.arange(2,20, 2),
                 "min_samples_leaf": [1,2,4],
                 "max_depth": [2,3,4],
                 "max_features": ["auto", "sqrt","log2"]}

# Instantiate GridSearchCV
clf_grid2 = GridSearchCV(clf_model2,
                         param_grid=clf_params_gs,
                         scoring="accuracy",
                         verbose=1,
                         n_jobs=-1,
                         cv=3)
# Fit GridSearchCV
clf_grid2.fit(X_train2, y_train2)

Fitting 3 folds for each of 2673 candidates, totalling 8019 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 128 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done 728 tasks      | elapsed:   12.3s
[Parallel(n_jobs=-1)]: Done 1728 tasks      | elapsed:   27.7s
[Parallel(n_jobs=-1)]: Done 3128 tasks      | elapsed:   54.3s
[Parallel(n_jobs=-1)]: Done 4928 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 7128 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done 8019 out of 8019 | elapsed:  3.1min finished


GridSearchCV(cv=3, error_score=nan,
             estimator=GradientBoostingClassifier(ccp_alpha=0.0,
                                                  criterion='friedman_mse',
                                                  init=None, learning_rate=0.1,
                                                  loss='deviance', max_depth=3,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_iter_no_c...
                 

In [None]:
clf_grid2.best_params_

In [None]:
clf_grid2.best_score_

In [None]:
best_clf_model2 = GradientBoostingClassifier(n_estimators=10,
                                             max_features="sqrt",
                                             min_samples_split=18,
                                             min_samples_leaf=2,
                                             random_state=45,
                                             max_depth=4)

# Fit model
best_clf_model2.fit(X_train2, y_train2)
# Best prediction parameters model
gbc_predictions2 = best_clf_model2.predict(X_test2)

# Cross Validation 
best_cv_pred2 = cross_val_predict(best_clf_model2,
                                 X_train2,
                                 y_train2,
                                 n_jobs=-1,
                                 cv=10)
# Accuracy score
clf_cv_acc2 = round(accuracy_score(y_train2, best_cv_pred2)*100, 2)
clf_cv_acc2

**Catboost again but with option2**

In [None]:
cat_feature = np.where(X_train2.dtypes !=np.float)[0]

train_pool = Pool(X_train2, 
                  y_train2,
                  cat_feature)

catboost_model = CatBoostClassifier(iterations=1000,
                                    custom_loss=["Accuracy"],
                                    loss_function="Logloss")

catboost_model.fit(train_pool, plot=True)

acc_catboost2 = round(catboost_model.score(X_train2, y_train2)*100, 2)

In [None]:
acc_catboost2

In [None]:
# Run Cross validation for 10 folds
cv_params = catboost_model.get_params()

cv_data = cv(train_pool, 
             cv_params,
             fold_count=10,
             plot=True)

# Accuracy score
cat_cv_acc2 = round(np.max(cv_data["test-Accuracy-mean"])*100, 2)

In [None]:
# Create a table with all scores
models = pd.DataFrame({
    "Model": ["Gradient Boost","Random Forest2", "CatBoost"],
    "Score": [clf_cv_acc2, 
              rfc_cv_acc2, 
              cat_cv_acc2]
})

models.sort_values(by="Score", 
                   ascending=False,
                   ignore_index=True)

**But I want use LabelEncoder on specific columns**

In [None]:
train.head()

In [None]:
# gbc_predictions

In [None]:
new_submission = pd.DataFrame()
new_submission["PassengerId"] = passengerId
new_submission["Survived"] = gbc_predictions
new_submission["Survived"] = new_submission["Survived"].astype(int)
new_submission.to_csv("kaggle _submissions/new_gbc_submission.csv", index=False)

>Think how to do it with OneHotEncoder