# Tuning
Here we will tune hyperparameters for our model of choice for the time being.
**Note** that those models, params and methods are all subject to a lot of change in the future while I study another models for Machine learning.

For now we go with Logistic Regression. Let's try to tune it and see where does it gets us!

First, let's use GridSearchCV, and if it'll turn out to be too time-consuming, we will use RandomizedSearchCV.

## Retrieving data

In [44]:
import pandas as pd
# 1. Read data and form X and y
train_df = pd.read_csv('train.csv')
X = train_df.drop(['Survived', 'PassengerId', 'Name', 'Ticket', 'Cabin'], 1)
y = train_df['Survived']
# 2. Conver male, female and C, Q, S into categorials
mapping = {'male': 0, 'female': 1, 'C': 0, 'Q': 1, 'S': 2}
X.replace({'Sex': mapping, 'Embarked': mapping}, inplace=True)
# 3. Replace NaNs in Age
X.fillna(X['Age'].mean(), inplace=True)
X.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,0,22.0,1,0,7.25,2.0
1,1,1,38.0,1,0,71.2833,0.0
2,3,1,26.0,0,0,7.925,2.0
3,1,1,35.0,1,0,53.1,2.0
4,3,0,35.0,0,0,8.05,2.0


## Tuning hyperparameters

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

log_regression = LogisticRegression(penalty='l2',  # Supported by most solvers
                                    dual=False,    # Since n_samples > n_features
                                    fit_intercept=True, # Why not use intercept?
                                    )
solver_options = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
param_grid = {'solver': solver_options}
grid = GridSearchCV(log_regression, param_grid, cv=30, scoring='accuracy')
grid.fit(X, y)
print("Best score:", grid.best_score_)
print("Best params:", grid.best_params_)
print("Best model:", grid.best_estimator_, "\n---------------")




Best score: 0.799102132435
Best params: {'solver': 'liblinear'}
Best model: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False) 
---------------




Playing with various cv values we see that the best score is **~0.79** and the best solver for this problem is **liblinear**. Also we get a ton of ConvergenceWarnings telling us that convergence was not reached.
Let's try to tune various parameters for liblinear solver and see what happens:

In [19]:
log_regression = LogisticRegression(penalty='l2',
                                    fit_intercept=True, # Why not use intercept?
                                    solver='liblinear'
                                    )
dual_options =[False, True]
intercept_scaling_options = [0.8, 1.0, 1.2]
param_grid = {'penalty': penalty_options, 'dual': dual_options, 'intercept_scaling': intercept_scaling_options}

grid = GridSearchCV(log_regression, param_grid, cv=30, scoring='accuracy')
grid.fit(X, y)
print("Best score:", grid.best_score_)
print("Best params:", grid.best_params_)
print("Best model:", grid.best_estimator_, "\n---------------")

Best score: 0.799102132435
Best params: {'dual': False, 'intercept_scaling': 0.8, 'penalty': 'l2'}
Best model: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=0.8, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False) 
---------------


Still that gets the same **~0.79** accuracy score.

For now let's submit it and check our score. This will be the initial benchmark which we will seek to improve!

In [55]:
# Submitting

# Reading test dataset
test_df = pd.read_csv('test.csv')
X_test  = test_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], 1)
# Convert male, female and C, Q, S into categorials
mapping = {'male': 0, 'female': 1, 'C': 0, 'Q': 1, 'S': 2}
X_test.replace({'Sex': mapping, 'Embarked': mapping}, inplace=True)
# Replace NaNs in Age
X_test.fillna(X['Age'].mean(), inplace=True)

# Getting a prediction
log_regression = LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=0.8, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
log_regression.fit(X, y)
y_pred = log_regression.predict(X_test)

submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": y_pred})
submission.to_csv('submission.csv', index=False)

# FUTURE
*** !!! Not taken into consideration for now !!! ***

In [61]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
# random_forest = RandomForestClassifier(n_estimators=100)
# random_forest.fit(X, y)
# #Y_pred = random_forest.predict(X_test)
# random_forest.score(X, y)
# acc_random_forest = round(random_forest.score(X, y) * 100, 2)
# print(acc_random_forest)

# random_forest = RandomForestClassifier(n_estimators=100)
# scores = cross_val_score(random_forest, X, y, scoring='accuracy', cv=30)
# print("Mean cross-validated score is:", scores.mean())

# Tune dat maybe??
random_forest = RandomForestClassifier()
n_estimators_range = range(1,35)
param_grid = {'n_estimators': n_estimators_range}
grid = GridSearchCV(random_forest, param_grid, cv=15, scoring='accuracy')
grid.fit(X, y)
print("Best score:", grid.best_score_)
print("Best params:", grid.best_params_)
print("Best model:", grid.best_estimator_, "\n---------------")


Best score: 0.822671156004
Best params: {'n_estimators': 14}
Best model: RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=14, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False) 
---------------
