First of all, I have used ideas from this website:

https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/

In this kernel we are going to use lasso regression.

https://en.wikipedia.org/wiki/Lasso_(statistics)

In [1]:
# Loading the packages
import numpy as np
import pandas as pd 
from sklearn.preprocessing import StandardScaler
from scipy import stats
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import make_scorer 
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import PolynomialFeatures

import warnings
warnings.filterwarnings('ignore')


In [2]:
# Loading the training dataset
df_train = pd.read_csv("../input/train.csv")

In [3]:
y = df_train["target"]
# We exclude the target and id columns from the training dataset
df_train.pop("target");
df_train.pop("id")
colnames1 = df_train.columns

We are going to standardize the explanatory variables by removing the mean and scaling to unit variance, this is mandatory for logistic regression. The standard score for the variable X is calculated as follows:

$$ z=\frac{X−\mu}{s} $$
 
Where $\mu$ is the mean and s is the standard deviation.

In [4]:
scaler = StandardScaler()
scaler.fit(df_train)
X = scaler.transform(df_train)
df_train = pd.DataFrame(data = X, columns=colnames1)   # df_train is standardized 

In this kernel:

https://www.kaggle.com/ricardorios/random-forests-don-t-overfit

We have found the following variables that are related with the target variable: 33, 279, 272, 83, 237, 241, 91, 199, 216, 19, 65, 141, 70, 243, 137, 26, 90 but from this kernel: 

https://www.kaggle.com/ricardorios/logistic-regression-with-lasso-don-t-overfit

The best predictors are 33, 272, 237, 91, 199, 65, and 90, this result was obtained using Lasso Regression which can be seen as a method of variable selection.

https://courses.cs.washington.edu/courses/cse446/13sp/slides/lasso-annotated.pdf

In [5]:
random_forest_predictors = ["33", "279", "272", 
                           "83", "237", "241", 
                           "91", "199", "216", 
                           "19", "65", "141", "70", "243", "137", "26", "90"]

selected_predictors = [0, 2, 4, 6, 7, 10, 16]
new_predictors = []

for i in selected_predictors: 
    new_predictors.append(random_forest_predictors[i])

df_train = df_train[new_predictors]


We are going to perform polynomial features, this is a kind of feature engineering. We suggest you to read the following: 

https://towardsdatascience.com/feature-engineering-what-powers-machine-learning-93ab191bcc2d

In [6]:
poly = PolynomialFeatures(2, interaction_only=True)
poly.fit(df_train)
X = poly.transform(df_train)

In order to regularize the model we are going to use [Lasso Regression](https://www.statisticshowto.datasciencecentral.com/lasso-regression/), one of the advantages of using this approach is that the model is sparse and we get the best predictors. Next, we are going to perform a grid search over the parameter C.  

In [7]:
# We adapt code from this kernel: 
# https://www.kaggle.com/vincentlugat/logistic-regression-rfe

# Find best hyperparameters (roc_auc)
random_state = 0
clf = LogisticRegression(random_state = random_state)
param_grid = {'class_weight' : ['balanced'], 
              'penalty' : ['l1'],  
              'C' : [0.0001, 0.0005, 0.001, 
                     0.005, 0.01, 0.05, 0.1, 0.5, 1, 
                     10, 100, 1000, 1500, 2000 
                     ], 
              'max_iter' : [100, 1000] }

# Make an roc_auc scoring object using make_scorer()
scorer = make_scorer(roc_auc_score)

grid = GridSearchCV(estimator = clf, param_grid = param_grid , 
                    scoring = scorer, verbose = 10, cv=20,
                    n_jobs = -1)



grid.fit(X,y)

print("Best Score:" + str(grid.best_score_))
print("Best Parameters: " + str(grid.best_params_))

best_parameters = grid.best_params_

Fitting 20 folds for each of 28 candidates, totalling 560 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    2.5s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    2.6s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    2.7s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.1864s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:    2.8s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0701s.) Setting batch_size=10.
[Parallel(n_jobs=-1)]: Done  48 tasks      | elapsed:    3.0s
[Parallel(n_jobs=-1)]: Done 129 tasks      | elapsed:    3.4s
[Parallel(n_jobs=-1)]: Done 239 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed:    4.3s


Best Score:0.7441500000000001
Best Parameters: {'C': 0.1, 'class_weight': 'balanced', 'max_iter': 100, 'penalty': 'l1'}


[Parallel(n_jobs=-1)]: Done 479 tasks      | elapsed:    4.9s
[Parallel(n_jobs=-1)]: Done 560 out of 560 | elapsed:    5.1s finished


In [8]:
# We get the best model 
best_clf = grid.best_estimator_
print(best_clf)

LogisticRegression(C=0.1, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l1', random_state=0,
          solver='warn', tol=0.0001, verbose=0, warm_start=False)


The best model is obtained with C=0.1, next we are going to fit the model with the whole training dataset.

In [9]:
model = LogisticRegression(C=0.1, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l1', random_state=0,
          solver='warn', tol=0.0001, verbose=0, warm_start=False);

model.fit(X, y);



The coefficients of the model are shown as follows.


In [10]:
print(model.coef_)

[[ 0.          0.66988478  0.04135936 -0.1308607  -0.21900114  0.23835914
   0.47937183 -0.07259055  0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.00232618  0.          0.11266933 -0.0468331   0.        ]]


Finally, we will generate the file submission.

In [11]:
df_test = pd.read_csv("../input/test.csv")
df_test.pop("id");
X = df_test 
X = scaler.transform(X)
df_test = pd.DataFrame(data = X, columns=colnames1)   # df_train is standardized 
df_test = df_test[new_predictors]

X = poly.transform(df_test)
y_pred = model.predict_proba(X)
y_pred = y_pred[:,1]    

In [12]:
# submit prediction
smpsb_df = pd.read_csv("../input/sample_submission.csv")
smpsb_df["target"] = y_pred
smpsb_df.to_csv("logistic_regression_l2_v2.csv", index=None)