### Logistic Regression Model for Kepler Space Telescope Data
The purpose of this project is to reinforce my data science and machine learning knowledge, and to help others in their journeys. I'm not an expert on these topics, and I'm conscious that there are a lot of tools, concepts and techniques that I need to master. That's why I develop this project, to share with the community the tools, concepts and techniques, that I have learned across my data science journey. I'm open to comments, critics and feedback that would help me to develop learn best practices and corrections in case I was wrong.

#### About Logistic Regression
- It is a classification model
- Mostly used to predict a discrete set of categories, such as Yes/No, Young/Old, Cold/Hot
- They are stronger with linear relationships
- They work only with numbers (All your data need to be numerical)
- By default, logistic regression cannot be used for classification tasks that have more than two targets

In [68]:
#import warnings
#warnings.simplefilter('ignore')
import pandas as pd
import numpy as np

In [69]:
kepler_df = pd.read_csv('../../data/exoplanet_data.csv')
kepler_df.head()

Unnamed: 0,koi_disposition,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,koi_time0bk_err1,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
0,CONFIRMED,0,0,0,0,54.418383,0.0002479,-0.0002479,162.51384,0.00352,...,-81,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
1,FALSE POSITIVE,0,1,0,0,19.89914,1.49e-05,-1.49e-05,175.850252,0.000581,...,-176,4.544,0.044,-0.176,0.868,0.233,-0.078,297.00482,48.134129,15.436
2,FALSE POSITIVE,0,1,0,0,1.736952,2.63e-07,-2.63e-07,170.307565,0.000115,...,-174,4.564,0.053,-0.168,0.791,0.201,-0.067,285.53461,48.28521,15.597
3,CONFIRMED,0,0,0,0,2.525592,3.76e-06,-3.76e-06,171.59555,0.00113,...,-211,4.438,0.07,-0.21,1.046,0.334,-0.133,288.75488,48.2262,15.509
4,CONFIRMED,0,0,0,0,4.134435,1.05e-05,-1.05e-05,172.97937,0.0019,...,-232,4.486,0.054,-0.229,0.972,0.315,-0.105,296.28613,48.22467,15.714


In [70]:
kepler_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6991 entries, 0 to 6990
Data columns (total 41 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   koi_disposition    6991 non-null   object 
 1   koi_fpflag_nt      6991 non-null   int64  
 2   koi_fpflag_ss      6991 non-null   int64  
 3   koi_fpflag_co      6991 non-null   int64  
 4   koi_fpflag_ec      6991 non-null   int64  
 5   koi_period         6991 non-null   float64
 6   koi_period_err1    6991 non-null   float64
 7   koi_period_err2    6991 non-null   float64
 8   koi_time0bk        6991 non-null   float64
 9   koi_time0bk_err1   6991 non-null   float64
 10  koi_time0bk_err2   6991 non-null   float64
 11  koi_impact         6991 non-null   float64
 12  koi_impact_err1    6991 non-null   float64
 13  koi_impact_err2    6991 non-null   float64
 14  koi_duration       6991 non-null   float64
 15  koi_duration_err1  6991 non-null   float64
 16  koi_duration_err2  6991 

In [71]:
print(f"The dataframe has a length of {len(kepler_df.columns)} columns.")
print(f"There are 3 outcomes/predictions/targets: {set(kepler_df['koi_disposition'])}")

The dataframe has a length of 41 columns.
There are 3 outcomes/predictions/targets: {'CANDIDATE', 'CONFIRMED', 'FALSE POSITIVE'}


In [72]:
kepler_df['koi_disposition'] = kepler_df['koi_disposition'].replace({'CONFIRMED': 0, 'FALSE POSITIVE': 1, 'CANDIDATE': 2})

In [73]:
correlation = kepler_df.corr()
correlation['koi_disposition']

koi_disposition      1.000000
koi_fpflag_nt        0.000416
koi_fpflag_ss        0.013503
koi_fpflag_co        0.008531
koi_fpflag_ec        0.008041
koi_period           0.124647
koi_period_err1      0.099048
koi_period_err2     -0.099048
koi_time0bk          0.070445
koi_time0bk_err1     0.147719
koi_time0bk_err2    -0.147719
koi_impact           0.010607
koi_impact_err1      0.058572
koi_impact_err2     -0.013980
koi_duration         0.029554
koi_duration_err1    0.156587
koi_duration_err2   -0.156587
koi_depth            0.008694
koi_depth_err1       0.001797
koi_depth_err2      -0.001797
koi_prad             0.001485
koi_prad_err1        0.003135
koi_prad_err2       -0.000998
koi_teq              0.021275
koi_insol            0.012070
koi_insol_err1       0.014604
koi_insol_err2      -0.014159
koi_model_snr       -0.016351
koi_tce_plnt_num    -0.095550
koi_steff            0.071048
koi_steff_err1       0.173227
koi_steff_err2      -0.148902
koi_slogg           -0.071437
koi_slogg_

In [74]:
import plotly
import plotly.graph_objs as go
import plotly.figure_factory as figfac
from plotly.offline import iplot

fig = go.Figure(data=go.Heatmap(z = correlation.iloc[:10, :10].round(3).values.tolist(),
                                x=correlation.iloc[:10, :10].columns.to_list(),
                                y = correlation.iloc[:10, :10].index.to_list(),
                                colorscale= 'bluered',
                                text=correlation.iloc[:10, :10].round(4).values,
                                texttemplate="%{text}",
                                textfont={"size":12}))
fig.show()

In [75]:
iplot(figfac.create_annotated_heatmap(correlation.iloc[:10, :10].round(3).values, x=correlation.iloc[:10, :10].columns.to_list(), 
                                  y=correlation.iloc[:10, :10].index.to_list(), annotation_text=correlation.iloc[:10, :10].round(4).values, colorscale= 'bluered'))



In [76]:
features = kepler_df.drop('koi_disposition', axis = 1)
target = kepler_df['koi_disposition']

print(features.shape)
print(target.shape)

(6991, 40)
(6991,)


In [77]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(max_iter = 1000)


In [78]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)

In [79]:
logreg.fit(X_train, y_train)


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



LogisticRegression(max_iter=1000)

In [80]:
from sklearn.feature_selection import RFE
rfe = RFE(logreg, n_features_to_select=10, step=1)
features_selected = rfe.fit(X_train, y_train)


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to th


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to th

In [81]:
relevant_features = features.loc[:,features_selected.support_]
relevant_features.head()

Unnamed: 0,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_impact,koi_impact_err2,koi_duration_err1,koi_duration_err2,koi_slogg,koi_srad_err1
0,0,0,0,0,0.586,-0.443,0.116,-0.116,4.467,0.105
1,0,1,0,0,0.969,-0.077,0.0341,-0.0341,4.544,0.233
2,0,1,0,0,1.276,-0.092,0.00537,-0.00537,4.564,0.201
3,0,0,0,0,0.701,-0.478,0.042,-0.042,4.438,0.334
4,0,0,0,0,0.762,-0.532,0.0673,-0.0673,4.486,0.315


In [82]:
relevant_corr_df = pd.merge(target, relevant_features, left_index=True, right_index=True)
relevant_corr_df.head()

Unnamed: 0,koi_disposition,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_impact,koi_impact_err2,koi_duration_err1,koi_duration_err2,koi_slogg,koi_srad_err1
0,0,0,0,0,0,0.586,-0.443,0.116,-0.116,4.467,0.105
1,1,0,1,0,0,0.969,-0.077,0.0341,-0.0341,4.544,0.233
2,1,0,1,0,0,1.276,-0.092,0.00537,-0.00537,4.564,0.201
3,0,0,0,0,0,0.701,-0.478,0.042,-0.042,4.438,0.334
4,0,0,0,0,0,0.762,-0.532,0.0673,-0.0673,4.486,0.315


In [83]:
relevant_corr_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6991 entries, 0 to 6990
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   koi_disposition    6991 non-null   int64  
 1   koi_fpflag_nt      6991 non-null   int64  
 2   koi_fpflag_ss      6991 non-null   int64  
 3   koi_fpflag_co      6991 non-null   int64  
 4   koi_fpflag_ec      6991 non-null   int64  
 5   koi_impact         6991 non-null   float64
 6   koi_impact_err2    6991 non-null   float64
 7   koi_duration_err1  6991 non-null   float64
 8   koi_duration_err2  6991 non-null   float64
 9   koi_slogg          6991 non-null   float64
 10  koi_srad_err1      6991 non-null   float64
dtypes: float64(6), int64(5)
memory usage: 600.9 KB


In [84]:
rel_correlation = relevant_corr_df.corr()
rel_correlation['koi_disposition']

koi_disposition      1.000000
koi_fpflag_nt        0.000416
koi_fpflag_ss        0.013503
koi_fpflag_co        0.008531
koi_fpflag_ec        0.008041
koi_impact           0.010607
koi_impact_err2     -0.013980
koi_duration_err1    0.156587
koi_duration_err2   -0.156587
koi_slogg           -0.071437
koi_srad_err1        0.069335
Name: koi_disposition, dtype: float64

In [85]:
fig = go.Figure(data=go.Heatmap(z = rel_correlation.round(3).values.tolist(),
                                x = rel_correlation.columns.to_list(),
                                y = rel_correlation.index.to_list(),
                                colorscale = 'ylgnbu',
                                text = rel_correlation.round(4).values,
                                texttemplate = "%{text}",
                                textfont = {"size":12}))
fig.show()

In [86]:
iplot(figfac.create_annotated_heatmap(rel_correlation.round(3).values, x=rel_correlation.columns.to_list(), 
                                  y=rel_correlation.index.to_list(), annotation_text=rel_correlation.round(4).values, colorscale= 'ylgnbu'))

In [87]:
from sklearn.preprocessing import MinMaxScaler
steps = [('scaler', MinMaxScaler()),
         ('logistic_regression', logreg)]

In [88]:
from sklearn.pipeline import Pipeline
pipeline = Pipeline(steps)

In [101]:
c_space = np.logspace(-5, 8, 15)
parameters = {'logistic_regression__C':c_space,
              'logistic_regression__penalty':['l1','l2'],
              'logistic_regression__solver': ['liblinear']}

In [102]:
X_train, X_test, y_train, y_test = train_test_split(relevant_features, target, test_size=0.3, random_state=42)

In [103]:
from sklearn.model_selection import GridSearchCV
logreg_cv = GridSearchCV(pipeline, param_grid = parameters, verbose = 3)

In [104]:
logreg_cv.fit(X_train, y_train)

Fitting 5 folds for each of 30 candidates, totalling 150 fits
[CV 1/5] END logistic_regression__C=1e-05, logistic_regression__penalty=l1, logistic_regression__solver=liblinear;, score=0.250 total time=   0.0s
[CV 2/5] END logistic_regression__C=1e-05, logistic_regression__penalty=l1, logistic_regression__solver=liblinear;, score=0.250 total time=   0.0s
[CV 3/5] END logistic_regression__C=1e-05, logistic_regression__penalty=l1, logistic_regression__solver=liblinear;, score=0.251 total time=   0.0s
[CV 4/5] END logistic_regression__C=1e-05, logistic_regression__penalty=l1, logistic_regression__solver=liblinear;, score=0.251 total time=   0.0s
[CV 5/5] END logistic_regression__C=1e-05, logistic_regression__penalty=l1, logistic_regression__solver=liblinear;, score=0.251 total time=   0.0s
[CV 1/5] END logistic_regression__C=1e-05, logistic_regression__penalty=l2, logistic_regression__solver=liblinear;, score=0.508 total time=   0.0s
[CV 2/5] END logistic_regression__C=1e-05, logistic_regr

[CV 4/5] END logistic_regression__C=0.4393970560760795, logistic_regression__penalty=l1, logistic_regression__solver=liblinear;, score=0.840 total time=   0.0s
[CV 5/5] END logistic_regression__C=0.4393970560760795, logistic_regression__penalty=l1, logistic_regression__solver=liblinear;, score=0.853 total time=   0.0s
[CV 1/5] END logistic_regression__C=0.4393970560760795, logistic_regression__penalty=l2, logistic_regression__solver=liblinear;, score=0.856 total time=   0.0s
[CV 2/5] END logistic_regression__C=0.4393970560760795, logistic_regression__penalty=l2, logistic_regression__solver=liblinear;, score=0.835 total time=   0.0s
[CV 3/5] END logistic_regression__C=0.4393970560760795, logistic_regression__penalty=l2, logistic_regression__solver=liblinear;, score=0.827 total time=   0.0s
[CV 4/5] END logistic_regression__C=0.4393970560760795, logistic_regression__penalty=l2, logistic_regression__solver=liblinear;, score=0.822 total time=   0.0s
[CV 5/5] END logistic_regression__C=0.43

[CV 4/5] END logistic_regression__C=19306.977288832535, logistic_regression__penalty=l2, logistic_regression__solver=liblinear;, score=0.847 total time=   0.0s
[CV 5/5] END logistic_regression__C=19306.977288832535, logistic_regression__penalty=l2, logistic_regression__solver=liblinear;, score=0.854 total time=   0.0s
[CV 1/5] END logistic_regression__C=163789.3706954068, logistic_regression__penalty=l1, logistic_regression__solver=liblinear;, score=0.875 total time=  10.0s
[CV 2/5] END logistic_regression__C=163789.3706954068, logistic_regression__penalty=l1, logistic_regression__solver=liblinear;, score=0.859 total time=   5.5s
[CV 3/5] END logistic_regression__C=163789.3706954068, logistic_regression__penalty=l1, logistic_regression__solver=liblinear;, score=0.872 total time=  13.1s
[CV 4/5] END logistic_regression__C=163789.3706954068, logistic_regression__penalty=l1, logistic_regression__solver=liblinear;, score=0.846 total time=   9.9s
[CV 5/5] END logistic_regression__C=163789.3

GridSearchCV(estimator=Pipeline(steps=[('scaler', MinMaxScaler()),
                                       ('logistic_regression',
                                        LogisticRegression(max_iter=1000))]),
             param_grid={'logistic_regression__C': array([1.00000000e-05, 8.48342898e-05, 7.19685673e-04, 6.10540230e-03,
       5.17947468e-02, 4.39397056e-01, 3.72759372e+00, 3.16227766e+01,
       2.68269580e+02, 2.27584593e+03, 1.93069773e+04, 1.63789371e+05,
       1.38949549e+06, 1.17876863e+07, 1.00000000e+08]),
                         'logistic_regression__penalty': ['l1', 'l2'],
                         'logistic_regression__solver': ['liblinear']},
             verbose=3)

In [105]:
predict = logreg_cv.predict(X_test)

In [109]:
from sklearn.metrics import classification_report, confusion_matrix
print("Accuracy: {}".format(logreg_cv.score(X_test, y_test)))
print(classification_report(y_test, predict))
print("Tuned Model Parameters: {}".format(logreg_cv.best_params_))
print(confusion_matrix(y_test, predict))

Accuracy: 0.8517635843660629
              precision    recall  f1-score   support

           0       0.70      0.83      0.76       574
           1       0.98      1.00      0.99      1020
           2       0.76      0.59      0.66       504

    accuracy                           0.85      2098
   macro avg       0.82      0.80      0.80      2098
weighted avg       0.85      0.85      0.85      2098

Tuned Model Parameters: {'logistic_regression__C': 2275.845926074791, 'logistic_regression__penalty': 'l2', 'logistic_regression__solver': 'liblinear'}
[[ 474   12   88]
 [   0 1017    3]
 [ 202    6  296]]


In [111]:
predictor_df = pd.DataFrame({'Prediction':predict, 'Actual':y_test}).reset_index(drop=True)
predictor_df[['Prediction', 'Actual']] = predictor_df[['Prediction', 'Actual']].replace({0: 'CONFIRMED', 1: 'FALSE POSITIVE', 2 : 'CANDIDATE'})
predictor_df

Unnamed: 0,Prediction,Actual
0,FALSE POSITIVE,FALSE POSITIVE
1,CANDIDATE,CANDIDATE
2,FALSE POSITIVE,FALSE POSITIVE
3,FALSE POSITIVE,FALSE POSITIVE
4,FALSE POSITIVE,FALSE POSITIVE
...,...,...
2093,FALSE POSITIVE,FALSE POSITIVE
2094,CANDIDATE,CONFIRMED
2095,CONFIRMED,CANDIDATE
2096,CONFIRMED,CONFIRMED
