-----------------------------------------------------------------------------
                             Python Script
-----------------------------------------------------------------------------
Author: Dr. Hyunglok Kim
Affiliation: School of Earth Sciences and Environmental Engineering,
             Gwangju Institute of Science and Technology (GIST)

Date: 2023

Version: 1.0

Course: EN5422/EV4238 - Applied Machine Learning for Environmental Data Analysis

-----------------------------------------------------------------------------
                            COPYRIGHT NOTICE
-----------------------------------------------------------------------------
© 2023 Dr. Hyunglok Kim, Gwangju Institute of Science and Technology.
All Rights Reserved.

Permission is granted to any individual or institution to use, copy, or
redistribute this software and documentation, under the following
conditions:

1. The software and documentation must not be distributed for profit,
   and must retain this copyright notice.

2. Any modifications to the software must be documented and those
   modifications must be released under the same terms as this license.

3. This software and documentation is provided "as is". The author(s)
   disclaim all warranties, whether express or implied, including but
   not limited to implied warranties of merchantability and fitness
   for a particular purpose.

-----------------------------------------------------------------------------
                               DESCRIPTION
-----------------------------------------------------------------------------
This script is written as part of the teaching materials for the
EN5422/EV4238 course "Applied Machine Learning for Environmental Data Analysis"
at Gwangju Institute of Science and Technology (GIST).

For detailed usage, please refer to the accompanying documentation
or course materials.

For questions, feedback, or further information, please contact:

Dr. Hyunglok Kim
Email: hyunglokkim@gist.ac.kr

In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()
!conda install pandas numpy matplotlib glmnet

⏬ Downloading https://github.com/conda-forge/miniforge/releases/download/23.1.0-1/Mambaforge-23.1.0-1-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:26
🔁 Restarting kernel...
Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | 

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from glmnet import ElasticNet
from glmnet import LogitNet
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier

# The URL of the xlsx file
url = 'https://github.com/JWarmenhoven/ISLR-python/raw/4100e941914519eea18385daadc9b3dab99ca8e2/Notebooks/Data/Default.xlsx'

# Load the xlsx data
Default = pd.read_excel(url, engine='openpyxl')

# Display the first few rows of the dataframe
print(Default.head())

  warn("Workbook contains no default style, apply openpyxl's default")


   Unnamed: 0 default student      balance        income
0           1      No      No   729.526495  44361.625074
1           2      No     Yes   817.180407  12106.134700
2           3      No      No  1073.549164  31767.138947
3           4      No      No   529.250605  35704.493935
4           5      No      No   785.655883  38463.495879


In [None]:
# Convert 'default' column to numeric values: 'No' -> 0 and 'Yes' -> 1
Default['default'] = Default['default'].map({'No': 0, 'Yes': 1})

# Convert 'student' column to numeric values: 'No' -> 0 and 'Yes' -> 1
Default['student'] = Default['student'].map({'No': 0, 'Yes': 1})
X = Default[['student', 'balance', 'income']].values
y = Default['default'].values

In [None]:
X

array([[0.00000000e+00, 7.29526495e+02, 4.43616251e+04],
       [1.00000000e+00, 8.17180407e+02, 1.21061347e+04],
       [0.00000000e+00, 1.07354916e+03, 3.17671389e+04],
       ...,
       [0.00000000e+00, 8.45411989e+02, 5.86361570e+04],
       [0.00000000e+00, 1.56900905e+03, 3.66691124e+04],
       [1.00000000e+00, 2.00922183e+02, 1.68629523e+04]])

In [None]:
# Fit the model with sm.OLS lib
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()

# Print the summary of the regression
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.124
Model:                            OLS   Adj. R-squared:                  0.124
Method:                 Least Squares   F-statistic:                     471.7
Date:                Wed, 04 Oct 2023   Prob (F-statistic):          1.09e-286
Time:                        05:23:46   Log-Likelihood:                 3653.0
No. Observations:               10000   AIC:                            -7298.
Df Residuals:                    9996   BIC:                            -7269.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0812      0.008     -9.685      0.0

In [None]:
# Fit Linear Regression Model using glmnet

# Note: alpha=0 makes it Ridge regression (no lasso penalty). To make it similar to a plain linear regression.
X = Default[['student', 'balance', 'income']].values
y = Default['default'].values

m = ElasticNet(alpha=0, fit_intercept=True)
m = m.fit(X, y)
# Print the coefficients and intercept
print("Intercept:", m.intercept_)
print("Coefficients:", m.coef_)

Intercept: -0.062397715192901146
Coefficients: [-4.75394908e-03  1.09145729e-04  1.76617254e-07]


In [None]:
import statsmodels.api as sm


# Fit logistic regression model
model_formula = 'y ~ student + balance + income'
fit_lr = smf.logit(formula=model_formula, data=Default).fit()

# Print the summary
print(fit_lr.summary())

Optimization terminated successfully.
         Current function value: 0.078577
         Iterations 10
                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9996
Method:                           MLE   Df Model:                            3
Date:                Wed, 04 Oct 2023   Pseudo R-squ.:                  0.4619
Time:                        05:06:43   Log-Likelihood:                -785.77
converged:                       True   LL-Null:                       -1460.3
Covariance Type:            nonrobust   LLR p-value:                3.257e-292
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -10.8690      0.492    -22.079      0.000     -11.834      -9.904
student       -0.6468      0

In [None]:
# Fit the logistic regression model using glmnet
from glmnet import LogitNet
m = LogitNet(alpha=1, standardize=True, fit_intercept=True)
m = m.fit(X, y)
# Print the intercept and coefficients for the last lambda value
print("Intercept:", m.intercept_)
print("Coefficients for the last lambda value:", m.coef_[:,-1])
print("All coefficients:")
print(m.coef_)

Intercept: -10.263835991714245
Coefficients for the last lambda value: [6.57081728e-07]
All coefficients:
[[-5.28277354e-01  5.37956888e-03  6.57081728e-07]]
