# LOGISTIC REGRESSION

# Using sklearn

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
data = pd.read_csv("datasets/Social_Network_Ads.csv")

In [3]:
X = data.drop(columns='Purchased')
y = data['Purchased']

train test split

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size= 0.3, random_state=42)

feature scaling

In [5]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

train the Logistic regression model

In [6]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=0)

model.fit(X_train,y_train)

predict train set results

In [7]:
y_pred_train = model.predict(X_train)

train accuracy

In [8]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_train,y_pred_train)
print(cm)
print("training accuracy:",accuracy_score(y_train, y_pred_train))

[[170  14]
 [ 34  62]]
training accuracy: 0.8285714285714286


predict test set results

In [9]:
y_pred_test = model.predict(X_test)

making confusion matrix

In [10]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test,y_pred_test)
print(cm)
print("testing accuracy:",accuracy_score(y_test, y_pred_test))

[[71  2]
 [16 31]]
testing accuracy: 0.85


# Using statsmodels

In [11]:
import statsmodels.api as sm

regression

In [12]:
X_train_sm = sm.add_constant(X_train)

In [13]:
reg_log = sm.Logit(y_train,X_train_sm)
results_log = reg_log.fit()

Optimization terminated successfully.
         Current function value: 0.373125
         Iterations 7


summary table

In [14]:
results_log.summary()

0,1,2,3
Dep. Variable:,Purchased,No. Observations:,280.0
Model:,Logit,Df Residuals:,277.0
Method:,MLE,Df Model:,2.0
Date:,"Sun, 14 Apr 2024",Pseudo R-squ.:,0.4196
Time:,13:38:07,Log-Likelihood:,-104.48
converged:,True,LL-Null:,-180.02
Covariance Type:,nonrobust,LLR p-value:,1.56e-33

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-1.1435,0.192,-5.943,0.000,-1.521,-0.766
x1,2.0126,0.269,7.474,0.000,1.485,2.540
x2,1.1230,0.201,5.599,0.000,0.730,1.516


*Insights:*
- MLE -> Maximum Likelihood Function -> find best fit line for Logistic Regression
- bigger likelihood function -> higher probability that our model is correct
- log-likelihood -> almost (not always) negitive -> bigger the value -> better the model
- LL-Null -> log likelihood of a model with no independent variables
- LLR -> Log Likelihood ratio -> measures if our model is differnet from LL-null, a.k.a a useless model.
- p-value of LLR here is very less ~ 0.000 -> Hence our model is significant.
- pseudo R2 -> McFadden's R2 -> **good pseudo R2 - b/w 0.2 - 0.4**
- pseudo R2 -> useful for comparing variations of the same model.

train accuracy

In [15]:
results_log.pred_table()

array([[169.,  15.],
       [ 32.,  64.]])

In [16]:
cm_df = pd.DataFrame(results_log.pred_table())
cm_df.columns = ['Predicted 0','Predicted 1']
cm_df = cm_df.rename(index={0:'Actual 0',1:'Actual 1'})
cm_df

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,169.0,15.0
Actual 1,32.0,64.0


In [17]:
cm = np.array(cm_df)
accuracy_train = (cm[0,0]+cm[1,1])/cm.sum()
accuracy_train

0.8321428571428572

test accuracy

In [18]:
X_test_sm = sm.add_constant(X_test)

In [19]:
# Predict probabilities for test data
y_pred_test_prob = results_log.predict(X_test_sm)

# Convert probabilities into binary predictions
y_pred_test = (y_pred_test_prob >= 0.5).astype(int)

# Calculate accuracy score
accuracy = (y_pred_test == y_test).mean()
print("Test Accuracy Score:", accuracy)


Test Accuracy Score: 0.85
