# Logistic Regression

### Logistic Regression

**Definition**  
Logistic Regression is a statistical and machine learning method used for classification tasks, particularly binary classification. It predicts the probability of a dependent variable belonging to a specific class based on one or more independent variables.

---

### Objective  
Logistic regression predicts a categorical outcome (e.g., **yes/no**, **0/1**) rather than a continuous value like linear regression. It estimates the probability \( P(y=1|X) \), where \( y \) is the target variable.

---

### Logistic (Sigmoid) Function  
The logistic regression model maps input features (X) through a sigmoid function to ensure the output is a probability between 0 and 1:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

Here, z is a linear combination of inputs and their weights:

$$
z = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_nx_n
$$

- $$\beta_0: Intercept term.  
- \beta_1, \beta_2, \dots, \beta_n : Coefficients (weights) for the input features.$$


In [3]:
#import Library
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [4]:
df=pd.read_table("D:\Downloads\ex2data1.txt",header=None,sep=',')

In [5]:
df=df.rename(columns={0:'X1',1:'X2',2:'y'})

In [6]:
df.head()

Unnamed: 0,X1,X2,y
0,34.62366,78.024693,0
1,30.286711,43.894998,0
2,35.847409,72.902198,0
3,60.182599,86.308552,1
4,79.032736,75.344376,1


In [7]:
df.shape

(100, 3)

### i) Randomly split the data into 80:20 training-testing hold-out (random state: 30).

In [8]:
X=df.drop(columns='y',axis=1)
y=df['y']

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=30)

### (ii) Train a logistic regression model with logit and probit link. Which model is better? Comment.

In [13]:
from statsmodels.discrete.discrete_model import Logit,Probit,Poisson
import statsmodels.api as sm

In [14]:
Logit_LR=sm.add_constant(X_train)

logit_model=sm.Logit(y_train,Logit_LR).fit()

print(logit_model.summary())

Optimization terminated successfully.
         Current function value: 0.202483
         Iterations 9
                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                   80
Model:                          Logit   Df Residuals:                       77
Method:                           MLE   Df Model:                            2
Date:                Sat, 30 Nov 2024   Pseudo R-squ.:                  0.6991
Time:                        01:36:47   Log-Likelihood:                -16.199
converged:                       True   LL-Null:                       -53.841
Covariance Type:            nonrobust   LLR p-value:                 4.489e-17
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        -25.3823      6.674     -3.803      0.000     -38.463     -12.302
X1             0.2198      0.

In [60]:
probit_LR=sm.add_constant(X_train)

probit_model=sm.Probit(y_train,probit_LR).fit()

print(probit_model.summary())

Optimization terminated successfully.
         Current function value: 0.199628
         Iterations 9
                          Probit Regression Results                           
Dep. Variable:                      y   No. Observations:                   80
Model:                         Probit   Df Residuals:                       77
Method:                           MLE   Df Model:                            2
Date:                Wed, 27 Nov 2024   Pseudo R-squ.:                  0.7034
Time:                        22:48:53   Log-Likelihood:                -15.970
converged:                       True   LL-Null:                       -53.841
Covariance Type:            nonrobust   LLR p-value:                 3.573e-17
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        -14.6933      3.636     -4.042      0.000     -21.819      -7.568
X1             0.1260      0.

In [61]:
log_log_LR=sm.add_constant(X_train)

LL_model=sm.Poisson(y_train,log_log_LR).fit()

print(LL_model.summary())

Optimization terminated successfully.
         Current function value: 0.780460
         Iterations 6
                          Poisson Regression Results                          
Dep. Variable:                      y   No. Observations:                   80
Model:                        Poisson   Df Residuals:                       77
Method:                           MLE   Df Model:                            2
Date:                Wed, 27 Nov 2024   Pseudo R-squ.:                  0.1390
Time:                        22:48:53   Log-Likelihood:                -62.437
converged:                       True   LL-Null:                       -72.520
Covariance Type:            nonrobust   LLR p-value:                 4.179e-05
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -4.1035      0.907     -4.523      0.000      -5.882      -2.325
X1             0.0278      0.

**Logit Model (0.6976):** This value suggests that approximately **69.76%** of the variability in the outcome is explained by the model. This is relatively high and indicates a good fit.

**Probit Model (0.7023):** This value indicates that the Probit model explains approximately **70.23%** of the variability in the outcome, which is slightly better than the Logit model.

**Cloglog Model (0.1399):** This value is significantly lower, suggesting that the Cloglog model explains only about **13.99%** of the variability in the outcome. This relatively low Pseudo R-squared indicates that the Cloglog model does not fit the data as well as the Logit and Probit models.

Probit Model is slightly better than the Logit Model based on the Pseudo R-squared values **(0.7023 vs. 0.6976)**, as it explains a marginally higher percentage of the variability.

Probit is the best model out of the three, as it has the highest Pseudo R-squared value and performs slightly better than Logit.

## II Method 

In [62]:
from sklearn.linear_model import LogisticRegression

In [63]:
model_LR=LogisticRegression()
model_LR.fit(X_train,y_train)

In [64]:
prediction=model_LR.predict(X_test)
print(prediction)

[0 1 0 1 1 1 1 1 1 0 1 0 0 1 1 0 1 0 1 1]


In [65]:
from sklearn.metrics import accuracy_score,classification_report

In [66]:
model_acc=accuracy_score(y_test,prediction)
print(model_acc)

0.95


In [67]:
model_report=classification_report(y_test,prediction)
print(model_report)

              precision    recall  f1-score   support

           0       1.00      0.88      0.93         8
           1       0.92      1.00      0.96        12

    accuracy                           0.95        20
   macro avg       0.96      0.94      0.95        20
weighted avg       0.95      0.95      0.95        20



### Interpretation
- The model performs very well overall, achieving 95% accuracy and high precision, recall, and F1-scores for both classes.  
- **Class 1 (positive class)** is slightly favored, as it achieves perfect recall and a slightly higher F1-score.  
- **Class 0 (negative class)** has a slightly lower recall (0.88), indicating that 12% of negatives were misclassified as positives.


**Weighted Precision (0.95)**
- On average, the model correctly identifies 95% of instances it predicts as either class 0 or class 1.  
- The score is heavily influenced by the precision of the larger class (class 1, with support = 12).

**Weighted Recall (0.95)**
- On average, the model correctly identifies 95% of all actual instances of each class.  
- Greater weight is assigned to class 1 due to its larger support.

**Weighted F1-Score (0.95)**
- The model demonstrates excellent balanced performance between precision and recall across both classes.  
- The metric reflects strong overall performance, with a slightly greater influence from the more frequent class (class 1).


In [69]:
df.head()

Unnamed: 0,X1,X2,y
0,34.62366,78.024693,0
1,30.286711,43.894998,0
2,35.847409,72.902198,0
3,60.182599,86.308552,1
4,79.032736,75.344376,1


In [73]:
import numpy as np

# Your input data
new_x = np.array([90,90], dtype=float)

# Reshape to a 2D array
new_x_reshaped = new_x.reshape(1, -1)

# Make predictions
predictions = model_LR.predict(new_x_reshaped)
print("prediction of y:-",predictions)


prediction of y:- [1]


In [77]:

new_x = np.array([1,90,90], dtype=float)

new_x_reshaped = new_x.reshape(1, -1)


predictions = probit_model.predict(sm.add_constant(new_x_reshaped))

print(predictions)



[1.]


20    0.086560
91    1.000000
34    0.057409
52    0.947329
8     0.999999
74    0.902696
21    0.999974
88    1.000000
80    0.999995
89    0.109887
82    0.906065
38    0.284015
0     0.065701
77    0.593129
42    0.999998
67    0.004089
68    1.000000
92    0.000116
48    0.999976
10    0.955892
dtype: float64
