<a href="https://colab.research.google.com/github/AbbisreeSaadhvi/Python-Projects/blob/main/Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Logistic Regression: Explanatory vs Predictive Models
Logistic Regression is a statistical method used for analyzing datasets in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). It is widely used to model the probability of a binary response based on one or more predictor variables.

##Explanatory Logistic Regression Model
**Objective:** To identify factors that influence a binary outcome.

**Example:** A telecom company wants to understand what factors affect a customer's decision to churn (leave the service).

**Scenario:** The telecom company collects data on various customer attributes such as monthly charges, contract type, tenure, payment method, and whether the customer has opted for tech support. They use logistic regression to analyze which of these factors significantly impact the likelihood of a customer churning.

**Interpretation:**

_Coefficients:_

The logistic regression model provides coefficients for each predictor variable. These coefficients represent the log odds of the outcome (churn) for a one-unit increase in the predictor variable, holding all other variables constant.

For example, if the coefficient for "Monthly Charges" is positive, it indicates that higher monthly charges are associated with an increased likelihood of churn.

_Odds Ratios:_

The exponentiated coefficients (odds ratios) indicate the change in odds of the outcome for a one-unit increase in the predictor variable.
For instance, an odds ratio greater than 1 for "Contract Type: Month-to-Month" might suggest that customers with month-to-month contracts are more likely to churn compared to those with longer-term contracts.

_P-Values:_

The significance of each predictor is tested using p-values. Predictors with p-values less than the significance level (e.g., 0.05) are considered statistically significant.

For example, if "Tech Support" has a p-value less than 0.05, it suggests that opting for tech support significantly affects the likelihood of churn.

##Predictive Logistic Regression Model

**Objective:** To predict a binary outcome.

**Example:** The telecom company wants to predict whether a particular customer will churn.

**Scenario:** Using historical data, the telecom company builds a logistic regression model to predict the probability of churn for each customer. The model is trained on a dataset containing the same predictor variables used in the explanatory model.

**Interpretation:**

_Predicted Probabilities:_

For each customer, the logistic regression model provides a predicted probability of churn. This probability ranges from 0 to 1.

For example, if a customer has a predicted probability of 0.8, it indicates an 80% chance that the customer will churn.

_Decision Threshold:_

The company sets a threshold (e.g., 0.5) to classify customers as likely to churn or not. Customers with predicted probabilities above the threshold are classified as likely to churn.

For instance, if the threshold is 0.5, and a customer has a predicted probability of 0.8, the model classifies this customer as likely to churn.

_Model Performance:_

The performance of the predictive model is evaluated using metrics such as accuracy, precision, recall, and the area under the ROC curve (AUC).

For example, a high AUC value indicates that the model is good at distinguishing between customers who churn and those who do not.


In [13]:
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# Sample data
data = pd.DataFrame({
    'MonthlyCharges': [70, 80, 50, 90, 60],
    'Tenure': [10, 20, 5, 15, 8],
    'Contract': [1, 0, 1, 0, 1],  # 1: Month-to-Month, 0: Long-term
    'TechSupport': [0, 1, 0, 1, 0],  # 1: Yes, 0: No
    'Churn': [1, 0, 1, 0, 1]  # 1: Churn, 0: No Churn
})

# Split data into predictors (X) and outcome (y)
X = data[['MonthlyCharges', 'Tenure', 'Contract', 'TechSupport']]
y = data['Churn']

# Add a constant to the model
X = sm.add_constant(X)

# Fit the logistic regression model (explanatory)
model = sm.Logit(y, X).fit()

# Summary of the model
print(model.summary())

# Stratified K-Fold cross-validation to ensure both classes are in the train and test sets
skf = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

# Fit the logistic regression model (predictive)
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Predictions
y_pred = clf.predict(X_test)
y_pred_prob = clf.predict_proba(X_test)[:, 1]

# Performance metrics
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_prob)

print(f'Accuracy: {accuracy}')
print(f'ROC AUC: {roc_auc}')


         Current function value: 0.000000
         Iterations: 35
                           Logit Regression Results                           
Dep. Variable:                  Churn   No. Observations:                    5
Model:                          Logit   Df Residuals:                        1
Method:                           MLE   Df Model:                            3
Date:                Wed, 29 May 2024   Pseudo R-squ.:                   1.000
Time:                        01:25:10   Log-Likelihood:            -1.4260e-06
converged:                      False   LL-Null:                       -3.3651
Covariance Type:            nonrobust   LLR p-value:                   0.08102
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
const           -164.7245        nan        nan        nan         nan         nan
MonthlyCharges     6.4538   2.16e+09   2.99e-09      



## Interpretation

**Explanatory Model:**

*Coefficients:*

The coefficients indicate the direction and magnitude of the relationship between each predictor and the probability of churn.

*P-Values:*

None of the predictors have p-values less than 0.05, suggesting that with this small sample size, there is no significant evidence that any predictor is strongly associated with churn.

_Pseudo R-squared:_

The value indicates how well the model explains the variability of the outcome data. In this case, it's 0.4193, suggesting a moderate fit.

**Predictive Model**

_Accuracy:_

The accuracy of the model on the test set is 1.0, indicating perfect classification. However, this could be due to the small test set size and not necessarily indicative of the model's performance on a larger, more representative dataset.

_ROC AUC:_

The ROC AUC score of 1.0 also indicates perfect discrimination between churn and non-churn cases in this small test set.


It's important to note that with a larger and more representative dataset, the performance metrics might vary, and the significance of predictors might change.