<a href="https://colab.research.google.com/github/Sanidhyar10/Intro-to-Data-Science-using-python-/blob/main/Reggresion_model_IT2K21_56.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 9: Regression Modelling

Regression modeling, in simple terms, is a statistical approach used to understand and quantify the relationship between one variable (the "dependent" or "outcome" variable) and one or more other variables (the "independent" or "predictor" variables). The primary goal is to develop a mathematical equation that can predict or explain the values of the dependent variable based on the values of the independent variables.

Here's a breakdown:

Dependent Variable: This is what you're trying to predict or understand. For example, it could be sales, temperature, exam scores, etc.

Independent Variables: These are the factors that you believe may influence or explain changes in the dependent variable. They could be things like time, price, age, etc.

Regression Equation: The model aims to find an equation that best fits the data. This equation helps us understand how changes in the independent variables relate to changes in the dependent variable.

Parameter Estimation: During modeling, the algorithm estimates parameters (coefficients) for each independent variable. These coefficients tell us the strength and direction of the relationship.

Prediction: Once the model is trained, it can be used to make predictions. For example, if we know the values of the independent variables, we can use the model to predict the value of the dependent variable.

In essence, regression modeling is a tool that helps us explore and quantify relationships between variables, making it valuable in fields such as economics, finance, biology, and many others. It's a fundamental part of data analysis and predictive modeling.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

# Load the dataset (replace 'bank.csv' with the actual path to your dataset)
bank_data = pd.read_csv('/content/bank.csv')

# Replace these columns with your actual predictor and response variable names
predictor_columns = ["age", "balance", "day", "duration", "campaign", "pdays", "previous"]

response_column = "deposit"

# Assuming categorical variables are already one-hot encoded
# If not, use bank_data = pd.get_dummies(bank_data, columns=["job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome"])

# Convert 'yes' and 'no' to 1 and 0
bank_data[response_column] = bank_data[response_column].map({'yes': 1, 'no': 0})

# Split the data into training and test sets
bank_train, bank_test = train_test_split(bank_data, test_size=0.2, random_state=42)

# Create predictor and response variables for the training set
X_train = bank_train[predictor_columns]
y_train = bank_train[response_column]

# Create predictor and response variables for the test set
X_test = bank_test[predictor_columns]
y_test = bank_test[response_column]

# Fit a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Replace these values with actual values for the customer you want to predict
duration_value = 1000
campaign_value = 2
pdays_value = -1
previous_value = 0

# Predict sales per visit for the first customer
cust01 = np.array([[1, 0, 333, duration_value, campaign_value, pdays_value, previous_value]])
predicted_sales = model.predict(cust01)
print("Predicted Sales per Visit for the first customer:", predicted_sales[0])

# Predict sales per visit for all customers in the test set
ypred = model.predict(X_test)

# Calculate Mean Absolute Error (MAE)
mae = metrics.mean_absolute_error(y_test, ypred)
print("Mean Absolute Error (MAE):", mae)

# Calculate R-squared
rsquared = model.score(X_test, y_test)
print("R-squared (R^2):", rsquared)


Predicted Sales per Visit for the first customer: 0.4413253722522675
Mean Absolute Error (MAE): 0.3917367069586419
R-squared (R^2): 0.24242210357043603




# 9.2 Demonstrate Stepwise Regression Using R/Python

In [None]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
from itertools import combinations
from tqdm import tqdm

# Load the dataset (replace 'bank.csv' with the actual path to your dataset)
bank_data = pd.read_csv('/content/bank.csv')

# Replace these columns with your actual predictor and response variable names
predictor_columns = ["age", "balance", "day", "duration", "campaign", "pdays", "previous"]
response_column = "deposit"

# Assuming categorical variables are already one-hot encoded
# If not, use bank_data = pd.get_dummies(bank_data, columns=["job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome"])

# Convert 'yes' and 'no' to 1 and 0
bank_data[response_column] = bank_data[response_column].map({'yes': 1, 'no': 0})

# Create predictor and response variables
X = bank_data[predictor_columns]
y = bank_data[response_column]

# Implement forward stepwise selection
def forward_select(X, y):
    selected_vars = set()
    remaining_vars = set(X.columns)
    best_model = None
    best_aic = np.inf

    for _ in tqdm(range(len(X.columns))):
        models = {}
        for var in remaining_vars:
            predictors = list(selected_vars) + [var]
            X_subset = X[predictors]
            X_subset = sm.add_constant(X_subset)
            model = sm.OLS(y, X_subset).fit()
            models[var] = model.aic

            if model.aic < best_aic:
                best_aic = model.aic
                best_model = model
                best_var = var

        selected_vars.add(best_var)
        remaining_vars.remove(best_var)

    return best_model

# Run forward stepwise selection
final_model = forward_select(X, y)

# Display the summary of the final model
print(final_model.summary())

100%|██████████| 7/7 [00:00<00:00, 11.45it/s]

                            OLS Regression Results                            
Dep. Variable:                deposit   R-squared:                       0.252
Model:                            OLS   Adj. R-squared:                  0.251
Method:                 Least Squares   F-statistic:                     535.5
Date:                Fri, 08 Dec 2023   Prob (F-statistic):               0.00
Time:                        08:48:32   Log-Likelihood:                -6469.1
No. Observations:               11162   AIC:                         1.295e+04
Df Residuals:                   11154   BIC:                         1.301e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.1950      0.018     11.012      0.0


