# Multiple Linear Regression

##  Task:
Predict the Total Amount spent by a customer based on features like Quantity, Price per Unit, Age, Gender, and Product Category.

### Q1: How do we set up a multiple linear regression model to predict Total Amount?
Instructions:

Use Quantity, Price per Unit, Age, Gender, and Product Category to predict the Total Amount.

Perform regression using Python and interpret the coefficients.

In [68]:
import pandas as pd
import statsmodels.api as sm
import numpy as np
# Load the dataset
df = pd.read_csv("https://raw.githubusercontent.com/Data-Navigators/Statistical_Concept_Excercise/main/data/Retail_sales_dataset.csv")


In [71]:
df['Date'] = pd.to_datetime(df['Date'])

# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())

# Rename 'Product Category' to 'Product_Category'
df = df.rename(columns={'Product Category': 'Product_Category'})

# Re-apply one-hot encoding, ensuring all categories are represented
df = pd.get_dummies(df, columns=['Gender', 'Product_Category'], drop_first=False, dtype=int)


Missing values:
Transaction ID      0
Date                0
Customer ID         0
Gender              0
Age                 0
Product Category    0
Quantity            0
Price per Unit      0
Total Amount        0
dtype: int64


In [72]:
# Define all possible categories for Gender and Product_Category
possible_genders = ['Male', 'Female']
possible_categories = ['Clothing', 'Electronics', 'Beauty']



# Add missing columns if they don't exist (set them to 0)
for gender in possible_genders:
    col_name = f"Gender_{gender}"
    if col_name not in df.columns:
        df[col_name] = 0

for category in possible_categories:
    col_name = f"Product_Category_{category}"
    if col_name not in df.columns:
        df[col_name] = 0

# Now, your dataset should have allpossible categories encoded

In [73]:
X = df[['Quantity', 'Price per Unit', 'Age', 'Gender_Male', 'Gender_Female', 
        'Product_Category_Clothing', 'Product_Category_Electronics', 
        'Product_Category_Beauty']]

y = df['Total Amount']

# Add constant to the model
X = sm.add_constant(X)

# Fit the multiple linear regression model
model = sm.OLS(y, X).fit()

# Print model summary
print(model.summary())

# Print coefficients with more readable format
print("\nCoefficients:")
for name, coef in zip(X.columns, model.params):
    print(f"{name}: {coef:.4f}")

                            OLS Regression Results                            
Dep. Variable:           Total Amount   R-squared:                       0.855
Model:                            OLS   Adj. R-squared:                  0.854
Method:                 Least Squares   F-statistic:                     976.4
Date:                Sun, 20 Oct 2024   Prob (F-statistic):               0.00
Time:                        01:41:51   Log-Likelihood:                -6780.6
No. Observations:                1000   AIC:                         1.358e+04
Df Residuals:                     993   BIC:                         1.361e+04
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
const           

### Q2: How do we interpret the regression coefficients?

### Q3: What does Adjusted R-Squared tell us?

### Q4: How can we calculate RMSE and MAE to evaluate the model’s performance?

Instructions:

Use the model to make predictions.

Calculate Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) to measure prediction error.

In [74]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# Predict Total Amount
y_pred = model.predict(X)

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y, y_pred))
print(f"RMSE: {rmse:.2f}")

# Calculate MAE
mae = mean_absolute_error(y, y_pred)
print(f"MAE: {mae:.2f}")


RMSE: 213.08
MAE: 175.51


### Q5: How do we check for multicollinearity in this dataset?

In [75]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF for each feature
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)

                        Feature       VIF
0                         const  0.000000
1                      Quantity  1.002227
2                Price per Unit  1.002204
3                           Age  1.004486
4                   Gender_Male       inf
5                 Gender_Female       inf
6     Product_Category_Clothing       inf
7  Product_Category_Electronics       inf
8       Product_Category_Beauty       inf


  return 1 - self.ssr/self.centered_tss
  vif = 1. / (1. - r_squared_i)
