In [1]:
# Run code and answer the following questions:

# Which model has the lowest Mean Squared Error (MSE), and what does this indicate about the model's performance?
# Interpretation: The MSE is a measure of the average squared difference between the predicted and actual values. The model with the lowest MSE is typically considered the best at predicting the output variable because it has the smallest prediction error. However, it's essential to consider both MSE and R-squared when evaluating models.

# How do the R-squared values compare across the models, and what can you infer from these values about the models' explanatory power?
# Interpretation: The R-squared value indicates the proportion of the variance in the dependent variable that is predictable from the independent variables. A higher R-squared value generally suggests that the model explains a greater proportion of the variance, implying better performance. However, caution is needed as high R-squared can sometimes result from overfitting, especially in complex models like ElasticNet.

# Based on the coefficients and intercepts provided, how do the different models handle the features, and what might this indicate about the importance of each feature in predicting sales?
# Interpretation: The coefficients in a regression model indicate the expected change in the dependent variable (Sales) for a one-unit change in the independent variable, holding all other variables constant. Comparing coefficients across models, especially between Ridge, Lasso, and ElasticNet, can reveal which features are consistently influential. Lasso and ElasticNet can shrink some coefficients to zero, effectively selecting a subset of features, which might indicate the most important features for prediction. Ridge, on the other hand, will shrink coefficients but not eliminate them, showing all features' influence.

import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

# Create the dataset
data = {
    'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West', 'North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
    'Dealership_Size': [2500, 3000, 1800, 2200, 2700, 3200, 1900, 2300, 2600, 3100, 2000, 2400, 2800, 3300, 2100, 2500],
    'Marketing_Spend': [50000, 60000, 45000, 48000, 52000, 62000, 47000, 49000, 51000, 61000, 46000, 50000, 53000, 63000, 48000, 52000],
    'Customer_Interactions': [300, 400, 280, 310, 330, 420, 290, 320, 350, 430, 300, 340, 370, 440, 310, 360],
    'Sales': [120, 150, 100, 110, 130, 160, 105, 115, 125, 155, 110, 120, 140, 170, 115, 130]
}

# Convert to DataFrame
df = pd.DataFrame(data)
print(df)

# Data Cleaning Tasks

# 1. Check for missing values and handle them
print("Missing values in each column:")
print(df.isnull().sum())

# 2. Remove duplicate rows if any
df = df.drop_duplicates()

# 3. Ensure consistent data formats
df['Marketing_Spend'] = df['Marketing_Spend'].astype(float)

# Convert categorical variable to dummy variables
df = pd.get_dummies(df, columns=['Region'], drop_first=True)

# Define the dependent variable (DV) and independent variables (IVs)
X = df[['Dealership_Size', 'Marketing_Spend', 'Customer_Interactions', 'Region_North', 'Region_West', 'Region_South']]
y = df['Sales']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=0.1),
    'ElasticNet Regression': ElasticNet(alpha=0.1, l1_ratio=0.5)
}

results = {}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    results[name] = {
        'Coefficients': model.coef_,
        'Intercept': model.intercept_,
        'Mean squared error': mse,
        'R-squared value': r2
    }

# Output the results for each model
for name, result in results.items():
    print(f"\n{name}:")
    print(f"Coefficients: {result['Coefficients']}")
    print(f"Intercept: {result['Intercept']}")
    print(f"Mean squared error: {result['Mean squared error']}")
    print(f"R-squared value: {result['R-squared value']}")


   Region  Dealership_Size  Marketing_Spend  Customer_Interactions  Sales
0   North             2500            50000                    300    120
1   South             3000            60000                    400    150
2    East             1800            45000                    280    100
3    West             2200            48000                    310    110
4   North             2700            52000                    330    130
5   South             3200            62000                    420    160
6    East             1900            47000                    290    105
7    West             2300            49000                    320    115
8   North             2600            51000                    350    125
9   South             3100            61000                    430    155
10   East             2000            46000                    300    110
11   West             2400            50000                    340    120
12  North             2800            