In a retail experiment, we want to understand how advertising expenditure, store
location, and competition affect sales revenue. Using synthetic data, implement multiple linear
regression in Python to analyse these factors. Interpret the coefficients, perform an F-test to assess
overall model significance, and conduct t-tests to evaluate the significance of individual
coefficients.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

In [20]:
# Generate synthetic data
np.random.seed(42)
n_samples = 100

# Features
advertising_expenditure = np.random.uniform(1000, 5000, n_samples)  # in dollars
store_location = np.random.choice([0, 1], size=n_samples)  # 0: Suburban, 1: Urban
competition = np.random.uniform(0, 10, n_samples)  # scale from 0 (low) to 10 (high)

# True relationship (synthetic)
true_coefficients = [5, 1000, -200]
noise = np.random.normal(0, 500, n_samples)

In [19]:
sales_revenue = (
    true_coefficients[0] * advertising_expenditure +
    true_coefficients[1] * store_location +
    true_coefficients[2] * competition +
    noise
)

# Create DataFrame
data = pd.DataFrame({
    'AdvertisingExpenditure': advertising_expenditure,
    'StoreLocation': store_location,
    'Competition': competition,
    'SalesRevenue': sales_revenue
})
data.head()

Unnamed: 0,AdvertisingExpenditure,StoreLocation,Competition,SalesRevenue
0,4844.762255,0,3.764634,23414.720518
1,4621.402568,0,8.105533,21375.421378
2,1783.164539,1,9.872761,8248.353787
3,1277.445204,0,1.504169,6465.14609
4,1403.112006,1,5.941307,6562.048023


In [25]:
# Define independent and dependent variables
X = data.drop(['SalesRevenue'],axis='columns')
y = data['SalesRevenue']

# Add a constant term for the intercept
X = sm.add_constant(X)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the multiple linear regression model
model = sm.OLS(y_train, X_train).fit()

# Model summary
summary = model.summary()
print(summary)

                            OLS Regression Results                            
Dep. Variable:           SalesRevenue   R-squared:                       0.994
Model:                            OLS   Adj. R-squared:                  0.994
Method:                 Least Squares   F-statistic:                     4124.
Date:                Thu, 09 Jan 2025   Prob (F-statistic):           5.06e-84
Time:                        01:33:30   Log-Likelihood:                -607.11
No. Observations:                  80   AIC:                             1222.
Df Residuals:                      76   BIC:                             1232.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                    -50

In [13]:
# Interpret coefficients
coefficients = model.params
print("\nInterpretation of coefficients:")
print(f"Intercept: {coefficients['const']:.2f} (baseline sales when all predictors are zero)")
print(f"Advertising Expenditure: {coefficients['AdvertisingExpenditure']:.2f} (increase in revenue per dollar spent)")
print(f"Store Location: {coefficients['StoreLocation']:.2f} (incremental revenue for urban locations)")
print(f"Competition: {coefficients['Competition']:.2f} (decrease in revenue per unit of competition)")


Interpretation of coefficients:
Intercept: -137.09 (baseline sales when all predictors are zero)
Advertising Expenditure: 5.03 (increase in revenue per dollar spent)
Store Location: 911.68 (incremental revenue for urban locations)
Competition: -183.53 (decrease in revenue per unit of competition)


In [14]:
# Perform F-test for overall model significance
f_pvalue = model.f_pvalue
print(f"\nOverall model significance (F-test p-value): {f_pvalue:.4f}")
if f_pvalue < 0.05:
    print("The model is significant.")
else:
    print("The model is not significant.")


Overall model significance (F-test p-value): 0.0000
The model is significant.


In [15]:
# Perform t-tests for individual coefficients
t_values = model.tvalues
p_values = model.pvalues
print("\nT-tests for individual coefficients:")
for predictor in coefficients.index:
    print(f"{predictor}: t-value = {t_values[predictor]:.2f}, p-value = {p_values[predictor]:.4f}")
    if p_values[predictor] < 0.05:
        print(f"    {predictor} is significant.")
    else:
        print(f"    {predictor} is not significant.")


T-tests for individual coefficients:
const: t-value = -0.54, p-value = 0.5894
    const is not significant.
AdvertisingExpenditure: t-value = 86.53, p-value = 0.0000
    AdvertisingExpenditure is significant.
StoreLocation: t-value = 7.39, p-value = 0.0000
    StoreLocation is significant.
Competition: t-value = -8.28, p-value = 0.0000
    Competition is significant.
