In [1]:
# Q1
# Simple Linear Regression uses one predictor (independent variable) to model the relationship with the response (dependent variable), producing a model with a single coefficient and intercept.
# Multiple Linear Regression involves multiple predictors, allowing the model to capture more complex relationships and interactions among variables. The benefit of using multiple predictors is a potentially more accurate model, as it accounts for additional factors that can influence the response.

# A continuous variable represents a range of values and is used directly in regression, creating a model that predicts changes in the response proportionally to the changes in this variable.
# An indicator variable (or dummy variable) takes binary values (0 or 1) to represent categories. It shifts the intercept of the regression line depending on the category but doesn’t change the slope.
# In Simple Linear Regression, using a continuous variable yields a line that adjusts based on the continuous predictor, while using an indicator variable creates a model that shifts up or down depending on the indicator's value.

# Introducing an indicator variable in Multiple Linear Regression alongside a continuous predictor allows the model to have separate intercepts for each category (indicated by the indicator variable) but a shared slope for the continuous predictor.
# In Simple Linear Regression, the model has a single line (intercept and slope) for all observations, but with Multiple Linear Regression that includes an indicator, there are two lines with different intercepts, reflecting category-based shifts in the response.

# Adding an interaction term between a continuous and an indicator variable allows each category to have its own slope, meaning the model can capture different rates of change for each group.
# This results in a Multiple Linear Regression model with distinct lines for each category that differ in both intercept and slope, providing a more flexible fit to the data based on category-specific trends.

# When a model uses only indicator variables for a non-binary categorical predictor, it applies one-hot encoding to convert categories into binary variables (one for each category except a reference group).
# This Multiple Linear Regression form models category-specific intercepts (with no slopes for continuous predictors). Each category has a separate expected outcome, and the response varies between categories but remains constant within each category.

In [None]:
# Q2
# step one: Identify the Predictor Values
#   In the binary case, we classify each ad budget as either "high" or "low," assigning 1 for high and 0 for low.
# step two: Plug the Values into the Formula:
# Without Interaction:
#   Continuous:
#     Sales=β0+β1⋅TV Spend+β2⋅Online Spend
#     Sales=β0+β1⋅TV Spend+β 2⋅Online Spend
# Insert the exact spending values for TV and online into the formula to predict sales based on their independent effects.
#   Binary:
#     Sales=β0+β1⋅TV High+β2⋅Online High
#     Sales=β0+β1⋅TV High+β2⋅Online High
# Use 1 or 0 for each variable, depending on whether the budget is high or low.

# With Interaction Formula:
# Continuous Case:
#   Sales=β0+β1⋅TV Spend+β2⋅Online Spend+β3⋅(TV Spend×Online Spend)
# Here, we still input the spending amounts, but we also calculate the interaction term(TV Spend×Online Spend), capturing the combined effect of both types of ad spending.
# Binary Case:
#   Sales=β0+β1⋅TV High+β2⋅Online High+β3⋅(TV High×Online High)
# In this binary model, we input 1 or 0 for each of TV High and Online High and calculate the interaction term (TV High×Online High), which captures any combined effect when both budgets are high.

# Without Interaction Model: Assumes that TV and online ad spending independently affect sales. We input each spending amount into the formula, adding their effects without any synergy. This model is suitable when the ads work independently of each other.
# With Interaction Model: Includes an interaction term, capturing synergy between the two ad types. When we input spending values, the interaction term accounts for any extra effect when both TV and online ad spending are high. This model may provide more accurate predictions if one ad type enhances the effect of the other.
# Difference: The interaction model reflects possible synergy between ad spending, offering more flexible predictions, while the non-interaction model assumes their effects are independent, providing a simpler prediction.

# Without Interaction:
#  Sales=β0+β1⋅TV High+β2⋅Online High
# Here, TV High and Online High are binary indicators (1 for high spend, 0 for low). The coefficients β1 and β2capture the shift in sales when spending is high versus low for each advertising type.
# With Interaction:
#  Sales=β0+β1·TV High+β2⋅Online High+β3⋅(TV High×Online High)
# Here, TV High×Online High is the interaction between the two binary indicators, representing cases where both ad budgets are high. The interaction term β3indicates the added effect on sales when both budgets are high, capturing any potential synergy between the two high-budget categories.
# Making Predictions with Binary Variables:
#   Without Interaction: Predict sales by inputting the binary values for high or low spending on each advertising medium.
#   With Interaction: Use the interaction model to predict sales by considering not just whether spending is high or low but also if both spending levels are high together, potentially yielding different predictions that account for combined high-spending impacts.

In [None]:
# Q3
# For an additive model, we select a few predictor variables:
#   Continuous Variable: Age
#   Binary Variable: Gender (e.g., Male = 1, Female = 0)
# Categorical Variable: Education Level (e.g., High School, Bachelor, Master)
# The linear form for this additive model can be expressed as:
#   Outcome=β0+β1⋅Age+β2⋅Gender+β3⋅Education Level
# This model assumes that the effect of each predictor on the outcome is independent of the others.

# For the interaction model, we incorporate an interaction term between age and gender. This allows us to examine if the relationship between age and the outcome differs based on gender.
#   The linear form with the interaction term is:
# Outcome=β0+β1⋅Age+β2⋅Gender+β3⋅Education Level+β4⋅(Age×Gender)
# This model captures potential synergies between age and gender, reflecting if their combined effect is different from their individual effects.

# Additive Model: Each predictor has an independent effect on the outcome. To predict using this model, input values for age, gender, and education level, then sum the effects according to the formula. This provides a straightforward estimate assuming no interaction between predictors.
# Interaction Model: The interaction model includes an additional term for age and gender. When making predictions, input the values for age and gender and calculate the interaction term (Age × Gender). The interaction allows the effect of age on the outcome to vary based on gender, providing a more nuanced prediction.

# Additive Model: Examine the p-values for each predictor. A significant p-value (typically <0.05) indicates that the predictor independently contributes to explaining the outcome. For example, if age and education level have low p-values, they have a statistically significant effect on the outcome.
# Interaction Model: Besides the main effects, pay special attention to the interaction term's p-value. A significant interaction term suggests that the combined effect of age and gender is important for predicting the outcome. If not significant, it indicates that adding the interaction term may not provide additional insight.

# Using Plotly, we create two scatter plots with best-fit lines to compare the additive and interaction models visually.
# Additive Model Plot: This plot shows outcome predictions based on age and gender, without interaction. Each gender group will have a single line, implying the effect of age is consistent across genders.
# Interaction Model Plot: This plot includes an interaction term, allowing different slopes for each gender. If the interaction is significant, we expect different trends between genders; otherwise, the lines may appear similar, suggesting the interaction term is unnecessary.
# Comparing the plots:
#   If the lines for each gender in the interaction plot differ significantly, this suggests the interaction term captures an important variation in the data, meaning the interaction model may be necessary.
#   If the lines are similar in both plots, the additive model likely provides sufficient explanatory power, and the interaction term may be unnecessary.

In [None]:
# Q4
# Model Fit (R-squared):
#   When we say "the model only explains 17.6% of the variability in the data," this refers to the R-squared value, a measure of how well the model as a whole captures the variability in the outcome. An R-squared of 17.6% indicates that 82.4% of the outcome's variability remains unexplained by the model, suggesting a weak fit overall.
# Individual Coefficients and Significance:
#   "Many coefficients are larger than 10 with strong evidence against the null hypothesis" indicates that specific predictors have substantial and statistically significant effects on the outcome. In other words, these variables are likely to influence the outcome individually, which is why their p-values are low (indicating strong evidence against the null hypothesis of "no effect").

# Missing Variables: The model may omit important predictors, so while the included variables have significant effects, they only partially explain the outcome.
# Complex Relationships: If interactions, non-linear relationships, or unmeasured external influences are present, a simple linear model with significant coefficients might still fail to capture the full variability in the data.
# Thus, significant coefficients with a low R-squared suggest that while some predictors have strong effects, they are not enough on their own to fully explain the variability in the outcome.

In [None]:
# Q7
# From model3/4 to model5:
#   Rationale: Add more predictors to capture relevant features and explain more variance in the outcome.
#   Principle: Improve model accuracy by including important characteristics while monitoring for multicollinearity.
# From model5 to model6:
#   Rationale: Simplify the model by retaining only statistically significant predictors.
#   Principle: Focus on impactful predictors to reduce overfitting, making the model more interpretable and stable.
# From model6 to model7:
#   Rationale: Capture complex relationships by adding interaction terms and stabilize the model with centering and scaling.
#   Principle: Enhance predictive power with interactions, while controlling multicollinearity to maintain generalizability.

In [None]:
# Q8
# # Re-running the full code to ensure proper execution and visualization

# Re-importing necessary libraries in case of any session reset
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import plotly.express as px
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split

# Simulating the dataset with relevant columns for the example
n_samples = 100
songs = pd.DataFrame({
    'danceability': np.random.rand(n_samples),
    'energy': np.random.rand(n_samples),
    'loudness': np.random.rand(n_samples) * 100 - 50,  # ranging from -50 to +50
    'mode': np.random.choice([0, 1], n_samples)  # binary variable
})

# Model formula for linear regression
linear_form = 'danceability ~ energy * loudness + energy * mode'

# Arrays to store R-squared values for in-sample and out-of-sample performance
reps = 100  # Number of repetitions
in_sample_Rsquared = np.zeros(reps)
out_of_sample_Rsquared = np.zeros(reps)

# Running the loop without a fixed random seed for train-test splitting
for i in range(reps):
    # Split the data (50-50 split)
    songs_training_data, songs_testing_data = train_test_split(songs, train_size=0.5)
    
    # Fit the model on training data
    final_model_fit = smf.ols(formula=linear_form, data=songs_training_data).fit()
    
    # Calculate in-sample R-squared
    in_sample_Rsquared[i] = final_model_fit.rsquared
    
    # Calculate out-of-sample R-squared
    out_of_sample_Rsquared[i] = np.corrcoef(
        songs_testing_data['danceability'],
        final_model_fit.predict(songs_testing_data)
    )[0, 1]**2

# Create a DataFrame to store the results for visualization
df_results = pd.DataFrame({
    "In Sample Performance (Rsquared)": in_sample_Rsquared,
    "Out of Sample Performance (Rsquared)": out_of_sample_Rsquared
})

# Scatter plot of in-sample vs. out-of-sample R-squared values
fig = px.scatter(
    df_results, 
    x="In Sample Performance (Rsquared)", 
    y="Out of Sample Performance (Rsquared)", 
    title="In-Sample vs Out-of-Sample R-squared Performance Across 100 Splits",
    labels={
        "In Sample Performance (Rsquared)": "In-Sample R-squared",
        "Out of Sample Performance (Rsquared)": "Out-of-Sample R-squared"
    }
)

# Adding a reference line for y=x to visualize agreement between in-sample and out-of-sample R-squared values
fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode="lines", name="y=x", line_shape="linear"))

fig.show()

# Loop for Repeated Splitting:
#   The code splits the dataset into training and testing sets 100 times, without using a fixed random seed. This creates variation in how the data is divided each time, allowing us to observe different potential outcomes for model performance on unseen data.
# In-Sample and Out-of-Sample R-squared Calculation:
#   For each split, the model is trained on the training data, and both in-sample (training) and out-of-sample (testing) R-squared values are calculated.
#   In-Sample R-squared measures the model’s ability to explain variance in the training data.
#   Out-of-Sample R-squared assesses the model's generalization by evaluating its performance on the test data.
# Visualization:
#   The scatter plot of in-sample vs. out-of-sample R-squared values across all splits helps visualize the relationship between these metrics.
#   A reference line y=x is added, which represents ideal agreement between in-sample and out-of-sample performance.

# Close Agreement: If the points are close to the y=x line, it indicates that the model’s training performance closely matches its testing performance, suggesting a good generalization.
# Deviation from Line: Points significantly above the line imply underfitting (both scores are low), while points below the line suggest overfitting (high training performance, low testing performance).

In [1]:
# Q9
# model6_fit is simpler, with fewer interactions and predictors, making it easier to understand and interpret.
# model7_fit is more complex, with high-order interaction terms (e.g., four-way interactions). This added complexity aims to capture more detailed relationships in the data but makes the model harder to interpret.

# Generalizability refers to the model’s ability to perform well on new, unseen data.
# model7_fit may perform slightly better on random test data splits, suggesting it captures specific patterns in the training data.
# However, this model can struggle when applied to data that comes from a different time or context (e.g., predicting future generations of data), which shows potential overfitting—relying too heavily on specific patterns from the training set.

# model6_fit is more interpretable, meaning it’s easier to understand how each predictor affects the outcome, making it more useful when interpretability is essential.
# model7_fit, despite sometimes showing better test performance, is harder to explain and may not always provide meaningful improvements in predictions for new data.

# In real-world scenarios, data often arrives sequentially (e.g., over time or as new groups appear). This means models should ideally generalize well to data from new contexts.
# The fact that model7_fit doesn’t perform as consistently when tested on future data indicates it may be less reliable for such sequential data, whereas the simpler model6_fit performs more steadily.

In [None]:
# https://chatgpt.com/c/67366617-6c30-8013-9507-9460d6ae67b8