## 1

- Simple Linear Regression (SLR) uses only one independent variable to predict the outcome, giving a single trend line. In contrast, Multiple Linear Regression (MLR) includes multiple independent variables, capturing more factors and creating a more detailed model. MLR usually provides better prediction accuracy by accounting for multiple influences on the outcome.
- In SLR, a continuous variable represents a range of values (like age), and the trend line changes steadily with the variable. An indicator (dummy) variable, on the other hand, divides data into categories (like gender), resulting in two distinct levels in the model. A continuous variable shows gradual changes, while an indicator variable emphasizes differences between categories.
- Adding an indicator variable to MLR shifts the intercept between categories, creating parallel lines with a shared slope across both groups. This allows the model to distinguish between groups while assuming the effect of the continuous variable remains the same across these groups.
- By adding an interaction between a continuous and an indicator variable in MLR, each group can have its own slope. This means the effect of the continuous variable varies between groups, leading to distinct lines that better capture group-specific trends.
- In MLR, using only indicator variables derived from a non-binary categorical variable models differences in mean outcomes across groups rather than a continuous trend. This method uses binary (dummy) coding to compare each category to a baseline, resulting in different intercepts but no slope, as it models differences between categories rather than a continuous relationship.

## 2

- The outcome variable is the effectiveness of the ads, and the predictor variables are the TV ad budget and online ad budget. There may be an interaction effect where the effectiveness of TV ads depends on the online ad budget. The model without interaction is： Effectiveness = 𝑏0 + 𝑏1 ⋅ TV_Budget + 𝑏2 ⋅ Online_Budget + 𝜖，and the model with interaction is: Effectiveness = 𝑏0 + 𝑏1 ⋅ TV_Budget + 𝑏2 ⋅ Online_Budget + 𝑏3 ⋅ (TV_Budget × Online_Budget) + 𝜖. The model without interaction assumes independent effects, while the model with interaction captures the combined effect of both budgets.

- If budgets are categorized as “high”and “low”, the model without interaction is: Effectiveness = 𝑏0 + 𝑏1 ⋅ TV_High + 𝑏2 ⋅ Online_High + 𝜖, The model with interaction is: Effectiveness = 𝑏0 + 𝑏1 ⋅ TV_High + 𝑏2 ⋅ Online_High + 𝑏3 ⋅(TV_High × Online_High) + 𝜖 The interaction term captures the combined effect when both budgets are high.

## 3

In [25]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import plotly.graph_objects as go

# Load the data with low_memory=False to avoid dtype warnings
cscs_data = pd.read_csv('CSCS_data.csv', low_memory=False)

# Filter relevant columns and remove rows with NaN in 'COVID_vaccinated'
cscs_data_subset = cscs_data[['DEMO_age', 'COVID_vaccinated', 'PSYCH_big_five_inventory_extraverted_score']].dropna(subset=['COVID_vaccinated'])

# Convert 'COVID_vaccinated' to binary: 1 if 'Yes', 0 if 'No'
cscs_data_subset['COVID_vaccinated'] = cscs_data_subset['COVID_vaccinated'].apply(lambda x: 1 if 'Yes' in str(x) else 0)

# Remove any remaining NaN values in other columns to ensure clean data for modeling
cscs_data_subset.dropna(inplace=True)

# Standardize 'DEMO_age' and 'PSYCH_big_five_inventory_extraverted_score' to improve model performance
scaler = StandardScaler()
cscs_data_subset[['DEMO_age', 'PSYCH_big_five_inventory_extraverted_score']] = scaler.fit_transform(
    cscs_data_subset[['DEMO_age', 'PSYCH_big_five_inventory_extraverted_score']]
)

# Separate predictors (X) and target variable (y)
X_additive = cscs_data_subset[['DEMO_age', 'PSYCH_big_five_inventory_extraverted_score']]
y = cscs_data_subset['COVID_vaccinated']

# Fit a regularized logistic regression model with additive terms only (Ridge regularization)
log_reg_additive = LogisticRegression(penalty='l2', solver='liblinear', max_iter=100, random_state=0)
log_reg_additive.fit(X_additive, y)

# Retrieve additive model coefficients
print("Additive Model Coefficients:")
print("Intercept:", log_reg_additive.intercept_)
print("Coefficients:", log_reg_additive.coef_)

# Adding an interaction term for a synergistic model
cscs_data_subset['interaction'] = cscs_data_subset['DEMO_age'] * cscs_data_subset['PSYCH_big_five_inventory_extraverted_score']
X_synergistic = cscs_data_subset[['DEMO_age', 'PSYCH_big_five_inventory_extraverted_score', 'interaction']]

# Fit the synergistic model with the interaction term
log_reg_synergistic = LogisticRegression(penalty='l2', solver='liblinear', max_iter=100, random_state=0)
log_reg_synergistic.fit(X_synergistic, y)

# Retrieve synergistic model coefficients
print("\nSynergistic Model Coefficients:")
print("Intercept:", log_reg_synergistic.intercept_)
print("Coefficients:", log_reg_synergistic.coef_)

# Visualization using Plotly
# Define age range for plotting
age_range = np.linspace(cscs_data_subset['DEMO_age'].min(), cscs_data_subset['DEMO_age'].max(), 100)
extraversion_median = 0  # Median standardized score (0 after scaling)

# Predict probabilities for additive model
additive_preds = log_reg_additive.predict_proba(
    np.column_stack((age_range, np.full(100, extraversion_median)))
)[:, 1]

# Predict probabilities for synergistic model
synergistic_preds = log_reg_synergistic.predict_proba(
    np.column_stack((age_range, np.full(100, extraversion_median), age_range * extraversion_median))
)[:, 1]

# Plot: Additive Model
fig_additive = go.Figure()
fig_additive.add_trace(go.Scatter(x=age_range, y=additive_preds, mode='lines', name='Additive Fit'))
fig_additive.add_trace(go.Scatter(
    x=cscs_data_subset['DEMO_age'],
    y=cscs_data_subset['COVID_vaccinated'] + np.random.normal(0, 0.05, len(cscs_data_subset)),
    mode='markers',
    name='Observed Data',
    opacity=0.5
))
fig_additive.update_layout(title='Regularized Logistic Regression (Additive Specification)',
                           xaxis_title='Age (Standardized)',
                           yaxis_title='Probability of COVID Vaccination')
fig_additive.show()

# Plot: Synergistic Model
fig_synergistic = go.Figure()
fig_synergistic.add_trace(go.Scatter(x=age_range, y=synergistic_preds, mode='lines', name='Synergistic Fit'))
fig_synergistic.add_trace(go.Scatter(
    x=cscs_data_subset['DEMO_age'],
    y=cscs_data_subset['COVID_vaccinated'] + np.random.normal(0, 0.05, len(cscs_data_subset)),
    mode='markers',
    name='Observed Data',
    opacity=0.5
))
fig_synergistic.update_layout(title='Regularized Logistic Regression (Synergistic Specification)',
                              xaxis_title='Age (Standardized)',
                              yaxis_title='Probability of COVID Vaccination')
fig_synergistic.show()


Additive Model Coefficients:
Intercept: [3.09240255]
Coefficients: [[0.01861329 0.05882134]]

Synergistic Model Coefficients:
Intercept: [3.09607642]
Coefficients: [[ 0.01707661  0.05803514 -0.02457871]]



X does not have valid feature names, but LogisticRegression was fitted with feature names


X does not have valid feature names, but LogisticRegression was fitted with feature names



## 4

The contradiction arises because R-squared and coefficient significance measure different aspects of the model. An R-squared of 17.6% shows that the model explains only a small portion of the data’s variability, suggesting other unobserved factors are likely important. However, the significant and large coefficients indicate that certain predictors do have a real impact on the outcome. This means the model as a whole has low explanatory power, but some variables are strongly related to the outcome.


Here’s a summary of our conversation:

1. Logistic Regression Modeling:
- We discussed fitting logistic regression models to data from a Canadian Social Connection Survey to predict COVID-19 vaccination status (COVID_vaccinated) using predictors like age (DEMO_age) and extraversion scores (PSYCH_big_five_inventory_extraverted_score).
- The binary outcome COVID_vaccinated was encoded as 1 for "Yes" and 0 for "No," and NaNs were filtered out.
2. Handling Modeling Issues:
- Initial logistic regression attempts encountered complete separation and non-convergence issues, meaning some predictors could perfectly predict outcomes, making coefficient estimation problematic.
- To address this, we:
    - Simplified the model by starting with an additive specification.
    - Standardized predictors to improve convergence.
    - Used Ridge regularization (L2 penalty) to stabilize coefficient estimates and reduce overfitting.
3. Additive vs. Synergistic Models:
- We fit two models:
    - An additive model with only main effects of age and extraversion score.
    - A synergistic model with an interaction term between age and extraversion score to explore if the effect of one variable depended on the other.
- Coefficients were printed to interpret the influence of each predictor on the probability of vaccination.
4. Visualization:
- We used Plotly to visualize the predicted probabilities across age for both models, showing how the probabilities vary based on the predictors.
5. Explaining Contradictory Model Results:
- We clarified why a model can have both a low percentage of variability explained (e.g., only 17.6% of total variance) and large, statistically significant coefficients for some predictors.
- The low model fit suggests the model doesn’t capture much of the outcome’s total variability, while large, significant coefficients indicate that some predictors have a strong, reliable effect on the outcome but may not capture the full complexity of influencing factors.

Let me know if you need more detail on any part!

https://chatgpt.com/share/673645bc-4a8c-8000-b807-d49becba11e3

## 5

- Data Cleaning and Splitting: In this step, missing values in the "Type 2" column are replaced with 'None', and the dataset is split 50/50 into training and test sets. A random seed is also set to ensure reproducibility of results.
- Simple Linear Regression Model (Model 3): A basic linear regression model is created using the formula HP ~ Attack + Defense, where HP is the dependent variable and Attack and Defense are predictors. This model allows us to examine the relationships between these basic variables.
- Evaluating Model 3: Model 3’s performance is evaluated by calculating its in-sample (training set) and out-of-sample (test set) R-squared values. These metrics assess how well the model fits the training data and how it generalizes to new data.
- Complex Interaction Model (Model 4): This step defines a more complex model that includes multiple variables and interaction terms (such as Attack, Defense, and Speed). This complex formula captures more intricate relationships, but it also significantly increases computational complexity.
- Evaluating and Comparing Model 4: Model 4 is fitted to the training data, and both in-sample and out-of-sample R-squared values are computed and compared to those of Model 3. This comparison helps evaluate whether the additional complexity improves predictive performance.


These five cells demonstrate a comprehensive approach to data preparation, building simple and complex models, and evaluating them. By comparing the in-sample and out-of-sample R-squared values of Model 3 and Model 4, we can observe the trade-off between model complexity and generalization ability. A simpler model (Model 3) may have stronger generalizability, while a more complex model (Model 4) might achieve better fit on the training data but risks overfitting, potentially performing worse on unseen data. 

## 6

The model4_linear_form formula creates a complex design matrix with many interaction terms, leading to multicollinearity (highly correlated predictors). This multicollinearity makes the model’s coefficients unstable and more likely to overfit, meaning it performs well on training data but struggles to generalize to new data, resulting in low out-of-sample accuracy.

## 7

From model3_fit and model4_fit to model5, model6, and finally model7, each step aims to balance model complexity and stability to improve generalization:
- model5 finds a middle ground between the simplicity of model3 and the complexity of model4, including main variables and important categorical variables but avoiding excessive interactions to keep the model manageable.
- model6 builds on model5 by further simplifying, keeping only key predictors and significant indicator variables to reduce multicollinearity and enhance stability.
- model7 reintroduces interactions among core variables, while centering and scaling helps address multicollinearity, making the model more stable and improving generalization.

Each model version gradually optimizes complexity and variable selection to create a more stable, generalizable model.

## 8

In [None]:
import numpy as np
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Define number of iterations
num_iterations = 100

# Lists to store performance metrics
in_sample_r2 = []
out_sample_r2 = []

# Loop to perform multiple model evaluations
for i in range(num_iterations):
    # Split data into training and test sets without setting a random seed
    pokeaman_train, pokeaman_test = train_test_split(pokeaman, train_size=0.5)
    
    # Define the model formula
    model_formula = 'HP ~ Attack + Speed + Q("Sp. Def") + Q("Sp. Atk") + I(Q("Type 1")=="Normal") + I(Q("Type 1")=="Water") + I(Generation==2) + I(Generation==5)'
    
    # Fit the model on the training set
    model_spec = smf.ols(formula=model_formula, data=pokeaman_train)
    model_fit = model_spec.fit()
    
    # Calculate in-sample R-squared
    in_sample_r2.append(model_fit.rsquared)
    
    # Predict on the test set and calculate out-of-sample R-squared
    yhat_test = model_fit.predict(pokeaman_test)
    y_test = pokeaman_test.HP
    out_sample_r2.append(np.corrcoef(y_test, yhat_test)[0, 1]**2)

# Convert results to arrays for easier plotting
in_sample_r2 = np.array(in_sample_r2)
out_sample_r2 = np.array(out_sample_r2)

# Plotting the results using fig.show()
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(range(num_iterations), in_sample_r2, label="In-Sample R²", marker='o')
ax.plot(range(num_iterations), out_sample_r2, label="Out-of-Sample R²", marker='x')
ax.set_xlabel("Iteration")
ax.set_ylabel("R-squared Value")
ax.set_title("In-Sample vs. Out-of-Sample R² Across Iterations")
ax.legend()
fig.show()


The results illustrate how well the model performs on both training data (in-sample) and test data (out-of-sample), with in-sample R-squared showing how much variation the model explains within the training data—typically resulting in higher values. In contrast, out-of-sample R-squared reflects how well the model generalizes to new data, which is essential for understanding its reliability. By observing how these R-squared values vary across different data splits, we can identify whether the model is overfitting or sensitive to specific samples, indicating if it may struggle with new data. This demonstration emphasizes the importance of out-of-sample performance over just achieving high training accuracy, as it provides a realistic look at how the model might perform in real-world scenarios and highlights the need for robust model validation to ensure reliable predictions.


## 9

This illustration looks at how well two models (model6 and model7) predict Pokémon HP when trained on early generations and tested on future generations. In-sample R-squared shows how well the models fit their training data, while out-of-sample R-squared indicates how well they generalize to new data. Models trained only on Gen 1 tend to fit Gen 1 well but struggle with later generations, showing they overfit to specific patterns in Gen 1. On the other hand, models trained on Gen 1-5 generally perform better on Gen 6, suggesting that training on a more diverse dataset helps the model generalize to unseen data. This highlights the importance of using varied training data to build models that work well in real-world situations.


Here's a summary of our discussion:

1. Regression Models for Pokémon Data: You explored different regression models (model3, model4, model5, model6, and model7) to predict Pokémon HP based on various predictors. Starting with simple models, you progressively added more predictors and interactions, then applied centering and scaling to reduce multicollinearity.

2. Model Performance Variability: Using a loop to evaluate the "in-sample" and "out-of-sample" R-squared values across multiple iterations, you examined model stability. This demonstration showed how performance metrics fluctuate with different random train-test splits, revealing the importance of generalizability over mere in-sample accuracy.

3. Evaluating Generalizability Across Generations: You set up models trained on subsets of Pokémon generations (e.g., Gen 1 only and Gen 1-5) to assess how well they predicted HP for unseen generations, such as Gen 6. By examining both in-sample and out-of-sample R-squared values, this setup illustrated how models trained on more generations typically generalize better to future data.

4. Insights Gained:
- Training on a broader, more representative dataset improves a model's generalizability.
- Models trained on limited data (e.g., Gen 1 only) tend to overfit to that subset and perform poorly on new generations.
- Centering and scaling were useful for reducing multicollinearity, especially in models with many interaction terms.
Overall, our conversation emphasized the importance of using diverse and representative data for predictive stability and highlighted techniques like centering and scaling for handling multicollinearity. This process helps ensure that models are robust and perform well on unseen data, a crucial aspect of real-world applications.

https://chatgpt.com/share/673680a7-fc1c-8000-a2f4-a79d0cae9582