#Question 1:
1. Difference between simple linear regression and multiple linear regression
Simple linear regression is when only one predictor variable is used to predict an outcome. It models the relationship between an outcome and a single predictor variable.
Multiple linear regression, on the other hand, uses multiple predictor variables to build a model. It can capture more complex relationships than simple regression because multiple factors affecting the outcome are considered. Multiple regression has the advantage of being able to explain changes in outcomes more fully.
2. Difference between continuous and indicator variables in simple linear regression
Continuous variables are variables that can take on any value, such as age, income, temperature, etc. These types of variables change gradually and can take any value within a certain range. When used in a model, continuous variables can help predict results as the value of the variable changes.
An indicator variable is a variable that has only two possible values, usually 0 and 1. It is used to indicate categorical information, such as gender (0 for males and 1 for females). Indicator variables enable the inclusion of categorical information in the regression model and are used to differentiate the effects of different categories on the results.
3 Changes after adding indicator variables to the model
When we use only continuous variables in the model, the model predicts a straight line that applies to all data.
When we add an indicator variable to the model, the model changes to multiple regression. The indicator variable changes the intercept of the model, which means that for different categories of data, the model will have different starting points. For example, the model may produce different results when gender is male and female, meaning that different categories of data will have different predictive values.
4. Effect of adding interaction terms for continuous and indicator variables in multiple regression
When we include an interaction term in a model, this means that the model considers not only the independent effects of each variable on the outcome, but also the interactions between these variables. Interaction terms are usually created between continuous and indicator variables.
With the inclusion of an interaction term, the model can provide different slopes for different categories of data. For example, assuming that we are analyzing the effects of gender and age on salary, the addition of an interaction term will allow the model to consider not only the effect of gender on salary and the effect of age on salary, but also whether age affects different genders differently.
5. Behavior of multiple regression models when using indicator variables for non-dichotomous category variables
When we have a categorical variable with more than one category, such as education (high school, bachelor's degree, master's degree), it cannot be represented by only one indicator variable. This is when we use multiple indicator variables to code this categorical variable. In this case, if there are k categories, then we need k-1 indicator variables to represent them.


###
CHATBOT SUMMAURY:
To summarize, multiple linear regression, by introducing indicator variables, interaction terms, and multiple category coding, is able to flexibly handle various types of data, capture more complex relationships, and provide more accurate analyses and predictions for real-world problems.


#Question 2:
Outcome variable (Outcome):
- Advertising effectiveness, such as sales, click-through rates, or conversion rates.
Predictor Variables (Predictor Variables):
- TV ad spending.
- Online Ad Spend.
A model that does not include an interaction term:
In this model, it is assumed that TV ads and online ads have independent effects on effectiveness. The formula is:
Ad Effect = Intercept + Coefficient of TV Ad Spend × TV Ad Spend + Coefficient of Online Ad Spend × Online Ad Spend

Interpretation:
The model assumes that TV ads and online ads have separate effects on effectiveness and that the effects of both are **additive**. In other words, the effect of increasing TV ad spending is not affected by how much is spent on online ads.

A model that includes an interaction term:
In this model, an interaction term is included to account for the fact that TV ads and online ads may affect each other's effectiveness. The formula is:
Ad Effectiveness = Intercept + Factor of TV Spend × TV Spend + Factor of Online Spend × Online Spend + Factor of Interaction × (TV Spend × Online Spend)

Interpretation:
The interaction term indicates the synergistic effect between the two. For example, if the coefficient of the interaction term is positive, it means that TV ads and online ads work better when they are increased simultaneously than when either ad spend is increased separately.

Difference Summary:
Models that do not include an interaction term are appropriate for situations where the effect of advertising is independent of the other type of ad spending.
The model with interaction terms is suitable for the case where the advertising effect is influenced by the interaction of the two advertising expenditures, i.e., it is better when there is a synergistic effect.

2. How to update the model if the advertising budget is simplified to “high” or “low”.

Let's assume that instead of using a continuous advertising expenditure, we use an indicative variable to represent the level of the advertising budget:
TV advertising budget: high budget = 1, low budget = 0.
Online advertising budget: high budget = 1, low budget = 0.

Models that do not include interaction terms:
Ad Effect = Intercept + Coefficient of TV ad budget × Indicator variable of TV ad budget + Coefficient of Online ad budget × Indicator variable of Online ad budget

Interpretation:
-This model suggests that higher budget TV ads or online ads lead to better results, but the effects of both are calculated independently.

Models that include interaction terms:
Ad Effect = Intercept + Coefficient on TV Ad Budget × TV Ad Budget Indicator Variable + Coefficient on Online Ad Budget × Online Ad Budget Indicator Variable + Interaction Term Coefficient × (TV Ad Budget Indicator Variable × Online Ad Budget Indicator Variable)

Interpretation:
-This model considers the combined effect of TV and online advertising. If both ads are high budget, the interaction term increases the effectiveness of the ads, suggesting a synergistic effect of the high budget combination.

###
CHATBOT SUMMARY:
The model without the interaction term assumes that the ad effects are independent and is suitable for the case where the budget independently affects the effects.
Models with interaction terms assume synergistic effects and are suitable for situations where the effects of two advertising channels are mutually reinforcing.
The model is simpler when using binary variables, but still captures the difference between high and low budgets, as well as possible synergies when high budgets are combined.

In [2]:
#Question 3:

import pandas as pd
import statsmodels.formula.api as smf

data = pd.read_csv('canadian_social_connection_survey.csv')

formula_additive = 'high_social_connection ~ age + income + C(gender) + C(education)'
model_additive = smf.logit(formula=formula_additive, data=data).fit()
print(model_additive.summary())


formula_interaction = 'high_social_connection ~ age * C(gender) + income * C(education)'
model_interaction = smf.logit(formula=formula_interaction, data=data).fit()
print(model_interaction.summary())


FileNotFoundError: [Errno 2] No such file or directory: '/path/to/your/file/canadian_social_connection_survey.csv'

The code above won't run because i just pretend we load the data but actual we don't

Summary:
Additive models are suitable for situations where predictor variables independently influence the outcome and have a simple structure that is easy to interpret.
Synergistic models, on the other hand, take into account the interaction between predictor variables and are suitable for situations where multiple variables may jointly influence the outcome.
Logistic regression is used for binary outcome variables, and interpretation requires attention to the sign and significance of the coefficients, and can use probability of change to aid understanding.
Visualization helps to show the predictive effects of different models, especially when examining the effects of interaction terms

#Question 4

1. Difference between model explanatory power (R²) and p-value
R² (Coefficient of Determination): R² indicates the percentage of variability in the outcome variable that is explained by the model. For example, an R² of 0.176 means that the model explains only 17.6% of the variability in the data, implying that the model has a weak ability to explain the results.

R² measures the overall model fit, i.e., the total variability explained by the model.
A low R² indicates that most of the variability cannot be explained by the current model and that there may be other important predictor variables not included in the model.
p-value: The p-value is used to test the significance of each predictor variable to the outcome.

If the p-value is less than the level of significance (usually 0.05), it means that we have evidence to reject the original hypothesis (i.e., the predictor variable has no effect on the outcome), implying that the predictor variable has a significant effect on the outcome.
The p-value reflects the magnitude and significance of the effect of each predictor variable on the outcome, controlling for other variables.
2. Why is a low R² not inconsistent with a significant coefficient?
A low R² indicates that the model is not explaining things well overall, i.e., the model is not capturing most of the data variability. This may be because:

There are many influences in the data that are not included in the model (i.e., omitted variables).
There is a large amount of noise or unexplained variability in the data.
Significant coefficients A significant coefficient indicates a significant relationship between some predictor variable and the outcome, meaning that:

Although the model explains a small portion of the variability in the results, some of these predictor variables do have a strong effect on the results.
The effect of these variables may be small, but there is a statistical correlation between their pattern of change and the outcome.


Summary:

R² and p-values measure different aspects of the model: the R² assesses the explanatory power of the model as a whole, while the p-value assesses the impact of each predictor variable.
There is no contradiction between a low R² and a significant coefficient. Even if the model cannot explain most of the variability, individual predictor variables can still have a significant impact on the results.
This situation usually indicates that although some variables do contribute significantly to the results, the model may be missing other key variables or the outcome variable is subject to complex multifactorial influences.



#Question 5:

What the Five Cells Illustrate:

Data Splitting (Cell 1): The importance of splitting data into training and testing sets to evaluate model performance objectively and assess generalizability.
Baseline Model Fitting (Cell 2): Establishing a simple linear regression model as a baseline for comparison against more complex models.
Performance Evaluation (Cell 3): Demonstrating the difference between in-sample and out-of-sample performance, highlighting the potential for overfitting if the model performs well on training data but poorly on testing data.
Complex Model Fitting (Cell 4): Illustrating the process of fitting a more complex model with interaction terms, capturing more detailed relationships at the risk of overfitting.
Generalization Assessment (Cell 5): Emphasizing the importance of evaluating both in-sample and out-of-sample R-squared values to determine whether the model generalizes well to new data or is merely overfitting the training data.

#Question 6:
In model4, the linear model contains multiple variables and interaction terms to construct a complex design matrix. The design matrix contains all predictor variables and their interaction terms, with each column representing a new predictive feature. While these interaction terms capture more variable relationships, they also introduce the problem of multicollinearity.

The main issue:
Multicollinearity: some variables in the design matrix are highly correlated, resulting in unstable model coefficients and larger standard errors. This covariance is usually caused by too many interaction terms.
Poor generalization: Due to multicollinearity, the model performs well on the training data (high in-sample R²) but poorly on the test data (low out-of-sample R²), indicating that the model is overfitted and cannot be generalized to new data.
Condition Number Indicator: Condition Number (CN) is used to measure covariance problems. Even after centering and normalizing the variables, the Condition Number is still high, indicating that the problem still exists.
Solution Suggestion:
Simplify the model: Reduce the interaction terms to reduce model complexity.
Use regularization: Use Ridge regression or Lasso regression to limit the model coefficients and alleviate the covariance problem.
Feature selection: remove predictor variables with high covariance and keep only those variables that have a significant impact on the results.
Conclusion:
Multicollinearity in the design matrix is the main cause of poor model generalization. Overly complex models may perform well on training data but fail to adapt to new data. By simplifying the model, regularization and feature selection, the robustness and generalization ability of the model can be improved so that it also performs well on test data.

#Question 7:

Condition number and multicollinearity:
Condition numbers are used to measure multicollinearity problems in design matrices.
In general, the condition number:
Less than 30 is not a problem.
Between 30 and 300, a covariance problem may exist.
Above 1000, it usually indicates a serious covariance problem.
In model4 and model5, the multiple covariance problem leads to unstable coefficients and affects the generalization ability of the model.

Principles of Model Extension and Improvement:
In the process of model extension from model3 to model7, more predictor variables and interaction terms are gradually introduced to improve the explanatory power and predictive ability of the model.
Use centrality and standardization to reduce the problem of covariance between variables and improve the robustness of the model.
Evaluate the generalization ability of the model by comparing in-sample and out-of-sample performance (R² values) to avoid overfitting or underfitting.

Evidence-based modeling:
The significance of each predictor variable is assessed through hypothesis testing (using p-values of the coefficients) to ensure that the introduced variables contribute substantially to the model.
Use out-of-sample validation to test the model's ability to generalize, rather than relying solely on in-sample fitting effects.
If the model fails to fully utilize the potential information of the predictor variables, it may exhibit underfitting (underfit) even if the model extends the predictor variables.

Goals of model improvement:
Reduce multicollinearity: reduce the effect of covariance on the model through methods such as feature selection, centering, and normalization.
Improve generalization: build models that maintain good predictive performance on new data, rather than just fitting training data.
Optimize the prediction effect: on the basis of hypothesis testing and sample validation, gradually expand and adjust the model structure to improve the predictive power of the model.

Conclusion:
The extension from model3 to model7 demonstrates the gradual improvement process of linear regression models. Despite the increase in model complexity and the possible introduction of the multicollinearity problem, we were able to optimize the model structure and improve its robustness and generalization ability through centering, standardization, and evidence-based approaches. The ultimate goal is to construct a model that performs well on both training and test data, enabling reliable prediction and interpretation.

In [9]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import plotly.express as px
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
import pandas as pd


songs = pd.read_csv('songs.csv')
print(songs.head()) 

songs_training_data, songs_testing_data = train_test_split(your_dataset_name, train_size=0.5)

reps = 100  

in_sample_Rsquared = np.zeros(reps)
out_of_sample_Rsquared = np.zeros(reps)

linear_form = 'danceability ~ energy * loudness + energy * mode'

for i in range(reps):
    songs_training_data, songs_testing_data = train_test_split(songs, train_size=0.5)
    
    final_model_fit = smf.ols(formula=linear_form, data=songs_training_data).fit()

    in_sample_Rsquared[i] = final_model_fit.rsquared

    y_pred = final_model_fit.predict(songs_testing_data)
    y_true = songs_testing_data['danceability']
    out_of_sample_Rsquared[i] = np.corrcoef(y_true, y_pred)[0, 1]**2

df = pd.DataFrame({
    "In Sample Performance (R²)": in_sample_Rsquared,
    "Out of Sample Performance (R²)": out_of_sample_Rsquared
})


fig = px.scatter(
    df,
    x="In Sample Performance (R²)",
    y="Out of Sample Performance (R²)",
    title="Comparison of In-Sample and Out-of-Sample R² Performance"
)

fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode="lines", line_shape="linear", name="y=x"))

fig.show()


FileNotFoundError: [Errno 2] No such file or directory: 'songs.csv'

Scatterplot analysis:
Each point in the plot represents the result of one model evaluation, with in-sample R² on the horizontal axis and out-of-sample R² on the vertical axis.
If most of the points fall near the y=x line, the in-sample and out-of-sample performance are similar and the model has good generalization.
If most of the points fall below the y=x line (the in-sample R² is higher than the out-of-sample R²), the model may have an overfitting problem, i.e., the model performs well on the training set but poorly on the test set.
Difference between in-sample and out-of-sample performance:
In-sample performance reflects how well the model fits the training data and is usually higher.
Out-of-sample performance reflects how well the model predicts new data and tends to be lower, especially if the model is overfitting.
Significance of repeated experiments:
By simulating multiple random splits without a random seed, the volatility and robustness of the model's performance is realistically assessed.
If the out-of-sample R² value varies significantly across splits, it indicates that the model performance is unstable and further simplification of the model or use of regularization may be required.
Conclusions and Implications:
With multiple random splits and evaluations, we can get a more complete picture of the model's ability to generalize, rather than relying solely on the results of a single split.
Overfitting Detection: If the in-sample R² is significantly higher than the out-of-sample R², the model may be overfitting and the model structure or feature selection needs to be adjusted.
Model robustness: The stability of the out-of-sample R² value can help assess whether the model is able to maintain consistent predictions across different training-testing splits, thus determining the robustness and reliability of the model.

#Question 9:

Model Complexity and Interpretation:

Model7 uses a complex linear form that incorporates multivariate interactions (e.g., Attack:Speed:Q(“Sp. Def”):Q(“Sp. Atk”)). This complex model captures more variable relationships and may perform better within the sample, but is less explanatory.
model6 is relatively more simplified, contains fewer interaction terms, and has a clearer model structure, and is therefore superior in terms of interpretability.
Comparison of in-sample and out-of-sample performance:

The code snippet shows the R² values of model7 and model6 on different training and test sets, specifically for different Pokémon generations (Generation 1 and Generation 6).
High in-sample R², low out-of-sample R²: model7 has a high in-sample R², indicating that it performs well on the training set, but a low out-of-sample R² on the test set, which suggests that the model is overfitted and does not generalize well.
More robust out-of-sample performance: In contrast, model6 has a smaller difference between in-sample and out-of-sample R², suggesting that it generalizes better and predicts more consistently, even on new data (different Generation).
Tradeoffs in model selection:

While model7 exhibits higher in-sample performance, model6 is superior from the point of view of generalization ability and interpretability due to its complexity and poorer out-of-sample performance.
This emphasizes an important principle in model construction: simple and explanatory models are usually superior to complex but difficult to explain models. Especially in cases where the multicollinearity problem is small, simple models are more likely to capture real patterns rather than noise.
Conclusion
Explanatory vs. complexity: model7 is overly complex, difficult to explain, and overfits the training data. Whereas model6 is more simplified, explanatory, and better able to generalize to different test data.
Validation of generalization ability: The out-of-sample R² comparison demonstrates the model's performance on new data, proving the robustness of model6.
Suggestion for model selection: During the modeling process, models with strong explanatory power and moderate complexity should be selected according to the data and requirements to avoid pursuing the in-sample performance to the neglect of the model's performance on new data.