## Pre-lecture hw

### 1.

1. Simple Linear Regression vs. Multiple Linear Regression
Simple Linear Regression involves modeling the relationship between a single predictor variable and the outcome:

outcome = 𝛽0 + 𝛽1 (predictorA)

Where:
- 𝛽0 is the intercept
- 𝛽1 is the slope for the predictor 

Multiple Linear Regression extends this by including multiple predictors (continuous or categorical), allowing for more complex models. This enables us to account for the influence of multiple variables simultaneously:

outcome = 𝛽0 + 𝛽1 (predictorA) + 𝛽2 (predictorB) + ...

The benefit of Multiple Linear Regression over Simple Linear Regression is that it models more complex relationships by accounting for interactions and multiple predictors, leading to a more accurate and realistic representation of real-world phenomena.

2. Continuous Variable vs. Indicator Variable in Simple Linear Regression
Continuous Variable: A continuous predictor variable can take any value within a range, like age, height, or time. In Simple Linear Regression, its effect is modeled directly as a linear relationship with the outcome.

outcome = 𝛽0 + 𝛽1 (continuous predictorA)

Indicator Variable (Categorical): An indicator variable is used to represent categories or groups. It is binary (0 or 1) and captures the presence or absence of a particular category. If we use an indicator for gender (e.g., 0 for female and 1 for male):

outcome = 𝛽0 + 𝛽1 (1[gender = male])

Here, 1[gender = male] is the indicator variable.

3. Introducing an Indicator Variable in Multiple Linear Regression
When you introduce an indicator variable alongside a continuous variable in a Multiple Linear Regression, the model becomes capable of accounting for both the linear relationship of the continuous variable and the difference between the categories represented by the indicator variable.

For example, with a continuous variable age and a categorical indicator variable gender (where 1 = male, 0 = female):

outcome = 𝛽0 + 𝛽1 (age) + 𝛽2 (1[gender = male])

The inclusion of the indicator variable shifts the intercept depending on the category. For instance, males and females might have different baselines or slopes, depending on the model.

4. Adding Interaction Between Continuous and Indicator Variables
In Multiple Linear Regression, an interaction between a continuous variable and an indicator variable tests whether the effect of the continuous variable differs across categories. The interaction term represents this combined effect:

outcome = 𝛽0 + 𝛽1 (age) + 𝛽2 (1[gender = male]) + 𝛽3 (age × 1[gender = male])

The interaction term 𝛽3 tells us how the relationship between age and the outcome differs for males vs. females.

5. Multiple Linear Regression with Indicator Variables for Non-Binary Categories
When using indicator variables for a non-binary categorical variable with more than two categories, we create "dummy variables" (i.e., indicator variables) for all but one category. The excluded category serves as the baseline group.

For example, if we have a categorical variable color with categories: Red, Blue, and Green, we need two indicator variables:
- 1[color = Blue]
- 1[color = Green]

The model would look like this:

outcome = 𝛽0 + 𝛽1 (1[colour = blue]) + 𝛽2 (1[colour = green])

- 𝛽0 represents the outcome for the baseline category (Red).
- 𝛽1 measures the difference between Blue and Red.
- 𝛽2 measures the difference between Green and Red.

In this case, the baseline group (Red) doesn't require an explicit indicator variable because it’s assumed when both indicator variables are 0. Number of categories minus one is why we use two indicator variables (for three categories).

### 2.

1. Difference between Simple and Multiple Linear Regression:

Simple Linear Regression involves predicting an outcome using one predictor variable. For example, predicting sales based on TV advertising spend:

Sales = 𝛽
0
+
𝛽
1
×
TV Spend

Multiple Linear Regression involves predicting an outcome using multiple predictor variables. It can include several predictors (e.g., both TV and online advertising spends) and their combined effects on the outcome. For example, predicting sales based on both TV and online ad spends:

Sales =
𝛽
0
+
𝛽
1
×
TV Spend
+
𝛽
2
×
Online Spend

2. Interaction Between Predictors:

In cases where the effectiveness of TV ads depends on online ad spending (or vice versa), the interaction effect means the relationship between one variable and the outcome changes depending on the value of the other variable. For example, spending more on TV ads may have a different effect on sales depending on how much is spent on online ads.

Without Interaction (Additive Model):

Sales =
𝛽
0
+
𝛽
1
×
TV Spend
+
𝛽
2
×
Online Spend

This assumes that TV and online ads affect sales independently.

With Interaction:

Sales =
𝛽
0
+
𝛽
1
×
TV Spend
+
𝛽
2
×
Online Spend
+
𝛽
3
×
(
TV Spend
×
Online Spend
)

This includes the interaction term (
𝛽
3), indicating that the effect of TV ads on sales might depend on the level of online ad spending.

Using These Models:

With Continuous Variables: Both models predict the outcome by considering the individual effects of TV and online ad spending, and the interaction model also adjusts for how these variables influence each other.

With Binary Variables: If TV and online ad budgets are categorized as "high" or "low" (binary variables), the model changes to reflect these categories. The interaction term in the binary model looks at how different combinations of high and low budgets (e.g., high TV + low online, high TV + high online) affect the outcome.

Without Interaction:

Sales =
𝛽
0
+
𝛽
1
×
TV (High/Low)
+
𝛽
2
×
Online (High/Low)


With Interaction:

Sales =
𝛽
0
+
𝛽
1
×
TV (High/Low)
+
𝛽
2
×
Online (High/Low)
+
𝛽
3
×
(
TV (High/Low)
×
Online (High/Low)
)

### 3.

In [2]:
import pandas as pd
import statsmodels.formula.api as smf

url = "https://raw.githubusercontent.com/KeithGalli/pandas/master/pokemon_data.csv"
pokeaman = pd.read_csv(url).fillna('None')

pokeaman['str8fyre'] = (pokeaman['Type 1']=='Fire').astype(int)
linear_model_specification_formula = \
'str8fyre ~ Attack*Legendary + Defense*I(Q("Type 2")=="None") + C(Generation)'
log_reg_fit = smf.logit(linear_model_specification_formula, data=pokeaman).fit()
log_reg_fit.summary()

Optimization terminated successfully.
         Current function value: 0.228109
         Iterations 8


0,1,2,3
Dep. Variable:,str8fyre,No. Observations:,800.0
Model:,Logit,Df Residuals:,788.0
Method:,MLE,Df Model:,11.0
Date:,"Sun, 08 Dec 2024",Pseudo R-squ.:,0.05156
Time:,03:33:54,Log-Likelihood:,-182.49
converged:,True,LL-Null:,-192.41
Covariance Type:,nonrobust,LLR p-value:,0.04757

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.2644,0.714,-4.572,0.000,-4.664,-1.865
Legendary[T.True],4.3478,2.179,1.996,0.046,0.078,8.618
"I(Q(""Type 2"") == ""None"")[T.True]",1.5432,0.853,1.810,0.070,-0.128,3.215
C(Generation)[T.2],-0.0574,0.468,-0.123,0.902,-0.975,0.861
C(Generation)[T.3],-0.6480,0.466,-1.390,0.164,-1.561,0.265
C(Generation)[T.4],-0.8255,0.545,-1.516,0.130,-1.893,0.242
C(Generation)[T.5],-0.5375,0.449,-1.198,0.231,-1.417,0.342
C(Generation)[T.6],0.3213,0.477,0.673,0.501,-0.614,1.257
Attack,0.0172,0.006,3.086,0.002,0.006,0.028


### 4.

In [3]:
import pandas as pd

url = "https://raw.githubusercontent.com/KeithGalli/pandas/master/pokemon_data.csv"
# fail https://github.com/KeithGalli/pandas/blob/master/pokemon_data.csv
pokeaman = pd.read_csv(url) 
pokeaman

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [4]:
import statsmodels.formula.api as smf

model1_spec = smf.ols(formula='HP ~ Q("Sp. Def") + C(Generation)', data=pokeaman)
model2_spec = smf.ols(formula='HP ~ Q("Sp. Def") + C(Generation) + Q("Sp. Def"):C(Generation)', data=pokeaman)
model2_spec = smf.ols(formula='HP ~ Q("Sp. Def") * C(Generation)', data=pokeaman)

model2_fit = model2_spec.fit()
model2_fit.summary()

0,1,2,3
Dep. Variable:,HP,R-squared:,0.176
Model:,OLS,Adj. R-squared:,0.164
Method:,Least Squares,F-statistic:,15.27
Date:,"Sun, 08 Dec 2024",Prob (F-statistic):,3.5e-27
Time:,03:34:44,Log-Likelihood:,-3649.4
No. Observations:,800,AIC:,7323.0
Df Residuals:,788,BIC:,7379.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,26.8971,5.246,5.127,0.000,16.599,37.195
C(Generation)[T.2],20.0449,7.821,2.563,0.011,4.692,35.398
C(Generation)[T.3],21.3662,6.998,3.053,0.002,7.629,35.103
C(Generation)[T.4],31.9575,8.235,3.881,0.000,15.793,48.122
C(Generation)[T.5],9.4926,7.883,1.204,0.229,-5.982,24.968
C(Generation)[T.6],22.2693,8.709,2.557,0.011,5.173,39.366
"Q(""Sp. Def"")",0.5634,0.071,7.906,0.000,0.423,0.703
"Q(""Sp. Def""):C(Generation)[T.2]",-0.2350,0.101,-2.316,0.021,-0.434,-0.036
"Q(""Sp. Def""):C(Generation)[T.3]",-0.3067,0.093,-3.300,0.001,-0.489,-0.124

0,1,2,3
Omnibus:,337.229,Durbin-Watson:,1.505
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2871.522
Skew:,1.684,Prob(JB):,0.0
Kurtosis:,11.649,Cond. No.,1400.0


The two metrics, R-squared and p-values, do not conflict because they assess different things:
- R-squared tells us how well the entire model fits the data.
- P-values and coefficients tell us how statistically significant the relationships are between individual predictors and the outcome.

A low R-squared value indicates that the model does not explain a lot of the variability in the outcome, but it is still possible for individual predictors to have significant effects, as indicated by low p-values and large coefficients.
This is common in situations where the predictors have strong effects but where other factors (unobserved or not included in the model) contribute to the variability in the outcome.

## Post-lecture hw: 

### 5.

In [5]:
import numpy as np
from sklearn.model_selection import train_test_split

fifty_fifty_split_size = int(pokeaman.shape[0]*0.5)

# Replace "NaN" (in the "Type 2" column with "None")
pokeaman.fillna('None', inplace=True)

np.random.seed(130)
pokeaman_train,pokeaman_test = \
  train_test_split(pokeaman, train_size=fifty_fifty_split_size)
pokeaman_train

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
370,338,Solrock,Rock,Psychic,70,95,85,55,65,70,3,False
6,6,Charizard,Fire,Flying,78,84,78,109,85,100,1,False
242,224,Octillery,Water,,75,105,75,105,75,45,2,False
661,600,Klang,Steel,,60,80,95,70,85,50,5,False
288,265,Wurmple,Bug,,45,45,35,20,30,20,3,False
...,...,...,...,...,...,...,...,...,...,...,...,...
522,471,Glaceon,Ice,,65,60,110,130,95,65,4,False
243,225,Delibird,Ice,Flying,45,55,45,65,45,75,2,False
797,720,HoopaHoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
117,109,Koffing,Poison,,40,65,95,60,45,35,1,False


Cell 1: Prepares the dataset for training and testing.
- train_test_split: Splits the data into two subsets: one for training the model (pokeaman_train) and one for testing it (pokeaman_test). It splits the data randomly, and here it is explicitly split in a 50-50 ratio.
- fillna: The NaN values in the "Type 2" column are replaced with 'None', effectively handling missing data.
- Output: The training set (pokeaman_train) is returned after the split, which will be used for model fitting.

This step sets up the data for model fitting and evaluation, ensuring that no missing values will interfere with the modeling process.

In [6]:
model_spec3 = smf.ols(formula='HP ~ Attack + Defense', 
                      data=pokeaman_train)
model3_fit = model_spec3.fit()
model3_fit.summary()

0,1,2,3
Dep. Variable:,HP,R-squared:,0.148
Model:,OLS,Adj. R-squared:,0.143
Method:,Least Squares,F-statistic:,34.4
Date:,"Sun, 08 Dec 2024",Prob (F-statistic):,1.66e-14
Time:,03:39:56,Log-Likelihood:,-1832.6
No. Observations:,400,AIC:,3671.0
Df Residuals:,397,BIC:,3683.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,42.5882,3.580,11.897,0.000,35.551,49.626
Attack,0.2472,0.041,6.051,0.000,0.167,0.327
Defense,0.1001,0.045,2.201,0.028,0.011,0.190

0,1,2,3
Omnibus:,284.299,Durbin-Watson:,2.006
Prob(Omnibus):,0.0,Jarque-Bera (JB):,5870.841
Skew:,2.72,Prob(JB):,0.0
Kurtosis:,20.963,Cond. No.,343.0


Cell 2: Fits a simple linear regression model with HP (Health Points) as the dependent variable and Attack and Defense as the independent variables.
- Formula: 'HP ~ Attack + Defense' means that we're modeling the relationship between HP and both Attack and Defense, assuming a linear relationship.
- fit(): The model is fitted to the training data (pokeaman_train).
- summary(): This outputs the summary of the fitted model, including coefficients, p-values, R-squared, etc.

This step illustrates a simple model with two predictors (Attack and Defense) to predict HP. The summary() helps understand how well the model fits the data and the statistical significance of the predictors.

In [7]:
yhat_model3 = model3_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model3_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model3)[0,1]**2)

'In sample' R-squared:     0.14771558304519894
'Out of sample' R-squared: 0.21208501873920738


Cell 3: Evaluates the model's performance both on the training set ("in-sample") and on the test set ("out-of-sample").
- In-sample R-squared: The R-squared value is directly retrieved from the fitted model. This measures how well the model explains the variance in the training data.
- Out-of-sample R-squared: Using the test data (pokeaman_test), it calculates the correlation between the actual and predicted HP values (yhat_model3). Squaring this correlation gives the out-of-sample R-squared, which is a measure of how well the model generalizes to unseen data.

This step is crucial for understanding how well the model fits the data both during training and when applied to new, unseen data. The comparison of in-sample and out-of-sample R-squared shows how well the model generalizes.

In [8]:
model4_linear_form = 'HP ~ Attack * Defense * Speed * Legendary'
model4_linear_form += ' * Q("Sp. Def") * Q("Sp. Atk")'
# DO NOT try adding '* C(Generation) * C(Q("Type 1")) * C(Q("Type 2"))'
# That's 6*18*19 = 6*18*19 possible interaction combinations...
# ...a huge number that will blow up your computer

model4_spec = smf.ols(formula=model4_linear_form, data=pokeaman_train)
model4_fit = model4_spec.fit()
model4_fit.summary()

0,1,2,3
Dep. Variable:,HP,R-squared:,0.467
Model:,OLS,Adj. R-squared:,0.369
Method:,Least Squares,F-statistic:,4.764
Date:,"Sun, 08 Dec 2024",Prob (F-statistic):,4.230000000000001e-21
Time:,03:40:27,Log-Likelihood:,-1738.6
No. Observations:,400,AIC:,3603.0
Df Residuals:,337,BIC:,3855.0
Df Model:,62,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,521.5715,130.273,4.004,0.000,265.322,777.821
Legendary[T.True],-6.1179,2.846,-2.150,0.032,-11.716,-0.520
Attack,-8.1938,2.329,-3.518,0.000,-12.775,-3.612
Attack:Legendary[T.True],-1224.9610,545.105,-2.247,0.025,-2297.199,-152.723
Defense,-6.1989,2.174,-2.851,0.005,-10.475,-1.923
Defense:Legendary[T.True],-102.4030,96.565,-1.060,0.290,-292.350,87.544
Attack:Defense,0.0985,0.033,2.982,0.003,0.034,0.164
Attack:Defense:Legendary[T.True],14.6361,6.267,2.336,0.020,2.310,26.963
Speed,-7.2261,2.178,-3.318,0.001,-11.511,-2.942

0,1,2,3
Omnibus:,214.307,Durbin-Watson:,1.992
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2354.664
Skew:,2.026,Prob(JB):,0.0
Kurtosis:,14.174,Cond. No.,1.2e+16


Cell 4: Defines and fits a much more complex linear regression model compared to Cell 2.
- Formula: The model includes multiple predictors (Attack, Defense, Speed, Legendary status, Special Defense (Sp. Def), and Special Attack (Sp. Atk)), along with interactions between them.
* in the formula: This operator indicates interaction terms between the predictors. For example, Attack * Defense includes both the main effects of Attack and Defense, as well as their interaction effect.
- Q(): This is used for handling column names with spaces or special characters, ensuring correct interpretation of "Sp. Def" and "Sp. Atk".
- fit(): The model is fitted to the training data.
- summary(): Outputs the summary of the fitted model, showing coefficients, statistical significance, and other key metrics.

This cell illustrates the complexity of building a more elaborate model with multiple predictors and interactions. The output summary will provide insights into the individual and combined effects of these predictors on the outcome (HP).

In [9]:
yhat_model4 = model4_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model4_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model4)[0,1]**2)

'In sample' R-squared:     0.46709442115833855
'Out of sample' R-squared: 0.002485342598992873


Cell 5: Evaluates the more complex model's performance, similar to Cell 3.
- In-sample R-squared: This measures how well the more complex model explains the variance in the training data.
- Out-of-sample R-squared: This measures how well the model generalizes to the test set, using the predicted values from the model.

By comparing the R-squared values from the simpler model (Cell 3) and the more complex model (Cell 5), we can assess whether the added complexity (more predictors and interactions) provides better explanatory power. This helps in understanding the trade-off between model complexity and generalizability.

### 6.

In [10]:
# "Cond. No." WAS 343.0 WITHOUT to centering and scaling
model3_fit.summary() 

0,1,2,3
Dep. Variable:,HP,R-squared:,0.148
Model:,OLS,Adj. R-squared:,0.143
Method:,Least Squares,F-statistic:,34.4
Date:,"Sun, 08 Dec 2024",Prob (F-statistic):,1.66e-14
Time:,03:45:43,Log-Likelihood:,-1832.6
No. Observations:,400,AIC:,3671.0
Df Residuals:,397,BIC:,3683.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,42.5882,3.580,11.897,0.000,35.551,49.626
Attack,0.2472,0.041,6.051,0.000,0.167,0.327
Defense,0.1001,0.045,2.201,0.028,0.011,0.190

0,1,2,3
Omnibus:,284.299,Durbin-Watson:,2.006
Prob(Omnibus):,0.0,Jarque-Bera (JB):,5870.841
Skew:,2.72,Prob(JB):,0.0
Kurtosis:,20.963,Cond. No.,343.0


In [11]:
from patsy import center, scale

model3_linear_form_center_scale = \
  'HP ~ scale(center(Attack)) + scale(center(Defense))' 
model_spec3_center_scale = smf.ols(formula=model3_linear_form_center_scale,
                                   data=pokeaman_train)
model3_center_scale_fit = model_spec3_center_scale.fit()
model3_center_scale_fit.summary()
# "Cond. No." is NOW 1.66 due to centering and scaling

0,1,2,3
Dep. Variable:,HP,R-squared:,0.148
Model:,OLS,Adj. R-squared:,0.143
Method:,Least Squares,F-statistic:,34.4
Date:,"Sun, 08 Dec 2024",Prob (F-statistic):,1.66e-14
Time:,03:45:53,Log-Likelihood:,-1832.6
No. Observations:,400,AIC:,3671.0
Df Residuals:,397,BIC:,3683.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,69.3025,1.186,58.439,0.000,66.971,71.634
scale(center(Attack)),8.1099,1.340,6.051,0.000,5.475,10.745
scale(center(Defense)),2.9496,1.340,2.201,0.028,0.315,5.585

0,1,2,3
Omnibus:,284.299,Durbin-Watson:,2.006
Prob(Omnibus):,0.0,Jarque-Bera (JB):,5870.841
Skew:,2.72,Prob(JB):,0.0
Kurtosis:,20.963,Cond. No.,1.66


In [12]:
model4_linear_form_CS = 'HP ~ scale(center(Attack)) * scale(center(Defense))'
model4_linear_form_CS += ' * scale(center(Speed)) * Legendary' 
model4_linear_form_CS += ' * scale(center(Q("Sp. Def"))) * scale(center(Q("Sp. Atk")))'
# Legendary is an indicator, so we don't center and scale that

model4_CS_spec = smf.ols(formula=model4_linear_form_CS, data=pokeaman_train)
model4_CS_fit = model4_CS_spec.fit()
model4_CS_fit.summary().tables[-1]  # Cond. No. is 2,250,000,000,000,000

# The condition number is still bad even after centering and scaling

0,1,2,3
Omnibus:,214.307,Durbin-Watson:,1.992
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2354.663
Skew:,2.026,Prob(JB):,0.0
Kurtosis:,14.174,Cond. No.,1.54e+16


In [13]:
# Just as the condition number was very bad to start with
model4_fit.summary().tables[-1]  # Cond. No. is 12,000,000,000,000,000

0,1,2,3
Omnibus:,214.307,Durbin-Watson:,1.992
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2354.664
Skew:,2.026,Prob(JB):,0.0
Kurtosis:,14.174,Cond. No.,1.2e+16


In model4, the addition of many interaction terms creates new predictors that are highly correlated with each other, leading to multicollinearity in the design matrix. This causes instability in the model's coefficients and makes the model prone to overfitting. Even after centering and scaling the data, the condition number remains very large, indicating persistent multicollinearity. As a result, the model's predictions generalize poorly to new, unseen data, as evidenced by the poor out-of-sample performance.

### 7.

- Model 5 starts with a reasonable set of predictors (both continuous and categorical) to explain variation in HP.
- Model 6 refines Model 5 by removing less significant predictors and adding important indicators based on the previous model's results, making it more focused and efficient.
- Model 7 introduces interaction terms to capture more complex relationships between predictors, improving the model's ability to explain variability in HP. Model 7 with centering and scaling applies transformations to continuous variables to address multicollinearity and improve numerical stability, making the model more robust and interpretable.

### 8. 

In [15]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
import statsmodels.formula.api as smf

# Create a placeholder for performance metrics
reps = 100  # Number of repetitions
in_sample_Rsquared = np.array([0.0] * reps)
out_of_sample_Rsquared = np.array([0.0] * reps)

# Linear form specification based on model3_fit
linear_form = 'HP ~ Attack + Defense'

# Iterate over multiple repetitions
for i in range(reps):
    # Randomly split the data
    pokeaman_train, pokeaman_test = train_test_split(pokeaman, train_size=0.5)

    # Fit the model on the training data
    final_model_fit = smf.ols(formula=linear_form, data=pokeaman_train).fit()
    
    # Store the 'in-sample' R-squared
    in_sample_Rsquared[i] = final_model_fit.rsquared
    
    # Compute and store the 'out-of-sample' R-squared
    yhat = final_model_fit.predict(pokeaman_test)
    out_of_sample_Rsquared[i] = np.corrcoef(pokeaman_test.HP, yhat)[0, 1] ** 2

# Create a DataFrame to store the results
df = pd.DataFrame({"In Sample Performance (Rsquared)": in_sample_Rsquared,
                   "Out of Sample Performance (Rsquared)": out_of_sample_Rsquared})

# Plot the results
fig = px.scatter(df, x="In Sample Performance (Rsquared)", y="Out of Sample Performance (Rsquared)")
fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], name="y=x", line_shape='linear'))

fig.show()

The goal of this analysis is to understand how the model’s performance varies with different random splits of the data. By running multiple repetitions, we can observe:
- In-sample performance (R-squared) typically shows how well the model fits the data it was trained on. It's expected to be relatively high because the model is specifically optimized for that data.
- Out-of-sample performance (R-squared) shows how well the model generalizes to new, unseen data. This metric is critical because it tells us how the model might perform on real-world data that wasn't included in the training set.

Key Observations:

If the in-sample R-squared is much higher than the out-of-sample R-squared, this suggests overfitting. The model is fitting the training data very well but does not generalize well to new data.
If both R-squared values are close, it suggests the model is likely generalizing well to new data.
The purpose of this demonstration is to highlight the variability in model performance across different splits of the data and illustrate the importance of validating the model's ability to generalize (out-of-sample performance). It emphasizes that a good model should not only fit the training data well (high in-sample R-squared) but also perform well on new, unseen data (high out-of-sample R-squared).

### 9.
Yes