# 1

#### Simple Linear Regression versus Multivariate Linear Regression.

Simple linear Regression models the relationship between a single predicator (independent variable) and the outcome (dependent variable). 

The model has the form $Y = B_0 + B_1X + \epsilon$

Where:
- $Y$ = Dependent Variable
- $B_0$ = Intercept (coefficient)
- $B_1$ = Slope (coefficient)
- $X$ = Independent Variable
- $\epsilon$ = Dependent Variable

This assume that $Y$ can be explained with a linear relationship to just one variable X. 

#### Multi Linear Regression models the relationship between multiple predicators (independent variable) to predict the outcome (dependent variable).

The model has the form $Y = B_0 + B_1X_1 + B_2X_2 + ... + B_nX_n + \epsilon$

$X_1, X_2, ...$ = Predicators 

Each predicator has its own coefficient $B_i$ indicating its contribution to $Y$.


#### continuous variable V.S. indicator variable in Simple Linear Regression; and these two linear forms

Continuous variable (age, height, time) 

$Y = B_0 + B_1X + \epsilon$

Indicator variable: a binary variable that take on values of 0 and 1. 

Use an indicator variable when you are interested in comparing two distinct groups in terms of their average outcome.

$Y = B_0 + B_1D + \epsilon$


#### Multi Linear Regression with one continuous variable and One indicator variable.

$Y = B_0 + B_1X + B_2D$

The model can represent two parallel lines, 1 for the D = 1 and another line for when D = 0.

#### Multiple Linear Regression with an interaction term. 

$Y = B_0 + B_1X + B_2D + B_3(X \times D)$

Two - Non parallel lines. 

#### Multiple Linear Regression with Categorical Data.

$Y = B_0 + B_1D_b + B_2D_c$

$D_a$ if $D_b = 0$ and $D_c = 0$


# 3 Logistic Regression

In [91]:
# Here's an example of how you can do this
import pandas as pd
import statsmodels.formula.api as smf

url = "https://raw.githubusercontent.com/KeithGalli/pandas/master/pokemon_data.csv"
pokeaman = pd.read_csv(url).fillna('None')

pokeaman['str8fyre'] = (pokeaman['Type 1']=='Fire').astype(int)
linear_model_specification_formula = \
'str8fyre ~ Attack*Legendary + Defense*I(Q("Type 2")=="None") + C(Generation)'
log_reg_fit = smf.logit(linear_model_specification_formula, data=pokeaman).fit()
log_reg_fit.summary()

Optimization terminated successfully.
         Current function value: 0.228109
         Iterations 8


0,1,2,3
Dep. Variable:,str8fyre,No. Observations:,800.0
Model:,Logit,Df Residuals:,788.0
Method:,MLE,Df Model:,11.0
Date:,"Thu, 14 Nov 2024",Pseudo R-squ.:,0.05156
Time:,21:15:24,Log-Likelihood:,-182.49
converged:,True,LL-Null:,-192.41
Covariance Type:,nonrobust,LLR p-value:,0.04757

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.2644,0.714,-4.572,0.000,-4.664,-1.865
Legendary[T.True],4.3478,2.179,1.996,0.046,0.078,8.618
"I(Q(""Type 2"") == ""None"")[T.True]",1.5432,0.853,1.810,0.070,-0.128,3.215
C(Generation)[T.2],-0.0574,0.468,-0.123,0.902,-0.975,0.861
C(Generation)[T.3],-0.6480,0.466,-1.390,0.164,-1.561,0.265
C(Generation)[T.4],-0.8255,0.545,-1.516,0.130,-1.893,0.242
C(Generation)[T.5],-0.5375,0.449,-1.198,0.231,-1.417,0.342
C(Generation)[T.6],0.3213,0.477,0.673,0.501,-0.614,1.257
Attack,0.0172,0.006,3.086,0.002,0.006,0.028


# 4

We have a low $R^2$ value of 0.176 but many of the coefficients have a low p-value; supporting evidence against the null hypothesis. This can be explained by the concepts of statistical significance and Explanatory power.

Statistical Significant and Explanatory power are not contradiction of eachother but offer different perspectives of the model. 

Statistical Significance tells us that certain predicators have a meaningful relationship with the outcome variable. Where a high coefficients means the predicator has a large impact on the outcome variable and a low p-value means we can reject the null hypothesis that the predicator has no influence on the outcome. 

On the other hand, Explanatory power, also known as coefficient of determination, measures how well the model captures the variabality in the data. A low $R^2$ value means that a small portion of variability in the data can be explained by the predicators. This can be because important factors are missing or Y is influenced by complex patterns. 

In the example below, Sp Def and Generation have a strong relationship with HP, however, there may be other variables that also affect HP or the relationship between sp def and generation to HP might not be exactly linear. 

In [93]:
import pandas as pd

url = "https://raw.githubusercontent.com/KeithGalli/pandas/master/pokemon_data.csv"
# fail https://github.com/KeithGalli/pandas/blob/master/pokemon_data.csv
pokeaman = pd.read_csv(url) 
pokeaman

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [94]:
import statsmodels.formula.api as smf

model1_spec = smf.ols(formula='HP ~ Q("Sp. Def") + C(Generation)', data=pokeaman)
model2_spec = smf.ols(formula='HP ~ Q("Sp. Def") + C(Generation) + Q("Sp. Def"):C(Generation)', data=pokeaman)

model2_spec = smf.ols(formula='HP ~ Q("Sp. Def") * C(Generation)', data=pokeaman)

model2_fit = model2_spec.fit()
model2_fit.summary()

0,1,2,3
Dep. Variable:,HP,R-squared:,0.176
Model:,OLS,Adj. R-squared:,0.164
Method:,Least Squares,F-statistic:,15.27
Date:,"Thu, 14 Nov 2024",Prob (F-statistic):,3.5e-27
Time:,21:15:24,Log-Likelihood:,-3649.4
No. Observations:,800,AIC:,7323.0
Df Residuals:,788,BIC:,7379.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,26.8971,5.246,5.127,0.000,16.599,37.195
C(Generation)[T.2],20.0449,7.821,2.563,0.011,4.692,35.398
C(Generation)[T.3],21.3662,6.998,3.053,0.002,7.629,35.103
C(Generation)[T.4],31.9575,8.235,3.881,0.000,15.793,48.122
C(Generation)[T.5],9.4926,7.883,1.204,0.229,-5.982,24.968
C(Generation)[T.6],22.2693,8.709,2.557,0.011,5.173,39.366
"Q(""Sp. Def"")",0.5634,0.071,7.906,0.000,0.423,0.703
"Q(""Sp. Def""):C(Generation)[T.2]",-0.2350,0.101,-2.316,0.021,-0.434,-0.036
"Q(""Sp. Def""):C(Generation)[T.3]",-0.3067,0.093,-3.300,0.001,-0.489,-0.124

0,1,2,3
Omnibus:,337.229,Durbin-Watson:,1.505
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2871.522
Skew:,1.684,Prob(JB):,0.0
Kurtosis:,11.649,Cond. No.,1400.0


# 5

In [96]:
import numpy as np
from sklearn.model_selection import train_test_split

fifty_fifty_split_size = int(pokeaman.shape[0]*0.5)

# Replace "NaN" (in the "Type 2" column with "None")
pokeaman.fillna('None', inplace=True)

np.random.seed(130)
pokeaman_train,pokeaman_test = \
  train_test_split(pokeaman, train_size=fifty_fifty_split_size)
pokeaman_train

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
370,338,Solrock,Rock,Psychic,70,95,85,55,65,70,3,False
6,6,Charizard,Fire,Flying,78,84,78,109,85,100,1,False
242,224,Octillery,Water,,75,105,75,105,75,45,2,False
661,600,Klang,Steel,,60,80,95,70,85,50,5,False
288,265,Wurmple,Bug,,45,45,35,20,30,20,3,False
...,...,...,...,...,...,...,...,...,...,...,...,...
522,471,Glaceon,Ice,,65,60,110,130,95,65,4,False
243,225,Delibird,Ice,Flying,45,55,45,65,45,75,2,False
797,720,HoopaHoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
117,109,Koffing,Poison,,40,65,95,60,45,35,1,False


In [97]:
model_spec3 = smf.ols(formula='HP ~ Attack + Defense', 
                      data=pokeaman_train)
model3_fit = model_spec3.fit()
model3_fit.summary()

0,1,2,3
Dep. Variable:,HP,R-squared:,0.148
Model:,OLS,Adj. R-squared:,0.143
Method:,Least Squares,F-statistic:,34.4
Date:,"Thu, 14 Nov 2024",Prob (F-statistic):,1.66e-14
Time:,21:15:24,Log-Likelihood:,-1832.6
No. Observations:,400,AIC:,3671.0
Df Residuals:,397,BIC:,3683.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,42.5882,3.580,11.897,0.000,35.551,49.626
Attack,0.2472,0.041,6.051,0.000,0.167,0.327
Defense,0.1001,0.045,2.201,0.028,0.011,0.190

0,1,2,3
Omnibus:,284.299,Durbin-Watson:,2.006
Prob(Omnibus):,0.0,Jarque-Bera (JB):,5870.841
Skew:,2.72,Prob(JB):,0.0
Kurtosis:,20.963,Cond. No.,343.0


In [98]:
yhat_model3 = model3_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model3_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model3)[0,1]**2)

'In sample' R-squared:     0.14771558304519838
'Out of sample' R-squared: 0.21208501873920707


In [99]:
model4_linear_form = 'HP ~ Attack * Defense * Speed * Legendary'
model4_linear_form += ' * Q("Sp. Def") * Q("Sp. Atk")'
# DO NOT try adding '* C(Generation) * C(Q("Type 1")) * C(Q("Type 2"))'
# That's 6*18*19 = 6*18*19 possible interaction combinations...
# ...a huge number that will blow up your computer

model4_spec = smf.ols(formula=model4_linear_form, data=pokeaman_train)
model4_fit = model4_spec.fit()
model4_fit.summary()

0,1,2,3
Dep. Variable:,HP,R-squared:,0.467
Model:,OLS,Adj. R-squared:,0.369
Method:,Least Squares,F-statistic:,4.764
Date:,"Thu, 14 Nov 2024",Prob (F-statistic):,4.230000000000001e-21
Time:,21:15:24,Log-Likelihood:,-1738.6
No. Observations:,400,AIC:,3603.0
Df Residuals:,337,BIC:,3855.0
Df Model:,62,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,521.5715,130.273,4.004,0.000,265.322,777.821
Legendary[T.True],-6.1181,2.846,-2.150,0.032,-11.716,-0.520
Attack,-8.1938,2.329,-3.518,0.000,-12.775,-3.612
Attack:Legendary[T.True],-1224.9615,545.105,-2.247,0.025,-2297.199,-152.724
Defense,-6.1989,2.174,-2.851,0.005,-10.475,-1.923
Defense:Legendary[T.True],-102.4031,96.565,-1.060,0.290,-292.350,87.544
Attack:Defense,0.0985,0.033,2.982,0.003,0.034,0.164
Attack:Defense:Legendary[T.True],14.6362,6.267,2.336,0.020,2.310,26.963
Speed,-7.2261,2.178,-3.318,0.001,-11.511,-2.942

0,1,2,3
Omnibus:,214.307,Durbin-Watson:,1.992
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2354.663
Skew:,2.026,Prob(JB):,0.0
Kurtosis:,14.174,Cond. No.,1.2e+16


In [100]:
yhat_model4 = model4_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model4_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model4)[0,1]**2)

'In sample' R-squared:     0.46709442115857047
'Out of sample' R-squared: 0.002485341422605343


# 6

In [102]:
# "Cond. No." WAS 343.0 WITHOUT to centering and scaling
model3_fit.summary() 

0,1,2,3
Dep. Variable:,HP,R-squared:,0.148
Model:,OLS,Adj. R-squared:,0.143
Method:,Least Squares,F-statistic:,34.4
Date:,"Thu, 14 Nov 2024",Prob (F-statistic):,1.66e-14
Time:,21:15:25,Log-Likelihood:,-1832.6
No. Observations:,400,AIC:,3671.0
Df Residuals:,397,BIC:,3683.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,42.5882,3.580,11.897,0.000,35.551,49.626
Attack,0.2472,0.041,6.051,0.000,0.167,0.327
Defense,0.1001,0.045,2.201,0.028,0.011,0.190

0,1,2,3
Omnibus:,284.299,Durbin-Watson:,2.006
Prob(Omnibus):,0.0,Jarque-Bera (JB):,5870.841
Skew:,2.72,Prob(JB):,0.0
Kurtosis:,20.963,Cond. No.,343.0


In [103]:
from patsy import center, scale

model3_linear_form_center_scale = \
  'HP ~ scale(center(Attack)) + scale(center(Defense))' 
model_spec3_center_scale = smf.ols(formula=model3_linear_form_center_scale,
                                   data=pokeaman_train)
model3_center_scale_fit = model_spec3_center_scale.fit()
model3_center_scale_fit.summary()
# "Cond. No." is NOW 1.66 due to centering and scaling

0,1,2,3
Dep. Variable:,HP,R-squared:,0.148
Model:,OLS,Adj. R-squared:,0.143
Method:,Least Squares,F-statistic:,34.4
Date:,"Thu, 14 Nov 2024",Prob (F-statistic):,1.66e-14
Time:,21:15:25,Log-Likelihood:,-1832.6
No. Observations:,400,AIC:,3671.0
Df Residuals:,397,BIC:,3683.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,69.3025,1.186,58.439,0.000,66.971,71.634
scale(center(Attack)),8.1099,1.340,6.051,0.000,5.475,10.745
scale(center(Defense)),2.9496,1.340,2.201,0.028,0.315,5.585

0,1,2,3
Omnibus:,284.299,Durbin-Watson:,2.006
Prob(Omnibus):,0.0,Jarque-Bera (JB):,5870.841
Skew:,2.72,Prob(JB):,0.0
Kurtosis:,20.963,Cond. No.,1.66


In [104]:
model4_linear_form_CS = 'HP ~ scale(center(Attack)) * scale(center(Defense))'
model4_linear_form_CS += ' * scale(center(Speed)) * Legendary' 
model4_linear_form_CS += ' * scale(center(Q("Sp. Def"))) * scale(center(Q("Sp. Atk")))'
# Legendary is an indicator, so we don't center and scale that

model4_CS_spec = smf.ols(formula=model4_linear_form_CS, data=pokeaman_train)
model4_CS_fit = model4_CS_spec.fit()
model4_CS_fit.summary().tables[-1]  # Cond. No. is 2,250,000,000,000,000

# The condition number is still bad even after centering and scaling

0,1,2,3
Omnibus:,214.307,Durbin-Watson:,1.992
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2354.663
Skew:,2.026,Prob(JB):,0.0
Kurtosis:,14.174,Cond. No.,2250000000000000.0


In [105]:
# Just as the condition number was very bad to start with
model4_fit.summary().tables[-1]  # Cond. No. is 12,000,000,000,000,000

0,1,2,3
Omnibus:,214.307,Durbin-Watson:,1.992
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2354.663
Skew:,2.026,Prob(JB):,0.0
Kurtosis:,14.174,Cond. No.,1.2e+16


# 7

Model 3 and model 4, then model 7 have been scaled and centered around 0. 

#### Model 3 to model 4. 

Model 4 introduces interaction terms between predictors, such as Attack * Defense and Speed * Legendary, to model the more complex relationships between variables and how they jointly influence HP.

The high Condition Number in Model 4 indicates high multicollinearity, suggesting that predictors are highly correlated with each other, leading to potential instability in the estimates.

#### model 4 to model 5

Model 5 extends Model 4 by adding additional predictors like Generation, Type 1, and Type 2. These are categorical variables that provide more information about the Pokémon's characteristics and their potential influence on HP.

#### Model 5 to model 6

Model 6 refines Model 5 by focusing on predictors with the lowest p-values. Binary variables are created for Type 1 = 'Water' and Type 1 = 'Normal', as well as for Generation 2 and Generation 5, because these categories were found to be significant predictors with strong evidence against the null hypothesis (p < 0.05). The goal is to simplify the model by removing predictors that don't significantly contribute to explaining HP.

#### Model 6 to model 7
Model7 adds interaction terms between continuous variables like attack, speed, sp def and sp att. This allows the model to account for more complex relationships between the predictors. If the affects of on predicstor on the outcome relies on other predictors

, this type model will capture those effects more accurately. The inclusion of the centered and scaled version of the model (model7_linear_form_CS) is meant to address potential multicollinearity and ensure that each predictor is on a comparable scale, which can improve model stability and interpretability.

Note we don't center and scale indicator variables.

The large condition number in model7_fit before centering and scaling (2,340,000,000) indicates potential multicollinearity problems—predictors may be highly correlated with each other, leading to unstable estimates. After centering and scaling, the condition number is reduced to a much smaller value (15.4), suggesting that multicollinearity is less of a concern, improving the model’s reliability.

The condition number is a key tool for identifying multicollinearity. 
- Large condition numbers indicate that predictors are highly correlated, which can lead to unstable and unreliable estimates.


The condition number is crucial for detecting multicollinearity. Large values (e.g., greater than 1000) indicate strong correlations between predictors, which can lead to unstable estimates. After centering and scaling, the condition number drops significantly, improving the model's stability and reliability.



In [107]:
# Here's something a little more reasonable...
model5_linear_form = 'HP ~ Attack + Defense + Speed + Legendary'
model5_linear_form += ' + Q("Sp. Def") + Q("Sp. Atk")'
model5_linear_form += ' + C(Generation) + C(Q("Type 1")) + C(Q("Type 2"))'

model5_spec = smf.ols(formula=model5_linear_form, data=pokeaman_train)
model5_fit = model5_spec.fit()
model5_fit.summary()

0,1,2,3
Dep. Variable:,HP,R-squared:,0.392
Model:,OLS,Adj. R-squared:,0.313
Method:,Least Squares,F-statistic:,4.948
Date:,"Thu, 14 Nov 2024",Prob (F-statistic):,9.48e-19
Time:,21:15:25,Log-Likelihood:,-1765.0
No. Observations:,400,AIC:,3624.0
Df Residuals:,353,BIC:,3812.0
Df Model:,46,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,10.1046,14.957,0.676,0.500,-19.312,39.521
Legendary[T.True],-3.2717,4.943,-0.662,0.508,-12.992,6.449
C(Generation)[T.2],9.2938,4.015,2.315,0.021,1.398,17.189
C(Generation)[T.3],2.3150,3.915,0.591,0.555,-5.385,10.015
C(Generation)[T.4],4.8353,4.149,1.165,0.245,-3.325,12.995
C(Generation)[T.5],11.4838,3.960,2.900,0.004,3.696,19.272
C(Generation)[T.6],4.9206,4.746,1.037,0.300,-4.413,14.254
"C(Q(""Type 1""))[T.Dark]",-1.4155,6.936,-0.204,0.838,-15.057,12.226
"C(Q(""Type 1""))[T.Dragon]",0.8509,6.900,0.123,0.902,-12.720,14.422

0,1,2,3
Omnibus:,286.476,Durbin-Watson:,1.917
Prob(Omnibus):,0.0,Jarque-Bera (JB):,5187.327
Skew:,2.807,Prob(JB):,0.0
Kurtosis:,19.725,Cond. No.,9210.0


In [108]:
yhat_model5 = model5_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model5_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model5)[0,1]**2)

'In sample' R-squared:     0.39201340835318943
'Out of sample' R-squared: 0.3001561448865205


In [109]:
# Here's something a little more reasonable...
model6_linear_form = 'HP ~ Attack + Speed + Q("Sp. Def") + Q("Sp. Atk")'
# And here we'll add the significant indicators from the previous model
# https://chatgpt.com/share/81ab88df-4f07-49f9-a44a-de0cfd89c67c
model6_linear_form += ' + I(Q("Type 1")=="Normal")'
model6_linear_form += ' + I(Q("Type 1")=="Water")'
model6_linear_form += ' + I(Generation==2)'
model6_linear_form += ' + I(Generation==5)'

model6_spec = smf.ols(formula=model6_linear_form, data=pokeaman_train)
model6_fit = model6_spec.fit()
model6_fit.summary()

0,1,2,3
Dep. Variable:,HP,R-squared:,0.333
Model:,OLS,Adj. R-squared:,0.319
Method:,Least Squares,F-statistic:,24.36
Date:,"Thu, 14 Nov 2024",Prob (F-statistic):,2.25e-30
Time:,21:15:25,Log-Likelihood:,-1783.6
No. Observations:,400,AIC:,3585.0
Df Residuals:,391,BIC:,3621.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,22.8587,3.876,5.897,0.000,15.238,30.479
"I(Q(""Type 1"") == ""Normal"")[T.True]",17.5594,3.339,5.258,0.000,10.994,24.125
"I(Q(""Type 1"") == ""Water"")[T.True]",9.0301,3.172,2.847,0.005,2.794,15.266
I(Generation == 2)[T.True],6.5293,2.949,2.214,0.027,0.732,12.327
I(Generation == 5)[T.True],8.4406,2.711,3.114,0.002,3.112,13.770
Attack,0.2454,0.037,6.639,0.000,0.173,0.318
Speed,-0.1370,0.045,-3.028,0.003,-0.226,-0.048
"Q(""Sp. Def"")",0.3002,0.045,6.662,0.000,0.212,0.389
"Q(""Sp. Atk"")",0.1192,0.042,2.828,0.005,0.036,0.202

0,1,2,3
Omnibus:,271.29,Durbin-Watson:,1.999
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4238.692
Skew:,2.651,Prob(JB):,0.0
Kurtosis:,18.04,Cond. No.,618.0


In [110]:
yhat_model6 = model6_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model6_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model6)[0,1]**2)

'In sample' R-squared:     0.3326310334310908
'Out of sample' R-squared: 0.2957246042708003


In [111]:
# And here's a slight change that seems to perhaps improve prediction...
model7_linear_form = 'HP ~ Attack * Speed * Q("Sp. Def") * Q("Sp. Atk")'
model7_linear_form += ' + I(Q("Type 1")=="Normal")'
model7_linear_form += ' + I(Q("Type 1")=="Water")'
model7_linear_form += ' + I(Generation==2)'
model7_linear_form += ' + I(Generation==5)'

model7_spec = smf.ols(formula=model7_linear_form, data=pokeaman_train)
model7_fit = model7_spec.fit()
model7_fit.summary()

0,1,2,3
Dep. Variable:,HP,R-squared:,0.378
Model:,OLS,Adj. R-squared:,0.347
Method:,Least Squares,F-statistic:,12.16
Date:,"Thu, 14 Nov 2024",Prob (F-statistic):,4.2000000000000004e-29
Time:,21:15:25,Log-Likelihood:,-1769.5
No. Observations:,400,AIC:,3579.0
Df Residuals:,380,BIC:,3659.0
Df Model:,19,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,95.1698,34.781,2.736,0.007,26.783,163.556
"I(Q(""Type 1"") == ""Normal"")[T.True]",18.3653,3.373,5.445,0.000,11.733,24.997
"I(Q(""Type 1"") == ""Water"")[T.True]",9.2913,3.140,2.959,0.003,3.117,15.466
I(Generation == 2)[T.True],7.0711,2.950,2.397,0.017,1.271,12.871
I(Generation == 5)[T.True],7.8557,2.687,2.923,0.004,2.572,13.140
Attack,-0.6975,0.458,-1.523,0.129,-1.598,0.203
Speed,-1.8147,0.554,-3.274,0.001,-2.905,-0.725
Attack:Speed,0.0189,0.007,2.882,0.004,0.006,0.032
"Q(""Sp. Def"")",-0.5532,0.546,-1.013,0.312,-1.627,0.521

0,1,2,3
Omnibus:,252.3,Durbin-Watson:,1.953
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3474.611
Skew:,2.438,Prob(JB):,0.0
Kurtosis:,16.59,Cond. No.,2340000000.0


In [112]:
yhat_model7 = model7_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model7_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model7)[0,1]**2)

'In sample' R-squared:     0.37818209127432423
'Out of sample' R-squared: 0.35055389262255354


In [113]:
# And here's a slight change that seems to perhas improve prediction...
model7_linear_form_CS = 'HP ~ scale(center(Attack)) * scale(center(Speed))'
model7_linear_form_CS += ' * scale(center(Q("Sp. Def"))) * scale(center(Q("Sp. Atk")))'
# We DO NOT center and scale indicator variables
model7_linear_form_CS += ' + I(Q("Type 1")=="Normal")'
model7_linear_form_CS += ' + I(Q("Type 1")=="Water")'
model7_linear_form_CS += ' + I(Generation==2)'
model7_linear_form_CS += ' + I(Generation==5)'

model7_CS_spec = smf.ols(formula=model7_linear_form_CS, data=pokeaman_train)
model7_CS_fit = model7_CS_spec.fit()
model7_CS_fit.summary().tables[-1] 
# "Cond. No." is NOW 15.4 due to centering and scaling

0,1,2,3
Omnibus:,252.3,Durbin-Watson:,1.953
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3474.611
Skew:,2.438,Prob(JB):,0.0
Kurtosis:,16.59,Cond. No.,15.4


In [114]:
# "Cond. No." WAS 2,340,000,000 WITHOUT to centering and scaling
model7_fit.summary().tables[-1]

0,1,2,3
Omnibus:,252.3,Durbin-Watson:,1.953
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3474.611
Skew:,2.438,Prob(JB):,0.0
Kurtosis:,16.59,Cond. No.,2340000000.0


In [115]:
model7_CS_fit.summary()


0,1,2,3
Dep. Variable:,HP,R-squared:,0.378
Model:,OLS,Adj. R-squared:,0.347
Method:,Least Squares,F-statistic:,12.16
Date:,"Thu, 14 Nov 2024",Prob (F-statistic):,4.2000000000000004e-29
Time:,21:15:25,Log-Likelihood:,-1769.5
No. Observations:,400,AIC:,3579.0
Df Residuals:,380,BIC:,3659.0
Df Model:,19,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,64.3312,1.653,38.915,0.000,61.081,67.582
"I(Q(""Type 1"") == ""Normal"")[T.True]",18.3653,3.373,5.445,0.000,11.733,24.997
"I(Q(""Type 1"") == ""Water"")[T.True]",9.2913,3.140,2.959,0.003,3.117,15.466
I(Generation == 2)[T.True],7.0711,2.950,2.397,0.017,1.271,12.871
I(Generation == 5)[T.True],7.8557,2.687,2.923,0.004,2.572,13.140
scale(center(Attack)),9.6058,1.400,6.860,0.000,6.853,12.359
scale(center(Speed)),-3.0657,1.439,-2.131,0.034,-5.895,-0.237
scale(center(Attack)):scale(center(Speed)),1.3257,1.469,0.903,0.367,-1.562,4.213
"scale(center(Q(""Sp. Def"")))",8.0176,1.512,5.304,0.000,5.046,10.990

0,1,2,3
Omnibus:,252.3,Durbin-Watson:,1.953
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3474.611
Skew:,2.438,Prob(JB):,0.0
Kurtosis:,16.59,Cond. No.,15.4


# 8

# 9
Model7 is much more complex than model 6 (in question 7) because of the interaction terms, espeicailly the 4 way interactions terms like Attack:Speed:Q("Sp. Def"):Q("Sp. Atk") which are hard to interpret. Although model7 can be considered as better 'out of sample' performance, which means it might be able to predict things better, it raises questions and concerns about random noise that might have been affected associations between variables. In model7 they're high p-values which assoicate to less significant relationships, whereas model6 has lower p-values, indicating more significant relationships. Model7 is harder to trust than model6 due to its hard to understand nature 

Parsimony Principle:
Simpler models, like Model 6, often have the benefit of being more generalizable and interpretable. While Model 7 might have higher predictive performance, its complexity makes it harder to trust for real-world application.
The simpler, more parsimonious model (Model 6) is preferable when the difference in predictive performance is not drastic and the goal is to have a model that is easier to understand, explain, and maintain.

The train-test approaches can identify generalizabilty issues. By using the data to predict future data, then comparing it to data as it becomes available. Model6 might actually be better at predicting future data because it is more parsimonious and due to the fact that model7 is to complex. 

You can see in the data that the R-squared for future are slightly lower in the model7 compared to the model6 future. This might suggest that model6 is a better model. 

This reasoning aligns with the core principle in modeling: prioritize simplicity and generalizability over complexity unless the latter results in a significant improvement.

Although Model 7 might perform better with new data, its complexity and risk of overfitting make Model 6 a more reliable choice for general use and understanding. It's better to choose a simpler model unless a more complex one clearly improves predictions. When moving from random splits to time-based predictions, the risk of overfitting becomes more serious, especially for complex models like Model 7. So, even though Model 7 performs better, it may not be as dependable for future predictions as Model 6. This follows the principle of favoring simpler, more generalizable models unless complexity offers a big advantage.


In [118]:
model7_gen1_predict_future = smf.ols(formula=model7_linear_form,
                                   data=pokeaman[pokeaman.Generation==1])
model7_gen1_predict_future_fit = model7_gen1_predict_future.fit()
print("'In sample' R-squared:    ", model7_fit.rsquared, "(original)")
y = pokeaman_test.HP
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model7)[0,1]**2, "(original)")
print("'In sample' R-squared:    ", model7_gen1_predict_future_fit.rsquared, "(gen1_predict_future)")
y = pokeaman[pokeaman.Generation!=1].HP
yhat = model7_gen1_predict_future_fit.predict(pokeaman[pokeaman.Generation!=1])
print("'Out of sample' R-squared:", np.corrcoef(y,yhat)[0,1]**2, "(gen1_predict_future)")

'In sample' R-squared:     0.37818209127432423 (original)
'Out of sample' R-squared: 0.35055389262255354 (original)
'In sample' R-squared:     0.5726118179916575 (gen1_predict_future)
'Out of sample' R-squared: 0.11151363459149807 (gen1_predict_future)


In [119]:
model7_gen1to5_predict_future = smf.ols(formula=model7_linear_form,
                                   data=pokeaman[pokeaman.Generation!=6])
model7_gen1to5_predict_future_fit = model7_gen1to5_predict_future.fit()
print("'In sample' R-squared:    ", model7_fit.rsquared, "(original)")
y = pokeaman_test.HP
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model7)[0,1]**2, "(original)")
print("'In sample' R-squared:    ", model7_gen1to5_predict_future_fit.rsquared, "(gen1to5_predict_future)")
y = pokeaman[pokeaman.Generation==6].HP
yhat = model7_gen1to5_predict_future_fit.predict(pokeaman[pokeaman.Generation==6])
print("'Out of sample' R-squared:", np.corrcoef(y,yhat)[0,1]**2, "(gen1to5_predict_future)")

'In sample' R-squared:     0.37818209127432423 (original)
'Out of sample' R-squared: 0.35055389262255354 (original)
'In sample' R-squared:     0.3904756578094536 (gen1to5_predict_future)
'Out of sample' R-squared: 0.23394915352156728 (gen1to5_predict_future)


In [120]:
model6_gen1_predict_future = smf.ols(formula=model6_linear_form,
                                   data=pokeaman[pokeaman.Generation==1])
model6_gen1_predict_future_fit = model6_gen1_predict_future.fit()
print("'In sample' R-squared:    ", model6_fit.rsquared, "(original)")
y = pokeaman_test.HP
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model6)[0,1]**2, "(original)")
print("'In sample' R-squared:    ", model6_gen1_predict_future_fit.rsquared, "(gen1_predict_future)")
y = pokeaman[pokeaman.Generation!=1].HP
yhat = model6_gen1_predict_future_fit.predict(pokeaman[pokeaman.Generation!=1])
print("'Out of sample' R-squared:", np.corrcoef(y,yhat)[0,1]**2, "(gen1_predict_future)")

'In sample' R-squared:     0.3326310334310908 (original)
'Out of sample' R-squared: 0.2957246042708003 (original)
'In sample' R-squared:     0.4433880517727281 (gen1_predict_future)
'Out of sample' R-squared: 0.1932858534276197 (gen1_predict_future)


In [121]:
model6_gen1to5_predict_future = smf.ols(formula=model6_linear_form,
                                   data=pokeaman[pokeaman.Generation!=6])
model6_gen1to5_predict_future_fit = model6_gen1to5_predict_future.fit()
print("'In sample' R-squared:    ", model6_fit.rsquared, "(original)")
y = pokeaman_test.HP
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model6)[0,1]**2, "(original)")
print("'In sample' R-squared:    ", model6_gen1to5_predict_future_fit.rsquared, "(gen1to5_predict_future)")
y = pokeaman[pokeaman.Generation==6].HP
yhat = model6_gen1to5_predict_future_fit.predict(pokeaman[pokeaman.Generation==6])
print("'Out of sample' R-squared:", np.corrcoef(y,yhat)[0,1]**2, "(gen1to5_predict_future)")

'In sample' R-squared:     0.3326310334310908 (original)
'Out of sample' R-squared: 0.2957246042708003 (original)
'In sample' R-squared:     0.3351727982411482 (gen1to5_predict_future)
'Out of sample' R-squared: 0.2626269017879998 (gen1to5_predict_future)


# Chatbot summaries
https://chatgpt.com/share/6736ae9a-e430-8011-9da5-fdaac4676b5e

Here's a summary of our conversation:

Context and Main Points:
Model Building: You were working on understanding and comparing multiple linear regression models (model3, model4, model5, model6, model7) and exploring how their complexity, interpretability, and generalizability change as new predictors and interaction terms are added.
Model Evolution:
Model 3 to Model 4: Model 4 introduces interaction terms, such as Attack * Defense, which increases model complexity and introduces potential multicollinearity, as seen in the high condition number.
Model 4 to Model 5: Model 5 adds new predictors like generation and Pokémon types to improve the model.
Model 5 to Model 6: Model 6 refines the model by focusing on significant predictors, creating binary variables for specific types and generations that had the lowest p-values.
Model 6 to Model 7: Model 7 adds more interaction terms among continuous predictors (e.g., Attack:Speed:Sp. Def:Sp. Atk) and is centered and scaled to address potential multicollinearity issues. However, the complexity increases, making interpretation more challenging.
Multicollinearity and Model Stability:
We discussed how high multicollinearity can lead to unstable coefficient estimates, making model interpretation difficult. Although Model 7 had better out-of-sample performance, it was more complex, which could lead to overfitting. The condition number dropped after centering and scaling, suggesting that multicollinearity was less of a concern in Model 7 compared to earlier models.
Model Performance and Generalizability:
Despite better "out of sample" performance in Model 7, the model's complexity and weak p-values raised concerns about its generalizability and interpretability. In contrast, Model 6 had stronger evidence for its coefficients and was more interpretable, making it preferable in terms of generalizability.
We discussed the importance of model parsimony, where simpler models are often preferred unless a more complex model provides a significant improvement in predictive performance and generalizability.
Sequential Data and Future Predictions:
The conversation also delved into the issue of generalizability when predicting future data. By applying a sequential data approach, where data from future generations is used for testing, we saw that Model 7 might struggle more with generalizability than Model 6. Overfitting to the training data can reduce the model’s ability to predict unseen data effectively.
Conclusion:
The key takeaway is that simpler, more interpretable models (like Model 6) are often better for generalization and interpretability. Although Model 7 showed improved predictive performance in some cases, its complexity and weak evidence for many coefficients made it less reliable for future predictions.
In summary, while Model 7 may show better raw predictive performance, Model 6 is the preferable model when considering generalizability, interpretability, and avoiding overfitting. The simpler model tends to generalize better, especially when handling real-world sequential data.