# Model Outputs

## Model Evaluations

- ### Improving Model Fit

    - #### Stepwise Selection

        - ##### Forward Step Selection

        - ##### Backward Step Selection


## Model Accuracy

**Regression Example**

In this example, we will use a used car dataset. The dependent variable is 'selling_price' with all other variables being potential independent variables.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

car_df = pd.read_csv("UsedCars2.csv")
car_df.head()

Unnamed: 0,selling_price,year,km_driven,owners,kmpl,engine_cc,power_bhp,seats
0,130000,2007,120000,1,16.1,1298,88.2,5
1,778000,2016,70000,2,24.52,1248,88.5,7
2,500000,2012,53000,2,23.0,1396,90.0,5
3,600000,2012,72000,1,21.5,1248,88.8,5
4,1149000,2019,5000,1,17.0,1591,121.3,5


## Model Evaluations

Throught the process of **building models**, we will be constantly evaluating them for the best fit of representing our data. 

In doing so, there are two different method, **Forwards** and **Backwards** 


##### Forward Step Selection

- Forward Stepwise begins with a model containing no x-variables (no predictors), and adds predictors to the model, one at a time. 
- For each model variation, the variable that gives the greatest additional improvement to the fit is added to the permanent model.
- Newer model variations are created after each output to produce combinations of the best performing R-squared values and lowest-value p-values
- Process
    - Build all models that contain only one independent variable.
    - Identify the best model (maybe highest adj. R2 value) and then add one more variable at a time.
    - Find the new best model.
    - Repeat until you hit some stopping criteria 
        - (maybe adj. R2 stops increasing or you get p-values > 0.05). 


In [5]:
import statsmodels.api as sm

y_dependent_variable  = car_df['selling_price']
x_independet_variables_dataframe = car_df.drop(columns=['selling_price'])

x_independet_variables_dataframe.columns

Index(['year', 'km_driven', 'owners', 'kmpl', 'engine_cc', 'power_bhp',
       'seats'],
      dtype='object')

**List Comprehension**

In **Forward Step Selection** , we produce individual lists of each **x_indepent_variable with a constant**, so we can review each individual x-variable (predictor) against the **y_dependent_variable** output in the model. 

In [6]:
# start with an empty list
X=[]

# for loop to go through a list of things (in this case, each column in car_df_new)
for column in x_independet_variables_dataframe.columns:

    # appending something to our list (X), and the 'something' is sm.add_constant(car_df_new[column])
    X.append(sm.add_constant(x_independet_variables_dataframe[column]))

In [7]:
# X = [sm.add_constant(car_df_new['selling_price']), sm.add_constant(car_df_new['year']), sm.add_constant(car_df_new['km_driven']), ...]

#Create a model for each indep. variable
#list of X's (with constants)
X = [sm.add_constant(x_independet_variables_dataframe[column]) for column in x_independet_variables_dataframe.columns]

In [9]:
# Example X_value list at index 0 within the X-variable, with its own constant
X[0]

Unnamed: 0,const,year
0,1.0,2007
1,1.0,2016
2,1.0,2012
3,1.0,2012
4,1.0,2019
...,...,...
1590,1.0,2017
1591,1.0,2014
1592,1.0,2010
1593,1.0,1997


In [17]:
# Producing model for X[0] as example of model

# Regression Model using Ordinary Least Squares (OLS) method to produce best-fit line
model_selling_price = sm.OLS(y_dependent_variable,X[0])

# 'Fit()' the model to the OLS
results_selling_price = model_selling_price.fit()

# Identify the R-Squared (Adjusted because Multivariate Linear Regression) value
ajd_r2_selling_price = results_selling_price.rsquared_adj

# Identify p-values for each x-varaible
pvalues_selling_price = results_selling_price.pvalues

print('R-Squared Value for X_Variable "Year"')
print(ajd_r2_selling_price)
print('')
print('P-Values for constant (y-intercept) and X_Variable "Year"')
print(pvalues_selling_price)

R-Squared Value for X_Variable "Year"
0.17367044629471773

P-Values for constant (y-intercept) and X_Variable "Year"
const    7.813010e-68
year     2.981846e-68
dtype: float64


In [18]:
print(results_selling_price.summary())

                            OLS Regression Results                            
Dep. Variable:          selling_price   R-squared:                       0.174
Model:                            OLS   Adj. R-squared:                  0.174
Method:                 Least Squares   F-statistic:                     336.0
Date:                Thu, 24 Jul 2025   Prob (F-statistic):           2.98e-68
Time:                        13:45:47   Log-Likelihood:                -23950.
No. Observations:                1595   AIC:                         4.790e+04
Df Residuals:                    1593   BIC:                         4.791e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -2.004e+08    1.1e+07    -18.267      0.0

**With the successful production of a model with a singular X_variable, we can iterate throguh each appended X_variable adn their constant to view each individual impact on y_value_depedent_variable**

In [None]:

# Producing list of models, iterating through each individual X_variable & their constant using Ordinary Least Squared (OLS) Method for Multivarite Regression 
Models = [sm.OLS(y_dependent_variable,x) for x in X] 

# Iterate through each model to 'Fit()' the OLS Method
Results = [model.fit() for model in Models]

# Iterate through each model for R-Squared adjusted values
Adj_Rsquared = [results.rsquared_adj for results in Results] 

# Iterate through each mode for p-values
Pval = [results.pvalues for results in Results] 

# Iterate through the identifyed range of indexs from Adj_Rsquared list
for i in range(len(Adj_Rsquared)):

     # Print each individual value in a series for reading
     print(f'adj_R2: {Adj_Rsquared[i]:.3f}, P-values: {*Pval[i],}, column: {x_independet_variables_dataframe.columns[i]}')

adj_R2: 0.174, P-values: (7.813010353971015e-68, 2.9818464289444125e-68), column: year
adj_R2: 0.070, P-values: (1.913408757157036e-142, 3.5013116609690504e-27), column: km_driven
adj_R2: 0.057, P-values: (1.5186541212405294e-102, 2.433141313618206e-22), column: owners
adj_R2: 0.013, P-values: (8.935450361424721e-27, 2.088158819752608e-06), column: kmpl
adj_R2: 0.199, P-values: (9.805584670006712e-14, 5.868115795938417e-79), column: engine_cc
adj_R2: 0.579, P-values: (9.412782556137207e-130, 3.7110127230227896e-302), column: power_bhp
adj_R2: -0.000, P-values: (2.922458353967729e-06, 0.4359248539986683), column: seats


In [24]:
Results[0].pvalues

const    7.813010e-68
year     2.981846e-68
dtype: float64

From this output, we can see that the model with '**power_bhp**' had the highest adj. R-squared value. 

Now let's try all models that consist of 'power_bhp' and another variable.

In [27]:
# Remove our chosen X_varible that had the highest R-squared value (impact) on the y_dependent_variable ('selling price')
remaining_variables = x_independet_variables_dataframe.drop(['power_bhp'], axis=1)

# Set this chosen X_variable as a 'permanent' varible in a model, and reference it for other X_variables
included_df = x_independet_variables_dataframe[['power_bhp']]

In [30]:
# Iterate throguh our 'remaining X_variables' and append them to list index position in our X-variable. Each including the first permanent 'chosen X_variable'Adj_Rsquared

X = [sm.add_constant(pd.merge(included_df, remaining_variables[column], right_index = True, left_index = True))\
     for column in remaining_variables.columns]
X[0]

Unnamed: 0,const,power_bhp,year
0,1.0,88.20,2007
1,1.0,88.50,2016
2,1.0,90.00,2012
3,1.0,88.80,2012
4,1.0,121.30,2019
...,...,...,...
1590,1.0,67.04,2017
1591,1.0,67.06,2014
1592,1.0,102.00,2010
1593,1.0,37.00,1997


In [None]:
# Producing list of models, iterating through each individual X_variable & their constant using Ordinary Least Squared (OLS) Method for Multivarite Regression 
Models = [sm.OLS(y_dependent_variable,x) for x in X] 

# Iterate through each model to 'Fit()' the OLS Method
Results = [model.fit() for model in Models] 

# Iterate through each model for R-Squared adjusted values
Adj_Rsquared = [results.rsquared_adj for results in Results]

# Iterate through each mode for p-values
Pval = [results.pvalues for results in Results]

# Iterate through the identifyed range of indexs from Adj_Rsquared list
print('Printing a combination of the permanently chosen "X_variable" and each iterated X_variable"s statistical insights : ') 
print( ' | R-Squared | constant p-value (y-intercept) | X_variable combiation p-value')
print('')
for i in range(len(Adj_Rsquared)):

     # Print each individual value in a series for reading

     
     print(f'adj_R2: {Adj_Rsquared[i]:.3f}, P-values: {*Pval[i],}, column: {remaining_variables.columns[i]}')

Printing a combination of the permanently chosen "X_variable" and each iterated X_variable"s statistical insights : 
 | R-Squared | constant p-value (y-intercept) | X_variable combiation p-value

adj_R2: 0.639, P-values: (1.001605189519802e-55, 1.2162697843804209e-288, 7.611907554019364e-55), column: year
adj_R2: 0.630, P-values: (2.978237375054589e-67, 2.737e-321, 1.3575094351912819e-46), column: km_driven
adj_R2: 0.605, P-values: (2.0502564956273112e-46, 5.118641374948136e-303, 2.1911838330016582e-23), column: owners
adj_R2: 0.618, P-values: (2.9293658745854524e-99, 0.0, 3.620556649188509e-35), column: kmpl
adj_R2: 0.596, P-values: (1.7284162995805068e-80, 2.1729150764950284e-239, 5.051132565768442e-16), column: engine_cc
adj_R2: 0.595, P-values: (3.082726485672466e-09, 1.012762537e-314, 1.3612693239727301e-14), column: seats



It looks like 'power_bhp' with 'year' is now the "best" model. Keep repeating this process until the adjusted R2 stops increasing by a significant amount, or we get insignificant indep. variables in our model.


##### Successful Forward Stepwise Selection
Keep doing this 'appending' of X-variables until you find teh combination with the best adj_R-squared value and an apporpirate p-value

##### Backward Stepwise Selection

- Backward Stepwise begin with a model containing all x-variables (all predictors), and removes x-variables (predictors) one at a time
- For each model variation, the variable that is the least significant (highest p-value) is removed. From which we will view the overall model’s adjusted_R-squared value and result p-values from remaining x_variables
- Process
    - Build a model that contains all independent x-variable.
    - Identify the x_variable that has the highest p-value and remove it from the model
    - Produce a new model and compare the adjusted_R-squared value against the previous model(s)
    - Repeat until you hit some stopping criteria 
        - (maybe adj. R2 stops increasing or you get all p-values < 0.05). 



In [41]:
# Run full model for all X-variables from Dataframe
y = car_df['selling_price']
X = car_df.drop('selling_price', axis=1)

# Add only one_constant to the entire model
# Adding a constant column of 1's so the model will contain an intercept
X = sm.add_constant(X)

# Producing one model with all X_variables and the singular constant using Ordinary Least Squared (OLS) Method for Multivarite Regression 
model = sm.OLS(y, X)

# 'Fit()' the OLS Method
results = model.fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:          selling_price   R-squared:                       0.665
Model:                            OLS   Adj. R-squared:                  0.663
Method:                 Least Squares   F-statistic:                     449.6
Date:                Thu, 24 Jul 2025   Prob (F-statistic):               0.00
Time:                        14:22:29   Log-Likelihood:                -23231.
No. Observations:                1595   AIC:                         4.648e+04
Df Residuals:                    1587   BIC:                         4.652e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -6.48e+07   1.02e+07     -6.354      0.0

Notice the Adj. R-Squared and p-value for each of the coefficients. 

The coefficient for **'owners' seems to be the only variable that has a p-value > 0.05. Let's try removing it and running another model.**

In [43]:
# Run full model, dropping the X-variable diagnosed above from the Dataframe
y = car_df['selling_price']
X = car_df.drop(['selling_price', 'owners'], axis=1)

# Add only one_constant to the entire model
# Adding a constant column of 1's so the model will contain an intercept
X = sm.add_constant(X) 


# Producing one model with all X_variables and the singular constant using Ordinary Least Squared (OLS) Method for Multivarite Regression 
model = sm.OLS(y, X)

# 'Fit()' the OLS Method
results = model.fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:          selling_price   R-squared:                       0.664
Model:                            OLS   Adj. R-squared:                  0.663
Method:                 Least Squares   F-statistic:                     523.0
Date:                Thu, 24 Jul 2025   Prob (F-statistic):               0.00
Time:                        14:24:02   Log-Likelihood:                -23233.
No. Observations:                1595   AIC:                         4.648e+04
Df Residuals:                    1588   BIC:                         4.652e+04
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -7.258e+07   9.38e+06     -7.735      0.0

The **Adjusted_R-squared value did not change**, however, we now have a new X_indepedenent_predictor_variable with a high(er) p-value.

We will remove this next X_variable to assess the change in the model's output

In [45]:
# Run full model, dropping both X-variable diagnosed above from the Dataframe
y = car_df['selling_price']
X = car_df.drop(['selling_price', 'owners', 'engine_cc'], axis=1)

# Add only one_constant to the entire model
# Adding a constant column of 1's so the model will contain an intercept
X = sm.add_constant(X) 

# Producing one model with all X_variables and the singular constant using Ordinary Least Squared (OLS) Method for Multivarite Regression 
model = sm.OLS(y, X)

# 'Fit()' the OLS Method
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:          selling_price   R-squared:                       0.663
Model:                            OLS   Adj. R-squared:                  0.662
Method:                 Least Squares   F-statistic:                     625.0
Date:                Thu, 24 Jul 2025   Prob (F-statistic):               0.00
Time:                        14:26:37   Log-Likelihood:                -23235.
No. Observations:                1595   AIC:                         4.648e+04
Df Residuals:                    1589   BIC:                         4.651e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -7.221e+07   9.39e+06     -7.688      0.0

In [46]:
#Again

# Run full model, dropping both X-variable diagnosed above from the Dataframe
y = car_df['selling_price']
X = car_df.drop(['selling_price', 'owners', 'engine_cc', 'seats'], axis=1)

# Add only one_constant to the entire model
# Adding a constant column of 1's so the model will contain an intercept
X = sm.add_constant(X) 

# Producing one model with all X_variables and the singular constant using Ordinary Least Squared (OLS) Method for Multivarite Regression 
model = sm.OLS(y, X)

# 'Fit()' the OLS Method
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:          selling_price   R-squared:                       0.662
Model:                            OLS   Adj. R-squared:                  0.661
Method:                 Least Squares   F-statistic:                     777.5
Date:                Thu, 24 Jul 2025   Prob (F-statistic):               0.00
Time:                        14:28:09   Log-Likelihood:                -23238.
No. Observations:                1595   AIC:                         4.649e+04
Df Residuals:                    1590   BIC:                         4.651e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -6.609e+07   9.05e+06     -7.300      0.0

##### Successful Backward Stepwise Selection 

At this point, we'd probably just stick with this model. All coefficients have significant p-values, and the Adjusted_R-square value has not significantly increasd from the range of 0.661 - 0.664

### Classification Example
#### Backward Stepwise Selection

In [47]:
car_df.head()

Unnamed: 0,selling_price,year,km_driven,owners,kmpl,engine_cc,power_bhp,seats
0,130000,2007,120000,1,16.1,1298,88.2,5
1,778000,2016,70000,2,24.52,1248,88.5,7
2,500000,2012,53000,2,23.0,1396,90.0,5
3,600000,2012,72000,1,21.5,1248,88.8,5
4,1149000,2019,5000,1,17.0,1591,121.3,5


In [None]:
import numpy as np

# Run full model for all X-variables from Dataframe

#  binary dependent y-variable
car_df['expensive'] = np.where(car_df['selling_price']>500000, 1, 0)

# x-variables
car_exp_df = car_df.drop(['selling_price'], axis=1)

car_exp_df

Unnamed: 0,year,km_driven,owners,kmpl,engine_cc,power_bhp,seats,expensive
0,2007,120000,1,16.10,1298,88.20,5,0
1,2016,70000,2,24.52,1248,88.50,7,1
2,2012,53000,2,23.00,1396,90.00,5,0
3,2012,72000,1,21.50,1248,88.80,5,1
4,2019,5000,1,17.00,1591,121.30,5,1
...,...,...,...,...,...,...,...,...
1590,2017,12000,1,23.10,998,67.04,5,0
1591,2014,50000,1,23.59,1364,67.06,5,0
1592,2010,129000,1,12.80,2494,102.00,8,0
1593,1997,120000,1,16.10,796,37.00,4,0


In [51]:

# Run full model
y = car_exp_df['expensive']
X  = car_exp_df.drop(['expensive'], axis=1)

# Add only one_constant to the entire model
# Adding a constant column of 1's so the model will contain an intercept
X = sm.add_constant(X)


# Producing one model with all X_variables and the singular constant using Logistics LOGIT Method for Classification Models
model = sm.Logit(y,X)

# 'Fit()' model to LOGIT fucntion
results = model.fit()


print(results.summary())

Optimization terminated successfully.
         Current function value: 0.304869
         Iterations 10
                           Logit Regression Results                           
Dep. Variable:              expensive   No. Observations:                 1595
Model:                          Logit   Df Residuals:                     1587
Method:                           MLE   Df Model:                            7
Date:                Thu, 24 Jul 2025   Pseudo R-squ.:                  0.5592
Time:                        14:50:52   Log-Likelihood:                -486.27
converged:                       True   LL-Null:                       -1103.1
Covariance Type:            nonrobust   LLR p-value:                3.753e-262
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1405.5027    100.711    -13.956      0.000   -1602.893   -1208.113
year           0.6897      0

In [52]:

# Optional output to check, to see 'effectiveness' of this model output to other model outputs of the same data
results.aic

np.float64(988.532437024925)

The p-value for the **km_drive coefficient is >0.05.**

Let's try removing that from the model.

In [53]:
# Run full model, dropping the X-variable diagnosed above from the Dataframe
y = car_exp_df['expensive']
X  = car_exp_df.drop(['expensive', 'km_driven'], axis=1)


# Add only one_constant to the entire model
# Adding a constant column of 1's so the model will contain an intercept
X = sm.add_constant(X) 

# Producing one model with all X_variables and the singular constant using Logistics LOGIT Method for Classification Models
model = sm.Logit(y,X)

# 'Fit()' model to LOGIT fucntion
results = model.fit()


print(results.summary())

Optimization terminated successfully.
         Current function value: 0.304897
         Iterations 9
                           Logit Regression Results                           
Dep. Variable:              expensive   No. Observations:                 1595
Model:                          Logit   Df Residuals:                     1588
Method:                           MLE   Df Model:                            6
Date:                Thu, 24 Jul 2025   Pseudo R-squ.:                  0.5591
Time:                        14:52:33   Log-Likelihood:                -486.31
converged:                       True   LL-Null:                       -1103.1
Covariance Type:            nonrobust   LLR p-value:                2.625e-263
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1394.8604     94.087    -14.825      0.000   -1579.267   -1210.454
year           0.6844      0.

In [54]:

# Optional output to check, to see 'effectiveness' of this model output to other model outputs of the same data
results.aic

np.float64(986.6227064282698)

In [55]:
# Run full model, dropping the X-variable diagnosed above from the Dataframe
y = car_exp_df['expensive']
X  = car_exp_df.drop(['expensive', 'km_driven', 'owners'], axis=1)


# Add only one_constant to the entire model
# Adding a constant column of 1's so the model will contain an intercept
X = sm.add_constant(X) 

# Producing one model with all X_variables and the singular constant using Logistics LOGIT Method for Classification Models
model = sm.Logit(y,X)

# 'Fit()' model to LOGIT fucntion
results = model.fit()


print(results.summary())

Optimization terminated successfully.
         Current function value: 0.307579
         Iterations 9
                           Logit Regression Results                           
Dep. Variable:              expensive   No. Observations:                 1595
Model:                          Logit   Df Residuals:                     1589
Method:                           MLE   Df Model:                            5
Date:                Thu, 24 Jul 2025   Pseudo R-squ.:                  0.5553
Time:                        14:53:08   Log-Likelihood:                -490.59
converged:                       True   LL-Null:                       -1103.1
Covariance Type:            nonrobust   LLR p-value:                1.133e-262
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const      -1489.2419     90.123    -16.525      0.000   -1665.880   -1312.604
year           0.7311      0.

In [56]:

# Optional output to check, to see 'effectiveness' of this model output to other model outputs of the same data
results.aic

np.float64(993.1779163651544)

##### Successful Classifiation Backward Stepwise Selection 

At this point, we'd probably just stick with this model. All coefficients have significant p-values, and the Adjusted_R-square value has not significantly increasd sitting at 0.5553, and we have our highest **results.aic output of 993.178**