# Modeling Your Data - Lab

## Introduction 

In this lab you'll perform a full linear regression on the data. You'll implement the process demonstrated in the previous lesson, taking a stepwise approach to analyze and improve the model along the way.

## Objectives
You will be able to:

* Remove predictors with p-values too high and refit the model
* Examine and interpret the model results
* Split data into training and testing sets
* Fit a regression model to the data set using statsmodel library


## Build an Initial Regression Model

To start, perform a train-test split and create an initial regression model to model the `list_price` using all of your available features.

> **Note:** In order to write the model you'll have to do some tedious manipulation of your column names. Statsmodels will not allow you to have spaces, apostrophe or arithmetic symbols (+) in your column names. Preview them and refine them as you go.  
**If you receive an error such as "PatsyError: error tokenizing input (maybe an unclosed string?)" then you need to further preprocess your column names.**

In [32]:
#Your code here
!ls


CONTRIBUTING.md  index.ipynb		   LICENSE.md
index_files	 Lego_dataset_cleaned.csv  README.md


In [33]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set_style('darkgrid')

from statsmodels.formula.api import ols
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
import scipy.stats as stats
from sklearn.model_selection import train_test_split

pd.set_option('display.max_columns',200)

In [34]:
df = pd.read_csv('Lego_dataset_cleaned.csv')
df.head()

Unnamed: 0,piece_count,list_price,num_reviews,play_star_rating,star_rating,val_star_rating,ages_10+,ages_10-14,ages_10-16,ages_10-21,ages_11-16,ages_12+,ages_12-16,ages_14+,ages_16+,ages_1½-3,ages_1½-5,ages_2-5,ages_4+,ages_4-7,ages_4-99,ages_5+,ages_5-12,ages_5-8,ages_6+,ages_6-12,ages_6-14,ages_7+,ages_7-12,ages_7-14,ages_8+,ages_8-12,ages_8-14,ages_9+,ages_9-12,ages_9-14,ages_9-16,theme_name_Angry Birds™,theme_name_Architecture,theme_name_BOOST,theme_name_Blue's Helicopter Pursuit,theme_name_BrickHeadz,theme_name_Carnotaurus Gyrosphere Escape,theme_name_City,theme_name_Classic,theme_name_Creator 3-in-1,theme_name_Creator Expert,theme_name_DC Comics™ Super Heroes,theme_name_DC Super Hero Girls,theme_name_DIMENSIONS™,theme_name_DUPLO®,theme_name_Dilophosaurus Outpost Attack,theme_name_Disney™,theme_name_Elves,theme_name_Friends,theme_name_Ghostbusters™,theme_name_Ideas,theme_name_Indoraptor Rampage at Lockwood Estate,theme_name_Juniors,theme_name_Jurassic Park Velociraptor Chase,theme_name_MINDSTORMS®,theme_name_Marvel Super Heroes,theme_name_Minecraft™,theme_name_Minifigures,theme_name_NEXO KNIGHTS™,theme_name_NINJAGO®,theme_name_Power Functions,theme_name_Pteranodon Chase,theme_name_SERIOUS PLAY®,theme_name_Speed Champions,theme_name_Star Wars™,theme_name_Stygimoloch Breakout,theme_name_T. rex Transport,theme_name_THE LEGO® BATMAN MOVIE,theme_name_THE LEGO® NINJAGO® MOVIE™,theme_name_Technic,country_AT,country_AU,country_BE,country_CA,country_CH,country_CZ,country_DE,country_DN,country_ES,country_FI,country_FR,country_GB,country_IE,country_IT,country_LU,country_NL,country_NO,country_NZ,country_PL,country_PT,country_US,review_difficulty_Average,review_difficulty_Challenging,review_difficulty_Easy,review_difficulty_Very Challenging,review_difficulty_Very Easy,review_difficulty_unknown
0,-0.27302,29.99,-0.398512,-0.655279,-0.045687,-0.36501,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0
1,-0.404154,19.99,-0.398512,-0.655279,0.990651,-0.36501,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0
2,-0.517242,12.99,-0.147162,-0.132473,-0.460222,-0.204063,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0
3,0.635296,99.99,0.187972,-1.352353,0.161581,0.11783,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0
4,0.288812,79.99,-0.063378,-2.049427,0.161581,-0.204063,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0


In [36]:
subs = [(' ', '_'),('.',''),("'",""),('™', ''), ('®',''),
        ('+','plus'), ('½','half'), ('-','_')
       ]
def col_formatting(col):
    for old, new in subs:
        col = col.replace(old,new)
    return col

df.columns = [col_formatting(col) for col in df.columns]

df.head()

Unnamed: 0,piece_count,list_price,num_reviews,play_star_rating,star_rating,val_star_rating,ages_10plus,ages_10_14,ages_10_16,ages_10_21,ages_11_16,ages_12plus,ages_12_16,ages_14plus,ages_16plus,ages_1half_3,ages_1half_5,ages_2_5,ages_4plus,ages_4_7,ages_4_99,ages_5plus,ages_5_12,ages_5_8,ages_6plus,ages_6_12,ages_6_14,ages_7plus,ages_7_12,ages_7_14,ages_8plus,ages_8_12,ages_8_14,ages_9plus,ages_9_12,ages_9_14,ages_9_16,theme_name_Angry_Birds,theme_name_Architecture,theme_name_BOOST,theme_name_Blues_Helicopter_Pursuit,theme_name_BrickHeadz,theme_name_Carnotaurus_Gyrosphere_Escape,theme_name_City,theme_name_Classic,theme_name_Creator_3_in_1,theme_name_Creator_Expert,theme_name_DC_Comics_Super_Heroes,theme_name_DC_Super_Hero_Girls,theme_name_DIMENSIONS,theme_name_DUPLO,theme_name_Dilophosaurus_Outpost_Attack,theme_name_Disney,theme_name_Elves,theme_name_Friends,theme_name_Ghostbusters,theme_name_Ideas,theme_name_Indoraptor_Rampage_at_Lockwood_Estate,theme_name_Juniors,theme_name_Jurassic_Park_Velociraptor_Chase,theme_name_MINDSTORMS,theme_name_Marvel_Super_Heroes,theme_name_Minecraft,theme_name_Minifigures,theme_name_NEXO_KNIGHTS,theme_name_NINJAGO,theme_name_Power_Functions,theme_name_Pteranodon_Chase,theme_name_SERIOUS_PLAY,theme_name_Speed_Champions,theme_name_Star_Wars,theme_name_Stygimoloch_Breakout,theme_name_T_rex_Transport,theme_name_THE_LEGO_BATMAN_MOVIE,theme_name_THE_LEGO_NINJAGO_MOVIE,theme_name_Technic,country_AT,country_AU,country_BE,country_CA,country_CH,country_CZ,country_DE,country_DN,country_ES,country_FI,country_FR,country_GB,country_IE,country_IT,country_LU,country_NL,country_NO,country_NZ,country_PL,country_PT,country_US,review_difficulty_Average,review_difficulty_Challenging,review_difficulty_Easy,review_difficulty_Very_Challenging,review_difficulty_Very_Easy,review_difficulty_unknown
0,-0.27302,29.99,-0.398512,-0.655279,-0.045687,-0.36501,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0
1,-0.404154,19.99,-0.398512,-0.655279,0.990651,-0.36501,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0
2,-0.517242,12.99,-0.147162,-0.132473,-0.460222,-0.204063,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0
3,0.635296,99.99,0.187972,-1.352353,0.161581,0.11783,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0
4,0.288812,79.99,-0.063378,-2.049427,0.161581,-0.204063,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0


In [24]:
outcome = 'list_price'
x_cols = ['piece_count', 'num_reviews', 'play_star_rating', 'star_rating', 'val_star_rating']
for col in x_cols:
    df[col] = (df[col] - df[col].mean())/df[col].std()
df.head()

Unnamed: 0,piece_count,list_price,num_reviews,play_star_rating,star_rating,val_star_rating,ages_10plus,ages_10-14,ages_10-16,ages_10-21,ages_11-16,ages_12plus,ages_12-16,ages_14plus,ages_16plus,ages_1½-3,ages_1½-5,ages_2-5,ages_4plus,ages_4-7,ages_4-99,ages_5plus,ages_5-12,ages_5-8,ages_6plus,ages_6-12,ages_6-14,ages_7plus,ages_7-12,ages_7-14,ages_8plus,ages_8-12,ages_8-14,ages_9plus,ages_9-12,ages_9-14,ages_9-16,theme_name_Angry_Birds™,theme_name_Architecture,theme_name_BOOST,theme_name_Blue's_Helicopter_Pursuit,theme_name_BrickHeadz,theme_name_Carnotaurus_Gyrosphere_Escape,theme_name_City,theme_name_Classic,theme_name_Creator_3-in-1,theme_name_Creator_Expert,theme_name_DC_Comics™_Super_Heroes,theme_name_DC_Super_Hero_Girls,theme_name_DIMENSIONS™,theme_name_DUPLO®,theme_name_Dilophosaurus_Outpost_Attack,theme_name_Disney™,theme_name_Elves,theme_name_Friends,theme_name_Ghostbusters™,theme_name_Ideas,theme_name_Indoraptor_Rampage_at_Lockwood_Estate,theme_name_Juniors,theme_name_Jurassic_Park_Velociraptor_Chase,theme_name_MINDSTORMS®,theme_name_Marvel_Super_Heroes,theme_name_Minecraft™,theme_name_Minifigures,theme_name_NEXO_KNIGHTS™,theme_name_NINJAGO®,theme_name_Power_Functions,theme_name_Pteranodon_Chase,theme_name_SERIOUS_PLAY®,theme_name_Speed_Champions,theme_name_Star_Wars™,theme_name_Stygimoloch_Breakout,theme_name_T._rex_Transport,theme_name_THE_LEGO®_BATMAN_MOVIE,theme_name_THE_LEGO®_NINJAGO®_MOVIE™,theme_name_Technic,country_AT,country_AU,country_BE,country_CA,country_CH,country_CZ,country_DE,country_DN,country_ES,country_FI,country_FR,country_GB,country_IE,country_IT,country_LU,country_NL,country_NO,country_NZ,country_PL,country_PT,country_US,review_difficulty_Average,review_difficulty_Challenging,review_difficulty_Easy,review_difficulty_Very_Challenging,review_difficulty_Very_Easy,review_difficulty_unknown
0,-0.27302,29.99,-0.398512,-0.655279,-0.045687,-0.36501,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0
1,-0.404154,19.99,-0.398512,-0.655279,0.990651,-0.36501,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0
2,-0.517242,12.99,-0.147162,-0.132473,-0.460222,-0.204063,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0
3,0.635296,99.99,0.187972,-1.352353,0.161581,0.11783,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0
4,0.288812,79.99,-0.063378,-2.049427,0.161581,-0.204063,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0


In [37]:
#Defining the problem
outcome = 'list_price'
x_cols = list(df.columns)
x_cols.remove(outcome)

In [38]:
train, test = train_test_split(df)

In [39]:
print(len(train), len(test))
train.head()

8152 2718


Unnamed: 0,piece_count,list_price,num_reviews,play_star_rating,star_rating,val_star_rating,ages_10plus,ages_10_14,ages_10_16,ages_10_21,ages_11_16,ages_12plus,ages_12_16,ages_14plus,ages_16plus,ages_1half_3,ages_1half_5,ages_2_5,ages_4plus,ages_4_7,ages_4_99,ages_5plus,ages_5_12,ages_5_8,ages_6plus,ages_6_12,ages_6_14,ages_7plus,ages_7_12,ages_7_14,ages_8plus,ages_8_12,ages_8_14,ages_9plus,ages_9_12,ages_9_14,ages_9_16,theme_name_Angry_Birds,theme_name_Architecture,theme_name_BOOST,theme_name_Blues_Helicopter_Pursuit,theme_name_BrickHeadz,theme_name_Carnotaurus_Gyrosphere_Escape,theme_name_City,theme_name_Classic,theme_name_Creator_3_in_1,theme_name_Creator_Expert,theme_name_DC_Comics_Super_Heroes,theme_name_DC_Super_Hero_Girls,theme_name_DIMENSIONS,theme_name_DUPLO,theme_name_Dilophosaurus_Outpost_Attack,theme_name_Disney,theme_name_Elves,theme_name_Friends,theme_name_Ghostbusters,theme_name_Ideas,theme_name_Indoraptor_Rampage_at_Lockwood_Estate,theme_name_Juniors,theme_name_Jurassic_Park_Velociraptor_Chase,theme_name_MINDSTORMS,theme_name_Marvel_Super_Heroes,theme_name_Minecraft,theme_name_Minifigures,theme_name_NEXO_KNIGHTS,theme_name_NINJAGO,theme_name_Power_Functions,theme_name_Pteranodon_Chase,theme_name_SERIOUS_PLAY,theme_name_Speed_Champions,theme_name_Star_Wars,theme_name_Stygimoloch_Breakout,theme_name_T_rex_Transport,theme_name_THE_LEGO_BATMAN_MOVIE,theme_name_THE_LEGO_NINJAGO_MOVIE,theme_name_Technic,country_AT,country_AU,country_BE,country_CA,country_CH,country_CZ,country_DE,country_DN,country_ES,country_FI,country_FR,country_GB,country_IE,country_IT,country_LU,country_NL,country_NO,country_NZ,country_PL,country_PT,country_US,review_difficulty_Average,review_difficulty_Challenging,review_difficulty_Easy,review_difficulty_Very_Challenging,review_difficulty_Very_Easy,review_difficulty_unknown
10400,-0.484759,12.1878,-0.370585,-0.655279,0.368848,0.439724,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
9364,-0.433027,42.6878,-0.342657,0.73887,0.576116,-0.36501,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
4226,-0.392123,48.7878,-0.286801,0.216064,0.161581,0.11783,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1509,1.762569,146.3878,9.795146,0.390333,0.783383,1.083511,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
1725,1.145396,158.5878,0.048333,0.390333,0.161581,0.600671,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0


In [40]:
#Fitting the actual model
predictors = '+'.join(x_cols)
formula = outcome + "~" + predictors
model = ols(formula=formula, data=df).fit()
model.summary()

0,1,2,3
Dep. Variable:,list_price,R-squared:,0.864
Model:,OLS,Adj. R-squared:,0.862
Method:,Least Squares,F-statistic:,726.2
Date:,"Tue, 11 Jun 2019",Prob (F-statistic):,0.0
Time:,04:19:25,Log-Likelihood:,-54056.0
No. Observations:,10870,AIC:,108300.0
Df Residuals:,10775,BIC:,109000.0
Df Model:,94,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,64.2821,1.551,41.435,0.000,61.241,67.323
piece_count,75.7184,0.776,97.605,0.000,74.198,77.239
num_reviews,6.4270,0.590,10.888,0.000,5.270,7.584
play_star_rating,5.2682,0.542,9.717,0.000,4.205,6.331
star_rating,-1.4380,0.617,-2.331,0.020,-2.647,-0.229
val_star_rating,-8.5504,0.550,-15.545,0.000,-9.628,-7.472
ages_10plus,122.9923,5.753,21.378,0.000,111.715,134.270
ages_10_14,-23.1648,7.788,-2.975,0.003,-38.430,-7.899
ages_10_16,-11.7969,3.528,-3.343,0.001,-18.713,-4.881

0,1,2,3
Omnibus:,5896.308,Durbin-Watson:,1.468
Prob(Omnibus):,0.0,Jarque-Bera (JB):,606905.535
Skew:,1.674,Prob(JB):,0.0
Kurtosis:,39.453,Cond. No.,1.83e+16


## Remove the Uninfluential Features

Based on the initial model, remove those features which do not appear to be statistically relevant and rerun the model.

In [41]:
#Your code here
#Extract the p-value table from the summary and use it to subset our features
summary = model.summary()
p_table = summary.tables[1]
p_table = pd.DataFrame(p_table.data)
p_table.columns = p_table.iloc[0]
p_table = p_table.drop(0)
p_table = p_table.set_index(p_table.columns[0])
p_table['P>|t|'] = p_table['P>|t|'].astype(float)
x_cols = list(p_table[p_table['P>|t|']<0.05].index)
x_cols.remove('Intercept')
print(len(p_table), len(x_cols))
print(x_cols[:5])
p_table.head()

103 76
['piece_count', 'num_reviews', 'play_star_rating', 'star_rating', 'val_star_rating']


Unnamed: 0,coef,std err,t,P>|t|,[0.025,0.975]
,,,,,,
Intercept,64.2821,1.551,41.435,0.0,61.241,67.323
piece_count,75.7184,0.776,97.605,0.0,74.198,77.239
num_reviews,6.427,0.59,10.888,0.0,5.27,7.584
play_star_rating,5.2682,0.542,9.717,0.0,4.205,6.331
star_rating,-1.438,0.617,-2.331,0.02,-2.647,-0.229


> **Comment:** You should see that the model performance is identical. Additionally, observe that there are further features which have been identified as unimpactful. Continue to refine the model accordingly.

In [42]:
#Your code here
#Refit model with subset features
predictors = '+'.join(x_cols)
formula = outcome + "~" + predictors
model = ols(formula=formula, data=train).fit()
model.summary()

0,1,2,3
Dep. Variable:,list_price,R-squared:,0.866
Model:,OLS,Adj. R-squared:,0.865
Method:,Least Squares,F-statistic:,697.6
Date:,"Tue, 11 Jun 2019",Prob (F-statistic):,0.0
Time:,04:22:19,Log-Likelihood:,-40357.0
No. Observations:,8152,AIC:,80870.0
Df Residuals:,8076,BIC:,81400.0
Df Model:,75,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,65.3873,2.370,27.584,0.000,60.741,70.034
piece_count,76.9834,0.869,88.597,0.000,75.280,78.687
num_reviews,7.1596,0.682,10.504,0.000,5.823,8.496
play_star_rating,5.2470,0.600,8.746,0.000,4.071,6.423
star_rating,-1.4135,0.685,-2.063,0.039,-2.756,-0.071
val_star_rating,-7.9266,0.598,-13.256,0.000,-9.099,-6.754
ages_10plus,130.4573,7.376,17.687,0.000,115.999,144.916
ages_10_14,-16.9757,8.546,-1.986,0.047,-33.729,-0.222
ages_10_16,-9.2797,4.146,-2.238,0.025,-17.407,-1.153

0,1,2,3
Omnibus:,4408.763,Durbin-Watson:,2.031
Prob(Omnibus):,0.0,Jarque-Bera (JB):,436218.301
Skew:,1.671,Prob(JB):,0.0
Kurtosis:,38.68,Cond. No.,1.53e+16


## Investigate Multicollinearity

There are still a lot of features in the current model! Chances are there are some strong multicollinearity issues. Begin to investigate the extend of this problem.

In [None]:
#Your code here

## Perform Another Round of Feature Selection

Once again, subset your features based on your findings above. Then rerun the model once again.

In [None]:
#Your code here

## Check the Normality Assumption

Check whether the normality assumption holds for your model.

In [None]:
# Your code here

## Check Homoscedasticity Assumption

Check whether the model's errors are indeed homoscedastic or if they violate this principle and display heteroscedasticity.

In [None]:
#Your code here

> **Comment:** This displays a fairly pronounced 'funnel' shape: errors appear to increase as the list_price increases. This doesn't bode well for our model. Subsetting the data to remove outliers and confiding the model to this restricted domain may be necessary. A log transformation or something equivalent may also be appropriate.

## Make Additional Refinements

From here, make additional refinements to your model based on the above analysis. As you progress, continue to go back and check the assumptions for the updated model. Be sure to attempt at least 2 additional model refinements.

> **Comment:** Based on the above plots, it seems as though outliers are having a substantial impact on the model. As such, removing outliers may be appropriate. Investigating the impact of a log transformation is also worthwhile.

In [None]:
#Your code here

## Summary

Well done! As you can see, regression can be a challenging task that requires you to make decisions along the way, try alternative approaches and make ongoing refinements. These choices depend on the context and specific use cases. 