# Feature addition and reduction

As seen in the inferential statistics notebook, there are a lot of multicollinearities between our variables. In this notebook I will try to delete some redundant features and add some features, learning from the EDA, that I think might be helpful for our prediction.

In [9]:
#Load necessary modules
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
import seaborn as sns

In [4]:
df = pd.read_csv(r'C:\Users\songs\Desktop\Springboard Files\Capstone 2\data\Interim\train1.csv',index_col=0)
df.head()

Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,...,MarkDown5,CPI,Unemployment,Size,Type_A,Type_B,Type_C,log_revenue,IsMarkDown,Type_barplot
0,26,92,2011-08-26,87235.57,False,61.1,3.796,0.0,0.0,0.0,...,0.0,136.213613,7.767,152513,1,0,0,11.431992,False,A
1,34,22,2011-03-25,5945.97,False,53.11,3.48,0.0,0.0,0.0,...,0.0,128.616064,10.398,158114,1,0,0,9.299807,False,A
2,21,28,2010-12-03,1219.89,False,50.43,2.708,0.0,0.0,0.0,...,0.0,211.265543,8.163,140167,0,1,0,8.733889,False,B
3,8,9,2010-09-17,11972.71,False,75.32,2.582,0.0,0.0,0.0,...,0.0,214.878556,6.315,155078,1,0,0,9.738769,False,A
4,19,55,2012-05-18,8271.82,False,58.81,4.029,12613.98,0.0,11.5,...,3600.79,138.106581,8.15,203819,1,0,0,9.49264,False,A


In [6]:
df.describe()

Unnamed: 0,Store,Dept,Weekly_Sales,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,Size,Type_A,Type_B,Type_C,log_revenue
count,282450.0,282450.0,282450.0,282450.0,282450.0,282450.0,282450.0,282450.0,282450.0,282450.0,282450.0,282450.0,282450.0,282450.0,282450.0,282450.0,282450.0
mean,22.193146,44.286274,15983.503944,60.113598,3.360301,2578.77754,872.126294,459.273032,1077.245617,1659.4853,171.207961,7.968076,136729.826904,0.5118,0.387371,0.100828,9.571147
std,12.782156,30.50361,22661.09825,18.446505,0.458603,6023.733269,5077.429516,5475.263454,3874.121495,4252.38421,39.160786,1.868035,61002.286891,0.499862,0.48715,0.301102,0.822145
min,1.0,1.0,-1750.0,-2.06,2.472,0.0,-265.76,-29.1,0.0,0.0,126.064,3.879,34875.0,0.0,0.0,0.0,8.08331
25%,11.0,18.0,2079.3375,46.78,2.932,0.0,0.0,0.0,0.0,0.0,132.022667,6.891,93638.0,0.0,0.0,0.0,8.863514
50%,22.0,38.0,7616.62,62.15,3.452,0.0,0.0,0.0,0.0,0.0,182.350989,7.866,140167.0,1.0,0.0,0.0,9.441973
75%,33.0,74.0,20245.8125,74.29,3.737,2789.03,1.91,4.42,423.22,2152.4,212.464799,8.572,202505.0,1.0,1.0,0.0,10.136017
max,45.0,99.0,693099.36,100.14,4.468,88646.76,104519.54,141630.61,67474.85,108519.28,227.232807,14.313,219622.0,1.0,1.0,1.0,13.456102


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 282450 entries, 0 to 282450
Data columns (total 21 columns):
Store           282450 non-null int64
Dept            282450 non-null int64
Date            282450 non-null object
Weekly_Sales    282450 non-null float64
IsHoliday       282450 non-null bool
Temperature     282450 non-null float64
Fuel_Price      282450 non-null float64
MarkDown1       282450 non-null float64
MarkDown2       282450 non-null float64
MarkDown3       282450 non-null float64
MarkDown4       282450 non-null float64
MarkDown5       282450 non-null float64
CPI             282450 non-null float64
Unemployment    282450 non-null float64
Size            282450 non-null int64
Type_A          282450 non-null int64
Type_B          282450 non-null int64
Type_C          282450 non-null int64
log_revenue     282450 non-null float64
IsMarkDown      282450 non-null bool
Type_barplot    282450 non-null object
dtypes: bool(2), float64(11), int64(6), object(2)
memory usage: 43.6+

## Combine the Markdown Columns

As seen in the EDA notebook, the markdown columns are highly correlated with each other and a large source of multicollinearity. By combining them into one "Total_Markdown" column, we reduce collinearity. We can also eliminate the IsMarkDown column, since a total markdown of 0 would mean that there is no markdown that week.

In [8]:
df_totalmd = df.drop(['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5','IsMarkDown'],axis=1)
df_totalmd['Total_MarkDown'] = df[['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5']].sum(axis=1)
df_totalmd.head()

Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday,Temperature,Fuel_Price,CPI,Unemployment,Size,Type_A,Type_B,Type_C,log_revenue,Type_barplot,Total_MarkDown
0,26,92,2011-08-26,87235.57,False,61.1,3.796,136.213613,7.767,152513,1,0,0,11.431992,A,0.0
1,34,22,2011-03-25,5945.97,False,53.11,3.48,128.616064,10.398,158114,1,0,0,9.299807,A,0.0
2,21,28,2010-12-03,1219.89,False,50.43,2.708,211.265543,8.163,140167,0,1,0,8.733889,B,0.0
3,8,9,2010-09-17,11972.71,False,75.32,2.582,214.878556,6.315,155078,1,0,0,9.738769,A,0.0
4,19,55,2012-05-18,8271.82,False,58.81,4.029,138.106581,8.15,203819,1,0,0,9.49264,A,17931.55


## Delete Type_barplot

I had originally created this column for easier plotting in seaborn, but now that we're not doing plotting, we can delete the column. 