# Feature addition and reduction

As seen in the inferential statistics notebook, there are a lot of multicollinearities between our variables. In this notebook I will try to delete some redundant features and add some features, learning from the EDA, that I think might be helpful for our prediction.

In [None]:
#Load necessary modules
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import seaborn as sns

In [None]:
df = pd.read_csv(r'C:\Users\songs\Desktop\Springboard Files\Capstone 2\data\Interim\train1.csv',index_col=0)
df.head()

In [None]:
df.describe()

In [None]:
df.info()

## Combine the Markdown Columns

As seen in the EDA notebook, the markdown columns are highly correlated with each other and a large source of multicollinearity. By combining them into one "Total_Markdown" column, we reduce collinearity. We can also eliminate the IsMarkDown column, since a total markdown of 0 would mean that there is no markdown that week.

In [None]:
df_totalmd = df

## Delete Type_barplot

I had originally created this column for easier plotting in seaborn, but now that we're not doing plotting, we can delete the column. 

Type of store is a categorical variable, and when doing regression, it's important to have a baseline for categorical variables so that we can interpret the effect of other options in the category. I chose type B as the baseline because it is in the middle in terms of size, and it's be easy to see positive effects (Type A) or negative effects (Type C)

In [None]:
df_nob = df_totalmd.drop(['Type_barplot','Type_B'],axis=1)
df_nob.head()

## Deciphering Year, Month, Week of Year, and Day

The year, month, week, and day could have an impact on the revenue made. As of now, the date column won't work with a lot of models. Thus, I'm going to parse the date column into year, month week of year, and day columns.

In [None]:
df_parsedate = df_nob
df_parsedate['Date'] = df_parsedate['Date'].astype('datetime64')
df_parsedate['Year'] = df_nob['Date'].dt.year
df_parsedate['Month'] = df_nob['Date'].dt.month
df_parsedate['Week_of_year'] = df_nob['Date'].dt.weekofyear
df_parsedate['Day'] = df_nob['Date'].dt.day
df_parsedate.head()

In [None]:
df_parsedate.info()

## Adding the Yearly Median value of each Store-Dept combination

This would provide a baseline value for each store-department combination that the models could modify for each week given the other attributes. I'm using the median instead of the mean because the mean can be biased by extreme values, and thus won't give as clear an indicator of each store-department's average performance.

In [None]:
df_groupby = df_parsedate.groupby(['Store','Dept','Year'])['Weekly_Sales'].median()
df_median = df_parsedate.merge(df_groupby, on=['Store','Dept','Year'], how='outer')
df_median.rename(columns={'Weekly_Sales_x':'Weekly_Sales','Weekly_Sales_y':'Median_Sales'}, inplace=True)
df_median.head()

In [None]:
df_median.info()

## Adding some Date variables

Some models don't work well with the dates, so I'm going to see if any dates have a larger impact than others.

From from the >$300,000 table in the EDA notebook, it appears that a signficant portion of them are made in the thanksgiving week, so we'll make a new feature called "IsThanksgiving" to distinguish those values made during Thanksgiving weeks. Looking at the average sales per week, it appears that sales are higher near Christmas, so I'll also create a new feature called "IsChristmas". After this, we could delete the Date Column.

In [None]:
df_bigholidays = df_median
df_bigholidays['IsThanksgiving'] = (df_bigholidays['Date'] == '2011-11-25') | (df_bigholidays['Date'] == '2010-11-26')
df_bigholidays['IsChristmas'] = (df_bigholidays['Date'] == '2010-12-31') | (df_bigholidays['Date'] == '2011-12-30')
df_bigholidays = df_bigholidays.drop('Date',axis=1)
df_bigholidays.head()

In [None]:
df_bigholidays.info()

## Adding department 72

Looking at the EDA, it appears that department 72 has sales higher than others. I'll create an "IsDept72" Column, then.

In [None]:
df_dept72 = df_bigholidays
df_dept72['IsDept72'] = (df_bigholidays['Dept'] == 72)
df_dept72.head()

In [None]:
def fit_model(df):
    #Creating Features and Target
    X = df.drop(['Weekly_Sales','log_revenue'], axis=1).values
    y = df['Weekly_Sales'].values

    #Train Test Split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    
    #Create the Regression model:
    linear = LinearRegression()
    linear.fit(X_train, y_train)

    #Making predictions
    y_pred = linear.predict(X_test)
    
    y_true = y_test
    
    print(mean_absolute_error(y_true, y_pred))
    
fit_model(df_dept72)

In [None]:
df_dept72.to_csv(r'C:\Users\songs\Desktop\Springboard Files\Capstone 2\data\Interim\train_all_features.csv')