# Pre-Processing and Training Data Development

I will begin by loading necessary packages and the cleaned data

In [1]:
# Import necessary packages

import pandas as pd
import numpy as np

In [2]:
# Load data

df = pd.read_csv(r'C:\Users\bronc\Downloads\Capstone 3\sales_data_sample(clean).csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Order_Number,Quantity_Ordered,Price_Each,Order_Line_Number,Sales,Order_Date,Status,QTR_ID,Month_ID,Year_ID,Product_Line,MSRP,Product_Code,Customer_Name,City,Country,Deal_Size
0,0,10107,30,95.7,2,2871.0,2003-02-24,Shipped,1,2,2003,Motorcycles,95,S10_1678,Land of Toys Inc.,NYC,USA,Small
1,1,10121,34,81.35,5,2765.9,2003-05-07,Shipped,2,5,2003,Motorcycles,95,S10_1678,Reims Collectables,Reims,France,Small
2,2,10134,41,94.74,2,3884.34,2003-07-01,Shipped,3,7,2003,Motorcycles,95,S10_1678,Lyon Souveniers,Paris,France,Medium
3,3,10145,45,83.26,6,3746.7,2003-08-25,Shipped,3,8,2003,Motorcycles,95,S10_1678,Toys4GrownUps.com,Pasadena,USA,Medium
4,4,10159,49,106.23,14,5205.27,2003-10-10,Shipped,4,10,2003,Motorcycles,95,S10_1678,Corporate Gift Ideas Co.,San Francisco,USA,Medium


In [3]:
# Drop faulty Unnamed column

df.drop('Unnamed: 0', axis=1, inplace=True)
df.head()

Unnamed: 0,Order_Number,Quantity_Ordered,Price_Each,Order_Line_Number,Sales,Order_Date,Status,QTR_ID,Month_ID,Year_ID,Product_Line,MSRP,Product_Code,Customer_Name,City,Country,Deal_Size
0,10107,30,95.7,2,2871.0,2003-02-24,Shipped,1,2,2003,Motorcycles,95,S10_1678,Land of Toys Inc.,NYC,USA,Small
1,10121,34,81.35,5,2765.9,2003-05-07,Shipped,2,5,2003,Motorcycles,95,S10_1678,Reims Collectables,Reims,France,Small
2,10134,41,94.74,2,3884.34,2003-07-01,Shipped,3,7,2003,Motorcycles,95,S10_1678,Lyon Souveniers,Paris,France,Medium
3,10145,45,83.26,6,3746.7,2003-08-25,Shipped,3,8,2003,Motorcycles,95,S10_1678,Toys4GrownUps.com,Pasadena,USA,Medium
4,10159,49,106.23,14,5205.27,2003-10-10,Shipped,4,10,2003,Motorcycles,95,S10_1678,Corporate Gift Ideas Co.,San Francisco,USA,Medium


### Create dummy variables

Now that the data is uploaded and properly formatted, the first step of pre-processing is to create dummy variables for the categorical variables that will be included in the model

In [4]:
# Check value counts of variables to add
df['QTR_ID'].value_counts()

4    1056
1     634
3     503
2     424
Name: QTR_ID, dtype: int64

In [5]:
df['Month_ID'].value_counts()

11    583
10    293
2     224
1     216
3     194
8     191
12    180
5     172
9     171
4     151
7     141
6     101
Name: Month_ID, dtype: int64

In [6]:
df['Year_ID'].value_counts()

2004    1287
2003     976
2005     354
Name: Year_ID, dtype: int64

In [7]:
df['Product_Line'].value_counts()

Classic Cars        914
Vintage Cars        557
Motorcycles         324
Trucks and Buses    281
Planes              271
Ships               195
Trains               75
Name: Product_Line, dtype: int64

In [8]:
df['Product_Code'].value_counts()

S18_3232    49
S32_2509    27
S24_2840    27
S50_1392    27
S10_1949    27
            ..
S18_3029    20
S18_2248    19
S18_4409    19
S18_4933    19
S18_1749    19
Name: Product_Code, Length: 109, dtype: int64

In [9]:
df['Customer_Name'].value_counts()

Euro Shopping Channel           213
Mini Gifts Distributors Ltd.    178
Australian Collectors, Co.       55
AV Stores, Co.                   51
Muscle Machine Inc               48
                               ... 
Auto-Moto Classics Inc.           8
Royale Belge                      8
Atelier graphique                 7
Mini Auto Werke                   7
Boards & Toys Co.                 3
Name: Customer_Name, Length: 92, dtype: int64

In [10]:
df['City'].value_counts()

Madrid        258
San Rafael    178
NYC           138
Singapore      79
Paris          70
             ... 
Burbank        13
Lule           13
Newark          9
Charleroi       8
Graz            7
Name: City, Length: 73, dtype: int64

In [11]:
df['Country'].value_counts()

USA            935
France         301
Spain          296
Australia      167
UK             130
Italy          113
Finland         92
Norway          85
Singapore       79
Canada          70
Germany         62
Denmark         52
Japan           52
Austria         47
Sweden          35
Switzerland     31
Belgium         28
Philippines     26
Ireland         16
Name: Country, dtype: int64

While some of these have a high amount of different values and will therefore have a higher amount of dummy variables I will be using all of the dummies except for with City where I will take a look to find a good cutoff for an "Other" column instead

In [12]:
City = df['City'].value_counts() > 15

In [13]:
City.sum()

62

Taking out those cities with 15 or less sales takes out the bottom 11 cities. I am comfortable with this as an "Other" column. Now I need to change these cities to other

In [14]:
df['City'] = df['City'].replace('Lule', 'Other')
df['City'] = df['City'].replace('Burbank', 'Other')
df['City'] = df['City'].replace('Newark', 'Other')
df['City'] = df['City'].replace('Charleroi', 'Other')
df['City'] = df['City'].replace('Graz', 'Other')
df['City'] = df['City'].replace('South Brisbane', 'Other')
df['City'] = df['City'].replace('Sevilla', 'Other')
df['City'] = df['City'].replace('Liverpool', 'Other')
df['City'] = df['City'].replace('Munich', 'Other')
df['City'] = df['City'].replace('Los Angeles', 'Other')
df['City'] = df['City'].replace('Brisbane', 'Other')

In [15]:
df['City'].value_counts()

Madrid          258
San Rafael      178
Other           138
NYC             138
Singapore        79
               ... 
Versailles       18
Glen Waverly     18
New Haven        17
Pasadena         17
Dublin           16
Name: City, Length: 63, dtype: int64

Perfect! Now there are 62 cities and 1 "Other" variable in our City column

Now its time to create and integrate the dummy variables. I will be putting City and Customer_Name at the end as I am not sure if I will be using these columns for models and therefore I will not be putting them in my baseline model or train test split. However, by creating the dummy variables I can always add them in later

In [16]:
dummy_Q = pd.get_dummies(df['QTR_ID'])
dummy_M = pd.get_dummies(df['Month_ID'])
dummy_Y = pd.get_dummies(df['Year_ID'])
dummy_PL = pd.get_dummies(df['Product_Line'])
dummy_PC = pd.get_dummies(df['Product_Code'])
dummy_Co = pd.get_dummies(df['Country'])
dummy_Ci = pd.get_dummies(df['City'])
dummy_CN = pd.get_dummies(df['Customer_Name'])

In [17]:
dummy_Q.sample(5)

Unnamed: 0,1,2,3,4
2219,1,0,0,0
708,1,0,0,0
596,0,0,0,1
2575,0,1,0,0
1219,0,1,0,0


In [18]:
dummy_M.sample(5)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12
1664,0,0,0,0,0,0,0,0,1,0,0,0
2494,0,0,0,0,1,0,0,0,0,0,0,0
387,0,1,0,0,0,0,0,0,0,0,0,0
1393,0,0,1,0,0,0,0,0,0,0,0,0
2380,0,0,0,0,0,0,0,1,0,0,0,0


In [19]:
dummy_Y.sample(5)

Unnamed: 0,2003,2004,2005
16,0,1,0
2193,0,0,1
1544,0,1,0
88,0,1,0
514,0,1,0


In [20]:
dummy_PL.sample(5)

Unnamed: 0,Classic Cars,Motorcycles,Planes,Ships,Trains,Trucks and Buses,Vintage Cars
388,1,0,0,0,0,0,0
714,0,0,0,0,0,1,0
560,0,0,0,0,0,0,1
299,1,0,0,0,0,0,0
125,1,0,0,0,0,0,0


In [21]:
dummy_PC.sample(5)

Unnamed: 0,S10_1678,S10_1949,S10_2016,S10_4698,S10_4757,S10_4962,S12_1099,S12_1108,S12_1666,S12_2823,...,S700_2466,S700_2610,S700_2824,S700_2834,S700_3167,S700_3505,S700_3962,S700_4002,S72_1253,S72_3212
690,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
174,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
603,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1617,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1721,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
dummy_Co.sample(5)

Unnamed: 0,Australia,Austria,Belgium,Canada,Denmark,Finland,France,Germany,Ireland,Italy,Japan,Norway,Philippines,Singapore,Spain,Sweden,Switzerland,UK,USA
1312,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
429,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
393,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
1813,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
382,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [23]:
dummy_Ci.sample(5)

Unnamed: 0,Aaarhus,Allentown,Barcelona,Bergamo,Bergen,Boras,Boston,Brickhaven,Bridgewater,Bruxelles,...,San Rafael,Singapore,Stavern,Strasbourg,Torino,Toulouse,Tsawassen,Vancouver,Versailles,White Plains
137,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2130,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1060,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1664,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2504,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [24]:
dummy_CN.sample(5)

Unnamed: 0,"AV Stores, Co.",Alpha Cognac,Amica Models & Co.,"Anna's Decorations, Ltd",Atelier graphique,"Australian Collectables, Ltd","Australian Collectors, Co.","Australian Gift Network, Co",Auto Assoc. & Cie.,Auto Canal Petit,...,"Tokyo Collectables, Ltd","Toms Spezialitten, Ltd","Toys of Finland, Co.",Toys4GrownUps.com,"UK Collectables, Ltd.","Vida Sport, Ltd",Vitachrome Inc.,"Volvo Model Replicas, Co",West Coast Collectables Co.,giftsbymail.co.uk
1700,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
746,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1246,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1728,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2118,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


All of these seem to have run successfully. Now they will be integrated to the original dataframe

In [25]:
df = pd.concat([df, dummy_Q], axis = 1)
df.head()

Unnamed: 0,Order_Number,Quantity_Ordered,Price_Each,Order_Line_Number,Sales,Order_Date,Status,QTR_ID,Month_ID,Year_ID,...,MSRP,Product_Code,Customer_Name,City,Country,Deal_Size,1,2,3,4
0,10107,30,95.7,2,2871.0,2003-02-24,Shipped,1,2,2003,...,95,S10_1678,Land of Toys Inc.,NYC,USA,Small,1,0,0,0
1,10121,34,81.35,5,2765.9,2003-05-07,Shipped,2,5,2003,...,95,S10_1678,Reims Collectables,Reims,France,Small,0,1,0,0
2,10134,41,94.74,2,3884.34,2003-07-01,Shipped,3,7,2003,...,95,S10_1678,Lyon Souveniers,Paris,France,Medium,0,0,1,0
3,10145,45,83.26,6,3746.7,2003-08-25,Shipped,3,8,2003,...,95,S10_1678,Toys4GrownUps.com,Pasadena,USA,Medium,0,0,1,0
4,10159,49,106.23,14,5205.27,2003-10-10,Shipped,4,10,2003,...,95,S10_1678,Corporate Gift Ideas Co.,San Francisco,USA,Medium,0,0,0,1


In [26]:
df = pd.concat([df, dummy_M], axis = 1)
df.head()

Unnamed: 0,Order_Number,Quantity_Ordered,Price_Each,Order_Line_Number,Sales,Order_Date,Status,QTR_ID,Month_ID,Year_ID,...,3,4,5,6,7,8,9,10,11,12
0,10107,30,95.7,2,2871.0,2003-02-24,Shipped,1,2,2003,...,0,0,0,0,0,0,0,0,0,0
1,10121,34,81.35,5,2765.9,2003-05-07,Shipped,2,5,2003,...,0,0,1,0,0,0,0,0,0,0
2,10134,41,94.74,2,3884.34,2003-07-01,Shipped,3,7,2003,...,0,0,0,0,1,0,0,0,0,0
3,10145,45,83.26,6,3746.7,2003-08-25,Shipped,3,8,2003,...,0,0,0,0,0,1,0,0,0,0
4,10159,49,106.23,14,5205.27,2003-10-10,Shipped,4,10,2003,...,0,0,0,0,0,0,0,1,0,0


In [27]:
df = pd.concat([df, dummy_Y], axis = 1)
df.head()

Unnamed: 0,Order_Number,Quantity_Ordered,Price_Each,Order_Line_Number,Sales,Order_Date,Status,QTR_ID,Month_ID,Year_ID,...,6,7,8,9,10,11,12,2003,2004,2005
0,10107,30,95.7,2,2871.0,2003-02-24,Shipped,1,2,2003,...,0,0,0,0,0,0,0,1,0,0
1,10121,34,81.35,5,2765.9,2003-05-07,Shipped,2,5,2003,...,0,0,0,0,0,0,0,1,0,0
2,10134,41,94.74,2,3884.34,2003-07-01,Shipped,3,7,2003,...,0,1,0,0,0,0,0,1,0,0
3,10145,45,83.26,6,3746.7,2003-08-25,Shipped,3,8,2003,...,0,0,1,0,0,0,0,1,0,0
4,10159,49,106.23,14,5205.27,2003-10-10,Shipped,4,10,2003,...,0,0,0,0,1,0,0,1,0,0


In [28]:
df = pd.concat([df, dummy_PL], axis = 1)
df.head()

Unnamed: 0,Order_Number,Quantity_Ordered,Price_Each,Order_Line_Number,Sales,Order_Date,Status,QTR_ID,Month_ID,Year_ID,...,2003,2004,2005,Classic Cars,Motorcycles,Planes,Ships,Trains,Trucks and Buses,Vintage Cars
0,10107,30,95.7,2,2871.0,2003-02-24,Shipped,1,2,2003,...,1,0,0,0,1,0,0,0,0,0
1,10121,34,81.35,5,2765.9,2003-05-07,Shipped,2,5,2003,...,1,0,0,0,1,0,0,0,0,0
2,10134,41,94.74,2,3884.34,2003-07-01,Shipped,3,7,2003,...,1,0,0,0,1,0,0,0,0,0
3,10145,45,83.26,6,3746.7,2003-08-25,Shipped,3,8,2003,...,1,0,0,0,1,0,0,0,0,0
4,10159,49,106.23,14,5205.27,2003-10-10,Shipped,4,10,2003,...,1,0,0,0,1,0,0,0,0,0


In [29]:
df = pd.concat([df, dummy_PC], axis = 1)
df.head()

Unnamed: 0,Order_Number,Quantity_Ordered,Price_Each,Order_Line_Number,Sales,Order_Date,Status,QTR_ID,Month_ID,Year_ID,...,S700_2466,S700_2610,S700_2824,S700_2834,S700_3167,S700_3505,S700_3962,S700_4002,S72_1253,S72_3212
0,10107,30,95.7,2,2871.0,2003-02-24,Shipped,1,2,2003,...,0,0,0,0,0,0,0,0,0,0
1,10121,34,81.35,5,2765.9,2003-05-07,Shipped,2,5,2003,...,0,0,0,0,0,0,0,0,0,0
2,10134,41,94.74,2,3884.34,2003-07-01,Shipped,3,7,2003,...,0,0,0,0,0,0,0,0,0,0
3,10145,45,83.26,6,3746.7,2003-08-25,Shipped,3,8,2003,...,0,0,0,0,0,0,0,0,0,0
4,10159,49,106.23,14,5205.27,2003-10-10,Shipped,4,10,2003,...,0,0,0,0,0,0,0,0,0,0


In [30]:
df = pd.concat([df, dummy_Co], axis = 1)
df.head()

Unnamed: 0,Order_Number,Quantity_Ordered,Price_Each,Order_Line_Number,Sales,Order_Date,Status,QTR_ID,Month_ID,Year_ID,...,Italy,Japan,Norway,Philippines,Singapore,Spain,Sweden,Switzerland,UK,USA
0,10107,30,95.7,2,2871.0,2003-02-24,Shipped,1,2,2003,...,0,0,0,0,0,0,0,0,0,1
1,10121,34,81.35,5,2765.9,2003-05-07,Shipped,2,5,2003,...,0,0,0,0,0,0,0,0,0,0
2,10134,41,94.74,2,3884.34,2003-07-01,Shipped,3,7,2003,...,0,0,0,0,0,0,0,0,0,0
3,10145,45,83.26,6,3746.7,2003-08-25,Shipped,3,8,2003,...,0,0,0,0,0,0,0,0,0,1
4,10159,49,106.23,14,5205.27,2003-10-10,Shipped,4,10,2003,...,0,0,0,0,0,0,0,0,0,1


As stated above City and Customer_Name will be added later if it appears they are necessary for our models. I would like to try the baseline model without them so I'll hold off for now.

### Scaling

For this part I'll be using sklearn's StandardScaler

In [31]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

Next I'll create the subset of our data that the model will be using and assign it to X

In [32]:
X1 = df[['Price_Each']]
X = df.iloc[:, 7:172]

Here I only chose Price_Each because Price_Each * Quantity_Ordered = Sales so to avoid overfitting the model I need to only use 1 of the 2. Price_Each has a higher correlation with Sales so I will try that first

In [33]:
X = pd.concat([X1, X], axis = 1)
X.head()

Unnamed: 0,Price_Each,QTR_ID,Month_ID,Year_ID,Product_Line,MSRP,Product_Code,Customer_Name,City,Country,...,Italy,Japan,Norway,Philippines,Singapore,Spain,Sweden,Switzerland,UK,USA
0,95.7,1,2,2003,Motorcycles,95,S10_1678,Land of Toys Inc.,NYC,USA,...,0,0,0,0,0,0,0,0,0,1
1,81.35,2,5,2003,Motorcycles,95,S10_1678,Reims Collectables,Reims,France,...,0,0,0,0,0,0,0,0,0,0
2,94.74,3,7,2003,Motorcycles,95,S10_1678,Lyon Souveniers,Paris,France,...,0,0,0,0,0,0,0,0,0,0
3,83.26,3,8,2003,Motorcycles,95,S10_1678,Toys4GrownUps.com,Pasadena,USA,...,0,0,0,0,0,0,0,0,0,1
4,106.23,4,10,2003,Motorcycles,95,S10_1678,Corporate Gift Ideas Co.,San Francisco,USA,...,0,0,0,0,0,0,0,0,0,1


Next we must drop the categorical variables so that we can scale

In [34]:
X.drop(columns = 'Product_Line', inplace = True)
X.drop(columns = 'Product_Code', inplace = True)
X.drop(columns = 'Customer_Name', inplace = True)
X.drop(columns = 'City', inplace = True)
X.drop(columns = 'Country', inplace = True)
X.drop(columns = 'Deal_Size', inplace = True)

Now that this is all set up its time to scale

In [35]:
scaler.fit(X)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [36]:
X_scaled = scaler.transform(X)

Since the first model will be a linear regression model, a constant must be added

In [37]:
import statsmodels.api as sm
X_scaled = sm.add_constant(X_scaled)

### Train Test Split

Next, its time to split the data for our baseline model. First we need to define y as the Sales column

In [38]:
y = df[['Sales']]

Now to perform the actual split. I'll be using a 70/30 split of training to testing size

In [39]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.3, random_state = 123)

Let's check that that worked by looking at the sizes of the training and test splits

In [40]:
X_train.shape

(1831, 160)

In [41]:
X_test.shape

(786, 160)

In [42]:
786/(786+1831)

0.30034390523500193

This rounds to 30% for our test split so conversely the train split is right around 70%

### 1st Model

Now all that's left is to run a preliminary model and see how it performs. I won't be doing any kind of tuning yet but will save that for the next models as I try to find the best one for this dataset. I will be defining success as having an r-squared of at least 0.8

In [43]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

Model1 = sm.OLS(y_train.astype(float), X_train.astype(float))
Model1_results = Model1.fit()

In [44]:
Model1_results.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.696
Model:,OLS,Adj. R-squared:,0.671
Method:,Least Squares,F-statistic:,27.64
Date:,"Thu, 06 May 2021",Prob (F-statistic):,0.0
Time:,20:11:17,Log-Likelihood:,-15293.0
No. Observations:,1831,AIC:,30870.0
Df Residuals:,1690,BIC:,31650.0
Df Model:,140,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3573.9714,25.288,141.332,0.000,3524.373,3623.570
x1,1520.5496,40.405,37.633,0.000,1441.301,1599.798
x2,-2.0287,6.743,-0.301,0.764,-15.254,11.197
x3,0.8058,7.338,0.110,0.913,-13.587,15.199
x4,32.2801,12.895,2.503,0.012,6.988,57.572
x5,3.7544,19.534,0.192,0.848,-34.560,42.068
x6,-12.8572,9.104,-1.412,0.158,-30.713,4.999
x7,28.4936,11.351,2.510,0.012,6.231,50.756
x8,-5.0948,11.644,-0.438,0.662,-27.933,17.743

0,1,2,3
Omnibus:,81.305,Durbin-Watson:,2.051
Prob(Omnibus):,0.0,Jarque-Bera (JB):,178.001
Skew:,0.274,Prob(JB):,2.23e-39
Kurtosis:,4.426,Cond. No.,4.39e+16


This model can definitely be improved upon. Our R-squared is currently 0.696 which isn't terrible but is too low to conclude that this is a good model. In addition many of our p-values for individual variables are well above 0.05 which means they aren't significant

### Extended Modeling Plan

Models:

1) Run the first model but remove variables with p-value over 0.05
2) See results with a Random Forest Regressor model
3) Use a Ridge Regression model with Regularization
4) For each of the above models, experiment with different combinations of variables and parameter tuning to try and achieve desired R-squared