# Pre-Processing and Training Data Development

I will begin by loading necessary packages and the cleaned data

In [1]:
# Import necessary packages

import pandas as pd
import numpy as np

In [2]:
# Load data

df = pd.read_csv(r'C:\Users\bronc\Downloads\Capstone 3\sales_data_sample(clean).csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Order_Number,Quantity_Ordered,Price_Each,Order_Line_Number,Sales,Order_Date,QTR_ID,Month_ID,Year_ID,Product_Line,MSRP,Product_Code,Customer_Name,City,Country,Deal_Size
0,0,10107,30,95.7,2,2871.0,2003-02-24,1,2,2003,Motorcycles,95,S10_1678,Land of Toys Inc.,NYC,USA,Small
1,1,10121,34,81.35,5,2765.9,2003-05-07,2,5,2003,Motorcycles,95,S10_1678,Reims Collectables,Reims,France,Small
2,2,10134,41,94.74,2,3884.34,2003-07-01,3,7,2003,Motorcycles,95,S10_1678,Lyon Souveniers,Paris,France,Medium
3,3,10145,45,83.26,6,3746.7,2003-08-25,3,8,2003,Motorcycles,95,S10_1678,Toys4GrownUps.com,Pasadena,USA,Medium
4,4,10159,49,106.23,14,5205.27,2003-10-10,4,10,2003,Motorcycles,95,S10_1678,Corporate Gift Ideas Co.,San Francisco,USA,Medium


In [3]:
# Drop faulty Unnamed column

df.drop('Unnamed: 0', axis=1, inplace=True)
df.head()

Unnamed: 0,Order_Number,Quantity_Ordered,Price_Each,Order_Line_Number,Sales,Order_Date,QTR_ID,Month_ID,Year_ID,Product_Line,MSRP,Product_Code,Customer_Name,City,Country,Deal_Size
0,10107,30,95.7,2,2871.0,2003-02-24,1,2,2003,Motorcycles,95,S10_1678,Land of Toys Inc.,NYC,USA,Small
1,10121,34,81.35,5,2765.9,2003-05-07,2,5,2003,Motorcycles,95,S10_1678,Reims Collectables,Reims,France,Small
2,10134,41,94.74,2,3884.34,2003-07-01,3,7,2003,Motorcycles,95,S10_1678,Lyon Souveniers,Paris,France,Medium
3,10145,45,83.26,6,3746.7,2003-08-25,3,8,2003,Motorcycles,95,S10_1678,Toys4GrownUps.com,Pasadena,USA,Medium
4,10159,49,106.23,14,5205.27,2003-10-10,4,10,2003,Motorcycles,95,S10_1678,Corporate Gift Ideas Co.,San Francisco,USA,Medium


### Aggregate Data

The first step for this pre-processing will be to aggregate the data. Since I'm looking to predict data for a new quarter of data I will be aggregating total sales by quarter per product code. Because of this I will also be dropping the Customer_Name, City, Country and Deal_Size columns since they won't aggregate in properly and I will also drop Month_ID as I am focusing on quarterly data

In [4]:
# Drop unnecessary columns
df.drop(columns = ['Month_ID', 'Customer_Name', 'City', 'Country', 'Deal_Size'], inplace = True)

In [5]:
#Verify change
df.head()

Unnamed: 0,Order_Number,Quantity_Ordered,Price_Each,Order_Line_Number,Sales,Order_Date,QTR_ID,Year_ID,Product_Line,MSRP,Product_Code
0,10107,30,95.7,2,2871.0,2003-02-24,1,2003,Motorcycles,95,S10_1678
1,10121,34,81.35,5,2765.9,2003-05-07,2,2003,Motorcycles,95,S10_1678
2,10134,41,94.74,2,3884.34,2003-07-01,3,2003,Motorcycles,95,S10_1678
3,10145,45,83.26,6,3746.7,2003-08-25,3,2003,Motorcycles,95,S10_1678
4,10159,49,106.23,14,5205.27,2003-10-10,4,2003,Motorcycles,95,S10_1678


Next I'll delete the data from Q2 of 2005 as it was discovered to be incomplete during EDA. With this removed I will be able to attempt to predict the entire quarter's data with my model

In [6]:
df = df.drop(df[(df['Year_ID'] == 2005) & (df['QTR_ID'] == 2)].index)
df

Unnamed: 0,Order_Number,Quantity_Ordered,Price_Each,Order_Line_Number,Sales,Order_Date,QTR_ID,Year_ID,Product_Line,MSRP,Product_Code
0,10107,30,95.70,2,2871.00,2003-02-24,1,2003,Motorcycles,95,S10_1678
1,10121,34,81.35,5,2765.90,2003-05-07,2,2003,Motorcycles,95,S10_1678
2,10134,41,94.74,2,3884.34,2003-07-01,3,2003,Motorcycles,95,S10_1678
3,10145,45,83.26,6,3746.70,2003-08-25,3,2003,Motorcycles,95,S10_1678
4,10159,49,106.23,14,5205.27,2003-10-10,4,2003,Motorcycles,95,S10_1678
...,...,...,...,...,...,...,...,...,...,...,...
2612,10315,40,55.69,5,2227.60,2004-10-29,4,2004,Ships,54,S72_3212
2613,10337,42,97.16,5,4080.72,2004-11-21,4,2004,Ships,54,S72_3212
2614,10350,20,112.22,15,2244.40,2004-12-02,4,2004,Ships,54,S72_3212
2615,10373,29,137.19,1,3978.51,2005-01-31,1,2005,Ships,54,S72_3212


Next I will perform the aggregation. Here I am purposely leaving out Quantity_Ordered to avoid overfitting since Quantity_Ordered * Price_Each = Sales and I am leaving out Order_Number, Order_Line_Number and Order_Date as they are not necessary for the model

In [7]:
df_agg = df.groupby(['Year_ID', 'QTR_ID', 'Product_Code'],as_index=False).agg({'Price_Each' : 'mean', 'Sales' : 'sum', 'MSRP' : 'mean', 'Product_Line' : 'min'})
df_agg

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line
0,2003,1,S10_1678,95.700,2871.00,95,Motorcycles
1,2003,1,S10_1949,228.230,12613.73,214,Classic Cars
2,2003,1,S10_2016,99.910,3896.49,118,Motorcycles
3,2003,1,S10_4698,224.650,6065.55,193,Motorcycles
4,2003,1,S10_4757,144.160,7208.00,136,Classic Cars
...,...,...,...,...,...,...,...
971,2005,1,S700_3505,102.260,8468.20,100,Ships
972,2005,1,S700_3962,107.035,7815.32,99,Ships
973,2005,1,S700_4002,71.490,5647.86,74,Planes
974,2005,1,S72_1253,52.595,2991.73,49,Planes


Now that the data is aggregated I want to add a column for previous quarter's sales. This column will give the model data on the previous quarter to help it identify trends and strengthen its predictions. First I will create separate dataframes for each quarter

In [8]:
Year = [2003]
Quarter = [1]
df_2003_Q1 = df_agg[df_agg.Year_ID.isin(Year) & df_agg.QTR_ID.isin(Quarter)]
df_2003_Q1.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line
0,2003,1,S10_1678,95.7,2871.0,95,Motorcycles
1,2003,1,S10_1949,228.23,12613.73,214,Classic Cars
2,2003,1,S10_2016,99.91,3896.49,118,Motorcycles
3,2003,1,S10_4698,224.65,6065.55,193,Motorcycles
4,2003,1,S10_4757,144.16,7208.0,136,Classic Cars


In [9]:
Year = [2003]
Quarter = [2]
df_2003_Q2 = df_agg[df_agg.Year_ID.isin(Year) & df_agg.QTR_ID.isin(Quarter)]
df_2003_Q2.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line
109,2003,2,S10_1678,81.35,2765.9,95,Motorcycles
110,2003,2,S10_1949,192.87,7329.06,214,Classic Cars
111,2003,2,S10_2016,96.34,2793.86,118,Motorcycles
112,2003,2,S10_4698,201.41,9264.86,193,Motorcycles
113,2003,2,S10_4757,121.04,9403.04,136,Classic Cars


In [10]:
Year = [2003]
Quarter = [3]
df_2003_Q3 = df_agg[df_agg.Year_ID.isin(Year) & df_agg.QTR_ID.isin(Quarter)]
df_2003_Q3.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line
218,2003,3,S10_1678,89.0,7631.04,95,Motorcycles
219,2003,3,S10_1949,221.8,18367.6,214,Classic Cars
220,2003,3,S10_2016,131.43,8500.72,118,Motorcycles
221,2003,3,S10_4698,191.72,12200.36,193,Motorcycles
222,2003,3,S10_4757,114.24,5597.76,136,Classic Cars


In [11]:
Year = [2003]
Quarter = [4]
df_2003_Q4 = df_agg[df_agg.Year_ID.isin(Year) & df_agg.QTR_ID.isin(Quarter)]
df_2003_Q4.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line
327,2003,4,S10_1678,100.486,18863.66,95,Motorcycles
328,2003,4,S10_1949,213.442,34602.88,214,Classic Cars
329,2003,4,S10_2016,121.08,20060.2,118,Motorcycles
330,2003,4,S10_4698,194.434,33989.19,193,Motorcycles
331,2003,4,S10_4757,138.38,17197.2,136,Classic Cars


In [12]:
Year = [2004]
Quarter = [1]
df_2004_Q1 = df_agg[df_agg.Year_ID.isin(Year) & df_agg.QTR_ID.isin(Quarter)]
df_2004_Q1.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line
436,2004,1,S10_1678,111.01,8674.1,95,Motorcycles
437,2004,1,S10_1949,198.225,12538.53,214,Classic Cars
438,2004,1,S10_2016,123.1,8431.48,118,Motorcycles
439,2004,1,S10_4698,189.785,15897.43,193,Motorcycles
440,2004,1,S10_4757,127.84,11195.52,136,Classic Cars


In [13]:
Year = [2004]
Quarter = [2]
df_2004_Q2 = df_agg[df_agg.Year_ID.isin(Year) & df_agg.QTR_ID.isin(Quarter)]
df_2004_Q2.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line
545,2004,2,S10_1678,107.82,9198.52,95,Motorcycles
546,2004,2,S10_1949,210.015,13800.98,214,Classic Cars
547,2004,2,S10_2016,124.09,13080.6,118,Motorcycles
548,2004,2,S10_4698,182.683333,22439.07,193,Motorcycles
549,2004,2,S10_4757,125.12,3378.24,136,Classic Cars


In [14]:
Year = [2004]
Quarter = [3]
df_2004_Q3 = df_agg[df_agg.Year_ID.isin(Year) & df_agg.QTR_ID.isin(Quarter)]
df_2004_Q3.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line
649,2004,3,S10_1678,106.546667,10874.42,95,Motorcycles
650,2004,3,S10_1949,220.73,20056.4,214,Classic Cars
651,2004,3,S10_2016,122.11,13146.29,118,Motorcycles
652,2004,3,S10_4698,207.863333,19023.33,193,Motorcycles
653,2004,3,S10_4757,126.48,9928.0,136,Classic Cars


In [15]:
Year = [2004]
Quarter = [4]
df_2004_Q4 = df_agg[df_agg.Year_ID.isin(Year) & df_agg.QTR_ID.isin(Quarter)]
df_2004_Q4.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line
758,2004,4,S10_1678,113.574,22337.49,95,Motorcycles
759,2004,4,S10_1949,172.24514,40436.34,214,Classic Cars
760,2004,4,S10_2016,104.302,16873.16,118,Motorcycles
761,2004,4,S10_4698,184.5275,21095.87,193,Motorcycles
762,2004,4,S10_4757,81.775,15661.58,136,Classic Cars


In [16]:
Year = [2005]
Quarter = [1]
df_2005_Q1 = df_agg[df_agg.Year_ID.isin(Year) & df_agg.QTR_ID.isin(Quarter)]
df_2005_Q1.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line
867,2005,1,S10_1678,55.635,3940.23,95,Motorcycles
868,2005,1,S10_1949,146.703333,15186.28,214,Classic Cars
869,2005,1,S10_2016,90.05,7513.51,118,Motorcycles
870,2005,1,S10_4698,142.536667,9320.65,193,Motorcycles
871,2005,1,S10_4757,117.21,12263.51,136,Classic Cars


Next I'll edit the quarter and year values as necessary so that when I join this data to the aggregated data I have the desired match of previous quarter with current quarter data. I will not be doing this for df_2005_Q1 as I will be using this in my final prediction

In [17]:
df_2003_Q1.loc[:,'QTR_ID'] = 2
df_2003_Q2.loc[:,'QTR_ID'] = 3
df_2003_Q3.loc[:,'QTR_ID'] = 4
df_2003_Q4.loc[:,'QTR_ID'] = 1
df_2003_Q4.loc[:,'Year_ID'] = 2004
df_2004_Q1.loc[:,'QTR_ID'] = 2
df_2004_Q2.loc[:,'QTR_ID'] = 3
df_2004_Q3.loc[:,'QTR_ID'] = 4
df_2004_Q4.loc[:,'QTR_ID'] = 1
df_2004_Q4.loc[:,'Year_ID'] = 2005

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


In [18]:
# Verify that this worked on a table where Quarter and Year were changed
df_2003_Q4.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line
327,2004,1,S10_1678,100.486,18863.66,95,Motorcycles
328,2004,1,S10_1949,213.442,34602.88,214,Classic Cars
329,2004,1,S10_2016,121.08,20060.2,118,Motorcycles
330,2004,1,S10_4698,194.434,33989.19,193,Motorcycles
331,2004,1,S10_4757,138.38,17197.2,136,Classic Cars


Now I will concatenate the tables together in order so that the index will match the aggregated data

In [19]:
df_pqs = pd.concat([df_2003_Q1, df_2003_Q2])
df_pqs

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line
0,2003,2,S10_1678,95.700,2871.00,95,Motorcycles
1,2003,2,S10_1949,228.230,12613.73,214,Classic Cars
2,2003,2,S10_2016,99.910,3896.49,118,Motorcycles
3,2003,2,S10_4698,224.650,6065.55,193,Motorcycles
4,2003,2,S10_4757,144.160,7208.00,136,Classic Cars
...,...,...,...,...,...,...,...
213,2003,3,S700_3505,104.175,8047.50,100,Ships
214,2003,3,S700_3962,84.415,6316.20,99,Ships
215,2003,3,S700_4002,72.175,3753.10,74,Planes
216,2003,3,S72_1253,44.940,2224.67,49,Planes


In [20]:
df_pqs = pd.concat([df_pqs, df_2003_Q3])
df_pqs

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line
0,2003,2,S10_1678,95.700,2871.00,95,Motorcycles
1,2003,2,S10_1949,228.230,12613.73,214,Classic Cars
2,2003,2,S10_2016,99.910,3896.49,118,Motorcycles
3,2003,2,S10_4698,224.650,6065.55,193,Motorcycles
4,2003,2,S10_4757,144.160,7208.00,136,Classic Cars
...,...,...,...,...,...,...,...
322,2003,4,S700_3505,99.670,6742.49,100,Ships
323,2003,4,S700_3962,97.820,5450.18,99,Ships
324,2003,4,S700_4002,85.870,2919.58,74,Planes
325,2003,4,S72_1253,50.650,1874.05,49,Planes


In [21]:
df_pqs = pd.concat([df_pqs, df_2003_Q4])
df_pqs

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line
0,2003,2,S10_1678,95.700000,2871.00,95,Motorcycles
1,2003,2,S10_1949,228.230000,12613.73,214,Classic Cars
2,2003,2,S10_2016,99.910000,3896.49,118,Motorcycles
3,2003,2,S10_4698,224.650000,6065.55,193,Motorcycles
4,2003,2,S10_4757,144.160000,7208.00,136,Classic Cars
...,...,...,...,...,...,...,...
431,2004,1,S700_3505,96.500000,10435.02,100,Ships
432,2004,1,S700_3962,92.690000,7541.63,99,Ships
433,2004,1,S700_4002,77.175000,12989.82,74,Planes
434,2004,1,S72_1253,48.540000,6642.70,49,Planes


In [22]:
df_pqs = pd.concat([df_pqs, df_2004_Q1])
df_pqs

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line
0,2003,2,S10_1678,95.700000,2871.00,95,Motorcycles
1,2003,2,S10_1949,228.230000,12613.73,214,Classic Cars
2,2003,2,S10_2016,99.910000,3896.49,118,Motorcycles
3,2003,2,S10_4698,224.650000,6065.55,193,Motorcycles
4,2003,2,S10_4757,144.160000,7208.00,136,Classic Cars
...,...,...,...,...,...,...,...
540,2004,2,S700_3505,108.850000,13250.30,100,Ships
541,2004,2,S700_3962,87.726667,8659.12,99,Ships
542,2004,2,S700_4002,72.180000,6137.15,74,Planes
543,2004,2,S72_1253,45.190000,3561.51,49,Planes


In [23]:
df_pqs = pd.concat([df_pqs, df_2004_Q2])
df_pqs

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line
0,2003,2,S10_1678,95.70,2871.00,95,Motorcycles
1,2003,2,S10_1949,228.23,12613.73,214,Classic Cars
2,2003,2,S10_2016,99.91,3896.49,118,Motorcycles
3,2003,2,S10_4698,224.65,6065.55,193,Motorcycles
4,2003,2,S10_4757,144.16,7208.00,136,Classic Cars
...,...,...,...,...,...,...,...
644,2004,3,S700_3505,88.15,2203.75,100,Ships
645,2004,3,S700_3962,81.43,4071.50,99,Ships
646,2004,3,S700_4002,67.37,4829.92,74,Planes
647,2004,3,S72_1253,57.61,1843.52,49,Planes


In [24]:
df_pqs = pd.concat([df_pqs, df_2004_Q3])
df_pqs

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line
0,2003,2,S10_1678,95.700000,2871.00,95,Motorcycles
1,2003,2,S10_1949,228.230000,12613.73,214,Classic Cars
2,2003,2,S10_2016,99.910000,3896.49,118,Motorcycles
3,2003,2,S10_4698,224.650000,6065.55,193,Motorcycles
4,2003,2,S10_4757,144.160000,7208.00,136,Classic Cars
...,...,...,...,...,...,...,...
753,2004,4,S700_3505,92.156667,8859.08,100,Ships
754,2004,4,S700_3962,102.623333,11058.50,99,Ships
755,2004,4,S700_4002,74.526667,8381.31,74,Planes
756,2004,4,S72_1253,53.800000,5301.42,49,Planes


In [25]:
df_pqs = pd.concat([df_pqs, df_2004_Q4])
df_pqs

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line
0,2003,2,S10_1678,95.700,2871.00,95,Motorcycles
1,2003,2,S10_1949,228.230,12613.73,214,Classic Cars
2,2003,2,S10_2016,99.910,3896.49,118,Motorcycles
3,2003,2,S10_4698,224.650,6065.55,193,Motorcycles
4,2003,2,S10_4757,144.160,7208.00,136,Classic Cars
...,...,...,...,...,...,...,...
862,2005,1,S700_3505,85.755,10723.77,100,Ships
863,2005,1,S700_3962,98.660,12339.18,99,Ships
864,2005,1,S700_4002,78.625,17573.15,74,Planes
865,2005,1,S72_1253,67.935,13928.27,49,Planes


Since I don't have data for 2002 Q4 I will drop Q1 2003 from the original dataset. Since we have 2005 Q1 I still have 2 full years of data so my model should still be balanced

In [26]:
# Identify indexes to drop
Q1_2003 = df_agg[(df_agg['Year_ID'] == 2003) & (df_agg['QTR_ID'] == 1)].index

In [27]:
# Drop 2003 Q1
df_agg.drop(Q1_2003, inplace = True)

In [28]:
# Reset index
df_agg.reset_index(inplace = True)
df_agg

Unnamed: 0,index,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line
0,109,2003,2,S10_1678,81.350,2765.90,95,Motorcycles
1,110,2003,2,S10_1949,192.870,7329.06,214,Classic Cars
2,111,2003,2,S10_2016,96.340,2793.86,118,Motorcycles
3,112,2003,2,S10_4698,201.410,9264.86,193,Motorcycles
4,113,2003,2,S10_4757,121.040,9403.04,136,Classic Cars
...,...,...,...,...,...,...,...,...
862,971,2005,1,S700_3505,102.260,8468.20,100,Ships
863,972,2005,1,S700_3962,107.035,7815.32,99,Ships
864,973,2005,1,S700_4002,71.490,5647.86,74,Planes
865,974,2005,1,S72_1253,52.595,2991.73,49,Planes


In [29]:
# Join data
df_agg = df_agg.join(df_pqs, how = 'outer', rsuffix = '_PQ')

In [30]:
# Verify join
df_agg.head()

Unnamed: 0,index,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line,Year_ID_PQ,QTR_ID_PQ,Product_Code_PQ,Price_Each_PQ,Sales_PQ,MSRP_PQ,Product_Line_PQ
0,109,2003,2,S10_1678,81.35,2765.9,95,Motorcycles,2003,2,S10_1678,95.7,2871.0,95,Motorcycles
1,110,2003,2,S10_1949,192.87,7329.06,214,Classic Cars,2003,2,S10_1949,228.23,12613.73,214,Classic Cars
2,111,2003,2,S10_2016,96.34,2793.86,118,Motorcycles,2003,2,S10_2016,99.91,3896.49,118,Motorcycles
3,112,2003,2,S10_4698,201.41,9264.86,193,Motorcycles,2003,2,S10_4698,224.65,6065.55,193,Motorcycles
4,113,2003,2,S10_4757,121.04,9403.04,136,Classic Cars,2003,2,S10_4757,144.16,7208.0,136,Classic Cars


In [31]:
# Drop duplicated columns
df_agg.drop(columns = ['index', 'Year_ID_PQ', 'QTR_ID_PQ', 'Product_Code_PQ', 'Price_Each_PQ', 'MSRP_PQ', 'Product_Line_PQ'], inplace = True)

In [32]:
# Verify drop
df_agg

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line,Sales_PQ
0,2003,2,S10_1678,81.350,2765.90,95,Motorcycles,2871.00
1,2003,2,S10_1949,192.870,7329.06,214,Classic Cars,12613.73
2,2003,2,S10_2016,96.340,2793.86,118,Motorcycles,3896.49
3,2003,2,S10_4698,201.410,9264.86,193,Motorcycles,6065.55
4,2003,2,S10_4757,121.040,9403.04,136,Classic Cars,7208.00
...,...,...,...,...,...,...,...,...
862,2005,1,S700_3505,102.260,8468.20,100,Ships,10723.77
863,2005,1,S700_3962,107.035,7815.32,99,Ships,12339.18
864,2005,1,S700_4002,71.490,5647.86,74,Planes,17573.15
865,2005,1,S72_1253,52.595,2991.73,49,Planes,13928.27


### Create dummy variables

Now that the data is uploaded and properly formatted, the first step of pre-processing is to create dummy variables for the categorical variables that will be included in the model. I will be leaving out product code as there are a very high number of them and I can still check them at the end by looking up the index in my aggregated table to find our best product code

In [33]:
# Check value counts of variables to add
df_agg['QTR_ID'].value_counts()

1    218
3    218
4    218
2    213
Name: QTR_ID, dtype: int64

In [34]:
df_agg['Year_ID'].value_counts()

2004    431
2003    327
2005    109
Name: Year_ID, dtype: int64

In [35]:
df_agg['Product_Line'].value_counts()

Classic Cars        296
Vintage Cars        190
Motorcycles         104
Planes               96
Trucks and Buses     88
Ships                70
Trains               23
Name: Product_Line, dtype: int64

Now its time to create and integrate the dummy variables

In [36]:
dummy_Q = pd.get_dummies(df_agg['QTR_ID'])
dummy_Y = pd.get_dummies(df_agg['Year_ID'])
dummy_PL = pd.get_dummies(df_agg['Product_Line'])

In [37]:
dummy_Q.sample(5)

Unnamed: 0,1,2,3,4
235,0,0,0,1
114,0,0,1,0
734,0,0,0,1
649,0,0,0,1
697,0,0,0,1


In [38]:
dummy_Y.sample(5)

Unnamed: 0,2003,2004,2005
474,0,1,0
458,0,1,0
2,1,0,0
475,0,1,0
813,0,0,1


In [39]:
dummy_PL.sample(5)

Unnamed: 0,Classic Cars,Motorcycles,Planes,Ships,Trains,Trucks and Buses,Vintage Cars
447,1,0,0,0,0,0,0
768,1,0,0,0,0,0,0
485,1,0,0,0,0,0,0
563,1,0,0,0,0,0,0
181,1,0,0,0,0,0,0


All of these seem to have run successfully. Now they will be integrated to the original dataframe

In [40]:
df = pd.concat([df_agg, dummy_Q], axis = 1)
df.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line,Sales_PQ,1,2,3,4
0,2003,2,S10_1678,81.35,2765.9,95,Motorcycles,2871.0,0,1,0,0
1,2003,2,S10_1949,192.87,7329.06,214,Classic Cars,12613.73,0,1,0,0
2,2003,2,S10_2016,96.34,2793.86,118,Motorcycles,3896.49,0,1,0,0
3,2003,2,S10_4698,201.41,9264.86,193,Motorcycles,6065.55,0,1,0,0
4,2003,2,S10_4757,121.04,9403.04,136,Classic Cars,7208.0,0,1,0,0


In [41]:
df = pd.concat([df, dummy_Y], axis = 1)
df.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line,Sales_PQ,1,2,3,4,2003,2004,2005
0,2003,2,S10_1678,81.35,2765.9,95,Motorcycles,2871.0,0,1,0,0,1,0,0
1,2003,2,S10_1949,192.87,7329.06,214,Classic Cars,12613.73,0,1,0,0,1,0,0
2,2003,2,S10_2016,96.34,2793.86,118,Motorcycles,3896.49,0,1,0,0,1,0,0
3,2003,2,S10_4698,201.41,9264.86,193,Motorcycles,6065.55,0,1,0,0,1,0,0
4,2003,2,S10_4757,121.04,9403.04,136,Classic Cars,7208.0,0,1,0,0,1,0,0


In [42]:
df = pd.concat([df, dummy_PL], axis = 1)
df.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line,Sales_PQ,1,2,...,2003,2004,2005,Classic Cars,Motorcycles,Planes,Ships,Trains,Trucks and Buses,Vintage Cars
0,2003,2,S10_1678,81.35,2765.9,95,Motorcycles,2871.0,0,1,...,1,0,0,0,1,0,0,0,0,0
1,2003,2,S10_1949,192.87,7329.06,214,Classic Cars,12613.73,0,1,...,1,0,0,1,0,0,0,0,0,0
2,2003,2,S10_2016,96.34,2793.86,118,Motorcycles,3896.49,0,1,...,1,0,0,0,1,0,0,0,0,0
3,2003,2,S10_4698,201.41,9264.86,193,Motorcycles,6065.55,0,1,...,1,0,0,0,1,0,0,0,0,0
4,2003,2,S10_4757,121.04,9403.04,136,Classic Cars,7208.0,0,1,...,1,0,0,1,0,0,0,0,0,0


### Scaling

For this part I'll be using sklearn's StandardScaler

In [43]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

Next I'll create the subset of our data that the model will be using and assign it to X

In [44]:
# Set up X
X = df.drop('Sales', axis=1)

In [45]:
# Verify X
X.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,MSRP,Product_Line,Sales_PQ,1,2,3,...,2003,2004,2005,Classic Cars,Motorcycles,Planes,Ships,Trains,Trucks and Buses,Vintage Cars
0,2003,2,S10_1678,81.35,95,Motorcycles,2871.0,0,1,0,...,1,0,0,0,1,0,0,0,0,0
1,2003,2,S10_1949,192.87,214,Classic Cars,12613.73,0,1,0,...,1,0,0,1,0,0,0,0,0,0
2,2003,2,S10_2016,96.34,118,Motorcycles,3896.49,0,1,0,...,1,0,0,0,1,0,0,0,0,0
3,2003,2,S10_4698,201.41,193,Motorcycles,6065.55,0,1,0,...,1,0,0,0,1,0,0,0,0,0
4,2003,2,S10_4757,121.04,136,Classic Cars,7208.0,0,1,0,...,1,0,0,1,0,0,0,0,0,0


Next we must drop the categorical variables so that we can scale

In [46]:
X.drop(columns = 'Year_ID', inplace = True)
X.drop(columns = 'QTR_ID', inplace = True)
X.drop(columns = 'Product_Line', inplace = True)
X.drop(columns = 'Product_Code', inplace = True)

Now that this is all set up its time to scale

In [47]:
scaler.fit(X)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [48]:
X_scaled = scaler.transform(X)

### Train Test Split

Next, its time to split the data for our baseline model. First we need to define y as the Sales column

In [49]:
y = df[['Sales']]

Now to perform the actual split. I'll be using a 70/30 split of training to testing size

In [50]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.3, random_state = 123)

Let's check that that worked by looking at the sizes of the training and test splits

In [51]:
X_train.shape

(606, 17)

In [52]:
X_test.shape

(261, 17)

In [53]:
261/(261+606)

0.30103806228373703

This rounds to 30% for our test split so conversely the train split is right around 70%

### 1st Model

Now all that's left is to run a preliminary model and see how it performs. I won't be doing any kind of tuning yet but will save that for the next models as I try to find the best one for this dataset. I will be defining success as having an r-squared of at least 0.8

In [54]:
# Run first model
from sklearn import linear_model
Model1 = linear_model.LinearRegression()
Model1.fit(X_train, y_train)
y_pred = Model1.predict(X_test)

In [55]:
# Check r-squared
from sklearn.metrics import r2_score
r2_score(y_pred, y_test)

0.6676967297879066

This model is performing quite badly. With a score of 0.45 it is lower than random. Hopefully some new models and parameter tuning can bring it up over our desired threshold of 0.8

### Extended Modeling Plan

Models:

1) See results with a Random Forest Regressor model
2) Use a Ridge Regression model with Regularization
3) Use a Lasso Regression model with Regularization
4) For all of the above models, experiment with different combinations of variables and parameter tuning or cross validation to try and achieve desired R-squared

### 2nd Model

Next I'll try a Random Forest model. First with basic parameters and then I'll see what tuning them can do for other runs

In [56]:
# Import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor

In [57]:
# Tune parameters and fit model to training data
Model2 = RandomForestRegressor(random_state = 123)
Model2.fit(X_train, y_train)

  This is separate from the ipykernel package so we can avoid doing imports until


RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=123, verbose=0, warm_start=False)

In [58]:
# Predict data on test set
y_pred = Model2.predict(X_test)

In [59]:
# Check r-squared
r2_score(y_test, y_pred)

0.7497509358906669

Wow! That made a huge difference. Now I'll see what happens if I tune the parameters to find the optimal r-squared. This next cell is the final result of multiple runs to find the optimal value for min_samples_leaf with a value of 10 for min_samples_split

In [60]:
# Tune parameters and fit model to training data
Model2 = RandomForestRegressor(min_samples_leaf = 7, min_samples_split = 10, random_state = 123)
Model2.fit(X_train, y_train)

  This is separate from the ipykernel package so we can avoid doing imports until


RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=7,
                      min_samples_split=10, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=123, verbose=0, warm_start=False)

In [61]:
# Predict data on test set
y_pred = Model2.predict(X_test)

In [62]:
# Check r-squared
r2_score(y_test, y_pred)

0.8241842714526736

These appear to be the optimal parameters and I have achieved an r-squared over 0.8. If I wanted to I could stop here but I'll see if I can get a better r-squared using other models with cross validation

### 3rd Model

For this Ridge Regression model I will use the same parameters. However, I will also be using cross validation to find the optimal parameters

In [63]:
# Create range of alphas for cross validation
alpha_range = 10.**np.arange(-2,3)
alpha_range

array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])

In [64]:
# Initiate Model
from sklearn.linear_model import RidgeCV
Model3 = RidgeCV(alphas = alpha_range, normalize = True, scoring = 'r2')

In [65]:
# Fit model to training data
Model3.fit(X_train, y_train)

RidgeCV(alphas=array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02]), cv=None,
        fit_intercept=True, gcv_mode=None, normalize=True, scoring='r2',
        store_cv_values=False)

In [66]:
# Check optimal alpha from cross validation
Model3.alpha_

0.01

In [67]:
# Predict data on test set
y_pred = Model3.predict(X_test)

In [68]:
# Check r-squared
r2_score(y_test, y_pred)

0.7529874682737588

Unfortunately even with optimal settings our score is lower than the Random Forest model and is therefore too low

### 4th Model

Next I will try a Lasso Regression model also using cross validation

In [69]:
# Initiate Model
from sklearn.linear_model import LassoCV
Model4 = LassoCV(alphas = alpha_range, normalize = True, random_state = 123)

In [70]:
# Fit model to training data
Model4.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


LassoCV(alphas=array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02]), copy_X=True,
        cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100,
        n_jobs=None, normalize=True, positive=False, precompute='auto',
        random_state=123, selection='cyclic', tol=0.0001, verbose=False)

In [71]:
# Check optimal alpha from cross validation
Model4.alpha_

1.0

In [72]:
# Predict data on test set
y_pred = Model4.predict(X_test)

In [73]:
# Check r-squared
r2_score(y_test, y_pred)

0.7509981393404845

Unfortunately this model is actually worse than Ridge and is also therefore under 0.8 and worse than Random Forest with respect to r-squared

### Model Selection and Implementation

After running through multiple models it appears that nothing is improving the r-squared score over Random Forest and I was able to get an r-squared over 0.8 after tuning some parameters. Now I will use the data that I have available to predict the next quarter using the Random Forest model

First I need to create a subset of data for Q2 of 2005 that includes all the columns I'll need for my model and is missing the sales column that the model will predict

In [74]:
# Create subset of data from 2005 Q1
Year = [2005]
Quarter = [1]
df_final = df[df.Year_ID.isin(Year) & df.QTR_ID.isin(Quarter)]

In [75]:
# Verify dataset
df_final.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line,Sales_PQ,1,2,...,2003,2004,2005,Classic Cars,Motorcycles,Planes,Ships,Trains,Trucks and Buses,Vintage Cars
758,2005,1,S10_1678,55.635,3940.23,95,Motorcycles,22337.49,1,0,...,0,0,1,0,1,0,0,0,0,0
759,2005,1,S10_1949,146.703333,15186.28,214,Classic Cars,40436.34,1,0,...,0,0,1,1,0,0,0,0,0,0
760,2005,1,S10_2016,90.05,7513.51,118,Motorcycles,16873.16,1,0,...,0,0,1,0,1,0,0,0,0,0
761,2005,1,S10_4698,142.536667,9320.65,193,Motorcycles,21095.87,1,0,...,0,0,1,0,1,0,0,0,0,0
762,2005,1,S10_4757,117.21,12263.51,136,Classic Cars,15661.58,1,0,...,0,0,1,1,0,0,0,0,0,0


In [76]:
# Change appropriate data variables to proper quarter
df_final.loc[:,'QTR_ID'] = 2
df_final.loc[:,1] = 0
df_final.loc[:,2] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


In [77]:
# Drop Sales_PQ column as the current sales column must be used as previous sales with quarter change
df_final.drop(columns = 'Sales_PQ', inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [78]:
# Rename Sales column
df_final.rename(columns = {'Sales' : 'Sales_PQ'}, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [79]:
# Drop unnecessary columns
df_final.drop(columns = 'Year_ID', inplace = True)
df_final.drop(columns = 'QTR_ID', inplace = True)
df_final.drop(columns = 'Product_Line', inplace = True)
df_final.drop(columns = 'Product_Code', inplace = True)

In [80]:
# Verify Q2 data
df_final.head()

Unnamed: 0,Price_Each,Sales_PQ,MSRP,1,2,3,4,2003,2004,2005,Classic Cars,Motorcycles,Planes,Ships,Trains,Trucks and Buses,Vintage Cars
758,55.635,3940.23,95,0,1,0,0,0,0,1,0,1,0,0,0,0,0
759,146.703333,15186.28,214,0,1,0,0,0,0,1,1,0,0,0,0,0,0
760,90.05,7513.51,118,0,1,0,0,0,0,1,0,1,0,0,0,0,0
761,142.536667,9320.65,193,0,1,0,0,0,0,1,0,1,0,0,0,0,0
762,117.21,12263.51,136,0,1,0,0,0,0,1,1,0,0,0,0,0,0


In [81]:
# Scale data
scaler.fit(df_final)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [82]:
# Fit Scaler
df_final_scaled = scaler.transform(df_final)

In [83]:
# Predict data with Random Forest model
y_pred = Model2.predict(df_final_scaled)

In [84]:
# Ensure predictions make sense
y_pred

array([ 2986.17693331, 15722.28059309,  7313.26429038, 10640.32282975,
       10531.05176056,  7563.58767035,  6322.81663999,  7717.21753755,
       11962.27405916, 11827.77902696, 11114.51786046,  7420.51915837,
       15514.70816772,  4501.76661418,  2894.58882902,  8961.58641765,
        9791.98675142, 10058.58644607, 13653.86292716,  6779.70108323,
        2870.46456456,  7346.70506229, 10547.2335921 ,  8640.16113807,
        7713.22024901,  7408.42765623,  2219.43207321, 12131.77242375,
        9338.22913032,  2662.33112376,  4928.3657547 ,  7122.92432097,
        3149.45488552, 12210.31265306,  6407.78820161,  2568.27092986,
        5601.70606185,  7393.0817318 , 10915.42705728, 15338.67709664,
        7635.87408606,  8424.18971533, 11910.45306718, 14640.68144182,
        4874.34087148, 11024.37383557,  6170.21512768, 15547.71388327,
        8621.9482037 ,  6171.54193135,  7110.01955097, 11454.01849542,
        7361.99895657, 11560.19159498,  7258.35191799,  5052.7448202 ,
      

In [85]:
# Add predicted data to final dataset
df_final['Sales'] = y_pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [86]:
# Verify few if any duplicate predicted values
df_final['Sales'].value_counts()

11024.373836    1
9943.152048     1
6171.541931     1
10915.427057    1
6322.816640     1
               ..
2870.464565     1
10237.919923    1
7409.877710     1
6401.362116     1
9535.410025     1
Name: Sales, Length: 109, dtype: int64

In [87]:
# Verify new column has been added
df_final.head()

Unnamed: 0,Price_Each,Sales_PQ,MSRP,1,2,3,4,2003,2004,2005,Classic Cars,Motorcycles,Planes,Ships,Trains,Trucks and Buses,Vintage Cars,Sales
758,55.635,3940.23,95,0,1,0,0,0,0,1,0,1,0,0,0,0,0,2986.176933
759,146.703333,15186.28,214,0,1,0,0,0,0,1,1,0,0,0,0,0,0,15722.280593
760,90.05,7513.51,118,0,1,0,0,0,0,1,0,1,0,0,0,0,0,7313.26429
761,142.536667,9320.65,193,0,1,0,0,0,0,1,0,1,0,0,0,0,0,10640.32283
762,117.21,12263.51,136,0,1,0,0,0,0,1,1,0,0,0,0,0,0,10531.051761


In [88]:
# Identify highest Sales figure and index
df_final.loc[df_final['Sales'].idxmax()].sort_values(ascending = False)

Sales               15722.280593
Sales_PQ            15186.280000
MSRP                  214.000000
Price_Each            146.703333
2                       1.000000
Classic Cars            1.000000
2005                    1.000000
3                       0.000000
4                       0.000000
2003                    0.000000
2004                    0.000000
1                       0.000000
Motorcycles             0.000000
Planes                  0.000000
Ships                   0.000000
Trains                  0.000000
Trucks and Buses        0.000000
Vintage Cars            0.000000
Name: 759, dtype: float64

In [89]:
# Use original aggregated dataframe to find product code
df_agg.iloc[759, :]

Year_ID                 2005
QTR_ID                     1
Product_Code        S10_1949
Price_Each        146.703333
Sales           15186.280000
MSRP                     214
Product_Line    Classic Cars
Sales_PQ        40436.340000
Name: 759, dtype: object

### Conclusion and Model Justification

After running the data through our model, the product with the most revenue is product code S10_1949 with a predicted revenue of $15,722.28 for Quarter 2 of 2005. Below is a breakdown of each model with their r-squared:

    Model          Linear Regression      Random Forest      Ridge        Lasso
    R-Squared            0.6677               0.8242         0.7530       0.7510

Since the Random Forest model had the best r-squared I chose to use it for my prediction