# Pre-Processing and Training Data Development

I will begin by loading necessary packages and the cleaned data

In [1]:
# Import necessary packages

import pandas as pd
import numpy as np

In [2]:
# Load data

df = pd.read_csv(r'C:\Users\bronc\Downloads\Capstone 3\sales_data_sample(clean).csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Order_Number,Quantity_Ordered,Price_Each,Order_Line_Number,Sales,Order_Date,QTR_ID,Month_ID,Year_ID,Product_Line,MSRP,Product_Code,Customer_Name,City,Country,Deal_Size
0,0,10107,30,95.7,2,2871.0,2003-02-24,1,2,2003,Motorcycles,95,S10_1678,Land of Toys Inc.,NYC,USA,Small
1,1,10121,34,81.35,5,2765.9,2003-05-07,2,5,2003,Motorcycles,95,S10_1678,Reims Collectables,Reims,France,Small
2,2,10134,41,94.74,2,3884.34,2003-07-01,3,7,2003,Motorcycles,95,S10_1678,Lyon Souveniers,Paris,France,Medium
3,3,10145,45,83.26,6,3746.7,2003-08-25,3,8,2003,Motorcycles,95,S10_1678,Toys4GrownUps.com,Pasadena,USA,Medium
4,4,10159,49,106.23,14,5205.27,2003-10-10,4,10,2003,Motorcycles,95,S10_1678,Corporate Gift Ideas Co.,San Francisco,USA,Medium


In [3]:
# Drop faulty Unnamed column

df.drop('Unnamed: 0', axis=1, inplace=True)
df.head()

Unnamed: 0,Order_Number,Quantity_Ordered,Price_Each,Order_Line_Number,Sales,Order_Date,QTR_ID,Month_ID,Year_ID,Product_Line,MSRP,Product_Code,Customer_Name,City,Country,Deal_Size
0,10107,30,95.7,2,2871.0,2003-02-24,1,2,2003,Motorcycles,95,S10_1678,Land of Toys Inc.,NYC,USA,Small
1,10121,34,81.35,5,2765.9,2003-05-07,2,5,2003,Motorcycles,95,S10_1678,Reims Collectables,Reims,France,Small
2,10134,41,94.74,2,3884.34,2003-07-01,3,7,2003,Motorcycles,95,S10_1678,Lyon Souveniers,Paris,France,Medium
3,10145,45,83.26,6,3746.7,2003-08-25,3,8,2003,Motorcycles,95,S10_1678,Toys4GrownUps.com,Pasadena,USA,Medium
4,10159,49,106.23,14,5205.27,2003-10-10,4,10,2003,Motorcycles,95,S10_1678,Corporate Gift Ideas Co.,San Francisco,USA,Medium


### Aggregate Data

The first step for this pre-processing will be to aggregate the data. Since I'm looking to predict data for a new quarter of data I will be aggregating total sales by quarter per product code. Because of this I will also be dropping the Customer_Name, City, Country and Deal_Size columns since they won't aggregate in properly and I will also drop Month_ID as I am focusing on quarterly data

In [4]:
# Drop unnecessary columns
df.drop(columns = ['Month_ID', 'Customer_Name', 'City', 'Country', 'Deal_Size'], inplace = True)

In [5]:
#Verify change
df.head()

Unnamed: 0,Order_Number,Quantity_Ordered,Price_Each,Order_Line_Number,Sales,Order_Date,QTR_ID,Year_ID,Product_Line,MSRP,Product_Code
0,10107,30,95.7,2,2871.0,2003-02-24,1,2003,Motorcycles,95,S10_1678
1,10121,34,81.35,5,2765.9,2003-05-07,2,2003,Motorcycles,95,S10_1678
2,10134,41,94.74,2,3884.34,2003-07-01,3,2003,Motorcycles,95,S10_1678
3,10145,45,83.26,6,3746.7,2003-08-25,3,2003,Motorcycles,95,S10_1678
4,10159,49,106.23,14,5205.27,2003-10-10,4,2003,Motorcycles,95,S10_1678


Next I'll delete the data from Q2 of 2005 as it was discovered to be incomplete during EDA. With this removed I will be able to attempt to predict the entire quarter's data with my model

In [6]:
df = df.drop(df[(df['Year_ID'] == 2005) & (df['QTR_ID'] == 2)].index)
df

Unnamed: 0,Order_Number,Quantity_Ordered,Price_Each,Order_Line_Number,Sales,Order_Date,QTR_ID,Year_ID,Product_Line,MSRP,Product_Code
0,10107,30,95.70,2,2871.00,2003-02-24,1,2003,Motorcycles,95,S10_1678
1,10121,34,81.35,5,2765.90,2003-05-07,2,2003,Motorcycles,95,S10_1678
2,10134,41,94.74,2,3884.34,2003-07-01,3,2003,Motorcycles,95,S10_1678
3,10145,45,83.26,6,3746.70,2003-08-25,3,2003,Motorcycles,95,S10_1678
4,10159,49,106.23,14,5205.27,2003-10-10,4,2003,Motorcycles,95,S10_1678
...,...,...,...,...,...,...,...,...,...,...,...
2612,10315,40,55.69,5,2227.60,2004-10-29,4,2004,Ships,54,S72_3212
2613,10337,42,97.16,5,4080.72,2004-11-21,4,2004,Ships,54,S72_3212
2614,10350,20,112.22,15,2244.40,2004-12-02,4,2004,Ships,54,S72_3212
2615,10373,29,137.19,1,3978.51,2005-01-31,1,2005,Ships,54,S72_3212


Next I will perform the aggregation. Here I am purposely leaving out Quantity_Ordered to avoid overfitting since Quantity_Ordered * Price_Each = Sales and I am leaving out Order_Number, Order_Line_Number and Order_Date as they are not necessary for the model

In [7]:
df_agg = df.groupby(['Year_ID', 'QTR_ID', 'Product_Code'],as_index=False).agg({'Price_Each' : 'mean', 'Sales' : 'sum', 'MSRP' : 'mean', 'Product_Line' : 'min'})
df_agg

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line
0,2003,1,S10_1678,95.700,2871.00,95,Motorcycles
1,2003,1,S10_1949,228.230,12613.73,214,Classic Cars
2,2003,1,S10_2016,99.910,3896.49,118,Motorcycles
3,2003,1,S10_4698,224.650,6065.55,193,Motorcycles
4,2003,1,S10_4757,144.160,7208.00,136,Classic Cars
...,...,...,...,...,...,...,...
971,2005,1,S700_3505,102.260,8468.20,100,Ships
972,2005,1,S700_3962,107.035,7815.32,99,Ships
973,2005,1,S700_4002,71.490,5647.86,74,Planes
974,2005,1,S72_1253,52.595,2991.73,49,Planes


### Create dummy variables

Now that the data is uploaded and properly formatted, the first step of pre-processing is to create dummy variables for the categorical variables that will be included in the model

In [8]:
# Check value counts of variables to add
df_agg['QTR_ID'].value_counts()

1    327
3    218
4    218
2    213
Name: QTR_ID, dtype: int64

In [9]:
df_agg['Year_ID'].value_counts()

2003    436
2004    431
2005    109
Name: Year_ID, dtype: int64

In [10]:
df_agg['Product_Line'].value_counts()

Classic Cars        333
Vintage Cars        214
Motorcycles         117
Planes              108
Trucks and Buses     99
Ships                79
Trains               26
Name: Product_Line, dtype: int64

In [11]:
df_agg['Product_Code'].value_counts()

S24_3856     9
S24_2841     9
S24_2840     9
S24_4048     9
S700_1138    9
            ..
S700_1938    8
S18_3029     8
S18_3259     8
S24_3816     8
S18_3140     8
Name: Product_Code, Length: 109, dtype: int64

Now its time to create and integrate the dummy variables

In [12]:
dummy_Q = pd.get_dummies(df_agg['QTR_ID'])
dummy_Y = pd.get_dummies(df_agg['Year_ID'])
dummy_PL = pd.get_dummies(df_agg['Product_Line'])
dummy_PC = pd.get_dummies(df_agg['Product_Code'])

In [13]:
dummy_Q.sample(5)

Unnamed: 0,1,2,3,4
268,0,0,1,0
833,0,0,0,1
203,0,1,0,0
917,1,0,0,0
871,1,0,0,0


In [14]:
dummy_Y.sample(5)

Unnamed: 0,2003,2004,2005
177,1,0,0
633,0,1,0
63,1,0,0
836,0,1,0
321,1,0,0


In [15]:
dummy_PL.sample(5)

Unnamed: 0,Classic Cars,Motorcycles,Planes,Ships,Trains,Trucks and Buses,Vintage Cars
610,1,0,0,0,0,0,0
724,0,0,0,0,0,0,1
173,0,1,0,0,0,0,0
445,0,1,0,0,0,0,0
794,0,0,0,1,0,0,0


In [16]:
dummy_PC.sample(5)

Unnamed: 0,S10_1678,S10_1949,S10_2016,S10_4698,S10_4757,S10_4962,S12_1099,S12_1108,S12_1666,S12_2823,...,S700_2466,S700_2610,S700_2824,S700_2834,S700_3167,S700_3505,S700_3962,S700_4002,S72_1253,S72_3212
704,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
177,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
661,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
914,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
38,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


All of these seem to have run successfully. Now they will be integrated to the original dataframe

In [17]:
df = pd.concat([df_agg, dummy_Q], axis = 1)
df.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line,1,2,3,4
0,2003,1,S10_1678,95.7,2871.0,95,Motorcycles,1,0,0,0
1,2003,1,S10_1949,228.23,12613.73,214,Classic Cars,1,0,0,0
2,2003,1,S10_2016,99.91,3896.49,118,Motorcycles,1,0,0,0
3,2003,1,S10_4698,224.65,6065.55,193,Motorcycles,1,0,0,0
4,2003,1,S10_4757,144.16,7208.0,136,Classic Cars,1,0,0,0


In [18]:
df = pd.concat([df, dummy_Y], axis = 1)
df.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line,1,2,3,4,2003,2004,2005
0,2003,1,S10_1678,95.7,2871.0,95,Motorcycles,1,0,0,0,1,0,0
1,2003,1,S10_1949,228.23,12613.73,214,Classic Cars,1,0,0,0,1,0,0
2,2003,1,S10_2016,99.91,3896.49,118,Motorcycles,1,0,0,0,1,0,0
3,2003,1,S10_4698,224.65,6065.55,193,Motorcycles,1,0,0,0,1,0,0
4,2003,1,S10_4757,144.16,7208.0,136,Classic Cars,1,0,0,0,1,0,0


In [19]:
df = pd.concat([df, dummy_PL], axis = 1)
df.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line,1,2,3,...,2003,2004,2005,Classic Cars,Motorcycles,Planes,Ships,Trains,Trucks and Buses,Vintage Cars
0,2003,1,S10_1678,95.7,2871.0,95,Motorcycles,1,0,0,...,1,0,0,0,1,0,0,0,0,0
1,2003,1,S10_1949,228.23,12613.73,214,Classic Cars,1,0,0,...,1,0,0,1,0,0,0,0,0,0
2,2003,1,S10_2016,99.91,3896.49,118,Motorcycles,1,0,0,...,1,0,0,0,1,0,0,0,0,0
3,2003,1,S10_4698,224.65,6065.55,193,Motorcycles,1,0,0,...,1,0,0,0,1,0,0,0,0,0
4,2003,1,S10_4757,144.16,7208.0,136,Classic Cars,1,0,0,...,1,0,0,1,0,0,0,0,0,0


In [20]:
df = pd.concat([df, dummy_PC], axis = 1)
df.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line,1,2,3,...,S700_2466,S700_2610,S700_2824,S700_2834,S700_3167,S700_3505,S700_3962,S700_4002,S72_1253,S72_3212
0,2003,1,S10_1678,95.7,2871.0,95,Motorcycles,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2003,1,S10_1949,228.23,12613.73,214,Classic Cars,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2003,1,S10_2016,99.91,3896.49,118,Motorcycles,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2003,1,S10_4698,224.65,6065.55,193,Motorcycles,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2003,1,S10_4757,144.16,7208.0,136,Classic Cars,1,0,0,...,0,0,0,0,0,0,0,0,0,0


### Scaling

For this part I'll be using sklearn's StandardScaler

In [21]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

Next I'll create the subset of our data that the model will be using and assign it to X

In [22]:
# Set up X
X = df.drop('Sales', axis=1)

In [23]:
# Verify X
X.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,MSRP,Product_Line,1,2,3,4,...,S700_2466,S700_2610,S700_2824,S700_2834,S700_3167,S700_3505,S700_3962,S700_4002,S72_1253,S72_3212
0,2003,1,S10_1678,95.7,95,Motorcycles,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2003,1,S10_1949,228.23,214,Classic Cars,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2003,1,S10_2016,99.91,118,Motorcycles,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2003,1,S10_4698,224.65,193,Motorcycles,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2003,1,S10_4757,144.16,136,Classic Cars,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next we must drop the categorical variables so that we can scale

In [24]:
X.drop(columns = 'Year_ID', inplace = True)
X.drop(columns = 'QTR_ID', inplace = True)
X.drop(columns = 'Product_Line', inplace = True)
X.drop(columns = 'Product_Code', inplace = True)

Now that this is all set up its time to scale

In [25]:
scaler.fit(X)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [26]:
X_scaled = scaler.transform(X)

### Train Test Split

Next, its time to split the data for our baseline model. First we need to define y as the Sales column

In [27]:
y = df[['Sales']]

Now to perform the actual split. I'll be using a 70/30 split of training to testing size

In [28]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.3, random_state = 123)

Let's check that that worked by looking at the sizes of the training and test splits

In [29]:
X_train.shape

(683, 125)

In [30]:
X_test.shape

(293, 125)

In [31]:
293/(293+683)

0.30020491803278687

This rounds to 30% for our test split so conversely the train split is right around 70%

### 1st Model

Now all that's left is to run a preliminary model and see how it performs. I won't be doing any kind of tuning yet but will save that for the next models as I try to find the best one for this dataset. I will be defining success as having an r-squared of at least 0.8

In [32]:
from sklearn import linear_model
Model1 = linear_model.LinearRegression()
Model1.fit(X_train, y_train)
y_pred = Model1.predict(X_test)

In [33]:
from sklearn.metrics import r2_score
r2_score(y_pred, y_test)

0.6311023696188257

This model is performing quite badly. With a score of 0.45 it is lower than random. Hopefully some new models and parameter tuning can bring it up over our desired threshold of 0.8

### Extended Modeling Plan

Models:

1) See results with a Random Forest Regressor model
2) Use a Ridge Regression model with Regularization
3) For both of the above models, experiment with different combinations of variables and parameter tuning to try and achieve desired R-squared

### 2nd Model

Next I'll try a Random Forest model. First with basic parameters and then I'll see what tuning them can do for other runs

In [34]:
from sklearn.ensemble import RandomForestRegressor

In [35]:
Model2 = RandomForestRegressor(random_state = 123)
Model2.fit(X_train, y_train)

  


RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=123, verbose=0, warm_start=False)

In [36]:
y_pred = Model2.predict(X_test)

In [37]:
r2_score(y_test, y_pred)

0.7642832344472683

Wow! That made a huge difference. Now I'll see what happens if I tune the parameters a little

In [38]:
Model2 = RandomForestRegressor(min_samples_leaf = 10, min_samples_split = 10, random_state = 123)
Model2.fit(X_train, y_train)

  


RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=10,
                      min_samples_split=10, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=123, verbose=0, warm_start=False)

In [39]:
y_pred = Model2.predict(X_test)

In [40]:
r2_score(y_test, y_pred)

0.7683536276413202

Hmm, that made it slightly worse. I'll try some more tuning but its looking like this model won't quite do the trick

In [62]:
Model2 = RandomForestRegressor(min_samples_leaf = 9, min_samples_split = 2, random_state = 123)
Model2.fit(X_train, y_train)

  


RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=9,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=123, verbose=0, warm_start=False)

In [63]:
y_pred = Model2.predict(X_test)

In [64]:
r2_score(y_test, y_pred)

0.7688152439963077

These appear to be the optimal parameters and yet the r-squared isn't quite high enough. Next I'll give a Ridge Regression model a shot 

### 3rd Model

For this Ridge Regression model I will use the same parameters. However, I will also be using cross validation to find the optimal parameters

In [65]:
alpha_range = 10.**np.arange(-2,3)
alpha_range

array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])

In [66]:
from sklearn.linear_model import RidgeCV
Model3 = RidgeCV(alphas = alpha_range, normalize = True, scoring = 'r2')

In [67]:
Model3.fit(X_train, y_train)

RidgeCV(alphas=array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02]), cv=None,
        fit_intercept=True, gcv_mode=None, normalize=True, scoring='r2',
        store_cv_values=False)

In [68]:
Model3.alpha_

0.1

In [69]:
y_pred = Model3.predict(X_test)

In [70]:
r2_score(y_test, y_pred)

0.665472203310578

Unfortunately even with optimal settings our score is lower than the Random Forest model and is therefore too low

### 4th Model

Next I will try a Lasso Regression model also using cross validation

In [71]:
from sklearn.linear_model import LassoCV
Model4 = LassoCV(alphas = alpha_range, normalize = True, random_state = 123)

In [72]:
Model4.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


LassoCV(alphas=array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02]), copy_X=True,
        cv=None, eps=0.001, fit_intercept=True, max_iter=1000, n_alphas=100,
        n_jobs=None, normalize=True, positive=False, precompute='auto',
        random_state=123, selection='cyclic', tol=0.0001, verbose=False)

In [73]:
Model4.alpha_

10.0

In [74]:
y_pred = Model4.predict(X_test)

In [75]:
r2_score(y_test, y_pred)

0.7327076392021159

While this was slightly better than Ridge it is a worse model than Random Forest and still has too low of an r-squared value

### 5th Model

Lastly I will attempt to use an Elastic Net model also using cross validation

In [76]:
from sklearn.linear_model import ElasticNetCV
Model5 = ElasticNetCV(l1_ratio = [0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1], alphas = alpha_range, normalize = True, random_state = 123)

In [77]:
Model5.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


ElasticNetCV(alphas=array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02]),
             copy_X=True, cv=None, eps=0.001, fit_intercept=True,
             l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1], max_iter=1000,
             n_alphas=100, n_jobs=None, normalize=True, positive=False,
             precompute='auto', random_state=123, selection='cyclic',
             tol=0.0001, verbose=0)

In [78]:
Model5.alpha_

10.0

In [79]:
Model5.l1_ratio_

1.0

In [80]:
y_pred = Model5.predict(X_test)

In [81]:
r2_score(y_test, y_pred)

0.7327076392021159

Unfortunately it looks like the Elastic Net model chose the Lasso model's results indicating that a mix of the 2 doesn't work better than just using a Lasso model

### Model Selection and Implementation

After running through multiple models it appears that nothing is improving the r-squared score over Random Forest. I also experimented with removing/adding variables but adding variables greatly lowered r-squared while removing them made slight negative changes. Given these results I think it is best to go forward with an r-squared of 0.7445 from the Random Forest model. Since I am not looking to predict the value of revenue but rather which Product_Code generates the highest one I think this value should be sufficient to give us significant results

First I need to create a subset of data for Q2 of 2005 that includes all the columns I'll need for my model and is missing the sales column that the model will predict

In [82]:
Year = [2005]
Quarter = [1]
df_final = df[df.Year_ID.isin(Year) & df.QTR_ID.isin(Quarter)]

In [83]:
df_final.head()

Unnamed: 0,Year_ID,QTR_ID,Product_Code,Price_Each,Sales,MSRP,Product_Line,1,2,3,...,S700_2466,S700_2610,S700_2824,S700_2834,S700_3167,S700_3505,S700_3962,S700_4002,S72_1253,S72_3212
867,2005,1,S10_1678,55.635,3940.23,95,Motorcycles,1,0,0,...,0,0,0,0,0,0,0,0,0,0
868,2005,1,S10_1949,146.703333,15186.28,214,Classic Cars,1,0,0,...,0,0,0,0,0,0,0,0,0,0
869,2005,1,S10_2016,90.05,7513.51,118,Motorcycles,1,0,0,...,0,0,0,0,0,0,0,0,0,0
870,2005,1,S10_4698,142.536667,9320.65,193,Motorcycles,1,0,0,...,0,0,0,0,0,0,0,0,0,0
871,2005,1,S10_4757,117.21,12263.51,136,Classic Cars,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [84]:
df_final.loc[:,'QTR_ID'] = 2
df_final.loc[:,1] = 0
df_final.loc[:,2] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


In [85]:
df_final.drop(columns = 'Year_ID', inplace = True)
df_final.drop(columns = 'QTR_ID', inplace = True)
df_final.drop(columns = 'Product_Line', inplace = True)
df_final.drop(columns = 'Product_Code', inplace = True)
df_final.drop(columns = 'Sales', inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [86]:
df_final.head()

Unnamed: 0,Price_Each,MSRP,1,2,3,4,2003,2004,2005,Classic Cars,...,S700_2466,S700_2610,S700_2824,S700_2834,S700_3167,S700_3505,S700_3962,S700_4002,S72_1253,S72_3212
867,55.635,95,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
868,146.703333,214,0,1,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
869,90.05,118,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
870,142.536667,193,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
871,117.21,136,0,1,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0


In [87]:
y_pred = Model2.predict(df_final)

In [88]:
y_pred

array([11581.57126411, 11720.40917162, 11581.57126411, 11581.57126411,
       11720.40917162, 11720.40917162, 11720.40917162, 11720.40917162,
       11581.57126411, 11581.57126411, 11720.40917162, 11720.40917162,
       11720.40917162, 11720.40917162, 11581.57126411, 11720.40917162,
       11581.57126411, 11720.40917162, 11494.36405989, 11494.36405989,
       11720.40917162, 11581.57126411, 11494.36405989, 11720.40917162,
       11720.40917162, 11720.40917162, 11494.36405989, 11581.57126411,
       11494.36405989, 11581.57126411, 11581.57126411, 11581.57126411,
       11494.36405989, 11720.40917162, 11494.36405989, 11494.36405989,
       11581.57126411, 11494.36405989, 11494.36405989, 11720.40917162,
       11581.57126411, 11720.40917162, 11494.36405989, 11720.40917162,
       11720.40917162, 11581.57126411, 11494.36405989, 11720.40917162,
       11494.36405989, 11494.36405989, 11581.57126411, 11494.36405989,
       11720.40917162, 11720.40917162, 11720.40917162, 11720.40917162,
      

In [89]:
df_final['Sales'] = y_pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [90]:
df_final.head()

Unnamed: 0,Price_Each,MSRP,1,2,3,4,2003,2004,2005,Classic Cars,...,S700_2610,S700_2824,S700_2834,S700_3167,S700_3505,S700_3962,S700_4002,S72_1253,S72_3212,Sales
867,55.635,95,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,11581.571264
868,146.703333,214,0,1,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,11720.409172
869,90.05,118,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,11581.571264
870,142.536667,193,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,11581.571264
871,117.21,136,0,1,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,11720.409172


In [91]:
df_final.loc[df_final['Sales'].idxmax()].sort_values(ascending = False)

Sales         11720.409172
MSRP            214.000000
Price_Each      146.703333
2                 1.000000
S10_1949          1.000000
                  ...     
S18_2238          0.000000
S18_1984          0.000000
S18_1889          0.000000
S18_1749          0.000000
S18_4027          0.000000
Name: 868, Length: 126, dtype: float64