# Advanced Linear Regression- Predict
## Predicting the average price per kilogram of Golden Delicious Apples

#### Predict Requirements:
Imagine you are in the Fresh Produce Industry. How much stock do you have on hand? Not too little that you run out of stock when customers want to buy more. And not too much that food waste occurs. How do you set your prices? Yields from farms fluctuate by season. Should your prices then also fluctuate by season? With this context, EDSA is challenging you to construct a regression algorithm, capable of accurately predicting how much a kilogram of Golden Delicious Apples will cost, given certain parameters.

Importing the necessary libabraries and loading data:

In [1]:
# Regression models
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

# Model slection
from sklearn.model_selection import train_test_split

# Preprocessing
from sklearn.preprocessing import StandardScaler

# Metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

# Visualisations
%matplotlib notebook
import seaborn as sns
import matplotlib.pyplot as plt
#%matplotlib inline



# Other
from scipy import stats
import math
import pickle
import numpy as np
import pandas as pd
#from mpl_toolkits.mplot3d import Axes3D (state 'mpld3.enable_notebook()' in every cell you plot something), if we going to do 3D visual

In [2]:
train = pd.read_csv('train_data.csv')
test = pd.read_csv('test_data.csv')
test_ID = test['Index']
test = test.drop('Index', axis=1)

## Confirmatory Data Analysis (Include our EDA here??)

First, we look at the first few rows of our train dataset

In [3]:
train.head()

Unnamed: 0,Province,Container,Size_Grade,Weight_Kg,Commodities,Date,Low_Price,High_Price,Sales_Total,Total_Qty_Sold,Total_Kg_Sold,Stock_On_Hand,avg_price_per_kg
0,CAPE,EC120,1L,12.0,APPLE GRANNY SMITH,2020-03-10,108.0,112.0,3236.0,29,348.0,0,9.3
1,CAPE,M4183,1L,18.3,APPLE GOLDEN DELICIOUS,2020-09-09,150.0,170.0,51710.0,332,6075.6,822,8.51
2,GAUTENG,AT200,1L,20.0,AVOCADO PINKERTON,2020-05-05,70.0,80.0,4860.0,66,1320.0,50,3.68
3,TRANSVAAL,BJ090,1L,9.0,TOMATOES-LONG LIFE,2020-01-20,60.0,60.0,600.0,10,90.0,0,6.67
4,WESTERN FREESTATE,PP100,1R,10.0,POTATO SIFRA (WASHED),2020-07-14,40.0,45.0,41530.0,927,9270.0,393,4.48


We can take a look at the dimensions of the dataframe to get an idea of the number of rows, n, and nummber of predictors, p, which is equal to one less than the number of columns.

In [4]:
train.shape

(64376, 13)

The shape command shows us that we have 64376 rows of data and 13 variables. We will try and model the price per kilogram of Golden Delicious Apples using the other 12 variables.

In the above dataframe, there appears to be no sign of missing values. The Pandas library can help us investigate this further, using the `info()` function. This function tells us what columns are in the dataframe, how many null values they have and what datatype they are. We will also use the `isnul().sum()` Pandas function to confirm the number of missing values in each column. 

In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64376 entries, 0 to 64375
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Province          64376 non-null  object 
 1   Container         64376 non-null  object 
 2   Size_Grade        64376 non-null  object 
 3   Weight_Kg         64376 non-null  float64
 4   Commodities       64376 non-null  object 
 5   Date              64376 non-null  object 
 6   Low_Price         64376 non-null  float64
 7   High_Price        64376 non-null  float64
 8   Sales_Total       64376 non-null  float64
 9   Total_Qty_Sold    64376 non-null  int64  
 10  Total_Kg_Sold     64376 non-null  float64
 11  Stock_On_Hand     64376 non-null  int64  
 12  avg_price_per_kg  64376 non-null  float64
dtypes: float64(6), int64(2), object(5)
memory usage: 6.4+ MB


In [6]:
train.isnull().sum()

Province            0
Container           0
Size_Grade          0
Weight_Kg           0
Commodities         0
Date                0
Low_Price           0
High_Price          0
Sales_Total         0
Total_Qty_Sold      0
Total_Kg_Sold       0
Stock_On_Hand       0
avg_price_per_kg    0
dtype: int64

From the information generated, we can see that all columns contain 64376 entries, and have no missing values.

## Extracting the Relevant Data
Data relating to Goldern Delicious Apples

Why? Upon doing dummy encoding on all the data and , we found no significant correlation between the response variable (Average price per kg) and the other commodities, instead we landed with a high number of columns which led to model inefficiencies and errors. Date was removed to improve accuracy, since none of the dates were significantly correlated with the response variable

In [7]:
df_train= train[train['Commodities']=='APPLE GOLDEN DELICIOUS'].drop('Date', axis=1)
df_train.head()

Unnamed: 0,Province,Container,Size_Grade,Weight_Kg,Commodities,Low_Price,High_Price,Sales_Total,Total_Qty_Sold,Total_Kg_Sold,Stock_On_Hand,avg_price_per_kg
1,CAPE,M4183,1L,18.3,APPLE GOLDEN DELICIOUS,150.0,170.0,51710.0,332,6075.6,822,8.51
7,CAPE,JG110,2M,11.0,APPLE GOLDEN DELICIOUS,50.0,50.0,16000.0,320,3520.0,0,4.55
24,W.CAPE-BERGRIVER ETC,JE090,2S,9.0,APPLE GOLDEN DELICIOUS,55.0,55.0,990.0,18,162.0,1506,6.11
40,CAPE,M4183,1S,18.3,APPLE GOLDEN DELICIOUS,80.0,120.0,32020.0,388,7100.4,443,4.51
69,EASTERN CAPE,IA400,1S,400.0,APPLE GOLDEN DELICIOUS,1800.0,1800.0,1800.0,1,400.0,2,4.5


In [8]:
df_test= test[test['Commodities']=='APPLE GOLDEN DELICIOUS'].drop('Date', axis=1)
df_test.head()

Unnamed: 0,Province,Container,Size_Grade,Weight_Kg,Commodities,Low_Price,High_Price,Sales_Total,Total_Qty_Sold,Total_Kg_Sold,Stock_On_Hand
0,W.CAPE-BERGRIVER ETC,EC120,1M,12.0,APPLE GOLDEN DELICIOUS,128.0,136.0,5008.0,38,456.0,0
1,W.CAPE-BERGRIVER ETC,M4183,1X,18.3,APPLE GOLDEN DELICIOUS,220.0,220.0,1760.0,8,146.4,2
2,W.CAPE-BERGRIVER ETC,EC120,1S,12.0,APPLE GOLDEN DELICIOUS,120.0,120.0,720.0,6,72.0,45
3,W.CAPE-BERGRIVER ETC,M4183,1M,18.3,APPLE GOLDEN DELICIOUS,160.0,160.0,160.0,1,18.3,8
4,W.CAPE-BERGRIVER ETC,M4183,1L,18.3,APPLE GOLDEN DELICIOUS,140.0,160.0,14140.0,100,1830.0,19


In [9]:
df_test.shape

(685, 11)

In [10]:
df_train.shape

(1952, 12)

We now have 1952 rows of data and 13 variables.

Using the `describe()` method from pandas, we to get the summary statistics for our data:

In [11]:
df_train.describe()


Unnamed: 0,Weight_Kg,Low_Price,High_Price,Sales_Total,Total_Qty_Sold,Total_Kg_Sold,Stock_On_Hand,avg_price_per_kg
count,1952.0,1952.0,1952.0,1952.0,1952.0,1952.0,1952.0,1952.0
mean,40.460912,174.307377,215.648053,20053.533811,174.510758,2960.176332,408.393955,6.778893
std,99.655169,373.553578,433.546159,39005.069445,308.810797,6097.416527,724.450582,2.248744
min,3.0,2.0,5.0,5.0,1.0,3.0,0.0,0.25
25%,9.0,50.0,60.0,1325.0,12.0,219.6,9.0,5.46
50%,12.0,80.0,108.0,5495.0,64.0,853.5,126.5,6.67
75%,18.3,127.25,160.0,21082.5,200.0,3093.525,468.0,8.28
max,400.0,2300.0,3300.0,369464.0,4237.0,74000.0,6400.0,21.24


# Feature Selection


Feature selection is the process of choosing the most relevant features in your data when developing a predictive model. "Most relevant" depends on many factors. Here we consider the correlation of the features with the target variable, as well as the variance of the features. We look for the highest correlation with the target, and the features with the most variance. During this process, we remove features that do not maximize model performance.



Before we begin the process, let us take a look at the distribution of our target variable (avg_price_per_kg):

In [12]:
# target distribution
sns.distplot(df_train['avg_price_per_kg'],kde=True)



<IPython.core.display.Javascript object>

<AxesSubplot:xlabel='avg_price_per_kg', ylabel='Density'>

The data seems to be normally distributed around the mean, with a slight skew to the right. ????

### Dummy Encoding Variables

In the process of model prediction, all variables need to be numeric. As we've observed, our data contains some categorical-text data (Province, Container, Size_Grade, Date), which we need to transform into numbers before we can train our model. To do this, we use a Pandas method called `get_dummies()`. The method will transform all the categorical text data into numbers by adding a column for each distinct category

In [13]:
dummy_df_train = pd.get_dummies(df_train)
dummy_df_train.head()

Unnamed: 0,Weight_Kg,Low_Price,High_Price,Sales_Total,Total_Qty_Sold,Total_Kg_Sold,Stock_On_Hand,avg_price_per_kg,Province_CAPE,Province_EASTERN CAPE,...,Size_Grade_1M,Size_Grade_1S,Size_Grade_1U,Size_Grade_1X,Size_Grade_2L,Size_Grade_2M,Size_Grade_2S,Size_Grade_2U,Size_Grade_2X,Commodities_APPLE GOLDEN DELICIOUS
1,18.3,150.0,170.0,51710.0,332,6075.6,822,8.51,1,0,...,0,0,0,0,0,0,0,0,0,1
7,11.0,50.0,50.0,16000.0,320,3520.0,0,4.55,1,0,...,0,0,0,0,0,1,0,0,0,1
24,9.0,55.0,55.0,990.0,18,162.0,1506,6.11,0,0,...,0,0,0,0,0,0,1,0,0,1
40,18.3,80.0,120.0,32020.0,388,7100.4,443,4.51,1,0,...,0,1,0,0,0,0,0,0,0,1
69,400.0,1800.0,1800.0,1800.0,1,400.0,2,4.5,0,1,...,0,1,0,0,0,0,0,0,0,1


In [14]:
# split data into predictors and response
X = dummy_df_train.drop('avg_price_per_kg', axis=1)
y = dummy_df_train['avg_price_per_kg']

In [15]:
# import scaler method from sklearn
from sklearn.preprocessing import StandardScaler

In [16]:
# create scaler object
scaler = StandardScaler()

In [17]:
# create scaled version of the predictors (there is no need to scale the response)
X_scaled = scaler.fit_transform(X)

In [18]:
# convert the scaled predictor values into a dataframe
X_standardise = pd.DataFrame(X_scaled,columns=X.columns)
X_standardise.head()

Unnamed: 0,Weight_Kg,Low_Price,High_Price,Sales_Total,Total_Qty_Sold,Total_Kg_Sold,Stock_On_Hand,Province_CAPE,Province_EASTERN CAPE,Province_NATAL,...,Size_Grade_1M,Size_Grade_1S,Size_Grade_1U,Size_Grade_1X,Size_Grade_2L,Size_Grade_2M,Size_Grade_2S,Size_Grade_2U,Size_Grade_2X,Commodities_APPLE GOLDEN DELICIOUS
0,-0.222433,-0.065087,-0.105317,0.811807,0.510117,0.511073,0.57107,0.931634,-0.343488,-0.245547,...,-0.44198,-0.612085,-0.032026,-0.30986,-0.255934,-0.367265,-0.432837,-0.093731,-0.116187,0.0
1,-0.295704,-0.332855,-0.382175,-0.10395,0.471248,0.091837,-0.563874,0.931634,-0.343488,-0.245547,...,-0.44198,-0.612085,-0.032026,-0.30986,-0.255934,2.722828,-0.432837,-0.093731,-0.116187,0.0
2,-0.315779,-0.319467,-0.370639,-0.48887,-0.506948,-0.459029,1.515476,-1.073382,-0.343488,-0.245547,...,-0.44198,-0.612085,-0.032026,-0.30986,-0.255934,-0.367265,2.310338,-0.093731,-0.116187,0.0
3,-0.222433,-0.252525,-0.220674,0.306871,0.691504,0.679187,0.047781,0.931634,-0.343488,-0.245547,...,-0.44198,1.63376,-0.032026,-0.30986,-0.255934,-0.367265,-0.432837,-0.093731,-0.116187,0.0
4,3.608756,4.353082,3.655338,-0.468098,-0.562012,-0.419986,-0.561112,-1.073382,2.91131,-0.245547,...,-0.44198,1.63376,-0.032026,-0.30986,-0.255934,-0.367265,-0.432837,-0.093731,-0.116187,0.0


### Variable Selection by Correlation and Significance
Using the dummy variable dataframe, we can build a model that predicts Average Apple Price per Kilogram (our dependent variable) as a function of 183 different independent variables (predictor variables).

Before we do this, however, we reorder columns so that our dependent variable is the last column of the dataframe. This will make a heatmap visualisation representing a correlation matrix of our data easier to interpret.

In [19]:
column_titles = [col for col in dummy_df_train.columns if col!= 'avg_price_per_kg'] + ['avg_price_per_kg']
dummy_df_train=dummy_df_train.reindex(columns=column_titles)

We need a way of guiding us to choose the best ones to be our predictors. One way is to look at the correlations between the Loan Size and each variables in our DataFrame and select those with the strongest correlations (both positive and negative).

We also need to consider how significant those features are.

The code below will create a new DataFrame and store the correlation coefficents and p-values in that DataFrame for reference.

In [20]:
# Calculate correlations between predictor variables and the response variable
corrs = dummy_df_train.corr()['avg_price_per_kg'].sort_values(ascending=False)

In [21]:
corrs1 = pd.DataFrame(dummy_df_train.corr()['avg_price_per_kg']).rename(columns = {'avg_price_per_kg':'Correlation'})
corrs1.sort_values(by='Correlation',ascending=False, inplace=True)
corrs1

Unnamed: 0,Correlation
avg_price_per_kg,1.0
Container_M4183,0.403229
Size_Grade_1L,0.280966
Province_W.CAPE-BERGRIVER ETC,0.262051
Size_Grade_1X,0.251451
Container_EC120,0.188162
Size_Grade_1M,0.175779
Container_EF120,0.114297
Sales_Total,0.108473
Stock_On_Hand,0.105028


Using [Pearson regression](http://sites.utexas.edu/sos/guided/inferential/numeric/bivariate/cor/) from SciPy:

In [22]:
from scipy.stats import pearsonr

# Build a dictionary of correlation coefficients and p-values
dict_cp = {}

column_titles = [col for col in corrs.index if col!= 'avg_price_per_kg']
for col in column_titles:
    p_val = round(pearsonr(dummy_df_train[col], dummy_df_train['avg_price_per_kg'])[1],6)
    dict_cp[col] = {'Correlation_Coefficient':corrs[col],
                    'P_Value':p_val}
    
df_cp = pd.DataFrame(dict_cp).T
df_cp_sorted = df_cp.sort_values('P_Value')
df_cp_sorted[df_cp_sorted['P_Value']<0.1] #significance level



Unnamed: 0,Correlation_Coefficient,P_Value
Container_M4183,0.403229,0.0
Size_Grade_2S,-0.352996,0.0
Weight_Kg,-0.337886,0.0
Container_JE090,-0.322235,0.0
Province_EASTERN CAPE,-0.178531,0.0
High_Price,-0.164496,0.0
Size_Grade_2M,-0.153372,0.0
Container_AC030,-0.144427,0.0
Low_Price,-0.14174,0.0
Container_JG110,-0.140148,0.0


Now, we've got a sorted list of the p-values and correlation coefficients for each of the features, when considered on their own.  

If we were to use a logic test with a significance value of 5% (p-value < 0.05), we could infer that the following features are statistically significant:

* Income
* Mortgage
* CCAvg
* Experience
* Age
* Education_Undergrad
* Family

Let's keep only the variables that have a significant correlation with the dependent variable. We'll put them into an independent variable DataFrame `X`

In [23]:
# The dependent variable remains the same:
y_data = dummy_df_train['avg_price_per_kg']  # y_name = ''


X_data = dummy_df_train.drop('avg_price_per_kg', axis=1)

In [24]:
# As before, we create the correlation matrix
# and find rows and columns  where correlation coefficients > 0.9 or <-0.9
corr1 = X_data.corr()
r, c = np.where(np.abs(corr1) > 0.9)

# We are only interested in the off diagonal entries:
off_diagonal = np.where(r != c)

# Show the correlation matrix rows and columns where we have highly correlated off diagonal entries:
corr1.iloc[r[off_diagonal], c[off_diagonal]]

Unnamed: 0,High_Price,Container_IA400,High_Price.1,Weight_Kg,Low_Price,Container_IA400.1,Total_Kg_Sold,Sales_Total,Weight_Kg.1,High_Price.2
Weight_Kg,0.905852,0.999231,0.905852,1.0,0.863182,0.999231,0.294117,0.180518,1.0,0.905852
Weight_Kg,0.905852,0.999231,0.905852,1.0,0.863182,0.999231,0.294117,0.180518,1.0,0.905852
Low_Price,0.93814,0.860219,0.93814,0.863182,1.0,0.860219,0.269744,0.18323,0.863182,0.93814
High_Price,1.0,0.902518,1.0,0.905852,0.93814,0.902518,0.372282,0.265672,0.905852,1.0
High_Price,1.0,0.902518,1.0,0.905852,0.93814,0.902518,0.372282,0.265672,0.905852,1.0
High_Price,1.0,0.902518,1.0,0.905852,0.93814,0.902518,0.372282,0.265672,0.905852,1.0
Sales_Total,0.265672,0.172753,0.265672,0.180518,0.18323,0.172753,0.962338,1.0,0.180518,0.265672
Total_Kg_Sold,0.372282,0.288659,0.372282,0.294117,0.269744,0.288659,1.0,0.962338,0.294117,0.372282
Container_IA400,0.902518,1.0,0.902518,0.999231,0.860219,1.0,0.288659,0.172753,0.999231,0.902518
Container_IA400,0.902518,1.0,0.902518,0.999231,0.860219,1.0,0.288659,0.172753,0.999231,0.902518


It seems we do not have any autocorrelate features. We will now build a few models using our features to see which perform best. The models wwe wil lbuild will be trained on:

### ????edit

### Variable Selection by Variance Thresholds



### Making Predictions

In [25]:
from sklearn.tree import DecisionTreeRegressor

In [26]:
# Function to fit data, make predictions, and evaluate model
def rmse(y_test, y_pred):
    return np.sqrt(mean_squared_error(y_pred,y_test))

def r_squared(y_test, y_pred):
    return r2_score(y_test, y_pred)
    
# Takes in a model, trains the model, and evaluates the model on the test set
def fit_and_evaluate(model,X,y):
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1,random_state=42)
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions and evalute
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)
    
    train_rmse = rmse(y_train, train_pred)
    test_rmse = rmse(y_test, test_pred)
    
    train_r2 = r_squared(y_train, train_pred)
    test_r2 = r_squared(y_test, test_pred)
    
    df = {'Train RMSE': train_rmse,'Train R^2':train_r2,'Test RMSE': test_rmse,'Test R^2':test_r2}
    
    return df

In [27]:
# Our model will be the same throughout, so will our y
model = LinearRegression(n_jobs=-1)


In [28]:
# fit the model to the data and make predictions - all features (Without date)
X = dummy_df_train.drop('avg_price_per_kg',axis=1).values
y = dummy_df_train['avg_price_per_kg']

fit_and_evaluate(model,X,y)

{'Train RMSE': 1.3971396533150298,
 'Train R^2': 0.6113269540747905,
 'Test RMSE': 1.4621001727242349,
 'Test R^2': 0.5996181268612967}

In [29]:
# Instantiate regression tree model
regr_tree = DecisionTreeRegressor(max_depth= 11, min_samples_leaf= 7,random_state=5) # Pruned through max_depth (AKA Hyper-parameter tunning)


# fit the model to the data and make predictions - all features (with date)== lower predictive accuracy(looking at RMSE)
X = dummy_df_train.drop('avg_price_per_kg',axis=1).values
y = dummy_df_train['avg_price_per_kg']

fit_and_evaluate(regr_tree,X,y)

{'Train RMSE': 0.5538967778902949,
 'Train R^2': 0.9389110776958863,
 'Test RMSE': 0.6015015390875093,
 'Test R^2': 0.932236842215795}

In [30]:
# fit the model to the data and make predictions - all features (Without date and province), poor model but better accuracy
X = dummy_df_train.drop('avg_price_per_kg',axis=1).values
y = dummy_df_train['avg_price_per_kg']

fit_and_evaluate(model,X,y)

{'Train RMSE': 1.3971396533150298,
 'Train R^2': 0.6113269540747905,
 'Test RMSE': 1.4621001727242349,
 'Test R^2': 0.5996181268612967}

In [31]:
df_train.shape,df_test.shape

((1952, 12), (685, 11))

In [32]:
#df_test= test[test['Commodities']=='APPLE GOLDEN DELICIOUS'].drop('Date', axis=1)

In [33]:
X_real = pd.get_dummies(df_test)
X_real.head()

Unnamed: 0,Weight_Kg,Low_Price,High_Price,Sales_Total,Total_Qty_Sold,Total_Kg_Sold,Stock_On_Hand,Province_CAPE,Province_EASTERN CAPE,Province_NATAL,...,Size_Grade_1M,Size_Grade_1S,Size_Grade_1U,Size_Grade_1X,Size_Grade_2L,Size_Grade_2M,Size_Grade_2S,Size_Grade_2U,Size_Grade_2X,Commodities_APPLE GOLDEN DELICIOUS
0,12.0,128.0,136.0,5008.0,38,456.0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
1,18.3,220.0,220.0,1760.0,8,146.4,2,0,0,0,...,0,0,0,1,0,0,0,0,0,1
2,12.0,120.0,120.0,720.0,6,72.0,45,0,0,0,...,0,1,0,0,0,0,0,0,0,1
3,18.3,160.0,160.0,160.0,1,18.3,8,0,0,0,...,1,0,0,0,0,0,0,0,0,1
4,18.3,140.0,160.0,14140.0,100,1830.0,19,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [34]:
regr_tree_preds = regr_tree.predict(X_real) # based on db without date

In [35]:
regr_tree_preds

array([10.615     , 13.06384615,  9.99583333,  8.76      ,  8.36714286,
        4.01      ,  3.41454545,  7.61428571,  8.59666667,  9.29      ,
        4.752     ,  5.67181818,  8.94714286,  4.752     ,  4.497     ,
        6.11      ,  5.50307692,  4.47333333, 10.17555556,  5.80727273,
        5.97857143,  4.7       ,  5.467     ,  7.78      ,  5.56      ,
        6.418     ,  7.43416667,  7.89875   ,  7.19      ,  7.19      ,
        7.89875   ,  5.66076923, 10.03285714,  7.65      ,  4.66142857,
        9.83727273, 10.84666667, 11.53272727,  6.39428571,  6.45571429,
        6.64083333,  6.79153846,  5.50307692,  6.19461538,  5.56      ,
        5.97857143,  5.        ,  5.65428571,  6.11      ,  4.66142857,
        2.10153846,  5.94285714,  7.542     , 13.06384615,  8.48375   ,
        6.19375   , 12.22857143,  8.94714286,  7.56538462,  5.9575    ,
        7.82142857, 11.53272727,  6.19461538,  9.04823529,  6.79153846,
        5.95416667,  7.01076923,  4.79142857,  5.67181818,  6.65

In [36]:
# create submission dataframe
# Create Dataframe of Order_No and Time from Pickup to Arrival
submission = pd.DataFrame(
    {'Index': test_ID,
     'avg_price_per_kg':regr_tree_preds
    })

In [37]:
submission


Unnamed: 0,Index,avg_price_per_kg
0,1,10.615000
1,2,13.063846
2,3,9.995833
3,4,8.760000
4,5,8.367143
...,...,...
680,681,4.430909
681,682,8.797143
682,683,6.110000
683,684,7.630000


In [38]:
submission.to_csv("testing(DecTree).csv", index=False)

In [None]:
save_path = r'C:\Users\27732\regression-apples-predict-api-template\assets\trained-models\regr_tree.pkl'
print (f"Training completed. Saving model to: {save_path}")
pickle.dump(regr_tree, open(save_path,'wb'))