# Importing Libraries

**Problem Statement:**

Avocado is a fruit consumed by people heavily in the United States. 

**Content**
This data was downloaded from the Hass Avocado Board website in May of 2018 & compiled into a single CSV. 

The table below represents weekly 2018 retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. 

Starting in 2013, the table below reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. 

The Product Lookup codes (PLU’s) ('4046','4225','4770') in the table are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table.

Some relevant columns in the dataset:

![]('data_image')

![](data_image.PNG)

# Importing Libraries

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from scipy.stats import zscore 
from sklearn.decomposition import PCA
import warnings 
warnings.filterwarnings('ignore')

%matplotlib inline
plt.style.use("ggplot")
import seaborn as sns
from sklearn import metrics
from sklearn.model_selection import train_test_split ,  cross_val_score
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import  RandomForestRegressor
from sklearn.ensemble import  GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.linear_model import Lasso
from sklearn.linear_model import  Ridge
from sklearn.linear_model import ElasticNet
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler 

# Importing Datasets

In [None]:
adf = pd.read_csv('avocado.csv')
adf.head(10)

# Data Analysis

In [None]:
adf.shape

In [None]:
adf.isnull().sum()

There  are lots of Null Columns , So we will delete these rows having no values.

In [None]:
adf = adf.dropna(how = 'any' , axis = 0)
adf.isnull().sum()

In [None]:
adf.shape

In [None]:
adf

In [None]:
adf.columns

Renaming some columns for our further processing.

In [None]:
adf = adf.rename(columns = {'4046' : 'Small_Medium' , '4225' : 'Large' , '4770' : 'Extra_Large' })
adf.columns

In [None]:
adf['Unnamed: 0'].value_counts()

In [None]:
adf.info()

Describing the dataset to obtain - Count , Mean , Standard deviation , Mininmum , IQR , Maximum values.

In [None]:
adf.describe()

#  Data PreProcessing

Dropping Unnamed column as it is some some random indexing provided.

In [None]:
adf = adf.drop('Unnamed: 0', axis = 1)

In [None]:
print(adf.shape)
print(adf.columns)

Extracting Information about the dataset 

In [None]:
adf.info()

Here we can see that date,type and region are of object type so we need to treat them .

**Date**

Here the date provided is in object format so we will convert its datatype to Datetime , and split Year , Month and Day from it.

In [None]:
adf['Date']=pd.to_datetime(adf['Date'])
adf['Sold_Year'] = adf['Date'].apply(lambda x:x.year)
adf['Sold_Month']=adf['Date'].apply(lambda x:x.month)
adf['Sold_Day']=adf['Date'].apply(lambda x:x.day)


In [None]:
adf.info()

Plotting graph to visualise how **Average Price** varies accordingly with **Date**.

In [None]:
dategroup=adf.groupby('Date').mean()
plt.figure(figsize=(12,5))
dategroup['AveragePrice'].plot(x=adf.Date , kind = 'line' , colormap = 'PuBuGn_r')
plt.title('Average Price')

Plotting the **Average price**  -  **region** wise

In [None]:
g = sns.factorplot('AveragePrice','region',data= adf,
                   hue='year',
                   height= 10,
                   aspect=0.9,
                   palette='GnBu_r',
                   join=False,
              )

Dropping the Date column as we have extracted Year , Month , Date from it . Hence we will not use this Date column for our further predictions .

In [None]:
adf = adf.drop('Date', axis = 1)

In [None]:
adf

In [None]:
adf.dtypes

**Type**

In [None]:
adf['type']

In [None]:
adf['type'].value_counts()

Here in the type column all the avocados are conventional so we can drop this column as all the datas are same.

In [None]:
adf = adf.drop('type' , axis = 1)

In [None]:
adf

**Region**

In [None]:
adf.columns

In [None]:
adf['region']

In [None]:
adf['region'].value_counts()

region column has the names of various regions , so we need to encode this column .

We will use Label Encoder except of Onehot Encoder as One Hot encoder will create different columns for different regions so we'll use Label Encoder .

In [None]:
adf['region']

In [None]:
encoder = LabelEncoder()
adf['region'] = encoder.fit_transform(adf['region'])

In [None]:
adf['region'].unique()

In [None]:
adf_copy = adf["region"]
reg_dup = adf_copy.drop_duplicates()
reg_dup

Here in the above code there are two columns -
First one shows the indexes of the region name 
and
second one shows the entity in which particular index is been converted

In [None]:
adf['region'].value_counts()

In [None]:
adf['region']

In [None]:
adf.info()

Counting Number of zeros in a column in a dataframe.

In [None]:
adf.astype(bool).sum(axis=0)

So from above we can see that columns Extra_Large , Large Bags , XLarge Bags has some zero values so we need to deal with them.

In [None]:
z_in_XL = adf['XLarge Bags'].astype(bool).sum(axis = 0)

percentage_of_zeroes = (z_in_XL/1517)*100
percentage_of_zeroes

From above we can oobserve that XLarge has 47% zeroes , so we will drop it .

In [None]:
adf = adf.drop('XLarge Bags' , axis = 1)

In [None]:
adf.head()

**Exploratory Data Analysis**

Plotting boxplot to visualise Outliers in Average Price in a respective Year .

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10,6))
sns.boxplot(x='Sold_Year',y='AveragePrice',data=adf,color='darkcyan')

*LMPLOT*

This function combines regplot() and FacetGrid. It is intended as a convenient interface to fit regression models across conditional subsets of a dataset.

Plotting Lmplot to visualise affect on target variable ('Average Price') with predictor variables.

In [None]:
adf.columns

In [None]:
for i in adf:
    sns.lmplot(x = i , y ='AveragePrice' , data = adf )

**Handling Outliers**

![](Boxplot.jpg)

Plotting Boxplot for visualising Outliers.

In [None]:
collist=adf.columns.values
ncol=4
nrows=4
plt.figure(figsize=(16, 9 ))
for i in range (0,len(collist)):
    plt.subplot(nrows,ncol,i+1)
    sns.boxplot(adf[collist[i]],color='darkcyan',orient='h')
    plt.tight_layout()

Finding zscore of our dataframe .

In [None]:
from scipy.stats import zscore
z = np.abs(zscore(adf))
z

Filtering values from the dataframe whose zscore<3.

In [None]:
adf_wo = adf[(z<3).all(axis = 1)]
adf_wo

Here we can see our dataframe size has been reduced to 1471 from 1517.

In [None]:
adf_wo.describe()

**Feature Selection** 

Earlier we have dropped 3 columns Date , type , XLarge Bags . And now also we can observe that there are two year columns in our dataset . So we need to drop one of them.

In [None]:
adf_wo = adf_wo.drop('year' , axis = 1)
adf_wo

**Skewness**

In [None]:
x_predictor = adf_wo.drop('AveragePrice', axis = 1)
x_predictor

In [None]:
x_predictor.skew()

Plotting distplot to visualize skewness in every colummns .

*Distplot*

This function combines the matplotlib hist function (with automatic calculation of a good default bin size) with the seaborn kdeplot() and rugplot() functions.

In [None]:
for feature in x_predictor :
    sns.distplot(x_predictor[feature] , kde = True , color = 'darkcyan' )
    plt.xlabel(feature)
    plt.ylabel("count")
    plt.title(feature)
    plt.show()

In [None]:
from sklearn.preprocessing import PowerTransformer
powert = PowerTransformer( method = 'yeo-johnson' , standardize = False)
x_t = powert.fit_transform(x_predictor)

In [None]:
x_t

In [None]:
x_trans = pd.DataFrame(x_t , columns = x_predictor.columns)
x_trans

In [None]:
for feature in x_trans :
    sns.distplot(x_trans[feature] , kde = True , color = 'darkcyan' )
    plt.xlabel(feature)
    plt.ylabel("count")
    plt.title(feature)
    plt.show()

In [None]:
x_trans.describe()

In [None]:
x_trans.skew()

The rule of thumb seems to be: If the skewness is between -0.5 and 0.5, the data are fairly symmetrical. If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed. If the skewness is less than -1 or greater than 1, the data are highly skewed.

**Standard Scaler**

Checking for min and max values for each column 

In [None]:
for i in x_trans :
    print(i , max(x_trans[i]) - min(x_trans[i]))

Gaussian's distribution with zero mean and unit variance is standard scaling.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_s = scaler.fit_transform(x_trans)
x_s

In [None]:
x_sc = pd.DataFrame(x_s , columns = x_trans.columns)
x_sc

**From above we can observe that some of the columns are scaled to zero . So we'll not scale our data.**

**PCA**

In [None]:
plt.figure(figsize = (16,9))
sns.heatmap(x_sc.corr() , annot = True , fmt = '.2%' , cmap = 'YlGnBu_r') 
plt.show()

Here we can see there is so much correlation between different columns so we will use PCA(principal component analysis).

In [None]:
pca = PCA(n_components = 'mle' , svd_solver = 'full' )
xpca = pca.fit_transform(x_sc)

In [None]:
xpca

In [None]:
x_f = pd.DataFrame(xpca )
x_f

In [None]:
print(pca.components_)

In [None]:
y = adf_wo.iloc[: , 0]
y

# Finding the best Random State

In [None]:
best_score=0
for i in range(2200):
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(x_f, y, train_size = 0.7 , test_size = 0.3, random_state = i)
    lr = LinearRegression()
    lr.fit(X_train,y_train)
    y_predicted = lr.predict(X_test)
    b_score= r2_score(y_test ,y_predicted )
    if b_score>best_score:
        best_score=b_score
        randomState=i
    
print('Best Score = {} For Random state = {}'.format(best_score*100,randomState))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x_f , y, train_size=0.7, test_size=0.3, random_state= 1117)

model_reg = [LinearRegression,RandomForestRegressor, SVR, DecisionTreeRegressor,KNeighborsRegressor, GradientBoostingRegressor,
             ExtraTreesRegressor ,AdaBoostRegressor , Lasso , Ridge , ElasticNet ]


for model in model_reg:
    m = model()
    print('\n''Model: ',m)
    m.fit(X_train, y_train)
    scr=m.score(X_train,y_train)
    score = (m.score(X_test , y_test))
    print('\n''-->''Score:',score)
    scr_cross=cross_val_score(m,x_f,y,cv=5)
    scr_mean=scr_cross.mean()
    print('Cross validation score: ',scr_mean)
    print('Difference between accuracy and cross validation score: ', scr-scr_mean)
    y_predicted = m.predict(X_test)
    print('Mean Absolute Error: ',mean_absolute_error(y_test, y_predicted))
    print('R2 Score' , r2_score(y_test , y_predicted))

We will use 

**Random Forest Regressor** ,

**SVR** ,

**Decision Tree Regressor** , 

**K Nearest Neighbors Regressor** ,

**Gradient Boost Regressor** ,

**Extra Trees Regressor** ,

**AdaBoost Regressor**

for our future predictions

**Random Forest Regressor**

In [None]:
best_score=0
for i in range(2200):
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(x_f, y, train_size = 0.7 , test_size = 0.3, random_state = i)
    rf = RandomForestRegressor()
    rf.fit(X_train,y_train)
    y_predicted = rf.predict(X_test)
    b_score= r2_score(y_test ,y_predicted )
    if b_score>best_score:
        best_score=b_score
        randomState=i
    
print('Best Score = {} For Random state = {}'.format(best_score*100,randomState))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x_f, y, test_size = 0.3, random_state = 40)

rf = RandomForestRegressor()
rf.fit(X_train,y_train)
rf_pred=rf.predict(X_test)
print('The R2 score={}'.format(r2_score(y_test, rf_pred)*100))
print('The MSE ={}'.format(mean_squared_error(y_test, rf_pred)))
print('The MAE ={}'.format(mean_absolute_error(y_test, rf_pred)))
print('The RMSE ={}'.format(np.sqrt(mean_squared_error(y_test, rf_pred))))

In [None]:
# RandomForest Regressor Hyperparameter Tuning

rf = RandomForestRegressor()
btss = BlockingTimeSeriesSplit(n_splits=5)

rfr_params = {"n_estimators":[100,200],
              "criterion":['mse'],
              "max_depth":[6,8,10,20],
             }
gsRFC = GridSearchCV(rf,param_grid = rfr_params, cv=btss, scoring="r2", n_jobs= -1, verbose = 1)

gsRFC.fit(X_train,y_train)

RFC_best = gsRFC.best_estimator_

gsRFC.best_params_

In [None]:
sns.regplot(y_test, rf_pred , scatter = True , label = True , color = 'darkcyan')

**SVR**

In [None]:
best_score=0
for i in range(2200):
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(x_f, y, test_size = 0.2, random_state = i)
    sv = SVR()
    sv.fit(X_train,y_train)
    y_predicted = sv.predict(X_test)
    b_score= r2_score(y_test ,y_predicted )
    if b_score>best_score:
        best_score=b_score
        randomState=i
        
print('Best Score = {} For Random state = {}'.format(best_score*100,randomState))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x_f, y, test_size = 0.2, random_state = 40)

sv = SVR()
sv.fit(X_train , y_train)
sv_pred = sv.predict(X_test)
print('The R2 score={}'.format(r2_score(y_test, sv_pred)*100))
print('The MSE ={}'.format(mean_squared_error(y_test, sv_pred)))
print('The MAE ={}'.format(mean_absolute_error(y_test, sv_pred)))
print('The RMSE ={}'.format(np.sqrt(mean_squared_error(y_test, sv_pred))))

In [None]:
sns.regplot(y_test, sv_pred , scatter = True , label = True , color = 'darkcyan')

**Decision Tree Regressor**

In [None]:
best_score=0
for i in range(2200):
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(x_f, y, test_size = 0.2, random_state = i)
    dtr = DecisionTreeRegressor()
    dtr.fit(X_train,y_train)
    y_predicted = dtr.predict(X_test)
    b_score= r2_score(y_test ,y_predicted )
    if b_score>best_score:
        best_score=b_score
        randomState=i
        
print('Best Score = {} For Random state = {}'.format(best_score*100,randomState))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x_f, y, test_size = 0.2, random_state = 98)

dtr = DecisionTreeRegressor()
dtr.fit(X_train , y_train)
dtr_pred = dtr.predict(X_test)
print('The R2 score={}'.format(r2_score(y_test, dtr_pred)*100))
print('The MSE ={}'.format(mean_squared_error(y_test, dtr_pred)))
print('The MAE ={}'.format(mean_absolute_error(y_test, dtr_pred)))
print('The RMSE ={}'.format(np.sqrt(mean_squared_error(y_test, dtr_pred))))

In [None]:
sns.regplot(y_test, dtr_pred , scatter = True , label = True , color = 'darkcyan')

**K Nearest Neighbours(KNN)**

In [None]:
best_score=0
for i in range(2200):
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(x_f, y, train_size = 0.7 , test_size = 0.3, random_state = i)
    kn = KNeighborsRegressor()
    kn.fit(X_train,y_train)
    y_predicted = kn.predict(X_test)
    b_score= r2_score(y_test ,y_predicted )
    if b_score>best_score:
        best_score=b_score
        randomState=i
    
print('Best Score = {} For Random state = {}'.format(best_score*100,randomState))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x_f, y, test_size = 0.2, random_state = 29)

knn = KNeighborsRegressor()
knn.fit(X_train , y_train)
knn_pred = knn.predict(X_test)
print('The R2 score={}'.format(r2_score(y_test, knn_pred)*100))
print('The MSE ={}'.format(mean_squared_error(y_test, knn_pred)))
print('The MAE ={}'.format(mean_absolute_error(y_test, knn_pred)))
print('The RMSE ={}'.format(np.sqrt(mean_squared_error(y_test, knn_pred))))

In [None]:
sns.regplot(y_test, knn_pred , scatter = True , label = True , color = 'darkcyan')

**GradientBoostRegressor**

In [None]:
best_score=0
for i in range(2200):
    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(x_f, y, train_size = 0.7 , test_size = 0.3, random_state = i)
    
    gbr = GradientBoostingRegressor()
    gbr.fit(X_train,y_train)
    y_predicted = gbr.predict(X_test)
    b_score= r2_score(y_test ,y_predicted )
    if b_score>best_score:
        best_score=b_score
        randomState=i
    
print('Best Score = {} For Random state = {}'.format(best_score*100,randomState))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x_f, y, test_size = 0.2, random_state = 40)

gbr = GradientBoostingRegressor()
gbr.fit(X_train , y_train)
gbr_pred = gbr.predict(X_test)
print('The R2 score={}'.format(r2_score(y_test, gbr_pred)*100))
print('The MSE ={}'.format(mean_squared_error(y_test, gbr_pred)))
print('The MAE ={}'.format(mean_absolute_error(y_test, gbr_pred)))
print('The RMSE ={}'.format(np.sqrt(mean_squared_error(y_test, gbr_pred))))

In [None]:
sns.regplot(y_test, gbr_pred , scatter = True , label = True , color = 'darkcyan')

**ExtraTreesRegressor**

In [None]:
best_score=0
for i in range(2200):
    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(x_f, y, train_size = 0.7 , test_size = 0.3, random_state = i)
    
    etr = ExtraTreesRegressor()
    etr.fit(X_train,y_train)
    y_predicted = etr.predict(X_test)
    b_score= r2_score(y_test ,y_predicted )
    if b_score>best_score:
        best_score=b_score
        randomState=i
    
print('Best Score = {} For Random state = {}'.format(best_score*100,randomState))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x_f, y, test_size = 0.2, random_state = 26)

etr = ExtraTreesRegressor()
etr.fit(X_train , y_train)
etr_pred = etr.predict(X_test)
print('The R2 score={}'.format(r2_score(y_test, etr_pred)*100))
print('The MSE ={}'.format(mean_squared_error(y_test, etr_pred)))
print('The MAE ={}'.format(mean_absolute_error(y_test, etr_pred)))
print('The RMSE ={}'.format(np.sqrt(mean_squared_error(y_test, etr_pred))))

In [None]:
# Extra Trees Regressor hyperparameter Tuning

etr = ExtraTreesRegressor()
btss = BlockingTimeSeriesSplit(n_splits=5)

ETR_params = {"n_estimators":[100,200],
              "criterion":['mse'],
              "max_depth":[6,8,10,20],
             }
gsETR = GridSearchCV(etr,param_grid = ETR_params, cv=btss, scoring="r2", n_jobs= -1, verbose = 1)

gsETR.fit(X_train,y_train)

ETR_best = gsETR.best_estimator_


gsETR.best_params_

In [None]:
sns.regplot(y_test, etr_pred , scatter = True , label = True , color = 'darkcyan')

**AdaBoostRegressor**

In [None]:
best_score=0
for i in range(2200):
    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(x_f, y, train_size = 0.7 , test_size = 0.3, random_state = i)
    
    abr = AdaBoostRegressor()
    abr.fit(X_train,y_train)
    y_predicted = abr.predict(X_test)
    b_score= r2_score(y_test ,y_predicted )
    if b_score>best_score:
        best_score=b_score
        randomState=i
    
print('Best Score = {} For Random state = {}'.format(best_score*100,randomState))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x_f, y, test_size = 0.2, random_state = 26)

abr = AdaBoostRegressor()
abr.fit(X_train , y_train)
abr_pred = abr.predict(X_test)
print('The R2 score={}'.format(r2_score(y_test, abr_pred)*100))
print('The MSE ={}'.format(mean_squared_error(y_test, abr_pred)))
print('The MAE ={}'.format(mean_absolute_error(y_test, abr_pred)))
print('The RMSE ={}'.format(np.sqrt(mean_squared_error(y_test, abr_pred))))

In [None]:
#AdaBoost Regressor Hyperparameter Tuning 


abr = AdaBoostRegressor(base_estimator=DecisionTreeRegressor())

ABR_params = {"n_estimators":[10,50,100,200],
              "learning_rate":[0.1,0.5,1,2],
              "loss":["linear","square","exponential"],
             }
gsAB = GridSearchCV(abr,param_grid = ABR_params, cv=btss, scoring="r2", n_jobs= -1, verbose = 1)

gsAB.fit(X_train,y_train)

AB_best = gsAB.best_estimator_
gsAB.best_params

In [None]:
sns.regplot(y_test, abr_pred , scatter = True , label = True , color = 'darkcyan')

From above we can see that **Extra Trees Regressor** has the best scores ,
so we will use this Algo for our predictions .

In [None]:
import joblib
joblib.dump(etr,'ExtraTreesRegression.pkl')