# AVACADO
The avocado (Persea americana) is a tree originating in the Americas which is likely native to the highland regions of south-central Mexico to Guatemala. It is classified as a member of the flowering plant family Lauraceae.The fruit of the plant, also called an avocado (or avocado pear or alligator pear), is botanically a large berry containing a single large seed.Avocado trees are partially self-pollinating, and are often propagated through grafting to maintain predictable fruit quality and quantity.

Avocados are cultivated in tropical and Mediterranean climates of many countries, with Mexico as the leading producer of avocados in 2019, supplying 32% of the world total.

The fruit of domestic varieties has a buttery flesh when ripe. Depending on the variety, avocados have green, brown, purplish, or black skin when ripe, and may be pear-shaped, egg-shaped, or spherical. Commercially, the fruits are picked while immature, and ripened after harvesting.

##Data Description
This data was downloaded from the Hass Avocado Board website in May of 2018 & compiled into a single CSV.

The table below represents weekly 2018 retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados.

Starting in 2013, the table below reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags.

The Product Lookup codes (PLU’s) in the table are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table.

# The Data Set

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('avocado.csv')

In [None]:
df

In [None]:
df.head

In [None]:
df.tail

In [None]:
df.columns

In [None]:
#Dropping the unnamed column

df.drop('Unnamed: 0',axis=1,inplace=True)

##About the Columns
Date
This Column shows the The date of the observation

Average Price
This Column shows the the average price of a single avocado

Total Volume
This Column shows the Total number of avocados sold

4046
This Column shows the Total number of avocados with PLU 4046 sold

4225
This Column shows the Total number of avocados with PLU 4225 sold

4770
This Column shows the Total number of avocados with PLU 4770 sold

Total Bags
This Column shows the no of total bags

Small Bags
This Column shows the no of total small bags

Large Bags
This Column shows the no. of total large bags

XLarge Bags
This Column shows the no. of total Xlarge bags

Type
This Column shows the conventional or organic

Year
This Column shows the the year of the observation

Region
This Column shows the Place

In [None]:
#Null values check

In [None]:
df.isnull().sum()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull())
plt.title("Null Values")
plt.show()

We can see that there are No Null Blank or Empty Values are Present in the Datset

##Checking and Transforming the Data types of the Columns To Same DataTypes for Better Analysis

In [None]:
df.info()

In [None]:
df.describe(include=['O'])

In [None]:
from sklearn.preprocessing import LabelEncoder
le =LabelEncoder()

list1=['type','region']
for val in list1:
  df[val]=le.fit_transform(df[val].astype(str))

In [None]:
da=pd.to_datetime(df['Date'],errors='coerce')
df['Date']=da.dt.strftime("%Y%m%d").astype(int)

In [None]:
df

# EDA##

In [None]:
df.plot.scatter(x='AveragePrice',y='region')

In [None]:
plt.figure(figsize=(5,5))
sns.countplot(df.type)

In [None]:
df.type.value_counts()

In [None]:
s.barplot(data=df, x="AveragePrice", y="region")

In [None]:
df.region.value_counts()

In [None]:
sns.lineplot(x='AveragePrice',y='Total Volume',data=df)

In [None]:
sns.barplot(data=df, x="AveragePrice", y="4046")

In [None]:
df.plot.hexbin(x='AveragePrice', y='4225', gridsize=15)

In [None]:
df.plot.scatter(x='AveragePrice',y='4770')

In [None]:
sns.barplot(data=df, x="AveragePrice", y="Total Bags")

In [None]:
sns.lineplot(x='AveragePrice',y='Small Bags',data=df)

In [None]:
sns.lineplot(x='year',y='type',data=df)

In [None]:
df.groupby('year')['type'].value_counts()

In [None]:
df.plot.hexbin(x='AveragePrice', y='Large Bags', gridsize=15)

In [None]:
sns.countplot(df['year'])


In [None]:
df.year.value_counts()

In [None]:
sns.displot(df['region'])

In [None]:
sns.displot(df['type'])

In [None]:
df.hist(figsize=(15,30),edgecolor='red',layout=(9,4),bins=15,legend=True)
plt.show()

In [None]:
##Coorelation
df.corr()

In [None]:
df.corr()['AveragePrice'].sort_values()

In [None]:
df.corr()['region'].sort_values()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(15,7))
sns.heatmap(df.corr(), annot=True, linewidths=0.5,linecolor="black", fmt='.2f')


In [None]:
##Descriptive Statistics
df.describe()

In [None]:
plt.figure(figsize=(15,7))
sns.heatmap(round(df.describe()[1:].transpose(),2), annot=True, linewidths=0.5,linecolor="black", fmt='f')


In [None]:
df.info()

In [None]:
#Checking Data To Remove Skewness
df.iloc[:,:-1].skew()

In [None]:
from sklearn.preprocessing import power_transform
x_new=power_transform(df.iloc[:,:-1],method='yeo-johnson')

df.iloc[:,:-1]=pd.DataFrame(x_new,columns=df.iloc[:,:-1].columns)


In [None]:
df.iloc[:,:-1].skew()

# Outliers Checking

In [None]:
import warnings
warnings.filterwarnings('ignore')
df.plot(kind='box',subplots=True, layout=(3,5), figsize=[20,8])


# IQR Proximity Rule
#Z - Score Technique

In [None]:
from scipy.stats import zscore
import numpy as np
z=np.abs(zscore(df))
z.shape

In [None]:
threshold=3
print(np.where(z>3))

In [None]:
len(np.where(z>3)[0])

In [None]:
df.drop([ 1716,  2699,  5462,  5475,  5476,  5477,  5478,  5479,  5480,
        5481,  5482,  5483,  5484,  5485,  5486,  5487,  5488,  5489,
        5490,  5491,  5492,  5493,  5494,  5495,  5496,  5497,  5506,
        5506,  7412,  8319,  8322,  8344,  8344,  8345,  8345,  8346,
        8346,  8347,  8347,  8348,  8348,  8349,  8349,  8350,  8351,
        8352,  8352,  8353,  8353,  8354,  8354,  8355,  8356,  8357,
        8358,  8359,  8360,  8361,  8362,  8363,  8364,  8365,  8365,
        8366,  8366,  8366,  8367,  8367,  8368,  8369,  8370,  8371,
        9090,  9090,  9091,  9091,  9092,  9092,  9093,  9093,  9094,
        9094,  9095,  9096,  9096,  9097,  9097,  9097,  9097,  9098,
        9098,  9099,  9099,  9100,  9101,  9212,  9894, 10381, 11024,
       11320, 11321, 11322, 11325, 11326, 11332, 11333, 11336, 11338,
       11340, 11342, 11347, 11348, 11349, 11350, 11354, 11387, 11388,
       11594, 11595, 11596, 11597, 11614, 11662, 12132, 14124, 14125,
       14404, 15261, 15262, 15473, 16055, 16720, 17428],axis=0)

In [None]:
df=df[(z<3).all(axis=1)]

In [None]:
df.shape

# Feature Engineering ( Variantion Inflation Factor )

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
df.corr()

In [None]:
import seaborn as sns
plt.figure(figsize=(15,7))
sns.heatmap(df.corr(),cmap="Blues",annot=True)

In [None]:
x1=df.drop('AveragePrice',axis=1)
y1=df['AveragePrice']


x2=df.drop('region',axis=1)
y2=df['region']

In [None]:
x1

In [None]:
y1

In [None]:
x2

In [None]:
y2

In [None]:
def vif_calc1():
  vif=pd.DataFrame()
  vif["VIF Factor"]=[variance_inflation_factor(x1.values,i) for i in range(x1.shape[1])]
  vif["features"]=x1.columns
  print(vif)
vif_calc1()

In [None]:
x1.drop(['Date','Total Volume'],axis=1,inplace=True)

In [None]:
vif_calc1()

In [None]:
def vif_calc2():
  vif=pd.DataFrame()
  vif["VIF Factor"]=[variance_inflation_factor(x2.values,i) for i in range(x2.shape[1])]
  vif["features"]=x2.columns
  print(vif)

In [None]:
vif_calc2()

In [None]:
x2.drop(['Date','Total Volume'],axis=1,inplace=True)
vif_calc2()

# Scaling the Data


In [None]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x1=sc.fit_transform(x1)
x1

In [None]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x2=sc.fit_transform(x2)
x2

# Using Linear Regression Model#

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error,mean_squared_error
from sklearn.metrics import r2_score


for i in range(1,100):
  x_train,x_test,y_train,y_test=train_test_split(x1,y1,test_size=.20,random_state=i)
  lr=LinearRegression()
  lr.fit(x_train,y_train)
  pred_train=lr.predict(x_train)
  pred_test=lr.predict(x_test)
  print(f"At random state {i},the training accuracy is :- {r2_score(y_train,pred_train)*100}")
  print(f"At random state {i},the testing accuracy is :- {r2_score(y_test,pred_test)*100}")
  print("\n")

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x1,y1,test_size=.20,random_state=41)
pred=lr.predict(x1)
print("predicted result ",pred)
print('actual result',y_test)

# DEcission Tree Regression

In [None]:
from sklearn.tree import DecisionTreeRegressor
dtr=DecisionTreeRegressor()
dtr.fit(x_train,y_train)

In [None]:
pred=dtr.predict(x_test)
print("predicted result ",pred)
print('actual result',y_test)

In [None]:
print('Error:')
print('Mean Absolute Error :',mean_absolute_error(y_test,pred))
print('Mean Squared Error :',mean_squared_error(y_test,pred))
print('Root mean Squared Error',np.sqrt(mean_squared_error(y_test,pred)))
print('r2 score :',r2_score(y_test,pred)*100)

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

parameters = {'alpha':[.0001,.001,.01,.1,1,10],'random_state':list(range(0,10))}
ls=Lasso()
clf=GridSearchCV(ls,parameters)
clf.fit(x_train,y_train)

print(clf.best_params_)

In [None]:
ls=Lasso(alpha=0.0001,random_state=0)
ls.fit(x_train,y_train)
ls_score_training=ls.score(x_train,y_train)
pred_ls = ls.predict(x_test)
ls_score_training*100

In [None]:
from sklearn.ensemble import RandomForestRegressor
Rrf=RandomForestRegressor()

parameters = {'criterion':['mse','mae'],'max_features':["auto","sqrt","log2"]}
clf = GridSearchCV(Rrf,parameters)
clf.fit(x_train,y_train)

print(clf.best_params_)

In [None]:
from sklearn.model_selection import cross_val_score

Rrf= RandomForestRegressor(criterion="mse",max_features="log2")
Rrf.fit(x_train,y_train)
Rrf.score(x_train,y_train)
pred_decession = Rrf.predict(x_test)

rfs = r2_score(y_test,pred_decession)
print('R2 Score :',rfs*100)

rfscore = cross_val_score(Rrf,x1,y1,cv=5)
rfc=rfscore.mean()
print('cross Val Score :',rfc*100)

Saving the best Regression Model
as the r2 score of the random forest is max 82% we consider it as the best model

import pickle
filename = 'Avocado Average Price.pkl'
pickle.dump(Rrf,open(filename,'wb'))

# Classification Model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.model_selection import train_test_split

maxAcc=0
maxRs=0

for i in range(1,200):
  x_train,x_test,y_train,y_test=train_test_split(x2,y2,test_size=.20,random_state=i)
  lr=LogisticRegression()
  lr.fit(x_train,y_train)
  predrf=lr.predict(x_test)
  acc=accuracy_score(y_test,predrf)
  if acc>maxAcc:
    maxAcc=acc
    maxRs=i

print("Best Accuracy is",maxAcc*100,"on Random State",maxRs)

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x2,y2,test_size=.20,random_state=44)


# Logistic Regression

In [None]:

Lr=LogisticRegression()
Lr.fit(x_train,y_train)
predlr=Lr.predict(x_test)
print("Accuracy",accuracy_score(y_test,predlr)*100)
print(confusion_matrix(y_test,predlr))
print(classification_report(y_test,predlr))

# Decission tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtr=DecisionTreeClassifier()
dtr.fit(x_train,y_train)
preddtr=dtr.predict(x_test)
print("Accuracy",accuracy_score(y_test,preddtr)*100)
print(confusion_matrix(y_test,preddtr))
print(classification_report(y_test,preddtr))

# Support Vector Classifier

In [None]:
from sklearn.svm import SVC

svc=SVC()
svc.fit(x_train,y_train)

ad_pred=svc.predict(x_test)

print("Accuracy",accuracy_score(y_test,ad_pred)*100)
print(confusion_matrix(y_test,ad_pred))
print(classification_report(y_test,ad_pred))

# Random Forest Classifier


In [None]:
from sklearn.ensemble import RandomForestClassifier


Rrf=RandomForestClassifier()
Rrf.fit(x_train,y_train)

Rrf_pred=Rrf.predict(x_test)

print("Accuracy",accuracy_score(y_test,Rrf_pred)*100)
print(confusion_matrix(y_test,Rrf_pred))
print(classification_report(y_test,Rrf_pred))

In [None]:
#Hypere Parameter Tuning
from sklearn.model_selection import GridSearchCV

# Creating Parameter List to pass in Grid SearchCV

parameters = {'max_features': ['auto','sqrt','log2'],
              'max_depth' : [4,5,6,7,8],
              'criterion' :['gini','entropy']}

In [None]:
gcv=GridSearchCV(RandomForestClassifier(),parameters,cv=5,scoring="accuracy")
gcv.fit(x_train,y_train)
gcv.best_params_

In [None]:
gcv_pred=gcv.best_estimator_.predict(x_test)
accuracy_score(y_test,gcv_pred)

In [None]:
##Saving the Best Model
since the accuracy score of Random forest classifier is 92 % we save it

import pickle
filename = 'Avocado Region.pkl'
pickle.dump(Rrf,open(filename,'wb'))