# Avocado Mini Project

Essential Purpose

Skills tested:
Using Pandas to access and explore the dataset.
Using Pandas to cleanse columns and choose features.
Using Scikit-Learn to preprocess the data before training.
Using K-Nearest Neighbors regressor for regression modeling and testing data.
Using Scikit-Learn for regression evaluation and enhancement.

Requirements

Avocados are many people's favorite fruit. They have excellent nutritional value and they taste great. It is time for you to learn about their prices!
Submission of your project on GitHub is optional. If you choose to manage your project using GitHub, find guidelines for using GitHub here. Ensure you are coding using your Jupyter Notebook – it will be uploaded to GitHub when you perform a GIT push operation.
License: The dataset is an open database, and it is publicly available online.

Expected Output

By the end of this mini project, you will need to deliver within your code:
Multiple R-squared measures resembling different k-neighbors used for training your K-Nearest Neighbor (KNN) regression.
The R-squared measure resembling one additional regression modeling technique such as linear regression.
You are expected to write around 25 lines of code to complete this project.


# Download the dataset
# Read the dataset and drop Nan

In [1]:
import pandas as pd 
df=pd.read_csv("avocado.csv")
df.dropna()
df.head(5)

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


# Extract Features
Exclude the region and date from the considered features.


In [2]:
df.columns

Index(['Unnamed: 0', 'Date', 'AveragePrice', 'Total Volume', '4046', '4225',
       '4770', 'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags', 'type',
       'year', 'region'],
      dtype='object')

In [3]:
df.drop(['Unnamed: 0', 'Date',"region"], axis=1, inplace=True)

In [4]:
df.head(5)

Unnamed: 0,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year
0,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015
1,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015
2,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015
3,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015
4,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015


# Perform Preprocessing
Perform any needed pre-processing on the chosen features including:
Scaling;
Encoding; and
Dealing with Nan values.
Hint:
Use only the preprocessing steps for this mini project.

In [5]:
#encode the "Type" data column
from sklearn.preprocessing import LabelEncoder

df["type"] = LabelEncoder().fit_transform(df["type"])

In [6]:
df.head(5)

Unnamed: 0,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year
0,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,0,2015
1,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,0,2015
2,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,0,2015
3,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,0,2015
4,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,0,2015


In [7]:
# assign input attributes and outputs 
features = len(df.columns)
y=df.iloc[:,0:1].values
X=df.iloc[:,1:features].values

In [8]:
# scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X) 

# Split the Data
Split your data as follows:
80% training set
10% validation set
10% test set

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size = 0.2)
X_validate, X_test, y_validate, y_test = train_test_split(X_test, y_test, test_size = 0.5)

# Train KNN Regression
Use a KNN regressor model to train your data.
Choose the best k for the KNN algorithm by trying different values and validating performance on the validation set.
Regression Metrics
Print the R-squared score of your final KNN regressor.

In [10]:
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor(n_neighbors = 8).fit(X_train, y_train)
model.score(X_test, y_test)

0.7627288427884398

In [None]:
#choose best k 
scores= []
results = 0
best_score = 0
neighbors=range(1,15)

for i in neighbors:
    knn = KNeighborsRegressor(n_neighbors = i).fit(X_train, y_train)
    results = knn.score(X_test, y_test)
    scores.append(round(results,2))
    
    if results > best_score:
        best_score = results
        best_k = i
        best_model = knn 
        
print(scores)
print(best_k)
    

In [None]:
import matplotlib.pyplot as plt
plt.plot(neighbors, scores)

In [None]:
#print the R2 score of the final KNN regressor

accuracy = best_model.score(X_validate, y_validate)
print("The best model has an accuracy of: ", round(accuracy, 2))

# Challenge Yourself (Optional)
Repeat step 6 for a different regression modelling technique.

In [None]:
# Linear Regression
from sklearn.linear_model import LinearRegression 
modelLR = LinearRegression().fit(X_train, y_train)
score = modelLR.score(X_test, y_test)
print(score)

In [None]:
#Decision Tree Regressor
from sklearn.ensemble import RandomForestRegressor
modelRFR = RandomForestRegressor(criterion = "squared_error", max_leaf_nodes = 100).fit(X_train, y_train.ravel())
score = modelRFR.score(X_test, y_test)
y_pred = modelRFR.predict(X_test)

print(score)

In [None]:
from sklearn.metrics import mean_squared_error

print("Accuracy: ", score)
print("MSE: ", str(mean_squared_error(y_test, y_pred)))