# Avocado Price Prediction and Classification Project

This project involves analyzing and modeling retail scan data of Hass avocados to accomplish two primary tasks:

1. **Regression Task**: Predict the average price of avocados (`AveragePrice`) using features such as `Total Volume`, `4046`, `4225`, `4770`, etc.
2. **Classification Task**: Classify the type of avocado (`conventional` or `organic`) based on similar features.

## Project Workflow

### 1. **Data Loading**
   - The dataset was sourced from the Hass Avocado Board website and includes weekly retail scan data for 2018.
   - The data is loaded into a Pandas DataFrame for further analysis.

### 2. **Data Exploration**
   - Initial exploration of the dataset to understand its structure, data types, and summary statistics.
   - Visualizations are used to identify trends and distributions in the data.

### 3. **Data Preprocessing**
   - Handling missing values, encoding categorical variables (`type`), and scaling numerical features to prepare the data for modeling.
   - Label Encoding is applied to convert the categorical `type` column into numerical format for the classification task.
   - Standard Scaling is used to normalize the numerical features to improve model performance.

### 4. **Regression Task**
   - Predicting the average price of avocados using a Random Forest Regressor.
   - The data is split into training and testing sets, and the model is trained on the training data.
   - Model performance is evaluated using metrics like R-squared and Mean Squared Error.

### 5. **Classification Task**
   - Classifying the avocado type as `conventional` or `organic` using a Random Forest Classifier.
   - Similar to the regression task, the data is split into training and testing sets, and the model is trained and evaluated.
   - Performance is measured using accuracy and a classification report.

### 6. **Visualization**
   - Visualizations are created to better understand the data distribution and the relationship between features.
   - Graphs include the distribution of average prices and the count of each avocado type.

### 7. **Conclusion**
   - The project successfully demonstrates the application of both regression and classification techniques on the avocado dataset.
   - Insights into avocado pricing and type classification are drawn from the model evaluations.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import mean_squared_error, r2_score, classification_report, accuracy_score


In [2]:
url = 'https://github.com/FlipRoboTechnologies/ML_-Datasets/blob/main/Avocado/avocado.csv.zip?raw=true'
avocado_df = pd.read_csv(url, compression='zip')


HTTPError: HTTP Error 404: Not Found

In [None]:
print(avocado_df.head())
print(avocado_df.info())
print(avocado_df.describe())


In [None]:
avocado_df.isnull().sum()  # If any, consider filling or dropping them


In [None]:
label_encoder = LabelEncoder()
avocado_df['type'] = label_encoder.fit_transform(avocado_df['type'])


In [None]:
scaler = StandardScaler()
numerical_features = ['Total Volume', '4046', '4225', '4770']
avocado_df[numerical_features] = scaler.fit_transform(avocado_df[numerical_features])


In [None]:
X_reg = avocado_df[['Total Volume', '4046', '4225', '4770']]
y_reg = avocado_df['AveragePrice']


In [None]:
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)


In [None]:
regressor = RandomForestRegressor(n_estimators=100, random_state=42)
regressor.fit(X_train_reg, y_train_reg)


In [None]:
y_pred_reg = regressor.predict(X_test_reg)
print("R2 Score:", r2_score(y_test_reg, y_pred_reg))
print("MSE:", mean_squared_error(y_test_reg, y_pred_reg))


In [None]:
X_clf = avocado_df[['Total Volume', '4046', '4225', '4770']]
y_clf = avocado_df['type']


In [None]:
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(X_clf, y_clf, test_size=0.2, random_state=42)


In [None]:
classifier = RandomForestClassifier(n_estimators=100, random_state=42)
classifier.fit(X_train_clf, y_train_clf)


In [None]:
y_pred_clf = classifier.predict(X_test_clf)
print("Accuracy:", accuracy_score(y_test_clf, y_pred_clf))
print(classification_report(y_test_clf, y_pred_clf))


In [None]:
sns.distplot(avocado_df['AveragePrice'])
plt.show()

sns.countplot(avocado_df['type'])
plt.show()


In [None]:
sns.distplot(avocado_df['AveragePrice'])
plt.show()

sns.countplot(avocado_df['type'])
plt.show()
