# Problem Statement:

**Avocado is a fruit consumed by people heavily in the United States.**

**Content:**

This data was downloaded from the Hass Avocado Board website in May of 2018 & compiled into a single CSV. 

The table below represents weekly 2018 retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. 

Starting in 2013, the table below reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) in the table reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. 

The Product Lookup codes (PLU’s) in the table are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this table.

Some relevant columns in the dataset:

- Date - The date of the observation
- AveragePrice - the average price of a single avocado
- type - conventional or organic
- year - the year
- Region - the city or region of the observation
- Total Volume - Total number of avocados sold
- 4046 - Total number of avocados with PLU 4046 sold
- 4225 - Total number of avocados with PLU 4225 sold
- 4770 - Total number of avocados with PLU 4770 sold


**Inspiration/Label:**

The dataset can be seen in two angles to find the region and find the average price.

Task: One of Classification and other of Regression.

Do both tasks in the same .ipynb file and submit at single file.

# 1. Importing the necessary libraries:

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

# 2. Importing the dataset:

In [2]:
# Displaying the maximum rows
pd.set_option('display.max_rows',None) 

# Displaying the maximum columns
pd.set_option('display.max_columns',None) 

data=pd.read_csv('Avocado.csv')

In [3]:
# First five rows of the dataset
data.head() 

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0.0,27-12-2015,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015.0,Albany
1,1.0,20-12-2015,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015.0,Albany
2,2.0,13-12-2015,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015.0,Albany
3,3.0,06-12-2015,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015.0,Albany
4,4.0,29-11-2015,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015.0,Albany


In [15]:
# last five rows of the dataset
data.tail() 

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
16463,,,,,,,,,,,,,,
16464,,,,,,,,,,,,,,
16465,,,,,,,,,,,,,,
16466,,,,,,,,,,,,,,
16467,,,,,,,,,,,,,,


# 3. Checking the attributes:

In [10]:
# Shape of the dataset
data.shape

(16468, 14)

In [11]:
# Brief information about the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16468 entries, 0 to 16467
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    1517 non-null   float64
 1   Date          1517 non-null   object 
 2   AveragePrice  1517 non-null   float64
 3   Total Volume  1517 non-null   float64
 4   4046          1517 non-null   float64
 5   4225          1517 non-null   float64
 6   4770          1517 non-null   float64
 7   Total Bags    1517 non-null   float64
 8   Small Bags    1517 non-null   float64
 9   Large Bags    1517 non-null   float64
 10  XLarge Bags   1517 non-null   float64
 11  type          1517 non-null   object 
 12  year          1517 non-null   float64
 13  region        1517 non-null   object 
dtypes: float64(11), object(3)
memory usage: 1.8+ MB


In [12]:
# Statistical summary of the dataset
data.describe()

Unnamed: 0.1,Unnamed: 0,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,year
count,1517.0,1517.0,1517.0,1517.0,1517.0,1517.0,1517.0,1517.0,1517.0,1517.0,1517.0
mean,26.995386,1.07499,1601879.0,646438.7,611437.5,50405.5,293597.4,248773.6,42642.05,2181.771074,2015.162821
std,14.848287,0.188891,4433143.0,1947614.0,1672906.0,137781.2,757976.5,647476.5,118215.7,7455.712144,0.369324
min,0.0,0.49,38750.74,467.72,1783.77,0.0,3311.77,3311.77,0.0,0.0,2015.0
25%,14.0,0.98,147470.0,20400.34,41476.06,911.25,36206.89,29727.22,540.74,0.0,2015.0
50%,29.0,1.08,402791.9,81751.17,118664.9,7688.17,73979.06,62375.69,5044.35,0.0,2015.0
75%,39.0,1.19,981975.1,377578.5,485150.3,29167.3,157609.7,146199.4,29267.67,401.48,2015.0
max,51.0,1.68,44655460.0,18933040.0,18956480.0,1381516.0,6736304.0,5893642.0,1121076.0,108072.79,2016.0


# 4. Filling the Null values:

In [14]:
# Cheacking the null values
data.isnull().sum()   

Unnamed: 0      14951
Date            14951
AveragePrice    14951
Total Volume    14951
4046            14951
4225            14951
4770            14951
Total Bags      14951
Small Bags      14951
Large Bags      14951
XLarge Bags     14951
type            14951
year            14951
region          14951
dtype: int64