# QCTO - Workplace Module

### Project Title: Please Insert your Project Title Here
#### Done By: Name and Surname

© ExploreAI 2024

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Introduce the project, outline its goals, and explain its significance.
* **Details:** Include information about the problem domain, the specific questions or challenges the project aims to address, and any relevant background information that sets the stage for the work.
---

---
<a href=#one></a>
## **Importing Packages**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Set up the Python environment with necessary libraries and tools.
* **Details:** List and import all the Python packages that will be used throughout the project such as Pandas for data manipulation, Matplotlib/Seaborn for visualization, scikit-learn for modeling, etc.
---

In [20]:
import sklearn as sk
import seaborn as sn
import matplotlib as plt
import pandas as pd
import numpy as np

---
<a href=#two></a>
## **Data Collection and Description**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Describe how the data was collected and provide an overview of its characteristics.
* **Details:** Mention sources of the data, the methods used for collection (e.g., APIs, web scraping, datasets from repositories), and a general description of the dataset including size, scope, and types of data available (e.g., numerical, categorical).
---

Weekly avocado sales and pricing data, collected by the Hass Avocado Board between 2015 and 2023, was used for this project. The data was obtained from Kaggle at this [link](https://www.kaggle.com/datasets/vakhariapujan/avocado-prices-and-sales-volume-2015-2023/data) and is stored in the `Data/Avocado_HassAvocadoBoard_20152023.csv` file. 

The data consists of 12 features and 53.4k observations.
The features are as follows:

* `Date` (datetime): The date of data recording.
* `AveragePrice` (float): The average selling price of a single avocado.
* `TotalVolume` (float): The total number of units sold.
* `plu4046`: The number of units of small/medium hass avocados (~3-5oz) sold in the week.
* `plu4225`: The number of units of small/medium hass avocados (~8-10oz) sold in the week.
* `plu4046`: The number of units of large hass avocados (~3-5oz) sold in the week.
* `plu4770`: The number of units of extra large hass avocados (~10-15oz) sold in the week.
* `TotalBags`: The total number of bags sold in the week. Bags consist of a variable number of avocados of mixed PLU type.
* `SmallBags`: The total number of small bags sold in the week. Bags consist of a variable number of avocados of mixed PLU type.
* `LargeBags`: The total number of large bags sold in the week. Bags consist of a variable number of avocados of mixed PLU type.
* `XLargeBags`: The total number of extra large bags sold in the week. Bags consist of a variable number of avocados of mixed PLU type.
* `type`: The type of avocado (conventional/organic).
* `region`: Regions and sub-regions in the US in which the avocados were sold. Total US sales are also included.

---
<a href=#three></a>
## **Loading Data**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Load the data into the notebook for manipulation and analysis.
* **Details:** Show the code used to load the data and display the first few rows to give a sense of what the raw data looks like.
---

In [21]:
avo_data = pd.read_csv(r'Data/Avocado_HassAvocadoBoard_20152023.csv', sep=',')
avo_data.head(5)

Unnamed: 0,Date,AveragePrice,TotalVolume,plu4046,plu4225,plu4770,TotalBags,SmallBags,LargeBags,XLargeBags,type,region
0,2015-01-04,1.22,40873.28,2819.5,28287.42,49.9,9716.46,9186.93,529.53,0.0,conventional,Albany
1,2015-01-04,1.79,1373.95,57.42,153.88,0.0,1162.65,1162.65,0.0,0.0,organic,Albany
2,2015-01-04,1.0,435021.49,364302.39,23821.16,82.15,46815.79,16707.15,30108.64,0.0,conventional,Atlanta
3,2015-01-04,1.76,3846.69,1500.15,938.35,0.0,1408.19,1071.35,336.84,0.0,organic,Atlanta
4,2015-01-04,1.08,788025.06,53987.31,552906.04,39995.03,141136.68,137146.07,3990.61,0.0,conventional,BaltimoreWashington


---
<a href=#four></a>
## **Data Cleaning and Filtering**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Prepare the data for analysis by cleaning and filtering.
* **Details:** Include steps for handling missing values, removing outliers, correcting errors, and possibly reducing the data (filtering based on certain criteria or features).
---

Remove irrelevant features.

In [22]:
avo_data = avo_data.drop(
    columns=[
        "TotalVolume",
        "TotalBags",
        "SmallBags",
        "LargeBags",
        "XLargeBags"
    ]
)

To assist with spotting errors in the data, print summary statistics for the numerical columns.

In [23]:
avo_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
AveragePrice,53415.0,1.42891,0.393116,0.44,1.119091,1.4,1.69,3.44083
plu4046,53415.0,298270.749448,1307669.0,0.0,694.725,14580.58,128792.38,25447200.0
plu4225,53415.0,222217.037654,955462.4,0.0,2120.8,17516.63,93515.6,20470570.0
plu4770,53415.0,20531.954686,104097.7,0.0,0.0,90.05,3599.735,2860025.0


Determine if there are any missing values.

In [24]:
avo_data.isnull().sum(axis=0)

Date            0
AveragePrice    0
plu4046         0
plu4225         0
plu4770         0
type            0
region          0
dtype: int64

Verify that the column data types are correct.

In [25]:
avo_data.dtypes

Date             object
AveragePrice    float64
plu4046         float64
plu4225         float64
plu4770         float64
type             object
region           object
dtype: object

In [26]:
avo_data

Unnamed: 0,Date,AveragePrice,plu4046,plu4225,plu4770,type,region
0,2015-01-04,1.220000,2819.50,28287.42,49.90,conventional,Albany
1,2015-01-04,1.790000,57.42,153.88,0.00,organic,Albany
2,2015-01-04,1.000000,364302.39,23821.16,82.15,conventional,Atlanta
3,2015-01-04,1.760000,1500.15,938.35,0.00,organic,Atlanta
4,2015-01-04,1.080000,53987.31,552906.04,39995.03,conventional,BaltimoreWashington
...,...,...,...,...,...,...,...
53410,2023-12-03,1.550513,204.64,1211.25,0.00,organic,Toledo
53411,2023-12-03,1.703920,66808.44,132075.11,58.65,organic,West
53412,2023-12-03,1.618931,15182.42,1211.38,0.00,organic,WestTexNewMexico
53413,2023-12-03,1.245406,1058.54,7.46,0.00,organic,Wichita


Rename the columns in a snake_case format to facilitate ease of analysis.

In [27]:
avo_data = avo_data.rename(
    columns={
        "Date": "date",
        "AveragePrice": "average_unit_price"  
    }
)

Extract year and month data into separate columns.

In [28]:
avo_data['year'] = pd.DatetimeIndex(avo_data['date']).year
avo_data['month'] = pd.DatetimeIndex(avo_data['date']).month
avo_data = avo_data.drop(columns=['date'])

Combine sales data into a single column

In [30]:
avo_data = avo_data.melt(id_vars=['average_unit_price', 'type', 'region', 'year', 'month'], value_vars=['plu4046', 'plu4225', 'plu4770'], var_name='plu_code', value_name='units_sold')

View cleaned data.

In [31]:
avo_data.head(5)

Unnamed: 0,average_unit_price,type,region,year,month,plu_code,units_sold
0,1.22,conventional,Albany,2015,1,plu4046,2819.5
1,1.79,organic,Albany,2015,1,plu4046,57.42
2,1.0,conventional,Atlanta,2015,1,plu4046,364302.39
3,1.76,organic,Atlanta,2015,1,plu4046,1500.15
4,1.08,conventional,BaltimoreWashington,2015,1,plu4046,53987.31


---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Explore and visualize the data to uncover patterns, trends, and relationships.
* **Details:** Use statistics and visualizations to explore the data. This may include histograms, box plots, scatter plots, and correlation matrices. Discuss any significant findings.
---


In [10]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#six></a>
## **Modeling**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Develop and train predictive or statistical models.
* **Details:** Describe the choice of models, feature selection and engineering processes, and show how the models are trained. Include code for setting up the models and explanations of the model parameters.
---


In [11]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#seven></a>
## **Evaluation and Validation**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Evaluate and validate the effectiveness and accuracy of the models.
* **Details:** Present metrics used to evaluate the models, such as accuracy, precision, recall, F1-score, etc. Discuss validation techniques employed, such as cross-validation or train/test split.
---

In [12]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#eight></a>
## **Final Model**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Present the final model and its performance.
* **Details:** Highlight the best-performing model and discuss its configuration, performance, and why it was chosen over others.
---


In [13]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#nine></a>
## **Conclusion and Future Work**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Summarize the findings and discuss future directions.
* **Details:** Conclude with a summary of the results, insights gained, limitations of the study, and suggestions for future projects or improvements in methodology or data collection.
---


In [14]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#ten></a>
## **References**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Provide citations and sources of external content.
* **Details:** List all the references and sources consulted during the project, including data sources, research papers, and documentation for tools and libraries used.
---

In [15]:
#Please use code cells to code in and do not forget to comment your code.

## Additional Sections to Consider

* ### Appendix: 
For any additional code, detailed tables, or extended data visualizations that are supplementary to the main content.

* ### Contributors: 
If this is a group project, list the contributors and their roles or contributions to the project.
