# QCTO - Workplace Module

### Project Title: Please Insert your Project Title Here
#### Done By: Name and Surname

Â© ExploreAI 2024

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Introduce the project, outline its goals, and explain its significance.
* **Details:** Include information about the problem domain, the specific questions or challenges the project aims to address, and any relevant background information that sets the stage for the work.
---

---
<a href=#one></a>
## **Importing Packages**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Set up the Python environment with necessary libraries and tools.
* **Details:** List and import all the Python packages that will be used throughout the project such as Pandas for data manipulation, Matplotlib/Seaborn for visualization, scikit-learn for modeling, etc.
---

In [1]:
import sklearn as sk
import seaborn as sn
import matplotlib as plt
import pandas as pd
import numpy as np

---
<a href=#two></a>
## **Data Collection and Description**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Describe how the data was collected and provide an overview of its characteristics.
* **Details:** Mention sources of the data, the methods used for collection (e.g., APIs, web scraping, datasets from repositories), and a general description of the dataset including size, scope, and types of data available (e.g., numerical, categorical).
---

Weekly avocado sales and pricing data, collected by the Hass Avocado Board between 2015 and 2023, was used for this project. The data was obtained from Kaggle at this [link](https://www.kaggle.com/datasets/vakhariapujan/avocado-prices-and-sales-volume-2015-2023/data) and is stored in the `Data/Avocado_HassAvocadoBoard_20152023.csv` file. 

The data consists of 12 features and 53.4k observations.
The features are as follows:

* `Date` (datetime): The date of data recording for that week.
* `AveragePrice` (float): The average selling price per unit.
* `TotalVolume` (float): The total volumne by weight in pounds.
* `plu4046`: The number of bulk units of small/medium hass avocados (~3-5oz) sold in the week.
* `plu4225`: The number of bulk units of small/medium hass avocados (~8-10oz) sold in the week.
* `plu4046`: The number of bulk units of large hass avocados (~3-5oz) sold in the week.
* `plu4770`: The number of bulk units of extra large hass avocados (~10-15oz) sold in the week.
* `TotalBags`: The total number of bags sold in the week. Bags consist of a variable number of avocados of mixed PLU type.
* `SmallBags`: The total number of small bags sold in the week. Bags consist of a variable number of avocados of mixed PLU type.
* `LargeBags`: The total number of large bags sold in the week. Bags consist of a variable number of avocados of mixed PLU type.
* `XLargeBags`: The total number of extra large bags sold in the week. Bags consist of a variable number of avocados of mixed PLU type.
* `type`: The type of avocado (conventional/organic).
* `region`: Regions in the US in which the avocados were sold. TotalUS sales are also included.

---
<a href=#three></a>
## **Loading Data**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Load the data into the notebook for manipulation and analysis.
* **Details:** Show the code used to load the data and display the first few rows to give a sense of what the raw data looks like.
---

In [2]:
dairy_data = pd.read_csv(r'Data/Avocado_HassAvocadoBoard_20152023.csv', sep=',')
dairy_data.head(5)

Unnamed: 0,Date,AveragePrice,TotalVolume,plu4046,plu4225,plu4770,TotalBags,SmallBags,LargeBags,XLargeBags,type,region
0,2015-01-04,1.22,40873.28,2819.5,28287.42,49.9,9716.46,9186.93,529.53,0.0,conventional,Albany
1,2015-01-04,1.79,1373.95,57.42,153.88,0.0,1162.65,1162.65,0.0,0.0,organic,Albany
2,2015-01-04,1.0,435021.49,364302.39,23821.16,82.15,46815.79,16707.15,30108.64,0.0,conventional,Atlanta
3,2015-01-04,1.76,3846.69,1500.15,938.35,0.0,1408.19,1071.35,336.84,0.0,organic,Atlanta
4,2015-01-04,1.08,788025.06,53987.31,552906.04,39995.03,141136.68,137146.07,3990.61,0.0,conventional,BaltimoreWashington


---
<a href=#four></a>
## **Data Cleaning and Filtering**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Prepare the data for analysis by cleaning and filtering.
* **Details:** Include steps for handling missing values, removing outliers, correcting errors, and possibly reducing the data (filtering based on certain criteria or features).
---

Remove irrelevant features.

In [3]:
dairy_data = dairy_data.drop(
    columns=[
        "Total Land Area (acres)",
        "Number of Cows",
        "Farm Size",
        "Product ID",
        "Total Value",
        "Shelf Life (days)",
        "Storage Condition",
        "Production Date",
        "Expiration Date",
        "Price per Unit",
        "Approx. Total Revenue(INR)",
        "Quantity in Stock (liters/kg)",
        "Reorder Quantity (liters/kg)",
    ]
)

To assist with spotting errors in the data, print summary statistics for the numerical columns.

In [4]:
dairy_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Quantity (liters/kg),4325.0,500.652657,288.975915,1.17,254.17,497.55,749.78,999.93
Quantity Sold (liters/kg),4325.0,248.095029,217.024182,1.0,69.0,189.0,374.0,960.0
Price per Unit (sold),4325.0,54.77914,26.19279,5.21,32.64,54.14,77.46,104.51
Minimum Stock Threshold (liters/kg),4325.0,55.826143,26.30145,10.02,32.91,56.46,79.01,99.99


Determine if there are any missing values.

In [5]:
dairy_data.isnull().sum(axis=0)

Location                               0
Date                                   0
Product Name                           0
Brand                                  0
Quantity (liters/kg)                   0
Quantity Sold (liters/kg)              0
Price per Unit (sold)                  0
Customer Location                      0
Sales Channel                          0
Minimum Stock Threshold (liters/kg)    0
dtype: int64

Verify that the column data types are correct.

In [6]:
dairy_data.dtypes

Location                                object
Date                                    object
Product Name                            object
Brand                                   object
Quantity (liters/kg)                   float64
Quantity Sold (liters/kg)                int64
Price per Unit (sold)                  float64
Customer Location                       object
Sales Channel                           object
Minimum Stock Threshold (liters/kg)    float64
dtype: object

Rename the columns in a snake_case format to facilitate ease of analysis.

In [7]:
dairy_data = dairy_data.rename(
    columns={
        "Location": "location",
        "Date": "date",
        "Product Name": "product_name",
        "Brand": "brand",
        "Quantity (liters/kg)": "quantity",
        "Quantity Sold (liters/kg)": "quantity_sold",
        "Price per Unit (sold)": "price_per_unit_sold",
        "Customer Location": "customer_location",
        "Sales Channel": "sales_channel",
        "Minimum Stock Threshold (liters/kg)": "min_stock_threshold"        
    }
)

Extract year and month data into separate columns.

In [8]:
dairy_data['year'] = pd.DatetimeIndex(dairy_data['date']).year
dairy_data['month'] = pd.DatetimeIndex(dairy_data['date']).month
dairy_data = dairy_data.drop(columns=['date'])

View cleaned data.

In [9]:
dairy_data.head(5)

Unnamed: 0,location,product_name,brand,quantity,quantity_sold,price_per_unit_sold,customer_location,sales_channel,min_stock_threshold,year,month
0,Telangana,Ice Cream,Dodla Dairy,222.4,7,82.24,Madhya Pradesh,Wholesale,19.55,2022,2
1,Uttar Pradesh,Milk,Amul,687.48,558,39.24,Kerala,Wholesale,43.17,2021,12
2,Tamil Nadu,Yogurt,Dodla Dairy,503.48,256,33.81,Madhya Pradesh,Online,15.1,2022,2
3,Telangana,Cheese,Britannia Industries,823.36,601,28.92,Rajasthan,Online,74.5,2019,6
4,Maharashtra,Buttermilk,Mother Dairy,147.77,145,83.07,Jharkhand,Retail,76.02,2020,12


---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Explore and visualize the data to uncover patterns, trends, and relationships.
* **Details:** Use statistics and visualizations to explore the data. This may include histograms, box plots, scatter plots, and correlation matrices. Discuss any significant findings.
---


In [10]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#six></a>
## **Modeling**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Develop and train predictive or statistical models.
* **Details:** Describe the choice of models, feature selection and engineering processes, and show how the models are trained. Include code for setting up the models and explanations of the model parameters.
---


In [11]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#seven></a>
## **Evaluation and Validation**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Evaluate and validate the effectiveness and accuracy of the models.
* **Details:** Present metrics used to evaluate the models, such as accuracy, precision, recall, F1-score, etc. Discuss validation techniques employed, such as cross-validation or train/test split.
---

In [12]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#eight></a>
## **Final Model**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Present the final model and its performance.
* **Details:** Highlight the best-performing model and discuss its configuration, performance, and why it was chosen over others.
---


In [13]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#nine></a>
## **Conclusion and Future Work**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Summarize the findings and discuss future directions.
* **Details:** Conclude with a summary of the results, insights gained, limitations of the study, and suggestions for future projects or improvements in methodology or data collection.
---


In [14]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#ten></a>
## **References**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Provide citations and sources of external content.
* **Details:** List all the references and sources consulted during the project, including data sources, research papers, and documentation for tools and libraries used.
---

In [15]:
#Please use code cells to code in and do not forget to comment your code.

## Additional Sections to Consider

* ### Appendix: 
For any additional code, detailed tables, or extended data visualizations that are supplementary to the main content.

* ### Contributors: 
If this is a group project, list the contributors and their roles or contributions to the project.
