# QCTO - Workplace Module

### Project Title: Evaluation of Vegetable Prices between January 2023 and January 2024
#### Done By: Tshepho Mabusela

© ExploreAI 2024

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Introduce the project, outline its goals, and explain its significance.
* **Details:** Include information about the problem domain, the specific questions or challenges the project aims to address, and any relevant background information that sets the stage for the work.
---

This project aims to evaluate the prices of different vegetables by determining whether or not the vegetable prices change by seasonality, in an attempt to provide cosumers/vegetable retailers with knowledge of when is the best season to purchase these vegetables.

---
<a href=#one></a>
## **Importing Packages**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Set up the Python environment with necessary libraries and tools.
* **Details:** List and import all the Python packages that will be used throughout the project such as Pandas for data manipulation, Matplotlib/Seaborn for visualization, scikit-learn for modeling, etc.
---

In [2]:
import pandas as pd # used for data manipulation and analysis
import matplotlib.pyplot as plt # used for static, animated, and interactive visualizations.
import seaborn as sns # used for drawing attractive and informative statistical graphics.

---
<a href=#two></a>
## **Data Collection and Description**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Describe how the data was collected and provide an overview of its characteristics.
* **Details:** Mention sources of the data, the methods used for collection (e.g., APIs, web scraping, datasets from repositories), and a general description of the dataset including size, scope, and types of data available (e.g., numerical, categorical).
---

The dataset was retrieved from kaggle.com and the dataset consists of a collection of prices of 10 different vegetables.
The prices are all numeric data and there are no blank values.

---
<a href=#three></a>
## **Loading Data**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Load the data into the notebook for manipulation and analysis.
* **Details:** Show the code used to load the data and display the first few rows to give a sense of what the raw data looks like.
---

In [3]:
#Please use code cells to code in and do not forget to comment your code.
data = pd.read_csv('./prices.csv')
data.head()

Unnamed: 0,Price Dates,Bhindi (Ladies finger),Tomato,Onion,Potato,Brinjal,Garlic,Peas,Methi,Green Chilli,Elephant Yam (Suran)
0,01-01-2023,35.0,18,22.0,20,30,50,25,8,45.0,25
1,02-01-2023,35.0,16,22.0,20,30,55,25,7,40.0,25
2,03-01-2023,35.0,16,21.0,20,30,55,25,7,40.0,25
3,04-01-2023,30.0,16,21.0,22,25,55,25,7,40.0,25
4,08-01-2023,35.0,16,20.0,21,25,55,22,6,35.0,25


---
<a href=#four></a>
## **Data Cleaning and Filtering**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Prepare the data for analysis by cleaning and filtering.
* **Details:** Include steps for handling missing values, removing outliers, correcting errors, and possibly reducing the data (filtering based on certain criteria or features).
---

In [4]:
# Create a copy of our dataframe, we are preserving the data in the original dataframe.
df_copy = data.copy()

# Convert the 'Price Dates' column to datetime format so that it is no longer an object data type
df_copy['Price Dates'] = pd.to_datetime(df_copy['Price Dates'], format = '%d-%m-%Y')


In [5]:
"""
Ensure consistent data types for all variables 

We have 
    Bhindi (Ladies finger)
    Onion and
    Green Chilli
with inconsistent data types for some cases in these variables
    
""" 

df_copy['Bhindi (Ladies finger)'] = df_copy['Bhindi (Ladies finger)'].astype('int64')
df_copy['Onion'] = df_copy['Onion'].astype('int64')
df_copy['Green Chilli'] = df_copy['Green Chilli'].astype('int64')

In [6]:
missing_values = df_copy.isnull().sum()
print(missing_values)

Price Dates               0
Bhindi (Ladies finger)    0
Tomato                    0
Onion                     0
Potato                    0
Brinjal                   0
Garlic                    0
Peas                      0
Methi                     0
Green Chilli              0
Elephant Yam (Suran)      0
dtype: int64


In [7]:
# Identify duplicate rows
duplicates = df_copy.duplicated()
print("Duplicate Rows:\n", df_copy[duplicates])

Duplicate Rows:
 Empty DataFrame
Columns: [Price Dates, Bhindi (Ladies finger), Tomato, Onion, Potato, Brinjal, Garlic, Peas, Methi, Green Chilli, Elephant Yam (Suran)]
Index: []


As per observation, There are no missing values in our columns and no duplicated rows in our data.
If there were missing values in our columns, we would set a default value for those missing values. see example below

The following is how we would handle columns with missing values

In [8]:
# If Tomato column had missing values
tomato_mean = df_copy['Tomato'].mean()
df_copy['Tomato'] = df_copy['Tomato'].fillna(value=tomato_mean)

The following is how we would handle columns with error values

In [None]:
# If there are errors within the dataset, we can skip the erroneous record with the following
test_df = pd.read_csv('./prices.csv', error_bad_lines=False, warn_bad_lines=True)
# when loading the dataframe above, we skip the erroneous records and display messages to show which records are skipped.

The following is how we would filter data in our data frame

In [10]:
filter_condition = (df_copy['Brinjal'] > 30)
df_copy[filter_condition]

Unnamed: 0,Price Dates,Bhindi (Ladies finger),Tomato,Onion,Potato,Brinjal,Garlic,Peas,Methi,Green Chilli,Elephant Yam (Suran)
9,2023-01-22,45,16,18,22,40,65,25,9,40,35
10,2023-01-23,42,16,18,22,50,65,25,8,40,35
11,2023-01-24,45,16,16,22,60,65,28,15,35,30
12,2023-01-25,40,16,16,20,70,65,25,15,35,25
13,2023-01-27,35,16,17,17,50,65,25,15,35,25
...,...,...,...,...,...,...,...,...,...,...,...
282,2023-12-27,45,16,30,20,70,260,40,16,40,25
283,2023-12-28,45,16,30,20,70,260,30,20,45,25
284,2023-12-29,45,16,30,22,80,260,30,18,50,25
285,2023-12-31,45,16,26,20,60,250,40,16,50,40


In [11]:
df_copy.describe()

Unnamed: 0,Price Dates,Bhindi (Ladies finger),Tomato,Onion,Potato,Brinjal,Garlic,Peas,Methi,Green Chilli,Elephant Yam (Suran)
count,287,287.0,287.0,287.0,287.0,287.0,287.0,287.0,287.0,287.0,287.0
mean,2023-07-04 21:54:33.867595776,29.379791,16.006969,20.637631,18.585366,31.655052,133.101045,66.658537,20.383275,44.121951,28.797909
min,2023-01-01 00:00:00,17.0,16.0,8.0,12.0,14.0,50.0,22.0,5.0,0.0,12.0
25%,2023-04-06 12:00:00,22.0,16.0,12.0,16.0,25.0,85.0,40.0,8.0,35.0,25.0
50%,2023-07-04 00:00:00,27.0,16.0,16.0,20.0,30.0,120.0,60.0,12.0,40.0,30.0
75%,2023-10-01 12:00:00,33.0,16.0,25.0,20.0,35.0,165.0,80.0,16.0,50.0,30.0
max,2024-01-01 00:00:00,60.0,18.0,57.0,24.0,80.0,290.0,150.0,2000.0,90.0,50.0
std,,8.137284,0.118056,11.722358,2.726238,11.725421,60.078331,33.302415,117.428417,12.798155,6.607973


---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Explore and visualize the data to uncover patterns, trends, and relationships.
* **Details:** Use statistics and visualizations to explore the data. This may include histograms, box plots, scatter plots, and correlation matrices. Discuss any significant findings.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#six></a>
## **Modeling**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Develop and train predictive or statistical models.
* **Details:** Describe the choice of models, feature selection and engineering processes, and show how the models are trained. Include code for setting up the models and explanations of the model parameters.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#seven></a>
## **Evaluation and Validation**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Evaluate and validate the effectiveness and accuracy of the models.
* **Details:** Present metrics used to evaluate the models, such as accuracy, precision, recall, F1-score, etc. Discuss validation techniques employed, such as cross-validation or train/test split.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#eight></a>
## **Final Model**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Present the final model and its performance.
* **Details:** Highlight the best-performing model and discuss its configuration, performance, and why it was chosen over others.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#nine></a>
## **Conclusion and Future Work**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Summarize the findings and discuss future directions.
* **Details:** Conclude with a summary of the results, insights gained, limitations of the study, and suggestions for future projects or improvements in methodology or data collection.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#ten></a>
## **References**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Provide citations and sources of external content.
* **Details:** List all the references and sources consulted during the project, including data sources, research papers, and documentation for tools and libraries used.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

## Additional Sections to Consider

* ### Appendix: 
For any additional code, detailed tables, or extended data visualizations that are supplementary to the main content.

* ### Contributors: 
If this is a group project, list the contributors and their roles or contributions to the project.
