# QCTO - Workplace Module

### Project Title: Exploring Vegetable prices
#### Done By: Lebogang Malata

© ExploreAI 2024

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**
<a href=#cont>Back to Table of Contents</a>

**Introduction**

This project focuses on the analysis of vegetable prices across different regions, utilizing an extensive dataset sourced from the authorized platform Agmarknet. As vegetable pricing plays a crucial role in the agricultural sector, understanding its trends is essential for both producers and consumers. The dataset provides a comprehensive view of price fluctuations over time for various vegetables, offering valuable insights into regional price differences and trends.

**Goals and Significance**

The primary goal of this project is to analyze the factors influencing vegetable prices and identify patterns or anomalies across different regions and timeframes. The analysis will help in answering key questions such as:

How do vegetable prices vary by region and season?
Are there any recurring trends or significant shifts over the years?
What external factors might be influencing these price changes (e.g., climate, transportation, or supply chain disruptions)?
By addressing these questions, the project aims to offer data-driven insights to policymakers, farmers, and market participants to enhance decision-making, optimize supply chains, and improve price stability.

**Problem Domain**

Agricultural markets are known for their volatility, particularly with vegetable prices that can fluctuate due to numerous factors, including supply chain issues, weather conditions, regional demand, and government policies. This project seeks to explore the dynamics of vegetable pricing and how these fluctuations affect both local and national economies. Additionally, it will highlight the challenges faced by farmers, traders, and consumers due to unpredictable price changes, and suggest ways data can be used to mitigate these issues.

---
<a href=#one></a>
## **Importing Packages**
<a href=#cont>Back to Table of Contents</a>


In [1]:
# Importing pandas for data manipulation and analysis
import pandas as pd

# Importing numpy for numerical operations and handling arrays
import numpy as np

# Importing seaborn for statistical data visualization
import seaborn as sns

# Importing matplotlib for plotting graphs and charts
import matplotlib.pyplot as plt

# Importing LinearRegression from sklearn.linear_model for creating and training linear regression models
from sklearn.linear_model import LinearRegression

# Importing metrics from sklearn for evaluating model performance
from sklearn import metrics

# Importing rc from matplotlib to control the default settings of plots
from matplotlib import rc

# Libraries for Handing Errors
import warnings
warnings.filterwarnings('ignore')

---
<a href=#two></a>
## **Data Collection and Description**
<a href=#cont>Back to Table of Contents</a>

The vegetable price dataset, collected from the Agmarknet portal, provides detailed pricing information for various vegetables across different regions in India. The data includes timestamps, region/market details, vegetable types, prices (in INR), and quantity available. It was collected using official APIs, ensuring accuracy. The dataset covers a broad time range, offering valuable insights into pricing trends, market conditions, and regional variations, making it a key resource for researchers and analysts studying vegetable market dynamics.

In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#three></a>
## **Loading Data**
<a href=#cont>Back to Table of Contents</a>


In [2]:
# loading dataset
prices_df = pd.read_csv('prices.csv')
prices_df.head()

Unnamed: 0,Price Dates,Bhindi (Ladies finger),Tomato,Onion,Potato,Brinjal,Garlic,Peas,Methi,Green Chilli,Elephant Yam (Suran)
0,01-01-2023,35.0,18,22.0,20,30,50,25,8,45.0,25
1,02-01-2023,35.0,16,22.0,20,30,55,25,7,40.0,25
2,03-01-2023,35.0,16,21.0,20,30,55,25,7,40.0,25
3,04-01-2023,30.0,16,21.0,22,25,55,25,7,40.0,25
4,08-01-2023,35.0,16,20.0,21,25,55,22,6,35.0,25


<div class="alert alert-block alert-danger">
<b>To prevent any major unnecessary changes occurring to the original data</b> , a copy of the dataframe will be made using the prices_df.copy()
method and referred to as `prices_copy_df`.
</div>

In [3]:
# The copy of the dataframe
prices_copy_df = prices_df.copy()

---
<a href=#four></a>
## **Data Cleaning and Filtering**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Prepare the data for analysis by cleaning and filtering.
* **Details:** Include steps for handling missing values, removing outliers, correcting errors, and possibly reducing the data (filtering based on certain criteria or features).
---

In [4]:
# Displays the number of rows and columns
prices_copy_df.shape

(287, 11)

The dataset consists of 287 rows (observations) and 11 columns (features)

In [5]:
## Display summary information about the DataFrame.
prices_copy_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287 entries, 0 to 286
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Price Dates             287 non-null    object 
 1   Bhindi (Ladies finger)  287 non-null    float64
 2   Tomato                  287 non-null    int64  
 3   Onion                   287 non-null    float64
 4   Potato                  287 non-null    int64  
 5   Brinjal                 287 non-null    int64  
 6   Garlic                  287 non-null    int64  
 7   Peas                    287 non-null    int64  
 8   Methi                   287 non-null    int64  
 9   Green Chilli            287 non-null    float64
 10  Elephant Yam (Suran)    287 non-null    int64  
dtypes: float64(3), int64(7), object(1)
memory usage: 24.8+ KB


In [6]:
# Replace spaces and brackets in column names with underscores and remove brackets for consistency
prices_copy_df.columns = [col.replace(" ", "_").replace("(", "").replace(")", "") for col in prices_copy_df.columns]


In [8]:
# Convert 'Price_Dates' column to datetime format
prices_copy_df['Price_Dates'] = pd.to_datetime(prices_copy_df['Price_Dates'], format='%d-%m-%Y')

In [9]:
# Identify and count the duplicates
print("Number of Duplicates:", prices_copy_df.duplicated().sum())

Number of Duplicates: 0


In [12]:
# Final Cleaned DataFrame
prices_copy_df.head()

Unnamed: 0,Price_Dates,Bhindi_Ladies_finger,Tomato,Onion,Potato,Brinjal,Garlic,Peas,Methi,Green_Chilli,Elephant_Yam_Suran
0,2023-01-01,35.0,18,22.0,20,30,50,25,8,45.0,25
1,2023-01-02,35.0,16,22.0,20,30,55,25,7,40.0,25
2,2023-01-03,35.0,16,21.0,20,30,55,25,7,40.0,25
3,2023-01-04,30.0,16,21.0,22,25,55,25,7,40.0,25
4,2023-01-08,35.0,16,20.0,21,25,55,22,6,35.0,25


In [13]:
prices_copy_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287 entries, 0 to 286
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   Price_Dates           287 non-null    datetime64[ns]
 1   Bhindi_Ladies_finger  287 non-null    float64       
 2   Tomato                287 non-null    int64         
 3   Onion                 287 non-null    float64       
 4   Potato                287 non-null    int64         
 5   Brinjal               287 non-null    int64         
 6   Garlic                287 non-null    int64         
 7   Peas                  287 non-null    int64         
 8   Methi                 287 non-null    int64         
 9   Green_Chilli          287 non-null    float64       
 10  Elephant_Yam_Suran    287 non-null    int64         
dtypes: datetime64[ns](1), float64(3), int64(7)
memory usage: 24.8 KB


---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Explore and visualize the data to uncover patterns, trends, and relationships.
* **Details:** Use statistics and visualizations to explore the data. This may include histograms, box plots, scatter plots, and correlation matrices. Discuss any significant findings.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#six></a>
## **Modeling**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Develop and train predictive or statistical models.
* **Details:** Describe the choice of models, feature selection and engineering processes, and show how the models are trained. Include code for setting up the models and explanations of the model parameters.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#seven></a>
## **Evaluation and Validation**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Evaluate and validate the effectiveness and accuracy of the models.
* **Details:** Present metrics used to evaluate the models, such as accuracy, precision, recall, F1-score, etc. Discuss validation techniques employed, such as cross-validation or train/test split.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#eight></a>
## **Final Model**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Present the final model and its performance.
* **Details:** Highlight the best-performing model and discuss its configuration, performance, and why it was chosen over others.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#nine></a>
## **Conclusion and Future Work**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Summarize the findings and discuss future directions.
* **Details:** Conclude with a summary of the results, insights gained, limitations of the study, and suggestions for future projects or improvements in methodology or data collection.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#ten></a>
## **References**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Provide citations and sources of external content.
* **Details:** List all the references and sources consulted during the project, including data sources, research papers, and documentation for tools and libraries used.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

## Additional Sections to Consider

* ### Appendix: 
For any additional code, detailed tables, or extended data visualizations that are supplementary to the main content.

* ### Contributors: 
If this is a group project, list the contributors and their roles or contributions to the project.
