# QCTO - Workplace Module

### Project Title: Please Insert your Project Title Here
#### Done By: LINDOKUHLE MHLONGO

© ExploreAI 2024

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Introduce the project, outline its goals, and explain its significance.
* **Details:** Include information about the problem domain, the specific questions or challenges the project aims to address, and any relevant background information that sets the stage for the work.
---

This study is based on an analysis and forecast of avocado prices in the United States from 2015 to 2024. The availability of clients is also governed by price fluctuations. Once more, at this stage, the economic crisis must be taken into account. Also, the research has produced data analysis that have been cleaned up. Pie conversations and bar graphs were among them. These are a variety of images to help the reader comprehend the investigation of avocado pricing in the United States. Three categories are used to classify avocados: extra-large bags, large bags, and small bags.

---
<a href=#one></a>
## **Importing Packages**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Set up the Python environment with necessary libraries and tools.
* **Details:** List and import all the Python packages that will be used throughout the project such as Pandas for data manipulation, Matplotlib/Seaborn for visualization, scikit-learn for modeling, etc.
---

In [20]:
#Importing Libraries for  Data loading, manipulation and analysis

import numpy as np
import csv
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt

# Displays output inline
%matplotlib inline

# Libraries for Handing Errors
import warnings
warnings.filterwarnings('ignore')

---
<a href=#two></a>
## **Data Collection and Description**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Describe how the data was collected and provide an overview of its characteristics.
* **Details:** Mention sources of the data, the methods used for collection (e.g., APIs, web scraping, datasets from repositories), and a general description of the dataset including size, scope, and types of data available (e.g., numerical, categorical).
---

Our Dataset is about "Avocado Prices and Sales Volume 2015-2023 in USA", we got our data from https://www.kaggle.com,seems to be uploaded by user 'vakhariapujan',it is in a form of a csv file.And was last updated 4 Months Ago.

---
<a href=#three></a>
## **Loading Data**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Load the data into the notebook for manipulation and analysis.
* **Details:** Show the code used to load the data and display the first few rows to give a sense of what the raw data looks like.
---

In [21]:
Avo_df = pd.read_csv('\\Users\\lindo\\OneDrive\\Desktop\\Workplace_Lindo\\Avocado_HassAvocadoBoard_20152023v1.0.1.csv')

---
<a href=#four></a>
## **Data Cleaning and Filtering**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Prepare the data for analysis by cleaning and filtering.
* **Details:** Include steps for handling missing values, removing outliers, correcting errors, and possibly reducing the data (filtering based on certain criteria or features).
---

In [22]:
def check_null_values(Avo_df_copy):
    """
    Print the count of null values for each column in a DataFrame.

    This function iterates through each column in the DataFrame to check for the presence of null values.
    If a column contains null values, it prints the column name along with the number of null values.

    Parameters:
    df (DataFrame): The pandas DataFrame to check for null values.

    Returns:
    None: This function does not return a value; it only prints information.
    """
    for column in Avo_df_copy:
        if Avo_df_copy[column].isnull().any():
            print('{0} has {1} null values'.format(column, Avo_df_copy[column].isnull().sum()))

In [24]:
Avo_df_copy = Avo_df.copy()

In [25]:
check_null_values(Avo_df_copy)

SmallBags has 12390 null values
LargeBags has 12390 null values
XLargeBags has 12390 null values


In [26]:
def count_duplicate_rows(Avo_df_copy):
    """
    Count the number of duplicate rows in a DataFrame.

    This function calculates the total number of duplicate rows in the DataFrame by calling the `duplicated` method,
    which marks duplicates as `True`, and then sums these cases.

    Parameters:
    df_copy (pandas.DataFrame): The DataFrame to check for duplicates.

    Returns:
    int: The count of duplicate rows.
    """
    duplicate_count = Avo_df_copy.duplicated().sum()
    return duplicate_count

In [27]:
count_duplicate_rows(Avo_df_copy)

0

In [28]:
print(f'Is there missing data in the columns?\n{Avo_df_copy.isna().any()}')

Is there missing data in the columns?
Date            False
AveragePrice    False
TotalVolume     False
plu4046         False
plu4225         False
plu4770         False
TotalBags       False
SmallBags        True
LargeBags        True
XLargeBags       True
type            False
region          False
dtype: bool


In [29]:
# Identifying outliers based on IQR in 'AveragePrice' column
Q1 = Avo_df_copy['AveragePrice'].quantile(0.25)
Q3 = Avo_df_copy['AveragePrice'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_price = (Avo_df_copy['AveragePrice'] < lower_bound) | (Avo_df_copy['AveragePrice'] > upper_bound)

# Print the number of outliers
print(f"Number of outliers in AveragePrice: {outliers_price.sum()}")

Number of outliers in AveragePrice: 358


In [30]:
# Identify rows with missing values
missing_rows = Avo_df_copy[Avo_df_copy[['SmallBags', 'LargeBags', 'XLargeBags']].isnull().any(axis=1)]

# Iterate over missing rows
for index, row in missing_rows.iterrows():
    # Fill missing values with their mean
    Avo_df_copy.at[index, 'SmallBags'] = Avo_df_copy['SmallBags'].mean()
    Avo_df_copy.at[index, 'LargeBags'] = Avo_df_copy['LargeBags'].mean()
    Avo_df_copy.at[index, 'XLargeBags'] = Avo_df_copy['XLargeBags'].mean()

# Verify that there are no more missing values
print(Avo_df_copy[['TotalBags', 'SmallBags', 'LargeBags', 'XLargeBags']].isnull().sum())

TotalBags     0
SmallBags     0
LargeBags     0
XLargeBags    0
dtype: int64


In [31]:
print(f'Is there missing data in the columns?\n{Avo_df_copy.isna().any()}')

Is there missing data in the columns?
Date            False
AveragePrice    False
TotalVolume     False
plu4046         False
plu4225         False
plu4770         False
TotalBags       False
SmallBags       False
LargeBags       False
XLargeBags      False
type            False
region          False
dtype: bool


---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Explore and visualize the data to uncover patterns, trends, and relationships.
* **Details:** Use statistics and visualizations to explore the data. This may include histograms, box plots, scatter plots, and correlation matrices. Discuss any significant findings.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#six></a>
## **Modeling**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Develop and train predictive or statistical models.
* **Details:** Describe the choice of models, feature selection and engineering processes, and show how the models are trained. Include code for setting up the models and explanations of the model parameters.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#seven></a>
## **Evaluation and Validation**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Evaluate and validate the effectiveness and accuracy of the models.
* **Details:** Present metrics used to evaluate the models, such as accuracy, precision, recall, F1-score, etc. Discuss validation techniques employed, such as cross-validation or train/test split.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#eight></a>
## **Final Model**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Present the final model and its performance.
* **Details:** Highlight the best-performing model and discuss its configuration, performance, and why it was chosen over others.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#nine></a>
## **Conclusion and Future Work**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Summarize the findings and discuss future directions.
* **Details:** Conclude with a summary of the results, insights gained, limitations of the study, and suggestions for future projects or improvements in methodology or data collection.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#ten></a>
## **References**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Provide citations and sources of external content.
* **Details:** List all the references and sources consulted during the project, including data sources, research papers, and documentation for tools and libraries used.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

## Additional Sections to Consider

* ### Appendix: 
For any additional code, detailed tables, or extended data visualizations that are supplementary to the main content.

* ### Contributors: 
If this is a group project, list the contributors and their roles or contributions to the project.
