# IS6061_PYTHON_ASSIGNMENT_GROUP_PROJECT

## GROUP 26  
### GROUP MEMBERS  
<div style="text-align: left;">
    
| Name                   | ID         | 
|:-----------------------|:-----------|
| **Praveen Mangawa**  | 124114478  | 
| **Mrunal Howale**      | 124112093  | 
| **Aida George**        | 124109805  | 
| **Shakirat Ezinne Muibi** | 123102416  | 
| **Dion Vincent Chettiar** | 124112586  | 
| **Dyuti Paresh Kamdar**   | 124101898  | 

</div>

## Project Approach and Workflow

In this group project, **the quarterly_waste_generation.csv** and **quarterly_waste_treatment.csv** datasets will be thoroughly analyzed. The objective is to merge these datasets using common fields like **Quarter**, **County**, **Waste Type**, **Waste Category**, and **Waste Amount (tonnes)** in order to create a single, cohesive dataset. **Exploratory Data Analysis (EDA)**, data aggregation and cleansing, data import, and insight generation are all steps in the data analytics lifecycle, which this project will adhere to.

We will investigate different Python modules to process the data, deal with outliers and missing values, generate calculated fields, and display the data throughout the project. The EDA phase will fix any problems that arose during the cleaning and merging process and assist in identifying trends or patterns in the data. This report will give a thorough explanation of each phase, including the justification for the methodologies used, the difficulties encountered, and the analysis's ultimate conclusions.


### Importing Necessary Libraries   
The required libraries for managing, manipulating, and visualizing data are imported in this part

   NumPy: (Numpy, 2024)    
   Pandas: (Pandas, 2018)  
   Seaborn: (seaborn, 2012)  
   Matplotlib: (Matplotlib, 2012)  
   CSV (Python module): (Python Software Foundation, 2020)  
   Regular Expressions (re module): (Python, 2009)
.

In [43]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import csv
import re
import os

In [44]:
# Read the CSV file containing quarterly waste generation data into a Pandas DataFrame
wgen=pd.read_csv("quarterly_waste_generation.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'quarterly_waste_generation.csv'

In [None]:
# Read the CSV file containing quarterly waste treatment data into a Pandas DataFrame
wtreat=pd.read_csv("quarterly_waste_treatment.csv")

# 1. Dataset Description and Issues

## Dataset Description
Decided to merges two datasets: **quarterly_waste_generation.csv** and **quarterly_waste_treatment.csv**, using the columns `Quarter`, `County`, `Waste Type`, `Waste Category`, and `Waste Amount (tonnes)` to create a unified dataset.
(Pandas.merge — Pandas 1.2.3 Documentation)

In [None]:
merge_data = pd.merge(wgen, wtreat, on=['Quarter', 'County','Waste Type', 'Waste Category','Waste Amount (tonnes)'], how='outer')
merge_data.head(100)

Giving different names to the the columns in the merged datasets for consistency and clarity (e.g., `Waste Amount (tonnes)` to `waste_amount_in_tonnes`)

### New Column Names
1. **Quarter:** Time period for waste generation/treatment.
2. **county:** Geographical region in ireland where waste was recorded.
3. **waste_type:** Classification as Hazardous or Non-Hazardous.
4. **waste_category:** Specific type of waste (e.g., Chemical Waste, Agricultural Waste).
5. **waste_amount_in_tonnes:** Quantity of waste generated or treated.
6. **treatment_method:** Process used for treating waste.
7. **price_of_treatment:** Cost per tonne for treatment.

(Pandas.DataFrame.rename — Pandas 1.4.2 Documentation, n.d.)  
(Pandas.DataFrame.nunique — Pandas 1.3.4 Documentation, n.d.)

In [None]:
merge_data.rename(columns={'Waste Amount (tonnes)': 'waste_amount_in_tonnes', 'Quarter': 'Quarter', 'County':'county', 'Waste Type' : 'waste_type', 'Waste Category': 'waste_category', 'Treatment Method':'treatment_method','Price of Treatment (€ per tonne)':'price_of_treatment'},inplace=True)
merge_data.nunique()

In [None]:
merge_data.info()

### Observed Issues
1. **Date Inconsistencies:** The `Quarter` column contains inconsistent date formats.
2. **Irregular Text Formats:** Variations in capitalization and special characters were present in  `treatment_method` , `county` and `waste_category`.
3. **Missing Data:** Null values were present in all the columns.
4. **Potential Duplicates:** There were duplicate rows in the dataset.
5. **Outliers:** Possible extreme values in `waste_amount_in_tonnes` and`price_of_treatment`

# 2. Data Cleaning

### A. Standardizing Date Formats
- A custom function `normalize_date` was implemented to unify date formats in the `Quarter` column.
  
- Formats handled include:
  - `"YYYY Qn"` (e.g., "2004 Q1")
  - `"YYYYQn"` (e.g., "2004Q1")
  - `"dd-MM-YYYY"`(e.g "01-01-2004")

(Pandas.to_datetime — Pandas 1.3.4 Documentation, n.d.)  
(Pandas.NaT — Pandas 2.2.3 Documentation, 2024)  
(GeeksforGeeks, 2018)

In [None]:
# Function to normalize date formats
def normalize_date(date_str):
    if isinstance(date_str, str):
        date_str = date_str.strip()  # Remove any leading/trailing whitespace

        # Explicitly check for 'dd-Mon-yy' format (e.g., '01-Jan-04')
        try:
            return pd.to_datetime(date_str, format='%d-%b-%y', dayfirst=True)
        except ValueError:
            pass  # If it fails, continue to other formats

        # Check for 'YYYYQn' format (e.g., '2004Q1')
        match = re.match(r'(\d{4})Q([1-4])', date_str)
        if match:
            year = int(match.group(1))
            quarter = int(match.group(2))
            # Map to start of the quarter
            if quarter == 1:
                return pd.to_datetime(f'01/01/{year}', dayfirst=True)
            elif quarter == 2:
                return pd.to_datetime(f'01/04/{year}', dayfirst=True)
            elif quarter == 3:
                return pd.to_datetime(f'01/07/{year}', dayfirst=True)
            elif quarter == 4:
                return pd.to_datetime(f'01/10/{year}', dayfirst=True)

        # Check for 'YYYY Qn' or 'Qn YYYY' formats (e.g., '2004 Q1' or 'Q1 2004')
        match = re.match(r'(\d{4})\s*Q([1-4])|Q([1-4])\s*(\d{4})', date_str)
        if match:
            year = int(match.group(1) or match.group(4))
            quarter = int(match.group(2) or match.group(3))
            # Map to start of the quarter
            if quarter == 1:
                return pd.to_datetime(f'01/01/{year}', dayfirst=True)
            elif quarter == 2:
                return pd.to_datetime(f'01/04/{year}', dayfirst=True)
            elif quarter == 3:
                return pd.to_datetime(f'01/07/{year}', dayfirst=True)
            elif quarter == 4:
                return pd.to_datetime(f'01/10/{year}', dayfirst=True)

        # Check for 'Month dd, yyyy' format (e.g., 'January 01, 2004')
        try:
            return pd.to_datetime(date_str, format='%B %d, %Y', dayfirst=True)
        except ValueError:
            pass

        # Check for 'Month dd yyyy' format (e.g., 'January 01 2004')
        try:
            return pd.to_datetime(date_str, format='%B %d %Y', dayfirst=True)
        except ValueError:
            pass

        # Check for 'dd-Mon-yyyy' format (e.g., '01-Jan-2004')
        try:
            return pd.to_datetime(date_str, format='%d-%b-%Y', dayfirst=True)
        except ValueError:
            pass

        # Check for 'dd-mm-yyyy' format (e.g., '01-01-2004')
        try:
            return pd.to_datetime(date_str, format='%d-%m-%Y', dayfirst=True)
        except ValueError:
            pass

        # Check for 'dd/mm/yyyy' format
        try:
            return pd.to_datetime(date_str, format='%d/%m/%Y', dayfirst=True)
        except ValueError:
            pass

        # Check for 'mm/dd/yyyy' format
        try:
            return pd.to_datetime(date_str, format='%m/%d/%Y', dayfirst=False)
        except ValueError:
            pass

        # Fallback for any remaining date strings
        try:
            return pd.to_datetime(date_str)
        except ValueError:
            return pd.NaT  # Return NaT if all parsing attempts fail
    else:
        return pd.NaT  # Return NaT for non-string entries

# Normalize the 'Quarter' column to a single date format
merge_data['Quarter'] = merge_data['Quarter'].apply(normalize_date)

# Ensure the 'Quarter' column contains only datetime objects
merge_data['Quarter'] = pd.to_datetime(merge_data['Quarter'], errors='coerce')

# Replace specific dates with corresponding quarters
def replace_with_quarters(date):
    if pd.isnull(date):  # Handle NaT values
        return date
    elif date.day == 1 and date.month == 1:
        return f"Q1 {date.year}"
    elif date.day == 1 and date.month == 4:
        return f"Q2 {date.year}"
    elif date.day == 1 and date.month == 7:
        return f"Q3 {date.year}"
    elif date.day == 1 and date.month == 10:
        return f"Q4 {date.year}"
    else:
        return date.strftime('%d/%m/%Y')  # Retain other dates in 'dd/mm/yyyy' format

# Apply the quarter replacement function
merge_data['Quarter'] = merge_data['Quarter'].apply(replace_with_quarters)

# Display the updated DataFrame
print("\nNormalized Data:")
print(merge_data.head(10))

In [None]:
merge_data.head(50)

### B. Standardizing Text Fields
- Removed redundant prefixes like "county" and "co." from the `county` column using regex.
  
- Capitalized the values in `county`
  
- Finding the unique values in `waste_category`, `waste_type` and `county`. The purpose of the "unique() method" is to know the range of values especially in categorical columns, identifying missing values if NaN appears in the output and to check if the values in the column don't have any issues (e.g. the wording of the values, unwanted characters like "." inbetween each word)
- Removed extra spaces between words, replaced characters like "." and  "_"  with  " ", then adjusted inconsistent capitalization in `waste_category`.
- Using the mapping method we corrected incomplete and inconsistent entries in the column `waste_category` like "Waste Agricultural" to "Agricultural Waste".

(Pandas.Series.str.replace — Pandas 2.0.3 Documentation, n.d.) (Pandas.Series.str.strip — Pandas 2.0.0 Documentation, n.d.)ml

In [None]:
# Remove 'county' and 'co' (case-insensitive) from all values in the 'County' column
merge_data['county'] = merge_data['county']\
    .str.replace(r'\bcounty\b|\bco\b|\.', '', case=False, regex=True)\
    .str.strip()

In [None]:
merge_data.head(50)

In [None]:
# Remove '_' and '.' from 'Waste Category'
merge_data['waste_category'] = merge_data['waste_category'].str.replace('_', '').str.replace('.', '').str.strip().str.lower()
merge_data.head(25)

In [None]:
merge_data['waste_category'] = merge_data['waste_category'].str.replace('hazardous', '',case=False).str.replace('non-', '',case=False) .str.replace('(e-Waste)', '',case=False).str.strip()

In [None]:
merge_data.head(20)

In [None]:
# Unique values in each column individually
unique_county = merge_data['county'].unique()
unique_waste_type = merge_data['waste_type'].unique()
unique_waste_category = merge_data['waste_category'].unique()

# Print the results
print("Unique values in 'County':", unique_county)
print("Unique values in 'Waste Type':", unique_waste_type)
print("Unique values in 'Waste Category':", unique_waste_category)


(Pandas.DataFrame.replace — Pandas 1.2.4 Documentation, n.d.)  
(Working with Missing Data — Pandas 1.5.1 Documentation, n.d.)  
(Pandas.DataFrame.map — Pandas 2.2.2 Documentation, n.d.)

In [None]:
category_mapping = {
    'waste industrial': 'Industrial Waste',
    'i n d u s t r i a l  h a z a r d o u s  w a s t e': 'Industrial Waste',
    'm e d i c a l  w a s t e': 'Medical Waste',
    'medicalwaste': 'Medical Waste',
    np.nan: np.nan,  # Keep NaN as is
    'organic waste': 'Organic Waste',
    'industrial  waste': 'Industrial Waste',
    'chemical waste': 'Chemical Waste',
    'municipal solid waste': 'Municipal Solid Waste',
    'industrialwaste': 'Industrial Waste',
    'electronicwaste': 'Electronic Waste',
    'medical waste': 'Medical Waste',
    'm u n i c i p a l  s o l i d  w a s t e': 'Municipal Solid Waste',
    'industrial waste': 'Industrial Waste',
    'construction and demolition waste': 'Construction And Demolition Waste',
    'a g r i c u l t u r a l  w a s t e': 'Agricultural Waste',
    'c h e m i c a l  w a s t e': 'Chemical Waste',
    'i n d u s t r i a l  n o n -h a z a r d o u s  w a s t e': 'Industrial Waste',
    'agricultural waste': 'Agricultural Waste',
    'organicwaste': 'Organic Waste',
    'chemicalwaste': 'Chemical Waste',
    'and waste construction demolition': 'Construction And Demolition Waste',
    'electronic waste': 'Electronic Waste',
    'agriculturalwaste': 'Agricultural Waste',
    'waste and construction demolition': 'Construction And Demolition Waste',
    'c o n s t r u c t i o n  a n d  d e m o l i t i o n  w a s t e': 'Construction And Demolition Waste',
    'municipal waste solid': 'Municipal Solid Waste',
    'waste and demolition construction': 'Construction And Demolition Waste',
    'constructionanddemolitionwaste': 'Construction And Demolition Waste',
    'waste demolition and construction': 'Construction And Demolition Waste',
    'waste organic': 'Organic Waste',
    'electronic  waste': 'Electronic Waste',
    'waste medical': 'Medical Waste',
    'e l e c t r o n i c  w a s t e  (e -w a s t e )': 'Electronic Waste',
    'waste agricultural': 'Agricultural Waste',
    'waste electronic': 'Electronic Waste',
    'waste municipal solid': 'Municipal Solid Waste',
    'municipalsolidwaste': 'Municipal Solid Waste',
    'waste chemical': 'Chemical Waste',
    'waste construction demolition and': 'Construction And Demolition Waste',
    'o r g a n i c  w a s t e': 'Organic Waste',
    'demolition waste and construction': 'Construction And Demolition Waste',
    'waste solid municipal': 'Municipal Solid Waste',
    'construction waste and demolition': 'Construction And Demolition Waste',
    'solid waste municipal': 'Municipal Solid Waste',
    'and construction demolition waste': 'Construction And Demolition Waste',
    'construction demolition waste and': 'Construction And Demolition Waste',
    'and waste demolition construction': 'Construction And Demolition Waste',
    'solid municipal waste': 'Municipal Solid Waste',
    'construction waste demolition and': 'Construction And Demolition Waste',
    'and demolition waste construction': 'Construction And Demolition Waste',
    'and construction waste demolition': 'Construction And Demolition Waste',
    'demolition construction waste and': 'Construction And Demolition Waste',
    'demolition waste construction and': 'Construction And Demolition Waste',
    'waste  electronic': 'Electronic Waste',
    'construction and waste demolition': 'Construction And Demolition Waste',
    'construction demolition and waste': 'Construction And Demolition Waste',
    'waste demolition construction and': 'Construction And Demolition Waste',
    'demolition and waste construction': 'Construction And Demolition Waste',
    'demolition and construction waste': 'Construction And Demolition Waste',
    'demolition construction and waste': 'Construction And Demolition Waste',
    'waste construction and demolition': 'Construction And Demolition Waste',
    'and demolition construction waste': 'Construction And Demolition Waste'
}

In [None]:
merge_data['waste_category'] = merge_data['waste_category'].replace(category_mapping)

In [None]:
county_mapping = {
    'nan': np.nan,  # String 'nan'
    np.nan: np.nan  # Explicit NaN
}

waste_type_mapping = {
    np.nan: np.nan  # Explicit NaN
}

waste_category_mapping = {
    'nan': np.nan,          # String 'nan'
    'n a n': np.nan,        # String 'n a n'
}

# Apply mappings
merge_data['county'] = merge_data['county'].replace(county_mapping)
merge_data['waste_type'] = merge_data['waste_type'].replace(waste_type_mapping)
merge_data['waste_category'] = merge_data['waste_category'].replace(waste_category_mapping)

In [None]:
merge_data.head(50)

In [None]:
# Unique values in each column individually
unique_county = merge_data['county'].unique()
unique_waste_type = merge_data['waste_type'].unique()
unique_waste_category = merge_data['waste_category'].unique()

# Print the results
print("Unique values in 'County':", unique_county)
print("Unique values in 'Waste Type':", unique_waste_type)
print("Unique values in 'Waste Category':", unique_waste_category)

### C. Handling Numeric Data
- Converted `waste_amount_in_tonnes` to numeric formats.  

- Removed non-numeric characters (e.g., commas) since the  `waste_amount_in_tonnes` column is a numerical field.

- All the invalid entries since they could cause errors or innacuracies during analysis were converted to NaN (Not a Number) so that we could easily locate and handle missing or invalid values in pandas.
(Pandas.to_numeric — Pandas 1.4.2 Documentation, n.d.)  
(Pandas.Series.str.replace — Pandas 2.0.3 Documentation, n.d.)


In [None]:
merge_data['waste_amount_in_tonnes'] = pd.to_numeric(
    merge_data['waste_amount_in_tonnes']
    .str.replace(r'[^\d.,]', '', regex=True) # Remove non-numeric characters
    .str.replace(',', ''), # Remove commas
    errors='coerce'  # Convert invalid entries to NaN
)
merge_data

- Removed extra spaces between words, replaced characters like "." , "_" with " " and adjusted inconsistent capitalization in `treatment_method`.  
- Made the values in the `treatment_method`lower case for uniformity.
(Pandas.Series.str.lower — Pandas 2.2.2 Documentation, n.d.) 

In [None]:
# Remove '_' and '.' from 'treatment_method'
merge_data['treatment_method'] = merge_data['treatment_method'].str.replace('_', '').str.replace('.', '').str.strip().str.lower()
merge_data

- Finding the unique values in `treatement_method` column to detect any anomalies or unexpected values, inconsistent formatting (e.g."disposal - other").
- We then cleaned and standardized the values in the `treatement_method` column using the dictionary mapping to remove inconsistences (e.g. "- landfill disposal" to "Disposal-Landfill").

In [None]:
merge_data['treatment_method'].unique()

In [None]:
treatment_mapping = {
    'disposal - other': 'Disposal-Other',
    'd i s p o s a l  - o t h e r': 'Disposal-Other',
    'other - disposal': 'Disposal-Other',
    '- other disposal': 'Disposal-Other',
    'disposal other -': 'Disposal-Other',
    '- disposal other': 'Disposal-Other',
    'other disposal -': 'Disposal-Other',
    'disposal-other': 'Disposal-Other',
    'disposal - incineration': 'Disposal-Incineration',
    'incineration disposal -': 'Disposal-Incineration',
    'd i s p o s a l  - i n c i n e r a t i o n': 'Disposal-Incineration',
    '- incineration disposal': 'Disposal-Incineration',
    'disposal-incineration': 'Disposal-Incineration',
    'incineration - disposal': 'Disposal-Incineration',
    '- disposal incineration': 'Disposal-Incineration',
    'disposal incineration -': 'Disposal-Incineration',
    'disposal-landfill': 'Disposal-Landfill',
    'landfill - disposal': 'Disposal-Landfill',
    'disposal - landfill': 'Disposal-Landfill',
    '- landfill disposal': 'Disposal-Landfill',
    'landfill disposal -': 'Disposal-Landfill',
    'd i s p o s a l  - l a n d f i l l': 'Disposal-Landfill',
    '- disposal landfill': 'Disposal-Landfill',
    'disposal landfill -': 'Disposal-Landfill',
    'recovery - recycling': 'Recovery-Recycling',
    'recovery-recycling': 'Recovery-Recycling',
    '- recovery recycling': 'Recovery-Recycling',
    'r e c o v e r y  - r e c y c l i n g': 'Recovery-Recycling',
    'recycling - recovery': 'Recovery-Recycling',
    '- recycling recovery': 'Recovery-Recycling',
    'recovery recycling -': 'Recovery-Recycling',
    'recycling recovery -': 'Recovery-Recycling',
    'recovery - composting': 'Recovery-Composting',
    '- composting recovery': 'Recovery-Composting',
    'r e c o v e r y  - c o m p o s t i n g': 'Recovery-Composting',
    'recovery-composting': 'Recovery-Composting',
    'composting - recovery': 'Recovery-Composting',
    'recovery composting -': 'Recovery-Composting',
    '- recovery composting': 'Recovery-Composting',
    'composting recovery -': 'Recovery-Composting',
    'recovery - energy recovery': 'Recovery-Energy Recovery',
    'recovery energy recovery -': 'Recovery-Energy Recovery',
    'r e c o v e r y  - e n e r g y  r e c o v e r y': 'Recovery-Energy Recovery',
    'recovery - recovery energy': 'Recovery-Energy Recovery',
    'recovery recovery energy -': 'Recovery-Energy Recovery',
    '- energy recovery recovery': 'Recovery-Energy Recovery',
    '- recovery recovery energy': 'Recovery-Energy Recovery',
    '- recovery energy recovery': 'Recovery-Energy Recovery',
    'recovery recovery - energy': 'Recovery-Energy Recovery',
    'energy recovery recovery -': 'Recovery-Energy Recovery',
    'energy - recovery recovery': 'Recovery-Energy Recovery',
    'energy recovery - recovery': 'Recovery-Energy Recovery',
    'recovery-energyrecovery': 'Recovery-Energy Recovery',
    'recovery energy - recovery': 'Recovery-Energy Recovery',
np.nan: np.nan,  # Keep NaN values as NaN
    'nan': np.nan,
    'n a n': np.nan}

In [None]:
merge_data['treatment_method'] = merge_data['treatment_method'].replace(treatment_mapping)
merge_data.head(50)

In [None]:
unique_entries=merge_data['treatment_method'].unique()
for entry in unique_entries:
    print(entry)

(Pandas.Series.value_counts — Pandas 1.3.4 Documentation, n.d.)

In [None]:
value=merge_data['treatment_method'].value_counts()
print(value)


- Converted `price_of_treatment` to numeric formats using a custom function `convert_to_float` to clean the column.

- Removed non-numeric characters (e.g., commas, € signs)
  
### Outcome
- Uniform text values across categories for consistent grouping which will make analysis easier.
- Accurate numeric values for analysis.
- Removed redundancy in text format.

(Pandas.DataFrame.round — Pandas 1.3.4 Documentation, n.d.)

In [None]:
# Function to extract and convert price to float
def convert_to_float(value):
    if isinstance(value, str):  # Process only if the value is a string
        # Extract numeric value from the string
        numeric_value = re.findall(r'[\d\.,]+', value)
        
        if numeric_value:
            # Clean the numeric string (remove commas and extra spaces)
            cleaned_value = numeric_value[0].replace(',', '').replace('€', '').strip()
            
            # Convert the cleaned string to float
            return float(cleaned_value)
        else:
            return None  # If no valid numeric value is found
    elif isinstance(value, (int, float)):  # If it's already numeric, return it as is
        return value
    else:
        return None  # Handle any other unexpected types

# Apply the function to the "price_of_treatment" column
merge_data['price_of_treatment'] = merge_data['price_of_treatment'].apply(convert_to_float)

merge_data['price_of_treatment'] = merge_data['price_of_treatment'].round(2)

# Display the updated DataFrame
print(merge_data)



# 3.Handling Missing Values

### A. Identifing columns with Missing Values:
1. **treatment_method**: 21,950 missing values.
2. **price_of _reatment**: 19,656 missing values.
3. **waste_amount_in_tonnes**: 3,277 missing values.
4. **Quarter**: 3,342 missing values.
5. **county**: 3,269 missing values.
6. **waste_type**: 3,312 missing values.
7. **waste_category**: 3,242 missing values.

(Pandas DataFrame Isnull() Method, n.d.)  
(Pandas DataFrame Sum() Method, n.d.)

In [None]:
print(merge_data.isnull().sum())

### B. Strategies Adopted to deal with the Missing Values:
- **Forward Fill**: Filled missing value in categorical fields like `waste_category` and  `Quarter` using `ffill`.
- **Forward Fill** and **Backward Fill**: Filled the missing values in  `treatment_method` with both methods because there were missing values present in the first and last row.
  
- **Imputation:**
  - Median for numerical fields (e.g., `price_of_treatment`and `waste_amount_in_tonnes`) to prevent influence from extreme values.
 

(Pandas DataFrame Fillna() Method, n.d.)  
(Python Statistics.median() Method, n.d.)

In [None]:
merge_data['Quarter'] = merge_data['Quarter'].fillna(method='ffill')  # Forward fill
merge_data

In [None]:
merge_data['county'] = merge_data['county'].fillna(method='ffill')
merge_data['waste_type'] = merge_data['waste_type'].fillna(method='ffill')
merge_data['waste_category'] = merge_data['waste_category'].fillna(method='ffill')
merge_data['treatment_method'] = merge_data['treatment_method'].fillna(method='ffill')
merge_data['treatment_method'] = merge_data['treatment_method'].fillna(method='bfill')
merge_data

In [None]:
merge_data['waste_amount_in_tonnes'] = merge_data['waste_amount_in_tonnes'].fillna(merge_data['waste_amount_in_tonnes'].median())
merge_data['price_of_treatment'] = merge_data['price_of_treatment'].fillna(merge_data['price_of_treatment'].median())
merge_data

In [None]:
print(merge_data.isnull().sum())

### C. Removing Duplicates  
- Removing the duplicate rows since they might affect analysis accuracy by inflating the importance of redundant data.
- Also removing duplicate rows will ensure reliability of the dataset and make better visualizations.

(Pandas DataFrame Duplicated() Method | Pandas Method, 2018) 
(Python | Pandas Dataframe.drop_duplicates(), 2018)


In [None]:
print(merge_data.duplicated().sum())

In [None]:
# Remove duplicate rows from the dataset
merge_data= merge_data.drop_duplicates()

In [None]:
print(merge_data.duplicated().sum())

In [None]:
merge_data['waste_amount_in_tonnes'].describe   #Calculating the mean before handling outliers to understand initial average.

# 4. Detecting Outliers in the Dataset
 Data points that substantially differ from the rest of the observations in a dataset are known as outliers. 
 Finding outliers is essential to preserving the integrity of the data and creating reliable models.

 ---
 
## A. Finding Outliers for Waste Amount in Tonnes

The following steps were used to identify outliers in the `waste_amount_in_tonnes` data.

---

### Steps to Identify Outliers

1. **Calculate Quartiles (Q1 and Q3)**:  
   The 25th percentile (Q1) and 75th percentile (Q3) were computed for the `waste_amount_in_tonnes` column.
   
2. **Calculate IQR**:  
   The **Interquartile Range (IQR)** was determined as the difference between Q3 and Q1.
   
3. **Determine Outlier Bounds**:  
   The outlier bounds were defined as:
   - **Lower Bound**: \( Q1 - 1.5 \times IQR \)
   - **Upper Bound**: \( Q3 + 1.5 \times IQR \)
   
4. **Identify Outliers**:  
   The outliers were identified as values below the lower bound or above the upper bound.

5. **Visualized Outliers**:  
   The outliers in the `waste_amount_in_tonnes` data was identified and displayed by using boxplot and histogram.
   

---

(Pandas.DataFrame.quantile — Pandas 2.1.1 Documentation, n.d.)  
(Waskom, n.d.)  
(Matplotlib.pyplot.hist — Matplotlib 3.5.1 Documentation, n.d.)  
(GeeksforGeeks, 2021)  
(Zach, 2020) 


In [None]:
#Finding outliers for waste amount in tonnes
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = merge_data['waste_amount_in_tonnes'].quantile(0.25)
Q3 = merge_data['waste_amount_in_tonnes'].quantile(0.75)

# Calculate IQR
IQR_1 = Q3 - Q1

# Determine outlier bounds
lower_bound_1 = Q1 - 1.5 * IQR_1
upper_bound_1 = Q3 + 1.5 * IQR_1

# Identify outliers
outliers_1 = merge_data[(merge_data['waste_amount_in_tonnes'] <= lower_bound_1) | (merge_data['waste_amount_in_tonnes'] >= upper_bound_1)]

# Print outliers
print("Identified Outliers:")
print(outliers_1)

In [None]:
#boxplot
sns.boxplot(data=merge_data, x='waste_amount_in_tonnes')
plt.title('Boxplot for Identifying Outliers')
plt.xlabel('outliers_1')
plt.show()

# Plot histogram
plt.hist(merge_data['waste_amount_in_tonnes'], bins=10, color='skyblue', edgecolor='black')
plt.title('Histogram Before Handling Outliers')
plt.xlabel('outliers_1')
plt.ylabel('Frequency')
plt.show()
 

In [None]:
# Filter data to exclude outliers based on specified bounds
merge_data = merge_data[(merge_data['waste_amount_in_tonnes'] >= lower_bound_1) & (merge_data['waste_amount_in_tonnes'] <= upper_bound_1)]
merge_data

(Pandas.DataFrame.describe — Pandas 0.20.2 Documentation, n.d.)  

In [None]:
merge_data['waste_amount_in_tonnes'].describe()

## B. Visualizations  
 


### Box Plot and Histogram for Waste Amount in Tonnes After Handling Outliers

Below are the visualizations showing the `waste_amount_in_tonnes` data after addressing outliers.

---

### Box Plot

The box plot displays the distribution of the waste amounts, indicating if any outliers remain after handling them. It visually shows the data's spread and identifies outliers as points outside the whiskers.

---

### Histogram

The histogram illustrates the frequency distribution of waste amounts, helping us understand how the data is distributed after handling outliers. It shows the number of occurrences within each bin and gives insights into the data's spread and central tendency.

---

These visualizations confirm that outliers have been addressed and provide a clearer view of the data's distribution. 

(Waskom, n.d.)  (Matplotlib.pyplot.hist — Matplotlib 3.5.1 Documentation, n.d.)l

In [None]:
# Plot the boxplot
sns.boxplot(data=merge_data, x='waste_amount_in_tonnes')
plt.title('Boxplot after fixing Outliers')
plt.xlabel('outliers_1')
plt.show()

# Plot histogram
plt.hist(merge_data['waste_amount_in_tonnes'], bins=10, color='skyblue', edgecolor='black')
plt.title('Histogram after Handling Outliers')
plt.xlabel('outliers_1')
plt.ylabel('Frequency')
plt.show()
merge_data

In [None]:
merge_data['price_of_treatment'].describe()

## Finding Outliers for Price of Treatment

The following steps were used to identify outliers in the `price_of_treatment` data.

---

## A. Steps to Identify Outliers

1. **Calculate Quartiles (Q1 and Q3)**:  
   The 25th percentile (Q1) and 75th percentile (Q3) were computed for the `price_of_treatment` column.
   
2. **Calculate IQR**:  
   The **Interquartile Range (IQR)** was determined as the difference between Q3 and Q1.
   
3. **Determine Outlier Bounds**:  
   The outlier bounds were defined as:
   - **Lower Bound**: \( Q1 - 1.5 \times IQR \)
   - **Upper Bound**: \( Q3 + 1.5 \times IQR \)
   
4. **Identify Outliers**:  
   The outliers were identified as values below the lower bound or above the upper bound.

5. **Print Identified Outliers**:  
   The outliers in the `price_of_treatment` data were displayed.

---

In [None]:
#Finding outliers for price of treatment
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Qt1 = merge_data['price_of_treatment'].quantile(0.25)
Qt3 = merge_data['price_of_treatment'].quantile(0.75)

# Calculate IQR
IQR_2 = Qt3 - Qt1

# Determine outlier bounds
lower_bound_2 = Qt1 - 1.5 * IQR_2
upper_bound_2 = Qt3 + 1.5 * IQR_2

# Identify outliers
outliers_2 = merge_data[(merge_data['price_of_treatment'] <= lower_bound_2) | (merge_data['price_of_treatment'] >= upper_bound_2)]

# Print outliers
print("Identified Outliers:")
print(outliers_2)
sns.boxplot(data=merge_data, x='price_of_treatment')
plt.title('Boxplot for Identifying Outliers')
plt.xlabel('outliers_2')
plt.show()

# Plot histogram
plt.hist(merge_data['price_of_treatment'], bins=10, color='skyblue', edgecolor='black')
plt.title('Histogram before Handling Outliers')
plt.xlabel('outliers_2')
plt.ylabel('Frequency')
plt.show()

(Pandas.DataFrame.clip — Pandas 2.1.4 Documentation, n.d.)

In [None]:
## Clip price values to be within the specified bounds
merge_data['price_of_treatment'] = merge_data['price_of_treatment'].clip(lower=lower_bound_2, upper=upper_bound_2)

## B. Visualizations  
 


### Box Plot and Histogram for Price of Treatment After Handling Outliers

Below are the visualizations showing the `price_of_treatment` data after addressing outliers.

---

### Box Plot

The box plot below visualizes the `price_of_treatment` column and highlights any outliers:

---

### Histogram

The histogram shows the frequency distribution of `price_of_treatment` before handling outliers, helping in understanding how the data is distributed.

These visualizations help confirm the identification of outliers and provide insight into the data's distribution.

In [None]:
# Plot the boxplot
sns.boxplot(data=merge_data, x='price_of_treatment')
plt.title('Boxplot after fixing Outliers')
plt.xlabel('outliers_2')
plt.show()

# Plot histogram
plt.hist(merge_data['price_of_treatment'], bins=10, color='skyblue', edgecolor='black')
plt.title('Histogram after Handling Outliers')
plt.xlabel('outliers_2')
plt.ylabel('Frequency')
plt.show()
merge_data

In [None]:
merge_data['price_of_treatment'].describe()

The data before and after treating outliers is clearly understood with these visuals. The histogram displays the better distribution following outlier treatment, while the box plot illustrates the existence and magnitude of outliers in the `waste_amount_in_tonnes` and `price_of_treatment` data. Resolving outliers guarantees a more balanced dataset, lowering skewness and facilitating more precise and trustworthy analysis for making decisions.

## 5.Description of the calculated fields.  

The **Total Treatment Cost** and **Treatment Efficiency** computed fields were created to help organize and interpret the data. These measures provide a clearer picture of the data trends and comparisons, which streamlined the study. Various visualizations were used to find patterns, identify abnormalities, and comprehend linkages within the data during the **Exploratory Data Analysis (EDA)**. Bar charts made cross-category comparisons simple, box plots revealed outliers, and histograms offered information about the distribution of the data. The integrity of the data was preserved by using these visual aids and computations, and significant conclusions were drawn to help the project's objectives. perceptive analysis.  

### Total Treatment Cost Calculation
Large amounts are made easier to handle and understand by calculating the total treatment cost in millions. `waste_amount_in_tonnes` is multiplied by `price_of_treatment` and then divided by 1,000,000 to convert the cost into millions, which facilitates the analysis and interpretation of the data.


In [None]:
#total_treatment_cost is in millions
merge_data['total_treat_cost_millions'] =( merge_data['waste_amount_in_tonnes'] * merge_data['price_of_treatment'])/1000000
merge_data['total_treat_cost_millions']=merge_data['total_treat_cost_millions'].round(2)
merge_data

### Treatment Efficiency Calculation
The percentage of `waste_amount_in_tonnes` is rounded to two decimal places and compared to the dataset's maximum value is the treatment efficiency.
(Pandas.DataFrame — Pandas 0.25.3 Documentation, 2020)

In [None]:
merge_data['treatment_efficiency'] = (merge_data['waste_amount_in_tonnes'] / merge_data['waste_amount_in_tonnes'].max()) * 100
merge_data['treatment_efficiency']=merge_data['treatment_efficiency'].round(2)
merge_data

## 6. Exploring the data visually using various graphs.

### A. Quarterly Waste Amount Trends by Waste Type
The trends in waste amounts throughout various quarters, broken down by waste kind, are shown in this line plot. The data is displayed using **sns.lineplot**, with lines colored according to waste type for convenient comparison. A title, axis labels, and a legend placed outside for clarification are all part of the plot. The layout is modified to accommodate the plot elements, and the X-axis labels are rotated for improved readability.
(Seaborn.lineplot — Seaborn 0.11.2 Documentation, n.d.)

In [None]:
# Set figure size
plt.figure(figsize=(14, 6))

# Plot waste trends by quarter and type
sns.lineplot(data=merge_data, x='Quarter', y='waste_amount_in_tonnes', hue='waste_type', errorbar=None)

# Title and labels
plt.title('Quarterly Waste Amount Trends by Waste Type', fontsize=16)
plt.xlabel('Quarter', fontsize=12)
plt.ylabel('Waste Amount (Tonnes)', fontsize=12)

# Adjust legend and ticks
plt.legend(title='Waste Type', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=90)

# Fit layout and show plot
plt.tight_layout()
plt.show()


Hazardous waste consistently exceeds non-hazardous waste, fluctuating around 9,500–10,500 tonnes, while non-hazardous waste ranges from 8,000–8,500 tonnes. Both types show irregular patterns without a clear trend, likely due to seasonal or operational factors.

### B. Waste Amount by Category and Type

This bar plot visualizes the waste amounts by category, differentiated by waste type. It uses **sns.barplot** to display the data, with the bars representing waste amounts for each category, and the legend indicating the waste types. The x-axis labels are rotated for readability, and the layout is adjusted for better spacing. 
(Seaborn.barplot — Seaborn 0.11.1 Documentation, n.d.) 

In [None]:
# Set the figure size for the plot
plt.figure(figsize=(14, 6))

# Create a barplot showing waste amount by category and type
sns.barplot(data=merge_data, x='waste_category', y='waste_amount_in_tonnes', hue='waste_type', errorbar=None)

# Set the title for the plot
plt.title('Waste Amount by Category and Type', fontsize=16)

# Label the x-axis and y-axis
plt.xlabel('Waste Category', fontsize=12)
plt.ylabel('Waste Amount (Tonnes)', fontsize=12)

# Adjust the legend position
plt.legend(title='Waste Type', bbox_to_anchor=(1.05, 1), loc='upper left')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45)

# Ensure layout fits into the figure area
plt.tight_layout()

# Show the plot
plt.show()


Hazardous waste dominates most categories, especially in chemical, industrial, and electronic waste, while non-hazardous waste closely matches in organic and construction waste. Overall, waste generation is balanced across both types with no drastic variations.
has context menu

### C. Top 10 Counties by Total Waste Amount

The top ten counties with the highest overall trash amounts are displayed in this bar plot. The information is arranged in descending order and categorized by county. The plot has rotated x-axis labels for readability and a color scheme for aesthetic appeal. Better spacing is achieved by adjusting the layout. 
(Seaborn.barplot — Seaborn 0.11.1 Documentation, n.d.) 

In [None]:
# Get top 10 counties by waste amount
top_counties = merge_data.groupby('county')['waste_amount_in_tonnes'].sum().sort_values(ascending=False).head(10)

# Plot bar chart of top counties
plt.figure(figsize=(14, 6))
sns.barplot(x= top_counties.index, y=top_counties.values, palette='plasma',hue= top_counties, legend= False)

# Add title and labels
plt.title('Top 10 Counties by Total Waste Amount', fontsize=16)
plt.xlabel('County', fontsize=12)
plt.ylabel('Total Waste Amount (Tonnes)', fontsize=12)

# Rotate x-axis labels
plt.xticks(rotation=45)
plt.tight_layout()

# Show plot
plt.show()

The graph shows a relatively consistent waste generation among the top counties, with Wicklow slightly ahead. No sharp spikes or drastic drops are observed, indicating a similar waste production pattern across these counties.

### D. Waste Distribution by Type

The distribution of total trash by kind is depicted in this pie chart, which also displays the percentage of each type. The y-axis label is buried for improved aesthetics, and the chart employs a pastel color scheme for visual clarity.

(Pandas.DataFrame.plot.pie — Pandas 1.3.2 Documentation, n.d.)  
(seaborn, 2013)

In [None]:
# Group and sum waste amounts by type
waste_by_type = merge_data.groupby('waste_type')['waste_amount_in_tonnes'].sum()

# Plot pie chart for waste distribution by type
plt.figure(figsize=(8, 8))
waste_by_type.plot(kind='pie', autopct='%1.1f%%', startangle=140, colors=sns.color_palette('Pastel1'))

# Title, remove y-label, and adjust layout
plt.title('Pie Chart: Waste Distribution by Type', fontsize=16)
plt.ylabel('')  # Hide y-label for better visual
plt.tight_layout()

# Show plot
plt.show()

The distribution of **Hazardous** and **Non-Hazardous** waste is depicted in the pie chart, which indicates that **42.7%** of the total waste is **Hazardous** and **57.3%** is **Non-Hazardous**. This suggests that most of the garbage handled in the dataset is less dangerous and may call for different management approaches because it shows a somewhat larger percentage of non-hazardous waste.

### E. Distribution of Waste Amounts
Using a kernel density estimate (KDE) to visualize the general structure of the data, this histogram displays the distribution of waste quantities. It aids in comprehending how frequently various waste quantities occur within the dataset. 

(Seaborn.histplot — Seaborn 0.11.2 Documentation, n.d.)  

In [None]:
# Create histogram with KDE for waste amount distribution
plt.figure(figsize=(10, 6))
sns.histplot(data=merge_data, x='waste_amount_in_tonnes', bins=30, kde=True, color='orange')

# Set title and labels
plt.title(' Distribution of Waste Amounts', fontsize=16)
plt.xlabel('Waste Amount (Tonnes)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)

# Adjust layout and show plot
plt.tight_layout()
plt.show()

With most of the data concentrated between 0 and 10,000 tonnes, this histogram shows a positively skewed distribution of waste amounts, suggesting that smaller waste amounts are more common. The steady drop toward higher values raises the possibility of extreme outliers or infrequent instances of significant waste volumes, both of which merit more research

# 7. Analysis of the findings based on EDA

The **Explanatory Data Analysis(EDA)** highlighted waste treatment trends, costs, and efficiency, simplifying financial analysis by expressing costs in millions and identifying improvement areas by comparing waste amounts to the dataset's maximum value. Key findings include: hazardous waste consistently exceeding non-hazardous waste in quarterly trends, with fluctuations likely due to seasonal factors; dominance of hazardous waste in categories like industrial and chemical waste, while non-hazardous waste is significant in organic and construction waste; and Wicklow leading in total waste generation, with consistent patterns across top counties. A pie chart further pinpointed major waste contributors, guiding targeted reduction initiatives and enhancing treatment efficiency.
 

# 8. Conclusion

The **quarterly_waste_generation.csv** and **quarterly_waste_treatment.csv** waste-related datasets were successfully combined after significant errors including missing values and different date formats were fixed. We discovered noteworthy trends throughout the exploratory data analysis (EDA), including major differences in garbage creation between counties, with some producing far more rubbish than others. Furthermore, we saw that the expenses related to waste treatment differed based on the treatment technique employed, underscoring the possible influence of various treatment approaches on total expenses. In order to better understand the factors influencing treatment costs, we plan to investigate prediction models in the future. This analysis offered insightful information about waste management techniques. By doing this, we intend to better understand the factors influencing waste management results and determine the most economical waste treatment techniques.

# 9. Reflections and Insights.

## A. Lessons

## Group Member 1: Dyuti Kamdar
**Most enjoyable part:** I enjoyed exploring Python's data visualization libraries like seaborn and creating insightful graphs using it. It was fun to learn the plotting of bar graphs and histogram.  
**Most challenging aspect:** The most challenging part for me was handling and cleaning a large dataset especially waste category and treatment method columns which had a lot of missing and all inconsistent values.  
**Most valuable thing learned:** I learned the importance of thorough data cleaning and preprocessing data for visualization.

## Group Member 2: Mrunal Howale
**Most enjoyable part:** Finding trends in waste treatment across various categories and time periods through data analysis, along with automating repetitive tasks using Python, was the most enjoyable part of the project.  
**Most challenging aspect:**  The most challenging aspect was cleaning the dataset of quarterly waste treatment, which involved addressing inconsistencies and missing values, along with struggling to understand and implement advanced statistical methods in Python for data analysis.  
**Most valuable thing learned:** The most valuable thing learned through this project was the importance of effective data preparation and visualization for drawing meaningful insights, along with gaining a solid understanding of using pandas and NumPy for data manipulation and the power of Python for data analytics.

## Group Member 3: Praveen Mangawa
**Most enjoyable part:** I loved working with Python libraries like Matplotlib and Seaborn to visually communicate data trends and insights.  
**Most challenging aspect:**  The biggest challenge for me was debugging code errors and ensuring the accuracy of calculations in the project.  
**Most valuable thing learned:** The most valuable thing that I learned while working on this project is what not to do while coding in python and also learned coding ethics.

## Group Member 4: Shakirat Muibi
**Most enjoyable part:** I gained satisfaction from exploring and cleaning an inconsistent dataset, as well as handling missing values, appreciating how small corrections enhanced analysis insights.  
**Most challenging aspect:** I found normalizing inconsistent date and text formats challenging, as it required time and meticulous effort.
   
**Most valuable thing learned:** Learned essential skills in cleaning, merging datasets, and handling missing data and outliers, crucial for tackling real-world data complexities.

## Group Member 5: Aida George
**Most enjoyable part:** I loved creating interactive Jupyter Notebooks to showcase the project results in an engaging and professional format.  
**Most challenging aspect:**  The most challenging aspect was understanding and implementing the logic for outlier detection and correction.  
**Most valuable thing learned:** I learned how to recognize and manage outliers using the IQR technique, which can greatly influence the conclusions derived from the data.

## Group Member 6: Dion Chettiar
**Most enjoyable part:**  I enjoyed learning how to merge and transform datasets into meaningful formats, which brought clarity to the data analysis process.  
**Most challenging aspect:** I found working with string manipulation and correcting inconsistent entries in the dataset particularly difficult.  
**Most valuable thing learned:** I learned the importance of data normalization and how to use Python's re and pandas libraries to handle text-based inconsistencies effectively.


## B. Issues Encountered during the project

We ran into a number of problems when cleansing the data. Managing missing entries in the ` treatment_method`  column, where the initial value was blank, was one major obstacle. In order to rectify the issue of advance filling producing a count of one instead of zero, we also employed backfilling in conjunction with forward filling. When dates were kept as strings and expenses were specified in various units (such as thousands and millions), another problem surfaced during data transformation. In order to fix this, we transformed the dates into the appropriate format and standardized all values to a common unit (millions), guaranteeing uniformity and facilitating precise analysis.

# 10. References

1.	Numpy. (2024). NumPy. Numpy.org. https://numpy.org/

2.	Pandas. (2018). Python Data Analysis Library. Pydata.org. https://pandas.pydata.org/

3.	seaborn. (2012). seaborn: statistical data visualization — seaborn 0.9.0 documentation. Pydata.org. https://seaborn.pydata.org/

4.	Matplotlib. (2012). Matplotlib: Python plotting — Matplotlib 3.1.1 documentation. Matplotlib.org. https://matplotlib.org/

5.	Python Software Foundation. (2020). csv — CSV File Reading and Writing — Python 3.8.1 documentation. Python.org. https://docs.python.org/3/library/csv.html

6.	Python. (2009). re — Regular expression operations — Python 3.7.2 documentation. Python.org. https://docs.python.org/3/library/re.html

7.	pandas.merge — pandas 1.2.3 documentation. Pandas.pydata.org. https://pandas.pydata.org/docs/reference/api/pandas.merge.html

8.	pandas.DataFrame.rename — pandas 1.4.2 documentation. (n.d.). Pandas.pydata.org. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html

9.	pandas.DataFrame.nunique — pandas 1.3.4 documentation. (n.d.). Pandas.pydata.org. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nunique.html

10.	pandas.to_datetime — pandas 1.3.4 documentation. (n.d.). Pandas.pydata.org. https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html

11.	pandas.NaT — pandas 2.2.3 documentation. (2024). Pydata.org. https://pandas.pydata.org/docs/reference/api/pandas.NaT.html

12.	GeeksforGeeks. (2018, July 3). Python | Pandas.apply(). GeeksforGeeks; GeeksforGeeks. https://www.geeksforgeeks.org/python-pandas-apply/

13.	pandas.Series.str.replace — pandas 2.0.3 documentation. (n.d.). Pandas.pydata.org. https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html

14.	pandas.Series.str.strip — pandas 2.0.0 documentation. (n.d.). Pandas.pydata.org. https://pandas.pydata.org/docs/reference/api/pandas.Series.str.strip.html

15.	pandas.DataFrame.replace — pandas 1.2.4 documentation. (n.d.). Pandas.pydata.org. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html

16.	Working with missing data — pandas 1.5.1 documentation. (n.d.). Pandas.pydata.org. https://pandas.pydata.org/docs/user_guide/missing_data.html

17.	pandas.DataFrame.map — pandas 2.2.2 documentation. (n.d.). Pandas.pydata.org. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html

18.	pandas.to_numeric — pandas 1.4.2 documentation. (n.d.). Pandas.pydata.org. https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html

19.	pandas.Series.str.replace — pandas 2.0.3 documentation. (n.d.). Pandas.pydata.org. https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html

20.	pandas.Series.str.lower — pandas 2.2.2 documentation. (n.d.). Pandas.pydata.org. https://pandas.pydata.org/docs/reference/api/pandas.Series.str.lower.html

21.	pandas.Series.value_counts — pandas 1.3.4 documentation. (n.d.). Pandas.pydata.org. https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html

22.	pandas.DataFrame.round — pandas 1.3.4 documentation. (n.d.). Pandas.pydata.org. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.round.html

23.	Pandas DataFrame isnull() Method. (n.d.). www.w3schools.com. https://www.w3schools.com/python/pandas/ref_df_isnull.asp

24.	Pandas DataFrame sum() Method. (n.d.). www.w3schools.com. https://www.w3schools.com/python/pandas/ref_df_sum.asp

25.	Pandas DataFrame fillna() Method. (n.d.). Www.w3schools.com. https://www.w3schools.com/python/pandas/ref_df_fillna.asp

26.	Python statistics.median() Method. (n.d.). Www.w3schools.com. https://www.w3schools.com/python/ref_stat_median.asp

27.	Pandas DataFrame duplicated() Method | Pandas Method. (2018, July 23). GeeksforGeeks. https://www.geeksforgeeks.org/pandas-dataframe-duplicated/

28.	Python | Pandas dataframe.drop_duplicates(). (2018, August 2). GeeksforGeeks. https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/

29.	pandas.DataFrame.quantile — pandas 2.1.1 documentation. (n.d.). Pandas.pydata.org. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html

30.	Waskom, M. (n.d.). seaborn.boxplot — seaborn 0.11.1 documentation. Seaborn.pydata.org. https://seaborn.pydata.org/generated/seaborn.boxplot.html

31.	matplotlib.pyplot.hist — Matplotlib 3.5.1 documentation. (n.d.). Matplotlib.org. https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html

32.	GeeksforGeeks. (2021, February 23). Detect and Remove the Outliers using Python. GeeksforGeeks. https://www.geeksforgeeks.org/detect-and-remove-the-outliers-using-python/

33.	Zach. (2020, July 6). How to Remove Outliers in Python. Statology. https://www.statology.org/remove-outliers-python/

34.	pandas.DataFrame.describe — pandas 0.20.2 documentation. (n.d.). Pandas.pydata.org. https://pandas.pydata.org/pandas-docs/version/0.20.2/generated/pandas.DataFrame.describe.html

35.	Waskom, M. (n.d.). seaborn.boxplot — seaborn 0.11.1 documentation. Seaborn.pydata.org. https://seaborn.pydata.org/generated/seaborn.boxplot.html

36.	matplotlib.pyplot.hist — Matplotlib 3.5.1 documentation. (n.d.). Matplotlib.org. https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html

37.	pandas.DataFrame.clip — pandas 2.1.4 documentation. (n.d.). Pandas.pydata.org. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.clip.html

38.	pandas.DataFrame — pandas 0.25.3 documentation. (2020). Pydata.org. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

39.	seaborn.lineplot — seaborn 0.11.2 documentation. (n.d.). Seaborn.pydata.org. https://seaborn.pydata.org/generated/seaborn.lineplot.html

40.	seaborn.barplot — seaborn 0.11.1 documentation. (n.d.). Seaborn.pydata.org. https://seaborn.pydata.org/generated/seaborn.barplot.html

41.	pandas.DataFrame.plot.pie — pandas 1.3.2 documentation. (n.d.). Pandas.pydata.org. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.pie.html

42.	seaborn. (2013). Choosing color palettes — seaborn 0.9.0 documentation. Pydata.org. https://seaborn.pydata.org/tutorial/color_palettes.html

43.	seaborn.histplot — seaborn 0.11.2 documentation. (n.d.). Seaborn.pydata.org. https://seaborn.pydata.org/generated/seaborn.histplot.html

44.	‌ Grammarly. (2). Grammarly. Grammarly.com. [AI-based proofreading tool] https://www.grammarly.com

orn.pydata.org. https://seaborn.pydata.org/generated/seaborn.histplot.html

44.	‌



‌