# __Impact Analysis of Monkeypox Case Study__

___

## **Business Understanding**

**The Monkeypox outbreak**, though not as widespread as other pandemics, poses significant public health challenges globally, especially in regions where the virus is most prevalent. As governments and health organizations strive to contain the spread, there is a need to **analyze Monkeypox data to extract actionable insights** that can aid in **public health responses and policy formation**. The objective of this project is to analyze global Monkeypox data, particularly focusing on regional trends, severity, and demographic impacts to support **strategic interventions**. The key aspects of this analysis will include:

1. **Epidemiologic Trends**: Investigating the spread of Monkeypox across different regions, focusing on factors such as new cases, total cases, and mortality rates.

2. **Regional Comparisons**: Comparing various countries and regions to understand how Monkeypox affects different areas, identifying regions with high transmission and high mortality rates.

3. **Demographic Trends**: Analyzing the impact of Monkeypox based on available demographic data (e.g., countries or regions) to highlight vulnerable groups or areas requiring urgent attention.

4. **Temporal Analysis**: Examining how the outbreak has evolved over time, identifying any patterns or spikes in infections and deaths that could guide future preventive measures.

5. **Identification of High-risk Regions**: Identifying "hotspots" can support public health officials in prioritizing these areas for immediate attention and interventions.

### **Problem Points**

Although Monkeypox is not as popular as other pandemics, its spread still poses significant challenges to global public health, especially in countries where the virus is more prevalent. Some of the key issues that need to be analyzed from this data are:

1. Identification of Spread Trends: How is Monkeypox spreading in different regions and countries? Are there regions that are more susceptible to this spread?
2. Correlation between Cases and Deaths: Is there a significant relationship between the number of new cases and the number of deaths in each country?
3. Regional Comparison: Which countries or regions have the highest number of cases and deaths, and how has this evolved over time?
4. Temporal Analysis: Are there any patterns or spikes in the spread of Monkeypox based on time of day, e.g. in certain seasons or certain time periods of the year?
5. Case Fatality Ratio Analysis: Which regions have the highest case fatality ratios? Do these regions also have a high number of cases or fewer but higher fatality rates?

## **Data Understanding**

<!-- **Data Description**

1. location: The name of the country or region that reported the data.
2. date: The date the data was reported in YYYY-MM-DD format.
3. new_cases: The number of new cases of Monkeypox reported on that date in the country/region.
4. new_deaths: The number of new deaths reported on that date in a country/region.
5. total_cases: The cumulative number of Monkeypox cases recorded in a country/region up to that date.
6. total_deaths: The cumulative number of deaths recorded in a country/region up to that date.
7. new_cases_per_million: The number of new cases per one million population in the region as of the given date.
8. total_cases_per_million: The cumulative number of cases per one million population up to the given date.
9. new_deaths_per_million: The number of new deaths per one million population in the region as of the given date.
10. total_deaths_per_million: The cumulative number of deaths per one million population up to the given date.
11. new_cases_smoothed: The smoothed average daily number of new cases over the given time period.
12. new_deaths_smoothed: The smoothed average daily number of new deaths over the given time period.
13. new_cases_smoothed_per_million: Average daily smoothed number of new cases per one million population.
14. new_deaths_smoothed_per_million: Rata-rata jumlah kematian baru harian yang dihaluskan per satu juta penduduk.
15. suspected_cases_cumulative: Jumlah kasus Monkeypox yang dicurigai hingga tanggal tertentu (jika data tersedia).
16. annotation: Catatan tambahan atau informasi terkait laporan data pada tanggal tertentu (misalnya, revisi atau koreksi data).

**Data Grouping**

1. Total Kasus dan Kematian per Negara/Wilayah: Menghitung jumlah total kasus dan kematian Monkeypox di setiap negara atau wilayah.
2. Perkembangan Kasus per Hari: Mengelompokkan data berdasarkan tanggal untuk melihat tren penyebaran harian.
3. Distribusi Kasus Baru per Wilayah: Melihat distribusi kasus baru berdasarkan lokasi dan waktu untuk memahami wilayah yang terkena dampak paling parah dalam periode tertentu.
4. Analisis Rasio Fatalitas Kasus (Case Fatality Ratio): Menghitung rasio fatalitas kasus (CFR) sebagai jumlah total kematian dibagi jumlah total kasus di setiap negara/wilayah untuk mengidentifikasi wilayah dengan tingkat fatalitas yang tinggi. -->

**Data Description**

1. location: The name of the country or region that reported the data.
2. date: The date the data was reported in YYYY-MM-DD format.
3. new_cases: The number of new cases of Monkeypox reported on that date in the country/region.
4. new_deaths: The number of new deaths reported on that date in a country/region.
5. total_cases: The cumulative number of Monkeypox cases recorded in a country/region up to that date.
6. total_deaths: The cumulative number of deaths recorded in a country/region up to that date.
7. new_cases_per_million: The number of new cases per one million population in the region as of the given date.
8. total_cases_per_million: The cumulative number of cases per one million population up to the given date.
9. new_deaths_per_million: The number of new deaths per one million population in the region as of the given date.
10. total_deaths_per_million: The cumulative number of deaths per one million population up to the given date.
11. new_cases_smoothed: The smoothed average daily number of new cases over the given time period.
12. new_deaths_smoothed: The smoothed average daily number of new deaths over the given time period.
13. new_cases_smoothed_per_million: Average daily smoothed number of new cases per one million population.
14. new_deaths_smoothed_per_million: Average daily smoothed number of new deaths per one million population.
15. suspected_cases_cumulative: Number of suspected Monkeypox cases up to a certain date (if data is available).
16. annotation: Additional notes or information related to the data report on a specific date (for example, data revisions or corrections).

**Data Grouping**

1. Total Cases and Deaths per Country/Region: Counts the total number of Monkeypox cases and deaths in each country or region.
2. Case Progression by Day: Categorize the data by date to see the trend of daily spread.
3. Distribution of New Cases by Region: View the distribution of new cases by location and time to understand the most severely affected regions in a given period.
4. Case Fatality Ratio Analysis: Calculate the case fatality ratio (CFR) as the total number of deaths divided by the total number of cases in each country/region to identify areas with high fatality rates.

## **Data Preparation**

### Import Library

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Gathering Data (Import File)

In [None]:
# Load the dataset
while True:
    try:
        start_year = int(input("Enter the start year (example: 2022): "))
        start_month = int(input("Enter the start month (1-12): "))

        end_year = int(input("Enter the end year (example: 2024): "))
        end_month = int(input("Enter the end month (1-12): "))

        # Input Validation
        if start_month < 1 or start_month > 12 or end_month < 1 or end_month > 12:
            print("Month must be between 1 and 12. Please try again.")
        elif start_year > end_year or (start_year == end_year and start_month > end_month):
            print("The start date cannot be later than the end date. Please try again.")
        else:
            break
    except ValueError:
        print("Invalid input. Please enter valid year and month numbers (example: 2022 and 5 for May).")

# Construct the file name based on the input
output_folder = 'data/raw/filtered'

# Format the file name according to the selected year and month range
file_name = f"monkeypox_{start_year}_{start_month}_to_{end_year}_{end_month}_filtered.csv"
file_path = os.path.join(output_folder, file_name)

# Check if the file exists
if os.path.exists(file_path):
    df = pd.read_csv(file_path)
    print(f"Data successfully loaded from {file_path}")
else:
    print(f"File {file_path} not found.")

### Check Data

In [None]:
# Count rows of dataset
jumlah_data = len(df)
print("Total data:", jumlah_data)

In [None]:
# View the first 5 rows of the dataset
print("First 5 rows of the dataset:")
df.head()

### Assessing Data

In [None]:
# Counting the number of duplicate entries
# Counting the number of null values in each column
print("Number of duplications: ", df.duplicated().sum())
print("\n")

print("Null Data:")
for key, data in df.isnull().sum().items():
    print(f"{key}: {data}")

In [None]:
# Checking dataset dimensions (number of rows and columns)
print("\nShape of the dataset:")
df.shape

In [None]:
# Checking data type, column, and missing values information
print("\nInfo of the dataset:")
df.info()

In [None]:
# Checking the number of missing values per column
print("\nMissing values per column:")
print(df.isnull().sum())

### Cleaning Data

#### Invalid Date

In [None]:
# Convert the 'date' column to datetime type
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Checking failed data converted to datetime
invalid_dates = df[df['date'].isna()]
print("\nInvalid date entries (rows with missing dates after conversion):")
print(invalid_dates)

#### Missing Values

In [None]:
# Addressing missing values
# For rows that contain missing values in the new_cases, new_deaths, total_cases, or total_deaths columns, we will remove them
data_cleaned = df.dropna(subset=['new_cases', 'new_deaths', 'total_cases', 'total_deaths'])

# Verify that there are no more missing values
print("\nMissing values after cleaning:")
print(data_cleaned.isnull().sum())

#### Duplicates

In [None]:
# Checking if there are duplicate values
duplicates = data_cleaned.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicates}")

# If there are duplicates, we will remove them
if duplicates > 0:
    data_cleaned = data_cleaned.drop_duplicates()

# Verify the data dimension after cleaning
print(f"\nShape of the dataset after cleaning: {data_cleaned.shape}")

#### Strange or Out of The Normal Range

In [None]:
# Checking for strange or out-of-bounds values (e.g. negative cases)
negative_cases = data_cleaned[(data_cleaned['new_cases'] < 0) | (data_cleaned['new_deaths'] < 0)]
print("\nRows with negative case values (if any):")
print(negative_cases)

# If there are invalid negative values, they can be removed
data_cleaned = data_cleaned[(data_cleaned['new_cases'] >= 0) & (data_cleaned['new_deaths'] >= 0)]

#### Outliers

In [None]:
# Checking for outliers in the new_cases and total_cases columns with IQR
Q1 = data_cleaned['new_cases'].quantile(0.25)
Q3 = data_cleaned['new_cases'].quantile(0.75)
IQR = Q3 - Q1

outliers = data_cleaned[(data_cleaned['new_cases'] < (Q1 - 1.5 * IQR)) | (data_cleaned['new_cases'] > (Q3 + 1.5 * IQR))]
print("\nPotential outliers based on new_cases:")
outliers.head()

#### Cleaned

In [None]:
# showing the entire of dataset
print("\nCleaned data preview:")
data_cleaned.head()

In [None]:
# Checking data type and column of dataset
data_cleaned.info()

## **Export to File**

In [None]:
# Path to save the modified CSV file
while True:
    try:
        start_year = int(input("Enter the start year (example: 2022): "))
        start_month = int(input("Enter the start month (1-12): "))

        end_year = int(input("Enter the end year (example: 2024): "))
        end_month = int(input("Enter the end month (1-12): "))

        # Input validation
        if start_month < 1 or start_month > 12 or end_month < 1 or end_month > 12:
            print("Month must be between 1 and 12. Please try again.")
        elif start_year > end_year or (start_year == end_year and start_month > end_month):
            print("The start date cannot be later than the end date. Please try again.")
        else:
            break
    except ValueError:
        print("Invalid input. Please enter valid year and month numbers (example: 2022 and 5 for May).")

In [None]:
# Path to save the processed file
output_folder = 'data/data_processed'
os.makedirs(output_folder, exist_ok=True)  # Ensure folder exists

# Construct the file name based on the year and month range
output_file_path = os.path.join(
    output_folder, f'monkeypox_{start_year}_{start_month}_to_{end_year}_{end_month}_processed.csv')

# Save the processed data to a CSV file
data_cleaned.to_csv(output_file_path, index=False)

print(f"The file has been saved to: {output_file_path}")