# __Impact Analysis of Monkeypox Case Study__

___

## **Business Understanding**

**The Monkeypox outbreak**, though not as widespread as other pandemics, poses significant public health challenges globally, especially in regions where the virus is most prevalent. As governments and health organizations strive to contain the spread, there is a need to **analyze Monkeypox data to extract actionable insights** that can aid in **public health responses and policy formation**. The objective of this project is to analyze global Monkeypox data, particularly focusing on regional trends, severity, and demographic impacts to support **strategic interventions**. The key aspects of this analysis will include:

1. **Epidemiologic Trends**: Investigating the spread of Monkeypox across different regions, focusing on factors such as new cases, total cases, and mortality rates.

2. **Regional Comparisons**: Comparing various countries and regions to understand how Monkeypox affects different areas, identifying regions with high transmission and high mortality rates.

3. **Demographic Trends**: Analyzing the impact of Monkeypox based on available demographic data (e.g., population or regions) to highlight vulnerable groups or areas requiring urgent attention.

4. **Temporal Analysis**: Examining how the outbreak has evolved over time, identifying any patterns or spikes in infections and deaths that could guide future preventive measures.

5. **Identification of High-risk Regions**: Identifying "hotspots" can support public health officials in prioritizing these areas for immediate attention and interventions.

<!-- ### **Problem Points**

Meskipun Monkeypox tidak sepopuler pandemi lainnya, penyebarannya tetap menimbulkan tantangan signifikan bagi kesehatan masyarakat global, terutama di negara-negara di mana virus ini lebih dominan. Beberapa masalah utama yang perlu dianalisis dari data ini adalah:

1. Identifikasi Tren Penyebaran: Bagaimana Monkeypox menyebar di berbagai wilayah dan negara? Adakah wilayah yang lebih rentan terhadap penyebaran ini?
2. Korelasi antara Kasus dan Kematian: Apakah ada hubungan yang signifikan antara jumlah kasus baru dengan angka kematian di setiap negara?
3. Perbandingan Regional: Negara atau wilayah mana yang memiliki jumlah kasus dan kematian tertinggi, serta bagaimana hal ini berkembang dari waktu ke waktu?
4. Analisis Temporal: Apakah terdapat pola atau lonjakan dalam penyebaran Monkeypox berdasarkan waktu, misalnya pada musim tertentu atau periode waktu tertentu dalam satu tahun?
5. Analisis Rasio Fatalitas Kasus: Wilayah mana yang memiliki rasio fatalitas kasus (Case Fatality Ratio) tertinggi? Apakah wilayah-wilayah ini juga memiliki jumlah kasus tinggi atau lebih sedikit namun dengan tingkat kematian yang lebih tinggi? -->

### **Problem Points**

Although Monkeypox is not as popular as other pandemics, its spread still poses significant challenges to global public health, especially in countries where the virus is more prevalent. Some of the key issues that need to be analyzed from this data are:

1. Identification of Spread Trends: How is Monkeypox spreading in different regions and countries? Are there regions that are more susceptible to this spread?
2. Correlation between Cases and Deaths: Is there a significant relationship between the number of new cases and the number of deaths in each country?
3. Regional Comparison: Which countries or regions have the highest number of cases and deaths, and how has this evolved over time?
4. Temporal Analysis: Are there any patterns or spikes in the spread of Monkeypox based on time of day, e.g. in certain seasons or certain time periods of the year?
5. Case Fatality Ratio Analysis: Which regions have the highest case fatality ratios? Do these regions also have a high number of cases or fewer but higher fatality rates?


## **Data Understanding**

<!-- **Data Description**

1. location: The name of the country or region that reported the data.
2. date: The date the data was reported in YYYY-MM-DD format.
3. new_cases: The number of new cases of Monkeypox reported on that date in the country/region.
4. new_deaths: The number of new deaths reported on that date in a country/region.
5. total_cases: The cumulative number of Monkeypox cases recorded in a country/region up to that date.
6. total_deaths: The cumulative number of deaths recorded in a country/region up to that date.
7. new_cases_per_million: The number of new cases per one million population in the region as of the given date.
8. total_cases_per_million: The cumulative number of cases per one million population up to the given date.
9. new_deaths_per_million: The number of new deaths per one million population in the region as of the given date.
10. total_deaths_per_million: The cumulative number of deaths per one million population up to the given date.
11. new_cases_smoothed: The smoothed average daily number of new cases over the given time period.
12. new_deaths_smoothed: The smoothed average daily number of new deaths over the given time period.
13. new_cases_smoothed_per_million: Average daily smoothed number of new cases per one million population.
14. new_deaths_smoothed_per_million: Rata-rata jumlah kematian baru harian yang dihaluskan per satu juta penduduk.
15. suspected_cases_cumulative: Jumlah kasus Monkeypox yang dicurigai hingga tanggal tertentu (jika data tersedia).
16. annotation: Catatan tambahan atau informasi terkait laporan data pada tanggal tertentu (misalnya, revisi atau koreksi data).

**Data Grouping**

1. Total Kasus dan Kematian per Negara/Wilayah: Menghitung jumlah total kasus dan kematian Monkeypox di setiap negara atau wilayah.
2. Perkembangan Kasus per Hari: Mengelompokkan data berdasarkan tanggal untuk melihat tren penyebaran harian.
3. Distribusi Kasus Baru per Wilayah: Melihat distribusi kasus baru berdasarkan lokasi dan waktu untuk memahami wilayah yang terkena dampak paling parah dalam periode tertentu.
4. Analisis Rasio Fatalitas Kasus (Case Fatality Ratio): Menghitung rasio fatalitas kasus (CFR) sebagai jumlah total kematian dibagi jumlah total kasus di setiap negara/wilayah untuk mengidentifikasi wilayah dengan tingkat fatalitas yang tinggi. -->

**Data Description**

1. location: The name of the country or region that reported the data.
2. date: The date the data was reported in YYYY-MM-DD format.
3. new_cases: The number of new cases of Monkeypox reported on that date in the country/region.
4. new_deaths: The number of new deaths reported on that date in a country/region.
5. total_cases: The cumulative number of Monkeypox cases recorded in a country/region up to that date.
6. total_deaths: The cumulative number of deaths recorded in a country/region up to that date.
7. new_cases_per_million: The number of new cases per one million population in the region as of the given date.
8. total_cases_per_million: The cumulative number of cases per one million population up to the given date.
9. new_deaths_per_million: The number of new deaths per one million population in the region as of the given date.
10. total_deaths_per_million: The cumulative number of deaths per one million population up to the given date.
11. new_cases_smoothed: The smoothed average daily number of new cases over the given time period.
12. new_deaths_smoothed: The smoothed average daily number of new deaths over the given time period.
13. new_cases_smoothed_per_million: Average daily smoothed number of new cases per one million population.
14. new_deaths_smoothed_per_million: Average daily smoothed number of new deaths per one million population.
15. suspected_cases_cumulative: Number of suspected Monkeypox cases up to a certain date (if data is available).
16. annotation: Additional notes or information related to the data report on a specific date (for example, data revisions or corrections).

**Data Grouping**

1. Total Cases and Deaths per Country/Region: Counts the total number of Monkeypox cases and deaths in each country or region.
2. Case Progression by Day: Categorize the data by date to see the trend of daily spread.
3. Distribution of New Cases by Region: View the distribution of new cases by location and time to understand the most severely affected regions in a given period.
4. Case Fatality Ratio Analysis: Calculate the case fatality ratio (CFR) as the total number of deaths divided by the total number of cases in each country/region to identify areas with high fatality rates.

## **Data Preparation**

### Import Library

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns

### Gathering Data (Import File)

In [None]:
# Load the dataset
file_path = 'data/raw_data/filtered/monkeypox_2022_to_2024.csv' # filename: monkeypox_{start_year}_to_{end_year}.csv
df = pd.read_csv(file_path)

### Check Data

In [None]:
# Count rows of dataset
jumlah_data = len(df)
print("Total data:", jumlah_data)

In [None]:
# View the first 5 rows of the dataset
print("First 5 rows of the dataset:")
df.head()

### Assessing Data

In [None]:
# Counting the number of duplicate entries
# Counting the number of null values in each column
print("Number of duplications: ", df.duplicated().sum())
print("\n")

print("Null Data:")
for key, data in df.isnull().sum().items():
    print(f"{key}: {data}")

In [None]:
# Checking dataset dimensions (number of rows and columns)
print("\nShape of the dataset:")
df.shape

In [None]:
# Checking data type, column, and missing values information
print("\nInfo of the dataset:")
df.info()

In [None]:
# Checking the number of missing values per column
print("\nMissing values per column:")
print(df.isnull().sum())

### Cleaning Data

#### Invalid Date

In [None]:
# Convert the 'date' column to datetime type
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Checking failed data converted to datetime
invalid_dates = df[df['date'].isna()]
print("\nInvalid date entries (rows with missing dates after conversion):")
print(invalid_dates)

#### Missing Values

In [None]:
# Addressing missing values
# For rows that contain missing values in the new_cases, new_deaths, total_cases, or total_deaths columns, we will remove them
data_cleaned = df.dropna(subset=['new_cases', 'new_deaths', 'total_cases', 'total_deaths'])

# Verify that there are no more missing values
print("\nMissing values after cleaning:")
print(data_cleaned.isnull().sum())

#### Duplicates

In [None]:
# Checking if there are duplicate values
duplicates = data_cleaned.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicates}")

# If there are duplicates, we will remove them
if duplicates > 0:
    data_cleaned = data_cleaned.drop_duplicates()

# Verify the data dimension after cleaning
print(f"\nShape of the dataset after cleaning: {data_cleaned.shape}")

#### Strange or Out of The Normal Range

In [None]:
# Checking for strange or out-of-bounds values (e.g. negative cases)
negative_cases = data_cleaned[(data_cleaned['new_cases'] < 0) | (data_cleaned['new_deaths'] < 0)]
print("\nRows with negative case values (if any):")
print(negative_cases)

# If there are invalid negative values, they can be removed
data_cleaned = data_cleaned[(data_cleaned['new_cases'] >= 0) & (data_cleaned['new_deaths'] >= 0)]

#### Outliers

In [None]:
# Checking for outliers in the new_cases and total_cases columns with IQR
Q1 = data_cleaned['new_cases'].quantile(0.25)
Q3 = data_cleaned['new_cases'].quantile(0.75)
IQR = Q3 - Q1

outliers = data_cleaned[(data_cleaned['new_cases'] < (Q1 - 1.5 * IQR)) | (data_cleaned['new_cases'] > (Q3 + 1.5 * IQR))]
print("\nPotential outliers based on new_cases:")
outliers.head()

#### Cleaned

In [None]:
# showing the entire of dataset
print("\nCleaned data preview:")
data_cleaned.head()

In [None]:
# Checking data type and column of dataset
data_cleaned.info()

## **Exploratory Data Analysis (EDA)**

In [None]:
# Classification of Countries and Regions/Continents
# List of regions/continents that are not countries
regions_or_continents = ['World', 'Asia', 'Europe', 'Africa', 'North America', 'South America', 'Oceania']  # You can add it if needed

# Added a new column 'location_type' to classify between 'Country' and 'Region/Continent'
data_cleaned['location_type'] = data_cleaned['location'].apply(lambda x: 'Region/Continent' if x in regions_or_continents else 'Country')

# Checking if the clustering was successful
print("\nLocation types (Countries vs Regions/Continents):")
print(data_cleaned['location_type'].value_counts())

In [None]:
# Grouping country and non-country data for separate analysis
countries_data = data_cleaned[data_cleaned['location_type'] == 'Country']
regions_data = data_cleaned[data_cleaned['location_type'] == 'Region/Continent']

# Optional: Displays the number of countries and regions identified
print(f"\nNumber of Countries: {countries_data['location'].nunique()}")
print(f"Number of Regions/Continents: {regions_data['location'].nunique()}")

In [None]:
# View the dataset after being classified between countries and non-countries
print("\nClassified dataset:")
data_cleaned.head()

**Epidemiologic Trends: Investigating the Spread of Monkeypox**

Looking at trends in the spread of Monkeypox with a focus on the factors of new cases, total cases, and mortality rates.

In [None]:
# EDA: Mengonversi kolom 'date' menjadi format tahun dan menghitung jumlah per tahun
data_cleaned.loc[:, 'year'] = pd.to_datetime(data_cleaned['date']).dt.year

# Menghitung jumlah (SUM) new_cases dan total_cases per tahun
data_cases_yearly_sum = data_cleaned.groupby('year').agg({
    'new_cases': 'sum',
    'total_cases': 'sum'
}).reset_index()

# Menghitung jumlah (SUM) new_cases dan total_cases per tahun
data_deaths_yearly_sum = data_cleaned.groupby('year').agg({
    'new_deaths': 'sum',
    'total_deaths': 'sum'
}).reset_index()

**Regional Comparisons: Country/Region Comparisons**

Comparing countries to understand how Monkeypox affects different regions.


In [None]:
# View total cases and deaths in each country
data_grouped_by_location_countries = countries_data.groupby('location').agg({
    'total_cases': 'sum',
    'total_deaths': 'sum'
}).reset_index()

# View total cases and deaths in each region/continent
data_grouped_by_location_regions = regions_data.groupby('location').agg({
    'total_cases': 'sum',
    'total_deaths': 'sum'
}).reset_index()

**Demographic Trends: Impact by Region**

Analyze the impact of Monkeypox by population size or regional area.

In [None]:
# A look at the countries with the highest total cases
top_countries = data_grouped_by_location_countries.nlargest(10, 'total_cases')

**Temporal Analysis: Time of Deployment Analysis**

Analyzing how the spread of Monkeypox changes over time.


In [None]:
# Make sure the 'date' field is in datetime format
data_cleaned.loc[:, 'date'] = pd.to_datetime(data_cleaned['date'], errors='coerce')

# Create a 'month' column in Year-Month format
data_cleaned.loc[:, 'month'] = data_cleaned['date'].dt.strftime('%Y-%m')

# Convert the 'new_cases' and 'new_deaths' columns to numeric
data_cleaned.loc[:, 'new_cases'] = pd.to_numeric(data_cleaned['new_cases'], errors='coerce')
data_cleaned.loc[:, 'new_deaths'] = pd.to_numeric(data_cleaned['new_deaths'], errors='coerce')

# Group data by 'month' and calculate total new cases and deaths by month
cases_per_month = data_cleaned.groupby('month').agg({
    'new_cases': 'sum',
    'new_deaths': 'sum'
}).reset_index()


**Identification of High-risk Regions: Top Locations by Case Fatality Ratio**

Identify high-risk areas based on case prevalence ratios.


In [None]:
# Calculating CFR
data_grouped_by_location_countries['CFR'] = data_grouped_by_location_countries['total_deaths'] / data_grouped_by_location_countries['total_cases'] * 100  # dalam persen

# Identify areas with high CFR
high_cfr_locations = data_grouped_by_location_countries.nlargest(10, 'CFR')

## **Data Visualization**

**Epidemiologic Trends: Investigating the Spread of Monkeypox**

In [None]:
# Visualize the trend of new cases and total cases over time
# Create two subplots (one for new_cases and one for total_cases)
fig, axes = plt.subplots(2, 1, figsize=(10, 6))

# new_cases trend visualization
sns.lineplot(x='year', y='new_cases', data=data_cases_yearly_sum, label='New Cases - Countries', color='blue', ax=axes[0], linewidth=1)
axes[0].set_title('Trends in New Cases')
axes[0].set_xlabel('Year')
axes[0].set_ylabel('Number of New Cases')

# Adding labels to data points for new_cases (yearly totals)
for year in sorted(data_cases_yearly_sum['year']):
    new_cases_value = data_cases_yearly_sum[data_cases_yearly_sum['year'] == year]['new_cases'].sum()  # Total per tahun
    axes[0].text(year, new_cases_value, f'{new_cases_value:,.0f}', color='blue', ha='center', va='bottom', fontsize=9)

# Visualisasi tren total_cases
sns.lineplot(x='year', y='total_cases', data=data_cases_yearly_sum, label='Total Cases - Countries', color='orange', ax=axes[1], linewidth=1)
axes[1].set_title('Trends in Total Cases')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('Number of Total Cases')

# Add a label to the data point for total_cases (grand total for each year)
for year in sorted(data_cases_yearly_sum['year']):
    total_cases_value = data_cases_yearly_sum[data_cases_yearly_sum['year'] == year]['total_cases'].sum()  # Total per tahun
    axes[1].text(year, total_cases_value, f'{total_cases_value:,.0f}', color='orange', ha='center', va='bottom', fontsize=9)

# Added X-axis rotation
for ax in axes:
    ax.set_xticks(sorted(data_cases_yearly_sum['year'].unique()))
    ax.set_xticklabels(sorted(data_cases_yearly_sum['year'].unique()), rotation=45)

# Tighter layout settings
plt.tight_layout()

# Show graph
plt.show()

In [None]:
# Visualize the trend of new deaths and total deaths over time
# Create two subplots (one for new_deaths and one for total_deaths)
fig, axes = plt.subplots(2, 1, figsize=(10, 6))

# Visualisasi tren new_cases
sns.lineplot(x='year', y='new_deaths', data=data_deaths_yearly_sum, label='New Deaths', color='blue', ax=axes[0], linewidth=1)
axes[0].set_title('Trends in New Deaths')
axes[0].set_xlabel('Year')
axes[0].set_ylabel('Number of New Deaths')

# Adding labels to data points for new_deaths (yearly totals)
for year in sorted(data_deaths_yearly_sum['year']):
    new_deaths_value = data_deaths_yearly_sum[data_deaths_yearly_sum['year'] == year]['new_deaths'].sum()  # Total per tahun
    axes[0].text(year, new_deaths_value, f'{new_deaths_value:,.0f}', color='blue', ha='center', va='bottom', fontsize=9)

# total_deaths trend visualization
sns.lineplot(x='year', y='total_deaths', data=data_deaths_yearly_sum, label='Total Deaths', color='orange', ax=axes[1], linewidth=1)
axes[1].set_title('Trends in Total Deaths')
axes[1].set_xlabel('Year')
axes[1].set_ylabel('Number of Total Deaths')

# Adding labels to data points for total_deaths (grand total for each year)
for year in sorted(data_deaths_yearly_sum['year']):
    total_deaths_value = data_deaths_yearly_sum[data_deaths_yearly_sum['year'] == year]['total_deaths'].sum()  # Total per tahun
    axes[1].text(year, total_deaths_value, f'{total_deaths_value:,.0f}', color='orange', ha='center', va='bottom', fontsize=9)

# Added X-axis rotation
for ax in axes:
    ax.set_xticks(sorted(data_deaths_yearly_sum['year'].unique()))
    ax.set_xticklabels(sorted(data_deaths_yearly_sum['year'].unique()), rotation=45)

# Tighter layout settings
plt.tight_layout()

# Show graph
plt.show()

**Regional Comparisons: Country/Region Comparisons**

In [None]:
# Sort data by total_cases in descending order
data_sorted = data_grouped_by_location_countries.sort_values('total_cases', ascending=False)

# Sort data by total_cases in descending order
data_sorted['total_deaths'] = data_grouped_by_location_countries['total_deaths']

# Display tables that already have 'total_cases' and 'total_deaths' columns
pd.set_option('display.max_rows', None)  # Optional, if you want to display all rows
data_sorted[['location', 'total_cases', 'total_deaths']]

In [None]:
# Sort data by total_cases in descending order
data_sorted_by_region_cases = data_grouped_by_location_regions.sort_values('total_cases', ascending=False)

# Add the total_deaths column to data_sorted_by_region_cases
data_sorted_by_region_cases['total_deaths'] = data_grouped_by_location_regions['total_deaths']

# Display a table containing location, total_cases, and total_deaths columns by region
pd.set_option('display.max_rows', None)  # Optional, if you want to display all rows
data_sorted_by_region_cases[['location', 'total_cases', 'total_deaths']]

**Demographic Trends: Impact by Region**

In [None]:
# Visualization: Countries with the Highest Total Cases
plt.figure(figsize=(10, 6))
sns.barplot(x='total_cases', y='location', data=top_countries)
plt.title('Top 10 Countries with Highest Total Cases (Countries)')
plt.xlabel('Total Cases')
plt.ylabel('Country')
plt.tight_layout()
plt.show()

**Temporal Analysis: Time of Deployment Analysis**

In [None]:
# Visualization: New Case Development by Month
# Create a figure with two subplots
fig, axes = plt.subplots(2, 1, figsize=(10, 12))

# First plot: New Cases
sns.lineplot(x='month', y='new_cases', data=cases_per_month, label='New Cases', ax=axes[0], color='blue')
axes[0].set_title('Development of New Cases per Month')
axes[0].set_xlabel('Month')
axes[0].set_ylabel('Number of New Cases')
axes[0].tick_params(axis='x', rotation=45)

# Annotate each data point for New Cases
for i in range(len(cases_per_month)):
    axes[0].annotate(
        cases_per_month['new_cases'].iloc[i], 
        (cases_per_month['month'].iloc[i], cases_per_month['new_cases'].iloc[i]), 
        textcoords="offset points", xytext=(0,5), ha='center', fontsize=9, color='blue'
    )

# Second plot: New Deaths
sns.lineplot(x='month', y='new_deaths', data=cases_per_month, label='New Deaths', ax=axes[1], color='red')
axes[1].set_title('Development of New Deaths per Month')
axes[1].set_xlabel('Month')
axes[1].set_ylabel('Number of New Deaths')
axes[1].tick_params(axis='x', rotation=45)

# Annotate each data point for New Deaths
for i in range(len(cases_per_month)):
    axes[1].annotate(
        cases_per_month['new_deaths'].iloc[i], 
        (cases_per_month['month'].iloc[i], cases_per_month['new_deaths'].iloc[i]), 
        textcoords="offset points", xytext=(0,5), ha='center', fontsize=9, color='red'
    )

# Adjust the layout for better spacing
plt.tight_layout()

# Show the plots
plt.show()

**Identification of High-risk Regions: Top Locations by Case Fatality Ratio**

In [None]:
# CFR visualization
plt.figure(figsize=(10, 6))
sns.barplot(x='CFR', y='location', data=high_cfr_locations, palette='rocket')
plt.title('Top Countries by Case Fatality Ratio (CFR)')
plt.xlabel('CFR (%)')
plt.ylabel('Location')
plt.show()

## **Export to File**

In [None]:
# Path to save the modified CSV file
start_year = input("Enter the 'start_year' to save (without the .csv extension (example: 2022)): ")
end_year = input("Enter the 'end_year' to save (without the .csv extension (example: 2024)): ")

In [None]:
# Path to save the modified CSV file inside the 'data' folder
output_file_path = f'data/data_processed/monkeypox_{start_year}_to_{end_year}_processed.csv'


# The function of the .to_csv function is to save the data into a file with the .csv extension
data_cleaned.to_csv(output_file_path)

print(f"The file has been saved to: {output_file_path}")