# Analysis Notebook

In this notebook, we:
1. Connect to MongoDB and read the four cleaned datasets.
2. Perform various analysis steps:
   - Summaries and data checks
   - Aggregations and group-bys for insights
   - Merging or joining data if beneficial for certain analyses
   - Computing correlations or other statistical measures

After this analysis, we will have a better understanding of the data and can proceed
to create visualizations in the next step.

## 1. Load Libraries and Set Environment Variables

In this cell, we import the necessary Python libraries (pandas, numpy, pymongo) for data analysis and load environment variables for the MongoDB URI.

In [1]:
import pandas as pd  # For data manipulation
import numpy as np  # For numeric operations
from pymongo import MongoClient  # For connecting to MongoDB
from dotenv import load_dotenv  # For loading environment variables
import os  # For environment

# ----------------------------------------------------
# Load environment variables for MongoDB URI
# ----------------------------------------------------

# Explicitly specify the .env file path
dotenv_path = os.path.abspath(os.path.join(os.getcwd(), "..", ".env"))
print("Looking for .env at:", dotenv_path)

# Load the .env file
load_dotenv(dotenv_path)

# Retrieve the MongoDB URI
MONGO_URI = os.getenv("MONGO_URI")

if not MONGO_URI:
    print(f"MONGO_URI is not set in the .env file: {MONGO_URI}")
else:
    # Mask the password part for display purposes, if the URI format is typical (mongodb+srv://user:pass@...)
    try:
        # Attempt to split out the password portion
        # e.g. "mongodb+srv://username:password@..."
        password_part = MONGO_URI.split(":")[2].split("@")[0]
        masked_uri = MONGO_URI.replace(password_part, "*****")
        print(f"MONGO_URI loaded successfully: {masked_uri}")
    except IndexError:
        # If splitting fails for some reason, just print it as is (or handle gracefully)
        print(f"MONGO_URI loaded successfully (could not mask password).")
        print(f"Full URI: {MONGO_URI}")


Looking for .env at: /Users/dr.sam/Desktop/CodeGenesis-TEAM/.env
MONGO_URI loaded successfully: mongodb+srv://koyluoglucem:*****@codegenesis.dupu0.mongodb.net/


## 2. Connect to MongoDB and Load Data

Here we connect to MongoDB using the provided URI and load the four cleaned datasets into Pandas DataFrames.

- **covid_vacc_death_rate**: COVID-19 vaccination and death rate data
- **covid_vacc_manufacturer**: Vaccination data broken down by manufacturer
- **oecd_health_expenditure**: OECD health expenditure data
- **us_death_rates**: US death rates by age group and vaccination status

In [2]:
from pymongo import MongoClient
import pandas as pd
from dotenv import load_dotenv
import os

# 1. Load environment variables
load_dotenv()

# 2. Load MongoDB URI from .env file
MONGO_URI = os.getenv("MONGO_URI")
DATABASE_NAME = os.getenv("DATABASE_NAME", "my_database")  # Use environment variable or default

# 3. Validate MongoDB URI
if not MONGO_URI:
    raise ValueError("MONGO_URI is not set in the .env file.")

# 4. Connect to MongoDB
try:
    client = MongoClient(MONGO_URI)
    client.server_info()  # Test the connection
    print("Successfully connected to MongoDB.")
except Exception as e:
    print(f"Failed to connect to MongoDB: {e}")
    raise

# 5. Access the database
db = client[DATABASE_NAME]


# 6. Function to load data from MongoDB into Pandas DataFrame
def mongo_to_df(collection_name):
    try:
        # Fetch all documents from the collection
        collection = db[collection_name]
        data = list(collection.find({}, {"_id": 0}))  # Exclude _id for clarity
        if not data:
            print(f"Warning: The collection '{collection_name}' is empty.")
            return pd.DataFrame()
        df = pd.DataFrame(data)
        print(f"Loaded {len(df)} records from collection '{collection_name}'")
        return df
    except Exception as e:
        print(f"Error loading collection '{collection_name}': {e}")
        return pd.DataFrame()


# 7. Define the collection names
collections = {
    "covid_vacc_death_rate": "covid_vacc_death_rate",
    "covid_vacc_manufacturer": "covid_vacc_manufacturer",
    "oecd_health_expenditure": "oecd_health_expenditure",
    "us_death_rates": "us_death_rates"
}

# 8. Load the data into DataFrames
dataframes = {}
for key, collection_name in collections.items():
    print(f"Loading data from collection: {collection_name}")
    dataframes[key] = mongo_to_df(collection_name)

# 9. Print summary of each DataFrame
for name, df in dataframes.items():
    print(f"\nSummary of DataFrame '{name}':")
    if df.empty:
        print(f"The DataFrame '{name}' is empty.")
    else:
        print(f"First few rows of {name}:")
        print(df.head())
        print(f"Shape: {df.shape}")

Successfully connected to MongoDB.
Loading data from collection: covid_vacc_death_rate
Loaded 447729 records from collection 'covid_vacc_death_rate'
Loading data from collection: covid_vacc_manufacturer
Loaded 59224 records from collection 'covid_vacc_manufacturer'
Loading data from collection: oecd_health_expenditure
Loaded 439 records from collection 'oecd_health_expenditure'
Loading data from collection: us_death_rates
Loaded 650 records from collection 'us_death_rates'

Summary of DataFrame 'covid_vacc_death_rate':
First few rows of covid_vacc_death_rate:
        Entity Code  year         Day  \
0  Afghanistan  AFG  2020  2020-01-09   
1  Afghanistan  AFG  2020  2020-01-10   
2  Afghanistan  AFG  2020  2020-01-11   
3  Afghanistan  AFG  2020  2020-01-12   
4  Afghanistan  AFG  2020  2020-01-13   

   Daily new confirmed deaths due to COVID-19 per million people (rolling 7-day average, right-aligned)  \
0                                                0.0                            

## 3. Initial Exploratory Data Analysis (EDA)

We examine the first few rows (`head`), data types (`info`), and basic statistics (`describe`) of each DataFrame to understand the data structure and contents.

In [3]:
# ----------------------------------------------------
# Initial Exploration
# ----------------------------------------------------
def explore_dataframe(df, name):
    print(f"\n### {name} ###")
    print(f"\nFirst few rows of {name}:")
    print(df.head())
    print(f"\nDataFrame info for {name}:")
    print(df.info())
    print(f"\nBasic statistics for {name}:")
    print(df.describe(include='all'))
    print("\n" + "=" * 50 + "\n")


# Explore each DataFrame
for name, df in dataframes.items():
    if not df.empty:
        explore_dataframe(df, name)
    else:
        print(f"\n### {name} ###")
        print(f"The DataFrame '{name}' is empty.")
        print("\n" + "=" * 50 + "\n")


### covid_vacc_death_rate ###

First few rows of covid_vacc_death_rate:
        Entity Code  year         Day  \
0  Afghanistan  AFG  2020  2020-01-09   
1  Afghanistan  AFG  2020  2020-01-10   
2  Afghanistan  AFG  2020  2020-01-11   
3  Afghanistan  AFG  2020  2020-01-12   
4  Afghanistan  AFG  2020  2020-01-13   

   Daily new confirmed deaths due to COVID-19 per million people (rolling 7-day average, right-aligned)  \
0                                                0.0                                                      
1                                                0.0                                                      
2                                                0.0                                                      
3                                                0.0                                                      
4                                                0.0                                                      

   COVID-19 doses (cumulative, per hu

## 4. Data Cleaning, Type Conversions, and Checking for Missing Values

In this step:
- Convert date columns to datetime format (if needed).
- Convert numeric columns to appropriate types.
- Check for null values and consider potential cleaning strategies.

In [17]:
#####
# Dataset: covid_vacc_death_rate (MongoDB)
#
# Objective:
# - Load the dataset from MongoDB using the `mongo_to_df` function.
# - Perform data cleaning and transformation:
#     - Replace missing values in `Code` column with "Unknown".
#     - Replace missing values in `COVID-19 doses (cumulative, per hundred)` column with 0.
#     - Convert date columns to datetime format.
#     - Validate data types and consistency.
#     - Save the cleaned dataset for further analysis.
#####

import pandas as pd
import os
import logging
from pymongo import MongoClient
from dotenv import load_dotenv

# Logging setup
logging.basicConfig(
    filename="covid_vacc_death_rate_cleaning.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

# Load environment variables
load_dotenv()
MONGO_URI = os.getenv("MONGO_URI")
DATABASE_NAME = os.getenv("DATABASE_NAME", "my_database")  # Default database name if not specified
output_dir = "/Users/dr.sam/Desktop/CodeGenesis-TEAM/data/processed"

# Connect to MongoDB
try:
    client = MongoClient(MONGO_URI)
    db = client[DATABASE_NAME]
    print("Successfully connected to MongoDB.")
except Exception as e:
    logging.error(f"Error connecting to MongoDB: {e}")
    raise


# Function to load MongoDB collection into Pandas DataFrame
def mongo_to_df(collection_name):
    try:
        collection = db[collection_name]
        data = list(collection.find({}, {"_id": 0}))  # Exclude _id field
        if not data:
            print(f"Warning: The collection '{collection_name}' is empty.")
            return pd.DataFrame()
        df = pd.DataFrame(data)
        print(f"Loaded {len(df)} records from collection '{collection_name}'")
        return df
    except Exception as e:
        logging.error(f"Error loading collection '{collection_name}': {e}")
        return pd.DataFrame()


# Load the data
collection_name = "covid_vacc_death_rate"
df = mongo_to_df(collection_name)

# Verify data load
if df.empty:
    raise ValueError("The loaded DataFrame is empty. Please check the MongoDB collection.")

# Data cleaning and transformation
try:
    # Convert 'Day' to datetime
    if 'Day' in df.columns:
        df['Day'] = pd.to_datetime(df['Day'], errors='coerce')
        logging.info("Converted 'Day' column to datetime format.")

    # Replace missing values in 'Code' with 'Unknown'
    if 'Code' in df.columns:
        df['Code'] = df['Code'].fillna("Unknown")
        logging.info("Replaced missing values in 'Code' column with 'Unknown'.")

    # Replace missing values in 'COVID-19 doses (cumulative, per hundred)' with 0
    covid_doses_col = 'COVID-19 doses (cumulative, per hundred)'
    if covid_doses_col in df.columns:
        df[covid_doses_col] = df[covid_doses_col].fillna(0)
        logging.info(f"Replaced missing values in '{covid_doses_col}' column with 0.")

    # Ensure 'Daily new confirmed deaths...' remains numeric
    deaths_col = 'Daily new confirmed deaths due to COVID-19 per million people (rolling 7-day average, right-aligned)'
    if deaths_col in df.columns:
        df[deaths_col] = pd.to_numeric(df[deaths_col], errors='coerce')
        logging.info(f"Ensured '{deaths_col}' remains numeric.")

    print("Data cleaning and transformation completed successfully.")
except Exception as e:
    logging.error(f"Error during data cleaning and transformation: {e}")
    raise

# Final dataset summary
print("\nSummary of the cleaned dataset:")
print(df.info())

# Save the cleaned dataset
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, "covid_vacc_death_rate_cleaned.csv")
try:
    df.to_csv(output_path, index=False)
    logging.info(f"Cleaned dataset saved to: {output_path}")
    print(f"Cleaned dataset saved to: {output_path}")
except Exception as e:
    logging.error(f"Error saving cleaned dataset: {e}")
    raise

Successfully connected to MongoDB.
Loaded 447729 records from collection 'covid_vacc_death_rate'
Data cleaning and transformation completed successfully.

Summary of the cleaned dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 447729 entries, 0 to 447728
Data columns (total 7 columns):
 #   Column                                                                                                Non-Null Count   Dtype         
---  ------                                                                                                --------------   -----         
 0   Entity                                                                                                447729 non-null  object        
 1   Code                                                                                                  447729 non-null  object        
 2   year                                                                                                  447729 non-null  int64         
 3   Day

In [19]:
#####
# Dataset: covid_vacc_manufacturer (MongoDB)
#
# Objective:
# - Load the dataset from MongoDB using the `mongo_to_df` function.
# - Perform data cleaning and transformation:
#     - Convert the `Day` column to datetime format.
#     - Replace missing values in `Code` column with "Unknown".
#     - Ensure all manufacturer columns are numeric.
#     - Save the cleaned dataset for further analysis.
#####

import pandas as pd
import os
import logging

# Logging setup
logging.basicConfig(
    filename="covid_vacc_manufacturer_cleaning.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

# Directory to save the processed data
output_dir = "/Users/dr.sam/Desktop/CodeGenesis-TEAM/data/processed"
os.makedirs(output_dir, exist_ok=True)

# Load dataset from MongoDB
try:
    logging.info("Loading dataset: covid_vacc_manufacturer from MongoDB")
    df = mongo_to_df("covid_vacc_manufacturer")  # MongoDB veri çekme fonksiyonu
    if df.empty:
        raise ValueError("Dataset is empty. Please check the MongoDB collection.")
    print(f"Dataset loaded successfully. Shape: {df.shape}")
except Exception as e:
    logging.error(f"Error loading dataset: {e}")
    raise

# Display first few rows of the dataset
print("First few rows of the dataset:")
print(df.head())

# Data Cleaning and Transformation
try:
    # Step 1: Convert date columns to datetime format
    logging.info("Converting date columns to datetime format.")
    if 'Day' in df.columns:
        df['Day'] = pd.to_datetime(df['Day'], errors='coerce')
        logging.info("Converted 'Day' column to datetime format.")

    # Step 2: Handle missing values
    logging.info("Handling missing values.")

    # Replace missing values in 'Code' column with 'Unknown'
    if 'Code' in df.columns:
        df['Code'] = df['Code'].fillna("Unknown")
        logging.info("Replaced missing values in 'Code' column with 'Unknown'.")

    # Ensure all manufacturer columns are numeric
    manufacturer_columns = [col for col in df.columns if "Manufacturer" in col]
    for col in manufacturer_columns:
        df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0)
        logging.info(f"Ensured column '{col}' is numeric and filled NaN with 0.")

    print("\nData cleaning and transformation completed successfully.")

except Exception as e:
    logging.error(f"Error during data cleaning and transformation: {e}")
    raise

# Final dataset summary
print("\nSummary of the cleaned dataset:")
print(df.info())

# Save the cleaned dataset for further analysis
output_path = os.path.join(output_dir, "covid_vacc_manufacturer_cleaned.csv")
try:
    df.to_csv(output_path, index=False)
    logging.info(f"Cleaned dataset saved to: {output_path}")
    print(f"Cleaned dataset saved to: {output_path}")
except Exception as e:
    logging.error(f"Error saving cleaned dataset: {e}")
    raise

Loaded 59224 records from collection 'covid_vacc_manufacturer'
Dataset loaded successfully. Shape: (59224, 17)
First few rows of the dataset:
      Entity Code         Day  \
0  Argentina  ARG  2020-12-04   
1  Argentina  ARG  2020-12-05   
2  Argentina  ARG  2020-12-06   
3  Argentina  ARG  2020-12-07   
4  Argentina  ARG  2020-12-08   

   COVID-19 doses (cumulative) - Manufacturer Pfizer/BioNTech  \
0                                                  1            
1                                                  1            
2                                                  1            
3                                                  1            
4                                                  1            

   COVID-19 doses (cumulative) - Manufacturer Moderna  \
0                                                  1    
1                                                  1    
2                                                  1    
3                                       

In [21]:
#####
# Dataset: us_death_rates (MongoDB)
#
# Objective:
# - Load the dataset from MongoDB using the `mongo_to_df` function.
# - Perform data cleaning and transformation:
#     - Convert the `Day` column to datetime format.
#     - Replace missing values in `Code` column with "Unknown".
#     - Ensure numeric columns are properly formatted.
#     - Save the cleaned dataset for further analysis.
#####

import pandas as pd
import os
import logging

# Logging setup
logging.basicConfig(
    filename="us_death_rates_cleaning.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

# Directory to save the processed data
output_dir = "/Users/dr.sam/Desktop/CodeGenesis-TEAM/data/processed"
os.makedirs(output_dir, exist_ok=True)

# Load dataset from MongoDB
try:
    logging.info("Loading dataset: us_death_rates from MongoDB")
    df = mongo_to_df("us_death_rates")  # MongoDB veri çekme fonksiyonu
    if df.empty:
        raise ValueError("Dataset is empty. Please check the MongoDB collection.")
    print(f"Dataset loaded successfully. Shape: {df.shape}")
except Exception as e:
    logging.error(f"Error loading dataset: {e}")
    raise

# Display first few rows of the dataset
print("First few rows of the dataset:")
print(df.head())

# Data Cleaning and Transformation
try:
    # Step 1: Convert date columns to datetime format
    logging.info("Converting date columns to datetime format.")
    if 'Day' in df.columns:
        df['Day'] = pd.to_datetime(df['Day'], errors='coerce')
        logging.info("Converted 'Day' column to datetime format.")

    # Step 2: Handle missing values
    logging.info("Handling missing values.")

    # Replace missing values in 'Code' column with 'Unknown'
    if 'Code' in df.columns:
        df['Code'] = df['Code'].fillna("Unknown")
        logging.info("Replaced missing values in 'Code' column with 'Unknown'.")

    # Ensure numeric columns are properly formatted
    numeric_columns = [col for col in df.columns if df[col].dtype in ['float64', 'int64']]
    for col in numeric_columns:
        df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0)
        logging.info(f"Ensured column '{col}' is numeric and filled NaN with 0.")

    print("\nData cleaning and transformation completed successfully.")

except Exception as e:
    logging.error(f"Error during data cleaning and transformation: {e}")
    raise

# Final dataset summary
print("\nSummary of the cleaned dataset:")
print(df.info())

# Save the cleaned dataset for further analysis
output_path = os.path.join(output_dir, "us_death_rates_cleaned.csv")
try:
    df.to_csv(output_path, index=False)
    logging.info(f"Cleaned dataset saved to: {output_path}")
    print(f"Cleaned dataset saved to: {output_path}")
except Exception as e:
    logging.error(f"Error saving cleaned dataset: {e}")
raise

Loaded 650 records from collection 'us_death_rates'
Dataset loaded successfully. Shape: (650, 6)
First few rows of the dataset:
  Entity  Code         Day  \
0  0.5-4   NaN  2022-08-06   
1  0.5-4   NaN  2022-08-13   
2  0.5-4   NaN  2022-08-20   
3  0.5-4   NaN  2022-08-27   
4  0.5-4   NaN  2022-09-03   

   Death rate (weekly) of unvaccinated people - United States, by age  \
0                                           0.096528                    
1                                           0.019468                    
2                                           0.000000                    
3                                           0.079043                    
4                                           0.039777                    

   Death rate (weekly) of fully vaccinated people (without bivalent booster) - United States, by age  \
0                                                0.0                                                   
1                                           

In [22]:
#####
# Project: COVID-19 Data Cleaning and Upload
#
# Objective:
# - Load cleaned datasets from local CSV files.
# - Save cleaned datasets back to MongoDB with 'cleaned_' prefix.
# - Log actions, successes, and errors in JSON format for tracking.
#####

import os
import json
from dotenv import load_dotenv
from pymongo import MongoClient
import pandas as pd
from datetime import datetime

# Load environment variables
load_dotenv()
MONGO_URI = os.getenv("MONGO_URI")
DATABASE_NAME = os.getenv("DATABASE_NAME")
CLEANED_DATA_DIR = os.getenv("OUTPUT_DIR", "processed")  # Directory where cleaned data is stored
# Assuming the notebook is in the 'reports' directory
PROJECT_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))
LOG_DIR = os.path.join(PROJECT_ROOT, "reports")
os.makedirs(LOG_DIR, exist_ok=True)
LOG_FILE_PATH = os.path.join(LOG_DIR, "mongodb_upload_log.json")

# Initialize log storage
log_entries = []

# Connect to MongoDB
try:
    client = MongoClient(MONGO_URI)
    db = client[DATABASE_NAME]
    log_entries.append({
        "timestamp": datetime.now().isoformat(),
        "level": "INFO",
        "message": "Successfully connected to MongoDB."
    })
except Exception as e:
    log_entries.append({
        "timestamp": datetime.now().isoformat(),
        "level": "ERROR",
        "message": f"Error connecting to MongoDB: {str(e)}"
    })
    # Save log and exit
    with open(LOG_FILE_PATH, "w") as log_file:
        json.dump(log_entries, log_file, indent=4)
    raise


# Function to save DataFrame to MongoDB
def save_df_to_mongodb(df, collection_name):
    """
    Save a Pandas DataFrame to MongoDB.
    """
    try:
        db[collection_name].insert_many(df.to_dict('records'))
        log_entries.append({
            "timestamp": datetime.now().isoformat(),
            "level": "INFO",
            "message": f"Successfully saved {len(df)} records to collection '{collection_name}'."
        })
    except Exception as e:
        log_entries.append({
            "timestamp": datetime.now().isoformat(),
            "level": "ERROR",
            "message": f"Error saving to collection '{collection_name}': {str(e)}"
        })
        raise


# Process and upload cleaned datasets
cleaned_files = {
    "covid_vacc_death_rate_cleaned.csv": "cleaned_covid_vacc_death_rate",
    "covid_vacc_manufacturer_cleaned.csv": "cleaned_covid_vacc_manufacturer",
    "us_death_rates_cleaned.csv": "cleaned_us_death_rates"
}

for file_name, collection_name in cleaned_files.items():
    try:
        file_path = os.path.join(CLEANED_DATA_DIR, file_name)
        if not os.path.exists(file_path):
            log_entries.append({
                "timestamp": datetime.now().isoformat(),
                "level": "WARNING",
                "message": f"File '{file_name}' not found. Skipping."
            })
            continue

        # Load cleaned dataset
        df = pd.read_csv(file_path)
        log_entries.append({
            "timestamp": datetime.now().isoformat(),
            "level": "INFO",
            "message": f"Loaded cleaned dataset from file '{file_name}' with shape {df.shape}."
        })

        # Save to MongoDB
        save_df_to_mongodb(df, collection_name)

    except Exception as e:
        log_entries.append({
            "timestamp": datetime.now().isoformat(),
            "level": "ERROR",
            "message": f"Error processing file '{file_name}': {str(e)}"
        })

# Save log file
with open(LOG_FILE_PATH, "w") as log_file:
    json.dump(log_entries, log_file, indent=4)

print(f"Cleaned datasets have been uploaded to MongoDB. Logs saved to {LOG_FILE_PATH}.")

Cleaned datasets have been uploaded to MongoDB. Logs saved to reports/logs/mongodb_upload_log.json.


## 5. Basic Statistical Analyses

Perform some example analyses:
- Identify entities with the highest vaccination rates (df_vdr)
- Check correlation between vaccination and death rates (df_vdr)
- Summarize total doses by manufacturer (df_vm)
- Compute average health expenditures by reference area (df_oecd)
- Examine death rates over time for a specific age group (e.g., 80+) in the US (df_us)

In [30]:
# === Data Retrieval ===
def get_dataframe_from_mongodb(collection_name):
    return pd.DataFrame(list(db[collection_name].find()))

df_vdr = get_dataframe_from_mongodb("cleaned_covid_vacc_death_rate")
df_vm = get_dataframe_from_mongodb("cleaned_covid_vacc_manufacturer")
df_us = get_dataframe_from_mongodb("cleaned_us_death_rates")

# === Analysis 1: Identify entities with the highest vaccination rates ===
def highest_vaccination_rates(df, n=10):
    latest_data = df.sort_values('Day').groupby('Entity').last()
    top_vaccinated = latest_data.nlargest(n, 'COVID-19 doses (cumulative, per hundred)')
    return top_vaccinated[['COVID-19 doses (cumulative, per hundred)']]

top_vaccinated = highest_vaccination_rates(df_vdr)
print("Top 10 entities with highest vaccination rates:")
print(top_vaccinated)

# === Analysis 2: Check correlation between vaccination and death rates ===
def vaccination_death_correlation(df):
    latest_data = df.sort_values('Day').groupby('Entity').last()
    return latest_data['COVID-19 doses (cumulative, per hundred)'].corr(
        latest_data['Daily new confirmed deaths due to COVID-19 per million people (rolling 7-day average, right-aligned)']
    )

correlation = vaccination_death_correlation(df_vdr)
print(f"\nCorrelation between vaccination and death rates: {correlation:.4f}")

# === Analysis 3: Summarize total doses by manufacturer ===
def total_doses_by_manufacturer(df):
    manufacturer_columns = [col for col in df.columns if col.startswith('COVID-19 doses (cumulative) - Manufacturer')]
    return df[manufacturer_columns].sum().sort_values(ascending=False)

total_doses = total_doses_by_manufacturer(df_vm)
print("\nTotal doses by manufacturer:")
print(total_doses)

# === Analysis 4: Examine death rates over time for a specific age group (e.g., 80+) in the US ===
def death_rates_over_time(df, age_group='80+'):
    df['Day'] = pd.to_datetime(df['Day'])
    age_group_data = df[df['Entity'] == age_group].set_index('Day')
    return age_group_data[['Death rate (weekly) of unvaccinated people - United States, by age',
                           'Death rate (weekly) of fully vaccinated people (without bivalent booster) - United States, by age',
                           'Death rate (weekly) of fully vaccinated people (with bivalent booster) - United States, by age']]

death_rates_80plus = death_rates_over_time(df_us)
print("\nDeath rates over time for 80+ age group:")
print(death_rates_80plus.head())

# === Visualization ===
plt.figure(figsize=(12, 6))
death_rates_80plus.plot()
plt.title('Death Rates Over Time for 80+ Age Group in the US')
plt.xlabel('Date')
plt.ylabel('Death Rate (weekly)')
plt.legend(labels=['Unvaccinated', 'Fully Vaccinated (without booster)', 'Fully Vaccinated (with booster)'], loc='upper right')
plt.tight_layout()
plt.savefig('death_rates_80plus.png')
plt.close()

logging.info("Basic statistical analyses completed successfully.")

Top 10 entities with highest vaccination rates:
             COVID-19 doses (cumulative, per hundred)
Entity                                               
Taiwan                                      291.02765
Hong Kong                                   281.44434
Macao                                       255.97530
Spain                                       249.28398
France                                      238.48257
Germany                                     229.81345
Afghanistan                                   0.00000
Africa                                        0.00000
Albania                                       0.00000
Algeria                                       0.00000

Correlation between vaccination and death rates: -0.0203

Total doses by manufacturer:
COVID-19 doses (cumulative) - Manufacturer Pfizer/BioNTech       2422723541541
COVID-19 doses (cumulative) - Manufacturer Moderna                755122752063
COVID-19 doses (cumulative) - Manufacturer Oxford/AstraZen

<Figure size 1200x600 with 0 Axes>

## 6. (Optional) Data Merging

If needed, data sets can be merged for more comprehensive analyses. The code below is an example and is commented out since it depends on matching keys.

In [None]:
# ----------------------------------------------------
# Potential Data Integration for Analysis
# ----------------------------------------------------
# If needed, you can attempt to integrate datasets. For example, if you have a common country code:
# This depends heavily on whether keys match. If not, skip this step.

# Example: If 'entity' in df_vdr and 'reference_area' in df_oecd correspond to the same countries (just an example),
# you could try merging on a year-from-day and reference_area/time_period basis.
# Note: This might result in NaNs if keys don't align well.

# Only do this if it makes sense for your analysis:
# if 'day' in df_vdr.columns:
#     df_vdr['year'] = df_vdr['day'].dt.year
# if 'time_period' in df_oecd.columns:
#     # Attempt a merge (just as an example)
#     merged_analysis = pd.merge(df_vdr, df_oecd,
#                                left_on=['entity','year'],
#                                right_on=['reference_area','time_period'],
#                                how='inner')
#     print("\nMerged dataset shape:", merged_analysis.shape)
#     # From here, you could check correlation between health expenditure and vaccination/death rates.


## 7. Correlation Matrices and Additional Statistical Measures

Display correlation matrices for numeric columns in each DataFrame to understand relationships between variables.

In [None]:
# ----------------------------------------------------
# Correlations and Statistical Measures
# ----------------------------------------------------
# Depending on your numeric columns, try a correlation matrix:
print("\nCorrelation matrix for df_vdr numeric columns:")
print(df_vdr.select_dtypes(include=[np.number]).corr())

# Similar for df_vm, df_oecd, df_us:
print("\nCorrelation matrix for df_vm numeric columns:")
print(df_vm.select_dtypes(include=[np.number]).corr())

print("\nCorrelation matrix for df_oecd numeric columns:")
print(df_oecd.select_dtypes(include=[np.number]).corr())

print("\nCorrelation matrix for df_us numeric columns:")
print(df_us.select_dtypes(include=[np.number]).corr())

## 8. Summary and Next Steps

At this point, we have:
- Loaded and inspected the data.
- Performed basic cleaning and type conversions.
- Conducted initial statistical analyses and correlations.

Next steps could involve:
- Advanced analytics or modeling.
- Data visualization.
- Further data enrichment or merging with external data sources.

In [None]:
# ----------------------------------------------------
# Summary and Next Steps
# ----------------------------------------------------
# By now we have:
# - Basic stats and shapes
# - Grouped averages and correlations
# - Checked if merging is feasible
#
# Next steps:
# 1. Identify interesting comparisons or patterns to visualize.
# 2. Prepare subsets of data for plotting in Visualization.ipynb.

print("\nAnalysis complete. You can now proceed to Visualization steps.")