# Analysis Notebook

In this notebook, we:
1. Connect to MongoDB and read the four cleaned datasets.
2. Perform various analysis steps:
   - Summaries and data checks
   - Aggregations and group-bys for insights
   - Merging or joining data if beneficial for certain analyses
   - Computing correlations or other statistical measures

After this analysis, we will have a better understanding of the data and can proceed
to create visualizations in the next step.

## 1. Load Libraries and Set Environment Variables

In this cell, we import the necessary Python libraries (pandas, numpy, pymongo) for data analysis and load environment variables for the MongoDB URI.

In [7]:
import pandas as pd # For data manipulation
import numpy as np # For numeric operations
from pymongo import MongoClient # For connecting to MongoDB
from dotenv import load_dotenv # For loading environment variables
import os # For environment


# ----------------------------------------------------
# Load environment variables for MongoDB URI
# ----------------------------------------------------

# Explicitly specify the .env file path
dotenv_path = os.path.join(os.getcwd(), ".env")
load_dotenv(dotenv_path)

MONGO_URI = os.getenv("MONGO_URI")

if not MONGO_URI:
    print("MONGO_URI is not set in the .env file.")
else:
    # Mask the password for display purposes
    masked_uri = MONGO_URI.replace(MONGO_URI.split(":")[2].split("@")[0], "*****")
    print(f"MONGO_URI loaded successfully: {masked_uri}")
# Amaç Mongo_URI'yi doğru yere koyup koyamadığımızı kontrol etmektir burada kontrol ederek hataların önüne geçmek istiyoruz.



MONGO_URI loaded successfully: mongodb+srv://koyluoglucem:*****@codegenesis.dupu0.mongodb.net/


## 2. Connect to MongoDB and Load Data

Here we connect to MongoDB using the provided URI and load the four cleaned datasets into Pandas DataFrames.

- **covid_vacc_death_rate**: COVID-19 vaccination and death rate data
- **covid_vacc_manufacturer**: Vaccination data broken down by manufacturer
- **oecd_health_expenditure**: OECD health expenditure data
- **us_death_rates**: US death rates by age group and vaccination status

In [8]:
from pymongo import MongoClient
import pandas as pd
from dotenv import load_dotenv
import os

# 1. Load environment variables
load_dotenv()

# 2. Load MongoDB URI from .env file
MONGO_URI = os.getenv("MONGO_URI")
DATABASE_NAME = "my_database"

# 3. Validate MongoDB URI
if not MONGO_URI:
    raise ValueError("MONGO_URI is not set in the .env file.")

# 4. Connect to MongoDB
try:
    client = MongoClient(MONGO_URI)
    client.server_info()  # Test the connection
    print("Successfully connected to MongoDB.")
except Exception as e:
    print(f"Failed to connect to MongoDB: {e}")
    raise

# 5. Access the database
db = client[DATABASE_NAME]

# 6. Function to load data from MongoDB into Pandas DataFrame
def mongo_to_df(collection_name):
    try:
        # Fetch all documents from the collection
        data = list(db[collection_name].find({}, {"_id": 0}))  # Exclude _id for clarity
        df = pd.DataFrame(data)
        print(f"Loaded {len(df)} records from collection '{collection_name}'")
        return df
    except Exception as e:
        print(f"Error loading collection '{collection_name}': {e}")
        return pd.DataFrame()

# 7. Define the collection names
collections = {
    "covid_vacc_death_rate": "covid_vacc_death_rate",
    "covid_vacc_manufacturer": "covid_vacc_manufacturer",
    "oecd_health_expenditure": "oecd_health_expenditure",
    "us_death_rates": "us_death_rates"
}

# 8. Load the data into DataFrames
df_vdr = mongo_to_df(collections["covid_vacc_death_rate"])
df_vm = mongo_to_df(collections["covid_vacc_manufacturer"])
df_oecd = mongo_to_df(collections["oecd_health_expenditure"])
df_us = mongo_to_df(collections["us_death_rates"])

# 9. Print the first few rows of each DataFrame
for name, df in [("Vaccinations vs Death Rate", df_vdr),
                 ("Vaccine Manufacturer Data", df_vm),
                 ("OECD Health Expenditure", df_oecd),
                 ("US Death Rates", df_us)]:
    print(f"\nFirst few rows of {name}:")
    print(df.head())
    print(f"Shape: {df.shape}")

Successfully connected to MongoDB.
Loaded 447729 records from collection 'covid_vacc_death_rate'
Loaded 59224 records from collection 'covid_vacc_manufacturer'
Loaded 439 records from collection 'oecd_health_expenditure'
Loaded 650 records from collection 'us_death_rates'

First few rows of Vaccinations vs Death Rate:
        Entity Code  year         Day  \
0  Afghanistan  AFG  2020  2020-01-09   
1  Afghanistan  AFG  2020  2020-01-10   
2  Afghanistan  AFG  2020  2020-01-11   
3  Afghanistan  AFG  2020  2020-01-12   
4  Afghanistan  AFG  2020  2020-01-13   

   Daily new confirmed deaths due to COVID-19 per million people (rolling 7-day average, right-aligned)  \
0                                                0.0                                                      
1                                                0.0                                                      
2                                                0.0                                                      
3    

## 3. Initial Exploratory Data Analysis (EDA)

We examine the first few rows (`head`), data types (`info`), and basic statistics (`describe`) of each DataFrame to understand the data structure and contents.

In [9]:
# ----------------------------------------------------
# Initial Exploration
# ----------------------------------------------------
print("### COVID Vaccinations vs Death Rate ###")
print(df_vdr.head())
print(df_vdr.info())
print(df_vdr.describe(include='all'))

print("\n### COVID Vaccine Manufacturer ###")
print(df_vm.head())
print(df_vm.info())
print(df_vm.describe(include='all'))

print("\n### OECD Health Expenditure ###")
print(df_oecd.head())
print(df_oecd.info())
print(df_oecd.describe(include='all'))

print("\n### US Death Rates ###")
print(df_us.head())
print(df_us.info())
print(df_us.describe(include='all'))

### COVID Vaccinations vs Death Rate ###
        Entity Code  year         Day  \
0  Afghanistan  AFG  2020  2020-01-09   
1  Afghanistan  AFG  2020  2020-01-10   
2  Afghanistan  AFG  2020  2020-01-11   
3  Afghanistan  AFG  2020  2020-01-12   
4  Afghanistan  AFG  2020  2020-01-13   

   Daily new confirmed deaths due to COVID-19 per million people (rolling 7-day average, right-aligned)  \
0                                                0.0                                                      
1                                                0.0                                                      
2                                                0.0                                                      
3                                                0.0                                                      
4                                                0.0                                                      

   COVID-19 doses (cumulative, per hundred) World regions according t

## 4. Data Cleaning, Type Conversions, and Checking for Missing Values

In this step:
- Convert date columns to datetime format (if needed).
- Convert numeric columns to appropriate types.
- Check for null values and consider potential cleaning strategies.

In [12]:
# ----------------------------------------------------
# Data Cleaning and Validation Checks
# ----------------------------------------------------

def clean_and_validate(df_dict):
    """
    Cleans and validates a dictionary of DataFrames by:
    - Converting date columns to datetime.
    - Ensuring numeric columns are properly formatted.
    - Checking and reporting null values.
    - Dropping or filling missing data based on threshold.

    Args:
        df_dict (dict): Dictionary of DataFrames to process.

    Returns:
        dict: Dictionary of cleaned and validated DataFrames.
    """
    for name, df in df_dict.items():
        print(f"\nProcessing DataFrame: {name}")

        # Convert 'day' columns to datetime if they exist
        if 'day' in df.columns:
            df['day'] = pd.to_datetime(df['day'], errors='coerce')
            print(f"Converted 'day' column to datetime in {name}.")

        # Convert specific numeric columns to numeric
        if name == "df_oecd" and 'time_period' in df.columns:
            df['time_period'] = pd.to_numeric(df['time_period'], errors='coerce')
            print(f"Converted 'time_period' column to numeric in {name}.")

        # Report null values
        print(f"\nNull value counts in {name}:\n{df.isnull().sum()}")

        # Fill missing numeric data with the mean
        for col in df.select_dtypes(include=["float", "int"]).columns:
            if df[col].isnull().any():
                df[col] = df[col].fillna(df[col].mean())
                print(f"Filled missing values in numeric column '{col}' with mean in {name}.")

        # Fill missing categorical data with 'Unknown'
        for col in df.select_dtypes(include=["object"]).columns:
            if df[col].isnull().any():
                df[col] = df[col].fillna("Unknown")
                print(f"Filled missing values in categorical column '{col}' with 'Unknown' in {name}.")

    return df_dict


# Example usage
df_dict = {
    "df_vdr": df_vdr,
    "df_vm": df_vm,
    "df_oecd": df_oecd,
    "df_us": df_us
}

# Clean and validate the dataframes
df_dict_cleaned = clean_and_validate(df_dict)

# Access cleaned DataFrames
df_vdr_cleaned = df_dict_cleaned["df_vdr"]
df_vm_cleaned = df_dict_cleaned["df_vm"]
df_oecd_cleaned = df_dict_cleaned["df_oecd"]
df_us_cleaned = df_dict_cleaned["df_us"]

# Print summary of cleaned DataFrames
for name, df in df_dict_cleaned.items():
    print(f"\n{name} summary after cleaning:")
    print(df.info())
    print(df.head())


Processing DataFrame: df_vdr

Null value counts in df_vdr:
Entity                                                                                                  0
Code                                                                                                    0
year                                                                                                    0
Day                                                                                                     0
Daily new confirmed deaths due to COVID-19 per million people (rolling 7-day average, right-aligned)    0
COVID-19 doses (cumulative, per hundred)                                                                0
World regions according to OWID                                                                         0
dtype: int64

Processing DataFrame: df_vm

Null value counts in df_vm:
Entity                                                           0
Code                                                        

## 5. Basic Statistical Analyses

Perform some example analyses:
- Identify entities with the highest vaccination rates (df_vdr)
- Check correlation between vaccination and death rates (df_vdr)
- Summarize total doses by manufacturer (df_vm)
- Compute average health expenditures by reference area (df_oecd)
- Examine death rates over time for a specific age group (e.g., 80+) in the US (df_us)

In [None]:

# ----------------------------------------------------
# Example Analyses
# ----------------------------------------------------

# 1. COVID Vaccinations vs Death Rate (df_vdr)
# Let's see which entities (countries) have the highest cumulative vaccination rate.
if 'covid_19_doses_cumulative,_per_hundred' in df_vdr.columns and 'entity' in df_vdr.columns:
    avg_vacc = (df_vdr.groupby('entity')['covid_19_doses_cumulative,_per_hundred']
                .mean()
                .sort_values(ascending=False))
    print("\nTop 10 entities by average cumulative vaccination (per hundred):")
    print(avg_vacc.head(10))

# Check correlation between vaccination and death rate if both columns exist:
if 'covid_19_doses_cumulative,_per_hundred' in df_vdr.columns and 'daily_new_confirmed_deaths_due_to_covid_19_per_million_people_rolling_7_day_average,_right_aligned' in df_vdr.columns:
    corr_value = df_vdr[['covid_19_doses_cumulative,_per_hundred',
                         'daily_new_confirmed_deaths_due_to_covid_19_per_million_people_rolling_7_day_average,_right_aligned']].corr().iloc[0,1]
    print(f"\nCorrelation between vaccine doses and daily death rate: {corr_value}")

# 2. COVID Vaccine Manufacturer (df_vm)
# Identify which vaccine manufacturer has the highest cumulative doses globally:
manufacturer_cols = [c for c in df_vm.columns if 'manufacturer' in c]
if manufacturer_cols:
    global_sums = df_vm[manufacturer_cols].sum().sort_values(ascending=False)
    print("\nTotal cumulative doses by manufacturer (across all entities):")
    print(global_sums)

# 3. OECD Health Expenditure (df_oecd)
# Let's check average health expenditure (obs_value) by country over time.
if 'reference_area' in df_oecd.columns and 'obs_value' in df_oecd.columns:
    avg_expenditure = df_oecd.groupby('reference_area')['obs_value'].mean().sort_values(ascending=False)
    print("\nTop 10 reference areas by average health expenditure (obs_value):")
    print(avg_expenditure.head(10))

# 4. US Death Rates (df_us)
# Check a particular age group's trends over time. Let's say "80+" if it exists.
if 'entity' in df_us.columns and df_us['entity'].eq('80+').any():
    # Filter data for 80+
    df_80plus = df_us[df_us['entity'] == '80+'].copy()
    df_80plus = df_80plus.sort_values('day')
    if 'death_rate_weekly_of_unvaccinated_people__united_states,_by_age' in df_80plus.columns:
        print("\nFirst 5 rows for 80+ age group death rates over time:")
        print(df_80plus[['day', 'death_rate_weekly_of_unvaccinated_people__united_states,_by_age']].head())


## 6. (Optional) Data Merging

If needed, data sets can be merged for more comprehensive analyses. The code below is an example and is commented out since it depends on matching keys.

In [None]:
# ----------------------------------------------------
# Potential Data Integration for Analysis
# ----------------------------------------------------
# If needed, you can attempt to integrate datasets. For example, if you have a common country code:
# This depends heavily on whether keys match. If not, skip this step.

# Example: If 'entity' in df_vdr and 'reference_area' in df_oecd correspond to the same countries (just an example),
# you could try merging on a year-from-day and reference_area/time_period basis.
# Note: This might result in NaNs if keys don't align well.

# Only do this if it makes sense for your analysis:
# if 'day' in df_vdr.columns:
#     df_vdr['year'] = df_vdr['day'].dt.year
# if 'time_period' in df_oecd.columns:
#     # Attempt a merge (just as an example)
#     merged_analysis = pd.merge(df_vdr, df_oecd,
#                                left_on=['entity','year'],
#                                right_on=['reference_area','time_period'],
#                                how='inner')
#     print("\nMerged dataset shape:", merged_analysis.shape)
#     # From here, you could check correlation between health expenditure and vaccination/death rates.


## 7. Correlation Matrices and Additional Statistical Measures

Display correlation matrices for numeric columns in each DataFrame to understand relationships between variables.

In [None]:
# ----------------------------------------------------
# Correlations and Statistical Measures
# ----------------------------------------------------
# Depending on your numeric columns, try a correlation matrix:
print("\nCorrelation matrix for df_vdr numeric columns:")
print(df_vdr.select_dtypes(include=[np.number]).corr())

# Similar for df_vm, df_oecd, df_us:
print("\nCorrelation matrix for df_vm numeric columns:")
print(df_vm.select_dtypes(include=[np.number]).corr())

print("\nCorrelation matrix for df_oecd numeric columns:")
print(df_oecd.select_dtypes(include=[np.number]).corr())

print("\nCorrelation matrix for df_us numeric columns:")
print(df_us.select_dtypes(include=[np.number]).corr())

## 8. Summary and Next Steps

At this point, we have:
- Loaded and inspected the data.
- Performed basic cleaning and type conversions.
- Conducted initial statistical analyses and correlations.

Next steps could involve:
- Advanced analytics or modeling.
- Data visualization.
- Further data enrichment or merging with external data sources.

In [None]:
# ----------------------------------------------------
# Summary and Next Steps
# ----------------------------------------------------
# By now we have:
# - Basic stats and shapes
# - Grouped averages and correlations
# - Checked if merging is feasible
#
# Next steps:
# 1. Identify interesting comparisons or patterns to visualize.
# 2. Prepare subsets of data for plotting in Visualization.ipynb.

print("\nAnalysis complete. You can now proceed to Visualization steps.")