<a href="https://colab.research.google.com/github/Chriskugu/Chriskugu/blob/main/final_projet_part_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Real -world Dataset:

In [None]:
https://covid-api.com/api/

In [None]:
https://covid-api.com/api/regions?per_page=20&order=iso&sort=asc

In [None]:
# Adequacy check

import requests
import pandas as pd

# Fetch data
url = "https://covid-api.com/api/regions?per_page=20&order=iso&sort=asc"  # Max records
response = requests.get(url)
data = response.json()['data']
df = pd.DataFrame(data)

# Check adequacy
print(f"Total regions: {len(df)}")
print("Variables:\n", df.columns)
print("Missing values:\n", df.isnull().sum())
print(df.columns.tolist())

Total regions: 20
Variables:
 Index(['iso', 'name'], dtype='object')
Missing values:
 iso     0
name    0
dtype: int64
['iso', 'name']


# Data Integrity

Source: COVID-19 API aggregates data from WHO, Johns Hopkins, and government reports.

Potential Biases:

  Underreporting in low-income regions.

  Delays in last_update (e.g., some countries report weekly).

Variable Clarity:

  iso: Standardized country codes (reliable).

active: May exclude asymptomatic cases (potential underestimation).

## Data Cleaning and Preparation

In [None]:
import pandas as pd
import requests

# Fetch the data from the API
url = "https://covid-api.com/api/regions?per_page=20&order=iso&sort=asc"
response = requests.get(url)
data = response.json()

# Convert to DataFrame
df = pd.DataFrame(data['data'])

# Initial inspection
print("Initial data shape:", df.shape)
print("\nFirst few rows:")
print(df.head())
print("\nData types:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())

# Basic cleaning
# Convert date fields to datetime
if 'last_update' in df.columns:
    df['last_update'] = pd.to_datetime(df['last_update'])

# Handle missing values
df.dropna(subset=['iso', 'name'], inplace=True)  # Essential identifier fields

# Type conversion for numeric fields
numeric_cols = ['confirmed', 'deaths', 'recovered', 'confirmed_diff', 'deaths_diff', 'recovered_diff']
for col in numeric_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')

# Calculate derived metrics
if all(col in df.columns for col in ['deaths', 'confirmed']):
    df['fatality_rate'] = df['deaths'] / df['confirmed']

if all(col in df.columns for col in ['recovered', 'confirmed']):
    df['recovery_rate'] = df['recovered'] / df['confirmed']

# Final cleaning
# Remove completely empty columns
df.dropna(axis=1, how='all', inplace=True)

# Validate cleaned data
print("\nCleaned data shape:", df.shape)
print("\nCleaned data sample:")
print(df.head())
print("\nMissing values after cleaning:")
print(df.isnull().sum())

# Save cleaned data
df.to_csv('cleaned_covid_regions_data.csv', index=False)

Initial data shape: (20, 2)

First few rows:
   iso         name
0  ABW        Aruba
1  AFG  Afghanistan
2  AGO       Angola
3  ALB      Albania
4  AND      Andorra

Data types:
iso     object
name    object
dtype: object

Missing values:
iso     0
name    0
dtype: int64

Cleaned data shape: (20, 2)

Cleaned data sample:
   iso         name
0  ABW        Aruba
1  AFG  Afghanistan
2  AGO       Angola
3  ALB      Albania
4  AND      Andorra

Missing values after cleaning:
iso     0
name    0
dtype: int64


## Handling Missing Values & Outliers

## Step 1:Missing Values

In [None]:
## Check missing values percentage
missing_percent = (df.isnull().sum() / len(df)) * 100
print("Missing values percentage:\n", missing_percent)

### Strategy 1: Drop columns with high missingness
# Drop columns with >70% missing values
threshold = 70
cols_to_drop = missing_percent[missing_percent > threshold].index.tolist()
df.drop(columns=cols_to_drop, inplace=True)
print(f"Dropped columns with >{threshold}% missing: {cols_to_drop}")

### Strategy 2: Impute remaining missing values
# For numerical columns
num_cols = df.select_dtypes(include=['number']).columns.tolist()
for col in num_cols:
    if df[col].isnull().any():
        # Use median for numerical columns (less sensitive to outliers)
        df[col] = df[col].fillna(df[col].median())
        print(f"Imputed {col} with median: {df[col].median()}")

# For categorical columns
cat_cols = df.select_dtypes(include=['object']).columns.tolist()
for col in cat_cols:
    if df[col].isnull().any():
        # Use mode for categorical columns
        mode_val = df[col].mode()[0]
        df[col] = df[col].fillna(mode_val)
        print(f"Imputed {col} with mode: {mode_val}")

# Verify no missing values remain
print("\nMissing values after treatment:\n", df.isnull().sum())




Missing values percentage:
 iso     0.0
name    0.0
dtype: float64
Dropped columns with >70% missing: []

Missing values after treatment:
 iso     0
name    0
dtype: int64


## Step 2: Outliers

In [None]:


# Step 2: Handling Outliers (Fixed Version)
import numpy as np
import matplotlib.pyplot as plt

### Step 2.1: Detect outliers - Only for columns that exist
# First identify which COVID metrics actually exist in our DataFrame
existing_metrics = [col for col in ['confirmed', 'deaths', 'recovered',
                   'confirmed_diff', 'deaths_diff', 'recovered_diff',
                   'fatality_rate'] if col in df.columns]

if not existing_metrics:
    print("Warning: No COVID metrics columns found for outlier detection")
else:
    # Visualize outliers only for existing columns
    plt.figure(figsize=(15, 8))
    df[existing_metrics].boxplot()
    plt.title('Boxplot of COVID-19 Metrics (Before Outlier Treatment)')
    plt.xticks(rotation=45)
    plt.show()

    ### Step 2.2: Treat outliers
    def cap_outliers(series):
        Q1 = series.quantile(0.25)
        Q3 = series.quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        return series.clip(lower_bound, upper_bound)

    # Apply to existing numerical COVID metrics
    for metric in existing_metrics:
        original_median = df[metric].median()
        df[metric] = cap_outliers(df[metric])
        print(f"\nOutlier treatment for {metric}:")
        print(f"  - Median before: {original_median}")
        print(f"  - Median after: {df[metric].median()}")

        # Calculate bounds for reporting
        Q1 = df[metric].quantile(0.25)
        Q3 = df[metric].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        print(f"  - Lower bound: {lower_bound}")
        print(f"  - Upper bound: {upper_bound}")
        print(f"  - Values capped: {sum((df[metric] <= lower_bound) | (df[metric] >= upper_bound))}")

    # Visualize after treatment
    plt.figure(figsize=(15, 8))
    df[existing_metrics].boxplot()
    plt.title('Boxplot of COVID-19 Metrics (After Outlier Treatment)')
    plt.xticks(rotation=45)
    plt.show()

    ### Step 2.3: Special handling for rates and counts
    # For fatality_rate (ensure between 0 and 1)
    if 'fatality_rate' in df.columns:
        df['fatality_rate'] = df['fatality_rate'].clip(0, 1)
        print("\nClipped fatality_rate to [0, 1] range")

    # For counts (ensure non-negative)
    count_cols = ['confirmed', 'deaths', 'recovered']
    for col in count_cols:
        if col in df.columns:
            df[col] = df[col].clip(lower=0)
            print(f"Ensured {col} values are non-negative")



## Data Validation

In [None]:
def validate_covid_data(df):
    """Robust validation function for COVID-19 data"""

    print("\n=== DATA VALIDATION REPORT ===")

    # 1. Basic Structure
    print("\n[1] Data Structure:")
    print(f"Rows: {len(df)}, Columns: {len(df.columns)}")
    print("Columns:", df.columns.tolist())

    # 2. COVID Metrics Check
    covid_metrics = {
        'cases': ['confirmed', 'cases', 'total_cases'],
        'deaths': ['deaths', 'fatalities', 'total_deaths'],
        'recovered': ['recovered', 'total_recovered']
    }

    found_metrics = {}
    for metric, aliases in covid_metrics.items():
        for alias in aliases:
            if alias in df.columns:
                found_metrics[metric] = alias
                break

    print("\n[2] COVID Metrics Found:")
    for metric, col_name in found_metrics.items():
        print(f"- {metric}: {col_name} (dtype: {df[col_name].dtype})")

    if not found_metrics:
        print("No standard COVID metrics found in dataset")

    # 3. Data Quality Metrics
    print("\n[3] Data Quality:")
    print("Missing Values:")
    print(df.isnull().sum())

    if found_metrics:
        print("\nBasic Statistics:")
        print(df[list(found_metrics.values())].describe())

    # 4. Save Validated Data
    output_file = 'validated_covid_data.csv'
    df.to_csv(output_file, index=False)
    print(f"\n[4] Validation Complete - Data saved to {output_file}")

    return df

# Run the validation
df_validated = validate_covid_data(df)


=== DATA VALIDATION REPORT ===

[1] Data Structure:
Rows: 219, Columns: 2
Columns: ['iso', 'name']

[2] COVID Metrics Found:
No standard COVID metrics found in dataset

[3] Data Quality:
Missing Values:
iso     0
name    0
dtype: int64

[4] Validation Complete - Data saved to validated_covid_data.csv


## Recoding & Encoding Variables

## Recoding Variables

In [None]:
# First examine categorical variables
cat_vars = df.select_dtypes(include=['object', 'category']).columns.tolist()
print("Categorical variables:", cat_vars)

### Example: Recode region names to standardized format
if 'name' in df.columns:
    df['name'] = df['name'].str.title().str.strip()
    print("\nStandardized region names")

### Example: Create binary indicators for specific regions
if 'name' in df.columns:
    df['is_high_risk'] = df['name'].isin(['Wuhan', 'Lombardy', 'Madrid']).astype(int)
    print("\nCreated high-risk region indicator")

### Example: Recode date into pandemic waves
if 'last_update' in df.columns:
    df['last_update'] = pd.to_datetime(df['last_update'])

    # Define pandemic wave periods (adjust dates as needed)
    wave_periods = [
        ('Initial Wave', '2019-12-01', '2020-06-30'),
        ('Delta Wave', '2021-04-01', '2021-10-31'),
        ('Omicron Wave', '2021-11-01', '2022-03-31'),
        ('Recent Period', '2022-04-01', '2023-12-31')
    ]

    df['pandemic_wave'] = 'Other'
    for wave, start, end in wave_periods:
        mask = (df['last_update'] >= start) & (df['last_update'] <= end)
        df.loc[mask, 'pandemic_wave'] = wave

    print("\nRecoded dates into pandemic waves")

Categorical variables: ['iso', 'name']

Standardized region names

Created high-risk region indicator


## Encoding Variables

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

### Option 1: Label Encoding (for ordinal categories)
if 'pandemic_wave' in df.columns:
    wave_order = ['Initial Wave', 'Delta Wave', 'Omicron Wave', 'Recent Period', 'Other']
    le = LabelEncoder()
    le.fit(wave_order)
    df['pandemic_wave_encoded'] = le.transform(df['pandemic_wave'])
    print("\nLabel encoded pandemic waves:")
    print(dict(zip(le.classes_, le.transform(le.classes_))))

### Option 2: One-Hot Encoding (for nominal categories)
if 'name' in df.columns:
    # Get top n regions to avoid too many dummy variables
    top_regions = df['name'].value_counts().nlargest(10).index.tolist()
    df['region_group'] = df['name'].where(df['name'].isin(top_regions), 'Other')

    # Perform one-hot encoding
    dummies = pd.get_dummies(df['region_group'], prefix='region')
    df = pd.concat([df, dummies], axis=1)
    print("\nOne-hot encoded top regions:")
    print(dummies.columns.tolist())

### Option 3: Target Encoding (for high-cardinality categories)
if 'name' in df.columns and 'fatality_rate' in df.columns:
    # Calculate mean fatality rate by region
    region_means = df.groupby('name')['fatality_rate'].mean().to_dict()
    df['region_fatality_encoded'] = df['name'].map(region_means)
    print("\nTarget encoded regions by fatality rate")


One-hot encoded top regions:
['region_China', 'region_Japan', 'region_Korea, South', 'region_Malaysia', 'region_Other', 'region_Philippines', 'region_Singapore', 'region_Taipei And Environs', 'region_Thailand', 'region_Us', 'region_Vietnam']


##  Data Transformation  for Skewed COVID-19 Variables

## Step 1: Identify Skewed Variables

In [None]:
import numpy as np

# Select numeric COVID metrics (excluding binary/categorical)
numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
covid_metrics = [col for col in numeric_cols if col not in ['is_high_risk', 'region_encoded']]

print("Skewness before transformation:")
skew_before = df[covid_metrics].skew()
print(skew_before)

# Visualize distributions
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(15, 10))
for i, col in enumerate(covid_metrics, 1):
    plt.subplot(3, 3, i)
    sns.histplot(df[col], kde=True)
    plt.title(f'{col} (Skew: {skew_before[col]:.2f})')
plt.tight_layout()
plt.show()

Skewness before transformation:
Series([], dtype: float64)


<Figure size 1500x1000 with 0 Axes>

## Step 2: Apply Appropriate Transformations

In [None]:
# Logarithmic Transformation (for right-skewed data)

right_skewed = skew_before[skew_before > 1].index.tolist()

for col in right_skewed:
    # Add 1 to handle zeros (log(0) is undefined)
    df[f'log_{col}'] = np.log1p(df[col])

print("\nApplied log1p transformation to:", right_skewed)

# Square Root Transformation (moderate right skew)

moderate_skew = skew_before[(skew_before > 0.5) & (skew_before <= 1)].index.tolist()

for col in moderate_skew:
    df[f'sqrt_{col}'] = np.sqrt(df[col])

print("Applied sqrt transformation to:", moderate_skew)

# Box-Cox Transformation (for positive values with varying skewness)

from scipy import stats

for col in covid_metrics:
    if df[col].min() > 0:  # Box-Cox requires positive values
        transformed, _ = stats.boxcox(df[col])
        df[f'boxcox_{col}'] = transformed
        print(f"Applied Box-Cox to {col}")


# Yeo-Johnson Transformation (handles zero/negative values)

for col in covid_metrics:
    transformed, _ = stats.yeojohnson(df[col])
    df[f'yeojohnson_{col}'] = transformed
    print(f"Applied Yeo-Johnson to {col}")




Applied log1p transformation to: []
Applied sqrt transformation to: []


## Step 3: Validate Transformation Results

In [None]:
# Calculate skewness after transformation
transformed_cols = [f'log_{col}' for col in right_skewed] + \
                  [f'sqrt_{col}' for col in moderate_skew] + \
                  [f'boxcox_{col}' for col in covid_metrics if f'boxcox_{col}' in df] + \
                  [f'yeojohnson_{col}' for col in covid_metrics]

print("\nSkewness after transformation:")
print(df[transformed_cols].skew())

# Visualize transformed distributions
plt.figure(figsize=(15, 10))
for i, col in enumerate(transformed_cols, 1):
    plt.subplot(4, 4, i)
    sns.histplot(df[col], kde=True)
    plt.title(col)
plt.tight_layout()
plt.show()


Skewness after transformation:
Series([], dtype: float64)


<Figure size 1500x1000 with 0 Axes>

## Normalization and Standardization for COVID-19 Data Analysis

In [None]:
# First identify which variables need scaling
numeric_cols = df.select_dtypes(include=['number']).columns.tolist()

# Remove columns that shouldn't be scaled:
# - Binary variables (0/1)
# - Variables already on comparable scales (e.g., percentages)
# - Identifier variables (region codes, etc.)
to_exclude = [col for col in numeric_cols if
              df[col].nunique() == 2 or  # Binary
              col.endswith('_encoded') or  # Already encoded
              col in ['iso', 'region_code']]  # Identifiers

features_to_scale = [col for col in numeric_cols if col not in to_exclude]

print("Variables to scale:", features_to_scale)

Variables to scale: ['is_high_risk']


## Standardization (Z-score Normalization)

""" Algorithms assuming Gaussian distributions (PCA, SVM, linear regression)

When outliers are properly handled

Features with different units but comparable ranges"""

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = df.copy()
df_scaled[features_to_scale] = scaler.fit_transform(df[features_to_scale])

# Verify standardization
print("\nAfter standardization (mean ~0, std ~1):")
print(df_scaled[features_to_scale].describe().loc[['mean', 'std']])

# Save standardized data
df_scaled.to_csv('covid_data_standardized.csv', index=False)


After standardization (mean ~0, std ~1):
      is_high_risk
mean           0.0
std            0.0


## Min-Max Normalization

""" Neural networks

Algorithms requiring [0,1] range (e.g., KNN)

Preserving zero entries in sparse data"""

In [None]:
from sklearn.preprocessing import MinMaxScaler

mmscaler = MinMaxScaler()
df_normalized = df.copy()
df_normalized[features_to_scale] = mmscaler.fit_transform(df[features_to_scale])

# Verify normalization
print("\nAfter normalization (range [0,1]):")
print(df_normalized[features_to_scale].describe().loc[['min', 'max']])

# Save normalized data
df_normalized.to_csv('covid_data_normalized.csv', index=False)


After normalization (range [0,1]):
     is_high_risk
min           0.0
max           0.0


## Robust Scaling (for COVID-19 Data with Outliers)

In [None]:
from sklearn.preprocessing import RobustScaler

rscaler = RobustScaler()
df_robust = df.copy()
df_robust[features_to_scale] = rscaler.fit_transform(df[features_to_scale])

# Verify robust scaling
print("\nAfter robust scaling (median=0, IQR=1):")
print(pd.DataFrame({
    'median': df_robust[features_to_scale].median(),
    'IQR': df_robust[features_to_scale].quantile(0.75) - df_robust[features_to_scale].quantile(0.25)
}))

# Save robust scaled data
df_robust.to_csv('covid_data_robust_scaled.csv', index=False)


After robust scaling (median=0, IQR=1):
              median  IQR
is_high_risk     0.0  0.0


## COVID-Specific Scaling Considerations

In [None]:
if 'population' in df.columns:
    # Create per-capita features before scaling
    df['cases_per_100k'] = (df['confirmed'] / df['population']) * 100000
    df['deaths_per_100k'] = (df['deaths'] / df['population']) * 100000
    features_to_scale.extend(['cases_per_100k', 'deaths_per_100k'])

## Time-Based Normalization

In [None]:
if 'date' in df.columns:
    # Normalize within each time period
    df['month'] = df['date'].dt.month
    for col in ['confirmed', 'deaths']:
        if col in df.columns:
            df[f'{col}_norm_by_month'] = df.groupby('month')[col].transform(
                lambda x: (x - x.min()) / (x.max() - x.min()))

## Choosing the Right Method

In [None]:
scaling_recommendations = {
    'kmeans': 'StandardScaler',
    'logistic_regression': 'StandardScaler',
    'svm': 'StandardScaler',
    'neural_network': 'MinMaxScaler',
    'knn': 'MinMaxScaler',
    'decision_tree': 'None needed',
    'random_forest': 'None needed'
}

print("\nScaling recommendations by algorithm:")
for algo, scaler in scaling_recommendations.items():
    print(f"{algo:>20}: {scaler}")


Scaling recommendations by algorithm:
              kmeans: StandardScaler
 logistic_regression: StandardScaler
                 svm: StandardScaler
      neural_network: MinMaxScaler
                 knn: MinMaxScaler
       decision_tree: None needed
       random_forest: None needed


## Pipeline Implementation

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Example pipeline
if 'outcome' in df.columns:  # Replace with your target variable
    X = df[features_to_scale]
    y = df['outcome']

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    # Create pipeline with scaling + classifier
    pipeline = Pipeline([
        ('scaler', StandardScaler()),  # Can swap with other scalers
        ('classifier', RandomForestClassifier())
    ])

    pipeline.fit(X_train, y_train)
    print(f"\nModel accuracy: {pipeline.score(X_test, y_test):.2f}")

""" COVID-Specific Factors:

Case counts often follow power-law distributions (consider log transform first)

Regional comparisons benefit from population-normalized features

Time-dependent normalization accounts for pandemic waves

When to Scale:

Always: Distance-based algorithms (KNN, SVM, K-means)

Usually: Neural networks, linear models

Rarely: Tree-based methods

Validation:

Always check descriptive statistics after scaling

Verify no information leakage (fit scalers on training data only)

Document which features were scaled and which method was used

Special Cases:

For sparse data (many zeros), MinMax may be better than Standard

For datasets with extreme outliers, Robust scaling is preferred

For compositional data (percentages), consider isometric log-ratio transforms

Remember that the choice between normalization and standardization depends on your specific machine learning task and the nature of your COVID-19 data features. Always validate the impact of scaling on your model performance."""