<a href="https://colab.research.google.com/github/Anissa7/Public_health_risk/blob/master/Summative_assignment_Anissa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Data Cleaning**
- Removed irrelevant columns (e.g., URLs, metadata) and rows that contained missing values.
- Ensured that the "Value" column, which contains the health indicator measurements, is numeric for modeling purposes.

The cleaned dataset is saved as health_indicators_clean.csv and used for model training.

In [1]:
# data_cleaning.ipynb

# Import necessary libraries
import pandas as pd
import numpy as np

# Load the raw dataset
data = pd.read_csv('health_indicators_bfa.csv')

# Display the first few rows to inspect the dataset
data.head()

# Step 1: Check for missing values
missing_values = data.isnull().sum()
print("Missing values in each column:\n", missing_values)

# If there are missing values, decide how to handle them: drop or impute
# Example: Drop rows with missing values (or we can impute)
data_clean = data.dropna()  # Alternatively, we could use data.fillna() to impute missing values

# Step 2: Check data types and convert if necessary
print("Data types:\n", data_clean.dtypes)

# Convert columns to appropriate types if necessary (e.g., converting year to numeric)
data_clean['YEAR (DISPLAY)'] = pd.to_numeric(data_clean['YEAR (DISPLAY)'], errors='coerce')

# Step 3: Remove duplicates
data_clean = data_clean.drop_duplicates()

# Step 4: Normalize features if needed (for model training)
from sklearn.preprocessing import StandardScaler

# Extracting only the numerical columns
numerical_cols = data_clean.select_dtypes(include=[np.number]).columns
scaler = StandardScaler()

# Scaling the numerical features
data_clean[numerical_cols] = scaler.fit_transform(data_clean[numerical_cols])

# Optional: Save the cleaned and scaled dataset for later use
data_clean.to_csv('health_indicators_clean.csv', index=False)


Missing values in each column:
 GHO (CODE)              0
GHO (DISPLAY)           0
GHO (URL)               0
YEAR (DISPLAY)          1
STARTYEAR               1
ENDYEAR                 1
REGION (CODE)           1
REGION (DISPLAY)        1
COUNTRY (CODE)          1
COUNTRY (DISPLAY)       1
DIMENSION (TYPE)     1667
DIMENSION (CODE)     1667
DIMENSION (NAME)     1685
Numeric               773
Value                  20
Low                  3097
High                 3097
dtype: int64
Data types:
 GHO (CODE)           object
GHO (DISPLAY)        object
GHO (URL)            object
YEAR (DISPLAY)       object
STARTYEAR            object
ENDYEAR              object
REGION (CODE)        object
REGION (DISPLAY)     object
COUNTRY (CODE)       object
COUNTRY (DISPLAY)    object
DIMENSION (TYPE)     object
DIMENSION (CODE)     object
DIMENSION (NAME)     object
Numeric              object
Value                object
Low                  object
High                 object
dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_clean['YEAR (DISPLAY)'] = pd.to_numeric(data_clean['YEAR (DISPLAY)'], errors='coerce')


In [None]:
from google.colab import files
files.download('/content/health_indicators_clean.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## **Machine Learning Models**
### **Vanilla Model (Baseline Model)**
The first model is a **vanilla machine learning classifier**, which serves as the baseline for this project. It’s a simple model without any optimizations or regularization. This model helps establish the baseline performance of the dataset.

- **Model Type**: A basic logistic regression or decision tree classifier.
- **Training**: The model is trained without any tuning or advanced techniques.
- **Purpose**: The vanilla model is used to compare improvements made by the optimized model.

In [2]:
# data_preprocessing.ipynb

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the cleaned dataset
data = pd.read_csv('health_indicators_clean.csv')

# Check for missing values
missing_values = data.isnull().sum()
print("Missing values in each column:\n", missing_values)

# Remove missing values (if any)
data_clean = data.dropna()

# Convert 'YEAR (DISPLAY)' to numeric if not already
data_clean['YEAR (DISPLAY)'] = pd.to_numeric(data_clean['YEAR (DISPLAY)'], errors='coerce')

# List columns to drop that are non-numeric or not suitable for scaling
columns_to_drop = ['Value', 'GHO (CODE)', 'GHO (DISPLAY)', 'GHO (URL)']
if 'SpatialDimValueCode' in data_clean.columns:
    columns_to_drop.append('SpatialDimValueCode')

# Define features (X) and target (y)
X = data_clean.drop(columns_to_drop, axis=1)
y = data_clean['Value']  # Target: The 'Value' column

# Perform one-hot encoding for any remaining categorical columns
categorical_cols = X.select_dtypes(include=['object']).columns
X = pd.get_dummies(X, columns=categorical_cols)

# Verify that all columns are now numeric
print("Data types after encoding:\n", X.dtypes)

# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Save the processed data
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X.columns)
X_train_scaled_df.to_csv('X_train_scaled.csv', index=False)
X_test_scaled_df.to_csv('X_test_scaled.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
y_test.to_csv('y_test.csv', index=False)


Missing values in each column:
 GHO (CODE)           0
GHO (DISPLAY)        0
GHO (URL)            0
YEAR (DISPLAY)       1
STARTYEAR            0
ENDYEAR              0
REGION (CODE)        0
REGION (DISPLAY)     0
COUNTRY (CODE)       0
COUNTRY (DISPLAY)    0
DIMENSION (TYPE)     0
DIMENSION (CODE)     0
DIMENSION (NAME)     0
Numeric              0
Value                0
Low                  0
High                 0
dtype: int64
Data types after encoding:
 YEAR (DISPLAY)    float64
STARTYEAR_1945       bool
STARTYEAR_1946       bool
STARTYEAR_1947       bool
STARTYEAR_1948       bool
                   ...   
High_99.8            bool
High_99.9            bool
High_9918.828        bool
High_9930.274        bool
High_9958.0          bool
Length: 12409, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_clean['YEAR (DISPLAY)'] = pd.to_numeric(data_clean['YEAR (DISPLAY)'], errors='coerce')


In [None]:
# vanilla_model_training.ipynb

# Import necessary libraries
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import joblib

# Load the preprocessed and scaled training and test datasets
X_train = pd.read_csv('X_train_scaled.csv')
X_test = pd.read_csv('X_test_scaled.csv')
y_train = pd.read_csv('y_train.csv')
y_test = pd.read_csv('y_test.csv')

# Initialize the vanilla model (Logistic Regression)
vanilla_model = LogisticRegression()

# Train the model on the training data
vanilla_model.fit(X_train, y_train.values.ravel())

# Make predictions on the test set
y_pred = vanilla_model.predict(X_test)

# Evaluate the model
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Save the trained model
joblib.dump(vanilla_model, 'vanilla_model.pkl')
