### Project: Life Expectancy (WHO) Prediction with Streamlit
We are working with a **World Health Organisation (WHO)** dataset, which contains indicators on global health trends, such as life expectancy, mortality rates, access to sanitation, etc. 

### Objective
- The dataset includes health-related statistics for countries over time (e.g., life expectancy, fertility rates, or other health indicators)
- Display visualizations like prediction vs actual values and other charts.
- Trains the model and predicts life expectancy using Linear Regression.
- Deploy the model using Streamlit to allow the user to input health data (e.g., health expenditure,   mortality rate, etc.) 

### Steps to build the Project:

1. **Obtain the WHO Dataset**.
2. **Preprocess and Clean the Data**.
3. **Visualize the Data**.
4. **Train a Model** to predict life expectancy or health indicator.
5. **Deploy the Model using Streamlit**.

### Dataset
We'll use the Life Expectancy (WHO) n dataset from Kaggle, which contains statistical analysis on factors influencing Life Expectancy.

### Dataset URL: https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who 

### Load and Preprocess the Data

In [5]:
import pandas as pd
import numpy as np
import streamlit as st
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt

# Load life expectancy dataset
This function loads the dataset from the provided file path.

In [15]:
df = pd.read_csv(r"C:\Users\Oguntuga\Downloads\Life Expectancy Data.csv")

# Explore data

In [17]:
print(df.head())

       Country  Year      Status  Life expectancy   Adult Mortality  \
0  Afghanistan  2015  Developing              65.0            263.0   
1  Afghanistan  2014  Developing              59.9            271.0   
2  Afghanistan  2013  Developing              59.9            268.0   
3  Afghanistan  2012  Developing              59.5            272.0   
4  Afghanistan  2011  Developing              59.2            275.0   

   infant deaths  Alcohol  percentage expenditure  Hepatitis B  Measles   ...  \
0             62     0.01               71.279624         65.0      1154  ...   
1             64     0.01               73.523582         62.0       492  ...   
2             66     0.01               73.219243         64.0       430  ...   
3             69     0.01               78.184215         67.0      2787  ...   
4             71     0.01                7.097109         68.0      3013  ...   

   Polio  Total expenditure  Diphtheria    HIV/AIDS         GDP  Population  \
0    6.

In [19]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          2938 non-null   object 
 1   Year                             2938 non-null   int64  
 2   Status                           2938 non-null   object 
 3   Life expectancy                  2928 non-null   float64
 4   Adult Mortality                  2928 non-null   float64
 5   infant deaths                    2938 non-null   int64  
 6   Alcohol                          2744 non-null   float64
 7   percentage expenditure           2938 non-null   float64
 8   Hepatitis B                      2385 non-null   float64
 9   Measles                          2938 non-null   int64  
 10   BMI                             2904 non-null   float64
 11  under-five deaths                2938 non-null   int64  
 12  Polio               

In [53]:
#### Summary Statistics
print(df.describe())

              Year  Life expectancy   Adult Mortality  infant deaths  \
count  2938.000000       2928.000000      2928.000000    2938.000000   
mean   2007.518720         69.224932       164.796448      30.303948   
std       4.613841          9.523867       124.292079     117.926501   
min    2000.000000         36.300000         1.000000       0.000000   
25%    2004.000000         63.100000        74.000000       0.000000   
50%    2008.000000         72.100000       144.000000       3.000000   
75%    2012.000000         75.700000       228.000000      22.000000   
max    2015.000000         89.000000       723.000000    1800.000000   

           Alcohol  percentage expenditure  Hepatitis B       Measles   \
count  2744.000000             2938.000000  2385.000000    2938.000000   
mean      4.602861              738.251295    80.940461    2419.592240   
std       4.052413             1987.914858    25.070016   11467.272489   
min       0.010000                0.000000     1.000000

# Preprocess the dataset
This function preprocesses the dataset by dropping missing values
and separating the target and feature variables.

In [42]:
def preprocess_data(df):
    
    # Check if necessary columns exist in the dataframe
    if 'Life Expectancy' not in df.columns:
        st.error("Dataset does not contain 'Life Expectancy' column.")
        return None, None  # Return None if the column is missing
    
    # Drop rows with missing target values (Life Expectancy)
    df = df.dropna(subset=['Life Expectancy'])  
    # Drop rows with missing feature values
    df = df.dropna()
    
    # Convert 'Country' to category type (if necessary)
    if 'Country' in df.columns:
        df.loc[:, 'Country'] = df['Country'].astype('category')
    
    # Drop non-relevant columns like 'Country' and 'Year' (we assume 'Country' is categorical)
    if 'Country' in df.columns:
        df = df.drop(columns=['Country', 'Year'], errors='ignore')  # Use errors='ignore' to avoid KeyError if 'Country' or 'Year' is missing
    
    # Separate the target variable and features
    X = df.drop(columns='Life Expectancy')
    y = df['Life Expectancy']
    
    # Ensure X and y are not empty
    if X.empty or y.empty:
        st.error("Dataframe preprocessing failed: Empty feature or target.")
        return None, None
    
    return X, y
    

In [45]:
print(df.columns)

Index(['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
       'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population',
       ' thinness  1-19 years', ' thinness 5-9 years',
       'Income composition of resources', 'Schooling'],
      dtype='object')


# Train a Linear Regression model
This function trains the model using linear regression.

In [74]:
def train_model(X_train, y_train):
    model = LinearRegression()
    model.fit(X_train, y_train)
    return model

# Evaluate the model
This function evaluates the trained model and returns the Mean Squared Error and R-squared value.

In [77]:

def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    return mse, r2, y_pred


In [79]:
print(df.mean)

<bound method DataFrame.mean of           Country  Year      Status  Life expectancy   Adult Mortality  \
0     Afghanistan  2015  Developing              65.0            263.0   
1     Afghanistan  2014  Developing              59.9            271.0   
2     Afghanistan  2013  Developing              59.9            268.0   
3     Afghanistan  2012  Developing              59.5            272.0   
4     Afghanistan  2011  Developing              59.2            275.0   
...           ...   ...         ...               ...              ...   
2933     Zimbabwe  2004  Developing              44.3            723.0   
2934     Zimbabwe  2003  Developing              44.5            715.0   
2935     Zimbabwe  2002  Developing              44.8             73.0   
2936     Zimbabwe  2001  Developing              45.3            686.0   
2937     Zimbabwe  2000  Developing              46.0            665.0   

      infant deaths  Alcohol  percentage expenditure  Hepatitis B  Measles   \


# Feature scaling for model input
This function scales the features using StandardScaler.


In [84]:
def scale_features(X_train, X_test):
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    return X_train_scaled, X_test_scaled, scaler

# Streamlit interface
The main function that runs the Streamlit app.

In [94]:
# Streamlit interface
def main():

    st.title('WHO Life Expectancy Prediction')

    # Load the dataset (change the file path as needed)
    df = pd.read_csv(r"C:\Users\Oguntuga\Downloads\Life Expectancy Data.csv")
    
    st.write("Dataset Preview:")
    st.write(df.head())

    # Visualize correlation matrix (Only numeric columns)
    st.subheader('Correlation Matrix')

    # Select only numeric columns for correlation matrix
    numeric_cols = df.select_dtypes(include=[np.number])  # Select only numeric columns
    corr = numeric_cols.corr()  # Calculate correlation matrix on numeric columns
    
    fig, ax = plt.subplots(figsize=(10, 8))
    sns.heatmap(corr, annot=True, cmap='coolwarm', ax=ax)
    st.pyplot(fig)

    # Preprocess the data
    X, y = preprocess_data(df)
    if X is None or y is None:  # Check if preprocessing returned None
        return  # Exit the function if preprocessing failed

    # Train-Test Split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Feature Scaling
    X_train_scaled, X_test_scaled, scaler = scale_features(X_train, X_test)

    # Train model
    model = train_model(X_train_scaled, y_train)

    # Evaluate model
    mse, r2, y_pred = evaluate_model(model, X_test_scaled, y_test)
    st.write(f'Mean Squared Error: {mse:.2f}')
    st.write(f'R²: {r2:.2f}')

    # Plot True vs Predicted Life Expectancy
    st.subheader('True vs Predicted Life Expectancy')
    fig, ax = plt.subplots()
    ax.scatter(y_test, y_pred, color='blue')
    ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', linestyle='--')
    ax.set_xlabel('True Life Expectancy')
    ax.set_ylabel('Predicted Life Expectancy')
    st.pyplot(fig)

    # User input for new data prediction
    st.sidebar.header("Enter health data for prediction")
    
    # Sample input fields
    health_expenditure = st.sidebar.slider('Health Expenditure', 0, 5000, 100)
    mortality_rate = st.sidebar.slider('Mortality Rate', 0.0, 20.0, 5.0)

    # Prepare input data for prediction
    user_data = np.array([[health_expenditure, mortality_rate]])
    user_data_scaled = scaler.transform(user_data)
    
    # Predict life expectancy
    predicted_life_expectancy = model.predict(user_data_scaled)
    st.write(f'Predicted Life Expectancy: {predicted_life_expectancy[0]:.2f}')

# Run the Streamlit app
if __name__ == '__main__':
    main()


