# Life Expectancy: Exploratory Data Analysis

    The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries The data-sets are made available to public for the purpose of health data analysis. The data-set related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website. Among all categories of health-related factors only those critical factors were chosen which are more representative. It has been observed that in the past 15 years , there has been a huge development in health sector resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years

    Goal: Find a set of features that affect Life Expectancy.
1. Data Cleaning
    Q1. Detect and Deal with the Missing values
    Q2. Detect and handle the outliers
2. Data Exploration and Visualization
    Q3. What is the Life Expectancy Country-wise
    Q4. How different dieseases affect life expectancy in developed and developing countries
    Q5. What effect does Schooling and Alcohol have on Life Expectancy
3. Summary



# Imports and Dataset Load


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from scipy.stats.mstats import winsorize
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
import os
%matplotlib inline

In [None]:
df = pd.read_csv('../input/life-expectancy-who/Life Expectancy Data.csv')

In [None]:
df.head()

# Section 1: Data Cleaning

In order to properly clean the data, it is important to understand the variables presented in the data. There are a number of things important to know about each variable:

1. What does the variable mean and what type of variable is it (Nominal/Ordinal/Interval/Ratio)?
2. Does the variable have missing values? If so, what should be done about them?
3. Does the variable have outliers? If so, what should be done about them?

# Dataset Description

This collection is made up of data collected by the World Health Organization from various nations all over the world (WHO for short). The information is a compilation of several indicators for a certain nation and year. In essence, the data is a time series of several metrics divided by nation.

The string values for the columns/variables themselves aren't particularly 'clean,' so here's a little cleaning of the column/variable titles before we go into the variable descriptions.


In [None]:
orig_cols = list(df.columns)
new_cols = []
for col in orig_cols:
    new_cols.append(col.strip().replace('  ', ' ').replace(' ', '_').lower())
df.columns = new_cols
                    

# Variable Descriptions

1. country (Nominal) - the country in which the indicators are from (i.e. United States of America or Congo)
2. year (Ordinal) - the calendar year the indicators are from (ranging from 2000 to 2015)
3. status (Nominal) - whether a country is considered to be 'Developing' or 'Developed' by WHO standards
4. life_expectancy (Ratio) - the life expectancy of people in years for a particular country and year
5. adult_mortality (Ratio) - the adult mortality rate per 1000 population (i.e. number of people dying between 15 and 60 years per 1000 population); if the rate is 263 then that means 263 people will die out of 1000 between the ages of 15 and 60; another way to think of this is that the chance an individual will die between 15 and 60 is 26.3%

6. infant_deaths (Ratio) - number of infant deaths per 1000 population; similar to above, but for infants
7. alcohol (Ratio) - a country's alcohol consumption rate measured as liters of pure alcohol consumption per capita
8. percentage_expenditure (Ratio) - expenditure on health as a percentage of Gross Domestic Product (gdp)
9. hepatitis_b (Ratio) - number of 1 year olds with Hepatitis B immunization over all 1 year olds in population
10. measles (Ratio) - number of reported Measles cases per 1000 population
11. bmi (Interval/Ordinal) - average Body Mass Index (BMI) of a country's total population
12. under-five_deaths (Ratio) - number of people under the age of five deaths per 1000 population
13. polio (Ratio) - number of 1 year olds with Polio immunization over the number of all 1 year olds in population
14. total_expenditure (Ratio) - government expenditure on health as a percentage of total government expenditure
15. diphtheria (Ratio) - Diphtheria tetanus toxoid and pertussis (DTP3) immunization rate of 1 year olds
16. hiv/aids (Ratio) - deaths per 1000 live births caused by HIV/AIDS for people under 5; number of people under 5 who die due to HIV/AIDS per 1000 births
17. gdp (Ratio) - Gross Domestic Product per capita
18. population (Ratio) - population of a country
19. thinness_1-19_years (Ratio) - rate of thinness among people aged 10-19 (Note: variable should be renamed to thinness_10-19_years to more accurately represent the variable)
20. thinness_5-9_years (Ratio) - rate of thinness among people aged 5-9
21. income_composition_of_resources (Ratio) - Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
22. schooling (Ratio) - average number of years of schooling of a population

As previously indicated, changing the name of the variable thinness 1-19 years to thinness 10-19 years 
would be beneficial because it is a more true representation of what the variable signifies.

In [None]:
df.rename(columns={'thinness_1-19_years':'thinness_10-19_years'}, inplace=True)

In [None]:
df.info()

In [None]:
df.size

In [None]:
df.shape

In [None]:
df.columns

# Missing Values
There are few things that must be done concerning missing values:

1. Detection of missing values
    Find nulls
    Could a null be signified by anything other than null? Zero values perhaps?
2. Dealing with missing values
    Fill nulls? Impute or Interpolate
    Eliminate nulls?

# Missing Values Detection
The simplest and quickest way here is to do a quick df.describe() and examine each variable individually to check whether the values make sense given the variable's description.


In [None]:
df.describe().iloc[:, 1:]

Things that may not make sense from the perspective of the above:

1. An adult mortality rate of 1? This is most certainly a measurement error, but what numbers make sense in this situation? If the value falls below a specific threshold, it may be necessary to convert it to null.
2. Is it possible that the infant mortality rate is as low as 0 per 1000? That's not conceivable, therefore I'm going to assume those values are null.
3. On the other hand, 1800 is most likely an anomaly, although it is feasible in a country with extremely high birthrates and a relatively small population total - this can be addressed later.
4. What's the difference between a BMI of 1 and a BMI of 87.3? If that were the case.Because a BMI of 15 or below is considered severely underweight, and a BMI of 40 or above is considered morbidly obese, many of these measures appear to be unrealistic...this variable may not be worth investigating further.
5. Similar to newborn deaths, values of zero are unlikely (if not impossible) in the Under Five Deaths category.
6. Is it feasible to have a GDP per capita as low as $1.68 (USD).
7. For a whole nation, the population is 34.

In [None]:
plt.figure(figsize=(15,10))
for i, col in enumerate(['adult_mortality', 'infant_deaths', 'bmi', 'under-five_deaths', 'gdp', 'population'], start=1):
    plt.subplot(2, 3, i)
    df.boxplot(col)

There are a few of the above that could simply be outliers, but there are some that almost certainly have to be errors of some sort. Of the above variables, changes to null will be made for the following since these numbers don't make any sense:

1. Adult mortality rates lower than the 5th percentile
2. Infant deaths of 0
3. BMI less than 10 and greater than 50
4. Under Five deaths of 0

In [None]:
mort_5_percentile = np.percentile(df.adult_mortality.dropna(), 5)
df.adult_mortality = df.apply(lambda x: np.nan if x.adult_mortality < mort_5_percentile else x.adult_mortality, axis=1)
df.infant_deaths = df.infant_deaths.replace(0, np.nan)
df.bmi = df.apply(lambda x: np.nan if (x.bmi < 10 or x.bmi > 50) else x.bmi, axis=1)
df['under-five_deaths'] = df['under-five_deaths'].replace(0, np.nan)

To check missing values in Each columns

In [None]:
df.info()

Because there appear to be a significant number of null values, it may be more useful to break down the data into those that contain nulls in order to investigate further.

The function below tries to accomplish this by only returning columns with (explicit) nulls, keeping a running total of those columns with nulls as well as their location in the dataframe, and returning the count of nulls in a given column as well as the percent of nulls out of all the values in the column.

In [None]:
def nulls_breakdown(df=df):
    df_cols = list(df.columns)
    cols_total_count = len(list(df.columns))
    cols_count = 0
    for loc, col in enumerate(df_cols):
        null_count = df[col].isnull().sum()
        total_count = df[col].isnull().count()
        percent_null = round(null_count/total_count*100, 2)
        if null_count > 0:
            cols_count += 1
            print('[iloc = {}] {} has {} null values: {}% null'.format(loc, col, null_count, percent_null))
    cols_percent_null = round(cols_count/cols_total_count*100, 2)
    print('Out of {} total columns, {} contain null values; {}% columns contain null values.'.format(cols_total_count, cols_count, cols_percent_null))

In [None]:
nulls_breakdown()

# Dealing with Missing Values

Nearly half of the BMI variable's values are null, it is likely best to remove this variable altogether.

In [None]:
df.drop(columns='bmi', inplace=True)

In [None]:
nulls_breakdown()

Because there appear to be many columns with null values, and because this is time series data ordered by nation, the best course of action is to interpolate the data by country. When attempting to interpolate by nation, however, no values are filled in because the countries' data for all the null values is null for each year, therefore imputation by year may be the best option.The mean of each year is computed as follows.

In [None]:
imputed_data = []
for year in list(df.year.unique()):
    year_data = df[df.year == year].copy()
    for col in list(year_data.columns)[3:]:
        year_data[col] = year_data[col].fillna(year_data[col].dropna().mean()).copy()
    imputed_data.append(year_data)
df = pd.concat(imputed_data).copy()

All the missing values have been treated becasuse of the function above

In [None]:
nulls_breakdown(df)

# Outliers

1. Detect the outliers
    Boxplots/histograms
    Tukey's Method
2. Deal with outliers
    Drop outliers
    Limit/Winsorize outliers

# Outlier Detection

    To visually see if there are any outliers, a boxplot and histogram will be constructed for each continuous variable.

In [None]:
cont_vars = list(df.columns)[3:]
def outliers_visual(data):
    plt.figure(figsize=(15, 40))
    i = 0
    for col in cont_vars:
        i += 1
        plt.subplot(9, 4, i)
        plt.boxplot(data[col])
        plt.title('{} boxplot'.format(col))
        i += 1
        plt.subplot(9, 4, i)
        plt.hist(data[col])
        plt.title('{} histogram'.format(col))
    plt.show()
outliers_visual(df)

    There are a lot of outliers for all of these variables, including the goal variable, life expectancy, as can be shown visually. Using Tukey's method, the same will be done statistically, with outliers defined as anything outside of 1.5 times the IQR.

In [None]:
def outlier_count(col, data=df):
    print(15*'-' + col + 15*'-')
    q75, q25 = np.percentile(data[col], [75, 25])
    iqr = q75 - q25
    min_val = q25 - (iqr*1.5)
    max_val = q75 + (iqr*1.5)
    outlier_count = len(np.where((data[col] > max_val) | (data[col] < min_val))[0])
    outlier_percent = round(outlier_count/len(data[col])*100, 2)
    print('Number of outliers: {}'.format(outlier_count))
    print('Percent of data that is outlier: {}%'.format(outlier_percent))

In [None]:
for col in cont_vars:
    outlier_count(col)

    It appears there are a decent amount of outliers in this dataset.

# Dealing with Outliers

1. Remove the outliers (best avoided in order to keep as much information as possible)
2. Set upper and/or lower boundaries for values (Winsorize the data)
3. Transform the data (logarithm, inverse, square root, and so on).
    advantage: can 'normalise' data and remove outliers 
    disadvantage: can't be applied to variables with values of 0 or less

    Because each variable has a different number of outliers and outliers on both sides of the data, the ideal approach is to winsorize (limit) the values for each variable separately until no outliers remain. By going variable by variable, with the flexibility to select a lower and/or upper limit for winsorization, the code below allows me to do just that.
    
    By default, the function will display two boxplots for the variable side by side (one boxplot of the original data, and one with the winsorized change). The winsorized data will be kept in the wins dict dictionary so that it may be conveniently retrieved later after a suitable limit has been determined (by visual examination).

In [None]:
def test_wins(col, lower_limit=0, upper_limit=0, show_plot=True):
    wins_data = winsorize(df[col], limits=(lower_limit, upper_limit))
    wins_dict[col] = wins_data
    if show_plot == True:
        plt.figure(figsize=(15,5))
        plt.subplot(121)
        plt.boxplot(df[col])
        plt.title('original {}'.format(col))
        plt.subplot(122)
        plt.boxplot(wins_data)
        plt.title('wins=({},{}) {}'.format(lower_limit, upper_limit, col))
        plt.show()

In [None]:
wins_dict = {}
test_wins(cont_vars[0], lower_limit=.01, show_plot=True)
test_wins(cont_vars[1], upper_limit=.04, show_plot=True)
test_wins(cont_vars[2], upper_limit=.05, show_plot=False)
test_wins(cont_vars[3], upper_limit=.0025, show_plot=False)
test_wins(cont_vars[4], upper_limit=.135, show_plot=False)
test_wins(cont_vars[5], lower_limit=.1, show_plot=False)
test_wins(cont_vars[6], upper_limit=.19, show_plot=False)
test_wins(cont_vars[7], upper_limit=.05, show_plot=False)
test_wins(cont_vars[8], lower_limit=.1, show_plot=False)
test_wins(cont_vars[9], upper_limit=.02, show_plot=False)
test_wins(cont_vars[10], lower_limit=.105, show_plot=False)
test_wins(cont_vars[11], upper_limit=.185, show_plot=False)
test_wins(cont_vars[12], upper_limit=.105, show_plot=False)
test_wins(cont_vars[13], upper_limit=.07, show_plot=False)
test_wins(cont_vars[14], upper_limit=.035, show_plot=False)
test_wins(cont_vars[15], upper_limit=.035, show_plot=False)
test_wins(cont_vars[16], lower_limit=.05, show_plot=False)
test_wins(cont_vars[17], lower_limit=.025, upper_limit=.005, show_plot=False)

    All of the variables have now been winsorized as little as possible in order to preserve as much data as feasible while still removing outliers. Finally, little boxplots for each variable's winsorized data will be provided to demonstrate that the outliers have been dealt with.

In [None]:
plt.figure(figsize=(15,5))
for i, col in enumerate(cont_vars, 1):
    plt.subplot(2, 9, i)
    plt.boxplot(wins_dict[col])
plt.tight_layout()
plt.show()

    The data cleaning section is now complete, with the exception of the outliers.

# Data Exploration

In [None]:
wins_df = df.iloc[:, 0:3]
for col in cont_vars:
    wins_df[col] = wins_dict[col]

In [None]:
wins_df.describe()

# VISUAL DISTRIBUTIONS

In [None]:
plt.figure(figsize=(15, 20))
for i, col in enumerate(cont_vars, 1):
    plt.subplot(5, 4, i)
    plt.hist(wins_df[col])
    plt.title(col)

In [None]:
wins_df[cont_vars].corr()

In [None]:
mask = np.triu(wins_df[cont_vars].corr())
plt.figure(figsize=(15,6))
sns.heatmap(wins_df[cont_vars].corr(), annot=True, fmt='.2g', vmin=-1, vmax=1, center=0, cmap='coolwarm', mask=mask)
plt.ylim(18, 0)
plt.title('Correlation Matrix Heatmap')
plt.show()

Some general takeaways from the graphic above:

1. Life Expectancy (target variable) appears to be relatively highly correlated (negatively or positively) with:
2. Adult Mortality (negative)
3. HIV/AIDS (negative)
4. Income Composition of Resources (positive)
5. Schooling (positive)
6. Life expectancy (target variable) is extremely lowly correlated to population (nearly no correlation at all)
7. Infant deaths and Under Five deaths are extremely highly correlated
8. Percentage Expenditure and GDP are relatively highly correlated
9. Hepatitis B vaccine rate is relatively positively correlated with Polio and Diphtheria vaccine rates
10. Polio vaccine rate and Diphtheria vaccine rate are very positively correlated
11. HIV/AIDS is relatively negatively correlated with Income Composition of Resources
12. Thinness of 5-9 Year olds rate and Thinness of 10-15 Year olds rate is extremely highly correlated
13. Income Composition of Resources and Schooling are very highly correlated

# Life Expectancy Country-wise

In [None]:
life_country = wins_df.groupby('country')['life_expectancy'].mean().sort_values(ascending=True)
life_country 
my_colors = list('rgbkymc')
life_country.plot(kind='bar', figsize=(50,15), fontsize=25,color=my_colors)
plt.title("Life_Expectancy Country wise",fontsize=40)
plt.xlabel("Country",fontsize=35)
plt.ylabel("Average Life expectancy",fontsize=35)
plt.tick_params(axis='x', which='major', labelsize=15)
plt.show()

    Japan is the country with the highest Life expectancy value followed by Sweden and Sierra Leone has the lowest Life expectancy value

In [None]:
plt.figure(figsize=(10, 5))
plt.subplot(121)
wins_df.status.value_counts().plot(kind='bar')
plt.title('Count of Rows by Country Status')
plt.xlabel('Country Status')
plt.ylabel('Count of Rows')
plt.xticks(rotation=0)

plt.subplot(122)
wins_df.status.value_counts().plot(kind='pie', autopct='%.2f')
plt.ylabel('')
plt.title('Country Status Pie Chart')

plt.show()

#  Life Expectancy Comparison in Developed and Developing Countries

    First, looking at how life expectancy has changed over the years may be helpful.

# How different dieseases affect life expectancy in developed and developing countries

In [None]:
sns.pairplot(wins_df, x_vars=["hepatitis_b"], y_vars=["life_expectancy"],
             hue="status",markers=["o", "x"], height=8, kind="reg");





    There is a slight decrease in the life expectancy value in case of developed countries whereas in case of Developed countries the life expectancy value is gradually rising which means that developing countries are taking measures for setting up vaccine of hepatitis B

In [None]:
sns.pairplot(wins_df, x_vars=["measles"], y_vars=["life_expectancy"],
             hue="status",markers=["o", "x"], height=8, kind="reg");

    In case of Measles, according to the graph the developed countries seems to have vaccines available to tackle measles whereas developing countries life expectancy values is decreasing day by day maybe because of lack of resources to handle measles

In [None]:
sns.pairplot(wins_df, x_vars=["polio"], y_vars=["life_expectancy"],
             hue="status",markers=["o", "x"], height=8, kind="reg");

    Developed countries seems to have successfully eradicated polio diesease because of vaccines whereas in developing countries there was low expectancy value initially but now it is gradually increasing maybe because of proper doses being given

In [None]:
sns.pairplot(wins_df, x_vars=["diphtheria"], y_vars=["life_expectancy"],
             hue="status",markers=["o", "x"], height=8, kind="reg");

    Developed countries seems to have successfully eradicated diptheria diesease because of vaccines whereas in developing countries there was low expectancy value initially but now it is gradually increasing maybe because of proper doses being given

In [None]:
sns.pairplot(wins_df, x_vars=["hiv/aids"], y_vars=["life_expectancy"],
             hue="status",markers=["o", "x"], height=8, kind="reg");

    The graph shows that developing countries still have not been able to handle hiv/aids at all as the life expectancy value is decreasing at a rapid range. This can be due to rising population and no education been given

# What effect does Schooling and Alcohol have on Life Expectancy

In [None]:
sns.pairplot(wins_df, x_vars=["schooling"], y_vars=["life_expectancy"],
             hue="status",markers=["o", "x"], height=10, kind="reg")

    Schooling can effect life expectancy more in developing countries than developed countries. This may be because education is more established and prevalent in wealthier countries. 

In [None]:
sns.pairplot(wins_df, x_vars=["alcohol"], y_vars=["life_expectancy"],
             hue="status",markers=["o", "x"], height=8, kind="reg")

    I’m guessing that this is due to the fact that only wealthier countries can afford alcohol or the consumption of alcohol is more prevalent among wealthier populations.

    That is why developing countries and alcohol have positive relation and developed countries and alcohol have negative relation.

# CONCLUSION

1. The dataset although collected by WHO contained a lot of missing values and we saw that most of the missing values were from the countries with very less population and were data collection is a very tedious task.
2. A lot of outliers were detected which were dealt by Winsorization
2. Japan although being hit badly by world war II came back very strong and is currently the country with the highest life expectancy followed by Sweden which is a big Achievement.
3. We largely saw how developing countries have very less life expectancy when we see diseases like HIV/AIDS, polio etc and how Schooling plays a big role in increasing the life expectancy of developing countries as people become much more educated and help improve the welfare and healthcare of the country along with economy
4. Alcoholism is a big issue in the developed country where people have good amount of money to spend and this shows how careless are people in terms of their health when it comes to alcoholism

# Recommendation

1. The Developed countries should help developing countries in eradicating the diseases which are affecting the life of the people by providing vaccinations 
2. The government should focus more on the schooling of the kids which will become face of the country in future and provide them with good food and educate them properly. 
3. The governemnt of developing countries should launch various schemes to motivate people to send their kids to schools 
4. Government should organize free healthcare camps to provide free vaccinations for the needy and poor people so that they dont have to spend their precious money and they also stay healthy to treat their families well. 
5. The government should increase the subsidiy on liquor and increase healthcare and welfare camps to generate awareness among people, how band overdrinking is and how it affects your body.
6. WHO should with the help of developed nations should help the governemnt of developing countries in providing free food and education and organize healthcare camps