# **Background**

An investigation into the socio-economic factors that affect life expectancy.

# **Import Libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')

# **Get the Data**

In [None]:
df = pd.read_csv('../input/life-expectancy-who/Life Expectancy Data.csv')

**Check the head, info, describe and shape of the data.**

In [None]:
df.head(2)

In [None]:
df.describe().transpose()

In [None]:
df.info()

In [None]:
df.shape

# **Exploratory Data Analysis and Feature Engineering**

Check for number of NULL values within each column as s percentage of the entire shape of the dataframe.

In [None]:
df.isnull().sum()

In [None]:
100*df.isnull().sum()/df.shape[0]

**Working with missing data.**

Use the pandas dataframe.interpolate() function to fill NA values in the dataframe since we are dealing with time series data over a period of successive years.

In [None]:
for col in df.columns:
  df[col] = df[col].interpolate(method='linear',limit_direction='both')

In [None]:
df.isnull().sum()

# **Detecting & Filtering Outliers.**

Discover outliers with the IQR score, also called the midspread or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1.

In [None]:
df.columns

In [None]:
fig, axes = plt.subplots(7,2,figsize=(5,25))
df.boxplot(column='Population', ax=axes[0,0])

df.boxplot(column='Schooling',ax=axes[0,1])

df.boxplot(column='Income composition of resources',ax=axes[1,0])
df.boxplot(column='GDP',ax=axes[1,1])

df.boxplot(column='Total expenditure',ax=axes[2,0])
df.boxplot(column='Polio',ax=axes[2,1])

df.boxplot(column='Adult Mortality',ax=axes[3,0])
df.boxplot(column='Alcohol',ax=axes[3,1])

df.boxplot(column='Hepatitis B',ax=axes[4,0])
df.boxplot(column=' thinness 5-9 years',ax=axes[4,1])

df.boxplot(column=' BMI ',ax=axes[5,0])
df.boxplot(column='under-five deaths ',ax=axes[5,1])

df.boxplot(column=' HIV/AIDS',ax=axes[6,0])
df.boxplot(column='Diphtheria ',ax=axes[6,1])

In [None]:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

In [None]:
df_clean = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]

In [None]:
df_clean.shape

**Insights**

In [None]:
df_clean.corr()

In [None]:
plt.figure(figsize=(12,10))
sns.heatmap(df_clean.corr(),cmap='viridis',annot=True)

In [None]:
sns.pairplot(df_clean)

In [None]:
df_clean.corr()['Life expectancy '].sort_values(ascending=False)

In [None]:
df_clean.corr()['Life expectancy '].sort_values(ascending=False).plot(kind='bar')

The chief predicting factors that seem to affect life expectancy are the Income composition of resorses, Schooling, Alcohol, BMI, Diphtheria, Polio, percentage expenditure, total expenditure,Hepatitis B, GDP.

In [None]:
less_than_65 = df_clean[df_clean['Life expectancy '] < 65]
sns.lmplot(x='percentage expenditure',y='Life expectancy ',data=less_than_65)

Total expenditure and percentage are weakly correlated to the life expenctancy, beyond a certain threshold, an increase in expenditure does not transalte to an increase in life expenditure. Other socio-economic factors need to be improved which are strongly correlated to the life expectany.

In [None]:
sns.lmplot(x='Schooling',y='Life expectancy ',data=df_clean)

Schooling is shown to have a fairly strong relationship with life expectancy, as expected, education raises awareness in making conscious life choices on habits such as dietary, drug and substance abuse etc.

In [None]:
df_clean['Status'].unique()

In [None]:
dmap = {'Developed':1,'Developing':0}
df_clean['Status'] = df_clean['Status'].map(dmap)

In [None]:
df_clean.head(4)

# **Training and Testing Data**

In [None]:
y = df_clean['Life expectancy '].values
X = df_clean.drop(['Country','Life expectancy '],axis=1).values # Too many countries to create dummy variables on

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

# **Training the Model**

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lm = LinearRegression()

In [None]:
lm.fit(X_train,y_train)

**Print out the coefficient and intercept of the model.**

In [None]:
print('Coefficients:\n',lm.coef_)
print('\n')
print('Intercept:\n',lm.intercept_)

# **Predicting Test Data**

In [None]:
predictions = lm.predict(X_test)

In [None]:
plt.figure(figsize=(12,8))
plt.scatter(y_test,predictions)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')

# **Evaluating the Model**

In [None]:
from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

# **Residuals**

In [None]:
sns.distplot((y_test-predictions),bins=50);