Your goal in this challenge is to find the factors that affect the life expectancy. Specifically, you need to find out which factors increase the expected life in the countries and which factors decrease it.

First, load the dataset.

* Detect the problems with the data such as missing values and outliers. Are there any nonsense values that seem to be stemmed from the data collection? For the missing values, discuss which technique would be the most suitable one in filling out these values. Regarding the outliers, discuss their potential effects on your analysis and select an appropriate method to deal with them.

* Explore the data using univariate and multivariate exploration techniques. You should pay special attention to your target variable. In this regard, your focus should be on finding the relevant variables that may affect life expectancy.

* In the feature engineering step, you need to select a suite of variables that you think would be ideal in the modeling phase. More concretely, you may discard some variables that are very correlated with the other ones or the variables that you think irrelevant with the life expectancy.

* Summarize your findings. One of the most important skills of a data scientist is to convey ideas and findings to nontechnical people using understandable language. In this regard, one of the most effective ways to communicate your ideas is to do it using effective visualization.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
import warnings

pd.set_option('display.max_columns', None)

warnings.filterwarnings('ignore')
sns.set(style="whitegrid")

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'lifeexpectancy'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

life_df_source = pd.read_sql_query('select * from lifeexpectancy',con=engine)

# no need for an open connection, 
# as we're only doing a single query
engine.dispose()

In [3]:
life_df = life_df_source.copy()

# Stripping column names of whitespace
for column_name in life_df.columns:
    life_df.rename({column_name: column_name.strip().lower()}, axis=1, inplace=True)
# Renaming columns
better_column_names = {'life expectancy': 'life_expect',
                      'adult mortality': 'adult_mort',
                      'infant deaths': 'infant_deaths',
                      'percentage expenditure': 'perc_expend',
                      'hepatitis b': 'hep_b',
                      'under-five deaths': '<5_deaths',
                      'total expenditure': 'total_expend',
                      'population': 'pop',
                      'thinness  1-19 years': 'thin<20',
                      'thinness 5-9 years': 'thin<10',
                      'income composition of resources': 'income'}
life_df.rename(better_column_names, axis=1, inplace=True)

print(life_df.shape)
life_df.head()

(2938, 22)


Unnamed: 0,country,year,status,life_expect,adult_mort,infant_deaths,alcohol,perc_expend,hep_b,measles,bmi,<5_deaths,polio,total_expend,diphtheria,hiv/aids,gdp,pop,thin<20,thin<10,income,schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,19.1,83,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,18.6,86,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,18.1,89,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,17.6,93,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,17.2,97,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


## Data Cleaning

### Variable Types

In [4]:
life_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2938 entries, 0 to 2937
Data columns (total 22 columns):
country          2938 non-null object
year             2938 non-null int64
status           2938 non-null object
life_expect      2928 non-null float64
adult_mort       2928 non-null float64
infant_deaths    2938 non-null int64
alcohol          2744 non-null float64
perc_expend      2938 non-null float64
hep_b            2385 non-null float64
measles          2938 non-null int64
bmi              2904 non-null float64
<5_deaths        2938 non-null int64
polio            2919 non-null float64
total_expend     2712 non-null float64
diphtheria       2919 non-null float64
hiv/aids         2938 non-null float64
gdp              2490 non-null float64
pop              2286 non-null float64
thin<20          2904 non-null float64
thin<10          2904 non-null float64
income           2771 non-null float64
schooling        2775 non-null float64
dtypes: float64(16), int64(4), object(2)
memo

I don't notice anything inherently wrong with data types.  Moving on.  Renaming column names.

### Missing Values

There are some empty values along with a recurring `nan` value.

In [5]:
life_df.isnull().sum()

country            0
year               0
status             0
life_expect       10
adult_mort        10
infant_deaths      0
alcohol          194
perc_expend        0
hep_b            553
measles            0
bmi               34
<5_deaths          0
polio             19
total_expend     226
diphtheria        19
hiv/aids           0
gdp              448
pop              652
thin<20           34
thin<10           34
income           167
schooling        163
dtype: int64

Attempting to interpolate missing data with trends per country

In [6]:
life_df.iloc[95:100]

Unnamed: 0,country,year,status,life_expect,adult_mort,infant_deaths,alcohol,perc_expend,hep_b,measles,bmi,<5_deaths,polio,total_expend,diphtheria,hiv/aids,gdp,pop,thin<20,thin<10,income,schooling
95,Argentina,2001,Developing,74.0,138.0,12,7.76,123.012867,,0,54.6,13,85.0,9.38,83.0,0.1,717.694674,3747159.0,1.2,1.1,0.771,15.6
96,Argentina,2000,Developing,74.1,137.0,12,7.68,1349.025282,,6,54.0,14,88.0,9.21,83.0,0.1,7669.273916,3757452.0,1.2,1.1,0.764,15.0
97,Armenia,2015,Developing,74.8,118.0,1,,0.0,94.0,33,54.9,1,96.0,,94.0,0.1,369.654776,291695.0,2.1,2.2,0.741,12.7
98,Armenia,2014,Developing,74.6,12.0,1,3.91,295.608714,93.0,13,54.1,1,95.0,4.48,93.0,0.1,3994.712355,29622.0,2.1,2.1,0.739,12.7
99,Armenia,2013,Developing,74.4,123.0,1,3.79,279.429081,95.0,10,53.3,1,96.0,4.55,95.0,0.1,3843.591213,289359.0,2.1,2.1,0.736,12.7


In [7]:
life_df = life_df.groupby(['country'], sort=False).apply(lambda group: group.interpolate(method='linear', limit_direction='both'))
life_df.iloc[95:100]

Unnamed: 0,country,year,status,life_expect,adult_mort,infant_deaths,alcohol,perc_expend,hep_b,measles,bmi,<5_deaths,polio,total_expend,diphtheria,hiv/aids,gdp,pop,thin<20,thin<10,income,schooling
95,Argentina,2001,Developing,74.0,138.0,12,7.76,123.012867,66.0,0,54.6,13,85.0,9.38,83.0,0.1,717.694674,3747159.0,1.2,1.1,0.771,15.6
96,Argentina,2000,Developing,74.1,137.0,12,7.68,1349.025282,66.0,6,54.0,14,88.0,9.21,83.0,0.1,7669.273916,3757452.0,1.2,1.1,0.764,15.0
97,Armenia,2015,Developing,74.8,118.0,1,3.91,0.0,94.0,33,54.9,1,96.0,4.48,94.0,0.1,369.654776,291695.0,2.1,2.2,0.741,12.7
98,Armenia,2014,Developing,74.6,12.0,1,3.91,295.608714,93.0,13,54.1,1,95.0,4.48,93.0,0.1,3994.712355,29622.0,2.1,2.1,0.739,12.7
99,Armenia,2013,Developing,74.4,123.0,1,3.79,279.429081,95.0,10,53.3,1,96.0,4.55,95.0,0.1,3843.591213,289359.0,2.1,2.1,0.736,12.7


Notice how row 97 doesn't have NaN for `alcohol` and `total_expend` anymore?

In [8]:
life_df.isnull().sum()

country            0
year               0
status             0
life_expect       10
adult_mort        10
infant_deaths      0
alcohol           17
perc_expend        0
hep_b            144
measles            0
bmi               34
<5_deaths          0
polio              0
total_expend      32
diphtheria         0
hiv/aids           0
gdp              405
pop              648
thin<20           34
thin<10           34
income           167
schooling        163
dtype: int64

Then why are there missing data still?

### Outliars