## Categorical Feature Encoding Challenge WITH PYTHON
[Crislânio Macêdo](https://medium.com/sapere-aude-tech) -  January, 04th, 2020

 [ 👩👨 Mortality Among Children  ](https://www.kaggle.com/caesarlupum/mortality-among-children)

----------
----------

# Context
Global and regional probability of dying among children aged 5-14 (10q5) and number of deaths by UNICEF Regions
Estimates generated by the UN Inter-agency Group for Child Mortality Estimation (UN IGME) in 2019
downloaded from http://www.childmortality.org
Notes:
10q5 is the probability of dying between age 5 and 14 expressed per 1 000 children aged 5
Lower and Upper refer to the lower bound and upper bound of 90% uncertainty intervals.
Regional classifications refer to the UNICEF's regional classification.


![](https://imgk.timesnownews.com/story/1537239356-About-802000-infant-deaths-reported-in-India-in-2017-UN.jpg?tr=w-600,h-450)

### Content
Child Mortality Estimates. Last update: 19 September 2019. Contact: childmortality@unicef.org

For further details please refer to http://data.unicef.org/regionalclassifications/

#### Acknowledgements
http://www.childmortality.org

Photo by Heather Mount on Unsplash

#### Inspiration
> Abhijit Banerjee, Esther Duflo and Michael Kremer were awarded the Nobel Prize in Economics 2019, for their "experimental approach to alleviating global poverty." With their new approach to getting reliable answers on the best ways to combat global poverty, maybe some children's lives could be saved.

## Imports

> We are using a typical data science stack: `numpy`, `pandas`, `sklearn`, `matplotlib`. 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')
import gc


# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import matplotlib.patches as patches


In [None]:
# Suppress warnings 
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
from IPython.display import HTML

## Read in Data 

First, we can list all the available data files. There are a total of 6 files: 1 main file for training (with target) 1 main file for testing (without the target), 1 example submission file, and 4 other files containing additional information about energy types based on historic usage rates and observed weather. . 

In [None]:
import os
print(os.listdir("../input/cusersmarildownloadsdeathscsv/"))

In [None]:
%%time
df = pd.read_csv('../input/cusersmarildownloadsdeathscsv/deaths.csv', delimiter=';', encoding = "ISO-8859-1")
df.dataframeName = 'deaths.csv'

# Glimpse of Data

In [None]:
print('Size of df data', df.shape)

> Data head

In [None]:
df.head()

# Examine Missing Values

Next we can look at the number and percentage of missing values in each column. 


### checking missing data for df

In [None]:
total = df.isnull().sum().sort_values(ascending = False)
percent = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending = False)
missing_data  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(8)

## Column Types

Let's look at the number of columns of each data type. `int64` and `float64` are numeric variables ([which can be either discrete or continuous](https://stats.stackexchange.com/questions/206/what-is-the-difference-between-discrete-data-and-continuous-data)). `object` columns contain strings and are  [categorical features.](http://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/supporting-topics/basics/what-are-categorical-discrete-and-continuous-variables/) . 

In [None]:
# Number of each type of column
df.dtypes.value_counts()

In [None]:
# Number of unique classes in each object column
df.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

### Correlations

Now that we have dealt with the categorical variables and the outliers, let's continue with the EDA. One way to try and understand the data is by looking for correlations between the features and the target. We can calculate the Pearson correlation coefficient between every variable and the target using the `.corr` dataframe method.

The correlation coefficient is not the greatest method to represent "relevance" of a feature, but it does give us an idea of possible relationships within the data. Some [general interpretations of the absolute value of the correlation coefficent](http://www.statstutor.ac.uk/resources/uploaded/pearsons.pdf) are:


* .00-.19 “very weak”
*  .20-.39 “weak”
*  .40-.59 “moderate”
*  .60-.79 “strong”
* .80-1.0 “very strong”


In [None]:
corrs = df.corr()
corrs

In [None]:
plt.figure(figsize = (20, 8))

# Heatmap of correlations
sns.heatmap(corrs, cmap = plt.cm.RdYlBu_r, vmin = -0.25, annot = True, vmax = 0.6)
plt.title('Correlation Heatmap');

## Infant mortality rate under 5 years for (1000) West Africa

In [None]:
HTML('<iframe width="980" height="520" src="https://www.youtube.com/embed/PL2cXmFL-OM" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

### Child Mortality Rate in India

In [None]:
HTML('<iframe width="980" height="520" src="https://www.youtube.com/embed/onFx-7b2M3I" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

### Top 15 Country population Ranking (1900-2019) 

In [None]:
HTML('<iframe width="980" height="520" src="https://www.youtube.com/embed/VSyj_OC4QO4" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

# Final