# Lab 2:  Feature Engineering

Feature Engineering is the process of transforming raw data into features/input variables that are easily digested by algorithms.  New Data Scientists often spend all of their time testing out various algorithms; however, the majority of accuracy gains generally stem from well crafted features.  In this Lab we will introduce the folling types of feature engineering:

1. [Feature pruning](#prune)
1. [Temporal Features (month, year, etc)](#temporal)   
2. [Extracting features from strings](#strings)
3. [One-hot encoding / dummy variables](#onehote)
4. [Scaling, Normalizing, Log transform](#scaling)  
5. [Geo Encoding](#meta)
6. [Data Imputation / cleaning](#imputation) 


While preforming Feature Engineering, it is critical to keep in mind the question that you are trying to answer.  For the purposes of this excercise, we will be using the KIVA dataset and will be trying to answering the following question:

*What drives the loan amount requested by KIVA borrowers? * 

In the language of Module 1, our outcome feature is **loan_amount**. In the next notebook, we will formalize this research question as a machine learning task. Our machine learning task will be to predict the loan amount that a borrower requests from KIVA using all the features we explore in this notebook.


We may not end up using all the features we create, but the process is an important extension of exploratory analysis. The key difference between feature engineering and exploratory analysis is that we now have a defined question in mind: "What drives the loan amount requested by KIVA lenders?"

In [218]:
import pandas as pd
import numpy as np
import re
from geopy.geocoders import Nominatim

In [219]:
# the command below tells jupyter to display up to 80 columns, this keeps everything visible
pd.set_option('display.max_columns', 80)
pd.set_option('expand_frame_repr', True)
df = pd.read_csv("../data/df.csv", low_memory=False)
df.head()

Unnamed: 0,activity,basket_amount,bonus_credit_eligibility,borrowers,currency_exchange_loss_amount,description.languages,description.texts.en,funded_amount,funded_date,id,image.template_id,journal_totals.bulkEntries,journal_totals.entries,lender_count,loan_amount,location.country,location.country_code,location.geo.level,location.geo.pairs,location.geo.type,location.town,name,partner_id,payments,planned_expiration_date,posted_date,sector,status,tags,terms.disbursal_amount,terms.disbursal_currency,terms.disbursal_date,terms.loan_amount,terms.local_payments,terms.loss_liability.currency_exchange,terms.loss_liability.currency_exchange_coverage_rate,terms.loss_liability.nonpayment,terms.repayment_interval,terms.repayment_term,terms.scheduled_payments,themes,translator.byline,translator.image,use,video.thumbnailImageId,video.title,video.youtubeId,borrower_count,partner_average_loan_size_percent_per_capita_income,partner_charges_fees_and_interest,partner_countries,partner_currency_exchange_loss_rate,partner_default_rate,partner_default_rate_note,partner_delinquency_rate,partner_delinquency_rate_note,partner_image.template_id,partner_loans_at_risk_rate,partner_loans_posted,partner_name,partner_portfolio_yield,partner_portfolio_yield_note,partner_profitability,partner_rating,partner_social_performance_strengths,partner_start_date,partner_status,partner_total_amount_raised,partner_url
0,Farming,0.0,False,"[{'first_name': 'Evaline', 'last_name': '', 'g...",,['en'],Evaline is a married lady aged 44 years old an...,0,,1291548,1,0,0,0,500,Kenya,KE,town,-0.583333 35.183333,point,litein,Evaline,386.0,[],2017-06-08T00:40:03Z,2017-05-09T00:40:03Z,Agriculture,fundraising,"[{'name': '#Woman Owned Biz'}, {'name': '#Pare...",50000.0,KES,2017-04-03T07:00:00Z,500,"[{'due_date': '2017-05-10T07:00:00Z', 'amount'...",shared,0.1,lender,Monthly,14,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...",,Julie Keaton,892591.0,to purchase more tea leaves to sell to the tea...,,,,1,0.0,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",0.120642,0.0,,7.017031,,1.0,21.165398,1948.0,Kenya ECLOF,40.3,,2.54,2.0,"[{'id': 3, 'name': 'Client Voice', 'descriptio...",2014-05-29T13:30:02Z,active,863375.0,http://www.eclof-kenya.org/
1,Furniture Making,0.0,False,"[{'first_name': 'Julias', 'last_name': '', 'ge...",,['en'],Aged 42 years is a man by the name of Julias. ...,0,,1291532,1,0,0,0,500,Kenya,KE,town,0.566667 34.566667,point,Bungoma,Julias,386.0,[],2017-06-08T00:30:05Z,2017-05-09T00:30:05Z,Manufacturing,fundraising,[],50000.0,KES,2017-04-03T07:00:00Z,500,"[{'due_date': '2017-05-09T07:00:00Z', 'amount'...",shared,0.1,lender,Monthly,14,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...",,Morena Calvo,1832928.0,to buy timber to make more furniture for his e...,,,,1,0.0,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",0.120642,0.0,,7.017031,,1.0,21.165398,1948.0,Kenya ECLOF,40.3,,2.54,2.0,"[{'id': 3, 'name': 'Client Voice', 'descriptio...",2014-05-29T13:30:02Z,active,863375.0,http://www.eclof-kenya.org/
2,Home Energy,0.0,False,"[{'first_name': 'Rose', 'last_name': '', 'gend...",,['en'],"Hello Kiva Community! <br /><br />Meet Rose, w...",50,,1291530,1,0,0,2,75,Kenya,KE,town,0.516667 35.283333,point,Eldoret,Rose,156.0,[],2017-06-08T00:30:03Z,2017-05-09T00:30:04Z,Personal Use,fundraising,"[{'name': '#Eco-friendly'}, {'name': '#Technol...",6000.0,KES,2017-04-28T07:00:00Z,75,"[{'due_date': '2017-05-14T07:00:00Z', 'amount'...",shared,0.1,lender,Monthly,14,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...","['Green', 'Earth Day Campaign']",Julie Keaton,892591.0,to buy a solar lantern.,,,,1,49.6,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",0.431935,2.575299,,2.536684,,1.0,24.200354,18150.0,Juhudi Kilimo,33.0,,-7.1,2.0,"[{'id': 1, 'name': 'Anti-Poverty Focus', 'desc...",2010-01-15T20:20:17Z,active,7705925.0,http://www.juhudikilimo.com/
3,Used Clothing,0.0,False,"[{'first_name': 'Jane', 'last_name': '', 'gend...",,['en'],"Jane was born in the 1980, and she is happily ...",0,,1291525,1,0,0,0,500,Kenya,KE,town,0.566667 34.566667,point,Bungoma,Jane,386.0,[],2017-06-08T00:20:04Z,2017-05-09T00:20:04Z,Clothing,fundraising,[{'name': '#Eco-friendly'}],50000.0,KES,2017-04-03T07:00:00Z,500,"[{'due_date': '2017-05-08T07:00:00Z', 'amount'...",shared,0.1,lender,Monthly,14,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...",,Julie Keaton,892591.0,to buy more clothes to meet the needs and tast...,,,,1,0.0,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",0.120642,0.0,,7.017031,,1.0,21.165398,1948.0,Kenya ECLOF,40.3,,2.54,2.0,"[{'id': 3, 'name': 'Client Voice', 'descriptio...",2014-05-29T13:30:02Z,active,863375.0,http://www.eclof-kenya.org/
4,Farming,0.0,False,"[{'first_name': 'Alice', 'last_name': '', 'gen...",,['en'],Alice (the woman pictured above in her small s...,0,,1291518,1,0,0,0,400,Kenya,KE,town,1 38,point,Nandi Hills,Alice,156.0,[],2017-06-08T00:20:03Z,2017-05-09T00:20:03Z,Agriculture,fundraising,[{'name': '#Woman Owned Biz'}],40000.0,KES,2017-05-27T07:00:00Z,400,"[{'due_date': '2017-05-27T07:00:00Z', 'amount'...",shared,0.1,lender,Monthly,13,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...",['Rural Exclusion'],,,"to buy farming inputs (fertilizers, pesticides...",,,,1,49.6,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",0.431935,2.575299,,2.536684,,1.0,24.200354,18150.0,Juhudi Kilimo,33.0,,-7.1,2.0,"[{'id': 1, 'name': 'Anti-Poverty Focus', 'desc...",2010-01-15T20:20:17Z,active,7705925.0,http://www.juhudikilimo.com/


## 1. Feature Pruning
<a id='prune'></a>

There is absolutely no point in keeping data that has no variation.  Algorithms can only provide meaningful insights when there is variation in the featuers.  Given that we are preforming feature engineering for the purporse of feeding thse features into a machine learning algorithm, lets go ahead and remove all columns that only consist 1 or less unique values.

In [220]:
for col in df.columns:
    if df[col].unique().size==1:
        df = df.drop(col, axis=1)

## 2. Temporal Features
<a id='temporal'></a>
Time trends are very significant, and should not be neglected.  Most algorithms will not be able to make use of raw datetimes; however, will be able to find patterns in the data if they are informed which observations occur in a given year, on a weekday vs weekend, on a holiday, etc.

Before we are able to extract this meta data, let's convert the strings in the pandas dataframe to datetime objects.

see a list of methods that can be applied to a pandas datetime. https://pandas.pydata.org/pandas-docs/version/0.21/api.html#id34

In [221]:
# luckily for us all time fields in this dataset have "_date" in their name
# pandas is really adept at time series, and we will use pd.to_datetime to create pandas timestamps.
# for more information, check out https://pandas.pydata.org/pandas-docs/stable/timeseries.html
for col in [c for c in df.columns if "_date" in c]:
    if "_date" in col:
        df[col] = pd.to_datetime(df[col])

In [222]:
##  posted date features
df['posted_date'] = pd.to_datetime(df['posted_date'])
df['posted_year']=df['posted_date'].dt.year
df['posted_month']=df['posted_date'].dt.month

## Time to fund is the funded date minus the posted date
## we add these fields because the homework question in the next notebook involves predicting time to fund
df['time_to_fund'] =pd.to_datetime(df['funded_date']) - pd.to_datetime(df['posted_date'])
df['days_to_fund'] = df['time_to_fund'].apply(lambda x: x.seconds//3600)

# expiration date features
## Time to expiration is the expiration date minus the Posted Date
df['planned_expiration_date'] = pd.to_datetime(df['planned_expiration_date'])
df['time_to_expire_date'] =df['planned_expiration_date'] - df['posted_date']
df['days_to_expire'] = df['time_to_expire_date'].dt.days

# TODO - Where were we getting the dispersal_date from???? - Jack 11/6/2017
## Time to dispursement is the Disbursed date minus the Posted Date
# df['time_to_dispersal'] =pd.to_datetime(df['dispersal_date']) - pd.to_datetime(df['posted_date'])
# df['days_to_dispersal'] = df.time_to_dispersal.dt.days

## 3. Extracting features from strings
<a id='strings'></a>

String variables by themselves are generally not good inputs to algorithms; howevever, it is often possible to extract meaningful features from encoding the information that they contain.  Let's first find out which of our variables are string variables.  From there, let's review some of the variables and see if we can construct new features from the contents of these string variables.

To discover which of our DataFrame columns are string variables, we will utilize pandas dtypes method.  In pandas there are the following types:



|       dtype        |        Description        |
|--------------------|---------------------------|
|      float         | Numeric value with a decimal point.  If NaNs exist in col, pandas will default to float|
|        int         | Numerica values without decimal points. |
|       bool         | Column consisting of True and False|
| datetime64[ns, tz] | Objects which contain specific date and time |
|   timedelta[ns]    | Object which indicates time elapsed between two datetimes |
|     category       | Variables that can only have specified values |
|      object        | Pandas representation of string variables |

Let's now use pandas method get_dtype_counts to see what data types exist in the DataFrame, and then select_dtypes to view all columns with dtype == object

In [223]:
df.get_dtype_counts()

bool                1
datetime64[ns]      5
float64            19
int64               9
object             29
timedelta64[ns]     2
dtype: int64

In [224]:
df.select_dtypes(include=[object])

Unnamed: 0,activity,borrowers,description.languages,description.texts.en,location.geo.level,location.geo.pairs,location.town,name,sector,status,tags,terms.disbursal_currency,terms.local_payments,terms.loss_liability.currency_exchange,terms.loss_liability.nonpayment,terms.repayment_interval,terms.scheduled_payments,themes,translator.byline,use,video.title,video.youtubeId,partner_charges_fees_and_interest,partner_countries,partner_name,partner_rating,partner_social_performance_strengths,partner_status,partner_url
0,Farming,"[{'first_name': 'Evaline', 'last_name': '', 'g...",['en'],Evaline is a married lady aged 44 years old an...,town,-0.583333 35.183333,litein,Evaline,Agriculture,fundraising,"[{'name': '#Woman Owned Biz'}, {'name': '#Pare...",KES,"[{'due_date': '2017-05-10T07:00:00Z', 'amount'...",shared,lender,Monthly,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...",,Julie Keaton,to purchase more tea leaves to sell to the tea...,,,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",Kenya ECLOF,2.0,"[{'id': 3, 'name': 'Client Voice', 'descriptio...",active,http://www.eclof-kenya.org/
1,Furniture Making,"[{'first_name': 'Julias', 'last_name': '', 'ge...",['en'],Aged 42 years is a man by the name of Julias. ...,town,0.566667 34.566667,Bungoma,Julias,Manufacturing,fundraising,[],KES,"[{'due_date': '2017-05-09T07:00:00Z', 'amount'...",shared,lender,Monthly,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...",,Morena Calvo,to buy timber to make more furniture for his e...,,,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",Kenya ECLOF,2.0,"[{'id': 3, 'name': 'Client Voice', 'descriptio...",active,http://www.eclof-kenya.org/
2,Home Energy,"[{'first_name': 'Rose', 'last_name': '', 'gend...",['en'],"Hello Kiva Community! <br /><br />Meet Rose, w...",town,0.516667 35.283333,Eldoret,Rose,Personal Use,fundraising,"[{'name': '#Eco-friendly'}, {'name': '#Technol...",KES,"[{'due_date': '2017-05-14T07:00:00Z', 'amount'...",shared,lender,Monthly,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...","['Green', 'Earth Day Campaign']",Julie Keaton,to buy a solar lantern.,,,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",Juhudi Kilimo,2.0,"[{'id': 1, 'name': 'Anti-Poverty Focus', 'desc...",active,http://www.juhudikilimo.com/
3,Used Clothing,"[{'first_name': 'Jane', 'last_name': '', 'gend...",['en'],"Jane was born in the 1980, and she is happily ...",town,0.566667 34.566667,Bungoma,Jane,Clothing,fundraising,[{'name': '#Eco-friendly'}],KES,"[{'due_date': '2017-05-08T07:00:00Z', 'amount'...",shared,lender,Monthly,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...",,Julie Keaton,to buy more clothes to meet the needs and tast...,,,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",Kenya ECLOF,2.0,"[{'id': 3, 'name': 'Client Voice', 'descriptio...",active,http://www.eclof-kenya.org/
4,Farming,"[{'first_name': 'Alice', 'last_name': '', 'gen...",['en'],Alice (the woman pictured above in her small s...,town,1 38,Nandi Hills,Alice,Agriculture,fundraising,[{'name': '#Woman Owned Biz'}],KES,"[{'due_date': '2017-05-27T07:00:00Z', 'amount'...",shared,lender,Monthly,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...",['Rural Exclusion'],,"to buy farming inputs (fertilizers, pesticides...",,,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",Juhudi Kilimo,2.0,"[{'id': 1, 'name': 'Anti-Poverty Focus', 'desc...",active,http://www.juhudikilimo.com/
5,Used Clothing,"[{'first_name': 'Clare', 'last_name': '', 'gen...",['en'],Clare is a married woman who is blessed with 2...,town,0.416667 34.25,Busia,Clare,Clothing,fundraising,"[{'name': '#Woman Owned Biz'}, {'name': '#Eco-...",KES,"[{'due_date': '2017-05-11T07:00:00Z', 'amount'...",shared,lender,Monthly,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...",,,to buy more bales of clothes to grow her busin...,,,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",Kenya ECLOF,2.0,"[{'id': 3, 'name': 'Client Voice', 'descriptio...",active,http://www.eclof-kenya.org/
6,Farming,"[{'first_name': 'Mary', 'last_name': '', 'gend...",['en'],"Wonderful Kiva community, meet Mary (pictured ...",town,1 38,Kerugoya,Mary,Agriculture,fundraising,"[{'name': '#Woman Owned Biz'}, {'name': '#Pare...",KES,"[{'due_date': '2017-05-27T07:00:00Z', 'amount'...",shared,lender,Monthly,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...",['Rural Exclusion'],,to buy seeds so that she can begin horticultur...,,,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",Juhudi Kilimo,2.0,"[{'id': 1, 'name': 'Anti-Poverty Focus', 'desc...",active,http://www.juhudikilimo.com/
7,Pigs,"[{'first_name': 'James', 'last_name': '', 'gen...",['en'],James is a happily married man and is blessed ...,town,1 38,Limuru,James,Agriculture,fundraising,[{'name': '#Animals'}],KES,"[{'due_date': '2017-06-06T07:00:00Z', 'amount'...",shared,lender,Monthly,"[{'due_date': '2017-08-01T07:00:00Z', 'amount'...",,,"to buy pig feeds and logs to burn charcoal, so...",,,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",Kenya ECLOF,2.0,"[{'id': 3, 'name': 'Client Voice', 'descriptio...",active,http://www.eclof-kenya.org/
8,Farming,"[{'first_name': 'Jacinta ', 'last_name': '', '...",['en'],Jacinta is 34 years old. She has four children...,town,-0.283333 36.066667,Nakuru,Jacinta,Agriculture,fundraising,[],KES,"[{'due_date': '2017-05-24T07:00:00Z', 'amount'...",shared,lender,Monthly,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...",,Lynn Cerra,to purchase farm inputs.,,,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",VisionFund Kenya,2.5,"[{'id': 1, 'name': 'Anti-Poverty Focus', 'desc...",active,http://www.visionfundkenya.co.ke/
9,Cereals,"[{'first_name': 'Emily ', 'last_name': '', 'ge...",['en'],"Meet this enterprising woman, Emily. She resid...",town,1 38,Bomet,Emily,Food,fundraising,[{'name': '#Woman Owned Biz'}],KES,"[{'due_date': '2017-05-26T07:00:00Z', 'amount'...",shared,lender,Monthly,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...",['Rural Exclusion'],,to buy cereals to sell at her local market.,,,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",Juhudi Kilimo,2.0,"[{'id': 1, 'name': 'Anti-Poverty Focus', 'desc...",active,http://www.juhudikilimo.com/


The borrowers column looks like it may have some interesting information, but it is hard to tell since the string is cropped in the displayed DataFrame.  Lets take a look at an example value.

In [225]:
df['borrowers'][0]

"[{'first_name': 'Evaline', 'last_name': '', 'gender': 'F', 'pictured': True}]"

A very simple feature we can create is count for the number of borrowers listed.  In order to accomplish this we will leverage pandas [apply](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) method, which allows us to apply a lambda function to a specific column or collection of columns in order to create a new vector.  The provided lambda function is applied to each row in order to calculate the value of the corresponding row in the new vector.

In [226]:
df['num_borrowers'] = df['borrowers'].apply(lambda x: x.count("{"))
print(df[df['num_borrowers']>1]['num_borrowers'].iloc[0])
print(df[df['num_borrowers']>1]['borrowers'].iloc[0])

4
[{'first_name': 'Florence ', 'last_name': '', 'gender': 'F', 'pictured': True}, {'first_name': 'Wanjiru', 'last_name': '', 'gender': 'F', 'pictured': True}, {'first_name': 'Jane ', 'last_name': '', 'gender': 'F', 'pictured': True}, {'first_name': 'Pauline ', 'last_name': '', 'gender': 'F', 'pictured': True}]


Keeping in mind that the question that we are trying to answer is "What drives the loan amount requested by KIVA borrowers?" let's create a few variables that encode the information on the gender of the listed borrowers.

In order to do this, we will once again use pandas' `apply` method, but this time will we introduce a if-else statement inside the lambda function.  This will enable us to change the value of the resulting column vector based on whether the conditional returns True or False for each row.

In [227]:
df['gender']=df['borrowers'].apply(lambda x: "Male" if x.split("gender': '")[1][0]=='M' else "Female")
df['num_male'] = df['borrowers'].apply(lambda x: x.count('''M'''))
df['num_female'] = df['borrowers'].apply(lambda x: x.count('''F'''))
df['pct_female']=100.00*df['num_female']/(df['num_male']+df['num_female'])

Beyond gender, age and marital status could be key explanatory features for the requested loan amount.  These featuers will all be booleans, and in order to construct them we will use panda's [str.contains](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html).  This is a handy feature because it allows us to utilize a [regex](https://docs.python.org/2/library/re.html).

In [228]:
## Whether or not the borrower is widowed
#  Note the str.contains function
df['widowed'] = df['description.texts.en'].str.contains("widowed|widow", na=0)
## Whether or not the borrower is married
df['married'] = np.where(df['description.texts.en'].str.contains("married|husband|wife", na=0)==True, 1, 0)
## Whether or not the borrower has children, notice we look for many variants of the word.
df['kids'] = df['description.texts.en'].str.contains("kids|child|children|kid|son|daughter|mother|father|parents", na=0)

## 4. One-hot encoding
<a id='onehote'></a>

One-hot encoding is the process of converting either categorical or string data into a binary.  Let's practice one-hot encoding by converting the "tags" column into a set of binary features indicating whether or not a particular tag appears in a given row. 

In order to do this we will first need to convert the "tags" column into a list of strings, and then we will utilize pandas `get_dummies` method to create the binary features.  Binary features are often referred to in the statistics world as dummy features.

In [229]:
df['tag_list'] = df['tags'].apply(lambda x: [elem['name'] for elem in eval(x)])
tag_df = pd.get_dummies(df['tag_list'].apply(pd.Series).stack()).sum(level=0)
# TODO - Explain how merges work or better yet figure a way to avoid merging. - Jack 11/10/17
df = df.merge(tag_df, left_index = True, right_index = True, how = 'outer')

In [232]:
df[tag_df.columns] = df[tag_df.columns].fillna(0)

In [233]:
df.columns

Index(['activity', 'basket_amount', 'bonus_credit_eligibility', 'borrowers',
       'currency_exchange_loss_amount', 'description.languages',
       'description.texts.en', 'funded_amount', 'funded_date', 'id',
       ...
       '#Technology', '#Tourism', '#Trees', '#Unique', '#Vegan', '#Widowed',
       '#Woman Owned Biz', 'user_favorite', 'volunteer_like',
       'volunteer_pick'],
      dtype='object', length=110)

## 5. Scaling, Normalizing, Log transform
<a id='scaling'></a>


TODO -- 11/10/2017

## 6. Geo Encoding
<a id='Geo Encoding'></a>


TODO -- Should we leave this in the notebook?? 11/10/2017

Location is probably very predictive of loan amount because intuitively there are differences in the cost of living and the type of sector between different regions. The cost of living in London, UK is very different from the cost of living in Mombasa, Kenya for example. You can also imagine that within Kenya there are likely differences in the cost of living between provinces or even counties. 

Location is a feature we want to include! However, our current location data is really messy. There is an issue with of geo-coordinates field where most coordinates are from a single location. Instead, we have to rely on location.town but this appears to be entered by hand and there are many spelling mistakes, variations of the same entry and incomplete addresses that prevent us from aggregating this data in a useful way. In order to use location, we somehow need to pull the province or county associated with each town. To do this we turn to the Google Geocode API. You can read more about this api [here](http://geopy.readthedocs.io/en/1.10.0/).

The api is very sensitive to how clean (standardized) the input is. Because our location field appears to be a non standardized field (it is the result of data written by hand), we have a lot of cleaning to do before we can call the api. However, even after this cleaning the api call tends to break frequently so we set up a [recursive function](https://www.programiz.com/python-programming/recursion), which means if there is an exception because of the quality of text or because the internet connection is weak, it calls the api again within the loop.

In [234]:
df['location']=df['location.town'].astype(str).map(lambda x: re.sub(r'[^a-zA-Z0-9 ]',r'',x).lower().rstrip().lstrip())
df['location']=df['location'].map(lambda x: re.sub(r'kenya', r'', x))
df['location']=df['location'].map(lambda x: re.sub(r' +', r' ', x))

df['location']=df['location'] + ' ' + 'kenya'

unique_location = df['location'].unique()
unique_location.sort()
len(unique_location)

1207

In [235]:
lookup={}
geolocator = Nominatim()

def location_match(unique_location, lookup):
    #creates empty dictionary
    y=0
    print(len(unique_location))
    for x in unique_location:
        if x in lookup.keys():
            print ('already added, pct complete %d' % (100.00*y/len(unique_location)))
        else:
            print ('adding %s, %d out of %d, pct complete %d' % (x, y, len(unique_location), 100.00*y/len(unique_location)))
            try:
                lookup[x]=geolocator.geocode(x, timeout=10)
            except GeocoderTimedOut:
                return location_match(unique_location, lookup)
                return lookup
        y=y+1
            
    return lookup

In [236]:
location = location_match(unique_location, lookup)

1207
adding aa estate nairobi kenya, 0 out of 1207, pct complete 0
adding aa kenya, 1 out of 1207, pct complete 0
adding adongosi teso kenya, 2 out of 1207, pct complete 0
adding adumai moding division teso district kenya, 3 out of 1207, pct complete 0
adding ahero kenya, 4 out of 1207, pct complete 0
adding akiliametteso district kenya, 5 out of 1207, pct complete 0
adding akites chakol division teso district kenya, 6 out of 1207, pct complete 0
adding aldina jomvu kenya, 7 out of 1207, pct complete 0
adding aldinajomvu  kenya, 8 out of 1207, pct complete 0
adding aldinajomvu kenya, 9 out of 1207, pct complete 0
adding amagoro teso district kenya, 10 out of 1207, pct complete 0
adding amagoroteso district kenya, 11 out of 1207, pct complete 0
adding amalemba kakamega kenya, 12 out of 1207, pct complete 0
adding amoyo central migori kenya, 13 out of 1207, pct complete 1
adding angawa avenue kisumu city kenya, 14 out of 1207, pct complete 1
adding angurai teso  kenya, 15 out of 1207, pc

adding diani mombasa kenya, 137 out of 1207, pct complete 11
adding dida kenya, 138 out of 1207, pct complete 11
adding docks mombasa  kenya, 139 out of 1207, pct complete 11
adding donholm nairobi kenya, 140 out of 1207, pct complete 11
adding drivein thika road kenya, 141 out of 1207, pct complete 11
adding dunga kisumu kenya, 142 out of 1207, pct complete 11
adding dungakisumu kenya, 143 out of 1207, pct complete 11
adding dungicha kenya, 144 out of 1207, pct complete 11
adding east nairob kenya, 145 out of 1207, pct complete 12
adding east nairobi kenya, 146 out of 1207, pct complete 12
adding eastland kenya, 147 out of 1207, pct complete 12
adding eastlands kenya, 148 out of 1207, pct complete 12
adding eastleigh market nairobi kenya, 149 out of 1207, pct complete 12
adding eastleigh nairobi kenya, 150 out of 1207, pct complete 12
adding ebenezer kenya, 151 out of 1207, pct complete 12
adding ekwanda kisumu kenya, 152 out of 1207, pct complete 12
adding eldama ravine kenya, 153 ou

adding ivakale vihiga kenya, 275 out of 1207, pct complete 22
adding jairos amagoro div teso district keny kenya, 276 out of 1207, pct complete 22
adding jairos teso district kenya, 277 out of 1207, pct complete 22
adding jaribuni kenya, 278 out of 1207, pct complete 23
adding jenga jamii kenya, 279 out of 1207, pct complete 23
adding jericho nairobi kenya, 280 out of 1207, pct complete 23
adding jerusalem nairobi kenya, 281 out of 1207, pct complete 23
adding jitahidi kenya, 282 out of 1207, pct complete 23
adding jitoni mombasa kenya, 283 out of 1207, pct complete 23
adding jogoo road kenya, 284 out of 1207, pct complete 23
adding jomvu mombasa kenya, 285 out of 1207, pct complete 23
adding jua kali kenya, 286 out of 1207, pct complete 23
adding jubilee market kisumu kenya, 287 out of 1207, pct complete 23
adding juhudi kenya, 288 out of 1207, pct complete 23
adding juja kenya, 289 out of 1207, pct complete 23
adding juja thika kenya, 290 out of 1207, pct complete 24
adding jujathika

adding kariokor nairobi kenya, 409 out of 1207, pct complete 33
adding karugia thika kenya, 410 out of 1207, pct complete 33
adding karunga kakamega kenya, 411 out of 1207, pct complete 34
adding kasarani kenya, 412 out of 1207, pct complete 34
adding kasikey kenya, 413 out of 1207, pct complete 34
adding katakwa angurai location teso  kenya, 414 out of 1207, pct complete 34
adding katakwa teso kenya, 415 out of 1207, pct complete 34
adding kathiani machakos kenya, 416 out of 1207, pct complete 34
adding kathonzeni makueni kenya, 417 out of 1207, pct complete 34
adding katingani emali kenya, 418 out of 1207, pct complete 34
adding katito kenya, 419 out of 1207, pct complete 34
adding kativanikibwezi kenya, 420 out of 1207, pct complete 34
adding katulyeemali kenya, 421 out of 1207, pct complete 34
adding katuo nyalenda kisumu kenya, 422 out of 1207, pct complete 34
adding kauma kenya, 423 out of 1207, pct complete 35
adding kavetetala kenya, 424 out of 1207, pct complete 35
adding kavi

NameError: name 'GeocoderTimedOut' is not defined

In [None]:
df['location_detail'] = df['location'].map(location)
df['location_str']=df['location_detail'].astype(str).map(lambda x: re.sub(r'Kenya',r'',x).lower().rstrip().lstrip())

In [None]:
lookup_tags=pd.read_csv(data_path+'/province_counties_KE.csv')

## THE END!

That is all for our feature engineering module!  Now that we have finished creating all of our features we can go ahead and explore them with some EDA!  The last step of this module is to save our results into a new csv

In [None]:
df.to_csv("../data/data.csv")

## 7. Data Imputation / cleaning
<a id='imputation'></a>


TODO -- 11/10/2017

Now let's investigate how much missing data we have in our dataset.  In pandas, missing data is either represented as NaN (Not a Number), or NaT (Not a Time).  While we look at our missing data, let's look at strings, numeric, and time objects separateley.

First, let's have a quick refresher on dyptes in our DataFrame

In [184]:
df.get_dtype_counts()

bool                1
datetime64[ns]      5
float64            20
int64              13
object             32
timedelta64[ns]     2
dtype: int64

Now let's create lists of all of the columns for the specific data types that we care about.

In [196]:
time_columns = df.select_dtypes(include=['datetime64','timedelta64']).columns
str_columns = df.select_dtypes(include=[object]).columns
numeric_columns = df.select_dtypes(exclude=[object,'datetime64','timedelta64']).columns

Now, let's use pandas `isnull` and `sum` functions to see how many observations of each column are missing.
Since there are a lot of columns in this DataFrame, let's restrict our returned DataFrame to columns which have 
missing data

In [187]:
df[time_columns].isnull().sum()[df[time_columns].isnull().sum()>0]

funded_date                 5627
planned_expiration_date    24913
terms.disbursal_date          15
partner_start_date          9642
time_to_fund                5627
time_to_expire_date        24913
dtype: int64

With missing data, you should always check to see if there is a systemic difference between observations with and without missing data.

In [189]:
df[df['funded_date'].isnull()].describe()

Unnamed: 0,basket_amount,currency_exchange_loss_amount,funded_amount,id,lender_count,loan_amount,partner_id,terms.disbursal_amount,terms.loan_amount,terms.loss_liability.currency_exchange_coverage_rate,terms.repayment_term,translator.image,video.thumbnailImageId,borrower_count,partner_average_loan_size_percent_per_capita_income,partner_currency_exchange_loss_rate,partner_default_rate,partner_delinquency_rate,partner_image.template_id,partner_loans_at_risk_rate,partner_loans_posted,partner_portfolio_yield,partner_profitability,partner_total_amount_raised,posted_year,posted_month,time_to_fund,days_to_fund,time_to_expire_date,days_to_expire,num_borrowers,num_male,num_female,pct_female,married
count,944.0,0.0,5627.0,5627.0,5627.0,5627.0,5612.0,5627.0,5627.0,5611.0,5627.0,2924.0,0.0,5627.0,5612.0,5612.0,5612.0,5612.0,5612.0,5612.0,5612.0,5582.0,5486.0,5612.0,5627.0,5627.0,0,0.0,5627,5627.0,5627.0,5627.0,5627.0,5627.0,5627.0
mean,0.185381,,384.717434,1008749.0,10.135596,841.834015,184.409658,79078.46,841.834015,0.100749,13.637818,1253407.0,,3.518216,26.192926,0.253438,1.847719,4.896113,1.0,13.065587,15744.028689,25.528449,6.707499,7234675.0,2015.468811,6.165275,NaT,,30 days 20:44:27.949529,30.779278,3.518216,2.197263,1.851786,42.798956,0.637462
std,2.145937,,530.835503,235397.4,14.101314,979.687591,74.224811,65052.77,979.687591,0.00862,5.194782,644365.3,,4.567557,20.324661,0.129567,1.904024,4.69257,0.0,9.904441,8396.494091,14.564497,13.916036,2828987.0,1.368951,3.203,NaT,,4 days 04:19:27.110027,4.186589,4.567557,3.054635,2.801541,41.707068,0.480776
min,0.0,,0.0,389427.0,0.0,75.0,133.0,3600.0,75.0,0.1,3.0,28733.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,-9.8,0.0,2012.0,1.0,,,29 days 22:59:59,29.0,1.0,0.0,0.0,0.0,0.0
25%,0.0,,125.0,873913.5,4.0,500.0,138.0,50000.0,500.0,0.1,11.0,812309.0,,1.0,0.0,0.120642,1.126878,0.123017,1.0,0.123017,9546.0,29.0,-1.7,6764500.0,2015.0,4.0,NaT,,30 days 00:00:00,30.0,1.0,0.0,0.0,0.0,0.0
50%,0.0,,275.0,1064941.0,8.0,700.0,156.0,69630.0,700.0,0.1,14.0,1324922.0,,1.0,34.9,0.21468,1.48389,2.536684,1.0,18.498507,17262.0,33.0,0.0,7705925.0,2016.0,5.0,NaT,,30 days 00:00:00,30.0,1.0,1.0,1.0,40.0,1.0
75%,0.0,,525.0,1196226.0,13.0,1000.0,202.0,99025.0,1000.0,0.1,14.0,1632475.0,,4.0,40.1,0.364948,2.575299,8.017062,1.0,21.165398,18150.0,36.0,29.1,8133425.0,2016.0,9.0,NaT,,30 days 00:00:00,30.0,4.0,2.0,2.0,100.0,1.0
max,25.0,,21725.0,1291548.0,658.0,50000.0,520.0,1596948.0,50000.0,0.2,122.0,2473963.0,,17.0,54.8,0.431935,16.580365,75.834468,1.0,100.0,30794.0,41.0,30.3,11366980.0,2017.0,12.0,NaT,,60 days 00:00:00,60.0,17.0,16.0,18.0,100.0,1.0


In [190]:
df[~df['funded_date'].isnull()].describe()

Unnamed: 0,basket_amount,currency_exchange_loss_amount,funded_amount,id,lender_count,loan_amount,partner_id,terms.disbursal_amount,terms.loan_amount,terms.loss_liability.currency_exchange_coverage_rate,terms.repayment_term,translator.image,video.thumbnailImageId,borrower_count,partner_average_loan_size_percent_per_capita_income,partner_currency_exchange_loss_rate,partner_default_rate,partner_delinquency_rate,partner_image.template_id,partner_loans_at_risk_rate,partner_loans_posted,partner_portfolio_yield,partner_profitability,partner_total_amount_raised,posted_year,posted_month,time_to_fund,days_to_fund,time_to_expire_date,days_to_expire,num_borrowers,num_male,num_female,pct_female,married
count,0.0,24808.0,122331.0,122331.0,122331.0,122331.0,112704.0,122331.0,122331.0,108836.0,122331.0,60093.0,76.0,122331.0,112704.0,112704.0,112704.0,112704.0,112704.0,112704.0,112704.0,106335.0,101404.0,112704.0,122331.0,122331.0,122331,122331.0,97418,97418.0,122331.0,122331.0,122331.0,122331.0,122331.0
mean,,5.733943,455.280142,725344.7,14.564174,455.280346,164.072934,38008.24,455.280346,0.123258,12.741733,1161574.0,615530.5,1.784307,30.217649,0.210486,3.919706,4.331362,1.0,11.452212,18295.78826,31.344201,2.127901,7313707.0,2013.518961,6.450417,8 days 02:26:26.764883,9.579493,36 days 23:18:31.642858,36.828009,1.784241,0.841079,1.243986,63.67334,0.615756
std,,12.986843,660.649621,341743.0,19.822752,660.649631,65.416106,41525.43,660.649631,0.042248,8.325936,706158.3,462245.0,2.803524,16.900695,0.267191,10.717564,5.506963,0.0,10.934934,9455.86292,10.099698,11.091623,3310323.0,2.265506,3.555518,12 days 10:25:49.877589,7.36907,76 days 01:29:10.498510,76.029403,2.803528,1.623172,1.975189,42.736448,0.486418
min,,0.01,25.0,251.0,1.0,25.0,6.0,25.0,25.0,0.1,1.0,23924.0,297574.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,-117.79,3950.0,2006.0,1.0,-442 days +13:27:55,0.0,1 days 21:42:53,1.0,1.0,0.0,0.0,0.0,0.0
25%,,0.95,225.0,430698.5,7.0,225.0,133.0,20000.0,225.0,0.1,9.0,505996.0,324494.8,1.0,24.3,0.089354,0.085473,0.0,1.0,0.0,9546.0,33.0,-1.7,6764500.0,2012.0,3.0,0 days 07:07:53,3.0,30 days 00:00:00,30.0,1.0,0.0,0.0,0.0,0.0
50%,,2.56,350.0,737589.0,11.0,350.0,156.0,30000.0,350.0,0.1,13.0,1186147.0,336108.0,1.0,34.9,0.164711,1.48389,2.536684,1.0,16.058249,18150.0,33.1,0.0,7646925.0,2014.0,6.0,2 days 08:26:06,8.0,30 days 00:00:00,30.0,1.0,0.0,1.0,100.0,1.0
75%,,6.54,575.0,1049704.0,18.0,575.0,164.0,50000.0,575.0,0.1,14.0,1669010.0,624790.0,1.0,40.1,0.217001,3.652283,8.017062,1.0,18.498507,21415.0,36.0,2.23,8133425.0,2015.0,10.0,13 days 12:43:25,16.0,30 days 00:00:00,30.0,1.0,1.0,1.0,100.0,1.0
max,,1285.51,50000.0,1292273.0,1589.0,50000.0,526.0,1579072.0,50000.0,0.2,122.0,2499150.0,1754457.0,46.0,54.8,7.513861,94.939083,100.0,1.0,100.0,30794.0,41.0,30.3,11366980.0,2017.0,12.0,127 days 21:15:58,23.0,1673 days 23:37:55,1673.0,46.0,24.0,43.0,100.0,1.0


In [191]:
df[str_columns].isnull().sum()[df[str_columns].isnull().sum()>0]

description.texts.en                      4328
location.town                            17551
terms.repayment_interval                127014
themes                                   98944
translator.byline                        45596
use                                       4327
video.title                             127882
video.youtubeId                         127882
partner_charges_fees_and_interest         9642
partner_countries                         9642
partner_name                              9642
partner_rating                            9642
partner_social_performance_strengths     14204
partner_status                            9642
partner_url                              13709
dtype: int64

In [192]:
df[df['partner_social_performance_strengths'].isnull()].describe()

Unnamed: 0,basket_amount,currency_exchange_loss_amount,funded_amount,id,lender_count,loan_amount,partner_id,terms.disbursal_amount,terms.loan_amount,terms.loss_liability.currency_exchange_coverage_rate,terms.repayment_term,translator.image,video.thumbnailImageId,borrower_count,partner_average_loan_size_percent_per_capita_income,partner_currency_exchange_loss_rate,partner_default_rate,partner_delinquency_rate,partner_image.template_id,partner_loans_at_risk_rate,partner_loans_posted,partner_portfolio_yield,partner_profitability,partner_total_amount_raised,posted_year,posted_month,time_to_fund,days_to_fund,time_to_expire_date,days_to_expire,num_borrowers,num_male,num_female,pct_female,married
count,5.0,9209.0,14204.0,14204.0,14204.0,14204.0,4562.0,14204.0,14204.0,1222.0,14204.0,869.0,46.0,14204.0,4562.0,4562.0,4562.0,4562.0,4562.0,4562.0,4562.0,5.0,565.0,4562.0,14204.0,14204.0,14167,14167.0,10503,10503.0,14204.0,14204.0,14204.0,14204.0,14204.0
mean,0.0,3.623674,356.308434,805105.8,13.689102,359.565967,94.015344,18796.35,359.565967,0.131833,8.230006,1140871.0,350261.3,1.060054,0.0,0.027849,24.936646,3.602258,1.0,5.87701,1080.888865,2.6,10.158159,347980.266331,2012.479794,6.331386,6 days 08:25:26.543728,8.975436,83 days 18:28:00.393697,83.126345,1.059842,0.290834,0.929879,81.155335,0.317375
std,0.0,5.739521,1178.702776,459793.5,33.371853,1238.572765,155.105302,38430.46,1238.572765,0.046602,7.374702,784973.5,172571.7,0.903098,0.0,0.160107,21.819607,15.702985,0.0,22.711207,735.031509,4.722288,14.70989,226580.67298,3.165162,3.345457,21 days 14:49:43.088653,6.997425,225 days 17:10:03.480230,225.725302,0.902995,0.706413,0.805274,35.201562,0.465471
min,0.0,0.01,20.0,251.0,1.0,25.0,6.0,25.0,25.0,0.1,1.0,23924.0,297574.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-20.6,0.0,2006.0,1.0,-442 days +13:27:55,0.0,1 days 21:42:53,1.0,1.0,0.0,0.0,0.0,0.0
25%,0.0,0.78,125.0,175088.0,5.0,125.0,24.0,10000.0,125.0,0.1,3.0,371733.0,315651.8,1.0,0.0,0.0,0.0,0.0,1.0,0.0,446.0,0.0,-12.1,120950.0,2010.0,3.0,0 days 05:12:57,3.0,44 days 15:59:58,44.0,1.0,0.0,1.0,100.0,0.0
50%,0.0,1.64,225.0,1081506.0,10.0,225.0,25.0,10000.0,225.0,0.1,6.0,1186147.0,327750.0,1.0,0.0,0.0,19.741479,0.0,1.0,0.0,913.0,1.0,15.7,361650.0,2014.0,6.0,1 days 00:41:57,8.0,44 days 16:59:56,44.0,1.0,0.0,1.0,100.0,0.0
75%,0.0,3.77,350.0,1085576.0,16.0,350.0,32.0,20000.0,350.0,0.2,12.0,1940938.0,329004.5,1.0,0.0,0.0,38.001789,0.0,1.0,0.0,1666.0,1.0,17.7,472200.0,2015.0,9.0,8 days 09:14:12.500000,15.0,44 days 16:59:59,44.0,1.0,0.0,1.0,100.0,1.0
max,0.0,79.34,50000.0,1287031.0,1589.0,50000.0,526.0,1322500.0,50000.0,0.2,67.0,2473963.0,1490613.0,45.0,0.0,1.827921,93.341275,100.0,1.0,100.0,2031.0,11.0,30.3,697975.0,2017.0,12.0,127 days 21:15:58,23.0,1673 days 23:37:55,1673.0,45.0,15.0,43.0,100.0,1.0


In [193]:
df[~df['partner_social_performance_strengths'].isnull()].describe()

Unnamed: 0,basket_amount,currency_exchange_loss_amount,funded_amount,id,lender_count,loan_amount,partner_id,terms.disbursal_amount,terms.loan_amount,terms.loss_liability.currency_exchange_coverage_rate,terms.repayment_term,translator.image,video.thumbnailImageId,borrower_count,partner_average_loan_size_percent_per_capita_income,partner_currency_exchange_loss_rate,partner_default_rate,partner_delinquency_rate,partner_image.template_id,partner_loans_at_risk_rate,partner_loans_posted,partner_portfolio_yield,partner_profitability,partner_total_amount_raised,posted_year,posted_month,time_to_fund,days_to_fund,time_to_expire_date,days_to_expire,num_borrowers,num_male,num_female,pct_female,married
count,939.0,15599.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113225.0,113754.0,62148.0,30.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,113754.0,111912.0,106325.0,113754.0,113754.0,113754.0,108164,108164.0,92542,92542.0,113754.0,113754.0,113754.0,113754.0,113754.0
mean,0.186368,6.979759,464.147854,729404.2,14.454375,486.353227,167.885833,42438.75,486.353227,0.12205,13.349421,1166184.0,1022276.0,1.960511,31.230942,0.21993,2.97462,4.388464,1.0,11.755396,18860.286179,31.055404,2.32152,7589162.0,2013.745169,6.451175,8 days 07:56:38.380144,9.658611,31 days 06:55:28.596756,31.205615,1.960467,0.976871,1.313273,60.457854,0.654087
std,2.151606,15.639856,555.67751,324192.3,17.152674,574.959016,57.930511,43554.52,574.959016,0.041458,8.137618,702468.9,472883.8,3.071569,16.325622,0.263052,8.746392,4.608373,0.0,10.070576,8953.291074,10.443428,11.263862,3037812.0,2.088873,3.564705,10 days 15:21:42.391957,7.412754,5 days 13:33:13.763375,5.583132,3.071572,1.807555,2.12224,43.225938,0.475667
min,0.0,0.01,0.0,108274.0,0.0,25.0,133.0,248.0,25.0,0.1,2.0,23924.0,344400.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,7.0,0.0,-117.79,16300.0,2009.0,1.0,-1 days +20:25:41,0.0,29 days 22:59:49,29.0,1.0,0.0,0.0,0.0,0.0
25%,0.0,1.22,250.0,443949.5,7.0,250.0,133.0,20000.0,250.0,0.1,11.0,518671.0,613509.0,1.0,24.3,0.089354,0.619151,0.0,1.0,0.0,17262.0,29.0,-1.7,6764500.0,2012.0,3.0,0 days 07:31:32,3.0,30 days 00:00:00,30.0,1.0,0.0,0.0,0.0,0.0
50%,0.0,3.48,350.0,717381.5,11.0,375.0,156.0,30000.0,375.0,0.1,14.0,1186147.0,956991.5,1.0,34.9,0.164711,1.48389,2.536684,1.0,16.058249,18150.0,33.1,0.0,7705925.0,2014.0,6.0,2 days 14:56:22,8.0,30 days 00:00:00,30.0,1.0,1.0,1.0,76.923077,1.0
75%,0.0,8.24,600.0,1007022.0,18.0,600.0,164.0,50000.0,600.0,0.1,14.0,1668411.0,1496371.0,1.0,40.1,0.364948,3.652283,8.017062,1.0,18.498507,30794.0,36.0,2.23,11366980.0,2016.0,10.0,14 days 03:06:37.750000,16.0,30 days 00:00:00,30.0,1.0,1.0,1.0,100.0,1.0
max,25.0,1285.51,50000.0,1292273.0,1491.0,50000.0,397.0,1596948.0,50000.0,0.2,122.0,2499150.0,1754457.0,46.0,54.8,7.513861,94.939083,100.0,1.0,100.0,30794.0,41.0,29.1,11366980.0,2017.0,12.0,62 days 01:59:03,23.0,60 days 00:00:00,60.0,46.0,24.0,33.0,100.0,1.0


In [194]:
df[numeric_columns].isnull().sum()[df[numeric_columns].isnull().sum()>0]

basket_amount                                           127014
currency_exchange_loss_amount                           103150
partner_id                                                9642
terms.loss_liability.currency_exchange_coverage_rate     13511
translator.image                                         64941
video.thumbnailImageId                                  127882
partner_average_loan_size_percent_per_capita_income       9642
partner_currency_exchange_loss_rate                       9642
partner_default_rate                                      9642
partner_delinquency_rate                                  9642
partner_image.template_id                                 9642
partner_loans_at_risk_rate                                9642
partner_loans_posted                                      9642
partner_portfolio_yield                                  16041
partner_profitability                                    21068
partner_total_amount_raised                            