# Lab 2:  Feature Engineering

Feature Engineering is the process of transforming raw data into features/input variables that are easily digested by algorithms.  New Data Scientists often spend all of their time testing out various algorithms; however, the majority of accuracy gains generally stem from well crafted features.  In this Lab we will introduce the folling types of feature engineering:

1. Feature pruning
2. Temporal Features (month, year, etc)
3. One-hot encoding / dummy variables
4. Extracting features from strings
5. Metadata
6. Feature scaling
7. Data Imputation / cleaning


While preforming Feature Engineering, it is critical to keep in mind the question that you are trying to answer.  For the purposes of this excercise, we will be using the KIVA dataset and will be trying to answering the following question:

*What drives the loan amount requested by KIVA borrowers? * 

In the language of Module 1, our outcome feature is **loan_amount**. In the next notebook, we will formalize this research question as a machine learning task. Our machine learning task will be to predict the loan amount that a borrower requests from KIVA using all the features we explore in this notebook.


We may not end up using all the features we create, but the process is an important extension of exploratory analysis. The key difference between feature engineering and exploratory analysis is that we now have a defined question in mind: "What drives the loan amount requested by KIVA lenders?"

In [1]:
from sklearn import preprocessing
import pandas as pd
import numpy as np
import re

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (15, 8)
sns.set()
sns.set(font_scale=1.5)

In [2]:
# the command below tells jupyter to display up to 80 columns, this keeps everything visible
pd.set_option('display.max_columns', 80)
pd.set_option('expand_frame_repr', True)
df = pd.read_csv("../data/raw_data.csv", low_memory=False)
df.head()

Unnamed: 0,activity,basket_amount,bonus_credit_eligibility,borrowers,currency_exchange_loss_amount,description.languages,description.texts.en,description.texts.es,description.texts.fr,description.texts.ru,funded_amount,funded_date,id,image.id,image.template_id,journal_totals.bulkEntries,journal_totals.entries,lender_count,loan_amount,location.country,location.country_code,location.geo.level,location.geo.pairs,location.geo.type,location.town,name,partner_id,payments,planned_expiration_date,posted_date,sector,status,tags,terms.disbursal_amount,terms.disbursal_currency,terms.disbursal_date,terms.loan_amount,terms.local_payments,terms.loss_liability.currency_exchange,terms.loss_liability.currency_exchange_coverage_rate,...,themes,translator.byline,translator.image,use,video.id,video.thumbnailImageId,video.title,video.youtubeId,borrower_count,partner_average_loan_size_percent_per_capita_income,partner_charges_fees_and_interest,partner_countries,partner_currency_exchange_loss_rate,partner_default_rate,partner_default_rate_note,partner_delinquency_rate,partner_delinquency_rate_note,partner_image.id,partner_image.template_id,partner_loans_at_risk_rate,partner_loans_posted,partner_name,partner_portfolio_yield,partner_portfolio_yield_note,partner_profitability,partner_rating,partner_social_performance_strengths,partner_start_date,partner_status,partner_total_amount_raised,partner_url,posted_datetime,funded_datetime,planned_expiration_datetime,dispursal_datetime,number_of_loans,dispersal_date,posted_year,posted_month,time_to_fund
0,Farming,0.0,False,"[{'first_name': 'Evaline', 'last_name': '', 'g...",,['en'],Evaline is a married lady aged 44 years old an...,,,,0,,1291548,2516002,1,0,0,0,500,Kenya,KE,town,-0.583333 35.183333,point,litein,Evaline,386.0,[],2017-06-08,2017-05-09,Agriculture,fundraising,"[{'name': '#Woman Owned Biz'}, {'name': '#Pare...",50000.0,KES,2017-04-03T07:00:00Z,500,"[{'due_date': '2017-05-10T07:00:00Z', 'amount'...",shared,0.1,...,,Julie Keaton,892591.0,to purchase more tea leaves to sell to the tea...,,,,,1,0.0,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",0.120642,0.0,,7.017031,,1592272.0,1.0,21.165398,1948.0,Kenya ECLOF,40.3,,2.54,2.0,"[{'id': 3, 'name': 'Client Voice', 'descriptio...",2014-05-29T13:30:02Z,active,863375.0,http://www.eclof-kenya.org/,2017-05-09 00:40:03,,2017-06-08 00:40:03,2017-04-03 07:00:00,1,2017-04-03,2017,5,
1,Furniture Making,0.0,False,"[{'first_name': 'Julias', 'last_name': '', 'ge...",,['en'],Aged 42 years is a man by the name of Julias. ...,,,,0,,1291532,2515992,1,0,0,0,500,Kenya,KE,town,0.566667 34.566667,point,Bungoma,Julias,386.0,[],2017-06-08,2017-05-09,Manufacturing,fundraising,[],50000.0,KES,2017-04-03T07:00:00Z,500,"[{'due_date': '2017-05-09T07:00:00Z', 'amount'...",shared,0.1,...,,Morena Calvo,1832928.0,to buy timber to make more furniture for his e...,,,,,1,0.0,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",0.120642,0.0,,7.017031,,1592272.0,1.0,21.165398,1948.0,Kenya ECLOF,40.3,,2.54,2.0,"[{'id': 3, 'name': 'Client Voice', 'descriptio...",2014-05-29T13:30:02Z,active,863375.0,http://www.eclof-kenya.org/,2017-05-09 00:30:05,,2017-06-08 00:30:05,2017-04-03 07:00:00,1,2017-04-03,2017,5,
2,Home Energy,0.0,False,"[{'first_name': 'Rose', 'last_name': '', 'gend...",,['en'],"Hello Kiva Community! <br /><br />Meet Rose, w...",,,,50,,1291530,2515991,1,0,0,2,75,Kenya,KE,town,0.516667 35.283333,point,Eldoret,Rose,156.0,[],2017-06-08,2017-05-09,Personal Use,fundraising,"[{'name': '#Eco-friendly'}, {'name': '#Technol...",6000.0,KES,2017-04-28T07:00:00Z,75,"[{'due_date': '2017-05-14T07:00:00Z', 'amount'...",shared,0.1,...,"['Green', 'Earth Day Campaign']",Julie Keaton,892591.0,to buy a solar lantern.,,,,,1,49.6,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",0.431935,2.575299,,2.536684,,1834079.0,1.0,24.200354,18150.0,Juhudi Kilimo,33.0,,-7.1,2.0,"[{'id': 1, 'name': 'Anti-Poverty Focus', 'desc...",2010-01-15T20:20:17Z,active,7705925.0,http://www.juhudikilimo.com/,2017-05-09 00:30:04,,2017-06-08 00:30:03,2017-04-28 07:00:00,1,2017-04-28,2017,5,
3,Used Clothing,0.0,False,"[{'first_name': 'Jane', 'last_name': '', 'gend...",,['en'],"Jane was born in the 1980, and she is happily ...",,,,0,,1291525,2515986,1,0,0,0,500,Kenya,KE,town,0.566667 34.566667,point,Bungoma,Jane,386.0,[],2017-06-08,2017-05-09,Clothing,fundraising,[{'name': '#Eco-friendly'}],50000.0,KES,2017-04-03T07:00:00Z,500,"[{'due_date': '2017-05-08T07:00:00Z', 'amount'...",shared,0.1,...,,Julie Keaton,892591.0,to buy more clothes to meet the needs and tast...,,,,,1,0.0,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",0.120642,0.0,,7.017031,,1592272.0,1.0,21.165398,1948.0,Kenya ECLOF,40.3,,2.54,2.0,"[{'id': 3, 'name': 'Client Voice', 'descriptio...",2014-05-29T13:30:02Z,active,863375.0,http://www.eclof-kenya.org/,2017-05-09 00:20:04,,2017-06-08 00:20:04,2017-04-03 07:00:00,1,2017-04-03,2017,5,
4,Farming,0.0,False,"[{'first_name': 'Alice', 'last_name': '', 'gen...",,['en'],Alice (the woman pictured above in her small s...,,,,0,,1291518,2515975,1,0,0,0,400,Kenya,KE,town,1 38,point,Nandi Hills,Alice,156.0,[],2017-06-08,2017-05-09,Agriculture,fundraising,[{'name': '#Woman Owned Biz'}],40000.0,KES,2017-05-27T07:00:00Z,400,"[{'due_date': '2017-05-27T07:00:00Z', 'amount'...",shared,0.1,...,['Rural Exclusion'],,,"to buy farming inputs (fertilizers, pesticides...",,,,,1,49.6,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",0.431935,2.575299,,2.536684,,1834079.0,1.0,24.200354,18150.0,Juhudi Kilimo,33.0,,-7.1,2.0,"[{'id': 1, 'name': 'Anti-Poverty Focus', 'desc...",2010-01-15T20:20:17Z,active,7705925.0,http://www.juhudikilimo.com/,2017-05-09 00:20:03,,2017-06-08 00:20:03,2017-05-27 07:00:00,1,2017-05-27,2017,5,


## 1. Feature Pruning
No need to keep features that have zero variation.  Algorithms can only provide meaningful insights when there is variation in the featuers.  Given that we are preforming feature engineering for the purporse of feeding thse features into a machine learning algorithm, lets go ahead and remove all columns that only consist 1 or less unique values.

In [3]:
for col in df.columns:
    if df[col].unique().size==1:
        print("Dropping column: {0}".format(col))
        df = df.drop(col, axis=1)

Dropping column: image.template_id
Dropping column: journal_totals.bulkEntries
Dropping column: journal_totals.entries
Dropping column: location.country
Dropping column: location.country_code
Dropping column: location.geo.type
Dropping column: payments
Dropping column: partner_default_rate_note
Dropping column: partner_delinquency_rate_note
Dropping column: partner_image.template_id
Dropping column: partner_portfolio_yield_note
Dropping column: number_of_loans


## 2. Temporal Features
Time trends are very significant, and should not be neglected.  Most algorithms will not be able to make use of raw datetimes; however, will be able to find patterns in the data if they are informed which observations occur in a given year, on a weekday vs weekend, on a holiday, etc.

Before we are able to extract this meta data, let's convert the strings in the pandas dataframe to datetime objects. Luckily for us all time fields in this dataset have "_date" in their name.

Pandas is really adept at time series, and we will use pd.to_datetime to create pandas timestamps.
see a list of methods that can be applied to a pandas datetime. https://pandas.pydata.org/pandas-docs/version/0.21/api.html#id34

In [4]:
for col in [c for c in df.columns if "_date" in c]:
    if "_date" in col:
        df[col] = pd.to_datetime(df[col])

### .dt accessor
Pandas .dt accessor enables you to easily construct additional featuers based off of these datetimes

In [5]:
##  posted date features
df['posted_year']=df['posted_date'].dt.year
df['posted_month']=df['posted_date'].dt.month

## Time to fund is the funded date minus the posted date
## we add these fields because the homework question in the next notebook involves predicting time to fund
df['time_to_fund'] =df['funded_date'] - df['posted_date']
df['days_to_fund'] = df['time_to_fund'].dt.days

# expiration date features
## Time to expiration is the expiration date minus the Posted Date
df['time_to_expire_date'] =df['planned_expiration_date'] - df['posted_date']
df['days_to_expire'] = df['time_to_expire_date'].dt.days

## 3. One-hot encoding
One-hot encoding is the process of converting either categorical or string data into a binary.  Let's practice one-hot encoding by converting the "tags" column into a set of binary features indicating whether or not a particular tag appears in a given row. 

In order to do this we will first need to convert the "tags" column into a list of strings, and then we will utilize pandas `get_dummies` method to create the binary features.  Binary features are often referred to in the statistics world as dummy features.



In [6]:
df['tag_list'] = df['tags'].apply(lambda x: [elem['name'] for elem in eval(x)])
tag_df = pd.get_dummies(df['tag_list'].apply(pd.Series).stack()).sum(level=0)
# TODO - Explain how merges work or better yet figure a way to avoid merging. - Jack 11/10/17
df = df.merge(tag_df, left_index = True, right_index = True, how = 'outer')

In [7]:
df[tag_df.columns] = df[tag_df.columns].fillna(0)

In [8]:
df.columns

Index(['activity', 'basket_amount', 'bonus_credit_eligibility', 'borrowers',
       'currency_exchange_loss_amount', 'description.languages',
       'description.texts.en', 'description.texts.es', 'description.texts.fr',
       'description.texts.ru',
       ...
       '#Technology', '#Tourism', '#Trees', '#Unique', '#Vegan', '#Widowed',
       '#Woman Owned Biz', 'user_favorite', 'volunteer_like',
       'volunteer_pick'],
      dtype='object', length=112)

## 4. Extracting features from strings

String variables by themselves are generally not good inputs to algorithms; howevever, it is often possible to extract meaningful features from encoding the information that they contain.  Let's first find out which of our variables are string variables.  From there, let's review some of the variables and see if we can construct new features from the contents of these string variables.

To discover which of our DataFrame columns are string variables, we will utilize pandas dtypes method.  In pandas there are the following types:



|       dtype        |        Description        |
|--------------------|---------------------------|
|      float         | Numeric value with a decimal point.  If NaNs exist in col, pandas will default to float|
|        int         | Numerica values without decimal points. |
|       bool         | Column consisting of True and False|
| datetime64[ns, tz] | Objects which contain specific date and time |
|   timedelta[ns]    | Object which indicates time elapsed between two datetimes |
|     category       | Variables that can only have specified values |
|      object        | Pandas representation of string variables |

Let's now use pandas method get_dtype_counts to see what data types exist in the DataFrame, and then select_dtypes to view all columns with dtype == object

In [9]:
df.get_dtype_counts()

bool                2
datetime64[ns]     10
float64            56
int64              10
object             32
timedelta64[ns]     2
dtype: int64

In [10]:
df.select_dtypes(include=[object])

Unnamed: 0,activity,borrowers,description.languages,description.texts.en,description.texts.es,description.texts.fr,description.texts.ru,location.geo.level,location.geo.pairs,location.town,name,sector,status,tags,terms.disbursal_currency,terms.local_payments,terms.loss_liability.currency_exchange,terms.loss_liability.nonpayment,terms.repayment_interval,terms.scheduled_payments,themes,translator.byline,use,video.title,video.youtubeId,partner_countries,partner_name,partner_rating,partner_social_performance_strengths,partner_status,partner_url,tag_list
0,Farming,"[{'first_name': 'Evaline', 'last_name': '', 'g...",['en'],Evaline is a married lady aged 44 years old an...,,,,town,-0.583333 35.183333,litein,Evaline,Agriculture,fundraising,"[{'name': '#Woman Owned Biz'}, {'name': '#Pare...",KES,"[{'due_date': '2017-05-10T07:00:00Z', 'amount'...",shared,lender,Monthly,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...",,Julie Keaton,to purchase more tea leaves to sell to the tea...,,,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",Kenya ECLOF,2.0,"[{'id': 3, 'name': 'Client Voice', 'descriptio...",active,http://www.eclof-kenya.org/,"[#Woman Owned Biz, #Parent]"
1,Furniture Making,"[{'first_name': 'Julias', 'last_name': '', 'ge...",['en'],Aged 42 years is a man by the name of Julias. ...,,,,town,0.566667 34.566667,Bungoma,Julias,Manufacturing,fundraising,[],KES,"[{'due_date': '2017-05-09T07:00:00Z', 'amount'...",shared,lender,Monthly,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...",,Morena Calvo,to buy timber to make more furniture for his e...,,,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",Kenya ECLOF,2.0,"[{'id': 3, 'name': 'Client Voice', 'descriptio...",active,http://www.eclof-kenya.org/,[]
2,Home Energy,"[{'first_name': 'Rose', 'last_name': '', 'gend...",['en'],"Hello Kiva Community! <br /><br />Meet Rose, w...",,,,town,0.516667 35.283333,Eldoret,Rose,Personal Use,fundraising,"[{'name': '#Eco-friendly'}, {'name': '#Technol...",KES,"[{'due_date': '2017-05-14T07:00:00Z', 'amount'...",shared,lender,Monthly,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...","['Green', 'Earth Day Campaign']",Julie Keaton,to buy a solar lantern.,,,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",Juhudi Kilimo,2.0,"[{'id': 1, 'name': 'Anti-Poverty Focus', 'desc...",active,http://www.juhudikilimo.com/,"[#Eco-friendly, #Technology]"
3,Used Clothing,"[{'first_name': 'Jane', 'last_name': '', 'gend...",['en'],"Jane was born in the 1980, and she is happily ...",,,,town,0.566667 34.566667,Bungoma,Jane,Clothing,fundraising,[{'name': '#Eco-friendly'}],KES,"[{'due_date': '2017-05-08T07:00:00Z', 'amount'...",shared,lender,Monthly,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...",,Julie Keaton,to buy more clothes to meet the needs and tast...,,,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",Kenya ECLOF,2.0,"[{'id': 3, 'name': 'Client Voice', 'descriptio...",active,http://www.eclof-kenya.org/,[#Eco-friendly]
4,Farming,"[{'first_name': 'Alice', 'last_name': '', 'gen...",['en'],Alice (the woman pictured above in her small s...,,,,town,1 38,Nandi Hills,Alice,Agriculture,fundraising,[{'name': '#Woman Owned Biz'}],KES,"[{'due_date': '2017-05-27T07:00:00Z', 'amount'...",shared,lender,Monthly,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...",['Rural Exclusion'],,"to buy farming inputs (fertilizers, pesticides...",,,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",Juhudi Kilimo,2.0,"[{'id': 1, 'name': 'Anti-Poverty Focus', 'desc...",active,http://www.juhudikilimo.com/,[#Woman Owned Biz]
5,Used Clothing,"[{'first_name': 'Clare', 'last_name': '', 'gen...",['en'],Clare is a married woman who is blessed with 2...,,,,town,0.416667 34.25,Busia,Clare,Clothing,fundraising,"[{'name': '#Woman Owned Biz'}, {'name': '#Eco-...",KES,"[{'due_date': '2017-05-11T07:00:00Z', 'amount'...",shared,lender,Monthly,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...",,,to buy more bales of clothes to grow her busin...,,,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",Kenya ECLOF,2.0,"[{'id': 3, 'name': 'Client Voice', 'descriptio...",active,http://www.eclof-kenya.org/,"[#Woman Owned Biz, #Eco-friendly]"
6,Farming,"[{'first_name': 'Mary', 'last_name': '', 'gend...",['en'],"Wonderful Kiva community, meet Mary (pictured ...",,,,town,1 38,Kerugoya,Mary,Agriculture,fundraising,"[{'name': '#Woman Owned Biz'}, {'name': '#Pare...",KES,"[{'due_date': '2017-05-27T07:00:00Z', 'amount'...",shared,lender,Monthly,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...",['Rural Exclusion'],,to buy seeds so that she can begin horticultur...,,,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",Juhudi Kilimo,2.0,"[{'id': 1, 'name': 'Anti-Poverty Focus', 'desc...",active,http://www.juhudikilimo.com/,"[#Woman Owned Biz, #Parent]"
7,Pigs,"[{'first_name': 'James', 'last_name': '', 'gen...",['en'],James is a happily married man and is blessed ...,,,,town,1 38,Limuru,James,Agriculture,fundraising,[{'name': '#Animals'}],KES,"[{'due_date': '2017-06-06T07:00:00Z', 'amount'...",shared,lender,Monthly,"[{'due_date': '2017-08-01T07:00:00Z', 'amount'...",,,"to buy pig feeds and logs to burn charcoal, so...",,,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",Kenya ECLOF,2.0,"[{'id': 3, 'name': 'Client Voice', 'descriptio...",active,http://www.eclof-kenya.org/,[#Animals]
8,Farming,"[{'first_name': 'Jacinta ', 'last_name': '', '...",['en'],Jacinta is 34 years old. She has four children...,,,,town,-0.283333 36.066667,Nakuru,Jacinta,Agriculture,fundraising,[],KES,"[{'due_date': '2017-05-24T07:00:00Z', 'amount'...",shared,lender,Monthly,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...",,Lynn Cerra,to purchase farm inputs.,,,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",VisionFund Kenya,2.5,"[{'id': 1, 'name': 'Anti-Poverty Focus', 'desc...",active,http://www.visionfundkenya.co.ke/,[]
9,Cereals,"[{'first_name': 'Emily ', 'last_name': '', 'ge...",['en'],"Meet this enterprising woman, Emily. She resid...",,,,town,1 38,Bomet,Emily,Food,fundraising,[{'name': '#Woman Owned Biz'}],KES,"[{'due_date': '2017-05-26T07:00:00Z', 'amount'...",shared,lender,Monthly,"[{'due_date': '2017-07-01T07:00:00Z', 'amount'...",['Rural Exclusion'],,to buy cereals to sell at her local market.,,,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",Juhudi Kilimo,2.0,"[{'id': 1, 'name': 'Anti-Poverty Focus', 'desc...",active,http://www.juhudikilimo.com/,[#Woman Owned Biz]


The borrowers column looks like it may have some interesting information, but it is hard to tell since the string is cropped in the displayed DataFrame.  Lets take a look at an example value.

In [11]:
df['borrowers'][0]

"[{'first_name': 'Evaline', 'last_name': '', 'gender': 'F', 'pictured': True}]"

A very simple feature we can create is count for the number of borrowers listed.  In order to accomplish this we will leverage pandas [apply](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) method, which allows us to apply a lambda function to a specific column or collection of columns in order to create a new vector.  The provided lambda function is applied to each row in order to calculate the value of the corresponding row in the new vector.

In [12]:
df['num_borrowers'] = df['borrowers'].apply(lambda x: x.count("{"))
df['num_tags'] = df['tags'].apply(lambda x: x.count(','))
print(df[df['num_borrowers']>1]['num_borrowers'].iloc[0])
print(df[df['num_borrowers']>1]['borrowers'].iloc[0])

4
[{'first_name': 'Florence ', 'last_name': '', 'gender': 'F', 'pictured': True}, {'first_name': 'Wanjiru', 'last_name': '', 'gender': 'F', 'pictured': True}, {'first_name': 'Jane ', 'last_name': '', 'gender': 'F', 'pictured': True}, {'first_name': 'Pauline ', 'last_name': '', 'gender': 'F', 'pictured': True}]


Keeping in mind that the question that we are trying to answer is "What drives the loan amount requested by KIVA borrowers?" let's create a few variables that encode the information on the gender of the listed borrowers.

In order to do this, we will once again use pandas' `apply` method, but this time will we introduce a if-else statement inside the lambda function.  This will enable us to change the value of the resulting column vector based on whether the conditional returns True or False for each row.

In [13]:
df['female']=df['borrowers'].apply(lambda x: 0 if x.split("gender': '")[1][0]=='M' else 1)
df['num_male'] = df['borrowers'].apply(lambda x: x.count('''M'''))
df['num_female'] = df['borrowers'].apply(lambda x: x.count('''F'''))
df['pct_female']=100.00*df['num_female']/(df['num_male']+df['num_female'])

Next up, marital status and a boolean for whether or not they have kids.  These featuers will all be booleans, and in order to construct them we will use panda's [str.contains](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html).  This is a handy feature because it allows us to utilize a [regex](https://docs.python.org/2/library/re.html).

In [14]:
## Whether or not the borrower is widowed
##  Note the str.contains function
df['widowed'] = df['description.texts.en'].str.contains("widowed|widow", na=False) * 1.0
## Whether or not the borrower is married
df['married'] = np.where(df['description.texts.en'].str.contains("married|husband|wife", na=False)==True, 1, 0) * 1.0
## Whether or not the borrower has children, notice we look for many variants of the word.
df['kids'] = df['description.texts.en'].str.contains("kids|child|children|kid|son|daughter|mother|father|parents", na=False) * 1.0
df['parent'] = np.where(df['#Parent']==1, df['#Parent'], df['kids'])

### Age and number of children

Creating variables for age will be a bit tricky. In the cells below, we parse out the age of the user by doing the following:

1. Importing a CSV mapping of age strings that appear in the Kiva description field (e.g. "2 years") to the integer counterparts (e.g. 2).
2. We then define a function to check for each of these string values within the description.texts.en field of our main dataframe. If a match is found, we append that string value to a new list, "age", which we create at the start of the function. If no match is found, we append a blank string. When the function has completed, we have a list the same length of our main dataframe, with the corresponding age string value for each observation (e.g. "2 years" or " " if there is no age value available)
    1. We use functions from the regular expression package to perform the string searches within the description.texts.en. Specifically, we use re.compile and  re.findall functions to first compile all possible age strings of interest, and then find all instances of the corresponding string.
3. We then create a new column in our main dataframe, "age", which is simply the list we created in step 2.
4. Finally, we perform a left join of our main dataframe with the CSV mapping, to map the string versions of age with their integer counterparts.

In [15]:
lookup_tags = pd.read_csv('../data/tags.csv')
lookup_tags.head(2)

Unnamed: 0,age,age_int,children_1,children_2,children_int
0,1 years,1,one child,1 child,1.0
1,2 years,2,two children,2 children,2.0


In [None]:
## Age of borrower and number of children
#  define a function that performs a loop that parses out all words, 
#  finds age and number of children match and creates a list that is return at the end of the function
def text_search(tag):
    #creates empty lists that are then added to in loop
    number=[]
    flag = tag.astype(str)
    match=flag.tolist()
    match = re.compile(r'\b(?:%s)\b' % '|'.join(match))
    for descr in df['description.texts.en']:
        try:
            if isinstance(descr, str):
                    if re.findall(match, descr):
                        match_0=re.findall(match,  descr)
                        match_1=re.findall(match,  descr)[:1]
                        word_1=" ".join(match_1)
                        number.append(word_1)
                    else:
                        number.append('')
        except:
            print('error')
            
    return(number)

In the cell below we write a small loop to go through each feature and search. Running this loop is fairly computationally expensive since it is doing a string match against every row of the data. You can expect it to take a few minutes to run. You can add other lists to the tags csv to extend the features you search for.

In [None]:
features=['age','children_1', 'children_2'] 
for feature in features:
    number= text_search(lookup_tags[feature])
    df[feature]=pd.DataFrame(number)

In [None]:
df.head(2)
len(df.index)

Finally, we map the integer fields unto our dataframe. That way we can decide whether to use number of children as a str feature or an int feature.

In [None]:
mydict = dict(zip(lookup_tags.children_1, lookup_tags.children_int))
df['children_int_1'] = df['children_1'].map(mydict)

mydict = dict(zip(lookup_tags.children_2, lookup_tags.children_int))
df['children_int_2'] = df['children_2'].map(mydict)

mydict = dict(zip(lookup_tags.age, lookup_tags.age_int))
df['age_int'] = df['age'].map(mydict)

In [None]:
df['children_int'] = df['children_int_1'].fillna(df['children_int_2'])
df['children_int'] = df['children_int'].fillna(0)

## 5. Metadata

We have data specifying which partner's provided the loan for each row; however, this information alone is not that helpful.  Let's try to extract some metadata from the dataset to learn how impactful partner's are.

In [None]:
print("Number of unique partners: {0} \n".format(len(df['partner_name'].unique())))
print("Top 15 partners: \n{0}\n".format(df['partner_name'].value_counts().head(15)))
print("Bottom 5 partners: \n{0}".format(df['partner_name'].value_counts().tail(15)))

There is a huge disparity between the number of loans provided per partner.  This information could be informative.

In [None]:
# let's only include those that have > 1000 obs (top 10)
top_partners = df['partner_name'].value_counts().index[:10]
top_partner_ids = df['partner_id'].value_counts().index[:10]
df['top_partners'] = df['partner_name'].apply(lambda x: x if x in top_partners else "Other")
df['top_partner_id'] = df['partner_id'].apply(lambda x: x if x in top_partner_ids else -1)

In [None]:
ax = sns.boxplot(x=df['top_partners'], y=df['loan_amount'],showfliers=False)
ax.set_xticklabels(labels=ax.get_xticklabels(), rotation = 90)

We know from Kiva that an exploratory partner who does not have a proven track record can be tested using a seed sum of $50,000. Let's create a boolean feature for exploratory partner in case we want to remove or otherwise treat these partners differently.

In [None]:
partner_dollar_amount = pd.DataFrame(df[(df['borrower_count'] == 1)].groupby(['partner_name','posted_year']).sum()['loan_amount'])
partner_dollar_amount.columns = ['partner_dollar_amount']
df = df.merge(partner_dollar_amount, left_on=['partner_name','posted_year'], right_index=True, how='outer')

In [None]:
df['exploratory_partner']=np.where(df['partner_dollar_amount']>50000,0,1)

In [None]:
df[(df['borrower_count'] == 1)]['exploratory_partner'].value_counts()

## 6. Feature Scaling

We will not overwrite our dataframe with scaled values because the appropriate scaling technique depends on the algorithm.  These are the three most common feature scaling techniques:
1. Normalization
2. Standardization
3. Log-transform

Normalization is the process of rescaling the data from 0-1.  The formula for this approach is:

`X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min` 

In [None]:
preprocessing.MinMaxScaler()
min_max_scaler = preprocessing.MinMaxScaler()
normalized = min_max_scaler.fit_transform(df['loan_amount'].astype(np.float64).values.reshape(-1,1))[:,0]
print("Pre Scaling\tMin: {0}\t\t Max: {1}\tMean: {2}".format(df['loan_amount'].min(),df['loan_amount'].max(),df['loan_amount'].mean()))
print("Post Scaling\tMin: {0}\t Max: {1}\tMean: {2}".format(np.min(normalized),np.max(normalized),np.mean(normalized)))

Standardization assumes normally distributed data (ie Gaussian) and scales the data so that it has a zero mean and unit variance.  Below is the formula
$${\dfrac{x - \bar x}{\sigma}}$$

In [None]:
standardized = preprocessing.scale(df['loan_amount'].astype(np.float64))
print("Post Scaling\tMin: {0}\t Max: {1}\tMean: {2}".format(np.min(standardized),np.max(standardized),np.mean(standardized)))

From these values, it appears that our data has a skewed distribution, and is actually a good candidate for a log transform

In [None]:
plt.hist(df['loan_amount'])
plt.show()
log_loan_amount = np.log(df['loan_amount'])
plt.hist(log_loan_amount)
plt.show()

## 7. Data Imputation / cleaning

Missing data can be informative, but it also will prevent many algorithms from training.  In order to enable our models to train while preserving the fact that some data is missing, we are going to:
1. Create a new column that indicates whether or not that column had missing data.
In pandas, missing data is either represented as NaN (Not a Number), or NaT (Not a Time).  While we look at our missing data, let's look at strings, numeric, and time objects separateley.

2. Imput missing data with the columns mean

First, let's have a quick refresher on dyptes in our DataFrame and create lists of all of the columns for specific data types.

In [None]:
df.get_dtype_counts()

In [None]:
time_columns = df.select_dtypes(include=['datetime64','timedelta64']).columns
str_columns = df.select_dtypes(include=[object]).columns
numeric_columns = df.select_dtypes(exclude=[object,'datetime64','timedelta64']).columns

Now, let's use pandas `isnull` and `sum` functions to see how many observations of each column are missing.
Since there are a lot of columns in this DataFrame, let's restrict our returned DataFrame to columns which have 
missing data

In [None]:
df[time_columns].isnull().sum()[df[time_columns].isnull().sum()>0]

In [None]:
df[str_columns].isnull().sum()[df[str_columns].isnull().sum()>0]

In [None]:
df[numeric_columns].isnull().sum()[df[numeric_columns].isnull().sum()>0]

With missing data, you should always check to see if there is a systemic difference between observations with and without missing data.

In [None]:
df[df['funded_date'].isnull()].describe()

In [None]:
df[~df['funded_date'].isnull()].describe()

Create columns that indicate whether or not data is missing.

In [None]:
for col in numeric_columns:
    df[col+'_na'] = pd.isnull(df[col])

Imput missing data with the mean

In [None]:
df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].mean())

## THE END!

That is all for our feature engineering module!  Now that we have finished creating all of our features we can go ahead and explore them with some EDA!  The last step of this module is to save our results into a new csv

In [None]:
df.to_csv("../data/clean_data.csv", index=False)