# Module 1: Introduction to Exploratory Analysis 

What we'll be doing in this notebook:
-----

 1.  Checking variable type
 2.  Checking for missing variables 
 3.  Number of observations in the dataset
 4.  Descriptive statistics

### Import packages

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from datetime import datetime
import dateutil.parser

# The command below means that the output of multiple commands in a cell will be output at once
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# The command below tells jupyter to display up to 80 columns, this keeps everything visible
pd.set_option('display.max_columns', 80)
pd.set_option('expand_frame_repr', True)

### Import dataset

We read in our merged dataset below. Don't forget to update the name with your own! 

In [4]:
data_path = '../data/'
df = pd.read_csv(data_path+'raw_data.csv.zip', low_memory=False)

In the cell below, we take a random sample of 2 rows to get a feel for the data.

In [5]:
df.sample(2)

Unnamed: 0,activity,basket_amount,bonus_credit_eligibility,borrowers,currency_exchange_loss_amount,description.languages,description.texts.en,description.texts.es,description.texts.fr,description.texts.ru,funded_amount,funded_date,id,image.id,image.template_id,journal_totals.bulkEntries,journal_totals.entries,lender_count,loan_amount,location.country,location.country_code,location.geo.level,location.geo.pairs,location.geo.type,location.town,name,partner_id,payments,planned_expiration_date,posted_date,sector,status,tags,terms.disbursal_amount,terms.disbursal_currency,terms.disbursal_date,terms.loan_amount,terms.local_payments,terms.loss_liability.currency_exchange,terms.loss_liability.currency_exchange_coverage_rate,...,themes,translator.byline,translator.image,use,video.id,video.thumbnailImageId,video.title,video.youtubeId,borrower_count,partner_average_loan_size_percent_per_capita_income,partner_charges_fees_and_interest,partner_countries,partner_currency_exchange_loss_rate,partner_default_rate,partner_default_rate_note,partner_delinquency_rate,partner_delinquency_rate_note,partner_image.id,partner_image.template_id,partner_loans_at_risk_rate,partner_loans_posted,partner_name,partner_portfolio_yield,partner_portfolio_yield_note,partner_profitability,partner_rating,partner_social_performance_strengths,partner_start_date,partner_status,partner_total_amount_raised,partner_url,posted_datetime,funded_datetime,planned_expiration_datetime,dispursal_datetime,number_of_loans,dispersal_date,posted_year,posted_month,time_to_fund
74052,General Store,,True,"[{'first_name': 'Rashid', 'last_name': '', 'ge...",,['en'],Rashid has been operating a retail shop busine...,,,,600,2013-04-25,548506,1331882,1,0,0,24,600,Kenya,KE,town,1 38,point,Tiribe,Rashid,164.0,[],2013-05-16,2013-04-16,Retail,funded,[],50000.0,KES,2013-03-22T07:00:00Z,600,[],shared,0.1,...,,,,"to boost his kiosk, by purchasing bundles of w...",,,,,1,24.3,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",0.164711,0.085473,,0.0,,2081417.0,1.0,0.0,21415.0,Yehu Microfinance Trust,33.1,,2.23,3.5,"[{'id': 3, 'name': 'Client Voice', 'descriptio...",2010-03-30T20:50:03Z,active,7646925.0,http://www.yehu.org,2013-04-16 20:20:03,2013-04-25 16:56:30,2013-05-16 20:20:03,2013-03-22 07:00:00,1,2013-03-22,2013,4,8.0
33902,Farming,,False,"[{'first_name': 'Anonymous', 'last_name': '', ...",,['en'],,,,,200,2015-10-18,947817,726677,1,0,0,8,200,Kenya,KE,country,1 38,point,,Anonymous,156.0,[],2015-11-16,2015-10-17,Agriculture,funded,[],20000.0,KES,2015-09-01T07:00:00Z,200,[],shared,0.1,...,['Rural Exclusion'],,,,,,,,1,49.6,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",0.431935,2.575299,,2.536684,,1834079.0,1.0,24.200354,18150.0,Juhudi Kilimo,33.0,,-7.1,2.0,"[{'id': 1, 'name': 'Anti-Poverty Focus', 'desc...",2010-01-15T20:20:17Z,active,7705925.0,http://www.juhudikilimo.com/,2015-10-17 16:20:04,2015-10-18 17:07:51,2015-11-16 16:20:04,2015-09-01 07:00:00,1,2015-09-01,2015,10,1.0


### 1) Type Checking
<a id='type_check'></a>

Type is very important in Python programing, because it affects the types of functions you can apply to a series. There are a few different types of data you will see regularly (see [this](https://en.wikibooks.org/wiki/Python_Programming/Data_Types) link for more detail):
* **int** - a number with no decimal places. example: loan_amount field
* **float** - a number with decimal places. example: partner_id field
* **str** - str is short for string. This type formally defined as a sequence of unicode characters. More simply, string means that the data is treated as a word, not a number. example: sector
* **boolean** - can only be True or False. There is not currently an example in the data, but we will be creating a gender field shortly.
* **datetime** - values meant to hold time data. Example: posted_date

Let's check the type of our variables using the examples we saw in the cell above.

In [6]:
## Select variables by name
type_example = df[['loan_amount','partner_id', 'sector','posted_date']]
## Pull (3) random rows
type_example.sample(3)

Unnamed: 0,loan_amount,partner_id,sector,posted_date
111475,900,156.0,Agriculture,2010-05-24
56824,350,164.0,Food,2014-06-28
71264,425,156.0,Agriculture,2013-06-16


In [7]:
## Check the first cell for a column
df['posted_datetime'].head(1)
## Check the datatype for a single column
df['posted_datetime'].dtype

0    2017-05-09 00:40:03
Name: posted_datetime, dtype: object

dtype('O')

Datatype 'O' is for object.

### 2) Do I have missing values?

<a id='missing_check'></a>

If we have missing data, is the missing data at random or not at random? If data is missing at random, the data distribution is still representative of the population. You can probably ignore the missing values as an inconvienience. However, if the data is missing systematically, any modeling you do may be biased. You should carefully consider the best way to clean the data, it may involve dropping some data. See [here](https://en.wikipedia.org/wiki/Missing_data) for additional information.

We want to see how many values are missing in certain variable columns. One way to do this is to count the number of null observations. 

For this, we wrote a short function to apply to the dataframe. 

We print out the first few observations, but you can remove the .head() to print out all columns. 

In [8]:
#Create a new function:
def num_missing(x):
  return sum(x.isnull())

#Applying per column:
print("Missing values per column:")
## Check how many are missing by column, and then check which ones have any missing values
print(df.apply(num_missing, axis=0).where(lambda x : x != 0).dropna().head(20)) 
#axis=0 defines that function is to be applied on each column

Missing values per column:
basket_amount                                           117257.0
currency_exchange_loss_amount                           102524.0
description.texts.en                                      4326.0
description.texts.es                                    118196.0
description.texts.fr                                    118196.0
description.texts.ru                                    118195.0
funded_date                                               5605.0
location.town                                             7894.0
planned_expiration_date                                  24913.0
terms.loss_liability.currency_exchange_coverage_rate      3849.0
terms.repayment_interval                                117257.0
themes                                                   89300.0
translator.byline                                        35887.0
translator.image                                         55223.0
use                                                       4325.

In [9]:
#Applying per row:
print("\nMissing values per row:")
missing_by_row = df.apply(num_missing, axis=1) 
#axis=1 defines that function is to be applied on each row
print(missing_by_row.head()) 


Missing values per row:
0    15
1    15
2    14
3    15
4    16
dtype: int64


Remember when we used a left join to merge in the partner data. That means we could have null (missing) values in our partner_id field. We check this in the cell below and find that we have 9,642 loans that are missing. Let's investigate and try and understand whether the data is missing at random or systematically missing.

In [10]:
null_basket_amount = df.loc[df['basket_amount'].isnull()]
len(null_basket_amount)
null_basket_amount.head(2)

117257

Unnamed: 0,activity,basket_amount,bonus_credit_eligibility,borrowers,currency_exchange_loss_amount,description.languages,description.texts.en,description.texts.es,description.texts.fr,description.texts.ru,funded_amount,funded_date,id,image.id,image.template_id,journal_totals.bulkEntries,journal_totals.entries,lender_count,loan_amount,location.country,location.country_code,location.geo.level,location.geo.pairs,location.geo.type,location.town,name,partner_id,payments,planned_expiration_date,posted_date,sector,status,tags,terms.disbursal_amount,terms.disbursal_currency,terms.disbursal_date,terms.loan_amount,terms.local_payments,terms.loss_liability.currency_exchange,terms.loss_liability.currency_exchange_coverage_rate,...,themes,translator.byline,translator.image,use,video.id,video.thumbnailImageId,video.title,video.youtubeId,borrower_count,partner_average_loan_size_percent_per_capita_income,partner_charges_fees_and_interest,partner_countries,partner_currency_exchange_loss_rate,partner_default_rate,partner_default_rate_note,partner_delinquency_rate,partner_delinquency_rate_note,partner_image.id,partner_image.template_id,partner_loans_at_risk_rate,partner_loans_posted,partner_name,partner_portfolio_yield,partner_portfolio_yield_note,partner_profitability,partner_rating,partner_social_performance_strengths,partner_start_date,partner_status,partner_total_amount_raised,partner_url,posted_datetime,funded_datetime,planned_expiration_datetime,dispursal_datetime,number_of_loans,dispersal_date,posted_year,posted_month,time_to_fund
16,Primary/secondary school costs,,False,"[{'first_name': 'Sally ', 'last_name': '', 'ge...",,['en'],"Sally is an ambitious woman from Bomet, a maiz...",,,,150,2017-05-09,1291449,2515878,1,0,0,6,150,Kenya,KE,town,1 38,point,Bomet,Sally,156.0,[],2017-06-07,2017-05-08,Education,funded,"[{'name': '#Parent'}, {'name': '#Schooling'}]",15000.0,KES,2017-05-28T07:00:00Z,150,[],shared,0.1,...,,,,to pay school fees for her children.,,,,,1,49.6,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",0.431935,2.575299,,2.536684,,1834079.0,1.0,24.200354,18150.0,Juhudi Kilimo,33.0,,-7.1,2.0,"[{'id': 1, 'name': 'Anti-Poverty Focus', 'desc...",2010-01-15T20:20:17Z,active,7705925.0,http://www.juhudikilimo.com/,2017-05-08 22:30:06,2017-05-09 00:37:46,2017-06-07 22:30:06,2017-05-28 07:00:00,1,2017-05-28,2017,5,0.0
21,Celebrations,,False,"[{'first_name': 'Naomi', 'last_name': '', 'gen...",,['en'],Naomi is a single mother of 2 children and she...,,,,100,2017-05-08,1291404,2515811,1,0,0,4,100,Kenya,KE,town,1 38,point,nyeri,Naomi,386.0,[],2017-06-07,2017-05-08,Personal Use,funded,"[{'name': '#Animals'}, {'name': '#Parent'}, {'...",10000.0,KES,2017-04-03T07:00:00Z,100,[],shared,0.1,...,,Cheryl Strecker,1412668.0,to buy a goat for a celebration during the Eas...,,,,,1,0.0,True,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",0.120642,0.0,,7.017031,,1592272.0,1.0,21.165398,1948.0,Kenya ECLOF,40.3,,2.54,2.0,"[{'id': 3, 'name': 'Client Voice', 'descriptio...",2014-05-29T13:30:02Z,active,863375.0,http://www.eclof-kenya.org/,2017-05-08 22:10:05,2017-05-08 23:01:45,2017-06-07 22:10:05,2017-04-03 07:00:00,1,2017-04-03,2017,5,0.0


It seems that the number of loans peaked in 2014, and has decreased. All of the loans issued in the second half of 2015, 2016, 2017 have a partner. This tells us these values are not missing at random, they are more likely to occur in data before June 2015, and are most likely to occur in the first half of 2014. We should probably just drop data with the partner_id missing. We do so below using the notnull() function which checks if the field is populated.

In [11]:
df = df.loc[df['partner_id'].notnull()]

As a sanity check that our filtering was correctly done we check the new number of rows. It makes sense!

In [12]:
len(df.index)

118199

### 3) Sanity Checks
<a id='obs_check'></a>

**Does the dataset match what you expected to find?**
- is the range of values what you would expect. For example, are all loan_amounts above 0.
- do you have the number of rows you would expect
- is your data for the date range you would expect. For example, is there a strange year in the data like 1880.
- are there unexpected spikes when you plot the data over time


In the command below we find out the number of loans and number of columns by using the function shape. You can also use len(df.index) to find the number of rows.

In [13]:
print('There are %d observations and %d features' % (df.shape[0],df.shape[1]))

There are 118199 observations and 84 features


This tells us there are 118,316 observations and 84 features. We learnt in the theory lesson that each row is an observation and each column is a potential feature. 118,000 matches with what we expect based upon our conversations with Kiva. This is a very healthy sample size for applying machine learning algorithms.

### 4) Descriptive statistics of the dataset

<a id='desc_stats'></a>

In Module 1, we learned about mean, frequency and percentiles as a powerful way to understand the distribution of the data. If you are unfamiliar with these terms or need a refresher [this](https://www.mathsisfun.com/data/frequency-grouped-mean-median-mode.html) overview should be helpful. The "describe" command below provides key summary statistics for each numeric column.

In [14]:
df.describe()

Unnamed: 0,basket_amount,currency_exchange_loss_amount,funded_amount,id,image.id,image.template_id,journal_totals.bulkEntries,journal_totals.entries,lender_count,loan_amount,partner_id,terms.disbursal_amount,terms.loan_amount,terms.loss_liability.currency_exchange_coverage_rate,terms.repayment_term,translator.image,video.id,video.thumbnailImageId,borrower_count,partner_average_loan_size_percent_per_capita_income,partner_currency_exchange_loss_rate,partner_default_rate,partner_default_rate_note,partner_delinquency_rate,partner_delinquency_rate_note,partner_image.id,partner_image.template_id,partner_loans_at_risk_rate,partner_loans_posted,partner_portfolio_yield,partner_portfolio_yield_note,partner_profitability,partner_total_amount_raised,number_of_loans,posted_year,posted_month,time_to_fund
count,942.0,15675.0,118199.0,118199.0,118199.0,118199.0,118199.0,118199.0,118199.0,118199.0,118199.0,118199.0,118199.0,114350.0,118199.0,62976.0,66.0,66.0,118199.0,118199.0,118199.0,118199.0,0.0,118199.0,0.0,118199.0,118199.0,118199.0,118199.0,111894.0,0.0,106811.0,118199.0,118199.0,118199.0,118199.0,112594.0
mean,0.185775,6.792162,460.031811,709588.4,1540860.0,1.0,0.0,0.0,14.222303,481.271838,164.943231,41104.558639,481.271838,0.122171,13.404386,1165682.0,1038.878788,463981.4,1.929306,30.053112,0.2125,3.823391,,4.346358,,1496747.0,1.0,11.51772,18190.648254,31.057951,,2.363798,7315619.0,1.0,2013.558516,6.424894,7.57146
std,2.148199,9.857512,394.928783,341379.2,604507.6,0.0,0.0,0.0,12.654955,417.259618,65.911965,35795.447209,417.259618,0.04154,7.785041,703680.4,851.989988,260069.2,3.028876,17.083295,0.262511,10.482077,,5.366385,,602296.2,0.0,10.82505,9413.644577,10.44039,,11.297412,3284701.0,0.0,2.344082,3.559694,12.300088
min,0.0,0.01,0.0,251.0,409.0,1.0,0.0,0.0,0.0,25.0,6.0,25.0,25.0,0.1,2.0,23924.0,150.0,297574.0,1.0,0.0,0.0,0.0,,0.0,,356.0,1.0,0.0,7.0,0.0,,-117.79,3950.0,1.0,2006.0,1.0,-442.0
25%,0.0,1.22,250.0,419862.0,1012860.0,1.0,0.0,0.0,7.0,250.0,133.0,20000.0,250.0,0.1,11.0,518671.0,470.25,323503.8,1.0,24.3,0.089354,0.085473,,0.0,,1495190.0,1.0,0.0,9546.0,29.0,,-1.7,6764500.0,1.0,2012.0,3.0,0.0
50%,0.0,3.49,350.0,697638.0,1575848.0,1.0,0.0,0.0,11.0,375.0,156.0,30000.0,375.0,0.1,14.0,1186147.0,552.5,328222.5,1.0,34.9,0.164711,1.48389,,2.536684,,1592689.0,1.0,16.058249,18150.0,33.1,,0.0,7646925.0,1.0,2014.0,6.0,2.0
75%,0.0,8.26,600.0,1003973.0,2053477.0,1.0,0.0,0.0,18.0,600.0,164.0,50000.0,600.0,0.1,14.0,1668411.0,2038.75,573510.0,1.0,40.1,0.364948,3.652283,,8.017062,,2081410.0,1.0,18.498507,21415.0,36.0,,2.23,8133425.0,1.0,2016.0,10.0,13.0
max,25.0,181.27,6000.0,1292273.0,2516905.0,1.0,0.0,0.0,218.0,6000.0,473.0,624390.0,6000.0,0.2,122.0,2499150.0,2816.0,1256913.0,46.0,54.8,7.513861,94.939083,,100.0,,2520600.0,1.0,100.0,30794.0,41.0,,30.3,11366980.0,1.0,2017.0,12.0,62.0


In order to get the same summary statistics for categorical columns (string) we need to do a little data wrangling. The first line of code filters for all columns that are a data type object. As we know from before this means they are considered to be a string. The final row of code provides summary statistics for these character fields.

In [15]:
categorical = df.dtypes[df.dtypes == "object"].index
df[categorical].describe()

Unnamed: 0,activity,borrowers,description.languages,description.texts.en,description.texts.es,description.texts.fr,description.texts.ru,funded_date,location.country,location.country_code,location.geo.level,location.geo.pairs,location.geo.type,location.town,name,payments,planned_expiration_date,posted_date,sector,status,tags,terms.disbursal_currency,terms.disbursal_date,terms.local_payments,terms.loss_liability.currency_exchange,terms.loss_liability.nonpayment,terms.repayment_interval,terms.scheduled_payments,themes,translator.byline,use,video.title,video.youtubeId,partner_countries,partner_name,partner_rating,partner_social_performance_strengths,partner_start_date,partner_status,partner_url,posted_datetime,funded_datetime,planned_expiration_datetime,dispursal_datetime,dispersal_date
count,118199,118199,118199,113873,3,3,4,112594,118199,118199,118199,118199,118199,110305,118199,118199,93286,118199,118199,118199,118199,118199,118199,118199,118199,118199,942,118199,28899,82312,113874,66,66,118199,118199,118199.0,113668,118199,118199,114140,118199,112594,93286,118199,118199
unique,148,29915,4,113744,3,3,4,3452,1,1,2,47,1,1246,15993,1,1860,3173,15,3,5386,2,5409,734,3,2,2,275,19,425,58791,62,64,8,40,9.0,16,40,3,29,80868,106740,64078,5409,3077
top,Farming,"[{'first_name': 'Anonymous', 'last_name': '', ...",['en'],"Hello Kiva Community! <br /><br />Meet Jane, w...",The person appearing in the photo is Agnes. Sh...,Irine has a small farm in Sigowet village wher...,David is a married man. He has 7 children. He ...,2016-03-08,Kenya,KE,town,1 38,point,Likoni,Anonymous,[],2014-03-26,2014-02-24,Agriculture,funded,[],KES,2017-02-01T08:00:00Z,[],shared,lender,Monthly,[],['Rural Exclusion'],Tim Gibson,to purchase a solar light and gain access to c...,Kiva Borrower SANITA from Kenya,6dWFtYShzBk,"[{'iso_code': 'KE', 'region': 'Africa', 'name'...",VisionFund Kenya,3.5,"[{'id': 1, 'name': 'Anti-Poverty Focus', 'desc...",2009-05-29T11:35:11Z,active,http://www.visionfundkenya.co.ke/,2011-01-01 08:00:08,2005-03-31 06:27:55,2014-01-01 01:44:31,2017-02-01 08:00:00,2017-02-01
freq,26227,2420,118189,5,1,1,1,413,118199,118199,110305,77214,118199,5035,3679,118199,620,620,45612,112594,65296,114991,2498,117257,113700,115618,622,117257,15653,6914,1405,2,2,106537,28570,39681.0,46681,28570,104158,28570,52,24,21,2498,2498


In the table above, there are 4 really useful fields: 

1) **count** - total number of fields populated (Not empty). 

2) **unique** - tells us how many different unique ways this field is populated. For example 4 in description.languages tells us there are 4 different language descriptions. 

3) **top** - tells us the most popular data point. For example, the top activity in this dataset is Farming which tells us most loans are in Farming.

4) **freq** - tells us that how frequent the most popular category is in our dataset. For example, 'en' (english) is the language almost all descriptions (description.languages) are written in (118,306 out of 118,316).

### Moving on

Next we move on to exploratory data analysis, where we will examine common plotting methods! 