# Module 1: Introduction to Exploratory Analysis 

What we'll be doing in this notebook:
-----

 1.  Checking variable type
 2.  Checking for missing variables 
 3.  Number of observations in the dataset
 4.  Descriptive statistics

### Import packages

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
import dateutil.parser

# The command below means that the output of multiple commands in a cell will be output at once
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# The command below tells jupyter to display up to 80 columns, this keeps everything visible
pd.set_option('display.max_columns', 80)
pd.set_option('expand_frame_repr', True)

# Show figures in notebook
%matplotlib inline

### Import dataset

We read in our dataset

In [2]:
kiva_locations = pd.read_csv("/home/anilla/DataScience/DataScience1/data/Kenya/kiva_ke_locations.csv")
kiva_loans=pd.read_csv("/home/anilla/DataScience/DataScience1/data/Kenya/kiva_loans_ke.csv")
loan_theme=pd.read_csv("/home/anilla/DataScience/DataScience1/data/Kenya/loan_theme_ke.csv")

In the cell below, we take a random sample of 2 rows to get a feel for the data.

In [3]:
kiva_loans.sample(n=102)

Unnamed: 0,id,funded_amount,loan_amount,activity,sector,use,country_code,country,region,currency,partner_id,posted_time,disbursed_time,funded_time,term_in_months,lender_count,tags,borrower_genders,repayment_interval,date
3357,681643,250,250,Food,Food,to prepare food to sell.,KE,Kenya,Mombasa,KES,138.0,2014-03-11 07:58:07+00:00,2014-02-28 08:00:00+00:00,2014-03-12 14:45:20+00:00,14,7,"#First Loan, #Parent, #Woman Owned Biz",female,monthly,2014-03-11
65961,1235075,500,500,Fruits & Vegetables,Food,to buy more stock of fruits,KE,Kenya,Busia,KES,138.0,2017-02-09 06:40:55+00:00,2017-01-30 08:00:00+00:00,2017-03-06 01:13:58+00:00,14,6,"#Woman Owned Biz, #Repeat Borrower, #Vegan, us...",female,monthly,2017-02-09
8093,729244,175,175,Fish Selling,Food,to purchase bundles of fish for resale,KE,Kenya,Tiribe,KES,164.0,2014-06-23 12:00:14+00:00,2014-06-23 07:00:00+00:00,2014-06-30 09:38:06+00:00,14,7,user_favorite,female,irregular,2014-06-23
49024,1056381,200,200,Food Production/Sales,Food,"to purchase bundles of wheat flour, maize flou...",KE,Kenya,Mwambalazi,KES,164.0,2016-04-19 07:31:48+00:00,2016-03-22 07:00:00+00:00,2016-04-24 09:40:33+00:00,14,6,"#Parent, #Woman Owned Biz",female,irregular,2016-04-19
71463,1285766,500,500,Water Distribution,Services,to purchase a water tank and materials for mak...,KE,Kenya,Likoni,KES,164.0,2017-04-25 11:53:14+00:00,2017-03-29 07:00:00+00:00,2017-05-02 07:06:15+00:00,12,20,"#Parent, #Woman Owned Biz, user_favorite",female,irregular,2017-04-25
4,1080150,125,125,Energy,Services,purchase solar lanterns for resale.,KE,Kenya,,KES,,2014-01-02 08:48:38+00:00,2014-01-30 01:42:21+00:00,2014-01-23 13:35:59+00:00,3,6,,male,irregular,2014-01-02
60518,1191997,675,675,Farming,Agriculture,to purchase a solar light and gain access to c...,KE,Kenya,Sirisia,KES,202.0,2016-11-24 06:37:28+00:00,2017-02-01 08:00:00+00:00,2016-12-28 14:09:52+00:00,11,8,"#Sustainable Ag, #Eco-friendly, #Schooling, #E...","male, female, male, male, female, male, male",bullet,2016-11-24
650,659059,600,600,Food,Food,"to add stock of tomatoes, onions, beans, and g...",KE,Kenya,Kericho,KES,138.0,2014-01-18 20:34:00+00:00,2014-01-17 08:00:00+00:00,2014-01-19 02:54:50+00:00,14,21,"#Woman Owned Biz, #Vegan, #Single Parent",female,monthly,2014-01-18
56448,1150170,500,500,Retail,Retail,"to buy more stock of flour, soda, sugar, bread...",KE,Kenya,Maua,KES,138.0,2016-09-14 09:37:57+00:00,2016-09-09 07:00:00+00:00,2016-10-11 21:11:07+00:00,14,19,"#Woman Owned Biz, #Parent, #First Loan, #Schoo...",female,monthly,2016-09-14
17988,805187,1275,1275,Farming,Agriculture,,KE,Kenya,,KES,202.0,2014-11-25 11:45:39+00:00,2015-02-01 08:00:00+00:00,2014-11-30 05:07:26+00:00,2,46,user_favorite,,bullet,2014-11-25


### 1) Type Checking
<a id='type_check'></a>

Type is very important in Python programing, because it affects the types of functions you can apply to a series. There are a few different types of data you will see regularly (see [this](https://en.wikibooks.org/wiki/Python_Programming/Data_Types) link for more detail):
* **int** - a number with no decimal places. example: loan_amount field
* **float** - a number with decimal places. example: partner_id field
* **str** - str is short for string. This type formally defined as a sequence of unicode characters. More simply, string means that the data is treated as a word, not a number. example: sector
* **boolean** - can only be True or False. There is not currently an example in the data, but we will be creating a gender field shortly.
* **datetime** - values meant to hold time data. Example: posted_date

Let's check the type of our variables using the examples we saw in the cell above.

In [5]:
# Here are all of the columns
kiva_loans.columns.tolist()

['id',
 'funded_amount',
 'loan_amount',
 'activity',
 'sector',
 'use',
 'country_code',
 'country',
 'region',
 'currency',
 'partner_id',
 'posted_time',
 'disbursed_time',
 'funded_time',
 'term_in_months',
 'lender_count',
 'tags',
 'borrower_genders',
 'repayment_interval',
 'date']

In [56]:
# Find the dtype, aka datatype, for a column
df['id_number'].dtype

dtype('int64')

In [57]:
# Try this - Pick a couple of columns and check their type on your own


### 2) Do I have missing values?

<a id='missing_check'></a>

Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we have not analysed the behavior and relationship with other variables correctly. It can lead to wrong prediction or classification.

If we have missing data, is the missing data at random or not? If data is missing at random, the data distribution is still representative of the population. You can probably ignore the missing values as an inconvenience. However, if the data is systematically missing, the analysis you do may be biased. You should carefully consider the best way to clean the data, it may involve dropping some data.

We want to see how many values are missing in certain variable columns. One way to do this is to count the number of null observations. 

For this, we wrote a short function to apply to the dataframe. 

We print out the first few observations, but you can remove the .head() to print out all columns. 

In [9]:
#Create a new function:
def num_missing(x):
    return sum(x.isnull())

#Applying per column:
# print("Missing values per column:")
## Check how many are missing by column, and then check which ones have any missing values
# print(df.apply(num_missing, axis=0).where(lambda x : x != 0).dropna()) 
print(kiva_loans.apply(num_missing,axis=0))
#axis=0 defines that function is to be applied on each column

id                        0
funded_amount             0
loan_amount               0
activity                  0
sector                    0
use                     713
country_code              0
country                   0
region                 8752
currency                  0
partner_id             8372
posted_time               0
disbursed_time          381
funded_time            5446
term_in_months            0
lender_count              0
tags                  23186
borrower_genders        712
repayment_interval        0
date                      0
dtype: int64


In [59]:
#checking the % of missing values
total = df.isnull().sum().sort_values(ascending = False)

percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending = False)

missing_kiva_loans_data  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_kiva_loans_data

Unnamed: 0,Total,Percent
funded_date,937,0.155674
use,342,0.05682
description,342,0.05682
location_country_code,17,0.002824
sector,0,0.0
repayment_term,0,0.0
funded_amount,0,0.0
status,0,0.0
lender_count,0,0.0
loan_amount,0,0.0


### 3) Sanity Checks
<a id='obs_check'></a>

**Does the dataset match what you expected to find?**
- is the range of values what you would expect. For example, are all loan_amounts above 0.
- do you have the number of rows you would expect
- is your data for the date range what you would expect. For example, is there a strange year in the data like 1880.
- are there unexpected spikes when you plot the data over time


In the command below we find out the number of rows and number of columns by using the function shape. You can also use len(df.index) to find the number of rows.

In [60]:
print(f'There are {df.shape[0]} observations and {df.shape[1]} features')

There are 6019 observations and 11 features


Remember, each row is an observation and each column is a potential feature. 

Remember we need a large amount of data for machine learning.

### 4) Descriptive statistics of the dataset

<a id='desc_stats'></a>

In [61]:
#Try out - Write code that provides key summary statistics for each numeric column.

In order to get the same summary statistics for categorical columns (string) we need to do a little data wrangling. The first line of code filters for all columns that are a data type object. As we know from before this means they are considered to be a string. The final row of code provides summary statistics for these character fields.

In [10]:
categorical = kiva_loans.dtypes[kiva_loans.dtypes == "object"].index
kiva_loans[categorical].describe()

Unnamed: 0,activity,sector,use,country_code,country,region,currency,posted_time,disbursed_time,funded_time,tags,borrower_genders,repayment_interval,date
count,75825,75825,75112,75825,75825,67073,75825,75825,75444,70379,52639,75113,75825,75825
unique,143,15,38572,1,1,393,2,75737,2375,63848,12928,5349,4,1117
top,Farming,Agriculture,to buy a solar lantern.,KE,Kenya,Kisii,KES,2017-04-27 11:53:12+00:00,2017-02-01 08:00:00+00:00,2015-03-18 06:46:14+00:00,"#Parent, #Woman Owned Biz",female,monthly,2016-11-18
freq,20555,33644,880,75825,75825,3546,75311,2,2496,14,2977,49719,46230,456


In [63]:
#Try out - What's the other way one can obtain the information above?


In the table above, there are 4 really useful fields: 

1) **count** - total number of fields populated (Not empty). 

2) **unique** - tells us how many different unique ways this field is populated.

3) **top** - tells us the most popular data point.

4) **freq** - tells us that how frequent the most popular category is in our dataset. 

What is next
-----

In the next section, we move on to exploratory data analysis (EDA).

<br>
<br> 
<br>

----