our goals for EDA is to explore the data
and find insights. The purpose of EDA is to 
1. maximimise insights into a dataset
2. see any underlyinng structures
3. identify the important variables or feature
4. detect outliers
5. test your underlying assumptions

#### what we'll be doing in this notebook
1. check variable types
2. check for missing values
3. look at the no and sensibility of observation in the dataset
4. describe the data
5. investigate some plotting techniques

#### Import packages

In [8]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
import dateutil.parser

#set jupyter to display up to 50 columns, so we can see evrything
pd.set_option('display.max_columns', 50)
pd.set_option('expand_frame_repr', True)

#show figures in notebook
%matplotlib inline 

Read in your dataset 

In [9]:
df = pd.read_csv('loans.csv')
df

Unnamed: 0,id_number,loan_amount,lender_count,status,funded_date,funded_amount,repayment_term,location_country_code,sector,description,use
0,736066,4825,60,funded,2014-08-03T17:51:50Z,4825,8,BJ,Retail,,
1,743090,975,34,funded,2014-08-18T09:10:54Z,975,12,BJ,Food,,
2,743120,950,25,funded,2014-08-09T17:46:35Z,950,14,BJ,Services,,
3,743121,825,28,funded,2014-08-24T17:00:38Z,825,14,BJ,Retail,,
4,743124,725,21,funded,2014-08-25T03:24:54Z,725,13,BJ,Retail,,
...,...,...,...,...,...,...,...,...,...,...,...
6014,1568871,200,8,funded,2018-07-19T15:14:35Z,200,14,ZW,Food,Sethukelo is a 19-year-old entrepreneur who li...,to purchase goods for starting a grocery store.
6015,1568880,200,8,funded,2018-07-19T19:22:43Z,200,14,ZW,Food,Hlanjiwe is a 20-year-old entrepreneur who liv...,to buy grocery goods for her business.
6016,1568883,200,6,funded,2018-07-19T20:18:53Z,200,14,ZW,Clothing,Lebuhani is a 21-year-old entrepreneur who liv...,to buy clothes for her business.
6017,1568887,200,8,funded,2018-07-18T23:38:44Z,200,14,ZW,Food,Jacqueline is a 23-year-old entrepreneur who l...,her to buy goods to sell in her store.


In [10]:
#lets get a random sample of our data
#let's get 3rows

df.sample(n=3)

Unnamed: 0,id_number,loan_amount,lender_count,status,funded_date,funded_amount,repayment_term,location_country_code,sector,description,use
2045,1454103,775,26,funded,2018-01-20T05:41:32Z,775,11,LS,Personal Use,'Malipuo is a 42-year-old woman married to a 4...,to pay for a stove.
5575,1456104,1025,35,funded,2018-01-24T10:55:51Z,1025,8,ZM,Services,Muller is a young emerging entrepreneur who ha...,to boost her working capital and ensure that s...
1875,1572272,300,0,fundraising,,0,14,KE,Agriculture,Rael is characterized by her sense of responsi...,"to purchase farming inputs, such as high-nutri..."


#### 1) Type checking
why is this important?
- coz the types of a feature affect what you can do to that column
this relates to the type of function you can apply on on the column
- the commmon data types you will see are
1. int
2. float 
3. str
4. boolean
5. datetime


lets check the type of our variable

In [14]:
#list all the columns
df.columns.to_list()

['id_number',
 'loan_amount',
 'lender_count',
 'status',
 'funded_date',
 'funded_amount',
 'repayment_term',
 'location_country_code',
 'sector',
 'description',
 'use']

In [15]:
#checking for specific column
df['id_number'].dtype

dtype('int64')

In [16]:
df.id_number.dtype

dtype('int64')

In [17]:
df.status.dtype

dtype('O')

In [18]:
df['status'].dtype

dtype('O')

In [29]:
#checking for the whole data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6019 entries, 0 to 6018
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   id_number              6019 non-null   int64 
 1   loan_amount            6019 non-null   int64 
 2   lender_count           6019 non-null   int64 
 3   status                 6019 non-null   object
 4   funded_date            5082 non-null   object
 5   funded_amount          6019 non-null   int64 
 6   repayment_term         6019 non-null   int64 
 7   location_country_code  6002 non-null   object
 8   sector                 6019 non-null   object
 9   description            5677 non-null   object
 10  use                    5677 non-null   object
dtypes: int64(5), object(6)
memory usage: 517.4+ KB


In [35]:
df.description[6014]

'Sethukelo is a 19-year-old entrepreneur who lives with her mother and siblings in the Lupane District of Zimbabwe.  She is requesting a Kiva loan in order to start her own grocery business.  <br /><br />Sethukelo plans to use the profit she earns to support herself, her mother, and her siblings.  She will repay the ‘social interest’ on her loan by volunteering 2.5 hours per week as a Camfed Transition Guide.  Her responsibilities include supporting other young women who are just entering Camfed’s alumnae organization, CAMA.  She will also deliver weekly sessions of an orientation course on topics such as health, financial literacy, and career guidance.'

In [28]:
#df.info

In [23]:
df['status'].unique()

array(['funded', 'fundraising', 'expired'], dtype=object)

In [21]:
df.repayment_term

0        8
1       12
2       14
3       14
4       13
        ..
6014    14
6015    14
6016    14
6017    14
6018    14
Name: repayment_term, Length: 6019, dtype: int64

#### 2) do we have missing values
if we have mising data, is the data missing randomly
if yes, the distribution is stil representative of thepopulation
in this case we may ignore them...
if the data is missing systematically, you have to be more careful about 
dealing with the missing data to avoid bias. think about how to carefully 
clean data (dropping data points, filling by group etc)

In [48]:
#checking for missing value
df.isnull().sum()

id_number                  0
loan_amount                0
lender_count               0
status                     0
funded_date              937
funded_amount              0
repayment_term             0
location_country_code     17
sector                     0
description              342
use                      342
dtype: int64

In [None]:
#lambda is an anonymous fxn
#the syntax is 
# lambda arguments : expression

In [55]:
#let use a fxn 
def missing_num(x):
    return sum(x.isna())

print('Missing values per column')
#applying our fxn by column
#check sum of missing value by column, and get the columns where 
#the sum is not zero

print(df.apply(missing_num, axis = 0).where(lambda x: x!=0).dropna())

Missing values per column
funded_date              937.0
location_country_code     17.0
description              342.0
use                      342.0
dtype: float64


In [46]:
#lambda example
x = lambda b: b+10
print(x(5))

15


#### 3) Sanity  checks
###### Does the data make sense? Does it match what you expect to find

- is the range of values what you would xpect? eg, all loan amounts must
greater than zero, funded amounmt cannot be negative
- do you have the number of rows expected
- do you have reasonable dates. dates cannot be future and dates should notbe to far ago
eg 1900, 1889
-do you have unxpected spikes in any of the int columns or evern over time

In [59]:
#eg. checking min value for loan amnt
df.loan_amount.min()

50

In [60]:
df.funded_amount.min()

0

##### 4) Descriptive stat of the dataset

In [61]:
#we can check the key summaries of the dataset (numerical column)
#using the describe method

df.describe()

Unnamed: 0,id_number,loan_amount,lender_count,funded_amount,repayment_term
count,6019.0,6019.0,6019.0,6019.0,6019.0
mean,1359770.0,1499.011464,35.661406,1325.07061,11.80329
std,371931.6,2512.51728,73.420256,2444.726815,9.114948
min,13772.0,50.0,0.0,0.0,3.0
25%,1425188.0,300.0,7.0,200.0,8.0
50%,1550673.0,625.0,16.0,525.0,10.0
75%,1566204.0,1825.0,41.0,1525.0,14.0
max,1573593.0,80000.0,2665.0,80000.0,133.0


In [66]:
#to get the summary of the object columns
#first, let's create a variable that holds the categorical or object data type
#then we can get the description using describe method

categorical = df.dtypes[df.dtypes=='object'].index
df[categorical].describe()

Unnamed: 0,status,funded_date,location_country_code,sector,description,use
count,6019,5082,6002,6019,5677,5677
unique,3,4453,30,14,5277,4325
top,funded,2018-07-22T15:54:41Z,TZ,Food,Zaina is 19-year-old entrepreneur who lives wi...,to pay for a stove.
freq,5082,9,400,1738,2,80


In [64]:
categorical

Index(['status', 'funded_date', 'location_country_code', 'sector',
       'description', 'use'],
      dtype='object')

In the table above, there are 4 really useful fields:

1) count - total number of fields populated (Not empty).

2) unique - tells us how many different unique ways this field is populated. For example 4 in description.languages tells us there are 4 different language descriptions.

3) top - tells us the most popular data point. For example, the top activity in this dataset is Farming which tells us most loans are in Farming.

4) freq - tells us that how frequent the most popular category is in our dataset. For example, 'en' (English) is the language almost all descriptions (description.languages) are written in (118,306 out of 118,316).

from https://github.com/AISaturdaysLagos/cohort7_classes/blob/main/Week4__Visualization_Data_Exploration/1_module_introduction_pandas/1_4_loading_and_understanding_data.ipynb