### The Data Analysis Process

#### The data analysis process can be generalized in the steps belows (Data Analysis with Python. freeCodeCamp, 2020):

1. Data Extraction
    1. SQL, csv, xlsx, scrapping, APIs.
2. Data Cleaning
    1. Missing values, incorrect types, invalid values, outliers.
3. Data Wrangling
    1. Reshape data frame, merging and joining, create classes.
4. Analysis
    1. Data exploration, visualization, correlation, hypothesis testing.
5. Action
    1. Machine Learning, Feature Engineering, production, dashboard.
6. Decision making

#### There are two python libraries that are widely used for data analysis, **numpy** and **pandas**.

#### **numpy** stands for 'numerical python' and provide us many resources for numerical operations. On the other hand, **pandas** gives us many powerfull tools to work with data frames for data analysis. Here is how we call them:

In [1]:
import numpy  as np
import pandas as pd

In [2]:
pd.options.display.float_format = '{:.2f}'.format

In [3]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#### We use _import_ to call the library (e.g. numpy and pandas) and 'as' np/pd justo to make it simplier to write our code when we need them. As we further advance in our data analysis we will learn how to use some handy functions available.

### German Credit Analysy

#### For our first analysis, we will use a credit risk dataset avilable on [kaggle](https://www.kaggle.com/laotse/credit-risk-dataset/version/1).

#### We will read the csv file to a dataframe object, called 'dados' and have some basic inspection to see what kind of information are we dealing with.

In [4]:
#help(pd.read_csv)
dados = pd.read_csv('credit_risk_dataset.csv')

In [5]:
type(dados)

pandas.core.frame.DataFrame

In [6]:
# Get column names
dados.columns

Index(['person_age', 'person_income', 'person_home_ownership',
       'person_emp_length', 'loan_intent', 'loan_grade', 'loan_amnt',
       'loan_int_rate', 'loan_status', 'loan_percent_income',
       'cb_person_default_on_file', 'cb_person_cred_hist_length'],
      dtype='object')

In [7]:
# number of rows and columns
dados.shape

(32581, 12)

#### We can take a look at a general description of our dataset with the method 'info()'. The type of variable stored in each column is given by 'Dtype', where 'int64' means integer, float64 decimals and object 'strings'.

In [8]:
dados.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32581 entries, 0 to 32580
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   person_age                  32581 non-null  int64  
 1   person_income               32581 non-null  int64  
 2   person_home_ownership       32581 non-null  object 
 3   person_emp_length           31686 non-null  float64
 4   loan_intent                 32581 non-null  object 
 5   loan_grade                  32581 non-null  object 
 6   loan_amnt                   32581 non-null  int64  
 7   loan_int_rate               29465 non-null  float64
 8   loan_status                 32581 non-null  int64  
 9   loan_percent_income         32581 non-null  float64
 10  cb_person_default_on_file   32581 non-null  object 
 11  cb_person_cred_hist_length  32581 non-null  int64  
dtypes: float64(3), int64(5), object(4)
memory usage: 3.0+ MB


In [9]:
# display first 'n' observations
dados.head(5)

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4


In [10]:
# general measures on quantiative variables
dados.describe()

Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_cred_hist_length
count,32581.0,32581.0,31686.0,32581.0,29465.0,32581.0,32581.0,32581.0
mean,27.73,66074.85,4.79,9589.37,11.01,0.22,0.17,5.8
std,6.35,61983.12,4.14,6322.09,3.24,0.41,0.11,4.06
min,20.0,4000.0,0.0,500.0,5.42,0.0,0.0,2.0
25%,23.0,38500.0,2.0,5000.0,7.9,0.0,0.09,3.0
50%,26.0,55000.0,4.0,8000.0,10.99,0.0,0.15,4.0
75%,30.0,79200.0,7.0,12200.0,13.47,0.0,0.23,8.0
max,144.0,6000000.0,123.0,35000.0,23.22,1.0,0.83,30.0


#### We also can look at class frequencies in a qualitative variable. If we use __normalize__ argument, the result is percentage values (adding up to 1).

In [10]:
# Frequency
dados['loan_intent'].value_counts()
# Percentage
dados.loan_intent.value_counts(normalize=True)

EDUCATION            6453
MEDICAL              6071
VENTURE              5719
PERSONAL             5521
DEBTCONSOLIDATION    5212
HOMEIMPROVEMENT      3605
Name: loan_intent, dtype: int64

EDUCATION           0.20
MEDICAL             0.19
VENTURE             0.18
PERSONAL            0.17
DEBTCONSOLIDATION   0.16
HOMEIMPROVEMENT     0.11
Name: loan_intent, dtype: float64

#### It is also interesting to look to look at some cross frequencies. Note that by default, the virst variable is considered by row.

In [24]:
print('\nFrequency count\n')
pd.crosstab(dados.loan_intent, dados.person_home_ownership)


Frequency count



person_home_ownership,MORTGAGE,OTHER,OWN,RENT
loan_intent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
DEBTCONSOLIDATION,2312,17,72,2811
EDUCATION,2627,17,528,3281
HOMEIMPROVEMENT,1741,12,318,1534
MEDICAL,2190,17,434,3430
PERSONAL,2340,18,446,2717
VENTURE,2234,26,786,2673


#### When normalizing a contingency table whe can get percentages by row, column or by the overall sum.

In [26]:
print('\nPercentages by row\n')
pd.crosstab(dados.loan_intent, dados.person_home_ownership, normalize='index')

print('\nPercentages by column\n')
pd.crosstab(dados.loan_intent, dados.person_home_ownership, normalize='columns')

print('\nPercentages by overall sum\n')
pd.crosstab(dados.loan_intent, dados.person_home_ownership, normalize='all')


Percentages by row



person_home_ownership,MORTGAGE,OTHER,OWN,RENT
loan_intent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
DEBTCONSOLIDATION,0.44,0.0,0.01,0.54
EDUCATION,0.41,0.0,0.08,0.51
HOMEIMPROVEMENT,0.48,0.0,0.09,0.43
MEDICAL,0.36,0.0,0.07,0.56
PERSONAL,0.42,0.0,0.08,0.49
VENTURE,0.39,0.0,0.14,0.47



Percentages by column



person_home_ownership,MORTGAGE,OTHER,OWN,RENT
loan_intent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
DEBTCONSOLIDATION,0.17,0.16,0.03,0.17
EDUCATION,0.2,0.16,0.2,0.2
HOMEIMPROVEMENT,0.13,0.11,0.12,0.09
MEDICAL,0.16,0.16,0.17,0.21
PERSONAL,0.17,0.17,0.17,0.17
VENTURE,0.17,0.24,0.3,0.16



Percentages by overall sum



person_home_ownership,MORTGAGE,OTHER,OWN,RENT
loan_intent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
DEBTCONSOLIDATION,0.07,0.0,0.0,0.09
EDUCATION,0.08,0.0,0.02,0.1
HOMEIMPROVEMENT,0.05,0.0,0.01,0.05
MEDICAL,0.07,0.0,0.01,0.11
PERSONAL,0.07,0.0,0.01,0.08
VENTURE,0.07,0.0,0.02,0.08


#### Calculating summary statistics for a specific feature.

In [31]:
dados['person_income'].mean()
dados['person_income'].std()

66074.84846996715

61983.119168159064

#### Simple data frame with income mean and standard deviation.

In [42]:
media  = dados['person_income'].mean()
desvio = dados['person_income'].std()

pd.DataFrame([ ['income', media, desvio] ], columns=['Variable', 'Mean', 'Std'])

Unnamed: 0,Variable,Mean,Std
0,income,66074.85,61983.12


#### From an exploratory perspective, it's also helpful to look at summary statistics by classes of interest. For example, we can take a look if the average interest rate seems to be change among loan grades

In [41]:
dados.groupby(['loan_grade'])['loan_int_rate'].mean()
dados.groupby(['loan_grade'])['loan_int_rate'].std()

loan_grade
A    7.33
B   11.00
C   13.46
D   15.36
E   17.01
F   18.61
G   20.25
Name: loan_int_rate, dtype: float64

loan_grade
A   1.04
B   0.91
C   0.96
D   1.11
E   1.32
F   1.38
G   1.07
Name: loan_int_rate, dtype: float64

#### For a better visualization we can build a simple data frame, just like 'person_income' example a above.

In [51]:
juros_nota_me = dados.groupby(['loan_grade'])['loan_int_rate'].mean()
juros_nota_sd = dados.groupby(['loan_grade'])['loan_int_rate'].std()

pd.DataFrame([juros_nota_me, juros_nota_sd])

loan_grade,A,B,C,D,E,F,G
loan_int_rate,7.33,11.0,13.46,15.36,17.01,18.61,20.25
loan_int_rate,1.04,0.91,0.96,1.11,1.32,1.38,1.07


#### Another option is to use 'pd.concat' to concatenate them. The argument 'axis=1' tells that they will be bind as columns. Whats the difference between the two dataframes below?

In [65]:
pd.concat([juros_nota_me, juros_nota_sd], axis=1)
pd.concat([juros_nota_me, juros_nota_sd], axis=1).reset_index()

Unnamed: 0_level_0,loan_int_rate,loan_int_rate
loan_grade,Unnamed: 1_level_1,Unnamed: 2_level_1
A,7.33,1.04
B,11.0,0.91
C,13.46,0.96
D,15.36,1.11
E,17.01,1.32
F,18.61,1.38
G,20.25,1.07


Unnamed: 0,loan_grade,loan_int_rate,loan_int_rate.1
0,A,7.33,1.04
1,B,11.0,0.91
2,C,13.46,0.96
3,D,15.36,1.11
4,E,17.01,1.32
5,F,18.61,1.38
6,G,20.25,1.07
