In [53]:
import pandas as pd
import numpy as nnp

from analyze_src.basic_data_inspection import DataInspection, DatatypesInspectionStrategy, SummaryInspectionStrategy
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',500)

In [54]:
data_path = '../extracted_data/logistic_regression.csv'
df = pd.read_csv(data_path)

# Checking the basic code

In [55]:
# step1 basic Data inspection 

In [56]:
# initialize the data inspector with the strategy of datatype inspection
inspector = DataInspection(DatatypesInspectionStrategy())
inspector.execute_inspection(df)

Datatypes of the Nullcounts in the dataframe: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396030 entries, 0 to 396029
Data columns (total 27 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   loan_amnt             396030 non-null  float64
 1   term                  396030 non-null  object 
 2   int_rate              396030 non-null  float64
 3   installment           396030 non-null  float64
 4   grade                 396030 non-null  object 
 5   sub_grade             396030 non-null  object 
 6   emp_title             373103 non-null  object 
 7   emp_length            377729 non-null  object 
 8   home_ownership        396030 non-null  object 
 9   annual_inc            396030 non-null  float64
 10  verification_status   396030 non-null  object 
 11  issue_d               396030 non-null  object 
 12  loan_status           396030 non-null  object 
 13  purpose               396030 non-null  object 
 14  title

In [49]:
#changing my strategy to summary inspection
inspector.set_strategy(SummaryInspectionStrategy())
inspector.execute_inspection(df)

Summary statistics of Numerical Columns:  
           loan_amnt       int_rate    installment    annual_inc  \
count  396030.000000  396030.000000  396030.000000  3.960300e+05   
mean    14113.888089      13.639400     431.849698  7.420318e+04   
std      8357.441341       4.472157     250.727790  6.163762e+04   
min       500.000000       5.320000      16.080000  0.000000e+00   
25%      8000.000000      10.490000     250.330000  4.500000e+04   
50%     12000.000000      13.330000     375.430000  6.400000e+04   
75%     20000.000000      16.490000     567.300000  9.000000e+04   
max     40000.000000      30.990000    1533.810000  8.706582e+06   

                 dti       open_acc        pub_rec     revol_bal  \
count  396030.000000  396030.000000  396030.000000  3.960300e+05   
mean       17.379514      11.311153       0.178191  1.584454e+04   
std        18.019092       5.137649       0.530671  2.059184e+04   
min         0.000000       0.000000       0.000000  0.000000e+00   
25% 

# Insights

1. Know your data
    
     - data contains 396030 datapoints with 26 columns

2. Summary statistics

    **Loan status**
    - There are 2 categorical values
    - Target variable is **Loan status** which have 318357 fully paid customers 

    **Loan amount**

    - Amount of loan applied by the borrower
    - Average of 14113.888089 loan is taken from LoanTap
    - Maximum of 40000, minimum of 500 is asked as loan by the borrower

    **term**

    - Term has 2 categories (36, 60) months
    - Leading 36 months with 302005 borrowers.
    
    **intrest rate**
    - Most of the loans are with interest rate of **13.639**
    - interest rates are in the range of 5.32 - 30.99 

    **installment**
    - Monthly payment of the borrower.
    - Montly payments are ranging from 16 - 1533

    **annual income**
    - Annual Income of the borrower ranging from 0 - 8706582
    - 75% of the borrowers have an annual income less than 90000

    **DTI - Debt to income ratio**
    - Debt to income ratio is ranging from 0 - 9999

    **open_acc - open credit accounts**
    - No of open credit line accounts are available
    - ✅ 75% of the users have less than 14 credit lines
    - ❌ surprisingly one have a credit lines of 90

    **pub_rec -negative cases**
    - ✅ users upto 75 percent have no derogatory cases
    - In our data we have a guy with 86 derogatory cases

    **revol_bal - outstanding revolving balance**
    - ❌One have a revolving balance of 17 lakhs
    - 75% have a revolving balance of 19620

    **revol_util**
    - Amount of credit the borrower is using in revolving balance
    - Max revol balance of 892

    **total_acc** 
    - The total number of credit lines currently in the borrower's credit file
    - A person have a credit lines of 151 which indicates he is a regular borrower.

    **mort_acc**
    - No of mortgage accounts
    - 75% have less ths 3 mortgage accounts

    **pub_rec_bankruptcies**
    - No of public recorded bankruptcies
    - max of 8 bankruptcies

    **Grade**
    - Loantap assigned 7 grades among those grade B has most

    **subgrades**
    - Based on the grades LoanTap assigned subgrades

    **emp_title**
    - mostly it got several(173105) unique values that shows several employee titles

    **emp_length**
    - Based on the experiece it is categorised into 11 categories

    **Home ownership**
    - There are 6 different types of home ownership.
    - Most of the houses are under mortgage

    **Verification status**
    - There are 3 categories here Most of the loans are verified

    **issue_d**
    - Most of the loans are taken in oct-2014

    **purpose**
    - They converted the purpose inpo 14 categories

    **Title**
    - Loan title provided by the borrower
    - Most of them took the loan for debt-consolidation

    **earliest credit line**
    - date first he took the loan

    **Initial_list_status**
    - W (Whole Loan Program) – The loan was listed as a whole loan, meaning a single investor (or very few investors)
    - F (Fractional Loan Program) - the loan was given in fractional part by the investors
    - 238066 values are F type of loans.

    **Application type**
    - Almost all of them are individual investors (395319)









