# chapter 5 : Understand Your Data With Descriptive Statistics

## Summary
* Peek At Your Data.
* Dimensions of Your Data.
* Data Types.
* Class Distribution.
* Data Summary.
* Correlations.
* Skewness.

We import libs we need for this chapter

In [1]:
from pandas import read_csv
from pandas import set_option

We load the dataset

In [2]:
filename = "./cs-training.csv"
test_filename = "./cs-test.csv"

data = read_csv(filename, index_col=0)
test = read_csv(test_filename, index_col=0)

## 5.1 Peek at Your Data

In [3]:
peek = data.head(20)
print(peek)

    SeriousDlqin2yrs  RevolvingUtilizationOfUnsecuredLines  age  \
1                  1                              0.766127   45   
2                  0                              0.957151   40   
3                  0                              0.658180   38   
4                  0                              0.233810   30   
5                  0                              0.907239   49   
6                  0                              0.213179   74   
7                  0                              0.305682   57   
8                  0                              0.754464   39   
9                  0                              0.116951   27   
10                 0                              0.189169   57   
11                 0                              0.644226   30   
12                 0                              0.018798   51   
13                 0                              0.010352   46   
14                 1                              0.964673   4

## 5.2 Dimensions of Your Data

In [4]:
shape = data.shape
print(shape)

(150000, 11)


In [5]:
test_shape = test.shape
print(test_shape)

(101503, 11)


## 5.3 Data Type For Each Attribute

In [6]:
types = data.dtypes
print(types)

SeriousDlqin2yrs                          int64
RevolvingUtilizationOfUnsecuredLines    float64
age                                       int64
NumberOfTime30-59DaysPastDueNotWorse      int64
DebtRatio                               float64
MonthlyIncome                           float64
NumberOfOpenCreditLinesAndLoans           int64
NumberOfTimes90DaysLate                   int64
NumberRealEstateLoansOrLines              int64
NumberOfTime60-89DaysPastDueNotWorse      int64
NumberOfDependents                      float64
dtype: object


## 5.4 Descriptive Statistics

In [7]:
set_option('display.width', 100)
set_option('precision', 3)
description = data.describe()
print(description)

       SeriousDlqin2yrs  RevolvingUtilizationOfUnsecuredLines         age  \
count        150000.000                            150000.000  150000.000   
mean              0.067                                 6.048      52.295   
std               0.250                               249.755      14.772   
min               0.000                                 0.000       0.000   
25%               0.000                                 0.030      41.000   
50%               0.000                                 0.154      52.000   
75%               0.000                                 0.559      63.000   
max               1.000                             50708.000     109.000   

       NumberOfTime30-59DaysPastDueNotWorse   DebtRatio  MonthlyIncome  \
count                            150000.000  150000.000      1.203e+05   
mean                                  0.421     353.005      6.670e+03   
std                                   4.193    2037.819      1.438e+04   
min       

## 5.5 Class Distribution (Classification Only)

In [8]:
class_counts = data.groupby('RevolvingUtilizationOfUnsecuredLines').size()
print(class_counts)

RevolvingUtilizationOfUnsecuredLines
0.000e+00    10878
8.370e-06        1
9.930e-06        1
1.250e-05        1
1.430e-05        1
1.490e-05        1
1.510e-05        1
1.600e-05        1
1.640e-05        1
1.870e-05        1
1.880e-05        1
2.100e-05        1
2.130e-05        1
2.210e-05        1
2.380e-05        1
2.660e-05        1
2.740e-05        1
2.780e-05        1
2.820e-05        1
2.850e-05        1
2.860e-05        1
3.180e-05        1
3.220e-05        1
3.230e-05        1
3.250e-05        1
3.300e-05        1
3.320e-05        2
3.350e-05        1
3.360e-05        1
3.370e-05        1
             ...  
7.452e+03        1
7.555e+03        1
7.696e+03        1
7.809e+03        1
7.839e+03        1
7.907e+03        1
8.228e+03        1
8.328e+03        1
8.497e+03        1
8.710e+03        1
8.831e+03        1
9.193e+03        1
9.340e+03        1
9.684e+03        1
1.015e+04        1
1.021e+04        1
1.082e+04        1
1.155e+04        1
1.184e+04        1
1.237e+04    

## 5.6 Correlations Between Attributes
Correlation refers to the relationship between two variables and how they may or may not
change together.

In [9]:
set_option('display.width', 100)
set_option('precision', 3)
correlations = data.corr(method='pearson')
print(correlations)

                                      SeriousDlqin2yrs  RevolvingUtilizationOfUnsecuredLines  \
SeriousDlqin2yrs                                 1.000                                -0.002   
RevolvingUtilizationOfUnsecuredLines            -0.002                                 1.000   
age                                             -0.115                                -0.006   
NumberOfTime30-59DaysPastDueNotWorse             0.126                                -0.001   
DebtRatio                                       -0.008                                 0.004   
MonthlyIncome                                   -0.020                                 0.007   
NumberOfOpenCreditLinesAndLoans                 -0.030                                -0.011   
NumberOfTimes90DaysLate                          0.117                                -0.001   
NumberRealEstateLoansOrLines                    -0.007                                 0.006   
NumberOfTime60-89DaysPastDueNotWorse    

## 5.7 Skew of Univariate Distributions
Skew refers to a distribution that is assumed Gaussian (normal or bell curve) that is shifted or
squashed in one direction or another.
The skew result show a positive (right) or negative (left) skew. Values closer to zero show
less skew.

In [10]:
skew = data.skew()
print(skew)

SeriousDlqin2yrs                          3.469
RevolvingUtilizationOfUnsecuredLines     97.632
age                                       0.189
NumberOfTime30-59DaysPastDueNotWorse     22.597
DebtRatio                                95.158
MonthlyIncome                           114.040
NumberOfOpenCreditLinesAndLoans           1.215
NumberOfTimes90DaysLate                  23.087
NumberRealEstateLoansOrLines              3.482
NumberOfTime60-89DaysPastDueNotWorse     23.332
NumberOfDependents                        1.588
dtype: float64
