# chapter 5 : Understand Your Data With Descriptive Statistics

## Summary
* Peek At Your Data.
* Dimensions of Your Data.
* Data Types.
* Class Distribution.
* Data Summary.
* Correlations.
* Skewness.

We import libs we need for this chapter

In [1]:
from pandas import read_csv
from pandas import set_option

We load the dataset

In [2]:
filename = "kaggle-house-prices-train.csv"
data = read_csv(filename, index_col=0)

## 5.1 Peek at Your Data

In [3]:
peek = data.head(20)
print(peek)

    MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
Id                                                                    
1           60       RL         65.0     8450   Pave   NaN      Reg   
2           20       RL         80.0     9600   Pave   NaN      Reg   
3           60       RL         68.0    11250   Pave   NaN      IR1   
4           70       RL         60.0     9550   Pave   NaN      IR1   
5           60       RL         84.0    14260   Pave   NaN      IR1   
6           50       RL         85.0    14115   Pave   NaN      IR1   
7           20       RL         75.0    10084   Pave   NaN      Reg   
8           60       RL          NaN    10382   Pave   NaN      IR1   
9           50       RM         51.0     6120   Pave   NaN      Reg   
10         190       RL         50.0     7420   Pave   NaN      Reg   
11          20       RL         70.0    11200   Pave   NaN      Reg   
12          60       RL         85.0    11924   Pave   NaN      IR1   
13    

## 5.2 Dimensions of Your Data

In [4]:
shape = data.shape
print(shape)

(1460, 80)


## 5.3 Data Type For Each Attribute

In [5]:
types = data.dtypes
print(types)

MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
Street            object
                  ...   
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
Length: 80, dtype: object


## 5.4 Descriptive Statistics

In [6]:
set_option('display.width', 100)
set_option('precision', 3)
description = data.describe()
print(description)

       MSSubClass  LotFrontage     LotArea  OverallQual  OverallCond  YearBuilt  YearRemodAdd  \
count    1460.000     1201.000    1460.000     1460.000     1460.000   1460.000      1460.000   
mean       56.897       70.050   10516.828        6.099        5.575   1971.268      1984.866   
std        42.301       24.285    9981.265        1.383        1.113     30.203        20.645   
min        20.000       21.000    1300.000        1.000        1.000   1872.000      1950.000   
25%        20.000       59.000    7553.500        5.000        5.000   1954.000      1967.000   
50%        50.000       69.000    9478.500        6.000        5.000   1973.000      1994.000   
75%        70.000       80.000   11601.500        7.000        6.000   2000.000      2004.000   
max       190.000      313.000  215245.000       10.000        9.000   2010.000      2010.000   

       MasVnrArea  BsmtFinSF1  BsmtFinSF2  ...  WoodDeckSF  OpenPorchSF  EnclosedPorch  3SsnPorch  \
count    1452.000    1460

## 5.5 Class Distribution (Classification Only)

In [7]:
class_counts = data.groupby('SaleType').size()
print(class_counts)

SaleType
COD        43
CWD         4
Con         2
ConLD       9
ConLI       5
ConLw       5
New       122
Oth         3
WD       1267
dtype: int64


## 5.6 Correlations Between Attributes
Correlation refers to the relationship between two variables and how they may or may not
change together.

In [8]:
set_option('display.width', 100)
set_option('precision', 3)
correlations = data.corr(method='pearson')
print(correlations)

               MSSubClass  LotFrontage  LotArea  OverallQual  OverallCond  YearBuilt  \
MSSubClass          1.000       -0.386   -0.140        0.033       -0.059      0.028   
LotFrontage        -0.386        1.000    0.426        0.252       -0.059      0.123   
LotArea            -0.140        0.426    1.000        0.106       -0.006      0.014   
OverallQual         0.033        0.252    0.106        1.000       -0.092      0.572   
OverallCond        -0.059       -0.059   -0.006       -0.092        1.000     -0.376   
YearBuilt           0.028        0.123    0.014        0.572       -0.376      1.000   
YearRemodAdd        0.041        0.089    0.014        0.551        0.074      0.593   
MasVnrArea          0.023        0.193    0.104        0.412       -0.128      0.316   
BsmtFinSF1         -0.070        0.234    0.214        0.240       -0.046      0.250   
BsmtFinSF2         -0.066        0.050    0.111       -0.059        0.040     -0.049   
BsmtUnfSF          -0.141       

## 5.7 Skew of Univariate Distributions
Skew refers to a distribution that is assumed Gaussian (normal or bell curve) that is shifted or
squashed in one direction or another.
The skew result show a positive (right) or negative (left) skew. Values closer to zero show
less skew.

In [9]:
skew = data.skew()
print(skew)

MSSubClass        1.408
LotFrontage       2.164
LotArea          12.208
OverallQual       0.217
OverallCond       0.693
YearBuilt        -0.613
YearRemodAdd     -0.504
MasVnrArea        2.669
BsmtFinSF1        1.686
BsmtFinSF2        4.255
BsmtUnfSF         0.920
TotalBsmtSF       1.524
1stFlrSF          1.377
2ndFlrSF          0.813
LowQualFinSF      9.011
GrLivArea         1.367
BsmtFullBath      0.596
BsmtHalfBath      4.103
FullBath          0.037
HalfBath          0.676
BedroomAbvGr      0.212
KitchenAbvGr      4.488
TotRmsAbvGrd      0.676
Fireplaces        0.650
GarageYrBlt      -0.649
GarageCars       -0.343
GarageArea        0.180
WoodDeckSF        1.541
OpenPorchSF       2.364
EnclosedPorch     3.090
3SsnPorch        10.304
ScreenPorch       4.122
PoolArea         14.828
MiscVal          24.477
MoSold            0.212
YrSold            0.096
SalePrice         1.883
dtype: float64
