## Introduction to EDA

There are three types of univariate statistics that we are interested in for each variable:

1.- General information: (data type: count of total values, number of unique values)
2.- Range and middle: (min, max, mean, median, mode, quartiles)
3.- Normality and spread: (standard deviation, skewness, kurtosis)

## Normality

The primary purpose of univariate analysis is to measure the degree of "normality" in EACH feature's distribution. We want to understand the shape and spread of each variable. We refer to this as the "normality" of each variable.

A variable has a normal distribution when the number of cases with low values decreases as the value decreases and increases from the mean $\mu$ while most cases have values toward the mean

![normal.png](attachment:normal.png)

## Univariate stats

In [1]:
import pandas as pd
df = pd.read_csv('http://www.ishelp.info/data/insurance.csv')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [2]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [3]:
df.shape

(1338, 7)

### Number of different values

In [5]:
print(f'age: {df.age.nunique()}' )
print(f'sex: {df.sex.nunique()}' )
print(f'bmi: {df.bmi.nunique()}' )
print(f'children: {df.children.nunique()}' )
print(f'smoker: {df.smoker.nunique()}' )
print(f'region: {df.region.nunique()}' )
print(f'charges: {df.charges.nunique()}' )

age: 47
sex: 2
bmi: 548
children: 6
smoker: 2
region: 4
charges: 1337


### Type of value

In [7]:
print(f'age: {df.age.dtype}' )
print(f'sex: {df.sex.dtype}' )
print(f'bmi: {df.bmi.dtype}' )
print(f'children: {df.children.dtype}' )
print(f'smoker: {df.smoker.dtype}' )
print(f'region: {df.region.dtype}' )
print(f'charges: {df.charges.dtype}' )

age: int64
sex: object
bmi: float64
children: int64
smoker: object
region: object
charges: float64


In [8]:
### Sum Null values

In [9]:
print(f'age: {df.age.isnull().sum()}' )
print(f'sex: {df.sex.isnull().sum()}' )
print(f'bmi: {df.bmi.isnull().sum()}' )
print(f'children: {df.children.isnull().sum()}' )
print(f'smoker: {df.smoker.isnull().sum()}' )
print(f'region: {df.region.isnull().sum()}' )
print(f'charges: {df.charges.isnull().sum()}' )

age: 0
sex: 0
bmi: 0
children: 0
smoker: 0
region: 0
charges: 0


### Standard Deviation

Measure of the dispersion of the data

![normal.png](attachment:normal.png)

In [10]:
df.charges.std()

12110.011236693994

### Abnormality: Skew, Kurt

![normal.png](attachment:normal.png)

In [11]:
from scipy.stats import kurtosis, skew

In [14]:
df.charges.skew()

1.5158796580240388

In [16]:
df.charges.kurt()

1.6062986532967907