## Statistics
A study and practice of collecting and analysing data, include descriptive statistics and inferential statitics.

### Descriptive Statistics
Descriptive statistics focus on summarizing and describe dataset itself by numerical and graphical methods, without drawing decisions or making predictions for a population.

#### Fundamental Concepts

<div align=center><img src="pictures/datatype.png"  width="70%"></div>

#### Measure of Central
1. Mean - sum/size
2. Median - middle value or avg of 2 middle values
3. Mode - most frequent value, used to describe categorical data

#### Measure of Spread
1. Range (Min, Max)
2. Variance - **mean((data - mean)^2)**
    
    For a population variance:
    $$
    \sigma^2 = \frac{\sum (x_i - \mu)^2}{N}
    $$

    For a sample variance:
    $$
    s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1}
    $$
3. Standard Deviation - **sqrt(var)**, easier to understand since it's not square
4. Mean Absolute Deviation - **mean(abs(data - mean))**
    $$
    \text{MAD} = \frac{1}{n} \sum_{i=1}^{n} |x_i - \bar{x}|
    $$
    - SD involves squared deviations, which gives more weight to larger deviations, so it's more sensitive to outliers.
    - MAD involves absolute deviations, which penalizes each deviations equally, so it's more robust when the data is skewed or non-normal.
    - MAD is more robust but SD is more commonly used, especially when assuming a normal distribution.
  
5. Quartiles, Quantiles, Interquartile Range(IQR)
   - Quartiles split up the data into 4 equal parts.
   - Quantiles or percentiles are a generalized version of quartile, e.g. split up the data into 5 or 10 pieces.
   - IQR is the difference between Q1 (25% percentile) and Q3 (75% percentile).

1. Outliers
    - Data > Q3 + 1.5 * IQR
    - Data < Q1 - 1.5 * IQR
  
<div align=center><img src="pictures/iqr.png"  width="70%"></div>

#### Data Visualization
| Variable      | Graphic |
| ----------- | ----------- |
| 1 Qualititave      | Barplot       |
| 1 Quantitative   | Histogram or Boxplot        |
| 2 Qualititave      | Clustered Barplot       |
| 2 Quantitative   | Scatter Plot        |
| 1 Quantitative + 1 Qualitative   | Double Boxplot        |

In [6]:
import pandas as pd
import numpy as np

titanic = pd.read_csv('datasets/TitanicSurvival.csv')
print(titanic[:11])


                           rownames survived     sex      age passengerClass
0     Allen, Miss. Elisabeth Walton      yes  female  29.0000            1st
1    Allison, Master. Hudson Trevor      yes    male   0.9167            1st
2      Allison, Miss. Helen Loraine       no  female   2.0000            1st
3   Allison, Mr. Hudson Joshua Crei       no    male  30.0000            1st
4   Allison, Mrs. Hudson J C (Bessi       no  female  25.0000            1st
5               Anderson, Mr. Harry      yes    male  48.0000            1st
6   Andrews, Miss. Kornelia Theodos      yes  female  63.0000            1st
7            Andrews, Mr. Thomas Jr       no    male  39.0000            1st
8   Appleton, Mrs. Edward Dale (Cha      yes  female  53.0000            1st
9           Artagaveytia, Mr. Ramon       no    male  71.0000            1st
10           Astor, Col. John Jacob       no    male  47.0000            1st


In [16]:
np.mean(titanic['age'])
# np.nanmedian(titanic['age'])
# titanic['age'].value_counts()
# titanic['sex'].value_counts()

29.881134512434034