# Explaining Basic Statistics Concepts using Titanic Dataset

In [8]:
# Imports

import numpy as np
import pandas as pd

# load the dataset

df = pd.read_csv('./titanic/train.csv')

## Introduction

Knowledge of statistics is essential for data scientists. The following are the examples of applications of statistics for data science:
* Optain the general understanding and description of the data: data types, measurement of general tendency of the data and its dispersion.
* Find out possible dependencies between different features in dataset.
* Build machine learning models, which use statistical concepts, such as Naive Bayes or Estimation Maximization algorithms.
* Measure statistical significance of the results of the experiments.

In this kernel I would like to introduce some of the statistical concepts and give examples using Titanic dataset. This is how the data looks like:

In [9]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Describe the Data

Statistical concepts are widely used to describe what the data is like: types of variables in the dataset and what are the values of the variables.

### 1. Data Types
There are 4 main types of the data:
* __Nominal__: the values fall into predetermined categories and can not be sorted. The following columns are nominal for Titanic dataset: Sex, Embarked, Survived.
* __Ordinal__: the values can be sorted, but there is no scale. Pclass variable is the example of ordinal column.
* __Interval__: the values can be sorted and there is a scale, but there is no zero point for the values (column with a temperature measurements in a dataset could be an example of interval data type). 
* __Ratio__: the values can be sorted, there is a scale and a zero point for the values. The following columns are ratio: Age, SibSp, Parch, Fare.

### 2. Central Tendency of the Data
For interval and ratio data types we can describe th central tandency of the values. Central tendency can be described using the following concepts:
* __Mean__: calculated average of the values.
* __Median__: the middle value.
* __Mode__: the most occuring value. There can be several modes for the variable.

Let's calculate mean, median and mode for the __Age__ values in the dataset:

Find out the mean value:

In [28]:
# calculate mean value manually
df_1 = df.dropna(axis = 0, subset = ['Age']) #remove rows with empty values for Age first
mean_1 = sum(df_1['Age'].values)/len(df_1)
print("The mean value for Age is {mean}".format(mean = mean_1))

The mean value for Age is 29.69911764705882


In [25]:
# calculate mean value using pandas
mean_2 = df['Age'].mean()

print("The mean value for Age is {mean}".format(mean = mean_2))

The mean value for age is 29.69911764705882


Find out the median value [using this stackoverflow topic to find median manually](https://stackoverflow.com/questions/24101524/finding-median-of-list-in-python):

In [29]:
# calculate median manually

def median(lst):
    quotient, remainder = divmod(len(lst), 2)
    if remainder:
        return sorted(lst)[quotient]
    return sum(sorted(lst)[quotient - 1:quotient + 1]) / 2.

median_1 = median(df_1['Age'].values)

print("The median value for Age is {median}".format(median = median_1))

The median value for Age is 28.0


In [30]:
# calculate median using pandas

median_2 = df['Age'].median()

print("The median value for Age is {median}".format(median = median_2))

The median value for Age is 28.0


Find out the mode values [using this stackoverflow topic to find mode for a list manually](https://stackoverflow.com/questions/10797819/finding-the-mode-of-a-list):

In [39]:
# find out mode manually

def mode(arr) :
    m = max([arr.count(a) for a in arr])
    return [x for x in arr if arr.count(x) == m][0] if m>1 else None

mode_1 = mode(df_1['Age'].values.tolist())

print("The mode value for Age is {mode}".format(mode = mode_1))

The mode value for Age is 24.0


In [35]:
# find out mode using pandas

mode_2 = df['Age'].mode()

print("The mode value for Age is {mode} (Taking only 1st discovered mode value.)".format(mode = mode_2[0]))

The mode value for Age is 24.0 (Taking only 1st discovered mode value.)


### 3. Dispertion of the data