# Descriptive Statistics

### Measures of Central Tendancy

Like always we are using the iris dataset for demonstration.

In [41]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [42]:
df = pd.read_csv('iris.csv')
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [43]:
df.rename(columns={'SepalLengthCm': 'SL', 'SepalWidthCm': 'SW', 'PetalLengthCm': 'PL', 'PetalWidthCm': 'PW'}, inplace=True)

In [44]:
df.head()

Unnamed: 0,Id,SL,SW,PL,PW,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [45]:
df['Species'].value_counts()

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: Species, dtype: int64

In [46]:
#  for computational purposes we are creating 3 datasets
iris_setosa = df[df['Species'] == 'Iris-setosa']
iris_versicolor = df[df['Species'] == 'Iris-versicolor']
iris_virginica = df[df['Species'] == 'Iris-virginica']

### 1. Mean

In [47]:
# calculating mean is simple enough
# Sepal Length
print(np.mean(iris_setosa['SL']))
print(np.mean(iris_versicolor['SL']))
print(np.mean(iris_virginica['SL']))

5.006
5.936
6.587999999999998


In [48]:
# Sepal Width
print(np.mean(iris_setosa['SW']))
print(np.mean(iris_versicolor['SW']))
print(np.mean(iris_virginica['SW']))

3.418
2.7700000000000005
2.974


In [49]:
# Petal Length
print(np.mean(iris_setosa['PL']))
print(np.mean(iris_versicolor['PL']))
print(np.mean(iris_virginica['PL']))

1.464
4.26
5.5520000000000005


In [50]:
# Sepal Width
print(np.mean(iris_setosa['PW']))
print(np.mean(iris_versicolor['PW']))
print(np.mean(iris_virginica['PW']))

0.244
1.3259999999999998
2.0260000000000002


### The Problem with Outliers

Mean is influenced heavily by outliers. Even by having one value as an outlier the value of Mean() changes drastically

In [52]:
# Note mean(sepal width)(iris_setosa) is 0.244
# if we add an outlier to the 50 datapoints
np.mean(np.append(iris_setosa['PW'], 50))
# the value changes to 1.21!!!

1.219607843137255

### 2. Median

The good thing about median is that outliers dont work on them. IN fact the median does not change unless more than half the values are outliers.

In [54]:
# calculate median on petal width
print(np.median(iris_setosa['PL']))
print(np.median(iris_versicolor['PL']))
print(np.median(iris_virginica['PL']))

1.5
4.35
5.55


In [57]:
# trying to add an outlier to the mix
np.median(np.append(iris_setosa['PL'], 50))
# note that the median didnt change

1.5

### 3. Mode

Mode cannot be found using NumPy but using Scipy.stats() module we can find the mode.

In [63]:
from scipy import stats

print(stats.mode(iris_setosa['PL'], keepdims=True))
print(stats.mode(iris_versicolor['PL'], keepdims=True))
print(stats.mode(iris_virginica['PL'], keepdims=True))

ModeResult(mode=array([1.5]), count=array([14]))
ModeResult(mode=array([4.5]), count=array([7]))
ModeResult(mode=array([5.1]), count=array([7]))


### Which one should you use??

Always use all 3 to have a fair idea of the centrality of the data.