# 3. Descriptive statistics

## 3.1. Measures of central tendency

In most situations, the first thing that you’ll want to calculate is a measure of central tendency. The most commonly used measures are the mean, median and mode; occasionally people will also report a trimmed mean

## 3.1.1. The mean

In [5]:
import numpy as np
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = np.mean(speed)
print(x)

89.76923076923077


## 3.1.2. The median

In [6]:
import numpy as np
speed = [99,86,87,88,86,103,87,94,78,77,85,86]
x = np.median(speed)
print(x)

86.5


## 3.1.3. The Mode

In [7]:
from scipy import stats
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = stats.mode(speed)
print(x)

ModeResult(mode=array([86]), count=array([3]))


## 3.1. 4. Trimmed mean
 If I take a 10% trimmed mean, we’ll drop the extreme values on either side, and take the mean of the rest:

In [8]:
dataset = [-15,2,3,4,5,6,7,8,9,12]

import numpy as np
from scipy import stats

dataset2 = np.array(dataset)

stats.trim_mean(dataset2, 0.1)

5.5

# 3.2. Measures of variability
The second thing that we really want is a measure of the variability of the data. 

## 3.2.1. range

With Python use the NumPy library ptp() method to find the range of the values 13, 21, 21, 40, 48, 55, 72:


In [10]:
import numpy as np

values = [13,21,21,40,48,55,72]

x = np.ptp(values)

print(x)

59


## 3.2.2. Interquartile range

In [6]:
# exampl_1

import numpy as np

# Sample dataset
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Calculate Interquartile Range (IQR) using numpy
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1

# Calculate Quartile Deviation
quartile_deviation = (q3 - q1) / 2

print("Interquartile Range (IQR):", iqr)
print("Quartile Deviation:", quartile_deviation)


Interquartile Range (IQR): 4.5
Quartile Deviation: 2.25


### Interquartile Range And Quartile Deviation of One Array using SciPy

In [7]:
# example_2

import numpy as np
from scipy.stats import iqr

# Sample dataset
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Calculate Interquartile Range (IQR) using scipy
iqr_value = iqr(data)

# Calculate Quartile Deviation
quartile_deviation = iqr_value / 2

print("Interquartile Range (IQR):", iqr_value)
print("Quartile Deviation:", quartile_deviation)


Interquartile Range (IQR): 4.5
Quartile Deviation: 2.25


In [8]:
# example_3

from scipy import stats

values = [13,21,21,40,42,48,55,72]

x = stats.iqr(values)

print(x)

28.75


## 3.2.3. Mean absolute deviation

Pandas provide a method to make Calculation of MAD (Mean Absolute Deviation) very easy. MAD is defined as average distance between each value and mean.

In [11]:
import pandas as pd

data = pd.Series( [56, 31, 56, 8, 32])
data.mad()

15.52

## 3.2.4. Variance

Statistics module provides very powerful tools, which can be used to compute anything related to Statistics. variance() is one such function.

In [14]:
# example_1

# Importing Statistics module
import statistics

# Creating a sample of data
sample = [2.74, 1.23, 2.63, 2.22, 3, 1.98]

statistics.variance(sample)


0.40924

In [16]:
#example_2

# Import statistics Library
import statistics

# Calculate the variance from a sample of data
print(statistics.variance([1, 3, 5, 7, 9, 11]))
print(statistics.variance([2, 2.5, 1.25, 3.1, 1.75, 2.8]))
print(statistics.variance([-11, 5.5, -3.4, 7.1]))
print(statistics.variance([1, 30, 50, 100]))

14
0.4796666666666667
70.80333333333334
1736.9166666666667


### Use the NumPy var() method to find the variance:

In [21]:
#example_3

import numpy as np

speed = [32,111,138,28,59,77,97]

x = np.var(speed)

print(x)

1432.2448979591834


## 3.2.5. Standard deviation

1-The statistics.stdev() method calculates the standard deviation from a sample of data.

2- Use the NumPy std() method to find the standard deviation


In [19]:
# example_1

# Import statistics Library
import statistics

# Calculate the standard deviation from a sample of data
print(statistics.stdev([1, 3, 5, 7, 9, 11]))
print(statistics.stdev([2, 2.5, 1.25, 3.1, 1.75, 2.8]))
print(statistics.stdev([-11, 5.5, -3.4, 7.1]))
print(statistics.stdev([1, 30, 50, 100]))

3.7416573867739413
0.6925797186365384
8.414471660973929
41.67633221226008


In [20]:
# example_2

import numpy as np

speed = [86,87,88,86,87,85,86]

x = np.std(speed)

print(x)

0.9035079029052513


# 3.3. Skew and kurtosis

In [22]:
# Importing library 
from scipy.stats import skew 
  
# Creating a dataset 
dataset = [88, 85, 82, 97, 67, 77, 74, 86,  
           81, 95, 77, 88, 85, 76, 81] 
  
# Calculate the skewness 
print(skew(dataset, axis=0, bias=True))

0.029331688766181797


In [24]:
# Importing library 
from scipy.stats import kurtosis 
  
# Creating a dataset 
dataset = [88, 85, 82, 97, 67, 77, 74, 86, 
           81, 95, 77, 88, 85, 76, 81] 
  
# Calculate the kurtosis 
print(kurtosis(dataset, axis=0, bias=True))

-0.29271198374234686


# 3.4. Getting an overall summary of a variable


## 3.4.1. “Describing” a data frame

Pandas describe() is used to view some basic statistical details like percentile, mean, std, etc. of a data frame or a series of numeric values.
If we want to describe the entire dataframe, we need to add the argument include = 'all'.

In [8]:
import pandas as pd

# reading and printing csv file
data = pd.read_csv('nba.csv')

data.describe(include = 'all')

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
count,457,457,457.0,457,457.0,457,457.0,373,446.0
unique,457,30,,5,,18,,118,
top,Robin Lopez,New Orleans Pelicans,,SG,,6-9,,Kentucky,
freq,1,19,,102,,59,,22,
mean,,,17.678337,,26.938731,,221.522976,,4842684.0
std,,,15.96609,,4.404016,,26.368343,,5229238.0
min,,,0.0,,19.0,,161.0,,30888.0
25%,,,5.0,,24.0,,200.0,,1044792.0
50%,,,13.0,,26.0,,220.0,,2839073.0
75%,,,25.0,,30.0,,240.0,,6500000.0


In [7]:
import pandas as pd

# reading and printing csv file
data = pd.read_csv('nba.csv')

data.describe()

Unnamed: 0,Number,Age,Weight,Salary
count,457.0,457.0,457.0,446.0
mean,17.678337,26.938731,221.522976,4842684.0
std,15.96609,4.404016,26.368343,5229238.0
min,0.0,19.0,161.0,30888.0
25%,5.0,24.0,200.0,1044792.0
50%,13.0,26.0,220.0,2839073.0
75%,25.0,30.0,240.0,6500000.0
max,99.0,40.0,307.0,25000000.0


# 3.5.Standard scores

We can use Python to calculate the z-score value of data points in the dataset. Also, we will use the numpy library to calculate mean and standard deviation of the dataset.

In [10]:
import numpy as np
dataset = [3,9, 23, 43,53, 4, 5,30, 35, 50, 70, 150, 6, 7, 8, 9, 10]
mean = np.mean(dataset)
std_dev = np.std(dataset)
z_scores = (dataset - mean) / std_dev

print(z_scores)

[-0.7574907  -0.59097335 -0.20243286  0.35262498  0.6301539  -0.72973781
 -0.70198492 -0.00816262  0.13060185  0.54689523  1.10195307  3.32218443
 -0.67423202 -0.64647913 -0.61872624 -0.59097335 -0.56322046]


### Calculate Z score. If Z score>3, print it as an outlier.

In [14]:
import numpy as np
dataset = [3,9, 23, 43,53, 4, 5,30, 35, 50, 70, 150, 6, 7, 8, 9, 10]
mean = np.mean(dataset)
std_dev = np.std(dataset)
z_scores = (dataset - mean) / std_dev

threshold = 3
outlier = []
for i in dataset:
	z = (i-mean)/std_dev
	if z > threshold:
		outlier.append(i)
print('outlier in dataset is', outlier)


outlier in dataset is [150]
