#### **Descriptive Statistics using Python**
- Decsriptive statistics helps us in understanding the data that is taken into consideration.
- It refers to the analysis, summary & presentation of findings related to a data set derived from a sample or entire population.
- There are two measures to perform in descriptive statistics i.e. **Measurement of Central Tendency & Measurement of Dispersion**.
- Measurement of Central Tendency refers to estimating the average of the data which is taken into consideration eg: Average House price in a location, Average Height of students in a class etc.
- There are three measures of central tendency i.e. **Mean, Median & Mode**.
- Measurement of Dispersion refers to estimating the spread of the data in consideration eg: Range of House Price in a location or Range of Height of students in a class.  
- There are three measures of diepersion i.e. **Range, Variance, Standard Deviation & Inter Quartile Range(IQR)**.

##### **Measurement of Central Tendency**

##### Mean
- Mean or Average is calculated by **adding the given values in a data set & then dividing that sum by the count of values**.
- eg: Mean of 32, 44, 48, 34, 27 is (32+44+48+34+27)/5 = 37 hence the formula becomes **x-bar = (summation of x)/(count of x)**.
- Problem with mean that it is affected by the outliers eg: Mean of 32, 44, 48, 34, 127 is (32+44+48+34+127)/5 = 57

##### Median
- Median is basically the **middle value in a data set when the data set is arranged in ascending or descending order**.
- Median divides data into two equal parts.
- If the number of values in a data set is odd, then median is the middle most value in the data set.
- If the number of values in a data set are even, then median is the average of the two middle values in the data set.
- Eg : Median of 32, 44, 48, 34, 27 arranged sequence will be 27, 32, 34, 44, 48 so here since number of values are 5 i.e. odd so median is the middle most value i.e. 34.
- Eg : Median of 27, 30, 32, 34, 42, 44, so here the number of values are 6 i.e. even, thus the median = (32+34)/2 = 33.
- Compared to Mean, Median is not affected by the extreme values so eg: median of 27, 32, 34, 44, 168 will remain 34 i.e. the middlemost value.

##### **Measurement of Dispersion**

##### Range
- Range is the difference between the maximum value & the minimum value in a data set i.e. **Range = Max - Min**.
- Eg : Range of 32, 44, 48, 34, 27 is Range = 48 - 27 = 21.
- Like Mean, Range is also affected by extreme values eg: range of 27, 44, 48, 34, 127 is Range = 127 - 27 = 100. 

##### Variance
- Variance is the **sum of squared differences between mean value & the individual values in a data set divided by the total count of values**.
- Eg : To calculate the Variance of 104, 98, 90, 104, 104.
- First calculate the mean i.e. mean = (104+98+90+104+104) = 100.
- Next calculate how far each individual value is from the mean value i.e. (104-100), (98-100), (90-100), (104-100), (104-100) which comes out to be 4, -2, -10, 4, 4.
- Now square the numbers i.e. (4)^2, (-2)^2, (-10)^2, (4)^2, (4)^2 which comes out to be 16, 4, 100, 16, 16.
- Now take the mean of the squared numbers i.e. Variance = (16+4+100+16+16)/5 = 30.4.

##### Standard Deviation
- It denotes the spread of the data values in a dataset from its mean value & is calculated as the **square root of variance**.
- Eg : In the variance example above , Standard Deviation = Sqrt(30.4) = 5.51.
- There are two formulas to calculate the standard deviation i.e. one for the sample & one for the entire population.
- **Sample Standard Deviation = (Sum of squared differences of mean & individual values) / (n-1)** & **Population Standard Deviation = (Sum of squared differences of mean & individual values) / N**.

##### Quartiles
- Just like Median which divides the data into two parts, we have Quartiles which **divide the data into four parts**.
- By splitting the data into four parts, we get three points of splits which are known as Quartiles.
- The first point of split is known as **First Quartile(Q1)**, the second one **Median/Second quartile(Q2)** & third one is the **Third Quartile(Q3)**.
- Eg : In the given data set (27, 30, 32, 34, 42, 44, 48), Q1 = 30, Q2 = 34, Q3 = 48.

##### InterQuartile Range(IQR)
- Inter-Quartile Range or IQR for a given data set is the difference between the value in the First Quartile & the value in the Third quartile in that data set i.e. **IQR = (Q3 - Q1)**.
- Eg : In the given data set (27, 30, 32, 34, 42, 44, 48), Q1 = 30, Q3 = 44 thus IQR = 44 - 30 = 14.

##### Quantiles
- If we divide the data into n parts, then we will get **(n-1) points of splits which are known as Quantiles**.
- Examples of Quartiles are Deciles which divide data into 10 parts thus we have 9 points of split or 9 deciles & Percentiles which divide data into 100 equal parts thus giving 99 percentiles.

In [1]:
# now let us first consider a list of numbers.
my_data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 5]
my_data

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 5]

In [2]:
# to calculate the sum of all the numbers in the list, we will use the in-built sum() function in Python.
sum(my_data)

60

In [3]:
# to calculate the count of values in the list, we will use the in-built len() function in Python.
len(my_data)

11

In [4]:
# now we can calculate the mean of the given data.
my_mean = sum(my_data) / len(my_data)
my_mean

5.454545454545454

In [5]:
# in order to make the above calculation simpler we will import the statistics package in order to calculate all the measures of central tendency as well as dispersion.
import statistics

In [6]:
# now in order to find the mean of the above data set.
statistics.mean(my_data)

5.454545454545454

In [7]:
# in order to calculate the median of the above data set.
statistics.median(my_data)

5

In [8]:
# we can also find the mode of the data set.
statistics.mode(my_data)

5

In [9]:
# in order to find out the range.
max(my_data) - min(my_data)  # since there is no method for range available in the statistics module.

9

In [10]:
# in order to find out the standard deviation for the given data set.
statistics.stdev(my_data)  # default value in statistics module is the sample standard deviation.

2.8762349126466136

In [11]:
# we can also find the variance.
statistics.variance(my_data)

8.272727272727273

In [12]:
# now let us find out the quartiles, this is done by calling the quantile method of the statistics module.
quart = statistics.quantiles(my_data)  # by default the quantile method return the Q1, median & Q3 values(n=4).
quart  # this returns a list containing Q1, Median & Q3.

[3.0, 5.0, 8.0]

In [13]:
# now in order to find the IQR for the given data set we can leverage list indexing.
quart[2] - quart[0]

5.0

In [14]:
# in order to calculate the deciles for the given data set.
statistics.quantiles(my_data, n=10)  
# similarly we can also calculate the percentiles by passing n=100.

[1.2, 2.4, 3.6, 4.8, 5.0, 6.2, 7.4, 8.6, 9.8]

In [15]:
# in practical scenarios we will getting a very large data set which will be more complex than the above data set.
# so here we will be levaraging the NumPy module to calculate the descriptive statistics mesaures.
# let us first import the numoy module.
import numpy as np

# now let us create a 1d array containing 100 random values ranging from 1 to 100.
np.random.seed(101)
arr = np.random.randint(1, 100, 100)
arr

array([96, 12, 82, 71, 64, 88, 76, 10, 78, 41,  5, 64, 41, 61, 93, 65,  6,
       13, 94, 41, 50, 84,  9, 30, 60, 35, 45, 73, 20, 11, 77, 96, 88,  1,
       74,  9, 63, 37, 84, 29, 64,  8, 11, 53, 57, 39, 74, 53, 19, 72, 16,
       45,  1, 13, 18, 76, 80, 98, 94, 25, 37, 64, 20, 36, 31, 11, 61, 21,
       28,  9, 87, 27, 88, 47, 48, 55, 87, 10, 46,  3, 19, 59, 93, 12, 11,
       95, 36, 29,  4, 84, 85, 48, 15, 70, 61, 70, 52,  7, 89, 72])

In [16]:
# now let us calculate the mean of the array.
np.mean(arr)

48.19

In [17]:
# also let us calculate the median of the array.
np.median(arr)

48.0

In [18]:
# we can also calculate the standard deviation for the array.
np.std(arr)

29.81432373876691

In [19]:
# similarly we can also calculate the variance.
np.var(arr)

888.8939

In [20]:
# now there is one limitation in numpy i.e. we cannot calculate the mode, so for this we leverage the scipy stats module.
import scipy.stats as stats
stats.mode(arr) 
# thus in this case 11 is the mode & it is present in 4 places inside our data set.

ModeResult(mode=array([11]), count=array([4]))

In [21]:
# now let us convert the above 1D array into a 2D array.
arr = arr.reshape(20, 5)
arr

array([[96, 12, 82, 71, 64],
       [88, 76, 10, 78, 41],
       [ 5, 64, 41, 61, 93],
       [65,  6, 13, 94, 41],
       [50, 84,  9, 30, 60],
       [35, 45, 73, 20, 11],
       [77, 96, 88,  1, 74],
       [ 9, 63, 37, 84, 29],
       [64,  8, 11, 53, 57],
       [39, 74, 53, 19, 72],
       [16, 45,  1, 13, 18],
       [76, 80, 98, 94, 25],
       [37, 64, 20, 36, 31],
       [11, 61, 21, 28,  9],
       [87, 27, 88, 47, 48],
       [55, 87, 10, 46,  3],
       [19, 59, 93, 12, 11],
       [95, 36, 29,  4, 84],
       [85, 48, 15, 70, 61],
       [70, 52,  7, 89, 72]])

In [22]:
# now let's say we want to calculate the row-wise & column-wise descriptive measures.

# Central Tendency
print("Row-wise Mean:",np.mean(arr, axis=1))  # this will calculate mean of values for each row of the array. 
print("Column-wise Mean:",np.mean(arr, axis=0))  # this will calculate mean of values for each column of the array.

print("\nRow-wise Median:",np.median(arr, axis=1))  # this will calculate median of values for each row of the array. 
print("Column-wise Median:",np.median(arr, axis=0))  # this will calculate median of values for each column of the array. 

print("\nRow-wise Mode:",stats.mode(arr, axis=1))  # this will calculate standard deviation of values for each row of the array. 
print("Column-wise Mode:",stats.mode(arr, axis=0))  # this will calculate standard deviation of values for each column of the array.

Row-wise Mean: [65.  58.6 52.8 43.8 46.6 36.8 67.2 44.4 38.6 51.4 18.6 74.6 37.6 26.
 59.4 40.2 38.8 49.6 55.8 58. ]
Column-wise Mean: [53.95 54.35 39.95 47.5  45.2 ]

Row-wise Median: [71. 76. 61. 41. 50. 35. 77. 37. 53. 53. 16. 80. 36. 21. 48. 46. 19. 36.
 61. 70.]
Column-wise Median: [59.5 60.  25.  46.5 44.5]

Row-wise Mode: ModeResult(mode=array([[12],
       [10],
       [ 5],
       [ 6],
       [ 9],
       [11],
       [ 1],
       [ 9],
       [ 8],
       [19],
       [ 1],
       [25],
       [20],
       [ 9],
       [27],
       [ 3],
       [11],
       [ 4],
       [15],
       [ 7]]), count=array([[1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1]]))
Column-wise Mode: ModeResult(mode=array([[ 5, 45, 10, 94, 11]]), count=array([[1, 2, 2, 2, 2]]))


In [23]:
# Dispersion
print("\nRow-wise Stdev:",np.std(arr, axis=1))  # this will calculate standard deviation of values for each row of the array. 
print("Column-wise Stdev:",np.std(arr, axis=0))  # this will calculate standard deviation of values for each column of the array.

print("\nRow-wise Variance:",np.var(arr, axis=1))  # this will calculate variance of values for each row of the array. 
print("Column-wise Variance:",np.var(arr, axis=0))  # this will calculate variance of values for each column of the array.


Row-wise Stdev: [28.62167011 29.03515111 29.09570415 32.72552521 25.60937328 21.58147354
 34.01999412 26.30285156 24.03830277 20.69396047 14.45821566 26.13503396
 14.51344205 18.80425484 24.13793695 30.78571097 32.33821269 34.44764143
 23.69303695 28.06421209]
Column-wise Stdev: [29.98078551 25.59741198 33.36236652 30.39325583 26.5548489 ]

Row-wise Variance: [ 819.2   843.04  846.56 1070.96  655.84  465.76 1157.36  691.84  577.84
  428.24  209.04  683.04  210.64  353.6   582.64  947.76 1045.76 1186.64
  561.36  787.6 ]
Column-wise Variance: [ 898.8475  655.2275 1113.0475  923.75    705.16  ]


In [24]:
# similarly we can calculate descriptive statistics measures in a data frames be leveraging the pandas module.
import pandas as pd
df = pd.read_csv("/Users/rahul_arora/Downloads/cars2.csv")
df.head()

Unnamed: 0,Car,Model,Volume,Weight,CO2
0,Toyoty,Aygo,1.0,790,99
1,Mitsubishi,Space Star,1.2,1160,95
2,Skoda,Citigo,1.0,929,95
3,Fiat,500,0.9,865,90
4,Mini,Cooper,1.5,1140,105


In [25]:
# let us now calculate the descriptive statistics measures for all the numerical data columns of the data frame.

# Central Tendency
print("Mean:\n", df[["Volume", "Weight", "CO2"]].mean())
print("\nMedian:\n", df[["Volume", "Weight", "CO2"]].median())
print("\nMode:\n", df[["Volume", "Weight", "CO2"]].mode())

Mean:
 Volume       1.611111
Weight    1292.277778
CO2        102.027778
dtype: float64

Median:
 Volume       1.6
Weight    1329.0
CO2         99.0
dtype: float64

Mode:
    Volume  Weight  CO2
0     1.6    1365   99


In [26]:
# Dispersion
print("\nStandard Deviation:\n", df[["Volume", "Weight", "CO2"]].std())
print("\nMin:\n", df[["Volume", "Weight", "CO2"]].min())
print("\nMax:\n", df[["Volume", "Weight", "CO2"]].max())


Standard Deviation:
 Volume      0.388975
Weight    242.123889
CO2         7.454571
dtype: float64

Min:
 Volume      0.9
Weight    790.0
CO2        90.0
dtype: float64

Max:
 Volume       2.5
Weight    1746.0
CO2        120.0
dtype: float64


In [27]:
# we can also return all the descriptive statistics measures for all the numerical data columns in the data frame at once.
df.describe() 

Unnamed: 0,Volume,Weight,CO2
count,36.0,36.0,36.0
mean,1.611111,1292.277778,102.027778
std,0.388975,242.123889,7.454571
min,0.9,790.0,90.0
25%,1.475,1117.25,97.75
50%,1.6,1329.0,99.0
75%,2.0,1418.25,105.0
max,2.5,1746.0,120.0


In [28]:
# we can get the count of the values of the categorical data columns in the data frame.
df["Car"].value_counts()

Mercedes      5
Ford          5
Skoda         4
Audi          3
BMW           3
Opel          3
Volvo         3
VW            1
Mitsubishi    1
Hyundai       1
Suzuki        1
Honda         1
Hundai        1
Mini          1
Fiat          1
Mazda         1
Toyoty        1
Name: Car, dtype: int64

In [29]:
# We can also create descriptive statistics measures pertaining to a defined grouping for all the numerical data columns of the data frame.
df.groupby("Car").mean()

Unnamed: 0_level_0,Volume,Weight,CO2
Car,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Audi,1.866667,1455.0,105.666667
BMW,1.733333,1486.666667,107.0
Fiat,0.9,865.0,90.0
Ford,1.54,1274.2,100.0
Honda,1.6,1252.0,94.0
Hundai,1.6,1326.0,97.0
Hyundai,1.1,980.0,99.0
Mazda,2.2,1280.0,104.0
Mercedes,1.94,1439.0,105.6
Mini,1.5,1140.0,105.0
