# **Descriptive Statistics**

# **Measuring Central Tendency**

There are three main measures of central tendency which can be calculated using the methods in pandas python library.

1. Mean - It is the Average value of the data which is a division of sum of the values with the number of values.

2. Median - It is the middle value in distribution when the values are arranged in ascending or descending order.

3. Mode - It is the most commonly occurring value in a distribution.

**Calculating Mean and Median**

In [None]:
import pandas as pd

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','Chanchal','Gasper','Naviya','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print("Mean Values in the Distribution")
print(df.mean())
print("*******************************")
print("Median Values in the Distribution")
print(df.median())

Mean Values in the Distribution
Age       31.833333
Rating     3.743333
dtype: float64
*******************************
Median Values in the Distribution
Age       29.50
Rating     3.79
dtype: float64


**Calculating Mode**

Mode may or may not be available in a distribution depending on whether the data is continous or whether there are values which has maximum frquency. We take a simple distribution below to find out the mode. Here we have a value which has maximum frequency in the distribution.

In [None]:
import pandas as pd

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','Chanchal','Gasper','Naviya','Andres']),
   'Age':pd.Series([25,26,25,23,30,25,23,34,40,30,25,46])}
#Create a DataFrame
df = pd.DataFrame(d)

print(df.mode())

        Name   Age
0     Andres  25.0
1   Chanchal   NaN
2     Gasper   NaN
3       Jack   NaN
4      James   NaN
5        Lee   NaN
6     Naviya   NaN
7      Ricky   NaN
8      Smith   NaN
9      Steve   NaN
10       Tom   NaN
11       Vin   NaN


# **Measures of Variability**
Two types:
1. Varience - Population variance is defined as the average of the squared differences from the Mean, denoted as 𝜎² (“sigma-squared”)

2. Standard Deviation -Standard Deviation is used more often because it is in the original unit. It is simply the square root of the variance and because of that, it is returned to the original unit of measurement.

**Calculating Varience**

In [1]:
# importing pandas as pd 
import pandas as pd 
  
# Creating the Series 
sr = pd.Series([19.5, 16.8, 22.78, 20.124, 18.1002]) 
  
# Print the series 
print(sr) 

0    19.5000
1    16.8000
2    22.7800
3    20.1240
4    18.1002
dtype: float64


In [2]:
# find the variance 
sr.var(skipna = True)

5.097387128

**Calculating Standard Deviation**

In [3]:
import pandas as pd

#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','Chanchal','Gasper','Naviya','Andres']),
   'Age':pd.Series([25,26,25,23,30,25,23,34,40,30,25,46])}
#Create a DataFrame
df = pd.DataFrame(d)

# find the standard deviation
df.std(axis = 0, skipna = True)

Age    7.265527
dtype: float64

# **Measures of Position:**

1.  Interquartile Range (IQR) - The interquartile range (IQR) is a measure of statistical dispersion between upper quartiles i.e Q3 and lower quartiles i.e Q1.
2.  Z-score - The Z-score for a particular data value represents how many standard deviations the data value lies above or below the mean.

***Calculating IQR ***

In [4]:
import numpy as np 
  
data = [32, 36, 46, 47, 56, 69, 75, 79, 79, 88, 89, 91, 92, 93, 96, 97,  
        101, 105, 112, 116] 
  
# First quartile (Q1) 
Q1 = np.median(data[:10]) 
  
# Third quartile (Q3) 
Q3 = np.median(data[10:]) 
  
# Interquartile range (IQR) 
IQR = Q3 - Q1 
  
print(IQR) 

34.0


**Calculating Z Score**

In [5]:
# stats.zscore() method   
import numpy as np 
from scipy import stats 
    
arr1 = [[20, 2, 7, 1, 34], 
        [50, 12, 12, 34, 4]] 
  
arr2 = [[50, 12, 12, 34, 4],  
        [12, 11, 10, 34, 21]] 
  
print ("\narr1 : ", arr1) 
print ("\narr2 : ", arr2) 
  
print ("\nZ-score for arr1 : \n", stats.zscore(arr1)) 
print ("\nZ-score for arr1 : \n", stats.zscore(arr1, axis = 1)) 


arr1 :  [[20, 2, 7, 1, 34], [50, 12, 12, 34, 4]]

arr2 :  [[50, 12, 12, 34, 4], [12, 11, 10, 34, 21]]

Z-score for arr1 : 
 [[-1. -1. -1. -1.  1.]
 [ 1.  1.  1.  1. -1.]]

Z-score for arr1 : 
 [[ 0.57251144 -0.85876716 -0.46118977 -0.93828264  1.68572813]
 [ 1.62005758 -0.61045648 -0.61045648  0.68089376 -1.08003838]]


# **Correlation**

A correlation is a statistic intended to quantify the strength of the relationship between two variables. The correlation coefficient r quantifies the strength and direction of the linear relationship between two quantitative variables.

In [6]:
# importing pandas as pd 
import pandas as pd 
  
# Making data frame from the csv file 
df = pd.read_csv("nba.csv") 
  
# Printing the first 10 rows of the data frame for visualization 
df[:10] 

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
5,Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,,12000000.0
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
8,Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
9,Marcus Smart,Boston Celtics,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0


In [7]:
# To find the correlation among the columns using pearson method 
df.corr(method ='pearson') 

Unnamed: 0,Number,Age,Weight,Salary
Number,1.0,0.028724,0.206921,-0.112386
Age,0.028724,1.0,0.087183,0.213459
Weight,0.206921,0.087183,1.0,0.138321
Salary,-0.112386,0.213459,0.138321,1.0


In [8]:
df.corr(method ='kendall') 

Unnamed: 0,Number,Age,Weight,Salary
Number,1.0,0.005536,0.15585,-0.075301
Age,0.005536,1.0,0.06613,0.172616
Weight,0.15585,0.06613,1.0,0.087165
Salary,-0.075301,0.172616,0.087165,1.0
