# Descriptive Statistical Measures 

The descriptive statistics refers to describing  and summarizing data. The main approaches used under descriptive statistics are : quantitative approach and visual approach.

Types of quantitative measures:
1. Central Tendency: Finds out the centers of the data. Useful measures include Mean, Median, and Mode etc.
2. Variability: Describes the spread of the data. Useful measures are Variance and Standard Deviation
3. Correlation: Tells about the association between a pair of variables in the dataset. Useful measures are Covariance and Correlation Coefficients

# Python Libraries for Statistics 

1. Statistics: Python built-in library for descriptive statistics. This library is useful when your dataset is not too large.
2. NumPy: Third-party library used for numerical computations on single and multi-dimensional arrays. 
3. SciPy: Third-party library for scientific computations based on NumPy. For statistical analysis scipy.stats can be used
4. Pandas: Third-party library for numerical computations based on NumPy. This library is useful for handling labelled one dimensional series objects and two dimensional DataFrame objects.
5. Matplotlib: Third-party library used for data visualization. This library works perfectly in blend with NumPy, SciPy, and Pandas.

In [5]:
# importing all libraries 
import math
import statistics
import numpy as np
import scipy.stats
import pandas as pd
import csv

# Calculating Measures of Central Tendancy

Mean: The sample mean or sample arithmetic mean or average is the arithmetic average of the data points.
![image.png](attachment:image.png)

In [68]:
# Let us consider some arbitary data scores
scores = [19,20,47,34,45,43,41,28,39,47]
# calcualting mean by formula
mean_score = sum(scores)/len(scores)
print(mean_score)
# Using statistics library function
man_score=statistics.mean(scores)
print(mean_score)

36.3
36.3


To use NymPy and Pandas we have convert the array into numpy array and dataframe

In [69]:
scores_np = np.array(scores)
scores_df = pd.Series(scores)

In [70]:
#  Using NumPy function
mean_score = np.mean(scores_np)
print(mean_score)
#Using Pandas function
mean_score = scores_df.mean()
print(mean_score)

36.3
36.3


Median : The median is simply the middle value of the sorted dataset. It is the value that splits the dataset in half. 

In [71]:
# Calculating Median by formula
n = len(scores)
if n % 2:
    scores_median = sorted(scores)[round(0.5*(n-1))]
else:
    x_ord, index = sorted(scores), round(0.5 * n)
    scores_median = 0.5 * (x_ord[index-1] + x_ord[index])

print(scores_median)

40.0


In [72]:
# Using Statistics Library function
scores_median = statistics.median(scores)
print(scores_median)
# Using NumPy function
scores_median = np.median(scores_np)
print(scores_median)
# Using Pandas function
scores_df.median()
print(scores_median)

40.0
40.0
40.0


Mode: Mode is the value that appears most frequently in the dataset. If there is no single value repeating itself, then the dataset is known as multimodal.

In [73]:
# Finding the mode without any library
scores_mode = max((scores.count(item), item) for item in set(scores))[1]
print(scores_mode)

47


In [75]:
# Using Statistics Library function
scores_mode = statistics.mode(scores)
print(scores_mode)
# Using NumPy function
scores_mode = scipy.stats.mode(scores_np)
print(scores_mode)
# Using Pandas function
scores_df.mode()
print(scores_mode)

47
ModeResult(mode=array([47]), count=array([2]))
ModeResult(mode=array([47]), count=array([2]))


# Calculating Measures of Variability
Measures of variability are capable of quantifying the spread of data points. Also known as Measures of Dispersion.

Variance: The variance quantifies the spread of the data. It signifies how far are the data points from the mean. 
![image.png](attachment:image.png)

In [77]:
# Calculating Variance using Formula (without libraries)
n = len(scores)
score_mean = sum(scores) / n
score_var = sum((item - score_mean)**2 for item in scores) / (n - 1)
print(score_var)

113.1222222222222


In [81]:
# Finding Variance using Libraries
# Using Statistics library function
score_var = statistics.variance(scores)
print(score_var)

# Using NumPy library function
score_var = np.var(scores_np, ddof=1) #Here the ddof stands for delta degrees of freedom. This parameter allows the proper calculation of 𝑠², with (𝑛 − 1) in the denominator instead of 𝑛.
print(score_var)

# Using Pandas Library function
scores_df.var(ddof=1)
print(score_var)

113.12222222222222
113.1222222222222
113.1222222222222


Standard Deviation: Standard deviation, 𝑠, is the positive square root of the sample variance. A low standard deviation for a variable indicates that the data points tend to be close to its mean, and vice versa.
![image.png](attachment:image.png)

In [82]:
# Calculating Variance using Formula (without libraries)
score_std = score_var**0.5
print(score_std)

10.635893108818939


In [83]:
# Finding Variance using Libraries
# Using Statistics library function
score_std = statistics.stdev(scores)
print(score_std)

# Using NumPy library function
score_std = np.std(scores_np, ddof=1) #Here the ddof stands for delta degrees of freedom. This parameter allows the proper calculation of 𝑠², with (𝑛 − 1) in the denominator instead of 𝑛.
print(score_std)

# Using Pandas Library function
scores_df.std(ddof=1)
print(score_std)

10.63589310881894
10.635893108818939
10.635893108818939


Skewness: It is a measure of the symmetry or lack of it of a data sample. The skewness value can be positive, negative, or undefined.
1. Highly skewed distribution: If the skewness value is less than −1 or greater than +1.
2. Moderately skewed distribution: If the skewness value is between −1 and −½ or between +½ and +1.
3. Approximately symmetric distribution: If the skewness value is between −½ and +½.

![image.png](attachment:image.png)


In [85]:
# Calculating Skewness using formula (without libraries)
n = len(scores)
scores_mean = sum(scores) / n
scores_var = sum((item - scores_mean)**2 for item in scores) / (n - 1)
scores_std = scores_var ** 0.5
scores_skew = (sum((item - scores_mean)**3 for item in scores)* n / ((n - 1) * (n - 2) * scores_std**3))
scores_skew

-0.7572169387363916

In [88]:
# Finding Skewness using Libraries
# Using Scipy library function
scores_skew=scipy.stats.skew(scores_np, bias=False) #Here the parameter bias is set to False to enable the corrections for statistical bias.
print(scores_skew)
# Using Pandas Library function
scores_df.skew()
print(scores_skew)

-0.7572169387363917
-0.7572169387363917


# Calculating Measures of Correlation

If we want to examine the association between the corresponding elements of two variables in a dataset, we use the measures of correlation. Let us say we have x and y variables in a dataset with equal number of elements n. Then we can say that there are n pairs of elements x and y.
1. The correlation is Positive Correlation if larger values of x correspond to larger values of y and vice versa.
2. The correlation is said to be Negative if larger values of x correspond to smaller values of y and vice versa.
3. When there is no visible relationship between x and y variables it is said to be Weak or No Correlation 


The two statistical measures for correlation are covariance and correlation coefficient.

Covariance indicates the strength and direction of a relationship between the pairs of variables.

* When the covariance is positive, the correlation is Positive. A higher value of covariance indicates stronger relationship.

* When the covariance is negative, the correlation is Negative. A lower value of covariance indicates a stronger relationship.

* When the covariance is close to zero, it indicates that correlation is weak.

The formula to calculate the covariance is as follows:

![image.png](attachment:image.png)

Let us consider a dataset for Risk Factors Associated with Low Infant Birth Weight birthwt.csv
This data contains the following columns:

* low: indicator of birth weight less than 2.5 kg.
* age: Mother's age in years.
* lwt: Mother's weight in pounds.
* race: mother's race (1 = white, 2 = black, 3 = other).
* smoke: moking status during pregnancy.
* ptl: number of previous premature labours.
* ht: history of hypertension.
* ui: presence of uterine irritability.
* ftv: number of physician visits during the first trimester.
* bwt: birth weight in grams.

Let us now find out if there is any relationship between mother's age and birth weigth of child.

In [104]:
#Loading the dataset
with open('birthwt.csv','r') as f:
    g=f.readlines()
    # Each line is split based on commas, and the list of floats are formed 
    age = [int(x.split(',')[2]) for x in g[1:]]
    birth_wt  = [int(x.split(',')[10]) for x in g[1:]]

In [107]:
def covariance(x, y):
    # Finding the mean of the series x and y
    mean_x = sum(x)/len(x)
    mean_y = sum(y)/len(y)
    # Subtracting mean from the individual elements
    sub_x = [i - mean_x for i in x]
    sub_y = [i - mean_y for i in y]
    numerator = sum([sub_x[i]*sub_y[i] for i in range(len(sub_x))])
    denominator = len(x)-1
    cov = numerator/denominator
    return cov

In [108]:
cov_func = covariance(age, birth_wt)
print("Covariance from the custom function:", cov_func)

Covariance from the custom function: 348.976443768997


In [109]:
# Preparing data for NumPy and Pandas
age_np, bwt_np = np.array(age), np.array(birth_wt)
age_pd, bwt_pd = pd.Series(age_np), pd.Series(bwt_np)

In [117]:
# Using NumPy function cov
cov_matrix = np.cov(age_np, bwt_np)
cov_matrix
cov_xy = cov_matrix[0, 1]
cov_xy

348.976443768997

In [118]:
# Using Pandas function cov
cov_xy = age_pd.cov(bwt_pd)
cov_xy

348.976443768997

Correlation Coefficient also known as Pearson’s Correlation Coefficient is another measure to find out the relation between two variables. The Correlation Coefficient is denoted by r and is calculated as follows.
![image.png](attachment:image.png)

* The value 𝑟 > 0 indicates positive correlation.
* The value 𝑟 < 0 indicates negative correlation.
* The value r = 1 is the maximum possible value of 𝑟. It corresponds to a perfect positive linear relationship between variables.
* The value r = −1 is the minimum possible value of 𝑟. It corresponds to a perfect negative linear relationship between variables.
* The value r ≈ 0, or when 𝑟 is around zero, means that the correlation between variables is weak.


In [119]:
# Writing the function for Correlation Coefficient
def correlation(x, y):
    # Finding the mean of the series x and y
    mean_x = sum(x)/float(len(x))
    mean_y = sum(y)/float(len(y))
    # Subtracting mean from the individual elements
    sub_x = [i-mean_x for i in x]
    sub_y = [i-mean_y for i in y]
    # covariance for x and y
    numerator = sum([sub_x[i]*sub_y[i] for i in range(len(sub_x))])
    # Standard Deviation of x and y
    std_deviation_x = sum([sub_x[i]**2.0 for i in range(len(sub_x))])
    std_deviation_y = sum([sub_y[i]**2.0 for i in range(len(sub_y))])
    # squaring by 0.5 to find the square root
    denominator = (std_deviation_x*std_deviation_y)**0.5 # short but equivalent to (std_deviation_x**0.5) * (std_deviation_y**0.5)
    cor = numerator/denominator
    return cor

In [120]:
correlation(age, birth_wt)

0.09031781366853261

In [125]:
# Using Library functions
# Using scipy
r = scipy.stats.pearsonr(age_np, bwt_np)
print(r)
# Using NumPy
corr_matrix = np.corrcoef(age_np, bwt_np)
r = corr_matrix[0, 1]
print(r)
# Using Pandas
r = age_pd.corr(bwt_pd)
print(r)

(0.09031781366853263, 0.21647524185521977)
0.09031781366853261
0.09031781366853261


# Visualization of Correlation

Python’s matplotlib library used to create a production-quality graphic. Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.