# **<center>DESCRIPTIVE STATISTICS</center>**

### <center>Descriptive statistics summarize and organize characteristics of a data set. A data set is a collection of responses or observations from a sample or entire population.</center>

In [4]:
# Importing Necessary Libraries
import numpy as np
import pandas as pd

In [5]:
# Rading Data and creating DataFrame

data = pd.read_csv("data.csv")
df = pd.DataFrame(data)

In [7]:
# printing first few rows of the dataset

df.head()

Unnamed: 0,Mthly_HH_Income,Mthly_HH_Expense,No_of_Fly_Members,Emi_or_Rent_Amt,Annual_HH_Income,Highest_Qualified_Member,No_of_Earning_Members
0,5000,8000,3,2000,64200,Under-Graduate,1
1,6000,7000,2,3000,79920,Illiterate,1
2,10000,4500,2,0,112800,Under-Graduate,1
3,10000,2000,1,0,97200,Illiterate,1
4,12500,12000,2,3000,147000,Graduate,1


## 1. Mean

> The mean, or M, is the most commonly used method for finding the average. To find the mean, simply add up all response values and divide the sum by the total number of responses. The total number of responses or observations is called N.

#  $${\displaystyle A={\frac {1}{n}}\sum _{i=1}^{n}a_{i}={\frac {a_{1}+a_{2}+\cdots +a_{n}}{n}}}$$

<table class="kb-table responsive-inline auto-th">
<caption>Example:</caption>
<tbody>
<tr>
<th>Data set</th>
<td data-th="Sum of all values">15, 3, 12, 0, 24, 3</td>
</tr>
<tr>
<th>Sum of all values</th>
<td data-th="Sum of all values">15 + 3 + 12 + 0 + 24 + 3 = 57</td>
</tr>
<tr>
<th>Total number of responses</th>
<td data-th="Sum of all values"><em>N</em> = 6</td>
</tr>
<tr>
<th>Mean</th>
<td data-th="Sum of all values">Divide the sum of values by <em>N </em>to find <em>M</em>:&nbsp;57/6 = <strong><span class="highlight-green">9.5</span></strong></td>
</tr>
</tbody>
</table>
    

In [8]:
# Printing the means of all the numerical columns in the dataset

print(df.mean())

Mthly_HH_Income           41558.00
Mthly_HH_Expense          18818.00
No_of_Fly_Members             4.06
Emi_or_Rent_Amt            3060.00
Annual_HH_Income         490019.04
No_of_Earning_Members         1.46
dtype: float64


## 2. Median

>The median is the value that’s exactly in the middle of a data set.
To find the median, order each response value from the smallest to the biggest. Then, the median is the number in the middle. If there are two numbers in the middle, find their mean.

<table class="kb-table responsive-inline auto-th">
<caption>Example</caption>
<tbody>
<tr>
<th>Ordered data set</th>
<td data-th="Middle numbers">0, 3, 3, 12, 15, 24</td>
</tr>
<tr>
<th>Middle numbers</th>
<td data-th="Middle numbers">3, 12</td>
</tr>
<tr>
<th>Median</th>
<td data-th="Middle numbers">Find the mean of the two middle numbers: (3 + 12)/2 = <strong><span class="highlight-green">7.5</span></strong></td>
</tr>
</tbody>
</table>

In [10]:
# Printing the medians of all the numerical columns in the dataset

df.median()

Mthly_HH_Income           35000.0
Mthly_HH_Expense          15500.0
No_of_Fly_Members             4.0
Emi_or_Rent_Amt               0.0
Annual_HH_Income         447420.0
No_of_Earning_Members         1.0
dtype: float64

## 3. Mode

> The mode is the simply the most popular or most frequent response value. A data set can have no mode, one mode, or more than one mode.
To find the mode, order your data set from lowest to highest and find the response that occurs most frequently.

<table class="kb-table responsive-inline auto-th">
<caption>Example:</caption>
<tbody>
<tr>
<th>Ordered data set</th>
<td data-th="Mode">0, 3, 3, 12, 15, 24</td>
</tr>
<tr>
<th>Mode</th>
<td data-th="Mode">Find the most frequently occurring response: <span class="highlight-green"><strong>3</strong></span></td>
</tr>
</tbody>
</table>

In [18]:
# Printing the mode of all the columns in the dataset

df.mode()

Unnamed: 0,Mthly_HH_Income,Mthly_HH_Expense,No_of_Fly_Members,Emi_or_Rent_Amt,Annual_HH_Income,Highest_Qualified_Member,No_of_Earning_Members
0,45000,25000,4,0,590400,Graduate,1


## 4.Variance

>The variance is the average of squared deviations from the mean. Variance reflects the degree of spread in the data set. The more spread the data, the larger the variance is in relation to the mean.
To find the variance, simply square the standard deviation.

# $$s^2 = \sqrt {\frac{1}{N}\sum\limits_{i = 1}^N {\left( {x_i - \bar x} \right)^2 } }$$

In [19]:
# printing the variance of all numeric columns in the Dataset

df.var()

Mthly_HH_Income          6.811009e+08
Mthly_HH_Expense         1.461733e+08
No_of_Fly_Members        2.302449e+00
Emi_or_Rent_Amt          3.895551e+07
Annual_HH_Income         1.024869e+11
No_of_Earning_Members    5.391837e-01
dtype: float64

## 5. Standard Deviation

>The standard deviation (s) is the average amount of variability in your dataset. It tells you, on average, how far each score lies from the mean. The larger the standard deviation, the more variable the data set is.

There are six steps for finding the standard deviation:

 - List each score and find their mean.
 - Subtract the mean from each score to get the deviation from the mean.
 - Square each of these deviations.
 - Add up all of the squared deviations.
 - Divide the sum of the squared deviations by N – 1.
 - Find the square root of the number you found.

# $$s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2} .$$

In [20]:
# printing the Std. Deviation of all numeric columns in the Dataset

df.std()

Mthly_HH_Income           26097.908979
Mthly_HH_Expense          12090.216824
No_of_Fly_Members             1.517382
Emi_or_Rent_Amt            6241.434948
Annual_HH_Income         320135.792123
No_of_Earning_Members         0.734291
dtype: float64

## 6.Correlation

>A correlation is a statistical measure of the relationship between two variables. The measure is best used in variables that demonstrate a linear relationship between each other. The fit of the data can be visually represented in a scatterplot. Using a scatterplot, we can generally assess the relationship between the variables and determine whether they are correlated or not.

The correlation coefficient is a value that indicates the strength of the relationship between variables. The coefficient can take any values from -1 to 1. The interpretations of the values are:

 - -1: Perfect negative correlation. The variables tend to move in opposite directions (i.e., when one variable increases, the other variable decreases).
 - 0: No correlation. The variables do not have a relationship with each other.
 - 1: Perfect positive correlation. The variables tend to move in the same direction (i.e., when one variable increases, the other variable also increases).

$$r =\frac{\sum\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sqrt{\sum\left(x_{i}-\bar{x}\right)^{2} \sum\left(y_{i}-\bar{y}\right)^{2}}}$$

 $$ r	=	correlation coefficient$$
 $$ x_{i}	=	values of the x-variable in a sample$$
 $$\bar{x}	=	mean of the values of the x-variable$$
 $$ y_{i}	=	values of the y-variable in a sample$$
 $$ \bar{y}	=	mean of the values of the y-variable$$

A correlation can be expressed visually. This is done by drawing a scattergram (also known as a scatterplot, scatter graph, scatter chart, or scatter diagram).

A scattergram is a graphical display that shows the relationships or associations between two numerical variables (or co-variables), which are represented as points (or dots) for each pair of score.

A scattergraph indicates the strength and direction of the correlation between the co-variables.

<img class="img-responsive" src="correlation.jpg" alt="Types of Correlations: Positive, Negative, and Zero">

In [26]:
# correlation between features in the dataset

df.corr()

Unnamed: 0,Mthly_HH_Income,Mthly_HH_Expense,No_of_Fly_Members,Emi_or_Rent_Amt,Annual_HH_Income,No_of_Earning_Members
Mthly_HH_Income,1.0,0.649215,0.448317,0.036976,0.970315,0.347883
Mthly_HH_Expense,0.649215,1.0,0.639702,0.40528,0.591222,0.311915
No_of_Fly_Members,0.448317,0.639702,1.0,0.085808,0.430868,0.597482
Emi_or_Rent_Amt,0.036976,0.40528,0.085808,1.0,0.002716,-0.097431
Annual_HH_Income,0.970315,0.591222,0.430868,0.002716,1.0,0.296679
No_of_Earning_Members,0.347883,0.311915,0.597482,-0.097431,0.296679,1.0


## 7. Normal Distribution

>Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. In graph form, normal distribution will appear as a bell curve.

Point to Remember:
 - A normal distribution is the proper term for a probability bell curve.
 - In a normal distribution the mean is zero and the standard deviation is 1. It has zero skew and a kurtosis of 3.
 - Normal distributions are symmetrical, but not all symmetrical distributions are normal.
 - In reality, most pricing distributions are not perfectly normal.


# $$f(x)= {\frac{1}{\sigma\sqrt{2\pi}}}e^{- {\frac {1}{2}} (\frac {x-\mu}{\sigma})^2}$$
$$f(x)	=	probability density function$$
$$\sigma	=	standard deviation$$
$$\mu	=	mean$$

<img loading="lazy" class="aligncenter wp-image-25652" src="https://cdn.corporatefinanceinstitute.com/assets/normal-distribution-300x232.png" alt="Normal Distribution" width="1024" height="791" srcset="//cdn.corporatefinanceinstitute.com/assets/normal-distribution-300x232.png 300w, //cdn.corporatefinanceinstitute.com/assets/normal-distribution-768x593.png 768w, //cdn.corporatefinanceinstitute.com/assets/normal-distribution-600x464.png 600w, //cdn.corporatefinanceinstitute.com/assets/normal-distribution-388x300.png 388w, //cdn.corporatefinanceinstitute.com/assets/normal-distribution-24x19.png 24w, //cdn.corporatefinanceinstitute.com/assets/normal-distribution-36x28.png 36w, //cdn.corporatefinanceinstitute.com/assets/normal-distribution-48x37.png 48w, //cdn.corporatefinanceinstitute.com/assets/normal-distribution.png 787w" sizes="(max-width: 1024px) 100vw, 1024px" data-src="" data-srcset="//cdn.corporatefinanceinstitute.com/assets/normal-distribution-300x232.png 300w, //cdn.corporatefinanceinstitute.com/assets/normal-distribution-768x593.png 768w, //cdn.corporatefinanceinstitute.com/assets/normal-distribution-600x464.png 600w, //cdn.corporatefinanceinstitute.com/assets/normal-distribution-388x300.png 388w, //cdn.corporatefinanceinstitute.com/assets/normal-distribution-24x19.png 24w, //cdn.corporatefinanceinstitute.com/assets/normal-distribution-36x28.png 36w, //cdn.corporatefinanceinstitute.com/assets/normal-distribution-48x37.png 48w, //cdn.corporatefinanceinstitute.com/assets/normal-distribution.png 787w" data-sizes="(max-width: 1024px) 100vw, 1024px">

## 8. Feature of Normal Distribution

### 1)  It is symmetric: 
>A normal distribution comes with a perfectly symmetrical shape. This means that the distribution curve can be divided in the middle to produce two equal halves. The symmetric shape occurs when one-half of the observations fall on each side of the curve.


### 2) The mean, median, and mode are equal

>The middle point of a normal distribution is the point with the maximum frequency, which means that it possesses the most observations of the variable. The midpoint is also the point where these three measures fall. The measures are usually equal in a perfectly (normal) distribution.

 

### 3) Empirical rule

>In normally distributed data, there is a constant proportion of distance lying under the curve between the mean and specific number of standard deviations from the mean. For example, 68.25% of all cases fall within +/- one standard deviation from the mean. 95% of all cases fall within +/- two standard deviations from the mean, while 99% of all cases fall within +/- three standard deviations from the mean.

### 4) Skewness and kurtosis

>Skewness and kurtosis are coefficients that measure how different a distribution is from a normal distribution. Skewness measures the symmetry of a normal distribution while kurtosis measures the thickness of the tail ends relative to the tails of a normal distribution.

## 9. Positively Skewed & Negatively Skewed Normal Distribution

 - Negatively Skewed Normal Distribution: A Negatively Skewed Normal Distribution has a long left tail. Negatively Skewed Normal Distribution are also called Left-skewed distributions. That’s because there is a long tail in the negative direction on the number line. The mean is also to the left of the peak.<br>
<hr>
 - Positively Skewed Normal Distribution: A Positively Skewed Normal Distribution has a long right tail. Positively Skewed Normal Distribution are also called right-skewed Normal distribution. That’s because there is a long tail in the positive direction on the number line. The mean is also to the right of the peak.

<img src="https://www.statisticshowto.com/wp-content/uploads/2014/02/pearson-mode-skewness.jpg" alt="Skewed Distribution " width="436" height="200" class="alignleft size-full wp-image-11848" srcset="https://www.statisticshowto.com/wp-content/uploads/2014/02/pearson-mode-skewness.jpg 436w, https://www.statisticshowto.com/wp-content/uploads/2014/02/pearson-mode-skewness-300x137.jpg 300w" sizes="(max-width: 436px) 100vw, 436px">

## 10. Effect on Mean, Median and Mode due to Skewness

Consider the following data set.
4,5,6,6,6,7,7,7,7,7,7,8,8,8,9,10


This data set can be represented by following histogram. Each interval has width one, and each value is located in the middle of an interval.

<img src="https://s3-us-west-2.amazonaws.com/courses-images/wp-content/uploads/sites/132/2016/04/21214237/fig-ch02_08_01.jpg" alt="This histogram matches the supplied data. It consists of 7 adjacent bars with the x-axis split into intervals of 1 from 4 to 10. The heighs of the bars peak in the middle and taper symmetrically to the right and left." width="350" data-media-type="image/jpg">

The histogram displays a symmetrical distribution of data. A distribution is symmetrical if a vertical line can be drawn at some point in the histogram such that the shape to the left and the right of the vertical line are mirror images of each other. The mean, the median, and the mode are each seven for these data. In a perfectly symmetrical distribution, the mean and the median are the same. This example has one mode (unimodal), and the mode is the same as the mean and median. In a symmetrical distribution that has two modes (bimodal), the two modes would be different from the mean and median.

The histogram for the data: 
4,5,6,6,6,7,7,7,7,8 is not symmetrical. The right-hand side seems “chopped off” compared to the left side. A distribution of this type is called skewed to the left because it is pulled out to the left.

<img src="https://s3-us-west-2.amazonaws.com/courses-images/wp-content/uploads/sites/132/2016/04/21214239/fig-ch02_08_02.jpg" alt="This histogram matches the supplied data. It consists of 5 adjacent bars with the x-axis split into intervals of 1 from 4 to 8. The peak is to the right, and the heights of the bars taper down to the left." width="350" data-media-type="image/jpg">

The mean is 6.3, the median is 6.5, and the mode is seven. Notice that the mean is less than the median, and they are both less than the mode. The mean and the median both reflect the skewing, but the mean reflects it more so.
The histogram for the data: 
6,7,7,7,7,8,8,8,9,10 is also not symmetrical. It is skewed to the right.



<img src="https://s3-us-west-2.amazonaws.com/courses-images/wp-content/uploads/sites/132/2016/04/21214242/fig-ch02_08_03.jpg" alt="This histogram matches the supplied data. It consists of 5 adjacent bars with the x-axis split into intervals of 1 from 6 to 10. The peak is to the left, and the heights of the bars taper down to the right." width="350" data-media-type="image/jpg">

The mean is 
7.7
, the median is 
7.5
, and the mode is seven. Of the three statistics, the mean is the largest, while the mode is the smallest. Again, the mean reflects the skewing the most.

**To summarize, generally if the distribution of data is skewed to the left, the mean is less than the median, which is often less than the mode. If the distribution of data is skewed to the right, the mode is often less than the median, which is less than the mean.**



## 11. Explain QQ Plot and show the implementation of the same

>Q Q Plots (Quantile-Quantile plots) are plots of two quantiles against each other. A quantile is a fraction where certain values fall below that quantile. For example, the median is a quantile where 50% of the data fall below that point and 50% lie above it. The purpose of Q Q plots is to find out if two sets of data come from the same distribution. A 45 degree angle is plotted on the Q Q plot; if the two data sets come from a common distribution, the points will fall on that reference line.

<a href="https://www.statisticshowto.com/wp-content/uploads/2015/08/Normal_exponential_qq.svg_.png"><img aria-describedby="caption-attachment-19218" src="https://www.statisticshowto.com/wp-content/uploads/2015/08/Normal_exponential_qq.svg_.png" alt="q q plots" width="256" height="224" class="size-full wp-image-19218 lazyloaded" data-ll-status="loaded"><noscript><img aria-describedby="caption-attachment-19218" src="https://www.statisticshowto.com/wp-content/uploads/2015/08/Normal_exponential_qq.svg_.png" alt="q q plots" width="256" height="224" class="size-full wp-image-19218" /></noscript></a>
<p id="caption-attachment-19218" class="wp-caption-text">A Q Q plot showing the 45 degree reference line. Image: skbkekas|Wikimedia Commons.</p>

The image above shows quantiles from a theoretical normal distribution on the horizontal axis. It’s being compared to a set of data on the y-axis. This particular type of Q Q plot is called a normal quantile-quantile (QQ) plot. The points are not clustered on the 45 degree line, and in fact follow a curve, suggesting that the sample data is not normally distributed.

### Implementation

Sample question: Do the following values come from a normal distribution?
7.19, 6.31, 5.89, 4.5, 3.77, 4.25, 5.19, 5.79, 6.79.


Step 1: Order the items from smallest to largest.

3.77,
4.25,
4.50,
5.19,
5.89,
5.79,
6.31,
6.79,
7.19

Step 2: Draw a normal distribution curve. Divide the curve into n+1 segments. We have 9 values, so divide the curve into 10 equally-sized areas. For this example, each segment is 10% of the area (because 100% / 10 = 10%).

<img src="https://www.statisticshowto.com/wp-content/uploads/2015/08/qq-plot.png" alt="" width="346" height="237" class="alignleft size-full wp-image-34140 lazyloaded" sizes="(max-width: 346px) 100vw, 346px" srcset="https://www.statisticshowto.com/wp-content/uploads/2015/08/qq-plot.png 346w, https://www.statisticshowto.com/wp-content/uploads/2015/08/qq-plot-300x205.png 300w" data-ll-status="loaded">

Step 3: Find the z-value (cut-off point) for each segment in Step 3. These segments are areas, so refer to a z-table (or use software) to get a z-value for each segment.
The z-values are:


10% = -1.28 <br>
20% = -0.84<br>
30% = -0.52<br>
40% = -0.25<br>
50% = 0 <br>
60% = 0.25 <br>
70% = 0.52 <br>
80% = 0.84 <br>
90% = 1.28 <br>
100% = 3.0 <br>


<img style="-webkit-user-select: none;margin: auto;background-color: hsl(0, 0%, 90%);transition: background-color 300ms;" src="https://www.statisticshowto.com/wp-content/uploads/2015/08/qq-plot-2.png">

Step 4: Plot your data set values (Step 1) against your normal distribution cut-off points (Step 3). I used Open Office for this chart:

![image.png](attachment:image.png)

The (almost) straight line on this q q plot indicates the data is approximately normal.
