# Descriptive Statistics

# Mean
The mean is the average or the most common value in a collection of numbers.
In statistics, it is a measure of central tendency of a probability distribution along median and mode. It is also referred to as an expected value.

$$\mu=\frac{\Sigma x_i}{n}$$

In [16]:
def user_mean(x):
    arr=x
    n=len(arr)
    return sum(arr)/n

In [18]:
print(user_mean(df['Mthly_HH_Income']))

41558.0


In [6]:
import pandas as pd
df=pd.read_csv(r"data.csv")
print(df.mean())

Mthly_HH_Income           41558.00
Mthly_HH_Expense          18818.00
No_of_Fly_Members             4.06
Emi_or_Rent_Amt            3060.00
Annual_HH_Income         490019.04
No_of_Earning_Members         1.46
dtype: float64


# Median
The median is the middle number in a sorted, ascending or descending, list of numbers and can be more descriptive of that data set than the average.

$$
Med(X) = \left\{
    \begin{array}\\
        X[\frac{n}{2}] & \mbox{if x is odd} \\
        \frac{X[\frac{n}{2}-1]+X[\frac{n}{2}]}{2} & \mbox{if is even} \\
    \end{array}
\right.
$$

In [23]:
def user_median(x):
    arr=x
    n=len(arr)
    
    if n%2==0:
        median1=arr[n//2]
        median2=arr[n//2 -1]
        median=(median1 + median2)//2
    else:
        median=arr[n//2]
    return median

In [24]:
print(user_median(df['Mthly_HH_Income']))

35000


In [8]:
print(df.median())

Mthly_HH_Income           35000.0
Mthly_HH_Expense          15500.0
No_of_Fly_Members             4.0
Emi_or_Rent_Amt               0.0
Annual_HH_Income         447420.0
No_of_Earning_Members         1.0
dtype: float64


# Mode
A mode is defined as the value that has a higher frequency in a given set of values. It is the value that appears the most number of times. 

In [9]:
print(df.mode())

   Mthly_HH_Income  Mthly_HH_Expense  No_of_Fly_Members  Emi_or_Rent_Amt  \
0            45000             25000                  4                0   

   Annual_HH_Income Highest_Qualified_Member  No_of_Earning_Members  
0            590400                 Graduate                      1  


# Variance
The variance is a measure of variability. It is calculated by taking the average of squared deviations from the mean.

$$\sigma^2=\frac{\Sigma(X-\mu)^2}{N}$$

In [10]:
print(df.var())

Mthly_HH_Income          6.811009e+08
Mthly_HH_Expense         1.461733e+08
No_of_Fly_Members        2.302449e+00
Emi_or_Rent_Amt          3.895551e+07
Annual_HH_Income         1.024869e+11
No_of_Earning_Members    5.391837e-01
dtype: float64


# Standard Deviation
The Standard Deviation is a measure of how spread out numbers are.

$$\sigma=\sqrt\frac{\Sigma(X-\mu)^2}{N}$$

In [11]:
print(df.std())

Mthly_HH_Income           26097.908979
Mthly_HH_Expense          12090.216824
No_of_Fly_Members             1.517382
Emi_or_Rent_Amt            6241.434948
Annual_HH_Income         320135.792123
No_of_Earning_Members         0.734291
dtype: float64


# Correlation
Correlation is a statistic that measures the degree to which two variables move in relation to each other.

$$
r_{xy}=\frac{\Sigma(x_i-\mu_x)(y_i-\mu_y)}{\sqrt{\Sigma(x_i-\mu_x)^2\Sigma(y_i-\mu_y)^2}}
$$

In [12]:
print(df.corr())

                       Mthly_HH_Income  Mthly_HH_Expense  No_of_Fly_Members  \
Mthly_HH_Income               1.000000          0.649215           0.448317   
Mthly_HH_Expense              0.649215          1.000000           0.639702   
No_of_Fly_Members             0.448317          0.639702           1.000000   
Emi_or_Rent_Amt               0.036976          0.405280           0.085808   
Annual_HH_Income              0.970315          0.591222           0.430868   
No_of_Earning_Members         0.347883          0.311915           0.597482   

                       Emi_or_Rent_Amt  Annual_HH_Income  \
Mthly_HH_Income               0.036976          0.970315   
Mthly_HH_Expense              0.405280          0.591222   
No_of_Fly_Members             0.085808          0.430868   
Emi_or_Rent_Amt               1.000000          0.002716   
Annual_HH_Income              0.002716          1.000000   
No_of_Earning_Members        -0.097431          0.296679   

                       No

# Normal  Distribution
Normal Distribution is a probability function used in statistics that tells about how the data values are distributed. It is the most important probability distribution function used in statistics because of its advantages in real case scenarios.

$$
f(x,\mu,\sigma)=\frac{1}{\sigma\sqrt{2\pi}} 
e^\frac{-(x-\mu)^2}{2\pi^2}
$$

# Skewness
In statistics, skewness is a measure of the asymmetry of the probability distribution of a random variable about its mean. In other words, skewness tells you the amount and direction of skew (departure from horizontal symmetry). The skewness value can be positive or negative, or even undefined. If skewness is 0, the data are perfectly symmetrical, although it is quite unlikely for real-world data. As a general rule of thumb:

If skewness is less than -1 or greater than 1, the distribution is highly skewed.
If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed.
If skewness is between -0.5 and 0.5, the distribution is approximately symmetric.