## Descriptive Statistics

Descriptive Statistics, provides a summary of your dataset giving a measure of the centre, dispersion and shape of your data. Here the data is described as a sample of the whole population, and there are no inferences made from the sample to the whole population, unlike Inferential Statistics, in which we model the data on the basis of probability theory.

## Key Elements of Descriptive Statistics

### Measures Of Central Tendency

* Mean
* Median
* Mode

### Measures Of Spread

* Range
* Outliers
* Interquantile Range
* Variance

### Dependence

* Correlation v/s Causation
* Covariance

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# house price prediction
data = pd.read_csv('train.csv')

In [None]:
data.shape

In [None]:
data.head()

In [None]:
# lets check the column names
data.columns

In [None]:
# lets check the Target Column of the Data
data['SalePrice'].head(20)

### Mean SalePrice

In [None]:
# checking the average price of houses
mean = np.mean(data['SalePrice'])
print(mean)

### Disadvantage of Mean

* Finding mean is not a good approach as the 'Mean is often affected by Outliers' or in simple words if there are some observations larger or smaller than majority of the other observations then the mean tends to deviate towards these values.

* To generalize it if the distribution of datasets is skewed(troubled by outliers), we do not choose mean. Here we will have to go for Median.

## Median of SalePrice

In [None]:
# checking the average price of houses
median = np.median(data['SalePrice'])
print(median)

* We can see there is a Huge difference in the Mean and Median Values, which tells us that there are Outliers in this column

## Median and Inter Quantile Range

* Taking the concept of median a step further, we can define the Inter - Quartile Range.
* IQR is a measure of variability and is based on dividing a data set into quartiles.
* Quartile is the division of a set of observations into four intervals based on the values of the data.

![image.png](attachment:image.png)

**The interquartile range is equal to Q3 minus Q1.**

**For example,** 
* consider the following numbers: 1, 3, 4, 5, 5, 6, 7, 11.

    * Q1 is the middle value in the first half of the data set.
    * Since there are an even number of data points in the first half of the data set, the middle value is the average of the two middle values; that is, Q1 = (3 + 4)/2 or Q1 = 3.5. Q3 is the middle value in the second half of the data set. 
    * Again, since the second half of the data set has an even number of observations, the middle value is the average of the two middle values; that is, Q3 = (6 + 7)/2 or Q3 = 6.5.
    * The interquartile range is Q3 minus Q1, so IQR = 6.5 - 3.5 = 3.
    
    
### Box Plot View for IQR

![image.png](attachment:image.png)


### Outliers with Box Plot

* The Boxplot above shows some additional observations below MINIMUM and above MAXIMUM. These are Outliers.
* There are many ways to mathematically represent or define outliers. One such method is using IQR.

In [None]:
x= data['SalePrice'].quantile(0.95)
x

In [None]:
### IQR 

# Median
median = np.median(data['SalePrice'])
print("Median :",median)

# lower quartile  
q1 = data['SalePrice'].quantile(0.25)

# upper quartile
q3 = data['SalePrice'].quantile(0.75)

# printing Results
print("Q1:", q1)
print("Q3:", q3)
print("IQR:", q3 - q1)

* Here, IQR is Representing the Middle 50% of the values in the sales price column, Whereas the Mean and Median Values are having a hug gap in their values that means there are so many outliers in the data, let's try checking these outliers using a box plot

In [None]:
sns.boxplot(data['SalePrice'])
plt.show()

In [None]:
## lets find no. of outliers

# for that we have to find the upper and ower outlier limit
outlier_lower_limit = q1 - 1.5*(q3 - q1)
outlier_upper_limit = q3 + 1.5*(q3 - q1)
print("Outlier Upper Limit :", outlier_lower_limit)
print("Outlier Lower Limit :", outlier_upper_limit)

In [None]:
Sales_price = data['SalePrice']

lower_limit_outliers = Sales_price[Sales_price < outlier_lower_limit].count()

upper_limit_outliers = Sales_price[Sales_price > outlier_upper_limit].count()

print("lower_limit_outliers:", lower_limit_outliers)
print("upper_limit_outliers:", upper_limit_outliers)
print("total outliers:", upper_limit_outliers + lower_limit_outliers)

## Skewness

In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or undefined

**Skewness**, also known as **skew**, is a measure of **asymmetry**. Specifically, it measures how much your distribution is weighted toward one side. 

**Right skew**, also known as **positive skew**, means that your distribution has more extreme values on the high end than on the low end.

**Left skew**, also known as **negative skew**, means that your distribution has more extreme values on the low end than on the high end.

Skewness of 0 represents a distribution with equal weight on both sides. There are multiple ways to calculate skew, but values between -1 and 1 are usually considered low.

![skew](https://upload.wikimedia.org/wikipedia/commons/c/cc/Relationship_between_mean_and_median_under_different_skewness.png)
Diva Jain, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons

In [None]:
# lets check the skewness of the data
sns.distplot(data['SalePrice'])
plt.show()

Thus, we see that our Histogram is "Positively Skewed"

We can see different examples of Skewness from the image on the previous slide and see how Mean, and the Median are affected in each distribution

## Mode

In [None]:
mode = Sales_price.mode()
print(mode)

In [None]:
## plot the hist with mean median and mode - This needs to be checked! 

plt.figure(figsize=(10, 6)) 
plt.hist(Sales_price, bins=40, color = 'yellow')
plt.plot([mode]*300, range(300), color = 'black', label='mode') 
plt.plot([median]*300, range(300), label='median')
plt.plot([mean]*300, range(300), label='mean')
plt.ylim(0, 250)
plt.legend()
plt.show()


## Spread of the Data

* Let's choose the value 250,000 from the SalePrice column and check how far this value is from the mean when compared to other points in the data set
* We measure this as follows:
      (250,000 - mean)/Random Variation
* We know the mean, we found that before

* What is Random Variation?
    * It's nothing but the Average variation of the data from the mean


### Range of the Data

* Range of data is simply:
    * Max Value of Data - Min Value of data


In [None]:
Range = np.max(Sales_price)-np.min(Sales_price)
Range


## Variance of the Data

In [None]:

variance = Sales_price.var()
print(variance)

### Standard Deviation

In [None]:
from math import sqrt

std = sqrt(variance)
print(std)

![image.png](attachment:image.png)

### Correlation

![image.png](attachment:image.png)

* A correlation coefficient of 1 means that for every positive increase of 1 in one variable, there is a positive increase of 1 in the other.
* A correlation coefficient of -1 means that for every positive increase of 1 in one variable, there is a negative decrease of 1 in the other.
* Zero means that for every increase, there isn’t a positive or negative increase. The two just aren’t related.

### What is the Correlation between the Sales price and the Living Room Area?



In [None]:
data['GrLivArea'].value_counts()

In [None]:
# lets find out the correlation

living_room_area = data.GrLivArea

# Returns Pearson product-moment correlation coefficients.
corr = np.corrcoef(Sales_price, living_room_area)[0,1] 
print("Correlation Between Sales Price and the Living Room Area is {0:.2f}".format(corr*100))

In [None]:
#considering 4 continous variable and finding the correlation

x = data[['LotArea','GrLivArea','GarageArea','SalePrice']]
corr = x.corr()     
print(corr)

### Correlation doesn't imply Causation

* However, correlation does not imply causation. There may be, for example, an unknown factor that influences both variables similarly.

* Causation indicates that one event is the result of the occurrence of the other event; i.e. there is a causal relationship between the two events. This is also referred to as cause and effect.

* A statistically significant correlation has been reported, for example, between yellow cars and a lower incidence of accidents. That does not indicate that yellow cars are safer, but just that fewer yellow cars are involved in accidents. A third factor, such as the personality type of the purchaser of yellow cars, is more likely to be responsible than the color of the paint itself.

In [None]:
## Covariance

data[['LotArea','GrLivArea','GarageArea','SalePrice']].cov()

La covariance mesure la relation linéaire entre deux variables. La covariance est similaire à la corrélation entre deux variables, cependant elle est différente pour les raisons suivantes :


- Les coefficients de corrélation sont normalisés. Ainsi, une relation linéaire parfaite correspond à un coefficient de 1. La corrélation mesure la force et la direction de la relation linéaire entre deux variables.
- Les valeurs de covariance ne sont pas normalisées. Par conséquent, la covariance peut s'étendre de moins l'infini à plus l'infini. Par conséquent, la valeur d'une relation linéaire parfaite dépend des données. Comme les données ne sont pas normalisées, il est difficile de déterminer la force de la relation entre deux variables.

Vous pouvez utiliser la covariance pour comprendre la direction d'une relation entre deux variables. Des valeurs de covariance positives indiquent que les valeurs supérieures à la moyenne d'une variable sont associées aux valeurs supérieures à la moyenne de l'autre variable et que les valeurs inférieures à la moyenne sont associées de façon similaire. Des valeurs de covariance négatives indiquent que les valeurs supérieures à la moyenne d'une variable sont associées aux valeurs inférieures à la moyenne de l'autre variable.