In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

#### Basic Statistical Measures Used for Data Analysis

### Measures of Central Tendency

In statistics, **central tendency** refers to the tendency of quantitative data to cluster around a central value in a probability distribution. In simpler terms, central tendency refers to the tendency of quantitative data to cluster around a central or typical value.

**Mean:**
The mean is the average value of a dataset, calculated by summing all values and dividing by the total number of values.

$$ \text{Mean} = \frac{\sum_{i=1}^{n} x_i}{n} $$

**Median:**
The median is the middle value of an ordered dataset. If the number of observations is odd, the median is the middle number. If even, it is the average of the two middle numbers.

**Mode:**
The mode is the most frequently occurring element in the dataset.

### Measures of Dispersion

In statistics, **dispersion** describes how spread out or squeezed together a distribution of data is.

**Range:**
The range is the difference between the largest and smallest values in the dataset.

**Interquartile Range (IQR):**
The IQR is the range between the first quartile (Q1) and the third quartile (Q3). It measures the spread of the middle 50% of the data.

**Quartiles:**
Quartiles are a type of quantile that divide the number of data points into four parts. The three quartiles result in four divisions:

- **First Quartile (Q1):** Also known as the lower quartile, it is defined as the 25th percentile where 25% of data is below this point.
- **Second Quartile (Q2):** This is the median of the dataset, so 50% of data is below this point.
- **Third Quartile (Q3):** Also known as the upper quartile, it is defined as the 75th percentile where 75% of data is below this point.

**Percentile:**
Values below which a certain percentage of the data falls.

**Quantiles:**
Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile than the number of groups created.

**Variance:**
Variance measures the average squared deviation of each data point from the mean. It is a measure of how far a set of numbers is spread out from their average value.

$$ \text{Variance} (\sigma^2) = \frac{\sum_{i=1}^{n}(x_i - \mu)^2}{n} $$

**Standard Deviation:**
It is the square root of the variance. It is in the same unit as the data.

$$ \text{Standard Deviation} (\sigma) = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \mu)^2}{n}} $$

### Measures of Shape

**Skewness:**
It measures the asymmetry of the data distribution. Positive skewness indicates a longer right tail, while negative skewness indicates a longer left tail.

$$ \text{Skewness} = \frac{n}{(n-1)(n-2)} \sum_{i=1}^{n} \left(\frac{x_i - \mu}{\sigma}\right)^3 $$

**Kurtosis:**
Kurtosis measures the "tailedness" of the data distribution. High kurtosis indicates heavy tails and sharp peaks, while low kurtosis indicates light tails and flat peaks.

$$ \text{Kurtosis} = \frac{n(n+1)}{(n-1)(n-2)(n-3)} \sum_{i=1}^{n} \left(\frac{x_i - \mu}{\sigma}\right)^4 - \frac{3(n-1)^2}{(n-2)(n-3)} $$

### Measures of Association

In statistics, **measures of association** are a range of statistical tools that can quantify the strength and direction of relationships between two or more variables.

**Covariance:**
It is the measure of joint variability of two random variables (how much they vary together).

$$ \text{Cov}(X, Y) = E[XY] - E[X]E[Y] $$
where E[X] is the mean of X.

**Pearson's Correlation Coefficient (r):**
It measures the strength and direction of the linear relationship between two variables. The \( R^2 \) measure is derived from this.

$$ r = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} $$

- Here, $\sigma_X \sigma_Y$ are standered Deviation of X and Y.

#let's now use these on a popular dataset "WINEQUALITY", to see how they help in undestanding data without visualizing it.

In [2]:
data = pd.read_csv("./data/winequality-white.csv", sep= ";")

In [3]:
# Let's see the total number of row and colums in our data set
print(data.shape)
data.info() #let's see additional info about the data

(4898, 12)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 459.3 KB


#### We have 4898 rows and 12 columns. 
- Most of the values are of float64 type, except one int64. 
- There is no null value i.e missing value in our dataset. 
- Now let's see the first and last 5 values of this dataset.

In [4]:
data.head() #by default prints starting 5 values

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [5]:
data.tail() #by default prints ending 5 values

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
4893,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.5,11.2,6
4894,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.9949,3.15,0.46,9.6,5
4895,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6
4896,5.5,0.29,0.3,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7
4897,6.0,0.21,0.38,0.8,0.02,22.0,98.0,0.98941,3.26,0.32,11.8,6


#### Out of this 12 coloumns "quality" is our target value i.e the dependent variable. rest are independent variable or input feature.

In [6]:
#To take a Look at the different statistic measure discussed above we can use pandas "describe()" method
data.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0
mean,6.854788,0.278241,0.334192,6.391415,0.045772,35.308085,138.360657,0.994027,3.188267,0.489847,10.514267,5.877909
std,0.843868,0.100795,0.12102,5.072058,0.021848,17.007137,42.498065,0.002991,0.151001,0.114126,1.230621,0.885639
min,3.8,0.08,0.0,0.6,0.009,2.0,9.0,0.98711,2.72,0.22,8.0,3.0
25%,6.3,0.21,0.27,1.7,0.036,23.0,108.0,0.991723,3.09,0.41,9.5,5.0
50%,6.8,0.26,0.32,5.2,0.043,34.0,134.0,0.99374,3.18,0.47,10.4,6.0
75%,7.3,0.32,0.39,9.9,0.05,46.0,167.0,0.9961,3.28,0.55,11.4,6.0
max,14.2,1.1,1.66,65.8,0.346,289.0,440.0,1.03898,3.82,1.08,14.2,9.0


 #### Above we can see the following:-

**For the Fixed Acidity column:**

- The mean value is 6.854.
- The standard deviation around the mean value of the data is about 0.84, indicating a relatively higher concentration of data points near the mean     value.
- The 25th percentile indicates that 25% of the data points are below 6.3.
- The 50th percentile indicates that 50% of the data points are below 6.8 (this is also the median value).
- The 75th percentile indicates that 75% of the data points are below 7.3.
- The above indicates that our dataset is slightly right-skewed or positively skewed as the mean is greater than the 50th percentile.

Further, the IQR gives us a value of 1.0, meaning the middle 50% of the data is in about a 1-unit range, indicating a tight concentration around the center.
A common rule is that values more than 1.5 times the IQR below Q1 or above Q3 are considered potential outliers.
$$ \text{lower bound} = Q1 - 1.5*IQR$$
$$ \text{upper bound} = Q3 + 1.5*IQR$$

- The comparison of the 75th percentile with the maximum value, and the 25th percentile with the minimum value, shows that there are outliers present:
    - The minimum value (3.8) is significantly lower than the lower bound (4.8), indicating the presence of low outliers.
    - The maximum value (14.2) is significantly higher than the upper bound (8.8), indicating the presence of high outliers.

Similar, analysis can be applied to each columns to understand how the values behave in them.

**For the Volatile Acidity column:**
- The standard deviation indicates a relatively tight spread around the mean.
- The data points in this column are slightly right-skewed as the mean is greater than the median.
- The IQR shows a tight concentration of the middle 50% of the data.
- Outliers: Yes

**For the Citric Acid column:**
- The standard deviation shows a moderate spread around the mean.
- The data points in this column are slightly right-skewed.
- The IQR indicates a moderate concentration of the middle 50% of the data.
- Outliers: Yes

**For the Residual Sugar column:**
- The standard deviation shows a large spread around the mean.
- The data points in this column are heavily right-skewed.
- The IQR shows a significant spread of the middle 50% of the data.
- Outliers: Yes

**For the Chlorides column:**
- The standard deviation indicates a relatively tight spread around the mean.
- The data points in this column are slightly right-skewed.
- The IQR shows a tight concentration of the middle 50% of the data.
- Outliers: Yes

**For the Free Sulfur Dioxide column:**
- The standard deviation indicates a large spread around the mean.
- The data points in this column are slightly right-skewed.
- The IQR shows a significant spread of the middle 50% of the data.
- Outliers: Yes

**For the Total Sulfur Dioxide column:**
- The standard deviation shows a large spread around the mean.
- The data points in this column are slightly right-skewed.
- The IQR shows a significant spread of the middle 50% of the data.
- Outliers: Yes

**For the Density column:**
- The standard deviation shows a tight spread around the mean.
- The data points in this column are slightly right-skewed.
- The IQR suggests a tight concentration of the middle 50% of the data.
- Outliers: Yes

**For the pH column:**
- The standard deviation shows a moderate spread around the mean.
- The dataset is slightly right-skewed.
- The IQR shows a moderate concentration of the middle 50% of the data.
- Outliers: Yes

**For the Sulphates column:**
- The standard deviation shows a moderate spread around the mean.
- The dataset is slightly right-skewed.
- The IQR indicates a moderate concentration of the middle 50% of the data.
- Outliers: Yes

**For the Alcohol column:**
- The standard deviation shows a moderate spread around the mean.
- The dataset is slightly right-skewed.
- The IQR shows a moderate concentration of the middle 50% of the data.
- Outliers: No

**For the Quality column:**
- The standard deviation shows a moderate spread around the mean.
- The dataset is slightly right-skewed.
- The IQR indicates a moderate concentration of the middle 50% of the data.
- Outliers: Yes




In [7]:
#Taking a look at the quality colums (target value)
data.quality.unique()

array([6, 5, 7, 8, 4, 3, 9], dtype=int64)

In [8]:
data.quality.value_counts()

quality
6    2198
5    1457
7     880
8     175
4     163
3      20
9       5
Name: count, dtype: int64

We can see that most wines are of quality level 6 implies that this is the most common rating, indicating an average quality level for the majority of the wines.
The presence of only a few high-quality wines and a bunch of low-quality wines suggests a skewed distribution where high and low ratings are less common compared to the middle range.

#### We can quantify the Skewness of each distribution to confirm our observation by calculating it's value.

In [9]:
def calculate_skewness(data):
    """
    Calculate the skewness of a dataset.

    Parameters:
    - data (list or pd.Series): The data for which to calculate skewness.

    Returns:
    - float: The skewness of the data.
    """
    # Convert data to a numpy array if it's a pandas Series
    if isinstance(data, pd.Series): # a function used to check if a object is an instance of a class
        data = data.values

    # Calculate mean and standard deviation
    mean = np.mean(data)
    std_dev = np.std(data, ddof=0)  # Population standard deviation

    # Calculate skewness
    n = len(data)
    skewness = (n / ((n - 1) * (n - 2))) * np.sum(((data - mean) / std_dev) ** 3)
    
    return skewness

# alrenativly we can use a libray called scipy stats.

In [10]:
for columns in data.columns:
    print(f'Skewness of {columns}: ', calculate_skewness(data[f'{columns}']))

Skewness of fixed acidity:  0.6479498975036447
Skewness of volatile acidity:  1.5774625722236126
Skewness of citric acid:  1.282313083231465
Skewness of residual sugar:  1.0774236978398457
Skewness of chlorides:  5.024869457659677
Skewness of free sulfur dioxide:  1.4071758425442207
Skewness of total sulfur dioxide:  0.39082952608925625
Skewness of density:  0.9780725217936713
Skewness of pH:  0.45792277644156953
Skewness of sulphates:  0.9774930227702094
Skewness of alcohol:  0.48749127855571994
Skewness of quality:  0.15584412215078905


Skewness values greater than 0 indicate a positively skewed distribution. Skewness values less than 0 indicate a negatively skewed distribution. A skewness value equal to 0 suggests a symmetric distribution.
(value won't be exaclty equal to 0, so sufficiently low values can be taken as 0)

In [11]:
#let's now calculate the covarience of each pair of columns
for col1 in data.columns:
    for col2 in data.columns:
        print(f"COV({col1},{col2}):",data[col1].cov(data[col2]))


# We can also implement a fucntion to calculate the cov value
def calc_covariance(data):
    """
        Calculates the Covariance Matrix
    Args:
        data (DataFrame) : dataset
    Returns:
        cov_matrix (dict) : covariance
    """
    # calculating mean of each column
    means = {col: data[col].mean() for col in data.columns}
    
    # Prepare an empty dictionary to store covariance results
    cov_matrix = {}

    # Iterate over all pairs of columns
    for col1 in data.columns:
        for col2 in data.columns:
            # Calculate covariance
            col1Xcol2 = data[col1] * data[col2] # Claculating E[X,Y]
            mean_col1Xcol2 = col1Xcol2.mean() # Finding it's mean
            cov = mean_col1Xcol2 - (means[col1] * means[col2]) # Final COV value
            
            # Store covariance in the dictionary
            cov_matrix[(col1, col2)] = cov
    
    return cov_matrix

# Cov Matirx is nothing but covarience of all possible combination of columns

COV(fixed acidity,fixed acidity): 0.7121135857004641
COV(fixed acidity,volatile acidity): -0.001930570601679193
COV(fixed acidity,citric acid): 0.029532511571780048
COV(fixed acidity,residual sugar): 0.381021813652795
COV(fixed acidity,chlorides): 0.0004256255361050133
COV(fixed acidity,free sulfur dioxide): -0.7089186423667687
COV(fixed acidity,total sulfur dioxide): 3.266013392629699
COV(fixed acidity,density): 0.000669677255776885
COV(fixed acidity,pH): -0.05426482595364051
COV(fixed acidity,sulphates): -0.0016509922909276956
COV(fixed acidity,alcohol): -0.12553282191892606
COV(fixed acidity,quality): -0.08494730942928613
COV(volatile acidity,fixed acidity): -0.001930570601679193
COV(volatile acidity,volatile acidity): 0.010159540992172523
COV(volatile acidity,citric acid): -0.0018232775514512808
COV(volatile acidity,residual sugar): 0.032865333683183506
COV(volatile acidity,chlorides): 0.000155277485703241
COV(volatile acidity,free sulfur dioxide): -0.16630045911893623
COV(volatile

Above we can see:
- **Positive Covarience** : when one value increase the other increases as well
- **Negative Covarience** : When one value increase the other decreases
- **0 Covarience** : not much effect of one or another
(value won't be exaclty equal to 0, so sufficiently low values can be taken as 0)


If we further divide the covariance values $ \text{COV}(X, Y) $ by the product of the standard deviations of $ X $ and $ Y $, we obtain the Pearson Correlation Coefficient. This coefficient provides information about the strength and direction of the linear relationship between the two variables.

- **Strength**: The higher the absolute value of this coefficient, the more strongly the variables follow a linear relationship. For example, a coefficient close to 1 or -1 indicates a strong linear relationship.

- **Direction**: A positive coefficient means that as one variable increases, the other variable also tends to increase (both increase together). A negative coefficient means that as one variable increases, the other variable tends to decrease (one increases while the other decreases).

In [12]:
for col1 in data.columns:
    for col2 in data.columns:
        print(f"COV({col1},{col2}):",(data[col1].cov(data[col2])/(data[col1].std() * data[col2].std())))

COV(fixed acidity,fixed acidity): 0.9999999999999997
COV(fixed acidity,volatile acidity): -0.02269729014664708
COV(fixed acidity,citric acid): 0.2891806976936752
COV(fixed acidity,residual sugar): 0.08902070136217158
COV(fixed acidity,chlorides): 0.023085643656347788
COV(fixed acidity,free sulfur dioxide): -0.04939585908117314
COV(fixed acidity,total sulfur dioxide): 0.09106975615864087
COV(fixed acidity,density): 0.2653310138391866
COV(fixed acidity,pH): -0.425858290991382
COV(fixed acidity,sulphates): -0.01714298502113732
COV(fixed acidity,alcohol): -0.12088112319453301
COV(fixed acidity,quality): -0.11366283071301789
COV(volatile acidity,fixed acidity): -0.02269729014664708
COV(volatile acidity,volatile acidity): 1.0
COV(volatile acidity,citric acid): -0.14947181064857515
COV(volatile acidity,residual sugar): 0.06428606009099527
COV(volatile acidity,chlorides): 0.07051157147938474
COV(volatile acidity,free sulfur dioxide): -0.09701193927796005
COV(volatile acidity,total sulfur dioxi

#### We covered basic statistics tools, which help us understand data through numbers. But looking at all these numbers can quickly fry your brain. To make sense of data in a visual way, we use Exploratory Data Analysis (EDA). EDA involves creating charts and graphs to visually explore and understand data patterns and relationships.

##### Note: further details in pen and paper statistics, there are still many more measure we can use to further understand what a dataset say i will slowly be covering those.