## DESCRIPTIVE STATISTICS


In this session we'll look at how to summarize and understand data using Python. Descriptive statistics provide simple summaries about the data, helping us understand its central tendency, variability, and overall distribution.

Key Types of Descriptive Statistics:

**Measures of Central Tendency:**
- Mean: The average value.
- Median: The middle value in sorted data.
- Mode: The most frequently occurring value.

**Measures of Dispersion:**
- Range: Difference between the largest and smallest values.
- Variance: Average of the squared deviations from the mean.
- Standard Deviation: Square root of variance.
- Interquartile Range (IQR): Spread of the middle 50% of the data.

In [11]:
# importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [27]:
# load sample dataset
df = pd.read_csv('penguins.csv')  # Seaborn's built-in dataset

# Display the first few rows
df.head()

Unnamed: 0.1,Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,4,Adelie,Torgersen,,,,,,2007
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         344 non-null    int64  
 1   species            344 non-null    object 
 2   island             344 non-null    object 
 3   bill_length_mm     342 non-null    float64
 4   bill_depth_mm      342 non-null    float64
 5   flipper_length_mm  342 non-null    float64
 6   body_mass_g        342 non-null    float64
 7   sex                333 non-null    object 
 8   year               344 non-null    int64  
dtypes: float64(4), int64(2), object(3)
memory usage: 24.3+ KB


In [28]:
df.columns

Index(['Unnamed: 0', 'species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex', 'year'],
      dtype='object')

In [34]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


#### Measures of Central Tendency

Measures of Central Tendency are statistical metrics that represent the "center" or "middle" of a dataset. They summarize a set of data by identifying a single value that best describes the data's central point, around which most data values are distributed.

_Choosing the Right Measure_

**Mean (AKA average):**
- Best for symmetrical distributions (e.g., normal distributions).
- Sensitive to outliers, which can skew the result.

**Median (AKA middle number):**
- Better for skewed distributions or datasets with outliers.
- Represents the 50th percentile (middle value).

**Mode (AKA highest occurence):**
- Useful for categorical data or when identifying the most common value is important.

In [36]:
# finding the sum
mass_sum = df['body_mass_g'].sum()

# finding the count
count_masses = len(df['body_mass_g'])  # this code counted the missing values as well

average = mass_sum/count_masses
print(average)

4177.325581395349


In [37]:
# Calculate mean of numerical columns
# numeric_only=True --> this calculates the mean on numerical columns alone

mean_values = df.mean(numeric_only=True)
print("Mean:\n", mean_values)

Mean:
 bill_length_mm         43.921930
bill_depth_mm          17.151170
flipper_length_mm     200.915205
body_mass_g          4201.754386
year                 2008.029070
dtype: float64


In [38]:
# Calculate the Median
median_values = df.median(numeric_only=True)
print("Median: \n", median_values)

Median: 
 bill_length_mm         44.45
bill_depth_mm          17.30
flipper_length_mm     197.00
body_mass_g          4050.00
year                 2008.00
dtype: float64


In [40]:
# Calculate the Mode
mode_values = df.mode()
print("Mode: \n", mode_values)

Mode: 
   species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Biscoe            41.1           17.0              190.0   

   body_mass_g   sex  year  
0       3800.0  male  2009  


In [41]:
df['sex'].value_counts()

sex
male      168
female    165
Name: count, dtype: int64

#### Measures of Dispersion

Measures of Dispersion are statistical metrics that describe how spread out or scattered the data values are around the central tendency (e.g., mean, median). While measures of central tendency (like mean and median) summarize the "center" of the data, measures of dispersion tell us about the variability, range, and distribution of the data.

**When to Use Measures of Dispersion**

- Range: For quick insights into total variability.
- InterQuartile Range (IQR): To analyze variability without being affected by outliers.
- Variance and Standard Deviation: For detailed variability in the same units as the data.

In [45]:
df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,year
count,342.0,342.0,342.0,342.0,344.0
mean,43.92193,17.15117,200.915205,4201.754386,2008.02907
std,5.459584,1.974793,14.061714,801.954536,0.818356
min,32.1,13.1,172.0,2700.0,2007.0
25%,39.225,15.6,190.0,3550.0,2007.0
50%,44.45,17.3,197.0,4050.0,2008.0
75%,48.5,18.7,213.0,4750.0,2009.0
max,59.6,21.5,231.0,6300.0,2009.0


In [46]:
# Find the maximum in each of the numerical columns
df.max(numeric_only=True)

bill_length_mm         59.6
bill_depth_mm          21.5
flipper_length_mm     231.0
body_mass_g          6300.0
year                 2009.0
dtype: float64

In [47]:
# Find the minimum value in each of the numerical columns
df.min(numeric_only=True)

bill_length_mm         32.1
bill_depth_mm          13.1
flipper_length_mm     172.0
body_mass_g          2700.0
year                 2007.0
dtype: float64

In [48]:
# Calculate the Range (The maximum value - The minimum value)
df_range = df.max(numeric_only=True) - df.min(numeric_only=True)
print("Range:\n", df_range)

Range:
 bill_length_mm         27.5
bill_depth_mm           8.4
flipper_length_mm      59.0
body_mass_g          3600.0
year                    2.0
dtype: float64


In [49]:
# Calculate Variance
variance_values = df.var(numeric_only=True)
print("Variance:\n", variance_values)

Variance:
 bill_length_mm           29.807054
bill_depth_mm             3.899808
flipper_length_mm       197.731792
body_mass_g          643131.077327
year                      0.669706
dtype: float64


In [50]:
# Calculate the standard deviation
std_values = df.std(numeric_only=True)
print("Standard Deviation:\n", std_values)

Standard Deviation:
 bill_length_mm         5.459584
bill_depth_mm          1.974793
flipper_length_mm     14.061714
body_mass_g          801.954536
year                   0.818356
dtype: float64


In [51]:
# Calculate IQR

# Select only numeric columns
numeric_data = df.select_dtypes(include='number')

Q1 = numeric_data.quantile(0.25)  # 0.25 is the same thing as 25% = 25/100
Q3 = numeric_data.quantile(0.75)  # 0.75 is the same thing as 75% = 75/100

IQR = Q3 - Q1

print("Interquartile Range (IQR):\n", IQR)

Interquartile Range (IQR):
 bill_length_mm          9.275
bill_depth_mm           3.100
flipper_length_mm      23.000
body_mass_g          1200.000
year                    2.000
dtype: float64


In [52]:
numeric_data.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,year
0,39.1,18.7,181.0,3750.0,2007
1,39.5,17.4,186.0,3800.0,2007
2,40.3,18.0,195.0,3250.0,2007
3,,,,,2007
4,36.7,19.3,193.0,3450.0,2007


## Analyzing relationships between variables: Measures of Association

Measures of association quantify the relationship between two variables. They tell us whether and how two variables are related. It is divided into two- covariance and correlaion.

Key measures of association include:

1. Covariance: Describes the direction of the relationship.
2. Correlation:
Under this we have two types:

- Pearson Correlation: Measures the strength and direction of a linear relationship.
- Spearman Correlation: Measures the strength and direction of a monotonic
(something that is not necessarily linear) relationship.


**1. Covariance**

Covariance measures how two variables change together. It indicates the **direction of the relationship but not the strength**.

- A positive covariance means that as one variable increases, the other tends to increase.
- A negative covariance means that as one variable increases, the other tends to decrease.
- A covariance close to 0 indicates no relationship.

Example: Let’s say we have two variables: hours of study and exam scores.
- If study time increases, exam scores also increase, the covariance will be positive.
- If more study time leads to lower scores, the covariance will be negative.
- If study time has no effect on scores, the covariance will be close to zero.

Formula:

$${Cov}(X, Y) = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{n - 1}$$

- $( X_i)$: Data points of variable ( X )  
- $( Y_i)$: Data points of variable ( Y )  
- $( \bar{X})$: Mean of (X )
- $( \bar{Y})$: Mean of (Y)
- $(n)$: Number of data points


Limitations:

It is difficult to interpret directly because its value depends on the scale of the variables. For instance, if one variable is measured in dollars and the other in kilometers, the covariance won’t be very meaningful.


**Correlation**

Correlation is a more standardized measure of the relationship between two variables. It tells you how strongly and in what direction the variables are related, but it scales the relationship to a range between -1 and 1:

- A correlation of 1 means a perfect positive relationship (as one variable increases, the other also increases proportionally).
- A correlation of -1 means a perfect negative relationship (as one variable increases, the other decreases proportionally).
- A correlation of 0 means no relationship at all.

Example:
- If more hours of study strongly lead to higher exam scores, you might get a correlation close to 1.
- If more hours of study somehow lead to lower exam scores, the correlation might be close to -1.
- If study time has no effect, the correlation will be around 0.

## Difference between correlation and covariance:

While covariance tells you the direction of the relationship, correlation also tells you the strength of the relationship in a clear, understandable way (between -1 and 1). Correlation removes the issue of different units, making it easier to compare relationships between variables.


**2. Pearson Correlation Coefficient**

The Pearson correlation coefficient (r) **measures the strength and direction** of a linear relationship between two variables. It is a normalized version of covariance and ranges from -1 to 1.

- $r=+1$: Perfect positive linear relationship.
- $r=−1$: Perfect negative linear relationship.
- $r=0$: No linear relationship.

Formula:

$$
r = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}$$

- $\text{Cov}(X, Y)$: Covariance of (X) and (Y).
- $r$: The Pearson correlation coefficient.
- $\sigma_X$: The individual data points of variables X and Y.
- $\sigma_Y$: The means of X and Y.

**3. Spearman Rank Correlation**

The Spearman correlation coefficient (ρ) measures the strength and direction of a monotonic relationship (not necessarily linear). It uses the ranks of the data rather than their actual values. In the Spearman Rank Correlation, ranks refer to the positions of values in sorted order within each dataset. Instead of using the raw values, Spearman Correlation works on their ranked positions to measure the strength and direction of a monotonic relationship between two variables.

- Suitable for non-linear but monotonic relationships.
- Less sensitive to outliers than Pearson correlation.

Formula:

$$
\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}$$

- $(d_i)$ is the difference between the ranks of the \(i\)-th observation in \( X \) and \( Y \).
- $(n)$ is the number of observations.

In [54]:
# Covariance
import numpy as np

# Example data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

# Calculate covariance
covariance = np.cov(x,y)[0,1]  # Extracts the covariance value
print("Covariance:", covariance)

Covariance: 5.0


**How is this?🤔** Let's break it down.

First, we know that:

`x`: Independent variable (e.g., input values).

`y`: Dependent variable, linearly dependent on x (y=2x).

Next, the Covariance Matrix is calculated:

`np.cov(x, y)` computes the covariance matrix for x and y.

A covariance matrix is a 2x2 matrix where:

- Element [0,0][0,0]: Variance of x.
- Element [1,1][1,1]: Variance of y.
- Element [0,1][0,1] or [1,0][1,0]: Covariance between x and y.

In covariance we have three options:
We can either calculate the covariance
- For x
- For y
- Between x and y

$$
\text{Covariance Matrix} =
\begin{bmatrix}
\text{Var}(x) & \text{Cov}(x, y) \\
\text{Cov}(y, x) & \text{Var}(y)
\end{bmatrix}
$$

Now, the Covariance Value is calculated:

`covariance = np.cov(x, y)[0, 1]`

Covariance Matrix Output:

If you print `np.cov(x, y)`:

In [56]:
print(np.cov(x, y))

[[ 2.5  5. ]
 [ 5.  10. ]]


Where:

- Variance of x: 2.5 (at position [0,0][0,0]).
- Variance of y: 10.0 (at position [1,1][1,1]).
- Covariance of x and y: 5.0 (at positions [0,1][0,1] and [1,0][1,0]).

Covariance Value:

The extracted covariance value is 5.0.


**Interpretation of the Result**

- The covariance of 5.0 is positive, indicating a positive relationship between x and y (as x increases, y also increases).
- The magnitude of the covariance value depends on the units of x and y, so it doesn’t give a standardized measure of the strength of the relationship. For a normalized measure, you would use the Pearson correlation coefficient.

In [58]:
# Pearson Correlation Coefficient
# Pearson uses a different library called "scipy"
from scipy.stats import pearsonr

# Example data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

# Calculate Pearson correlation
pearson_corr, _ = pearsonr(x, y)
print("Pearson Correlation Coefficient: ", pearson_corr)

Pearson Correlation Coefficient:  1.0


Interpretation of the Output

- Pearson Correlation Coefficient (r = 1.0):

The value r=1.0 confirms that x and y have a perfect positive linear relationship. As x increases, y increases proportionally.

In [59]:
numeric_data.corr()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,year
bill_length_mm,1.0,-0.235053,0.656181,0.59511,0.054545
bill_depth_mm,-0.235053,1.0,-0.583851,-0.471916,-0.060354
flipper_length_mm,0.656181,-0.583851,1.0,0.871202,0.169675
body_mass_g,0.59511,-0.471916,0.871202,1.0,0.042209
year,0.054545,-0.060354,0.169675,0.042209,1.0


The numeric_data.corr() method in pandas calculates the pairwise correlation coefficients between all numeric columns in a DataFrame. By default, it computes the Pearson correlation coefficient.

In [60]:
# Spearman Rank Correlation
from scipy.stats import spearmanr

# Example data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

# Calculate Spearman correlation
spearman_corr, _ = spearmanr(x, y)
print("Spearman Correlation Coefficient: ", spearman_corr)

Spearman Correlation Coefficient:  0.9999999999999999
