<a href="https://colab.research.google.com/github/KGzB/CAS-Applied-Data-Science/blob/master/Module-2/CAS-D1-DescriptiveStatistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Notebook 1, Module 2, Statistical Inference for Data Science, CAS Applied Data Science, 2024-08-27, A. Mühlemann, University of Bern.

*This notebook is based on the notebook by S. Haug and G. Conti from 2020*


# 1. Descriptive Statistics on Single Features



**Goals**
- Graphical preparation of the data
- Calculate summary statistics


First load the necessary libraries / modules.

In [0]:
# Load the needed python libraries by executing this python code (press ctrl enter)
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
import pandas as pd
import io

In this notebook we will look at a dataset about the effect of surface and vision on balance (source https://gksmyth.github.io/ozdasl/oz/ctsib.html).

The balance of subjects were observed on two different surfaces and for restricted and unrestricted vision.

- *Subject*: id of subject-
- *Sex*: male or female
- *Age*: age in years
- *Height*: height in cm
- *Weight*:	weight in kg
-	*Surface*: normal or foam
- *Vision*:	eyes open, eyes closed, or closed dome
-	*CTSIB*: Qualitive measure of balance, 1 (stable) - 4 (unstable)

I uploaded the dataset to Github so that we can read it directly from there.

In [0]:
url = "https://github.com/KGzB/CAS-Applied-Data-Science/blob/master/Module-2/balance.csv?raw=true"
df = pd.read_csv(url, sep=";")
df.head() # Print the first five rows

## 1.1 Graphical Analyis
### 1.1.1 Pie chart and bar plot (categorical variables)
Pie charts are used to show proportions of a whole.

We could, for example, ask ourselves about how the genders were represented in this study percentage-wise. When we look at the dataset more closely we can see that each participant participated in different experiments. Thus, it probably would make sense to first get a subdataframe with just the distinct individuals.

In [0]:
# Pie chart of the gender distribution
gender_counts = df['Sex'].value_counts()
plt.figure(figsize=(8, 8))
plt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Gender Distribution')
plt.show()

The problem with pie charts is that if there are many groups we tend missinterpret the pie slices (https://www.data-to-viz.com/caveat/pie.html). Thus, it often is reasonable sense to use a barplot instead.

In [0]:
# Barplot of the gender distribution
gender_counts = df['Sex'].value_counts()
plt.figure(figsize=(10, 6))
plt.bar(gender_counts.index, gender_counts.values, color=['blue', 'orange'])
plt.xlabel('Gender')
plt.ylabel('Count')
plt.title('Gender Distribution')
plt.show()

## 1.1.2 Histogram (numerical variables)
Histograms are used to display frequencies or proportions.

To get a better idea about our participants we would like to look at their age distribution using a histogram.

In [0]:
# histogram of age distribution
plt.figure(figsize=(10, 6))
plt.hist(df['Age'], bins=20, color='skyblue', edgecolor='black')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution')
plt.show()

The authors mention
"for both males and females, ten older (more than 24 years old) and ten younger subjects were selected".
This can be seen in this histogramm (at least that when not considering the genders yet)

Let us now check whether this is fulfilled for each gender.

In [0]:
# Create a histogram for each gender
genders = df['Sex'].unique()
plt.figure(figsize=(12, 6))

for gender in genders:
    subset = df[df['Sex'] == gender]
    plt.hist(subset['Age'], bins=20, alpha=0.5, label=gender, edgecolor='black')

plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution by Gender')
plt.legend()
plt.show()

### Kernel Density Estimation
Kernel Density Estimation is a method for visualizing the distribution of a dataset by creating a smooth curve. An advantage of using KDE over a histogram is that it produces a continuous, smooth curve that is not dependent on the bin size. This can provide a clearer picture of the underlying data distribution, especially with smaller datasets, and can reveal multimodality (multiple peaks) that might be hidden by a histogram's fixed binning. Its primary drawback is that its appearance is highly dependent on a parameter called the "bandwidth," which can be difficult to choose, and a poor choice can lead to a misleading representation of the data. *Please play around with the bandwidth to see its effect.*

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity

# create a kernel density estimate for the participants age
age_data = df['Age'].values[:, np.newaxis]
kde = KernelDensity(bandwidth=3.0)
kde.fit(age_data)

age_range = np.linspace(age_data.min(), age_data.max(), 100)[:, np.newaxis]
log_density = kde.score_samples(age_range)

plt.figure(figsize=(10, 6))
plt.plot(age_range, np.exp(log_density), color='blue')
plt.xlabel('Age')
plt.ylabel('Density')
plt.title('Kernel Density Estimate of Age')
plt.show()

Try to draw the densities for the males and females in the same plot.

In [0]:
# Convert Pandas df to Spark DataFrame (if needed)
df_spark = spark.createDataFrame(df)

# Compute Pearson correlation
corr_age_height = df_spark.stat.corr('Age', 'Height')
corr_age_weight = df_spark.stat.corr('Age', 'Weight')
corr_height_weight = df_spark.stat.corr('Height', 'Weight')

# Display results
result = spark.createDataFrame(
    [
        ('Age', 'Height', corr_age_height),
        ('Age', 'Weight', corr_age_weight),
        ('Height', 'Weight', corr_height_weight)
    ],
    ['Column1', 'Column2', 'PearsonCorrelation']
)
display(result)


### 1.1.3 Scatter plot colored by gender (simulatneous description)

Scatter plots use dots to represent values for two different numeric variables.

For example, we could look at the species and one of the numerical characteristics simulatneously.

It is reasonable to assume that taller people tend to be heavier. We would like to check whether this assumption is fulfilled in our dataset. To differentiate between the genders we can color the dots with respect to the genders.

In [0]:
# Scatter plot of Height and Weight colored with respect to gender
plt.figure(figsize=(10, 6))
for gender in participants['Sex'].unique():
    subset = participants[participants['Sex'] == gender]
    plt.scatter(subset['Height'], subset['Weight'], label=gender, alpha=0.6)

plt.xlabel('Height')
plt.ylabel('Weight')
plt.title('Height vs Weight by Gender')
plt.legend()
plt.show()

What can you see in this picture?

GEnerally, hinght and weight are correlated. In the sample, man are taller and heavier then women. 

# 1.2 Numerical analysis

# 1.2.1 Location

Location measures are used to describe typical values of a variable. Best known are the mean and the median.


In [0]:
# Mean of numerical variables of participants
mean_values = df[['Age', 'Height', 'Weight']].mean()
display(mean_values)

Maybe it would make sense to group by gender.

In [0]:
# Mean of numerical variables of participants grouped by gender
mean_values_by_gender = df.groupby('Sex', as_index=False)[['Age', 'Height', 'Weight']].mean()
display(mean_values_by_gender)

In [0]:
# Median of numerical variables on participants grouped by Sex
median_values_by_gender = df.groupby('Sex', as_index=False)[['Age', 'Height', 'Weight']].median()
display(median_values_by_gender)

In [0]:
# Compare the mean with the median for numerical variables grouped by gender
mean_values_by_gender = df.groupby('Sex', as_index=False)[['Age', 'Height', 'Weight']].mean()
median_values_by_gender = df.groupby('Sex', as_index=False)[['Age', 'Height', 'Weight']].median()

comparison = mean_values_by_gender.set_index('Sex') - median_values_by_gender.set_index('Sex')
display(comparison)

What do you notice when you compare the mean with the median?

## 1.2.2 Spread

Typical values are interesting but sometimes more information is needed. It is for example also of interest to see how spead the values are. Typical measures for the spread are the variance, standard deviation, inter quartile range, etc.


In [0]:
# Standard deviation of numerical variables on participants grouped by Sex
std_values_by_gender = df.groupby('Sex', as_index=False)[['Age', 'Height', 'Weight']].std()
display(std_values_by_gender)

In [0]:
from sklearn.preprocessing import StandardScaler

# Z-scaling the values
scaler = StandardScaler()
scaled_values = scaler.fit_transform(df[['Age', 'Height', 'Weight']])

# Convert scaled values back to DataFrame
scaled_df = pd.DataFrame(scaled_values, columns=['Age', 'Height', 'Weight'])

# Mean of numerical variables of participants
mean_values = scaled_df.mean()
display(mean_values)

# Mean of numerical variables of participants grouped by gender
mean_values_by_gender = df[['Sex']].join(scaled_df).groupby('Sex', as_index=False).mean()
display(mean_values_by_gender)

# Median of numerical variables on participants grouped by Sex
median_values_by_gender = df[['Sex']].join(scaled_df).groupby('Sex', as_index=False).median()
display(median_values_by_gender)

# Compare the mean with the median for numerical variables grouped by gender
comparison = mean_values_by_gender.set_index('Sex') - median_values_by_gender.set_index('Sex')
display(comparison)

# Standard deviation of numerical variables on participants grouped by Sex
std_values_by_gender = df[['Sex']].join(scaled_df).groupby('Sex', as_index=False).std()
display(std_values_by_gender)

Which variable has the most standard deviation? would you have expected that?

We can also look at the IQR.

The weight. Yes. I would have expected that

In [0]:
# IQR of Weight
Q1 = df['Weight'].quantile(0.25)
Q3 = df['Weight'].quantile(0.75)
IQR = Q3 - Q1
IQR

Compare the standard deviation of *Weight* with its IQR.


In [0]:
# Standard deviation of Weight
std_weight = df['Weight'].std()

# IQR of Weight
q1_weight = df['Weight'].quantile(0.25)
q3_weight = df['Weight'].quantile(0.75)
iqr_weight = q3_weight - q1_weight

# Comparison
comparison = pd.DataFrame({
    'Metric': ['Standard Deviation', 'IQR'],
    'Weight': [std_weight, iqr_weight]
})

display(comparison)

Alternatively, one can also get most of those measures with a single command:

In [0]:
participants = df[['Subject', 'Sex', 'Age', 'Height',	'Weight']].drop_duplicates()
grouped_participants = participants.groupby('Sex')
grouped_participants[['Age', 'Height',	'Weight']].describe()

However, this output is not very comprehensible. Here it makes more sense to consider only one species for the summary:

In [0]:
grouped_participants['Weight'].describe()

Do all these digits after the comma make sense?

## 1.2.3 Shape

Another measure that is often looked at is the shape of the distribution. The mostly used measures are the skewness and the kurtosis.

In [0]:
# Skew of numerical variables on participants grouped by Sex
skew_values_by_gender = df.groupby('Sex', as_index=False)[['Age', 'Height', 'Weight']].skew()
display(skew_values_by_gender)

The skewness of *Height* is $<0$ for both genders. What does that mean?

The distribution of height is negatively skewed. This means most people are relatively tall, while a smaller number of much shorter individuals create a long tail on the left.

Now let us look at the kurtosis.

In [0]:
# Kurtosis of numerical variables on participants grouped by Sex
kurtosis_values_by_gender = df.groupby('Sex', as_index=False)[['Age', 'Height', 'Weight']].apply(lambda x: x.kurtosis())
display(kurtosis_values_by_gender)

The idea behind the kurtosis is the following:
Some numerical characteristics, when the sample size is large and the intervals are small, result in a histogram that resembles a Gaussian bell curve. In this case, the value of the kurtosis is close to zero. If the tails are heavier than you would expect for a Gaussian distribution the kurtosis will be substantially positive. If the tails are less heavy than you would expect for a Gaussian distribution the kurtosis will be negative.


## 1.7 Outlook
A often used example of a normal distribtion is the height distribtion in a population. Let us check, whether this could also be the case for our data set.

When looking at hypothesis testing, we will see how to test this mathematically.

Our model is a normal distribution with the mean and width taken from the dataset: **norm.pdf(x,mean,width)**.

In [0]:
from scipy.stats import norm

# Get mean and std from height

mean  = participants['Height'].mean()
std = participants['Height'].std()
print(mean,std)

# Create gaussian pdf
xmin = participants['Height'].min()
xmax = participants['Height'].max()
x = np.linspace(xmin, xmax, 100)
pdf = norm.pdf(x, mean, std)

# Plot histogram
plt.hist(participants['Height'], density=True, alpha=0.6, color='skyblue', edgecolor='black', label=' Histogram')
plt.plot(x, pdf, 'r', linewidth=2, label='Normal Distribution')
plt.show()

What do you think?

# 2. Simultaneous Description of two Features

In the first part of this notebook, we only looked at individual characteristics. For example, we calculated the mean *slength* of *iris setosa*. Of course, we did this simultaneously for all characteristics and species, but we never compared two characteristics directly. That is what we would like to do now.

## 2.1 Graphical Analysis

### 2.1.1 Boxplots

A boxplot is a graphical display of the minimum, maximum, and the 3 quartiles.

We could for example look at the difference in balance scores depending on the surface the participants stand on. Let us first consider the case when the particiants view is not obstructed.

In [0]:
# Boxplots of CTSIB-score for the two different surfaces.

Would you expect the boxplots to look like this? Inspect the dataset more closely.

We could also look at the height distribution according to the genders.

In [0]:
# Boxplots of height for the genders.

### 2.1.2 Scatter matrix

Now let us see whether some of the numerical infos on the participants correlate.  (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.plotting.scatter_matrix.html)

In [0]:
from pandas.plotting import scatter_matrix
scatter_matrix(participants[['Age','Height','Weight']], alpha=0.2, diagonal='hist')
plt.show()

As we would have assumed there is a positive correlation between height and weight.

## 2.2 Numerical Analyis

### 2.3 Correlation
To get more information on the correlation between two variables the correlation coefficient is calculated.

Let's see whether the correlation coefficient does support this observation.

**Caution**: correlation not equals causation!!


In [0]:
# Ensure df_spark is a Spark DataFrame
df_spark = df if isinstance(df, type(spark.createDataFrame([(1,)]).toDF("x"))) else spark.createDataFrame(df)

# (Optional) cast to numeric if needed
from pyspark.sql import functions as F
for c in ["Age", "Height", "Weight"]:
    df_spark = df_spark.withColumn(c, F.col(c).cast("double"))

# Compute Pearson correlations
corr_age_height   = df_spark.stat.corr("Age", "Height")
corr_age_weight   = df_spark.stat.corr("Age", "Weight")
corr_height_weight= df_spark.stat.corr("Height", "Weight")

# Show as Spark DataFrame (avoids MLSerDe error)
result = spark.createDataFrame(
    [
        ("Age",   "Height", corr_age_height),
        ("Age",   "Weight", corr_age_weight),
        ("Height","Weight", corr_height_weight),
    ],
    ["Column1", "Column2", "PearsonCorrelation"]
)

display(result)


What is the definition of the correlation?


### 📌 Definition of Correlation

**Correlation** measures the strength and direction of a **linear relationship** between two quantitative variables.

The most common measure is the **Pearson correlation coefficient (r)**:

\[
r = \frac{\text{Cov}(X, Y)}{\sigma_X \, \sigma_Y}
\]

where:  
- \(\text{Cov}(X, Y)\) = covariance of \(X\) and \(Y\)  
- \(\sigma_X, \sigma_Y\) = standard deviations of \(X\) and \(Y\)  

---

### 📊 Interpretation of Pearson r
- **+1** → perfect positive linear relationship (as one increases, so does the other)  
- **0** → no linear relationship  
- **–1** → perfect negative linear relationship (as one increases, the other decreases)  

---

⚠️ **Note**:  
- Correlation only captures **linear** relationships.  
- Correlation does **not imply causation**.


What kind of relationship between *Age* and *Height* does the correlation coefficient suggest?

Close to zero. No relationship as seems intuitive

### Exercise
We have now seen several examples of desciptive approaches. What we have left out so far is to really investigate the difference in balance depending of *Surface* and *Vision*. User the remaining time to further investigate!

In [0]:
# Investigate the difference in balance depending on Surface and Vision

from pyspark.sql import functions as F

# Ensure Spark DF
df_spark = df if "toDF" in dir(df) and hasattr(df, "schema") else spark.createDataFrame(df)

# Pick the correct score column (CTSIB_score vs CTSIB)
score_col = "CTSIB_score" if "CTSIB_score" in df_spark.columns else "CTSIB"

# Cast to double
df_spark = df_spark.withColumn(score_col, F.col(score_col).cast("double"))

# Descriptive stats by Surface x Vision
balance_stats = (
    df_spark
    .groupBy("Surface", "Vision")
    .agg(
        F.mean(score_col).alias("mean"),
        F.stddev(score_col).alias("stddev"),
        F.expr(f"percentile_approx({score_col}, 0.25)").alias("q1"),
        F.expr(f"percentile_approx({score_col}, 0.75)").alias("q3"),
        F.count(F.col(score_col)).alias("n")
    )
    .withColumn("iqr", F.col("q3") - F.col("q1"))
    .orderBy("Surface", "Vision")
)

display(balance_stats)
