<a href="https://colab.research.google.com/github/KGzB/CAS-Applied-Data-Science/blob/master/Module-2/CAS-D1-DescriptiveStatistics-Sol.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Notebook 1, Module 2, Statistical Inference for Data Science, CAS Applied Data Science, 2024-08-27, A. Mühlemann, University of Bern.

*This notebook is based on the notebook by S. Haug and G. Conti from 2020*


# 1. Descriptive Statistics on Single Features



**Goals**
- Graphical preparation of the data
- Calculate summary statistics


First load the necessary libraries / modules.

In [0]:
# Load the needed python libraries by executing this python code (press ctrl enter)
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
import pandas as pd
import io

In this notebook we will look at a dataset about the effect of surface and vision on balance.

The balance of subjects were observed on two different surfaces and for restricted and unrestricted vision.

- *Subject*: id of subject-
- *Sex*: male or female
- *Age*: age in years
- *Height*: height in cm
- *Weight*:	weight in kg
-	*Surface*: normal or foam
- *Vision*:	eyes open, eyes closed, or closed dome
-	*CTSIB*: Qualitive measure of balance, 1 (stable) - 4 (unstable)

I uploaded the dataset to Github so that we can read it directly from there.

In [0]:
url = "https://github.com/KGzB/CAS-Applied-Data-Science/blob/master/Module-2/balance.csv?raw=true"
df = pd.read_csv(url, sep=";")
df.head() # Print the first five rows

## 1.1 Graphical Analyis
### 1.1.1 Pie chart and bar plot (categorical variables)
Pie charts are used to show proportions of a whole.

We could, for example, ask ourselves about how the genders were represented in this study percentage-wise. When we look at the dataset more closely we can see that each participant participated in different experiments. Thus, it probably would make sense to first get a subdataframe with just the distinct individuals.

In [0]:
# Pie chart of the gender distribution
participants = df[['Subject', 'Sex', 'Age', 'Height',	'Weight']].drop_duplicates()
participants.groupby('Sex').size().plot(kind='pie', autopct='%1.1f%%' ,ylabel="")

The problem with pie charts is that if there are many groups we tend missinterpret the pie slices (https://www.data-to-viz.com/caveat/pie.html). Thus, it often is reasonable sense to use a barplot instead.

In [0]:
# Barplot of the gender distribution
participants.groupby('Sex').size().plot(kind='barh' ,xlabel="number of participants")

## 1.1.2 Histogram (numerical variables)
Histograms are used to display frequencies or proportions.

To get a better idea about our participants we would like to look at their age distribution using a histogram.

In [0]:
# histogramm of age distibtion
participants['Age'].plot(kind="hist",fill=True,histtype='barstacked',title='Histogram', label="Age in years", color= 'tab:blue')
plt.show()

The authors mention
"for both males and females, ten older (more than 24 years old) and ten younger subjects were selected".
This can be seen in this histogramm (at least that when not considering the genders yet)

We could also check whether the is a difference in humidity depending on the number of people in the room.

In [0]:
# get genders
genders = df['Sex'].unique()

# Create a histogram for each number of people
for gender in genders:
    subset = participants[participants['Sex'] == gender]
    plt.hist(subset['Age'], alpha=0.7, color='blue')
    plt.title(f"Histogram for the {gender} participants")
    plt.xlim(15, 40) # make sure the all have the same x-axis
    plt.xlabel('Age')
    plt.ylabel('Frequency')
    plt.show()

### Kernel Density Estimation
Kernel Density Estimation is a method for visualizing the distribution of a dataset by creating a smooth curve. An advantage of using KDE over a histogram is that it produces a continuous, smooth curve that is not dependent on the bin size. This can provide a clearer picture of the underlying data distribution, especially with smaller datasets, and can reveal multimodality (multiple peaks) that might be hidden by a histogram's fixed binning. Its primary drawback is that its appearance is highly dependent on a parameter called the "bandwidth," which can be difficult to choose, and a poor choice can lead to a misleading representation of the data. *Please play around with the bandwidth to see its effect.*

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity

age_data = participants['Age'].values.reshape(-1, 1)
kde = KernelDensity(bandwidth=3.0, kernel='gaussian')
kde.fit(age_data)

# range for plotting
age_range = np.linspace(participants['Age'].min() - 5, participants['Age'].max() + 5, 1000).reshape(-1, 1)

# Get the density
log_density_gaussian = kde.score_samples(age_range)
density_gaussian = np.exp(log_density_gaussian)

# Plot the KDE
plt.figure(figsize=(10, 6))
plt.plot(age_range, density_gaussian, label='Gaussian Kernel', color='blue', linewidth=2)

# Add histogram
plt.hist(participants['Age'], density=True, alpha=0.6, color='grey', label='Histogram of Data')

plt.title('Kernel Density Estimation of Participant Age')
plt.xlabel('Age')
plt.ylabel('Density')
plt.show()



Try to draw the densities for the males and females in the same plot.

In [0]:
males = participants[participants['Sex'] == 'male']['Age'].values.reshape(-1, 1)
females = participants[participants['Sex'] == 'female']['Age'].values.reshape(-1, 1)

# plot range
age_range = np.linspace(participants['Age'].min() - 5, participants['Age'].max() + 5, 1000).reshape(-1, 1)

# fit males
kde_males = KernelDensity(bandwidth=3.0, kernel='gaussian')
kde_males.fit(males)
log_density_males = kde_males.score_samples(age_range)
density_males = np.exp(log_density_males)

# fit females
kde_females = KernelDensity(bandwidth=3.0, kernel='gaussian')
kde_females.fit(females)
log_density_females = kde_females.score_samples(age_range)
density_females = np.exp(log_density_females)

# Plot both KDE curves on the same plot
plt.figure(figsize=(10, 6))
plt.plot(age_range, density_males, label='Males', color='blue', linewidth=2)
plt.plot(age_range, density_females, label='Females', color='red', linewidth=2)

plt.title('KDE of Age by Gender')
plt.xlabel('Age')
plt.ylabel('Density')
plt.legend()
plt.show()

### 1.1.3 Scatter plot colored by gender (simulatneous description)

Scatter plots use dots to represent values for two different numeric variables.

For example, we could look at the species and one of the numerical characteristics simulatneously.

It is reasonable to assume that taller people tend to be heavier. We would like to check whether this assumption is fulfilled in our dataset. To differentiate between the genders we can color the dots with respect to the genders.

In [0]:
# Create colors for genders
colors = {'male': 'blue', 'female': 'red'}

# Create a scatter plot
for gender in participants['Sex'].unique():
    subset = participants[participants['Sex'] == gender]
    plt.scatter(subset['Height'], subset['Weight'], color=colors[gender], label=gender, s=100, alpha=0.7)

# show plot
plt.title('Height vs. Weight')
plt.xlabel('Height')
plt.ylabel('Weight')
plt.legend(title='Gender')


What can you see in this picture?

# 1.2 Numerical analysis

# 1.2.1 Location

Location measures are used to describe typical values of a variable. Best known are the mean and the median.


In [0]:
# Mean of numerical variables on participants
participants[['Age', 'Height',	'Weight']].mean()

Maybe it would make sense to group by gender.

In [0]:
grouped_participants = participants.groupby('Sex')
grouped_participants[['Age', 'Height',	'Weight']].mean()

In [0]:
# Median of numerical variables on participants grouped by Sex
grouped_participants[['Age', 'Height',	'Weight']].median()

What do you notice when you compare the mean with the median?

## 1.2.2 Spread

Typical values are interesting but sometimes more information is needed. It is for example also of interest to see how spead the values are. Typical measures for the spread are the variance, standard deviation, inter quartile range, etc.


In [0]:
# Standard deviation of numerical variables on participants grouped by Sex
grouped_participants[['Age', 'Height',	'Weight']].std()

Which variable has the most standard deviation? would you have expected that?

We can also look at the IQR.

In [0]:
# IQR of Weight
grouped_participants['Weight'].quantile(0.75)-grouped_participants['Weight'].quantile(0.25)


Compare the standard deviation of *Weight* with its IQR.


Alternatively, one can also get most of those measures with a single command:

In [0]:
grouped_participants[['Age', 'Height',	'Weight']].describe()

However, this output is not very comprehensible. Here it makes more sense to consider only one species for the summary:

In [0]:
grouped_participants['Weight'].describe()

Do all these digits after the comma make sense?

## 1.2.3 Shape

Another measure that is often looked at is the shape of the distribution. The mostly used measures are the skewness and the kurtosis.

In [0]:
# Skew of numerical variables on participants grouped by Sex
grouped_participants[['Age', 'Height',	'Weight']].skew()

The skewness of *Height* is $<0$ for both genders. What does that mean?

Now let us look at the kurtosis.

In [0]:
# Kurtosis of numerical variables on participants grouped by Sex
grouped_participants[['Age', 'Height',	'Weight']].apply(pd.DataFrame.kurt, numeric_only=True)

The idea behind the kurtosis is the following:
Some numerical characteristics, when the sample size is large and the intervals are small, result in a histogram that resembles a Gaussian bell curve. In this case, the value of the kurtosis is close to zero. If the tails are heavier than you would expect for a Gaussian distribution the kurtosis will be substantially positive. If the tails are less heavy than you would expect for a Gaussian distribution the kurtosis will be negative.


## 1.7 Outlook
A often used example of a normal distribtion is the height distribtion in a population. Let us check, whether this could also be the case for our data set.

When looking at hypothesis testing, we will see how to test this mathematically.

Our model is a normal distribution with the mean and width taken from the dataset: **norm.pdf(x,mean,width)**.

In [0]:
from scipy.stats import norm

# Get mean and std from height

mean  = participants['Height'].mean()
std = participants['Height'].std()
print(mean,std)

# Create gaussian pdf
xmin = participants['Height'].min()
xmax = participants['Height'].max()
x = np.linspace(xmin, xmax, 100)
pdf = norm.pdf(x, mean, std)

# Plot histogram
plt.hist(participants['Height'], density=True, alpha=0.6, color='skyblue', edgecolor='black', label=' Histogram')
plt.plot(x, pdf, 'r', linewidth=2, label='Normal Distribution')
plt.show()

What do you think?

# 2. Simultaneous Description of two Features

In the first part of this notebook, we only looked at individual characteristics. For example, we calculated the mean *slength* of *iris setosa*. Of course, we did this simultaneously for all characteristics and species, but we never compared two characteristics directly. That is what we would like to do now.

## 2.1 Graphical Analysis

### 2.1.1 Boxplots

A boxplot is a graphical display of the minimum, maximum, and the 3 quartiles.

We could for example look at the difference in balance scores depending on the surface the participants stand on. Let us first consider the case when the particiants view is not obstructed.

In [0]:
# Boxplots of CTSIB-score for the two different surfaces.
tmp = df[df['Vision']=='open'][['Surface', 'CTSIB']]
tmp.pivot(columns='Surface', values='CTSIB').boxplot()

Would you expect the boxplots to look like this? Inspect the dataset more closely.

We could also look at the height distribution according to the genders.

In [0]:
# Boxplots of height the genders.
tmp = participants[['Sex', 'Height']]
tmp.pivot(columns='Sex', values='Height').boxplot()

### 2.1.2 Scatter matrix

Now let us see whether some of the numerical infos on the participants correlate.  (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.plotting.scatter_matrix.html)

In [0]:
from pandas.plotting import scatter_matrix
scatter_matrix(participants[['Age','Height','Weight']], alpha=0.2, diagonal='hist')
plt.show()

As we would have assumed there is a positive correlation between height and weight.

## 2.2 Numerical Analyis

### 2.3 Correlation
To get more information on the correlation between two variables the correlation coefficient is calculated.

Let's see whether the correlation coefficient does support this observation.

**Caution**: correlation not equals causation!!


In [0]:
participants[['Age','Height','Weight']].corr()

What is the definition of the correlation?


What kind of relationship between *Age* and *Height* does the correlation coefficient suggest?

### Exercise
We have now seen several examples of desciptive approaches. What we have left out so far is to really investigate the difference in balance depending of *Surface* and *Vision*. User the remaining time to further investigate!