In [None]:
import requests
from IPython.core.display import HTML
HTML(f"""
<style>
@import "https://cdn.jsdelivr.net/npm/bulma@0.9.4/css/bulma.min.css";
</style>
""")

# Data wrangling and statistics
This tutorial covers essential concepts in descriptive statistics and their implementation in python. 
The tutorial includes:
1. Initial Data Examination
2. Descriptive Statistics
3. Visualizations

To fulfill the goals of this tutorial, we will make use of the Diamonds dataset, which provides information about passengers on board the RMS Titanic, which famously sank on its maiden voyage in April 1912 after hitting an iceberg. As a first step in describing the data, run the code cells below to load the dataset, and all necessary libraries for obtaining descriptive statistics.


In [None]:
# import numpy
import numpy as np
# library for plotting and importing the data
import seaborn as sns
# library for specific stiatistics
from scipy.stats import skew, kurtosis
# library for manipulating and analysing data 
import pandas as pd
# Load the titanic dataset
titanic = sns.load_dataset('titanic')

In [None]:
# for display clarity 
pd.set_option('display.float_format', lambda x: '%.3f' % x) 

## Initial Data Examination
When beginning a data analysis or exploration task, one of the first steps is to get a quick overview of your dataset. The .head() method allows you to do that, by showing the initial rows of your dataset, providing a snapshot of what your data looks like. This initial examination helps you understand how the data is organized, spot any potential issues or missing values, and make informed decisions about your further approach.


In [None]:
print(titanic.head())

As illustrated by the output above, our version of the dataset consists of 15 variables:
1. ‘survived’: Indicates whether the passenger survived (0 = No, 1 = Yes).

2. ‘pclass’: Passenger class (1st, 2nd, or 3rd).

3. ‘sex’: Gender of the passenger.

4. ‘age’: Age of the passenger.

5. ‘sibsp’: Number of siblings/spouses aboard.

6. ‘parch’: Number of parents/children aboard.

7. ‘fare’: Fare paid for the ticket.

8. ‘embarked’: Port of embarkation (Cherbourg, Queenstown, or Southampton).

9. ‘class’: Equivalent to pclass but as a categoriacal data type.

10. ‘who’: Indicates whether the passenger is a child, man, or woman.

11. ‘adult_male’: Indicates if the passenger is an adult male.

12. ‘deck’: Deck where the passenger’s cabin was located.

13. ‘embark_town’: Town where the passenger embarked.

14. ‘alive’: Indicates whether the passenger survived (Yes or No).

15. ‘alone’: Indicates if the passenger was traveling alone.


### Data types and Structures
Examining data types and structures is the next necessary step, as it helps you get an understanding of what types of values the data contains, and how those values are represented. This examination is essential for data cleaning, preprocessing, and selecting appropriate analysis techniques. In this context, data types refer to the format in which data is stored, such as integers, floating-point numbers, strings, dates, or categorical labels. By running the cell below we make use of the dtypes attribute to access the data types of columns in our pandas dataframe:


In [None]:
data_types = titanic.dtypes

# Print data types and structure
print("Data Types:")
print(data_types)


Besides integers, floating point numbers, booleans and categorical data types, the output above also includes a data type called ‘object’. The term refers to a data type used by some libraries, such as NumPy and pandas, to represent columns or arrays that contain mixed types of data.
### Dimensionality
Examining the dimensionality of data involves understanding the size and structure of a dataset. Therefore, dimensionality refers to the number of variables (features or columns) and observations (data points or rows) present in the dataset. Executing the code cell below returns a tuple (rows, columns), and makes use of the ‘shape’ attribute, which allows you to find the dimensions of an array, DataFrame, or other data structures:


In [None]:
data_structure = titanic.shape
print("\nData Dimensionality:")
print(data_structure)

## Descriptive Statistics
Descriptive statistics are a fundamental component of data manipulation and provide a concise summary of the main characteristics and features of a dataset. These statistics are used to describe and understand the essential aspects of data, by summarizing its central tendencies, variability, and distribution. 
### Summary Statistics
When examining a dataset, it’s common to encounter two types of variables: numerical and categorical. Numerical variables represent quantities or measurements, while categorical variables represent categories or labels. To gain meaningful insights, numerical and categorical variables are summarized separately. 
Summaries of numerical variables often include common metrics such as mean (average), median (middle value), mode (most frequent value), range (difference between the maximum and minimum values), variance and standard deviation (both are measures of data dispersion). By running the code cell below, you can make use of the describe() function from the pandas library in order to obtain the above mentioned common metrics. For educational purposes, in this tutorial we also calculate some additional statistics and add these to the summary table:


In [None]:
# Get summary statistics for numerical columns
numerical_summary = titanic.describe()

# Rename the "50%" row to "median" for clarity
numerical_summary = numerical_summary.rename(index={'50%': 'median'})

# Calculate the range and add it as a new row
numerical_columns = titanic.select_dtypes(include=[np.number])
range_values = numerical_columns.max(numeric_only=True) - numerical_columns.min(numeric_only=True)
numerical_summary.loc['range'] = range_values

# Find the mode 
numerical_summary.loc['mode'] = titanic.select_dtypes(include=[np.number]).mode().iloc[0]

# Calculate variance, skewness, and kurtosis and add as new rows to summary
numerical_columns = titanic.select_dtypes(include=[np.number])
variance_values = numerical_columns.var()
skewness_values = skew(numerical_columns, axis=0)
kurtosis_values = kurtosis(numerical_columns, axis=0)
numerical_summary.loc['variance'] = variance_values
numerical_summary.loc['skewness'] = skewness_values
numerical_summary.loc['kurtosis'] = kurtosis_values

# Reorder the rows for clarity
numerical_summary = numerical_summary.reindex(['count', 'mean', 'median','mode', 'min', 'max','range', 'std','variance', '25%', '75%','skewness','kurtosis'])

# Print the numerical summary statistics
print(numerical_summary)

On the other hand, categorical variables require different types of summary statistics. These statistics often consist of counts, percentages, or frequencies for each category within a categorical variable. You can obtain these measures by executing the code cell below, which calls the same ‘describe’ function as above, but this time only taking categorical variables into account:


In [None]:
# Get summary statistics for categorical columns
categorical_summary = titanic.describe(include='category')

# For clarity rename top to mode, as the two are equivalent
categorical_summary = categorical_summary.rename(index={'top': 'mode'})

print(categorical_summary)

It’s important to note that while the summary above provides essential information about the categorical variables, it does not give us insight into the distribution of unique classes within each category. Therefore we do not get an insight into class distributions and potential class imbalances. By executing the code cell below, you can calculate the frequency of occurrence of each unique category within the dataset:


In [None]:
# count the number of observations belonging to each category
category_counts = titanic.select_dtypes(include='category').apply(lambda col: col.value_counts())
# this is only for display purposes
category_counts = category_counts.reindex(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'First', 'Second', 'Third'])
print(category_counts)

### Missing Data
Missing data refers to absent or incomplete values in a dataset. It’s common to encounter missing data due to various reasons such as data collection errors, sensor failures, or simply because certain information was not collected for some observations. Dealing with missing data is a crucial step in data preprocessing, as it can significantly impact the accuracy and reliability of any analysis or modeling performed on the dataset. The code cell below identifies the number of missing values for each variable in the dataset, by calling the ‘isna’ function: 


In [None]:
# Count missing data for each column
print(titanic.isna().sum())

### Duplicates
Duplicate rows refer to observations that are identical in all their attribute values across all columns. These duplicates can arise due to various reasons such as missing values, data entry errors, system glitches, or issues during data collection. The pandas library provides the ‘duplicated’ function, which identifies and marks duplicate rows within a DataFrame or Series. We use this function in the code cell below:


In [None]:
# Count duplicate rows
duplicate_count = titanic.duplicated().sum()

# Print the count of duplicate rows
print(duplicate_count)

It’s important to exercise caution and consider the context of the data to determine whether duplicated rows should be retained or removed. For instance, in the current dataset some of the duplicate rows could be a consequence of missing values. As the age and deck variables contain the most missing observations, it’s possible that rows which are identical originate from different passengers, but appear the same due to incomplete or unavailable information. In such cases, removing all duplicate rows without considering the underlying reasons could lead to the loss of potentially valuable data. For this reason, in the current tutorial, we made the choice to not remove duplicate entries.
### Outliers
Outliers are data points that deviate significantly from the majority of observations in a dataset. Outliers can have a disproportionate impact on statistical measures and modeling results, therefore it’s crucial that we correctly identify and handle these data points.
$Z$-scores, also known as standard scores, are a valuable tool for detecting outliers. They quantify how many standard deviations a data point is away from the mean of the dataset, and they are advantageous because they provide a standardized measure of deviation that is independent of the original data’s scale and distribution. Therefore, the first step in detecting outliers is calculating the $z$-scores of the numerical values, as done in the code cell below:


In [None]:
# Calculate the mean and standard deviation for each numerical column
mean_values = numerical_columns.mean()
std_values = numerical_columns.std()

# Calculate z-scores for each data point in the numerical columns
z_scores = (numerical_columns - mean_values) / std_values

As a next step, in the next cell we define a $z$-threshold, with $z$-scores greater than or equal to that threshold being considered outliers. This is implemented in the code cell below, with the output showing the number of outliers for each numerical variable. The choice of z-threshold depends on the specific analysis and domain knowledge, and for the current analysis we chose a threshold of 2.


In [None]:
# Define the z-score threshold for identifying outliers
z_threshold = 2

# Identify and print outliers using z-scores
outliers = titanic[abs(z_scores) > z_threshold]

# Filter to exclude columns with all NaN values
outliers = outliers.dropna(axis=1, how='all')

# Filter to exclude rows with all NaN values
outliers = outliers.dropna(how='all')

# Count outliers for each variable
outlier_counts = outliers.count()

print(outlier_counts)

### Correlation
Correlation is a statistical measure that quantifies the degree to which two or more variables change together. Correlation between variables is commonly examined by looking at correlation matrices. By examining the values in the matrix, you can gain insights into which variables are strongly correlated, which might indicate redundancy or multicollinearity in regression analysis. In the code cell below we use the ‘corr’ method provided by the pandas library to calculate the Pearson correlation coefficients for pairs of the numerical columns in the titanic dataset.


In [None]:
# Calculate the pairwise correlation matrix
correlation_matrix = numerical_columns.corr()

print(correlation_matrix)

Note that the strength of correlation that suggests multicollinearity can vary depending on the field and specific problem, but a common threshold to identify multicollinearity is a correlation coefficient greater than 0.7 or less than -0.7. 
## Visualizations
While measures like mean, median, mode and standard deviation provide essential insights into the typical values and spread of data, visualizations offer a more comprehensive understanding by revealing patterns, relationships, and additional information within the dataset. Different data types within a dataset often necessitate the use of specific types of plots. 
### Histograms
A histogram is a graphical representation of the distribution of data. Histograms are particularly useful for understanding the central tendency, spread, and shape of data. In the code cell below we make use of the ‘histplot’ function from the Seaborn library, in order to plot a histogram of the fare prices paid by the passengers on Titanic: 


In [None]:
sns.histplot(titanic['fare'], bins=50, color='skyblue')

### Box Plots
Similarly to histograms, boxplots, also known as a box-and-whisker plot, are graphical representations used to visualize the distribution of a variable. It provides valuable insights into the central tendency, spread, and presence of outliers in the data. Boxplots are particularly useful for comparing the distribution of a variable across different categories or groups. In the code cell below we make use of the ‘boxplot’ function from the Seaborn library, in order to plot the age of the passengers across different classes in the Titanic dataset: 


In [None]:
# Create a boxplot of the "fare" column
sns.boxplot(x='class', y='age', data=titanic)

### Bar Charts - (Categorical / classification data)
Bar charts are used to display the frequency of categories within a dataset. Bar charts are particularly useful for comparing different categories and understanding their relative sizes. The code cell below makes use of the ‘barplot’ function from the Seaborn library, in order to provide a visual representation of the class distribution among the passengers: 


In [None]:
# Count the number of passengers in each class
class_counts = titanic['class'].value_counts()
sns.barplot(x=class_counts.index, y=class_counts.values, palette='Set2')

### Heatmaps
Heatmaps are commonly used to visualize relationships, patterns, or distributions within complex datasets. Heatmaps are particularly useful for identifying multicollinearity. The Seaborn library contains the ‘heatmap’ function, which is used in the code cell below to create a heatmap of the previously constructed correlation matrix: 


In [None]:
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=.5)

### Scatter Plots
Scatter plots are commonly used to visualize the relationship between two numerical variables. Additionally, outliers can often be spotted in scatterplots. They appear as points far away from the main cluster. In the code cell below we make use of the ‘scatterplot’ function from the Seaborn library to visualize the relationship between passenger ages and the fare prices paid:


In [None]:
sns.scatterplot(data=titanic, x='age', y='fare', alpha=0.5, color='b')  # Plot age vs. fare

### Pair Plots
Pair plots are a comprehensive way to visualize the relationships between multiple numerical variables simultaneously. Here’s how you can generate pair plots for the age and fare numerical variables using the ‘pairplot’ function from the Seaborn library:


In [None]:
sns.pairplot(titanic[['age', 'fare']])

### Missing Data Patterns
Besides helping in identifying multicollinearity, heatmaps can also be useful for visualizing missing data in a dataset. They provide a quick and intuitive way to identify patterns of missingness across variables. You can create a heatmap of missing data using Seaborn. Here’s how to do it:


In [None]:
sns.heatmap(titanic.isna(), cbar=False, cmap='viridis')