# Exploring Your Dataset for Machine Learning

With all analysis, machine learning or not, it's best to start by getting to grips with your dataset. In this notebook we will demonstrate how to use Python to visualise and characterise the dataset.

## Key objectives
- Load, explore and visualise the dataset
- Apply basic quality criteria, like checks for missing values
- Check for class imbalance
- Calculate correlations between variables

## 1. Introducing the Indian Liver Patient Dataset

### Information from Kaggle

This dataset contains 416 liver patient records and 167 non-liver-patient records.The dataset was collected from test samples in North East of Andhra Pradesh, India. 

'is_patient' is a binary class label used to divide patients into two groups: liver patient or not. This dataset contains:

- 441 male patient records 
- 142 female patient records.

**Note**: Any patient whose age exceeded 89 is listed as being of age "90".

[Indian Liver Patient Dataset](https://www.kaggle.com/datasets/jeevannagaraj/indian-liver-patient-dataset)

## 2. Setting Up Our Environment

Now that we understand the origins of our dataset, let's explore it. 

For this we will be using a combination of widely-used data science packages in Python. Here is a brief description of each:

- **Pandas**: A powerful data manipulation library that provides its own DataFrame structures ideal for working with labelled, tabular data

- **NumPy**: A fundamental package for numerical computing in Python, providing support for arrays and a plethora of mathematical functions

- **Matplotlib**: A comprehensive plotting library capable of creating static, animated and interactive visualisations

- **Seaborn**: A statistical data visualisation library that builds on matplotlib and provides a high-level interface for drawing attractive statistical graphs

- **scikit-learn**: A popular Python machine learning library that provides simple and efficient tools for data analysis and modelling

Using a combination of these tools, we should be able to load, manipulate and visualise our data powerfully, and effectively.

In [None]:
# Import our packages, giving some shorter aliases to make typing easier
import numpy as np  
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, RobustScaler, OneHotEncoder

# Set visualisation style for consistency
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

## 3. Loading and Initial Inspection

The first thing we want to do is to check that the data we have are as we expect. We can do the following:

- Check how many rows (**samples**) and columns (**features**) we have
- Take a sneak-peak at the actual data
- See what types we have in each column

In [None]:
# Load the dataset
df = pd.read_csv("../data/indian_liver_patient.csv") # Read in the CSV file as a Pandas DataFrame

# Display basic information about the dataset

In [None]:
# Display first five rows to understand the structure:


In [None]:
# Check data types of all columns


It looks like the DataFrame is all numeric, with features that are *floats*, and a binary diagnosis column with values of either 0 or 1, which represent our labels (the outcome of a diagnosis, and what we are hoping to predict in this practical). 

We can confirm if this is all true, by counting the values of the DataFrame's `dtypes`, as follows:

In [None]:
# Check for how many unique data types are present in the DataFrame


## 4a. Data Quality Checks

Data quality is crucial. In this example, we are lucky enough to be using a well-prepared, clean dataset; but with real-world problems, this is often not the case! In this section we will run through some common data quality checks. 

**Null** or **missing** values will often break machine learning models, and so we need to appropriately handle them before we begin training. There are a few common strategies for **handling null values**. You can: 

- Remove the offending row altogether
- Use a replacement value (such as zero, or the average of the other values in the column)
- Interpolate the value based upon the other features in the row

In [None]:
# Check for missing values
missing_values = df.isnull().sum()

title = "Distribution of classes:"
title_length = len(title)

print(title, "=" * title_length, sep="\n")

if missing_values.sum() == 0:
    print("No missing values")
else:
    print(missing_values[missing_values > 0])

In [None]:
# Remove missing values

df = df[~df['alkphos'].isnull()] # ~ here is a NOT operator, meaning that all values that do not satisfy this expression, are kept.

In [None]:
# Replace missing values

df['alkphos'] = df['alkphos'].fillna(np.mean(df['alkphos'])) # This replaces all missing values in this column with the mean of the column.

It's also good practice to check for duplicates. These occur more often than you might think. Leaving duplicates in yoru data can introduce bias when training your machine learning model, in favour of the duplicated data point.

In [None]:
# Check for duplicate rows
duplicates = df.duplicated().sum()

print(f"Number of duplicate rows: {duplicates}")

In [None]:
# If any of our rows were duplicated, we could filter them out like this
df = df[~df.duplicated()]

## 4b - Augmenting labeled data

Text labels need to be converted to a numerical form to be read and analysed by machine learning models

In [None]:
# Let's see what our gender distribution looks like:

df[['gender', 'is_patient']].groupby('gender').agg('count')

We can then reset the indexing column, as we performed operations earlier to remove missing values. When you remove these data values, the Pandas DataFrame retains the original indexing, and so you may get indexing that labels your rows in broken sequence. Using the `.reset_index` method, we can both drop (remove) the old column of indexes, reset the indexing to run concurrently with our modified DataFrame, and make this modification directly to our DataFrame.

*HINT: Whenever you see the keyword argument* `inplace=True`, *this means that it is performing the operation on the original DataFrame, and modifying it, directly (as opposed to creating a copy).*

In [None]:
df.reset_index(drop=True, inplace=True)

As aforementioned, machine learning models do not like to work with text labels. They understand numerical values, only, and so we must account for this, by converting text labels (in this case, 'gender' and 'is_patient') to numerical values that the model will understand.

To do this we can use what's known as **one-hot encoding**. If we use the gender example, this will take a column from being a single column containig data entries 'Male' and 'Female', and convert it into two columns: one for Male and another for Female, with a binary 0 or 1 to represent whether the respective gender is present in the entry, or not.

In [None]:
# Encode our data

categorical_columns = df.select_dtypes(include=['object']).columns.tolist() # Identify categorical features (with text labels).

encoder = OneHotEncoder(sparse_output=False) # Instantiate the OneHotEncoder model. The keyword argument specifies the output is a NumPy array.

one_hot_encoded = encoder.fit_transform(df[categorical_columns]) # Learn categories, and convert them into one-hot-encoded arrays.

one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(categorical_columns)).astype(int) # Convert these arrays into a DataFrame.

df_encoded = df.join(one_hot_df) # Add these new DataFrame columns to the existing DataFrame

df_encoded = df_encoded.drop(categorical_columns, axis=1) # Drop (remove) the old categorical columns.

print(f"Encoded dataset : \n{df_encoded}")

You may have noticed that `gender_Female` and `gender_Male` data is opposite of each other for obvious reasons. Due to this we can actually remove one of the columns as the negative of the other is informative enough, so there is not need to duplicate this data. 

This may seem long winded, but using the one hot encoding tool can be used on columns with many classes, so it is good to see how it works.

In [None]:
df_encoded = df_encoded.drop(columns = ['gender_Male']) # drop only gender_Male

**Note**: As a cautionary measure, we made our changes and stored them in a new DataFrame object called `df_encoded`. Now that we are happy with how the DataFrame looks, we can overwrite the original `df`, and replace its contents with those of `df_encoded`.

In [None]:
# Now that we've modified 

df = df_encoded

## 5. Target Variable Analysis

The goal of this kind of machine learning is to predict one aspect of a data point, based on the others. The thing you're trying to predict is referred to as the **label**, the **class**, or the **target variable**. In this case, the label is the *diagnosis*, with two possible values: 1 (not a liver patient) and 2 (is a liver patient).

**Class imbalance** can significantly affect model performance, so understanding the class distribution is crucial for machine learning. 

In [None]:
# Quick and simple way to print the distribution of our target variable


Since we are following the more recognised convention of categorical data being given a 0 for a feature being absent or `False`, and 1 for a feature being present or `True`, let's correct this dataset's values of 1 and 2, to align with this using the `.replace()` method.

In [None]:
# Analyse the distribution of the target variable
class_dist = df['is_patient'].value_counts() 
title = "Distribution of classes:"
title_length = len(title)

print(title, "=" * title_length, sep="\n") # Underlines the title

**Note**: On the subject of **class imbalance**, an unbalanced distribution of classes in the target variable can affect your predictions with machine learning. If one class dominates, the algorithm might achieve high accuracy by simply predicting the majority class. In this dataset, there is a fairly large skew towards the negative (non-patient) class: something to keep in mind.

## 6. Statistical Summary of Features

A useful way to get an overview of the features is to look at the summary statistics - the mean, standard deviation, and quartile values - for each column. We can do that easily with Pandas. 

Let's firstly generate a quick preview of the DataFrame again, using the `.head()` method. By default, this displays the first five samples (rows), and all features (columns).

In [None]:
# Preview the DataFrame

For the purposes of this statistical summary, let's again create a new DataFrame called `df_features`, and have it contain only the features (removing the categorical variables). Categorical values - even though we one-hot-encoded these to give meaningful numerical values - aren't actually meaningful beyond being binary labels, and would thus be uninformative to perform statistical analyses upon.

Thus, our new DataFrame will:

- Exclude the target variable (the label we want to later predict using our trained machine learning model) 
- Exclude gender, for which summary statistics would be largely redundant

As the target variable is also the outcome we want to predict, removing it at this stage, is good exploratory practice. Namely because it isn't truly a feature, but the binary label we want to predict. We can then use the `.describe()` method to get a statistical summary of the remaining data in our new DataFrame:

In [None]:
# Create a new DataFrame containing only the features (excluding the target variable)

# Get a statistical summary of numerical features


## 7. Feature Distributions

Understanding how features are distributed is essential for choosing appropriate preprocessing techniques and algorithms. Machine learning algorithms generally look for ways to separate points of different classes by finding high-dimensional patterns. These patterns are often very difficult for us to visualise, but what we can do is to break down the problem, and look at a couple of dimensions at a time. 

The **pairplot** from Seaborn is a great starting tool, to eyeball the data, and at a glance, determine if there are any pairs of features that show clear differences between the classes. 

If you see multiple pairs with *decent visual separation between classes*, there is a *good chance a machine learning model will be perform well*. Many pairs might have slight separation with a lot of 'blur' between the classes; however, in a higher-dimensional space the boundary will hopefully be more defined.

The figure will be quite large, as this dataset has a lot of features, and so we need to reduce the size and resolution a little to make it display nicely.

In [None]:
g = sns.pairplot(df.drop(columns=['gender_Female']), # Drop gender column
                 hue='is_patient', palette={0: 'green', 1: 'red'},  # Have the colouring correspond to diagnosis
                 height=1.2, plot_kws={'alpha':0.6}) # Specify height, and set 60% transparency between classes
                                                     # so that overlap is easily visible.

g.figure.set_dpi(60) # Set the resolution of the figure to 60 dots per inch.

plt.show() # Display the plot

## 8. Feature Scaling and Comparison

Features often have different scales. Look at the summary statistics above; `alkphos` ranges from 0.3 to 2.80, whereas `tot_proteins` goes from 63 to 2110. 

Some machine learning algorithms find it *difficult* to compare features that range over such different magnitudes. Features whose units are larger can swamp the predictive space and have a disproportionately, greater affect the predictions made by the algorithm. 

To combat this, we *rescale* the data so that all the features vary over the same range. In this example, we can use the `StandardScaler()` class imported from `scikit-learn` in order to do this. 

**Note**: There are other [scalers](https://scikit-learn.org/stable/modules/preprocessing.html#), each with their own advantages and disadvantages; you can learn more about them via scikit-learn's documentation. 

In [None]:
# Compare feature means before scaling
feature_means = df_features.mean()

# Make a single figure that will hold and display two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 6), dpi=80)

# Original scale
ax1.bar(range(len(feature_means)), feature_means.values)
ax1.set_xlabel('Feature Index')
ax1.set_ylabel('Mean Value')
ax1.set_title('Feature Means - Original Scale')
ax1.xaxis.set_ticks(np.arange(len(df_features.columns)))
ax1.set_xticklabels(df_features.columns, rotation=45, ha='right')
ax1.grid(True, alpha=0.3)

# Apply robust scaling
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df_features)
scaled_means = pd.DataFrame(scaled_features).describe().loc['mean']

# Scaled features
ax2.bar(df_features.columns, scaled_means)
ax2.set_xlabel('Features')
ax2.set_ylabel('Mean Value (Scaled)')
ax2.set_title('Feature Means - After Standard Scaling')
ax2.xaxis.set_ticks(np.arange(len(df_features.columns)))
ax2.set_xticklabels(df_features.columns, rotation=45, ha='right')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

*HINT: To truly see the differences between scaled and unscaled features, look at the values on the y-axes of each plot.*

Once we are happy with our scaled data, we can overwrite the original DataFrame values with the newly-scaled features:

In [None]:
# Add our new scaled data back to the DataFrame



And get a snapshot of our summary statistics now, using the `.describe()` method. Note, we've been lazy, and haven't removed the target variable or gender, here - ignore them. In fact, leaving them in explains why we removed them, above (having a patient who is 0.42% male isn't meaningful information).

In [None]:
# Summary statistics



And finally, a glance at our DataFrame, now that we have done our data exploration and pre-processing:

In [None]:
# View DataFrame



Once we are happy with the data, and are ready to move onto our machine learning analyses, let's save the pre-processed DataFrame to a CSV file, using Pandas' `.to_csv()` method.

In [None]:
# Save DataFrame as CSV file



In the next notebook, we'll start to perform some modelling using classical machine learning methods.