# Exploring Your Dataset for Machine Learning

With all analysis, machine learning or not, it's best to start by getting to grips with your dataset. In this notebook we will demonstrate how to use Python to visualise and characterise the dataset.

## Key objectives
- Load, explore and visualise the dataset
- Apply basic quality criteria, like checks for missing values
- Check for class imbalance
- Calculate correlations between variables

## 1. Introducing the Breast Cancer Dataset

The dataset used in this series is the Wisconsin Breast Cancer Diagnostic Database, publicly available from the [University of California Irvine (UCI) Machine Learning Repository](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic). It consists of characteristics, or features (the machine learning term), of cell nuclei taken from breast masses which were sampled using fine-needle aspiration (FNA), a common diagnostic procedure in oncology. 

The clinical samples used to form this dataset were collected from January 1989 to November 1991. Relevant features were extracted from the digital images using Multisurface Method-Tree (K.P. Bennett 1992). This is a classification method that uses linear programming to construct a decision tree - a form of machine learning! 

The features are observable characteristics of the cells in the images - for example, radius, concavity, and texture. An example of one of the digitised images from an FNA sample is given below.

![An example of an image of a breast mass from which dataset features were extracted](../assets/fna_pic.png)

## 1b. Knowing your dataset

Why is it important to understand the origins of your dataset? The data we're working with here was collected around 1990. While the data points (features) collected in a new sample may theoretically be the same, the hardware used to collect them and the processing methods may differ. To the eye, they could look identical, but there might be subtle changes in the patterns within the data. A model learns the patterns and properties present in whatever data it is trained on, not necessarily the "true" characteristics of breast cancer cells. When asked to predict on cancer cells collected in 2025, it might give us incorrect answers, and if we weren't careful we might mistakenly trust these answers, given the model's strong performance on historical data. This phenomenon is called "dataset shift" or "distribution shift," and it's something you should always keep in mind when using someone else's data, or even your own. Reproducibility is key!

## 2. Setting Up Our Environment

Now we understand the origins of our dataset, let's explore it. 

For this we will be using a combination of widely-used data science packages in Python. Here is a brief description of each:

- **Pandas**: A powerful data manipulation library that provides DataFrame structures ideal for working with labelled, tabular data
- **Numpy**: The fundamental package for numerical computing in Python, providing support for arrays and mathematical functions
- **Matplotlib**: A comprehensive plotting library capable of creating static, animated, and interactive visualisations
- **Seaborn**: A statistical data visualisation library that builds on matplotlib and provides a high-level interface for drawing attractive statistical graphs
- **Scikit-learn**: A machine learning library that provides simple and efficient tools for data analysis and modelling

Using a combination of these tools, we should be able to load, manipulate, and visualise our data effectively.

In [None]:
# Import our packages, giving some shorter aliases to make typing easier
import numpy as np  
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.decomposition import PCA

# Set visualisation style for consistency
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

## 3. Loading and Initial Inspection

The first thing we want to do is to check that the data we have is as we expected. We can check how many rows (samples) and columns (features) we have, take a sneak-peak at the actual data, and see what types we have in each column. 

In [None]:
# Load the dataset
df = pd.read_csv('../data/breast_cancer.csv')

# Display basic information about the dataset
print(f"Dataset shape: {df.shape}")
print(f"Number of samples: {df.shape[0]}")
print(f"Number of features: {df.shape[1]}")

In [None]:
# Display first few rows to understand the structure
df.head()

In [None]:
# Check data types of all columns
df.info()

It looks like the dataframe is all numeric, with `float64` features and an integer column for the diagnosis (the labels). We can check this by counting the values of the DataFrame's `dtypes`.

In [None]:
# Check for how many unique data types are present in the dataframe
df.dtypes.value_counts()

## 4. Data Quality Checks

Data quality is crucial. We're lucky enough to be using a well-prepared, clean dataset, but in real-world problems that's not often the case! In this section we'll run through some common data quality checks. 

Null or missing values will often break machine learning models, so we need to appropriately handle them before we being training. There are a few common strategies for handling null values. You can remove the offending row altogether; use a replacement value (such as zero, or the average of the other values in the column); or interpolate the value based upon the other features in the row.

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values per column:")
print("-" * 50)
if missing_values.sum() == 0:
    print("No missing values")
else:
    print(missing_values[missing_values > 0])

It's also good to check for duplicates. This is more commons than you might think. Duplicates will introduce bias in training, slightly favouring the duplicated data point.

In [None]:
# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

In [None]:
# If any of our rows were duplicated, we could filter them like this
df_deduped = df[df.duplicated()]

## 5. Target Variable Analysis

The goal of this kind of machine learning is to predict one aspect of a data point, based on the others. The thing you're trying to predict is called the label, the class, or the target variable. In this case, the label is the diagnosis, with two possible values: 0 (benign) and 1 (malignant).

Class imbalance can significantly affect model performance, so understanding the class distribution is crucial for machine learning. 

In [None]:
# Quick and simple way to print the distribution of our target variable
df['diagnosis'].value_counts()

In [None]:
# Define class mapping for better readability
cls_dict = {0: 'benign', 1: 'malignant'}

# Analyse the distribution of the target variable
class_dist = df['diagnosis'].value_counts()

print("Distribution of classes:")
print("-" * 50)
print(f"Benign samples: {class_dist[0]} ({class_dist[0] / len(df) * 100:.1f}%)")
print(f"Malignant samples: {class_dist[1]} ({class_dist[1] / len(df) * 100:.1f}%)")
print(f"\nClass ratio (benign:malignant): {class_dist[0]/class_dist[1]:.2f}:1")

**Note on Class Imbalance**: An unbalanced distribution of classes in the target variable can affect your predictions with machine learning. If one class dominates, the algorithm might achieve high accuracy by simply predicting the majority class. In this dataset, we have a reasonable balance (approximately 1.7:1 ratio), though it's worth noting that the malignant class is slightly more prevalent.

## 6. Statistical Summary of Features

A useful way to get an overview of the features is to look at the summary statistics - the mean, standard deviation, and quartile values - for each column. We can do that easily with pandas.

In [None]:
# Create a new dataframe containing only the features (excluding the target variable)
df_features = df.drop(columns=['diagnosis'])

# Get statistical summary of numerical features
df_features.describe().round(2)

## 7. Feature Distributions

Understanding how features are distributed is essential for choosing appropriate preprocessing techniques and algorithms. Machine learning algorithms generally look for ways to separate points of different classes by finding high-dimensional patterns. These patterns are very difficult for our puny human brains to visualise, but what we can do is break down the problem and look at a couple of dimensions at a time. 

The pairplot from seaborn is a great starting tool, to see roughly at a glance if there are any pairs of features that show clear differences between the classes. If you can see multiple pairs with visual separation between classes, there is a good chance a machine learning model will be perform well. Many pairs might have slight separation with a lot of 'blur' between the classes; however, in a higher dimensional space the boundry will hopefully be more defined.

The figure comes out qutie large because we have a lot of features, so we need to reduce the size and resolution a little to make it display nicely.

In [None]:
g = sns.pairplot(df, hue='diagnosis', palette={0: 'green', 1: 'red'}, height=1.2, plot_kws={'alpha':0.6})
g.figure.set_dpi(60)
plt.show()

## 8. Feature Scaling and Comparison

Features often have different scales. Look at the summary statistics above; `smoothness_mean` ranges from 0.05 to 0.16, whereas `area` goes from 143.5 to 2501. Some machine learning algorithms find it difficult to compare features that range over such different magnitudes. Features whose units are larger can swamp the predictive space and have a disproportionately greater affect upon the predictions made by the algorithm. To combat this, we rescale the data so that all the features vary over the same range. We do this with the `RobustScaler()` class imported from scikit-learn.

In [None]:
# Compare feature means before scaling
feature_means = df_features.mean()

# Make a figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 6), dpi=80)

# Original scale
ax1.bar(range(len(feature_means)), feature_means.values)
ax1.set_xlabel('Feature Index')
ax1.set_ylabel('Mean Value')
ax1.set_title('Feature Means - Original Scale')
ax1.grid(True, alpha=0.3)

# Apply robust scaling
scaler = RobustScaler()
scaled_features = scaler.fit_transform(df_features)
scaled_means = np.mean(scaled_features, axis=0)

# Scaled features
ax2.bar(df_features.columns, scaled_means)
ax2.set_xlabel('Features')
ax2.set_ylabel('Mean Value (Scaled)')
ax2.set_title('Feature Means - After Robust Scaling')
ax2.xaxis.set_ticks(np.arange(len(df_features.columns)))
ax2.set_xticklabels(df_features.columns, rotation=45, ha='right')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In the next notebook, we'll start to perform some modelling using classical machine learning methods.