# Exploring Your Dataset for Machine Learning

With all analysis, machine learning or not, it's best to start by getting to grips with your dataset. In this notebook we will demonstrate how to use Python to visualise and characterise the dataset.

## Key objectives
- Load, explore and visualise the dataset
- Apply basic quality criteria, like checks for missing values
- Check for class imbalance
- Calculate correlations between variables

## 1. Introducing the Breast Cancer Dataset

The dataset used in this series is the Wisconsin Breast Cancer Diagnostic Database, publicly available from the [University of California Irvine (UCI) Machine Learning Repository](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic). It consists of characteristics, or features (the machine learning term), of cell nuclei taken from breast masses which were sampled using fine-needle aspiration (FNA), a common diagnostic procedure in oncology. 

The clinical samples used to form this dataset were collected from January 1989 to November 1991. Relevant features were extracted from the digital images using Multisurface Method-Tree (K.P. Bennett 1992). This is a classification method that uses linear programming to construct a decision tree - a form of machine learning! 

The features are observable characteristics of the cells in the images - for example, radius, concavity, and texture. An example of one of the digitised images from an FNA sample is given below.

![An example of an image of a breast mass from which dataset features were extracted](../assets/fna_pic.png)

## 2. Setting Up Our Environment

In [1]:
# Import our packages, giving some shorter aliases to make typing easier
import numpy as np  
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.decomposition import PCA

# Set visualisation style for consistency
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

## 3. Loading and Initial Inspection

In [4]:
# Load the dataset
df = pd.read_csv('../data/breast_cancer.csv')

In [None]:
# Display first few rows to understand the structure


In [None]:
# Check data types of all columns


In [None]:
# Check for how many unique data types are present in the dataframe


## 4. Data Quality Checks

In [None]:
# Check for missing values

In [None]:
# Check for duplicate rows

In [None]:
# If any of our rows were duplicated, we could filter them like this

## 5. Target Variable Analysis

In [None]:
# Quick and simple way to print the distribution of our target variable


In [None]:
# Define class mapping for better readability

# Analyse the distribution of the target variable


## 6. Statistical Summary of Features

In [None]:
# Create a new dataframe containing only the features (excluding the target variable)

# Get statistical summary of numerical features


## 7. Feature Distributions

In [None]:
# calling seaborns .pairplot()


## 8. Feature Scaling and Comparison

Features often have different scales. Look at the summary statistics above; `smoothness_mean` ranges from 0.05 to 0.16, whereas `area` goes from 143.5 to 2501. Some machine learning algorithms find it difficult to compare features that range over such different magnitudes. Features whose units are larger can swamp the predictive space and have a disproportionately greater affect upon the predictions made by the algorithm. To combat this, we rescale the data so that all the features vary over the same range. We do this with the `RobustScaler()` class imported from scikit-learn.

In [None]:
# Compare feature means before scaling
feature_means = df_features.mean()

# Make a figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 6), dpi=80)

# Original scale
ax1.bar(range(len(feature_means)), feature_means.values)
ax1.set_xlabel('Feature Index')
ax1.set_ylabel('Mean Value')
ax1.set_title('Feature Means - Original Scale')
ax1.grid(True, alpha=0.3)

# Apply robust scaling
scaler = RobustScaler()
scaled_features = scaler.fit_transform(df_features)
scaled_means = np.mean(scaled_features, axis=0)

# Scaled features
ax2.bar(df_features.columns, scaled_means)
ax2.set_xlabel('Features')
ax2.set_ylabel('Mean Value (Scaled)')
ax2.set_title('Feature Means - After Robust Scaling')
ax2.xaxis.set_ticks(np.arange(len(df_features.columns)))
ax2.set_xticklabels(df_features.columns, rotation=45, ha='right')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In the next notebook, we'll start to perform some modelling using classical machine learning methods.