# Introduction to Machine Learning with Python

This is a Jupyter Notebook, designed to help you understand and use some concepts and libraries for machine learning using the Python coding language and ecosystem.

We will be using [scikit‑learn](https://scikit-learn.org/) library for the machine learning parts.

## Sections

1. **Guided classification on the Iris dataset** – walk through a complete machine learning pipeline using a built‑in toy dataset. This is a "classic" dataset and example.
2. **Unsupervised clustering with k‑means** – understand how to group data without knowing their labels/categories in advance.
3. **Self‑guided exercise on an anaemia dataset** – apply similar techniques to a real‑world haematology dataset.


<div style="
  background-color:#e6f3ff;
  border-left:6px solid #2196F3;
  padding:10px 16px;
  border-radius:6px;
  font-size:14px;
">
 To run each cell of code, type shift-enter, or press the play icon in the toolbar near the top ▶
</div>

## Getting started

Before running this notebook you should ensure that the required packages are installed. The main libraries we will use are:

- **pandas** for data manipulation
- **numpy** for numerical arrays
- **seaborn** and **matplotlib** for plotting
- **scikit‑learn** for machine learning models and evaluation metrics

Run the following cell to import these libraries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# scikit‑learn imports
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_fscore_support
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# display settings
sns.set(style='whitegrid')
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', None)

## 1. Supervised classification with the Iris dataset

To illustrate a complete machine learning workflow we will use the classic **Iris** dataset. This dataset contains measurements of three species of iris flower (setosa, versicolor and virginica). Each sample includes four numerical features:

1. *Sepal length*
2. *Sepal width*
3. *Petal length*
4. *Petal width*

Our goal is to build a classifier that predicts the species based on these measurements. The steps below walk you through loading the dataset, exploring it, training a model and evaluating its performance.

In [None]:
iris = load_iris()
# This is a shortcut to a built-in dataset for tutorials. 
# A real world equivalent might use something like pandas.read_excel(...) to load tabular/spreadsheet data.

X = pd.DataFrame(iris.data, columns=iris.feature_names)  
# By the way, in Python variables are usually lowercase... 
#   but in mathematical Python, many people break that coding convention to use the maths convention of uppercase for matrices.

y = pd.Series(iris.target, name='species')

print('Shape of feature matrix:', X.shape)
print('Number of labels:', y.nunique())

# Display the first few rows of the data
iris_df = pd.concat([X, y], axis=1)
iris_df.head()  # By the way, in Jupyter Notebooks, the last thing in the cell is automatically printed.

### Exploratory data analysis

Before building models it is important to explore the data. A quick way to visualise pairwise relationships between numerical features coloured by the class is to use `seaborn.pairplot`. This can highlight how separable the species are.

In [None]:
# Pairplot of features coloured by species
sns.pairplot(iris_df, hue='species', diag_kind='hist')
plt.suptitle('Pairwise feature relationships in the Iris dataset', y=1.02)
plt.show()

<div style="
  background-color:#e6f3ff;
  border-left:6px solid #2196F3;
  padding:10px 16px;
  border-radius:6px;
  font-size:14px;
">
  Looking at these plots... what do you expect a model to learn about how to the species of an unknown flower?
</div>


My notes...



### Train–test split

We will split the dataset into a **training set** and a **test set** so that we can evaluate (with the test set) how well our classifier (which learns from the training set) generalises to novel data. A common practice is to use 70–80% of the data for training and the remainder for testing.

<div style="
  background-color:#e6f3ff;
  border-left:6px solid #2196F3;
  padding:10px 16px;
  border-radius:6px;
  font-size:14px;
">
  Can you think of cases where this may be problematic?
</div>

My notes...

In [None]:
# Create a train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print('Training samples:', X_train.shape[0])
print('Test samples:', X_test.shape[0])

### Fitting a Random Forest classifier

A **decision tree** in machine learning is just like one you might draw out to explain how to follow a decision process in real life... like:

```
is petal_length < 2.5?
 ├── yes → class = setosa
 └── no → is petal_width < 1.8?
        ├── yes → class = versicolor
        └── no → class = virginica
```

This tree repeatedly subsets the data until most data fed to it is given a satisfactory label. Such a tree can be automatically learned from data, using `sklearn` or similar tools.

A **random forest** is an ensemble of decision trees; each tree is trained on a bootstrapped subset of the data and randomly selects a subset of features when splitting nodes. Averaging the predictions of many trees often improves accuracy and reduces [over‑fitting](https://en.wikipedia.org/wiki/Overfitting) (where the model is too specific to the training data).

Below we instantiate a `RandomForestClassifier`, let's train it on the training data and make predictions on the test data.
Those predictions let us examine performance metrics of the model.

In [None]:
# Instantiate the model (you can adjust n_estimators and other hyperparameters)
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Fit on the training data
rf.fit(X_train, y_train)

In [None]:
# Predict on the test data
y_pred = rf.predict(X_test)

# Evaluate performance
print('Classification report:')
print(classification_report(y_test, y_pred))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.title('Confusion Matrix')
plt.show()

**Precision, recall and F1**

In binary classification the **precision** is the ratio of true positives to all predicted positives, indicating how many selected items were relevant; the **recall** is the ratio of true positives to all actual positives, measuring how many relevant items were selected. 

The **f1 score** is a weighted harmonic mean of precision and recall. For multi‑class problems scikit‑learn applies these metrics for each class and averages them.

### Cross‑validation

Random train/test splits can be sensitive to how the data are partitioned. **Cross‑validation** mitigates this by repeatedly splitting the data into different train/test folds and averaging the scores. In scikit‑learn this is done with `cross_val_score`.

<div style="
  background-color:#e6f3ff;
  border-left:6px solid #2196F3;
  padding:10px 16px;
  border-radius:6px;
  font-size:14px;
">
  Consider... what could happen if one species was very rare in the dataset, and we only uses a single train/test split?
</div>

In [None]:
# 5‑fold cross‑validation on the training set
cv_scores = cross_val_score(rf, X_train, y_train, cv=5, scoring='accuracy')
print('Cross‑validated accuracy scores:', cv_scores)
print('Mean CV accuracy: %.3f' % cv_scores.mean())

<div style="
  background-color:#e6f3ff;
  border-left:6px solid #2196F3;
  padding:10px 16px;
  border-radius:6px;
  font-size:14px;
">
  K-fold splitting segments the entire dataset by "folding" it as if it were printed on a long piece of paper, e.g. into 5 folds each of which is used to test the predictions in turn. An alternative is "Monte Carlo", where e.g. 5 random test/train splits would be made. What advantage(s) does k-fold have over Monte Carlo?
</div>

## 2. Unsupervised clustering with k‑means

In unsupervised learning there are no labels to learn from. Instead we aim to discover structure in the observed data (e.g. the petal measurements). A widely used clustering algorithm is **k‑means**, which partitions data into k clusters by minimising the within‑cluster sum of squares; it alternates between assigning each datapoint to the nearest centroid and updating centroids to the mean of their assigned points.

We'll apply k‑means to the iris features and examine how well the resulting clusters correspond to the actual species.

In [None]:
# Standardise features before clustering (important when variables are on different scales)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform k‑means clustering with k=3 (there are three species)
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X_scaled)

# Compare clusters to true labels
pd.crosstab(y, clusters, rownames=['Actual species'], colnames=['Cluster'])

Although k‑means is unsupervised, in this case it largely recovers the species structure because the iris classes are well separated. You can try varying the number of clusters and initialisation parameters to see how the assignments change.

<div style="
  background-color:#e6f3ff;
  border-left:6px solid #2196F3;
  padding:10px 16px;
  border-radius:6px;
  font-size:14px;
">
  What do you think this <a href="https://en.wikipedia.org/wiki/Contingency_table">Cross Tab (contingency table)</a> is telling you? What does it mean about how successfully the model learned the data structure of the species separation?
</div>

My notes...

<div style="
  background-color:#e6f3ff;
  border-left:6px solid #2196F3;
  padding:10px 16px;
  border-radius:6px;
  font-size:14px;
">
  Bonus! Given that the model is some multi-dimensional model... do you know any techniques to try and visualise how well the clusters have been formed? Hint: people are much better are visualising things in 2 dimensions. And so you'd need to find the 2 principle dimensions of the model...
  <details>
    <summary>Show me the code!</summary>
    <dt>
        <code>
        import matplotlib.pyplot as plt
        from sklearn.decomposition import PCA
        pca = PCA(n_components=2)
        X_pca = pca.fit_transform(X_scaled)
        plt.figure(figsize=(6,5))
        plt.scatter(X_pca[:,0], X_pca[:,1], c=clusters, cmap='viridis', s=40)
        plt.title("K-Means clusters (projected via PCA)")
        plt.xlabel("PC1")
        plt.ylabel("PC2")
        plt.show()
      </code>
    </dt>
  </details>
</div>

My notes...

## 3. Self‑guided exercise: Anaemia dataset

The previous sections demonstrated how to perform supervised classification and unsupervised clustering using the iris data. Now it's your turn to apply these techniques to a realistic dataset that measures haematological indicators associated with anaemia.

### Dataset description

This dataset is provided by (and can be cited as):

> Mojumdar, Mayen Uddin ; Sarker, Dhiman; Assaduzzaman, Md; Rahman, Md. Mohaimenur; Sajeeb, Md. Anisul Haque; Bari , Md Shadikul ; Siddiquee, Shah Md Tanvir; Chakraborty, Narayan Ranjan (2024), “Pediatric Anemia Dataset: Hematological Indicators and Diagnostic Classification”, Mendeley Data, V1, doi: 10.17632/y7v7ff3wpj.1

It can be downloaded from: https://data.mendeley.com/datasets/y7v7ff3wpj/1
Since it is an Excel file (not a CSV) we also need a Python library like `openpyxl` to deal with it.

To do all that in a terminal:

```bash
wget https://prod-dcd-datasets-cache-zipfiles.s3.eu-west-1.amazonaws.com/y7v7ff3wpj-1.zip
unzip y7v7ff3wpj-1.zip
pip install openpyxl
```

---

The anaemia dataset contains samples from patients with attributes such as:

- **Gender** (categorical)
- **Age** (in years)
- **Hb**: haemoglobin (g/dL)
- **RBC**: red blood cell count (million/µL)
- **PCV**: packed cell volume (%)
- **MCV**: mean corpuscular volume (fL)
- **MCH**: mean corpuscular haemoglobin (pg)
- **MCHC**: mean corpuscular haemoglobin concentration (g/dL)
- **Decision_Class**: target variable indicating anaemia classification (1 for anaemic, 0 for healthy)

These parameters were collected to support **predictive models and early diagnosis of anaemia**.

### Task

Use the code below as a starting point to build a classifier that predicts the anaemia status. You will need to:

1. **Load the dataset** from a CSV file into a pandas DataFrame. If the file is named `anaemia.csv` and located in the same directory as this notebook, you can read it using `pd.read_csv('anaemia.csv')`.
2. **Explore the data** – check for missing values, visualise distributions, and look for correlations.
3. **Pre‑process the data** – convert the `Gender` column to numeric form (e.g. map 'm'→1, 'f'→0), and split the data into features `X_ana` and target `y_ana`.
4. **Train a classifier** – you can start with a `RandomForestClassifier` as we used above. Split the data into training and testing sets, fit the model, and evaluate it using accuracy, precision, recall and the confusion matrix.
5. **(Optional) Experiment** with other algorithms such as logistic regression or support vector machines, perform cross‑validation, and adjust hyperparameters.

As you work through this exercise, refer back to the code from the Iris example and adapt it as needed.

In [None]:
anaemia_df = pd.read_excel('Pediatric Anemia Dataset Hematological Indicators and Diagnostic Classification/Anemia Dataset.xlsx')
anaemia_df

<div style="
  background-color:#e6f3ff;
  border-left:6px solid #2196F3;
  padding:10px 16px;
  border-radius:6px;
  font-size:14px;
">
Some suggested steps and code snippets are given. Try to write new code to perform each task. Go further if you like!
</div>

### 3.1  Understand the basic info about the data:
```python
anaemia_df.head()
anaemia_df.info()
```

In [None]:
# your code...




### 3.2 Encode categories
Often if it better to "encode" (i.e. translate) categorical feature data into a number.
This is basically because the features are not going to be treated as binary labels, but rather like slightly fuzzy measurements.
You could do something like: 
```python
anaemia_df['Gender'] = anaemia_df['Gender'].map({'m': 1, 'f': 0})
anaemia_df
```
to overwrite the `Gender` column with numbers instead of text labels.

In [None]:
# your code...

# anaemia_df['Gender'] = ...


### 3.3 Separate features and target
As before, we want a feature matrix `X` and a class/target vector `y` (the anemia decision class).
The features are in the columns with labels: `feature_cols = ['Gender', 'Age', 'Hb', 'RBC', 'PCV', 'MCV', 'MCH', 'MCHC']`

And a column / list of columns can be selected from a dataframe like this: `my_column_series = df["my_column"]` or `df_slice = df[["my_first_column", "my_second_column"]]`

In [None]:
# your code...


# feature_cols = ...
# X = ...
# y = ...


### 3.4 Make a test/train split of the data
Remember we've got the `train_test_split(...)` method from sklearn to help.

In [None]:
# your code ...

# X_train, X_test, y_train, y_test = ...


### 3.5 Make a Random Forest Classifier
We want to train/fit a model (on the `train` subset) to classify the measurement as anaemic or not.

In [None]:
# your code...

# classifier = ...
# classifier.fit(...


### 3.6 Evaluate the model using the `test` subset
- Use the `classifier` to `predict` the anaemia decision class of the `test` subset.
- Look at the `classification_report` and `confusion_matrix` of those predictions vs the ground truth (`y_test`)
- Visualise the confusion matrix (you can use Seaborn's `sns.heatmap` as we did for the Iris dataset)

In [None]:
# your code...


# y_pred = classifier.predict(...
# ...


In [None]:
# your code...
# cm = ...


<div style="
  background-color:#e6f3ff;
  border-left:6px solid #2196F3;
  padding:10px 16px;
  border-radius:6px;
  font-size:14px;
">
  Bonus! Consider changing how many trees are used in the forest, or use one of `sklearn`'s other classifier like a `LogisticRegression`. You could also look for natural subgroups within the patients using k-means clustering... how would you work out the real-world relevance of any clusters?
</div>

## Summary

In this notebook you have:

- Used `pandas` and `seaborn` to get data into Python, and visualise data out. 
- Built a supervised classification model (random forest) on the iris dataset (flowers), examined the confusion matrix and cross‑validated accuracy of the model by splitting up the data repeatedly.
- Applied k‑means clustering to try and discover structure in unlabelled data.
- Written your own code to apply all of those techniques to a dataset derived from pediatric anaemia patients.

[Scikit Learn](https://scikit-learn.org/stable/), the library we have been using, is very widely used and popular, especially in tutorial settings like this. There are various other libraries too, including ones more suitable for vast datasets, for very complex models, or for natural language processing (e.g. [spacy](https://spacy.io/), which is similarly friendly to non-machine-learning-engineers).

<div style="
  background-color:#fdd757;
  border-left:6px solid #ffb81c;
  padding:10px 16px;
  border-radius:6px;
  font-size:14px;
">
 To use any of these techniques or tools for real use cases, you should carefully evaluate your methods, including deciding how you will evaluate them <b>before beginning to train and test</b>. This is important to avoid "p-hacking" where you try so many approaches that the fundamental assumptions required for the statistics to be valid are broken. Some research insistutions offer some form of statistics drop-in clinics to staff which, if available to you, are a great place to discuss the options for your own data.
</div>
