# Iris Preprocessing (Section 2 Task 1)

## Load preprocessed Iris data

Load the saved `iris_preprocessed.csv` and extract feature array `X` and true labels.

In [1]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from pathlib import Path
import matplotlib.pyplot as plt

PROJECT_ROOT = Path.cwd().parent
IMAGES_DIR = PROJECT_ROOT / 'outputs' / 'images'
DATA_DIR = PROJECT_ROOT / 'data'
IMAGES_DIR.mkdir(parents=True, exist_ok=True)
DATA_DIR.mkdir(parents=True, exist_ok=True)

data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['species'] = pd.Categorical.from_codes(data.target, data.target_names)
df['target'] = data.target
print('Shape:', df.shape)
print('Missing values per column:')
print(df.isnull().sum())

Shape: (150, 6)
Missing values per column:
sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
species              0
target               0
dtype: int64


## Scale features and encode labels

Apply Min-Max scaling to numeric features and encode species as integers. Save the preprocessed dataset for downstream tasks.

In [3]:
# Scaling and encoding
scaler = MinMaxScaler()
features = data.feature_names
df_scaled = df.copy()
df_scaled[features] = scaler.fit_transform(df_scaled[features])
le = LabelEncoder()
df_scaled['species_encoded'] = le.fit_transform(df_scaled['species'])
# save
DATA_OUT = PROJECT_ROOT / 'data' / 'iris_preprocessed.csv'
df_scaled.to_csv(DATA_OUT, index=False)
print('Saved preprocessed iris to', DATA_OUT)

Saved preprocessed iris to c:\Users\HP\Documents\DSA2040_Practical_Exam_Geoffrey_Mwangi_566\data\iris_preprocessed.csv


## Exploratory visualizations

Generate pairwise scatter plots, a correlation heatmap, and boxplots to inspect feature distributions and relationships.

In [None]:
# Exploratory plots: scatter matrix, correlation heatmap, boxplots
from pandas.plotting import scatter_matrix
scatter_matrix(df_scaled[features], alpha=0.8, figsize=(10,10))
plt.savefig(IMAGES_DIR / 'iris_scatter_matrix.png')


plt.figure(figsize=(6,5))
plt.imshow(df_scaled[features].corr(), interpolation='nearest', aspect='auto')
plt.colorbar()
plt.xticks(range(len(features)), features, rotation=45)
plt.yticks(range(len(features)), features)
plt.tight_layout()
plt.savefig(IMAGES_DIR / 'iris_correlation_heatmap.png')


plt.figure(figsize=(8,6))
df_scaled[features].boxplot()
plt.title('Boxplots - Iris features')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig(IMAGES_DIR / 'iris_boxplots.png')

print('Plots saved to', IMAGES_DIR)

Plots saved to c:\Users\HP\Documents\DSA2040_Practical_Exam_Geoffrey_Mwangi_566\outputs\images


## Train/test split helper

Define a function to split the data into train/test sets with stratification.

In [5]:
from sklearn.model_selection import train_test_split

def split_data(df, test_size=0.2, random_state=42):
    X = df[features].values
    y = df['species_encoded'].values
    return train_test_split(X, y, test_size=test_size, random_state=random_state, stratify=y)

X_train, X_test, y_train, y_test = split_data(df_scaled)
print('Train/test sizes:', X_train.shape, X_test.shape)

Train/test sizes: (120, 4) (30, 4)
