# **(ADD THE NOTEBOOK NAME HERE)**

## Objectives

* Start Stastical testing using the cleaned CSV file in the Processed folder 

## Inputs

* Import Sci-kit learn for Machine Learning/feature engine

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\ngubo\\Documents\\vscode-projects\\Capstone_Project_Fruit_Veg_Prices_UK\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\ngubo\\Documents\\vscode-projects\\Capstone_Project_Fruit_Veg_Prices_UK'

# Section 1

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import xgboost as xgb
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline

### Data Cleaning
from feature_engine.imputation import MeanMedianImputer

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### PCA
from sklearn.decomposition import PCA

### ML algorithm
from sklearn.cluster import KMeans

# Load the cleaned dataset
df = pd.read_csv('Dataset/Processed/fruitvegprices-2017_2022-cleaned.csv')

# make sure price (and other numeric columns) are numeric
df['price'] = (df['price']
               .astype(str)
               .str.replace('[£,]', '', regex=True)
               .replace('', np.nan)
               .astype(float))

# pick numeric columns to impute
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print("Numeric columns to impute:", numeric_cols)

print(numeric_cols)
print(df[numeric_cols].dtypes.head(20))
print(df[numeric_cols].head())  # sample values

# Canonical PipelineCluster - parameterized and safe (uses numeric_cols)
def PipelineCluster(impute_vars=None, n_pca=10, n_clusters=10, add_model=False):
    """Create a reusable sklearn Pipeline for numeric preprocessing and optional clustering.
    The pipeline performs these steps in order:
      1. Imputation (median) for the selected numeric variables using Feature-engine's
         MeanMedianImputer. This is robust to outliers compared to mean imputation.
      2. Standard scaling (zero mean, unit variance) via sklearn's StandardScaler.
      3. PCA to reduce dimensionality (n_pca components).
      4. Optionally append a KMeans clustering step when add_model=True.
    Parameters
    ----------
    impute_vars : list[str] or None
        List of numeric column names to impute. If None, uses the global `numeric_cols`
        discovered earlier in the notebook.
    n_pca : int
        Number of PCA components to keep. It will be capped to the number of features
        (minimum 1) to avoid errors.
    n_clusters : int
        Number of clusters for KMeans when add_model=True.
    add_model : bool
        If True, the returned pipeline includes a final KMeans estimator; otherwise
        the pipeline contains only transformers (imputer, scaler, PCA).
    Returns
    -------
    sklearn.pipeline.Pipeline
        A sklearn Pipeline instance ready to fit/transform (or fit_predict if model
        step is appended).
    """
    # Default to all numeric columns if user didn't provide a list
    if impute_vars is None:
        impute_vars = numeric_cols
    # Ensure n_pca is at most the number of input features and at least 1
    n_pca = min(n_pca, max(1, len(impute_vars)))
    # Build transformer steps. Keep names short and explicit for readability.
    steps = [
        ("imputer", MeanMedianImputer(imputation_method='median', variables=impute_vars)),
        ("scaler", StandardScaler()),
        ("pca", PCA(n_components=n_pca, random_state=0)),
    ]
    # Optionally add a clustering model at the end of the pipeline
    if add_model:
        steps.append(("model", KMeans(n_clusters=n_clusters, random_state=0)))
    pipeline_base = Pipeline(steps)
    return pipeline_base

# Create pipeline and run PCA on numeric columns (no model by default)
pipeline_cluster = PipelineCluster(n_pca=10, n_clusters=10, add_model=False)
# keep only the steps up to PCA (exclude model) - this will be correct whether add_model is True or False
pipeline_pca = Pipeline(pipeline_cluster.steps[:-1])  # -1 excludes the model only
# fit/transform only numeric columns to avoid converting strings to float
df_pca = pipeline_pca.fit_transform(df[numeric_cols].copy())

print(df_pca.shape, '\n', type(df_pca))

Numeric columns to impute: ['price']
['price']
price    float64
dtype: object
   price
0   2.05
1   1.22
2   1.14
3   1.05
4   1.03
(9256, 1) 
 <class 'numpy.ndarray'>


In [5]:
# Reuse the canonical PipelineCluster (defined earlier) which supports add_model
# Create pipeline and run PCA on numeric columns (no model by default)
pipeline_cluster = PipelineCluster(n_pca=10, n_clusters=10, add_model=False)
# keep only the steps up to PCA (exclude model) - this will be correct whether add_model is True or False
pipeline_pca = Pipeline(pipeline_cluster.steps[:-1])  # -1 excludes the model only
# fit/transform only numeric columns to avoid converting strings to float
df_pca = pipeline_pca.fit_transform(df[numeric_cols].copy())

print(df_pca.shape, '\n', type(df_pca))

(9256, 1) 
 <class 'numpy.ndarray'>


In [None]:
"""Smoke-test for the preprocessing pipeline.
Runs a tiny end-to-end pipeline on a minimal numeric subset so we can:
  - verify imputation, scaling and PCA run without errors, and
  - inspect the transformed output shape and sample values.
This is intentionally small and non-destructive: it copies the test columns
and does not modify the main DataFrame.
"""
# Choose test columns: prefer 'price' if present, otherwise pick the first numeric
test_cols = ['price'] if 'price' in numeric_cols else numeric_cols[:1]
print('Test columns:', test_cols)
# Build a minimal pipeline with a single PCA component for speed
p_test = PipelineCluster(impute_vars=test_cols, n_pca=1)
# Prepare data copy and run fit_transform (safe, local to X_test)
X_test = df[test_cols].copy()
X_test_trans = p_test.fit_transform(X_test)
# Report shape and show a small sample of transformed values
print('X_test_trans shape:', getattr(X_test_trans, 'shape', None))
print('Sample transformed values (first 5 rows):')
print(X_test_trans[:5])

Test columns: ['price']
X_test_trans shape: (9256, 1)
Sample transformed values (first 5 rows):
[[ 0.24288433]
 [-0.18234037]
 [-0.22332588]
 [-0.26943458]
 [-0.27968096]]


In [7]:
# Demo: run pipeline including KMeans model on numeric columns (small n_clusters for speed)
# Use the PipelineCluster definition from cell 12, which supports add_model
# Make sure to use the PipelineCluster from cell 12, not cell 13
p_with_model = PipelineCluster(impute_vars=numeric_cols, n_pca=5, n_clusters=5, add_model=True)
X_numeric = df[numeric_cols].copy()
X_trans_labels = p_with_model.fit_predict(X_numeric)
print('Assigned cluster labels shape:', len(X_trans_labels))
print('Cluster label sample (first 20):', X_trans_labels[:20])
# Attach labels to a small sample dataframe for inspection
sample = df.loc[X_numeric.index[:20], :].copy()
sample['cluster_label'] = X_trans_labels[:20]
sample[['item','price','cluster_label']].head(10)

  super()._check_params_vs_input(X, default_n_init=10)


Assigned cluster labels shape: 9256
Cluster label sample (first 20): [0 3 3 3 3 3 3 3 3 3 0 0 3 3 3 3 3 3 3 3]


Unnamed: 0,item,price,cluster_label
0,apples,2.05,0
1,apples,1.22,3
2,apples,1.14,3
3,apples,1.05,3
4,apples,1.03,3
5,apples,0.85,3
6,pears,0.77,3
7,pears,1.24,3
8,beetroot,0.52,3
9,brussels_sprouts,0.78,3


Section 1 content

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---