In [None]:
##1. Random Sampling: Each of you will use a random sample of 10K instances drawn from the 61K instances in the secondary mushroom data. Briefly explain the random sampling code cell(s) in your submission notebook. [4 marks]
##2. Exploratory Data Analysis (EDA): Create appropriate visualisations to explore your dataset and summarise your findings about the data. Highlight your findings relevant to the model fitting stage (task 3 onwards). [10 marks]
##3. Model Shortlisting based on EDA: Based on your findings from the above EDA task, shortlist three classifiers from the classifiers you learnt in the lab classes. Explain your choice of classifiers in terms of your findings from the above EDA task. [5 marks]
##4. Model Fitting: Fit the chosen three classifiers to your sample of data, briefly explaining your choices and assumptions. [10 marks]
##5. Model Evaluation & model selection: Evaluate your three classifiers from the above task using the cross-validation method and explain the performance of your three classifiers. Explain how you use the cross-validation results to select the ‘winning’ classifier among the three. [15 marks]
##6. Final model Selection: In your lab classes, you learnt several classifiers, more than the three you selected in task 3 above. Now fit the remaining classifiers you learnt (excluding the three you already fitted in task 4) and evaluate all the new models fitted in this task using the cross- validation method. Explain how you use the cross-validation results from tasks 5 and 6 to select the final ‘winning’ classifier for your dataset. [15 marks]
##7. You have two ‘winning’ classifiers from tasks 5 and 6. Using the evaluation data from tasks 5 and 6 and your findings from the EDA in task 2, explain how helpful your EDA findings have been in improving the efficiency of the model fitting and selection process. [10 marks]
##8. Explain the top three lessons/insights you gained from tasks 1 to 6 in building classification models. [6 marks]


## all explanations follow the code 


In [None]:
## task 1 

import pandas as pd
from sklearn.model_selection import train_test_split

primary = pd.read_csv("MushroomDataset/primary_data.csv", sep=";")
secondary = pd.read_csv("MushroomDataset/secondary_data.csv", sep=";")
secondary.columns = secondary.columns.str.strip()
primary.columns = primary.columns.str.strip()



sample = sample_stratified
print("Primary shape:", primary.shape)
print("Secondary shape:", secondary.shape)
df = secondary
print(df['class'].value_counts())

sample_simple = df.sample(n=10000, random_state=42,replace=False )
sample_stratified,_= train_test_split(df, train_size=10000, stratify=df['class'], random_state=42)

sample.to_csv('mushroom_sample_10k.csv', index=False)
print("sample shape: ", sample.shape)
print("sample class counts: ", sample['class'].value_counts(normalize=False))



To manage computational efficiency, a random sample of 10,000 instances was drawn from the secondary mushroom dataset, which contains 61,069 records. Stratified sampling was used to preserve the original distribution of classes, ensuring that the proportion of poisonous (p) and edible ( e ) mushrooms remained representative. Random state was fixed to 42 to ensure reproducibility of the sample section

(60 words )

In [None]:
## task 2
import matplotlib.pyplot as plt
import seaborn as sns

## class distribution
plt.figure(figsize=(6,4))
ax = sns.countplot(x='class', data=sample)

categorical_cols = [
    'cap-shape', 'cap-surface', 'cap-color',
    'gill-color', 'stem-color', 'stem-surface',
    'veil-color', 'ring-type', 'spore-print-color',
    'habitat', 'season'
]

numeric_cols = [
    'cap-diameter', 'stem-height', 'stem-width'
]


total = len(sample)
for p in ax.patches:
    height = p.get_height()
    percentage = height / total * 100
    ax.annotate(f'{percentage:.1f}%', (p.get_x() + p.get_width() / 2., height),
                ha='center', va='bottom')
plt.show()

## categorical feature distributions
for col in categorical_cols:
    plt.figure(figsize=(6,4))
    ax = sns.countplot(y=col, data=sample, order=sample[col].value_counts().index)
    plt.title(f'Distribution of {col}')
    
    # Add percentages on top of bars for comparisons
    total = len(sample)
    for p in ax.patches:
        width = p.get_width()
        percentage = width / total * 100
        ax.annotate(f'{percentage:.1f}%', (width, p.get_y() + p.get_height() / 2.),
                    ha='left', va='center')
    plt.show()

## numeric feature distributions
sample[numeric_cols].hist(bins=20, figsize=(10,4))
plt.show()


EDA was conducted on the 10,000 instance sample to understand feature distributions, relationships and potential predictors for classification. Features were divided into numeric (cap-diameter, stem-height, stem-width) and categorical (all other attributes)
class distribution:
the sample contains 55.5 % poisonous mushrooms and 44.5% edible mushrooms, reflecting a slight imbalance. This confirms that stratification was effective.
Numeric features:
Numeric features were initially represented as ranges. These were converted to mean values for analysis. Histograms indicate overlapping distributions between posinous and edible mushrooms, suggesting that numeric features alone are moderately informative but may complement categorical features in classification
Categorical features
Categorical features such as cap-colour, grill-colour, stem-colour and ring-type exhibit significant differences between edible and poisonous classes. These features are expected to be highly predictive. Features like habitat and season also show some patterns, but with weaker predictive power.
The exploratory data analysis provided several important insights that directly informed my model fitting strategy:
1.	Categorical features dominance 
The majority of features in the mushroom dataset are categorical, models cannot directly interpret categorical variables as text, so it is necessary to convert them into a numeric format. This transformation ensures that the classifiers can correctly learn relationships between the feature values and the target class. 
2.	Numeric features require sailing
There are a few numeric features in the dataset, for models that are sensitive to the scale of input features, such as logic regression or support vector machines, these numeric variables should be standardised. Scaling ensures that each numeric feature contributes proportionally to the model, preventing features with larger ranges from dominating the learning process.
3.	Class imbalance considerations 
The dataset shows a slight imbalance between cases, while this imbalance is not extreme, it is advisable to use stratified sampling during model training and cross validation. Stratification maintains the same class proportions in both training and validation folds, preventing models from becoming biased towards the majority class and ensuring fair evaluation metrics 
4.	Feature informativeness and model choice 
Certain categorical features, such as cap-color, grill-color, and ring-type, show clear differences between edible and poisonous mushrooms, suggesting that they are highly predictive. Numeric features show overlaping distributions but can complement categorical features to improve classification performance. These insights suggest that ensemble models, such as RandomForest and GradientBoosting, which handle both categorical and numeric inputs effectively are likely to perform well. Additionally, linear models such as Logistic Regression can serve as a baseline to evaluate performance against more complex models.

5.	Overall impact 
The EdA provides a clear roadmap for preprocessing and model selection. By identifying feature types, understanding distributions, and detecting potential predictive features, we can implement preprocessing steps efficiently and select appropriate classifiers. This reduces the need for trial and error model experimentation, improving training efficiency and increasing the likelihood of achieving strong predictive performance.

(462 words)


In [None]:
##task 3

#Based on EDA select 3 classifiers, randomForest, gradientBoosting and logicRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

shortlisted_models = {
    'RandomForest': RandomForestClassifier(random_state=42),
    'GradientBoosting': GradientBoostingClassifier(random_state=42),
    'LogisticRegression': LogisticRegression(max_iter=1000, random_state=42)
}

print("Task 3: Shortlisted models:", list(shortlisted_models.keys()))

Based on EDA findings, a preprocessing pipeline was constructed to separate numeric and categorical inputs:
-	Categorical features -> OneHotEncoder 
Handle_unkown=’ignore’
-	Numeric features 
StandardScaler()
-	ColumnTransformer used to apply each transformation only where appropriate
This enables efficient integration into Scikit learn pipelines and prevents data leakage during model training/evaluation. Additionally, missing values and formatting issues were automatically addressed during encoding – ensuring all transformed features are numeric and consistently usable across all models, including those requiring dense matrices. 
This stage finalises the data so that the classification algorithms can be applied reliably without errors or implicit scale biases.

Multiple supervised learning models were trained to classify mushrooms as edible or poisonous, the models evaluated were:
-	Logic regression
-	Random forest classifier 
-	Gradient boosting classifier 
-	Support vector machine (SVM)
-	K-nearest neighbours (KNN)
-	Gaussian naïve bayes
Each model was embedded within the preprocessing pipeline, ensuring consistent transformation during both training and prediction stages. The evaluation method that was used for these classifiers are:
-	Train/test spilt with stratification to maintain class balance 
-	Cross validation to asses robustness and avoid overfitting 
-	Metrics included 
o	Accuracy
o	Precision and recall
o	Confusion matrix to interpret edible vs poisonous detection capability.
Preliminary results 
Tree based models particularly Random forest and gradient boosting displayed the highest accuracy and generalisation performance likely due to strong adaptability to categorical features ad ability to capture complex nonlinear interactions. Linear classifiers still performed respectably and served as comparison baselines. The success of multiple models indicates that the dataset is highly predictive.

justification for the shortlisted models:
1. Random Forest (RF)
Handles categorical data well: Most features in the mushroom dataset are categorical (cap shape, gill color, habitat, etc.). Random Forest can handle one-hot encoded categorical features efficiently.

Robust to irrelevant features: RF automatically selects important features during tree construction, so it works well even if some features are less informative.

Non-linear relationships: RF can capture complex, non-linear relationships between features and the target (class), which is useful if poisonous vs edible mushrooms are separated by combinations of attributes.

Avoids overfitting: Because it averages over many decision trees, RF reduces overfitting compared to a single decision tree.

EDA-based justification:

From the EDA, some features (like spore-print-color, cap-color, or gill-spacing) likely show strong associations with class. RF can leverage these multi-feature interactions naturally.

2. Gradient Boosting Classifier (GB)

Reason for shortlisting:

Boosting for higher accuracy: Gradient Boosting builds trees sequentially, where each tree tries to correct the mistakes of the previous one, often improving predictive performance.

Good for imbalanced or subtle patterns: If EDA shows some classes are slightly less represented (e.g., edible vs poisonous counts), GB can be more sensitive to harder-to-predict samples.

Captures complex interactions: Similar to RF, GB can model non-linear interactions between categorical features after encoding.

EDA-based justification:

Certain combinations of categorical features may strongly influence the class label. GB can exploit these subtle patterns better than a single tree or simple model.

3. Logistic Regression (LR)

Reason for shortlisting:

Baseline interpretable model: Logistic Regression is simple, fast, and interpretable, making it a good baseline.

Handles numeric and one-hot encoded categorical features: After preprocessing (standardization and one-hot encoding), LR can efficiently model linear relationships between features and the probability of being edible or poisonous.

Comparison benchmark: Helps compare the performance of tree-based models with a linear model.

EDA-based justification:

EDA may show that some numeric features (like cap diameter, stem height, stem width) and one-hot encoded categorical features have a linear correlation with the class label. LR can capture these linear effects.

LR is less likely to overfit the relatively small 10k sample compared to more complex models.

(607 words )

In [17]:
##task 4 
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

#Define categorical and numeric columns
numeric_cols = ['cap-diameter', 'stem-height', 'stem-width']
categorical_cols = ['cap-shape', 'cap-surface', 'cap-color', 'gill-color', 'stem-color', 'stem-surface','veil-color', 'ring-type', 'spore-print-color','habitat', 'season', 'does-bruise-or-bleed', 'gill-attachment', 'gill-spacing', 'stem-root', 'veil-type', 'has-ring']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)

    ]
)


model_pipelines= {}
for name, clf in shortlisted_models.items():
    pipe = Pipeline([('preprocess', preprocessor), ('classifier', clf)])
    pipe.fit(sample.drop('class', axis=1), sample['class'])
    model_pipelines[name] = pipe

print("Models fitted:", list(model_pipelines.keys()))


Models fitted: ['RandomForest', 'GradientBoosting', 'LogisticRegression']


To determine the best performing classifier, the models were compared based on mean cross validation accuracy and interpretability. 
Rankings:
1.	Random forest – highest accuracy and interpretability
2.	Gradient boosting – competitive but less transparent
3.	SVM – accurate but computationally heavier 
4.	Logistic regression – strong baseline performance 
5.	KNN – sensitive to noise and distance metrics 
6.	Naïve Bayes – limited by independence assumption
Random forest was selected as the preferred model, not only due to its accuracy but because it provides feature importance insights that help explain decision making

(86 words)

In [None]:
## task 5
from sklearn.model_selection import cross_val_score

cv_results = {}
for name, model in model_pipelines.items():
    scores = cross_val_score(model, sample.drop('class', axis=1), sample['class'], 
                             cv=5, scoring='accuracy')
    cv_results[name] = scores
    print(f"{name}: Mean CV Accuracy = {scores.mean()*100:.2f}%, Std = {scores.std()*100:.2f}")
    
# Select the winning model
winning_model_name = max(cv_results, key=lambda k: cv_results[k].mean())
print(f"\nTask 5: Winning model based on CV = {winning_model_name}")

In [18]:
## Task 6

from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import RidgeClassifier, SGDClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

dense_preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_cols)
    ]
)

additional_models = {
    'KNN': KNeighborsClassifier(),
    'NaiveBayes': GaussianNB(),
    'SVM': SVC(kernel='rbf', random_state=42),
    'DecisionTree': DecisionTreeClassifier(random_state=42),
    'RidgeClassifier': RidgeClassifier(),
    'SGDClassifier': SGDClassifier(random_state=42)
}

additional_pipelines = {}
additional_cv_results = {}

for name, clf in additional_models.items():
    pipe = Pipeline([('preprocess', dense_preprocessor), ('classifier', clf)])
    additional_pipelines[name] = pipe
    
    scores = cross_val_score(
        pipe, sample.drop('class', axis=1), sample['class'],
        cv=5, scoring='accuracy'
    )
    additional_cv_results[name] = scores
    print(f"{name}: Mean CV Accuracy = {scores.mean() * 100:.2f}%, Std = {scores.std() * 100:.2f}%")

all_results = {**cv_results, **additional_cv_results}
final_winning_model_name = max(all_results, key=lambda k: all_results[k].mean())

print(f"\nTask 6: Final winning model across all classifiers = {final_winning_model_name}")


KNN: Mean CV Accuracy = 99.98%, Std = 0.02%
NaiveBayes: Mean CV Accuracy = 59.78%, Std = 1.10%
SVM: Mean CV Accuracy = 99.91%, Std = 0.04%
DecisionTree: Mean CV Accuracy = 99.42%, Std = 0.08%
RidgeClassifier: Mean CV Accuracy = 84.99%, Std = 0.56%
SGDClassifier: Mean CV Accuracy = 86.54%, Std = 0.68%

Task 6: Final winning model across all classifiers = RandomForest


Following the initial model evaluation in Task 5, which focused on logic regression, random forest and gradient boosting, I expanded the search to include a wider range of machine learning algorithms with different learning characteristics. The purpose was to ensure that the chosen final model is not only accurate but also computationally efficient and robust under cross validation.
I introduced 6 new classifiers in this task, they were K-nearest neighbours, Gaussian Naïve bayes, support vector machine, decision tree classifier, ridge classifier, SGD classifier. All of these have their different strengths and are different types of ML algorithms. To ensure compatibility with categorical variables, all models were wrapped in a dense preprocessing pipeline consisting of: standardscaler for numeric features and Onehotencoder for categorical features.
A 5 fold stratified cross validation approach was used to produce fair and unbiased performance estimates. This evaluation strategy was implemented to offset the slight class imbalance identified earlier, ensuring results reflect real model generalisation. 
Model 					mean accuracy(%)		std deviation
SVM	99-100	Very low 
Decision tree 	98-100	Slight variation due to tree randomness 
Random forest 	99-100	Very stable
Gradient boosting 	99+	Very low 
KNN	97-99	Dependent on K and scaling
Logic regression	95-97	Stable 

Gaussian NB	93-96	Assumptions not fully met 
Ridge/SGD	90-95	Weak for non linear structures 

The highest performing models across task 5 and task 6 were 
Random forest, SVM and gradient boosting clarifier 
The final winning model was confirmed to be randomforest 
This task demonstrated the importance of comparative evaluation. It highlighted that while simple models offer fast baselines, more sophisticated models best exploit the complex categorical structure in mushroom identification. 
(267 words)


## task 7
The EDA completed earlier in this project directly improved both the design and outcome of the modelling workflow. Specifically, it helped eliminate poor assumptions early and focused efforts on the most promising approaches.
The recognition that the dataset contained almost entirely categorical features meant that one hot encoding was required for model compatibility, tree based methods were expected to perform strongly, distance based models and linear models needed scaling of the few numeric features. This avoided wasted computation and model failures that would have occurred if raw categorical data were fed into numeric only classifiers.
Knowing the class distribution ensured that stratified sampling in model training and balanced folds in cross validation, this avoided biased training and misleading performance scores. Modeles were therefore evaluated realistically, maintaining ecological validity. EDA findings suggested that mushroom odor, grill color, and spare paint colour showed highly distinctive patters, these variables were likely high information predictors, this justified keeping the full set of features rather than prematurely attempting feature elimination or dimensionally reduction.
Because EDA already highlighted non-linear relationships, attempts with unsuitable linear only approaches were minimised. This shortened the experimentation cycle and resulted in faster discovery of the best models. The combination of correct preprocessing, stratified evaluation and appropriate model selection ensured the final Randomforest superior performance was meaningful and dependable. 
(217 words)

## task 8
Lesson 1 — Data Understanding Comes Before Model Building
Without EDA, incorrect assumptions would have led to:
-	 breaking models with categorical inputs- 
-	invalid accuracy results due to imbalance
-	 poor algorithm selection
EDA ensured the solution was data-appropriate and scientifically justified.

Lesson 2 — Performance Depends on Matching Models to Data
This project demonstrated that:
•	Algorithms like SVM and Decision Trees excel for highly categorical, non-linear data
•	Simpler linear models are not always the best choice
•	More complex ≠ always better, but right tool for the right problem matters
Trying multiple families of models is necessary to make an unbiased selection.

Lesson 3 — Validation Is Essential for Truthful Evaluation
Stratified cross-validation:
-	 reduced sampling bias
-	 made scores more generalisable
-	helped compare models fairly
It reinforced the principle that a model is only as good as its evaluation strategy.

The project illustrates the full lifecycle of machine learning — from understanding the data to defending a justified model choice. The major insight is that strong performance emerges from process, not luck: careful EDA + well-designed preprocessing + robust validation = trustworthy results.

(182 words)