# Mushrooms challenge

Each autumn, in Catalonia, there is a big competition about mushroom hunting. Unfortunately, during the past years, a lot of people have picked poisonous mushrooms thinking that they were edible, causing an overload on the healthcare system.

The Department of Health has asked to develop a model that given basic image attributes of mushrooms we can detect if it is poisonous or not, and give a guidance if which are the features most indicative of a poisonous mushroom.

## 1.&nbsp;Import libraries

In [69]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

from sklearn.metrics import ConfusionMatrixDisplay

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score

from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

## 2.&nbsp; Read in, manipulate and split data

In [70]:
# url = "https://drive.google.com/file/d/1Op1vQftBKN1lrPVGGLJU-UOlv_dScTup/view?usp=sharing"

url = "https://drive.google.com/file/d/1eT8uTctwIx9yu2m207ZD1zSfshT9B24k/view?usp=drive_link"
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]
mush = pd.read_csv(path)

In [71]:
mush.head(10)

Unnamed: 0,cap.shape,cap.color,bruises,stalk.color.above.ring,stalk.color.below.ring,population,Id,poisonous
0,k,e,False,w,w,v,6573,1
1,f,e,True,p,w,y,4426,0
2,b,w,False,w,w,s,7018,0
3,k,g,False,w,w,n,5789,0
4,f,n,True,p,g,v,6187,0
5,x,w,False,w,w,s,2508,0
6,x,w,False,w,w,a,488,0
7,f,y,False,b,b,v,3673,1
8,f,y,False,b,b,v,5364,1
9,x,y,True,w,w,n,2582,0


In [72]:
X = mush.drop(columns=["Id"]).copy()
y = X.pop("poisonous")

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8, random_state=123)

X_train.head()

Unnamed: 0,cap.shape,cap.color,bruises,stalk.color.above.ring,stalk.color.below.ring,population
198,b,b,True,w,w,v
4637,f,n,True,p,g,y
3019,f,p,True,w,w,v
2468,x,g,False,w,w,a
6225,x,w,True,w,w,s


## 3.&nbsp; Create pipeline using RandomForest classifier

We chose the `RandomForestClassifier()` as our model but you can try any other classifier.

In [206]:
# One-hot encode all categorical columns
categorical_features = X.columns.tolist()
preprocessor = ColumnTransformer([
    ('onehot', OneHotEncoder(handle_unknown='infrequent_if_exist'), categorical_features)
])

# --------------------
# 4. Create pipeline
# --------------------
model = make_pipeline(preprocessor, RandomForestClassifier(random_state= 42))

# --------------------
# 5. Train and evaluate
# --------------------
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))



Accuracy: 0.9661538461538461

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.94      0.97       671
           1       0.94      0.99      0.97       629

    accuracy                           0.97      1300
   macro avg       0.97      0.97      0.97      1300
weighted avg       0.97      0.97      0.97      1300



In [207]:
#GridSearch

from sklearn.model_selection import GridSearchCV

param_grid = {
    'randomforestclassifier__n_estimators': [50, 100, 200],
    'randomforestclassifier__max_depth': [10,20,30],
    'randomforestclassifier__min_samples_split': [2, 5],
    'randomforestclassifier__min_samples_leaf': [1,2,3],
    'randomforestclassifier__max_features': ['sqrt', 'log2']
}

# Run a grid search to find the optimal combination of hyperparameters
rf_search = GridSearchCV(
    model,
    param_grid,
    cv=5,
    verbose=1
)

rf_search.fit(X_train, y_train)

best_param = rf_search.best_params_

best_param

Fitting 5 folds for each of 108 candidates, totalling 540 fits


KeyboardInterrupt: 

In [None]:
# Update parameters of the pipeline using set_params
model.set_params(
    randomforestclassifier__n_estimators=200,
    randomforestclassifier__max_depth=20,
    randomforestclassifier__min_samples_split=2,
    randomforestclassifier__min_samples_leaf=1,
    randomforestclassifier__max_features='sqrt'
)

model.fit(X_train, y_train)

In [None]:
model.predict(X_test)

In [None]:
accuracy_score(y_true=y_test, y_pred=model.predict(X_test))

- A bit better than 95% Accuracy.
- That means, only 5 out of 100 mushrooms are wrongly labeled.
- "Wrongly labeled" can be two cases:
     - A poisonous mushroom got classified as non-poisonous or
     - a non-poisonous mushroom got classified as poisonous.

**Are both cases equally dangerous?**

Let's plot the confusion matrix to see how well our model performed.

In [None]:
ConfusionMatrixDisplay.from_estimator(model, X_train, y_train, display_labels=["Not poisonous", "Poisonous"]);

The confusion matrix shows that our model predicted 44 False Negatives. This means that 44 mushrooms will be predicted as non-poisonous (= Negatives) while in fact, they truly are poisonous (= False Negatives).

Our task is to avoid these situations at all costs, so we need to find a way to make that left-bottom corner of the confusion matrix to be equal to 0.

> Note: Judging from the values in the confusion matrix, the overall accuracy seems to be pretty high. There are cases in which the accuracy doesn't seem to be the right metric to tell whether a model performs well (enough). This is why need `recall`-measure for this competition:  

**Recall is the ability of the classifier to find all the positive samples.**

In [75]:
from sklearn.metrics import recall_score

recall_score(y_true=y_test, y_pred=model.predict(X_test))

NameError: name 'model' is not defined

With a recall score of 1, no poisonous mushroom will be classified as non-poisonous (which means there are no false-negatives) --> no person will die.

## 4.&nbsp; Create pipeline using Catboost classifier

In [76]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.8-cp311-cp311-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.8-cp311-cp311-manylinux2014_x86_64.whl (99.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.8


In [78]:
from catboost import CatBoostClassifier

# One-hot encode all categorical columns
categorical_features = X.columns.tolist()
preprocessor = ColumnTransformer([
    ('onehot', OneHotEncoder(handle_unknown='infrequent_if_exist'), categorical_features)
])

# --------------------
# 4. Create pipeline
# --------------------
model_cb = make_pipeline(preprocessor, CatBoostClassifier(random_state= 42))

# --------------------
# 5. Train and evaluate
# --------------------
model_cb.fit(X_train, y_train)
y_pred = model_cb.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Learning rate set to 0.020827
0:	learn: 0.6584755	total: 3.46ms	remaining: 3.46s
1:	learn: 0.6266146	total: 5.92ms	remaining: 2.96s
2:	learn: 0.5914228	total: 12.7ms	remaining: 4.24s
3:	learn: 0.5621078	total: 27.3ms	remaining: 6.79s
4:	learn: 0.5364339	total: 33.4ms	remaining: 6.65s
5:	learn: 0.5124569	total: 45ms	remaining: 7.45s
6:	learn: 0.4904097	total: 48ms	remaining: 6.81s
7:	learn: 0.4647786	total: 56.1ms	remaining: 6.96s
8:	learn: 0.4416616	total: 58.7ms	remaining: 6.47s
9:	learn: 0.4243799	total: 65.7ms	remaining: 6.51s
10:	learn: 0.4074467	total: 78.8ms	remaining: 7.08s
11:	learn: 0.3899025	total: 86.4ms	remaining: 7.12s
12:	learn: 0.3749001	total: 88.4ms	remaining: 6.71s
13:	learn: 0.3648585	total: 97.6ms	remaining: 6.88s
14:	learn: 0.3523701	total: 103ms	remaining: 6.76s
15:	learn: 0.3407087	total: 105ms	remaining: 6.46s
16:	learn: 0.3283910	total: 111ms	remaining: 6.43s
17:	learn: 0.3178802	total: 131ms	remaining: 7.13s
18:	learn: 0.3090631	total: 133ms	remaining: 6.89s
1

In [79]:
#GridSearch

param_grid = {
    'catboostclassifier__depth': [6, 8, 10],  # deeper trees may capture more signal
    'catboostclassifier__learning_rate': [0.01, 0.05, 0.1],  # lower = more stable
    'catboostclassifier__iterations': [200, 500, 800],  # more trees = better generalization
    'catboostclassifier__l2_leaf_reg': [1, 3, 5, 7],  # L2 regularization to reduce overfitting
    'catboostclassifier__border_count': [32, 64, 128],  # more splits in continuous features
    'catboostclassifier__scale_pos_weight': [1, 5, 10]  # very useful if target is imbalanced
}

# Run a grid search to find the optimal combination of hyperparameters
cb_search = GridSearchCV(
    model_cb,
    param_grid,
    cv=5,
    verbose=1
)
cb_search.fit(X_train, y_train)

best_param = cb_search.best_params_

# Use the best estimator found by GridSearchCV
model_cb = cb_search.best_estimator_

best_param

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
600:	learn: 0.0574890	total: 1.41s	remaining: 467ms
601:	learn: 0.0574890	total: 1.41s	remaining: 465ms
602:	learn: 0.0574890	total: 1.42s	remaining: 462ms
603:	learn: 0.0574890	total: 1.42s	remaining: 460ms
604:	learn: 0.0574890	total: 1.42s	remaining: 457ms
605:	learn: 0.0574890	total: 1.42s	remaining: 455ms
606:	learn: 0.0574890	total: 1.42s	remaining: 453ms
607:	learn: 0.0574890	total: 1.43s	remaining: 450ms
608:	learn: 0.0574890	total: 1.43s	remaining: 448ms
609:	learn: 0.0574890	total: 1.43s	remaining: 446ms
610:	learn: 0.0574890	total: 1.43s	remaining: 444ms
611:	learn: 0.0574890	total: 1.44s	remaining: 442ms
612:	learn: 0.0574890	total: 1.44s	remaining: 440ms
613:	learn: 0.0574890	total: 1.44s	remaining: 438ms
614:	learn: 0.0574890	total: 1.45s	remaining: 435ms
615:	learn: 0.0574890	total: 1.45s	remaining: 433ms
616:	learn: 0.0574890	total: 1.45s	remaining: 430ms
617:	learn: 0.0574890	total: 1.45s	remaining: 428ms

{'catboostclassifier__border_count': 32,
 'catboostclassifier__depth': 10,
 'catboostclassifier__iterations': 800,
 'catboostclassifier__l2_leaf_reg': 1,
 'catboostclassifier__learning_rate': 0.01,
 'catboostclassifier__scale_pos_weight': 1}

In [90]:
# Recreate pipeline with new parameters
model_cb = make_pipeline(
    preprocessor,
    CatBoostClassifier(
        depth=10,
        iterations=800,
        learning_rate=0.01,
        random_state=42,
        l2_leaf_reg=1,
        border_count=64,
        scale_pos_weight=1,
        verbose=0
    )
)

model_cb.fit(X_train, y_train)

In [None]:
model.predict(X_test)

In [81]:
accuracy_score(y_true=y_test, y_pred=model_cb.predict(X_test))

0.9661538461538461

In [None]:
ConfusionMatrixDisplay.from_estimator(model_cb, X_train, y_train, display_labels=["Not poisonous", "Poisonous"]);

## 5.&nbsp; Create pipeline using XGboost classifier

In [5]:
from xgboost import XGBClassifier

# --------------------
# 1. One-hot encode categorical features
# --------------------
categorical_features = X.columns.tolist()
preprocessor = ColumnTransformer([
    ('onehot', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])

# --------------------
# 2. Create pipeline with XGBoostClassifier
# --------------------
model_xgb = make_pipeline(
    preprocessor,
    XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)
)

# --------------------
# 3. Train and evaluate
# --------------------
model_xgb.fit(X_train, y_train)
y_pred = model_xgb.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Parameters: { "use_label_encoder" } are not used.



Accuracy: 0.9615384615384616

Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.95      0.96       671
           1       0.94      0.98      0.96       629

    accuracy                           0.96      1300
   macro avg       0.96      0.96      0.96      1300
weighted avg       0.96      0.96      0.96      1300



In [6]:
# 2. Define the hyperparameter grid
param_grid = {
    'xgbclassifier__n_estimators': [100, 300, 500],
    'xgbclassifier__max_depth': [4, 6, 8, 10],
    'xgbclassifier__learning_rate': [0.01, 0.05, 0.1],
    'xgbclassifier__subsample': [0.6, 0.8, 1.0],
    'xgbclassifier__colsample_bytree': [0.6, 0.8, 1.0],
    'xgbclassifier__scale_pos_weight': [1, 5, 10],  # To handle class imbalance and reduce false negatives
    'xgbclassifier__min_child_weight': [1, 3, 5],
    'xgbclassifier__gamma': [0, 0.1, 0.3],  # Regularization to reduce overfitting
}
# Run a grid search to find the optimal combination of hyperparameters
xgb_search = GridSearchCV(
    model_xgb,
    param_grid,
    cv=5,
    verbose=1
)

xgb_search.fit(X_train, y_train)

best_param = xgb_search.best_params_

best_param

Fitting 5 folds for each of 8748 candidates, totalling 43740 fits


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "us

{'xgbclassifier__colsample_bytree': 1.0,
 'xgbclassifier__gamma': 0.3,
 'xgbclassifier__learning_rate': 0.05,
 'xgbclassifier__max_depth': 10,
 'xgbclassifier__min_child_weight': 1,
 'xgbclassifier__n_estimators': 300,
 'xgbclassifier__scale_pos_weight': 1,
 'xgbclassifier__subsample': 1.0}

In [68]:
# Update parameters of the pipeline using set_params
model_xgb.set_params(
    xgbclassifier__learning_rate= 0.05,
    xgbclassifier__max_depth= 10,
    xgbclassifier__n_estimators= 300,
    xgbclassifier__subsample= 1.0,
    xgbclassifier__colsample_bytree= 1.0,
    xgbclassifier__gamma= 0.3,
    xgbclassifier__scale_pos_weight= 5, #performed better than 1
    xgbclassifier__min_child_weight= 1
)

model_xgb.fit(X_train, y_train)

Parameters: { "use_label_encoder" } are not used.



In [59]:
accuracy_score(y_true=y_test, y_pred=model_xgb.predict(X_test))

0.9461538461538461

**XGBoost performed best.**

## Part II: Competition submission with unseen data

In [82]:
# import data
url = "https://drive.google.com/file/d/1eWxV9FGj6D-YnMsv4mHMWRcGIKbjrXYL/view?usp=drive_link"
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]
new_data = pd.read_csv(path)

In [83]:
# make sure it's in the same format
id_col = new_data.pop("Id")

In [84]:
# columns are in a different order, so let's change that
order_of_columns = X.columns.to_list()
new_data = new_data[order_of_columns]

In [85]:
# predict values
poisonous_pred = model_cb.predict(new_data)

In [86]:
# build the submission file
submission_file = pd.DataFrame({
    'Id':id_col,
    'poisonous':poisonous_pred
})

In [87]:
submission_file.head()

Unnamed: 0,Id,poisonous
0,5165,1
1,4281,1
2,231,0
3,3890,0
4,1521,1


Download the submission file and upload it to the competition. Good luck!

In [88]:
# If work locally:
#submission_file.to_csv('submission_1.csv',index=False)

In [89]:
# If work on colab:

from google.colab import files
submission_file.to_csv('submission_1.csv',index=False)
files.download('submission_1.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>