In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

cols = [
    "class", "cap-shape", "cap-surface", "cap-color", "bruises", "odor", 
    "gill-attachment", "gill-spacing", "gill-size", "gill-color", "stalk-shape", 
    "stalk-root", "stalk-surface-above-ring", "stalk-surface-below-ring", 
    "stalk-color-above-ring", "stalk-color-below-ring", "veil-type", 
    "veil-color", "ring-number", "ring-type", "spore-print-color", 
    "population", "habitat"
]

In [11]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"
df = pd.read_csv(url, header=None, names=cols)

# 'e' and 'p' (or EDIBLE/POISONOUS)
print(df['class'].unique())

['p' 'e']


In [12]:
class_map = {'e': 0, 'p': 1, 'EDIBLE': 0, 'POISONOUS': 1}

df['y'] = df['class'].map(class_map)

print("Missing values in y:", df['y'].isnull().sum())

Missing values in y: 0


In [13]:
# 'odor' and 'spore-print-color'
features_to_use = ['odor', 'spore-print-color']
X = df[features_to_use].copy()

X = pd.get_dummies(X, columns=features_to_use, drop_first=True)

y = df['y']

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

print(f"Training shape: {X_train.shape}")
print(f"Testing shape: {X_test.shape}")

Training shape: (6093, 16)
Testing shape: (2031, 16)


In [15]:
model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 0.9931068439192516

Confusion Matrix:
 [[1040    0]
 [  14  977]]


In [16]:
X_odor = pd.get_dummies(df[['odor']], drop_first=True)
X_train_o, X_test_o, y_train_o, y_test_o = train_test_split(X_odor, y, test_size=0.25, random_state=42)

model_odor = LogisticRegression()
model_odor.fit(X_train_o, y_train_o)
print("Accuracy (Odor only):", accuracy_score(y_test_o, model_odor.predict(X_test_o)))

Accuracy (Odor only): 0.9852289512555391


In [17]:
X_other = pd.get_dummies(df[['spore-print-color']], drop_first=True)
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_other, y, test_size=0.25, random_state=42)

model_other = LogisticRegression()
model_other.fit(X_train_s, y_train_s)
print("Accuracy (Spore Print only):", accuracy_score(y_test_s, model_other.predict(X_test_s)))

Accuracy (Spore Print only): 0.8685376661742984


### Conclusions
Based on the predictive analysis using Logistic Regression, we can determine which feature is the stronger predictor of whether a mushroom is poisonous.

Odor is the most accurate single predictor. When using odor as the sole feature, the model achieved an accuracy of approximately 98.5%.

Spore Print Color is a weaker predictor. When using spore-print-color alone, the model achieved an accuracy of approximately 86.9%.

Therefore, if we had to choose only one feature to predict edibility, odor is significantly more reliable than spore print color.

### Recommendations for Further Analysis
Combine Features for Higher Safety: While odor is highly predictive (98.5%), it is not perfect. By combining odor and spore-print-color in our first model, we increased the accuracy to 99.3%. Future analysis should focus on identifying which specific combination of features (e.g., Odor + Habitat + Gill Color) can push the accuracy to 100%.

Analyze False Negatives: In the context of mushroom hunting, a "False Negative" (predicting a poisonous mushroom is edible) is a potentially fatal error. Even a model with 99% accuracy might not be safe enough for human consumption. Further analysis should focus on minimizing false negatives specifically, perhaps by adjusting the classification threshold or choosing a model that prioritizes recall over precision.

Test Other Classifiers: We used Logistic Regression for this analysis. Future work could compare these results against other algorithms like Decision Trees or Random Forests, which often handle categorical data (like mushroom attributes) very well and provide easily interpretable rules.