# Project: Mushroom Predictive Analysis using Scikit-learn

## Introduction
In this project, I will conduct a predictive analysis to determine which of the two selected features ("odor" and another feature of my choice) best predicts whether a mushroom is poisonous or edible. I will be using scikit-learn for building and evaluating the predictive model. My analysis will include preprocessing the data, converting categorical values into numerical form, training a classification model, and evaluating its accuracy.

### Step 1: Import Libraries
To start, I will import the necessary libraries such as `pandas` for data manipulation, `scikit-learn` for machine learning models, and `matplotlib` for visualizations.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

### Step 2: Load Dataset
I will load the mushroom dataset, which I previously preprocessed. This dataset contains information about various characteristics of mushrooms, including whether they are edible or poisonous.

In [3]:
file_path = r"C:\Users\The King\Desktop\FALL 2024\IS 362\IS362_PROJECT_4\agaricus-lepiota.data"
df_mushrooms = pd.read_csv(file_path, header=None)

df_mushrooms.columns = [
    'class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', 'gill-attachment',
    'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
    'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type',
    'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat'
]

print(df_mushrooms.head())


  class cap-shape cap-surface cap-color bruises odor gill-attachment  \
0     p         x           s         n       t    p               f   
1     e         x           s         y       t    a               f   
2     e         b           s         w       t    l               f   
3     p         x           y         w       t    p               f   
4     e         x           s         g       f    n               f   

  gill-spacing gill-size gill-color  ... stalk-surface-below-ring  \
0            c         n          k  ...                        s   
1            c         b          k  ...                        s   
2            c         b          n  ...                        s   
3            c         n          n  ...                        s   
4            w         b          k  ...                        s   

  stalk-color-above-ring stalk-color-below-ring veil-type veil-color  \
0                      w                      w         p          w   
1       

### Step 3: Select Features and Target
For this analysis, I will select "odor" and another feature of my choice as predictors, and "class" as the target variable. The "class" column contains labels 'e' for edible and 'p' for poisonous.

In [6]:
features = ['odor', 'gill-color']
target = 'class'

### Step 4: Preprocess Data
To use these features in a machine learning model, I need to convert the categorical variables into numerical form. I will use `pandas.get_dummies()` to achieve this.

In [7]:
df_features = pd.get_dummies(df_mushrooms[features], drop_first=True)
y = df_mushrooms[target].apply(lambda x: 1 if x == 'p' else 0)

print(df_features.head())

   odor_c  odor_f  odor_l  odor_m  odor_n  odor_p  odor_s  odor_y  \
0   False   False   False   False   False    True   False   False   
1   False   False   False   False   False   False   False   False   
2   False   False    True   False   False   False   False   False   
3   False   False   False   False   False    True   False   False   
4   False   False   False   False    True   False   False   False   

   gill-color_e  gill-color_g  gill-color_h  gill-color_k  gill-color_n  \
0         False         False         False          True         False   
1         False         False         False          True         False   
2         False         False         False         False          True   
3         False         False         False         False          True   
4         False         False         False          True         False   

   gill-color_o  gill-color_p  gill-color_r  gill-color_u  gill-color_w  \
0         False         False         False         False  

### Step 5: Split Dataset
I will split the data into training and testing sets, using 80% of the data for training and 20% for testing.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(df_features, y, test_size=0.2, random_state=42)

### Step 6: Train Classifier
I will use the RandomForestClassifier to predict whether a mushroom is poisonous or edible. Random Forest is a robust ensemble model that works well for classification tasks.

In [9]:
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

### Step 7: Make Predictions
Using the trained classifier, I'll make predictions on the test set and evaluate the model's performance.

In [11]:
y_pred = clf.predict(X_test)

### Step 8: Evaluate Model
I will evaluate the model using accuracy score and a classification report to determine how well my model is able to classify the mushrooms.

In [12]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.99

Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       843
           1       1.00      0.98      0.99       782

    accuracy                           0.99      1625
   macro avg       0.99      0.99      0.99      1625
weighted avg       0.99      0.99      0.99      1625



### Step 9: Conclusion
Based on the accuracy and classification report, I can draw conclusions about which feature is a better predictor of whether a mushroom is poisonous or edible. The accuracy score will help me understand the overall effectiveness of the model, while the feature importance (obtained from the Random Forest model) will help me assess which feature was most important in making predictions.

In [13]:
feature_importances = pd.Series(clf.feature_importances_, index=df_features.columns)
print("\nFeature Importances:")
print(feature_importances.sort_values(ascending=False))


Feature Importances:
odor_n          0.371035
odor_f          0.226310
odor_y          0.066802
odor_s          0.064020
odor_p          0.059114
odor_l          0.049004
odor_c          0.044707
gill-color_n    0.037176
gill-color_w    0.025198
gill-color_k    0.011653
gill-color_u    0.010670
gill-color_r    0.009905
odor_m          0.008479
gill-color_g    0.007328
gill-color_p    0.004104
gill-color_h    0.002056
gill-color_e    0.001061
gill-color_y    0.000905
gill-color_o    0.000472
dtype: float64
