##**Assignment 3 (2024/2): ML1**
**Safe to eat or deadly poison?**



This homework is a classification task to identify whether a mushroom is edible or poisonous.

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981).

Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the credibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.


Step 1. Load 'mushroom2020_dataset.csv' data from the “Attachment” (note: this data set has been preliminarily prepared.).

Step 2. Drop rows where the target (label) variable is missing.

Step 3. Drop the following variables:
'id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type'

Step 4. Examine the number of rows, the number of digits, and whether any are missing.

Step 5. Fill missing values by adding the mean for numeric variables and the mode for nominal variables.

Step 6. Convert the label variable e (edible) to 1 and p (poisonous) to 0 and check the quantity. class0: class1

Step 7. Convert the nominal variable to numeric using a dummy code with drop_first = True.

Step 8. Split train/test with 20% test, stratify, and seed = 2020.

Step 9. Create a Random Forest with GridSearch on training data with 5 CV.
	'criterion':['gini','entropy']
'max_depth': [2,3]
'min_samples_leaf':[2,5]
'N_estimators':[100]
'random_state': 2020

Step 10.  Predict the testing data set with classification_report.


**Complete class MushroomClassifier from given code template below.**

In [5]:
#import your other libraries here
import pandas as pd
# hint
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.impute import SimpleImputer
import numpy as np

In [6]:
class MushroomClassifier:
    def __init__(self, data_path): # DO NOT modify this line
        self.data_path = data_path
        self.df = pd.read_csv(data_path)

    def Q1(self): # DO NOT modify this line
        """
            1. (From step 1) Before doing the data prep., how many "na" are there in "gill-size" variables?
        """
        # remove pass and replace with you code
        return self.df['gill-size'].isnull().sum()


    def Q2(self): # DO NOT modify this line
        """
            2. (From step 2-4) How many rows of data, how many variables?
            - Drop rows where the target (label) variable is missing.
            - Drop the following variables:
            'id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate','stalk-root', 'stalk-surface-above-ring',
            'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type'
            - Examine the number of rows, the number of digits, and whether any are missing.
        """
        # remove pass and replace with you code
        self.df.dropna(subset=['label'], axis=0, inplace=True)
        drop_list = ['id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate','stalk-root', 'stalk-surface-above-ring',
        'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type']
        self.df.drop(columns=drop_list, inplace=True)
        return self.df.shape


    def Q3(self): # DO NOT modify this line
        """
            3. (From step 5-6) Answer the quantity class0:class1
            - Fill missing values by adding the mean for numeric variables and the mode for nominal variables.
            - Convert the label variable e (edible) to 1 and p (poisonous) to 0 and check the quantity. class0: class1
            - Note: You need to reproduce the process (code) from Q2 to obtain the correct result.
        """
        # remove pass and replace with you code
        self.df.dropna(subset=['label'], axis=0, inplace=True)
        drop_list = ['id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate','stalk-root', 'stalk-surface-above-ring',
        'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type']
        self.df.drop(columns=drop_list, inplace=True)
        
        num_vars = ['cap-color-rate']
        nom_vars = ['label', 'cap-shape', 'cap-surface', 
                    'bruises', 'odor', 'stalk-shape', 'ring-number',
                   'ring-type', 'spore-print-color', 'population', 'habitat']

        num_imp = SimpleImputer(missing_values=np.NaN, strategy='mean')
        nom_imp = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')

        self.df[num_vars] = num_imp.fit_transform(self.df[num_vars])
        self.df[nom_vars] = nom_imp.fit_transform(self.df[nom_vars])

        self.df['label'] = self.df['label'].map({'p': 0, 'e': 1})
        
        return (self.df['label'].value_counts()[0], self.df['label'].value_counts()[1])


    def Q4(self): # DO NOT modify this line
        """
            4. (From step 7-8) How much is each training and testing sets
            - Convert the nominal variable to numeric using a dummy code with drop_first = True.
            - Split train/test with 20% test, stratify, and seed = 2020.
            - Note: You need to reproduce the process (code) from Q2, Q3 to obtain the correct result.
        """
        # remove pass and replace with you code
        self.df.dropna(subset=['label'], axis=0, inplace=True)
        drop_list = ['id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate','stalk-root', 'stalk-surface-above-ring',
        'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type']
        self.df.drop(columns=drop_list, inplace=True)
        
        num_vars = ['cap-color-rate']
        nom_vars = ['label', 'cap-shape', 'cap-surface', 
                    'bruises', 'odor', 'stalk-shape', 'ring-number',
                   'ring-type', 'spore-print-color', 'population', 'habitat']

        num_imp = SimpleImputer(missing_values=np.NaN, strategy='mean')
        nom_imp = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')

        self.df[num_vars] = num_imp.fit_transform(self.df[num_vars])
        self.df[nom_vars] = nom_imp.fit_transform(self.df[nom_vars])

        self.df['label'] = self.df['label'].map({'p': 0, 'e': 1})
        
        dummy_df = pd.get_dummies(self.df, drop_first=True)
        
        X = dummy_df.drop(columns=['label'])
        y = dummy_df['label']
        X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, 
                                                            test_size=0.20, 
                                                            random_state=2020)
        
        return (X_train.shape, X_test.shape)


    def Q5(self):
        """
            5. (From step 9) Best params after doing random forest grid search.
            Create a Random Forest with GridSearch on training data with 5 CV.
            - 'criterion':['gini','entropy']
            - 'max_depth': [2,3]
            - 'min_samples_leaf':[2,5]
            - 'N_estimators':[100]
            - 'random_state': 2020
            - Note: You need to reproduce the process (code) from Q2, Q3, Q4 to obtain the correct result.
        """
        # remove pass and replace with you code
        self.df.dropna(subset=['label'], axis=0, inplace=True)
        drop_list = ['id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate','stalk-root', 'stalk-surface-above-ring',
        'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type']
        self.df.drop(columns=drop_list, inplace=True)
        
        num_vars = ['cap-color-rate']
        nom_vars = ['label', 'cap-shape', 'cap-surface', 
                    'bruises', 'odor', 'stalk-shape', 'ring-number',
                   'ring-type', 'spore-print-color', 'population', 'habitat']

        num_imp = SimpleImputer(missing_values=np.NaN, strategy='mean')
        nom_imp = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')

        self.df[num_vars] = num_imp.fit_transform(self.df[num_vars])
        self.df[nom_vars] = nom_imp.fit_transform(self.df[nom_vars])

        self.df['label'] = self.df['label'].map({'p': 0, 'e': 1})
        
        dummy_df = pd.get_dummies(self.df, drop_first=True)
        
        X = dummy_df.drop(columns=['label'])
        y = dummy_df['label']
        X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, 
                                                            test_size=0.20, 
                                                            random_state=2020)
        
        rf = RandomForestClassifier()
        parameters = {
            'criterion':['gini','entropy'],
            'max_depth': [2,3],
            'min_samples_leaf':[2,5],
            'n_estimators':[100],
            'random_state': [2020]
        }
            

        grid_search = GridSearchCV(
            estimator=rf,
            param_grid=parameters,
            cv=5,
            n_jobs=-1 # Parallel
        )

        grid_search.fit(X_train, y_train)
        best_params = grid_search.best_params_
        return (
            best_params['criterion'],
            best_params['max_depth'],
            best_params['min_samples_leaf'],
            best_params['n_estimators'],
            best_params['random_state']
    )


    def Q6(self):
        """
            5. (From step 10) What is the value of macro f1 (2 digits)?
            Predict the testing data set with confusion_matrix and classification_report,
            using scientific rounding (less than 0.5 dropped, more than 0.5 then increased)
            - Note: You need to reproduce the process (code) from Q2, Q3, Q4, Q5 to obtain the correct result.
        """
        # remove pass and replace with you code
        self.df.dropna(subset=['label'], axis=0, inplace=True)
        drop_list = ['id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate','stalk-root', 'stalk-surface-above-ring',
        'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type']
        self.df.drop(columns=drop_list, inplace=True)
        
        num_vars = ['cap-color-rate']
        nom_vars = ['label', 'cap-shape', 'cap-surface', 
                    'bruises', 'odor', 'stalk-shape', 'ring-number',
                   'ring-type', 'spore-print-color', 'population', 'habitat']

        num_imp = SimpleImputer(missing_values=np.NaN, strategy='mean')
        nom_imp = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')

        self.df[num_vars] = num_imp.fit_transform(self.df[num_vars])
        self.df[nom_vars] = nom_imp.fit_transform(self.df[nom_vars])

        self.df['label'] = self.df['label'].map({'p': 0, 'e': 1})
        
        dummy_df = pd.get_dummies(self.df, drop_first=True)
        
        X = dummy_df.drop(columns=['label'])
        y = dummy_df['label']
        X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, 
                                                            test_size=0.20, 
                                                            random_state=2020)
        
        rf = RandomForestClassifier()
        parameters = {
            'criterion':['gini','entropy'],
            'max_depth': [2,3],
            'min_samples_leaf':[2,5],
            'n_estimators':[100],
            'random_state': [2020]
        }
            

        grid_search = GridSearchCV(
            estimator=rf,
            param_grid=parameters,
            cv=5,
            n_jobs=-1 # Parallel
        )

        grid_search.fit(X_train, y_train)
        
        best_rf = RandomForestClassifier(**grid_search.best_params_)
        best_rf.fit(X_train, y_train)
        y_pred = best_rf.predict(X_test)
        report = classification_report(y_test, y_pred, output_dict=True)
        
        f1_class0 = round(report['0']['f1-score'], 2)
        f1_class1 = round(report['1']['f1-score'], 2)

        return f1_class0, f1_class1


Run the code below to test that your code can work.

In [8]:
hw = MushroomClassifier('mushroom2020_dataset.csv')
print(hw.Q1())

hw = MushroomClassifier('mushroom2020_dataset.csv')
print(hw.Q2())

hw = MushroomClassifier('mushroom2020_dataset.csv')
print(hw.Q3())

hw = MushroomClassifier('mushroom2020_dataset.csv')
print(hw.Q4())

hw = MushroomClassifier('mushroom2020_dataset.csv')
print(hw.Q5())

hw = MushroomClassifier('mushroom2020_dataset.csv')
print(hw.Q6())

121
(5764, 12)
(3660, 2104)
((4611, 42), (1153, 42))
('gini', 3, 5, 100, 2020)
(0.98, 0.97)


# Using self.Qx()

In [10]:
class MushroomClassifier:
    def __init__(self, data_path): # DO NOT modify this line
        self.data_path = data_path
        self.df = pd.read_csv(data_path)

    def Q1(self): # DO NOT modify this line
        """
            1. (From step 1) Before doing the data prep., how many "na" are there in "gill-size" variables?
        """
        # remove pass and replace with you code
        return self.df['gill-size'].isnull().sum()


    def Q2(self): # DO NOT modify this line
        """
            2. (From step 2-4) How many rows of data, how many variables?
            - Drop rows where the target (label) variable is missing.
            - Drop the following variables:
            'id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate','stalk-root', 'stalk-surface-above-ring',
            'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type'
            - Examine the number of rows, the number of digits, and whether any are missing.
        """
        # remove pass and replace with you code
        self.df.dropna(subset=['label'], axis=0, inplace=True)
        drop_list = ['id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate','stalk-root', 'stalk-surface-above-ring',
        'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type']
        self.df.drop(columns=drop_list, inplace=True)
        return self.df.shape


    def Q3(self): # DO NOT modify this line
        """
            3. (From step 5-6) Answer the quantity class0:class1
            - Fill missing values by adding the mean for numeric variables and the mode for nominal variables.
            - Convert the label variable e (edible) to 1 and p (poisonous) to 0 and check the quantity. class0: class1
            - Note: You need to reproduce the process (code) from Q2 to obtain the correct result.
        """
        # remove pass and replace with you code
        self.Q2()
        
        num_vars = ['cap-color-rate']
        nom_vars = ['label', 'cap-shape', 'cap-surface', 
                    'bruises', 'odor', 'stalk-shape', 'ring-number',
                   'ring-type', 'spore-print-color', 'population', 'habitat']

        num_imp = SimpleImputer(missing_values=np.NaN, strategy='mean')
        nom_imp = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')

        self.df[num_vars] = num_imp.fit_transform(self.df[num_vars])
        self.df[nom_vars] = nom_imp.fit_transform(self.df[nom_vars])

        self.df['label'] = self.df['label'].map({'p': 0, 'e': 1})
        
        return (self.df['label'].value_counts()[0], self.df['label'].value_counts()[1])


    def Q4(self): # DO NOT modify this line
        """
            4. (From step 7-8) How much is each training and testing sets
            - Convert the nominal variable to numeric using a dummy code with drop_first = True.
            - Split train/test with 20% test, stratify, and seed = 2020.
            - Note: You need to reproduce the process (code) from Q2, Q3 to obtain the correct result.
        """
        # remove pass and replace with you code
        self.Q3()
        
        dummy_df = pd.get_dummies(self.df, drop_first=True)
        
        X = dummy_df.drop(columns=['label'])
        y = dummy_df['label']
        X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, 
                                                            test_size=0.20, 
                                                            random_state=2020)
        
        return (X_train.shape, X_test.shape)


    def Q5(self):
        """
            5. (From step 9) Best params after doing random forest grid search.
            Create a Random Forest with GridSearch on training data with 5 CV.
            - 'criterion':['gini','entropy']
            - 'max_depth': [2,3]
            - 'min_samples_leaf':[2,5]
            - 'N_estimators':[100]
            - 'random_state': 2020
            - Note: You need to reproduce the process (code) from Q2, Q3, Q4 to obtain the correct result.
        """
        # remove pass and replace with you code
        self.Q4()
        
        dummy_df = pd.get_dummies(self.df, drop_first=True)
        
        X = dummy_df.drop(columns=['label'])
        y = dummy_df['label']
        X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, 
                                                            test_size=0.20, 
                                                            random_state=2020)
        
        rf = RandomForestClassifier()
        parameters = {
            'criterion':['gini','entropy'],
            'max_depth': [2,3],
            'min_samples_leaf':[2,5],
            'n_estimators':[100],
            'random_state': [2020]
        }
            

        grid_search = GridSearchCV(
            estimator=rf,
            param_grid=parameters,
            cv=5,
            n_jobs=-1 # Parallel
        )

        grid_search.fit(X_train, y_train)
        best_params = grid_search.best_params_
        return (
            best_params['criterion'],
            best_params['max_depth'],
            best_params['min_samples_leaf'],
            best_params['n_estimators'],
            best_params['random_state']
    )


    def Q6(self):
        """
            5. (From step 10) What is the value of macro f1 (2 digits)?
            Predict the testing data set with confusion_matrix and classification_report,
            using scientific rounding (less than 0.5 dropped, more than 0.5 then increased)
            - Note: You need to reproduce the process (code) from Q2, Q3, Q4, Q5 to obtain the correct result.
        """
        # remove pass and replace with you code
        self.Q5()
        
        dummy_df = pd.get_dummies(self.df, drop_first=True)
        
        X = dummy_df.drop(columns=['label'])
        y = dummy_df['label']
        X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, 
                                                            test_size=0.20, 
                                                            random_state=2020)
        
        rf = RandomForestClassifier()
        parameters = {
            'criterion':['gini','entropy'],
            'max_depth': [2,3],
            'min_samples_leaf':[2,5],
            'n_estimators':[100],
            'random_state': [2020]
        }
            

        grid_search = GridSearchCV(
            estimator=rf,
            param_grid=parameters,
            cv=5,
            n_jobs=-1 # Parallel
        )

        grid_search.fit(X_train, y_train)
        
        best_rf = RandomForestClassifier(**grid_search.best_params_)
        best_rf.fit(X_train, y_train)
        y_pred = best_rf.predict(X_test)
        report = classification_report(y_test, y_pred, output_dict=True)
        
        f1_class0 = round(report['0']['f1-score'], 2)
        f1_class1 = round(report['1']['f1-score'], 2)

        return f1_class0, f1_class1

In [11]:
hw = MushroomClassifier('mushroom2020_dataset.csv')
print(hw.Q1())

hw = MushroomClassifier('mushroom2020_dataset.csv')
print(hw.Q2())

hw = MushroomClassifier('mushroom2020_dataset.csv')
print(hw.Q3())

hw = MushroomClassifier('mushroom2020_dataset.csv')
print(hw.Q4())

hw = MushroomClassifier('mushroom2020_dataset.csv')
print(hw.Q5())

hw = MushroomClassifier('mushroom2020_dataset.csv')
print(hw.Q6())

121
(5764, 12)
(3660, 2104)
((4611, 42), (1153, 42))
('gini', 3, 5, 100, 2020)
(0.98, 0.97)
