# Data Preprocessing
Data preprocessing is a critical step in machine learning that often determines the success of a model. We are seeking to enhance our data preprocessing in our machine learning project.

## Objective
Give the numerical representation of the categorical data such that it can be used for classification of whether a mushroom is 'poisonous' or 'edible'.

## Tasks
- Improve the data preprocessing workflow.
- Data Cleaning & Transformation.
- Feature Engineering.
- Encoding of categorical data, and also provide reason behind the use of any particular encoding technique.
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data = pd.read_csv("mushroom.csv")

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data = pd.read_csv("mushroom.csv")

In [2]:
data

Unnamed: 0.1,Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,poisonous
0,0,convex,smooth,brown,bruises,pungent,free,close,narrow,black,...,white,white,partial,white,one,pendant,black,scattered,urban,p
1,1,convex,smooth,yellow,bruises,almond,free,close,broad,black,...,white,white,partial,white,one,pendant,brown,numerous,grasses,e
2,2,bell,smooth,white,bruises,anise,free,close,broad,brown,...,white,white,partial,white,one,pendant,brown,numerous,meadows,e
3,3,convex,scaly,white,bruises,pungent,free,close,narrow,brown,...,white,white,partial,white,one,pendant,black,scattered,urban,p
4,4,convex,smooth,gray,no,none,free,crowded,broad,black,...,white,white,partial,white,one,evanescent,brown,abundant,grasses,e
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,8119,knobbed,smooth,brown,no,none,attached,close,broad,yellow,...,orange,orange,partial,orange,one,pendant,buff,clustered,leaves,e
8120,8120,convex,smooth,brown,no,none,attached,close,broad,yellow,...,orange,orange,partial,brown,one,pendant,buff,several,leaves,e
8121,8121,flat,smooth,brown,no,none,attached,close,broad,brown,...,orange,orange,partial,orange,one,pendant,buff,clustered,leaves,e
8122,8122,knobbed,scaly,brown,no,fishy,free,close,narrow,buff,...,white,white,partial,white,one,evanescent,white,several,leaves,p


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 24 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Unnamed: 0                8124 non-null   int64 
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                5644 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring  

In [4]:
for i in data.columns:
    print(i,':',data[i].unique())

Unnamed: 0 : [   0    1    2 ... 8121 8122 8123]
cap-shape : ['convex' 'bell' 'sunken' 'flat' 'knobbed' 'conical']
cap-surface : ['smooth' 'scaly' 'fibrous' 'grooves']
cap-color : ['brown' 'yellow' 'white' 'gray' 'red' 'pink' 'buff' 'purple' 'cinnamon'
 'green']
bruises : ['bruises' 'no']
odor : ['pungent' 'almond' 'anise' 'none' 'foul' 'creosote' 'fishy' 'spicy'
 'musty']
gill-attachment : ['free' 'attached']
gill-spacing : ['close' 'crowded']
gill-size : ['narrow' 'broad']
gill-color : ['black' 'brown' 'gray' 'pink' 'white' 'chocolate' 'purple' 'red' 'buff'
 'green' 'yellow' 'orange']
stalk-shape : ['enlarging' 'tapering']
stalk-root : ['equal' 'club' 'bulbous' 'rooted' nan]
stalk-surface-above-ring : ['smooth' 'fibrous' 'silky' 'scaly']
stalk-surface-below-ring : ['smooth' 'fibrous' 'scaly' 'silky']
stalk-color-above-ring : ['white' 'gray' 'pink' 'brown' 'buff' 'red' 'orange' 'cinnamon' 'yellow']
stalk-color-below-ring : ['white' 'pink' 'gray' 'buff' 'brown' 'red' 'yellow' 'orange' 

In [5]:
data.drop(['Unnamed: 0'],axis=1,inplace=True)

In [6]:
data.isnull().sum()

cap-shape                      0
cap-surface                    0
cap-color                      0
bruises                        0
odor                           0
gill-attachment                0
gill-spacing                   0
gill-size                      0
gill-color                     0
stalk-shape                    0
stalk-root                  2480
stalk-surface-above-ring       0
stalk-surface-below-ring       0
stalk-color-above-ring         0
stalk-color-below-ring         0
veil-type                      0
veil-color                     0
ring-number                    0
ring-type                      0
spore-print-color              0
population                     0
habitat                        0
poisonous                      0
dtype: int64

In [7]:
data['stalk-root'].fillna('unknown', inplace=True)
data.isnull().sum()

cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
poisonous                   0
dtype: int64

In [8]:
data['cap-shape_surface'] = data['cap-shape'] + '_' + data['cap-surface']  # Cap shape and surface texture are key morphological features for species identification
data['cap-color_odor'] = data['cap-color'] + '_' + data['odor']  # Certain cap colors and odors are strong indicators of toxicity
data['odor_bruises'] = data['odor'] + '_' + data['bruises'].astype(str)  # Bruising combined with odor often signals poisonous mushrooms
data['gill-color_spore-print'] = data['gill-color'] + '_' + data['spore-print-color']  # Gill and spore print colors are critical for differentiating species
data['ring-number_type'] = data['ring-number'] + '_' + data['ring-type']  # Ring number and type help distinguish between mushroom species
data['population_habitat'] = data['population'] + '_' + data['habitat']  # Population density in specific habitats helps identify species
data['odor_habitat'] = data['odor'] + '_' + data['habitat']  # Odors common to certain habitats can indicate mushroom edibility
data['cap-shape_surface_color'] = data['cap-shape'] + '_' + data['cap-surface'] + '_' + data['cap-color']  # A combination of cap features improves species classification
data['gill-size_spacing'] = data['gill-size'] + '_' + data['gill-spacing']  # Gill size and spacing are important biological traits for classification

**These interactions may capture relationships between categorical attributes that aren't obvious when looked at individually**

In [9]:
data

Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,poisonous,cap-shape_surface,cap-color_odor,odor_bruises,gill-color_spore-print,ring-number_type,population_habitat,odor_habitat,cap-shape_surface_color,gill-size_spacing
0,convex,smooth,brown,bruises,pungent,free,close,narrow,black,enlarging,...,p,convex_smooth,brown_pungent,pungent_bruises,black_black,one_pendant,scattered_urban,pungent_urban,convex_smooth_brown,narrow_close
1,convex,smooth,yellow,bruises,almond,free,close,broad,black,enlarging,...,e,convex_smooth,yellow_almond,almond_bruises,black_brown,one_pendant,numerous_grasses,almond_grasses,convex_smooth_yellow,broad_close
2,bell,smooth,white,bruises,anise,free,close,broad,brown,enlarging,...,e,bell_smooth,white_anise,anise_bruises,brown_brown,one_pendant,numerous_meadows,anise_meadows,bell_smooth_white,broad_close
3,convex,scaly,white,bruises,pungent,free,close,narrow,brown,enlarging,...,p,convex_scaly,white_pungent,pungent_bruises,brown_black,one_pendant,scattered_urban,pungent_urban,convex_scaly_white,narrow_close
4,convex,smooth,gray,no,none,free,crowded,broad,black,tapering,...,e,convex_smooth,gray_none,none_no,black_brown,one_evanescent,abundant_grasses,none_grasses,convex_smooth_gray,broad_crowded
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,knobbed,smooth,brown,no,none,attached,close,broad,yellow,enlarging,...,e,knobbed_smooth,brown_none,none_no,yellow_buff,one_pendant,clustered_leaves,none_leaves,knobbed_smooth_brown,broad_close
8120,convex,smooth,brown,no,none,attached,close,broad,yellow,enlarging,...,e,convex_smooth,brown_none,none_no,yellow_buff,one_pendant,several_leaves,none_leaves,convex_smooth_brown,broad_close
8121,flat,smooth,brown,no,none,attached,close,broad,brown,enlarging,...,e,flat_smooth,brown_none,none_no,brown_buff,one_pendant,clustered_leaves,none_leaves,flat_smooth_brown,broad_close
8122,knobbed,scaly,brown,no,fishy,free,close,narrow,buff,tapering,...,p,knobbed_scaly,brown_fishy,fishy_no,buff_white,one_evanescent,several_leaves,fishy_leaves,knobbed_scaly_brown,narrow_close


**Encode ordinal features using LabelEncoder to convert categorical values into ordinal integers, preserving the inherent order of 'gill-size', 'gill-spacing', and 'ring-number'**

In [10]:
ordinal_features = ['gill-size', 'gill-spacing', 'ring-number']
label_encoder = LabelEncoder()
for feature in ordinal_features:
    data[feature] = label_encoder.fit_transform(data[feature])

**Apply one-hot encoding to nominal features to convert categorical variables into binary indicators, facilitating model training by removing the first category to avoid multicollinearity**


In [11]:
nominal_features = ['cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', 
                    'gill-attachment', 'gill-color', 'stalk-shape', 'stalk-root', 
                    'stalk-surface-above-ring', 'stalk-surface-below-ring', 
                    'stalk-color-above-ring', 'stalk-color-below-ring', 
                    'veil-type', 'veil-color', 'ring-type', 'spore-print-color', 
                    'population', 'habitat']
data = pd.get_dummies(data, columns=nominal_features, drop_first=True,dtype=int)

**Apply one-hot encoding to interaction features to convert combined categorical variables into binary indicators,enhancing the model's ability to capture relationships between feature interactions while avoiding multicollinearity by dropping the first category**

In [12]:
interaction_features = ['cap-shape_surface', 'cap-color_odor', 'odor_bruises', 
                        'gill-color_spore-print', 'ring-number_type', 
                        'population_habitat', 'odor_habitat', 
                        'cap-shape_surface_color', 'gill-size_spacing']
data = pd.get_dummies(data, columns=interaction_features, drop_first=True,dtype=int)

**Map the 'poisonous' target column from categorical labels ('e' for edible, 'p' for poisonous) to binary values (0 for edible, 1 for poisonous) for numerical representation in the model**


In [13]:
data['poisonous'] = data['poisonous'].map({'e': 0, 'p': 1})

In [14]:
data

Unnamed: 0,gill-spacing,gill-size,ring-number,poisonous,cap-shape_conical,cap-shape_convex,cap-shape_flat,cap-shape_knobbed,cap-shape_sunken,cap-surface_grooves,...,cap-shape_surface_color_knobbed_smooth_buff,cap-shape_surface_color_knobbed_smooth_gray,cap-shape_surface_color_knobbed_smooth_pink,cap-shape_surface_color_knobbed_smooth_red,cap-shape_surface_color_knobbed_smooth_white,cap-shape_surface_color_sunken_fibrous_brown,cap-shape_surface_color_sunken_fibrous_gray,gill-size_spacing_broad_crowded,gill-size_spacing_narrow_close,gill-size_spacing_narrow_crowded
0,0,1,1,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,0,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,1,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,1,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,0,0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
8120,0,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8121,0,0,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8122,0,1,1,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0


# The preprocessing steps have effectively cleaned and transformed the mushroom dataset for analysis. By generating interaction features and applying appropriate encoding techniques, we have enhanced the model's capacity to capture relationships between various categorical attributes. The target variable 'poisonous' has been mapped to binary values, setting the stage for effective model training and evaluation in predicting mushroom edibility.