# Mushroom Project (CHANGE)
## Introduction

In this project, my goal is to use binomial logistic regression to predict whether a mushroom is edible or poisonous. The idea is to gain more experience with exploratory data analysis, data cleaning, building binomial logistic regression models, and testing model assumptions. 

The data I am using is the [Mushroom dataset](https://archive.ics.uci.edu/dataset/73/mushroom) from the UCI Machine Learning Repository. The dataset has 8,124 observations with 22 features, was created in 1987, and last updated on August 10, 2023. Each row in the dataset represents an observation of one mushroom from one of 23 species of gilled mushrooms in the Agaricus and Lepiota Family, along with corresponding characteristics sush as shape, color, habitat, etc. that can be used to predict its edibility.

## Step 1: Imports
### Import Packages

For this project, I will use the following packages:
* Pandas
* TODO

In [150]:
# Standard data processing packages
import pandas as pd
import numpy as np

# Preprocessing, modeling, and evaluation packages
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as metrics
import scipy.stats as stats
from scipy.stats import chi2

# Data visualization packages
import seaborn as sns
import matplotlib.pyplot as plt

# Used to fetch data from the UCI repository
from ucimlrepo import fetch_ucirepo 

### Load the Dataset

Load in the data and display the variables. We can see that all the variables are categorical, with two being binary.

In [151]:
# fetch dataset 
mushroom = fetch_ucirepo(id=73) 
  
# data (as pandas dataframes) 
X = mushroom.data.features 
y = mushroom.data.targets 
  
# variable information 
print(mushroom.variables) 

                        name     role         type demographic  \
0                  poisonous   Target  Categorical        None   
1                  cap-shape  Feature  Categorical        None   
2                cap-surface  Feature  Categorical        None   
3                  cap-color  Feature       Binary        None   
4                    bruises  Feature  Categorical        None   
5                       odor  Feature  Categorical        None   
6            gill-attachment  Feature  Categorical        None   
7               gill-spacing  Feature  Categorical        None   
8                  gill-size  Feature  Categorical        None   
9                 gill-color  Feature  Categorical        None   
10               stalk-shape  Feature  Categorical        None   
11                stalk-root  Feature  Categorical        None   
12  stalk-surface-above-ring  Feature  Categorical        None   
13  stalk-surface-below-ring  Feature  Categorical        None   
14    stal

### View the first 5 rows

Display the first 5 rows of the dataset. As we can see below, `poisonous` is the target variable that we will be predicting and the other columns are potential independent variables we can use for our model.

In [152]:
# Combine the features and target into a single DataFrame
df_mushroom = pd.concat([X, y], axis=1)

# Display the first 5 rows of the DataFrame
df_mushroom.head()

Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,poisonous
0,x,s,n,t,p,f,c,n,k,e,...,w,w,p,w,o,p,k,s,u,p
1,x,s,y,t,a,f,c,b,k,e,...,w,w,p,w,o,p,n,n,g,e
2,b,s,w,t,l,f,c,b,n,e,...,w,w,p,w,o,p,n,n,m,e
3,x,y,w,t,p,f,c,n,n,e,...,w,w,p,w,o,p,k,s,u,p
4,x,s,g,f,n,f,w,b,k,t,...,w,w,p,w,o,e,n,a,g,e


## Step 2: Data exploration, data cleaning, and model preparation
### Explore the data

Before constructing any models, it is important to explore the data to better understand its shape, whether any data is missing, and what each column means. In the case of this mushroom dataset, each column represents a categorical variable with data taking the form of a single letter. This letter is difficult to understand at a glance because and also needs to be transformed into numeric data in order to be used in a binomial logistic regression model. We will transform the data later, but first we will explore the data shape and data types. 

In [153]:
# Print the data types of each column
df_mushroom.dtypes

cap-shape                   object
cap-surface                 object
cap-color                   object
bruises                     object
odor                        object
gill-attachment             object
gill-spacing                object
gill-size                   object
gill-color                  object
stalk-shape                 object
stalk-root                  object
stalk-surface-above-ring    object
stalk-surface-below-ring    object
stalk-color-above-ring      object
stalk-color-below-ring      object
veil-type                   object
veil-color                  object
ring-number                 object
ring-type                   object
spore-print-color           object
population                  object
habitat                     object
poisonous                   object
dtype: object

### Check the shape of the dataset

It appears that there are 8124 rows with 23 columns, as expected, with each column being an object data type.

In [154]:
# Print the shape of the dataset
df_mushroom.shape

(8124, 23)

### Check how many poisonous mushrooms there are

From the code below, we can see that 4208 mushrooms are classified as edible, and 3916 are classified as poisonous. 

In [155]:
# Count the values of the 'poisonous' column
df_mushroom['poisonous'].value_counts()

poisonous
e    4208
p    3916
Name: count, dtype: int64

### Transform column values

As we saw when we viewed the first 5 rows of the dataset, each column is made up of singlular letters which are hard to understand. For example, a value of "y" in the `cap-surface` column indicates scaly, this would be impossible to figure out at a glance. 

To make the data more easily intepretable, lets transform each column according to the mapping provided by UCI.

In [156]:
# Define the mapping for all categorical columns
mappings = {
    'cap-shape': {'b': 'bell', 'c': 'conical', 'x': 'convex', 'f': 'flat', 'k': 'knobbed', 's': 'sunken'},
    'cap-surface': {'f': 'fibrous', 'g': 'grooves', 'y': 'scaly', 's': 'smooth'},
    'cap-color': {'n': 'brown', 'b': 'buff', 'c': 'cinnamon', 'g': 'gray', 'r': 'green', 'p': 'pink', 'u': 'purple', 'e': 'red', 'w': 'white', 'y': 'yellow'},
    'bruises?': {'t': 'bruises', 'f': 'no'},
    'odor': {'a': 'almond', 'l': 'anise', 'c': 'creosote', 'y': 'fishy', 'f': 'foul', 'm': 'musty', 'n': 'none', 'p': 'pungent', 's': 'spicy'},
    'gill-attachment': {'a': 'attached', 'd': 'descending', 'f': 'free', 'n': 'notched'},
    'gill-spacing': {'c': 'close', 'w': 'crowded', 'd': 'distant'},
    'gill-size': {'b': 'broad', 'n': 'narrow'},
    'gill-color': {'k': 'black', 'n': 'brown', 'b': 'buff', 'h': 'chocolate', 'g': 'gray', 'r': 'green', 'o': 'orange', 'p': 'pink', 'u': 'purple', 'e': 'red', 'w': 'white', 'y': 'yellow'},
    'stalk-shape': {'e': 'enlarging', 't': 'tapering'},
    'stalk-root': {'b': 'bulbous', 'c': 'club', 'u': 'cup', 'e': 'equal', 'z': 'rhizomorphs', 'r': 'rooted', '?': 'missing'},
    'stalk-surface-above-ring': {'f': 'fibrous', 'y': 'scaly', 'k': 'silky', 's': 'smooth'},
    'stalk-surface-below-ring': {'f': 'fibrous', 'y': 'scaly', 'k': 'silky', 's': 'smooth'},
    'stalk-color-above-ring': {'n': 'brown', 'b': 'buff', 'c': 'cinnamon', 'g': 'gray', 'o': 'orange', 'p': 'pink', 'e': 'red', 'w': 'white', 'y': 'yellow'},
    'stalk-color-below-ring': {'n': 'brown', 'b': 'buff', 'c': 'cinnamon', 'g': 'gray', 'o': 'orange', 'p': 'pink', 'e': 'red', 'w': 'white', 'y': 'yellow'},
    'veil-type': {'p': 'partial', 'u': 'universal'},
    'veil-color': {'n': 'brown', 'o': 'orange', 'w': 'white', 'y': 'yellow'},
    'ring-number': {'n': 0, 'o': 1, 't': 2},
    'ring-type': {'c': 'cobwebby', 'e': 'evanescent', 'f': 'flaring', 'l': 'large', 'n': 'none', 'p': 'pendant', 's': 'sheathing', 'z': 'zone'},
    'spore-print-color': {'k': 'black', 'n': 'brown', 'b': 'buff', 'h': 'chocolate', 'r': 'green', 'o': 'orange', 'u': 'purple', 'w': 'white', 'y': 'yellow'},
    'population': {'a': 'abundant', 'c': 'clustered', 'n': 'numerous', 's': 'scattered', 'v': 'several', 'y': 'solitary'},
    'habitat': {'g': 'grasses', 'l': 'leaves', 'm': 'meadows', 'p': 'paths', 'u': 'urban', 'w': 'waste', 'd': 'woods'},
    'poisonous': {'p': 'poisonous', 'e': 'edible'}
}

# Apply the mapping to all relevant columns
df_mushroom.replace(mappings, inplace=True)

# Verify transformation by displaying first 5 rows
df_mushroom .head()

  df_mushroom.replace(mappings, inplace=True)


Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,poisonous
0,convex,smooth,brown,t,pungent,free,close,narrow,black,enlarging,...,white,white,partial,white,1,pendant,black,scattered,urban,poisonous
1,convex,smooth,yellow,t,almond,free,close,broad,black,enlarging,...,white,white,partial,white,1,pendant,brown,numerous,grasses,edible
2,bell,smooth,white,t,anise,free,close,broad,brown,enlarging,...,white,white,partial,white,1,pendant,brown,numerous,meadows,edible
3,convex,scaly,white,t,pungent,free,close,narrow,brown,enlarging,...,white,white,partial,white,1,pendant,black,scattered,urban,poisonous
4,convex,smooth,gray,f,none,free,crowded,broad,black,tapering,...,white,white,partial,white,1,evanescent,brown,abundant,grasses,edible


This looks much better! Now we can actually understand what each observation means within each column. Lets move forward with the analysis and examine null values.

### Check for Null values in the data

It looks like every column has 0 missing values, except for `stalk-root` which has 2,480 missing values. This means approximately 25% of rows are missing data for this column. Because I have no way of gathering the missing data, I could remove the column since there are 21 other columns to use for building the model. However, if there is a strong relationship between `stalk-root` and whether the mushroom is poisonous then this might not make sense, and we would lose valuable predictive information.

In [157]:
# Count the number of missing values in each column
df_mushroom.isnull().sum()

cap-shape                      0
cap-surface                    0
cap-color                      0
bruises                        0
odor                           0
gill-attachment                0
gill-spacing                   0
gill-size                      0
gill-color                     0
stalk-shape                    0
stalk-root                  2480
stalk-surface-above-ring       0
stalk-surface-below-ring       0
stalk-color-above-ring         0
stalk-color-below-ring         0
veil-type                      0
veil-color                     0
ring-number                    0
ring-type                      0
spore-print-color              0
population                     0
habitat                        0
poisonous                      0
dtype: int64

### Chi-Square Test of Independence

To determine whether I should remove the column, I will use a Chi-Square Test of Independence to see if there is a significant association between `stalk-root` and `poisonous`. The Chi-Square Test of Independence calculations are based on the actual and expected number of observations in each combined group. We will evaluate a null and alternative hypothesis as follows:
- **Null Hypothesis (H₀):**  There is no association between the `stalk-root` and whether the mushroom is poisonous or edible. In other words, the distribution of `stalk-root` is independent of the `poisonous` variable.

- **Alternative Hypothesis (H₁):**  There is a significant association between the `stalk-root` and whether the mushroom is poisonous or edible. This means that the `stalk-root` variable may be useful in predicting whether a mushroom is poisonous.

We will use the `crosstab` function within the `scipy.stats` package to create a contingency table, which shows the number of observations in each combination of groups.

In [158]:
# Drop rows with missing values for the column of interest
df_clean = df_mushroom.dropna(subset=['stalk-root'])

# Create a contingency table
contingency_table = pd.crosstab(df_clean['stalk-root'], df_clean['poisonous'])
print(contingency_table)

poisonous   edible  poisonous
stalk-root                   
bulbous       1920       1856
club           512         44
equal          864        256
rooted         192          0


In [159]:
# Perform the Chi-Square test
chi_sq, p, dof, expected = stats.chi2_contingency(contingency_table)

# Display results
print(f"Chi-Square Statistic: {chi_sq}")
print(f"P-value: {p}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies Table:\n", expected)

# Interpretation
if p < 0.05:
    print("The variable is significantly related to the target variable.")
else:
    print("No strong evidence of a relationship between the variable and the target variable.")

Chi-Square Statistic: 638.2637626946303
P-value: 5.103920794627432e-138
Degrees of Freedom: 3
Expected Frequencies Table:
 [[2333.57335223 1442.42664777]
 [ 343.60878809  212.39121191]
 [ 692.16158753  427.83841247]
 [ 118.65627215   73.34372785]]
The variable is significantly related to the target variable.


In the code above, the chi-square statistic (`chi2`) is calculated by taking the summation across all groups of the observed frequency minus the expected frequency, squared, divided by the expected frequency. The follwoing formula is used.

χ² = Σ [(O_i - E_i)² / E_i]

Where:
- `O_i` = Observed frequency (actual count from data)
- `E_i` = Expected frequency (calculated under the assumption of independence)
- The sum (Σ) is taken over all categories in the contingency table.

In the output above, we can see the chi-square test statistic is `638.26`. We can compare this to a critical value to determine whether it is large enough to reject the null hypothesis that the two variables are unrelated. To calculate the critical chi-square value, we use the following code.

In [160]:
# Calculate the critical value for significance level 0.05
a = 0.05
critical_value = chi2.ppf(1 - a, dof)
print(f"Chi-Squared Critical Value: {critical_value}")

Chi-Squared Critical Value: 7.814727903251179


Since our chi-square statistic is significantly larger than 7.81, we reject the null hypothesis, meaning there is a statistically significant association between `stalk-root` and `poisonous`. Additionally, the p-value is extremely small (5.1*10^-138) which suggests there is a highly significant relationship between the two variables, thus providing further evidence to reject the null hypothesis.

### Transform categorical variables into numeric data type

With this in mind, it does not make sense to drop this column because it has a significant relationship with whether the mushroom is poisonous or not. We also cannot simply remove rows where `stalk-root` is missing because that would remove 25% of the data. For now lets first create a new dataframe where our categorical variables are transformed into numeric type. To do this we must use either Label Encoding or One Hot Encoding.

In [165]:
# Get the number of unique values per column
unique_values = df_mushroom.nunique().sort_values()
unique_values

veil-type                    1
poisonous                    2
bruises                      2
gill-attachment              2
gill-spacing                 2
gill-size                    2
stalk-shape                  2
ring-number                  3
veil-color                   4
stalk-surface-below-ring     4
stalk-surface-above-ring     4
cap-surface                  4
stalk-root                   4
ring-type                    5
population                   6
cap-shape                    6
habitat                      7
stalk-color-above-ring       9
stalk-color-below-ring       9
odor                         9
spore-print-color            9
cap-color                   10
gill-color                  12
dtype: int64

Given that all of our categorical columns contain nominal features, **One Hot Encoding** is the best choice for transforming the variables into numeric type. Label Encoding does not make sense because none of our variables are ordinal, and hence coding them as unique numbers in the same column might introduce a misleading ordinal relationship that would only be approproate if the data had some inherent order. One Hot Encoding converts categorical variables into binary format by introducing new dummy variable columns for each unique category of that variable. 

### Feature selection

Before implementing this with Python, we must first consider that many of our variables have a high number of unique values, as seen above. For example, `gill-color` has 12 unique values which means that 11 columns will be created to cover each color. Thus, One Hot Encoding will significantly increase the number of features which could make interpretation more difficult and lead to overfitting. To help reduce complexity, lets perform the chi square test of independence on each categorical variable to see if we can drop any irrelevant variables from our analysis.

In [168]:
# Get the column names of categorical columns, removing the target column and single numeric column
categorical_cols = df_mushroom.drop(columns=['poisonous','ring-number']).columns

# Dictionary to store p-values
chi2_results = {}

# Perform chi-square test for each categorical column
for col in categorical_cols:
    contingency_table = pd.crosstab(df_mushroom[col], df_mushroom['poisonous'])  # Create contingency table
    chi2, p, dof, expected = stats.chi2_contingency(contingency_table)  # Perform chi-square test
    chi2_results[col] = p  # Store p-value

# Convert results to a DataFrame and sort by significance
chi2_df = pd.DataFrame.from_dict(chi2_results, orient='index', columns=['p_value']).sort_values(by='p_value')

# Display results
print(chi2_df)


                                p_value
habitat                    0.000000e+00
spore-print-color          0.000000e+00
ring-type                  0.000000e+00
bruises                    0.000000e+00
odor                       0.000000e+00
stalk-color-below-ring     0.000000e+00
gill-size                  0.000000e+00
gill-color                 0.000000e+00
stalk-color-above-ring     0.000000e+00
population                 0.000000e+00
stalk-surface-above-ring   0.000000e+00
stalk-surface-below-ring   0.000000e+00
gill-spacing              5.022978e-216
stalk-root                5.103921e-138
cap-shape                 1.196457e-103
cap-color                  6.055815e-78
cap-surface                5.518427e-68
veil-color                 3.320973e-41
gill-attachment            5.501707e-31
stalk-shape                4.604746e-20
veil-type                  1.000000e+00
