# Mushroom Project (CHANGE)
## Introduction

In this project, my goal is to use binomial logistic regression to predict whether a mushroom is edible or poisonous. The idea is to gain more experience with exploratory data analysis, data cleaning, building binomial logistic regression models, and testing model assumptions. 

The data I am using is the [Mushroom dataset](https://archive.ics.uci.edu/dataset/73/mushroom) from the UCI Machine Learning Repository. The dataset has 8,124 observations with 22 features, was created in 1987, and last updated on August 10, 2023. Each row in the dataset represents an observation of one mushroom from one of 23 species of gilled mushrooms in the Agaricus and Lepiota Family, along with corresponding characteristics sush as shape, color, habitat, etc. that can be used to predict its edibility.

## Step 1: Imports
### Import Packages

This project requires the following packages.

In [970]:
# Standard data processing packages
import pandas as pd
import numpy as np

# Preprocessing, modeling, and evaluation packages
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as metrics
import scipy.stats as stats
from scipy.stats import chi2

# Data visualization packages
import seaborn as sns
import matplotlib.pyplot as plt

# Used to fetch data from the UCI repository
from ucimlrepo import fetch_ucirepo 

### Load the Dataset

Load in the data and display the variables. We can see that all the variables are categorical, with two being binary.

In [971]:
# fetch dataset 
mushroom = fetch_ucirepo(id=73) 
  
# data (as pandas dataframes) 
X = mushroom.data.features 
y = mushroom.data.targets 
  
# variable information 
mushroom.variables

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,poisonous,Target,Categorical,,,,no
1,cap-shape,Feature,Categorical,,"bell=b,conical=c,convex=x,flat=f, knobbed=k,su...",,no
2,cap-surface,Feature,Categorical,,"fibrous=f,grooves=g,scaly=y,smooth=s",,no
3,cap-color,Feature,Binary,,"brown=n,buff=b,cinnamon=c,gray=g,green=r, pink...",,no
4,bruises,Feature,Categorical,,"bruises=t,no=f",,no
5,odor,Feature,Categorical,,"almond=a,anise=l,creosote=c,fishy=y,foul=f, mu...",,no
6,gill-attachment,Feature,Categorical,,"attached=a,descending=d,free=f,notched=n",,no
7,gill-spacing,Feature,Categorical,,"close=c,crowded=w,distant=d",,no
8,gill-size,Feature,Categorical,,"broad=b,narrow=n",,no
9,gill-color,Feature,Categorical,,"black=k,brown=n,buff=b,chocolate=h,gray=g, gre...",,no


### View the first 5 rows

Display the first 5 rows of the dataset. As we can see below, `poisonous` is the target variable that we will be predicting and the other columns are potential independent variables we can use for our model.

In [972]:
# Combine the features and target into a single DataFrame
df_mushroom = pd.concat([X, y], axis=1)

# Display the first 5 rows of the DataFrame
df_mushroom.head()

Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,poisonous
0,x,s,n,t,p,f,c,n,k,e,...,w,w,p,w,o,p,k,s,u,p
1,x,s,y,t,a,f,c,b,k,e,...,w,w,p,w,o,p,n,n,g,e
2,b,s,w,t,l,f,c,b,n,e,...,w,w,p,w,o,p,n,n,m,e
3,x,y,w,t,p,f,c,n,n,e,...,w,w,p,w,o,p,k,s,u,p
4,x,s,g,f,n,f,w,b,k,t,...,w,w,p,w,o,e,n,a,g,e


## Step 2: Data exploration and cleaning
### Explore the data

Before constructing any models, it is important to explore the data to better understand its shape, whether any data is missing, and what each column means. In the case of this mushroom dataset, each column represents a categorical variable with data taking the form of a single letter. We will need to transform it later into numeric data in order to be used in a binomial logistic regression model. First lets explore the data shape and data types. 

In [973]:
# Print the data types of each column
df_mushroom.dtypes

cap-shape                   object
cap-surface                 object
cap-color                   object
bruises                     object
odor                        object
gill-attachment             object
gill-spacing                object
gill-size                   object
gill-color                  object
stalk-shape                 object
stalk-root                  object
stalk-surface-above-ring    object
stalk-surface-below-ring    object
stalk-color-above-ring      object
stalk-color-below-ring      object
veil-type                   object
veil-color                  object
ring-number                 object
ring-type                   object
spore-print-color           object
population                  object
habitat                     object
poisonous                   object
dtype: object

### Check the shape of the dataset

It appears that there are 8124 rows with 23 columns, as expected, with each column being an object data type.

In [974]:
# Print the shape of the dataset
df_mushroom.shape

(8124, 23)

### Check how many poisonous mushrooms there are

From the code below, we can see that 4208 mushrooms are classified as edible, and 3916 are classified as poisonous. This means our data is well balanced.

In [975]:
# Count the values of the 'poisonous' column
df_mushroom['poisonous'].value_counts()

poisonous
e    4208
p    3916
Name: count, dtype: int64

### Transform column values

As we saw when we viewed the first 5 rows of the dataset, each column is made up of singlular letters which are hard to understand. For example, a value of "y" in the `cap-surface` column indicates scaly, this would be impossible to figure out at a glance. 

To make the data more easily intepretable, lets transform each column according to the mapping provided by UCI.

In [976]:
# Define the mapping for all categorical columns
mappings = {
    'cap-shape': {'b': 'bell', 'c': 'conical', 'x': 'convex', 'f': 'flat', 'k': 'knobbed', 's': 'sunken'},
    'cap-surface': {'f': 'fibrous', 'g': 'grooves', 'y': 'scaly', 's': 'smooth'},
    'cap-color': {'n': 'brown', 'b': 'buff', 'c': 'cinnamon', 'g': 'gray', 'r': 'green', 'p': 'pink', 'u': 'purple', 'e': 'red', 'w': 'white', 'y': 'yellow'},
    'bruises': {'t': 'bruises', 'f': 'no'},
    'odor': {'a': 'almond', 'l': 'anise', 'c': 'creosote', 'y': 'fishy', 'f': 'foul', 'm': 'musty', 'n': 'none', 'p': 'pungent', 's': 'spicy'},
    'gill-attachment': {'a': 'attached', 'd': 'descending', 'f': 'free', 'n': 'notched'},
    'gill-spacing': {'c': 'close', 'w': 'crowded', 'd': 'distant'},
    'gill-size': {'b': 'broad', 'n': 'narrow'},
    'gill-color': {'k': 'black', 'n': 'brown', 'b': 'buff', 'h': 'chocolate', 'g': 'gray', 'r': 'green', 'o': 'orange', 'p': 'pink', 'u': 'purple', 'e': 'red', 'w': 'white', 'y': 'yellow'},
    'stalk-shape': {'e': 'enlarging', 't': 'tapering'},
    'stalk-root': {'b': 'bulbous', 'c': 'club', 'u': 'cup', 'e': 'equal', 'z': 'rhizomorphs', 'r': 'rooted', '?': 'missing'},
    'stalk-surface-above-ring': {'f': 'fibrous', 'y': 'scaly', 'k': 'silky', 's': 'smooth'},
    'stalk-surface-below-ring': {'f': 'fibrous', 'y': 'scaly', 'k': 'silky', 's': 'smooth'},
    'stalk-color-above-ring': {'n': 'brown', 'b': 'buff', 'c': 'cinnamon', 'g': 'gray', 'o': 'orange', 'p': 'pink', 'e': 'red', 'w': 'white', 'y': 'yellow'},
    'stalk-color-below-ring': {'n': 'brown', 'b': 'buff', 'c': 'cinnamon', 'g': 'gray', 'o': 'orange', 'p': 'pink', 'e': 'red', 'w': 'white', 'y': 'yellow'},
    'veil-type': {'p': 'partial', 'u': 'universal'},
    'veil-color': {'n': 'brown', 'o': 'orange', 'w': 'white', 'y': 'yellow'},
    'ring-number': {'n': 0, 'o': 1, 't': 2},
    'ring-type': {'c': 'cobwebby', 'e': 'evanescent', 'f': 'flaring', 'l': 'large', 'n': 'none', 'p': 'pendant', 's': 'sheathing', 'z': 'zone'},
    'spore-print-color': {'k': 'black', 'n': 'brown', 'b': 'buff', 'h': 'chocolate', 'r': 'green', 'o': 'orange', 'u': 'purple', 'w': 'white', 'y': 'yellow'},
    'population': {'a': 'abundant', 'c': 'clustered', 'n': 'numerous', 's': 'scattered', 'v': 'several', 'y': 'solitary'},
    'habitat': {'g': 'grasses', 'l': 'leaves', 'm': 'meadows', 'p': 'paths', 'u': 'urban', 'w': 'waste', 'd': 'woods'},
    'poisonous': {'p': 'poisonous', 'e': 'edible'}
}

# Apply the mapping to all relevant columns
df_mushroom.replace(mappings, inplace=True)

# Verify transformation by displaying first 5 rows
df_mushroom.head()

  df_mushroom.replace(mappings, inplace=True)


Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat,poisonous
0,convex,smooth,brown,bruises,pungent,free,close,narrow,black,enlarging,...,white,white,partial,white,1,pendant,black,scattered,urban,poisonous
1,convex,smooth,yellow,bruises,almond,free,close,broad,black,enlarging,...,white,white,partial,white,1,pendant,brown,numerous,grasses,edible
2,bell,smooth,white,bruises,anise,free,close,broad,brown,enlarging,...,white,white,partial,white,1,pendant,brown,numerous,meadows,edible
3,convex,scaly,white,bruises,pungent,free,close,narrow,brown,enlarging,...,white,white,partial,white,1,pendant,black,scattered,urban,poisonous
4,convex,smooth,gray,no,none,free,crowded,broad,black,tapering,...,white,white,partial,white,1,evanescent,brown,abundant,grasses,edible


This looks much better! Now we can actually understand what each observation means within each column. Lets move forward with the analysis and examine null values.

### Check for Null values in the data

It looks like every column has 0 missing values, except for `stalk-root` which has 2,480 missing values. This means approximately 25% of rows are missing data for this column. Because I have no way of gathering the missing data, I could remove the column since there are 21 other columns to use for building the model. However, if there is a strong relationship between `stalk-root` and whether the mushroom is poisonous then this might not make sense since we would lose valuable predictive information.

In [977]:
# Count the number of missing values in each column
df_mushroom.isnull().sum()

cap-shape                      0
cap-surface                    0
cap-color                      0
bruises                        0
odor                           0
gill-attachment                0
gill-spacing                   0
gill-size                      0
gill-color                     0
stalk-shape                    0
stalk-root                  2480
stalk-surface-above-ring       0
stalk-surface-below-ring       0
stalk-color-above-ring         0
stalk-color-below-ring         0
veil-type                      0
veil-color                     0
ring-number                    0
ring-type                      0
spore-print-color              0
population                     0
habitat                        0
poisonous                      0
dtype: int64

### Chi-Square Test of Independence

To determine whether I should remove the column, I will use a Chi-Square Test of Independence to see if there is a significant association between `stalk-root` and `poisonous`. The Chi-Square Test of Independence calculations are based on the actual and expected number of observations in each combined group. We will evaluate a null and alternative hypothesis as follows:
- **Null Hypothesis (H₀):**  There is no association between the `stalk-root` and whether the mushroom is poisonous or edible. In other words, the distribution of `stalk-root` is independent of the `poisonous` variable.

- **Alternative Hypothesis (H₁):**  There is a significant association between the `stalk-root` and whether the mushroom is poisonous or edible. This means that the `stalk-root` variable may be useful in predicting whether a mushroom is poisonous.

We will use the `crosstab` function within the `scipy.stats` package to create a contingency table, which shows the number of observations in each combination of groups.

In [978]:
# Drop rows with missing values for the column of interest
df_clean = df_mushroom.dropna(subset=['stalk-root'])

# Create a contingency table
contingency_table = pd.crosstab(df_clean['stalk-root'], df_clean['poisonous'])
print(contingency_table)

poisonous   edible  poisonous
stalk-root                   
bulbous       1920       1856
club           512         44
equal          864        256
rooted         192          0


In [979]:
# Perform the Chi-Square test
chi_sq, p, dof, expected = stats.chi2_contingency(contingency_table)

# Display results
print(f"Chi-Square Statistic: {chi_sq}")
print(f"P-value: {p}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies Table:\n", expected)

# Interpretation
if p < 0.05:
    print("The variable is significantly related to the target variable.")
else:
    print("No strong evidence of a relationship between the variable and the target variable.")

Chi-Square Statistic: 638.2637626946303
P-value: 5.103920794627432e-138
Degrees of Freedom: 3
Expected Frequencies Table:
 [[2333.57335223 1442.42664777]
 [ 343.60878809  212.39121191]
 [ 692.16158753  427.83841247]
 [ 118.65627215   73.34372785]]
The variable is significantly related to the target variable.


In the code above, the chi-square statistic (`chi2`) is calculated by taking the summation across all groups of the observed frequency minus the expected frequency, squared, divided by the expected frequency. The follwoing formula is used.

χ² = Σ [(O_i - E_i)² / E_i]

Where:
- `O_i` = Observed frequency (actual count from data)
- `E_i` = Expected frequency (calculated under the assumption of independence)
- The sum (Σ) is taken over all categories in the contingency table.

In the output above, we can see the chi-square test statistic is `638.26`. We can compare this to a critical value to determine whether it is large enough to reject the null hypothesis that the two variables are unrelated. To calculate the critical chi-square value, we use the following code.

In [980]:
# Calculate the critical value for significance level 0.05
a = 0.05
critical_value = chi2.ppf(1 - a, dof)
print(f"Chi-Squared Critical Value: {critical_value}")

Chi-Squared Critical Value: 7.814727903251179


Since our chi-square statistic is significantly larger than 7.81, we reject the null hypothesis, meaning there is a statistically significant association between `stalk-root` and `poisonous`. Additionally, the p-value is extremely small (5.1*10^-138) which suggests there is a highly significant relationship between the two variables, thus providing further evidence to reject the null hypothesis.

### Feature selection

On the one hand, it might not make sense to drop this column because it has a significant relationship with whether the mushroom is poisonous or not, and removing rows where `stalk-root` is missing would delete 25% of the data. On the other hand, leaving in the rows with missing values will throw off the One Hot Encoding process because it will interpret NULL as a unique category a create a dummy variable for it.

Before dropping any data, lets look at what other features are important or irrelevant for our analysis.

In [981]:
# Get the number of unique values per column
unique_values = df_mushroom.nunique().sort_values()
unique_values

veil-type                    1
poisonous                    2
bruises                      2
gill-attachment              2
gill-spacing                 2
gill-size                    2
stalk-shape                  2
ring-number                  3
veil-color                   4
stalk-surface-below-ring     4
stalk-surface-above-ring     4
cap-surface                  4
stalk-root                   4
ring-type                    5
population                   6
cap-shape                    6
habitat                      7
stalk-color-above-ring       9
stalk-color-below-ring       9
odor                         9
spore-print-color            9
cap-color                   10
gill-color                  12
dtype: int64

It seems that many of our variables have a high number of unique values, as seen above. For example, `gill-color` has 12 unique values which means that 11 columns will be created to cover each color using One Hot Encoding. Thus, if we use One Hot Encoding to transform our variables into binary columns, it will significantly increase the number of features which could make interpretation difficult and lead to overfitting. To help reduce complexity, lets perform the chi square test of independence on each categorical variable to see if we can drop any irrelevant variables from our analysis.

In [982]:
# Get the column names of categorical columns, removing the target column and single numeric column
categorical_cols = df_mushroom.drop(columns=['poisonous','ring-number']).columns

# Dictionary to store p-values
chi2_results = {}

# Perform chi-square test for each categorical column
for col in categorical_cols:
    contingency_table = pd.crosstab(df_mushroom[col], df_mushroom['poisonous'])  # Create contingency table
    chi2, p, dof, expected = stats.chi2_contingency(contingency_table)  # Perform chi-square test
    chi2_results[col] = p  # Store p-value

# Convert results to a DataFrame and sort by significance
chi2_df = pd.DataFrame.from_dict(chi2_results, orient='index', columns=['p_value']).sort_values(by='p_value')

# Display results
print(chi2_df)


                                p_value
habitat                    0.000000e+00
spore-print-color          0.000000e+00
ring-type                  0.000000e+00
bruises                    0.000000e+00
odor                       0.000000e+00
stalk-color-below-ring     0.000000e+00
gill-size                  0.000000e+00
gill-color                 0.000000e+00
stalk-color-above-ring     0.000000e+00
population                 0.000000e+00
stalk-surface-above-ring   0.000000e+00
stalk-surface-below-ring   0.000000e+00
gill-spacing              5.022978e-216
stalk-root                5.103921e-138
cap-shape                 1.196457e-103
cap-color                  6.055815e-78
cap-surface                5.518427e-68
veil-color                 3.320973e-41
gill-attachment            5.501707e-31
stalk-shape                4.604746e-20
veil-type                  1.000000e+00


As seen above, `veil-type` is the only variable that does not have a significant association with `poisonous`. This makes sense because as we saw earlier, `veil-type` only has one unique value and therefore will not be useful in predicting whether a mushroom is poisonous or edible. The rest of the variables are highly significant, as indicated by their extremely small p-values. Lets drop `veil-type` from our data to reduce complexity.

In [983]:
# Drop veil-type column
df_mushroom = df_mushroom.drop(columns=['veil-type'])

# Check if 'veil-type' column is in the DataFrame
'veil-type' in df_mushroom.columns

False

The `veil-type` column was successfully dropped as confirmed by the code output above. Lets also create two new dataframes as follows:
1. `df_dropped_rows`: Drops all rows where stalk-root is missing.
2. `df_dropped_column`: Drops the stalk-root column entirely.

In [984]:
# Drop rows where 'stalk-root' is NaN
df_dropped_rows = df_mushroom.dropna(subset=['stalk-root'])

# Drop the 'stalk-root' column entirely
df_dropped_column = df_mushroom.drop(columns=['stalk-root'])

For now we will conduct our analysis and build our models using `df_dropped_column`, but we will return to `df_dropped_rows` later to see if it is able to better predict mushroom edibility with the extra column, despite losing some data.

### Transform data using One Hot Encoding

Given that all of our categorical columns contain nominal features, **One Hot Encoding** is the best choice for transforming our variables into numeric type. Label Encoding does not make sense because none of our variables are ordinal, and hence coding them as unique numbers in the same column might introduce a misleading ordinal relationship that would only be approproate if the data had some inherent order. One Hot Encoding converts categorical variables into binary format by introducing new dummy variable columns for each unique category of that variable. 

In the code below, we implement One Hot Encoding for categorical columns with 3+ categories.

In [985]:
# Define the columns that need One-Hot Encoding (every column except ring-number)
categorical_columns = ['poisonous', 'bruises', 'gill-attachment', 'gill-spacing', 'gill-size', 'stalk-shape', 'veil-color', 'stalk-surface-below-ring',
                       'stalk-surface-above-ring', 'cap-surface', 'ring-type', 'population', 'cap-shape', 'habitat', 'stalk-color-above-ring', 
                       'stalk-color-below-ring', 'odor', 'spore-print-color', 'cap-color', 'gill-color']

# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, drop='first')  # drop='first' avoids multicollinearity

# Fit and transform the categorical columns
encoded_array = encoder.fit_transform(df_dropped_column[categorical_columns])

# Get the new column names from the encoder
encoded_col_names = encoder.get_feature_names_out(categorical_columns)
print("Number of columns after OHE: " + str(len(encoded_col_names)))

# Convert the encoded array to a DataFrame
df_encoded = pd.DataFrame(encoded_array, columns=encoded_col_names, index=df_dropped_column.index)

# Drop the original categorical columns and concatenate the new encoded columns
df_dropped_column = df_dropped_column.drop(columns=categorical_columns).join(df_encoded)

# Display the first few rows
df_dropped_column.head()


Number of columns after OHE: 90


Unnamed: 0,ring-number,poisonous_poisonous,bruises_no,gill-attachment_free,gill-spacing_crowded,gill-size_narrow,stalk-shape_tapering,veil-color_orange,veil-color_white,veil-color_yellow,...,gill-color_buff,gill-color_chocolate,gill-color_gray,gill-color_green,gill-color_orange,gill-color_pink,gill-color_purple,gill-color_red,gill-color_white,gill-color_yellow
0,1,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


I printed the number of columns created by One Hot Encoding so we can check that it worked correctly. According to the table below, we would expect to get 90 columns by using One Hot Encoding. This lines up with what our code printed, showing that our encoder indeed created 84 columns. By viewing the dataframe, we can see that it has 91 columns, whioch makes sense because we appended our 90 dummy variable columns to the other 1 remaining column (`ring-number`).

| Categorical Column (w/ 3+ categories)                    | # of Unique Categories | New Columns |
|----------------------------|-------------------|------------------------|
| `poisonous `                | 2                 | 1                      |
| `bruises`                   | 2                 | 1                      |
| `gill-attachment`           | 2                 | 1                      |
| `gill-spacing`              | 2                 | 1                      |
| `gill-size`                 | 2                 | 1                      |
| `stalk-shape`               | 2                 | 1                      |
| `veil-color`                | 4                 | 3                      |
| `stalk-surface-below-ring`  | 4                 | 3                      |
| `stalk-surface-above-ring`  | 4                 | 3                      |
| `cap-surface`               | 4                 | 3                      |
| `ring-type`                 | 5                 | 4                      |
| `population`                | 6                 | 5                      |
| `cap-shape`                 | 6                 | 5                      |
| `habitat`                   | 7                 | 6                      |
| `stalk-color-above-ring`    | 9                 | 8                      |
| `stalk-color-below-ring`    | 9                 | 8                      |
| `odor`                      | 9                 | 8                      |
| `spore-print-color`         | 9                 | 8                      |
| `cap-color`                 | 10                | 9                      |
| `gill-color`                | 12                | 11                     |
| **Total**                   |                   | **90**                 |

Lets check the data types again to ensure all the columns are numeric type.

In [986]:
# Check if all columns are of type float
all_float = df_dropped_column.dtypes.eq('float64').all()
print(f"All columns are of float type: {all_float}")

# Identify which column(s) are not of type float
non_float_columns = df_dropped_column.columns[df_dropped_column.dtypes != 'float64']
print("Columns not of float type:")
print(non_float_columns)

All columns are of float type: False
Columns not of float type:
Index(['ring-number'], dtype='object')


It seems like `ring-number` is still an object type, lets convert it to float using the following code.

In [987]:
# Convert 'ring-number' to float type
df_dropped_column['ring-number'] = df_dropped_column['ring-number'].astype(float)

# Check the data type again to verify
print("Data type of ring-number column: " + str(df_dropped_column['ring-number'].dtype))

# Check if all columns are of type float
all_float = df_dropped_column.dtypes.eq('float64').all()
print(f"All columns are of float type: {all_float}")

Data type of ring-number column: float64
All columns are of float type: True


Now that we have a solid understanding of our data and it's cleaned and in the right format, we can move onto testing assumptions for our binary logistic regression model.

## Step 3: Checking Assumptions