# Problem Statement

1. Title: Secondary mushroom data

2. Sources:
	(a) Mushroom species drawn from source book:
		Patrick Hardin.Mushrooms & Toadstools.
	    Zondervan, 1999
	(b) Inspired by this mushroom data:
		Jeff Schlimmer.Mushroom Data Set. Apr. 1987.
		url:https://archive.ics.uci.edu/ml/datasets/Mushroom.
	(c) Repository containing the related Python scripts and all the data sets: https://mushroom.mathematik.uni-marburg.de/files/ 
	(d) Author: Dennis Wagner
	(e) Date: 05 September 2020

3. Relevant information:
	This dataset includes 61069 hypothetical mushrooms with caps based on 173 species (353 mushrooms
	per species). Each mushroom is identified as definitely edible, definitely poisonous, or of 
	unknown edibility and not recommended (the latter class was combined with the poisonous class).
	Of the 20 variables, 17 are nominal and 3 are metrical.

4. Data simulation:
	The related Python project (Sources (c)) contains a Python module secondary_data_generation.py
	used to generate this data based on primary_data_edited.csv also found in the repository.
	Both nominal and metrical variables are a result of randomization.
	The simulated and ordered by species version is found in secondary_data_generated.csv.
	The randomly shuffled version is found in secondary_data_shuffled.csv.

5. Class information:
	1. class		poisonous=p, edibile=e (binary)

6. Variable Information:
   (n: nominal, m: metrical; nominal values as sets of values)
   1. cap-diameter (m):			float number in cm
   2. cap-shape (n):            bell=b, conical=c, convex=x, flat=f,
                                sunken=s, spherical=p, others=o
   3. cap-surface (n):          fibrous=i, grooves=g, scaly=y, smooth=s,
								shiny=h, leathery=l, silky=k, sticky=t,
								wrinkled=w, fleshy=e
   4. cap-color (n):            brown=n, buff=b, gray=g, green=r, pink=p,
								purple=u, red=e, white=w, yellow=y, blue=l, 
								orange=o,  black=k
   5. does-bruise-bleed (n):	bruises-or-bleeding=t,no=f
   6. gill-attachment (n):      adnate=a, adnexed=x, decurrent=d, free=e, 
								sinuate=s, pores=p, none=f, unknown=?
   7. gill-spacing (n):         close=c, distant=d, none=f
   8. gill-color (n):           see cap-color + none=f
   9. stem-height (m):			float number in cm
   10. stem-width (m):			float number in mm   
   11. stem-root (n):           bulbous=b, swollen=s, club=c, cup=u, equal=e,
                                rhizomorphs=z, rooted=r
   12. stem-surface (n): 		see cap-surface + none=f
   13. stem-color (n):			see cap-color + none=f
   14. veil-type (n):           partial=p, universal=u
   15. veil-color (n):          see cap-color + none=f
   16. has-ring (n):            ring=t, none=f
   17. ring-type (n):           cobwebby=c, evanescent=e, flaring=r, grooved=g, 
							    large=l, pendant=p, sheathing=s, zone=z, scaly=y, movable=m, none=f, unknown=?
   18. spore-print-color (n):   see cap color
   19. habitat (n):             grasses=g, leaves=l, meadows=m, paths=p, heaths=h,
                                urban=u, waste=w, woods=d
   20. season (n):				spring=s, summer=u, autumn=a, winter=w


# IMPORTING BASIC LIBRARIES

In [2]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

In [3]:
pd.set_option("display.max_columns",None)

# READING DATASET

In [18]:
data = pd.read_csv(r"C:\Users\Pragya\Downloads\mushroom\agaricus-lepiota.data",delimiter=",")
data

Unnamed: 0,p,x,s,n,t,p.1,f,c,n.1,k,e,e.1,s.1,s.2,w,w.1,p.2,w.2,o,p.3,k.1,s.3,u
0,e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
1,e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
2,p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
3,e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
4,e,x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,n,g
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8118,e,k,s,n,f,n,a,c,b,y,e,?,s,s,o,o,p,o,o,p,b,c,l
8119,e,x,s,n,f,n,a,c,b,y,e,?,s,s,o,o,p,n,o,p,b,v,l
8120,e,f,s,n,f,n,a,c,b,n,e,?,s,s,o,o,p,o,o,p,b,c,l
8121,p,k,y,n,f,y,f,c,n,b,t,?,s,k,w,w,p,w,o,e,w,v,l


# EDA

**1. HEAD**

In [9]:
data.head()

Unnamed: 0,p,x,s,n,t,p.1,f,c,n.1,k,e,e.1,s.1,s.2,w,w.1,p.2,w.2,o,p.3,k.1,s.3,u
0,e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
1,e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
2,p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
3,e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
4,e,x,y,y,t,a,f,c,b,n,e,c,s,s,w,w,p,w,o,p,k,n,g


**2. SHAPE**

In [8]:
data.shape

(8123, 23)

**3. COLUMNS**

In [10]:
data.columns

Index(['p', 'x', 's', 'n', 't', 'p.1', 'f', 'c', 'n.1', 'k', 'e', 'e.1', 's.1',
       's.2', 'w', 'w.1', 'p.2', 'w.2', 'o', 'p.3', 'k.1', 's.3', 'u'],
      dtype='object')

**4. DATA TYPE**

In [11]:
data.dtypes

p      object
x      object
s      object
n      object
t      object
p.1    object
f      object
c      object
n.1    object
k      object
e      object
e.1    object
s.1    object
s.2    object
w      object
w.1    object
p.2    object
w.2    object
o      object
p.3    object
k.1    object
s.3    object
u      object
dtype: object

**5. missing value**

In [20]:
data.isnull().sum()

p         0
x         0
s         0
n         0
t         0
p.1       0
f         0
c         0
n.1       0
k         0
e         0
e.1    2480
s.1       0
s.2       0
w         0
w.1       0
p.2       0
w.2       0
o         0
p.3       0
k.1       0
s.3       0
u         0
dtype: int64

**6. DUPLICATED VALUES**

In [13]:
data.duplicated().sum()

0

**7. DESCRIPTIVE STATISTICS**

In [14]:
data.describe()

Unnamed: 0,p,x,s,n,t,p.1,f,c,n.1,k,e,e.1,s.1,s.2,w,w.1,p.2,w.2,o,p.3,k.1,s.3,u
count,8123,8123,8123,8123,8123,8123,8123,8123,8123,8123,8123,8123,8123,8123,8123,8123,8123,8123,8123,8123,8123,8123,8123
unique,2,6,4,10,2,9,2,2,2,12,2,5,4,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,t,b,s,s,w,w,p,w,o,p,w,v,d
freq,4208,3655,3244,2283,4748,3528,7913,6811,5612,1728,4608,3776,5175,4935,4463,4383,8123,7923,7487,3967,2388,4040,3148


**8. INFO**

In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8123 entries, 0 to 8122
Data columns (total 23 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   p       8123 non-null   object
 1   x       8123 non-null   object
 2   s       8123 non-null   object
 3   n       8123 non-null   object
 4   t       8123 non-null   object
 5   p.1     8123 non-null   object
 6   f       8123 non-null   object
 7   c       8123 non-null   object
 8   n.1     8123 non-null   object
 9   k       8123 non-null   object
 10  e       8123 non-null   object
 11  e.1     8123 non-null   object
 12  s.1     8123 non-null   object
 13  s.2     8123 non-null   object
 14  w       8123 non-null   object
 15  w.1     8123 non-null   object
 16  p.2     8123 non-null   object
 17  w.2     8123 non-null   object
 18  o       8123 non-null   object
 19  p.3     8123 non-null   object
 20  k.1     8123 non-null   object
 21  s.3     8123 non-null   object
 22  u       8123 non-null   

**9. ANAMOLY DETECTION**

In [16]:
for i in data.columns:
    print({i:data[i].unique()})

{'p': array(['e', 'p'], dtype=object)}
{'x': array(['x', 'b', 's', 'f', 'k', 'c'], dtype=object)}
{'s': array(['s', 'y', 'f', 'g'], dtype=object)}
{'n': array(['y', 'w', 'g', 'n', 'e', 'p', 'b', 'u', 'c', 'r'], dtype=object)}
{'t': array(['t', 'f'], dtype=object)}
{'p.1': array(['a', 'l', 'p', 'n', 'f', 'c', 'y', 's', 'm'], dtype=object)}
{'f': array(['f', 'a'], dtype=object)}
{'c': array(['c', 'w'], dtype=object)}
{'n.1': array(['b', 'n'], dtype=object)}
{'k': array(['k', 'n', 'g', 'p', 'w', 'h', 'u', 'e', 'b', 'r', 'y', 'o'],
      dtype=object)}
{'e': array(['e', 't'], dtype=object)}
{'e.1': array(['c', 'e', 'b', 'r', '?'], dtype=object)}
{'s.1': array(['s', 'f', 'k', 'y'], dtype=object)}
{'s.2': array(['s', 'f', 'y', 'k'], dtype=object)}
{'w': array(['w', 'g', 'p', 'n', 'b', 'e', 'o', 'c', 'y'], dtype=object)}
{'w.1': array(['w', 'p', 'g', 'b', 'n', 'e', 'y', 'o', 'c'], dtype=object)}
{'p.2': array(['p'], dtype=object)}
{'w.2': array(['w', 'n', 'o', 'y'], dtype=object)}
{'o': array

ANAMOLY IS DETECTED IN FORM OF "?"

**10. REPLACING "?"WITH NP.NAN**

In [19]:
data.replace("?",np.nan,inplace=True)
data.isnull().sum()

p         0
x         0
s         0
n         0
t         0
p.1       0
f         0
c         0
n.1       0
k         0
e         0
e.1    2480
s.1       0
s.2       0
w         0
w.1       0
p.2       0
w.2       0
o         0
p.3       0
k.1       0
s.3       0
u         0
dtype: int64

# DATA PREPROCESSING

**Checking for missing value**

In [21]:
data.isnull().sum()

p         0
x         0
s         0
n         0
t         0
p.1       0
f         0
c         0
n.1       0
k         0
e         0
e.1    2480
s.1       0
s.2       0
w         0
w.1       0
p.2       0
w.2       0
o         0
p.3       0
k.1       0
s.3       0
u         0
dtype: int64

**REPLACING MISSING VALUE WITH MODE**

In [22]:
for x in data.columns:
    if data[x].dtype=='object' or data[x].dtype=='bool':      # for categorical
        data[x].fillna(data[x].mode()[0],inplace=True)
    elif data[x].dtype=='int64' or data[x].dtype=='float64':    # for numerical
        data[x].fillna(value = 0,inplace=True)


**RECHECKING**

In [23]:
data.isnull().sum()

p      0
x      0
s      0
n      0
t      0
p.1    0
f      0
c      0
n.1    0
k      0
e      0
e.1    0
s.1    0
s.2    0
w      0
w.1    0
p.2    0
w.2    0
o      0
p.3    0
k.1    0
s.3    0
u      0
dtype: int64

**APPLYING LABELENCODER TO CONVERT CATEGORICAL DATA TO NUMERICAL**

In [24]:
colname=[]
for x in data.columns:
    if data[x].dtype=='object':
        colname.append(x)
print(colname)

from sklearn.preprocessing import LabelEncoder
 
le=LabelEncoder()
 
for x in colname:
    data[x]=le.fit_transform(data[x])

['p', 'x', 's', 'n', 't', 'p.1', 'f', 'c', 'n.1', 'k', 'e', 'e.1', 's.1', 's.2', 'w', 'w.1', 'p.2', 'w.2', 'o', 'p.3', 'k.1', 's.3', 'u']


In [26]:
data.columns

Index(['p', 'x', 's', 'n', 't', 'p.1', 'f', 'c', 'n.1', 'k', 'e', 'e.1', 's.1',
       's.2', 'w', 'w.1', 'p.2', 'w.2', 'o', 'p.3', 'k.1', 's.3', 'u'],
      dtype='object')

**Creating X and Y**

In [27]:
X = data.drop(['p'],axis=1)
Y = data['p']

In [28]:
print(X.shape)
print(Y.shape)

(8123, 22)
(8123,)


**Scaling the data to avoid data discrispencies**

In [29]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()    
scaler.fit(X)                
X = scaler.transform(X)       

**Splitting the data into testing and traning part**

In [30]:
# split the data into test and train
from sklearn.model_selection import train_test_split

X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.3,random_state = 10) 

**Checking class Distribution**

In [32]:
data['p'].value_counts()

p
0    4208
1    3915
Name: count, dtype: int64

# MODEL BUILDING

# LOGISTIC REGRESSION

In [34]:
from sklearn.linear_model import LogisticRegression
# create a model
classifier=LogisticRegression(random_state=10)

# fitting training data to the model...........(input is X and Y)
classifier.fit(X_train,Y_train)

# output is equation of liner regression 
print(classifier.intercept_)   # intercept is beta-not
print(classifier.coef_)     # coef is beta-1,beta-2

[-2.01820241]
[[-5.43694955e-03  4.76547515e-01 -1.06419025e-01 -1.65418244e-01
  -1.41415008e+00 -6.72521523e-01 -5.50623578e+00  6.64791286e+00
  -8.11299931e-01 -5.11466309e-02 -4.65933616e+00 -5.17310791e+00
   1.23175294e-01 -2.69846778e-01 -2.90948014e-02  0.00000000e+00
   5.64088232e+00  7.36554685e-01  4.26139893e+00 -2.03266062e-01
  -1.81048328e+00  3.09752894e-01]]


**Predicting the output on test data**

In [35]:
Y_pred=classifier.predict(X_test)
print(Y_pred)

[0 0 1 ... 0 1 1]


**Evaluation Phase**

In [36]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
 
# confusion matrix
cfm=confusion_matrix(Y_test,Y_pred)
print(cfm)
 
print("Classification report: ")
 
print(classification_report(Y_test,Y_pred))


# accuracy_score
acc=accuracy_score(Y_test, Y_pred)
print("Accuracy of the model: ",acc)

[[1225   48]
 [  41 1123]]
Classification report: 
              precision    recall  f1-score   support

           0       0.97      0.96      0.96      1273
           1       0.96      0.96      0.96      1164

    accuracy                           0.96      2437
   macro avg       0.96      0.96      0.96      2437
weighted avg       0.96      0.96      0.96      2437

Accuracy of the model:  0.9634796881411571


# TUNNED LOGISTIC REGRESSION

In [53]:
# store the predicted probabilities
y_pred_prob = classifier.predict_proba(X_test)
print(y_pred_prob)


y_pred_class = []
for value in y_pred_prob[:,1]:
    if value >0.58:                         # agar value 0.5 se jayada h to consider as 1
        y_pred_class.append(1)
    else:
        y_pred_class.append(0)              # agar value 0.5 se kum h to consider as 0


from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
 
# confusion matrix
cfm=confusion_matrix(Y_test,y_pred_class)
print(cfm)
 
print("Classification report: ")
 
print(classification_report(Y_test,y_pred_class))


# accuracy_score
acc=accuracy_score(Y_test, y_pred_class)
print("Accuracy of the model: ",acc)

[[9.99999956e-01 4.41825110e-08]
 [9.83990407e-01 1.60095934e-02]
 [3.09894077e-01 6.90105923e-01]
 ...
 [9.99999894e-01 1.05507704e-07]
 [1.95003078e-05 9.99980500e-01]
 [1.45253956e-01 8.54746044e-01]]
[[1232   41]
 [  66 1098]]
Classification report: 
              precision    recall  f1-score   support

           0       0.95      0.97      0.96      1273
           1       0.96      0.94      0.95      1164

    accuracy                           0.96      2437
   macro avg       0.96      0.96      0.96      2437
weighted avg       0.96      0.96      0.96      2437

Accuracy of the model:  0.9560935576528519


# DECISION TREE

In [54]:
#predicting using the Decision_Tree_Classifier
from sklearn.tree import DecisionTreeClassifier

model_DT=DecisionTreeClassifier(random_state=10, 
                                         criterion="gini")
#min_samples_leaf, min_samples_split, max_depth, max_features, max_leaf_nodes

#fit the model on the data and predict the values
model_DT.fit(X_train,Y_train)
Y_pred=model_DT.predict(X_test)
#print(Y_pred)
#print(list(zip(Y_test,Y_pred)))

from sklearn.metrics import confusion_matrix, accuracy_score,classification_report
#confusion matrix
print(confusion_matrix(Y_test,Y_pred))
print(accuracy_score(Y_test,Y_pred))
print(classification_report(Y_test,Y_pred))

model_DT.score(X_train,Y_train)

[[1273    0]
 [   0 1164]]
1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1273
           1       1.00      1.00      1.00      1164

    accuracy                           1.00      2437
   macro avg       1.00      1.00      1.00      2437
weighted avg       1.00      1.00      1.00      2437



1.0

MODEL IS OVERFITTING

# TUNNED LOGISTIC REGRESSION

In [56]:
from sklearn.model_selection import GridSearchCV, train_test_split

# Define the model
dt = DecisionTreeClassifier(random_state=42)

# Define the hyperparameter grid
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 3, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Perform Grid Search
grid_search = GridSearchCV(dt, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, Y_train)

# Best parameters and accuracy
print("Best Parameters:", grid_search.best_params_)

# Test accuracy
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print("Test Accuracy:", accuracy_score(Y_test, y_pred))

Best Parameters: {'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
Test Accuracy: 1.0


# RANDOM FOREST

In [57]:
#predicting using the Random_Forest_Classifier
from sklearn.ensemble import RandomForestClassifier

model_RandomForest=RandomForestClassifier(n_estimators=100,
                                          random_state=10, bootstrap=True,
                                         n_jobs=-1)

#fit the model on the data and predict the values
model_RandomForest.fit(X_train,Y_train)

Y_pred=model_RandomForest.predict(X_test)


from sklearn.metrics import confusion_matrix, accuracy_score,classification_report
#confusion matrix
print(confusion_matrix(Y_test,Y_pred))
print(accuracy_score(Y_test,Y_pred))
print(classification_report(Y_test,Y_pred))

[[1273    0]
 [   0 1164]]
1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1273
           1       1.00      1.00      1.00      1164

    accuracy                           1.00      2437
   macro avg       1.00      1.00      1.00      2437
weighted avg       1.00      1.00      1.00      2437



In [59]:
# Define model
rf = RandomForestClassifier(random_state=42)

# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'criterion': ['gini', 'entropy']
}

# Grid search
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, Y_train)

# Best parameters and test accuracy
print("Best Parameters:", grid_search.best_params_)
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print("Test Accuracy:", accuracy_score(Y_test, y_pred))

Best Parameters: {'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
Test Accuracy: 1.0


# CONCLUSION

Here logistic regression is giving ggod accuracy and rest of model gives the 100% accuracy   
but we take margin of 3% that why i prefer logistic regression as the best model