Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency > 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

In [None]:
import pandas as pd
import pandas_profiling

df = pd.read_csv('../data/mushrooms.csv')

# Change column names: replace hyphens with underscores
df.columns = [col.replace('-', '_') for col in df]

## Choose your target 

### Which column in your tabular dataset will you predict?

The 'class' column is the target. 

## Is your problem regression or classification?

This is a classification problem, classifying mushrooms into two categories edible(e) and poisonous(p).

## How is your target distributed?

### Classification: How many classes? Are the classes imbalanced?

There are two classes and they are balanced. The majority class frequency is 51.8%.


In [11]:
df['class'].unique()

array(['p', 'e'], dtype=object)

In [12]:
df['class'].value_counts(normalize=True)

e    0.517971
p    0.482029
Name: class, dtype: float64

## Choose which observations you will use to train, validate, and test your model.

### Are some observations outliers? Will you exclude them?

The observations are all categorical data. I will be using all of the observations.

### Will you do a random split or a time-based split?

I will use cross validation and a random split.

## Choose your evaluation metric(s).

I will use accuracy as my main evaluation metric. I will also look at precision, because false positives in this case could be dangerous.

## Begin to clean and explore your data

In [19]:
# Split data into train and test
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, random_state=42)

train.shape, test.shape

((6093, 23), (2031, 23))

In [20]:
train.describe().T

Unnamed: 0,count,unique,top,freq
class,6093,2,e,3168
cap_shape,6093,6,x,2764
cap_surface,6093,4,y,2445
cap_color,6093,10,n,1711
bruises,6093,2,f,3555
odor,6093,9,n,2672
gill_attachment,6093,2,f,5937
gill_spacing,6093,2,c,5109
gill_size,6093,2,b,4229
gill_color,6093,12,b,1298


In [21]:
train['stalk_root'].value_counts()

b    2844
?    1865
e     827
c     415
r     142
Name: stalk_root, dtype: int64

In [22]:
# Get Pandas Profiling Report
train.profile_report()



In [35]:
%matplotlib inline
import plotly.express as px

px.bar(train, x='bruises', color='class', barmode='group')

In [36]:
px.bar(train, x='cap_color', color='class', barmode='group')

## Begin to choose which features, if any, to exclude. Would some features "leak" future information?

I will be excluding 'veil_type', because the feature is constant. 

In [37]:
target = 'class'
features = train.columns.drop([target] + ['veil_type'])
X_train = train[features]
y_train = train[target]

In [44]:
import category_encoders as ce
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier

pipeline = make_pipeline(
    ce.OrdinalEncoder(),  
    DecisionTreeClassifier(max_depth=3)
)

cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=5)

array([0.99507793, 0.95652174, 0.9778507 , 0.96223317, 0.95812808])

In [40]:
X_train.head()

Unnamed: 0,cap_shape,cap_surface,cap_color,bruises,odor,gill_attachment,gill_spacing,gill_size,gill_color,stalk_shape,...,stalk_surface_above_ring,stalk_surface_below_ring,stalk_color_above_ring,stalk_color_below_ring,veil_color,ring_number,ring_type,spore_print_color,population,habitat
3887,x,s,w,f,c,f,c,n,n,e,...,s,s,w,w,w,o,p,n,s,d
4119,f,f,y,f,f,f,c,b,h,e,...,k,k,n,b,w,o,l,h,v,g
1600,x,y,g,t,n,f,c,b,n,t,...,s,s,p,g,w,o,p,k,y,d
4988,x,y,y,f,f,f,c,b,p,e,...,k,k,n,n,w,o,l,h,y,p
6757,f,y,n,f,f,f,c,n,b,t,...,s,k,w,p,w,o,e,w,v,d


### Attribute Information: (classes: edible=e, poisonous=p)

**cap-shape**: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s

**cap-surface**: fibrous=f,grooves=g,scaly=y,smooth=s

**cap-color**: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y

**bruises**: bruises=t,no=f

**odor**: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s

**gill-attachment**: attached=a,descending=d,free=f,notched=n

**gill-spacing**: close=c,crowded=w,distant=d

**gill-size**: broad=b,narrow=n

**gill-color**: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y

**stalk-shape**: enlarging=e,tapering=t

**stalk-root**: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?

**stalk-surface-above-ring**: fibrous=f,scaly=y,silky=k,smooth=s

**stalk-surface-below-ring**: fibrous=f,scaly=y,silky=k,smooth=s

**stalk-color-above-ring**: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

**stalk-color-below-ring**: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

**veil-type**: partial=p,universal=u

**veil-color**: brown=n,orange=o,white=w,yellow=y

**ring-number**: none=n,one=o,two=t

**ring-type**: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z

**spore-print-color**: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y

**population**: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y

**habitat**: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d