<a href="https://colab.research.google.com/github/AdrienF/ColabTests/blob/main/Mushrooms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Etude de cas : les champignons!

## Contexte 

Although this dataset was originally contributed to the UCI Machine Learning repository nearly 30 years ago, mushroom hunting (otherwise known as "shrooming") is enjoying new peaks in popularity. Learn which features spell certain death and which are most palatable in this dataset of mushroom characteristics. And how certain can your model be?
Content

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.

    Time period: Donated to UCI ML 27 April 1987

Inspiration

    What types of machine learning models perform best on this dataset?

    Which features are most indicative of a poisonous mushroom?

Acknowledgements

This dataset was originally donated to the UCI Machine Learning repository. You can learn more about past research using the data [here](https://archive.ics.uci.edu/ml/datasets/Mushroom). 

## Dataset description

Attribute Information: (classes: edible=e, poisonous=p)

- cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
- cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
- cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
- bruises: bruises=t,no=f
- odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
- gill-attachment: attached=a,descending=d,free=f,notched=n
- gill-spacing: close=c,crowded=w,distant=d
- gill-size: broad=b,narrow=n
- gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
- stalk-shape: enlarging=e,tapering=t
- stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
- stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
- stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
- stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
- stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
-veil-type: partial=p,universal=u
-veil-color: brown=n,orange=o,white=w,yellow=y
-ring-number: none=n,one=o,two=t
-ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z
-spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
-population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
-habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d


## But du notebook

Dans ce notebook vous devrez explorer ce dataset, comprendre quels sont les attributs importants et trouver quels sont les champignons comestibles. 

**L'évaluation se fera sur la précision de la classification `edible / poisonous` et la baseline sera celle d'un classifieur linéaire (à implémenter).**

Bonne chance!

## Let's go!

A partir de maintenant vous êtes libre d'éditer ce notebook. N'hésitez pas à le structurer, à le documenter, à illustrer, ...

In [None]:
# Data management is preferably done with pandas
import pandas as pd
# Download dataset
!wget https://raw.githubusercontent.com/AdrienF/ColabTests/main/mushrooms.csv \
  -O mushrooms.csv
# Load dataset
df = pd.read_csv("mushrooms.csv")


--2021-01-05 08:11:32--  https://raw.githubusercontent.com/AdrienF/ColabTests/main/mushrooms.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 374003 (365K) [text/plain]
Saving to: ‘mushrooms.csv’


2021-01-05 08:11:32 (15.3 MB/s) - ‘mushrooms.csv’ saved [374003/374003]



In [None]:
import numpy as np
# shuffle rows
df = df.sample(frac=1).reset_index(drop=True)
# test various train set size
for nbt in range(1, 26):
  train_set = df.iloc[0:2*nbt:2]
  test_set  = df.iloc[1::2]
  def distance(rowA, rowB): # nombre de valeurs en commun, colonne 'class' exclue
    return np.sum((rowA.values==rowB.values)[1:])
  print("test accuracy:", 
        train_set.shape[0], 
        np.mean(
            [
             train_set.iloc[
                            np.argmax(np.array(
                                [distance(train_set.iloc[r], test_set.iloc[testr]) for r in range(train_set.shape[0])])
                            )]['class']==test_set.iloc[testr]['class']
             for testr in range(test_set.shape[0])
             ]
             ))

test accuracy: 1 0.47882816346627277
test accuracy: 2 0.725258493353028
test accuracy: 3 0.7195962580009847
test accuracy: 4 0.7912358444116199
test accuracy: 5 0.7907434761201378
test accuracy: 6 0.793451501723289
test accuracy: 7 0.7727720334810438
test accuracy: 8 0.6779911373707533
test accuracy: 9 0.6779911373707533
test accuracy: 10 0.8340718857705564
test accuracy: 11 0.8333333333333334
test accuracy: 12 0.8498276710979813
test accuracy: 13 0.8522895125553914
test accuracy: 14 0.8530280649926145
test accuracy: 15 0.8714918759231906
test accuracy: 16 0.8810930576070901
test accuracy: 17 0.8828163466272771
test accuracy: 18 0.8828163466272771
test accuracy: 19 0.8838010832102413
test accuracy: 20 0.8830625307730182
test accuracy: 21 0.8830625307730182
test accuracy: 22 0.8830625307730182
test accuracy: 23 0.8862629246676514
test accuracy: 24 0.8862629246676514
test accuracy: 25 0.8865091088133924
