In [45]:
import os
import pandas as pd
from matplotlib import pyplot as plt

# Mostra todas as colunas d dataframe
pd.set_option('display.max_columns', None)

# Abertura do Dataset

In [46]:
data_path = os.path.join("data", "mushrooms.csv")
header_path = os.path.join("data", "names.txt")

with open(header_path, 'r') as f:
    header = [line.strip() for line in f]

data = pd.read_csv(data_path, header=None)
data.columns = header

In [47]:
data.head()

Unnamed: 0,is-edible,cap-shape,cap-surface,cap-color,has-bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-root,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ALMOND,FREE,CROWDED,NARROW,WHITE,TAPERING,BULBOUS,SMOOTH,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,PURPLE,SEVERAL,WOODS
1,EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ALMOND,FREE,CROWDED,NARROW,WHITE,TAPERING,BULBOUS,SMOOTH,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,BROWN,SEVERAL,WOODS
2,EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ALMOND,FREE,CROWDED,NARROW,PINK,TAPERING,BULBOUS,SMOOTH,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,PURPLE,SEVERAL,WOODS
3,EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ALMOND,FREE,CROWDED,NARROW,PINK,TAPERING,BULBOUS,SMOOTH,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,BROWN,SEVERAL,WOODS
4,EDIBLE,CONVEX,SMOOTH,WHITE,BRUISES,ALMOND,FREE,CROWDED,NARROW,BROWN,TAPERING,BULBOUS,SMOOTH,SMOOTH,WHITE,WHITE,PARTIAL,WHITE,ONE,PENDANT,PURPLE,SEVERAL,WOODS


# Conhecendo o dataset

## Quais as dimensões do dataset?

In [48]:
data.shape

(8416, 23)

O dataset é composto de 8416 linhas e 23 colunas.

## Quais os atributos existentes no dataset e seus tipos?

In [49]:
data.dtypes

is-edible                   object
cap-shape                   object
cap-surface                 object
cap-color                   object
has-bruises                 object
odor                        object
gill-attachment             object
gill-spacing                object
gill-size                   object
gill-color                  object
stalk-shape                 object
stalk-root                  object
stalk-surface-above-ring    object
stalk-surface-below-ring    object
stalk-color-above-ring      object
stalk-color-below-ring      object
veil-type                   object
veil-color                  object
ring-number                 object
ring-type                   object
spore-print-color           object
population                  object
habitat                     object
dtype: object

Todos os atributos existentes no dataset são do tipo String. Os atributos estão relacionados a características do cogumelo como, por exemplo, se este o comestível ou venenoso, seu habitat, odor, formato do chapéu, brânquias, talo, etc.

## Há dados faltantes no dataset?

Dados faltantes neste dataset possuem "?" como valor em seu atributo, mas isto só ocorre no atributo n° 11: "stalk-root".

In [50]:
data.eq("?").sum()

is-edible                      0
cap-shape                      0
cap-surface                    0
cap-color                      0
has-bruises                    0
odor                           0
gill-attachment                0
gill-spacing                   0
gill-size                      0
gill-color                     0
stalk-shape                    0
stalk-root                  2480
stalk-surface-above-ring       0
stalk-surface-below-ring       0
stalk-color-above-ring         0
stalk-color-below-ring         0
veil-type                      0
veil-color                     0
ring-number                    0
ring-type                      0
spore-print-color              0
population                     0
habitat                        0
dtype: int64

Uma estratégia para tratar esse problema é retirar todas as linhas que possuem um dado faltante ou remover completamente a coluna "stalk-root".

Devido ao alto número de instâncias (2480) com "?", optamos por remover a coluna "stalk-root".

In [51]:
data = data.drop('stalk-root', axis=1)
data.shape

(8416, 22)

## O atributo-alvo
O atributo alvo desse dataset se chama "is-edible". Abaixo os valores possíveis para este atributo e suas frequências.

In [52]:
edibility = data['is-edible']
edibility.value_counts()

EDIBLE       4488
POISONOUS    3928
Name: is-edible, dtype: int64