# Preprocessing cell for the Pokémon Data Science Project

In [32]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

plt.figure(figsize=(4, 3), dpi=60)
plt.rcParams['figure.figsize'] = [4, 3]
plt.rcParams['figure.dpi'] = 60
sns.set_theme(rc={'figure.figsize':(4,3)})

<Figure size 240x180 with 0 Axes>

### Data reading

In [33]:
data = pd.read_csv("../data/Pokemon.csv", index_col=None)

### About the data, from Kaggle:

The dataset was extracted using the file bulbapediascrapper.py. The main objective of this notebook is to do a exploratory data analysis, as well as to preprocess it to apply AI techniques.

### Initial tasks:

1. Initial exploration of contents of all datasets
2. Preprocessing
3. Visualizations
4. Comparison with the original UCI dataset

## 1. Initial exploration and cleaning of contents of all datasets

In [34]:
PLOT = True

def describe_feature(data, feat):
    print(f"- Type: {data.loc[:, feat].dtype}")
    print(f"- First rows:\n{data.loc[:, feat].head(5)}")
    print(f"- Last rows:\n{data.loc[:, feat].tail(5)}")
    print(f"- Number of missing values: {data.loc[:, feat].isna().sum()}")
    print(data.loc[:, feat].dtype)
    print(f"- Number of distinct values: {data.loc[:, feat].nunique()}")
    with pd.option_context('display.max_rows', None):
        print(f"- Unique value counts:\n{data.loc[:, feat].value_counts()}")
    if data.loc[:, feat].dtype in ['int64', 'float64']:
        print(f"- Min: {data.loc[:, feat].min()}")
        print(f"- Mean: {data.loc[:, feat].mean()}")
        print(f"- Median: {data.loc[:, feat].median()}")
        print(f"- Max: {data.loc[:, feat].max()}")
        print(f"- Std: {data.loc[:, feat].std()}")
    else:
        print(f"- Unique values: {data.loc[:, feat].unique()}")


def remove_first_last_letter(series):
    series = pd.Series([sublist[1:-1] if len(sublist) >= 3 else sublist 
                      for sublist in series])
    return series.str[1:-1].str.split('\', \'')

def get_first_element(series):
    def process_list_first(lst):
        return lst[0]
    
    return series.apply(process_list_first)

def get_second_element(series):
    def process_list_second(lst):
        if len(lst) >= 2:
            return lst[1]
        elif len(lst) == 1:
            return 'None'
    
    return series.apply(process_list_second)


def get_third_element(series):
    def process_list_third(lst):
        if len(lst) >= 3:
            return lst[2]
        elif len(lst) < 3:
            return 'None'
    
    return series.apply(process_list_third)

def pieplot(dt, feat, title):
    if feat in dt.columns:
        _, ax = plt.subplots()
        labels = dt[feat].value_counts().index.tolist()
        values = [dt.loc[dt[feat] == x,:].shape[0] for x in dt[feat].value_counts().index.tolist()]
        ax.pie(values, labels=labels, autopct='%1.1f%%')
        plt.show()


def violinplot(dt, feat, title, div=None):
    if feat in dt.columns:
        if not div is None:
            sns.violinplot(data = dt, x=div, y=feat, split=True)
        else:
            sns.violinplot(data = dt, y=feat)
        plt.title(title)
        plt.show()


Let's understand the content of each file. Let's start with the training dataset

In [35]:
print("Pokémon dataset")
print(f"Number of rows: {data.shape[0]}, number of columns: {data.shape[1]}")
print(f"Column names: {data.columns}")
print(f"Number of missing values: {data.isna().sum().sum()}")

Pokémon dataset
Number of rows: 1179, number of columns: 45
Column names: Index(['DexNumber', 'Name', 'Type', 'Abilities', 'Generation', 'Hp', 'Attack',
       'Defense', 'SpecialAttack', 'SpecialDefense', 'Speed', 'TotalStats',
       'Weight', 'Height', 'GenderProbM', 'Category', 'CatchRate', 'EggCycles',
       'EggGroup', 'LevelingRate', 'BaseFriendship', 'IsLegendary',
       'IsMythical', 'IsUltraBeast', 'HasMega', 'EvoStage', 'TotalEvoStages',
       'DamageFromNormal', 'DamageFromFighting', 'DamageFromFlying',
       'DamageFromPoison', 'DamageFromGround', 'DamageFromRock',
       'DamageFromBug', 'DamageFromGhost', 'DamageFromSteel', 'DamageFromFire',
       'DamageFromWater', 'DamageFromGrass', 'DamageFromElectric',
       'DamageFromPsychic', 'DamageFromIce', 'DamageFromDragon',
       'DamageFromDark', 'DamageFromFairy'],
      dtype='object')
Number of missing values: 0


### Attribute information:

1. **DexNumber**: Number of the Pokémon for the national dex
2. **Name**: Name of the Pokémon
3. **Type**: Pokémon's typing as a list
4. **Abilities**: Pokémon's abilities as a list
5. **Generation**: The generation where it was introduced
6. **Hp**: Hp base stat
7. **Attack**: Attack base stat
8. **Defense**: Defense base stat
9. **SpecialAttack**: Special attack base stat
10. **SpecialDefense**: Special defense base stat
11. **Speed**: Speed base stat
12. **TotalStats**: Total stats (sum of the previous six stats)
13. **Weight**: Weight in kg
14. **Height**: Height in m
15. **GenderProbM**: Probability of a Pokémon of that species being male (if it has unknown gender, it will be None)
16. **Category**: Category of that Pokémon (some distinct Pokémons have the same categories, and it may vary between evolutions)
17. **CatchRate**: Capture rate of that Pokémon
18. **EggCycles**: Number of cycles (steps, the number of steps in each cycle varies among games) to hatch an egg of that Pokémon
19. **EggGroup**: Egg Group(s) of that Pokémon
20. **LevelingRate**: Class of the XP growth of that Pokémon
21. **BaseFriendship**: Base friendship of that Pokémon
22. **IsLegendary**: Denotes if it is a legendary pokemon
23. **IsLegendary**: Denotes if it is a legendary pokemon
24. **IsMythical**: Denotes if it is a mythical pokemon
25. **IsUltraBeast**: Denotes if it is an ultra beast
26. **HasMega**: Has a Mega evolution
27. **EvoStage**: Evolution Stage of that Pokémon
28. **TotalEvoStages**: Total evolution stages for that Pokémon
29. **DamageFrom(Type)**: Amount of damage taken for a specific attack type

In [36]:
data.describe()

Unnamed: 0,DexNumber,Hp,Attack,Defense,SpecialAttack,SpecialDefense,Speed,TotalStats,Weight,Height,...,DamageFromSteel,DamageFromFire,DamageFromWater,DamageFromGrass,DamageFromElectric,DamageFromPsychic,DamageFromIce,DamageFromDragon,DamageFromDark,DamageFromFairy
count,1179.0,1179.0,1179.0,1179.0,1179.0,1179.0,1179.0,1179.0,1179.0,1179.0,...,1179.0,1179.0,1179.0,1179.0,1179.0,1179.0,1179.0,1179.0,1179.0,1179.0
mean,514.177269,70.823579,79.006785,73.095844,72.067006,71.787956,69.270568,442.075488,66.536811,1.205344,...,0.995547,1.148007,1.051739,0.993215,1.038804,0.986005,1.201654,0.964801,1.0581,1.090331
std,296.629667,26.485177,30.33304,28.848474,31.280072,27.198244,29.638923,122.72611,119.956411,1.216914,...,0.514777,0.695027,0.595839,0.725956,0.632178,0.517945,0.736001,0.385199,0.454902,0.535065
min,1.0,1.0,5.0,5.0,10.0,20.0,5.0,180.0,0.1,0.1,...,0.25,0.25,0.25,0.25,0.0,0.0,0.25,0.0,0.25,0.25
25%,255.5,51.0,55.0,50.0,50.0,50.0,45.0,330.0,8.35,0.5,...,0.5,0.5,0.5,0.5,0.5,1.0,0.5,1.0,1.0,1.0
50%,525.0,70.0,76.0,70.0,65.0,70.0,67.0,464.0,28.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
75%,762.5,85.0,100.0,90.0,95.0,90.0,90.0,525.0,70.75,1.5,...,1.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0
max,1025.0,255.0,181.0,230.0,180.0,230.0,200.0,1125.0,999.9,20.0,...,4.0,4.0,4.0,4.0,4.0,4.0,4.0,2.0,4.0,4.0


### Feature DexNumber

In [37]:
describe_feature(data, 'DexNumber')

- Type: int64
- First rows:
0    494
1      1
2      2
3      3
4      4
Name: DexNumber, dtype: int64
- Last rows:
1174    1022
1175    1023
1176    1024
1177    1024
1178    1025
Name: DexNumber, dtype: int64
- Number of missing values: 0
int64
- Number of distinct values: 1025
- Unique value counts:
479     6
670     5
671     5
649     5
669     5
351     4
585     4
555     4
586     4
931     4
1017    4
386     4
741     4
845     3
745     3
646     3
898     3
413     3
412     3
978     3
800     3
52      3
718     3
550     3
79      2
89      2
101     2
100     2
716     2
902     2
901     2
774     2
720     2
211     2
423     2
724     2
88      2
103     2
422     2
421     2
83      2
964     2
483     2
80      2
78      2
77      2
76      2
157     2
648     2
74      2
105     2
549     2
925     2
628     2
916     2
128     2
215     2
144     2
122     2
145     2
146     2
849     2
641     2
618     2
642     2
705     2
706     2
645     2
110     2
905   

Let us see the amount of Pokémon that have each amount of forms.

In [38]:
print("Number of forms / Number of Pokémon with that amount of forms")
print(data['DexNumber'].value_counts().value_counts())

Number of forms / Number of Pokémon with that amount of forms
1    914
2     87
3     11
4      8
5      4
6      1
Name: DexNumber, dtype: int64


Let us see which Pokémons does have more than 2 forms (DexNumber + Name)

In [39]:
names = data.groupby(['DexNumber'])['Name'].unique()
for i, e in zip(names.index, names):
    if len(e) > 1:
        print(i, e)

19 ['Rattata' 'Alolan Rattata']
20 ['Raticate' 'Alolan Raticate']
26 ['Raichu' 'Alolan Raichu']
27 ['Sandshrew' 'Alolan Sandshrew']
28 ['Sandslash' 'Alolan Sandslash']
37 ['Vulpix' 'Alolan Vulpix']
38 ['Ninetales' 'Alolan Ninetales']
50 ['Diglett' 'Alolan Diglett']
51 ['Dugtrio' 'Alolan Dugtrio']
52 ['Meowth' 'Alolan Meowth' 'Galarian Meowth']
53 ['Persian' 'Alolan Persian']
58 ['Growlithe' 'Hisuian Growlithe']
59 ['Arcanine' 'Hisuian Arcanine']
74 ['Geodude' 'Alolan Geodude']
75 ['Graveler' 'Alolan Graveler']
76 ['Golem' 'Alolan Golem']
77 ['Ponyta' 'Galarian Ponyta']
78 ['Rapidash' 'Galarian Rapidash']
79 ['Slowpoke' 'Galarian Slowpoke']
80 ['Slowbro' 'Galarian Slowbro']
83 ["Farfetch'd" "Galarian Farfetch'd"]
88 ['Grimer' 'Alolan Grimer']
89 ['Muk' 'Alolan Muk']
100 ['Voltorb' 'Hisuian Voltorb']
101 ['Electrode' 'Hisuian Electrode']
103 ['Exeggutor' 'Alolan Exeggutor']
105 ['Marowak' 'Alolan Marowak']
110 ['Weezing' 'Galarian Weezing']
122 ['Mr. Mime' 'Galarian Mr. Mime']
128 ['Taur

### Feature Name

In [40]:
describe_feature(data, 'Name')

- Type: object
- First rows:
0       Victini
1     Bulbasaur
2       Ivysaur
3      Venusaur
4    Charmander
Name: Name, dtype: object
- Last rows:
1174               Iron Boulder
1175                 Iron Crown
1176      Terapagos Normal Form
1177    Terapagos Terastal Form
1178                  Pecharunt
Name: Name, dtype: object
- Number of missing values: 0
object
- Number of distinct values: 1179
- Unique value counts:
Victini                               1
Florges Orange Flower                 1
Aromatisse                            1
Spritzee                              1
Aegislash Blade Forme                 1
Aegislash Shield Forme                1
Doublade                              1
Honedge                               1
Meowstic Female                       1
Meowstic Male                         1
Espurr                                1
Furfrou Natural Form                  1
Pangoro                               1
Pancham                               1
Gogoat      

Every Pokémon and form have a different name. So this is more suitable to be the primary key rather than the DexNumber. That's funny

### Feature Type

Before doing anything else, we are going to divide this feature into 2: Type1 and Type 2. If the Pokémon has only 1 type, Type2 will have a placeholder value representing "None" typing, but it will be necessary in order to not have NANs.

In [41]:
data['Type'] = remove_first_last_letter(data['Type'])
data['Type1'] = get_first_element(data['Type'])
data['Type2'] = get_second_element(data['Type'])
describe_feature(data, 'Type1')
describe_feature(data, 'Type2')

- Type: object
- First rows:
0    Psychic
1      Grass
2      Grass
3      Grass
4       Fire
Name: Type1, dtype: object
- Last rows:
1174      Rock
1175     Steel
1176    Normal
1177    Normal
1178    Poison
Name: Type1, dtype: object
- Number of missing values: 0
object
- Number of distinct values: 18
- Unique value counts:
Water       145
Normal      135
Grass       111
Bug          91
Psychic      75
Fire         75
Electric     71
Rock         64
Dark         53
Poison       48
Ground       46
Fighting     45
Fairy        44
Dragon       43
Steel        41
Ice          40
Ghost        40
Flying       12
Name: Type1, dtype: int64
- Unique values: ['Psychic' 'Grass' 'Fire' 'Water' 'Bug' 'Normal' 'Dark' 'Poison'
 'Electric' 'Ground' 'Ice' 'Fairy' 'Steel' 'Fighting' 'Rock' 'Ghost'
 'Dragon' 'Flying']
- Type: object
- First rows:
0      Fire
1    Poison
2    Poison
3    Poison
4      None
Name: Type2, dtype: object
- Last rows:
1174    Psychic
1175    Psychic
1176       None
1177      

In [42]:
data = data.drop('Type', axis=1)

Types are ordered, and it is important to consider it in any further analysis (although there are no in-game differences between being Normal,Ghost or Ghost,Normal types, the order does not matter (is much more like a set rather than a list))

### Feature Abilities

In [43]:
data['Abilities'] = remove_first_last_letter(data['Abilities'])
data['Ability1'] = get_first_element(data['Abilities'])
data['Ability2'] = get_second_element(data['Abilities'])
data['HiddenAbility'] = get_third_element(data['Abilities'])
data = data.drop('Abilities', axis=1)

### Feature Generation

In [None]:
describe_feature(data, 'Generation')

### Feature Hp

In [None]:
describe_feature(data, 'Hp')

### Feature Attack

In [None]:
describe_feature(data, 'Attack')

### Feature Defense

In [None]:
describe_feature(data, 'Defense')

### Feature SpecialAttack

In [None]:
describe_feature(data, 'SpecialAttack')

### Feature SpecialDefense

In [None]:
describe_feature(data, 'SpecialDefense')

### Feature Speed

In [None]:
describe_feature(data, 'Speed')

### Feature TotalStats

In [None]:
describe_feature(data, 'TotalStats')

### Feature Weight

In [None]:
describe_feature(data, 'Weight')

### Feature Height

In [None]:
describe_feature(data, 'Height')

### Feature GenderProbM

In [None]:
describe_feature(data, 'GenderProbM')

### Feature Category

In [None]:
describe_feature(data, 'Category')

### Feature CatchRate

In [None]:
describe_feature(data, 'CatchRate')

### Feature EggCycles

In [None]:
describe_feature(data, 'EggCycles')

This attribute seems like a boolean one. This is the reason why it will be processed as one of them.

### Feature EggGroup

In [None]:
describe_feature(data, 'EggGroup')

### Feature LevelingRate

In [None]:
describe_feature(data, 'LevelingRate')

### Feature BaseFrienship

In [None]:
describe_feature(data, 'BaseFriendship')

### Feature IsLegendary

In [None]:
describe_feature(data, 'IsLegendary')

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

### Feature IsLegendary

## 2. Definition of preprocessing function

We will now define the preprocessing function which will be used. It is defined using the following principles:
1. There are numeric attributes as well as categorical ones. Each attribute's data type will be maintained
2. There are attributes where there exists things like 'is f' or 'are p' or some structure compound by 2 words, the second of them being presumably the actual category. In these case, we will retain only the last word.
3. There are lots of bad numeric and bad values among categorical features, whose categories' representation only consist of 1 letter. This is the reason why all of them will be turned into nans.
4. There are lots of underrepresented classes in categorical ones. Let us keep them by now.

In [None]:
train_cpy = train.copy()
test_cpy = test.copy()

In [None]:
train = train_cpy.copy()
test = test_cpy.copy()

In [None]:
def preproc_attribute(dt, attr, datatype):
    assert attr in dt.columns
    assert datatype in ['int64', 'float64', 'string', 'boolean']

    # 1. Maintaining and considering the correct datatype
    if datatype == 'int64':
        # We will consider them as a string only for removing any incorrect values
        dt[attr] = train[attr].astype('string')

        # We remove all invalid values, we will mark them as None
        dt.loc[~dt[attr].str.isnumeric(), attr] = None

        # And after that, we will consider it as an int
        dt[attr] = dt[attr].astype('int64')


    elif datatype == 'float64':
        # We will consider them as a string only for removing any incorrect values
        dt[attr] = train[attr].astype('string')

        # We remove all invalid values, we will mark them as None
        dt.loc[~dt[attr].str.replace('.','').str.isnumeric(), attr] = None

        # And after that, we will consider it as a float
        dt[attr] = dt[attr].astype('float64')


    elif datatype == 'string' or datatype == 'boolean':
        dt[attr] = dt[attr].astype('string')
        
        # 2. Modifying those compund categories
        dt[attr] = dt[attr].str.split().str[-1]
        dt[attr] = dt[attr].astype('string')

        # 3. Turning to NAs all bad categories
        dt.loc[(dt[attr].str.len() > 1) | ~(dt[attr].str.isalpha()), attr] = None


        # 4. Currently, we do nothing about underrepresented categories
        print(dt[attr])
        if datatype == 'boolean':

            dt[attr] = dt[attr].replace(['f','t'], ['0','1'])

            dt.loc[~dt[attr].isin(['0','1']),attr] = None
            dt[attr] = dt[attr].replace('NaT', np.nan)

            dt[attr] = dt[attr].astype('Int64')


    return dt


In [None]:
preproc_attribute(train, 'class', 'string')

if PLOT:
    pieplot(train, 'class', 'Pie plot containing the class attribute for the train dataset')

In [None]:
preproc_attribute(train, 'cap-diameter', 'float64')
preproc_attribute(test, 'cap-diameter', 'float64')

if PLOT:
    violinplot(train, 'cap-diameter', 'Violinplot containing the cap-diameter attribute for the train dataset')
    violinplot(train, 'cap-diameter', 'Violinplot containing the cap-diameter attribute for the train dataset, divided by class', 'class')
    violinplot(test, 'cap-diameter', 'Violinplot containing the cap-diameter attribute for the test dataset')

In [None]:
preproc_attribute(train, 'cap-shape', 'string')
preproc_attribute(test, 'cap-shape', 'string')

if PLOT:
    pieplot(train, 'cap-shape', 'Pie plot containing the cap-shape attribute for the train dataset')
    pieplot(test, 'cap-shape', 'Pie plot containing the cap-shape attribute for the test dataset')

In [None]:
preproc_attribute(train, 'cap-surface', 'string')
preproc_attribute(test, 'cap-surface', 'string')

if PLOT:
    pieplot(train, 'cap-surface', 'Pie plot containing the cap-surface attribute for the train dataset')
    pieplot(test, 'cap-surface', 'Pie plot containing the cap-surface attribute for the test dataset')

In [None]:
preproc_attribute(train, 'cap-color', 'string')
preproc_attribute(test, 'cap-color', 'string')

if PLOT:
    pieplot(train, 'cap-color', 'Pie plot containing the cap-color attribute for the train dataset')
    pieplot(test, 'cap-color', 'Pie plot containing the cap-color attribute for the test dataset')

In [None]:
preproc_attribute(train, 'does-bruise-or-bleed', 'boolean')
preproc_attribute(test, 'does-bruise-or-bleed', 'boolean')

if PLOT:
    pieplot(train, 'does-bruise-or-bleed', 'Pie plot containing the does-bruise-or-bleed attribute for the train dataset')
    pieplot(test, 'does-bruise-or-bleed', 'Pie plot containing the does-bruise-or-bleed attribute for the test dataset')

In [None]:
preproc_attribute(train, 'gill-attachment', 'string')
preproc_attribute(test, 'gill-attachment', 'string')

if PLOT:
    pieplot(train, 'gill-attachment', 'Pie plot containing the gill-attachment attribute for the train dataset')
    pieplot(test, 'gill-attachment', 'Pie plot containing the gill-attachment attribute for the test dataset')

In [None]:
preproc_attribute(train, 'gill-spacing', 'string')
preproc_attribute(test, 'gill-spacing', 'string')

if PLOT:
    pieplot(train, 'gill-spacing', 'Pie plot containing the gill-spacing attribute for the train dataset')
    pieplot(test, 'gill-spacing', 'Pie plot containing the gill-spacing attribute for the test dataset')

In [None]:
preproc_attribute(train, 'gill-color', 'string')
preproc_attribute(test, 'gill-color', 'string')

if PLOT:
    pieplot(train, 'gill-color', 'Pie plot containing the gill-color attribute for the train dataset')
    pieplot(test, 'gill-color', 'Pie plot containing the gill-color attribute for the test dataset')

In [None]:
preproc_attribute(train, 'stem-height', 'float64')
preproc_attribute(test, 'stem-height', 'float64')

if PLOT:
    violinplot(train, 'stem-height', 'Violinplot containing the stem-height attribute for the train dataset')
    violinplot(train, 'stem-height', 'Violinplot containing the stem-height attribute for the train dataset, divided by class', 'class')
    violinplot(test, 'stem-height', 'Violinplot containing the stem-height attribute for the test dataset')

In [None]:
preproc_attribute(train, 'stem-width', 'float64')
preproc_attribute(test, 'stem-width', 'float64')

if PLOT:
    violinplot(train, 'stem-width', 'Violinplot containing the stem-width attribute for the train dataset')
    violinplot(train, 'stem-width', 'Violinplot containing the stem-width attribute for the train dataset, divided by class', 'class')
    violinplot(test, 'stem-width', 'Violinplot containing the stem-width attribute for the test dataset')

In [None]:
preproc_attribute(train, 'stem-root', 'string')
preproc_attribute(test, 'stem-root', 'string')

if PLOT:
    pieplot(train, 'stem-root', 'Pie plot containing the stem-root attribute for the train dataset')
    pieplot(test, 'stem-root', 'Pie plot containing the stem-root attribute for the test dataset')

In [None]:
preproc_attribute(train, 'stem-surface', 'string')
preproc_attribute(test, 'stem-surface', 'string')

if PLOT:
    pieplot(train, 'stem-surface', 'Pie plot containing the stem-surface attribute for the train dataset')
    pieplot(test, 'stem-surface', 'Pie plot containing the stem-surface attribute for the test dataset')

In [None]:
preproc_attribute(train, 'stem-color', 'string')
preproc_attribute(test, 'stem-color', 'string')

if PLOT:
    pieplot(train, 'stem-color', 'Pie plot containing the stem-color attribute for the train dataset')
    pieplot(test, 'stem-color', 'Pie plot containing the stem-color attribute for the test dataset')

In [None]:
preproc_attribute(train, 'veil-type', 'string')
preproc_attribute(test, 'veil-type', 'string')

if PLOT:
    pieplot(train, 'veil-type', 'Pie plot containing the veil-type attribute for the train dataset')
    pieplot(test, 'veil-type', 'Pie plot containing the veil-type attribute for the test dataset')

In [None]:
preproc_attribute(train, 'veil-color', 'string')
preproc_attribute(test, 'veil-color', 'string')

if PLOT:
    pieplot(train, 'veil-color', 'Pie plot containing the veil-color attribute for the train dataset')
    pieplot(test, 'veil-color', 'Pie plot containing the veil-color attribute for the test dataset')

In [None]:
preproc_attribute(train, 'has-ring', 'boolean')
preproc_attribute(test, 'has-ring', 'boolean')

if PLOT:
    pieplot(train, 'has-ring', 'Pie plot containing the has-ring attribute for the train dataset')
    pieplot(test, 'has-ring', 'Pie plot containing the has-ring attribute for the test dataset')

In [None]:
preproc_attribute(train, 'ring-type', 'string')
preproc_attribute(test, 'ring-type', 'string')

if PLOT:
    pieplot(train, 'ring-type', 'Pie plot containing the ring-type attribute for the train dataset')
    pieplot(test, 'ring-type', 'Pie plot containing the ring-type attribute for the test dataset')

In [None]:
preproc_attribute(train, 'spore-print-color', 'string')
preproc_attribute(test, 'spore-print-color', 'string')

if PLOT:
    pieplot(train, 'spore-print-color', 'Pie plot containing the spore-print-color attribute for the train dataset')
    pieplot(test, 'spore-print-color', 'Pie plot containing the spore-print-color attribute for the test dataset')

In [None]:
preproc_attribute(train, 'habitat', 'string')
preproc_attribute(test, 'habitat', 'string')

if PLOT:
    pieplot(train, 'habitat', 'Pie plot containing the habitat attribute for the train dataset')
    pieplot(test, 'habitat', 'Pie plot containing the habitat attribute for the test dataset')

In [None]:
preproc_attribute(train, 'season', 'string')
preproc_attribute(test, 'season', 'string')

if PLOT:
    pieplot(train, 'season', 'Pie plot containing the season attribute for the train dataset')
    pieplot(test, 'season', 'Pie plot containing the season attribute for the test dataset')

# COSAS PARA HACER:
- Hacer un diccionario de colores
- Comprobar qué valores en otros atributos tienen los individuos que tienen valores "extraños" (numerales en atributos categóricos...)
- Visualización de valores peridos
- Correlaciones
- Tratamiento de missing values
- Quedarse sólo con los valores de UCI / Valores más representados

## 2. Further visualizations

Missing values visualizations: First, we will start with showing the missing values positions (in white) from the training set

In [None]:
msno.matrix(train)

And then

In [None]:
plt.figure(figsize=(18,10))
sns.heatmap(train.loc[:, train.isnull().any()].isnull().corr(), annot=True, fmt='.2f')

The only high correlation is between veil-type and veil-color

## 3. Comparision with the original UCI dataset