# The Problem Definition

The dataset is a multilabel classification problem. The goal is to predict the presence of amphibians species near the water reservoirs based on features obtained from GIS systems and satellite images
font: https://archive.ics.uci.edu/dataset/528/amphibians

In [1]:
# importing
import pandas as pd

### Importing and Exploring the data

In [2]:
# uploading data
path = "amphibians\dataset.csv"

In [7]:
df = pd.read_csv(path, sep=(";"))

In [8]:
df.head(7)

Unnamed: 0,Integer,Categorical,Numerical,Numerical.1,Categorical.1,Categorical.2,Categorical.3,Categorical.4,Categorical.5,Categorical.6,...,Ordinal.1,Categorical.8,Categorical.9,Label 1,Label 2,Label 3,Label 4,Label 5,Label 6,Label 7
0,ID,Motorway,SR,NR,TR,VR,SUR1,SUR2,SUR3,UR,...,BR,MR,CR,Green frogs,Brown frogs,Common toad,Fire-bellied toad,Tree frog,Common newt,Great crested newt
1,1,A1,600,1,1,4,6,2,10,0,...,0,0,1,0,0,0,0,0,0,0
2,2,A1,700,1,5,1,10,6,10,3,...,1,0,1,0,1,1,0,0,1,0
3,3,A1,200,1,5,1,10,6,10,3,...,1,0,1,0,1,1,0,0,1,0
4,4,A1,300,1,5,0,6,10,2,3,...,0,0,1,0,0,1,0,0,0,0
5,5,A1,600,2,1,4,10,2,6,0,...,5,0,1,0,1,1,1,0,1,1
6,6,A1,200,1,5,1,6,6,10,1,...,0,0,1,0,0,0,0,0,0,0


In [16]:
type_and_column = df.iloc[0:1].T
type_and_column

Unnamed: 0,0
Integer,ID
Categorical,Motorway
Numerical,SR
Numerical.1,NR
Categorical.1,TR
Categorical.2,VR
Categorical.3,SUR1
Categorical.4,SUR2
Categorical.5,SUR3
Categorical.6,UR


### Transforming the data

We can see that the heads of the columns are the types of columns, let's fix that.

In [19]:
df_tranf_heads = pd.DataFrame(df.values[1:], columns=df.iloc[0])
df_tranf_heads

Unnamed: 0,ID,Motorway,SR,NR,TR,VR,SUR1,SUR2,SUR3,UR,...,BR,MR,CR,Green frogs,Brown frogs,Common toad,Fire-bellied toad,Tree frog,Common newt,Great crested newt
0,1,A1,600,1,1,4,6,2,10,0,...,0,0,1,0,0,0,0,0,0,0
1,2,A1,700,1,5,1,10,6,10,3,...,1,0,1,0,1,1,0,0,1,0
2,3,A1,200,1,5,1,10,6,10,3,...,1,0,1,0,1,1,0,0,1,0
3,4,A1,300,1,5,0,6,10,2,3,...,0,0,1,0,0,1,0,0,0,0
4,5,A1,600,2,1,4,10,2,6,0,...,5,0,1,0,1,1,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
184,185,S52,2300,1,12,3,2,2,1,0,...,1,0,1,0,1,0,0,0,0,0
185,186,S52,300,1,14,2,7,10,2,0,...,5,0,1,1,1,1,1,0,1,0
186,187,S52,500,1,1,4,1,10,2,0,...,5,0,1,1,1,1,1,0,1,0
187,188,S52,300,1,12,3,2,1,6,0,...,0,0,1,0,1,1,0,0,0,0


In [20]:
df_tranf_heads.shape

(189, 23)

In [22]:
df_tranf_heads.columns

Index(['ID', 'Motorway', 'SR', 'NR', 'TR', 'VR', 'SUR1', 'SUR2', 'SUR3', 'UR',
       'FR', 'OR', 'RR', 'BR', 'MR', 'CR', 'Green frogs', 'Brown frogs',
       'Common toad', 'Fire-bellied toad', 'Tree frog', 'Common newt',
       'Great crested newt'],
      dtype='object', name=0)

In [23]:
df_tranf_heads.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189 entries, 0 to 188
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   ID                  189 non-null    object
 1   Motorway            189 non-null    object
 2   SR                  189 non-null    object
 3   NR                  189 non-null    object
 4   TR                  189 non-null    object
 5   VR                  189 non-null    object
 6   SUR1                189 non-null    object
 7   SUR2                189 non-null    object
 8   SUR3                189 non-null    object
 9   UR                  189 non-null    object
 10  FR                  189 non-null    object
 11  OR                  189 non-null    object
 12  RR                  189 non-null    object
 13  BR                  189 non-null    object
 14  MR                  189 non-null    object
 15  CR                  189 non-null    object
 16  Green frogs         189 no

In [45]:
df_tranf_heads.isnull().any()

0
ID                    False
Motorway              False
SR                    False
NR                    False
TR                    False
VR                    False
SUR1                  False
SUR2                  False
SUR3                  False
UR                    False
FR                    False
OR                    False
RR                    False
BR                    False
MR                    False
CR                    False
Green frogs           False
Brown frogs           False
Common toad           False
Fire-bellied toad     False
Tree frog             False
Common newt           False
Great crested newt    False
dtype: bool

We can see that all the types of values are as object. What is not correct, we can see it in the data set documentation, and also in the original dataset heads.
Let's analyse and transforme it.

In [24]:
df_tranf_heads.SR

0       600
1       700
2       200
3       300
4       600
       ... 
184    2300
185     300
186     500
187     300
188     300
Name: SR, Length: 189, dtype: object

In [25]:
df_tranf_heads.SR.unique()

array(['600', '700', '200', '300', '500', '750', '7000', '1700', '8000',
       '30000', '1600', '3800', '2500', '800', '4500', '1000', '3300',
       '2100', '400', '1100', '100', '80000', '31000', '25000', '40000',
       '1900', '30', '4300', '4000', '1500', '28300', '50', '9000',
       '19300', '3500', '9100', '1300', '2000', '10050', '16000', '5000',
       '10000', '29000', '8250', '250', '500000', '50000', '450', '8300',
       '1800', '150', '900', '3000', '350', '6300', '3400', '2400',
       '115000', '360000', '4100', '2300', '15000', '2600', '26000',
       '1400', '22000'], dtype=object)

In [26]:
df_tranf_heads.NR.unique()

array(['1', '2', '3', '6', '5', '7', '4', '9', '10', '12'], dtype=object)

In [28]:
df_tranf_heads.OR.unique()

array(['50', '75', '25', '99', '100', '80'], dtype=object)

### Transforme the datatype to numerical ones

In [34]:
df_tranf_heads.SR = df_tranf_heads.SR.astype("int32")

In [35]:
df_tranf_heads.NR = df_tranf_heads.NR.astype("int8")

In [36]:
df_tranf_heads.OR = df_tranf_heads.OR.astype("int16")

In [37]:
df_tranf_heads.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189 entries, 0 to 188
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   ID                  189 non-null    object
 1   Motorway            189 non-null    object
 2   SR                  189 non-null    int32 
 3   NR                  189 non-null    int8  
 4   TR                  189 non-null    object
 5   VR                  189 non-null    object
 6   SUR1                189 non-null    object
 7   SUR2                189 non-null    object
 8   SUR3                189 non-null    object
 9   UR                  189 non-null    object
 10  FR                  189 non-null    object
 11  OR                  189 non-null    int16 
 12  RR                  189 non-null    object
 13  BR                  189 non-null    object
 14  MR                  189 non-null    object
 15  CR                  189 non-null    object
 16  Green frogs         189 no

### Analysing the Categoric types

Eventhought we can see numbers this columns have categorical information, and we have to transforme it in a numerical information, than we can work with Machine Learning.

In [46]:
categorical_col_types = type_and_column.T.filter(like="Categorical")
categorical_col_types

Unnamed: 0,Categorical,Categorical.1,Categorical.2,Categorical.3,Categorical.4,Categorical.5,Categorical.6,Categorical.7,Categorical.8,Categorical.9
0,Motorway,TR,VR,SUR1,SUR2,SUR3,UR,FR,MR,CR


In [47]:
df_tranf_heads1= pd.get_dummies(df_tranf_heads, columns=categorical_col_types.iloc[0])

In [48]:
df_tranf_heads1

Unnamed: 0,ID,SR,NR,OR,RR,BR,Green frogs,Brown frogs,Common toad,Fire-bellied toad,...,FR_0,FR_1,FR_2,FR_3,FR_4,MR_0,MR_1,MR_2,CR_1,CR_2
0,1,600,1,50,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,1,0
1,2,700,1,75,1,1,0,1,1,0,...,0,1,0,0,0,1,0,0,1,0
2,3,200,1,75,1,1,0,1,1,0,...,0,0,0,0,1,1,0,0,1,0
3,4,300,1,25,0,0,0,0,1,0,...,0,0,0,0,1,1,0,0,1,0
4,5,600,2,99,0,5,0,1,1,1,...,1,0,0,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
184,185,2300,1,75,2,1,0,1,0,0,...,1,0,0,0,0,1,0,0,1,0
185,186,300,1,100,5,5,1,1,1,1,...,1,0,0,0,0,1,0,0,1,0
186,187,500,1,100,5,5,1,1,1,1,...,1,0,0,0,0,1,0,0,1,0
187,188,300,1,100,1,0,0,1,1,0,...,1,0,0,0,0,1,0,0,1,0


In [49]:
#Let's that the information about the dataset after these transformations
df_tranf_heads1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189 entries, 0 to 188
Data columns (total 64 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   ID                  189 non-null    object
 1   SR                  189 non-null    int32 
 2   NR                  189 non-null    int8  
 3   OR                  189 non-null    int16 
 4   RR                  189 non-null    object
 5   BR                  189 non-null    object
 6   Green frogs         189 non-null    object
 7   Brown frogs         189 non-null    object
 8   Common toad         189 non-null    object
 9   Fire-bellied toad   189 non-null    object
 10  Tree frog           189 non-null    object
 11  Common newt         189 non-null    object
 12  Great crested newt  189 non-null    object
 13  Motorway_A1         189 non-null    uint8 
 14  Motorway_S52        189 non-null    uint8 
 15  TR_1                189 non-null    uint8 
 16  TR_11               189 no

We still having some object types. Let's see it.

In [53]:
df_tranf_heads1.select_dtypes(include = "object")

Unnamed: 0,ID,RR,BR,Green frogs,Brown frogs,Common toad,Fire-bellied toad,Tree frog,Common newt,Great crested newt
0,1,0,0,0,0,0,0,0,0,0
1,2,1,1,0,1,1,0,0,1,0
2,3,1,1,0,1,1,0,0,1,0
3,4,0,0,0,0,1,0,0,0,0
4,5,0,5,0,1,1,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...
184,185,2,1,0,1,0,0,0,0,0
185,186,5,5,1,1,1,1,0,1,0
186,187,5,5,1,1,1,1,0,1,0
187,188,1,0,0,1,1,0,0,0,0


The original heads Label 1 to Label 7 are the target we are looking for

In [56]:
classification_target = type_and_column.T.filter(like="Label")
classification_target

Unnamed: 0,Label 1,Label 2,Label 3,Label 4,Label 5,Label 6,Label 7
0,Green frogs,Brown frogs,Common toad,Fire-bellied toad,Tree frog,Common newt,Great crested newt


In [68]:
classification_target.iloc[0].to_list()

['Green frogs',
 'Brown frogs',
 'Common toad',
 'Fire-bellied toad',
 'Tree frog',
 'Common newt',
 'Great crested newt']