# Preparação dos Datasets

Aqui trataremos de:

- Selecionar os datasets com os quais trabalharemos

- Limpar dados

- Reduzir a dimensionalidade dos dados para um espaço 2D (se necessário)

- Reduzir a quantidade de rótulos dos nossos datasets para 2 rótulos (se necessário)

## Criação das Funções de Limpeza, Redução e Plotagem

Aqui, vamos criar as funções que usaremos para limpar os dados, reduzir eles para 2 dimensões e plotar eles de acordo com seus rótulos:

### Limpeza

Agora que importamos todos os nossos datasets, vamos criar uma função que faz a limpeza dos dados, que consistirá apenas em:

- Eliminar dados duplicados
- Eliminar dados null/None/NaN

OBS.: como veremos mais a frente, alguns dos nossos datasets precisarão de mais trabalho para limpeza, essa será apenas uma função de limpeza mais simples.

In [58]:
import pandas as pd

def cleanData(data):

    dataframe = pd.DataFrame(data=data.data, columns=data.feature_names)
    dataframe['target'] = data.target

    dataframe = dataframe.dropna()
    dataframe = dataframe.drop_duplicates()

    return dataframe

Usaremos essa função mais a frente.

### Redução de Dimensão

Agora que importamos e limpamos todos os nossos datasets, vamos criar uma função que faz a redução de dimensionalidade dos dados para 2D:

In [27]:
from sklearn.decomposition import PCA

def reduce2DPCA(df):
    # Assuming last column is 'label'
    features = df.iloc[:, :-1]
    labels = df.iloc[:, -1]
    
    pca = PCA(n_components=2)
    reduced_features = pca.fit_transform(features)
    
    reduced_df = pd.DataFrame(reduced_features, columns=['x', 'y'])
    reduced_df['label'] = labels.reset_index(drop=True)
    
    return reduced_df

Usaremos essa função mais a frente.

### Plotagem

Agora que importamos, limpamos e reduzimos para 2D todos os nossos datasets, vamos criar uma função que faz a plota os nossos dados com cores diferentes para cada rótulo para podermos identificar quais são linearmente separáveis dos outros:

In [28]:
import matplotlib.pyplot as plt

def plotData(df, title):

    if not {'x', 'y', 'label'}.issubset(df.columns):
        print("DataFrame must contain 'x', 'y', and 'label' columns.")
        return
    
    plt.figure(figsize=(10, 6))
    
    uniqueLabels = df['label'].unique()
    for label in uniqueLabels:
        subset = df[df['label'] == label]
        plt.scatter(subset['x'], subset['y'], label=label)

    plt.xlabel('x')
    plt.ylabel('y')
    plt.legend()
    plt.title(title)
    plt.show()

Agora vamos usar essa última função para plotar os nossos dados e verificar visualmente quais rotulos são linearmente separáveis.

### Juntando essas funções

Juntando as funções que:

- Faz a limpeza dos dados
- Reduz os dados para 2D
- Plota dos dados

Temos a seguinte função:

In [29]:
def cleanReducePlotData(data, title):
    
    plotData(reduce2DPCA(cleanData(data)), title)

## Seleção e Limpeza dos Datasets

Aqui vamos selecionar e importar os datasets que usaremos para esse trabalho.

Usaremos dados de diferentes fontes para aplicar nosso modelo a diversas situações e ver seu desempenho mediante os diferentes desafios.

Os primeiros datasets que utilizaremos estarão contidos no scikit learn: 

- Iris Plants

    Para mais informações, acesse:

    https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris

In [79]:
from sklearn.datasets import load_iris

irisData = load_iris()

irisDataframe = cleanData(irisData)

irisDataframe.info()

<class 'pandas.core.frame.DataFrame'>
Index: 149 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  149 non-null    float64
 1   sepal width (cm)   149 non-null    float64
 2   petal length (cm)  149 non-null    float64
 3   petal width (cm)   149 non-null    float64
 4   target             149 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 7.0 KB


- Forest Covertypes

    Para mais informações, acesse:

    https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_covtype.html#sklearn.datasets.fetch_covtype

In [80]:
from sklearn.datasets import fetch_covtype

forestData = fetch_covtype()

forestDataframe = irisDataframe = cleanData(forestData)

forestDataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 581012 entries, 0 to 581011
Data columns (total 55 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   Elevation                           581012 non-null  float64
 1   Aspect                              581012 non-null  float64
 2   Slope                               581012 non-null  float64
 3   Horizontal_Distance_To_Hydrology    581012 non-null  float64
 4   Vertical_Distance_To_Hydrology      581012 non-null  float64
 5   Horizontal_Distance_To_Roadways     581012 non-null  float64
 6   Hillshade_9am                       581012 non-null  float64
 7   Hillshade_Noon                      581012 non-null  float64
 8   Hillshade_3pm                       581012 non-null  float64
 9   Horizontal_Distance_To_Fire_Points  581012 non-null  float64
 10  Wilderness_Area_0                   581012 non-null  float64
 11  Wilderness_Area_1         

- Wine Recognition

    Para mais informações, acesse:

    https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html#sklearn.datasets.load_wine

In [81]:
from sklearn.datasets import load_wine

wineData = load_wine()

wineDataframe = cleanData(wineData)

wineDataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   alcohol                       178 non-null    float64
 1   malic_acid                    178 non-null    float64
 2   ash                           178 non-null    float64
 3   alcalinity_of_ash             178 non-null    float64
 4   magnesium                     178 non-null    float64
 5   total_phenols                 178 non-null    float64
 6   flavanoids                    178 non-null    float64
 7   nonflavanoid_phenols          178 non-null    float64
 8   proanthocyanins               178 non-null    float64
 9   color_intensity               178 non-null    float64
 10  hue                           178 non-null    float64
 11  od280/od315_of_diluted_wines  178 non-null    float64
 12  proline                       178 non-null    float64
 13  targe

- Optical Recognition of Handwritten Digits

    Para mais informações, acesse:

    https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits

In [82]:
from sklearn.datasets import load_digits

digiData = load_digits()

digiDataframe = cleanData(digiData)

digiDataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1797 entries, 0 to 1796
Data columns (total 65 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pixel_0_0  1797 non-null   float64
 1   pixel_0_1  1797 non-null   float64
 2   pixel_0_2  1797 non-null   float64
 3   pixel_0_3  1797 non-null   float64
 4   pixel_0_4  1797 non-null   float64
 5   pixel_0_5  1797 non-null   float64
 6   pixel_0_6  1797 non-null   float64
 7   pixel_0_7  1797 non-null   float64
 8   pixel_1_0  1797 non-null   float64
 9   pixel_1_1  1797 non-null   float64
 10  pixel_1_2  1797 non-null   float64
 11  pixel_1_3  1797 non-null   float64
 12  pixel_1_4  1797 non-null   float64
 13  pixel_1_5  1797 non-null   float64
 14  pixel_1_6  1797 non-null   float64
 15  pixel_1_7  1797 non-null   float64
 16  pixel_2_0  1797 non-null   float64
 17  pixel_2_1  1797 non-null   float64
 18  pixel_2_2  1797 non-null   float64
 19  pixel_2_3  1797 non-null   float64
 20  pixel_2_

- Diabetes

    Para mais informações, acesse:

    https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html#sklearn.datasets.load_diabetes

In [83]:
from sklearn.datasets import load_diabetes

diabData = load_diabetes()

diabDataframe = cleanData(diabData)

diabDataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     442 non-null    float64
 1   sex     442 non-null    float64
 2   bmi     442 non-null    float64
 3   bp      442 non-null    float64
 4   s1      442 non-null    float64
 5   s2      442 non-null    float64
 6   s3      442 non-null    float64
 7   s4      442 non-null    float64
 8   s5      442 non-null    float64
 9   s6      442 non-null    float64
 10  target  442 non-null    float64
dtypes: float64(11)
memory usage: 38.1 KB


- Breast Cancer Wisconsin (Diagnostic)

    Para mais informações, acesse:

    https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer

In [84]:
from sklearn.datasets import load_breast_cancer

cancerData = load_breast_cancer()

cancerDataframe = cleanData(cancerData)

cancerDataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

Para variarmos as fontes dos nossos dados, os próximos 2 serão datasets obtidos do kaggle:

- Most Streamed Spotify Songs 2023

    Para mais informações, acesse:

    https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023

In [103]:
spotifyDataframe = pd.read_csv('spotify-2023.csv', encoding='ISO-8859-1')

spotifyDataframe = spotifyDataframe.dropna()
spotifyDataframe = spotifyDataframe.drop_duplicates()

In [104]:
# Transform object columns that are 
# int-like (streams, in_deezer_playlists, 
# in_shazam_charts) into int columns

import numpy as np

# Function to attempt to convert to int, returning NaN if it fails
def tryConvertInt(x):
    try:
        return int(x)
    except ValueError:
        return np.nan

spotifyDataframe['streams'] = spotifyDataframe['streams'].apply(tryConvertInt)
spotifyDataframe['in_deezer_playlists'] = spotifyDataframe['in_deezer_playlists'].apply(tryConvertInt)
spotifyDataframe['in_shazam_charts'] = spotifyDataframe['in_shazam_charts'].apply(tryConvertInt)

spotifyDataframe.dropna(subset=['streams', 'in_deezer_playlists', 'in_shazam_charts'], inplace=True)

# Avoid NaN transforming dataframe to float
spotifyDataframe['streams'] = spotifyDataframe['streams'].astype(int)
spotifyDataframe['in_deezer_playlists'] = spotifyDataframe['in_deezer_playlists'].astype(int)
spotifyDataframe['in_shazam_charts'] = spotifyDataframe['in_shazam_charts'].astype(int)


# Map the binary column 'mode' to 0 and 1

mode_mapping = {
    'minor': 0,
    'major': 1
}

spotifyDataframe['mode'] = spotifyDataframe['mode'].str.lower().map(mode_mapping)
spotifyDataframe = spotifyDataframe.dropna(subset=['mode'])


# Delete the track_name and artist(s)_name columns

spotifyDataframe = spotifyDataframe.drop(columns=['track_name', 'artist(s)_name'])


# Make the key column the last 
# column and rename it to target

col_X = spotifyDataframe['key']
spotifyDataframe.drop(columns=['key'], inplace=True)
spotifyDataframe['target'] = col_X


spotifyDataframe.info()

<class 'pandas.core.frame.DataFrame'>
Index: 748 entries, 0 to 952
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   artist_count          748 non-null    int64 
 1   released_year         748 non-null    int64 
 2   released_month        748 non-null    int64 
 3   released_day          748 non-null    int64 
 4   in_spotify_playlists  748 non-null    int64 
 5   in_spotify_charts     748 non-null    int64 
 6   streams               748 non-null    int64 
 7   in_apple_playlists    748 non-null    int64 
 8   in_apple_charts       748 non-null    int64 
 9   in_deezer_playlists   748 non-null    int64 
 10  in_deezer_charts      748 non-null    int64 
 11  in_shazam_charts      748 non-null    int64 
 12  bpm                   748 non-null    int64 
 13  mode                  748 non-null    int64 
 14  danceability_%        748 non-null    int64 
 15  valence_%             748 non-null    int64 


- Mobile Price Classification

    Para mais informações, acesse:

    https://www.kaggle.com/datasets/iabhishekofficial/mobile-price-classification

In [105]:
mobileDataframe = pd.read_csv('mobile.csv', encoding='ISO-8859-1')

mobileDataframe = mobileDataframe.dropna()
mobileDataframe = mobileDataframe.drop_duplicates()

mobileDataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   battery_power  2000 non-null   int64  
 1   blue           2000 non-null   int64  
 2   clock_speed    2000 non-null   float64
 3   dual_sim       2000 non-null   int64  
 4   fc             2000 non-null   int64  
 5   four_g         2000 non-null   int64  
 6   int_memory     2000 non-null   int64  
 7   m_dep          2000 non-null   float64
 8   mobile_wt      2000 non-null   int64  
 9   n_cores        2000 non-null   int64  
 10  pc             2000 non-null   int64  
 11  px_height      2000 non-null   int64  
 12  px_width       2000 non-null   int64  
 13  ram            2000 non-null   int64  
 14  sc_h           2000 non-null   int64  
 15  sc_w           2000 non-null   int64  
 16  talk_time      2000 non-null   int64  
 17  three_g        2000 non-null   int64  
 18  touch_sc

Por último, usamos as funções make_classification e make_blob do scikit learn para gerarmos 2 datasets aleatórios seguindo certos parâmetros. 

In [106]:
# make_blobs generated data:

from sklearn.datasets import make_blobs

feat, targ = make_blobs(n_features=2, centers=2)

genBlobData = np.column_stack((feat, targ))

genBlobDataframe = pd.DataFrame(genBlobData, columns=["feature1", "feature2", "target"])

genBlobDataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   feature1  100 non-null    float64
 1   feature2  100 non-null    float64
 2   target    100 non-null    float64
dtypes: float64(3)
memory usage: 2.5 KB


In [107]:
# make classification generated data:

from sklearn.datasets import make_classification

feat, targ = make_classification(n_features=2, n_redundant=0, n_informative=1, n_clusters_per_class=1)

genClassData = np.column_stack((feat, targ))

genClassDataframe = pd.DataFrame(genClassData, columns=["feature1", "feature2", "target"])

genClassDataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   feature1  100 non-null    float64
 1   feature2  100 non-null    float64
 2   target    100 non-null    float64
dtypes: float64(3)
memory usage: 2.5 KB


## Redução da Quantidade de Rótulos do Datasets

Agora que temos todos os nossos datasets em dataframes, vamos aplicar a função