# Experimentação dos modelos

Neste notebook, foi feita o teste e a analise de diferentes modelos treinados com os dados do dataset SDSS_DR18. 

**Objetivos:**
* Entender a distribuição das features em cada classe.
* Analisar a relação entre os intevalos das features e a classifcação.
* Identificar possíveis outliers e valores faltantes.
* Identificar os maximos e minimos das features para cada classe.

**Bibliotecas utilizadas:**
* **Pandas:** para manipulação de dados.
* **NumPy:** para operações numéricas.
* **Matplotlib:** para visualização de dados.
* **Seaborn:** para visualização de dados estatísticos.

In [1]:
pip install torch 

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.




In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import MinMaxScaler

## Preprocessamento dos dados

In [3]:
sloan_table = pd.read_csv("Dados/SDSS_DR18.csv")
sloan_table.head(100)

Unnamed: 0,objid,specobjid,ra,dec,u,g,r,i,z,run,...,psfMag_g,psfMag_i,psfMag_z,expAB_u,expAB_g,expAB_r,expAB_i,expAB_z,redshift,class
0,1.240000e+18,3.240000e+17,184.950869,0.733068,18.87062,17.59612,17.11245,16.83899,16.70908,756,...,19.96352,19.25145,19.05230,0.479021,0.518483,0.520474,0.508502,0.488969,0.041691,GALAXY
1,1.240000e+18,3.250000e+17,185.729201,0.679704,19.59560,19.92153,20.34448,20.66213,20.59599,756,...,19.92417,20.65535,20.57387,0.573926,0.531728,0.403072,0.999874,0.189495,-0.000814,STAR
2,1.240000e+18,3.240000e+17,185.687690,0.823480,19.26421,17.87891,17.09593,16.65159,16.35329,756,...,19.33645,18.16669,17.78844,0.701666,0.743386,0.770897,0.778642,0.736771,0.113069,GALAXY
3,1.240000e+18,2.880000e+18,185.677904,0.768362,19.49739,17.96166,17.41269,17.20545,17.11567,756,...,17.96176,17.21564,17.12367,0.999818,0.787760,0.745611,0.399718,0.986137,0.000087,STAR
4,1.240000e+18,2.880000e+18,185.814763,0.776940,18.31519,16.83033,16.26352,16.06320,15.97527,756,...,16.85104,16.08275,15.98694,0.999795,0.834450,0.723526,0.712259,0.527055,0.000018,STAR
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,1.240000e+18,7.980000e+17,46.109404,1.045906,19.10064,16.81490,15.60841,15.11939,14.75466,109,...,18.69036,16.95093,16.53452,0.668718,0.717878,0.681205,0.683496,0.689745,0.156453,GALAXY
96,1.240000e+18,4.650000e+17,49.296809,1.047410,18.52110,16.65724,15.92370,15.48312,15.17295,109,...,19.74750,18.25955,17.88374,0.496652,0.503385,0.499375,0.505945,0.501964,0.066758,GALAXY
97,1.240000e+18,1.710000e+18,50.917168,0.901690,17.56852,16.07338,15.48176,15.28377,15.19522,109,...,16.07821,15.24219,15.13657,0.377348,0.999858,0.111211,0.460445,0.253717,0.000240,STAR
98,1.240000e+18,4.650000e+17,50.827067,0.939875,18.73592,17.42720,16.80908,16.47318,16.18574,109,...,19.13368,18.13357,17.79567,0.284895,0.329829,0.341564,0.331625,0.315325,0.036462,GALAXY


In [4]:
#Filtro de linhas com valores errados
sloan_table = sloan_table[sloan_table['u']>=0]
sloan_table = sloan_table[sloan_table['g']>=0]
sloan_table = sloan_table[sloan_table['r']>=0]
sloan_table = sloan_table[sloan_table['i']>=0]
sloan_table = sloan_table[sloan_table['z']>=0]

In [5]:
sloan_filtred = sloan_table[['u','g','r','i','z','class']]
sloan_filtred["classificacao"] = sloan_filtred["class"].replace(["STAR","QSO","GALAXY"],[0,1,2])

  sloan_filtred["classificacao"] = sloan_filtred["class"].replace(["STAR","QSO","GALAXY"],[0,1,2])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sloan_filtred["classificacao"] = sloan_filtred["class"].replace(["STAR","QSO","GALAXY"],[0,1,2])


In [6]:
sloan_filtred.head(2000)

Unnamed: 0,u,g,r,i,z,class,classificacao
0,18.87062,17.59612,17.11245,16.83899,16.70908,GALAXY,2
1,19.59560,19.92153,20.34448,20.66213,20.59599,STAR,0
2,19.26421,17.87891,17.09593,16.65159,16.35329,GALAXY,2
3,19.49739,17.96166,17.41269,17.20545,17.11567,STAR,0
4,18.31519,16.83033,16.26352,16.06320,15.97527,STAR,0
...,...,...,...,...,...,...,...
1995,18.59335,16.53429,15.48791,14.98926,14.61014,GALAXY,2
1996,18.64174,18.91801,19.41393,19.77970,20.10166,STAR,0
1997,19.15671,18.22034,17.92332,17.79087,17.73026,STAR,0
1998,18.01667,17.71583,17.63513,17.56084,17.53474,QSO,1


In [7]:
train_validation, test = train_test_split(sloan_filtred, test_size=0.33, random_state=42, stratify=sloan_filtred['classificacao'])
train, validation = train_test_split(train_validation, test_size=0.33, random_state=42, stratify=train_validation['classificacao'])

In [8]:
#dataset de treino do classificador
X_train = train[['u','g','r','i','z']]
Y_train = train[['classificacao']]

#dataset de validação do classificador
X_validation = validation[['u','g','r','i','z']]
Y_validation = validation[['classificacao']]

**Modeificar a celula abaixo para que o teste de cada classe seja composto pelas copias das amostras existentes**

In [9]:
#separando o df de teste em dfs de cada classe
test_star = test[test['class'] == 'STAR']
test_qso = test[test['class'] == 'QSO']
test_galaxy = test[test['class'] == 'GALAXY']

#separando parte sample das classes para montar o teste, e separando parte para ser o teste de cada classe
part_star_test, star_sample = train_test_split(test_star,
                                               test_size=0.40,
                                               random_state=42)

part_qso_test, qso_sample = train_test_split(test_qso,
                                             test_size=0.40,
                                             random_state=42)

part_galaxy_test, galaxy_sample = train_test_split(test_galaxy,
                                                   test_size=0.40,
                                                   random_state=42)

# montando o df de teste geral
test_sample = pd.concat([test_star, test_qso, test_galaxy])
test_sample = test_sample.sample(frac=1, random_state=42).reset_index(drop=True)


In [10]:
#dataset de teste do classificador
X_test =  test_sample[['u','g','r','i','z']]
Y_test =  test_sample[['classificacao']]

#dataset de teste da classe star
x_star = star_sample[['u','g','r','i','z']]
y_star = star_sample[['classificacao']]

#dataset de teste da classe qso
x_qso = qso_sample[['u','g','r','i','z']]
y_qso = qso_sample[['classificacao']]

#dataset de teste da classe galaxy
x_galaxy = galaxy_sample[['u','g','r','i','z']]
y_galaxy = galaxy_sample[['classificacao']]

In [11]:
# Classe de apoio 
class TorchDataset(Dataset):
  def __init__(self, features, labels):
    self.features = features
    self.labels = labels

  def __len__(self):
    return len(self.features)

  def __getitem__(self, idx):
    return self.features[idx], self.labels[idx]

In [12]:
#Função que transformas df do pandas em dataloader 
def df_to_loader(x_df,y_df):
  if not isinstance(x_df, pd.DataFrame):
    x_df = pd.DataFrame(x_df)

  x_tensor = torch.tensor(x_df.values, dtype=torch.float32)
  y_tensor = torch.tensor(y_df.values, dtype=torch.long)

  torch_dataset = TorchDataset(x_tensor, y_tensor)
  return DataLoader(torch_dataset, batch_size=32, shuffle=True)

In [13]:
#Normalização dos dados
scaler = MinMaxScaler()

scaler.fit(X_train)
X_train_norm = scaler.transform(X_train)

X_validation_norm = scaler.transform(X_validation)

X_test_norm = scaler.transform(X_test)

X_star_norm = scaler.transform(x_star)

X_qso_norm = scaler.transform(x_qso)

X_galaxy_norm = scaler.transform(x_galaxy)

In [14]:
#Preparar os dados de treino para ser recebido pelas IAs
trainloader = df_to_loader(X_train_norm, Y_train)

#Prepara os dados de treino para ser recebido pelas IAs
validationloader = df_to_loader(X_validation_norm, Y_validation)

#Prepara os dados de teste para ser recebido pelas IAs
testloader = df_to_loader(X_test_norm, Y_test)

#Prepara os dados de teste de estrelas para ser recebido por IAs
starloader = df_to_loader(X_star_norm, y_star)

#Prepara os dados de teste de quasar para ser recebido por IAs
qsoloader = df_to_loader(X_qso_norm, y_qso)

#Prepara os dados de teste de galáxias para ser recebido por IAs
galaxyloader = df_to_loader(X_galaxy_norm, y_galaxy)

In [15]:
#função que adapta os loaders para entradas validads para os modelos sklearn
def adapter(loader):
    X_list = []
    y_list = []
    for data in loader:
        inputs, labels = data
        X_list.append(inputs.numpy())
        y_list.append(labels.numpy())
    
    X = np.concatenate(X_list)
    Y = np.concatenate(y_list)
    Y = Y.ravel()
    return X, Y

In [16]:
class datas:
    def __init__(self, trainloader, validationloader, testloader):
        self.train = trainloader
        self.x_train, self.y_train = adapter(trainloader)
        
        self.validation = validationloader
        self.x_validation, self.y_validation = adapter(validationloader)
        
        self.test = testloader
        self.x_test, self.y_test = adapter(testloader)

    #Gets dos dados de treino
    def get_trainloader(self):
        return self.train
    def get_xtrain(self):
        return self.x_train
    def get_ytrain(self):
        return self.y_train

    #Gets dos dados de validação
    def get_validationloader(self):
        return self.validation
    def get_xvalidation(self):
        return self.x_validation
    def get_yvalidation(self):
        return self.y_validation  

    #Gets dos dados de teste
    def get_testloader(self):
        return self.test
    def get_xtest(self):
        return self.x_test
    def get_ytest(self):
        return self.y_test

    

In [17]:
sloan_data = datas(trainloader,validationloader,testloader)

## Teste dos modelos com configurações basicas

In [18]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
import warnings

**Modelo KNN**

In [19]:
knn = KNeighborsClassifier(n_neighbors=5)

In [20]:
model_knn = knn.fit(sloan_data.get_xtrain(),sloan_data.get_ytrain())

In [21]:
y_pred_knn = model_knn.predict(sloan_data.get_xvalidation())

In [22]:
acuracia_knn = accuracy_score(sloan_data.get_yvalidation(), y_pred_knn)

print(f"Acurácia do modelo KNN: {acuracia_knn}")

Acurácia do modelo KNN: 0.9294889190411578
