# Normalizando Conjunto de Dados
É necessário normalizar os dados quando há uma diferença muito grande nos valores das colunas, onde as com valores maiores seriam predominantes em ditar a classificação, diminuindo a relevâncias das com valores pequenos.

## Imports

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

## Carregando Base de dados

In [2]:
!git clone https://github.com/Crissky/MLUD.git

Cloning into 'MLUD'...
remote: Enumerating objects: 42, done.[K
remote: Counting objects: 100% (42/42), done.[K
remote: Compressing objects: 100% (38/38), done.[K
remote: Total 42 (delta 9), reused 13 (delta 0), pack-reused 0[K
Unpacking objects: 100% (42/42), done.


In [3]:
dataset = pd.read_csv('MLUD/Aula 03/admission.csv', delimiter=';')

## Visualizando os Dados

In [4]:
dataset.head()

Unnamed: 0,Name,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Approval
0,Lucas,337,118,4,4.5,4.5,9.65,1,1
1,Ana,324,107,4,4.0,4.5,8.87,1,1
2,Jose,316,104,3,3.0,3.5,8.0,1,1
3,Carlos,322,110,3,3.5,2.5,8.67,1,1
4,Zileide,314,103,2,2.0,3.0,8.21,0,0


In [5]:
X = dataset.iloc[:,:-1].values              # Pegando somente as variáveis independentes
y = dataset.iloc[:,-1].values               # Pegando somente as variáveis dependentes
D = pd.get_dummies(X[:,0])                  # Criando o One-Hot Encoding com o Pandas
X = X[:, 1:]                                # Retirando os Labels Numéricos
X = np.insert(X, 0, D.values, axis=1)       # Inserindo o One-Hot Encoding em X

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [6]:
print(X_train)

[[0 0 0 0 0 0 1 0 0 322 110 3 3.5 2.5 8.67 1]
 [0 0 0 0 0 0 0 1 0 316 104 3 3.0 3.5 8.0 1]
 [0 0 0 0 1 0 0 0 0 302 102 1 2.0 1.5 8.0 0]
 [0 0 0 0 0 1 0 0 0 314 103 2 2.0 3.0 8.21 0]
 [0 1 0 0 0 0 0 0 0 337 118 4 4.5 4.5 9.65 1]
 [0 0 1 0 0 0 0 0 0 330 115 5 4.5 3.0 9.34 1]
 [0 0 0 1 0 0 0 0 0 324 107 4 4.0 4.5 8.87 1]]


## Normalizando os Dados

In [7]:
scale_X = StandardScaler()

X_train = scale_X.fit_transform(X_train)
X_test = scale_X.fit_transform(X_test)

In [8]:
print(X_train)

[[ 0.         -0.40824829 -0.40824829 -0.40824829 -0.40824829 -0.40824829
   2.44948974 -0.40824829  0.          0.12168831  0.27431507 -0.11470787
   0.14433757 -0.71795816 -0.01181521  0.63245553]
 [ 0.         -0.40824829 -0.40824829 -0.40824829 -0.40824829 -0.40824829
  -0.40824829  2.44948974  0.         -0.44619046 -0.77306974 -0.11470787
  -0.36084392  0.28718326 -1.12008234  0.63245553]
 [ 0.         -0.40824829 -0.40824829 -0.40824829  2.44948974 -0.40824829
  -0.40824829 -0.40824829  0.         -1.7712409  -1.122198   -1.720618
  -1.37120689 -1.72309958 -1.12008234 -1.58113883]
 [ 0.         -0.40824829 -0.40824829 -0.40824829 -0.40824829  2.44948974
  -0.40824829 -0.40824829  0.         -0.63548338 -0.94763387 -0.91766294
  -1.37120689 -0.21538745 -0.77271503 -1.58113883]
 [ 0.          2.44948974 -0.40824829 -0.40824829 -0.40824829 -0.40824829
  -0.40824829 -0.40824829  0.          1.54138521  1.67082814  0.6882472
   1.15470054  1.29232469  1.60923222  0.63245553]
 [ 0.   

## Outra alternativa de Normalização

Valores ficarão entre zero e um.

In [9]:
new_ds = dataset.iloc[:,1:]                                 #Selecionando todas as colunas, menos a primeira pois é formada por strings
# dataset.iloc[:,1:] = (new_ds - new_ds.min()) / (new_ds.max() - new_ds.min())

(new_ds - new_ds.min()) / (new_ds.max() - new_ds.min())

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Approval
0,1.0,1.0,0.75,1.0,1.0,1.0,1.0,1.0
1,0.628571,0.352941,0.75,0.8,1.0,0.554286,1.0,1.0
2,0.4,0.176471,0.5,0.4,0.666667,0.057143,1.0,1.0
3,0.571429,0.529412,0.5,0.6,0.333333,0.44,1.0,1.0
4,0.342857,0.117647,0.25,0.0,0.5,0.177143,0.0,0.0
5,0.8,0.823529,1.0,1.0,0.5,0.822857,1.0,1.0
6,0.542857,0.470588,0.5,0.4,0.833333,0.171429,1.0,1.0
7,0.171429,0.0,0.25,0.4,0.833333,0.0,0.0,0.0
8,0.0,0.058824,0.0,0.0,0.0,0.057143,0.0,0.0


In [10]:
new_ds / new_ds.max()

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Approval
0,1.0,1.0,0.8,1.0,1.0,1.0,1.0,1.0
1,0.961424,0.90678,0.8,0.888889,1.0,0.919171,1.0,1.0
2,0.937685,0.881356,0.6,0.666667,0.777778,0.829016,1.0,1.0
3,0.95549,0.932203,0.6,0.777778,0.555556,0.898446,1.0,1.0
4,0.931751,0.872881,0.4,0.444444,0.666667,0.850777,0.0,0.0
5,0.979228,0.974576,1.0,1.0,0.666667,0.967876,1.0,1.0
6,0.952522,0.923729,0.6,0.666667,0.888889,0.849741,1.0,1.0
7,0.913947,0.855932,0.4,0.666667,0.888889,0.818653,0.0,0.0
8,0.896142,0.864407,0.2,0.444444,0.333333,0.829016,0.0,0.0
