# Modelagem e Avaliação

Nesse caderno vamos tratar de cada modelo e fazer a avaliação para comparar e checar qual é o modelo mais eficiente para prever a espécie de Iris.

**Importação de Bibliotecas**

In [1]:
from sklearn.datasets import load_iris
import pandas as pd

In [2]:
data = load_iris()

In [3]:
#Atributos
X = pd.DataFrame(data.data, columns=data.feature_names)
#Variável dependente
y = pd.DataFrame(data.target, columns=['Species'])

In [7]:
display(X.head())
display(y.head())

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


Unnamed: 0,Species
0,0
1,0
2,0
3,0
4,0


Como eu tenho como plano, lidar com tanto o caso com e sem os atributos com uma correlação muito alta, vou fazer funções que apliquem sobre o treino e o teste! Como os dados já estão balanceados, não preciso me preocupar com isso, então vou fazer todos os casos que possamos precisar. Como os dados estão na mesma escala, não é necessário fazer a normalização também.

**Importação dos módulos para fazer a modelagem**

In [10]:
#Pre-processamento
from sklearn.model_selection import train_test_split
#Modelos
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
#Métricas de Avaliação
from sklearn.metrics import accuracy_score, classification_report

In [43]:
#Função de DecisionTreeClassifier
def model_decision_tree(x_train,x_test,y_train,y_test):
    model = DecisionTreeClassifier(random_state=123)
    model.fit(x_train, y_train)
    y_predict = model.predict(x_test)
    print(classification_report(y_test, y_predict))
    acc = accuracy_score(y_test,y_predict)
    return acc

In [85]:
#Função de Random Forest Classifier
def model_RandomForestClassifier(x_train,x_test,y_train,y_test):
    model=RandomForestClassifier(random_state=123)
    model.fit(x_train, y_train.values.ravel())
    y_predict = model.predict(x_test)
    print(classification_report(y_test.values.ravel(), y_predict))
    acc = accuracy_score(y_test,y_predict)
    return acc

In [101]:
#Função de Logistic Regression
def model_LogisticRegression(x_train,x_test,y_train,y_test):
    model=LogisticRegression(random_state=123,max_iter=1000)
    model.fit(x_train, y_train.values.ravel())
    y_predict = model.predict(x_test)
    print(classification_report(y_test, y_predict))
    acc = accuracy_score(y_test,y_predict)
    return acc

In [197]:
#Função de KneighborsClassifier
def model_Kneighbors_classifier(x_train,x_test,y_train,y_test):
    #Precisamos checar qual é a melhor quantidade de vizinhos, usando acurária como parâmetro de comparação
    melhor_acuracia = 0
    for i in range(10):
        model = KNeighborsClassifier(n_neighbors=i+1)
        model.fit(x_train, y_train.values.ravel())
        y_predict = model.predict(x_test)
        acuracia = accuracy_score(y_test,y_predict)
        if acuracia > melhor_acuracia:
            melhor_acuracia = acuracia
            melhor_modelo = model
            n=i+1
    y_predict = melhor_modelo.predict(x_test)
    print('Melhor número de vizinhos:{}'.format(n))
    print(classification_report(y_test, y_predict))
    acc = accuracy_score(y_test,y_predict)
    return acc

In [211]:
# Função de Support Vector Classifier
def model_svc(x_train,x_test,y_train,y_test):
    model = SVC(random_state=123)
    model.fit(x_train, y_train.values.ravel())
    y_predict = model.predict(x_test)
    print(classification_report(y_test, y_predict))
    acc = accuracy_score(y_test,y_predict)
    return acc

In [212]:
X_train,X_test,y_train,y_test = train_test_split(X,y, random_state=123)

In [213]:
decision_tree_acc = model_decision_tree(X_train,X_test,y_train,y_test)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       0.80      1.00      0.89         8
           2       1.00      0.86      0.92        14

    accuracy                           0.95        38
   macro avg       0.93      0.95      0.94        38
weighted avg       0.96      0.95      0.95        38



In [214]:
random_forest_acc = model_RandomForestClassifier(X_train,X_test,y_train,y_test)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       0.73      1.00      0.84         8
           2       1.00      0.79      0.88        14

    accuracy                           0.92        38
   macro avg       0.91      0.93      0.91        38
weighted avg       0.94      0.92      0.92        38



In [215]:
logistic_regression_acc = model_LogisticRegression(X_train,X_test,y_train,y_test)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       0.89      1.00      0.94         8
           2       1.00      0.93      0.96        14

    accuracy                           0.97        38
   macro avg       0.96      0.98      0.97        38
weighted avg       0.98      0.97      0.97        38



In [216]:
kneighbors_acc = model_Kneighbors_classifier(X_train,X_test,y_train,y_test)

Melhor número de vizinhos:5
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       1.00      0.88      0.93         8
           2       0.93      1.00      0.97        14

    accuracy                           0.97        38
   macro avg       0.98      0.96      0.97        38
weighted avg       0.98      0.97      0.97        38



In [219]:
svc_acc = model_svc(X_train,X_test,y_train,y_test)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       0.89      1.00      0.94         8
           2       1.00      0.93      0.96        14

    accuracy                           0.97        38
   macro avg       0.96      0.98      0.97        38
weighted avg       0.98      0.97      0.97        38



In [222]:
print('ACURÁCIAS')
print('Random Forest Classifier:{}'.format(round(random_forest_acc,4)))
print('Decision Tree Classifier:{}'.format(round(decision_tree_acc,4)))
print('Support Vector Classifier:{}'.format(round(svc_acc,4)))
print('K-Nearest Neighbors:{}'.format(round(kneighbors_acc,4)))
print('Logistic Regression:{}'.format(round(logistic_regression_acc,4)))

ACURÁCIAS
Random Forest Classifier:0.9211
Decision Tree Classifier:0.9474
Support Vector Classifier:0.9737
K-Nearest Neighbors:0.9737
Logistic Regression:0.9737


Então daqui podemos ver que os melhores modelos são SVC, Vizinhos próximos e regressão logística.

Caso façamos os mesmos cálculos sem o atributo correlacionado, será que chegaremos no mesmo resultado?

In [240]:
X_train,X_test,y_train,y_test = train_test_split(X[['sepal length (cm)','sepal width (cm)','petal length (cm)']],y, random_state=123)

In [241]:
decision_tree_acc = model_decision_tree(X_train,X_test,y_train,y_test)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       1.00      0.88      0.93         8
           2       0.93      1.00      0.97        14

    accuracy                           0.97        38
   macro avg       0.98      0.96      0.97        38
weighted avg       0.98      0.97      0.97        38



In [242]:
random_forest_acc = model_RandomForestClassifier(X_train,X_test,y_train,y_test)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       0.64      0.88      0.74         8
           2       0.91      0.71      0.80        14

    accuracy                           0.87        38
   macro avg       0.85      0.86      0.85        38
weighted avg       0.89      0.87      0.87        38



In [243]:
kneighbors_acc = model_Kneighbors_classifier(X_train,X_test,y_train,y_test)

Melhor número de vizinhos:1
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       0.78      0.88      0.82         8
           2       0.92      0.86      0.89        14

    accuracy                           0.92        38
   macro avg       0.90      0.91      0.90        38
weighted avg       0.92      0.92      0.92        38



In [244]:
logistic_regression_acc = model_LogisticRegression(X_train,X_test,y_train,y_test)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       0.89      1.00      0.94         8
           2       1.00      0.93      0.96        14

    accuracy                           0.97        38
   macro avg       0.96      0.98      0.97        38
weighted avg       0.98      0.97      0.97        38



In [245]:
svc_acc = model_svc(X_train,X_test,y_train,y_test)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       0.80      1.00      0.89         8
           2       1.00      0.86      0.92        14

    accuracy                           0.95        38
   macro avg       0.93      0.95      0.94        38
weighted avg       0.96      0.95      0.95        38



In [246]:
print('ACURÁCIAS')
print('Random Forest Classifier:{}'.format(round(random_forest_acc,4)))
print('Decision Tree Classifier:{}'.format(round(decision_tree_acc,4)))
print('Support Vector Classifier:{}'.format(round(svc_acc,4)))
print('K-Nearest Neighbors:{}'.format(round(kneighbors_acc,4)))
print('Logistic Regression:{}'.format(round(logistic_regression_acc,4)))

ACURÁCIAS
Random Forest Classifier:0.8684
Decision Tree Classifier:0.9737
Support Vector Classifier:0.9474
K-Nearest Neighbors:0.9211
Logistic Regression:0.9737


Com isso podemos ver que o resultado muda!
O que muda? Tanto SVC e Vizinhos próximos apresentam uma queda na acurácia, e Regressão logística se mantém. E Árvore de Decisão cresce o suficiente para se manter com a acurária mais alta junto com Regressão logística.

Por causa disso, sendo o mais alto nos dois casos.

O melhor modelo para definir a espécie de Iris é a Regressão Logística.

In [None]:
import pickle

