# Testing Models

In this notebook I will experiment and predict with different models for my inputs.csv and outputs.csv files for my Snake Debate Chatbot. These are handwritten datasets by me as well as with contributions from some friends/classmates for inputs. They are quite small and limited for now, but cover most topics/myths about snakes.

## Reading and Processing the Data
### With Pandas

In [1]:
# Importing Pandas
import pandas as pd
# Reading in the csv file with pandas
df = pd.read_csv('Data/inputs.csv')
# Converting the category column to categorical data
df.Category = df.Category.astype('category')
# Displaying the counts of each author
print("Counts: ")
print(df.Category.value_counts())
# Displaying the first few rows of the data frame
print("\nFirst few rows of data frame:")
df.head()

Counts: 
Pet Snakes            16
Body                  15
Behavior              11
Eat                   11
Understand            10
Eyesight              10
Venomous              10
Pairs                  8
Misunderstand          8
Legs                   8
Generic                8
Kill Snakes            7
Dislocate Jaws         7
Dangerous Snakes       7
Smell                  7
Fear                   7
Rattle                 6
Musk                   6
Mother Snake           6
Deaf                   6
Suffication            6
Evil                   6
Endangered             6
Brain                  6
Infared                6
Pupils                 5
Snake Bite             5
Snake Benefits         5
Swim                   5
Snake Attraction       5
Tails                  5
Topics                 5
Scared                 5
Music                  5
Live Forever           5
Lay Eggs               5
Crossbreeding          5
Definition             5
Flying snakes          5
Greeting        

Unnamed: 0,Input,Category
0,Snakes are aggressive,Behavior
1,Are they aggressive,Behavior
2,Snakes will not bite unless you try to approac...,Behavior
3,Do snakes like to bite?,Behavior
4,Snakes chase people,Behavior


### Utilizing sklearn to create train/test data frames

In [2]:
# Import sklearn's train_test_split
from sklearn.model_selection import train_test_split
# Divide into train and test (80/20 with seed 1234 for replicable results)
# X contains the predictor columns and y contains the target column
X_train, X_test, y_train, y_test = train_test_split(
    df.Input, df.Category, test_size=0.2, random_state=1234,
    stratify=df.Category)

# Outputting the dimensions of train and test
print("Dimensions of train data: ", X_train.shape)
print("Dimensions of test data: ", X_test.shape)

Dimensions of train data:  (257,)
Dimensions of test data:  (65,)


### Removing stop words

In [3]:
# Importing the nltk stopwords
from nltk.corpus import stopwords
# This is our set of stopwords, it will be used during vectorization
stopwords = set(stopwords.words('English'))

### Performing tf-idf Vectorization

In [40]:
# Importing our tf-idf vectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
# Setting up our stopwords for our vectorizer
vectorizer = TfidfVectorizer(stop_words=stopwords)
# Perform tf-idf vectorization and fit to training data
X_train_vect = vectorizer.fit_transform(X_train)
# Transforming the test data with the fitted tf-idf vectorization
X_test_vect = vectorizer.transform(X_test)
# Outputting the dimensions of train and test
print("Dimensions of the vectorized train data: ", X_train_vect.shape)
print("Dimensions of the vectorized test data: ", X_test_vect.shape)



Dimensions of the vectorized train data:  (257, 338)
Dimensions of the vectorized test data:  (65, 338)


In [42]:
# Importing our tf-idf vectorizer from sklearn
from sklearn.feature_extraction.text import CountVectorizer
# Setting up our stopwords for our vectorizer
vectorizer = CountVectorizer(stop_words=stopwords, strip_accents='unicode')
# Perform tf-idf vectorization and fit to training data
X_train_vect = vectorizer.fit_transform(X_train)
# Transforming the test data with the fitted tf-idf vectorization
X_test_vect = vectorizer.transform(X_test)
# Outputting the dimensions of train and test
print("Dimensions of the vectorized train data: ", X_train_vect.shape)
print("Dimensions of the vectorized test data: ", X_test_vect.shape)

Dimensions of the vectorized train data:  (257, 338)
Dimensions of the vectorized test data:  (65, 338)


## Benoulli Naive Bayes


In [6]:
# Importing the Benoulli Naive Bayes model from sklearn
from sklearn.naive_bayes import BernoulliNB
# Creating Benoulli Naive Bayes model
b_nb_1 = BernoulliNB()
b_nb_1.fit(X_train_vect, y_train)
# Printing the accuracy on the training data
print("Accuracy on Training Data: ", b_nb_1.score(X_train_vect, y_train))

# Testing and evaluating
b_nb_1_pred = b_nb_1.predict(X_test_vect)

# Importing tools from sklearn for evaluation
from sklearn.metrics import confusion_matrix, classification_report
print("\nResults on testing data:\n")
print("Confusion matrix:")
print(confusion_matrix(y_test, b_nb_1_pred))
print("\nClassification Report:")
print(classification_report(y_test, b_nb_1_pred))

Accuracy on Training Data:  0.07392996108949416

Results on testing data:

Confusion matrix:
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Classification Report:
                    precision    recall  f1-score   support

          Behavior       0.00      0.00      0.00         2
              Body       0.00      0.00      0.00         3
             Brain       0.00      0.00      0.00         1
               Bye       0.00      0.00      0.00         1
          Creators       0.00      0.00      0.00         1
     Crossbreeding       0.00      0.00      0.00         1
  Dangerous Snakes       0.00      0.00      0.00         1
              Deaf       0.00      0.00      0.00         1
        Definition       0.00      0.00      0.00         1
           Diamond       0.00      0.00      0.00         1
    Dislocate Jaws       0.00      0.00      0.00         1
               Eat       0.00      0.00   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Logistic Regression

### With lbfgs

In [41]:
# Importing our Model
from sklearn.linear_model import LogisticRegression
# Training our model, using lbfgs solver
logreg1 = LogisticRegression(solver='lbfgs', multi_class='multinomial', random_state=1234, class_weight='balanced', max_iter=3000000)
logreg1.fit(X_train_vect, y_train)

# Printing the accuracy on the training data
print("Accuracy on Training Data: ", logreg1.score(X_train_vect, y_train))

# Testing and evaluating
logreg1_pred = logreg1.predict(X_test_vect)

print("\nResults on testing data:\n")
print("Confusion matrix:")
print(confusion_matrix(y_test, logreg1_pred))
print("\nClassification Report:")
print(classification_report(y_test, logreg1_pred))

Accuracy on Training Data:  0.9299610894941635

Results on testing data:

Confusion matrix:
[[1 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 2]]

Classification Report:
                    precision    recall  f1-score   support

          Behavior       0.50      0.50      0.50         2
              Body       0.00      0.00      0.00         3
             Brain       0.00      0.00      0.00         1
               Bye       0.00      0.00      0.00         1
          Creators       0.00      0.00      0.00         1
     Crossbreeding       1.00      1.00      1.00         1
  Dangerous Snakes       0.50      1.00      0.67         1
              Deaf       1.00      1.00      1.00         1
        Definition       0.00      0.00      0.00         1
           Diamond       1.00      1.00      1.00         1
    Dislocate Jaws       0.50      1.00      0.67         1
               Eat       1.00      0.50    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### With liblinear

In [43]:
# Training our model, using liblinear solver
logreg2 = LogisticRegression(solver='liblinear', random_state=1234, class_weight='balanced', C=10, max_iter=5000)
logreg2.fit(X_train_vect, y_train)
# Printing the accuracy on the training data
print("Accuracy on Training Data: ", logreg2.score(X_train_vect, y_train))

# Testing and evaluating
logreg2_pred = logreg2.predict(X_test_vect)

print("\nResults on testing data:\n")
print("Confusion matrix:")
print(confusion_matrix(y_test, logreg2_pred))
print("\nClassification Report:")
print(classification_report(y_test, logreg2_pred))

Accuracy on Training Data:  0.980544747081712

Results on testing data:

Confusion matrix:
[[2 0 0 ... 0 0 0]
 [0 3 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 2]]

Classification Report:
                    precision    recall  f1-score   support

          Behavior       0.67      1.00      0.80         2
              Body       0.30      1.00      0.46         3
             Brain       0.00      0.00      0.00         1
               Bye       0.00      0.00      0.00         1
          Creators       0.00      0.00      0.00         1
     Crossbreeding       1.00      1.00      1.00         1
  Dangerous Snakes       1.00      1.00      1.00         1
              Deaf       1.00      1.00      1.00         1
        Definition       0.00      0.00      0.00         1
           Diamond       1.00      1.00      1.00         1
    Dislocate Jaws       1.00      1.00      1.00         1
               Eat       1.00      0.50     

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Random Forest


In [36]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=1234, max_depth=18, max_features='log2', oob_score=True)
clf.fit(X_train_vect, y_train)

# Printing the accuracy on the training data
print("Accuracy on Training Data: ", clf.score(X_train_vect, y_train))

# Testing and evaluating
clf_pred = clf.predict(X_test_vect)

print("\nResults on testing data:\n")
print("Confusion matrix:")
print(confusion_matrix(y_test, clf_pred))
print("\nClassification Report:")
print(classification_report(y_test, clf_pred))

Accuracy on Training Data:  0.9299610894941635

Results on testing data:

Confusion matrix:
[[2 0 0 ... 0 0 0]
 [0 3 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 0 2]]

Classification Report:
                    precision    recall  f1-score   support

          Behavior       0.50      1.00      0.67         2
              Body       0.27      1.00      0.43         3
             Brain       0.00      0.00      0.00         1
               Bye       0.00      0.00      0.00         1
          Creators       0.00      0.00      0.00         1
     Crossbreeding       0.00      0.00      0.00         1
  Dangerous Snakes       1.00      1.00      1.00         1
              Deaf       1.00      1.00      1.00         1
        Definition       0.00      0.00      0.00         1
           Diamond       1.00      1.00      1.00         1
    Dislocate Jaws       1.00      1.00      1.00         1
               Eat       1.00      1.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Neural Networks

### Neural Network 1


In [22]:
# Importing the multiclassifier neural network from sklearn
from sklearn.neural_network import MLPClassifier

# Train our model
neural_net1 = MLPClassifier(solver='lbfgs', hidden_layer_sizes=(2,1), random_state=1234, max_iter=5000)
neural_net1.fit(X_train_vect, y_train)

# Printing the accuracy on the training data
print("Accuracy on Training Data: ", neural_net1.score(X_train_vect, y_train))

# Testing and evaluating
neural_net1_pred = neural_net1.predict(X_test_vect)

print("\nResults on testing data:\n")
print("Confusion matrix:")
print(confusion_matrix(y_test, neural_net1_pred))
print("\nClassification Report:")
print(classification_report(y_test, neural_net1_pred))

Accuracy on Training Data:  0.05058365758754864

Results on testing data:

Confusion matrix:
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Classification Report:
                    precision    recall  f1-score   support

          Behavior       0.00      0.00      0.00         2
              Body       0.00      0.00      0.00         3
             Brain       0.00      0.00      0.00         1
               Bye       0.00      0.00      0.00         1
          Creators       0.00      0.00      0.00         1
     Crossbreeding       0.00      0.00      0.00         1
  Dangerous Snakes       0.00      0.00      0.00         1
              Deaf       0.00      0.00      0.00         1
        Definition       0.00      0.00      0.00         1
           Diamond       0.00      0.00      0.00         1
    Dislocate Jaws       0.00      0.00      0.00         1
               Eat       0.00      0.00   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Neural Network 2

In [23]:
# Importing the multiclassifier neural network from sklearn
from sklearn.neural_network import MLPClassifier

# Train our model
neural_net2 = MLPClassifier(solver='lbfgs', hidden_layer_sizes=(2), random_state=1234, max_iter=5000)
neural_net2.fit(X_train_vect, y_train)

# Printing the accuracy on the training data
print("Accuracy on Training Data: ", neural_net2.score(X_train_vect, y_train))

# Testing and evaluating
neural_net2_pred = neural_net2.predict(X_test_vect)

print("\nResults on testing data:\n")
print("Confusion matrix:")
print(confusion_matrix(y_test, neural_net2_pred))
print("\nClassification Report:")
print(classification_report(y_test, neural_net2_pred))

Accuracy on Training Data:  0.708171206225681

Results on testing data:

Confusion matrix:
[[0 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 2]]

Classification Report:
                    precision    recall  f1-score   support

          Behavior       0.00      0.00      0.00         2
              Body       0.00      0.00      0.00         3
             Brain       0.00      0.00      0.00         1
               Bye       0.00      0.00      0.00         1
          Creators       0.00      0.00      0.00         1
     Crossbreeding       0.00      0.00      0.00         1
  Dangerous Snakes       0.50      1.00      0.67         1
              Deaf       0.00      0.00      0.00         1
        Definition       0.00      0.00      0.00         1
           Diamond       1.00      1.00      1.00         1
    Dislocate Jaws       0.00      0.00      0.00         1
               Eat       1.00      0.50     

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Neural Network 3

In [24]:
# Importing the multiclassifier neural network from sklearn
from sklearn.neural_network import MLPClassifier

# Train our model
neural_net3 = MLPClassifier(solver='lbfgs', hidden_layer_sizes=(1,1), random_state=1234, max_iter=50000)
neural_net3.fit(X_train_vect, y_train)

# Printing the accuracy on the training data
print("Accuracy on Training Data: ", neural_net3.score(X_train_vect, y_train))

# Testing and evaluating
neural_net3_pred = neural_net3.predict(X_test_vect)

print("\nResults on testing data:\n")
print("Confusion matrix:")
print(confusion_matrix(y_test, neural_net3_pred))
print("\nClassification Report:")
print(classification_report(y_test, neural_net3_pred))

Accuracy on Training Data:  0.1517509727626459

Results on testing data:

Confusion matrix:
[[0 0 0 ... 0 1 0]
 [0 0 0 ... 0 3 0]
 [0 0 0 ... 0 1 0]
 ...
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 2 0]
 [0 0 0 ... 0 2 0]]

Classification Report:
                    precision    recall  f1-score   support

          Behavior       0.00      0.00      0.00         2
              Body       0.00      0.00      0.00         3
             Brain       0.00      0.00      0.00         1
               Bye       0.00      0.00      0.00         1
          Creators       0.00      0.00      0.00         1
     Crossbreeding       0.00      0.00      0.00         1
  Dangerous Snakes       0.00      0.00      0.00         1
              Deaf       0.00      0.00      0.00         1
        Definition       0.00      0.00      0.00         1
           Diamond       0.00      0.00      0.00         1
    Dislocate Jaws       0.00      0.00      0.00         1
               Eat       0.50      0.50    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
