# Test Naive Bayes Classifier to Classify Qualification Text

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

Lets try a Naive bayes model.

naive Bayes regularly generalizes well into data as it is a high bias model, with low variance

This means it is not a very complex model and does not often overfit to the training set which is a great place for us to start

In [23]:
#Create data partitions
df = pd.read_csv('./labeled_descriptions.csv', encoding='latin-1')
df = df.dropna()
df.head()

X_train, X_test, y_train, y_test=train_test_split(df.text, df.out, test_size=0.33,random_state = 4864)

Transform data into sklearns compressed count vector format 

In [24]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_test_counts = count_vect.transform(X_test)
X_train_counts.shape,X_test_counts.shape

((52, 1685), (27, 1685))

Transform data into frequencies rather than raw count to avoid extreme outliars.

In [25]:
from sklearn.feature_extraction.text import TfidfTransformer
X_train_freq = TfidfTransformer(use_idf=False).fit_transform(X_train_counts)
X_test_freq = TfidfTransformer(use_idf=False).transform(X_test_counts)


Train model with training set

In [26]:
from sklearn.naive_bayes import MultinomialNB
naive = MultinomialNB().fit(X_train_freq, y_train)

In [30]:
round(naive.score(X_test_freq,y_test),2)

0.96

Seems like we have achieved great accuracy on our first model. 

We will keep in this in mind but lets try a support vector machine classifier simply to compare

In [34]:
from sklearn import svm
vector_machine = svm.SVC()
vector_machine.fit(X_train_freq,y_train)


SVC()

In [38]:
vector_machine.score(X_test_freq,y_test)

1.0

In [62]:
X_train_counts.toarray().shape

(52, 1685)

Wow seems like we achieved 100% accuracy on our test set

This is not as surprising as qualification text should be fairly distinguishable from description text. 

This is great and means we can export our model and use it to help us identify qualification text in our scraped data