# Scikit-Learn Naive Bayes Classifier


Before we can use scikit-learn's Naive Bayes classifier, we need to first transform our data into a format that scikit-learn can process using scikit-learn's `CountVectorizer` object.

First we need to create a `CountVectorizer` object and teach it the vocabulary of the training set. This is done by calling the `.fit()` method and passing it a list of strings of the words we wish to teach the algorithm.

```py
vectorizer = CountVectorizer()
vectorizer.fit(["Training review one", "Second review"])
```

Next we call its `.transform()` method passing it a list of strings. It transform those strings into counts of the trained words.

```py
counts = vectorizer.transform(["one review two review"])
```

`counts` stores an array, `[2,1,0,0]` in this case, of the number of times that the words in the training list appeared in the transform list. Thus, the word "review" appeared twice, the word "one" appeared once, and neither "Training" nor "Second" appeared at all.

How did we know that the 2 correspondeds to review? You can print `vectorizer.vocabulary_` to see the index that each word corresponds to.

```py
print(vectorizer.vocabulary_)
{'one': 1, 'Training': 2, 'review': 0, 'Second': 3}
```

Note: even though the word "two" was in our new review, there wasn't an index for it in the vocabulary. This is because "two" wasn't in any of the strings used in the .fit() method.

#### Example - formatting the data

```py
from reviews import neg_list, pos_list
from sklearn.feature_extraction.text import CountVectorizer

review = "This crib was amazing"

counter = CountVectorizer()
counter.fit(neg_list + pos_list) # already lists
print(counter.vocabulary_)

# prints what looks like an array of all 0s, but the indices that correspond to the words "this", "crib", "was", and "amazing" should all be 1.
review_counts = counter.transform([review])
print(review_counts.toarray()) # print a readable format

# transform the training set
training_counts = counter.transform(neg_list + pos_list)
print(training_counts.toarray())
```

## Using Scikit-Learn MultinomialNB Classifier

Using the `MultinomialNB` classifier involves three steps:

1. the classifier needs to be trained using the `.fit()` method, which takes two args, 1st is an array of data points(the formatted data already prepared) and an array of labels corresponding to each data point.

2. once trained, use the `.predict()` method to predict the labels of new data points. .predict() takes a list of points that you want to classify and it returns the predicted labels of those points.

3. `.predict_proba()` will return the probability of each label given a point. Takes the same list of points passed to `.predict()`.

### Example Using Scikit-Learn's MultinomialNB 

```py
from reviews import counter, training_counts
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

review = "This crib was amazing"
review_counts = counter.transform([review])

# create your MultinomialNB classifier
classifier = MultinomialNB()

# train the model
# 'training_counts' are our transformed points
# We made the training points by combining neg_list and pos_list. So the first half of the labels should be 0 (for negative) and the second half should be 1 (for positive) contains 1000 of each
training_labels = [0] * 1000 + [1] * 1000
classifier.fit(training_counts, training_labels)

# classify the review and return the predicted labels
print(classifier.predict(review_counts)) # [1]

# The first number printed is the probability that the review was a 0 (bad) and the second number is the probability the review was a 1 (good).
print(classifier.predict_proba(review_counts))

# TEST ================================================================================
# review = "This crib was amazing"
# --> .predict() - [1], 
# --> .predict_proba() - [[0.22699537 0.77300463]]

# review = "This crib was great amazing and wonderful"
# --> .predict() - [1], 
# --> .predict_proba() - [[0.04977729 0.95022271]]

# review = "This crib was absolutly awful, worst I have seen"
# --> .predict() - [0], 
# --> .predict_proba() - [[0.67036292 0.32963708]]
```

## Summary

We used Scikit-Learn's MultinomialNB(Naive Bayes) module to create a suppervised machine learning algorithm. Note:

1. A tagged dataset is necessary to calculate the probabilities used in Bayes' Theorem.

2. In this example, the features of our dataset are the words used in a product review. In order to apply Bayes' Theorem, we assume that these features are independent.

3. Using Bayes' Theorem, we can find P(class|data point) for every possible class. In this example, there were two classes — positive and negative. The class with the highest probability will be the algorithm’s prediction.

The following ways in which we can improve our dataset before feeding it into the Naive Bayes classifier:

1. Remove all punctuation from the training set.

2. Lowercase every word in the training set.

3. Use a bigram or trigram model. Right now, the features of a review are individual words. For example, the features of the point "This crib is great" are "This", "crib", "is", and "great". If we used a bigram model, the features would be "This crib", "crib is", and "is great". Using a bigram model makes the assumption of independence more reasonable.

These are Natural Language Processing techniques that can improve performance of a Naive Bayes Classifier