<a href="https://colab.research.google.com/github/EleTP/PracticaTextMining/blob/master/AI_Saturdays_Session_6_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI Saturdays: Session 6 - NLP

Session 6 - **Natural Language Processing**

Valencia - 16/11/2019


---




![Saturdays logo](https://cdn-images-1.medium.com/max/718/1*e-i4CFTO6-ypIXccycTEBg@2x.png)





## 0.1. Objetivo

- Comprender las aproximaciones al **Procesamiento del Lenguaje Natural (NLP)** basadas en Machine Learning.
    - Análisis del funcionamiento de los algoritmos
    - Extracción de características: Paso de datos no estructurados a datos ML-ready
- Análisis de nuevos casos de negocio que abre el NLP



## 0.2. About me
Pablo González Carrizo ([@unmonoqueteclea](https://twitter.com/unmonoqueteclea/))

- Mi foto más reciente:

<center>
<img src=https://unmonoqueteclea.github.io/assets/images/pequeno2.jpg width="300">
</center>


- MSc in **Telecomunications Engineering**
- Democratizing Machine Learning, as a **Machine Learning Engineer**, at [BigML](https://bigml.com/)

<center>
<img src=https://static.bigml.com/static/img/bigml.png width="200">
</center>

Info and contact:
  - https://unmonoqueteclea.github.io
  - pgonzalezcarrizo@gmail.com






## 0.3. Programa


### **Spoiler 1**

Antes de almorzar tendremos funcionando un sistema capaz de analizar si un comentario sobre una película en IMDB es positivo o negativo con una exactitud cercana al 90%.


<center>
<img src=https://i.gifer.com/V8zA.gif width="400">
</center>



### **Spoiler 2**

Al final habrá competición y **regalos** para los mejores


<center>
<img src=https://media.giphy.com/media/kKo2x2QSWMNfW/giphy.gif width="400">
</center>

### Hablaremos de...
- Sentiment Analysis with **IMDB** corpus and **Logistic Regression**
- Your NLP algorithm in a **spreadsheet**
- More NLP techniques
- The Grand Challenge
- Whatever you want






## 0.4. Consejos
- **No** es necesario ejecutar el notebook a la vez que yo ni copiar todas las líneas de código que vaya añadiendo
  - El `learning by doing` está muy bien, pero va más allá de escribir y ejecutar líneas de código contrarreloj sin saber lo que hacemos.
  - Mejor limitarse primero a escuchar y tratar de **comprender** los conceptos
  - Dejaré tiempo para que ejecutéis los notebook y podáis comprender y analizar los resultados entre todos

  <center>
<img src=https://i.gifer.com/1FA.gif width="500">
</center> 

- La programación es un **medio**, no un **fin**, para conseguir hacer Machine Learning (no es el único, ¿os he hablado de `BigML` ya?)
  - Lo que hoy queremos aprender es cómo aplicar técnicas de Machine Learning a textos, no queremos aprender a programar los algoritmos que lo hagan
  - Algo importante a recordar en ML: `No necesitamos reinventar la rueda cada vez`

- Esta es la **última** sesión teórica
  - Tenemos tiempo para que discutir sobre las puertas que nos puede abrir el Machine Learning, los problemas que le vemos, las dificultades en su implantación, etc
  - Un **diálogo** es mejor que un **monólogo**: preguntad, proponed, **cuestionadme**, sed críticos

   <center>
<img src=https://i.gifer.com/ZPIA.gif width="500">
</center> 

## 0.5 In action

<center>
<img src=https://i.gifer.com/3wv4.gif width="500">
</center> 

### 0.5.1. Sorry, we are not going to learn Neuro-linguistic programming
- Open Google
- Type "`NLP`"
- You will see a lot of results about **Neuro-linguistic programming (NLP)**
- Nothing to do with our NLP: **Natural Language Processing**




### 0.5.2 Last month in NLP...
- Today is 16/11/2019

<center>
<img src=https://i.gifer.com/1y8s.gif width="500">
</center> 

#### 21/10/2019 - FB explains its new features to protect the 2020 US Elections
  - Some of them using NLP techniques
  - NLP helped fighting Voter Suppression and Intimidation
  - [View more](https://newsroom.fb.com/news/2019/10/update-on-election-integrity-efforts/?utm_campaign=Artificial%2BIntelligence%2BWeekly&utm_medium=web&utm_source=Artificial_Intelligence_Weekly_129)
  - ¿Por qué no aplican lo mismo al hate speech?
  - ¿Y las fake news?

#### 26/10/2019 -  Google explains BERT: Its last big release for the search engine

- One of the biggest changes in the **search engine**
- NLP to understand searchs better and offer better results
  - From matching keywords to understanding the context of each word
- [View more](https://www.blog.google/products/search/search-language-understanding-bert/)

#### 05/11/2019 - GPT-2 Released
  - See how a modern neural network completes your text
  - Model released two weeks ago
    - It can be **dangerous**? 
    [See this](https://www.theguardian.com/technology/2019/feb/14/elon-musk-backed-ai-writes-convincing-news-fiction)
  - [Talk to transformer](https://talktotransformer.com/)
    - "The most difficult part of Machine Learning is"
    - "The 2019 spanish elections"
    - https://twitter.com/unmonoqueteclea/status/1193996445784911872

## 1. Setup

You will need  **fastai 0.7.0**

### 1.1. Installing FastAI Library

In [0]:
print (" Installing FastAI libraries...")
!pip install fastai==0.7.0 > /dev/null
print ("\n Installing required libraries...")
!pip install torchtext==0.2.3 > /dev/null    # Corrects torch error, accepts Float32 type
!git clone https://github.com/fastai/fastai.git fastai_ml
!ln -s fastai_ml/courses/ml1/fastai/ fastai


 Installing FastAI libraries...
[31mERROR: torchvision 0.4.2+cu100 has requirement torch==1.3.1, but you'll have torch 0.3.1 which is incompatible.[0m

 Installing required libraries...
Cloning into 'fastai_ml'...
remote: Enumerating objects: 31882, done.[K
remote: Total 31882 (delta 0), reused 0 (delta 0), pack-reused 31882[K
Receiving objects: 100% (31882/31882), 434.60 MiB | 39.41 MiB/s, done.
Resolving deltas: 100% (23219/23219), done.
Checking out files: 100% (815/815), done.


In [0]:
from os.path import exists
from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag

platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\10/'    
accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'
version='1.0.0'
torch_url=f"http://download.pytorch.org/whl/{accelerator}/torch-{version}-{platform}-linux_x86_64.whl"
!pip install -U {torch_url} torchvision

Collecting torch==1.0.0
[?25l  Downloading http://download.pytorch.org/whl/cu100/torch-1.0.0-cp36-cp36m-linux_x86_64.whl (753.6MB)
[K     |████████████████████████████████| 753.6MB 96.1MB/s 
[?25hRequirement already up-to-date: torchvision in /usr/local/lib/python3.6/dist-packages (0.4.2+cu100)
[31mERROR: torchvision 0.4.2+cu100 has requirement torch==1.3.1, but you'll have torch 1.0.0 which is incompatible.[0m
[31mERROR: fastai 0.7.0 has requirement torch<0.4, but you'll have torch 1.0.0 which is incompatible.[0m
Installing collected packages: torch
  Found existing installation: torch 0.3.1
    Uninstalling torch-0.3.1:
      Successfully uninstalled torch-0.3.1
Successfully installed torch-1.0.0


### 1.2. Main imports

In [0]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
from fastai.nlp import *
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import text 

In [0]:
import warnings
warnings.filterwarnings("ignore")

## 2. IMDB dataset and the sentiment classification task

The [large movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/) contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews:

  - A negative review has a score ≤ 4 out of 10
  - A positive review has a score ≥ 7 out of 10. 
  - Neutral reviews are not included in the dataset. 
  
The dataset is divided into training and test sets. 
The training set is the same 25,000 labeled reviews.

The **sentiment classification task** consists of predicting the polarity (positive or negative) of a given text.



### 2.0. Download data

In [0]:
!mkdir data/
!curl http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz  --output data/aclImdb_v1.tar.gz 
!gunzip data/aclImdb_v1.tar.gz;
!tar -xvf data/aclImdb_v1.tar -C data/  > /dev/null

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  65.7M      0  0:00:01  0:00:01 --:--:-- 65.7M


### 2.1. Tokenizing and term document matrix creation

#### 2.1.1 Understanding data

In [0]:
PATH='data/aclImdb/'
names = ['neg','pos']

In [0]:
%ls {PATH}

imdbEr.txt  imdb.vocab  README  [0m[01;34mtest[0m/  [01;34mtrain[0m/


In [0]:
%ls {PATH}train

labeledBow.feat  [0m[01;34mpos[0m/    unsupBow.feat  urls_pos.txt
[01;34mneg[0m/             [01;34munsup[0m/  urls_neg.txt   urls_unsup.txt


It seems that every review is within a different file.

Knowing the label of a review is as easy as knowing its parent folder `pos`/`neg`

In [0]:
%ls {PATH}train/pos | head

0_9.txt
10000_8.txt
10001_10.txt
10002_7.txt
10003_8.txt
10004_8.txt
10005_7.txt
10006_7.txt
10007_7.txt
10008_7.txt


In [0]:
trn,trn_y = texts_labels_from_folders(f'{PATH}train',names)
val,val_y = texts_labels_from_folders(f'{PATH}test',names)

What does texts_labels_from_folders?
Let's check

In [0]:
??texts_labels_from_folders

Here is the text of the first review

In [0]:
trn[0]

"Oh, it's the movie - I thought I waited too long to take out the dog... I can't believe I watched the whole thing. I guess I was optimistically anticipating that it was going to get better. Horribly disjointed dialog, pathetic acting, and totally improbable events. Like Toby's mom hanging herself in the time it takes Col to walk upstairs and back down in a room with a 24' ceiling and no chairs, counters or anything around her motionlessly suspended body that she could have possibly used to climb on to do herself in. The little girl that played the daughter of the last family was the best actor in the whole movie, and the puppy of the first couple was a close second. The basic storyline has potential and with a good script and director could be a seriously creepy flick, but this version sadly is not it. I get more scared when I open my electric bill every month."

In [0]:
trn_y[0]

0

#### 2.1.2. Creating feature vectors

[`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) converts a collection of text documents to a matrix of token counts (part of `sklearn.feature_extraction.text`).

In [0]:
veczr = CountVectorizer(tokenizer=tokenize)

In [0]:
??tokenize

See more: https://github.com/fastai/fastai/blob/master/old/fastai/text.py

`fit_transform(trn)` finds the vocabulary in the training set. It also transforms the training set into a term-document matrix. 

Since we have to apply the *same transformation* to your validation set, the second line uses just the method `transform(val)`. 

`trn_term_doc` and `val_term_doc` are sparse matrices. `trn_term_doc[i]` represents training document i and it contains a count of words for each document for each word in the vocabulary.

In [0]:
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

**QUESTION 1**: What happens if some of the words in validation doesn't appear in training dataset?

**ANSWER**: This tokenizers usually use a **unknown** token for these cases

In [0]:
trn_term_doc

<25000x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 3749745 stored elements in Compressed Sparse Row format>

**QUESTION 2**: What does these dimensions mean? 

We store it about a **sparse matrix**. 
It only stores non-zeros and it's more efficient for these kind of matrices.

In [0]:
trn_term_doc[0]

<1x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 117 stored elements in Compressed Sparse Row format>

Each word is mapped to a number. 

In [0]:
vocab = veczr.get_feature_names(); 
vocab[13000:13005]

['clustering', 'clutch', 'clutches', 'clutching', 'clutter']

We can use `vocabulary_` to obtain the id of a word

In [0]:
veczr.vocabulary_['absurd']

1297

- 1297 is the `id` of the word `absurd`
- 13000 is the `id` of the word `clustering`
- **QUESTION 3**: What are we doing below?

In [0]:
trn_term_doc[0,1297]

0

In [0]:
trn_term_doc[1,13000]

0

### 2.2. Logistic regression

We can fit a logistic regression using the matrices calculated previously as features.

In [0]:
def print_accuracy(preds,ground):
  accuracy = (preds==ground).sum()/preds.shape[0]
  print("The accuracy is: {}%".format(100*accuracy))
  return accuracy


In [0]:
m = LogisticRegression(dual=True)
m.fit(trn_term_doc, trn_y)
preds = m.predict(val_term_doc)
accuracy = print_accuracy(preds,val_y)

The accuracy is: 87.148%


- **QUESTION 4**: What are we doing below to increase accuracy?
  - Test np.sign() with different numbers: -4, -2, 1 3, 10, etc

In [0]:
m = LogisticRegression(dual=True)
m.fit(trn_term_doc.sign(), trn_y)
preds = m.predict(val_term_doc.sign())
accuracy = print_accuracy(preds,val_y)

The accuracy is: 87.384%


In [0]:
help(val_term_doc.sign)

Help on method sign in module scipy.sparse.data:

sign() method of scipy.sparse.csr.csr_matrix instance
    Element-wise sign.
    
    See numpy.sign for more information.



We can play with **regularization** to avoid overfitting...

See [04:49]: https://www.coursera.org/lecture/ml-classification/visualizing-effect-of-l2-regularization-in-logistic-regression-1VXLD

In [0]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(trn_term_doc, trn_y)
preds = m.predict(val_term_doc)
accuracy = print_accuracy(preds,val_y)

The accuracy is: 88.28%


In [0]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(trn_term_doc.sign(), trn_y)
preds = m.predict(val_term_doc.sign())
accuracy = print_accuracy(preds,val_y)

The accuracy is: 88.40400000000001%


88.4% of accuracy!!

![alt text](https://media.giphy.com/media/xT0GqssRweIhlz209i/giphy.gif)

### 2.3. Your NLP model in a spreadsheet

See here:

https://docs.google.com/spreadsheets/d/1hZcfAKBp4BqvefEFk0pOhbHxVb7sX7tz6QxbnUL8kSE/edit?usp=sharing

![link text](https://media.giphy.com/media/SYQIWpavmTyta4nQhK/giphy.gif)

### 2.4. Naive Bayes

We define the **log-count ratio** $r$ for each word $f$:

$r = \log \frac{\text{ratio of feature $f$ in positive documents}}{\text{ratio of feature $f$ in negative documents}}$

where ratio of feature $f$ in positive documents is the number of times a positive document has a feature divided by the number of positive documents.

It's nice to use the log because you can sum things together instead multiplying

In [0]:
# The probability of a word given a class
def pr(x,y,y_i):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

In [0]:
# Ratios (for each word) between the probability for the
# class 1 and the probability for the class 0
r = np.log(pr(trn_term_doc, trn_y,1)/pr(trn_term_doc, trn_y,0))
b = np.log((trn_y==1).mean() / (trn_y==0).mean())

Here is the formula for Naive Bayes.

In [0]:
pre_preds = val_term_doc @ r.T + b
preds = pre_preds.T>0
(preds==val_y).mean()

0.81656

...and binarized Naive Bayes.

`.sign` replaces everything positive by a 1 and everything negative (or zero) by a 0

In [0]:
pre_preds = val_term_doc.sign() @ r.T + b
preds = pre_preds.T>0
(preds==val_y).mean()

0.83184

### 2.5. Trigram with NB features

Our next model is a version of logistic regression with Naive Bayes features described [here](https://www.aclweb.org/anthology/P12-2018). For every document we compute binarized features as described above, but this time we use bigrams and trigrams too. Each feature is a log-count ratio. A logistic regression model is then trained to predict sentiment.

In [0]:
veczr =  CountVectorizer(ngram_range=(1,3), tokenizer=tokenize, max_features=800000)
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

In [0]:
trn_term_doc.shape

(25000, 800000)

Here we fit regularized logistic regression where the features are the trigrams.

In [0]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(trn_term_doc.sign(), trn_y);
preds = m.predict(val_term_doc.sign())
accuracy = print_accuracy(preds,val_y)

The accuracy is: 90.5%


Let's use **log_count_ratio** as features

In [0]:
r = np.log(pr(trn_term_doc, trn_y,1) / pr(trn_term_doc, trn_y,0))
b = np.log((trn_y==1).mean() / (trn_y==0).mean())

Here is the $\text{log-count ratio}$ `r`.  

In [0]:
r.shape, r

((1, 800000),
 matrix([[-0.07511, -0.00034,  0.10548, ...,  1.38629, -2.07944, -2.07944]]))

Here we fit regularized logistic regression where the features are the trigrams' log-count ratios.

In [0]:
x_nb = trn_term_doc.multiply(r)
m = LogisticRegression(dual=True, C=0.1)
m.fit(x_nb, trn_y);

val_x_nb = val_term_doc.multiply(r)
preds = m.predict(val_x_nb)
accuracy = print_accuracy(preds,val_y)

The accuracy is: 91.556%


### 2.6. fastai NBSVM++

In [0]:
sl=2000

In [0]:
# Here is how we get a model from a bag of words
md = TextClassifierData.from_bow(trn_term_doc, trn_y, val_term_doc, val_y, sl)

In [0]:
learner = md.dotprod_nb_learner()
learner.fit(0.02, 1, wds=1e-6, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=1, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   <lambda>   
    0      0.024123   0.119538   0.91672   



[0.11953848763465881, 0.9167200000190735]

In [0]:
learner.fit(0.02, 2, wds=1e-6, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=2, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   <lambda>   
    0      0.020608   0.113216   0.92176   
    1      0.011949   0.111858   0.92132   



[0.11185849909305573, 0.9213200000572205]

In [0]:
learner.fit(0.02, 2, wds=1e-6, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=2, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   <lambda>   
    0      0.016833   0.110698   0.92204   
  7%|▋         | 27/391 [00:00<00:07, 47.18it/s, loss=0.014] 

# 3. More NLP techniques

## 3.1 Stop words remove

- Very useful when the stop words appear more in one of the classes

In [0]:
list(text.ENGLISH_STOP_WORDS)[:10]

In [0]:
veczr =  CountVectorizer(ngram_range=(1,3), 
                         tokenizer=tokenize, 
                         max_features=800000,
                         stop_words="english")
trn_term_doc = veczr.fit_transform(trn)
val_term_doc = veczr.transform(val)

## 3.2. Stemming and Lemmatizing

Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word



In [0]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language="english")

In [0]:
print(stemmer.stem("foot"))
print(stemmer.stem("feet"))

In [0]:
print(stemmer.stem("play"))
print(stemmer.stem("plays"))
print(stemmer.stem("played"))

In [0]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmer=WordNetLemmatizer()

In [0]:
print(lemmer.lemmatize("foot"))
print(lemmer.lemmatize("feet"))

In [0]:
print(lemmer.lemmatize("play" ))
print(lemmer.lemmatize("plays"))
print(lemmer.lemmatize("played"))

In [0]:
print(lemmer.lemmatize("play", pos='v' ))
print(lemmer.lemmatize("plays", pos='v'))
print(lemmer.lemmatize("played", pos='v'))

In [0]:
print(lemmer.lemmatize("are", pos='v' ))
print(lemmer.lemmatize("is", pos='v'))
print(lemmer.lemmatize("being", pos='v'))

In [0]:
def stemming_tokenizer(str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    words = [stemmer.stem(word) for word in words]
    return words

In [0]:
cv =  CountVectorizer(ngram_range=(1,3), tokenizer=stemming_tokenizer, max_features=800000)

In [0]:
trn_term_doc = cv.fit_transform(trn)
val_term_doc = cv.transform(val)

In [0]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(trn_term_doc, trn_y);
preds = m.predict(val_term_doc)
accuracy = print_accuracy(preds,val_y)

## 3.3 TF-IFD

TF-IDF (stands for Term-Frequency-Inverse-Document Frequency) weights down the common words occuring in almost all the documents and give more importance to the words that appear in a subset of documents. TF-IDF works by penalising these common words by assigning them lower weights while giving importance to some rare words in a particular document.

In [0]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer=TfidfTransformer(smooth_idf=False,use_idf=False)
trn_term_doc_tfidf = tfidf_transformer.fit_transform(trn_term_doc)
val_term_doc_tfidf = tfidf_transformer.transform(val_term_doc)

In [0]:
m = LogisticRegression(C=0.1, dual=True)
m.fit(trn_term_doc_tfidf, trn_y);
preds = m.predict(val_term_doc_tfidf)
accuracy = print_accuracy(preds,val_y)

## 3.4. Word embeddings
Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.

http://vectors.nlpl.eu/explore/embeddings/en/calculator/


## 3.5. Deep Learning
### 3.5.1 Recurrent Neural Networks
  - https://colah.github.io/posts/2015-08-Understanding-LSTMs/
  - http://karpathy.github.io/2015/05/21/rnn-effectiveness/
### 3.5.2 Attention networks
  - https://towardsdatascience.com/intuitive-understanding-of-attention-mechanism-in-deep-learning-6c9482aecf4f
#### 3.5.3 Transformers
  - https://towardsdatascience.com/transformers-141e32e69591

# 4. The Grand Challenge

### 4.1. Dataset: 20 News Groups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering. 

#### 4.1.1 Setup

Loading the **train** and **test** datasets

In [0]:
from sklearn.datasets import fetch_20newsgroups
train = fetch_20newsgroups(subset='train', shuffle=True)
test = fetch_20newsgroups(subset='test', shuffle=True)

These are the possible classes:

In [0]:
train.target_names

Let's see a random instance of the dataset:

In [0]:
print(train.data[3])

In [0]:
print("There are {} instance in the train dataset".format(len(train.data)))
print("There are {} instance in the test dataset".format(len(test.data)))

#### 4.1.2 Feature extraction

Let's extract features from the texts

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(train.data)
test_counts = count_vect.transform(test.data)

#### 4.1.3 Create model

In [0]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(train_counts, train.target)

#### 4.1.4 Evaluation

In [0]:
import numpy as np
predicted = clf.predict(test_counts)
print("Accuracy = {}%".format(100*np.mean(predicted == test.target)))

## References

* Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. Sida Wang and Christopher D. Manning [pdf](https://www.aclweb.org/anthology/P12-2018)