## The 20 newsgroups dataset 

This could be like the MNIST of NLP I guess. 

This dataset is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. But in my case I'll only use three categories. 

In [20]:
from sklearn.datasets import fetch_20newsgroups

categories = ['sci.space', 
              'soc.religion.christian',
              'comp.graphics'
              ]
              
twenty_train = fetch_20newsgroups(subset='train',categories = categories , shuffle=True, random_state=42)

In [21]:
twenty_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [22]:
twenty_train.target_names

['comp.graphics', 'sci.space', 'soc.religion.christian']

In [23]:
len(twenty_train.data)

1776

## The first objective would be to assign a vector to each sentence/paragraph. 

For example, we have 2 sentences in a dataset:

```
[
    ["Hello, I am a human",         1],
    ["Farewell, I have to leave",   0]
]
```

The first thing that we'd have to do is to find the `vocabulary`, which is basically a superset containing all the unique words in the dataset. In our case, the vocabulary would look like:

```
vocab = ["Hello", "I", "am", "a", "human", "Farewell", "have", "to", "leave" ]
```

The next step is to express the sentences in our dataset as vectors in a an `n` dimensional space where `n` is the size of the vocabulary. 

For example, let us express the first sentence in the dataset as a vector:

\begin{array}{|l|l|l|}
\hline
vocabulary & Hello, \thinspace I \thinspace am \thinspace a  \thinspace human & Hello,\thinspace I\thinspace have \thinspace to \thinspace leave \\ \hline
Hello    & 1                     & 1                        \\ \hline
I        & 1                     & 1                        \\ \hline
am       & 1                     & 0                        \\ \hline
a        & 1                     & 0                        \\ \hline
human    & 1                     & 0                        \\ \hline
farewell & 0                     & 0                        \\ \hline
have     & 0                     & 1                        \\ \hline
to       & 0                     & 1                        \\ \hline
leave    & 0                     & 1                        \\ \hline
\end{array}

And in our case we use the ` CountVectorizer`, which builds a dictionary of features and transforms documents to feature vectors which can be "read" by a model. 

In [24]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

## We'll now express all of the sentences in the dataset in form of vectors in a `len(vocabulary)` dimensional space

A shape of `(a,b)` below means there are a total of `a` vectors (i.e sentences) in the dataset with all of them expressed in a "vocabulary space" of `b` dimensions.

The example below has 1663 sentences each represented as a vector in a "vocabulary space" of 27829 dimensions

In [25]:
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(1776, 31121)

In [26]:
len(count_vect.vocabulary_)

31121

## Transforming the count matrix to a normalised tf-idf representation

**tf-idf** means **term-frequency** times **inverse document-frequency**

The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative. 

In [27]:
from sklearn.feature_extraction.text import TfidfTransformer

In [28]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(1776, 31121)

## Training the classifier with `MultinomialNB`

Try and recall bayes theorem from class 12 first. 

Naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature. 

For example, a fruit may be considered to be an orange if it is orange, round, and about 10 cm in diameter. 

A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an orange, regardless of any possible correlations between the color, roundness, and diameter features. Which is kind of equivalent to a super shallow neural network (I might be wrong). 


In [29]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In [34]:
docs_new = [
            'what if the moon is flat', 
            "By the pope!",
            "You picked the wrong house fool !",  
            "All we had to do, was follow the damn train, CJ!" 
            ]

X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))



'what if the moon is flat' => sci.space
'By the pope!' => soc.religion.christian
'You picked the wrong house fool !' => soc.religion.christian
'All we had to do, was follow the damn train, CJ!' => soc.religion.christian


The model is definitely **not** the best you'll find, but it works on simple sentences. 