## Scikit-learn: Working with Text Data

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

### Loading the newsgroups dataset

The dataset is called "Twenty Newsgroups". The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

In [2]:

from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

### About the dataset
The returned dataset is a `scikit-learn` "bunch": a simple holder object with fields that can be both accessed as python `dict` keys or `object` attributes for convenience.

### Attributes/Keys
`target_names` holds the list of the requested category names

`data` holds all the files

`target` holds the category of each files


In [3]:
print(twenty_train.keys())

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])


In [4]:
print(len(twenty_train.filenames)) # Number of unique files

2257


In [5]:
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [6]:
# Printing the first 10 files
twenty_train['data'][:10] 

['From: sd345@city.ac.uk (Michael Collier)\nSubject: Converting images to HP LaserJet III?\nNntp-Posting-Host: hampton\nOrganization: The City University\nLines: 14\n\nDoes anyone know of a good way (standard PC application/PD utility) to\nconvert tif/img/tga files into LaserJet III format.  We would also like to\ndo the same, converting to HPGL (HP plotter) files.\n\nPlease email any response.\n\nIs this the correct group?\n\nThanks in advance.  Michael.\n-- \nMichael Collier (Programmer)                 The Computer Unit,\nEmail: M.P.Collier@uk.ac.city                The City University,\nTel: 071 477-8000 x3769                      London,\nFax: 071 477-8565                            EC1V 0HB.\n',
 "From: ani@ms.uky.edu (Aniruddha B. Deglurkar)\nSubject: help: Splitting a trimming region along a mesh \nOrganization: University Of Kentucky, Dept. of Math Sciences\nLines: 28\n\n\n\n\tHi,\n\n\tI have a problem, I hope some of the 'gurus' can help me solve.\n\n\tBackground of the probl

In [7]:
# Printing the categories of the first 10 lines
twenty_train.target[:10]

array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2], dtype=int64)

In [8]:
for t in twenty_train.target[:10]:
    print(twenty_train.target_names[t])

comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med


### Extracting features from text files
In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors

A breakdown of how `CountVectorizer` works

1. **Tokenization**: it first tokenzies the text documents, meaning it splits the text into individual words/terms.

2. **Building the vocabulary**: it then builds a vocabulary of all unqiues words or terms. Each unique word becomes a feature in the vectorized representation

3. **Counting**: for each document, it counts the occurrence of each word in the document and stores these counts in the corresponding matrix cell

In [9]:
# Tokenizing text with scikit-learn
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data) # converts text in the files into the tokens
X_train_counts.shape

(2257, 35788)

## From occurrences to frequencies

Occurrence count is a good start but there is an issue: 

- longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid these potential discrepancies it suffices to **no. of occurrences of each word / total no.of words** for each doc, these new frequences are called `tf` or term frequencies. This downscaling is called **tf-idf** or "Term Frequency times Inverse Document Frequency"

In [17]:
from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

(2257, 35788)

In the above example-code, we use 

1. `fit(...)` : fit our estimator to the data
2. `transform(...)` : transform our count-matrix to a tf-idf represenation

Instead, we can

3. `fit_transform(...)` : combine `fit(...)` and `transform(...)` to skil redundant processing

In [11]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2257, 35788)

### Training a classifier

Now that we have our features, we can train a classifier to predict the category of a post. 

Let's start with a `naive Bayes` classifier, which provides a nice baseline for this task. `scikit-learn` includes several variants of this classifier, and the one most suitable for word counts is the multinomial variant

In [12]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In [13]:
# Predicting the outcome of a new document

docs_new = ["ChatGPT runs fast on the browser", "Buddhism is not my religion"]

X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)
# print(predicted) # OUTPUT: [1, 3]

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'ChatGPT runs fast on the browser' => comp.graphics
'Buddhism is not my religion' => soc.religion.christian


### Building a pipeline

`scikit-learn` provides a `Pipeline` class that behaves like a compound classifier. This makes the vectorizer => transformer => classifier easier to work with.

In [14]:
from sklearn.pipeline import Pipeline

text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

In [15]:
text_clf.fit(twenty_train.data, twenty_train.target)

In [16]:
text_clf.predict(docs_new)

array([1, 3], dtype=int64)