# Natural Language Processing


In [1]:
import nltk
import numpy as np

#nltk.download()

## Exploring the 20 Newsgroups Dataset

In [2]:
from sklearn.datasets import fetch_20newsgroups

# explorating options
groups = fetch_20newsgroups()
groups.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])

In [3]:
# the news groups
print('n_groups:', len(groups.target_names))
print(groups.target_names)

n_groups: 20
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [4]:
# integer encoding of the newsgroup
groups.target

array([7, 4, 4, ..., 3, 1, 8])

In [5]:
# sample data
print('n_posts', len(groups.data), end='\n\n')
print(groups.data[0])

n_posts 11314

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







In [6]:
# we can see what target this first data element belongs to
groups.target[0]

7

In [7]:
# which belongs to this group
groups.target_names[groups.target[0]]

'rec.autos'

## Visualizations

In [8]:
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.style.use('ggplot')
sns.set_style("darkgrid", {"axes.facecolor": ".9"})

g = sns.countplot(groups.target, color='blue')
plt.title('Distribution of Posts by Type')
plt.show()

<Figure size 640x480 with 1 Axes>

## Feature Selection
### Option 1: Count Vectorizer
* Count vectorizer basically packages in counting the distribution of works in a corpus a brief description of some of less obvious parameters and their meaning
* `ngram_range`: the minimum and maximum n_grams considered:
    * ex: dan went to the park
        * `(1,2)`: `[dan, went, to, the, park]`, `[dan went, went to, to the, the park]`
* `stop_words`: filter common contextualess words
* `max_features`: consider only a limited number of features. In our case, the features are the words, and so if we limit to 500 features for example, we would only be considering 500 of the most common words, and the other words will be left out.
* `binary`: sets nonzero counts to 1


In [9]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(ngram_range=(1,1),
                     stop_words="english",
                     lowercase=True,
                     max_features=500,
                     binary=False)

# fit transformed takes in a corpus of documents
transformed = cv.fit_transform(groups.data)

# view some of the features
print(cv.get_feature_names()[50:65])

['ago', 'agree', 'al', 'american', 'andrew', 'answer', 'anybody', 'apple', 'application', 'apr', 'april', 'area', 'argument', 'armenian', 'armenians']


### Option 2: TF-IDF Vectorizer


In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer


tfidfv = TfidfVectorizer(sublinear_tf=True, 
                                   max_df=0.5, 
                                   stop_words='english', 
                                   max_features=40000)
transformed = tfidfv.fit_transform(groups.data)

# view some of the features
print(tfidfv.get_feature_names()[50:65])

['01000100b', '010235', '0109', '011', '0111', '0114', '0123456789', '0126', '013', '013651', '014', '015', '01580', '01730', '01752']


### Improvements with Data Preprocessing

* Notice that some of the features are unfiltered. For example, we have names and numbers - which are somewhat arbitrary.
* Below we apply some filtering techniques by lemmatizing (finding the appropriate root word e.g. runs -> run), ignoring numbers and names.
* We can alternatively use nouns or other parts of speech

In [14]:
nltk.download('names')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.corpus import names
from nltk.stem import WordNetLemmatizer


NAMES = set(names.words())

def my_filter(post):
    lemmatizer = WordNetLemmatizer()
    f1 = [word.lower() for word in nltk.tokenize.word_tokenize(post)]
    f2 = [word for word in f1 if word.isalpha() and word not in NAMES]
    return ' '.join(lemmatizer.lemmatize(word) for word in f2)


cleaned_posts = []
for post in groups.data:
    cleaned_posts.append(my_filter(post))
    
transformed = cv.fit_transform(cleaned_posts)
print(cv.get_feature_names()[0:10])

[nltk_data] Downloading package names to
[nltk_data]     /Users/danmaksimovich/nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/danmaksimovich/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/danmaksimovich/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
['able', 'access', 'act', 'action', 'actually', 'address', 'advance', 'ago', 'agree', 'air']


## What Exactly does `fit_transform` do in the context of `CountVectorizer`?

In [None]:
transformed

* Is a `fit` followed by a `transform`. `fit` standardizes the data (mean=0, sd=1), and `transform` is what convert this information to a document term matrix.
* `fit_transform` combines these two operators together and learns the vocabulary dictionary and return term-document matrix. So basically what we get back is a matrix represents the word frequency for every feature in `cv.get_feature_names()`. The rows are the documents in order, and the columns are the frequency of these terms.
* It's a very concrete way of summarizing the distribution of features per document.

In [None]:
len(cleaned_posts)