- Data cleaning / processing / language parsing
- Create features using two different NLP methods: For example, BoW vs tf-idf.
- Use the features to fit supervised learning models for each feature set to predict the category outcomes.
- Assess your models using cross-validation and determine whether one model performed better.
- Pick one of the models and try to increase accuracy by at least 5 percentage points.

I need to shrink things down for my baby computer. In this case, I want to keep all the articles that have at least 10 of that genre. And I want to just take the first 1000 words of the article rather than the whole thing.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import brown, stopwords
from nltk import word_tokenize
from collections import Counter

In [2]:
nltk.download('brown')
nltk.download('punkt')

[nltk_data] Downloading package brown to
[nltk_data]     /Users/lukeelliott/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/lukeelliott/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [4]:
# This imports the txt file and names the column 'info'
longform = pd.read_csv("cats.txt", sep='\n', header=None)
longform.columns = ['info']

# Grabs first 4 characters of a string
def get_keys(txt):
    return txt[:4]

# Grabs all but first 4 characters of a string
def drop_column_names(txt):
    return txt[4:]

# Function takes in a dirty, longform DataFrame and pops it back out cleaned
# and split into two columns
def longform_cleaning(df):
    df['keys'] = df['info'].apply(lambda x: get_keys(x))

    df['info'] = df['info'].apply(lambda x: drop_column_names(x))

    df['info'] = df['info'].apply(lambda x: x.strip())
    
    return df

labels_df = longform_cleaning(longform)

In [5]:
d = {}
list_of_dfs = []
for i in brown.categories():
    d[str(i)] = labels_df[labels_df['info'] == i]
    if len(d[str(i)]) > 19:
        list_of_dfs.append(d[str(i)][0:10])
        
labels_df = pd.concat(list_of_dfs).reset_index()

In [6]:
labels_df = labels_df.drop(columns=['index'])

In [7]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text

In [8]:
# Puts all the article words and punctuation into dataframe column
article_col = []
for article in labels_df['keys']:
    article_col.append(text_cleaner(' '.join(brown.words(fileids=[article]))))

labels_df['article_words'] = article_col

In [9]:
labels_df['1000_tokens'] = labels_df['article_words'].apply(lambda x: word_tokenize(x))
labels_df['first_1000_words'] = labels_df['1000_tokens'].apply(lambda x: x[0:200])

In [10]:
# Puts all the article words and punctuation into dataframe column
article_col1000 = []
for article in labels_df['first_1000_words']:
    article_col1000.append(text_cleaner(' '.join(article)))

labels_df['article_words1000'] = article_col1000

In [11]:
labels_df.head()

Unnamed: 0,info,keys,article_words,1000_tokens,first_1000_words,article_words1000
0,adventure,cn01,Dan Morgan told himself he would forget Ann Tu...,"[Dan, Morgan, told, himself, he, would, forget...","[Dan, Morgan, told, himself, he, would, forget...",Dan Morgan told himself he would forget Ann Tu...
1,adventure,cn02,Gavin paused wearily . `` You can't stay here ...,"[Gavin, paused, wearily, ., ``, You, ca, n't, ...","[Gavin, paused, wearily, ., ``, You, ca, n't, ...",Gavin paused wearily . `` You ca n't stay here...
2,adventure,cn03,"The sentry was not dead . He was , in fact , s...","[The, sentry, was, not, dead, ., He, was, ,, i...","[The, sentry, was, not, dead, ., He, was, ,, i...","The sentry was not dead . He was , in fact , s..."
3,adventure,cn04,`` So it wasn't the earthquake that made him r...,"[``, So, it, was, n't, the, earthquake, that, ...","[``, So, it, was, n't, the, earthquake, that, ...",`` So it was n't the earthquake that made him ...
4,adventure,cn05,"She was carrying a quirt , and she started to ...","[She, was, carrying, a, quirt, ,, and, she, st...","[She, was, carrying, a, quirt, ,, and, she, st...","She was carrying a quirt , and she started to ..."


In [12]:
nlp = spacy.load('en')
spacy_articles = []
for article in labels_df['article_words1000']:
    spacy_articles.append(nlp(article))
    
labels_df['spacy_articles'] = spacy_articles

In [13]:
labels_df['info'].value_counts()

belles_lettres    10
lore              10
mystery           10
fiction           10
government        10
romance           10
hobbies           10
learned           10
editorial         10
news              10
adventure         10
Name: info, dtype: int64

In [14]:
# spacy_articles has the spacy breakdowns
labels_df.head()

Unnamed: 0,info,keys,article_words,1000_tokens,first_1000_words,article_words1000,spacy_articles
0,adventure,cn01,Dan Morgan told himself he would forget Ann Tu...,"[Dan, Morgan, told, himself, he, would, forget...","[Dan, Morgan, told, himself, he, would, forget...",Dan Morgan told himself he would forget Ann Tu...,"(Dan, Morgan, told, himself, he, would, forget..."
1,adventure,cn02,Gavin paused wearily . `` You can't stay here ...,"[Gavin, paused, wearily, ., ``, You, ca, n't, ...","[Gavin, paused, wearily, ., ``, You, ca, n't, ...",Gavin paused wearily . `` You ca n't stay here...,"(Gavin, paused, wearily, ., ``, You, ca, n't, ..."
2,adventure,cn03,"The sentry was not dead . He was , in fact , s...","[The, sentry, was, not, dead, ., He, was, ,, i...","[The, sentry, was, not, dead, ., He, was, ,, i...","The sentry was not dead . He was , in fact , s...","(The, sentry, was, not, dead, ., He, was, ,, i..."
3,adventure,cn04,`` So it wasn't the earthquake that made him r...,"[``, So, it, was, n't, the, earthquake, that, ...","[``, So, it, was, n't, the, earthquake, that, ...",`` So it was n't the earthquake that made him ...,"(``, So, it, was, n't, the, earthquake, that, ..."
4,adventure,cn05,"She was carrying a quirt , and she started to ...","[She, was, carrying, a, quirt, ,, and, she, st...","[She, was, carrying, a, quirt, ,, and, she, st...","She was carrying a quirt , and she started to ...","(She, was, carrying, a, quirt, ,, and, she, st..."


Next: reduce to lemmas, remove stopwords and punctuation...

I want to run bag of words and create a common words list but only include the words that appear more than once.

In [20]:
# Utility function to create a list of the 2000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    common_words = []
    for item in Counter(allwords).most_common(2000):
        if item[1] > 1:
            print(item[0])
            common_words.append(item[0])
    return common_words

def bow_features(text, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text'] = text['spacy_articles']
    df['genre'] = text['info']
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 10 == 0:
            print("Processing row {}".format(i))
            
    return df

In [21]:
common_word_lists = []
for article in labels_df['spacy_articles']:
    common_word_lists.append(bag_of_words(article))

-PRON-
ann
n't
night
sleep
the
work
find
day
little
``
-PRON-
gavin
man
n't
say
chair
rock
pain
clayton
-PRON-
``
the
belt
mike
dean
fiske
say
will
``
-PRON-
n't
jason
the
large
war
party
say
shout
fort
cook
``
-PRON-
fire
say
be
let
burn
big
and
appreciate
m
take
-PRON-
hall
the
``
's
-PRON-
the
rankin
barton
hard
head
swing
man
let
``
pamela
mother
melissa
auntie
grace
station
wagon
-PRON-
simple
-PRON-
``
lord
listen
want
ve
get
n't
but
be
go
the
's
herd
-PRON-
outfit
know
night
hand
coffee
fire
-PRON-
liberal
the
north
northern
welfare
discrimination
social
bourbons
south
``
state
western
world
order
nation
position
law
power
``
man
war
be
increase
-PRON-
ask
question
submarine
launch
missile
room
button
-PRON-
go
squall
game
see
rain
cloud
degree
the
``
isfahan
time
century
great
the
tile
city
traveler
desert
in
``
-PRON-
maestro
the
explain
life
pittsburgh
orchestra
steinberg
london
conduct
concert
``
nation
founding
fathers
america
test
determine
-PRON-
national
john
jefferson
m

In [22]:
len(common_word_lists)

110

In [23]:
flat_list = [item for sublist in common_word_lists for item in sublist]

common_words = list(set().union(flat_list))


In [24]:
len(common_words)

864

In [25]:
the_texts = labels_df.loc[:, ['spacy_articles', 'info']].copy()

In [26]:
word_counts = bow_features(the_texts, common_words)

Processing row 0
Processing row 10
Processing row 20
Processing row 30
Processing row 40
Processing row 50
Processing row 60
Processing row 70
Processing row 80
Processing row 90
Processing row 100


In [30]:
word_counts.head(5)

Unnamed: 0,japanese,lock,papa,grow,or,nations,purchasing,manager,seed,woman,...,pass,hot,fluid,theatre,century,rate,fix,technical,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Dan, Morgan, told, himself, he, would, forget...",adventure
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Gavin, paused, wearily, ., ``, You, ca, n't, ...",adventure
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(The, sentry, was, not, dead, ., He, was, ,, i...",adventure
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(``, So, it, was, n't, the, earthquake, that, ...",adventure
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(She, was, carrying, a, quirt, ,, and, she, st...",adventure


- tf-idf thing is next
- Then do the supervised learning on both sets of features individually
- pick a model and try to improve it is last.

Then do unit 4 capstone.

- Try clustering
- Unsupervised feature generation
- Attempt combos of supervised and unsupervised techniques to try to get best results
- Write-up