# Challenge: Build Your Own NLP Model

- Choose Corpus with at least 10 distinct categories
- Data cleaning / processing / language parsing
- Create features using two different NLP methods: For example, BoW vs tf-idf.
- Use the features to fit supervised learning models for each feature set to predict the category outcomes.
- Assess your models using cross-validation and determine whether one model performed better.
- Pick one of the models and try to increase accuracy by at least 5 percentage points.

In [30]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import brown, stopwords
from nltk import word_tokenize
from collections import Counter
from sklearn.preprocessing import normalize
from sklearn.model_selection import cross_val_score

from sklearn.model_selection import train_test_split

In [2]:
nltk.download('brown')
nltk.download('punkt')

[nltk_data] Downloading package brown to
[nltk_data]     /Users/lukeelliott/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/lukeelliott/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In [4]:
# This imports the txt file and names the column 'info'
longform = pd.read_csv("cats.txt", sep='\n', header=None)
longform.columns = ['info']

# Grabs first 4 characters of a string
def get_keys(txt):
    return txt[:4]

# Grabs all but first 4 characters of a string
def drop_column_names(txt):
    return txt[4:]

# Function takes in a dirty, longform DataFrame and pops it back out cleaned
# and split into two columns
def longform_cleaning(df):
    df['keys'] = df['info'].apply(lambda x: get_keys(x))

    df['info'] = df['info'].apply(lambda x: drop_column_names(x))

    df['info'] = df['info'].apply(lambda x: x.strip())
    
    return df

labels_df = longform_cleaning(longform)

In [5]:
d = {}
list_of_dfs = []
for i in brown.categories():
    d[str(i)] = labels_df[labels_df['info'] == i]
    if len(d[str(i)]) > 19:
        list_of_dfs.append(d[str(i)][0:20])
        
labels_df = pd.concat(list_of_dfs).reset_index()

In [6]:
labels_df = labels_df.drop(columns=['index'])

In [7]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text

In [8]:
# Puts all the article words and punctuation into dataframe column
article_col = []
for article in labels_df['keys']:
    article_col.append(text_cleaner(' '.join(brown.words(fileids=[article]))))

labels_df['article_words'] = article_col

In [9]:
labels_df['word_tokens'] = labels_df['article_words'].apply(lambda x: word_tokenize(x))
labels_df['first_1000_words'] = labels_df['word_tokens'].apply(lambda x: x[0:500])

In [10]:
# Puts all the article words and punctuation into dataframe column
article_col1000 = []
for article in labels_df['first_1000_words']:
    article_col1000.append(text_cleaner(' '.join(article)))

labels_df['article_words1000'] = article_col1000

In [11]:
labels_df.head()

Unnamed: 0,info,keys,article_words,word_tokens,first_1000_words,article_words1000
0,adventure,cn01,Dan Morgan told himself he would forget Ann Tu...,"[Dan, Morgan, told, himself, he, would, forget...","[Dan, Morgan, told, himself, he, would, forget...",Dan Morgan told himself he would forget Ann Tu...
1,adventure,cn02,Gavin paused wearily . `` You can't stay here ...,"[Gavin, paused, wearily, ., ``, You, ca, n't, ...","[Gavin, paused, wearily, ., ``, You, ca, n't, ...",Gavin paused wearily . `` You ca n't stay here...
2,adventure,cn03,"The sentry was not dead . He was , in fact , s...","[The, sentry, was, not, dead, ., He, was, ,, i...","[The, sentry, was, not, dead, ., He, was, ,, i...","The sentry was not dead . He was , in fact , s..."
3,adventure,cn04,`` So it wasn't the earthquake that made him r...,"[``, So, it, was, n't, the, earthquake, that, ...","[``, So, it, was, n't, the, earthquake, that, ...",`` So it was n't the earthquake that made him ...
4,adventure,cn05,"She was carrying a quirt , and she started to ...","[She, was, carrying, a, quirt, ,, and, she, st...","[She, was, carrying, a, quirt, ,, and, she, st...","She was carrying a quirt , and she started to ..."


In [12]:
nlp = spacy.load('en')
spacy_articles = []
for article in labels_df['article_words1000']:
    spacy_articles.append(nlp(article))
    
labels_df['spacy_articles'] = spacy_articles

In [13]:
labels_df['info'].value_counts()

mystery           20
romance           20
editorial         20
government        20
fiction           20
belles_lettres    20
news              20
learned           20
hobbies           20
lore              20
adventure         20
Name: info, dtype: int64

In [14]:
# spacy_articles has the spacy breakdowns
labels_df.head()

Unnamed: 0,info,keys,article_words,word_tokens,first_1000_words,article_words1000,spacy_articles
0,adventure,cn01,Dan Morgan told himself he would forget Ann Tu...,"[Dan, Morgan, told, himself, he, would, forget...","[Dan, Morgan, told, himself, he, would, forget...",Dan Morgan told himself he would forget Ann Tu...,"(Dan, Morgan, told, himself, he, would, forget..."
1,adventure,cn02,Gavin paused wearily . `` You can't stay here ...,"[Gavin, paused, wearily, ., ``, You, ca, n't, ...","[Gavin, paused, wearily, ., ``, You, ca, n't, ...",Gavin paused wearily . `` You ca n't stay here...,"(Gavin, paused, wearily, ., ``, You, ca, n't, ..."
2,adventure,cn03,"The sentry was not dead . He was , in fact , s...","[The, sentry, was, not, dead, ., He, was, ,, i...","[The, sentry, was, not, dead, ., He, was, ,, i...","The sentry was not dead . He was , in fact , s...","(The, sentry, was, not, dead, ., He, was, ,, i..."
3,adventure,cn04,`` So it wasn't the earthquake that made him r...,"[``, So, it, was, n't, the, earthquake, that, ...","[``, So, it, was, n't, the, earthquake, that, ...",`` So it was n't the earthquake that made him ...,"(``, So, it, was, n't, the, earthquake, that, ..."
4,adventure,cn05,"She was carrying a quirt , and she started to ...","[She, was, carrying, a, quirt, ,, and, she, st...","[She, was, carrying, a, quirt, ,, and, she, st...","She was carrying a quirt , and she started to ...","(She, was, carrying, a, quirt, ,, and, she, st..."


#### Bag of Words Feature Generation

In [15]:
# Utility function to create a list of the 2000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    common_words = []
    for item in Counter(allwords).most_common(2000):
        if item[1] > 1:
            print(item[0])
            common_words.append(item[0])
    return common_words

def bow_features(text, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text'] = text['spacy_articles']
    df['genre'] = text['info']
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 10 == 0:
            print("Processing row {}".format(i))
            
    return df

In [16]:
common_word_lists = []
for article in labels_df['spacy_articles']:
    common_word_lists.append(bag_of_words(article))

-PRON-
ann
day
the
budd
work
meadow
fence
see
morgan
n't
night
sleep
leave
country
find
think
little
time
winchester
lean
shovel
when
south
walk
house
woman
``
-PRON-
n't
man
clayton
gavin
try
go
be
say
leave
big
charlie
chair
rock
life
shake
pain
's
lip
do
stick
just
mean
fight
fair
clear
want
come
-PRON-
``
mike
dean
fiske
susan
kiss
the
inside
julia
turn
say
will
talk
horse
try
ride
show
rifle
pistol
bayonet
belt
man
join
feel
evidently
guard
hand
n't
``
-PRON-
war
be
say
the
large
party
time
sioux
people
mr.
manuel
go
n't
come
jason
look
wrong
montero
shout
fort
there
cook
then
little
that
oso
like
and
old
knife
's
get
see
cheyennes
nations
letter
white
year
``
-PRON-
be
say
n't
fire
wilson
big
m
tell
take
quirt
let
wrist
burn
right
the
and
appreciate
sure
leave
morning
long
time
but
place
smile
carwood
know
girl
-PRON-
hall
man
counter
the
clerk
time
go
afternoon
job
when
authority
hang
speak
baldness
neck
underneath
person
slip
scrap
notice
feel
's
``
-PRON-
man
the
's
barton
ran

bread
butter
right
breakfast
table
door
teacart
voice
time
cup
fill
stir
need
mary
-PRON-
's
jesus
body
moment
feel
son
hold
mixture
myrrh
aloe
present
joseph
arimathea
nicodemus
place
in
concept
mother
child
lap
christ
go
there
human
close
dead
carry
mind
sketch
kate
-PRON-
``
's
come
god
child
juanita
jonathan
death
the
little
when
bed
call
accept
letter
feel
hand
minister
sit
morning
talk
word
mrs.
tussle
three
hold
move
know
duty
baby
-PRON-
``
old
the
sky
ballroom
come
bed
know
thing
leave
feel
mynheer
way
black
star
martha
schuyler
stand
peter
big
time
ball
-PRON-
andrei
path
``
way
land
palestine
alexander
brandel
forest
end
man
peasant
wonder
lack
desire
remember
long
dry
swamp
lublin
care
in
human
sit
belief
travel
seek
this
warsaw
-PRON-
``
'd
get
n't
dramatic
thelma
eileen
drunk
like
old
man
want
be
world
's
table
crazy
woman
hell
ginmill
time
walk
home
hate
what
precious
-PRON-
rector
work
give
lord
village
hino
truth
competition
escape
matter
man
mission
equal
eye
's
churc

forward
cabin
800
-PRON-
a
cost
inch
ride
large
bunk
aft
all
passenger
speed
merc
motor
mph
aboard
like
small
item
etc
gas
tank
building
time
3,000
buy
bronze
nail
standard
-PRON-
panel
vacation
cottage
house
design
material
the
find
build
scale
small
home
workshop
time
utilize
require
use
simple
tool
follow
step
site
apply
handyman
possible
size
timber
major
building
regulation
restriction
usually
check
bridge
``
-PRON-
court
the
shall
newbury
general
act
this
island
say
share
river
massachusetts
type
construction
know
newburyport
salisbury
building
propose
1791
petition
hon'ble
deer
year
times
agree
subscribe
writing
build
receive
later
pool
-PRON-
good
lot
house
year
bit
problem
plunge
time
living
fill
child
plastic
likely
great
as
cover
air
-PRON-
conditioning
house
year
the
line
fha
low
mean
cool
acquire
stress
live
new
home
good
thing
improve
possible
system
$
day
central
no
cost
be
long
know
temperature
control
few
bring
better
forget
hot
radio
thermal
emission
radiation
planet


country
issue
newspaper
erupt
level
matter
effort
step
good
agreement
all
frequently
man
's
captain
voyage
the
discovery
sea
find
time
able
hudson
company
1610
arctic
water
north
seventeen
month
board
go
what
great
year
highly
second
picture
determine
american
half
this
east
in
muscovy
dutch
red
selkirk
great
the
river
settlement
douglas
settle
valley
's
southward
scots
fort
north
empire
hudson
bay
company
mile
assiniboine
october
year
group
york
factory
american
``
man
1812
little
swiss
mercenary
war
settler
plot
late
1818
``
-PRON-
a
write
like
letter
humor
rich
find
man
gray
another
report
house
expression
hungry
horse
corner
god
receive
girl
home
love
wife
forgit
yank
american
nationalism
folklore
america
popularity
century
history
action
point
literature
the
proportion
emphasis
influence
world
twentieth
historian
fact
international
legend
contemporary
occur
country
hope
-PRON-
group
national
identification
apply
spread
dominion
palm
pine
united
states
year
society
course
personal


In [17]:
flat_list = [item for sublist in common_word_lists for item in sublist]

common_words = list(set().union(flat_list))
len(common_words)

3122

In [18]:
the_texts = labels_df.loc[:, ['spacy_articles', 'info']].copy()

In [19]:
word_counts = bow_features(the_texts, common_words)

Processing row 0
Processing row 10
Processing row 20
Processing row 30
Processing row 40
Processing row 50
Processing row 60
Processing row 70
Processing row 80
Processing row 90
Processing row 100
Processing row 110
Processing row 120
Processing row 130
Processing row 140
Processing row 150
Processing row 160
Processing row 170
Processing row 180
Processing row 190
Processing row 200
Processing row 210


In [20]:
word_counts.head(5)

Unnamed: 0,college,virginity,interested,drive,rose,distribution,outfield,regions,vue,growth,...,woman,varani,molecule,mystery,sexual,microorganism,illustration,vote,text,genre
0,0,0,0,0,0,0,0,0,0,0,...,2,0,0,0,0,0,0,0,"(Dan, Morgan, told, himself, he, would, forget...",adventure
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Gavin, paused, wearily, ., ``, You, ca, n't, ...",adventure
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(The, sentry, was, not, dead, ., He, was, ,, i...",adventure
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(``, So, it, was, n't, the, earthquake, that, ...",adventure
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(She, was, carrying, a, quirt, ,, and, she, st...",adventure


#### tf-idf feature generation

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.5, # drop words that occur in more than half the paragraphs
                             min_df=2, # only use words that appear at least twice
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case (since Alice in Wonderland has the HABIT of CAPITALIZING WORDS for EMPHASIS)
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )

vecked = vectorizer.fit_transform(labels_df['article_words1000'])
y = labels_df['info']

In [22]:
# Reshape vectorizer to readable content
vecked_csr = vecked.tocsr()

# Number of paragraphs
n = vecked_csr.shape[0]

# A list of dictionaries, one per paragraph
vecked_bypara = [{} for _ in range(0,n)]

# List of features
terms = vectorizer.get_feature_names()

# For each paragraph, lists the feature words and their tf-idf scores
for i, j in zip(*vecked_csr.nonzero()):
    vecked_bypara[i][terms[j]] = vecked_csr[i, j]

# Keep in mind that the log base 2 of 1 is 0, so a tf-idf score of 0 indicates that the word was present once in that sentence.

print('Tf_idf vector:', vecked_bypara[0])

Tf_idf vector: {'dan': 0.09288113323778592, 'told': 0.053220919327698415, 'forget': 0.08874578400011568, 'rid': 0.09821252566207198, 'certainly': 0.07252138963918388, 'did': 0.03713329486626584, 'want': 0.053220919327698415, 'wife': 0.06305464797722761, 'married': 0.06606648733914246, 'asking': 0.08251020067712087, 'trouble': 0.06400522747905005, 'woke': 0.09821252566207198, 'middle': 0.07413390512252865, 'night': 0.10432329208116367, 'thinking': 0.1482678102450573, 'sleep': 0.15570556412693984, 'plans': 0.06718999721489784, 'dreams': 0.08874578400011568, 'revolved': 0.09821252566207198, 'long': 0.0398381474051185, 'felt': 0.057493038464867884, 'thing': 0.05434442920345379, 'sell': 0.09821252566207198, 'al': 0.08536695765062792, 'leave': 0.06305464797722761, 'country': 0.1098656072271085, 'streak': 0.09821252566207198, 'allow': 0.08251020067712087, 'best': 0.05268371541656255, 'bitterness': 0.09821252566207198, 'disappointment': 0.08874578400011568, 'poisoned': 0.09821252566207198, 'ha

In [25]:
vecked_norm = normalize(vecked)
vecked_norm_df = pd.DataFrame(data=vecked_norm.toarray())

In [26]:
# DataFrame with tf-idf and bow features
tfidf_bow = pd.concat([word_counts, vecked_norm_df], ignore_index=False, axis=1)

Bag of Words dataframe and tfidf column in labels_df.

#### Supervised Models with Bag of Words features

In [38]:
X = word_counts.drop(columns=['text', 'genre'])
y = word_counts['genre']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, 
                                                    stratify = y)

In [39]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [40]:
rfc = ensemble.RandomForestClassifier(100)

rfc_train = rfc.fit(X_train, y_train)

print('Training score:', rfc.score(X_train, y_train))
print('Training score:', rfc.score(X_test, y_test))

print('Cross_val_score:', cross_val_score(rfc_train, X_test, y_test, cv=5))

Training score: 1.0
Training score: 0.3939393939393939
Cross_val_score: [0.40909091 0.54545455 0.36363636 0.45454545 0.45454545]


In [81]:
lr = LogisticRegression(solver='lbfgs', multi_class='auto')
lr_train = lr.fit(X_train, y_train)

print('Training score:', lr.score(X_train, y_train))
print('Training score:', lr.score(X_test, y_test))

print('Cross_val_score:', cross_val_score(lr_train, X_test, y_test, cv=5))



Training score: 1.0
Training score: 0.45454545454545453




Cross_val_score: [0.36363636 0.45454545 0.36363636 0.63636364 0.45454545]




In [42]:
clf = ensemble.GradientBoostingClassifier()
clf_train = clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

print('Cross_val_score:', cross_val_score(clf_train, X_test, y_test, cv=5))

Training set score: 1.0

Test set score: 0.18181818181818182
Cross_val_score: [0.31818182 0.54545455 0.27272727 0.27272727 0.36363636]


This is a harsh metric for multivariate classification. So these poor results are to be expected.

#### Supervised Learning with tf-idf Features

In [43]:
X1 = vecked_norm_df
y1 = labels_df['info']

X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=.3, stratify=y1)

In [44]:
rfc1 = ensemble.RandomForestClassifier(100)
rfc1_train = rfc1.fit(X1_train, y1_train)

print('Training set score:', rfc1.score(X1_train, y1_train))
print('\nTest set score:', rfc1.score(X1_test, y1_test))

print('Cross_val_score:', cross_val_score(rfc1_train, X1_test, y1_test, cv=5))

Training set score: 1.0

Test set score: 0.36363636363636365
Cross_val_score: [0.36363636 0.45454545 0.45454545 0.36363636 0.18181818]


In [80]:
lr1 = LogisticRegression(solver='lbfgs', multi_class='auto')
lr1_train = lr1.fit(X1_train, y1_train)

print('Training set score:', lr1.score(X1_train, y1_train))
print('\nTest set score:', lr1.score(X1_test, y1_test))

print('Cross_val_score:', cross_val_score(lr1_train, X1_test, y1_test, cv=5))

Training set score: 1.0

Test set score: 0.5
Cross_val_score: [0.40909091 0.63636364 0.54545455 0.45454545 0.63636364]


In [46]:
clf1 = ensemble.GradientBoostingClassifier()
clf1_train = clf1.fit(X1_train, y1_train)

print('Training set score:', clf1.score(X1_train, y1_train))
print('\nTest set score:', clf1.score(X1_test, y1_test))

print('Cross_val_score:', cross_val_score(clf1_train, X1_test, y1_test, cv=5))

Training set score: 1.0

Test set score: 0.2878787878787879
Cross_val_score: [0.18181818 0.36363636 0.27272727 0.09090909 0.18181818]


### Fine-tune one model

In [82]:
from sklearn.model_selection import GridSearchCV

In [83]:
LogisticRegression()

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [84]:
lr1_params = {'penalty' : ['l2'],
             'C' : [.01, 1, 10],
             'solver' : ['lbfgs', 'sag']}

In [85]:
lr1_cv = GridSearchCV(lr1_train, param_grid=lr1_params, cv=5, )

In [86]:
lr1_cv.fit(X1_train, y1_train)



GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='auto',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'penalty': ['l2'], 'C': [0.01, 1, 10], 'solver': ['lbfgs', 'sag']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [87]:
lr1_cv.best_params_

{'C': 10, 'penalty': 'l2', 'solver': 'lbfgs'}

In [88]:
best_est = lr1_cv.best_estimator_

In [89]:
best_est_fit = best_est.fit(X1_train, y1_train)
np.mean(cross_val_score(best_est_fit, X1_test, y1_test, cv=5))

0.5272727272727272

# Challenge Summary Notes

- With 11 different genres, it is exciting to see that the model could correctly assign the articles to the correct categories more than half the time.
- 

# Challenge Continued (Unit 4 Capstone)

The Capstone Project is an extension of all that is seen above.

- Create Clusters
- Unsupervised Feature Generation and Selection
- Build models using unsupervised and supervised learning techniques

### Supervised Learning with tf-idf and BoW at once

In [None]:
X2 = tfidf_bow
y2 = labels_df['info']

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=.3, stratify=y2)

In [None]:
rfc2 = ensemble.RandomForestClassifier(100)
rfc2_train = rfc2.fit(X2_train, y2_train)

print('Training set score:', rfc2.score(X2_train, y2_train))
print('\nTest set score:', rfc2.score(X2_test, y2_test))

In [None]:
lr2 = LogisticRegression()
lr2_train = lr2.fit(X2_train, y2_train)

print('Training set score:', lr2.score(X2_train, y2_train))
print('\nTest set score:', lr2.score(X2_test, y2_test))

In [None]:
clf2 = ensemble.GradientBoostingClassifier()
clf2_train = clf2.fit(X2_train, y2_train)

print('Training set score:', clf2.score(X2_train, y2_train))
print('\nTest set score:', clf2.score(X2_test, y2_test))

Then do unit 4 capstone.

- Try clustering
- Compare clustering to modelling for classifying texts
- Compare whether the clustered features improve performance or hurt it on the models
- Unsupervised feature generation
- Attempt combos of supervised and unsupervised techniques to try to get best results
- Write-up