# Natural Language Processing

<img src="https://media.giphy.com/media/msKNSs8rmJ5m/giphy.gif" alt="Drawing" style="width: 300px;"/>





#### Scenario: 
> You are studying whether Twitter users from large metropolitan areas can be effectively separated from Twitter users in the surrounding area based on tweet content. You have been handed a large set of tweets from Illinois, and are set with the task of separating tweets from Chicago vs. tweets from outside of Chicago.



<div>
<p float = 'left'> What type of problem is this?</p>
<p float = 'left'> What steps do you anticipate carrying out? </p>
<p float = 'left'> What challenges do you foresee? </p>
</div>


<img src = 'https://media.giphy.com/media/WqLmcthJ7AgQKwYJbb/giphy.gif' alt="Drawing" style="width: 300px;"  float = 'right'> </img>

By the end of this class, you will be able to:

1. Clean text and transform a set of documents into word tokens
2. Compare texts using word counts and Term Frequency-Inverse Document Frequency
3. Use stemmers and lemmatizers
4. Define and create n-grams
5. Fit supervised models to preprocessed text data, predict binary classes, and assess model accuracy.

## Vanilla Python Text Exploration

Explore the texts in the example texts by:

1. Creating a list of words in each text.
2. Counting the number of occurences of the words.  
3. Ordering the words by number of occurences.
4. Comparing the counts





### We will start with some text examples

In [8]:
# 1. Create a list of words
with open('text_examples/A.txt') as read_file:
    A = read_file.read()
A = A.split()

In [9]:
# 2. Count the words
word, count = np.unique(A, return_counts=True)
# 3. Order the words by most frequent
most_common_A = sorted(list(zip(count, word)), reverse=True)
most_common_A

[(17, 'the'),
 (10, 'to'),
 (7, 'in'),
 (6, 'of'),
 (6, 'a'),
 (5, 'and'),
 (4, 'patent'),
 (4, 'for'),
 (4, 'EU'),
 (3, 'would'),
 (3, 'they'),
 (3, 'that'),
 (3, 'software'),
 (3, 'say'),
 (3, 'law'),
 (3, 'its'),
 (3, 'it'),
 (3, 'has'),
 (3, 'draft'),
 (3, 'directive'),
 (3, 'could'),
 (3, 'The'),
 (2, 'with'),
 (2, 'two'),
 (2, 'their'),
 (2, 'small'),
 (2, 'said'),
 (2, 'proposals'),
 (2, 'ordered'),
 (2, 'one'),
 (2, 'new'),
 (2, 'member'),
 (2, 'legal'),
 (2, 'laws'),
 (2, 'is'),
 (2, 'inventions.'),
 (2, 'have'),
 (2, 'had'),
 (2, 'firms'),
 (2, 'are'),
 (2, 'Supporters'),
 (2, 'MEPs'),
 (2, 'European'),
 (1, 'words,'),
 (1, 'without'),
 (1, 'who'),
 (1, 'which'),
 (1, 'when'),
 (1, 'welcomed'),
 (1, 'voting'),
 (1, 'vocal'),
 (1, 'use'),
 (1, 'up'),
 (1, 'under'),
 (1, 'twice'),
 (1, 'them'),
 (1, 'suffered'),
 (1, 'states.'),
 (1, 'states,'),
 (1, 'started'),
 (1, 'some'),
 (1, 'similar'),
 (1, 'should'),
 (1, 'shopping"'),
 (1, 'setbacks'),
 (1, 'service,'),
 (1, 'serve'),


In [10]:
# 4. Compare the texts
texts = [letter+'.txt' for letter 
         in list('ABCDEFGHIJKL')]

counts = []
for letter in list('ABCDEFGHIJKL'):
    text = letter+'.txt'
    with open(f'text_examples/{text}') as read_file:
        read_text = read_file.read()
    text_split = read_text.split()
    word, count = np.unique(text_split, return_counts=True)
    most_common = sorted(list(zip(count, word)), reverse=True)
    counts.append(most_common)


In [11]:
top_five_per_text = []
for text in counts:
    top_five = {}
    for number in range(5,15):
        top_five[number] = (text[number][1])
    top_five_per_text.append(top_five)

In [185]:
pd.DataFrame.from_dict(top_five_per_text)

Unnamed: 0,5,6,7,8,9,10,11,12,13,14
0,and,patent,for,EU,would,they,that,software,say,law
1,a,have,has,The,Brown,said,proposed,on,nations,ministers
2,gadgets,will,a,our,list,gadget,and,The,Stuff,year's
3,was,to,a,man,commit,and,BNP,suspicion,police,in
4,of,is,in,from,e-mails,attachment,virus,this,they,not
5,at,and,Ireland,would,three,that,part,of,last,for
6,record,indoor,by,time,third,ran,his,champion,Olympic,who
7,income,for,The,Prince,private,from,be,are,It,Cornwall
8,calls,net,free,will,of,make,in,has,a,-
9,will,is,and,Bekele,to,on,indoor,as,The,two


Looking at the most common words, we can see some similarities in some documents. However, there is a lot of junk in there which makes comparison difficult.  Let's do some preprocessing to get rid of the junk.

<h2>Bag of Words</h2>

<img src = "images/bag_of_word.jpg"></img>

What is the problem with text in relation to machine learning?
BOW takes a text, breaks it up into small pieces (words, bigrams, stems, lemma), an converts it into counts.  These counts can then be fed into our familiar machine learning algorithms.

Question: Did any algorithm pop into your head that might be particularly suited to bags of words?

In [54]:
texts = []
for letter in list('ABCDEFGHIJKL'):
    text = letter+'.txt'
    with open(f'text_examples/{text}') as read_file:
        read_text = read_file.read()
    texts.append(read_text)

## Steps for creating a bag of words

1. make lowercase 
2. remove punctuation
3. remove stopwords
4. apply stemmer/lemmatizer



To help us with the above steps, we will introduce a new library, **NLTK**  [documentation](https://www.nltk.org/).


In [55]:
import nltk
from nltk import word_tokenize, regexp_tokenize

### 1. Tokenize and make lowercase

In [56]:
text_A = [word.lower() for 
                word in word_tokenize(texts[0])]
text_A

['reboot',
 'ordered',
 'for',
 'eu',
 'patent',
 'law',
 'a',
 'european',
 'parliament',
 'committee',
 'has',
 'ordered',
 'a',
 'rewrite',
 'of',
 'the',
 'proposals',
 'for',
 'controversial',
 'new',
 'european',
 'union',
 'rules',
 'which',
 'govern',
 'computer-based',
 'inventions',
 '.',
 'the',
 'legal',
 'affairs',
 'committee',
 '(',
 'juri',
 ')',
 'said',
 'the',
 'commission',
 'should',
 're-submit',
 'the',
 'computer',
 'implemented',
 'inventions',
 'directive',
 'after',
 'meps',
 'failed',
 'to',
 'back',
 'it',
 '.',
 'it',
 'has',
 'had',
 'vocal',
 'critics',
 'who',
 'say',
 'it',
 'could',
 'favour',
 'large',
 'over',
 'small',
 'firms',
 'and',
 'impact',
 'open-source',
 'software',
 'innovation',
 '.',
 'supporters',
 'say',
 'it',
 'would',
 'let',
 'firms',
 'protect',
 'their',
 'inventions',
 '.',
 'the',
 'directive',
 'is',
 'intended',
 'to',
 'offer',
 'patent',
 'protection',
 'to',
 'inventions',
 'that',
 'use',
 'software',
 'to',
 'achieve',

### 2. Remove punctuation


In [57]:
import string
string.punctuation

text_A = [word for word in text_A
         if word not in string.punctuation]
text_A

['reboot',
 'ordered',
 'for',
 'eu',
 'patent',
 'law',
 'a',
 'european',
 'parliament',
 'committee',
 'has',
 'ordered',
 'a',
 'rewrite',
 'of',
 'the',
 'proposals',
 'for',
 'controversial',
 'new',
 'european',
 'union',
 'rules',
 'which',
 'govern',
 'computer-based',
 'inventions',
 'the',
 'legal',
 'affairs',
 'committee',
 'juri',
 'said',
 'the',
 'commission',
 'should',
 're-submit',
 'the',
 'computer',
 'implemented',
 'inventions',
 'directive',
 'after',
 'meps',
 'failed',
 'to',
 'back',
 'it',
 'it',
 'has',
 'had',
 'vocal',
 'critics',
 'who',
 'say',
 'it',
 'could',
 'favour',
 'large',
 'over',
 'small',
 'firms',
 'and',
 'impact',
 'open-source',
 'software',
 'innovation',
 'supporters',
 'say',
 'it',
 'would',
 'let',
 'firms',
 'protect',
 'their',
 'inventions',
 'the',
 'directive',
 'is',
 'intended',
 'to',
 'offer',
 'patent',
 'protection',
 'to',
 'inventions',
 'that',
 'use',
 'software',
 'to',
 'achieve',
 'their',
 'effect',
 'in',
 'other

### 3. Stopwords
- Stopwords are words that show up frequently but have low semantic value:  For example, The, a, with, and. 
- In our effort to find representative words for our texts, we need to remove them. Luckily, nltk makes this easy.


In [58]:
from nltk.corpus import stopwords
text_A = [word for word in text_A 
          if text_A not in stopwords.words('english')]
text_A

['reboot',
 'ordered',
 'for',
 'eu',
 'patent',
 'law',
 'a',
 'european',
 'parliament',
 'committee',
 'has',
 'ordered',
 'a',
 'rewrite',
 'of',
 'the',
 'proposals',
 'for',
 'controversial',
 'new',
 'european',
 'union',
 'rules',
 'which',
 'govern',
 'computer-based',
 'inventions',
 'the',
 'legal',
 'affairs',
 'committee',
 'juri',
 'said',
 'the',
 'commission',
 'should',
 're-submit',
 'the',
 'computer',
 'implemented',
 'inventions',
 'directive',
 'after',
 'meps',
 'failed',
 'to',
 'back',
 'it',
 'it',
 'has',
 'had',
 'vocal',
 'critics',
 'who',
 'say',
 'it',
 'could',
 'favour',
 'large',
 'over',
 'small',
 'firms',
 'and',
 'impact',
 'open-source',
 'software',
 'innovation',
 'supporters',
 'say',
 'it',
 'would',
 'let',
 'firms',
 'protect',
 'their',
 'inventions',
 'the',
 'directive',
 'is',
 'intended',
 'to',
 'offer',
 'patent',
 'protection',
 'to',
 'inventions',
 'that',
 'use',
 'software',
 'to',
 'achieve',
 'their',
 'effect',
 'in',
 'other

### 4. Stem and Lemmatize words

Lastly, we can use two techniques to reduce words to common forms.  

1. Stemming is a method which essentially chops off parts of words.  For example, it may chop off the s of a plural, "words" becomes "word". 

There are different stemmers available.  The two we will use here are the Porter and Snowball stemmers.  A main difference between the two is how agressively it stems, Porter being less agressive.

In [59]:
from nltk.stem import *

p_stemmer = PorterStemmer()

text_A_pstem = [p_stemmer.stem(word) for word in text_A]
text_A_pstem

['reboot',
 'order',
 'for',
 'eu',
 'patent',
 'law',
 'a',
 'european',
 'parliament',
 'committe',
 'ha',
 'order',
 'a',
 'rewrit',
 'of',
 'the',
 'propos',
 'for',
 'controversi',
 'new',
 'european',
 'union',
 'rule',
 'which',
 'govern',
 'computer-bas',
 'invent',
 'the',
 'legal',
 'affair',
 'committe',
 'juri',
 'said',
 'the',
 'commiss',
 'should',
 're-submit',
 'the',
 'comput',
 'implement',
 'invent',
 'direct',
 'after',
 'mep',
 'fail',
 'to',
 'back',
 'it',
 'it',
 'ha',
 'had',
 'vocal',
 'critic',
 'who',
 'say',
 'it',
 'could',
 'favour',
 'larg',
 'over',
 'small',
 'firm',
 'and',
 'impact',
 'open-sourc',
 'softwar',
 'innov',
 'support',
 'say',
 'it',
 'would',
 'let',
 'firm',
 'protect',
 'their',
 'invent',
 'the',
 'direct',
 'is',
 'intend',
 'to',
 'offer',
 'patent',
 'protect',
 'to',
 'invent',
 'that',
 'use',
 'softwar',
 'to',
 'achiev',
 'their',
 'effect',
 'in',
 'other',
 'word',
 '``',
 'comput',
 'implement',
 'invent',
 "''",
 'the',
 

In [60]:
s_stemmer = SnowballStemmer(language='english')
text_A_sstem = [s_stemmer.stem(word) for word in text_A]

# In this case, they perform similarly
for p,s in zip(text_A_pstem, text_A_sstem):
    if p != s:
        print(p,s)

ha has
ha has
ha has
thi this
thi this


2. Lemmatizing uses parts of speech in order to return the uninflected form of the word.  In other words, ran/running/runs will all be transformed to run.  

In order to do this, we must tag our words with parts of speech.


In [61]:
from nltk import pos_tag

text_A_pos = pos_tag(text_A)
text_A_pos[2]

('for', 'IN')

In order to get nltk's POS tagger to talk to the Wordnet Lemmatizer, we need to translate the nltk POS to the more general WordNet POS.  

The funciton get_wordnet_pos shown below achieves that.

In [62]:
from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):
    '''
    Translate nltk POS to wordnet tags
    '''
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [63]:
text_A_pos = [(word[0], get_wordnet_pos(word[1]))
             for word in text_A_pos]
text_A_pos

[('reboot', 'n'),
 ('ordered', 'v'),
 ('for', 'n'),
 ('eu', 'a'),
 ('patent', 'n'),
 ('law', 'n'),
 ('a', 'n'),
 ('european', 'a'),
 ('parliament', 'n'),
 ('committee', 'n'),
 ('has', 'v'),
 ('ordered', 'v'),
 ('a', 'n'),
 ('rewrite', 'n'),
 ('of', 'n'),
 ('the', 'n'),
 ('proposals', 'n'),
 ('for', 'n'),
 ('controversial', 'a'),
 ('new', 'a'),
 ('european', 'a'),
 ('union', 'n'),
 ('rules', 'n'),
 ('which', 'n'),
 ('govern', 'v'),
 ('computer-based', 'a'),
 ('inventions', 'n'),
 ('the', 'n'),
 ('legal', 'a'),
 ('affairs', 'n'),
 ('committee', 'n'),
 ('juri', 'n'),
 ('said', 'v'),
 ('the', 'n'),
 ('commission', 'n'),
 ('should', 'n'),
 ('re-submit', 'v'),
 ('the', 'n'),
 ('computer', 'n'),
 ('implemented', 'v'),
 ('inventions', 'n'),
 ('directive', 'v'),
 ('after', 'n'),
 ('meps', 'n'),
 ('failed', 'v'),
 ('to', 'n'),
 ('back', 'v'),
 ('it', 'n'),
 ('it', 'n'),
 ('has', 'v'),
 ('had', 'v'),
 ('vocal', 'a'),
 ('critics', 'n'),
 ('who', 'n'),
 ('say', 'v'),
 ('it', 'n'),
 ('could', 'n'),


In [64]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

text_A_lemma = [wordnet_lemmatizer.lemmatize(word[0],word[1])
                for word in text_A_pos]

text_A_lemma

['reboot',
 'order',
 'for',
 'eu',
 'patent',
 'law',
 'a',
 'european',
 'parliament',
 'committee',
 'have',
 'order',
 'a',
 'rewrite',
 'of',
 'the',
 'proposal',
 'for',
 'controversial',
 'new',
 'european',
 'union',
 'rule',
 'which',
 'govern',
 'computer-based',
 'invention',
 'the',
 'legal',
 'affair',
 'committee',
 'juri',
 'say',
 'the',
 'commission',
 'should',
 're-submit',
 'the',
 'computer',
 'implement',
 'invention',
 'directive',
 'after',
 'meps',
 'fail',
 'to',
 'back',
 'it',
 'it',
 'have',
 'have',
 'vocal',
 'critic',
 'who',
 'say',
 'it',
 'could',
 'favour',
 'large',
 'over',
 'small',
 'firm',
 'and',
 'impact',
 'open-source',
 'software',
 'innovation',
 'supporter',
 'say',
 'it',
 'would',
 'let',
 'firm',
 'protect',
 'their',
 'invention',
 'the',
 'directive',
 'be',
 'intend',
 'to',
 'offer',
 'patent',
 'protection',
 'to',
 'invention',
 'that',
 'use',
 'software',
 'to',
 'achieve',
 'their',
 'effect',
 'in',
 'other',
 'word',
 '``',


In [65]:
from nltk.corpus import stopwords
stopwords.words('english')

text_A = [word for word in text_A
         if word not in stopwords.words('english')]

text_A

['reboot',
 'ordered',
 'eu',
 'patent',
 'law',
 'european',
 'parliament',
 'committee',
 'ordered',
 'rewrite',
 'proposals',
 'controversial',
 'new',
 'european',
 'union',
 'rules',
 'govern',
 'computer-based',
 'inventions',
 'legal',
 'affairs',
 'committee',
 'juri',
 'said',
 'commission',
 're-submit',
 'computer',
 'implemented',
 'inventions',
 'directive',
 'meps',
 'failed',
 'back',
 'vocal',
 'critics',
 'say',
 'could',
 'favour',
 'large',
 'small',
 'firms',
 'impact',
 'open-source',
 'software',
 'innovation',
 'supporters',
 'say',
 'would',
 'let',
 'firms',
 'protect',
 'inventions',
 'directive',
 'intended',
 'offer',
 'patent',
 'protection',
 'inventions',
 'use',
 'software',
 'achieve',
 'effect',
 'words',
 '``',
 'computer',
 'implemented',
 'invention',
 "''",
 'draft',
 'law',
 'suffered',
 'setbacks',
 'poland',
 'one',
 'largest',
 'eu',
 'member',
 'states',
 'rejected',
 'adoption',
 'twice',
 'two',
 'months',
 'intense',
 'lobbying',
 'issue',


## More tools
![](https://media.giphy.com/media/kI4312sUxEuciCVig2/giphy.gif)

## Regex

In [66]:
from nltk import regexp_tokenize

Regular expressions are a powerful tool which will allow you to search strings for specific patterns.  It takes some getting used to, but once you get it is extremely useful.

1. You can try out regular expressions here: https://regexr.com/
2. Here is a cheatsheet: https://cheatography.com/davechild/cheat-sheets/regular-expressions/
3. Here is a website with a tutorial: https://www.regular-expressions.info/


In [67]:
# We can use a regular expression to remove punctuation, 
# and numbers
pattern = ("([a-zA-Z]+(?:'[a-z]+)?)")
regexp_tokenize(texts[0], pattern=pattern)

['Reboot',
 'ordered',
 'for',
 'EU',
 'patent',
 'law',
 'A',
 'European',
 'Parliament',
 'committee',
 'has',
 'ordered',
 'a',
 'rewrite',
 'of',
 'the',
 'proposals',
 'for',
 'controversial',
 'new',
 'European',
 'Union',
 'rules',
 'which',
 'govern',
 'computer',
 'based',
 'inventions',
 'The',
 'Legal',
 'Affairs',
 'Committee',
 'JURI',
 'said',
 'the',
 'Commission',
 'should',
 're',
 'submit',
 'the',
 'Computer',
 'Implemented',
 'Inventions',
 'Directive',
 'after',
 'MEPs',
 'failed',
 'to',
 'back',
 'it',
 'It',
 'has',
 'had',
 'vocal',
 'critics',
 'who',
 'say',
 'it',
 'could',
 'favour',
 'large',
 'over',
 'small',
 'firms',
 'and',
 'impact',
 'open',
 'source',
 'software',
 'innovation',
 'Supporters',
 'say',
 'it',
 'would',
 'let',
 'firms',
 'protect',
 'their',
 'inventions',
 'The',
 'directive',
 'is',
 'intended',
 'to',
 'offer',
 'patent',
 'protection',
 'to',
 'inventions',
 'that',
 'use',
 'software',
 'to',
 'achieve',
 'their',
 'effect',
 '

In [68]:
nltk.corpus.stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [69]:
# or we can use Word Vectorizers
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(token_pattern=pattern, 
                     stop_words=nltk.corpus.stopwords.words('english'))

cv_texts = cv.fit_transform(texts)
cv.vocabulary_


{'reboot': 703,
 'ordered': 611,
 'eu': 254,
 'patent': 628,
 'law': 474,
 'european': 256,
 'parliament': 622,
 'committee': 155,
 'rewrite': 720,
 'proposals': 670,
 'controversial': 179,
 'new': 585,
 'union': 884,
 'rules': 726,
 'govern': 346,
 'computer': 169,
 'based': 73,
 'inventions': 429,
 'legal': 479,
 'affairs': 20,
 'juri': 450,
 'said': 730,
 'commission': 153,
 'submit': 821,
 'implemented': 400,
 'directive': 226,
 'meps': 535,
 'failed': 269,
 'back': 67,
 'vocal': 902,
 'critics': 197,
 'say': 736,
 'could': 184,
 'favour': 274,
 'large': 466,
 'small': 782,
 'firms': 302,
 'impact': 399,
 'open': 606,
 'source': 794,
 'software': 784,
 'innovation': 420,
 'supporters': 827,
 'would': 935,
 'let': 482,
 'protect': 672,
 'intended': 423,
 'offer': 596,
 'protection': 673,
 'use': 889,
 'achieve': 10,
 'effect': 244,
 'words': 932,
 'invention': 428,
 'draft': 233,
 'suffered': 824,
 'setbacks': 759,
 'poland': 650,
 'one': 605,
 'largest': 468,
 'member': 531,
 'stat

In [71]:
df = pd.DataFrame(cv_texts.toarray())
df.columns = cv.vocabulary_
df.head()

Unnamed: 0,reboot,ordered,eu,patent,law,european,parliament,committee,rewrite,proposals,...,big,plan,defend,title,helsinki,august,want,perform,indoors,start
0,0,0,0,1,0,0,0,0,0,0,...,0,3,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,...,0,2,0,1,2,2,0,1,1,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,2
4,0,0,0,0,0,1,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0


Each word becoming a feature in the dataframe, which means that there are a lot of features.  Also, there are a lot of zeros in the dataframe. A matrix where most of the values are zeros is called a **sparse matrix**.

In [72]:
df.sort_values('plan', ascending=False)

Unnamed: 0,reboot,ordered,eu,patent,law,european,parliament,committee,rewrite,proposals,...,big,plan,defend,title,helsinki,august,want,perform,indoors,start
0,0,0,0,1,0,0,0,0,0,0,...,0,3,0,0,0,0,0,0,0,0
10,0,0,0,0,0,0,0,0,0,2,...,0,3,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,1,0,0,0,...,0,2,0,1,2,2,0,1,1,0
5,0,0,0,0,0,0,0,0,0,0,...,1,2,0,0,1,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
8,0,0,2,0,2,0,0,1,0,0,...,0,1,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,2
4,0,0,0,0,0,1,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
7,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0


### TF-IDF

We can also use Term-Frequency-Inverse Document frequency instead of a straight word count. 

TF-IDF multiplies the frequency of a given word by the inverse of how frequently the word appears across all documents. In this way, it suggests that the most representative words in a document are the ones which have the highest count within a document, but are rare across documents.

There are different ways to calculate term frequency.  The formula below accounts to document length:

### Term Frequency (TF)


$\begin{align}
tf(w) = \dfrac{single\ word\ count}{total\ number\ of\ words\ in\ document}
\end{align} $




### Inverse Document Frequency (IDF)

$\begin{align}
idf(w) = \log \dfrac{Number\ of\ docs}{Number\ of\ docs\ word\ is\ found}
\end{align} $

$\begin{align}
tfidf = {tf}*{idf}
\end{align} $

In [73]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(token_pattern=pattern, 
                     stop_words=nltk.corpus.stopwords.words('english'))

tfidf_texts = tfidf.fit_transform(texts)
tfidf.vocabulary_

{'reboot': 703,
 'ordered': 611,
 'eu': 254,
 'patent': 628,
 'law': 474,
 'european': 256,
 'parliament': 622,
 'committee': 155,
 'rewrite': 720,
 'proposals': 670,
 'controversial': 179,
 'new': 585,
 'union': 884,
 'rules': 726,
 'govern': 346,
 'computer': 169,
 'based': 73,
 'inventions': 429,
 'legal': 479,
 'affairs': 20,
 'juri': 450,
 'said': 730,
 'commission': 153,
 'submit': 821,
 'implemented': 400,
 'directive': 226,
 'meps': 535,
 'failed': 269,
 'back': 67,
 'vocal': 902,
 'critics': 197,
 'say': 736,
 'could': 184,
 'favour': 274,
 'large': 466,
 'small': 782,
 'firms': 302,
 'impact': 399,
 'open': 606,
 'source': 794,
 'software': 784,
 'innovation': 420,
 'supporters': 827,
 'would': 935,
 'let': 482,
 'protect': 672,
 'intended': 423,
 'offer': 596,
 'protection': 673,
 'use': 889,
 'achieve': 10,
 'effect': 244,
 'words': 932,
 'invention': 428,
 'draft': 233,
 'suffered': 824,
 'setbacks': 759,
 'poland': 650,
 'one': 605,
 'largest': 468,
 'member': 531,
 'stat

In [74]:
df = pd.DataFrame(tfidf_texts.toarray())
df.columns = tfidf.vocabulary_
df.head()

Unnamed: 0,reboot,ordered,eu,patent,law,european,parliament,committee,rewrite,proposals,...,big,plan,defend,title,helsinki,august,want,perform,indoors,start
0,0.0,0.0,0.0,0.059021,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.099823,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.048115,0.0,0.0,0.037515,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.057847,0.0,0.0,0.0,...,0.0,0.075948,0.0,0.067357,0.059216,0.134714,0.0,0.067357,0.067357,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.051038,0.0,0.0,0.0,0.0,0.116109
4,0.0,0.0,0.0,0.0,0.0,0.061499,0.052816,0.0,0.052816,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### NGrams

In addition to creating single word counts, we can also count groups of words.  These are n-grams, with n referring to the number of words to count. 

If, for example, we were trying to separate sports articles from political articles, the word might not be particularly useful by itself.  However, the bigram "World Series" would be highly representative. 

A bigram, is a group of two words, tri-gram is three, etc.

It is very easy to create bigrams with count vectorizers and tf-idf:


In [75]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(token_pattern=pattern, 
                     stop_words=nltk.corpus.stopwords.words('english'), 
                    ngram_range=(1,2))

cv_texts = cv.fit_transform(texts)
cv.vocabulary_


{'reboot': 1788,
 'ordered': 1569,
 'eu': 643,
 'patent': 1613,
 'law': 1194,
 'european': 650,
 'parliament': 1596,
 'committee': 396,
 'rewrite': 1836,
 'proposals': 1711,
 'controversial': 454,
 'new': 1492,
 'union': 2251,
 'rules': 1849,
 'govern': 877,
 'computer': 429,
 'based': 182,
 'inventions': 1089,
 'legal': 1208,
 'affairs': 44,
 'juri': 1136,
 'said': 1858,
 'commission': 391,
 'submit': 2099,
 'implemented': 1010,
 'directive': 569,
 'meps': 1359,
 'failed': 681,
 'back': 169,
 'vocal': 2308,
 'critics': 502,
 'say': 1888,
 'could': 466,
 'favour': 691,
 'large': 1172,
 'small': 2011,
 'firms': 757,
 'impact': 1008,
 'open': 1558,
 'source': 2040,
 'software': 2016,
 'innovation': 1068,
 'supporters': 2111,
 'would': 2400,
 'let': 1218,
 'protect': 1718,
 'intended': 1074,
 'offer': 1524,
 'protection': 1720,
 'use': 2263,
 'achieve': 23,
 'effect': 619,
 'words': 2381,
 'invention': 1087,
 'draft': 590,
 'suffered': 2105,
 'setbacks': 1957,
 'poland': 1664,
 'one': 155

# Twitter Data
- Now that we have the tools, let's return to our twitter data
- Let's start with our usual train test split.


In [76]:
import pandas as pd
import numpy as np
from collections import defaultdict
import string

tweets = pd.read_csv('data/illinois_tweets.csv')
target = tweets.target
text = tweets.combined_text
target.value_counts()

0.0    35000
1.0    34999
Name: target, dtype: int64

In [7]:
print(f"There are {text.shape[0]} tweets")
print(f"""There are: 
                    {target.value_counts()[1]} Chicago tweets,
                    {target.value_counts()[0]} tweets from outside of Chicago
                  """)

There are 69999 tweets
There are: 
                    34999 Chicago tweets,
                    35000 tweets from outside of Chicago
                  


In [328]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(text, target, 
                                                    test_size=.2, 
                                                    random_state=42)

In [330]:
# Let's then work with training and validation steps
X_tt, X_val, y_tt, y_val = train_test_split(X_train, y_train, 
                                            test_size=.2,
                                            random_state=42)

### Now, start walking through the steps of text cleaning.

However you would like, start coding the following steps:

1. Make lowercase
2. Remove punctuation
3. Remove stopwords
4. stem/lemmatize
5. Count-vectorize/TF-IDF
6. Fit to a model and check your metrics.

Remember, there are various ways to do this. There are many choices to be made, some of which it is hard to assertain whether they are best or not.  Experimentation is key.

In [331]:
# your code here