# Text Learning Notes

In [24]:
import pickle as pkl
import pandas as pd
import numpy as np

## Bag of Words model

#### Goal: Find word frequency counts


#### Questions:
1. Does the word order matter inside a phrase? No
2. Do long phrases give different input vectors? Yes
3. Can we have complex phrases? "Chicago Bulls" no

#### Bag of Words in SKLearn aka CountVectorizer

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
Vectorizer = CountVectorizer()

In [6]:
data = ['this is a long string string string and a test',
        'also this is a test test string great learning much fun']
bag_of_words = Vectorizer.fit(data)
bag_of_words = Vectorizer.transform(data)
print(bag_of_words)

  (0, 1)	1
  (0, 4)	1
  (0, 6)	1
  (0, 8)	3
  (0, 9)	1
  (0, 10)	1
  (1, 0)	1
  (1, 2)	1
  (1, 3)	1
  (1, 4)	1
  (1, 5)	1
  (1, 7)	1
  (1, 8)	1
  (1, 9)	2
  (1, 10)	1


You get the document number of the left of the tuple, and the word in the right of the tuple. The frequency count is in the column outside of the tuple.

This gets you the feature number:

In [8]:
word = 'string'
print Vectorizer.vocabulary_.get(word)

8


### Not all words have equal amount of information

Some just have more information than others
    - so must remove, to get rid of noise
    
Called <b/>Stopwords</b>!
    - the, a, will, in, for,you,be etc.
    
<b/>Can import from NLTK library:</b>

In [14]:
from nltk.corpus import stopwords
import nltk

In [15]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [17]:
# need to specify language
sw = stopwords.words('English')
len(sw)

179

#### Problem: not all unique words are different
- Word meaning changes based on context, ex: responsiveness,responsive

#### Solution: word stemmer
- stemmer ex: responsive ---> respons
- take stemmers off shelf and make use of it
- results in much cleaner vocabulary

#### Stemming with NLTK

In [18]:
from nltk.stem.snowball import SnowballStemmer

In [19]:
#dont forget to specify language
stemmer = SnowballStemmer('english')

In [20]:
stemmer.stem("responsiveness")

u'respons'

In [22]:
stemmer.stem("unresponsive")

u'unrespons'

Need to fine-tune and specify goals of project to stem as efficient as possible.

#### Order of operations in text processing

1) Stemming

2) bag-of-words

## TF-IDF Representation

#### Key Terms:
- TF: term frequency (bag of words)
- IDF: inverse document frequency

#### Idea:

Weight words by how often they occur in the corpus (whole text)
    - Weights rare words higher because of greater id of important messages
    

In [33]:
words = pd.read_pickle("your_word_data.pkl")

In [38]:
words[-1]

u'jtownsensf httpgasmsgboardcorpenroncom '

In [39]:
len(words)

17578