# Learning from Text

Ex. to distinguish between "nice day" and "a very nice day"

How many inputs to put in, say, SVM?

- hard to tell how many inputs.

Frequency of occurence of words can be mapped out using "bag of words".

- word of order does not matter
- long phrases give different input vectors
- cannot handle complex phrases (ex Chicago Bulls vs. Chicago vs. bulls)

# Bag of Words in sklearn

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
vectorizer = CountVectorizer()

In [3]:
str1 = "hi Katie the self driving car will be late Best Sebastian"

In [4]:
str2 = "Hi Sebastian the machine learning class will be great great great Best Katie"

In [5]:
str3 = "Hi Katie machine learning class will be most excellent"

In [6]:
email_list = [str1, str2, str3]

In [26]:
bag_of_words = vectorizer.fit(email_list)

In [27]:
print bag_of_words

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


In [28]:
bag_of_words = vectorizer.transform(email_list)

In [29]:
print bag_of_words

  (0, 0)	1
  (0, 1)	1
  (0, 2)	1
  (0, 4)	1
  (0, 7)	1
  (0, 8)	1
  (0, 9)	1
  (0, 13)	1
  (0, 14)	1
  (0, 15)	1
  (0, 16)	1
  (1, 0)	1
  (1, 1)	1
  (1, 3)	1
  (1, 6)	3
  (1, 7)	1
  (1, 8)	1
  (1, 10)	1
  (1, 11)	1
  (1, 13)	1
  (1, 15)	1
  (1, 16)	1
  (2, 0)	1
  (2, 3)	1
  (2, 5)	1
  (2, 7)	1
  (2, 8)	1
  (2, 10)	1
  (2, 11)	1
  (2, 12)	1
  (2, 16)	1


In [30]:
vectorizer.vocabulary_.get('great')

6

In [31]:
vectorizer.vocabulary_.get('machine')

11

In [17]:
vectorizer.vocabulary_.get('will')

16

In [18]:
vectorizer.vocabulary_.get('excellent')

5

In [20]:
vectorizer.vocabulary_.get('class')

3

In [21]:
vectorizer.vocabulary_.get('be')

0

In [22]:
vectorizer.vocabulary_.get('best')

1

In [23]:
vectorizer.vocabulary_.get('katie')

8

In [25]:
vectorizer.vocabulary_.get('sebastian')

13

In [33]:
vectorizer.vocabulary_.get('hi')

7

In [34]:
print vectorizer.vocabulary_

{u'be': 0, u'most': 12, u'hi': 7, u'learning': 10, u'excellent': 5, u'class': 3, u'best': 1, u'katie': 8, u'will': 16, u'great': 6, u'driving': 4, u'car': 2, u'self': 14, u'machine': 11, u'late': 9, u'the': 15, u'sebastian': 13}


In [35]:
"the" in vectorizer.vocabulary_

True

## Not all words are equal

Some words contain more information than others

Low-information words:

"the"

"will"

"hi"

"Katie"

"Sebastian"


#### Some words contain more information that others.

Stop words = low info highly frequesnt word = [the, in, for you, will, be, have]

#### Quiz:

How many words will be removed when we remove stopwords from "Hi Katie the machine learning class will be great best Sebastian"?

Answer: 3 == the, will, be

### Getting stopwords from NLTK

NLTK = national language tool kit

In [36]:
from nltk.corpus import stopwords

In [37]:
sw = stopwords.words("english")

In [38]:
sw[0]

u'i'

In [39]:
sw[1]

u'me'

In [40]:
len(sw)

153

In [41]:
print sw

[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u

### Not all unique words are different

```
unresponsive
response
responsivity      -->   respon
responsiveness
response 
```

stem of a word created using stemmer function

### Stemming with NLTK

In [42]:
from nltk.stem.snowball import SnowballStemmer

In [43]:
stemmer = SnowballStemmer("english")

In [44]:
stemmer.stem("responsiveness")

u'respons'

In [45]:
stemmer.stem("unresponsive")

u'unrespons'

### Order of operations in text processing

1. Stemming
2. bag-of-words representation

### TfIdf representation

TfIdf = term frequency inverse document frequency

term frequency = like bag of words

inverse document frequency = weighing by how often word occurs in corpus

Quiz: would you weigh commmon words hihger or reater words?

    Rates rare words higher than common words.

TfIdf rates rare words higher than common words.


### Why upweight rare words

Example:

Katie - a physics major

Sebastian - a cs major, has a project on robot called Stanley

---> very rare physics and Stanley would occur in corpus compared to Udacity and machine learning (both teachers of Udacity intro to machine learning course)