# Text Learning With Python

### Bag of words: vector representation of a word counts of a document

In SKLearn, we use "Count Vectorizer"

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
vectorizer = CountVectorizer()

In [3]:
string1 = "hi Katie the self driving car will be late Best Sebastian"
string2 = "Hi Sebastian the machine learning class will be great great great Best Katie"
string3 = "Hi Katie the machine learning class will be most excellent"

In [4]:
email_list = [string1, string2, string3]

In [5]:
# Figure out what the words in the corpus are and assign list indices to each:
bag_of_words = vectorizer.fit(email_list)

In [6]:
# Figure out how many counts of each word:
bag_of_words = vectorizer.transform(email_list)

In [7]:
print(bag_of_words[0])
print()
print(bag_of_words[1])

  (0, 0)	1
  (0, 1)	1
  (0, 2)	1
  (0, 4)	1
  (0, 7)	1
  (0, 8)	1
  (0, 9)	1
  (0, 13)	1
  (0, 14)	1
  (0, 15)	1
  (0, 16)	1

  (0, 0)	1
  (0, 1)	1
  (0, 3)	1
  (0, 6)	3
  (0, 7)	1
  (0, 8)	1
  (0, 10)	1
  (0, 11)	1
  (0, 13)	1
  (0, 15)	1
  (0, 16)	1


In the first document, document 0, the first word, word 0 == "hi", occurs 1 time; ending with the 17th word, word 16 == "will", occuring one time

In the second document, document 1, the seventh word, word 6 == "great", occurs 3 times; ending with the 17th word, word 16 == "will", occuring 1 time

**(a, b)   x:**
* a = # of document
* b = # of word
* x = frequency of word b in document a

In [8]:
print(vectorizer.vocabulary_.get("great"))
print()
print(vectorizer.vocabulary_.get("will"))

6

16


# Getting stopwords from NLTK
* NLTK: National Language Tool Kit

In [9]:
from nltk.corpus import stopwords

In [10]:
sw = stopwords.words("english")

In [11]:
display(sw[0:7])
len(sw)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours']

179

# Stemming
* Showing some kind of root to words
> Idea: not all words are equally important, permutations of similar words may be more meaningful (e.g.:repsond, responsive, unresponsive -> respon)

Helps to lower the dimensionality

In [4]:
from nltk.stem.snowball import SnowballStemmer

In [5]:
stemmer = SnowballStemmer("english")

In [14]:
display(stemmer.stem("responsiveness"))
display(stemmer.stem("responsivity"))
display(stemmer.stem("unresponsive"))

'respons'

'respons'

'unrespons'

# Importance of order of operations when processing text:
1. Stemming
2. Bag-of-words

# Term Frequency - Inverse Document Frequency (TFIDF)
* TF (Term Frequency) - like bag of words, each term is weighted by how often it occurs in a document
> e.g. Word that occurs 10 times has 10 times more weight than a word that appears only once
* IDF (Inverse Documnet Frequency): weighting by how often it occurs in the corpus as a whole (all the documents put together)
> e.g. rate rare words more highly than common words (think signal VS noise)

# Mini-project
Construct your own version of that preprocessing step, so that you are going directly from raw data to processed features.

You will be given two text files: one contains the locations of all the emails from Sara, the other has emails from Chris. You will also have access to the `parseOutText()` function, which accepts an opened email as an argument and returns a string containing all the (stemmed) words in the email.

Start with a warmup exercise to get acquainted with `parseOutText()`. Go to the tools directory and run `parse_out_email_text.py`, which contains parseOutText() and a test email to run this function over.

`parseOutText()` takes the opened email and returns only the text part, stripping away any metadata that may occur at the beginning of the email, so what's left is the text of the message. We currently have this script set up so that it will print the text of the email to the screen, what is the text that you get when you run `parseOutText()`?

In [1]:
# This code required work to port to python 3
import parse_out_email_text

In [2]:
# Without stemming
# parse_out_email_text.main()



Hi Everyone  If you can read this message youre properly using parseOutText  Please proceed to the next part of the project



In [2]:
# With stemming
parse_out_email_text.main()

hi everyon if you can read this messag your proper use parseouttext pleas proceed to the next part of the project
