# NLP Basics

### Natural Language Processing (NLP)
Field concerned with the ability of a computer to understand, analyze, manipulate, and potentially generate human language.

NLP in real life: 
 * Spam filter
 * Auto-complete
 * Auto-correct
 
NLP encompasses:
 * Sentiment analysis
 * Topic modeling
 * Text classification
 * Sentence segmentation or part-of-speech tagging
 
 
### NLTK - Natural Language Toolkit 
Suite of open-source tools to male NLP processses in Python easier to build.


### How to install NLTK on your local machine

Both sets of instructions below assume you already have Python installed. These instructions are taken directly from [http://www.nltk.org/install.html](http://www.nltk.org/install.html).

**Mac/Unix**

From the terminal:
1. Install NLTK: run `pip install -U nltk`
2. Test installation: run `python` then type `import nltk`

**Windows**

1. Install NLTK: [http://pypi.python.org/pypi/nltk](http://pypi.python.org/pypi/nltk)
2. Test installation: `Start>Python35`, then type `import nltk`

### Download NLTK data

In [1]:
import nltk
nltk.download()
# Pick the parts of the package that you would like to have installed (maybe all
# and click download.

# Other possibility is to follow as per:
# http://www.nltk.org/install.html  
# http://www.nltk.org/data.html

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [2]:
# to print out all the functions and attributes and methods within nltk package: 
dir(nltk)
# notice to following functions among them:
# 'pos-tag'
# 'tokenize'

['AbstractLazySequence',
 'AffixTagger',
 'AlignedSent',
 'Alignment',
 'AnnotationTask',
 'ApplicationExpression',
 'Assignment',
 'BigramAssocMeasures',
 'BigramCollocationFinder',
 'BigramTagger',
 'BinaryMaxentFeatureEncoding',
 'BlanklineTokenizer',
 'BllipParser',
 'BottomUpChartParser',
 'BottomUpLeftCornerChartParser',
 'BottomUpProbabilisticChartParser',
 'Boxer',
 'BrillTagger',
 'BrillTaggerTrainer',
 'CFG',
 'CRFTagger',
 'CfgReadingCommand',
 'ChartParser',
 'ChunkParserI',
 'ChunkScore',
 'Cistem',
 'ClassifierBasedPOSTagger',
 'ClassifierBasedTagger',
 'ClassifierI',
 'ConcordanceIndex',
 'ConditionalExponentialClassifier',
 'ConditionalFreqDist',
 'ConditionalProbDist',
 'ConditionalProbDistI',
 'ConfusionMatrix',
 'ContextIndex',
 'ContextTagger',
 'ContingencyMeasures',
 'CoreNLPDependencyParser',
 'CoreNLPParser',
 'Counter',
 'CrossValidationProbDist',
 'DRS',
 'DecisionTreeClassifier',
 'DefaultTagger',
 'DependencyEvaluator',
 'DependencyGrammar',
 'DependencyGrap

### Preliminary examples: what can you do with NLTK?

In [3]:
from nltk.corpus import stopwords

stopwords.words('english')[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [4]:
stopwords.words('english')[0:500:25]

['i', 'herself', 'been', 'with', 'here', 'very', 'doesn', 'won']

Stop words are words that are used very frequently but don't really contribute much to the meaning of a sentence.
These words  are generally sentiment-neutral, so there's no strong meaning necesseraly one way or the other. They're just clouding the signal and taking room away from words that aren't sentiment-neutral. So you can go ahead and safely drop these.

### Structured Data vs. Unstructured Data

Most text data lacks the formal structure of numeric data.

What makes a file unstructured? Binary data, no delimeters, no indication of rows. I.e. an e-mail, PDF file or social media post.

### Read in semi-structured text data

In [5]:
# Open and Read in the raw text
# https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection#
rawData = open("SMSSpamCollection.tsv").read()

# Print the raw data, i.e the first 500 characters
rawData[0:500]

"ham\tI've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\nspam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\nham\tNah I don't think he goes to usf, he lives around here though\nham\tEven my brother is not like to speak with me. They treat me like aid"

In [6]:
# let's replace '\t' by '\n' and then let's split the whole text by '\n' and return a list
parsedData = rawData.replace('\t', '\n').split('\n')

In [7]:
# let's print the first 5 elements of the list
parsedData[0:5]

['ham',
 "I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.",
 'spam',
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
 'ham']

In [8]:
#let's create a new list taking parsedData, start in the seroth position, go to the very end and take every other
labelList = parsedData[0::2]
# taking parsedData again, let's start now from the first position and create a new list
textList = parsedData[1::2]

In [9]:
print(labelList[0:5])
print(textList[0:5])

['ham', 'spam', 'ham', 'ham', 'ham']
["I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.", "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", "Nah I don't think he goes to usf, he lives around here though", 'Even my brother is not like to speak with me. They treat me like aids patent.', 'I HAVE A DATE ON SUNDAY WITH WILL!!']


In [10]:
import pandas as pd

#let's create a data frame and pass it a dictionary
fullCorpus = pd.DataFrame({
    'label': labelList,
    'body_list': textList
})

fullCorpus.head()

ValueError: arrays must all be same length

In [11]:
# let's check the length of both list
print(len(labelList))
print(len(textList))

5571
5570


In [12]:
# let's see the very end of labelList
print(labelList[-5:])

['ham', 'ham', 'ham', 'ham', '']


Last item is empty.

In [13]:
fullCorpus = pd.DataFrame({
    'label': labelList[:-1], # just do not grab the last item
    'body_list': textList
})

fullCorpus.head()

Unnamed: 0,label,body_list
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


In [14]:
dataset = pd.read_csv("SMSSpamCollection.tsv", sep="\t", header=None)
dataset.head()

Unnamed: 0,0,1
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


In [15]:
for col in dataset: 
    print(col) 

0
1


In [16]:
# dataset.rename(columns={'0': 'Label','1':'Body_list'}, inplace=True)
dataset.columns = ['Labels', 'Body_Text']
dataset.head()

Unnamed: 0,Labels,Body_Text
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


In [17]:
# Another way to do above is as follows:
import pandas as pd
dataset = pd.read_csv("SMSSpamCollection.tsv", sep="\t",header=None)
dataset.columns = ['label', 'body_text']
dataset.head()

Unnamed: 0,label,body_text
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


### Explore the dataset

In [18]:
# What is the shape of the dataset?

print("Input data has {} rows and {} columns.".format(len(dataset), len(dataset.columns)))

Input data has 5568 rows and 2 columns.


In [20]:
# How many spam/ham are there?

print("Out of {} rows, {} are spam, {} are ham.".format(len(dataset),
                                                       len(dataset[dataset['label']=='spam']),
                                                       len(dataset[dataset['label']=='ham'])))

Out of 5568 rows, 746 are spam, 4822 are ham.


In [23]:
# How much missing data is there?

print("Number of null in Labels: {}".format(dataset['label'].isnull().sum()))
print("Number of null in Body_Text: {}".format(dataset['body_text'].isnull().sum()))

Number of null in Labels: 0
Number of null in Body_Text: 0


### Using regular expressions in Python

Regular expressions: text string for describing a search pattern.
 * 'nlp' : this search pattern would just capture and return 'nlp'
 * '[j-q]' : will just search for all single characters between 'j' and 'q' in whatever the text, but for all characters between 'j' and 'q', not just 'n', 'l' and 'p'.
 * '[j-q]+' : now you can search for string longer than 1 character
 * '[0-9]+' 
 * '[j-q0-9]+' 

Why are Regular Expressions useful?
 * Identifying whitespaces between words/tokens
 * Identifying/creating delimiters or end-of-line escape characters
 * Removing punctuation or numbers from your text
 * Cleaning HTML tags from text
 * Identifying some textual patterns you're interested in

Use Cases:
 * Confirming passwords meet criteria
 * Searching URL for some substring
 * Searching for files on your computer
 * Document scraping

Python's `re` package is the most commonly used regex resource. More details can be found [here](https://docs.python.org/3/library/re.html).

In [24]:
import re

re_test = 'This is a made up string to test 2 different regex methods'
re_test_messy = 'This      is a made up     string to test 2    different regex methods'
re_test_messy1 = 'This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods'

### Splitting a sentence into a list of words

In [25]:
re.split('\s', re_test) # to identify the single white spaces and use it to define the split

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [26]:
re.split('\s', re_test_messy)

['This',
 '',
 '',
 '',
 '',
 '',
 'is',
 'a',
 'made',
 'up',
 '',
 '',
 '',
 '',
 'string',
 'to',
 'test',
 '2',
 '',
 '',
 '',
 'different',
 'regex',
 'methods']

In [27]:
re.split('\s+', re_test_messy) # to identify one or more white spaces

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [28]:
re.split('\s+', re_test_messy1)

['This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods']

In [29]:
re.split('\W+', re_test_messy1) # to identify any non-word character

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [30]:
re.split('\w+', re_test_messy1)

['', '-', '-', '-', '/', '.', '*', '>>>>', '----', '""""""', '~', '-', '']

In [31]:
re.findall('\S+', re_test) # to search for one or more non white spaces characters

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [32]:
re.findall('\S+', re_test_messy)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [33]:
re.findall('\S+', re_test_messy1)

['This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods']

In [34]:
re.findall('\w+', re_test_messy1) # to search for one or more word characters

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

### Takeaways
* Useful methods for tokenizing
    - findall()
    - split()
* Useful regular expressions for tokenizing
    - '\W' & '\w' for words
    - '\S' & '\s' for whitespaces

### Replacing a specific string

In [35]:
pep8_test = 'I try to follow PEP8 guidelines'
pep7_test = 'I try to follow PEP7 guidelines'
peep8_test = 'I try to follow PEEP8 guidelines'

In [36]:
re.findall('[a-z]+', pep8_test)

['try', 'to', 'follow', 'guidelines']

In [37]:
re.findall('[A-Z]+', pep8_test)

['I', 'PEP']

In [38]:
re.findall('[A-Z]+[0-9]+', pep8_test) #to capture letters and numbers

['PEP8']

In [39]:
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Styleguide', pep8_test) # to find specific text and then replace it by another, in this case 'PEP8 Python Styleguide'

'I try to follow PEP8 Python Styleguide guidelines'

In [40]:
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Styleguide', pep7_test)

'I try to follow PEP8 Python Styleguide guidelines'

In [41]:
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Styleguide', peep8_test)

'I try to follow PEP8 Python Styleguide guidelines'

### Other examples of regex methods

- re.search()
- re.match()
- re.fullmatch()
- re.finditer()
- re.escape()

### Machine learning pipeline
1. Raw text - model can't distinguish words
2. Tokenize - tell the model what to look at
3. Clean text - remove stop words/punctuation, stemming, etc, to allow Python to focus on the most pivotal words in the text
4. Vectorize - convert to numeric form so that machine learning algorithm can actually ingest and use to build a model, basically a matrix with one row per text message and one column per word
5. Machine learning algorithm - fit/train model
6. Spam filter  or other - system to filter emails 

### Pre-processing text data

Cleaning up the text data is necessary to highlight attributes that you're going to want your machine learning system to pick up on. Cleaning (or pre-processing) the data typically consists of a number of steps:
1. **Remove punctuation**
2. **Tokenization**
3. **Remove stopwords**
4. **Lemmatize/Stem**

In [42]:
import pandas as pd
pd.set_option('display.max_colwidth', 100) #setting a custom value for how many characters we can see in a panda's data frame when it is printed out; the default is 50

data = pd.read_csv("SMSSpamCollection.tsv", sep='\t', header=None)
data.columns = ['label', 'body_text']

data.head()

Unnamed: 0,label,body_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2,ham,"Nah I don't think he goes to usf, he lives around here though"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


In [43]:
# What does the cleaned version look like?
data_cleaned = pd.read_csv("SMSSpamCollection_cleaned.tsv", sep='\t')
data_cleaned.head()

Unnamed: 0,label,body_text,body_text_nostop
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"['ive', 'searching', 'right', 'words', 'thank', 'breather', 'promise', 'wont', 'take', 'help', '..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"['free', 'entry', '2', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","['nah', 'dont', 'think', 'goes', 'usf', 'lives', 'around', 'though']"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"['even', 'brother', 'like', 'speak', 'treat', 'like', 'aids', 'patent']"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"['date', 'sunday']"


### Remove punctuation

In [44]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [45]:
# reminder:
"I like NLP." == "I like NLP"

False

In [46]:
# Let's cycle Python through each character, check if it's some kind of punctuation, it it is punctuation discard it, 
# and it it's not, then keep it.
def remove_punct(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    return text_nopunct

data['body_text_clean'] = data['body_text'].apply(lambda x: remove_punct(x))

data.head()

Unnamed: 0,label,body_text,body_text_clean
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL


### Tokenization

In [51]:
import re

def tokenize(text):
    tokens = re.split('\W+', text) # '\W+' regex indicates that it will split wherever it sees one or more non-word characters
    return tokens

data['body_text_tokenized'] = data['body_text_clean'].apply(lambda x: tokenize(x.lower())) #Python is case sensitive so in order not to loose resources this prevents from telling Python that 'L' and 'l' are closely related

data.head()

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenized,body_text_tokenized_2
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[i, have, a, date, on, sunday, with, will]"


In [52]:
# Another way to tokenize your text
import nltk 
def tokenize_2(text):
    tokens_2 = nltk.word_tokenize(text)
    return tokens_2

data['body_text_tokenized_2'] = data['body_text_clean'].apply(lambda x: tokenize_2(x.lower()))

data.head()

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenized,body_text_tokenized_2
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[i, have, a, date, on, sunday, with, will]"


In [55]:
(data['body_text_tokenized'] == data['body_text_tokenized_2']).sum()

4729

In [71]:
len(data)

5568

In [48]:
# reminder:
'NLP' == 'nlp'

False

### Remove stopwords

In [72]:
import nltk

stopword = nltk.corpus.stopwords.words('english')

In [73]:
stopwords

<WordListCorpusReader in 'C:\\Users\\34677\\AppData\\Roaming\\nltk_data\\corpora\\stopwords'>

In [74]:
def remove_stopwords(tokenized_list):
    text = [word for word in tokenized_list if word not in stopword]
    return text

data['body_text_nostop'] = data['body_text_tokenized'].apply(lambda x: remove_stopwords(x))

data.head()

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenized,body_text_tokenized_2,body_text_nostop
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[i, have, a, date, on, sunday, with, will]","[date, sunday]"


In [75]:
print(stopword)
print(len(stopword))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '