# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Feature Extraction from Text (Natural Language Processing)
Week 6| Lesson 4.1

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Extract features from free form text using Scikit Learn
- Identify Parts of Speech using NLTK
- Remove stop words
- Describe how TF-IDF works

### STUDENT PRE-WORK
*Before this lesson, you should already be able to:*
- Familiarize yourself with [nltk.download()](http://www.nltk.org/data.html) in case you need to download additional corpuses
- Describe what a transformer is in Scikit Learn and use it
- Recognize basic principles of English language syntax

In [1]:
# Best not to run this cell as it will occupy your notebook kernel,
# better to take this and run it in a terminal
#import nltk
#nltk.download()

<a name="opening"></a>
## Opening
All the models we have learned so far accept a 2D table of real numbers as input (we called it X) and output a vector of classes or numbers (we called it y). Very often though, our starting point data is not given in the form of a table of numbers, but it's unstructured. We have seen how we have to deal with structuring data in the case of scraping, for example. Text documents are a common case in point where we must define and formulate the structure we want to see. In this case we need a way to go from unstructure data to a table of numbers, so that we can then apply the usual methods. This is called _feature extraction_.

**Check:** Take a couple of minutes to think of real-world applications of Natural Language Processing and discuss.


<a name="introduction"></a>
## Feature Extraction from Text

### A simple example
Suppose we are building a spam/ham classifier. Input are emails, output is a binary classification.

Here's an example of an input email:

In [3]:
from __future__ import print_function
import pandas as pd
pd.set_option("display.max_columns", 100)

In [5]:
spam = """
Hello,\nI saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.
"""
ham = """
Hello,\nI am writing in regards to your application to the position of Data Scientist at Hooli X. We are pleased to inform you that you passed the first round of interviews and we would like to invite you for an on-site interview with our Senior Data Scientist Mr. John Smith. You will find attached to this message further information on date, time and location of the interview. Please let me know if I can be of any further assistance. Best Regards.
"""
print(spam)
print("--")
print (ham)


Hello,
I saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.

--

Hello,
I am writing in regards to your application to the position of Data Scientist at Hooli X. We are pleased to inform you that you passed the first round of interviews and we would like to invite you for an on-site interview with our Senior Data Scientist Mr. John Smith. You will find attached to this message further information on date, time and location of the interview. Please let me know if I can be of any further assistance. Best Regards.



**Check:** can you think of a simple heuristic rule to catch email like this (meaning an imperfect but rapid solution)?

By defining a simple rule that parses the text we have performed one of the simplest feature extraction from text: binary word counting. 

### Bag of words (word counting)

The bag-of-words model is a simplifying representation used in natural language processing. In this model, a text (such as a sentence or a document) is represented as the count of its words, disregarding grammar and even word order (and hence generally much meaning, but topics in particular are well represented by this approach).

In [6]:
from collections import Counter

print(Counter(spam.lower().split()))
print("--")
print(Counter(ham.lower().split()))

Counter({'i': 7, 'of': 4, 'and': 3, 'is': 2, 'etc.': 2, 'am': 2, 'an': 2, 'have': 2, 'in': 2, 'your': 2, 'to': 2, 'years': 2, 'with': 2, 'this': 2, 'contact': 2, 'the': 2, 'major': 1, 'old': 1, 'cancer': 1, 'outstanding': 1, 'seven': 1, 'decided': 1, 'through': 1, 'carefully': 1, 'euros(eight': 1, 'seem': 1, 'saw': 1, 'information': 1, 'for': 1, 'euros': 1, 'fifty': 1, '86': 1, 'sum': 1, '"lukoil".': 1, 'only': 1, 'pjsc': 1, 'mr.': 1, '2': 1, 'linkedin.': 1, 'will/donate': 1, 'you': 1, 'hundred': 1, 'was': 1, 'personality.': 1, 'chairman': 1, 'profile': 1, 'you.': 1, 'hello,': 1, 'ago.': 1, 'read': 1, 'going': 1, 'thousand': 1, 'million': 1, 'grayfer': 1, 'reason': 1, 'be': 1, 'one': 1, 'why': 1, 'on': 1, 'name': 1, 'week.': 1, '8,750,000.00': 1, 'later': 1, 'board': 1, 'operation': 1, 'will': 1, 'directors': 1, 'diagnosed': 1, 'valery': 1, 'my': 1})
--
Counter({'to': 5, 'you': 4, 'of': 4, 'the': 3, 'and': 2, 'we': 2, 'scientist': 2, 'data': 2, 'i': 2, 'further': 2, 'this': 1, 'regards

In the above example we counted the number of times each word appeared in the text. Note that since we included all the words in the text, we created a dictionary that contains many words with only one appearance.

<a name="demo"></a>
## Demo: Scikit Learn Count Vectorizer

Scikit learn offers a way to do bag-of-words with the CountVectorizer, with many configurable options:

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

In [8]:
# Initialise and fit to the 'spam' message we defined above
# Note we set n_gram to (1,1) as default so we only return single words
# Note we did not define stop words so these are included in the fit

cvec = CountVectorizer()
cvec.fit([spam])

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [9]:
# Returns features (just showing ten out of the total of 69)
# Note that all words have been set to lowercase, this is a default option (see output above)

cvec.get_feature_names()[10:20]

[u'cancer',
 u'carefully',
 u'chairman',
 u'contact',
 u'decided',
 u'diagnosed',
 u'directors',
 u'donate',
 u'eight',
 u'etc']

What did this operation do? Remember we follow a fit and transform flow with sklearn. So first we fit the
CountVectorizer object - this means we have defined the dictionary of possible words. Next if we transform based
on any document we input, it will count the number of times words in that dictionary appear in the inputted document (we usually use the word document to mean any individual text input, which could for example be rows in a dataframe - each row is then a document). 

When we then do a transform, if a word
in a text document input to the fitted CountVectorizer was not present in the original document we based our fit then
it simply will not appear in the transformed output at all (i.e. there is no base corpus of words, you define that when you perform the fit).

In [11]:
df  = pd.DataFrame(cvec.transform([spam]).todense(), columns=cvec.get_feature_names())
df.transpose()[0].sort_values(ascending=False)[0:10]

of         4
and        3
your       2
this       2
in         2
you        2
have       2
euros      2
etc        2
contact    2
Name: 0, dtype: int64

Note that we can choose several parameters to tweak.

**Check:** spend a couple of minutes scanning the documentation to figure out what those parameters do. Take 5 minutes, then share a few takeaways from the documentation in groups. What features stand out to you?

[CountVectorizer Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)


<a name="guided_practice"></a>
## Sklearn HashingVectorizer

As you have seen in the CountVectorizer documentation we can set the dictionary to have a fixed size, or only keeping words of certain frequencies, however we still have to compute a dictionary and hold the dictionary in memory. This could be a problem when we have a large corpus.

One approach to optimising this in sklearn known as the `HashingVectorizer`, which converts a collection of text documents to a matrix of occurrences. Each word is mapped to a feature with the use of a hash function that converts it to a hash (essentially an index). If we encounter that word again in the text, it will be converted to the same hash, allowing us to count word occurence without retaining a dictionary in memory. We do not actually fit the hashing function, it has already been defined. So we can use the hashing approach for words we have never actually seen before. This is very convenient however the main drawback of the this is that it's not possible to compute the inverse transform, and thus we lose information on what words the important features correspond to. It's also possible (though unlikely) for two words to be assigned the same hash and thus for the accuracy of a model built on the HashingVectorizer to underperform than that built with a CountVectorizer (hash collision). However, the memory optimisation is substantial decreasing both the run time and memory allocation and the impact on accuracy will tend to be minimal owing to the number of possible hashes, but of course it pays to verify when building and testing. I have included a paper which describes the methodology [here](./assets/papers/feature_hashing.pdf). There are 2^20 possible hashes in the default in sklearn, over 1 million, so clashes are not likely (you can set this to a maximum of 2^31 - 1).

So the key to remember here is that from the modelling perspective, we don't actually input words. We just input some value (index, hash) corresponding to a word and its frequency, and we get out whether that was a predictive feature. But it doesn't matter what word it was, unless we want to actually interpret our model. But interpretation is distinct from accurate prediction. So, generally you would perform exploration using CountVectorizer in order to be able to readily query your model, and then implement HashingVectorizer in production in order to reduce memory load. This is particularly a concern in text applications because of the sheer size of the potential number of features when you start including multiple combinations of words (ngrams).


**Check:** What characteristics should feature extraction from text satisfy?

In [12]:
# HashingVectorizer is implemented in the same way but there
# is no need to fit this time, since the hashing function
# has already been defined by sklearn. If you run fit, 
# it will just do nothing

from sklearn.feature_extraction.text import HashingVectorizer
hvec = HashingVectorizer()

In [14]:
# Output has been scaled and interpretation is lost to hash values instead of words.
# Additionally even the scaled values do not directly correspond to counts as values
# can be set to be negative (which is a strategy to avoid hash collisions).
# See for example recent developments on this implementation in sklearn here
# https://github.com/scikit-learn/scikit-learn/issues/7513 

df2  = pd.DataFrame(hvec.transform([spam]).todense())
df2.transpose()[0].abs().sort_values(ascending=False)[0:10]

479532     0.338062
180525     0.253546
1005907    0.169031
994433     0.169031
170062     0.169031
174171     0.169031
832412     0.169031
967636     0.169031
757616     0.169031
144749     0.169031
Name: 0, dtype: float64

**Check:** What new parameters does this vectoriser offer?

<a name="introduction_2"></a>
## Intro: NLTK (Natural Language Tool Kit)

Bag of word approaches like the one outlined before completely ignore the structure of a sentence, they merely assess presence of specific words or word combinations. Here are some additional techniques that can help to build up the complexity required to deal with language as it is actually used (bear in mind dealing with natural language includes some of the toughest problems in machine learning today).

### Segmentation

_Segmentation_ is a technique to identify (e.g.) sentences within a body of text. Punctuation serves as our guide in the first instance.

In [15]:
easy_text = "I went to the zoo today. What do you think of that? I bet you hate it! Or maybe you don't"

easy_split_text = ["I went to the zoo today.", "What do you think of that?", "I bet you hate it!", "Or maybe you don't"]

In [16]:
def simple_sentencer(text):
    '''Take a string `text` and return
    a list of strings, each containing a sentence'''
    
    sentences = []
    substring = ''
    for c in text:
        if c in ('.', '!', '?'):
            sentences.append(substring + c)
            substring = ''
        else:
            substring += c
    return sentences

simple_sentencer(easy_text)

['I went to the zoo today.',
 ' What do you think of that?',
 ' I bet you hate it!']

The sentencer above doesn't work perfectly. In the lab you will learn how to improve it. Thankfully over many years people have worked on such issues and the NLTK library offers an easy to use sentencer (thank you open source coding).

In [17]:
# You have to use nltk.download() to get the PunktSentenceTokenizer

from nltk.tokenize import PunktSentenceTokenizer

sent_detector = PunktSentenceTokenizer()
sent_detector.sentences_from_text(easy_text)

['I went to the zoo today.',
 'What do you think of that?',
 'I bet you hate it!',
 "Or maybe you don't"]

**Check:** Does NLTK offer other Tokenizers? Use nltk.download() to explore the available packages.

<a name="demo_2"></a>
### Stemming

Normalisation is when slightly different version of a word exist. For example: LinkedIn will see hundreds of variations of the title "Software Developer" (including "Code Ninja").

**Check:** What are other common cases of text that could need normalisation?

It would be wrong to consider the words "MR." and "mr" to be different features, thus we need a technique to normalise words to a common root. This technique is called _stemming_.

- Science, Scientist => Scien
- Swimming, Swimmer, Swim => Swim

As we did above we could define a Stemmer based on rules:

In [18]:
def stem(tokens):
    '''rules-based stemming of a bunch of tokens'''
    
    new_bag = []
    for token in tokens:
        # define rules here
        if token.endswith('s'):
            new_bag.append(token[:-1])
        elif token.endswith('er'):
            new_bag.append(token[:-2])
        elif token.endswith('tion'):
            new_bag.append(token[:-4])
        elif token.endswith('tist'):
            new_bag.append(token[:-4])
        elif token.endswith('ce'):
            new_bag.append(token[:-2])
        elif token.endswith('ing'):
            new_bag.append(token[:-2])
        else:
            new_bag.append(token)

    return new_bag

stem(['Science', 'Scientist'])

['Scien', 'Scien']

As before, NLTK contains several robust stemmers.

In [19]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('Swimmed'))
print(stemmer.stem('Swimming'))

from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.snowball import EnglishStemmer

Swim
Swim


**Check:** There are various stemmers available in NLTK, such as the additional two importered above. Take a look at [the documentation](http://www.nltk.org/api/nltk.stem.html) and see if you can find the differences.

### Stop Words

Some words are very common and provide no information on the text content. We should remove these.

**Check:** Can you give some examples?

In [21]:
# You will need to have used nltk.download() to download the stopwords corpus for this to work

from nltk.corpus import stopwords

stop = stopwords.words('english')
sentence = "this is a really interesting sentence"
print([i for i in sentence.split() if i not in stop])

['really', 'interesting', 'sentence']


In [22]:
# Note you might want to add or remove words from the stopwords list, depending
# on your application (e.g. domain specific words)
stop[0:10]

[u'i',
 u'me',
 u'my',
 u'myself',
 u'we',
 u'our',
 u'ours',
 u'ourselves',
 u'you',
 u'your']

### Parts of Speech

Each word has a specific role in a sentence (Verb, Noun etc.) Parts-of-speech tagging (POS) is a feature extraction technique that attaches a tag to each word in the sentence, in order to provide a more precise context for further analysis. This is often a resource intensive process, but it can sometimes improve the accuracy or our models.



In [26]:
# word_tokenize separates the words nad pos_tag identifies the word type
# there are various taggers available, take a look at nltk.tag documentation

from nltk import pos_tag, word_tokenize
pos_tag(word_tokenize("today is a great day to learn nlp"))

[('today', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('great', 'JJ'),
 ('day', 'NN'),
 ('to', 'TO'),
 ('learn', 'VB'),
 ('nlp', 'NN')]

In [27]:
# To find these tags and their meaning. How did I know it was the upenn tagset? 
# I'm afraid NLTK documentation is not so good and I had to go digging in the source code to confirm this. 
# So probably a bit of code you want to save if you're looking to interpret these tags.

import nltk.help as help_nltk
help_nltk.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

<a name="guided-practice_2"></a>
# Term frequency - Inverse document Frequency

More powerful than just removing known stop-words is the TF-IDF approach. This tells us which words are most discriminative between documents. Words that occur a lot in one document but don't occur in many documents will tell you something special about the document. Of course, this can be used to innately reduce the importance of stop words as well as more powerfully generalising for domain specific terms. Its calculation is straightforward.

Term frequency TF is the frequency of a certain term (e.g. a word or stem) in a document:
$$
\mathrm{tf}(t,d) = \frac{N_\text{term}}{N_\text{terms in Document}}
$$

Inverse document frequency IDF of a word is defined as the number of documents in the corpus divided by the number of documents that contain the term - typically a log scaling is used (this is the implementation in sklearn).
$$
\mathrm{idf}(t, D) = \log(\frac{1+ N_\text{Documents}}{1+ N_\text{Documents that contain term}} +1 )
$$

Term frequency Inverse Document Frequency (TF-IDF) is calculated as:

$$
\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)
$$

This enhances terms that are highly specific of a particular document, while suppressing terms that are common to most documents. Hence, if a term has a high TFIDF in a particular document it means that it occurs proportionately more in that particular document than it tends to occur in the corpus as a whole (all documents). Hence it seems to be an interesting word from the perspective of identifying characteristics of the document - such as the topic (eg is the document about politics, mathematics, the environment?), or the sentiment (eg is the writer expressing happiness, sadness, or frustration?).

Scikit Learn introduces a TFIDF vectorizer that works similarly to the other vectorizers. Alternatively, we could use CountVectorizer followed by TfidfTransformer. We want to be careful about what our corpus will be to interpret this well.


In [28]:
# This is a spam dataset with labels

corpus=pd.read_csv("assets/SMSSpamCollection.txt", sep="\t", header=None, encoding="utf-8")
corpus.columns=["classification", "text"]

In [29]:
# Let's initialise the tfidf and calculate the word frequencies for the whole corpus

from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer()
tvec.fit(corpus["text"])

TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [30]:
# Now we take single document examples at random to compare their contents to the corpus
example_spam=corpus.loc[corpus["classification"]=="spam",:].iloc[100]
example_ham=corpus.loc[corpus["classification"]=="ham",:].iloc[100]

In [31]:
example_spam["text"]

u'To review and KEEP the fantastic Nokia N-Gage game deck with Club Nokia, go 2 www.cnupdates.com/newsletter. unsubscribe from alerts reply with the word OUT'

In [32]:
example_ham["text"]

u"Hmm...my uncle just informed me that he's paying the school directly. So pls buy food."

In [33]:
# This performs the tfidf transformation on the two examples and puts them in a dataframe
# so we can compare the word frequency in each document with the word frequencies in the whole
# corpus, and identify words which stand out

df3  = pd.DataFrame(tvec.transform([example_spam["text"], example_ham["text"]]).todense(), columns=tvec.get_feature_names(), index=['example_spam', 'example_ham'])

In [34]:
# Then we display the columns sorted by the spam example's highest ranking tfidf score for the top 10.
df3.transpose().sort_values('example_spam', ascending=False).head(10).transpose()

Unnamed: 0,nokia,deck,newsletter,cnupdates,gage,alerts,review,with,fantastic,club
example_spam,0.346322,0.27441,0.27441,0.27441,0.27441,0.27441,0.235925,0.231317,0.222039,0.20691
example_ham,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [35]:
# This shows us similarly the ham example's highest ranking tfidf scores
df3.transpose().sort_values('example_ham', ascending=False).head(10).transpose()

Unnamed: 0,directly,informed,paying,uncle,hmm,food,school,buy,pls,he
example_spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
example_ham,0.345969,0.345969,0.334647,0.318081,0.311635,0.292379,0.286856,0.246985,0.221469,0.20174


In [36]:
# This shows us all the word features that are actually present in either of the examples and their tfidf
# just as a useful piece of code
df3.loc[:, (df3 != 0).any(axis=0)]

Unnamed: 0,alerts,and,buy,club,cnupdates,com,deck,directly,fantastic,food,from,gage,game,go,he,hmm,informed,just,keep,me,my,newsletter,nokia,out,paying,pls,reply,review,school,so,that,the,to,uncle,unsubscribe,with,word,www
example_spam,0.27441,0.090505,0.0,0.20691,0.27441,0.165189,0.27441,0.0,0.222039,0.0,0.123267,0.27441,0.202175,0.124293,0.0,0.0,0.0,0.0,0.166534,0.0,0.0,0.27441,0.346322,0.124526,0.0,0.0,0.144785,0.235925,0.0,0.0,0.0,0.164818,0.067412,0.0,0.20525,0.231317,0.183158,0.153926
example_ham,0.0,0.0,0.246985,0.0,0.0,0.0,0.0,0.345969,0.0,0.292379,0.0,0.0,0.0,0.0,0.20174,0.311635,0.345969,0.167721,0.0,0.139092,0.144414,0.0,0.0,0.0,0.334647,0.221469,0.0,0.0,0.286856,0.160044,0.15251,0.120848,0.0,0.318081,0.0,0.0,0.0,0.0


In [37]:
# Just to walk through the syntax, here's one way you could do a basic spam filter with TFIDF

# Preparing the data: feature selection
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(corpus["text"], corpus["classification"], test_size=0.2, random_state=42)
mapping={"spam":1, "ham":0}
y_train=y_train.map(mapping)
y_test=y_test.map(mapping)
X_train_tfidf=pd.DataFrame(tvec.transform(X_train).todense(), columns=tvec.get_feature_names())
X_test_tfidf=pd.DataFrame(tvec.transform(X_test).todense(), columns=tvec.get_feature_names())

# Modelling based on those features and assess model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
logit=LogisticRegression()
logit.fit(X_train_tfidf, y_train)
predictions=logit.predict(X_test_tfidf)
print("accuracy is", accuracy_score(y_test, predictions))
print(confusion_matrix(y_test, predictions))
print("remember the distribution is uneven at:")
print(y_test.value_counts())

accuracy is 0.964125560538
[[966   0]
 [ 40 109]]
remember the distribution is uneven at:
0    966
1    149
Name: classification, dtype: int64


**Check:** What do we do all this for? Remember we started with just unstructured text.

<a name="conclusion"></a>
## Conclusion

In this lesson we learned about the first steps in Natural Language Processing and about two very powerful toolkits:
- Scikit Learn Feature Extraction Text
- Natural Language Tool Kit

**Check:** Discussion: what are some real world applications of these techniques?

## Some additional notes on text packages

If you are interested in working with text, eg on your capstone, you may find the following packages worth researching besides NLTK and Sklearn:
- [Gensim](https://radimrehurek.com/gensim/) is for topic modelling and implements Word2Vec, which can tell you distance between words (such as Queen to King vs wife to husband)
- [pyLDAvis](https://pyldavis.readthedocs.io/en/latest/) works with gensim to visualise topics in text
- [Fuzzywuzzy](https://pypi.python.org/pypi/fuzzywuzzy) compares similarity between documents
- [Textblob](https://textblob.readthedocs.io/en/dev/) wraps up some nltk functionality in an easy-to-use interface (eg it can do sentiment analysis in a few lines)
