<img style="float: center;" src="images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, Jonathan Morgan, and Ridhima Sodhi. "ADA-KCMO-2018." Coleridge Initiative GitHub Repositories. 2018. https://github.com/Coleridge-Initiative/ada-kcmo-2018. [![DOI](https://zenodo.org/badge/119078858.svg)](https://zenodo.org/badge/latestdoi/119078858)

# Text Analysis
---

## Table of Contents

- [Introduction](#Introduction)
    - [Learning Objectives](#Learning-Objectives)
    - [Glossary of Terms](#Glossary-of-Terms)
- [Python Setup](#Python-Setup)
- [Loading the Data](#Loading-the-Data)
- [Data Emploration & Creating Text Corpus](#Data-Emploration-&-Creating-Text-Corpus)
- [The Principles of Topic Modeling](#The-Principles-of-Topic-Modeling)
- [Topic Modeling in Practice](#Topic-Modeling-in-Practice)
    - [Creating a matrix of Features from Text – Bag of N-grams](#Creating-a-matrix-of-features-from-text-–-Bag-of-N-grams)
    - [Calculating Word Counts](#Calculating-Word-Counts)
    - [Text Cleaning and Normalization](#Text-Cleaning-and-Normalization)
    - [Tokenizing – Breaking Text into Pieces](#Tokenizing-–-)
    - [Stopwords – Removing Meaningless Text](#Removing-Meaningless-Text-–-Stopwords)
    - [Distilling Text data – Stemming and Lemmatization - ](#Stemming-and-Lemmatization---Distilling-text-data)
    - [N-grams - Adding context by creating N-grams](#N-grams---Adding-context-by-creating-N-grams)
    - [TF-IDF - Weighting terms based on frequency](#TF-IDF---Weighting-terms-based-on-frequency)
- [Additional Resources](#Additional-Resources)

## Introduction

- Back to [Table of Contents](#Table-of-Contents)

**Text analysis** is used to extract useful information from or summarize a large amount of unstructured text stored in documents. This opens up the opportunity of using text data alongside more conventional data sources (e.g. surveys and administrative data). The goal of text analysis is to take a large corpus of complex and unstructured text data and extract important and meaningful messages in a comprehensible way. 

Text analysis can help with the following tasks:

* **Information Retrieval**: Find relevant information in a large database, such as a systematic literature review, that would be very time-consuming for humans to do manually. 

* **Clustering and Text Categorization**: Summarize a large corpus of text by finding the most important phrases, using methods like topic modeling. 

* **Text Summarization**: Create category-sensitive text summaries of a large corpus of text. 

* **Machine Translation**: Translate documents from one language to another. 


### Learning Objectives

In this tutorial, you will...
* Learn how to transform a corpus of text into a structured matrix format so that we can apply natural language processing (NLP) methods
* Learn the basics and applications of topic modeling
* Learn how to do document tagging and evaluate the results

### Glossary of Terms

Glossary of Terms:

* **Corpus**: A corpus is the set of all text documents used in your analysis; for example, your corpus of text may include hundreds of research articles.

* **Tokenize**: Tokenization is the process by which text is separated into meaningful terms or phrases. In English this is easy to do for individual words, as they are separated by whitespace; however, it can get more complicated to  automate determining which groups of words constitute meaningful phrases. 

* **Stemming**: Stemming is normalizing text by reducing all forms or conjugations of a word to the word's most basic form. In English, this can mean making a rule of removing the suffixes "ed" or "ing" from the end of all words, but it gets more complex. For example, "to go" is irregular, so you need to tell the algorithm that "went" and "goes" stem from a common lemma, and should be considered alternate forms of the word "go."

* **TF-IDF**: TF-IDF (term frequency-inverse document frequency) is an example of feature engineering where the most important words are extracted by taking account their frequency in documents and the entire corpus of documents as a whole.

* **Topic Modeling**: Topic modeling is an unsupervised learning method where groups of words that often appear together are clustered into topics. Typically, the words in one topic should be related and make sense (e.g. boat, ship, captain). Individual documents can fall under one topic or multiple topics. 

* **LDA**: LDA (Latent Dirichlet Allocation) is a type of probabilistic model commonly used for topic modeling. 

* **Stop Words**: Stop words are words that have little semantic meaning but occur very frequently, like prepositions, articles and common nouns. For example, every document (in English) will probably contain the words "and" and "the" many times. You will often remove them as part of preprocessing using a list of stop words.


## Python Setup

- Back to [Table of Contents](#Table-of-Contents)

In [None]:
%pylab inline 
import nltk
import ujson
import re
import time
# import progressbar
import psycopg2

import pandas as pd
from __future__ import print_function
from six.moves import zip, range 
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, roc_auc_score, auc
from sklearn import preprocessing
from collections import Counter, OrderedDict
from nltk.corpus import stopwords
from nltk import SnowballStemmer

# nltk.download('stopwords') #download the latest stopwords

## Loading the Data

- Back to [Table of Contents](#Table-of-Contents)

Our dataset for this tutorial will be a set of articles from the Business Section of the Kansas City Star. We have parsed out for every article the title, the date, the author, and the text or articles content.

Let's start by loading the table into a `pandas` DataFrame.

In [None]:
# Database connection properties
db_name = "appliedda"
db_host = "10.10.2.10"
conn = psycopg2.connect(database=db_name, host=db_host) #database connection

In [None]:
df = pd.read_sql('SELECT * FROM ada_kcmo.kansas_city_star_business_articles_2016_2018;', conn)

## Data Exploration & Creating Text Corpus

- Back to [Table of Contents](#Table-of-Contents)

Our business articles data table has 5 fields:

- `title` - article title as it appeared onlin and in the written newspaper.
- `date` - date of article publication.
- `author` - article author or authors.
- `text` - article content.
- `article_id` - unique identifier for every article.

Let's take a look at examples of the values:

In [None]:
df.describe(include = 'all')

In [None]:
df.head()

Let's take a closer look at the title and text variables.

In [None]:
df['title'][0]

In [None]:
df['text'][0]

Both `title` and `text` have text content. Unsurprisingly, the article content is longer than the title. We could perform text analysis of the title, the content, or both. For the time being, we will focus of the content. 

First we need to form our corpus of text: we can pull out the array of descriptions from the DataFrame using the data frame's `.values` attribute. 

In [None]:
corpus = df['text'].values #pull all the descriptions and put them in a numpy array 

In [None]:
len(corpus)

## The Principles of Topic Modeling

- Back to [Table of Contents](#Table-of-Contents)

We are going to apply topic modeling, an unsupervised learning method, to our corpus to find the high-level topics in our corpus as a "first go" for exploring our data. Through this process, we'll discuss how to clean and preprocess our data to get the best results.

Topic modeling is a broad subfield of machine learning and natural language processing. We are going to focus on a common modeling approach called Latent Dirichlet Allocation (LDA). 

To use topic modeling, we first have to assume that topics exist in our corpus, and that some small number of these topics can "explain" the corpus. Topics in this context refer to words from the corpus, in a list that is ranked by probability. A single document can be explained by multiple topics. For instance, an article on a merger in the healthcare industry could fall under the topic "finance" as well as the topic "healthcare". The set of topics used by a document is known as the document's allocation, hence, the name Latent Dirchlet Allocation, each document has an allocation of latent topics allocated by Dirchlet distribution. 

As a start to modeling topics, we define below a function "create_topics" that uses Latent Dirichlet Allocation (LDA) to find topics. We will refer to this function throughout the notebook, so don't worry if you do not understand every part of the function yet.

In [None]:
def create_topics(tfidf, features, N_TOPICS=3, N_TOP_WORDS=5):
    """
    Given a matrix of features of text data generate topics
    
    Parameters
    -----------
    tfidf: scipy sparse matrix
        sparse matrix of text features
    N_TOPICS: int
        number of topics (default 10)
    N_TOP_WORDS: int
        number of top words to display in each topic (default 10)
        
    Returns
    -------
    ls_keywords: ls
        list of keywords for each topics
    doctopic: array
        numpy array with percentages of topic that fit each category
    N_TOPICS: int
        number of assumed topics
    N_TOP_WORDS: int
        Number of top words in a given topic. 
    """
    
    i=0
    lda = LatentDirichletAllocation(n_components= N_TOPICS,
                                    learning_method='online') #create an object that will create 5 topics
    i+=1
    doctopic = lda.fit_transform( tfidf )
    i+=1

    ls_keywords = []
    for i,topic in enumerate(lda.components_):
        word_idx = np.argsort(topic)[::-1][:N_TOP_WORDS]
        keywords = ', '.join( features[i] for i in word_idx)
        ls_keywords.append(keywords)
        print(i, keywords)
        i+=1
            
    return ls_keywords, doctopic

## Topic Modeling in Practice

- Back to [Table of Contents](#Table-of-Contents)

The first important step in working with text data is cleaning and processing the data, which includes (but is not limited to):

- forming a corpus of text
- tokenization
- removing stop-words
- finding words co-located together (N-grams)
- stemming and lemmatization

Each of these steps will be discussed below. 

The ultimate goal is to transform our text data into a form an algorithm can work with, because a document or a corpus of text cannot be fed directly into an algorithm. Algorithms expect numerical feature vectors with certain fixed sizes, and can't handle documents, which are basically sequences of symbols with variable length. We will be transforming our text corpus into a *bag of n-grams* to be further analyzed. In this form our text data is represented as a matrix where each row refers to a specific job description (document) and each column is the occurence of a word (feature).

### Creating a Matrix of Features from Text – Bag of N-grams

- Back to [Table of Contents](#Table-of-Contents)

Ultimately, we want to take our collection of documents, corpus, and convert it into a matrix. Fortunately, `sklearn` has a pre-built object, `CountVectorizer`, that can tokenize, eliminate stopwords, identify n-grams, and stem our corpus, and output a matrix in one step. Before we apply the vectorizer to our corpus of data, let's apply it to a toy example so that we see what the output looks like and how a bag of words is represented. 

In [None]:
def create_bag_of_words(corpus, 
                        NGRAM_RANGE = (0, 1), 
                        stop_words = None, 
                        stem = False, 
                        MIN_DF = 0.05, 
                        MAX_DF = 0.95, 
                        USE_IDF = False):

    """
    Turn a corpus of text into a bag-of-words.
    
    Parameters
    -----------
    corpus: ls
        test of documents in corpus    
    NGRAM_RANGE: tuple
        range of N-gram. Default (0,1)
    stop_words: ls
        list of commonly occuring words that have little semantic
        value
    stem: bool
        use a stemmer to stem words
    MIN_DF: float
       exclude words that have a frequency less than the threshold
    MAX_DF: float
        exclude words that have a frequency greater than the threshold
    USE_IDF: bool
        Re-weigh words according to the Term Frequency-Inverse Document Frequency 
        (emphasize words unique to a document, suppress words common throughout the corpus)
    
    Returns
    -------
    bag_of_words: scipy sparse matrix
        scipy sparse matrix of text
    features:
        list of words
    """
    #parameters for vectorizer 
    ANALYZER = "word" #unit of features are single words rather then phrases of words 
    STRIP_ACCENTS = 'unicode'
    
    if stem:
        stemmer = nltk.SnowballStemmer("english")
        tokenize = lambda x: [stemmer.stem(i) for i in x.split()]
    else:
        tokenize = None
    vectorizer = CountVectorizer(analyzer=ANALYZER,
                                 tokenizer=tokenize, 
                                 ngram_range=NGRAM_RANGE,
                                 stop_words = stop_words,
                                 strip_accents=STRIP_ACCENTS,
                                 min_df = MIN_DF,
                                 max_df = MAX_DF)
    
    bag_of_words = vectorizer.fit_transform( corpus ) #transform our corpus is a bag of words 
    features = vectorizer.get_feature_names()

    if USE_IDF:
        NORM = None #turn on normalization flag
        SMOOTH_IDF = True #prvents division by zero errors
        SUBLINEAR_IDF = True #replace TF with 1 + log(TF)
        transformer = TfidfTransformer(norm = NORM,smooth_idf = SMOOTH_IDF,sublinear_tf = True)
        #get the bag-of-words from the vectorizer and
        #then use TFIDF to limit the tokens found throughout the text 
        tfidf = transformer.fit_transform(bag_of_words)
        
        return tfidf, features
    else:
        return bag_of_words, features

In [None]:
# create example corpus.
toy_corpus = ['this is document one', 'this is document two', 'text analysis on documents is fun']

In [None]:
# convert to bag of words
toy_bag_of_words, toy_features = create_bag_of_words( toy_corpus )

In [None]:
# review - our corpus:
toy_corpus

In [None]:
# features derived from the corpus
toy_features

In [None]:
# bag of words that results:
np_bag_of_words = toy_bag_of_words.toarray()
np_bag_of_words

Our data has been transformed from a document into a 3 x 9 matrix, where each row in the matrix corresponds to a document, and each column corresponds to a feature (in the order they appear in `toy_features`). A 1 indicates the existence of the feature or word in the document, and a 0 indicates the word is not present.

It is very common that this representation will be a "sparse" matrix, or a matrix that has a lot of 0s. With sparse matrices, it is often more efficient to keep track of which values *aren't* 0 and where those non-zero entries are located, rather than to save the entire matrix. To save space, the `scipy` library has special ways of storing sparse matrices in an efficient way. 

Our toy corpus is now ready to be analyzed. We used this toy example to illustrate how a document is turned into a matrix to be used in text analysis. When you're applying this to real text data, the matrix will be much larger and harder to interpret, but it's important that you know the process. 

__Exercise 1 - convert corpus to matrix__

- Back to [Table of Contents](#Table-of-Contents)

To check your knowledge, make your own toy corpus and turn it into a matrix.

In [None]:
#solution
exercise_corpus = ['Batman is friends with Superman', 
                   'Superman is enemies with Lex Luthor',
                   'Batman is enemies with Lex Luthor' ] 
exercise_bag_of_words, exercise_features = create_bag_of_words(exercise_corpus)

In [None]:
# convert bag of words to array
np_bag_of_words = exercise_bag_of_words.toarray()

In [None]:
# show features:
exercise_features

In [None]:
# output derived bag of words:
np_bag_of_words

### Calculating Word Counts

- Back to [Table of Contents](#Table-of-Contents)

As an initial look into the data, we can examine the most frequently occuring words in our corpus. We can sum the columns of the bag_of_words and then convert to a numpy array. From here we can zip the features and word_count into a dictionary, and display the results.

In [None]:
def get_word_counts(bag_of_words, feature_names):

    """
    Get the ordered word counts from a bag_of_words
    
    Parameters
    ----------
    bag_of_words: obj
        scipy sparse matrix from CounterVectorizer
    feature_names: ls
        list of words
        
    Returns
    -------
    word_counts: dict
        Dictionary of word counts
    """

    # convert bag of words to array
    np_bag_of_words = bag_of_words.toarray()
    
    # calculate word count.
    word_count = np.sum(np_bag_of_words,axis=0)
    
    # convert to flattened array.
    np_word_count = np.asarray(word_count).ravel()
    
    # create dict of words mapped to count of occurrences of each word.
    dict_word_counts = dict( zip(feature_names, np_word_count) )
    
    # Create ordered dictionary
    orddict_word_counts = OrderedDict( sorted(dict_word_counts.items(), key=lambda x: x[1], reverse=True), )
    
    return orddict_word_counts

In [None]:
# get ordered word counts for our example corpus.
get_word_counts(toy_bag_of_words, toy_features)

Note that the words "document" and "documents" both appear separately in the list. Should they be treated as the same words, since one is just the plural of the other, or should they be considered distinct words? These are the types of decisions you will have to make in your preprocessing steps.

__Exercise 2 - getting word counts__

- Back to [Table of Contents](#Table-of-Contents)

Get the word counts of your exercise corpus.


In [None]:
get_word_counts(exercise_bag_of_words, exercise_features)

Create a bag of words and set of features from our social services corpus:

In [None]:
corpus_bag_of_words, corpus_features = create_bag_of_words(corpus)

Let's examine our features. 

In [None]:
corpus_features

The first aspect to notice about the feature list is that the first few entries are numbers that have no real semantic meaning. The feature lists also includes numerous other useless words, such as prepositions and articles, that will just add noise to our analysis. 

We can also notice the words *comprises* and *comprising*, or the words *form* and *formed*, are close enough to each other that it might not make sense to treat them as entirely separate words. Part of your cleaning and preprocessing duties will be manually inspecting your lists of features, seeing where these issues arise, and making decisions to either remove them from your analysis or address them separately. 

Let's get the count of the number of times that each of the words appears in our corpus.

In [None]:
get_word_counts(corpus_bag_of_words, corpus_features)

Our top words are articles, prepositions and conjunctions that are not informative whatsoever, so we're probably not going to come up with anything interesting ("garbage in, garbage out"). 

Nevertheless, let's forge blindly ahead and try to create topics, and see the quality of the results that we get.

In [None]:
ls_corpus_keywords, corpus_doctopic = create_topics(corpus_bag_of_words, corpus_features)

> These topics don't give us any real insight to what the data contains - one of the topics is "to, of, in, that, for"! According to the articles in your dataset, some might hint to subjects ("sales", "consumer", "index", etc.) but the signal is being swamped by the noise. 

We'll have to clean and process our data to get any meaningful information out of this text. 

### Text Cleaning and Normalization

- Back to [Table of Contents](#Table-of-Contents)

To clean and normalize text, we'll remove all special characters, numbers, and punctuation, so we're left with only the words themselves. Then we will make all the text lowercase; this uniformity will ensure that the algorithm doesn't treat "the" and "The" as different words, for example. 

To remove the special characters, numbers and punctuation we will use regular expressions. 


**Regular Expressions**, or "regexes" for short, let you find all the words or phrases in a document or text file that match a certain pattern. These rules are useful for pulling out useful information from a large amount of text. For example, if you want to find all email addresses in a document, you might look for everything that looks like *some combination of letters, _, .* followed by *@*, followed by more letters, and ending in *.com* or *.edu*. If you want to find all the credit card numbers in a document, you might look for everywhere you see the pattern "four numbers, space, four numbers, space, four numbers, space, four numbers." Regexes are also helpful if you are scraping information from websites, because you can use them to separate the content from the HTML code used for formatting the website.

A full tutorial on regular expressions would be outside the scope of this tutorial, but many good tutorials that can be found on-line. [regex101.com](regex101.com) is also a great interactive tool for developing and checking regular expressions.

>"Some people, when confronted with a problem, think 
>'I know, I'll use regular expressions.'   Now they have two problems."
> -- Jaime Zawinski

*A word of warning:* Regexes can work much more quickly than plain text sorting; however, if your regular expressions are becoming overly complicated, it's a good idea to find a simpler way to do what you want to do. Any developer should keep in mind there is a trade-off between optimization and understandability. The general philosophy of programming in Python is that your code is meant to be as understandable by *people* as much as possible, because human time is more valuable than computer time. You should therefore lean toward understandability rather than overly optimizing your code to make it run as quickly as possible. Your future-self, code-reviewers, people who inherit your code, and anyone else who has to make sense of your code in the future will appreciate it. 

For our purposes, we are going to use a regular expression to match all characters that are not letters -- punctuation, quotes, special characters and numbers -- and replace them with spaces. Then we'll make all of the remaining characters lowercase.  

We will be using the `re` library in python for regular expression matching.

In [None]:
#get rid of the punctuations and set all characters to lowercase
RE_PREPROCESS = re.compile( r'\W+|\d+' ) #the regular expressions that matches all non-characters

# get rid of punctuation and make everything lowercase
# the code below works by looping through the array of text ("corpus")
# for a given piece of text ( "description" ) we invoke the `re.sub` command 
# the `re.sub` command takes 3 arguments: (1) the regular expression to match, 
# (2) what we want to substitute in place of that matching string (' ', a space)
# and (3) the text we want to apply this to. 
# we then invoke the `lower()` method on the output of the `re.sub` command
# to make all the remaining characters lowercase.
# the result is a list, where each entry in the list is a cleaned version of the
# corresponding entry in the original corpus.
# we then make the list into a numpy array to use it in analysis

processed_corpus = np.array( [ re.sub( RE_PREPROCESS, ' ', description ).lower() for description in corpus ] )

Next, let's look at an example of the results of this cleanup.

__First Description, Before Cleaning__

- Back to [Table of Contents](#Table-of-Contents)

First, we'll look at the first description in our corpus, before it was cleaned:

In [None]:
corpus[0]

This text includes a lot of useful information, but also includes some things we don't want or need. There are some weird special characters (...). There are also some numbers, which are informative and interesting to a human reading the text (phone numbers, addresses, ...), but when we break down the documents into individual words, the numbers will become meaningless. We'll also want to remove all punctuation, so that we can say any two things separated by a space are individual words.

__First Description, After Cleaning__

- Back to [Table of Contents](#Table-of-Contents)

Now, let's look at this text after cleaning:

In [None]:
processed_corpus[0]

All lowercase, all numbers and special characters have been removed. Out text is now normalized.

### Breaking Text into Pieces – Tokenizing

- Back to [Table of Contents](#Table-of-Contents)

Now that we've cleaned our text, we can *tokenize* it by deciding which words or phrases are the most meaningful. Normally the `CountVectorizer` handles this for us, but in this case, we'll split our text into individual words manually to show how it is done.

To go from a whole document to a list of individual words, we can use the `.split()` command. By default, this command splits based on spaces in between words, so we don't need to specify that explicitly.  

In [None]:
tokens = processed_corpus[0].split()

In [None]:
tokens

### Removing Meaningless Text – Stopwords

- Back to [Table of Contents](#Table-of-Contents)

Stopwords are words that are found commonly throughout a text and carry little semantic meaning. Examples of common stopwords are prepositions ("to", "on", "in"), articles ("the", "an", "a"), conjunctions ("and", "or", "but") and common nouns. For example, the words *the* and *of* are totally ubiquitous, so they won't serve as meaningful features, whether to distinguish documents from each other or to tell what a given document is about. You may also run into words that you want to remove based on where you obtained your corpus of text or what it's about. There are many lists of common stopwords available for you to use, both for general documents and for specific contexts, so you don't have to start from scratch.   

We can eliminate stopwords by checking all the words in our corpus against a list of commonly occuring stopwords that comes with NLTK.

In [None]:
eng_stopwords = stopwords.words('english')

In [None]:
eng_stopwords

In [None]:
#sample of stopwords
#this is an example of slicing where we implicitly start at the beginning and move to the end
#we select every 10th entry in the array
eng_stopwords[::10]

Notice that this list includes "weren" and "hasn" as well as single letters ("t"). Why do you think these are contained in the list of stopwords?

__Exercise 3 - practicing slicing__

- Back to [Table of Contents](#Table-of-Contents)

Try slicing to retrieve every 5th word.

In [None]:
eng_stopwords[::5]

Now that we've cleaned up our data a little bit, let's see what our bag of words looks like.

In [None]:
# create bag of words from processed_corpus
processed_bag_of_words, processed_features = create_bag_of_words( processed_corpus, stop_words = eng_stopwords )
dict_processed_word_counts = get_word_counts( processed_bag_of_words, processed_features )
dict_processed_word_counts

Much better! Now this is starting to look like a reasonable representation of our corpus of text. 

We mentioned that, in addition to stopwords that are common across all types of text analysis problems, there will also be specific stopwords based on the context of your domain. One quick way to remove some of these domain-specific stopwords is by dropping some of your most frequent words. 
> Notice how the top words include words like "said", "year"? It makes sense that these words are so common - so they won't be very helpful in analysis. We'll start out by dropping the top 20 words. You'll want to change this number, playing with making it bigger and smaller, to see how it affects your resulting topics.

In [None]:
# get top 20 stopwords (slice from start through 20 items in list)
top_20_words = list(dict_processed_word_counts.keys())[:20]

# create new stopword list by combining default stopwords with our top 20.
domain_specific_stopwords = eng_stopwords + top_20_words

# make a new bag of words excluding custom stopwords.
processed_bag_of_words, processed_features = create_bag_of_words(processed_corpus,
                                                                 stop_words=domain_specific_stopwords)


In [None]:
# what do we have now?
dict_processed_word_counts = get_word_counts(processed_bag_of_words, processed_features)
dict_processed_word_counts

> This is a bit better - although we still see some words that are probably very common ("may"), words like "health" will probably help us come up with more specific categories within the broader realm of business articles. Let's see what topics we produce.

In [None]:
processed_keywords, processed_doctopic = create_topics(processed_bag_of_words, 
                                                       processed_features)

Now we are starting to get somewhere! We can manipulate the number of topics we want to find and the number of words to use for each topic to see if we can understand more from our corpus. 

In [None]:
# look for 5 topics, include 10 words in each.
processed_keywords, processed_doctopic = create_topics(processed_bag_of_words, 
                                                       processed_features, 
                                                       N_TOPICS = 5,
                                                       N_TOP_WORDS= 10)

Some structure is starting to reveal itself. Adding more topics has revealed to larger subtopics. Let's see if increasing the number of topics gives us more information.

However, we can see that some useless words are still present. This is an iterative process - after seeing the results of some analysis, you will need to go back to the preprocessing step and add more words to your list of stopwords or change how you cleaned the data.

Now let's try 10 topics with 15 words each:

In [None]:
# 10 topics, 15 words each
processed_keywords, processed_doctopic = create_topics(processed_bag_of_words, 
                                                       processed_features, 
                                                       N_TOPICS = 10,
                                                       N_TOP_WORDS= 15)

> This looks like a good amount of topics for now. Some of the top words are quite similar, like "jobs" and "job". Let's move to stemming and lemmatization.

### Stemming and Lemmatization - Distilling text data

- Back to [Table of Contents](#Table-of-Contents)

We can further process our text through *stemming and lemmatization*, or replacing words with their root or simplest form. For example "systems," "systematic," and "system" are all different words, but we can replace all these words with "system" without sacrificing much meaning. 

- A **lemma** is the original dictionary form of a word (e.g. the lemma for "lies," "lied," and "lying" is "lie").
- The process of turning a word into its simplest form is **stemming**. There are several well known stemming algorithms -- Porter, Snowball, Lancaster -- that all have their respective strengths and weaknesses.

For this tutorial, we'll use the Snowball Stemmer:

In [None]:
# Examples of how a Stemmer works:
stemmer = SnowballStemmer("english")
print(stemmer.stem('lies'))
print(stemmer.stem("lying"))
print(stemmer.stem('systematic'))
print(stemmer.stem("running"))

Let's try creating a bag of stemmed words.

In [None]:
# include stemming when creating our bag of words.
processed_bag_of_words, processed_features = create_bag_of_words(processed_corpus,
                                                                 stop_words=domain_specific_stopwords,
                                                                 stem=True)

In [None]:
# create topics with stemmed words.
processed_keywords, processed_doctopic = create_topics(processed_bag_of_words, 
                                                       processed_features, 
                                                       N_TOPICS = 10,
                                                       N_TOP_WORDS= 15)

> What do we think of these topics?

### N-grams - Adding context by creating N-grams

- Back to [Table of Contents](#Table-of-Contents)

Obviously, reducing a document to a bag of words means losing much of its meaning - we put words in certain orders, and group words together in phrases and sentences, precisely to give them more meaning. If you follow the processing steps we've gone through so far, splitting your document into individual words and then removing stopwords, you'll completely lose all phrases like "kick the bucket," "commander in chief," or "sleeps with the fishes." 

One way to address this is to break down each document similarly, but rather than treating each word as an individual unit, treat each group of 2 words, or 3 words, or *n* words, as a unit. We call this a "bag of *n*-grams," where *n* is the number of words in each chunk. Then you can analyze which groups of words commonly occur together (in a fixed order). 

Let's transform our corpus into a bag of n-grams with *n*=2: a bag of 2-grams, AKA a bag of bi-grams.

In [None]:
# create bag of words with stemmed words and 2-grams (NGRAM_RANGE = (0, 2)).
processed_bag_of_words, processed_features = create_bag_of_words(processed_corpus,
                                                                 stop_words=domain_specific_stopwords,
                                                                 stem=True,
                                                                 NGRAM_RANGE=(0,2))

# Create topics.
processed_keywords, processed_doctopic = create_topics(processed_bag_of_words, 
                                                       processed_features, 
                                                       N_TOPICS = 10,
                                                       N_TOP_WORDS= 15)

> We can see that this lets us uncover patterns that we couldn't when we just used a bag of words: "kansas citi" come up as a word. Note that this still includes the individual words, as well as the bi-grams.

### TF-IDF - Weighting terms based on frequency

- Back to [Table of Contents](#Table-of-Contents)

A final step in cleaning and processing our text data is **Term Frequency-Inverse Document Frequency (TF-IDF)**. TF-IDF is based on the idea that the words (or terms) that are most related to a certain topic will occur frequently in documents on that topic, and infrequently in unrelated documents.  TF-IDF re-weights words so that we emphasize words that are unique to a document and suppress words that are common throughout the corpus by inversely weighting terms based on their frequency within the document and across the corpus.

Let's look at how using TF-IDF affects our bag of words:

In [None]:
# create bag of words including TF-IDF weighting.
processed_bag_of_words, processed_features = create_bag_of_words( processed_corpus,
                                                                  stop_words = domain_specific_stopwords,
                                                                  stem = True,
                                                                  NGRAM_RANGE = ( 0, 2 ),
                                                                  USE_IDF = True )

In [None]:
# let's see what we have:
dict_word_counts = get_word_counts( processed_bag_of_words, processed_features )
dict_word_counts

The words counts have been reweighted to emphasize the more meaningful words of the corpus, while de-emphasizing the words that are found commonly throughout the corpus.

How does this affect our topics?

In [None]:
processed_keywords, processed_doctopic = create_topics(processed_bag_of_words, 
                                                       processed_features, 
                                                       N_TOPICS = 5,
                                                       N_TOP_WORDS= 10)

**Exercise 4 - Refining a topic model**

- Back to [Table of Contents](#Table-of-Contents)

You can only develop an intuition for the right number of topics and topic words suitable for a given problem by iterating until you find a good match. 

Change the number of topics and topic words until you get an intution of how many words and topics are enough.

In [None]:
exercise_keywords, exercise_doctopic = create_topics( processed_bag_of_words, 
                                                      processed_features, 
                                                      N_TOPICS = 5,
                                                      N_TOP_WORDS= 25 )

In [None]:
exercise_keywords, exercise_doctopic = create_topics( processed_bag_of_words, 
                                                      processed_features, 
                                                      N_TOPICS = 10,
                                                      N_TOP_WORDS= 25 )

In [None]:
#grab the topic_id of the majority topic for each document and store it in a list
ls_topic_id = [np.argsort(processed_doctopic[comment_id])[::-1][0] for comment_id in range(len(corpus))]
df['topic_id'] = ls_topic_id #add to the dataframe so we can compare with the job titles

Now each row is tagged with a topic ID. Let's see how well the topics explain the text by looking at the first topic, and seeing how similar the texts within that topic are to each other.

In [None]:
topic_num = 0
print(processed_keywords[topic_num])
df[df.topic_id == topic_num].head(10)

**Exercise 5 - Interpreting a model's "topics"**

- Back to [Table of Contents](#Table-of-Contents)

Examine the other topic IDs, and see if the "topics" we identified make sense as groupings of text.

In [None]:
topic_num = 3
print(processed_keywords[topic_num])
df[df.topic_id == topic_num].head(10)

---

## Additional Resources

- A great resource for NLP in Python is 
[Natural Language Processing with Python](https://www.amazon.com/Natural-Language-Processing-Python-Analyzing/dp/0596516495).