# Machine Learning w/Python 3.7

## Importing Python Packages / Modules

### Packages Overview
* The Natural Language Toolkit (nltk) - we'll be accessing corpora as well as functional tools from this package.
    * The Brown Corpus: A text corpus of American English, split into fifteen different categories
    * Part of Speech Taggers (pos): prebuilt functions that are designed to determine the part of speech of every word in a given sentence.
* pandas - for data processing    
* matplotlib - for visualizing data (%matplotlib inline - displays images clearly in the Jupyter notebook)
* scikit-learn - for machine learning (ft. various classification, regression and clustering algorithms)

In [None]:
import nltk
from nltk.corpus import brown
from nltk import pos_tag_sents
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn

## Understanding Classification

## Accessing our Data

In [None]:
for cat in brown.categories():
    print (cat)

In [None]:
news_sent = brown.sents(categories=["news"])
romance_sent = brown.sents(categories=["romance"])

In [None]:
print (news_sent[:5])
print ()
print (romance_sent[:5])

In [None]:
ndf = pd.DataFrame({'sentence': news_sent,
                    'label':'news'})
rdf = pd.DataFrame({'sentence':romance_sent, 
                    'label':'romance'})

In [None]:
# combining two spreadsheets into 1
df = pd.concat([ndf, rdf])

## Extracting Features from our Data

## Supervised Machine Learning

Supervised machine learning takes places in two steps: the training phase, and the testing phase. In the training phase, you use a portion of your data to train your algorithm (which, in our case, is a classification algorithm). You provide both your feature vector and your labels to the algorithm, and the algorithm searches for patterns in your data that can help associate it with a particular label.

In the testing phase, we use the classifier we trained in the previous step, and give it previously unseen feature vectors representing unseen data to the algorithm, and have the algorithm predict the label. We can then compare the "true" label to the predicted label, and see if our classifier provides us with a good and generalizable way of accomplishing the task (in our case, the task of automatically distinguishing news sentences from romance sentences).

## Unsupervised Machine Learning

In supervised machine learning tasks, the data is assigned to some set of classes. For example, here we are given a dataset wherein each observation is a set of physical attributes of an object. In an supervised task, the object column acts as the labels. The algorithm then uses these existing separations in the data to develop criteria for classifying unknown observations in the data.

## Topic Modeling with Latent Dirchlet Allocation (LDA)

One subset of unsupervised learning tasks are topic extraction tasks, where the aim is to find common groupings of items across collections of items. One method of doing so is Latent Dirichlet allocation (LDA). Latent Dirichlet Allocation is a way to model how topics are distributed over a corpus and words are distributed over a set of topics.

In broad strokes, LDA extracts hidden (latent) topics via the following steps:1, 2

1. Arbitrarily decide that there are 10 topics.
2. Select one document and randomly assign each word in the document to one of the 10 topics.
3. Repeat step 2 for all the other documents. This results in the same word being assigned to multiple topics.
4. Compute
    * how many topics are in each document?
    * how many topic assignements are due to a given word?
5. Take one word in one document and reassign it to a new topic and then repeat step 4.
6. Repeat step 5 until the model stabilizes such that reassigned topics do not change distributions.

LDA yields a set of words associated to each topic (see step 4, part 2) and the mixture of topics associated to each document (see step 4, part 1).

# Let's do topic modeling with sklearn!

One of the best things about sklearn is the simplicity of its syntax.

To do machine learning with sklearn, follow these five steps (the function names remain the same, regardless of the algorithm you use!):

### Step 1: Import your desired algorithm

In this example, we will be using the Latent Dirichlet Allocation algorithm.

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

### Step 2: Choose your machine learning algorithm

When creating an instance of sklearn's LatentDirichletAllocation algorithm to run on our data, we need to set parameters. n_components is the number of topics in the dataset and we set random_state to 42 so that this notebook is reproducible. Since the sentences happen to already have labels (either news or romance), lets see if LDA can also find those separations by setting the number of topics to 2.

In [None]:
num_topics = 2
lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)

### Step 3: Fit your data

Using the lda object we set up above, we now apply (fit) the LDA algorithm to the bag of words we extracted from our sentences and had stored in the tf sparse matrix.

In [None]:
lda.fit(tf)

In [None]:
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=2, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=42,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

### Step 4: Transform your data

We now want to model the documents in our corpus in terms of the topics discovered by the model. This is done using the .transform method of LDA. This function yields the distribution of topics across the documents. The document_topic array contains the percentages of each topic found in each document.

In [None]:
document_topic = lda.transform(tf)

Then we visualize how much of each document is each topic - for example that document 1 is 10% topic A and 25% topic b. We choose an area chart because each band of the chart maps to a different category (in this case a unique topic). The width of each band in relation to the others illustrates how much of the document is thought to be about that topic relative to the others.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from cycler import cycler
import numpy as np

colors = ['tab:green', 'tab:pink']
topics = np.arange(10)
num_docs = document_topic.shape[0]

fig, ax = plt.subplots(figsize=(15,5))
_ = ax.stackplot(range(num_docs), document_topic.T, labels=topics, colors=colors)
_ = ax.set_xlim(0, num_docs)
_ = ax.set_ylim(0,1)
_ = ax.set_yticks([])
_ = ax.set_xlabel("document")
_ = ax.legend(title="topic", bbox_to_anchor=(1.06, 1), borderaxespad=0)
fig.savefig("images/doc_topic.png", bbox_inches = 'tight', pad_inches = 0)

### Step 5: Print topics

lda.components_ is an array where each row is a topic, and each column roughly contains the number of times that word was assigned to that topic, which is also the probability of that word being in that topic. To figure out which word is in which column, we use the get_feature_names() function from CountVectorizer. The argsort function is used to return the indexes of the columns with the highest probabilities, which we then map into our collection of words. Here we print the top 5 words in each topic.

In [None]:
num_words = 10
topic_word  = lda.components_ 
words = np.array(tf_vectorizer.get_feature_names())
for i, topic in enumerate(topic_word):
    # sorting is in descending, so ::-1 reverses to ascending
    sorted_idx = topic.argsort()[::-1]
    print(i, words[sorted_idx][:num_words])

In [None]:
0 ['said' 'like' 'time' 'just' 'll' 'way' 'didn' 'new' 'president' 'thought']
1 ['mrs' 'said' 'home' 'little' 'year' 'day' 'good' 'new' 'got' 'right']

We can also visualize these topics as lists sized by the frequency of the word and colored by the topic, as proposed by Allan Riddell in Text Analysis with Topic Models for the Humanities and Social Sciences:

In [None]:
# font size for word with largest share in corpus
fontsize_base = 40/ np.max(topic_word)

fig, ax = plt.subplots(figsize=(15, 2), constrained_layout=True)

for i, topic in enumerate(topic_word):
    top_idx = topic.argsort()[::-1][:num_words]
    top_words = words[top_idx]
    top_share = topic[top_idx]
    for j, (word, share) in enumerate(zip(top_words, top_share)):
        ax.text(j, i/4,  word, fontsize=fontsize_base*share, color=colors[i])
        
#stretch the-axis to accommodate the words
ax.set_xlim(0, num_words)
ax.set_ylim(-.2, i/4+.2)
ax.axis('off')
#fig.subplots_adjust(hspace=-0)
fig.savefig("images/word_topic.png", bbox_inches = 'tight', pad_inches = 0)

## Review

At the end of this workshop, we have covered the following skills:
* How to use skills from the NLTK workshop to build features for a classification task
* How to build a text classification system that can predict whether sentences belong to one category ("news") or another ("romance")
* How to group data and perform calculations on the aggregations
* How to prepare data for machine learning using pandas, a package for Python that helps to organize your data
* How to use the scikit-learn package for Python to perform different types of machine learning on the data
* How to evaluate the results of machine learning algorithms
* How to visualize observations, aggregations, and algorithmic results

## Resources
"Introduction to Machine Learning with Python", Andreas C. Muller and Sarah Guido. O'Reilly, 2017.

"LING 83800: Methods in Computational Linguistics II", Andrew Rosenberg. http://eniac.cs.qc.cuny.edu/andrew/methods2/, 2014.

"Introduction to Latent Dirichlet Allocation", Edward Chen, http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/, 08/22/2011

"Topic Modeling for Humanists: A Guided Tour", Scott Weingart, http://www.scottbot.net/HIAL/index.html@p=19113.html, 07/25/2012

"The LDA Buffet is Now Open", Matthew Jockers, http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-now-open-or-latent-dirichlet-allocation-for-english-majors/, 09/29/2011

"Introduction to Topic Modeling",Christine Doig, http://chdoig.github.io/pytexas2015-topic-modeling/#/, PyTexas, 2015

##### Acknowledgments
Shout out to the CUNY DHRI for providing much of this workshop's structure.
