# Introduction to Unsupervised Approaches using scikit-learn

This two hour lesson covers unsupervised computational text anlalysis approaches in Python using Scikit-Learn.  The central method covered is Topic Modeling using LDA, which is a common approach used frequetly in the social sciences and humanities.  

## 0. Overview

There are two libraries that dominate text analysis in Python. The first is NLTK, which implements a range of natural language processing techniques. You learned about this in part 2 of this series yesterday.

The other dominant library is scikit-learn, which, at its most basic, provides a function to create a memory-efficient document-term matrix. It also implements a variety of quite sophisticated machine learning techniques that you can use on your text. It's a powerful library, and one you will continually return to as you advance in text analysis (and looks great on your CV!).

Because scikit-learn is such a large and powerful library the goal today is not to become experts, but instead learn the basic functions in the library and gain an intuition about how you might use it to do text analysis. We'll give you the keys to the kingdom: you go explore! To give an overview, here are some of the things you can do using scikit-learn:
* word weighting
* feature extraction
* text classification / supervised machine learning
    * L2 regression
    * classification algorithms such as nearest neighbors, SVM, and random forest
* clustering / unsupervised machine learning
    * k-means
    * pca
    * cosine similarity
    * LDA

Today, we'll start with the Document Term Matrix (DTM). The DTM is the bread and butter of most computational text analysis techniques, both simple and more sophisticated methods. In this lesson we will use Python's scikit-learn package learn to make a document term matrix from a .csv Music Reviews dataset (collected from MetaCritic.com). We will then use the DTM and a word weighting technique called tf-idf (term frequency inverse document frequency) to identify important and discriminating words within this dataset (utilizing the Pandas package). The illustrating question: what words distinguish reviews of Rap albums, Indie Rock albums, and Jazz albums? 

Finally, we will use the DTM to get an introduction to one method for uncovering patterns or themes within text: LDA, a topic modeling algorithm. Again, this will just be an introduction. Look for additional workshops in the future that will get into topic modeling in more detail.
  

### Learning Goals

* Understand the DTM and why it's important to text analysis
* Learn how to create a DTM from a .csv file
* Learn basic functionality of Python's package scikit-learn
* Understand tf-idf scores, and word scores in general
* Learn a simple way to identify distinctive words
* Implement a basic topic modeling algorithm and learn how to tweak it
* In the process, gain more familiarity and comfort with the Pandas package and manipulating data

#### Outline

1. [The Pandas Dataframe: Music Reviews](#1.-The-Pandas-Dataframe:-Music-Reviews)
* [Explore the Data using Pandas](#2.-Explore-the-Data-using-Pandas)
* [Creating the DTM: scikit-learn](#3.-Creating-the-DTM:-scikit-learn)
    1. [CountVectorizer function](#CountVectorizer-Function)
* [What can we do with a DTM?](#4.-What-can-we-do-with-a-DTM?)
    1. [Exercise 4.1](#Exercise-4.1)
* [Tf-idf scores](#5.-Tf-idf-scores)
    1. [Tf-idf Vectorizer Function](#Tf-Idf-Vectorizer-Function)
* [Identifying Distinctive Words](#6.-Identifying-Distinctive-Words)
    1. [Identify Distinctive Words by Genre](#Identifying-Disctinctive-Words-by-Genre)
    * [Exercise 6.1](#Exercise-6.1)
* [Uncovering patterns using LDA](#7.-Uncovering-Patterns-using-LDA)
    1. [Exercise 7.1](#Exercise-7.1)
    * [Exercise 7.2](#Exercise-7.2)

#### Key Jargon

* *Document Term Matrix*:
    * a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
* *TF-IDF Scores*: 
    *  short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
* *Topic Modeling*:
    * A statistical model to uncover abstract topics within a text. It uses the co-occurrence fo words within documents, compared to their distribution across documents, to uncover these abstract themes. The output is a list of weighted words, which indicate the subject of each topic, and a weight distribution across topics for each document.
    
* *LDA*:
    * Latent Dirichlet Allocation. A implementation of topic modeling that assumes a Dirichlet prior. It does not take document order into account, unlike other topic modeling algorithms.
    

#### Further Resources

[This blog post](https://de.dariah.eu/tatom/feature_selection.html) goes through finding distinctive words using Python in more detail 

Paper: [Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf), Burt Monroe, Michael Colaresi, Kevin Quinn

[More detailed description of implementing LDA using scikit-learn](http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-topics-extraction-with-nmf-lda-py).
    

## 1. The Pandas Dataframe: Music Reviews

First, we read our music reviews corpus, which is stored as a .csv file on our hard drive, into a Pandas dataframe. 

Note: I love Pandas for data munging and basic calculations because it's so easy to use, and its data structure is really intuitive for me. It's not memory efficient however, so you might quickly need to move away from it. I recommend always always always using Pandas (or similar) over spreadsheets and Excel. [Excel is bad for science!](https://www.washingtonpost.com/news/wonk/wp/2016/08/26/an-alarming-number-of-scientific-papers-contain-excel-errors/)

In [1]:
import pandas
import numpy as np

#create a dataframe called "df"
df = pandas.read_csv("../A-Data/BDHSI2016_music_reviews.csv", sep = '\t', encoding = 'utf-8')

#view the dataframe
#notice the metadata. The column "body" contains our text of interest.
df

Unnamed: 0,album,artist,genre,release_date,critic,score,body
0,Don't Panic,All Time Low,Pop/Rock,2012-10-09 00:00:00,Kerrang!,74.0,While For Baltimore proves they can still writ...
1,Fear and Saturday Night,Ryan Bingham,Country,2015-01-20 00:00:00,Uncut,70.0,There's nothing fake about the purgatorial nar...
2,The Way I'm Livin',Lee Ann Womack,Country,2014-09-23 00:00:00,Q Magazine,84.0,All life's disastrous lows are here on a caree...
3,Doris,Earl Sweatshirt,Rap,2013-08-20 00:00:00,Pitchfork,82.0,"With Doris, Odd Future’s Odysseus is finally b..."
4,Giraffe,Echoboy,Rock,2003-02-25 00:00:00,AllMusic,71.0,Though Giraffe is definitely Echoboy's most im...
5,Weathervanes,Freelance Whales,Indie,2010-04-13 00:00:00,Q Magazine,68.0,Fans of Owl City and The Postal Service will r...
6,Build a Rocket Boys!,Elbow,Pop/Rock,2011-04-12 00:00:00,Delusions of Adequacy,82.0,"Whereas previous Elbow records set a mood, Bui..."
7,Ambivalence Avenue,Bibio,Indie,2009-06-23 00:00:00,Q Magazine,78.0,His remarkable Warp debut follows a series of ...
8,Wavvves,Wavves,Indie,2009-03-17 00:00:00,PopMatters,68.0,"There’s an energy coursing through this, and r..."
9,Peachtree Road,Elton John,Rock,2004-11-09 00:00:00,MelD.,70.0,Classic. Songs filled with soul. Lyrics refres...


In [2]:
#print the first review from the column 'body'
df['body'][0]

'While For Baltimore proves they can still write a grade A banger when they put their mind to it, too many songs are destined to have "must try harder" stamped on their report card. [13 Oct 2012, p.52]'

## 2. Explore the Data using Pandas

Let's first look at some descriptive statistics about this dataset, to get a feel for what's in it. We'll do this using the Pandas package. 

Note: this is always good practice. It serves two purposes. It checks to make sure your data is correct, and there's no major errors. It also keeps you in touch with your data, which will help with interpretation. <3 your data!

First, what genres are in this dataset, and how many reviews in each genre?

In [3]:
#We can count this using the value_counts() function
df['genre'].value_counts()

Pop/Rock                  1486
Indie                     1115
Rock                       932
Electronic                 513
Rap                        363
Pop                        149
Country                    140
R&B;                       112
Folk                        70
Alternative/Indie Rock      42
Dance                       41
Jazz                        38
Name: genre, dtype: int64

The first thing most people do is to `describe` their data. (This is the `summary` command in R, or the `sum` command in Stata).

In [4]:
#There's only one numeric column in our data so we only get one column for output.
#If there were multiple numeric columns we would get more columns.
df.describe()

Unnamed: 0,score
count,5001.0
mean,72.684223
std,8.714896
min,7.4
25%,68.0
50%,74.0
75%,79.0
max,100.0


Who are the reviewers?

In [5]:
df['critic'].value_counts()

AllMusic                     282
PopMatters                   228
Pitchfork                    207
Q Magazine                   178
Uncut                        171
Mojo                         137
Drowned In Sound             132
New Musical Express (NME)    127
The A.V. Club                121
Rolling Stone                112
Under The Radar              100
Spin                          97
The Guardian                  96
musicOMH.com                  88
Entertainment Weekly          87
Slant Magazine                83
Paste Magazine                72
Consequence of Sound          69
Alternative Press             69
Prefix Magazine               68
NOW Magazine                  66
Tiny Mix Tapes                64
Blender                       57
Dusted Magazine               56
Dot Music                     56
Stylus Magazine               55
No Ripcord                    53
Boston Globe                  52
Austin Chronicle              52
Filter                        50
          

And the artists?

In [6]:
df['artist'].value_counts()

Various Artists                        22
R.E.M.                                 16
Arcade Fire                            14
Sigur Rós                              13
Belle & Sebastian                      12
Brian Eno                              11
The Raveonettes                        10
Bob Dylan                              10
Low                                    10
Weezer                                 10
Mogwai                                 10
Kings of Leon                          10
Radiohead                              10
LCD Soundsystem                        10
Ghostface Killah                        9
Sun Kil Moon                            9
M. Ward                                 9
Wilco                                   9
Franz Ferdinand                         9
Eels                                    9
Los Campesinos!                         9
Of Montreal                             8
Neil Young                              8
The Decemberists                  

What if we just want the average score given?

In [7]:
print(df['score'].mean())

72.68422315536893


Slightly more complicted to code: what is the average score for each genre? To do this, we use Pandas `groupby` function. Note: If you are planning on doing any sort of statistics, including basic statistics, you'll want to get very familiar with the `groupby` function. It's quite powerful.

In [8]:
#create a groupby dataframe grouped by genre
df_genres = df.groupby("genre")
print(df_genres)
#calculate the mean score by genre, print out the results
print(df_genres['score'].mean().sort_values(ascending=False))

<pandas.core.groupby.DataFrameGroupBy object at 0x11500bf60>
genre
Jazz                      77.631579
Folk                      75.900000
Indie                     74.400897
Country                   74.071429
Alternative/Indie Rock    73.928571
Electronic                73.140351
Pop/Rock                  73.033782
R&B;                      72.366071
Rap                       72.173554
Rock                      70.754292
Dance                     70.146341
Pop                       64.608054
Name: score, dtype: float64


## 3. Creating the DTM: scikit-learn

Ok, that's the summary of the metadata. Next, we turn to analyzing the text of the reviews. Remember, the text is stored in the 'body' column. First, a preprocessing step to remove numbers.

In [9]:
df['body'] = df['body'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))

### CountVectorizer Function

Our next step is to turn the text into a document term matrix using the scikit-learn function called `CountVectorizer`. There are two ways to do this. We can turn it into a sparse matrix type, which can be used within `scikit-learn` for further analyses.

In [10]:
#import the function CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer()

sklearn_dtm = CountVectorizer().fit_transform(df.body)
print(sklearn_dtm)

  (0, 9643)	1
  (0, 2011)	1
  (0, 11604)	1
  (0, 9722)	1
  (0, 13369)	1
  (0, 6358)	1
  (0, 14799)	1
  (0, 9277)	1
  (0, 6417)	1
  (0, 3662)	1
  (0, 671)	1
  (0, 13062)	1
  (0, 8536)	1
  (0, 14542)	1
  (0, 7398)	1
  (0, 14495)	2
  (0, 8941)	1
  (0, 14257)	2
  (0, 11042)	1
  (0, 15740)	1
  (0, 1034)	1
  (0, 6088)	1
  (0, 15995)	1
  (0, 13493)	1
  (0, 1963)	1
  :	:
  (5000, 4803)	1
  (5000, 12068)	1
  (5000, 4724)	1
  (5000, 11414)	1
  (5000, 13381)	1
  (5000, 10844)	1
  (5000, 9821)	1
  (5000, 12918)	1
  (5000, 5168)	1
  (5000, 14110)	1
  (5000, 1202)	1
  (5000, 9261)	1
  (5000, 13040)	1
  (5000, 9134)	1
  (5000, 15882)	1
  (5000, 14500)	1
  (5000, 828)	1
  (5000, 14237)	1
  (5000, 15940)	1
  (5000, 480)	3
  (5000, 744)	1
  (5000, 9663)	1
  (5000, 14243)	1
  (5000, 9722)	1
  (5000, 14257)	1


This format is called Compressed Sparse Format. It save a lot of memory to store the dtm in this format, but it is difficult to look at for a human. To illustrate the techniques in this lesson we will first convert this matrix back to a Pandas DataFrame, a format we're more familiar with. For larger datasets, you will have to use the Compressed Sparse Format. Putting it into a DataFrame, however, will enable us to get more comfortable with Pandas!

In [11]:
#we do the same as we did above, but covert it into a Pandas dataframe. Note this takes quite a bit more memory, so will not be good for bigger data.
dtm_df = pandas.DataFrame(countvec.fit_transform(df.body).toarray(), columns=countvec.get_feature_names(), index = df.index)

#view the dtm dataframe
dtm_df

Unnamed: 0,aa,aaaa,aahs,aaliyah,aaron,ab,abandon,abandoned,abandoning,abc,...,zone,zones,zoo,zooey,zoomer,zu,zydeco,álbum,être,über
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 4. What can we do with a DTM?

We can do a number of calculations using a DTM. For a toy example, we can quickly identify the most frequent words (compare this to how many steps it took in lesson 2, where we found the most frequent words using NLTK).

In [12]:
print(dtm_df.sum().sort_values(ascending=False))

the             7406
and             4557
of              4400
to              3175
is              2914
it              2608
that            2039
in              1775
album           1719
this            1518
but             1439
with            1367
as              1310
on              1139
for             1073
are              812
you              775
their            775
an               751
his              743
more             712
be               691
like             681
from             676
not              650
songs            640
one              580
they             580
its              575
all              574
                ... 
glimmering         1
glimmers           1
gliss              1
glisten            1
glistening         1
glitch             1
respond            1
glitchier          1
glitter            1
glittering         1
glittery           1
glitz              1
glo                1
gloating           1
respectively       1
globular           1
respectfully 

### Exercise 4.1

Print out the most infrequent words rather than the most frequent words. You can look at the [Pandas documentation](http://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-stats) for more information.
* Gold star challenge: 
    * Print the average number of times each word is used in a review
    * Print this out sorted from lowest to highest.

What else does the DTM enable? Because it is in the format of a matrix, we can perform any matrix algebra or vector manipulation on it, which enables some pretty exciting things (think vector space and Euclidean  geometry). But, what do we lose when we reprsent text in this format?

Today, we will use variations on the DTM to find distinctive words in this dataset, and then do some preliminary work discovering themes in text.

## 5. Tf-idf scores

How to find distinctive words in a corpus is a long-standing question in text analysis. We saw a few ways to this yesterday, using natural language processing. Today, we'll learn one simple approach to this: word scores. The idea behind words scores is to weight words not just by their frequency, but by their frequency in one document compared to their distribution across all documents. Words that are frequent, but are also used in every single document, will not be distinguising. We want to identify words that are unevenly distributed across the corpus.

One of the most popular ways to weight words (beyond frequency counts) is `tf-idf score`. By offsetting the frequency of a word by its document frequency (the number of documents in which it appears) will in theory filter out common terms such as 'the', 'of', and 'and'.

Traditionally, the inverse document frequency is calculated as such:

number_of_documents / number_documents_with_term

so:

tfidf_word1 = word1_frequency_document1 * (number_of_documents / number_document_with_word1)

You can, and often should, normalize the numerator: 

tfidf_word1 = (word1_frequency_document1 / word_count_document1) * (number_of_documents / number_document_with_word1)

We can calculate this manually, but scikit-learn has a built-in function to do so. This function also uses log frequencies, so the numbers will not correspond excactly to the calculations above. We'll use the [scikit-learn calculation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), but a challenge for you: use Pandas to calculate this manually. 

### Tf-Idf Vectorizer Function

To do so, we simply do the same thing we did above with CountVectorizer, but instead we use the function TfidfVectorizer.

In [13]:
#import the function
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvec = TfidfVectorizer()

#create the dtm, but with cells weigthed by the tf-idf score.
dtm_tfidf_df = pandas.DataFrame(tfidfvec.fit_transform(df.body).toarray(), columns=tfidfvec.get_feature_names(), index = df.index)

#view results
dtm_tfidf_df

Unnamed: 0,aa,aaaa,aahs,aaliyah,aaron,ab,abandon,abandoned,abandoning,abc,...,zone,zones,zoo,zooey,zoomer,zu,zydeco,álbum,être,über
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's look at the 20 words with highest tf-idf weights.

In [14]:
print(dtm_tfidf_df.max().sort_values(ascending=False)[0:20])

brill         1.000000
perfect       1.000000
yummy         1.000000
pppperfect    1.000000
awesome       1.000000
wonderfull    1.000000
meh           1.000000
stars         1.000000
subpar        0.959257
ga            0.908259
masterful     0.898620
grower        0.888624
likable       0.867803
acirc         0.867003
great         0.864253
infectious    0.859996
blank         0.854475
thrilling     0.848810
smart         0.847852
stuff         0.834479
dtype: float64


Ok! We have successfully identified content words, without removing stop words. What else do you notice about this list?

## 6. Identifying Distinctive Words

What can we do with this? These scores are best used when you want to identify distinctive words for individual documents, or groups of documents, compared to other groups or the corpus as a whole. To illustrate this, let's compare three genres and identify the most distinctive words by genre.

First we merge the genre of the document into our dtm weighted by tf-idf scores, and then compare genres.

In [15]:
#creat dataset with document index and genre
df_genre = df['genre'].to_frame()
print(df_genre)

           genre
0       Pop/Rock
1        Country
2        Country
3            Rap
4           Rock
5          Indie
6       Pop/Rock
7          Indie
8          Indie
9           Rock
10    Electronic
11          Rock
12          Rock
13         Indie
14         Indie
15           Pop
16         Indie
17      Pop/Rock
18           Rap
19          Rock
20         Indie
21    Electronic
22          Rock
23          Rock
24           Rap
25         Indie
26         Indie
27      Pop/Rock
28          Rock
29    Electronic
...          ...
4971    Pop/Rock
4972       Indie
4973  Electronic
4974       Indie
4975        Rock
4976        Rock
4977        Rock
4978     Country
4979    Pop/Rock
4980     Country
4981  Electronic
4982    Pop/Rock
4983     Country
4984    Pop/Rock
4985    Pop/Rock
4986       Indie
4987    Pop/Rock
4988  Electronic
4989        Rock
4990    Pop/Rock
4991         Rap
4992  Electronic
4993        Rock
4994        Rock
4995         Rap
4996       Indie
4997        Ro

In [16]:
#merge this into the dtm_tfidf_df
merged_df = df_genre.join(dtm_tfidf_df, how = 'right', lsuffix='_x')

#view result
merged_df

Unnamed: 0,genre_x,aa,aaaa,aahs,aaliyah,aaron,ab,abandon,abandoned,abandoning,...,zone,zones,zoo,zooey,zoomer,zu,zydeco,álbum,être,über
0,Pop/Rock,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Country,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Country,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Rap,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Rock,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Indie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Pop/Rock,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Indie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Indie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Rock,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Identifying Disctinctive Words by Genre

Now lets compare the words with the highest tf-idf weight for each genre. 

Note: there are other ways to do this. Challenge: what is a different approach to identifying rows from a certain genre in our dtm?

In [17]:
#pull out the reviews for three genres, Rap, Alternative/Indie Rock, and Jazz
dtm_rap = merged_df[merged_df['genre_x']=="Rap"]
dtm_indie = merged_df[merged_df['genre_x']=="Alternative/Indie Rock"]
dtm_jazz = merged_df[merged_df['genre_x']=="Jazz"]

#print the words with the highest tf-idf scores for each genre
print("Rap Words")
print(dtm_rap.max(numeric_only=True).sort_values(ascending=False)[0:20])
print()
print("Indie Words")
print(dtm_indie.max(numeric_only=True).sort_values(ascending=False)[0:20])
print()
print("Jazz Words")
print(dtm_jazz.max(numeric_only=True).sort_values(ascending=False)[0:20])

Rap Words
blank             0.854475
waste             0.755918
amiable           0.730963
awesomely         0.717079
joyless           0.687687
beastie           0.672439
same              0.672392
sucker            0.663760
vanguard          0.661978
tight             0.653993
lamest            0.639377
derivativeness    0.636271
authentic         0.627192
diverse           0.623373
sermon            0.621175
pushin            0.617699
mastermind        0.609213
neat              0.608922
we                0.600755
lift              0.591821
dtype: float64

Indie Words
underplayed    0.516717
prisoner       0.512087
jezabels       0.512087
careworn       0.509386
folk           0.509321
fourth         0.480502
heyday         0.469035
their          0.458950
riffed         0.458182
bet            0.456164
victory        0.449289
exhausted      0.445969
bigger         0.441849
babelfished    0.431543
lightweight    0.428857
exercised      0.428857
powerhouse     0.422192
worn          

There we go! A method of identifying distinctive words. You notice there are some proper nouns in there. How might we remove those if we're not interested in them?

Tf-idf scores are just one way to identify distinctive or discriminating words. See Monroe, Colaresi, and Quinn (2009) for more ideas for finding distinctive words. (Warning: this paper is a bit outdated. No one has taken up their recommendation to use a Dirichlet prior).

### Exercise 6.1 

Copy and paste the code above to the cell below and change the genres for a different comparison. Instead of outputting the highest weighted words, output the lowest weighted words. How should we interpret these words?

## 7. Uncovering Patterns using LDA

Frequency counts and tf-idf scores are done at the word level. There are other methods of exporatory or unsupervised analysis on the document level and by examining the co-occurrence of words within documents. Scikit-learn allows for many of these methods, including:

* document clustering
* document or word similarities using cosine similarity
* pca
* topic modeling

We'll run through an example of topic modeling here. Again, the goal is not to learn everything you need to know about topic modeling. Instead, this will provide you some starter code to run a simple model, with the idea that you can use this base of knowledge to explore this further.

We will run Latent Dirichlet Allocation, the most basic and the oldest version of topic modeling. We will run this in one big chunk of code. Our challenge: use our knowledge of scikit-learn that we gained aboe to walk through the code to understand what it is doing. Your challenge: figure out how to modify this code to work on your own data, and/or tweak the parameters to get better output.

Note: we will be using a different dataset for this technique. The music reviews in the above dataset are often short, one word or one sentence reviews. Topic modeling is not really appropriate for texts that are this short. Instead, we want texts that are longer and are composed of multiple topics each. For this exercise we will use a database of children's literature from the 19th century. 

The data were compiled by students in this course: http://english197s2015.pbworks.com/w/page/93127947/FrontPage
Found here: http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets#demo-corpora

That page has additional corpora, for those interested in exploring text analysis further.

I did some minimal cleaning to get the children's literature data in .csv format for our use.

In [18]:
df_lit = pandas.read_csv("../A-Data/childrens_lit.csv.bz2", sep='\t', encoding = 'utf-8', compression = 'bz2')

#drop rows where the text is missing. I think there's only one row where it's missing, but check me on that.
df_lit = df_lit.dropna(subset=['text'])

#view the dataframe
df_lit

Unnamed: 0.1,Unnamed: 0,title,author gender,year,text
0,0,A Dog with a Bad Name,Male,1886,A DOG WITH A BAD NAME BY TALBOT BAINES REED ...
1,1,A Final Reckoning,Male,1887,A Final Reckoning: A Tale of Bush Life in Aust...
2,2,"A House Party, Don Gesualdo, and A Rainy June",Female,1887,A HOUSE-PARTY Don Gesualdo and A Rainy June...
3,3,A Houseful of Girls,Female,1889,"A HOUSEFUL OF GIRLS. BY SARAH TYTLER, AUTHOR ..."
4,4,A Little Country Girl,Female,1885,"LITTLE COUNTRY GIRL. BY SUSAN COOLIDGE, ..."
5,5,A Round Dozen,Female,1883,\n A ROUND DOZEN. [Illustration: TOINETTE AND...
6,6,A Sailor's Lass,Female,1886,"A SAILOR'S LASS by EMMA LESLIE, Author of ""..."
7,7,A World of Girls,Female,1886,A WORLD OF GIRLS: THE STORY OF A SCHOOL. By ...
8,8,Adrift in the Wild,Male,1887,Adrift in the Wilds; ...
9,9,Adventures in Africa,Male,1883,"ADVENTURES IN AFRICA, BY W.H.G. KINGSTON. C..."


Now we're ready to fit the model. This requires the use of CountVecorizer, which we've already used, and the scikit-learn function LatentDirichletAllocation.

See [here](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) for more information about this function. 

In [19]:
####Adopted From: 
#Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Lars Buitinck
#         Chyi-Kwei Yau <chyikwei.yau@gmail.com>
# License: BSD 3 clause

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

n_samples = 2000
n_topics = 5
n_top_words = 50

##This is a function to print out the top words for each topic in a pretty way.
#Don't worry too much about understanding every line of this code.
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

# Use tf-idf features
tfidf_vectorizer = TfidfVectorizer(max_df=0.80, min_df=50,
                                   max_features=None,
                                   stop_words='english')

tfidf = tfidf_vectorizer.fit_transform(df_lit.text)

# Use tf (raw term count) features
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.80, min_df=50,
                                max_features=None,
                                stop_words='english'
                                )

tf = tf_vectorizer.fit_transform(df_lit.text)

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_topics=%d..."
      % (n_samples, n_topics))

#define the lda function, with desired options
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=20,
                                learning_method='online',
                                learning_offset=80.,
                                total_samples=n_samples,
                                random_state=0)
#fit the model
lda.fit(tf)

#print the top words per topic, using the function defined above.
#Unlike R, which has a built-in function to print top words, we have to write our own for scikit-learn
#I think this demonstrates the different aims of the two packages: R is for social scientists, Python for computer scientists

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Extracting tf features for LDA...
Fitting LDA models with tf features, n_samples=2000 and n_topics=5...

Topics in LDA model:

Topic #0:
doctor project girls sister papa mamma london baby street sweet dr tom tea aunt remarked presently em ain study works office wasn cousin youth darling loved ladies everybody flower public foundation nurse shop class ma george stairs doesn flowers john lovely carriage bell mary sisters reader garden uncle st term

Topic #1:
dick uncle doctor er jack ain tom yer den fish em rock wolf lads rope gun ha birds ay beneath rocks shock stream tail moments eh mate garden excitedly sand fishing thrust nay ye softly chap gazing bird leg hook tremendous penny growled stones ashore mountain jump farther task angrily

Topic #2:
french king troops ship officers camp army attack prince guns officer village tom soldiers shore regiment indian rode fort british wounded boats march native island queen advanced deck lads james prisoners city vessel sword ships jack column 

### Exercise 7.1

Run the same code as above but change some of the parameters. How does this change the output.

Suggestions:
1. Use tf-idf scores rather than raw counts. (hint: look for the variable name we created) 
* Change the number of topics. What do you find?
* Do not remove stop words. How does this change the output?

One thing we may want to do with the output is find the most representative texts for each topic. A simple way to do this (but not memory efficient), is to merge the topic distribution back into the Pandas dataframe.

First get the topic distribution array.

In [20]:
topic_dist = lda.transform(tf)
topic_dist

array([[  9.15375344e-01,   1.42317721e-02,   6.96084374e-02,
          2.56380976e-05,   7.58808516e-04],
       [  1.76733255e-01,   6.52920094e-02,   6.89648962e-01,
          6.82975969e-02,   2.81769775e-05],
       [  9.69898328e-01,   2.80712669e-05,   1.53776961e-02,
          2.80847733e-05,   1.46678196e-02],
       [  9.99905126e-01,   2.36526492e-05,   2.37332767e-05,
          2.36966695e-05,   2.37909101e-05],
       [  9.46729880e-01,   3.63092671e-02,   5.82999763e-03,
          1.10847475e-02,   4.61076571e-05],
       [  8.94796152e-01,   1.05061842e-01,   4.73949792e-05,
          4.73012256e-05,   4.73090690e-05],
       [  4.29599412e-01,   5.70084609e-01,   1.05253815e-04,
          1.05672565e-04,   1.05052097e-04],
       [  9.99888096e-01,   2.78874100e-05,   2.80340364e-05,
          2.79245614e-05,   2.80576907e-05],
       [  1.26623528e-01,   1.54800031e-01,   6.75428892e-01,
          4.31122056e-02,   3.53426630e-05],
       [  5.43946402e-05,   2.7639595

Merge back in with the original dataframe.

In [21]:
topic_dist_df = pandas.DataFrame(topic_dist)
df_w_topics = topic_dist_df.join(df_lit)
df_w_topics

Unnamed: 0.1,0,1,2,3,4,Unnamed: 0,title,author gender,year,text
0,0.915375,0.014232,0.069608,0.000026,0.000759,0.0,A Dog with a Bad Name,Male,1886.0,A DOG WITH A BAD NAME BY TALBOT BAINES REED ...
1,0.176733,0.065292,0.689649,0.068298,0.000028,1.0,A Final Reckoning,Male,1887.0,A Final Reckoning: A Tale of Bush Life in Aust...
2,0.969898,0.000028,0.015378,0.000028,0.014668,2.0,"A House Party, Don Gesualdo, and A Rainy June",Female,1887.0,A HOUSE-PARTY Don Gesualdo and A Rainy June...
3,0.999905,0.000024,0.000024,0.000024,0.000024,3.0,A Houseful of Girls,Female,1889.0,"A HOUSEFUL OF GIRLS. BY SARAH TYTLER, AUTHOR ..."
4,0.946730,0.036309,0.005830,0.011085,0.000046,4.0,A Little Country Girl,Female,1885.0,"LITTLE COUNTRY GIRL. BY SUSAN COOLIDGE, ..."
5,0.894796,0.105062,0.000047,0.000047,0.000047,5.0,A Round Dozen,Female,1883.0,\n A ROUND DOZEN. [Illustration: TOINETTE AND...
6,0.429599,0.570085,0.000105,0.000106,0.000105,6.0,A Sailor's Lass,Female,1886.0,"A SAILOR'S LASS by EMMA LESLIE, Author of ""..."
7,0.999888,0.000028,0.000028,0.000028,0.000028,7.0,A World of Girls,Female,1886.0,A WORLD OF GIRLS: THE STORY OF A SCHOOL. By ...
8,0.126624,0.154800,0.675429,0.043112,0.000035,8.0,Adrift in the Wild,Male,1887.0,Adrift in the Wilds; ...
9,0.000054,0.276396,0.723441,0.000054,0.000055,9.0,Adventures in Africa,Male,1883.0,"ADVENTURES IN AFRICA, BY W.H.G. KINGSTON. C..."


Now we can sort the dataframe for the topic of interest, and view the top documents for the topics.
Below we sort the documents first by Topic 0 (looking at the top words for this topic I think it's about family, health, and domestic activities), and next by Topic 1 (again looking at the top words I think this topic is about children playing outside in nature). These topics may be a family/nature split?

Look at the titles for the two different topics. Look at the gender of the author. Hypotheses?

In [22]:
print(df_w_topics[['title', 'author gender', 0]].sort_values(by=[0], ascending=False))

                                                 title author gender         0
85                                                 NaN           NaN  0.999912
3                                  A Houseful of Girls        Female  0.999905
97                                  The Life of a Ship          Male  0.999901
100                  The Little Princess of Tower Hill        Female  0.999895
7                                     A World of Girls        Female  0.999888
58                                                 NaN           NaN  0.999887
63                                         Quicksilver          Male  0.999883
27                                Elsie's Kith and Kin        Female  0.999879
28                               Elsie's New Relations        Female  0.999869
36                                   Grandmother Elsie        Female  0.999866
108  The Red Man's Revenge A Tale of the Red River ...          Male  0.999865
45                                        Just Sixte

In [23]:
print(df_w_topics[['title', 'author gender', 1]].sort_values(by=[1], ascending=False))

                                                 title author gender         1
24                                    Dick o' the Fens          Male  0.999931
53                                     My Friend Smith          Male  0.999919
48                              Little Lord Fauntleroy        Female  0.999914
55                                    Orange and Green          Male  0.999913
114                            The Willoughby Captains          Male  0.999893
115                                  The Young Buglers          Male  0.999743
18                                    Brownsmith's Boy          Male  0.994254
19                                         Bunyip Land          Male  0.991851
120                          Through Forest and Stream          Male  0.989588
50                                    Middy and Ensign          Male  0.887119
61                                          Post Haste          Male  0.844167
99                               The Lion of the Nor

What other patterns might we find with topic modeling? Toward what end?

### Exercise 7.2

Gold star exercise: 
1. Find the most prevalent topic in the corpus.
* Find the least prevalent topic in the corpus. 
* Find the most prevalent topic by the gender of the author.
* Hint: How do we define prevalence? What are different ways of measuring this, and the benefits/drawbacks of each?


Extra bonus gold star exercise:
1. This topic model provide the topic distribtution for 127 rows, but there are 131 rows in the full data.
* What is going on here? (I don't have an answer to this. I hope someone can figure it out!)           