# Unsupervised methods

In this lesson, we'll cover unsupervised computational text anlalysis approaches. The central methods covered are TF-IDF and Topic Modeling. Both of these are common approachs in the social sciences and humanities.

[DTM/TF-IDF](#dtm)<br>

[Topic modeling](#topics)<br>

### Today you will
* Understand the DTM and why it's important to text analysis
* Learn how to create a DTM in Python
* Learn basic functionality of Python's package scikit-learn
* Understand tf-idf scores
* Learn a simple way to identify distinctive words
* Implement a basic topic modeling algorithm and learn how to tweak it
* In the process, gain more familiarity and comfort with the Pandas package and manipulating data


### Key Jargon
* *Document Term Matrix*:
    * a matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
* *TF-IDF Scores*: 
    * short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
* *Topic Modeling*:
    * A general class of statistical models that uncover abstract topics within a text. It uses the co-occurrence of words within documents, compared to their distribution across documents, to uncover these abstract themes. The output is a list of weighted words, which indicate the subject of each topic, and a weight distribution across topics for each document.
    
* *LDA*:
    * Latent Dirichlet Allocation. A particular model for topic modeling. It does not take document order into account, unlike other topic modeling algorithms.

## DTM/TF-IDF <a id='dtm'></a>

In this lesson we will use Python's scikit-learn package learn to make a document term matrix from a .csv Music Reviews dataset (collected from MetaCritic.com). We will then use the DTM and a word weighting technique called tf-idf (term frequency inverse document frequency) to identify important and discriminating words within this dataset (utilizing the Pandas package). The illustrating question: **what words distinguish reviews of Rap albums, Indie Rock albums, and Jazz albums?**

In [1]:
import os
import numpy as np
import pandas as pd

DATA_DIR = 'data'
music_fname = 'music_reviews.csv'
music_fname = os.path.join(DATA_DIR, music_fname)

### First attempt at reading in file

In [2]:
reviews = pd.read_csv(music_fname, sep='\t')
reviews.head()

Unnamed: 0,album,artist,genre,release_date,critic,score,body
0,Don't Panic,All Time Low,Pop/Rock,2012-10-09 00:00:00,Kerrang!,74.0,While For Baltimore proves they can still writ...
1,Fear and Saturday Night,Ryan Bingham,Country,2015-01-20 00:00:00,Uncut,70.0,There's nothing fake about the purgatorial nar...
2,The Way I'm Livin',Lee Ann Womack,Country,2014-09-23 00:00:00,Q Magazine,84.0,All life's disastrous lows are here on a caree...
3,Doris,Earl Sweatshirt,Rap,2013-08-20 00:00:00,Pitchfork,82.0,"With Doris, Odd Future’s Odysseus is finally b..."
4,Giraffe,Echoboy,Rock,2003-02-25 00:00:00,AllMusic,71.0,Though Giraffe is definitely Echoboy's most im...


Print the text of the first review.

In [3]:
print(reviews['body'][0])

While For Baltimore proves they can still write a grade A banger when they put their mind to it, too many songs are destined to have "must try harder" stamped on their report card. [13 Oct 2012, p.52]


### Explore the Data using Pandas

Let's first look at some descriptive statistics about this dataset, to get a feel for what's in it. We'll do this using the Pandas package. 

Note: this is always good practice. It serves two purposes. It checks to make sure your data is correct, and there's no major errors. It also keeps you in touch with your data, which will help with interpretation. <3 your data!

First, what genres are in this dataset, and how many reviews in each genre?

In [4]:
#We can count this using the value_counts() function
reviews['genre'].value_counts()

Pop/Rock                  1486
Indie                     1115
Rock                       932
Electronic                 513
Rap                        363
Pop                        149
Country                    140
R&B;                       112
Folk                        70
Alternative/Indie Rock      42
Dance                       41
Jazz                        38
Name: genre, dtype: int64

The first thing most people do is to `describe` their data. (This is the `summary` command in R, or the `sum` command in Stata).

In [5]:
#There's only one numeric column in our data so we only get one column for output.
reviews.describe()

Unnamed: 0,score
count,5001.0
mean,72.684223
std,8.714896
min,7.4
25%,68.0
50%,74.0
75%,79.0
max,100.0


This only gets us numerical summaries. To get summaries of some of the other columns, we can explicitly ask for it.

In [6]:
reviews.describe(include=['O'])

Unnamed: 0,album,artist,genre,release_date,critic,body
count,5001,5001,5001,5001,5001,5001
unique,3799,2607,12,956,592,4998
top,Tonight: Franz Ferdinand,Various Artists,Pop/Rock,2011-09-13 00:00:00,AllMusic,travis is always great band!!!!!!!!!!! it's sh...
freq,5,22,1486,29,282,2


Who were the reviewers?

In [7]:
reviews['critic'].value_counts().head(10)

AllMusic                     282
PopMatters                   228
Pitchfork                    207
Q Magazine                   178
Uncut                        171
Mojo                         137
Drowned In Sound             132
New Musical Express (NME)    127
The A.V. Club                121
Rolling Stone                112
Name: critic, dtype: int64

And the artists?

In [8]:
reviews['artist'].value_counts().head(10)

Various Artists      22
R.E.M.               16
Arcade Fire          14
Sigur Rós            13
Belle & Sebastian    12
Brian Eno            11
Mogwai               10
LCD Soundsystem      10
Radiohead            10
Weezer               10
Name: artist, dtype: int64

We can get the average score as follows:

In [9]:
reviews['score'].mean()

72.68422315536893

Now we want to know the average score for each genre? To do this, we use Pandas `groupby` function. You'll want to get very familiar with the `groupby` function. It's quite powerful. (Similar to `collapse` on Stata)

In [10]:
reviews_grouped_by_genre = reviews.groupby("genre")
reviews_grouped_by_genre['score'].mean().sort_values(ascending=False)

genre
Jazz                      77.631579
Folk                      75.900000
Indie                     74.400897
Country                   74.071429
Alternative/Indie Rock    73.928571
Electronic                73.140351
Pop/Rock                  73.033782
R&B;                      72.366071
Rap                       72.173554
Rock                      70.754292
Dance                     70.146341
Pop                       64.608054
Name: score, dtype: float64

### Creating the DTM using scikit-learn

Ok, that's the summary of the metadata. Next, we turn to analyzing the text of the reviews. Remember, the text is stored in the 'body' column. First, a preprocessing step to remove numbers.

In [11]:
def remove_digits(comment):
    return ''.join([ch for ch in comment if not ch.isdigit()])

reviews['body_without_digits'] = reviews['body'].apply(remove_digits)
reviews

Unnamed: 0,album,artist,genre,release_date,critic,score,body,body_without_digits
0,Don't Panic,All Time Low,Pop/Rock,2012-10-09 00:00:00,Kerrang!,74.0,While For Baltimore proves they can still writ...,While For Baltimore proves they can still writ...
1,Fear and Saturday Night,Ryan Bingham,Country,2015-01-20 00:00:00,Uncut,70.0,There's nothing fake about the purgatorial nar...,There's nothing fake about the purgatorial nar...
2,The Way I'm Livin',Lee Ann Womack,Country,2014-09-23 00:00:00,Q Magazine,84.0,All life's disastrous lows are here on a caree...,All life's disastrous lows are here on a caree...
3,Doris,Earl Sweatshirt,Rap,2013-08-20 00:00:00,Pitchfork,82.0,"With Doris, Odd Future’s Odysseus is finally b...","With Doris, Odd Future’s Odysseus is finally b..."
4,Giraffe,Echoboy,Rock,2003-02-25 00:00:00,AllMusic,71.0,Though Giraffe is definitely Echoboy's most im...,Though Giraffe is definitely Echoboy's most im...
5,Weathervanes,Freelance Whales,Indie,2010-04-13 00:00:00,Q Magazine,68.0,Fans of Owl City and The Postal Service will r...,Fans of Owl City and The Postal Service will r...
6,Build a Rocket Boys!,Elbow,Pop/Rock,2011-04-12 00:00:00,Delusions of Adequacy,82.0,"Whereas previous Elbow records set a mood, Bui...","Whereas previous Elbow records set a mood, Bui..."
7,Ambivalence Avenue,Bibio,Indie,2009-06-23 00:00:00,Q Magazine,78.0,His remarkable Warp debut follows a series of ...,His remarkable Warp debut follows a series of ...
8,Wavvves,Wavves,Indie,2009-03-17 00:00:00,PopMatters,68.0,"There’s an energy coursing through this, and r...","There’s an energy coursing through this, and r..."
9,Peachtree Road,Elton John,Rock,2004-11-09 00:00:00,MelD.,70.0,Classic. Songs filled with soul. Lyrics refres...,Classic. Songs filled with soul. Lyrics refres...


In [12]:
reviews['body_without_digits'].head()

0    While For Baltimore proves they can still writ...
1    There's nothing fake about the purgatorial nar...
2    All life's disastrous lows are here on a caree...
3    With Doris, Odd Future’s Odysseus is finally b...
4    Though Giraffe is definitely Echoboy's most im...
Name: body_without_digits, dtype: object

### CountVectorizer Function

Our next step is to turn the text into a document term matrix using the scikit-learn function called `CountVectorizer`.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer()
sparse_dtm = countvec.fit_transform(reviews['body_without_digits'])

Great! We made a DTM! Let's look at it.

In [14]:
sparse_dtm

<5001x16139 sparse matrix of type '<class 'numpy.int64'>'
	with 124340 stored elements in Compressed Sparse Row format>

This format is called Compressed Sparse Format. It save a lot of memory to store the dtm in this format, but it is difficult to look at for a human. To illustrate the techniques in this lesson we will first convert this matrix back to a Pandas DataFrame, a format we're more familiar with. For larger datasets, you will have to use the Compressed Sparse Format. Putting it into a DataFrame, however, will enable us to get more comfortable with Pandas!

In [15]:
dtm = pd.DataFrame(sparse_dtm.toarray(), columns=countvec.get_feature_names(), index=reviews.index)
dtm.head()

Unnamed: 0,aa,aaaa,aahs,aaliyah,aaron,ab,abandon,abandoned,abandoning,abc,...,zone,zones,zoo,zooey,zoomer,zu,zydeco,álbum,être,über
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### What can we do with a DTM?

We can quickly identify the most frequent words

In [16]:
dtm.sum().sort_values(ascending=False).head(10)

the      7406
and      4557
of       4400
to       3175
is       2914
it       2608
that     2039
in       1775
album    1719
this     1518
dtype: int64

### Challenge

* Print out the most infrequent words rather than the most frequent words. You can look at the [Pandas documentation](http://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-stats) for more information.
* Print the average number of times each word is used in a review.
* Print this out sorted from highest to lowest.

In [17]:
dtm.sum().sort_values().head()

sincerest       1
glyn            1
gluttonously    1
glue            1
glows           1
dtype: int64

In [18]:
dtm.mean().sort_values(ascending=False).head()

the    1.480904
and    0.911218
of     0.879824
to     0.634873
is     0.582683
dtype: float64

### TF-IDF scores

How to find distinctive words in a corpus is a long-standing question in text analysis. Today, we'll learn one simple approach to this: TF-IDF. The idea behind words scores is to weight words not just by their frequency, but by their frequency in one document compared to their distribution across all documents. Words that are frequent, but are also used in every single document, will not be distinguising. We want to identify words that are unevenly distributed across the corpus.

One of the most popular ways to weight words (beyond frequency counts) is `tf-idf score`. By offsetting the frequency of a word by its document frequency (the number of documents in which it appears) will in theory filter out common terms such as 'the', 'of', and 'and'.

Traditionally, the inverse document frequency is calculated as such:

number_of_documents / number_documents_with_term

so:

tfidf_word1 = word1_frequency_document1 * (number_of_documents / number_document_with_word1)

You can, and often should, normalize the numerator: 

tfidf_word1 = (word1_frequency_document1 / word_count_document1) * (number_of_documents / number_document_with_word1)

We can calculate this manually, but scikit-learn has a built-in function to do so. This function also uses log frequencies, so the numbers will not correspond excactly to the calculations above. We'll use the [scikit-learn calculation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), but a challenge for you: use Pandas to calculate this manually. 

### TF-IDFVectorizer Function

To do so, we simply do the same thing we did above with CountVectorizer, but instead we use the function TfidfVectorizer.

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvec = TfidfVectorizer()
sparse_tfidf = tfidfvec.fit_transform(reviews['body_without_digits'])
sparse_tfidf

<5001x16139 sparse matrix of type '<class 'numpy.float64'>'
	with 124340 stored elements in Compressed Sparse Row format>

In [20]:
tfidf = pd.DataFrame(sparse_tfidf.toarray(), columns=tfidfvec.get_feature_names(), index=reviews.index)
tfidf.head()

Unnamed: 0,aa,aaaa,aahs,aaliyah,aaron,ab,abandon,abandoned,abandoning,abc,...,zone,zones,zoo,zooey,zoomer,zu,zydeco,álbum,être,über
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's look at the 20 words with highest tf-idf weights.

In [21]:
tfidf.max().sort_values(ascending=False).head(20)

brill         1.000000
perfect       1.000000
yummy         1.000000
pppperfect    1.000000
awesome       1.000000
wonderfull    1.000000
meh           1.000000
stars         1.000000
subpar        0.959257
ga            0.908259
masterful     0.898620
grower        0.888624
likable       0.867803
acirc         0.867003
great         0.864253
infectious    0.859996
blank         0.854475
thrilling     0.848810
smart         0.847852
stuff         0.834479
dtype: float64

Ok! We have successfully identified content words, without removing stop words.

### Identifying Distinctive Words

What can we do with this? These scores are best used when you want to identify distinctive words for individual documents, or groups of documents, compared to other groups or the corpus as a whole. To illustrate this, let's compare three genres and identify the most distinctive words by genre.

First we add in a column of genre.

In [22]:
tfidf['genre_'] = reviews['genre']
tfidf.head()

Unnamed: 0,aa,aaaa,aahs,aaliyah,aaron,ab,abandon,abandoned,abandoning,abc,...,zones,zoo,zooey,zoomer,zu,zydeco,álbum,être,über,genre_
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Pop/Rock
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Country
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Country
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Rap
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Rock


Now lets compare the words with the highest tf-idf weight for each genre. 

In [23]:
rap = tfidf[tfidf['genre_']=='Rap']
indie = tfidf[tfidf['genre_']=='Indie']
jazz = tfidf[tfidf['genre_']=='Jazz']

rap.max(numeric_only=True).sort_values(ascending=False).head()

blank        0.854475
waste        0.755918
amiable      0.730963
awesomely    0.717079
joyless      0.687687
dtype: float64

In [24]:
indie.max(numeric_only=True).sort_values(ascending=False).head()

meh           1.0
awesome       1.0
wonderfull    1.0
perfect       1.0
yummy         1.0
dtype: float64

In [25]:
jazz.max(numeric_only=True).sort_values(ascending=False).head()

purely        0.544477
descending    0.519218
devotional    0.507724
recordings    0.499963
languid       0.487715
dtype: float64

There we go! A method of identifying distinctive words.

### Challenge

Instead of outputting the highest weighted words, output the lowest weighted words. How should we interpret these words?

In [26]:
jazz.max(numeric_only=True).sort_values().head()

aa             0.0
potent         0.0
potential      0.0
potentially    0.0
potion         0.0
dtype: float64

## Topic modeling <a id='topics'></a>

The goal of topic models can be twofold: 1/ learning something about the topics themselves, i.e what the the ext is about 2/ reduce the dimensionality of text to represent a document as a weighted average of K topics instead of a vector of token counts over the whole vocabulary. In the latter case, topic modeling a way to treat text as any data in a more tractable way for any subsequent statistical analysis (linear/logistic regression, etc). 

There are many topic modeling algorithms, but we'll use LDA. This is a standard model to use. Again, the goal is not to learn everything you need to know about topic modeling. Instead, this will provide you some starter code to run a simple model, with the idea that you can use this base of knowledge to explore this further.

We will run Latent Dirichlet Allocation, the most basic and the oldest version of topic modeling$^1$. We will run this in one big chunk of code. Our challenge: use our knowledge of scikit-learn that we gained above to walk through the code to understand what it is doing. Your challenge: figure out how to modify this code to work on your own data, and/or tweak the parameters to get better output.

First, a bit of theory. LDA is a generative model - a model over the entire data generating process - in which a document is a mixture of topics and topics are probability distributions over tokens in the vocabulary. The (normalized) frequency of word $j$ in document $i$ can be written as:
$q_{ij} = v_{i1}*\theta_{1j} + v_{i2}*\theta_{2j} + ... + v_{iK}*\theta_{Kj}$
where K is the total number of topics, $\theta_{kj}$ is the probability that word $j$ shows up in topic $k$ and $v_{ik}$ is the weight assigned to topic $k$ in document $i$. The model treats $v$ and $\theta$ as generated from Dirichlet-distributed priors and can be estimated through Maximum Likelihood or Bayesian methods.

Note: we will be using a different dataset for this technique. The music reviews in the above dataset are often short, one word or one sentence reviews. Topic modeling is not really appropriate for texts that are this short. Instead, we want texts that are longer and are composed of multiple topics each. For this exercise we will use a database of children's literature from the 19th century. 

The data were compiled by students in this course: http://english197s2015.pbworks.com/w/page/93127947/FrontPage
Found here: http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets#demo-corpora

That page has additional corpora, for those interested in exploring text analysis further.

$^1$ Reference: Blei, D. M., A. Y. Ng, and M. I. Jordan (2003). Latent Dirichlet allocation. Journal of Machine
Learning Research 3, 993–1022.

In [27]:
literature_fname = os.path.join(DATA_DIR, 'childrens_lit.csv.bz2')
df_lit = pd.read_csv(literature_fname, sep='\t', encoding = 'utf-8', compression = 'bz2', index_col=0)

#drop rows where the text is missing
df_lit = df_lit.dropna(subset=['text'])
df_lit.head()

Unnamed: 0,title,author gender,year,text
0,A Dog with a Bad Name,Male,1886,A DOG WITH A BAD NAME BY TALBOT BAINES REED ...
1,A Final Reckoning,Male,1887,A Final Reckoning: A Tale of Bush Life in Aust...
2,"A House Party, Don Gesualdo, and A Rainy June",Female,1887,A HOUSE-PARTY Don Gesualdo and A Rainy June...
3,A Houseful of Girls,Female,1889,"A HOUSEFUL OF GIRLS. BY SARAH TYTLER, AUTHOR ..."
4,A Little Country Girl,Female,1885,"LITTLE COUNTRY GIRL. BY SUSAN COOLIDGE, ..."


Now we're ready to fit the model. This requires the use of CountVectorizer, which we've already used, and the scikit-learn function LatentDirichletAllocation.

See [here](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) for more information about this function. 

First, we have to import it from sklearn.

In [28]:
from sklearn.decomposition import LatentDirichletAllocation

In sklearn, the input to LDA is a DTM (with either counts or TF-IDF scores).

In [29]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.80, min_df=50,
                                   max_features=5000,
                                   stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(df_lit['text'])

In [30]:
tf_vectorizer = CountVectorizer(max_df=0.80, min_df=50,
                                max_features=5000,
                                stop_words='english'
                                )
tf = tf_vectorizer.fit_transform(df_lit['text'])

This is where we fit the model.

In [31]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
lda = LatentDirichletAllocation(n_topics=10, max_iter=20, random_state=0)
lda = lda.fit(tf)

This is a function to print out the top words for each topic in a pretty way. Don't worry too much about understanding every line of this code.

In [32]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #{}:".format(topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [33]:
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 20)


Topic #0:
doctor girls papa mamma sister baby street aunt london sweet project tom dr tea study presently flower darling office everybody

Topic #1:
dick doctor uncle tom jack fish em lads rope rock birds shock beneath ay gun stream garden excitedly moments fishing

Topic #2:
french troops officers army attack guns officer tom soldiers regiment british camp village ship march jack fort wounded column native

Topic #3:
project doctor church ma mary gray girls thou works soldier regiment james rode public village officer george st cousin soldiers

Topic #4:
frank james king shore lake camp village forest boats coast attack troops french native fort woods stream ship army guns

Topic #5:
project ye works ship george foundation island shore em observed youth ice deck ay agreement vessel remarked crew fish considerable

Topic #6:
er uncle jack ain den yer wolf folks lion tail gun dish sing jump aunt study seed doctor bag goodness

Topic #7:
deck frank shore ship vessel boats island cabin s

One thing we may want to do with the output is compare the prevalence of each topic across documents. A simple way to do this (but not memory efficient), is to merge the topic distribution back into the Pandas dataframe.

First get the topic distribution array.

In [34]:
topic_dist = lda.transform(tf)
topic_dist

array([[  8.71598187e-01,   1.99211275e-02,   6.32910384e-02, ...,
          1.25034836e-05,   3.79720913e-05,   4.40811557e-02],
       [  3.45023517e-02,   2.96151321e-02,   6.87865338e-02, ...,
          7.16863274e-01,   1.37251101e-05,   1.17770974e-01],
       [  5.42918480e-01,   3.37979473e-03,   1.36686654e-05, ...,
          1.36684699e-05,   3.64883088e-02,   1.36688476e-05],
       ..., 
       [  9.69609121e-06,   9.69620204e-06,   4.83220622e-03, ...,
          9.69645662e-06,   9.93827250e-01,   1.27267072e-03],
       [  7.58453416e-06,   7.58455582e-06,   9.97801502e-01, ...,
          7.58478411e-06,   7.58469156e-06,   7.58441317e-06],
       [  8.35441481e-06,   8.35426939e-06,   4.62466841e-01, ...,
          3.46077391e-02,   8.35442365e-06,   8.35435389e-06]])

Merge back with original dataframe

In [35]:
topic_dist_df = pd.DataFrame(topic_dist)
df_w_topics = topic_dist_df.join(df_lit)
df_w_topics

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,title,author gender,year,text
0,0.871598,0.019921,0.063291,0.000013,0.000013,0.000013,0.001021,0.000013,0.000038,0.044081,A Dog with a Bad Name,Male,1886.0,A DOG WITH A BAD NAME BY TALBOT BAINES REED ...
1,0.034502,0.029615,0.068787,0.030595,0.000014,0.000014,0.001825,0.716863,0.000014,0.117771,A Final Reckoning,Male,1887.0,A Final Reckoning: A Tale of Bush Life in Aust...
2,0.542918,0.003380,0.000014,0.417131,0.000014,0.000014,0.000014,0.000014,0.036488,0.000014,"A House Party, Don Gesualdo, and A Rainy June",Female,1887.0,A HOUSE-PARTY Don Gesualdo and A Rainy June...
3,0.984716,0.000012,0.000012,0.005849,0.000012,0.000012,0.000012,0.003454,0.005911,0.000012,A Houseful of Girls,Female,1889.0,"A HOUSEFUL OF GIRLS. BY SARAH TYTLER, AUTHOR ..."
4,0.689145,0.049493,0.000022,0.205900,0.000022,0.000022,0.000022,0.055327,0.000022,0.000022,A Little Country Girl,Female,1885.0,"LITTLE COUNTRY GIRL. BY SUSAN COOLIDGE, ..."
5,0.811108,0.086477,0.000023,0.000141,0.000023,0.000023,0.000023,0.000023,0.000023,0.102136,A Round Dozen,Female,1883.0,\n A ROUND DOZEN. [Illustration: TOINETTE AND...
6,0.430773,0.451863,0.000051,0.000051,0.000051,0.005707,0.111349,0.000051,0.000051,0.000051,A Sailor's Lass,Female,1886.0,"A SAILOR'S LASS by EMMA LESLIE, Author of ""..."
7,0.976862,0.000014,0.000014,0.000014,0.007139,0.000014,0.000014,0.000014,0.015903,0.000014,A World of Girls,Female,1886.0,A WORLD OF GIRLS: THE STORY OF A SCHOOL. By ...
8,0.000017,0.000017,0.000017,0.000017,0.000834,0.137004,0.000017,0.138041,0.000017,0.724018,Adrift in the Wild,Male,1887.0,Adrift in the Wilds; ...
9,0.000027,0.000027,0.067510,0.000027,0.000027,0.000027,0.059704,0.000027,0.000027,0.872599,Adventures in Africa,Male,1883.0,"ADVENTURES IN AFRICA, BY W.H.G. KINGSTON. C..."


Now we can chech the average weight of each topic across gender using `groupby`.

In [36]:
grouped = df_w_topics.groupby('author gender')
grouped[0].mean().sort_values(ascending=False)

author gender
Female    0.529574
Male      0.174689
Name: 0, dtype: float64

## Challenge

Modify the script above to:
* increase the number of topics
* increase the number of printed top words per topic
* fit the model to the tf-idf matrix instead of the tf one

### Further resources

[This blog post](https://de.dariah.eu/tatom/feature_selection.html) goes through finding distinctive words using Python in more detail 

Paper: [Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf), Burt Monroe, Michael Colaresi, Kevin Quinn

[Topic modeling with Textacy](https://github.com/repmax/topic-model/blob/master/topic-modelling.ipynb)