## Introduction to Unsupervised Approaches using scikit-learn

There are two libraries that dominate text analysis in Python. The first is NLTK, which implements a range of natural language processing techniques. You learned about this in part 2 of this series yesterday.

The other dominant library is scikit-learn, which, at its most basic, provides a function to create a memory-efficient document-term matrix. It also implements a variety of quite sophisticated machine learning techniques that you can use on your text. It's a powerful library, and one you will continually return to as you advance in text analysis (and looks great on your CV!).

Because scikit-learn is such a large and powerful library the goal today is not to become experts, but instead learn the basic functions in the library and gain an intuition about how you might use it to do text analysis. We'll give you the keys to the kingdom: you go explore! To give an overview, here are some of the things you can do using scikit-learn:
* word weighting
* feature extraction
* text classification / supervised machine learning
    * L2 regression
    * classification algorithms such as nearest neighbors, SVM, and random forest
* clustering / unsupervised machine learning
    * k-means
    * pca
    * cosine similarity
    * LDA

Today, we'll start with the Document Term Matrix (DTM). The DTM is the bread and butter of most computational text analysis techniques, both simple and more sophisticated methods. In this lesson we will use Python's scikit-learn package learn to make a document term matrix from a .csv Music Reviews dataset (collected from MetaCritic.com). We will then use the DTM and a word weighting technique called tf-idf (term frequency inverse document frequency) to identify important and discriminating words within this dataset (utilizing the Pandas package). The illustrating question: what words distinguish reviews of Rap albums, Indie Rock albums, and Jazz albums? 

Finally, we will use the DTM to get an introduction to one method for uncovering patterns or themes within text: LDA, a topic modeling algorithm. Again, this will just be an introduction. Look for additional workshops in the future that will get into topic modeling in more detail.
  

### Learning Goals
* Understand the DTM and why it's important to text analysis
* Learn how to create a DTM from a .csv file
* Learn basic functionality of Python's package scikit-learn
* Understand tf-idf scores, and word scores in general
* Learn a simple way to identify distinctive words
* Implement a basic topic modeling algorithm and learn how to tweak it
* In the process, gain more familiarity and comfort with the Pandas package and manipulating data

### Outline
<ol start="0">
  <li>The Pandas Dataframe: Music Reviews</li>
  <li>Explore the Data using Pandas</li>
          -Basic descriptive statistics
  <li>Creating the DTM: scikit-learn</li>
          -CountVectorizer function
  <li>What can we do with a DTM?</li>
  <li>Tf-idf scores</li>
          -TfidfVectorizer function
  <li>Identifying Distinctive Words</li>
          -Application: Identify distinctive words by genre
  <li>Uncovering patterns using LDA</li>
</ol>

### Key Jargon
* *Document Term Matrix*:
    * a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
* *TF-IDF Scores*: 
    *  short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
* *Topic Modeling*:
    * A statistical model to uncover abstract topics within a text. It uses the co-occurrence fo words within documents, compared to their distribution across documents, to uncover these abstract themes. The output is a list of weighted words, which indicate the subject of each topic, and a weight distribution across topics for each document.
    
* *LDA*:
    * Latent Dirichlet Allocation. A implementation of topic modeling that assumes a Dirichlet prior. It does not take document order into account, unlike other topic modeling algorithms.
    
### Further Resources

[This blog post](https://de.dariah.eu/tatom/feature_selection.html) goes through finding distinctive words using Python in more detail 

Paper: [Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf), Burt Monroe, Michael Colaresi, Kevin Quinn

[More detailed description of implementing LDA using scikit-learn](http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-topics-extraction-with-nmf-lda-py).

### 0. Import and summarize the data

First, we read our corpus, which is stored as a .csv file on our hard drive, into a Pandas dataframe. 

Note: I love Pandas for data munging and basic calculations because it's so easy to use, and its data structure is really intuitive for me. It's not memory efficient however, so you might quickly need to move away from it. I recommend always always always using Pandas (or similar) over spreadsheets and Excel. [Excel is bad for science!](https://www.washingtonpost.com/news/wonk/wp/2016/08/26/an-alarming-number-of-scientific-papers-contain-excel-errors/)

In [6]:
import pandas
import numpy as np

#create a dataframe called "df"
df = pandas.read_csv("statement_test_031417.csv", sep = ',', encoding = 'utf-8')

#view the dataframe
#notice the metadata. The column "body" contains our text of interest.
df

Unnamed: 0,ID,PS1
0,1,Grandmothers are essential in the lives of the...
1,2,Most children acquire the same color or a simi...
2,2,"""His name is Brownie, Daddy!"" were the words I..."
3,4,Planting Seeds of Empathy\ The world I c...
4,5,I grew up on the 101. My mom lived on one sid...
5,6,I am who I am because of my family and its str...
6,7,"My grandfather once told me if you are kind, p..."
7,8,263 consecutive weeks. The Burrito Boyz organi...
8,9,"I was born in Hayward CA, but raised in Bay Po..."
9,10,From a young age I have been self motivated in...


In [8]:
#print the first review from the column 'body'
df['PS1'][0]

"Grandmothers are essential in the lives of their grandchildren. Taking on the role of a mother figure at any given moment is time consuming yet rewarding for any grandmother. Some grandmothers have choices of either wanting to care for their grandchildren or leaving them to be cared for by others. When my parents finalized their divorce it was my grandmother who sacrificed her lifestyle in order to take on the role of a mother figure in my life. Once living in a retirement lifestyle and socializing with friends has now turned into caring for two young rambunctious boys.\\\\While in first grade I believed I was invincible, I lived in a beautiful two story home, a star war theme bedroom that I did not have to share with my older brother, loving grandparents who often visited, and hardworking parents who provided for our family. All of a sudden, my idealistic world came to end when my parents decided to part ways. There was no inkling that mother decided to leave our happy home. At a you

### 1. Explore the Data using Pandas

Let's first look at some descriptive statistics about this dataset, to get a feel for what's in it. We'll do this using the Pandas package. 

Note: this is always good practice. It serves two purposes. It checks to make sure your data is correct, and there's no major errors. It also keeps you in touch with your data, which will help with interpretation. <3 your data!

First, what genres are in this dataset, and how many reviews in each genre?

What ID's have more than one "PS1"s can be found by counting and ranking "ID"s

In [9]:
df['ID'].value_counts()

4074    4
3283    3
3669    3
1511    3
3292    3
354     3
15      3
1205    3
1744    3
1469    3
1575    3
3653    3
2493    3
2667    3
827     3
1819    3
407     3
419     3
3794    3
1208    3
668     3
785     3
4110    3
1762    3
198     3
2707    2
2691    2
2858    2
1028    2
3077    2
       ..
3135    1
1146    1
1150    1
3255    1
1182    1
3251    1
1202    1
3247    1
1198    1
3243    1
1194    1
3239    1
1190    1
3235    1
1186    1
3231    1
3227    1
3203    1
1178    1
3223    1
1174    1
3219    1
1170    1
3215    1
1166    1
3211    1
1162    1
3207    1
1158    1
2049    1
Name: ID, dtype: int64

### 2. Creating the DTM: scikit-learn

Ok, that's the summary of the metadata. Next, we turn to analyzing the text of the reviews. Remember, the text is stored in the 'body' column. First, a preprocessing step to remove numbers.

In [11]:
df['PS1'] = df['PS1'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))

Our next step is to turn the text into a document term matrix using the scikit-learn function called CountVectorizer. There are two ways to do this. We can turn it into a sparse matrix type, which can be used within scikit-learn for further analyses.

In [13]:
#import the function CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer()

sklearn_dtm = CountVectorizer().fit_transform(df.PS1)
print(sklearn_dtm)

  (0, 11925)	2
  (0, 1399)	2
  (0, 9481)	1
  (0, 13697)	11
  (0, 27654)	10
  (0, 16096)	1
  (0, 18889)	11
  (0, 27664)	4
  (0, 11910)	2
  (0, 27255)	1
  (0, 18997)	4
  (0, 23497)	2
  (0, 17817)	5
  (0, 10402)	2
  (0, 1693)	2
  (0, 1199)	3
  (0, 11623)	2
  (0, 17666)	1
  (0, 14631)	5
  (0, 27905)	1
  (0, 5820)	1
  (0, 30945)	1
  (0, 23255)	1
  (0, 10831)	6
  (0, 11924)	10
  :	:
  (4258, 28700)	1
  (4258, 21287)	1
  (4258, 25016)	1
  (4258, 30017)	1
  (4258, 28255)	1
  (4258, 288)	1
  (4258, 25879)	1
  (4258, 9676)	1
  (4258, 16918)	1
  (4258, 327)	1
  (4258, 23169)	1
  (4258, 28407)	1
  (4258, 16643)	1
  (4258, 15189)	3
  (4258, 28720)	2
  (4258, 7454)	1
  (4258, 716)	2
  (4258, 30962)	1
  (4258, 20606)	1
  (4258, 21467)	1
  (4258, 17039)	2
  (4258, 19927)	1
  (4258, 14811)	2
  (4258, 8964)	1
  (4258, 4364)	1


This format is called Compressed Sparse Format. It save a lot of memory to store the dtm in this format, but it is difficult to look at for a human. To illustrate the techniques in this lesson we will first convert this matrix back to a Pandas dataframe, a format we're more familiar with. For larger datasets, you will have to use the Compressed Sparse Format. Putting it into a DataFrame, however, will enable us to get more comfortable with Pandas!

In [14]:
#we do the same as we did above, but covert it into a Pandas dataframe. Note this takes quite a bit more memory, so will not be good for bigger data.
dtm_df = pandas.DataFrame(countvec.fit_transform(df.PS1).toarray(), columns=countvec.get_feature_names(), index = df.index)

#view the dtm dataframe
dtm_df

Unnamed: 0,____,aa,aabo,aacit,aahs,aaple,aaron,aaronic,aau,ab,...,zoologist,zoology,zoom,zoomed,zooming,zoos,zovi,zucchini,zuma,zydeco
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 3. What can we do with a DTM?

We can do a number of calculations using a DTM. For a toy example, we can quickly identify the most frequent words (compare this to how many steps it took in lesson 2, where we found the most frequent words using NLTK).

In [15]:
print(dtm_df.sum().sort_values(ascending=False))

to              83706
the             80009
my              74357
and             67083
of              45204
in              38775
that            30035
me              28739
was             25497
for             19539
with            17402
it              16963
as              16500
have            15698
is              14197
be              11868
on              11128
from            10686
not             10557
this             9669
but              9453
life             9433
had              9219
school           9111
at               9099
family           8335
they             7792
has              7680
an               7201
when             7139
                ...  
lessens             1
leche               1
lesions             1
leporidae           1
leotard             1
leone               1
lentil              1
leningrad           1
leniency            1
lengthier           1
lends               1
lemongrass          1
leland              1
lejos               1
lehnert   

In [None]:
#####Exercise:
###Print out the most infrequent words rather than the most frequent words.
##Gold star challenge: print the average number of times each word is used in a review
print(dtm_df.mean().sort_values(ascending=False))
#Print this out sorted from highest to lowest.

What else does the DTM enable? Because it is in the format of a matrix, we can perform any matrix algebra or vector manipulation on it, which enables some pretty exciting things (think vector space and Euclidean  geometry). But, what do we lose when we reprsent text in this format?

Today, we will use variations on the DTM to find distinctive words in this dataset, and then do some preliminary work discovering themes in text.

### 4. Tf-idf scores

How to find distinctive words in a corpus is a long-standing question in text analysis. We saw a few ways to this yesterday, using natural language processing. Today, we'll learn one simple approach to this: word scores. The idea behind words scores is to weight words not just by their frequency, but by their frequency in one document compared to their distribution across all documents. Words that are frequent, but are also used in every single document, will not be distinguising. We want to identify words that are unevenly distributed across the corpus.

One of the most popular ways to weight words (beyond frequency counts) is *tf-idf* scores. By offsetting the frequency of a word by its document frequency (the number of documents in which it appears) will in theory filter out common terms such as 'the', 'of', and 'and'.

More precisely, the inverse document frequency is calculated as such:

number_of_documents / number_documents_with_term

so:

tfidf_word1 = word1_frequency_document1 * (number_of_documents / number_document_with_word1)

You can, and often should, normalize the numerator: 

tfidf_word1 = (word1_frequency_document1 / word_count_document1) * (number_of_documents / number_document_with_word1)

We can calculate this manually, but scikit-learn has a built-in function to do so. We'll use it, but a challenge for you: use Pandas to calculate this manually. 

To do so, we simply do the same thing we did above with CountVectorizer, but instead we use the function TfidfVectorizer.

In [16]:
#import the function
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvec = TfidfVectorizer()

#create the dtm, but with cells weigthed by the tf-idf score.
dtm_tfidf_df = pandas.DataFrame(tfidfvec.fit_transform(df.PS1).toarray(), columns=tfidfvec.get_feature_names(), index = df.index)

#view results
dtm_tfidf_df

Unnamed: 0,____,aa,aabo,aacit,aahs,aaple,aaron,aaronic,aau,ab,...,zoologist,zoology,zoom,zoomed,zooming,zoos,zovi,zucchini,zuma,zydeco
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's look at the 20 words with highest tf-idf weights.

In [17]:
print(dtm_tfidf_df.max().sort_values(ascending=False)[0:20])

pei          0.874034
dubb         0.821924
ken          0.807414
dori         0.784735
ozzie        0.779671
charlee      0.776855
oakwood      0.776572
troy         0.775049
creek        0.755369
mixteco      0.740840
bhangra      0.732388
cirque       0.727046
lakorns      0.716875
golf         0.715907
om           0.715895
tamil        0.712871
ojiichan     0.704561
starcraft    0.697635
kris         0.693769
richmond     0.689032
dtype: float64


Ok! We have successfully identified content words, without removing stop words. What else do you notice about this list?

### 5. Identifying Distinctive Words

What can we do with this? These scores are best used when you want to identify distinctive words for individual documents, or groups of documents, compared to other groups or the corpus as a whole. To illustrate this, let's compare three genres and identify the most distinctive words by genre.

First we merge the genre of the document into our dtm weighted by tf-idf scores, and then compare genres.

In [None]:
#creat dataset with document index and genre
df_genre = df['genre'].to_frame()
print(df_genre)

In [None]:
#merge this into the dtm_tfidf_df
merged_df = df_genre.join(dtm_tfidf_df, how = 'right', lsuffix='_x')

#view result
merged_df

Now lets compare the words with the highest tf-idf weight for each genre. 

Note: there are other ways to do this. Challenge: what is a different approach to identifying rows from a certain genre in our dtm?

In [None]:
#pull out the reviews for three genres, Rap, Alternative/Indie Rock, and Jazz
dtm_rap = merged_df[merged_df['genre_x']=="Rap"]
dtm_indie = merged_df[merged_df['genre_x']=="Alternative/Indie Rock"]
dtm_jazz = merged_df[merged_df['genre_x']=="Jazz"]

#print the words with the highest tf-idf scores for each genre
print("Rap Words")
print(dtm_rap.max(numeric_only=True).sort_values(ascending=False)[0:20])
print()
print("Indie Words")
print(dtm_indie.max(numeric_only=True).sort_values(ascending=False)[0:20])
print()
print("Jazz Words")
print(dtm_jazz.max(numeric_only=True).sort_values(ascending=False)[0:20])

There we go! A method of identifying distinctive words. You notice there are some proper nouns in there. How might we remove those if we're not interested in them?

Tf-idf scores are just one way to identify distinctive or discriminating words. See Monroe, Colaresi, and Quinn (2009) for more ideas for finding distinctive words. (Warning: this paper is a bit outdated. No one has taken up their recommendation to use a Dirichlet prior).

In [None]:
#####Exercise:
###Copy and paste the code above to this cell, and change the genres for a different comparison.
###Instead of outputting the highest weighted words, output the lowest weighted words. 
##How should we interpret these words?

#pull out the reviews for three genres, Rap, Alternative/Indie Rock, and Jazz
dtm_rap = merged_df[merged_df['genre_x']=="Country"]
dtm_indie = merged_df[merged_df['genre_x']=="Pop/Rock"]
dtm_jazz = merged_df[merged_df['genre_x']=="Electronic"]

#print the words with the highest tf-idf scores for each genre
print("Coutry Words")
print(dtm_rap.max(numeric_only=True).sort_values(ascending=False)[0:20])
print()
print("Pop/Rock")
print(dtm_indie.max(numeric_only=True).sort_values(ascending=False)[0:20])
print()
print("Electronic")
print(dtm_jazz.max(numeric_only=True).sort_values(ascending=False)[0:20])

### 6. Uncovering Patterns: LDA

Frequency counts and tf-idf scores are done at the word level. There are other methods of exporatory or unsupervised analysis on the document level and by examining the co-occurrence of words within documents. Scikit-learn allows for many of these methods, including:

* document clustering
* document or word similarities using cosine similarity
* pca
* topic modeling

We'll run through an example of topic modeling here. Again, the goal is not to learn everything you need to know about topic modeling. Instead, this will provide you some starter code to run a simple model, with the idea that you can use this base of knowledge to explore this further.

We will run Latent Dirichlet Allocation, the most basic and the oldest version of topic modeling. We will run this in one big chunk of code. Our challenge: use our knowledge of scikit-learn that we gained aboe to walk through the code to understand what it is doing. Your challenge: figure out how to modify this code to work on your own data, and/or tweak the parameters to get better output.

Note: we will be using a different dataset for this technique. The music reviews in the above dataset are often short, one word or one sentence reviews. Topic modeling is not really appropriate for texts that are this short. Instead, we want texts that are longer and are composed of multiple topics each. For this exercise we will use a database of children's literature from the 19th century. 

The data were compiled by students in this course: http://english197s2015.pbworks.com/w/page/93127947/FrontPage
Found here: http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets#demo-corpora

That page has additional corpora, for those interested in exploring text analysis further.

I did some minimal cleaning to get the children's literature data in .csv format for our use.

In [19]:
df_lit = pandas.read_csv("statement_test_031417.csv", sep = ',', encoding = 'utf-8')

#drop rows where the text is missing. I think there's only one row where it's missing, but check me on that.
df_lit = df_lit.dropna(subset=['PS1'])

#view the dataframe
df_lit

Unnamed: 0,ID,PS1
0,1,Grandmothers are essential in the lives of the...
1,2,Most children acquire the same color or a simi...
2,2,"""His name is Brownie, Daddy!"" were the words I..."
3,4,Planting Seeds of Empathy\ The world I c...
4,5,I grew up on the 101. My mom lived on one sid...
5,6,I am who I am because of my family and its str...
6,7,"My grandfather once told me if you are kind, p..."
7,8,263 consecutive weeks. The Burrito Boyz organi...
8,9,"I was born in Hayward CA, but raised in Bay Po..."
9,10,From a young age I have been self motivated in...


Now we're ready to fit the model. This requires the use of CountVecorizer, which we've already used, and the scikit-learn function LatentDirichletAllocation.

See [here](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) for more information about this function. 

In [21]:
####Adopted From: 
#Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Lars Buitinck
#         Chyi-Kwei Yau <chyikwei.yau@gmail.com>
# License: BSD 3 clause

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

n_samples = 2000
n_topics = 5
n_top_words = 50

##This is a function to print out the top words for each topic in a pretty way.
#Don't worry too much about understanding every line of this code.
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

# Use tf-idf features
tfidf_vectorizer = TfidfVectorizer(max_df=0.80, min_df=50,
                                   max_features=None,
                                   stop_words='english')

tfidf = tfidf_vectorizer.fit_transform(df_lit.PS1)

# Use tf (raw term count) features
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.80, min_df=50,
                                max_features=None,
                                stop_words='english'
                                )

tf = tf_vectorizer.fit_transform(df_lit.PS1)

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_topics=%d..."
      % (n_samples, n_topics))

#define the lda function, with desired options
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=20,
                                learning_method='online',
                                learning_offset=80.,
                                total_samples=n_samples,
                                random_state=0)
#fit the model
lda.fit(tf)

#print the top words per topic, using the function defined above.
#Unlike R, which has a built-in function to print top words, we have to write our own for scikit-learn
#I think this demonstrates the different aims of the two packages: R is for social scientists, Python for computer scientists

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Extracting tf features for LDA...
Fitting LDA models with tf features, n_samples=2000 and n_topics=5...

Topics in LDA model:

Topic #0:
school high year life team time people new class students work friends learned community years like grade student sports club did college classes just helped world make felt different experience skills able best music taught started program way hard help activities playing person began know leadership things wanted junior teachers

Topic #1:
life time like mom family just mother day dad home years did didn know parents father want people way felt things school brother world house make going ve knew sister wanted friends little person year love got help thought away feel old told wasn don new came started come room

Topic #2:
world science environmental animals environment nature water school love passion life new biology time learn learning like research animal learned natural knowledge california ocean began art class earth summer people work future 

In [None]:
####Exercise:
###Run the same code as above but change some of the parameters. How does this change the output.
###Suggestions:
## 0. Use tf-idf scores rather than raw counts. (hint: look for the variable name we created) 
## 1. Change the number of topics. What do you find?
## 2. Do not remove stop words. How does this change the output?

One thing we may want to do with the output is find the most representative texts for each topic. A simple way to do this (but not memory efficient), is to merge the topic distribution back into the Pandas dataframe.

First get the topic distribution array.

In [22]:
topic_dist = lda.transform(tf)
topic_dist

array([[   0.20512113,  122.90993301,    0.20403351,   21.80433052,
          25.87658183],
       [  17.40275584,   18.47377427,    0.20636533,  157.09147739,
          12.82562717],
       [   0.20364263,   96.41514875,   71.97346067,    0.20357151,
           0.20417644],
       ..., 
       [  44.10964325,    0.20427113,   61.99135408,    7.48961113,
           0.20512042],
       [  85.3714285 ,   77.01374061,    0.20516536,    0.20454836,
           0.20511717],
       [  85.22210455,   81.32842592,    0.20353997,   52.04042252,
           0.20550704]])

Merge back in with the original dataframe.

In [23]:
topic_dist_df = pandas.DataFrame(topic_dist)
df_w_topics = topic_dist_df.join(df_lit)
df_w_topics

Unnamed: 0,0,1,2,3,4,ID,PS1
0,0.205121,122.909933,0.204034,21.804331,25.876582,1,Grandmothers are essential in the lives of the...
1,17.402756,18.473774,0.206365,157.091477,12.825627,2,Most children acquire the same color or a simi...
2,0.203643,96.415149,71.973461,0.203572,0.204176,2,"""His name is Brownie, Daddy!"" were the words I..."
3,0.206381,30.567963,0.204870,25.359457,104.661328,4,Planting Seeds of Empathy\ The world I c...
4,0.205534,63.993204,12.761344,34.629261,22.410656,5,I grew up on the 101. My mom lived on one sid...
5,33.431351,0.204540,0.205412,0.206780,150.951916,6,I am who I am because of my family and its str...
6,34.565386,33.874135,0.205119,12.357412,62.997947,7,"My grandfather once told me if you are kind, p..."
7,0.205884,29.706457,18.917972,18.382129,75.787558,8,263 consecutive weeks. The Burrito Boyz organi...
8,0.204154,240.224709,0.205718,24.161999,0.203420,9,"I was born in Hayward CA, but raised in Bay Po..."
9,0.204346,52.488010,0.203373,118.900292,0.203979,10,From a young age I have been self motivated in...


Now we can sort the dataframe for the topic of interest, and view the top documents for the topics.
Below we sort the documents first by Topic 0 (looking at the top words for this topic I think it's about family, health, and domestic activities), and next by Topic 1 (again looking at the top words I think this topic is about children playing outside in nature). These topics may be a family/nature split?

Look at the titles for the two different topics. Look at the gender of the author. Hypotheses?

In [25]:
print(df_w_topics[['ID', 'PS1', 0]].sort_values(by=[0], ascending=False))

        ID                                                PS1           0
1980  1981  I remember first arriving in America at the ag...  243.032962
2112  2112  Covered in face paint and wearing all black, I...  217.655483
1319  1320  My Story\The best way to understand me is to s...  214.154889
4027  4027  My life before joining Associated Student Body...  208.181786
2103  2103  "Band, ten, hut!"\\Everyone snaps to attention...  207.182583
3950  3951  Since the begging of high school, I have been ...  207.034423
2104  2105  As I feel the rough surface of the block under...  206.813344
756    757  "How are you doing in school?", is any adult's...  205.633428
490    491  Changing My Path\\       Although my parents d...  201.940491
1645  1646  Being the youngest of a family of four and the...  199.780045
3334  3335  Growing up in Catholic school has been challen...  199.178435
3849  3850  In school, students are challenged to find the...  198.544826
2378  2379  I moved from Waterloo, Iow

We can read individual essays in full using the code below.  Change the number in the final set of brackets to point to a spesific serial number (ID-1).

In [27]:
df['PS1'][1980]

"I remember first arriving in America at the age of  in the year . By august, I began preschool and for couple days it was hard to leave my mom's side because until now I have never truly left my mom's side, but slowly I began to adapt a form of independence.  Within a year I finally began kindergarten. Within this year, the Teacher began teaching me Basic English beginning with the ABC song, count to  orally, learning to write my English name, how to hold a materials such as scissors and pencils, and how to hold a book properly from looking at words left to right. This was the first step in leading me into the society. As I moved up in my grades in elementary school, I began building up more knowledge involving math and English and at the same time these experiences began to help me create my pre-stage dream. After elementary school, I moved up to middle school. These  years of middle school has broadened my knowledge with math, science, English, and new classes involving electives an

In [28]:
print(df_w_topics[['ID', 'PS1', 1]].sort_values(by=[1], ascending=False))

        ID                                                PS1           1
2131  2132  In a world of fairy tales and magic people don...  321.185614
3327  3328  This may be one of the hardest papers I have e...  262.176514
4228  4229  To say that I'm nervous about this would be an...  259.185398
8        9  I was born in Hayward CA, but raised in Bay Po...  240.224709
1902  1903  Relying on the support of my inconsistent fath...  237.174991
1144  1145  UC Personal Statement\ Growing up in Middle Sc...  232.189433
2639  2640  "Con todo mi amor," my dad uttered as he tucke...  223.179046
1286  1287  It was dawn, I remember walking through the co...  222.179610
2690  2691  It is eleven o'clock at night. My grandmother ...  219.948404
3442  3443  I have always been a pretty positive person. I...  217.172267
1996  1997  I was born in Ottawa, Ontario, Canada-the frig...  214.863029
3614  3615  My story started in one of the worst areas of ...  211.835709
3588  3589  ?Throughout my life my gra

In [29]:
df['PS1'][2131]

'In a world of fairy tales and magic people don\'t exist, yet in a world of people fairy tales may become a reality. I never knew what high school is going to be like except nothing compared to the movies. Every kid dreams of growing up going to a good high school and college especially when it comes to building an amazing future for themselves. I was one of those lucky kids the help and support of my parents and family I got the grades that I needed to pass and get into Westview high school one of the best high schools. Where I lived had required me to go to a different high school but my grades got my inter-district transfer the seal of approval. As time went on summer departed and came the beginning of a fresh new start in my life one that I was going to make the best and most memorable next four years of my life.Time simply had passed and I made so many new friends just by talking to the people around me oh and the principal had cracked a few jokes make us laugh. I went to all my c

In [31]:
print(df_w_topics[['ID', 'PS1', 2]].sort_values(by=[2], ascending=False))

        ID                                                PS1           2
3515  3516  Can one individual have an effect on the envir...  226.925105
2262  2263  While visiting Zambia on a cultural exchange p...  223.798898
1582  1583  I have always been fascinated by how and why o...  200.178204
3177  3178  Growing up in beautiful San Diego led me to be...  196.034895
93      94  A baby sits in a floatie in the pool, content ...  195.622021
4189  4189  No. I'm not a tree hugger. I teach students ab...  193.177338
2507  2507  When I was between the ages of 4 and 10, I had...  181.644078
1943  1944  One moment, I remember myself screaming at the...  180.179243
2343  2344  My curiosity has been the biggest influence in...  179.430716
2753  2754  When I was seven years old, I visited China fo...  174.279485
2652  2653  In seventh grade, I remember my Advanced Art t...  173.689490
4074  4074  Until I entered eighth grade, I always thought...  172.902520
475    476  My choice to pursue enviro

In [33]:
df['PS1'][3515]

'Can one individual have an effect on the environment? I previously found it simple to look at the world around me with rose-colored glasses, just taking what I needed without regard.  \\\\Through contact with my surroundings, I became concerned for the safety and conservation of our natural environment as well as the suppression of the many factors that destroy it.  I have witnessed the fragile relationship between man and nature and the toxicity that threatens both the health of the environment and the impact on our population. \\\\Los Angeles, with its population of  million, has many appealing features such near-perfect weather.  However, LA is also infamous for being number one as America\'s worst smog-polluted city. I consider myself fortunate that my school and the city of Santa Monica opened my eyes, allowing me to help face the problems of man-made environmental destruction. \\\\My "immersion" in the environment began with my interest and certification as a SCUBA diver.  At ag

In [34]:
print(df_w_topics[['ID', 'PS1', 3]].sort_values(by=[3], ascending=False))

        ID                                                PS1           3
2645  2646  My parents migrated from Mexico to the United ...  264.126279
1603  1604  ?The world I come from has had a huge influenc...  261.073138
144    144  "Education should be your main focus right now...  241.179236
3750  3751  Growing up as the oldest child, of young paren...  240.114565
872    872  Two and a half years ago, I moved from the Phi...  223.458023
173    174  "In every conceivable manner the family is the...  221.810996
3668  3669  I am a strong believer that it is not about ho...  216.995420
3126  3127  My family came from a very poverty stricken pa...  216.179580
3148  3149  My parents strived to give my siblings and me ...  215.585856
1451  1452  The cultural background I come from doesn't pr...  215.032654
3940  3941  I am an individual who identifies a less notor...  212.558472
1428  1429  Throughout my life I have faced many adversiti...  205.752150
2532  2533  My family has shaped my dr

In [35]:
df['PS1'][2645]

'My parents migrated from Mexico to the United States in order to provide me with a better education and a more stable household. As a young child, my parents always emphasized the importance of pursuing a higher education because it would provide me with a stable career. Since neither of my parents had the resources to receive an education in Mexico, they took advantage of the educational opportunities in the U.S. for myself and my younger sibling. However, living in poverty became the biggest struggle for my parents, and at times they were not always able to put food on our table. I saw my parents suffer from obtaining minimum-wage jobs that always put them at risk of becoming jobless. At the age of fifteen, living in East Oakland where crime, poverty, and the loss of hopes are considered "norms", I decided to help my family by applying for a part-time job in order to support my family financially. After working in the food industry and being a high school student, I realized that my

In [37]:
print(df_w_topics[['ID', 'PS1', 4]].sort_values(by=[4], ascending=False))

        ID                                                PS1           4
811    812  Anxiety rose as I gingerly passed the San Dieg...  220.105901
3368  3369  From a young age, I was exposed to different c...  208.094314
246    247  Oakland, a big city known for its notorious vi...  192.179092
363    364  At a very young age, I began to build my own r...  186.174948
3263  3264  I did not understand their words, but I could ...  180.341683
584    584  It's always the same: boring and plain white w...  174.261216
3507  3508  After years of not being able to say "I am Nig...  173.805058
3210  3211  Everybody possesses an inherent desire to comp...  173.700609
275    276  My journey began in 2008 in a street named Val...  165.585899
3060  3061  The world I come from in my eyes can only be d...  165.130156
1508  1509  Throughout my childhood, I have always had the...  163.964575
2486  2487  A rock drops into a lake and causes a ripple e...  162.796032
1298  1298  I live in a world of duali

In [38]:
df['PS1'][811]

'Anxiety rose as I gingerly passed the San Diego Airport stores, looking for the boarding gate for Baltimore, Maryland. Not only was I scared of traveling alone as a minor for the first time, I grew increasingly anxious of the idea of a -day Leadership Summit for Medicine and Healthcare in Johns Hopkins University with total strangers. At the summit, what first started as fear transformed into determination to find the field that captures my passion. \\\\My heart wrenches when I see homeless people lugging a cart of recyclables and clothing, a small oxygen tank with tubes attached to their bodies, or cardboard with black marker asking for help, as I drive by some streets in San Diego. The homeless reminds me of the increasing number of veterans suffering from multifaceted medical conditions as a result of deployment to war zones. They struggle from physical, psychological, and psychosocial issues concerning themselves, work, and family; they are prone to fall into the cracks of the sys

What other patterns might we find with topic modeling? Toward what end?

In [None]:
###Ex (gold star exercise!): 
#       Find the most prevalent topic in the corpus.
#       Find the least prevalent topic in the corpus. 
#       Find the most prevalent topic by the gender of the author.
#       Hint: How do we define prevalence? What are different ways of measuring this,
#              and the benefits/drawbacks of each.


#       Extra bonus gold star exercise:
#          This topic model provide the topic distribtution for 127 rows, but there are 131 rows in the full data.
#          What is going on here? (I don't have an answer to this. I hope someone can figure it out!)           