# Analysing Admissions Essays: Unsupervised Approaches using scikit-learn

There are two libraries that dominate text analysis in Python. The first is NLTK, which implements a range of natural language processing techniques (see other notebook).

The other dominant library is scikit-learn, which, at its most basic, provides a function to create a memory-efficient document-term matrix. It also implements a variety of quite sophisticated machine learning techniques that you can use on your text. It's a powerful library well suited for many purpouses.

Some of the approaches we will use below for our purpouses include:
* word weighting
* feature extraction
* text classification / supervised machine learning
    * L2 regression
    * classification algorithms such as nearest neighbors, SVM, and random forest
* clustering / unsupervised machine learning
    * k-means
    * pca
    * cosine similarity
    * LDA

Today, we'll start with the Document Term Matrix (DTM). The DTM is the bread and butter of most computational text analysis techniques, both simple and more sophisticated methods. In this lesson we will use Python's scikit-learn package learn to make a document term matrix from a .csv Music Reviews dataset (collected from MetaCritic.com). We will then use the DTM and a word weighting technique called tf-idf (term frequency inverse document frequency) to identify important and discriminating words within this dataset (utilizing the Pandas package). The illustrating question: what words distinguish reviews of Rap albums, Indie Rock albums, and Jazz albums? 

Finally, we will use the DTM to get an introduction to one method for uncovering patterns or themes within text: LDA, a topic modeling algorithm. Again, this will just be an introduction. Look for additional workshops in the future that will get into topic modeling in more detail.


### Outline
1. Import and view the data using Pandas
1. Explore the Data using Pandas
  1. Basic descriptive statistics
1. Creating the DTM: scikit-learn
  1. CountVectorizer function
1. What can we do with a DTM?
1. Tf-idf scores
  1. TfidfVectorizer function
1. Identifying Distinctive Words
  1. Application: Identify distinctive words by genre
1. Uncovering patterns using LDA

### Key Jargon
* *Document Term Matrix*:
  * a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
* *TF-IDF Scores*: 
  *  short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
* *Topic Modeling*:
  * A statistical model to uncover abstract topics within a text. It uses the co-occurrence fo words within documents, compared to their distribution across documents, to uncover these abstract themes. The output is a list of weighted words, which indicate the subject of each topic, and a weight distribution across topics for each document.
* *LDA*:
  * Latent Dirichlet Allocation. A implementation of topic modeling that assumes a Dirichlet prior. It does not take document order into account, unlike other topic modeling algorithms.
    
### Further Resources

[This blog post](https://de.dariah.eu/tatom/feature_selection.html) goes through finding distinctive words using Python in more detail 

Paper: [Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf), Burt Monroe, Michael Colaresi, Kevin Quinn

[More detailed description of implementing LDA using scikit-learn](http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-topics-extraction-with-nmf-lda-py).

## 1. Import and view the data using Pandas

First, we read our corpus, which is stored as a .csv file on our hard drive, into a Pandas dataframe. 

Note: Pandas is great for data munging and basic calculations because it's so easy to use, and its data structure is really intuitive for me. It's not memory efficient however, so you might quickly need to move away from it. I recommend always always always using Pandas (or similar) over spreadsheets and Excel. [Excel is bad for science!](https://www.washingtonpost.com/news/wonk/wp/2016/08/26/an-alarming-number-of-scientific-papers-contain-excel-errors/)

### First we'll import, format, and view the data

Ths code brings in the packages we'll need (Pandas and Nympy) before reading the data into a pandas dataframe and inspecting that data frame.

In [2]:
import pandas
import numpy

#create a dataframe called "df"
df = pandas.read_csv("/Volumes/Extra Space/Google Drive/Scholarship/Writing Projects - Personal/Admissions Essays/Data/PS1_F16.csv", sep = ',', encoding = 'utf_8')

#view the dataframe
#notice the metadata. The column "Personal Statement 1 (RETIRED)" contains our text of interest.
df

Unnamed: 0,"﻿""ApplyUC Application CPID""",College,Personal Statement 1 (RETIRED)
0,3128896,College of Letters and Science,Bojio!\\That was what I playfully typed on my ...
1,3092833,College of Engineering,"My world is shaped by a 5' 10"", lean man, who ..."
2,3142974,College of Engineering,I come from a mediocre family in Malaysia. As ...
3,3020517,College of Letters and Science,Being a Mexican-American now is one of the mos...
4,3121294,College of Natural Resources,I am really confused when I have to fill out i...
5,3162551,College of Letters and Science,"Since before I was born, my parents wanted me ..."
6,3090215,College of Letters and Science,"When I think of my cousin Sid, I see him as th..."
7,3158702,College of Engineering,"I was born in a suburb just east of Seattle, t..."
8,3095465,College of Natural Resources,"Being the youngest in my family, I have been h..."
9,3147750,College of Letters and Science,Living up to the grand expectations of a cardi...


Next we will rename the column headers to make things easier going forward. 

In [3]:
# Rename the Colums of the Pandas Dataframe so they are easier to work with
df.columns = ['CPID', 'College', 'PS1']

# inspect the dataframe again
df

Unnamed: 0,CPID,College,PS1
0,3128896,College of Letters and Science,Bojio!\\That was what I playfully typed on my ...
1,3092833,College of Engineering,"My world is shaped by a 5' 10"", lean man, who ..."
2,3142974,College of Engineering,I come from a mediocre family in Malaysia. As ...
3,3020517,College of Letters and Science,Being a Mexican-American now is one of the mos...
4,3121294,College of Natural Resources,I am really confused when I have to fill out i...
5,3162551,College of Letters and Science,"Since before I was born, my parents wanted me ..."
6,3090215,College of Letters and Science,"When I think of my cousin Sid, I see him as th..."
7,3158702,College of Engineering,"I was born in a suburb just east of Seattle, t..."
8,3095465,College of Natural Resources,"Being the youngest in my family, I have been h..."
9,3147750,College of Letters and Science,Living up to the grand expectations of a cardi...


It is important to review data data that is contained in the new dataframe we created.  This code looks at the first essay in full.

In [4]:
#print the first essay from the column 'PS1'
df['PS1'][0]

'Bojio!\\\\That was what I playfully typed on my family\'s Whatsapp group chat after my older brother posted a picture of his and my sister in law\'s Bali resort. It was an expression that travelled from my mind to my flitting fingertips almost immediately. The resort was simply the image of serenity and solitude- and which student going through examination stress would not want to be a part of that?\\\\It was only when I got home and slumped on the sofa that I saw the nervous look on my mother\'s face. Her kohl-rimmed eyes were wide and her vermillion adorned forehead scrunched up as she asked, utterly confused, "What\'s bojio?"\\\\I burst out laughing. Sometimes, I forgot how every day brought around a new culture shock when you lived in a traditional Indian family but grew up in a multiracial community. The Hokkien phrase "bojio", literally meaning "never invite", is a popular colloquialism in Singapore to teasingly express annoyance at not being invited to something. My brother, wh

## 2. Explore the Data using Pandas

Let's first evaluate the general nature of the data to see if the ID's are unique, if there is any missing data, etc.  
We can also look at some descriptive statistics about this datasetto get a feel for what's in it. We'll do this using the Pandas package. 

Note: this is always good practice. It serves two purposes. It checks to make sure your data is correct, and there's no major errors. It also keeps you in touch with your data, which will help with interpretation. Love your data!

### Are the ID's Unique?

What ID's have more than one "PS1"s can be found by counting and ranking "ID"s

In [5]:
#This tells us if we have any duplicate IDs.  If each response is 1 we are ok.
print(df['CPID'].value_counts())

# This code seems to check for duplicate CPIDs.  If it's blank there are no duplicates.
print()
print("Array containing duplicate CPIDs:")
print(df.set_index('CPID').index.get_duplicates())

3016703    1
3002500    1
3146453    1
3152598    1
3019479    1
3042008    1
3039961    1
3018728    1
3035871    1
3123936    1
3125987    1
3115748    1
3030135    1
3117799    1
3009256    1
3013354    1
3130093    1
3136238    1
3003119    1
3027667    1
3029714    1
3068623    1
3056321    1
3084983    1
3107512    1
3105465    1
3109563    1
3103422    1
3101375    1
3058368    1
          ..
3044772    1
3014037    1
3040678    1
3042727    1
3151272    1
3153321    1
3012128    1
3132004    1
3028396    1
3007894    1
3132819    1
3048826    1
3132755    1
3134800    1
3061116    1
3059071    1
3136223    1
3100035    1
3110276    1
3021600    1
3085704    1
3130770    1
3081610    1
3144558    1
3093900    1
3095949    1
3089806    1
3091855    1
3132125    1
3116651    1
Name: CPID, dtype: int64

Array containing duplicate CPIDs:
[]


### Are there any missing Essays?

In [17]:
# This creates a variable 'empties
empties = numpy.where(pandas.isnull(df['PS1']))[0]

# you notice that this list not a list.  The next opperation "list" gets it in the right format.
print(empties)

empties = list(empties)

print(empties)

#The len command seems not to be counting what I'm after.
print(len(empties))

df.iloc[empties]

[ 1776  3206  3566  6285  6801  7530  7930  8111  8571 11796 12977 15694
 19073 23667 24682 26014 28080 28573 29154 29548 31818 40212 41898 44980
 53738 64612 73519 74177 79276 81423]
[1776, 3206, 3566, 6285, 6801, 7530, 7930, 8111, 8571, 11796, 12977, 15694, 19073, 23667, 24682, 26014, 28080, 28573, 29154, 29548, 31818, 40212, 41898, 44980, 53738, 64612, 73519, 74177, 79276, 81423]
30


Unnamed: 0,CPID,College,PS1
1776,3133634,College of Engineering,
3206,3157746,College of Letters and Science,
3566,3041638,College of Letters and Science,
6285,3108688,College of Natural Resources,
6801,3052354,College of Letters and Science,
7530,3055366,College of Letters and Science,
7930,3056046,College of Engineering,
8111,3001798,College of Letters and Science,
8571,3145221,College of Natural Resources,
11796,3092946,College of Letters and Science,


Drop the missing data using.

In [12]:
df_no_missing = df.drop(df.index[empties])


# df_no_missing = df.dropna()
df_no_missing

Unnamed: 0,CPID,College,PS1
0,3128896,College of Letters and Science,Bojio!\\That was what I playfully typed on my ...
1,3092833,College of Engineering,"My world is shaped by a 5' 10"", lean man, who ..."
2,3142974,College of Engineering,I come from a mediocre family in Malaysia. As ...
3,3020517,College of Letters and Science,Being a Mexican-American now is one of the mos...
4,3121294,College of Natural Resources,I am really confused when I have to fill out i...
5,3162551,College of Letters and Science,"Since before I was born, my parents wanted me ..."
6,3090215,College of Letters and Science,"When I think of my cousin Sid, I see him as th..."
7,3158702,College of Engineering,"I was born in a suburb just east of Seattle, t..."
8,3095465,College of Natural Resources,"Being the youngest in my family, I have been h..."
9,3147750,College of Letters and Science,Living up to the grand expectations of a cardi...


## 3. Creating a smaller sample to check the code

In this section we'll create a smaller sample of the code to that the analysis we construct below works.

In [89]:
# sample the first 500 essays.
#df_sample = df[:500]
#df_sample.dtypes

# this generates a random sample of 500 essays, with a random state of 0 for reproducability.
df_sample = df.sample(n=5000, random_state=0)

df_sample

Unnamed: 0,CPID,College,PS1
22419,3041784,College of Letters and Science,"""Pascale, vete a lavar las manos antes de come..."
70459,3080139,College of Letters and Science,"La fiesta de los quince anos. Traditionally, (..."
34948,3066178,College of Letters and Science,BOOM. The crashing sound turned my worst night...
15704,3124866,College of Letters and Science,As human beings we are filled with insecuritie...
13744,3056166,College of Natural Resources,Environment it is quite a major thing in peopl...
5921,3088397,College of Engineering,"Growing up, I was always interested in how eve..."
23811,3142668,College of Natural Resources,I used to think that my life was difficult to ...
21420,3143725,College of Letters and Science,"In the game of life, love is the key to happin..."
4391,3053134,College of Letters and Science,Grading worksheets and answering questions for...
22449,3110433,College of Letters and Science,As the younger sibling of two disabled brother...


In [90]:
df_sample

Unnamed: 0,CPID,College,PS1
22419,3041784,College of Letters and Science,"""Pascale, vete a lavar las manos antes de come..."
70459,3080139,College of Letters and Science,"La fiesta de los quince anos. Traditionally, (..."
34948,3066178,College of Letters and Science,BOOM. The crashing sound turned my worst night...
15704,3124866,College of Letters and Science,As human beings we are filled with insecuritie...
13744,3056166,College of Natural Resources,Environment it is quite a major thing in peopl...
5921,3088397,College of Engineering,"Growing up, I was always interested in how eve..."
23811,3142668,College of Natural Resources,I used to think that my life was difficult to ...
21420,3143725,College of Letters and Science,"In the game of life, love is the key to happin..."
4391,3053134,College of Letters and Science,Grading worksheets and answering questions for...
22449,3110433,College of Letters and Science,As the younger sibling of two disabled brother...


In [91]:
empties_sample = numpy.where(pandas.isnull(df_sample['PS1']))

#The len command seems not to be counting what I'm after.
print(len(empties_sample))

print(empties_sample)

1
(array([2302]),)


In [92]:
df_sample = df_sample.drop(df_sample.index[[2302]])
df_sample

Unnamed: 0,CPID,College,PS1
22419,3041784,College of Letters and Science,"""Pascale, vete a lavar las manos antes de come..."
70459,3080139,College of Letters and Science,"La fiesta de los quince anos. Traditionally, (..."
34948,3066178,College of Letters and Science,BOOM. The crashing sound turned my worst night...
15704,3124866,College of Letters and Science,As human beings we are filled with insecuritie...
13744,3056166,College of Natural Resources,Environment it is quite a major thing in peopl...
5921,3088397,College of Engineering,"Growing up, I was always interested in how eve..."
23811,3142668,College of Natural Resources,I used to think that my life was difficult to ...
21420,3143725,College of Letters and Science,"In the game of life, love is the key to happin..."
4391,3053134,College of Letters and Science,Grading worksheets and answering questions for...
22449,3110433,College of Letters and Science,As the younger sibling of two disabled brother...


## 4. Creating the DTM: scikit-learn

Ok, that's the summary of the metadata. Next, we turn to analyzing the text of the reviews. Remember, the text is stored in the 'body' column. First, a preprocessing step to remove numbers.

In [13]:
# This gets rid of numbers - THIS SEEMS TO BREAK WITH MY DATA
# It appears that if you drop rows with missing data it works!
df_no_missing['PS1'] = df_no_missing['PS1'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))

# If this generates an error we might consider uncommeting code below to see the data type:
#df.dtypes

Our next step is to turn the text into a document term matrix using the scikit-learn function called CountVectorizer. There are two ways to do this. We can turn it into a sparse matrix type, which can be used within scikit-learn for further analyses.

In [19]:
#import the function CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

countvec = CountVectorizer()

# Original sklearn_dtm = CountVectorizer().fit_transform(df.PS1)
#I added the values.astype section below and it seemed to fix the count vectorizer issues
sklearn_dtm = CountVectorizer().fit_transform(df_no_missing['PS1'].values.astype('U'))

print(sklearn_dtm)

  (0, 11094)	4
  (0, 97131)	8
  (0, 105906)	9
  (0, 106662)	3
  (0, 74218)	1
  (0, 100732)	1
  (0, 68485)	6
  (0, 64431)	26
  (0, 33421)	3
  (0, 106671)	1
  (0, 40311)	1
  (0, 16191)	1
  (0, 1748)	1
  (0, 68288)	1
  (0, 12519)	2
  (0, 75284)	1
  (0, 73479)	1
  (0, 68016)	12
  (0, 43510)	1
  (0, 3728)	19
  (0, 88895)	1
  (0, 46556)	10
  (0, 54708)	1
  (0, 7572)	1
  (0, 81462)	2
  :	:
  (82543, 86545)	1
  (82543, 89256)	1
  (82543, 76860)	1
  (82543, 63565)	1
  (82543, 964)	1
  (82543, 24571)	2
  (82543, 58573)	1
  (82543, 76943)	1
  (82543, 88310)	1
  (82543, 16213)	1
  (82543, 7655)	1
  (82543, 42349)	2
  (82543, 95964)	1
  (82543, 56822)	1
  (82543, 18485)	1
  (82543, 63473)	1
  (82543, 17933)	1
  (82543, 82424)	1
  (82543, 30350)	1
  (82543, 36674)	1
  (82543, 22704)	1
  (82543, 95709)	1
  (82543, 22331)	1
  (82543, 46333)	1
  (82543, 67849)	1


This format is called Compressed Sparse Format. It save a lot of memory to store the dtm in this format, but it is difficult to look at for a human. To illustrate the techniques in this lesson we will first convert this matrix back to a Pandas dataframe, a format we're more familiar with. For larger datasets, you will have to use the Compressed Sparse Format. Putting it into a DataFrame, however, will enable us to get more comfortable with Pandas!

In [83]:
#we do the same as we did above, but covert it into a Pandas dataframe. Note this takes quite a bit more memory, so will not be good for bigger data.
dtm_df = pandas.DataFrame(countvec.fit_transform(df_sample['PS1'].values.astype('U')).toarray(), columns=countvec.get_feature_names(), index = df_sample.index)

#view the dtm dataframe
dtm_df

Unnamed: 0,___,____,________,_m,aa,aaa,aahs,aamc,aap,aaron,...,zozo,zta,ztas,zubaz,zuckerberg,zuhaib,zulu,zumba,zw,zzwwwiiiipp
22419,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
70459,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
34948,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
15704,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
13744,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5921,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23811,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21420,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4391,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
22449,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 5. What can we do with a DTM?

We can do a number of calculations using a DTM. For a toy example, we can quickly identify the most frequent words (compare this to how many steps it took in lesson 2, where we found the most frequent words using NLTK).

In [84]:
print(dtm_df.sum().sort_values(ascending=False))

the             97109
to              91283
my              79187
and             75705
of              55542
in              46596
that            33245
me              30427
was             27140
for             21436
with            20171
as              18959
it              18147
have            16559
is              16252
on              12775
be              12201
from            12060
this            11670
not             10967
but             10758
at              10494
school          10302
had              9628
an               9177
life             8988
has              8231
they             7946
family           7883
when             7843
                ...  
oyi                 1
drunkenness         1
overwritten         1
overwriting         1
overstays           1
overworld           1
overstepped         1
dummy               1
overstuffed         1
dumbfounding        1
overtake            1
dumbed              1
dumbbell            1
overtaking          1
duly      

In [None]:
#####Exercise:
###Print out the most infrequent words rather than the most frequent words.
##Gold star challenge: print the average number of times each word is used in an essay
print(dtm_df.mean().sort_values(ascending=False))
#Print this out sorted from highest to lowest.

What else does the DTM enable? Because it is in the format of a matrix, we can perform any matrix algebra or vector manipulation on it, which enables some pretty exciting things (think vector space and Euclidean  geometry). But, what do we lose when we reprsent text in this format?

Today, we will use variations on the DTM to find distinctive words in this dataset, and then do some preliminary work discovering themes in text.

## 6. Tf-idf scores

How to find distinctive words in a corpus is a long-standing question in text analysis. We saw a few ways to this yesterday, using natural language processing. Today, we'll learn one simple approach to this: word scores. The idea behind words scores is to weight words not just by their frequency, but by their frequency in one document compared to their distribution across all documents. Words that are frequent, but are also used in every single document, will not be distinguising. We want to identify words that are unevenly distributed across the corpus.

One of the most popular ways to weight words (beyond frequency counts) is *tf-idf* scores. By offsetting the frequency of a word by its document frequency (the number of documents in which it appears) will in theory filter out common terms such as 'the', 'of', and 'and'.

More precisely, the inverse document frequency is calculated as such:

number_of_documents / number_documents_with_term

so:

tfidf_word1 = word1_frequency_document1 * (number_of_documents / number_document_with_word1)

You can, and often should, normalize the numerator: 

tfidf_word1 = (word1_frequency_document1 / word_count_document1) * (number_of_documents / number_document_with_word1)

We can calculate this manually, but scikit-learn has a built-in function to do so. We'll use it, but a challenge for you: use Pandas to calculate this manually. 

To do so, we simply do the same thing we did above with CountVectorizer, but instead we use the function TfidfVectorizer.

In [None]:
#import the function
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfvec = TfidfVectorizer()

#create the dtm, but with cells weigthed by the tf-idf score.
dtm_tfidf_df = pandas.DataFrame(tfidfvec.fit_transform(df.PS1).toarray(), columns=tfidfvec.get_feature_names(), index = df.index)

#view results
dtm_tfidf_df

Let's look at the 20 words with highest tf-idf weights.

In [None]:
print(dtm_tfidf_df.max().sort_values(ascending=False)[0:20])

## 7. Uncovering Patterns: LDA

Frequency counts and tf-idf scores are done at the word level. There are other methods of exporatory or unsupervised analysis on the document level and by examining the co-occurrence of words within documents. Scikit-learn allows for many of these methods, including:

* document clustering
* document or word similarities using cosine similarity
* pca
* topic modeling

We'll run through an example of topic modeling here. Again, the goal is not to learn everything you need to know about topic modeling. Instead, this will provide you some starter code to run a simple model, with the idea that you can use this base of knowledge to explore this further.

We will run Latent Dirichlet Allocation, the most basic and the oldest version of topic modeling. We will run this in one big chunk of code. Our challenge: use our knowledge of scikit-learn that we gained aboe to walk through the code to understand what it is doing. Your challenge: figure out how to modify this code to work on your own data, and/or tweak the parameters to get better output.

Note: we will be using a different dataset for this technique. The music reviews in the above dataset are often short, one word or one sentence reviews. Topic modeling is not really appropriate for texts that are this short. Instead, we want texts that are longer and are composed of multiple topics each. For this exercise we will use a database of children's literature from the 19th century. 

The data were compiled by students in this course: http://english197s2015.pbworks.com/w/page/93127947/FrontPage
Found here: http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets#demo-corpora

That page has additional corpora, for those interested in exploring text analysis further.

I did some minimal cleaning to get the children's literature data in .csv format for our use.

In [None]:
df_lit = pandas.read_csv("/Volumes/Extra Space/Google Drive/Scholarship/Writing Projects - Personal/Admissions Essays/Small Sample/AdmissionsEssays/statement_test_031417.csv", sep = ',', encoding = 'utf-8')

#drop rows where the text is missing. I think there's only one row where it's missing, but check me on that.
df_lit = df_lit.dropna(subset=['PS1'])

#view the dataframe
df_lit

Now we're ready to fit the model. This requires the use of CountVecorizer, which we've already used, and the scikit-learn function LatentDirichletAllocation.

See [here](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) for more information about this function. 

In [None]:
####Adopted From: 
#Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Lars Buitinck
#         Chyi-Kwei Yau <chyikwei.yau@gmail.com>
# License: BSD 3 clause

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

n_samples = 2000
n_topics = 5
n_top_words = 50

##This is a function to print out the top words for each topic in a pretty way.
#Don't worry too much about understanding every line of this code.
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

# Use tf-idf features
tfidf_vectorizer = TfidfVectorizer(max_df=0.80, min_df=50,
                                   max_features=None,
                                   stop_words='english')

tfidf = tfidf_vectorizer.fit_transform(df_lit.PS1)

# Use tf (raw term count) features
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.80, min_df=50,
                                max_features=None,
                                stop_words='english'
                                )

tf = tf_vectorizer.fit_transform(df_lit.PS1)

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_topics=%d..."
      % (n_samples, n_topics))

#define the lda function, with desired options
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=20,
                                learning_method='online',
                                learning_offset=80.,
                                total_samples=n_samples,
                                random_state=0)
#fit the model
lda.fit(tf)

#print the top words per topic, using the function defined above.
#Unlike R, which has a built-in function to print top words, we have to write our own for scikit-learn
#I think this demonstrates the different aims of the two packages: R is for social scientists, Python for computer scientists

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

In [None]:
####Exercise:
###Run the same code as above but change some of the parameters. How does this change the output.
###Suggestions:
## 0. Use tf-idf scores rather than raw counts. (hint: look for the variable name we created) 
## 1. Change the number of topics. What do you find?
## 2. Do not remove stop words. How does this change the output?

One thing we may want to do with the output is find the most representative texts for each topic. A simple way to do this (but not memory efficient), is to merge the topic distribution back into the Pandas dataframe.

First get the topic distribution array.

In [None]:
topic_dist = lda.transform(tf)
topic_dist

Merge back in with the original dataframe.

In [None]:
topic_dist_df = pandas.DataFrame(topic_dist)
df_w_topics = topic_dist_df.join(df_lit)
df_w_topics

Now we can sort the dataframe for the topic of interest, and view the top documents for the topics.
Below we sort the documents first by Topic 0 (looking at the top words for this topic I think it's about family, health, and domestic activities), and next by Topic 1 (again looking at the top words I think this topic is about children playing outside in nature). These topics may be a family/nature split?

Look at the titles for the two different topics. Look at the gender of the author. Hypotheses?

In [None]:
print(df_w_topics[['ID', 'PS1', 0]].sort_values(by=[0], ascending=False))

We can read individual essays in full using the code below.  Change the number in the final set of brackets to point to a spesific serial number (ID-1).

In [None]:
df['PS1'][1980]

In [None]:
print(df_w_topics[['ID', 'PS1', 1]].sort_values(by=[1], ascending=False))

In [None]:
df['PS1'][2131]

In [None]:
print(df_w_topics[['ID', 'PS1', 2]].sort_values(by=[2], ascending=False))

In [None]:
df['PS1'][3515]

In [None]:
print(df_w_topics[['ID', 'PS1', 3]].sort_values(by=[3], ascending=False))

In [None]:
df['PS1'][2645]

In [None]:
print(df_w_topics[['ID', 'PS1', 4]].sort_values(by=[4], ascending=False))

In [None]:
df['PS1'][811]

What other patterns might we find with topic modeling? Toward what end?

In [None]:
###Ex (gold star exercise!): 
#       Find the most prevalent topic in the corpus.
#       Find the least prevalent topic in the corpus. 
#       Find the most prevalent topic by the gender of the author.
#       Hint: How do we define prevalence? What are different ways of measuring this,
#              and the benefits/drawbacks of each.


#       Extra bonus gold star exercise:
#          This topic model provide the topic distribtution for 127 rows, but there are 131 rows in the full data.
#          What is going on here? (I don't have an answer to this. I hope someone can figure it out!)           