# Analysing Admissions Essays: Unsupervised Approaches using scikit-learn

This notbook is designed to analyze every admissions essay submitted to Berkeley in the 2014-2015 academic year using Topic Modeling in Pythons scikit-learn package.  It begins with two CSV files (UTF-8) each containing unique and correspondeing dummie ID's for each of the two essay prompts that year.  There are column headers.

### Personal Statement 1
1. Freshmen: Describe the world you come from — for example, your family, community or school — and tell us how your world has shaped your dreams and aspirations.
1. Transfers: What is your intended major? Discuss how your interest in the subject developed and describe any experience you have had in the field — such as volunteer work, internships and employment, participation in student organizations and activities — and what you have gained from your involvement.

### Personal Statement 2
1. Tell us about a personal quality, talent, accomplishment, contribution or experience that is important to you. What about this quality or accomplishment makes you proud, and how does it relate to the person you are?

### Outline
1. [Setup the Analysis](#0.-Setup-the-Analysis)
  1. [Import Packages](#Import-Packages)
  1. [Important Questions](#Important-Questions)
  1. [Initialize Variables](#Initilize-Variables)
1. [Import and view the data using Pandas](#1.-Import-and-view-the-data-using-Pandas)
  1. [Import the data into a Pandas Dataframe](#Import-the-data-into-a-Pandas-Dataframe)
  1. [Lable the Columns](#Lable-the-Columns)
  1. Merge the dataframes
  1. Review the Data
1. [Explore the Data & Drop missing values](#2.-Explore-the-Data-using-Pandas)
  1. Are the ID's Unique?
  2. Find Missing Data
  1. Drop Missing Data
1. [Pre-Processing the Essays](#3.-Pre-Processing-the-Essays)
  1. Cleaning the text and tokenizing
  1. Remove Stopwords
  1. Stem the Tokens
1. [Creating a sample for testing](#4.-Creating-a-sample-for-testing)
1. [Creating the DTM: scikit-learn](#5.-Creating-the-DTM:-scikit-learn)
  1. CountVectorizer function
1. [Tf-idf scores](#6.-Tf-idf-scores)
  1. TfidfVectorizer function
1. [Uncovering patterns using LDA](#7.-Uncovering-Patterns:-LDA)
1. [The End!](#8.-The-End!)

## 0. Setup the Analysis

Before we get started we must import a number of packages we'll need, answer a few important questions, and set up a few variables we'll need later.

### Import Packages

In [None]:
import pandas
import numpy
import time
import datetime
import platform
import ast # This is used below to asign the variables on the fly.
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
#import socket # I think this was to get the computer name, but may be changed now that platform is being used. see socket.gethostname()
#from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
print('All packages successfully imported!')

### Important Questions

In [None]:
#QUESTIONS TO ANSWER 
sample_yes_no = 'No'
number_of_topics = 5 # to revert to original, remove 'number_of_topics' from 'n_topics'.

### Initilize Variables

This initializes a few variables we'll need, and deals with a few of the answers to the questions we have here.

In [None]:
sample_yes_no = sample_yes_no.upper()
print('Analyse only a sample? '+sample_yes_no)

#This establishes the start time to calculate the time it takes to run the script
start_time = time.time()
print(start_time)

#Using platform package to access the machine name.
run_on = platform.node()

print(run_on)

## 1. Import and view the data using Pandas

First, we read our corpus, which is stored as a .csv file on our hard drive, into a Pandas dataframe. 

Note: Pandas is great for data munging and basic calculations because it's so easy to use, and its data structure is really intuitive. It's not memory efficient however, so you might quickly need to move away from it. 

### Import the data into a Pandas Dataframe

To get started, we need to import a few packages.

This code sets some background variables we'll need for later calculations.

This next section automatically detects where the script is running and adjusts file paths accordingly.  

In [None]:
if (run_on == 'BensMBP') or (run_on == 'BensMBP.local'):
    df1 = pandas.read_csv("/Volumes/Extra Space/Google Drive/Scholarship/Writing Projects - Personal/Admissions Essays/Data/PS1_F16.csv", sep = ',', encoding = 'utf_8')
    df2 = pandas.read_csv("/Volumes/Extra Space/Google Drive/Scholarship/Writing Projects - Personal/Admissions Essays/Data/PS2_F16.csv", sep = ',', encoding = 'utf_8')
    print('The script is running locally.')
elif run_on == 'mercury':
    df1 = pandas.read_csv("../data/originals/PS1_F16.csv", sep = ',', encoding = 'utf_8')
    df2 = pandas.read_csv("../data/originals/PS2_F16.csv", sep = ',', encoding = 'utf_8')
    print('The script is running on mercury.')
else:
    print('The file path is unclear on this machine.')

Next we can import the data from the CSV Pandas and begin our session.

In [None]:
#create dataframes called "df1" and "df2

#df1 = pandas.read_csv("/Volumes/Extra Space/Google Drive/Scholarship/Writing Projects - Personal/Admissions Essays/Data/PS1_F16.csv", sep = ',', encoding = 'utf_8')
#df2 = pandas.read_csv("/Volumes/Extra Space/Google Drive/Scholarship/Writing Projects - Personal/Admissions Essays/Data/PS2_F16.csv", sep = ',', encoding = 'utf_8')

#for the server
#df1 = pandas.read_csv("../data/originals/PS1_F16.csv", sep = ',', encoding = 'utf_8')
#df2 = pandas.read_csv("../data/originals/PS2_F16.csv", sep = ',', encoding = 'utf_8')


# View the dataframe.
# Notice the metadata. The column "Personal Statement 1 (RETIRED)" contains our text of interest.
# You can move the hashtag to view the other dataframe.
df1
# df2

### Lable the Columns

Next we can rename the colum headers so they are easier to work with.

In [None]:
# This renames the colum headers
df1.columns = ['CPID', 'College', 'PS1']
df2.columns = ['CPID', 'College', 'PS2']

# View the dataframe.  You can move the hashtag to view the other dataframe.
df1
# df2

### Merge the Dataframes

Now we will merge the two dataframes on thier two common elements (CPID and College) using `merge`.

In [None]:
# Merge the two data frames so that we have one data frame with both questions attached to common CPID's and College.
df = pandas.merge(df1, df2, on=['CPID', 'College'])
df

### Review the data

It can be helpful to see how much memory is being used by this new dataframe.  We can do that with the `info` option.  We can also view individual essays, housed in particular cells, in full.

In [None]:
# Check the amount of memory being occupied by this newly created element.  

# PROGRESS
print('Full oringinal dataframe created.')
print(df.info(memory_usage='deep'))

It is important to review data data that is contained in the new dataframe we created.  This code looks at an essay in full.

In [None]:
#print the first essay from the column 'PS1' the print file is more faithful to our data
print(df['PS1'][0])

## 2. Explore the Data using Pandas

Let's first evaluate the general nature of the data to see if the ID's are unique, if there is any missing data, etc.  
We can also look at some descriptive statistics about this data set to get a feel for what's in it. We'll do this using the Pandas package. 

Note: this is always good practice. It serves two purposes. It checks to make sure your data is correct, and there's no major errors. It also keeps you in touch with your data, which will help with interpretation. Love your data!

### Are the ID's Unique?

What ID's have more than one "PS1"s can be found by counting and ranking "ID"s

In [None]:
#This tells us if we have any duplicate IDs.  If each response is 1 we are ok.
print(df['CPID'].value_counts())

# This code seems to check for duplicate CPIDs.  If it's blank there are no duplicates.
print()
print('Array containing duplicate CPIDs:')
print(df.set_index('CPID').index.get_duplicates())

### Find missing essays

Advanced opperations will not work with empty data.  The next few steps are designed to find, exlpore and purge records with missing data.

In [None]:
# This creates a variable empties

# First for PS1
print('Summarizing missing data for PS1:')
empties_PS1 = numpy.where(pandas.isnull(df['PS1']))[0]

print(empties_PS1)

# you notice that this is not formatted as a list.  The next opperation "list" gets it in the right format.
empties_PS1 = list(empties_PS1)
print(empties_PS1)

#This counts the number of missing essays for PS1.
print(len(empties_PS1))

#This lists the elemtns with missing data.
df.iloc[empties_PS1]

In [None]:
# Repeat the above steps for PS2
print('Summarizing missing data for PS2')
empties_PS2 = numpy.where(pandas.isnull(df['PS2']))[0]
print(empties_PS2)
empties_PS2 = list(empties_PS2)
print(empties_PS2)
print(len(empties_PS2))
df.iloc[empties_PS2]

Next we can create a list of every ID which has at least one missing essay.

In [None]:
#This combines the two lists of missing data without duplicateing anything.
empties_any = empties_PS1 + list(set(empties_PS2) - set(empties_PS1))
empties_any.sort()

print(empties_any)
print('There are', len(empties_any), 'CPIDs with at least one missing essay.')

### Drop the missing essays

This takes the list of ID's that have at least one missing essay and drops them, creating a new dataframe where each cell ID populated.

In [None]:
df_no_missing = df.drop(df.index[empties_any])

# PROGRESS
print('Records with missing data have beend dropped.')

# df_no_missing = df.dropna()
df_no_missing

## 3. Pre-Processing the Essays

Once we have a Pandas Dataframe in the appropreiate structure, we can begin to process the text in the two essay columns.  This invovles cleaning the text in a number of ways before tokenizing the essays.  

### Cleaning the text and tokenizing

The section below combines multiple preprocessing steps into a singl eline of code.  It is repeated twice for each of the essays, and results in a largely "preprocessed", tokenized new column.  most of this is accomplished with the "str" feature of python.  Here is what we accomplished with each step:
1. `str.replace('\\', ' ')` - This removes some of the ideosyncratic backslashes that were present 
1. `str.lower()` - this shifts all the letters to lowercase
1. `str.replace('[^\w\s]','')` - This gets rid of punctuation.  The "`^`" is a negated set, the "`\w`" matches any word character (alphanumeric & underscore), and the "`\s`" matches any whitespace character (spaces, tabs, line breaks).
1. `str.replace('[\d]','')` - This gets rid of all numbers.
1. `str.split()` - This tokenizes whats left, creating a list within the pandas cell

In [None]:
#create two new columns with tokenized essay responses.  
#In the same opperation it make everything lowercase.
df_no_missing['PS1_clean'] = df_no_missing['PS1'].str.replace('\\', ' ').str.lower().str.replace('[^\w\s]','').str.replace('[\d]','').str.split()
df_no_missing['PS2_clean'] = df_no_missing['PS2'].str.replace('\\', ' ').str.lower().str.replace('[^\w\s]','').str.replace('[\d]','').str.split()

# PROGRESS
print('Step 1 of preprocessing complete.')

df_no_missing

In [None]:
# this shows that we've mostly delt with the odd backslashes and cleand the text in a bunch of other ways!
print(df_no_missing['PS1_clean'][0])

### Remove Stopwords

Stopwords are  words that appear frequently and tend not to be distinctive.  They are generally removed prior to text analysis unless there is compelling reason to keep them. [More info](http://www.nltk.org/book/ch02.html#code-unusual)

In [None]:
#stopwords imported from NLTK Above

#Removes english stop words from the tokenized columns
stop_words = stopwords.words('english')
df_no_missing['PS1_clean'] = df_no_missing['PS1_clean'].apply(lambda x: [item for item in x if item not in stop_words])
df_no_missing['PS2_clean'] = df_no_missing['PS2_clean'].apply(lambda x: [item for item in x if item not in stop_words])

# PROGRESS
print('Stopwords have been removed.')

df_no_missing

### Stem the Tokens

Stemming reduces words with multiple endings to thier common stem.  There are multiple ways to do this, but we will use the Porter Stemmer for our purpouses. http://www.bogotobogo.com/python/NLTK/Stemming_NLTK.php

In [None]:
# using the "Porter Stemmer" we'll stem the words
porter_stemmer = PorterStemmer()

# There is lots of code in the saved file to rework this if nessisary.

#This works on my machine with NLTK 3.2.1, but not on Mercury when it had NLTK 3.2.2!
df_no_missing['PS1_clean'] = df_no_missing['PS1_clean'].apply(lambda x: [porter_stemmer.stem(item) for item in x])
df_no_missing['PS2_clean'] = df_no_missing['PS2_clean'].apply(lambda x: [porter_stemmer.stem(item) for item in x])

# PROGRESS
print('Stemmed the tokens.')

'''
print('Test List 1')
print(test_list_1)
print('Test List 2')
print(test_list_2)
'''

df_no_missing

Joining the tokens back to a string so we can execute count vectorizer and create a documnet term matrix.

In [None]:
df_no_missing['PS1_clean'] = df_no_missing['PS1_clean'].apply(lambda x: ' '.join(x)) # for item in x])
df_no_missing['PS2_clean'] = df_no_missing['PS2_clean'].apply(lambda x: ' '.join(x)) # for item in x])

# PROGRESS
print('Transformed the tokens back into strings.')

# text_list_stemmed = [' '.join([porter_stemmer.stem(word) for word in sentence.split(" ")]) for sentence in text_list]
df_no_missing

## 4. Creating a sample for testing

In this section we'll create a smaller sample of the code to that the analysis we construct below works.

In [None]:
# This generates a random sample of N essays, with a random state set for reproducability.
# N can be slowly increased slowly to test the computational resources required as you scale up
df_sample = df_no_missing.sample(n=500, random_state=0)

# This code resets the indexs so that sorted orininals are kept and new ones are generated.
df_sample = df_sample.sort_index()
df_sample = df_sample.reset_index()

# PROGRESS
print('Created a sample of the data.')

#This is where I assign the sample data to be analyzed.  If I want to run the whole dataset, comment this out.
#df_no_missing = df_sample

if (sample_yes_no == 'Y') or (sample_yes_no == 'YES'):
    df_no_missing = df_sample
    print('Running on the sample, not the whole dataset.')
elif (sample_yes_no == 'no') or (sample_yes_no == 'No') or (sample_yes_no == 'NO') or (sample_yes_no == 'n') or (sample_yes_no == 'N'):
    print('Running on the entire dataset.')
else:
    print('Unable to determine what data to analyze, the sample or the entire set.')

df_sample

## 5. Creating the DTM: scikit-learn

Now that we've preprocessed the text and created two colums with strings, the required imput for scikit-learn's CountVectorizer, we can create a documnet term matrix.  This is the building block for Topic Modeling and a number of other methods we may want to explore.  There are two ways to do this. We can turn it into a sparse matrix type, which can be used within scikit-learn for further analyses.  We can then turn it into a full documnet term matrix, but this is very memory intensive and might not be a great idea for larger data sets.

In [None]:
# see above for: from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()

#Original sklearn_dtm = CountVectorizer().fit_transform(df.PS1)
#I added the '.values.astype('U')' for an interim step in the section below. 
#It seemed to fix the count vectorizer issues
sklearn_dtm_PS1 = countvec.fit_transform(df_no_missing['PS1_clean'])
sklearn_dtm_PS2 = countvec.fit_transform(df_no_missing['PS2_clean'])

print('PS1 sparse matrix type')
print(sklearn_dtm_PS1)
print(' ')
print('PS2 sparse matrix type')
print(sklearn_dtm_PS2)

## 6. Tf-idf scores

How to find distinctive words in a corpus is a long-standing question in text analysis. We saw a few ways to this yesterday, using natural language processing. Today, we'll learn one simple approach to this: word scores. The idea behind words scores is to weight words not just by their frequency, but by their frequency in one document compared to their distribution across all documents. Words that are frequent, but are also used in every single document, will not be distinguising. We want to identify words that are unevenly distributed across the corpus.

One of the most popular ways to weight words (beyond frequency counts) is *tf-idf* scores. By offsetting the frequency of a word by its document frequency (the number of documents in which it appears) will in theory filter out common terms such as 'the', 'of', and 'and'.

More precisely, the inverse document frequency is calculated as such:

number_of_documents / number_documents_with_term

so:

tfidf_word1 = word1_frequency_document1 * (number_of_documents / number_document_with_word1)

You can, and often should, normalize the numerator: 

tfidf_word1 = (word1_frequency_document1 / word_count_document1) * (number_of_documents / number_document_with_word1)

We can calculate this manually, but scikit-learn has a built-in function to do so. We'll use it, but a challenge for you: use Pandas to calculate this manually. 

To do so, we simply do the same thing we did above with CountVectorizer, but instead we use the function TfidfVectorizer.

In [None]:
# see above for from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvec = TfidfVectorizer()

# #create the dtm, but with cells weigthed by the tf-idf score.
# dtm_tfidf_df = pandas.DataFrame(tfidfvec.fit_transform(df.PS1).toarray(), columns=tfidfvec.get_feature_names(), index = df.index)

# #view results
# dtm_tfidf_df

Let's look at the 20 words with highest tf-idf weights.

In [None]:
# print(dtm_tfidf_df.max().sort_values(ascending=False)[0:20])

## 7. Uncovering Patterns: LDA

Frequency counts and tf-idf scores are done at the word level. There are other methods of exporatory or unsupervised analysis on the document level and by examining the co-occurrence of words within documents. Scikit-learn allows for many of these methods, including:

* document clustering
* document or word similarities using cosine similarity
* pca
* topic modeling

We'll run through an example of topic modeling here. Again, the goal is not to learn everything you need to know about topic modeling. Instead, this will provide you some starter code to run a simple model, with the idea that you can use this base of knowledge to explore this further.

We will run Latent Dirichlet Allocation, the most basic and the oldest version of topic modeling. We will run this in one big chunk of code. Our challenge: use our knowledge of scikit-learn that we gained aboe to walk through the code to understand what it is doing. Your challenge: figure out how to modify this code to work on your own data, and/or tweak the parameters to get better output.

Note: we will be using a different dataset for this technique. The music reviews in the above dataset are often short, one word or one sentence reviews. Topic modeling is not really appropriate for texts that are this short. Instead, we want texts that are longer and are composed of multiple topics each. For this exercise we will use a database of children's literature from the 19th century. 

The data were compiled by students in this course: http://english197s2015.pbworks.com/w/page/93127947/FrontPage
Found here: http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets#demo-corpora

That page has additional corpora, for those interested in exploring text analysis further.

I did some minimal cleaning to get the children's literature data in .csv format for our use.

In [None]:
# df_lit = pandas.read_csv("/Volumes/Extra Space/Google Drive/Scholarship/Writing Projects - Personal/Admissions Essays/Small Sample/AdmissionsEssays/statement_test_031417.csv", sep = ',', encoding = 'utf-8')

# #drop rows where the text is missing. I think there's only one row where it's missing, but check me on that.
# df_lit = df_lit.dropna(subset=['PS1'])

#df_lit = df_no_missing

#view the dataframe
#df_lit

Now we're ready to fit the model. This requires the use of CountVecorizer, which we've already used, and the scikit-learn function LatentDirichletAllocation.

See [here](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) for more information about this function. 

### Running the LDA Model

In [None]:
#should switch to batch (from online).  n_samples should be closer to full set

####Adopted From: 
#Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Lars Buitinck
#         Chyi-Kwei Yau <chyikwei.yau@gmail.com>
# License: BSD 3 clause

# See above for: from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# and:from sklearn.decomposition import LatentDirichletAllocation

n_samples = 2000
n_topics = number_of_topics #changed this so the value can be set above.
n_top_words = 100

##This is a function to print out the top words for each topic in a pretty way.
#Don't worry too much about understanding every line of this code.
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

# Use tf-idf features
tfidf_vectorizer = TfidfVectorizer(max_df=0.80, min_df=50,
                                   max_features=None,
                                   stop_words='english')

tfidf = tfidf_vectorizer.fit_transform(df_no_missing['PS1_clean'])

# Use tf (raw term count) features
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.80, min_df=50,
                                max_features=None,
                                stop_words='english'
                                )

tf = tf_vectorizer.fit_transform(df_no_missing['PS1_clean'])

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_topics=%d..."
      % (n_samples, n_topics))

#define the lda function, with desired options TAKE A LOOK AT THIS.  MIGHT BE TOO FEW
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=100,  #100 (THATS WHAT LAURA DID)
                                learning_method='batch',  # CHANGE THIS from 'online' TO 'batch'
                                learning_offset=80.,
                                total_samples=n_samples,
                                random_state=0)
#fit the model
lda.fit(tf)

#print the top words per topic, using the function defined above.
#Unlike R, which has a built-in function to print top words, we have to write our own for scikit-learn
#I think this demonstrates the different aims of the two packages: R is for social scientists, Python for computer scientists

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

### Format Topic Document Matrix

One thing we may want to do with the output is find the most representative texts for each topic. A simple way to do this (but not memory efficient), is to merge the topic distribution back into the Pandas dataframe.

First get the topic distribution array. The numbers produced here sum to 1 for every row, and represent the relative representativeness of each topic.  The topic with the highest number is the most likely topic to have structured the documnemt.  

In [None]:
topic_dist = lda.transform(tf)
topic_dist

Then we create a Panda's Dataframe and do some work to rename the columns.

In [None]:
#this reads the array into a dataframe
topic_dist_df = pandas.DataFrame(topic_dist)

#This is a little script to name the colum names appropriately
col_number = n_topics
col_nubber_original = col_number
column_name_string_PS1 = ""
for i in range(0,col_number):
    column_name_string_PS1+="'Topic_"+str(col_nubber_original-col_number)+"_PS1', "
    col_number = col_number - 1


#This cleans up the end of the string produced above and adds brackets
column_name_string_PS1 = column_name_string_PS1[:-2]
column_name_string_PS1 = '['+column_name_string_PS1+']'


#This converts the string into a list type object.
column_name_string_PS1 = ast.literal_eval(column_name_string_PS1)

#This applies the string contining the names of the colums to the dataframe
topic_dist_df.columns = column_name_string_PS1

topic_dist_df

Now we can merge the new dataframe with the original dataframe of essays and ID's.

In [None]:
df_w_topics = df_no_missing.join(topic_dist_df)

# PROGRESS
print('Joined the topic and oringinal dataframes.')

df_w_topics

### Save the Documnet Topic Matrix

In [None]:
#Writing the output of this run to a CSV
#df_w_topics.to_csv('Admissions_PS1_Full_'+time_for_f_name+'.csv', sep=',')

now = datetime.datetime.now()
time_for_f_name = now.strftime("Date_%Y-%m-%d_Time_%H-%M")
path = '../data/'
topic_num_string = str(number_of_topics)

#print(path+'Admissions_PS1_Full_'+time_for_f_name+'.csv')

# add this   number_of_topics   and test it: 
df_w_topics.to_csv(path+'Admissions_PS1_'+topic_num_string+'_Topic_'+time_for_f_name+'.csv', sep=',')

### Format Feature Matrix

These first few steps walk through the data that we have and tries to get it into a format that is readable.  First, we take a look at a list of words stored in tf_feautre_names.

In [None]:
print(len(tf_feature_names))
print(type(tf_feature_names))
print(tf_feature_names)

We are going to need to transform the list into an array, which we can do below.

In [None]:
# This transforms the list into a numpy array
tf_feature_names_array = numpy.asarray(tf_feature_names)
print(type(tf_feature_names_array))
print(tf_feature_names_array.shape)
print(tf_feature_names_array)

This is a bit of an asside, but we see that this matrix stores the topic weights for each document.

In [None]:
#These identical arrays provide the topic weighting for each text.
print(lda.transform(tf))
print(lda.transform(tf).shape)
print(topic_dist)
print(topic_dist.shape)

Now we can look at a matrix that seems to store the feature weights.  We notice that the dimentions of this martix correspond to the number of topics and the dimentions of the word list we just reviewed above.

In [None]:
#These two things tell us what is in the LDA model at this point
#I believe this tell us the weights of the features, which in this case are words

#this tells us the dimentions of the array
print(lda.components_.shape)
print(type(lda.components_))

#This shows that components is an array of numbers
print(lda.components_)

This transposes the matrix so it matches the terms.

In [None]:
transposed_components = numpy.transpose(lda.components_)
print(transposed_components.shape)
print(type(transposed_components))
print(transposed_components)

Now we can conver the feature list into a Pandas DataFrame for easy viewing and manipulation.

In [None]:
item_new = pandas.DataFrame(tf_feature_names_array)
item_new.columns = ['Feature_Words']
item_new

This creates a Pandas DataFrame out of the feature weights.

In [None]:
item_new_2 = pandas.DataFrame(transposed_components)
item_new_2

This names the feautre weight colums on the fly.  We can change the number of topics without upsetting this.

In [None]:
#This is a little script to name the colum names appropriately
col_number = n_topics
col_nubber_original = col_number
column_name_string_PS1_features = ""
for i in range(0,col_number):
    column_name_string_PS1_features+="'Topic_"+str(col_nubber_original-col_number)+"_PS1_features', "
    col_number = col_number - 1


#This cleans up the end of the string produced above and adds brackets
column_name_string_PS1_features = column_name_string_PS1_features[:-2]
column_name_string_PS1_features = '['+column_name_string_PS1_features+']'


#This converts the string into a list type object.
column_name_string_PS1_features = ast.literal_eval(column_name_string_PS1_features)

#This applies the string contining the names of the colums to the dataframe
item_new_2.columns = column_name_string_PS1_features

item_new_2

Now we can merge the two data frames we saw above: the features (terms), and the feature weights.

In [None]:
df_crazy = pandas.concat([item_new, item_new_2], axis=1)
df_crazy

### Save the Feature Matrix

In [None]:
#This saves the feature matrix to a csv
df_crazy.to_csv(path+'Admissions_PS1_Feature_Words_'+topic_num_string+'_Topic_'+time_for_f_name+'.csv', sep=',')

### Reviewing Final Model Output

Now we can sort the dataframe for the topic of interest, and view the top documents for the topics.
Below we sort the documents first by Topic 0 (looking at the top words for this topic I think it's about family, health, and domestic activities), and next by Topic 1 (again looking at the top words I think this topic is about children playing outside in nature). These topics may be a family/nature split?

Look at the titles for the two different topics. Look at the gender of the author. Hypotheses?

We can read individual essays in full using the code below.  Change the number in the final set of brackets to point to a spesific serial number (ID-1).

In [None]:
print(df_w_topics[['CPID', 'PS1', 'Topic_0_PS1']].sort_values(by=['Topic_0_PS1'], ascending=False))

In [None]:
print(df_w_topics['PS1'][205])

This allows us to sort the most imporant words/features for each topic.

In [None]:
topic_0_feature_terms = df_crazy.sort_values(['Topic_2_PS1_features'], ascending=False)
topic_0_feature_terms

## 8. The End!

In [None]:
# This calculates the amount of time it took to run the model, it requires "start_time" set at start.
end_time = time.time()
total_seconds = end_time - start_time

m, s = divmod(total_seconds, 60)
h, m = divmod(m, 60)
print("This script took %d:%02d:%02d to run (h:m:s)" % (h, m, s))

In [None]:
print('Done! The output of the section above corresponds to Admissions_PS1_Full_'+time_for_f_name+'.csv')

### Scratch Cell