# Manipulating Topic Model Output

This script can import the output of a scikit_learn LDA topic model to interpret the result.

1. [Import Packages and Initilize Variables](#1.-Import-Pacakges-and-Initilize-Variables)
 1. [Inspect the Imported DataFrame](#Inspect-the-imported-DataFrames)
1. [Import the Topic Model Output](#2.-Import-the-Topic-Model-Output)
1. [Check for Data Issues](#3.-Check-for-data-issues)
1. [Interpret the Topics](#4.-Interpret-the-topics)
 1. [Create Top Word Lists](#Create-Top-Word-Lists)
 1. [Review Coded Essays](#Review-Coded-Essays)
1. [Create Top Topic Columns](#5.-Create-Top-Topic-Column)
1. [Analyze the Data](#6.-Analyze-the-Data)
1. [Scratch Cells](#7.-Scratch-Cells)

## 1. Import Pacakges and Initilize Variables

This gets us started.

In [2]:
import pandas
import numpy
print('All packages successfully imported!')

All packages successfully imported!


## 2. Import the Topic Model Output

first we must bring in our two dataframes.

In [3]:
doc_topic_df = pandas.read_csv("/Volumes/Extra Space/Google Drive/Scholarship/Writing Projects - Personal/Admissions Essays/Data/model output/2017-05-02_at_17-34/Admissions_PS1_3_Topic_2017-05-02_at_17-34.gzip", compression='gzip', sep = ',', index_col=0, encoding = 'utf_8')
term_topic_df = pandas.read_csv("/Volumes/Extra Space/Google Drive/Scholarship/Writing Projects - Personal/Admissions Essays/Data/model output/2017-05-02_at_17-34/Admissions_PS1_3_Topic_Feature_Words-2017-05-02_at_17-34.csv", sep = ',', index_col=0, encoding = 'utf_8')
print('Imported the Topic Model output.')

Imported the Topic Model output.


If we need to convert the dataframe to a csv from a gzip, use this code below.

In [27]:
doc_topic_df.to_csv('/Users/benjamin/Desktop/admissions_with_3_topic_weights.csv', sep=',')

### Inspect the imported DataFrames

First we can look at the Document Topic dataframe.

In [4]:
doc_topic_df

Unnamed: 0,CPID,College,PS1,PS2,PS1_clean,PS2_clean,Topic_0_PS1,Topic_1_PS1,Topic_2_PS1
0,3128896,College of Letters and Science,Bojio!\\That was what I playfully typed on my ...,Costume: a torn and tattered shirt and pants s...,bojio play type famili whatsapp group chat old...,costum torn tatter shirt pant old colour fade ...,0.409134,0.197232,0.393634
1,3092833,College of Engineering,"My world is shaped by a 5' 10"", lean man, who ...","Since childhood, I have yearned for utopia. I ...",world shape lean man look like despit age look...,sinc childhood yearn utopia know appear unreal...,0.047477,0.281248,0.671275
2,3142974,College of Engineering,I come from a mediocre family in Malaysia. As ...,I was raised in a family where academics was t...,come mediocr famili malaysia malaysian expos d...,rais famili academ number one prioriti excel g...,0.382529,0.140288,0.477183
3,3020517,College of Letters and Science,Being a Mexican-American now is one of the mos...,I learned about the female stereotypes early i...,mexicanamerican one amaz thing could ever fina...,learn femal stereotyp earli life women weak ca...,0.370764,0.627213,0.002023
4,3121294,College of Natural Resources,I am really confused when I have to fill out i...,The bus was loud and smelled. It did not help ...,realli confus fill inform come standard test s...,bu loud smell help ac yellow school bu transpo...,0.266908,0.632173,0.100919
5,3162551,College of Letters and Science,"Since before I was born, my parents wanted me ...","In the past, when I wanted information and ins...",sinc born parent want identifi dutch american ...,past want inform inspir knew look classroom te...,0.709202,0.034891,0.255907
6,3090215,College of Letters and Science,"When I think of my cousin Sid, I see him as th...",There's a reason that people buy magazines: th...,think cousin sid see guy love nascar alway ask...,there reason peopl buy magazin want like way l...,0.014716,0.931561,0.053723
7,3158702,College of Engineering,"I was born in a suburb just east of Seattle, t...",I've had a few defining moments throughout my ...,born suburb east seattl middl class famili two...,ive defin moment throughout life chang futur o...,0.454547,0.002095,0.543358
8,3095465,College of Natural Resources,"Being the youngest in my family, I have been h...","One talent that I am proud to have is dancing,...",youngest famili held much higher standard olde...,one talent proud danc danc sinc kindergarten f...,0.002823,0.848057,0.149120
9,3147750,College of Letters and Science,Living up to the grand expectations of a cardi...,"Throughout my life, I've explored, enjoyed, an...",live grand expect cardiologistprofessor nutrit...,throughout life ive explor enjoy challeng inte...,0.002717,0.446715,0.550568


Next we can look at the Term Topic dataframe.

In [5]:
term_topic_df

Unnamed: 0,Feature_Words,Topic_0_PS1_features,Topic_1_PS1_features,Topic_2_PS1_features
0,aa,10.311660,48.572147,5.116193
1,aaron,14.658362,40.894435,61.447203
2,ab,9.851093,72.968547,94.180360
3,aback,68.630848,45.699409,64.669743
4,abacu,0.376226,0.353855,126.269918
5,abandon,245.344496,1272.749805,439.905699
6,abc,68.118829,31.032383,54.848787
7,abdomen,0.354808,31.421680,30.223512
8,abhor,26.533994,25.486269,12.979737
9,abid,146.732472,68.623314,42.644214


We can capture some variables from our analysis that we may want to use later.  The two printouts should be the same.

In [6]:
#This section is for the doc_topic matrix
doc_topic_headers = list(doc_topic_df)
doc_topic_headers = [ x for x in doc_topic_headers if "Topic_" in x ]
num_topics = len(doc_topic_headers)
print(num_topics)
print(doc_topic_headers)
print('')

#This section is for the doc_topic matrix
term_topic_headers = list(term_topic_df)
term_topic_headers = [ x for x in term_topic_df if "Topic_" in x ]
num_topics1 = len(term_topic_headers)
print(num_topics1)
print(term_topic_headers)
print('')

# This gets the topic numbers as a list for later calculations
topic_numbers = list(range(0, num_topics))
print(topic_numbers)

#This creates a list of names for the top term dataframe create later
num_topics = len(doc_topic_headers)
top_terms_topic_names = ['Top_Terms_T_'+str(x) for x in range(num_topics)]
print(top_terms_topic_names)

#This creates a list of names for the top term dataframe create later
feature_weight_names = ['Feature_Weights_T_'+str(x) for x in range(num_topics)]
print(feature_weight_names)

3
['Topic_0_PS1', 'Topic_1_PS1', 'Topic_2_PS1']

3
['Topic_0_PS1_features', 'Topic_1_PS1_features', 'Topic_2_PS1_features']

[0, 1, 2]
['Top_Terms_T_0', 'Top_Terms_T_1', 'Top_Terms_T_2']
['Feature_Weights_T_0', 'Feature_Weights_T_1', 'Feature_Weights_T_2']


## 3. Check for data issues

First we can see if there is any missing data.

In [7]:
#are there any missing values in the calculated values in the documnet topic dataframe
print(doc_topic_df[doc_topic_headers].isnull().values.any())

#are there any missing values in the topic term dataframe
print(term_topic_df.isnull().values.any())

False
False


This should find more details about the rows with missing data.

In [8]:
#Find where the nan's are in the first dataframe if needed
index0 = doc_topic_df['Topic_0_PS1'].index[doc_topic_df['Topic_0_PS1'].apply(numpy.isnan)]

df_index = doc_topic_df.index.values.tolist()
[df_index.index(i) for i in index0]

[]

In [9]:
print(index0)
print(len(index0))

Int64Index([], dtype='int64')
0


This prints all the rows for which any cell is NaN.

In [10]:
doc_topic_df[pandas.isnull(doc_topic_df).any(axis=1)]

#or
#test_null_df = doc_topic_df[doc_topic_df['PS2_clean'].isnull()]
#test_null_df

Unnamed: 0,CPID,College,PS1,PS2,PS1_clean,PS2_clean,Topic_0_PS1,Topic_1_PS1,Topic_2_PS1
2693,3094431,College of Letters and Science,I became interested in Economics four years ag...,a,becam interest econom four year ago read krugm...,,0.673216,0.019867,0.306917
4454,3154717,College of Letters and Science,I am horrible at making decisions. In the begi...,.,horribl make decis begin high school embarrass...,,0.365282,0.632603,0.002114
6843,3166277,College of Chemistry,-,"Cold, heavy rain drops fell onto my face. Some...",,cold heavi rain drop fell onto face someon scr...,0.333333,0.333333,0.333333
14476,3084615,College of Letters and Science,A,B,,b,0.333333,0.333333,0.333333
18641,3111191,College of Letters and Science,.,"My dad once said, ""Someone can take your wealt...",,dad said someon take wealth properti one mean ...,0.333333,0.333333,0.333333
30183,3119489,College of Letters and Science,"College Admission Essay\November 15, 2015\\Col...",.,colleg admiss essay novemb colleg admiss essay...,,0.301061,0.636183,0.062757
62975,3108845,College of Letters and Science,d,I learned early on in my working life that I d...,,learn earli work life dont favor work money pe...,0.333333,0.333333,0.333333
65978,3158564,College of Letters and Science,0,0,,,0.333333,0.333333,0.333333
72460,3029150,College of Environmental Design,My name is Laura Vasquez and I have always bee...,I,name laura vasquez alway artist complet mean w...,,0.161873,0.585503,0.252623
75667,3067097,College of Letters and Science,My life has been in a constant state of freefa...,0,life constant state freefal im constantli fall...,,0.002139,0.995508,0.002352


This gets the shorter entries in the dataframe.  We may want to consider cutting these in the pre-processing steps involved in the main analysis.

In [11]:
post_processing_nan = (doc_topic_df['PS1'].str.len() < 10) | (doc_topic_df['PS2'].str.len() < 10) | (doc_topic_df['PS1_clean'].str.len() < 10) | (doc_topic_df['PS2_clean'].str.len() < 10)

# This converts the boolean output from above to a dataframe
doc_topic_df_short_essays = doc_topic_df.loc[post_processing_nan]
doc_topic_df_short_essays

Unnamed: 0,CPID,College,PS1,PS2,PS1_clean,PS2_clean,Topic_0_PS1,Topic_1_PS1,Topic_2_PS1
2693,3094431,College of Letters and Science,I became interested in Economics four years ag...,a,becam interest econom four year ago read krugm...,,0.673216,0.019867,0.306917
4454,3154717,College of Letters and Science,I am horrible at making decisions. In the begi...,.,horribl make decis begin high school embarrass...,,0.365282,0.632603,0.002114
6531,3119005,College of Letters and Science,Tajikistan is a small country in Central Asia....,see1,tajikistan small countri central asia place ta...,see,0.411828,0.418693,0.16948
6843,3166277,College of Chemistry,-,"Cold, heavy rain drops fell onto my face. Some...",,cold heavi rain drop fell onto face someon scr...,0.333333,0.333333,0.333333
14476,3084615,College of Letters and Science,A,B,,b,0.333333,0.333333,0.333333
18641,3111191,College of Letters and Science,.,"My dad once said, ""Someone can take your wealt...",,dad said someon take wealth properti one mean ...,0.333333,0.333333,0.333333
30183,3119489,College of Letters and Science,"College Admission Essay\November 15, 2015\\Col...",.,colleg admiss essay novemb colleg admiss essay...,,0.301061,0.636183,0.062757
31182,3009014,College of Natural Resources,xxxx,xxxx,xxxx,xxxx,0.333333,0.333333,0.333333
33411,3165372,College of Engineering,I started my first few years of life in a Hond...,I have not discovered who I am.,start first year life hondura highest murder r...,discov,0.267467,0.730718,0.001815
40698,3115474,College of Letters and Science,"A Better Me\As Paul Day said, ""All separation ...",All in Statement 1,better paul day said separ involv suspend mome...,statement,0.360296,0.460946,0.178758


## 4. Interpret the topics

### Create Top Word Lists

For the sake of comparision, this creates a dataframe where each topic has a list of topic words and weights which accends from most important to least important.  This can be helpful to try to interpret the topics.

For the future, this should be written into a script that generates this table for `n` number of topics.

This creates a function that can be used for any number of topics.

In [12]:
def top_word_df():
    i = 0

    while i < num_topics:
        if i == 0:
            temp_df = term_topic_df.sort_values(by=(term_topic_headers[i]), ascending=False)
            temp_list = [ head for head in term_topic_headers if head != 'Feature_Words'] 
            temp_list = [ head for head in term_topic_headers if head != term_topic_headers[i] ]
            temp_df = temp_df.drop(temp_list, axis=1)
            temp_df = temp_df.reset_index(drop=True)
            temp_df.rename(columns={'Feature_Words' : top_terms_topic_names[i], 
                                    term_topic_headers[i] : feature_weight_names[i]}, inplace=True)
            i += 1
        else:
            temp_df_1 = term_topic_df.sort_values(by=(term_topic_headers[i]), ascending=False)
            temp_list = [ head for head in term_topic_headers if head != 'Feature_Words'] 
            temp_list = [ head for head in term_topic_headers if head != term_topic_headers[i] ]
            temp_df_1 = temp_df_1.drop(temp_list, axis=1)
            temp_df_1 = temp_df_1.reset_index(drop=True)
            temp_df_1.rename(columns={'Feature_Words' : top_terms_topic_names[i], 
                                    term_topic_headers[i] : feature_weight_names[i]}, inplace=True)
            temp_df = pandas.concat([temp_df, temp_df_1], axis=1)
            i += 1
    else:
        return temp_df

all_top_features = top_word_df()
all_top_features

Unnamed: 0,Top_Terms_T_0,Feature_Weights_T_0,Top_Terms_T_1,Feature_Weights_T_1,Top_Terms_T_2,Feature_Weights_T_2
0,school,86039.590577,life,100064.574918,world,38043.347045
1,world,58136.494662,famili,99783.739578,scienc,33081.132753
2,commun,57608.361133,parent,81673.816297,learn,32070.935068
3,peopl,51039.094573,want,79367.610510,time,29462.060155
4,cultur,44998.253142,school,73574.449997,engin,27095.455150
5,differ,42536.877762,work,67226.393059,like,26820.819289
6,learn,40204.745620,help,62737.435635,comput,25315.216989
7,student,39670.078746,time,62074.096954,work,23408.738369
8,famili,39068.970116,year,55850.982757,use,23209.784246
9,live,34191.826005,alway,55383.174650,year,22007.700785


This creates a table of the top 100 words which can be exported and reviewed to make sense of the topics.

In [13]:
top_100_words_all = all_top_features.head(n=100)
top_100_words_all

Unnamed: 0,Top_Terms_T_0,Feature_Weights_T_0,Top_Terms_T_1,Feature_Weights_T_1,Top_Terms_T_2,Feature_Weights_T_2
0,school,86039.590577,life,100064.574918,world,38043.347045
1,world,58136.494662,famili,99783.739578,scienc,33081.132753
2,commun,57608.361133,parent,81673.816297,learn,32070.935068
3,peopl,51039.094573,want,79367.610510,time,29462.060155
4,cultur,44998.253142,school,73574.449997,engin,27095.455150
5,differ,42536.877762,work,67226.393059,like,26820.819289
6,learn,40204.745620,help,62737.435635,comput,25315.216989
7,student,39670.078746,time,62074.096954,work,23408.738369
8,famili,39068.970116,year,55850.982757,use,23209.784246
9,live,34191.826005,alway,55383.174650,year,22007.700785


In [14]:
def print_top_words():
    i = 0
    while i < num_topics:
        print('Topic '+str(i)+' Top Words')
        word_list = top_100_words_all[top_terms_topic_names[i]].tolist()
        word_list = myString = ", ".join(word_list)
        print(word_list)
        print('')
        i += 1
print_top_words()

Topic 0 Top Words
school, world, commun, peopl, cultur, differ, learn, student, famili, live, life, experi, help, educ, year, mani, high, opportun, new, work, countri, parent, person, make, becom, develop, studi, class, come, understand, dream, languag, way, american, chang, busi, aspir, valu, citi, environ, friend, place, social, import, like, divers, attend, want, societi, futur, academ, time, howev, activ, abl, shape, state, passion, believ, club, taught, allow, colleg, success, alway, challeng, pursu, grow, realiz, program, knowledg, group, goal, english, provid, inspir, intern, continu, teacher, travel, career, individu, background, influenc, chines, hope, skill, home, univers, unit, creat, better, speak, perspect, use, tradit, children, member, econom, small

Topic 1 Top Words
life, famili, parent, want, school, work, help, time, year, alway, mother, make, peopl, live, father, like, day, know, mom, becom, thing, dream, person, hard, come, learn, way, dad, home, brother, love, wor

This exports the table of top words for each topic with weights.

In [15]:
#top_100_words_all.to_csv("/Volumes/Extra Space/Google Drive/Scholarship/Writing Projects - Personal/Admissions Essays/Data/model output/test_wordlist.csv", sep=',')

### Review Coded Essays

This give us a sense of what the most representative essays for each topic were.

Next we can print the 10 most reprentative essays from each of the topics.

In [16]:
def print_top_docs(number):
    i = 0

    while i < num_topics:
        print_index = 1
        list_of_top_docs = doc_topic_df.sort_values(by=doc_topic_headers[i], ascending=False).head(number).index.tolist()
        for x in list_of_top_docs:
            print("Topic "+str(i)+": #"+str(print_index)+" most representative (of "+str(number)+")")
            print('CPID: '+str(doc_topic_df['CPID'][x]))
            print(doc_topic_df['PS1'][x])
            print('')
            print_index += 1
        i += 1
            
print_top_docs(5)

Topic 0: #1 most representative (of 5)
CPID: 3060626
Living in such a culturally diverse state as California, paradoxes are ubiquitous. Diversity breeds paradoxes. The paradox that has been most prevalent in my life has been my parents' relationship and its enduring nature. \\At its basic level, my parents' relationship is a beautiful love story. My father was born in Tehran, Iran and was raised as a devout Muslim. Attending Islamic schools, he enrolled in the Iranian military to fight a religious war against Iraq during the 1980's. In contrast, my mother was born in a very impoverished community in East Los Angeles to Mexican immigrants. Raised with Catholic teachings, she once entertained the thought of becoming a nun in order to fulfill her religious duties. \\Indeed, my parents grew up in entirely different cultures, but as the old adage goes, opposites attract. Thankfully for me, the environment my parents cultivated has instilled within me a holistic, rich, and vibrant perspectiv

This code can be used to see any essay if you know it's ID.

In [17]:
# Use this code to print any one essay in full (lookup by CPID)
CPID_to_find = 3060626
print(doc_topic_df['PS1'][doc_topic_df[doc_topic_df['CPID'] == CPID_to_find].index.tolist()[0]])

Living in such a culturally diverse state as California, paradoxes are ubiquitous. Diversity breeds paradoxes. The paradox that has been most prevalent in my life has been my parents' relationship and its enduring nature. \\At its basic level, my parents' relationship is a beautiful love story. My father was born in Tehran, Iran and was raised as a devout Muslim. Attending Islamic schools, he enrolled in the Iranian military to fight a religious war against Iraq during the 1980's. In contrast, my mother was born in a very impoverished community in East Los Angeles to Mexican immigrants. Raised with Catholic teachings, she once entertained the thought of becoming a nun in order to fulfill her religious duties. \\Indeed, my parents grew up in entirely different cultures, but as the old adage goes, opposites attract. Thankfully for me, the environment my parents cultivated has instilled within me a holistic, rich, and vibrant perspective. Specifically, my family's background has enhanced 

## 5. Create Top Topic Column

Now we can begin to prepare the topic document matirx for further analysis.

In [18]:
#the issue with this is that for those essays where there is no clear top topic, such as
#those with no remaining words on which to deploy the model, they get assinged the first colum.
doc_topic_df['Top_Topic'] = doc_topic_df[doc_topic_headers].idxmax(axis=1)

This is supposed to allow me to mark/mask places where some topics are equal. It doesn't seem to work for the purpouses of the cross tabs below.

In [19]:
# http://stackoverflow.com/questions/40331738/idxmax-equality-with-pandas
equality_df = doc_topic_df.eq(doc_topic_df.max(axis=1), axis=0).sum(axis=1)
#print(equality_df)

doc_topic_df['Top_Topic'] = doc_topic_df['Top_Topic'].mask(equality_df > 1, 'Equality')
#mydf['winner'] = mydf['winner'].mask(s > 1, 'Equality')
doc_topic_df

Unnamed: 0,CPID,College,PS1,PS2,PS1_clean,PS2_clean,Topic_0_PS1,Topic_1_PS1,Topic_2_PS1,Top_Topic
0,3128896,College of Letters and Science,Bojio!\\That was what I playfully typed on my ...,Costume: a torn and tattered shirt and pants s...,bojio play type famili whatsapp group chat old...,costum torn tatter shirt pant old colour fade ...,0.409134,0.197232,0.393634,Topic_0_PS1
1,3092833,College of Engineering,"My world is shaped by a 5' 10"", lean man, who ...","Since childhood, I have yearned for utopia. I ...",world shape lean man look like despit age look...,sinc childhood yearn utopia know appear unreal...,0.047477,0.281248,0.671275,Topic_2_PS1
2,3142974,College of Engineering,I come from a mediocre family in Malaysia. As ...,I was raised in a family where academics was t...,come mediocr famili malaysia malaysian expos d...,rais famili academ number one prioriti excel g...,0.382529,0.140288,0.477183,Topic_2_PS1
3,3020517,College of Letters and Science,Being a Mexican-American now is one of the mos...,I learned about the female stereotypes early i...,mexicanamerican one amaz thing could ever fina...,learn femal stereotyp earli life women weak ca...,0.370764,0.627213,0.002023,Topic_1_PS1
4,3121294,College of Natural Resources,I am really confused when I have to fill out i...,The bus was loud and smelled. It did not help ...,realli confus fill inform come standard test s...,bu loud smell help ac yellow school bu transpo...,0.266908,0.632173,0.100919,Topic_1_PS1
5,3162551,College of Letters and Science,"Since before I was born, my parents wanted me ...","In the past, when I wanted information and ins...",sinc born parent want identifi dutch american ...,past want inform inspir knew look classroom te...,0.709202,0.034891,0.255907,Topic_0_PS1
6,3090215,College of Letters and Science,"When I think of my cousin Sid, I see him as th...",There's a reason that people buy magazines: th...,think cousin sid see guy love nascar alway ask...,there reason peopl buy magazin want like way l...,0.014716,0.931561,0.053723,Topic_1_PS1
7,3158702,College of Engineering,"I was born in a suburb just east of Seattle, t...",I've had a few defining moments throughout my ...,born suburb east seattl middl class famili two...,ive defin moment throughout life chang futur o...,0.454547,0.002095,0.543358,Topic_2_PS1
8,3095465,College of Natural Resources,"Being the youngest in my family, I have been h...","One talent that I am proud to have is dancing,...",youngest famili held much higher standard olde...,one talent proud danc danc sinc kindergarten f...,0.002823,0.848057,0.149120,Topic_1_PS1
9,3147750,College of Letters and Science,Living up to the grand expectations of a cardi...,"Throughout my life, I've explored, enjoyed, an...",live grand expect cardiologistprofessor nutrit...,throughout life ive explor enjoy challeng inte...,0.002717,0.446715,0.550568,Topic_2_PS1


This shows the challenges with my work so far.  This has no clear top topic, but my script coded it as topic 0.  

In [20]:
doc_topic_df.loc[doc_topic_df['CPID'] == 3166277]

Unnamed: 0,CPID,College,PS1,PS2,PS1_clean,PS2_clean,Topic_0_PS1,Topic_1_PS1,Topic_2_PS1,Top_Topic
6843,3166277,College of Chemistry,-,"Cold, heavy rain drops fell onto my face. Some...",,cold heavi rain drop fell onto face someon scr...,0.333333,0.333333,0.333333,Topic_0_PS1


Export the new data.  This code works but use only as needed.

In [21]:
#doc_topic_df.to_csv("/Volumes/Extra Space/Google Drive/Scholarship/Writing Projects - Personal/Admissions Essays/Data/model output/Admissions_Coded.gzip", compression='gzip', sep=',')

## 6. Analyze the Data

This does cross tabulations to review the concentrations of the data.

In [22]:
 pandas.crosstab(doc_topic_df.College, doc_topic_df.Top_Topic, margins=True)

Top_Topic,Topic_0_PS1,Topic_1_PS1,Topic_2_PS1,All
College,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
College of Chemistry,789,1393,943,3125
College of Engineering,5019,6607,7026,18652
College of Environmental Design,394,624,413,1431
College of Letters and Science,17069,22834,13219,53122
College of Natural Resources,1554,3326,1324,6204
All,24825,34784,22925,82534


## 7. Scratch Cells

In [23]:
#dupes = top_100_words_all[top_100_words_all.duplicated(['Feature_Words_S0', 'Feature_Words_S1', 'Feature_Words_S2'], keep=False)]
#dupes

In [24]:
#Original top word code
'''
# ",".join(myList )

print('Topic 0 Top Words')
word_list_topic_0 = top_100_words_all['Feature_Words_S0'].tolist()
word_list_topic_0 = myString = ", ".join(word_list_topic_0)
print(word_list_topic_0)
print('')

print('Topic 1 Top Words')
word_list_topic_1 = top_100_words_all['Feature_Words_S1'].tolist()
word_list_topic_1 = myString = ", ".join(word_list_topic_1)
print(word_list_topic_1)
print('')

print('Topic 2 Top Words')
word_list_topic_2 = top_100_words_all['Feature_Words_S2'].tolist()
word_list_topic_2 = myString = ", ".join(word_list_topic_2)
print(word_list_topic_2)
print('')
'''

'\n# ",".join(myList )\n\nprint(\'Topic 0 Top Words\')\nword_list_topic_0 = top_100_words_all[\'Feature_Words_S0\'].tolist()\nword_list_topic_0 = myString = ", ".join(word_list_topic_0)\nprint(word_list_topic_0)\nprint(\'\')\n\nprint(\'Topic 1 Top Words\')\nword_list_topic_1 = top_100_words_all[\'Feature_Words_S1\'].tolist()\nword_list_topic_1 = myString = ", ".join(word_list_topic_1)\nprint(word_list_topic_1)\nprint(\'\')\n\nprint(\'Topic 2 Top Words\')\nword_list_topic_2 = top_100_words_all[\'Feature_Words_S2\'].tolist()\nword_list_topic_2 = myString = ", ".join(word_list_topic_2)\nprint(word_list_topic_2)\nprint(\'\')\n'

In [25]:
#Original top word dataframe code.  This was written into a function below

'''
df_0_sort = term_topic_df.sort_values(by=('Topic_0_PS1_features'), ascending=False)
df_0_sort = df_0_sort.drop(['Topic_1_PS1_features', 'Topic_2_PS1_features'], axis=1)
df_0_sort.rename(columns={'Feature_Words': 'Feature_Words_S0',
                  'Topic_0_PS1_features': 'Topic_0_PS1_features_S0'}, inplace=True)
df_0_sort = df_0_sort.reset_index(drop=True)

df_1_sort = term_topic_df.sort_values(by=('Topic_1_PS1_features'), ascending=False)
df_1_sort = df_1_sort.drop(['Topic_0_PS1_features', 'Topic_2_PS1_features'], axis=1)
df_1_sort.rename(columns={'Feature_Words': 'Feature_Words_S1',
                  'Topic_1_PS1_features': 'Topic_1_PS1_features_S1'}, inplace=True)
df_1_sort = df_1_sort.reset_index(drop=True)

df_2_sort = term_topic_df.sort_values(by=('Topic_2_PS1_features'), ascending=False)
df_2_sort = df_2_sort.drop(['Topic_0_PS1_features', 'Topic_1_PS1_features'], axis=1)
df_2_sort.rename(columns={'Feature_Words': 'Feature_Words_S2',
                  'Topic_2_PS1_features': 'Topic_2_PS1_features_S2'}, inplace=True)
df_2_sort = df_2_sort.reset_index(drop=True)

#this concatinates the dataframes we just created above.
all_top_features = pandas.concat([df_0_sort, df_1_sort, df_2_sort], axis=1)
all_top_features
'''

"\ndf_0_sort = term_topic_df.sort_values(by=('Topic_0_PS1_features'), ascending=False)\ndf_0_sort = df_0_sort.drop(['Topic_1_PS1_features', 'Topic_2_PS1_features'], axis=1)\ndf_0_sort.rename(columns={'Feature_Words': 'Feature_Words_S0',\n                  'Topic_0_PS1_features': 'Topic_0_PS1_features_S0'}, inplace=True)\ndf_0_sort = df_0_sort.reset_index(drop=True)\n\ndf_1_sort = term_topic_df.sort_values(by=('Topic_1_PS1_features'), ascending=False)\ndf_1_sort = df_1_sort.drop(['Topic_0_PS1_features', 'Topic_2_PS1_features'], axis=1)\ndf_1_sort.rename(columns={'Feature_Words': 'Feature_Words_S1',\n                  'Topic_1_PS1_features': 'Topic_1_PS1_features_S1'}, inplace=True)\ndf_1_sort = df_1_sort.reset_index(drop=True)\n\ndf_2_sort = term_topic_df.sort_values(by=('Topic_2_PS1_features'), ascending=False)\ndf_2_sort = df_2_sort.drop(['Topic_0_PS1_features', 'Topic_1_PS1_features'], axis=1)\ndf_2_sort.rename(columns={'Feature_Words': 'Feature_Words_S2',\n                  'Topic_2

In [26]:
# Old Code to print top docs.  Made into a function below

'''
# This generates lists of ID's for the most important topics
list_of_top_docs_topic_0 = doc_topic_df.sort_values(by='Topic_0_PS1', ascending=False).head(10).index.tolist()
list_of_top_docs_topic_1 = doc_topic_df.sort_values(by='Topic_1_PS1', ascending=False).head(10).index.tolist()
list_of_top_docs_topic_2 = doc_topic_df.sort_values(by='Topic_2_PS1', ascending=False).head(10).index.tolist()

# This code, which is repeated, prints and formats the most represtentative topics
print_index = 1
num_essays = len(list_of_top_docs_topic_0)

for i in list_of_top_docs_topic_0:
    print("Topic 0: # "+str(print_index)+" most representative (of "+str(num_essays)+")")
    print('CPID: '+str(doc_topic_df['CPID'][i]))
    print(doc_topic_df['PS1'][i])
    print('')
    print_index = print_index + 1

print_index = 1
num_essays = len(list_of_top_docs_topic_1)

for i in list_of_top_docs_topic_1:
    print("Topic 1: # "+str(print_index)+" most representative (of "+str(num_essays)+")")
    print('CPID: '+str(doc_topic_df['CPID'][i]))
    print(doc_topic_df['PS1'][i])
    print('')
    print_index = print_index + 1

print_index = 1
num_essays = len(list_of_top_docs_topic_2)
    
for i in list_of_top_docs_topic_2:
    print("Topic 2: # "+str(print_index)+" most representative (of "+str(num_essays)+")")
    print('CPID: '+str(doc_topic_df['CPID'][i]))
    print(doc_topic_df['PS1'][i])
    print('')
    print_index = print_index + 1
'''

'\n# This generates lists of ID\'s for the most important topics\nlist_of_top_docs_topic_0 = doc_topic_df.sort_values(by=\'Topic_0_PS1\', ascending=False).head(10).index.tolist()\nlist_of_top_docs_topic_1 = doc_topic_df.sort_values(by=\'Topic_1_PS1\', ascending=False).head(10).index.tolist()\nlist_of_top_docs_topic_2 = doc_topic_df.sort_values(by=\'Topic_2_PS1\', ascending=False).head(10).index.tolist()\n\n# This code, which is repeated, prints and formats the most represtentative topics\nprint_index = 1\nnum_essays = len(list_of_top_docs_topic_0)\n\nfor i in list_of_top_docs_topic_0:\n    print("Topic 0: # "+str(print_index)+" most representative (of "+str(num_essays)+")")\n    print(\'CPID: \'+str(doc_topic_df[\'CPID\'][i]))\n    print(doc_topic_df[\'PS1\'][i])\n    print(\'\')\n    print_index = print_index + 1\n\nprint_index = 1\nnum_essays = len(list_of_top_docs_topic_1)\n\nfor i in list_of_top_docs_topic_1:\n    print("Topic 1: # "+str(print_index)+" most representative (of "+str