# Analysing Admissions Essays: Unsupervised Approaches using scikit-learn

This notbook is designed to analyze every admissions essay submitted to Berkeley in the 2014-2015 academic year using Topic Modeling in Pythons scikit-learn package.  It begins with two CSV files (UTF-8) each containing unique and correspondeing dummie ID's for each of the two essay prompts that year.  There are column headers.

### Personal Statement 1
1. Freshmen: Describe the world you come from — for example, your family, community or school — and tell us how your world has shaped your dreams and aspirations.
1. Transfers: What is your intended major? Discuss how your interest in the subject developed and describe any experience you have had in the field — such as volunteer work, internships and employment, participation in student organizations and activities — and what you have gained from your involvement.

### Personal Statement 2
1. Tell us about a personal quality, talent, accomplishment, contribution or experience that is important to you. What about this quality or accomplishment makes you proud, and how does it relate to the person you are?

### Outline
1. Import and view the data using Pandas
  1. Import the data into a Pandas Dataframe
  1. Lable the columns
  1. Merge the dataframes
  1. Review the Data
1. Explore the Data & Drop missing values
  1. Are the ID's Unique?
  2. Find Missing Data
  1. Drop Missing Data
1. Pre-Processing the Essays
  1. Cleaning the text and tokenizing
  1. Remove Stopwords
  1. Stem the Tokens
1. Creating a sample for testing
1. Creating the DTM: scikit-learn
  1. CountVectorizer function
1. What can we do with a DTM?
1. Tf-idf scores
  1. TfidfVectorizer function
1. Identifying Distinctive Words
  1. Application: Identify distinctive words by genre
1. Uncovering patterns using LDA

## 1. Import and view the data using Pandas

First, we read our corpus, which is stored as a .csv file on our hard drive, into a Pandas dataframe. 

Note: Pandas is great for data munging and basic calculations because it's so easy to use, and its data structure is really intuitive. It's not memory efficient however, so you might quickly need to move away from it. 

### Import the data into a Pandas Dataframe

To get started, we need to import a few packages.

In [1]:
import pandas
import numpy
import socket
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
#from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import time
import datetime

In [2]:
#print(socket.gethostname())
run_on = socket.gethostname()

if run_on == 'BensMBP.local':
    df1 = pandas.read_csv("/Volumes/Extra Space/Google Drive/Scholarship/Writing Projects - Personal/Admissions Essays/Data/PS1_F16.csv", sep = ',', encoding = 'utf_8')
    df2 = pandas.read_csv("/Volumes/Extra Space/Google Drive/Scholarship/Writing Projects - Personal/Admissions Essays/Data/PS2_F16.csv", sep = ',', encoding = 'utf_8')
    print('The script is running locally.')
elif run_on == 'mercury':
    df1 = pandas.read_csv("../data/originals/PS1_F16.csv", sep = ',', encoding = 'utf_8')
    df2 = pandas.read_csv("../data/originals/PS2_F16.csv", sep = ',', encoding = 'utf_8')
    print('The script is running on mercury.')
else:
    print('The file path is unclear on this machine.')

The script is running locally.


Next we can import the data from the CSV Pandas and begin our session.

In [3]:
#create dataframes called "df1" and "df2



#df1 = pandas.read_csv("/Volumes/Extra Space/Google Drive/Scholarship/Writing Projects - Personal/Admissions Essays/Data/PS1_F16.csv", sep = ',', encoding = 'utf_8')
#df2 = pandas.read_csv("/Volumes/Extra Space/Google Drive/Scholarship/Writing Projects - Personal/Admissions Essays/Data/PS2_F16.csv", sep = ',', encoding = 'utf_8')

#for the server
#df1 = pandas.read_csv("../data/originals/PS1_F16.csv", sep = ',', encoding = 'utf_8')
#df2 = pandas.read_csv("../data/originals/PS2_F16.csv", sep = ',', encoding = 'utf_8')


# View the dataframe.
# Notice the metadata. The column "Personal Statement 1 (RETIRED)" contains our text of interest.
# You can move the hashtag to view the other dataframe.
df1
# df2

Unnamed: 0,"﻿""ApplyUC Application CPID""",College,Personal Statement 1 (RETIRED)
0,3128896,College of Letters and Science,Bojio!\\That was what I playfully typed on my ...
1,3092833,College of Engineering,"My world is shaped by a 5' 10"", lean man, who ..."
2,3142974,College of Engineering,I come from a mediocre family in Malaysia. As ...
3,3020517,College of Letters and Science,Being a Mexican-American now is one of the mos...
4,3121294,College of Natural Resources,I am really confused when I have to fill out i...
5,3162551,College of Letters and Science,"Since before I was born, my parents wanted me ..."
6,3090215,College of Letters and Science,"When I think of my cousin Sid, I see him as th..."
7,3158702,College of Engineering,"I was born in a suburb just east of Seattle, t..."
8,3095465,College of Natural Resources,"Being the youngest in my family, I have been h..."
9,3147750,College of Letters and Science,Living up to the grand expectations of a cardi...


### Lable the Columns

Next we can rename the colum headers so they are easier to work with.

In [4]:
# This renames the colum headers
df1.columns = ['CPID', 'College', 'PS1']
df2.columns = ['CPID', 'College', 'PS2']

# View the dataframe.  You can move the hashtag to view the other dataframe.
df1
# df2

Unnamed: 0,CPID,College,PS1
0,3128896,College of Letters and Science,Bojio!\\That was what I playfully typed on my ...
1,3092833,College of Engineering,"My world is shaped by a 5' 10"", lean man, who ..."
2,3142974,College of Engineering,I come from a mediocre family in Malaysia. As ...
3,3020517,College of Letters and Science,Being a Mexican-American now is one of the mos...
4,3121294,College of Natural Resources,I am really confused when I have to fill out i...
5,3162551,College of Letters and Science,"Since before I was born, my parents wanted me ..."
6,3090215,College of Letters and Science,"When I think of my cousin Sid, I see him as th..."
7,3158702,College of Engineering,"I was born in a suburb just east of Seattle, t..."
8,3095465,College of Natural Resources,"Being the youngest in my family, I have been h..."
9,3147750,College of Letters and Science,Living up to the grand expectations of a cardi...


### Merge the Dataframes

Now we will merge the two dataframes on thier two common elements (CPID and College) using `merge`.

In [5]:
# Merge the two data frames so that we have one data frame with both questions attached to common CPID's and College.
df = pandas.merge(df1, df2, on=['CPID', 'College'])
df

Unnamed: 0,CPID,College,PS1,PS2
0,3128896,College of Letters and Science,Bojio!\\That was what I playfully typed on my ...,Costume: a torn and tattered shirt and pants s...
1,3092833,College of Engineering,"My world is shaped by a 5' 10"", lean man, who ...","Since childhood, I have yearned for utopia. I ..."
2,3142974,College of Engineering,I come from a mediocre family in Malaysia. As ...,I was raised in a family where academics was t...
3,3020517,College of Letters and Science,Being a Mexican-American now is one of the mos...,I learned about the female stereotypes early i...
4,3121294,College of Natural Resources,I am really confused when I have to fill out i...,The bus was loud and smelled. It did not help ...
5,3162551,College of Letters and Science,"Since before I was born, my parents wanted me ...","In the past, when I wanted information and ins..."
6,3090215,College of Letters and Science,"When I think of my cousin Sid, I see him as th...",There's a reason that people buy magazines: th...
7,3158702,College of Engineering,"I was born in a suburb just east of Seattle, t...",I've had a few defining moments throughout my ...
8,3095465,College of Natural Resources,"Being the youngest in my family, I have been h...","One talent that I am proud to have is dancing,..."
9,3147750,College of Letters and Science,Living up to the grand expectations of a cardi...,"Throughout my life, I've explored, enjoyed, an..."


### Review the data

It can be helpful to see how much memory is being used by this new dataframe.  We can do that with the `info` option.  We can also view individual essays, housed in particular cells, in full.

In [6]:
# Check the amount of memory being occupied by this newly created element.  
print(df.info(memory_usage='deep'))

<class 'pandas.core.frame.DataFrame'>
Int64Index: 82574 entries, 0 to 82573
Data columns (total 4 columns):
CPID       82574 non-null int64
College    82574 non-null object
PS1        82544 non-null object
PS2        82542 non-null object
dtypes: int64(1), object(3)
memory usage: 436.7 MB
None


It is important to review data data that is contained in the new dataframe we created.  This code looks at an essay in full.

In [7]:
#print the first essay from the column 'PS1' the print file is more faithful to our data
print(df['PS1'][0])

Bojio!\\That was what I playfully typed on my family's Whatsapp group chat after my older brother posted a picture of his and my sister in law's Bali resort. It was an expression that travelled from my mind to my flitting fingertips almost immediately. The resort was simply the image of serenity and solitude- and which student going through examination stress would not want to be a part of that?\\It was only when I got home and slumped on the sofa that I saw the nervous look on my mother's face. Her kohl-rimmed eyes were wide and her vermillion adorned forehead scrunched up as she asked, utterly confused, "What's bojio?"\\I burst out laughing. Sometimes, I forgot how every day brought around a new culture shock when you lived in a traditional Indian family but grew up in a multiracial community. The Hokkien phrase "bojio", literally meaning "never invite", is a popular colloquialism in Singapore to teasingly express annoyance at not being invited to something. My brother, who has grown

## 2. Explore the Data using Pandas

Let's first evaluate the general nature of the data to see if the ID's are unique, if there is any missing data, etc.  
We can also look at some descriptive statistics about this data set to get a feel for what's in it. We'll do this using the Pandas package. 

Note: this is always good practice. It serves two purposes. It checks to make sure your data is correct, and there's no major errors. It also keeps you in touch with your data, which will help with interpretation. Love your data!

### Are the ID's Unique?

What ID's have more than one "PS1"s can be found by counting and ranking "ID"s

In [8]:
#This tells us if we have any duplicate IDs.  If each response is 1 we are ok.
print(df['CPID'].value_counts())

# This code seems to check for duplicate CPIDs.  If it's blank there are no duplicates.
print()
print("Array containing duplicate CPIDs:")
print(df.set_index('CPID').index.get_duplicates())

3016703    1
3002500    1
3146453    1
3152598    1
3019479    1
3042008    1
3039961    1
3018728    1
3035871    1
3123936    1
3125987    1
3115748    1
3030135    1
3117799    1
3009256    1
3013354    1
3130093    1
3136238    1
3003119    1
3027667    1
3029714    1
3068623    1
3056321    1
3084983    1
3107512    1
3105465    1
3109563    1
3103422    1
3101375    1
3058368    1
          ..
3044772    1
3014037    1
3040678    1
3042727    1
3151272    1
3153321    1
3012128    1
3132004    1
3028396    1
3007894    1
3132819    1
3048826    1
3132755    1
3134800    1
3061116    1
3059071    1
3136223    1
3100035    1
3110276    1
3021600    1
3085704    1
3130770    1
3081610    1
3144558    1
3093900    1
3095949    1
3089806    1
3091855    1
3132125    1
3116651    1
Name: CPID, dtype: int64

Array containing duplicate CPIDs:
[]


### Find missing essays

Advanced opperations will not work with empty data.  The next few steps are designed to find, exlpore and purge records with missing data.

In [9]:
# This creates a variable empties

# First for PS1
print('Summarizing missing data for PS1:')
empties_PS1 = numpy.where(pandas.isnull(df['PS1']))[0]

print(empties_PS1)

# you notice that this is not formatted as a list.  The next opperation "list" gets it in the right format.
empties_PS1 = list(empties_PS1)
print(empties_PS1)

#This counts the number of missing essays for PS1.
print(len(empties_PS1))

#This lists the elemtns with missing data.
df.iloc[empties_PS1]

Summarizing missing data for PS1:
[ 1776  3206  3566  6285  6801  7530  7930  8111  8571 11796 12977 15694
 19073 23667 24682 26014 28080 28573 29154 29548 31818 40212 41898 44980
 53738 64612 73519 74177 79276 81423]
[1776, 3206, 3566, 6285, 6801, 7530, 7930, 8111, 8571, 11796, 12977, 15694, 19073, 23667, 24682, 26014, 28080, 28573, 29154, 29548, 31818, 40212, 41898, 44980, 53738, 64612, 73519, 74177, 79276, 81423]
30


Unnamed: 0,CPID,College,PS1,PS2
1776,3133634,College of Engineering,,
3206,3157746,College of Letters and Science,,
3566,3041638,College of Letters and Science,,
6285,3108688,College of Natural Resources,,I never described myself as super religious. W...
6801,3052354,College of Letters and Science,,
7530,3055366,College of Letters and Science,,"At a young age the word ""weird "" or "" differen..."
7930,3056046,College of Engineering,,
8111,3001798,College of Letters and Science,,
8571,3145221,College of Natural Resources,,"The ocean is my world. From an early age, the ..."
11796,3092946,College of Letters and Science,,


In [10]:
# Repeat the above steps for PS2
print('Summarizing missing data for PS2')
empties_PS2 = numpy.where(pandas.isnull(df['PS2']))[0]
print(empties_PS2)
empties_PS2 = list(empties_PS2)
print(empties_PS2)
print(len(empties_PS2))
df.iloc[empties_PS2]

Summarizing missing data for PS2
[ 1222  1776  3206  3566  3667  3900  6801  7930  8111 11796 12977 14552
 16926 19073 23667 24682 26014 28573 31818 36381 40212 41898 44980 53738
 57647 60910 62127 72379 73519 74177 79276 81423]
[1222, 1776, 3206, 3566, 3667, 3900, 6801, 7930, 8111, 11796, 12977, 14552, 16926, 19073, 23667, 24682, 26014, 28573, 31818, 36381, 40212, 41898, 44980, 53738, 57647, 60910, 62127, 72379, 73519, 74177, 79276, 81423]
32


Unnamed: 0,CPID,College,PS1,PS2
1222,3152303,College of Letters and Science,"Every day, the mirror reminds me that I carry ...",
1776,3133634,College of Engineering,,
3206,3157746,College of Letters and Science,,
3566,3041638,College of Letters and Science,,
3667,3162451,College of Natural Resources,DAD'S BACK ACHED BECAUSE OF ME. Mom's hands mo...,
3900,3060413,College of Letters and Science,"I am lying down on my belly, savoring the exqu...",
6801,3052354,College of Letters and Science,,
7930,3056046,College of Engineering,,
8111,3001798,College of Letters and Science,,
11796,3092946,College of Letters and Science,,


Next we can create a list of every ID which has at least one missing essay.

In [11]:
#This combines the two lists of missing data without duplicateing anything.
empties_any = empties_PS1 + list(set(empties_PS2) - set(empties_PS1))
empties_any.sort()

print(empties_any)
print('There are', len(empties_any), 'CPIDs with at least one missing essay.')

[1222, 1776, 3206, 3566, 3667, 3900, 6285, 6801, 7530, 7930, 8111, 8571, 11796, 12977, 14552, 15694, 16926, 19073, 23667, 24682, 26014, 28080, 28573, 29154, 29548, 31818, 36381, 40212, 41898, 44980, 53738, 57647, 60910, 62127, 64612, 72379, 73519, 74177, 79276, 81423]
There are 40 CPIDs with at least one missing essay.


### Drop the missing essays

This takes the list of ID's that have at least one missing essay and drops them, creating a new dataframe where each cell ID populated.

In [12]:
df_no_missing = df.drop(df.index[empties_any])

# df_no_missing = df.dropna()
df_no_missing

Unnamed: 0,CPID,College,PS1,PS2
0,3128896,College of Letters and Science,Bojio!\\That was what I playfully typed on my ...,Costume: a torn and tattered shirt and pants s...
1,3092833,College of Engineering,"My world is shaped by a 5' 10"", lean man, who ...","Since childhood, I have yearned for utopia. I ..."
2,3142974,College of Engineering,I come from a mediocre family in Malaysia. As ...,I was raised in a family where academics was t...
3,3020517,College of Letters and Science,Being a Mexican-American now is one of the mos...,I learned about the female stereotypes early i...
4,3121294,College of Natural Resources,I am really confused when I have to fill out i...,The bus was loud and smelled. It did not help ...
5,3162551,College of Letters and Science,"Since before I was born, my parents wanted me ...","In the past, when I wanted information and ins..."
6,3090215,College of Letters and Science,"When I think of my cousin Sid, I see him as th...",There's a reason that people buy magazines: th...
7,3158702,College of Engineering,"I was born in a suburb just east of Seattle, t...",I've had a few defining moments throughout my ...
8,3095465,College of Natural Resources,"Being the youngest in my family, I have been h...","One talent that I am proud to have is dancing,..."
9,3147750,College of Letters and Science,Living up to the grand expectations of a cardi...,"Throughout my life, I've explored, enjoyed, an..."


## 3. Pre-Processing the Essays

Once we have a Pandas Dataframe in the appropreiate structure, we can begin to process the text in the two essay columns.  This invovles cleaning the text in a number of ways before tokenizing the essays.  

### Cleaning the text and tokenizing

The section below combines multiple preprocessing steps into a singl eline of code.  It is repeated twice for each of the essays, and results in a largely "preprocessed", tokenized new column.  most of this is accomplished with the "str" feature of python.  Here is what we accomplished with each step:
1. `str.replace('\\', ' ')` - This removes some of the ideosyncratic backslashes that were present 
1. `str.lower()` - this shifts all the letters to lowercase
1. `str.replace('[^\w\s]','')` - This gets rid of punctuation.  The "`^`" is a negated set, the "`\w`" matches any word character (alphanumeric & underscore), and the "`\s`" matches any whitespace character (spaces, tabs, line breaks).
1. `str.replace('[\d]','')` - This gets rid of all numbers.
1. `str.split()` - This tokenizes whats left, creating a list within the pandas cell

In [13]:
#create two new columns with tokenized essay responses.  
#In the same opperation it make everything lowercase.
df_no_missing['PS1_clean'] = df_no_missing['PS1'].str.replace('\\', ' ').str.lower().str.replace('[^\w\s]','').str.replace('[\d]','').str.split()
df_no_missing['PS2_clean'] = df_no_missing['PS2'].str.replace('\\', ' ').str.lower().str.replace('[^\w\s]','').str.replace('[\d]','').str.split()
df_no_missing

Unnamed: 0,CPID,College,PS1,PS2,PS1_clean,PS2_clean
0,3128896,College of Letters and Science,Bojio!\\That was what I playfully typed on my ...,Costume: a torn and tattered shirt and pants s...,"[bojio, that, was, what, i, playfully, typed, ...","[costume, a, torn, and, tattered, shirt, and, ..."
1,3092833,College of Engineering,"My world is shaped by a 5' 10"", lean man, who ...","Since childhood, I have yearned for utopia. I ...","[my, world, is, shaped, by, a, lean, man, who,...","[since, childhood, i, have, yearned, for, utop..."
2,3142974,College of Engineering,I come from a mediocre family in Malaysia. As ...,I was raised in a family where academics was t...,"[i, come, from, a, mediocre, family, in, malay...","[i, was, raised, in, a, family, where, academi..."
3,3020517,College of Letters and Science,Being a Mexican-American now is one of the mos...,I learned about the female stereotypes early i...,"[being, a, mexicanamerican, now, is, one, of, ...","[i, learned, about, the, female, stereotypes, ..."
4,3121294,College of Natural Resources,I am really confused when I have to fill out i...,The bus was loud and smelled. It did not help ...,"[i, am, really, confused, when, i, have, to, f...","[the, bus, was, loud, and, smelled, it, did, n..."
5,3162551,College of Letters and Science,"Since before I was born, my parents wanted me ...","In the past, when I wanted information and ins...","[since, before, i, was, born, my, parents, wan...","[in, the, past, when, i, wanted, information, ..."
6,3090215,College of Letters and Science,"When I think of my cousin Sid, I see him as th...",There's a reason that people buy magazines: th...,"[when, i, think, of, my, cousin, sid, i, see, ...","[theres, a, reason, that, people, buy, magazin..."
7,3158702,College of Engineering,"I was born in a suburb just east of Seattle, t...",I've had a few defining moments throughout my ...,"[i, was, born, in, a, suburb, just, east, of, ...","[ive, had, a, few, defining, moments, througho..."
8,3095465,College of Natural Resources,"Being the youngest in my family, I have been h...","One talent that I am proud to have is dancing,...","[being, the, youngest, in, my, family, i, have...","[one, talent, that, i, am, proud, to, have, is..."
9,3147750,College of Letters and Science,Living up to the grand expectations of a cardi...,"Throughout my life, I've explored, enjoyed, an...","[living, up, to, the, grand, expectations, of,...","[throughout, my, life, ive, explored, enjoyed,..."


In [14]:
# this shows that we've mostly delt with the odd backslashes and cleand the text in a bunch of other ways!
print(df_no_missing['PS1_clean'][0])

['bojio', 'that', 'was', 'what', 'i', 'playfully', 'typed', 'on', 'my', 'familys', 'whatsapp', 'group', 'chat', 'after', 'my', 'older', 'brother', 'posted', 'a', 'picture', 'of', 'his', 'and', 'my', 'sister', 'in', 'laws', 'bali', 'resort', 'it', 'was', 'an', 'expression', 'that', 'travelled', 'from', 'my', 'mind', 'to', 'my', 'flitting', 'fingertips', 'almost', 'immediately', 'the', 'resort', 'was', 'simply', 'the', 'image', 'of', 'serenity', 'and', 'solitude', 'and', 'which', 'student', 'going', 'through', 'examination', 'stress', 'would', 'not', 'want', 'to', 'be', 'a', 'part', 'of', 'that', 'it', 'was', 'only', 'when', 'i', 'got', 'home', 'and', 'slumped', 'on', 'the', 'sofa', 'that', 'i', 'saw', 'the', 'nervous', 'look', 'on', 'my', 'mothers', 'face', 'her', 'kohlrimmed', 'eyes', 'were', 'wide', 'and', 'her', 'vermillion', 'adorned', 'forehead', 'scrunched', 'up', 'as', 'she', 'asked', 'utterly', 'confused', 'whats', 'bojio', 'i', 'burst', 'out', 'laughing', 'sometimes', 'i', 'for

### Remove Stopwords

Stopwords are  words that appear frequently and tend not to be distinctive.  They are generally removed prior to text analysis unless there is compelling reason to keep them. [More info](http://www.nltk.org/book/ch02.html#code-unusual)

In [15]:
#stopwords imported from NLTK Above

#Removes english stop words from the tokenized columns
stop_words = stopwords.words('english')
df_no_missing['PS1_clean'] = df_no_missing['PS1_clean'].apply(lambda x: [item for item in x if item not in stop_words])
df_no_missing['PS2_clean'] = df_no_missing['PS2_clean'].apply(lambda x: [item for item in x if item not in stop_words])

df_no_missing

Unnamed: 0,CPID,College,PS1,PS2,PS1_clean,PS2_clean
0,3128896,College of Letters and Science,Bojio!\\That was what I playfully typed on my ...,Costume: a torn and tattered shirt and pants s...,"[bojio, playfully, typed, familys, whatsapp, g...","[costume, torn, tattered, shirt, pants, old, c..."
1,3092833,College of Engineering,"My world is shaped by a 5' 10"", lean man, who ...","Since childhood, I have yearned for utopia. I ...","[world, shaped, lean, man, looks, like, despit...","[since, childhood, yearned, utopia, know, appe..."
2,3142974,College of Engineering,I come from a mediocre family in Malaysia. As ...,I was raised in a family where academics was t...,"[come, mediocre, family, malaysia, malaysian, ...","[raised, family, academics, number, one, prior..."
3,3020517,College of Letters and Science,Being a Mexican-American now is one of the mos...,I learned about the female stereotypes early i...,"[mexicanamerican, one, amazing, things, could,...","[learned, female, stereotypes, early, life, wo..."
4,3121294,College of Natural Resources,I am really confused when I have to fill out i...,The bus was loud and smelled. It did not help ...,"[really, confused, fill, information, comes, s...","[bus, loud, smelled, help, ac, yellow, school,..."
5,3162551,College of Letters and Science,"Since before I was born, my parents wanted me ...","In the past, when I wanted information and ins...","[since, born, parents, wanted, identify, dutch...","[past, wanted, information, inspiration, knew,..."
6,3090215,College of Letters and Science,"When I think of my cousin Sid, I see him as th...",There's a reason that people buy magazines: th...,"[think, cousin, sid, see, guy, loves, nascar, ...","[theres, reason, people, buy, magazines, want,..."
7,3158702,College of Engineering,"I was born in a suburb just east of Seattle, t...",I've had a few defining moments throughout my ...,"[born, suburb, east, seattle, middle, class, f...","[ive, defining, moments, throughout, life, cha..."
8,3095465,College of Natural Resources,"Being the youngest in my family, I have been h...","One talent that I am proud to have is dancing,...","[youngest, family, held, much, higher, standar...","[one, talent, proud, dancing, dancing, since, ..."
9,3147750,College of Letters and Science,Living up to the grand expectations of a cardi...,"Throughout my life, I've explored, enjoyed, an...","[living, grand, expectations, cardiologistprof...","[throughout, life, ive, explored, enjoyed, cha..."


### Stem the Tokens

Stemming reduces words with multiple endings to thier common stem.  There are multiple ways to do this, but we will use the Porter Stemmer for our purpouses. http://www.bogotobogo.com/python/NLTK/Stemming_NLTK.php

In [16]:
# using the "Porter Stemmer" we'll stem the words
porter_stemmer = PorterStemmer()

# This line can be used to compare the ouput of a few cells
#test_list_1 = [porter_stemmer.stem(item) for item in df_no_missing['PS1_clean'][0]]
#test_list_2 = [porter_stemmer.stem(item) for item in df_no_missing['PS2_clean'][8000]]

#This does not work:
#df_no_missing['PS1_clean'] = df_no_missing['PS1_clean'].apply(lambda x: porter_stemmer.stem(item) for item in df_no_missing)
#df_no_missing['PS2_clean'] = df_no_missing['PS2_clean'].apply(lambda x: porter_stemmer.stem(item) for item in df_no_missing)
#also doese not work:
#df_no_missing['PS1_clean'] = df_no_missing['PS1_clean'].apply(lambda x: [porter_stemmer.stem(item) for item in df_no_missing['PS1_clean']])
#df_no_missing['PS2_clean'] = df_no_missing['PS2_clean'].apply(lambda x: [porter_stemmer.stem(item) for item in df_no_missing['PS2_clean']])
#also doesn't work:
#df_no_missing['PS1_clean'] = df_no_missing['PS1_clean'].apply(lambda x: item for item in porter_stemmer.stem(x))
#df_no_missing['PS2_clean'] = df_no_missing['PS2_clean'].apply(lambda x: item for item in porter_stemmer.stem(x))

#Testing
#df_no_missing['PS1_clean'] = df_no_missing['PS1_clean'].apply(lambda x: porter_stemmer.stem(item) for item in df_no_missing['PS1_clean'])
#df_no_missing['PS2_clean'] = df_no_missing['PS2_clean'].apply(lambda x: porter_stemmer.stem(item) for item in df_no_missing['PS2_clean'])

df_no_missing['PS1_clean'] = df_no_missing['PS1_clean'].apply(lambda x: [porter_stemmer.stem(item) for item in x])
df_no_missing['PS2_clean'] = df_no_missing['PS2_clean'].apply(lambda x: [porter_stemmer.stem(item) for item in x])


'''
print('Test List 1')
print(test_list_1)
print('Test List 2')
print(test_list_2)
'''

df_no_missing

Unnamed: 0,CPID,College,PS1,PS2,PS1_clean,PS2_clean
0,3128896,College of Letters and Science,Bojio!\\That was what I playfully typed on my ...,Costume: a torn and tattered shirt and pants s...,"[bojio, play, type, famili, whatsapp, group, c...","[costum, torn, tatter, shirt, pant, old, colou..."
1,3092833,College of Engineering,"My world is shaped by a 5' 10"", lean man, who ...","Since childhood, I have yearned for utopia. I ...","[world, shape, lean, man, look, like, despit, ...","[sinc, childhood, yearn, utopia, know, appear,..."
2,3142974,College of Engineering,I come from a mediocre family in Malaysia. As ...,I was raised in a family where academics was t...,"[come, mediocr, famili, malaysia, malaysian, e...","[rais, famili, academ, number, one, prioriti, ..."
3,3020517,College of Letters and Science,Being a Mexican-American now is one of the mos...,I learned about the female stereotypes early i...,"[mexicanamerican, one, amaz, thing, could, eve...","[learn, femal, stereotyp, earli, life, women, ..."
4,3121294,College of Natural Resources,I am really confused when I have to fill out i...,The bus was loud and smelled. It did not help ...,"[realli, confus, fill, inform, come, standard,...","[bu, loud, smell, help, ac, yellow, school, bu..."
5,3162551,College of Letters and Science,"Since before I was born, my parents wanted me ...","In the past, when I wanted information and ins...","[sinc, born, parent, want, identifi, dutch, am...","[past, want, inform, inspir, knew, look, class..."
6,3090215,College of Letters and Science,"When I think of my cousin Sid, I see him as th...",There's a reason that people buy magazines: th...,"[think, cousin, sid, see, guy, love, nascar, a...","[there, reason, peopl, buy, magazin, want, lik..."
7,3158702,College of Engineering,"I was born in a suburb just east of Seattle, t...",I've had a few defining moments throughout my ...,"[born, suburb, east, seattl, middl, class, fam...","[ive, defin, moment, throughout, life, chang, ..."
8,3095465,College of Natural Resources,"Being the youngest in my family, I have been h...","One talent that I am proud to have is dancing,...","[youngest, famili, held, much, higher, standar...","[one, talent, proud, danc, danc, sinc, kinderg..."
9,3147750,College of Letters and Science,Living up to the grand expectations of a cardi...,"Throughout my life, I've explored, enjoyed, an...","[live, grand, expect, cardiologistprofessor, n...","[throughout, life, ive, explor, enjoy, challen..."


In [17]:
'''
#This code works for one cell in the Pandas Dataframe. 
print("test_list_1 - Individual")
print(test_list_1)
print('')
print("test_list_1 - dataframe")
print(df_no_missing['PS1_clean'][0])
print('')
print("test_list_2 - Individual")
print(test_list_2)
print('')
print("test_list_2 - dataframe")
print(df_no_missing['PS2_clean'][8000])
'''

'\n#This code works for one cell in the Pandas Dataframe. \nprint("test_list_1 - Individual")\nprint(test_list_1)\nprint(\'\')\nprint("test_list_1 - dataframe")\nprint(df_no_missing[\'PS1_clean\'][0])\nprint(\'\')\nprint("test_list_2 - Individual")\nprint(test_list_2)\nprint(\'\')\nprint("test_list_2 - dataframe")\nprint(df_no_missing[\'PS2_clean\'][8000])\n'

Joining the tokens back to a string so we can execute count vectorizer and create a documnet term matrix.

In [18]:
df_no_missing['PS1_clean'] = df_no_missing['PS1_clean'].apply(lambda x: ' '.join(x)) # for item in x])
df_no_missing['PS2_clean'] = df_no_missing['PS2_clean'].apply(lambda x: ' '.join(x)) # for item in x])


# text_list_stemmed = [' '.join([porter_stemmer.stem(word) for word in sentence.split(" ")]) for sentence in text_list]
df_no_missing

Unnamed: 0,CPID,College,PS1,PS2,PS1_clean,PS2_clean
0,3128896,College of Letters and Science,Bojio!\\That was what I playfully typed on my ...,Costume: a torn and tattered shirt and pants s...,bojio play type famili whatsapp group chat old...,costum torn tatter shirt pant old colour fade ...
1,3092833,College of Engineering,"My world is shaped by a 5' 10"", lean man, who ...","Since childhood, I have yearned for utopia. I ...",world shape lean man look like despit age look...,sinc childhood yearn utopia know appear unreal...
2,3142974,College of Engineering,I come from a mediocre family in Malaysia. As ...,I was raised in a family where academics was t...,come mediocr famili malaysia malaysian expos d...,rais famili academ number one prioriti excel g...
3,3020517,College of Letters and Science,Being a Mexican-American now is one of the mos...,I learned about the female stereotypes early i...,mexicanamerican one amaz thing could ever fina...,learn femal stereotyp earli life women weak ca...
4,3121294,College of Natural Resources,I am really confused when I have to fill out i...,The bus was loud and smelled. It did not help ...,realli confus fill inform come standard test s...,bu loud smell help ac yellow school bu transpo...
5,3162551,College of Letters and Science,"Since before I was born, my parents wanted me ...","In the past, when I wanted information and ins...",sinc born parent want identifi dutch american ...,past want inform inspir knew look classroom te...
6,3090215,College of Letters and Science,"When I think of my cousin Sid, I see him as th...",There's a reason that people buy magazines: th...,think cousin sid see guy love nascar alway ask...,there reason peopl buy magazin want like way l...
7,3158702,College of Engineering,"I was born in a suburb just east of Seattle, t...",I've had a few defining moments throughout my ...,born suburb east seattl middl class famili two...,ive defin moment throughout life chang futur o...
8,3095465,College of Natural Resources,"Being the youngest in my family, I have been h...","One talent that I am proud to have is dancing,...",youngest famili held much higher standard olde...,one talent proud danc danc sinc kindergarten f...
9,3147750,College of Letters and Science,Living up to the grand expectations of a cardi...,"Throughout my life, I've explored, enjoyed, an...",live grand expect cardiologistprofessor nutrit...,throughout life ive explor enjoy challeng inte...


## 4. Creating a sample for testing

In this section we'll create a smaller sample of the code to that the analysis we construct below works.

In [19]:
# This generates a random sample of N essays, with a random state set for reproducability.
# N can be slowly increased slowly to test the computational resources required as you scale up
df_sample = df_no_missing.sample(n=500, random_state=0)

# This code resets the indexs so that sorted orininals are kept and new ones are generated.
df_sample = df_sample.sort_index()
df_sample = df_sample.reset_index()

#This is where I assign the sample data to be analyzed.  If I want to run the whole dataset, comment this out.
#df_no_missing = df_sample

df_sample

Unnamed: 0,index,CPID,College,PS1,PS2,PS1_clean,PS2_clean
0,338,3163703,College of Natural Resources,"As I stared into the vulgar man's eyes, I felt...",Living in a society that praises women for the...,stare vulgar man eye felt nail dig deeper fles...,live societi prais women extern beauti instead...
1,439,3056579,College of Engineering,I come from an Indian family whose ancestry tr...,My involvement in speech and debate has had a ...,come indian famili whose ancestri trace back i...,involv speech debat signific impact ive becom ...
2,722,3113770,College of Letters and Science,Whenever I walked into my mother's room at the...,I was part of a student volunteer project to e...,whenev walk mother room begin month would ofte...,part student volunt project encourag physic me...
3,794,3120168,College of Engineering,The early years of my childhood in Mexico were...,My life is one of numbers. The universal langu...,earli year childhood mexico easi fact first ye...,life one number univers languag math transcend...
4,822,3150465,College of Letters and Science,My social life in reality was not exactly stel...,"A few months ago, I had an experience that sho...",social life realiti exactli stellar start foun...,month ago experi show much leadership skill im...
5,1102,3047627,College of Chemistry,Most people acquire a unique personality based...,"A human's natural instinct is to fear failure,...",peopl acquir uniqu person base upon individu c...,human natur instinct fear failur go life avoid...
6,1302,3094299,College of Letters and Science,"First, I noticed her trembling hands. In the y...","I knocked tentatively on her door, fidgeting u...",first notic trembl hand year known never seen ...,knock tent door fidget uneasili muffl cri echo...
7,1467,3112108,College of Natural Resources,Sophomore year I decided to transfer to Campbe...,"In 2nd grade, I joined Cub Scouts with a few o...",sophomor year decid transfer campbel hall desi...,nd grade join cub scout friend love outdoor en...
8,1478,3129573,College of Engineering,"Every year, distant relatives gather at our ho...",I've always looked forward to the day I'd fini...,everi year distant rel gather hous celebr call...,ive alway look forward day id finish middl sch...
9,1563,3042091,College of Letters and Science,I am the youngest of three brothers and come f...,"As a high school senior, I find that the most ...",youngest three brother come famili known deter...,high school senior find defin qualiti famili p...


## 5. Creating the DTM: scikit-learn

Now that we've preprocessed the text and created two colums with strings, the required imput for scikit-learn's CountVectorizer, we can create a documnet term matrix.  This is the building block for Topic Modeling and a number of other methods we may want to explore.  There are two ways to do this. We can turn it into a sparse matrix type, which can be used within scikit-learn for further analyses.  We can then turn it into a full documnet term matrix, but this is very memory intensive and might not be a great idea for larger data sets.

In [20]:
# see above for: from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()

#Original sklearn_dtm = CountVectorizer().fit_transform(df.PS1)
#I added the '.values.astype('U')' for an interim step in the section below. 
#It seemed to fix the count vectorizer issues
sklearn_dtm_PS1 = countvec.fit_transform(df_no_missing['PS1_clean'])
sklearn_dtm_PS2 = countvec.fit_transform(df_no_missing['PS2_clean'])

print('PS1 sparse matrix type')
print(sklearn_dtm_PS1)
print(' ')
print('PS2 sparse matrix type')
print(sklearn_dtm_PS2)

PS1 sparse matrix type
  (0, 5947)	1
  (0, 253)	1
  (0, 7771)	1
  (0, 4180)	1
  (0, 2926)	1
  (0, 2798)	1
  (0, 8233)	1
  (0, 2378)	1
  (0, 2729)	1
  (0, 1451)	1
  (0, 368)	1
  (0, 3051)	1
  (0, 1223)	1
  (0, 4164)	1
  (0, 2454)	1
  (0, 5937)	2
  (0, 266)	1
  (0, 5911)	1
  (0, 7440)	1
  (0, 5647)	1
  (0, 5489)	1
  (0, 7209)	1
  (0, 7189)	1
  (0, 1833)	1
  (0, 6605)	1
  :	:
  (499, 1113)	1
  (499, 5209)	2
  (499, 4855)	2
  (499, 1067)	3
  (499, 8713)	1
  (499, 520)	1
  (499, 4566)	1
  (499, 9317)	1
  (499, 2877)	1
  (499, 7821)	1
  (499, 5993)	1
  (499, 4826)	1
  (499, 3153)	3
  (499, 6255)	1
  (499, 7840)	1
  (499, 4657)	3
  (499, 9273)	3
  (499, 9030)	2
  (499, 4485)	1
  (499, 5460)	1
  (499, 5254)	1
  (499, 8281)	1
  (499, 3447)	1
  (499, 1125)	1
  (499, 2008)	1
 
PS2 sparse matrix type
  (0, 8545)	1
  (0, 7818)	1
  (0, 924)	1
  (0, 3004)	1
  (0, 6722)	1
  (0, 7601)	1
  (0, 6306)	1
  (0, 3592)	1
  (0, 2968)	1
  (0, 9215)	1
  (0, 1160)	1
  (0, 3942)	1
  (0, 775)	1
  (0, 4447)	2
  (0, 

This format is called Compressed Sparse Format. It save a lot of memory to store the dtm in this format, but it is difficult to look at for a human. To illustrate the techniques in this lesson we will first convert this matrix back to a Pandas dataframe, a format we're more familiar with. For larger datasets, you will have to use the Compressed Sparse Format. Putting it into a DataFrame, however, will enable us to get more comfortable with Pandas! For this data, we will skip this step to avoid crashing the kernal.

In [21]:
# #we do the same as we did above, but covert it into a Pandas dataframe. 
#Note this takes quite a bit more memory, so will not be good for bigger data.
#THIS IS TOO LARGE FOR MY MACHINE AND THIS DATA
# dtm_df = pandas.DataFrame(countvec.fit_transform(df_sample['PS1'].values.astype('U')).toarray(), columns=countvec.get_feature_names(), index = df_sample.index)

# #view the dtm dataframe
# dtm_df

## 6. What can we do with a DTM?

We can do a number of calculations using a DTM. For a toy example, we can quickly identify the most frequent words (compare this to how many steps it took in lesson 2, where we found the most frequent words using NLTK).

In [22]:
#print(dtm_df.sum().sort_values(ascending=False))
# print(dtm_df.sum().sort_values(ascending=False))

In [23]:
#####Exercise:
###Print out the most infrequent words rather than the most frequent words.
##Gold star challenge: print the average number of times each word is used in an essay
#print(dtm_df.mean().sort_values(ascending=False))
#Print this out sorted from highest to lowest.

What else does the DTM enable? Because it is in the format of a matrix, we can perform any matrix algebra or vector manipulation on it, which enables some pretty exciting things (think vector space and Euclidean  geometry). But, what do we lose when we reprsent text in this format?

Today, we will use variations on the DTM to find distinctive words in this dataset, and then do some preliminary work discovering themes in text.

## 7. Tf-idf scores

How to find distinctive words in a corpus is a long-standing question in text analysis. We saw a few ways to this yesterday, using natural language processing. Today, we'll learn one simple approach to this: word scores. The idea behind words scores is to weight words not just by their frequency, but by their frequency in one document compared to their distribution across all documents. Words that are frequent, but are also used in every single document, will not be distinguising. We want to identify words that are unevenly distributed across the corpus.

One of the most popular ways to weight words (beyond frequency counts) is *tf-idf* scores. By offsetting the frequency of a word by its document frequency (the number of documents in which it appears) will in theory filter out common terms such as 'the', 'of', and 'and'.

More precisely, the inverse document frequency is calculated as such:

number_of_documents / number_documents_with_term

so:

tfidf_word1 = word1_frequency_document1 * (number_of_documents / number_document_with_word1)

You can, and often should, normalize the numerator: 

tfidf_word1 = (word1_frequency_document1 / word_count_document1) * (number_of_documents / number_document_with_word1)

We can calculate this manually, but scikit-learn has a built-in function to do so. We'll use it, but a challenge for you: use Pandas to calculate this manually. 

To do so, we simply do the same thing we did above with CountVectorizer, but instead we use the function TfidfVectorizer.

In [24]:
# see above for from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvec = TfidfVectorizer()

# #create the dtm, but with cells weigthed by the tf-idf score.
# dtm_tfidf_df = pandas.DataFrame(tfidfvec.fit_transform(df.PS1).toarray(), columns=tfidfvec.get_feature_names(), index = df.index)

# #view results
# dtm_tfidf_df

Let's look at the 20 words with highest tf-idf weights.

In [25]:
# print(dtm_tfidf_df.max().sort_values(ascending=False)[0:20])

## 8. Uncovering Patterns: LDA

Frequency counts and tf-idf scores are done at the word level. There are other methods of exporatory or unsupervised analysis on the document level and by examining the co-occurrence of words within documents. Scikit-learn allows for many of these methods, including:

* document clustering
* document or word similarities using cosine similarity
* pca
* topic modeling

We'll run through an example of topic modeling here. Again, the goal is not to learn everything you need to know about topic modeling. Instead, this will provide you some starter code to run a simple model, with the idea that you can use this base of knowledge to explore this further.

We will run Latent Dirichlet Allocation, the most basic and the oldest version of topic modeling. We will run this in one big chunk of code. Our challenge: use our knowledge of scikit-learn that we gained aboe to walk through the code to understand what it is doing. Your challenge: figure out how to modify this code to work on your own data, and/or tweak the parameters to get better output.

Note: we will be using a different dataset for this technique. The music reviews in the above dataset are often short, one word or one sentence reviews. Topic modeling is not really appropriate for texts that are this short. Instead, we want texts that are longer and are composed of multiple topics each. For this exercise we will use a database of children's literature from the 19th century. 

The data were compiled by students in this course: http://english197s2015.pbworks.com/w/page/93127947/FrontPage
Found here: http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets#demo-corpora

That page has additional corpora, for those interested in exploring text analysis further.

I did some minimal cleaning to get the children's literature data in .csv format for our use.

In [26]:
# df_lit = pandas.read_csv("/Volumes/Extra Space/Google Drive/Scholarship/Writing Projects - Personal/Admissions Essays/Small Sample/AdmissionsEssays/statement_test_031417.csv", sep = ',', encoding = 'utf-8')

# #drop rows where the text is missing. I think there's only one row where it's missing, but check me on that.
# df_lit = df_lit.dropna(subset=['PS1'])

#df_lit = df_no_missing

#view the dataframe
#df_lit

Now we're ready to fit the model. This requires the use of CountVecorizer, which we've already used, and the scikit-learn function LatentDirichletAllocation.

See [here](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) for more information about this function. 

In [27]:
# see the early line: import time
start_time = time.time()

#should switch to batch (from online).  n_samples should be closer to full set

####Adopted From: 
#Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Lars Buitinck
#         Chyi-Kwei Yau <chyikwei.yau@gmail.com>
# License: BSD 3 clause

# See above for: from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# and:from sklearn.decomposition import LatentDirichletAllocation

n_samples = 2000
n_topics = 5
n_top_words = 50

##This is a function to print out the top words for each topic in a pretty way.
#Don't worry too much about understanding every line of this code.
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

# Use tf-idf features
tfidf_vectorizer = TfidfVectorizer(max_df=0.80, min_df=50,
                                   max_features=None,
                                   stop_words='english')

tfidf = tfidf_vectorizer.fit_transform(df_no_missing['PS1_clean'])

# Use tf (raw term count) features
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.80, min_df=50,
                                max_features=None,
                                stop_words='english'
                                )

tf = tf_vectorizer.fit_transform(df_no_missing['PS1_clean'])

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_topics=%d..."
      % (n_samples, n_topics))

#define the lda function, with desired options TAKE A LOOK AT THIS.  MIGHT BE TOO FEW
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=100,  #100 (THATS WHAT LAURA DID)
                                learning_method='batch',  # CHANGE THIS from 'online' TO 'batch'
                                learning_offset=80.,
                                total_samples=n_samples, # TAKE A LOOK AT THIS RE CHANGE TO BATCH
                                random_state=0)
#fit the model
lda.fit(tf)

#print the top words per topic, using the function defined above.
#Unlike R, which has a built-in function to print top words, we have to write our own for scikit-learn
#I think this demonstrates the different aims of the two packages: R is for social scientists, Python for computer scientists

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

print("This program took", time.time() - start_time, "seconds to run.")

Extracting tf features for LDA...
Fitting LDA models with tf features, n_samples=2000 and n_topics=5...

Topics in LDA model:

Topic #0:
world father busi dream life parent learn famili alway shape dad person passion pursu scienc influenc aspir explor develop work desir way inspir job futur grew great import natur travel taught love becom valu age engin young encourag environ come univers grow studi understand new thank childhood mind believ allow

Topic #1:
time life want like day year peopl world love know make felt thing realiz start feel live help everi home way look friend new littl someth thought becam use face person talk learn howev alway word began didnt think chang place read tri im rememb ask need realli experi say

Topic #2:
school commun cultur peopl differ world help student experi famili mani learn educ live countri make year new citi valu come way chang understand attend american place high life divers state social person languag friend parent small becom intern like im

In [28]:
####Exercise:
###Run the same code as above but change some of the parameters. How does this change the output.
###Suggestions:
## 0. Use tf-idf scores rather than raw counts. (hint: look for the variable name we created) 
## 1. Change the number of topics. What do you find?
## 2. Do not remove stop words. How does this change the output?

One thing we may want to do with the output is find the most representative texts for each topic. A simple way to do this (but not memory efficient), is to merge the topic distribution back into the Pandas dataframe.

First get the topic distribution array.

In [29]:
topic_dist = lda.transform(tf)
topic_dist

array([[ 0.06109782,  0.48728143,  0.00150029,  0.44862768,  0.00149278],
       [ 0.24978888,  0.35073259,  0.03186259,  0.00206101,  0.36555494],
       [ 0.00149007,  0.10929221,  0.00149249,  0.53047133,  0.35725389],
       ..., 
       [ 0.26470612,  0.6757214 ,  0.05592117,  0.00182514,  0.00182617],
       [ 0.00272899,  0.42434396,  0.00271834,  0.00269908,  0.56750962],
       [ 0.23761803,  0.31539662,  0.07437001,  0.00182616,  0.37078919]])

Merge back in with the original dataframe.

In [30]:
topic_dist_df = pandas.DataFrame(topic_dist)
df_w_topics = topic_dist_df.join(df_no_missing)
df_w_topics

Unnamed: 0,0,1,2,3,4,index,CPID,College,PS1,PS2,PS1_clean,PS2_clean
0,0.061098,0.487281,0.001500,0.448628,0.001493,338,3163703,College of Natural Resources,"As I stared into the vulgar man's eyes, I felt...",Living in a society that praises women for the...,stare vulgar man eye felt nail dig deeper fles...,live societi prais women extern beauti instead...
1,0.249789,0.350733,0.031863,0.002061,0.365555,439,3056579,College of Engineering,I come from an Indian family whose ancestry tr...,My involvement in speech and debate has had a ...,come indian famili whose ancestri trace back i...,involv speech debat signific impact ive becom ...
2,0.001490,0.109292,0.001492,0.530471,0.357254,722,3113770,College of Letters and Science,Whenever I walked into my mother's room at the...,I was part of a student volunteer project to e...,whenev walk mother room begin month would ofte...,part student volunt project encourag physic me...
3,0.001147,0.425401,0.145133,0.427161,0.001158,794,3120168,College of Engineering,The early years of my childhood in Mexico were...,My life is one of numbers. The universal langu...,earli year childhood mexico easi fact first ye...,life one number univers languag math transcend...
4,0.001474,0.341226,0.372105,0.001467,0.283728,822,3150465,College of Letters and Science,My social life in reality was not exactly stel...,"A few months ago, I had an experience that sho...",social life realiti exactli stellar start foun...,month ago experi show much leadership skill im...
5,0.381607,0.134930,0.112337,0.001956,0.369170,1102,3047627,College of Chemistry,Most people acquire a unique personality based...,"A human's natural instinct is to fear failure,...",peopl acquir uniqu person base upon individu c...,human natur instinct fear failur go life avoid...
6,0.133610,0.659837,0.002020,0.170216,0.034317,1302,3094299,College of Letters and Science,"First, I noticed her trembling hands. In the y...","I knocked tentatively on her door, fidgeting u...",first notic trembl hand year known never seen ...,knock tent door fidget uneasili muffl cri echo...
7,0.001608,0.461333,0.150073,0.001610,0.385376,1467,3112108,College of Natural Resources,Sophomore year I decided to transfer to Campbe...,"In 2nd grade, I joined Cub Scouts with a few o...",sophomor year decid transfer campbel hall desi...,nd grade join cub scout friend love outdoor en...
8,0.055490,0.437935,0.001720,0.374395,0.130459,1478,3129573,College of Engineering,"Every year, distant relatives gather at our ho...",I've always looked forward to the day I'd fini...,everi year distant rel gather hous celebr call...,ive alway look forward day id finish middl sch...
9,0.001704,0.108070,0.229489,0.400097,0.260640,1563,3042091,College of Letters and Science,I am the youngest of three brothers and come f...,"As a high school senior, I find that the most ...",youngest three brother come famili known deter...,high school senior find defin qualiti famili p...


In [31]:
df_w_topics.columns = ['Topic_1_PS1', 'Topic_2_PS1', 'Topic_3_PS1', 'Topic_4_PS1', 'Topic_5_PS1', 'index', 'CPID', 'College', 'PS1', 'PS2', 'PS1_clean', 'PS2_clean']
df_w_topics

Unnamed: 0,Topic_1_PS1,Topic_2_PS1,Topic_3_PS1,Topic_4_PS1,Topic_5_PS1,index,CPID,College,PS1,PS2,PS1_clean,PS2_clean
0,0.061098,0.487281,0.001500,0.448628,0.001493,338,3163703,College of Natural Resources,"As I stared into the vulgar man's eyes, I felt...",Living in a society that praises women for the...,stare vulgar man eye felt nail dig deeper fles...,live societi prais women extern beauti instead...
1,0.249789,0.350733,0.031863,0.002061,0.365555,439,3056579,College of Engineering,I come from an Indian family whose ancestry tr...,My involvement in speech and debate has had a ...,come indian famili whose ancestri trace back i...,involv speech debat signific impact ive becom ...
2,0.001490,0.109292,0.001492,0.530471,0.357254,722,3113770,College of Letters and Science,Whenever I walked into my mother's room at the...,I was part of a student volunteer project to e...,whenev walk mother room begin month would ofte...,part student volunt project encourag physic me...
3,0.001147,0.425401,0.145133,0.427161,0.001158,794,3120168,College of Engineering,The early years of my childhood in Mexico were...,My life is one of numbers. The universal langu...,earli year childhood mexico easi fact first ye...,life one number univers languag math transcend...
4,0.001474,0.341226,0.372105,0.001467,0.283728,822,3150465,College of Letters and Science,My social life in reality was not exactly stel...,"A few months ago, I had an experience that sho...",social life realiti exactli stellar start foun...,month ago experi show much leadership skill im...
5,0.381607,0.134930,0.112337,0.001956,0.369170,1102,3047627,College of Chemistry,Most people acquire a unique personality based...,"A human's natural instinct is to fear failure,...",peopl acquir uniqu person base upon individu c...,human natur instinct fear failur go life avoid...
6,0.133610,0.659837,0.002020,0.170216,0.034317,1302,3094299,College of Letters and Science,"First, I noticed her trembling hands. In the y...","I knocked tentatively on her door, fidgeting u...",first notic trembl hand year known never seen ...,knock tent door fidget uneasili muffl cri echo...
7,0.001608,0.461333,0.150073,0.001610,0.385376,1467,3112108,College of Natural Resources,Sophomore year I decided to transfer to Campbe...,"In 2nd grade, I joined Cub Scouts with a few o...",sophomor year decid transfer campbel hall desi...,nd grade join cub scout friend love outdoor en...
8,0.055490,0.437935,0.001720,0.374395,0.130459,1478,3129573,College of Engineering,"Every year, distant relatives gather at our ho...",I've always looked forward to the day I'd fini...,everi year distant rel gather hous celebr call...,ive alway look forward day id finish middl sch...
9,0.001704,0.108070,0.229489,0.400097,0.260640,1563,3042091,College of Letters and Science,I am the youngest of three brothers and come f...,"As a high school senior, I find that the most ...",youngest three brother come famili known deter...,high school senior find defin qualiti famili p...


In [32]:
#Writing the output of this run to a CSV
#df_w_topics.to_csv('Admissions_PS1_Full_'+time_for_f_name+'.csv', sep=',')

now = datetime.datetime.now()
time_for_f_name = now.strftime("Date_%Y-%m-%d_Time_%H-%M")
path = '../data/'
#print(path+'Admissions_PS1_Full_'+time_for_f_name+'.csv')

df_w_topics.to_csv(path+'Admissions_PS1_Full_'+time_for_f_name+'.csv', sep=',')

Now we can sort the dataframe for the topic of interest, and view the top documents for the topics.
Below we sort the documents first by Topic 0 (looking at the top words for this topic I think it's about family, health, and domestic activities), and next by Topic 1 (again looking at the top words I think this topic is about children playing outside in nature). These topics may be a family/nature split?

Look at the titles for the two different topics. Look at the gender of the author. Hypotheses?

We can read individual essays in full using the code below.  Change the number in the final set of brackets to point to a spesific serial number (ID-1).

In [33]:
print(df_w_topics[['CPID', 'PS1', 'Topic_1_PS1']].sort_values(by=['Topic_1_PS1'], ascending=False))

        CPID                                                PS1  Topic_1_PS1
205  3164991  Two of my greatest passions in life are sports...     0.597007
334  3062672  I invite you to partake on a journey through m...     0.559961
407  3149279  While I do not have an overly adverse backgrou...     0.474799
210  3073911  Life in teeming, fast-paced Mumbai can be over...     0.466141
391  3114432  My upbringing and environment have shaped me i...     0.459060
167  3033733  Everything in the universe follows a pattern a...     0.442625
49   3088330  As far back as I can remember I have always be...     0.427264
225  3111615  Coming from an Asian family, I was taught that...     0.402263
489  3012529  How is the immigration policy affecting the ho...     0.398052
165  3050717  I haven't written a Father's Day Card in a lon...     0.397703
411  3129460  Whether it was awkward introductions to my fat...     0.395188
250  3122476  My aspiration to study abroad and be a banker ...     0.393210

In [34]:
print(df_w_topics['PS1'][205])

Two of my greatest passions in life are sports and business. From childhood I was naturally inclined toward sports: to shoot hoops, play catch, and be involved in competition. On the other hand, my interest in business developed over time from observing my grandfather. My grandpa is considered a legend in Chicago Chinese restaurant history, starting over thirty restaurants throughout the Midwest. His entrepreneurial skills were inspirational, and early in my life Grandpa planted seeds in my mind that someday I would be his successor in the restaurant business. Although the specific dream of taking over the restaurants has faded, working there during my high school years has helped me understand the value of hard work and the benefits of running your own business - especially when it's doing something that you love.  The idea was that business can help you pursue your dreams.\That idea was also emphasized in a significantly different experience. Last summer my mom started helping out at

In [35]:
print(df_w_topics[['CPID', 'PS1', 'Topic_2_PS1']].sort_values(by=['Topic_2_PS1'], ascending=False))

        CPID                                                PS1  Topic_2_PS1
426  3137551  It hurt me to hear the weeping of the woman's ...     0.919399
246  3028348  Cities are always usually associated with rauc...     0.917954
460  3117358  The valley stretches magnificently before me l...     0.912304
41   3067256  "Bubbles, Bubbles, Bubbles! My Bubbles!" Those...     0.889309
143  3139749  What if I told you that eating one more spoonf...     0.870800
461  3039291  Some people refer to Surprise Lake Camp as the...     0.856031
338  3054133  I struck a match. The ball of dry twigs in the...     0.825735
371  3097308  The Boy In The Picture\\    From the day my mo...     0.816478
387  3015387  I was the product of two people who were drive...     0.815780
204  3027535  As an inquisitive young boy, it was hard to fo...     0.796811
120  3012485  All I wanted to do was watch cartoons. But my ...     0.782642
265  3127430  My world is very smelly. It's a good thing, I ...     0.770938

In [36]:
print(df_w_topics['PS1'][246])

Cities are always usually associated with raucous streets, and they never sleep. I just happen to live in a silent corner of city. Everyone in my neighborhood went about his or her life without really disturbing anyone else. I have enjoyed my time living here, but I've felt as though something was calling to me in this silent environment. In, my heart, I felt as though I was about to burst with passion. Once it burst, it would form into something special and unique to call my own. This was the beginning of my journey.\                   As a young lad, I was very reclusive. I was never keen on opening myself up to others or talking about the things that I enjoyed. I thought I would never be exceptionally good at anything. Every time I returned home, I was a bored. I constantly drew pictures and read books. Even this didn't suffice. One day, when my mother was busy in the kitchen,  I began to ask if lunch was ready. I asked repeatedly asking, much to my mother's annoyance. After turning

In [37]:
print(df_w_topics[['CPID', 'PS1', 'Topic_3_PS1']].sort_values(by=['Topic_3_PS1'], ascending=False))

        CPID                                                PS1  Topic_3_PS1
397  3079405  Growing up in Catholic school has been challen...     0.994602
116  3069451  State Highway 6 is the artery of my community....     0.993379
360  3145558  I take pride in being from my hometown of Tula...     0.990366
24   3141633  I live in the north of Cyprus, an island divid...     0.840052
178  3092850  I am one of multitude of international student...     0.822328
136  3150904  "The loving, good person - even alone - can ma...     0.813652
37   3082703  I was born in China, and of those 1.4 billion ...     0.796733
247  3121554  I consider cities as the center of life. Citie...     0.792400
203  3049642  Koreans, Americans, and Kiwis happily frolicki...     0.790042
406  3095512  I was born to Indian immigrants in the Lone St...     0.746499
300  3125649  Being a U.S. citizen studying abroad, I feel l...     0.730092
138  3001563  I come from the most important community of ea...     0.717908

In [38]:
print(df_w_topics['PS1'][397])

Growing up in Catholic school has been challenging at times; however, I have come to accept its effect on my life as positive. I did not realize when I began kindergarten how much of an influence both my school's faith and the people that surrounded me would have on me throughout elementary, middle, and even high school. \     Because I went to school with a small group of classmates for nine years, my friendships with my peers grew stronger over time. When I entered high school, most of my classmates joined me at the same school, where we would spend four more years surrounded by each other and learning more about one another. Having close friends around me for so long has taught me to be an open, kind, and confident person. Forming friendships with people with so many unique talents and personalities has taught me to be versatile and open to new experiences and opportunities. My friends have also positively affected my work in school, because they encouraged me to take challenging cl

In [39]:
print(df_w_topics[['CPID', 'PS1', 'Topic_4_PS1']].sort_values(by=['Topic_4_PS1'], ascending=False))

        CPID                                                PS1  Topic_4_PS1
243  3053400  I lived in a small blue house on the corner of...     0.969703
88   3122848  I come from a family of 11. My mother was Colo...     0.919852
129  3124909  I come from a family that has as much or less ...     0.909960
109  3085781  I am a product of a large, rowdy, colorful His...     0.898410
307  3120191  Joseph Olivarez \The past five generations of ...     0.880235
314  3157030  As I begin thinking about college and what my ...     0.877089
118  3028289  I am the middle child, the forgotten child. I ...     0.873222
284  3131619  ?What's in my world that makes me who I am? Th...     0.873145
34   3164751  My grandmother came to this country without kn...     0.872294
147  3112676  My parent's life story is full of hardship, di...     0.871389
343  3032409  At the age of ten I lost my grandmother to can...     0.855583
283  3067522  I live in a community where you could hear fam...     0.851328

In [40]:
print(df_w_topics['PS1'][243])

I lived in a small blue house on the corner of Ceres and Modesto. The house, which comfortably houses five people was jam-packed with eight: my mother and father, my three sisters and my two brothers. Since the house was so full, my parents told stories that seemed to gather the family around and quiet down the never ending chaos that came with a eight person family. \    I never had a dull moment with my family. My dad is a man who would remember every moment in his life and would always retell these moments as stories. Some of the stories are just retelling the adventures of his daily life, while some stories that are told in the house define who we are as a family. There were many stories of successes like my brother's 4.8 GPA or the many students from the Lo family that crosses the graduation stage as valedictorians.  There are also stories that tell about the failures of my family, like the brother who failed so many classes he was forced to change his major or the little sister w

In [41]:
print(df_w_topics[['CPID', 'PS1', 'Topic_5_PS1']].sort_values(by=['Topic_5_PS1'], ascending=False))

        CPID                                                PS1  Topic_5_PS1
238  3042427  Voices were chattering and people crowded arou...     0.993759
52   3084329  Gloucester High School is a tool I have used t...     0.993608
268  3070695  When I was young, I idolized Neil Degrasse Tys...     0.993517
366  3133798  At the age of five, I have been obsessed with ...     0.913096
231  3020298  In middle school, I thought I was amazing. Whe...     0.909822
323  3103691  In the moment, I was mindlessly following Erik...     0.904525
86   3066309  The spirit of teamwork and competition has bee...     0.892214
478  3062802  I live about 15 minutes from NASA Ames Researc...     0.868751
121  3149862  I grew up seeing a faded black and white pictu...     0.855162
275  3023235  It was the first day of the first grade and I ...     0.833701
463  3044758  I've always loved building things. As a child,...     0.825478
487  3091626  My growth and maturation have happened in my h...     0.796757

In [42]:
print(df_w_topics['PS1'][238])

Voices were chattering and people crowded around in the brightly lit room. Sitting on a swivel chair in the underground cavern, I was surrounded by the sound of laughter and clacking keyboards. I live here--in the computer lab. I've been living here since ninth grade. \The computer lab was where I was first welcomed with open arms to the coding community. I met friends and mentors who taught me and shared my enthusiasm. They encouraged me to delve into the exciting world of computer science. My curiosity for programming grew as I discovered the fascinating computing specializations of artificial intelligence and graphics. As my programming skills developed, I began work on programming projects with others which further pushed me to grow. Armed with just lines of codes and my trusty keyboard, I ventured deeper into the limitless world of computer science.\On Friday nights, my friends and I played through the game we made, jumping over enemies and avoiding spike traps. Ever since I disco