# How does Computational Text Analysis Look

Below are just a few things you can do with computer-assisted, or computational text analysis. 

Like I always say, these methods can assist you with your analysis, they do not replace it. The meaning comes from the analyst, but these methods can help us dig deeper into meanings.

In this NB I'll present two broad categories of calculations/analyses you can do with computational text analysis:

1. Descriptive statistics
2. Frequent and Distinctive Words

This is meant to get you thinking about how a computer "reads" text, so you can start thinking about the strenghts of a computer compared to a human, and how you might combine the strengths of both to build a better content analysis project.

### 1. Descriptive Statistics

First, I'll create a pandas dataframe with the sample interviews, wotj each respondent statement as a separate row. attached to the filename to indicate to which interview the statement belongs.

I'll then calculate the number of speech acts for each interview, the average number of words by speech statement overall, and the average number of words by speech statement separated by interview.

In [1]:
#first import all packages needed for the NB
import re
import os
import nltk
from nltk.corpus import stopwords
import string
import pandas
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
my_file = open("../data/SF657.txt", encoding='utf-8', errors='ignore').read()
my_file

'Interview: SF657\nI:\tSo, before we get started on the interview, I just kind of want to go over some logistics first. I guess the first question is can you hear me okay?\nR:\tI can hear you perfectly.\nI:\tGreat. So the first thing is, would you be able to give me your address? We’re compensating our interviewees $40--\nR:\tSo here’s what I’m going to ask you. I work [inaudible 1:26] program of recovery right now, and I see this as part of my service that I can make amends.\nI:\tOkay. \nR:\tAnd I’d like you to donate that money, if you could really do this, is to The Safe House. The Safe House is actually a place in San Francisco that gets women off the street who have been in prostitution. If you could instead of giving it to me, if you guys could make a check out to them and give them money I would really appreciate it.\nI:\tThat’s really wonderful of you. Yes, I believe we’ve had other respondents ask the same, and I believe that that is possible. So I can talk with the lead resea

In [87]:
#create a dataframe
folder_path = "../data/"
#print(os.listdir(folder_path))
filenames = os.listdir(folder_path)
columns = ['interview', 'statement']
master_df = pandas.DataFrame(columns=columns)
for f in filenames:
    df = pandas.DataFrame(columns=columns)
    my_file = open(folder_path+f, encoding='utf-8', errors='ignore').read()
    respondent = re.findall("R:(.*)", my_file)
    df['statement'] = respondent
    df['interview'] = f[:-4]
    master_df = pandas.concat([df,master_df])

master_df

Unnamed: 0,interview,statement
0,SF280,\t52.
1,SF280,"\tJune 26th, 64."
2,SF280,\tSan Francisco.
3,SF280,\tI guess white.
4,SF280,\tSingle.
5,SF280,\tNo.
6,SF280,\tNo.
7,SF280,\tHigh school. Twelfth.
8,SF280,\tMy last year I was in special ed. In high sc...
9,SF280,\tYeah.


In [107]:
print(open(folder_path+filenames[0], encoding='utf-8', errors='ignore').read())

Interview: SF300

I:	So we'll just start with background questions. How old are you?
R:	I'm 63.
I:	And what date were you born?
R:	1953. Yeah, that adds up right.
I:	And the month and the day?
R:	February 7.
I:	Same day as my brother.
R:	Really?
I:	Mm-hmm [yes]. He's a '97, though. He's younger than I am. Where were you born?
R:	Jersey City, New Jersey.
I:	And are you a US citizen?
R:	Yes.
I:	And if you could, choose from this list of categories what you self-identify as and how people characterize you.
R:	Now this is only in California. I found this out when I moved to California.
I:	How so?
R:	It turns out that in California, Italians are not considered white people.
I:	And was that something you experienced in the Bay Area or other parts?
R:	All over California, up and down the whole statePortland to LA.
I:	Interesting.
R:	Yeah. On the East Coast, before I moved out here, I would have just checked white. Doesn't work here.
I:	And was that something that you experienced in job, or ed

In [93]:
#number of speech acts by interview
master_df['interview'].value_counts().sort_values(ascending=False)

SF259         618
SF283         569
SF300         523
SF216         511
SF310         474
SF202         463
SF280         408
SF286         396
SF218.1       339
SF318         331
SF261         323
SF335 pt 2    221
SF206.1       142
SF335 pt 3    128
SF206.2       108
SF218.2        81
SF335 pt 1     54
SF338 pt 1      1
Name: interview, dtype: int64

In [94]:
#create a word count column
master_df['tokens'] = master_df['statement'].str.split()
master_df['word_count'] = master_df['tokens'].str.len()
master_df

Unnamed: 0,interview,statement,tokens,word_count
0,SF280,\t52.,[52.],1
1,SF280,"\tJune 26th, 64.","[June, 26th,, 64.]",3
2,SF280,\tSan Francisco.,"[San, Francisco.]",2
3,SF280,\tI guess white.,"[I, guess, white.]",3
4,SF280,\tSingle.,[Single.],1
5,SF280,\tNo.,[No.],1
6,SF280,\tNo.,[No.],1
7,SF280,\tHigh school. Twelfth.,"[High, school., Twelfth.]",3
8,SF280,\tMy last year I was in special ed. In high sc...,"[My, last, year, I, was, in, special, ed., In,...",11
9,SF280,\tYeah.,[Yeah.],1


In [91]:
#average number of words per speech act
master_df['word_count'].mean()

20.132688927943761

In [92]:
#average number of words per speech act, by interview
grouped = master_df.groupby('interview')
print(grouped['word_count'].mean().sort_values(ascending=False))

#note: SF269 had the most number of speech acts, but the lower end of words per speech act

interview
SF206.1       50.760563
SF286         37.661616
SF206.2       37.481481
SF283         26.936731
SF300         23.286807
SF202         21.952484
SF318         19.990937
SF310         17.586498
SF261         17.582043
SF218.1       17.262537
SF280         16.313725
SF218.2       16.283951
SF216         12.150685
SF259         12.042071
SF335 pt 3     7.195312
SF335 pt 2     6.819005
SF338 pt 1     4.000000
SF335 pt 1     3.203704
Name: word_count, dtype: float64


### Frequent and Distinguising Words

Next I'll look at the words themselves. I'll first treat all interviews as one long text, and look at the most frequent words overall, and the most frequent nouns, verbs, and adjectives.

Second I'll compare two interviews, and identify the most distinctive words for each of these interviews when compared to one another. This could be done, for example, when comparing cities, or perhaps specific questions.

In [61]:
#create list with all interviews
folder_path = "../data/"
#print(os.listdir(folder_path))
filenames = os.listdir(folder_path)
text_list = []
for f in filenames:
    my_file = open(folder_path+f, encoding='utf-8', errors='ignore').read()
    respondent = re.findall("R:(.*)", my_file)
    text_list.append(' '.join(x for x in respondent))

In [None]:
#concatenate list into one long string and tokenize
text = ' '.join(x for x in text_list)
tokens = nltk.word_tokenize(text)

In [95]:
#remove punctuation, lowercase, and remove stopwords. Print out most frequent words
tokens_clean = [word for word in tokens if word not in string.punctuation]
tokens_clean = [word.lower() for word in tokens_clean]
tokens_clean = [word for word in tokens_clean if word not in stopwords.words('english')]

word_frequency = nltk.FreqDist(tokens_clean)
word_frequency.most_common(50)

[('yeah', 1770),
 ('know', 1535),
 ('like', 1066),
 ('dont', 668),
 ('inaudible', 510),
 ('get', 463),
 ('im', 460),
 ('got', 448),
 ('going', 442),
 ("'s", 424),
 ('think', 423),
 ('really', 418),
 ("n't", 412),
 ('one', 410),
 ('go', 404),
 ('--', 402),
 ('would', 368),
 ('didnt', 364),
 ('thats', 362),
 ('people', 349),
 ('time', 345),
 ('mean', 344),
 ('yes', 323),
 ('okay', 315),
 ('oh', 311),
 ('well', 307),
 ('right', 287),
 ('said', 271),
 ('kind', 258),
 ('want', 251),
 ('something', 223),
 ('lot', 221),
 ('say', 215),
 ('umm', 212),
 ('stuff', 211),
 ('work', 209),
 ('agree', 209),
 ('thing', 208),
 ('back', 204),
 ('shit', 198),
 ('went', 197),
 ('job', 195),
 ('could', 192),
 ('two', 192),
 ('never', 192),
 ('much', 177),
 ('day', 176),
 ("'m", 170),
 ('jail', 166),
 ('good', 163)]

In [105]:
#tag each word into its part-of-speech
tagged_tokens = nltk.pos_tag(tokens_clean)
#preview what this looks like
tagged_tokens[:10]

[("'m", 'VBP'),
 ('63', 'CD'),
 ('1953', 'CD'),
 ('yeah', 'NN'),
 ('adds', 'VBZ'),
 ('right', 'JJ'),
 ('february', 'JJ'),
 ('7', 'CD'),
 ('really', 'RB'),
 ('jersey', 'JJ')]

In [98]:
#print most frequent nouns, verbs, and adjectives
adjectives = [word for (word,pos) in tagged_tokens if pos == 'JJ' or pos=='JJR' or pos=='JJS']
nouns = [word for (word,pos) in tagged_tokens if pos=='NN' or pos=='NNS']
verbs = [word for (word,pos) in tagged_tokens if pos in ['VB', 'VBD','VBG','VBN','VBP','VBZ']]
freq_nouns = nltk.FreqDist(nouns)
freq_verbs = nltk.FreqDist(verbs)
freq_adjs= nltk.FreqDist(adjectives)
print("Most Frequent Nouns:")
print(freq_nouns.most_common(20))
print()
print("Most Frequent Verbs")
print(freq_verbs.most_common(20))
print()
print("Most Frequent Adjectives")
print(freq_adjs.most_common(20))

Most Frequent Nouns:
[('yeah', 799), ('dont', 355), ('people', 349), ('time', 345), ('thats', 316), ('kind', 258), ('im', 252), ('didnt', 226), ('something', 223), ('thing', 208), ('job', 195), ('lot', 178), ('day', 176), ('yes', 171), ('work', 171), ('anything', 161), ('things', 142), ('years', 140), ('youre', 139), ('nothing', 138)]

Most Frequent Verbs
[('know', 1249), ('going', 442), ('got', 424), ('go', 382), ('think', 350), ('get', 347), ('said', 271), ('say', 213), ('want', 210), ('went', 197), ('yeah', 176), ("'m", 170), ('told', 126), ('see', 119), ("'re", 113), ('make', 113), ('happened', 111), ('take', 110), ('came', 104), ('getting', 102)]

Most Frequent Adjectives
[('inaudible', 503), ('know', 195), ('yeah', 170), ('dont', 164), ('good', 159), ('umm', 157), ('mean', 152), ('right', 146), ('much', 146), ('little', 129), ('im', 129), ('sure', 124), ('whole', 119), ('different', 117), ('okay', 113), ('oh', 109), ('ive', 101), ('agree', 100), ('bad', 93), ('hmm-hmm', 80)]


In [104]:
#Let's compare two different interviews
#This could also be done to compare, for example, different cities, by concatening the respondents by city
#SF206.1 had the highest number of words per speech act, while SF259 had one of the lowest. Let's compare these.
print(filenames)
print(filenames[7])
print(filenames[8])
compare_list = [text_list[7], text_list[8]]

['SF300.txt', 'SF335 pt 1.txt', 'SF218.2.txt', 'SF286.txt', 'SF338 pt 2.txt', 'SF218.1.txt', 'SF310.txt', 'SF206.1.txt', 'SF259.txt', 'SF206.2.txt', 'SF338 pt 1.txt', 'SF335 pt 3.txt', 'SF337.txt', 'SF335 pt 2.txt', 'SF318.txt', 'SF216.txt', 'SF261.txt', 'SF283.txt', 'SF202.txt', 'SF280.txt']
SF206.1.txt
SF259.txt


In [101]:
#create a document term matrix
countvec = CountVectorizer(stop_words='english')
df = pandas.DataFrame(countvec.fit_transform(compare_list).toarray(), columns=countvec.get_feature_names())
df

Unnamed: 0,00,000,000the,01,02,03,04,05,08,09,...,ymca,youd,youi,youll,young,youngest,youre,youve,zero,zoo
0,3,0,0,3,2,0,2,2,4,1,...,1,2,1,1,7,2,11,2,2,0
1,4,2,1,1,1,5,0,0,1,0,...,0,0,0,0,1,0,0,0,1,1


In [102]:
#implement difference of proportions
df['word_count'] = df.sum(axis=1)
df = df.iloc[:,:].div(df.word_count, axis=0)
df.loc[2] = df.loc[0] - df.loc[1]

#print most distinctive words
#higher values are most defining of SF206.1
df.loc[2].sort_values(ascending=False)

inaudible    0.022955
dont         0.017782
im           0.017459
like         0.015209
daughter     0.011639
didnt        0.010346
thats        0.010023
shes         0.008406
baby         0.008083
going        0.008032
son          0.007759
mom          0.006362
brother      0.005496
hes          0.005173
aint         0.005173
came         0.004975
said         0.004912
momma        0.004850
dad          0.004777
little       0.004714
oregon       0.004526
come         0.004266
auntie       0.004203
ill          0.003880
tell         0.003650
took         0.003619
youre        0.003556
lot          0.003276
passed       0.003233
knew         0.003233
               ...   
know        -0.003164
remember    -0.003212
probably    -0.003212
pretty      -0.003212
anger       -0.003212
worked      -0.003504
maybe       -0.003733
times       -0.003764
place       -0.003994
oh          -0.004003
people      -0.004337
didn        -0.004380
job         -0.004640
time        -0.004691
guess     

In [103]:
#most distinctive words. Lower values are most defining of SF259
df.loc[2].sort_values(ascending=True)

yeah        -0.138383
really      -0.016037
think       -0.015088
just        -0.011194
don         -0.009635
kind        -0.007987
mm          -0.007883
strongly    -0.007821
shit        -0.007591
disagree    -0.005839
ve          -0.005839
fucking     -0.005745
okay        -0.005589
agree       -0.005422
yes         -0.005401
guess       -0.004807
time        -0.004691
job         -0.004640
didn        -0.004380
people      -0.004337
oh          -0.004003
place       -0.003994
times       -0.003764
maybe       -0.003733
worked      -0.003504
anger       -0.003212
pretty      -0.003212
probably    -0.003212
remember    -0.003212
know        -0.003164
               ...   
knew         0.003233
passed       0.003233
lot          0.003276
youre        0.003556
took         0.003619
tell         0.003650
ill          0.003880
auntie       0.004203
come         0.004266
oregon       0.004526
little       0.004714
dad          0.004777
momma        0.004850
said         0.004912
came      

I see a lot of family words in the first respondent, SF206.1. Mores cuss words, qualifiers, and anger words in the second respondent, SF259. This is one potential pattern to look for in you data (with many many caveats, of course).