# How does Computational Text Analysis Look

Below are just a few things you can do with computer-assisted, or computational text analysis. 

Like I always say, these methods can assist you with your analysis, they do not replace it. The meaning comes from the analyst, but these methods can help us dig deeper into meanings.

In this NB I'll present two broad categories of calculations/analyses you can do with computational text analysis:

1. Descriptive statistics
2. Frequent and Distinctive Words

This is meant to get you thinking about how a computer "reads" text, so you can start thinking about the strenghts of a computer compared to a human, and how you might combine the strengths of both to build a better content analysis project.

### 1. Descriptive Statistics

First, I'll create a pandas dataframe with the sample interviews, wotj each respondent statement as a separate row. attached to the filename to indicate to which interview the statement belongs.

I'll then calculate the number of speech acts for each interview, the average number of words by speech statement overall, and the average number of words by speech statement separated by interview.

In [54]:
#first import all packages needed for the NB
import re
import os
import nltk
from nltk.corpus import stopwords
import string
import pandas
from sklearn.feature_extraction.text import CountVectorizer



In [55]:
# change these for your machine
data_folder = "/Volumes/Extra Space/Google Drive/Scholarship/Workshops - Talks - Notes/Comp Text Analysis - CTAWG/Computational Text Analysis Working Group/Smith Interview Project/Smith Interview Sample/data/"
my_file = open("/Volumes/Extra Space/Google Drive/Scholarship/Workshops - Talks - Notes/Comp Text Analysis - CTAWG/Computational Text Analysis Working Group/Smith Interview Project/Smith Interview Sample/data/SF657, 1 and 2 txt file.txt", encoding='utf-8', errors='ignore').read()
my_file

'Interview: SF657\nI:\tSo, before we get started on the interview, I just kind of want to go over some logistics first. I guess the first question is can you hear me okay?\nR:\tI can hear you perfectly.\nI:\tGreat. So the first thing is, would you be able to give me your address? We’re compensating our interviewees $40--\nR:\tSo here’s what I’m going to ask you. I work [inaudible 1:26] program of recovery right now, and I see this as part of my service that I can make amends.\nI:\tOkay. \nR:\tAnd I’d like you to donate that money, if you could really do this, is to The Safe House. The Safe House is actually a place in San Francisco that gets women off the street who have been in prostitution. If you could instead of giving it to me, if you guys could make a check out to them and give them money I would really appreciate it.\nI:\tThat’s really wonderful of you. Yes, I believe we’ve had other respondents ask the same, and I believe that that is possible. So I can talk with the lead resea

In [56]:
#create a dataframe
folder_path = data_folder
#print(os.listdir(folder_path))
filenames = os.listdir(folder_path)
columns = ['interview', 'statement_combined']
master_df = pandas.DataFrame(columns=columns)

for f in filenames:
    df = pandas.DataFrame(columns=columns)
    my_file = open(folder_path+f, encoding='utf-8', errors='ignore').read()
    respondent = re.findall("(R|I|OV):\t(.*)", my_file)
    df['statement_combined'] = respondent
    df['interview'] = f[:-4]
    master_df = pandas.concat([df,master_df])

master_df

Unnamed: 0,interview,statement_combined
0,"SF657, 1 and 2 txt file","(I, So, before we get started on the interview..."
1,"SF657, 1 and 2 txt file","(R, I can hear you perfectly.)"
2,"SF657, 1 and 2 txt file","(I, Great. So the first thing is, would you be..."
3,"SF657, 1 and 2 txt file","(R, So here’s what I’m going to ask you. I wor..."
4,"SF657, 1 and 2 txt file","(I, Okay. )"
5,"SF657, 1 and 2 txt file","(R, And I’d like you to donate that money, if ..."
6,"SF657, 1 and 2 txt file","(I, That’s really wonderful of you. Yes, I bel..."
7,"SF657, 1 and 2 txt file","(R, That’s fine.)"
8,"SF657, 1 and 2 txt file","(I, We do need to send you a consent form. I’m..."
9,"SF657, 1 and 2 txt file","(R, Yes. And nothing I say is going to have my..."


In [57]:
print(open(folder_path+filenames[0], encoding='utf-8', errors='ignore').read())




In [58]:
# master_df.iloc[100]['statement']
master_df[['speaker','statement']] = pandas.DataFrame([x for x in master_df.statement_combined])
# master_df = master_df.drop('statement')
del master_df['statement_combined']
master_df

Unnamed: 0,interview,speaker,statement
0,"SF657, 1 and 2 txt file",I,"So, before we get started on the interview, I ..."
1,"SF657, 1 and 2 txt file",R,I can hear you perfectly.
2,"SF657, 1 and 2 txt file",I,"Great. So the first thing is, would you be abl..."
3,"SF657, 1 and 2 txt file",R,So here’s what I’m going to ask you. I work [i...
4,"SF657, 1 and 2 txt file",I,Okay.
5,"SF657, 1 and 2 txt file",R,"And I’d like you to donate that money, if you ..."
6,"SF657, 1 and 2 txt file",I,"That’s really wonderful of you. Yes, I believe..."
7,"SF657, 1 and 2 txt file",R,That’s fine.
8,"SF657, 1 and 2 txt file",I,We do need to send you a consent form. I’m goi...
9,"SF657, 1 and 2 txt file",R,Yes. And nothing I say is going to have my nam...


In [17]:
#number of speech acts by interview
master_df['interview'].value_counts().sort_values(ascending=False)

SF259                      618
SF283                      569
SF300                      523
SF216                      511
SF310                      474
SF202                      463
SF280                      408
SF286                      396
SF218.1                    339
SF318                      331
SF261                      323
SF657, 1 and 2 txt file    290
SF335 pt 2                 221
SF206.1                    142
SF335 pt 3                 128
SF206.2                    108
SF218.2                     81
SF335 pt 1                  54
SF338 pt 1                   1
Name: interview, dtype: int64

In [18]:
#create a word count column
master_df['tokens'] = master_df['statement'].str.split()
master_df['word_count'] = master_df['tokens'].str.len()
master_df

Unnamed: 0,interview,statement,tokens,word_count
0,"SF657, 1 and 2 txt file",\tI can hear you perfectly.,"[I, can, hear, you, perfectly.]",5
1,"SF657, 1 and 2 txt file",\tSo here’s what I’m going to ask you. I work ...,"[So, here’s, what, I’m, going, to, ask, you., ...",31
2,"SF657, 1 and 2 txt file","\tAnd I’d like you to donate that money, if yo...","[And, I’d, like, you, to, donate, that, money,...",68
3,"SF657, 1 and 2 txt file",\tThat’s fine.,"[That’s, fine.]",2
4,"SF657, 1 and 2 txt file",\tYes. And nothing I say is going to have my n...,"[Yes., And, nothing, I, say, is, going, to, ha...",14
5,"SF657, 1 and 2 txt file",\t[gives address],"[[gives, address]]",2
6,"SF657, 1 and 2 txt file",\t[Inaudible 3:13].,"[[Inaudible, 3:13].]",2
7,"SF657, 1 and 2 txt file",\t[Inaudible 3:25].,"[[Inaudible, 3:25].]",2
8,"SF657, 1 and 2 txt file",\tI’m on a headset. [Inaudible 3:44].,"[I’m, on, a, headset., [Inaudible, 3:44].]",6
9,"SF657, 1 and 2 txt file","\tThey will not have my address, right?","[They, will, not, have, my, address,, right?]",7


In [11]:
#average number of words per speech act
master_df['word_count'].mean()

19.956521739130434

In [19]:
#average number of words per speech act, by interview
grouped = master_df.groupby('interview')
print(grouped['word_count'].mean().sort_values(ascending=False))

#note: SF269 had the most number of speech acts, but the lower end of words per speech act

interview
SF206.1                    50.760563
SF286                      37.661616
SF206.2                    37.481481
SF283                      26.936731
SF300                      23.286807
SF202                      21.952484
SF318                      19.990937
SF310                      17.586498
SF261                      17.582043
SF218.1                    17.262537
SF657, 1 and 2 txt file    16.500000
SF280                      16.313725
SF218.2                    16.283951
SF216                      12.150685
SF259                      12.042071
SF335 pt 3                  7.195312
SF335 pt 2                  6.819005
SF338 pt 1                  4.000000
SF335 pt 1                  3.203704
Name: word_count, dtype: float64


### Frequent and Distinguising Words

Next I'll look at the words themselves. I'll first treat all interviews as one long text, and look at the most frequent words overall, and the most frequent nouns, verbs, and adjectives.

Second I'll compare two interviews, and identify the most distinctive words for each of these interviews when compared to one another. This could be done, for example, when comparing cities, or perhaps specific questions.

In [20]:
#create list with all interviews
folder_path = data_folder
#print(os.listdir(folder_path))
filenames = os.listdir(folder_path)
text_list = []
for f in filenames:
    my_file = open(folder_path+f, encoding='utf-8', errors='ignore').read()
    respondent = re.findall("R:(.*)", my_file)
    text_list.append(' '.join(x for x in respondent))

In [21]:
#concatenate list into one long string and tokenize
text = ' '.join(x for x in text_list)
tokens = nltk.word_tokenize(text)

In [22]:
#remove punctuation, lowercase, and remove stopwords. Print out most frequent words
tokens_clean = [word for word in tokens if word not in string.punctuation]
tokens_clean = [word.lower() for word in tokens_clean]
tokens_clean = [word for word in tokens_clean if word not in stopwords.words('english')]

word_frequency = nltk.FreqDist(tokens_clean)
word_frequency.most_common(50)

[('yeah', 1797),
 ('know', 1560),
 ('like', 1079),
 ('dont', 668),
 ('inaudible', 550),
 ('get', 486),
 ('going', 475),
 ('got', 466),
 ('im', 460),
 ('think', 452),
 ('one', 433),
 ('really', 432),
 ('--', 429),
 ("'s", 424),
 ('would', 419),
 ('go', 417),
 ("n't", 412),
 ('people', 376),
 ('time', 366),
 ('didnt', 364),
 ('thats', 362),
 ('mean', 358),
 ('yes', 340),
 ('okay', 321),
 ('well', 320),
 ('oh', 314),
 ('right', 305),
 ('said', 272),
 ('kind', 265),
 ('want', 262),
 ('say', 258),
 ('agree', 238),
 ('lot', 231),
 ('something', 230),
 ('umm', 228),
 ('thing', 221),
 ('work', 216),
 ('stuff', 214),
 ('back', 207),
 ('two', 205),
 ('job', 203),
 ('went', 203),
 ('could', 202),
 ('never', 201),
 ('shit', 199),
 ('much', 183),
 ('day', 179),
 ('good', 175),
 ('strongly', 172),
 ("'m", 170)]

In [23]:
#tag each word into its part-of-speech
tagged_tokens = nltk.pos_tag(tokens_clean)
#preview what this looks like
tagged_tokens[:10]

[('okay', 'JJ'),
 ('okay', 'NN'),
 ('yeah', '$'),
 ('23', 'CD'),
 ('may', 'MD'),
 ('13', 'CD'),
 ('93', 'CD'),
 ('yeah', 'JJ'),
 ('san', 'JJ'),
 ('francisco', 'JJ')]

In [24]:
#print most frequent nouns, verbs, and adjectives
adjectives = [word for (word,pos) in tagged_tokens if pos == 'JJ' or pos=='JJR' or pos=='JJS']
nouns = [word for (word,pos) in tagged_tokens if pos=='NN' or pos=='NNS']
verbs = [word for (word,pos) in tagged_tokens if pos in ['VB', 'VBD','VBG','VBN','VBP','VBZ']]
freq_nouns = nltk.FreqDist(nouns)
freq_verbs = nltk.FreqDist(verbs)
freq_adjs= nltk.FreqDist(adjectives)
print("Most Frequent Nouns:")
print(freq_nouns.most_common(20))
print()
print("Most Frequent Verbs")
print(freq_verbs.most_common(20))
print()
print("Most Frequent Adjectives")
print(freq_adjs.most_common(20))

Most Frequent Nouns:
[('yeah', 815), ('people', 376), ('time', 366), ('dont', 355), ('thats', 316), ('kind', 265), ('im', 252), ('something', 230), ('didnt', 226), ('thing', 221), ('job', 203), ('lot', 186), ('day', 179), ('yes', 178), ('work', 178), ('anything', 162), ('things', 156), ('years', 149), ('way', 141), ('youre', 139)]

Most Frequent Verbs
[('know', 1271), ('going', 475), ('got', 441), ('go', 395), ('think', 376), ('get', 359), ('said', 272), ('say', 256), ('want', 220), ('went', 203), ('yeah', 176), ("'m", 170), ('told', 126), ('see', 121), ('make', 120), ('take', 116), ("'re", 113), ('happened', 112), ('agree', 109), ('getting', 108)]

Most Frequent Adjectives
[('inaudible', 543), ('know', 196), ('yeah', 174), ('good', 171), ('umm', 170), ('dont', 164), ('mean', 159), ('right', 157), ('much', 151), ('little', 132), ('im', 129), ('sure', 126), ('different', 121), ('whole', 121), ('agree', 115), ('okay', 112), ('oh', 108), ('ive', 101), ('bad', 94), ('first', 84)]


In [25]:
#Let's compare two different interviews
#This could also be done to compare, for example, different cities, by concatening the respondents by city
#SF206.1 had the highest number of words per speech act, while SF259 had one of the lowest. Let's compare these.
print(filenames)
print(filenames[7])
print(filenames[8])
compare_list = [text_list[7], text_list[8]]

['Icon\r', 'SF202.txt', 'SF206.1.txt', 'SF206.2.txt', 'SF216.txt', 'SF218.1.txt', 'SF218.2.txt', 'SF259.txt', 'SF261.txt', 'SF280.txt', 'SF283.txt', 'SF286.txt', 'SF300.txt', 'SF310.txt', 'SF318.txt', 'SF335 pt 1.txt', 'SF335 pt 2.txt', 'SF335 pt 3.txt', 'SF337.txt', 'SF338 pt 1.txt', 'SF338 pt 2.txt', 'SF657, 1 and 2 txt file.txt']
SF259.txt
SF261.txt


In [26]:
#create a document term matrix
countvec = CountVectorizer(stop_words='english')
df = pandas.DataFrame(countvec.fit_transform(compare_list).toarray(), columns=countvec.get_feature_names())
df

Unnamed: 0,00,000,000the,01,02,03,08,09,10,11,...,yearand,years,yelled,yelling,yep,yes,young,youre,zero,zoo
0,4,2,1,1,1,5,1,0,2,2,...,1,13,0,0,3,34,1,0,1,1
1,3,13,0,3,1,0,0,1,3,0,...,0,23,3,2,0,16,1,14,0,0


In [27]:
#implement difference of proportions
df['word_count'] = df.sum(axis=1)
df = df.iloc[:,:].div(df.word_count, axis=0)
df.loc[2] = df.loc[0] - df.loc[1]

#print most distinctive words
#higher values are most defining of SF206.1
df.loc[2].sort_values(ascending=False)

yeah          0.118732
think         0.014074
don           0.009635
kind          0.009530
just          0.009340
really        0.008556
hmm           0.008175
mm            0.007883
shit          0.007194
fucking       0.006715
okay          0.006191
ve            0.005839
strongly      0.005689
like          0.005659
time          0.004955
guess         0.004836
didn          0.004380
maybe         0.004380
doing         0.004252
fuck          0.003796
got           0.003683
yes           0.003578
stuff         0.003166
community     0.002920
worked        0.002710
wasn          0.002628
cool          0.002523
make          0.002418
dad           0.002336
case          0.002336
                ...   
stage        -0.002778
hes          -0.002778
going        -0.002786
days         -0.002800
close        -0.002883
situation    -0.002883
understand   -0.002987
said         -0.002987
diversion    -0.002987
went         -0.003033
knew         -0.003175
support      -0.003175
right      

In [28]:
#most distinctive words. Lower values are most defining of SF259
df.loc[2].sort_values(ascending=True)

im           -0.018651
thats        -0.012698
dont         -0.011508
ive          -0.008730
high         -0.006057
youre        -0.005556
absolutely   -0.005556
years        -0.005331
theres       -0.005159
job          -0.004957
000          -0.004575
sure         -0.004515
things       -0.004118
need         -0.003968
nope         -0.003968
good         -0.003886
theyre       -0.003571
right        -0.003557
support      -0.003175
knew         -0.003175
went         -0.003033
diversion    -0.002987
said         -0.002987
understand   -0.002987
situation    -0.002883
close        -0.002883
days         -0.002800
going        -0.002786
hes          -0.002778
stage        -0.002778
                ...   
case          0.002336
dad           0.002336
make          0.002418
cool          0.002523
wasn          0.002628
worked        0.002710
community     0.002920
stuff         0.003166
yes           0.003578
got           0.003683
fuck          0.003796
doing         0.004252
maybe      

I see a lot of family words in the first respondent, SF206.1. Mores cuss words, qualifiers, and anger words in the second respondent, SF259. This is one potential pattern to look for in you data (with many many caveats, of course).