# Dr Turow Factivia NLP Data Anaylsis Request
## Etienne Jacquot - ASC IT SYSADMIN - epj@asc.upenn.edu
### Last edited: 09/24/2019

_________________________

## In the below analysis, we will perform the following steps:
1. Open & Read the Factivia File provided by Dr Turow
2. Create list of dictionaries which contain **document ID**, **article text**, **tokens**, and **POS tags**
3. Looking at *Collocates* around key terms: **Voice**, **Surveillance**, and **Privacy**
4. Looking at *Concordance* for usage of key terms
4. Use NLTK for *Natural Language Processing* to find most common **adjectives** used **10 words to the left and right** of these key terms

__________
## Extracting Data from File

In [1]:
%run Functions.ipynb
# These is a jturow function in that other notebook which grabbed nouns, verbs, adjectives...
# the request has since changed so this is currently not in use
# Please also note most if not all of the functions were authored by Matt O'Donnell at ASC

In [2]:
characters_to_remove = '!,.()[]"'

In [3]:
# Check for file that Dr Turow provided is uploaded to Jupyterhub, this was .rtf so I converted & saved as .txt utf-16
for item in os.listdir('../JTurow_Data'):
    if item.endswith('.txt'):
        print(item)

Factiva-Smart_Speaker_and_Voice.txt


In [4]:
# Read file of text data
for item in os.listdir('../JTurow_Data'):
    if item.endswith('.txt'):
        text = open('../JTurow_Data/'+ item,'r',encoding="UTF-16").read().splitlines()

# Preparing to capture data as list of dictionaries
articles_total=[]
articles = {}
article_txt=[]

# Go through lines in article to separate individual articles
for line in text:
    article_txt += [line]
    # Each article ends with this Document Line
    if line.startswith('Document ') and len(line) == 34:
        doc_id = line.split()[1]
        articles = {'document_ID':doc_id,
                    'article_text':article_txt, 
                    'tokens':[],
                    'POS_tag':[]
                        }
        articles_total.append(articles)    
        article_txt = []

In [5]:
# Total number of articles provided by Dr Turow
len(articles_total)

100

In [6]:
# Extracting tokens from article texts and updating the dictionary
tokens = []
total_tokens = []
for article in articles_total:
    for words in article['article_text']:
        tokens += tokenize(words,strip_chars=characters_to_remove,lowercase=True)
    
    # Also creating total_tokens which is all tokens in one list
    total_tokens += tokens
    article['tokens']+= tokens
    tokens = []

In [7]:
# Applying NLTK part of speech tagging to tokens and then updating the dictionary
import nltk
for article in articles_total:
    for words in article['tokens']:
        word = [words]
        nltk_text = nltk.pos_tag(word)
        article['POS_tag'] += nltk_text 

_________________


## Looking for Collocates near instances of **"Voice"**, **"Surveillance"**, and **"Privacy"** in the articles

In [8]:
#Only one article which references surveillance
for article in articles_total:
    for word in article['tokens']:
        if word == 'surveillance':
            print('Article ID is:',article['document_ID'], '-- this article contains reference of "surveillance"')

Article ID is: CNEWSN0020190607ef6700004 -- this article contains reference of "surveillance"


In [9]:
# Pulling collocates on word 'surveillance', 10 to left and 10 to right
jturow_surveillance_colls = Counter()
for article in articles_total:
    jturow_surveillance_colls.update(collocates(article['tokens'],'surveillance', win=[10,10]))
jturow_surveillance_colls.most_common(15)

[('a', 2),
 ('see', 1),
 ('in', 1),
 ('the', 1),
 ('future', 1),
 ('read:', 1),
 ("amazon's", 1),
 ('helping', 1),
 ('police', 1),
 ('build', 1),
 ('network', 1),
 ('with', 1),
 ('ring', 1),
 ('doorbells', 1),
 ('alexa', 1)]

In [10]:
# Pulling collocates on word 'voice', 10 to left and 10 to right
jturow_voice_colls = Counter()
for article in articles_total:
    jturow_voice_colls.update(collocates(article['tokens'],'voice', win=[10,10]))
jturow_voice_colls.most_common(15)

[('the', 258),
 ('to', 141),
 ('and', 125),
 ('a', 119),
 ('of', 115),
 ('with', 96),
 ('assistant', 81),
 ('you', 63),
 ('alexa', 59),
 ('google', 55),
 ('in', 53),
 ('smart', 49),
 ('that', 49),
 ('commands', 48),
 ('it', 45)]

In [11]:
# Pulling collocates on word 'privacy', 10 to left and 10 to right
jturow_privacy_colls = Counter()
for article in articles_total:
    jturow_privacy_colls.update(collocates(article['tokens'],'privacy', win=[10,10]))
jturow_privacy_colls.most_common(15)

[('the', 39),
 ('to', 34),
 ('a', 34),
 ('and', 28),
 ('of', 20),
 ('alexa', 14),
 ('concerns', 13),
 ('as', 12),
 ('on', 12),
 ('about', 11),
 ('it', 11),
 ('has', 10),
 ('that', 9),
 ('for', 8),
 ('in', 8)]

__________
## Looking for Concordance usage of terms **"Voice"**, **"Surveillance"**, and **"Privacy"** in the articles

In [12]:
# Pulling concordance for words 'voice' and 'surveillance' and displaying random sampling
kwic_voice=[]
kwic_surv=[]
kwic_priv=[]
for article in articles_total:
    kwic_voice.extend(make_kwic('voice', article['tokens']))
    kwic_surv.extend(make_kwic('surveillance', article['tokens']))
    kwic_priv.extend(make_kwic('privacy', article['tokens']))
    
# For random sampling, not used here...
#sample_kwic_voice = random.sample(kwic_voice,30)
#sample_kwic_surv = random.sample(kwic_surv,1)
#sample_kwic_voice = sort_kwic(sample_kwic_voice, order=['L1'])
#sample_kwic_surv = sort_kwic(sample_kwic_surv, order=['L1'])


In [13]:
print('Number of kwic for word "surveillance" is:', len(kwic_surv))
print_kwic(kwic_surv)

Number of kwic for word "surveillance" is: 1
                                             see in the future read: amazon's helping police build a  surveillance  network with ring doorbells alexa amazon has been a leader


In [14]:
print('Number of kwic for word "voice" is:', len(kwic_voice))
print_kwic(kwic_voice)

Number of kwic for word "voice" is: 259
                                         with alexa google assistant and siri via apple homekit best  voice  control ecobee smartthermostat click to view image chris monroe/cnet the
                                                 standing directly in front of it aside from using a  voice  command testing a thermostat let's talk about testing smart thermostats
                                           or siri most connected thermostats work with at least one  voice  assistant and some like the ecobee3 lite and ecobee smartthermostat
                                          lite and ecobee smartthermostat work with all three do the  voice  commands flow naturally like they would in an actual conversation?
                                                if the goal of these companies is to eventually have  voice  assistants that we can have natural-sounding conversations with -— rather
                                           c cnet networks in

In [15]:
print('Number of kwic for word "privacy" is:', len(kwic_priv))
print_kwic(kwic_priv)

Number of kwic for word "privacy" is: 54
                                  popping up throughout people's homes and lives the practice raises  privacy  concerns for smart-speaker users in particular who might have known
                                                   these systems and what risks there may be to your  privacy  if you think about it why would you want a
                                    professor at the university of michigan who has studied people's  privacy  perceptions when it comes to smart speakers as a result
                                           about them he often hears that rather than using built-in  privacy  controls such as a physical mute button that many smart
                                                    first place and they could do more to talk about  privacy  risks users face as well as how they're protecting users'
                                         blog post that point out the companies' commitments to user  privacy  cassell and 

________
## Now pulling most frequent Adjectives near (10 words before & 10 words after) the tokens  'voice', 'surveillance', and 'privacy'
### NLTK Part-of-Speech Tags: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/
    Note: the token **nest** appears as an adjective when really this is a proper pronoun referring to Google Nest... 

In [29]:
Left_of_Voice = []
Right_of_Voice = []
LR_of_Voice = []
for kwic in kwic_voice:
    Left_of_Voice.append(kwic[0])
    Right_of_Voice.append(kwic[2])
    LR_of_Voice.append(kwic[0] + kwic[2])
    
nltk_LR_of_voice = []
for line in LR_of_Voice:
    for words in line:
        word = [words]
        if len(words) > 0:
            tagged = nltk.pos_tag(word)
            nltk_LR_of_voice += tagged
            
adjective_LR_voice = Counter()
for tagged_word in nltk_LR_of_voice:
    if tagged_word[1].startswith('J'):
        #print(tagged_word)
        adjective_LR_voice.update([tagged_word])
adjective_LR_voice.most_common(30)

[(('best', 'JJS'), 21),
 (('nest', 'JJS'), 14),
 (('same', 'JJ'), 10),
 (('other', 'JJ'), 9),
 (('most', 'JJS'), 7),
 (('compatible', 'JJ'), 7),
 (('own', 'JJ'), 7),
 (('few', 'JJ'), 7),
 (('unidentified', 'JJ'), 7),
 (('much', 'JJ'), 6),
 (('able', 'JJ'), 6),
 (('new', 'JJ'), 6),
 (('available', 'JJ'), 6),
 (('small', 'JJ'), 5),
 (('final', 'JJ'), 4),
 (('actual', 'JJ'), 3),
 (('important', 'JJ'), 3),
 (('easier', 'JJR'), 3),
 (('affordable', 'JJ'), 3),
 (('agnostic', 'JJ'), 3),
 (('google-owned', 'JJ'), 3),
 (('alexa-enabled', 'JJ'), 3),
 (('major', 'JJ'), 3),
 (('eponymous', 'JJ'), 3),
 (('easy', 'JJ'), 3),
 (('basic', 'JJ'), 3),
 (('larger', 'JJR'), 3),
 (('least', 'JJS'), 2),
 (('local', 'JJ'), 2),
 (('main', 'JJ'), 2)]

In [30]:
Left_of_Priv = []
Right_of_Priv = []
LR_of_Priv = []
for kwic in kwic_priv:
    Left_of_Priv.append(kwic[0])
    Right_of_Priv.append(kwic[2])
    LR_of_Priv.append(kwic[0] + kwic[2])

nltk_LR_of_priv = []
for line in LR_of_Priv:
    for words in line:
        word = [words]
        if len(words) > 0:
            tagged = nltk.pos_tag(word)
            nltk_LR_of_priv += tagged
            
adjective_LR_priv = Counter()
for tagged_word in nltk_LR_of_priv:
    if tagged_word[1].startswith('J'):
        #print(tagged_word)
        adjective_LR_priv.update([tagged_word])
adjective_LR_priv.most_common(30)

[(('such', 'JJ'), 5),
 (('physical', 'JJ'), 4),
 (('many', 'JJ'), 4),
 (('next', 'JJ'), 4),
 (('own', 'JJ'), 3),
 (('likely', 'JJ'), 3),
 (('manual', 'JJ'), 3),
 (('least', 'JJS'), 3),
 (('commercial', 'JJ'), 3),
 (('other', 'JJ'), 2),
 (('major', 'JJ'), 2),
 (('similar', 'JJ'), 2),
 (('2-mic', 'JJ'), 2),
 (('concerned', 'JJ'), 2),
 (('social', 'JJ'), 2),
 (('biggest', 'JJS'), 2),
 (('available', 'JJ'), 2),
 (('particular', 'JJ'), 1),
 (('previous', 'JJ'), 1),
 (('usual', 'JJ'), 1),
 (('nest', 'JJS'), 1),
 (('numerous', 'JJ'), 1),
 (('great', 'JJ'), 1),
 (('real', 'JJ'), 1),
 (('last', 'JJ'), 1),
 (('medical', 'JJ'), 1),
 (('big', 'JJ'), 1),
 (('fundamental', 'JJ'), 1),
 (('creative', 'JJ'), 1),
 (('hard', 'JJ'), 1)]

In [18]:
Left_of_Surv = []
Right_of_Surv = []
LR_of_Surv = []
for kwic in kwic_surv:
    Left_of_Surv.append(kwic[0])
    Right_of_Surv.append(kwic[2])
    LR_of_Surv.append(kwic[0] + kwic[2])
    
nltk_LR_of_surv = []
for line in LR_of_Surv:
    for words in line:
        word = [words]
        if len(words) > 0:
            tagged = nltk.pos_tag(word)
            nltk_LR_of_surv += tagged
            
adjective_LR_surv = Counter()
for tagged_word in nltk_LR_of_surv:
    if tagged_word[1].startswith('J'):
        #print(tagged_word)
        adjective_LR_surv.update([tagged_word])
# Results are empty since there is only 1 instance of Surveillance and I guess none of the words are listed as adjectives
adjective_LR_surv.most_common(15)

[]

In [19]:
# Nothing here is listed as Adjective around surveillance ... 
nltk_LR_of_surv

[('see', 'VB'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('future', 'NN'),
 ('read:', 'NN'),
 ("amazon's", 'NN'),
 ('helping', 'VBG'),
 ('police', 'NNS'),
 ('build', 'NN'),
 ('a', 'DT'),
 ('network', 'NN'),
 ('with', 'IN'),
 ('ring', 'NN'),
 ('doorbells', 'NNS'),
 ('alexa', 'NN'),
 ('amazon', 'NN'),
 ('has', 'VBZ'),
 ('been', 'VBN'),
 ('a', 'DT'),
 ('leader', 'NN')]

### We can also look specifically at most frequent Adjectives on left of Voice and right of Voice as separate (not combined totals), as in adjectives before Voice and adjects which come after Voice

In [20]:
nltk_left_of_voice = []
for line in Left_of_Voice:
    for words in line:
        word = [words]
        if len(words) > 0:
            tagged = nltk.pos_tag(word)
            nltk_left_of_voice += tagged
adjective_left_voice = Counter()
for tagged_word in nltk_left_of_voice:
    if tagged_word[1].startswith('J'):
        #print(tagged_word)
        adjective_left_voice.update([tagged_word])
adjective_left_voice.most_common(15)

[(('best', 'JJS'), 12),
 (('nest', 'JJS'), 7),
 (('same', 'JJ'), 7),
 (('most', 'JJS'), 5),
 (('compatible', 'JJ'), 5),
 (('available', 'JJ'), 5),
 (('own', 'JJ'), 4),
 (('able', 'JJ'), 4),
 (('unidentified', 'JJ'), 4),
 (('affordable', 'JJ'), 3),
 (('agnostic', 'JJ'), 3),
 (('new', 'JJ'), 3),
 (('much', 'JJ'), 3),
 (('local', 'JJ'), 2),
 (('main', 'JJ'), 2)]

In [21]:
nltk_right_of_voice = []
for line in Right_of_Voice:
    for words in line:
        word = [words]
        if len(words) > 0:
            tagged = nltk.pos_tag(word)
            nltk_right_of_voice += tagged
adjective_right_voice = Counter()
for tagged_word in nltk_right_of_voice:
    if tagged_word[1].startswith('J'):
        #print(tagged_word)
        adjective_right_voice.update([tagged_word])
adjective_right_voice.most_common(15)

[(('best', 'JJS'), 9),
 (('other', 'JJ'), 8),
 (('nest', 'JJS'), 7),
 (('few', 'JJ'), 6),
 (('easier', 'JJR'), 3),
 (('much', 'JJ'), 3),
 (('same', 'JJ'), 3),
 (('google-owned', 'JJ'), 3),
 (('final', 'JJ'), 3),
 (('small', 'JJ'), 3),
 (('new', 'JJ'), 3),
 (('own', 'JJ'), 3),
 (('larger', 'JJR'), 3),
 (('unidentified', 'JJ'), 3),
 (('compatible', 'JJ'), 2)]

_______
# This jupyterhub document & analysis officially ends here, based on most recent request from Dr Turow!
## Below is testing & other sample requests which we put together for reference

In [22]:
total_nltk_tokens = []
target_nltk_tokens = []
IN_and_PRP_tokens = []
NNJJV_nlkt_tokens = []

for article in articles_total:
    for words in article['POS_tag']:
        total_nltk_tokens += [words]
        
        # Excluding Prepositions & Pronouns as requested in original ticket by Dr Turow
        # targeted_ = (words[1] == 'IN' or words[1] == 'FW')
        #if not targeted_:
            #target_nltk_tokens += [words]
        
        # Including Nouns, Verbs, Adjectives
        targeted_nouns_verbs_adj = (words[1].startswith('NN') or words[1].startswith('JJ') or words[1].startswith('V'))
        if targeted_nouns_verbs_adj:
            NNJJV_nlkt_tokens += [words]
        
        else:
            IN_and_PRP_tokens += [words]
            
if (len(total_nltk_tokens) - len(target_nltk_tokens)) or (len(total_nltk_tokens) - len(NNJJV_nlkt_tokens)) == len(IN_and_PRP_tokens):
    print('Completed!')

Completed!


In [23]:
# Random example of noun, verb, or adjective only
NNJJV_nlkt_tokens[random.randint(0,len(NNJJV_nlkt_tokens))]

('universal', 'NN')

In [24]:
# random from total
total_nltk_tokens[random.randint(0,len(NNJJV_nlkt_tokens))]

('can', 'MD')

In [25]:
total_most_common_nltk_NVJ = Counter()
for tokens in NNJJV_nlkt_tokens:
    total_most_common_nltk_NVJ.update(get_ngram_tokens([tokens],1))
total_most_common_nltk_NVJ.most_common(15)

[(('is', 'VBZ'), 928),
 (('smart', 'NN'), 796),
 (('amazon', 'NN'), 752),
 (('google', 'NN'), 450),
 (('home', 'NN'), 415),
 (('are', 'VBP'), 370),
 (('best', 'JJS'), 369),
 (('prime', 'NN'), 359),
 (('be', 'VB'), 341),
 (('alexa', 'NN'), 331),
 (('echo', 'NN'), 320),
 (('day', 'NN'), 311),
 (('has', 'VBZ'), 309),
 (('view', 'NN'), 296),
 (('click', 'NN'), 277)]

In [26]:
total_most_common_nltk_other = Counter()
for tokens in total_nltk_tokens:
    total_most_common_nltk_other.update(get_ngram_tokens([tokens],1))
total_most_common_nltk_other.most_common(15)

[(('the', 'DT'), 4372),
 (('to', 'TO'), 2458),
 (('a', 'DT'), 2152),
 (('and', 'CC'), 2104),
 (('of', 'IN'), 1629),
 (('for', 'IN'), 1119),
 (('you', 'PRP'), 1004),
 (('in', 'IN'), 982),
 (('is', 'VBZ'), 928),
 (('that', 'IN'), 883),
 (('it', 'PRP'), 865),
 (('on', 'IN'), 841),
 (('with', 'IN'), 841),
 (('smart', 'NN'), 796),
 (('at', 'IN'), 783)]

In [27]:
total_most_common_nltk_tokens = Counter()
for target_tokens in target_nltk_tokens:
    total_most_common_nltk_tokens.update(get_ngram_tokens([target_tokens],1))
total_most_common_nltk_tokens.most_common(30)

[]

In [28]:
# Most common 3 token combinations in total_tokens list
trigrams = Counter()
trigrams.update(get_ngram_tokens(total_tokens,3))
trigrams.most_common(30)

[('click to view', 267),
 ('to view image', 265),
 ('all rights reserved', 70),
 ('the best smart', 67),
 ('see it at', 64),
 ('inc all rights', 62),
 ('view image chris', 59),
 ('image chris monroe/cnet', 59),
 ('if you want', 58),
 ('see at amazon', 58),
 ('a smart speaker', 53),
 ('prime day deals', 53),
 ('echo show 5', 50),
 ('2019 cnet newscom', 49),
 ('cnet newscom cnewsn', 49),
 ('newscom cnewsn english', 49),
 ('cnewsn english c', 49),
 ('english c cnet', 49),
 ('c cnet networks', 49),
 ('cnet networks inc', 49),
 ('networks inc all', 49),
 ('the nest hub', 44),
 ('a lot of', 39),
 ('the google home', 39),
 ('be able to', 39),
 ('the amazon echo', 38),
 ('one of the', 37),
 ('amazon echo show', 36),
 ('google nest hub', 36),
 ('it at amazonfire', 36)]