# NLP
Find your favorite news source and grab the article text.

1. Show the most common words in the article.
2. Show the most common words under a part of speech. (i.e. NOUN: {'Bob':12, 'Alice':4,})
3. Find a subject/object relationship through the dependency parser in any sentence.
4. Show the most common Entities and their types. 
5. Find Entites and their dependency (hint: entity.root.head)
6. Find the most similar words in the article

Note: Yes, the notebook from the video is not provided, I leave it to you to make your own :) it's your final assignment for the semester. Enjoy!

In [11]:
!pip install spacy



In [15]:
!python -m spacy download en

[!] As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the full
pipeline package name 'en_core_web_sm' instead.
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


2022-04-29 12:34:46.813795: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-04-29 12:34:46.813827: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [2]:
import spacy
# import spacy.en.download
# spacy.en.download.download.en
from spacy.lang.en import English
processor = spacy.load('en_core_web_sm')

In [3]:
import pandas as pd
import numpy as np

### 1. Show the most common words in the article.

In [4]:
from collections import Counter # simple counter assist

In [5]:
# https://www.bbc.com/news/business-61258357

text = "Twitter, which this week agreed to be bought by billionaire Elon Musk, has said its user numbers grew faster than expected over the past year. Advertising revenue has also been rising, but by less than was forecast. Some observers have questioned Mr Musk's commercial judgement in buying Twitter, a platform that despite its high profile has not consistently made high returns. In the latest quarter it made a profit of $513m (£412m) on revenues of $1.2bn. Daily active users of the platform rose to 229 million, up from 199 million a year earlier, the company said, publishing its latest financial results. New users grew faster outside the US, by 18.1%, than in its home market where numbers were up 6.4% over the 12 months to the end of March. This week Twitter's board agreed a $44bn sale to Tesla boss, Mr Musk, the world's richest person, and a prolific user of the platform. In publishing its results, the firm said it was withdrawing all previously provided guidance over its immediate commercial outlook. However, it did say revenues had been affected by 'headwinds associated with the war in Ukraine'. Mr Musk's purchase is likely to take several months to complete, after which the company will be owned privately. While Mr Musk has not made clear his precise plans for the platform, he has spoken about reducing advertising, and cracking down on 'bot' or automated accounts. He has also prompted controversy by suggesting there may be a new approach to how Twitter moderates free speech."

In [6]:
twitter = processor(text)
twitter

Twitter, which this week agreed to be bought by billionaire Elon Musk, has said its user numbers grew faster than expected over the past year. Advertising revenue has also been rising, but by less than was forecast. Some observers have questioned Mr Musk's commercial judgement in buying Twitter, a platform that despite its high profile has not consistently made high returns. In the latest quarter it made a profit of $513m (£412m) on revenues of $1.2bn. Daily active users of the platform rose to 229 million, up from 199 million a year earlier, the company said, publishing its latest financial results. New users grew faster outside the US, by 18.1%, than in its home market where numbers were up 6.4% over the 12 months to the end of March. This week Twitter's board agreed a $44bn sale to Tesla boss, Mr Musk, the world's richest person, and a prolific user of the platform. In publishing its results, the firm said it was withdrawing all previously provided guidance over its immediate commer

In [7]:
tokens = [token.text for token in twitter if not token.is_punct if not token.is_stop] # needed to remove pun otherwise 'words' only showed up once
tokens[0:10]

['Twitter',
 'week',
 'agreed',
 'bought',
 'billionaire',
 'Elon',
 'Musk',
 'said',
 'user',
 'numbers']

In [8]:
freq = Counter(tokens)
freq.most_common()[0:10]

[('Musk', 5),
 ('Twitter', 4),
 ('Mr', 4),
 ('platform', 4),
 ('said', 3),
 ('$', 3),
 ('week', 2),
 ('agreed', 2),
 ('user', 2),
 ('numbers', 2)]

### 2. Show the most common words under a part of speech. (i.e. NOUN: {'Bob':12, 'Alice':4,})

In [9]:
pos = [token.pos_ for token in twitter if not token.is_punct if not token.is_stop]
pos[0:5], tokens[0:5]

(['PROPN', 'NOUN', 'VERB', 'VERB', 'NOUN'],
 ['Twitter', 'week', 'agreed', 'bought', 'billionaire'])

In [10]:
df = pd.DataFrame({"pos": pos, "word": tokens})
df.head()

Unnamed: 0,pos,word
0,PROPN,Twitter
1,NOUN,week
2,VERB,agreed
3,VERB,bought
4,NOUN,billionaire


In [11]:
df['count'] = df.groupby('word')['word'].transform('count')
df = df.sort_values(['pos', 'count', 'word'], ascending = [True, False, True])
df.head(25) # shows words by pos and how many

Unnamed: 0,pos,word,count
23,ADJ,commercial,2
103,ADJ,commercial,2
29,ADJ,high,2
32,ADJ,high,2
34,ADJ,latest,2
60,ADJ,latest,2
63,ADJ,New,1
47,ADJ,active,1
122,ADJ,clear,1
61,ADJ,financial,1


In [12]:
# convert to a list
list = df.values.tolist()
list[0:25]

[['ADJ', 'commercial', 2],
 ['ADJ', 'commercial', 2],
 ['ADJ', 'high', 2],
 ['ADJ', 'high', 2],
 ['ADJ', 'latest', 2],
 ['ADJ', 'latest', 2],
 ['ADJ', 'New', 1],
 ['ADJ', 'active', 1],
 ['ADJ', 'clear', 1],
 ['ADJ', 'financial', 1],
 ['ADJ', 'free', 1],
 ['ADJ', 'immediate', 1],
 ['ADJ', 'likely', 1],
 ['ADJ', 'new', 1],
 ['ADJ', 'past', 1],
 ['ADJ', 'precise', 1],
 ['ADJ', 'prolific', 1],
 ['ADJ', 'richest', 1],
 ['ADP', 'outside', 1],
 ['ADV', 'faster', 2],
 ['ADV', 'faster', 2],
 ['ADV', 'Daily', 1],
 ['ADV', 'consistently', 1],
 ['ADV', 'earlier', 1],
 ['ADV', 'previously', 1]]

In [13]:
# remove duplicates from list
new_list = []
for i in list:
    if i not in new_list:
        new_list.append(i)

new_list[0:25]

[['ADJ', 'commercial', 2],
 ['ADJ', 'high', 2],
 ['ADJ', 'latest', 2],
 ['ADJ', 'New', 1],
 ['ADJ', 'active', 1],
 ['ADJ', 'clear', 1],
 ['ADJ', 'financial', 1],
 ['ADJ', 'free', 1],
 ['ADJ', 'immediate', 1],
 ['ADJ', 'likely', 1],
 ['ADJ', 'new', 1],
 ['ADJ', 'past', 1],
 ['ADJ', 'precise', 1],
 ['ADJ', 'prolific', 1],
 ['ADJ', 'richest', 1],
 ['ADP', 'outside', 1],
 ['ADV', 'faster', 2],
 ['ADV', 'Daily', 1],
 ['ADV', 'consistently', 1],
 ['ADV', 'earlier', 1],
 ['ADV', 'previously', 1],
 ['ADV', 'privately', 1],
 ['NOUN', 'platform', 4],
 ['NOUN', 'company', 2],
 ['NOUN', 'months', 2]]

In [15]:
# group by pos
output = {}

for x, y, z in new_list:
    group = output.get(x, [])
    group.extend((y, z))
    output[x] = group

print(output)

{'ADJ': ['commercial', 2, 'high', 2, 'latest', 2, 'New', 1, 'active', 1, 'clear', 1, 'financial', 1, 'free', 1, 'immediate', 1, 'likely', 1, 'new', 1, 'past', 1, 'precise', 1, 'prolific', 1, 'richest', 1], 'ADP': ['outside', 1], 'ADV': ['faster', 2, 'Daily', 1, 'consistently', 1, 'earlier', 1, 'previously', 1, 'privately', 1], 'NOUN': ['platform', 4, 'company', 2, 'months', 2, 'numbers', 2, 'results', 2, 'revenues', 2, 'user', 2, 'users', 2, 'week', 2, 'year', 2, '44bn', 1, 'Advertising', 1, 'accounts', 1, 'advertising', 1, 'approach', 1, 'billionaire', 1, 'board', 1, 'boss', 1, 'bot', 1, 'controversy', 1, 'end', 1, 'firm', 1, 'guidance', 1, 'headwinds', 1, 'home', 1, 'judgement', 1, 'market', 1, 'observers', 1, 'outlook', 1, 'person', 1, 'plans', 1, 'profile', 1, 'profit', 1, 'purchase', 1, 'quarter', 1, 'returns', 1, 'revenue', 1, 'sale', 1, 'speech', 1, 'war', 1, 'world', 1], 'NUM': ['m', 2, 'million', 2, '1.2bn', 1, '12', 1, '18.1', 1, '199', 1, '229', 1, '412', 1, '513', 1, '6.4',

### 3. Find a subject/object relationship through the dependency parser in any sentence.

In [16]:
def pr_tree(word, level):
    if word.is_punct:
        return
    for child in word.lefts:
        pr_tree(child, level+1)
        print('\t' * level + word.text + ' - ' + word.dep_)
        for child in word.rights:
            pr_tree(child, level+1)

In [17]:
for sentence in twitter.sents:
    pr_tree(sentence.root, 0)
    print('---------------------------------------------')

said - ROOT
		numbers - nsubj
		numbers - nsubj
	grew - ccomp
said - ROOT
		numbers - nsubj
		numbers - nsubj
	grew - ccomp
---------------------------------------------
	revenue - nsubj
rising - ROOT
	forecast - conj
	forecast - conj
rising - ROOT
	forecast - conj
	forecast - conj
rising - ROOT
	forecast - conj
	forecast - conj
rising - ROOT
	forecast - conj
	forecast - conj
---------------------------------------------
	observers - nsubj
questioned - ROOT
		Musk - poss
	judgement - dobj
	judgement - dobj
questioned - ROOT
		Musk - poss
	judgement - dobj
	judgement - dobj
---------------------------------------------
made - ROOT
	profit - dobj
	412 - dobj
made - ROOT
	profit - dobj
	412 - dobj
---------------------------------------------
		users - nsubj
		users - nsubj
	rose - ccomp
said - ROOT
said - ROOT
	company - nsubj
said - ROOT
---------------------------------------------
	users - nsubj
grew - ROOT
---------------------------------------------
	week - npadvmod
agreed - ROOT
	

### 4. Show the most common Entities and their types. 

In [18]:
for entity in twitter.ents:
    print(entity, entity.label_, )

this week DATE
Elon Musk PERSON
the past year DATE
Musk PERSON
Twitter PRODUCT
the latest quarter DATE
513 MONEY
Daily DATE
229 million CARDINAL
199 million CARDINAL
a year earlier DATE
US GPE
18.1% PERCENT
6.4% PERCENT
the 12 months DATE
the end of March DATE
Twitter PERSON
44bn MONEY
Tesla ORG
Musk PERSON
Ukraine GPE
Musk PERSON
several months DATE
Musk PERSON
Twitter PRODUCT


### 5. Find Entites and their dependency (hint: entity.root.head)

In [19]:
n = 0
for entity in twitter.ents:
    print("'", entity, "' ", "is dependent on ", "'", entity.root.head, "'", sep = '')
    n += 1

'this week' is dependent on 'agreed'
'Elon Musk' is dependent on 'by'
'the past year' is dependent on 'over'
'Musk' is dependent on 'judgement'
'Twitter' is dependent on 'buying'
'the latest quarter' is dependent on 'In'
'513' is dependent on 'of'
'Daily' is dependent on 'users'
'229 million' is dependent on 'to'
'199 million' is dependent on 'from'
'a year earlier' is dependent on 'from'
'US' is dependent on 'outside'
'18.1%' is dependent on 'by'
'6.4%' is dependent on 'up'
'the 12 months' is dependent on 'over'
'the end of March' is dependent on 'to'
'Twitter' is dependent on 'board'
'44bn' is dependent on 'sale'
'Tesla' is dependent on 'boss'
'Musk' is dependent on 'sale'
'Ukraine' is dependent on 'in'
'Musk' is dependent on 'purchase'
'several months' is dependent on 'take'
'Musk' is dependent on 'made'
'Twitter' is dependent on 'moderates'


### 6. Find the most similar words in the article

In [21]:
# used doc.ents from spaCy documentation to find similar words/phrases
similar_words = [(token_first.text, token_second.text, token_first.similarity(token_second)) for token_first in twitter.ents for token_second in twitter.ents]
similar_words[0:50]

  similar_words = [(token_first.text, token_second.text, token_first.similarity(token_second)) for token_first in twitter.ents for token_second in twitter.ents]


[('this week', 'this week', 1.0),
 ('this week', 'Elon Musk', 0.1362256109714508),
 ('this week', 'the past year', 0.4012893736362457),
 ('this week', 'Musk', 0.08003652095794678),
 ('this week', 'Twitter', -0.0020605523604899645),
 ('this week', 'the latest quarter', 0.26848894357681274),
 ('this week', '513', -0.11720535904169083),
 ('this week', 'Daily', -0.03588566556572914),
 ('this week', '229 million', -0.048933278769254684),
 ('this week', '199 million', -0.0656629130244255),
 ('this week', 'a year earlier', 0.4542818069458008),
 ('this week', 'US', 0.04240124300122261),
 ('this week', '18.1%', 0.17094899713993073),
 ('this week', '6.4%', 0.13086354732513428),
 ('this week', 'the 12 months', 0.2084415853023529),
 ('this week', 'the end of March', 0.2791098654270172),
 ('this week', 'Twitter', 0.14400601387023926),
 ('this week', '44bn', 0.027430834248661995),
 ('this week', 'Tesla', -0.07456112653017044),
 ('this week', 'Musk', 0.11970679461956024),
 ('this week', 'Ukraine', -0

In [30]:
# convert tupils to dataframe to clean up
df = pd.DataFrame(similar_words, columns =['first', 'second', 'score'])
df.head()

Unnamed: 0,first,second,score
0,this week,this week,1.0
1,this week,Elon Musk,0.136226
2,this week,the past year,0.401289
3,this week,Musk,0.080037
4,this week,Twitter,-0.002061


In [32]:
# organize dataframe by score value (after removing duplicates)
df = df.sort_values(['score'], ascending = [False])
df = df.drop_duplicates()

df.head(50)

Unnamed: 0,first,second,score
0,this week,this week,1.0
52,the past year,the past year,1.0
404,Twitter,Twitter,1.0
390,the end of March,the end of March,1.0
364,the 12 months,the 12 months,1.0
338,6.4%,6.4%,1.0
286,US,US,1.0
260,a year earlier,a year earlier,1.0
234,199 million,199 million,1.0
208,229 million,229 million,1.0
