# Parts Of Speech Tagging

### Use cases include:

- Remove all proper names
- Find statements where the subject is 'Apple'


## Identifing the part of speech for each word

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv('data/news.csv')
len(df)

75

In [3]:
df.head()

Unnamed: 0,news_headline,news_article,news_category,full_text,clean_text
0,Apple loses $450 billion in market cap in the ...,Shares of the world's first trillion-dollar pu...,technology,Apple loses $450 billion in market cap in the ...,apple lose billion market cap last month share...
1,"US startup makes ₹28,000 cordless 'infrared' h...","US-based startup Volo Beauty showed a ₹28,000 ...",technology,"US startup makes ₹28,000 cordless 'infrared' h...",us startup make cordless infrared hair dryer u...
2,"Apple, which never attends CES, puts ad near e...","Apple, which never attends technology event CE...",technology,"Apple, which never attends CES, puts ad near e...",apple never attend ce put ad near event troll ...
3,Billionaire offers ₹6.5 crore to 100 people fo...,A tweet by billionaire and Japan's largest clo...,technology,Billionaire offers ₹6.5 crore to 100 people fo...,billionaire offer crore people retweet tweet t...
4,Google shifted $22.7 billion to tax haven Berm...,Google moved $22.7 billion through a Dutch she...,technology,Google shifted $22.7 billion to tax haven Berm...,google shift billion tax bermuda google move b...


#### Load the english dictionary

In [4]:
import spacy
nlp = spacy.load("en_core_web_sm")

#### Extract the first document and convert the text to a single string

In [5]:
sentence = str(df.iloc[0].full_text)
print(sentence)

Apple loses $450 billion in market cap in the last 3 months. Shares of the world's first trillion-dollar public company, Apple, have fallen over 38% between October 2018 and January 2019, costing it over $450 billion in market capitalisation. CEO Tim Cook recently cut Apple's quarterly revenue forecast for the first time in nearly 20 years, adding the company expects to make $84 billion, lower than the previous estimate of $89-93 billion.


#### Convert the sentence to a spacy object of tokens

In [6]:
sentence_nlp = nlp(sentence)
print(sentence_nlp)
type(sentence_nlp)

Apple loses $450 billion in market cap in the last 3 months. Shares of the world's first trillion-dollar public company, Apple, have fallen over 38% between October 2018 and January 2019, costing it over $450 billion in market capitalisation. CEO Tim Cook recently cut Apple's quarterly revenue forecast for the first time in nearly 20 years, adding the company expects to make $84 billion, lower than the previous estimate of $89-93 billion.


spacy.tokens.doc.Doc

#### Extract the parts of speech tags

**Text:** The original word text.   
**Lemma:** The base form of the word.   
**POS:** The simple part-of-speech tag.  
**Tag:** The detailed part-of-speech tag.  
**Dep:** Syntactic dependency, i.e. the relation between tokens.  
**Shape:** The word shape – capitalisation, punctuation, digits.  
**is alpha:** Is the token an alpha character?  
**is stop:** Is the token part of a stop list, i.e. the most common words of the language?  

In [7]:
df_pos = pd.DataFrame([(1, word.text, word.lemma_, word.pos_, word.tag_, word.dep_,
                     word.shape_, word.is_alpha, word.is_stop) for word in sentence_nlp], 
                   columns=['doc_id', 'word', 'lemma', 'POS_tag', 'tag_type', 'dep', 'shape', 'is_alpha', 'is_stop'])

df_pos.head(10)

Unnamed: 0,doc_id,word,lemma,POS_tag,tag_type,dep,shape,is_alpha,is_stop
0,1,Apple,Apple,PROPN,NNP,nsubj,Xxxxx,True,False
1,1,loses,lose,VERB,VBZ,ROOT,xxxx,True,False
2,1,$,$,SYM,$,quantmod,$,False,False
3,1,450,450,NUM,CD,compound,ddd,False,False
4,1,billion,billion,NUM,CD,dobj,xxxx,True,False
5,1,in,in,ADP,IN,prep,xx,True,True
6,1,market,market,NOUN,NN,compound,xxxx,True,False
7,1,cap,cap,NOUN,NN,pobj,xxx,True,False
8,1,in,in,ADP,IN,prep,xx,True,True
9,1,the,the,DET,DT,det,xxx,True,True


In [8]:
df_pos.POS_tag.value_counts()

NOUN     17
NUM      15
ADP      10
PUNCT     9
PROPN     7
VERB      7
ADJ       7
SYM       5
DET       5
PART      3
ADV       2
CCONJ     1
SCONJ     1
AUX       1
PRON      1
Name: POS_tag, dtype: int64

In [9]:
df_pos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91 entries, 0 to 90
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   doc_id    91 non-null     int64 
 1   word      91 non-null     object
 2   lemma     91 non-null     object
 3   POS_tag   91 non-null     object
 4   tag_type  91 non-null     object
 5   dep       91 non-null     object
 6   shape     91 non-null     object
 7   is_alpha  91 non-null     bool  
 8   is_stop   91 non-null     bool  
dtypes: bool(2), int64(1), object(6)
memory usage: 5.3+ KB
