<a href="https://colab.research.google.com/github/Nithin46/KDM_ICP2/blob/main/KDM2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Part of Speech (POS) Tagging**

For this, I had used spaCy library. First, I have imported the required library and then passed the input data. Finally, used word.pos_ & word.tag_ and explain(word.tag_) functions to print the desired output. 

*spaCy library comes pre-built with machine learning algorithms that, depending upon the context (surrounding words), it is capable of returning the correct POS tag for the word.

In [None]:
import spacy
sp = spacy.load('en_core_web_sm')
data = sp("Xi Jinping is a Chinese politician who has served as General Secretary of the Chinese Communist Party (CCP) and Chairman of the Central Military Commission (CMC) since 2012, and President of the People's Republic of China (PRC) since 2013. He has been the paramount leader of China, the most prominent political leader in the country, since 2012. The son of Chinese Communist veteran Xi Zhongxun, he was exiled to rural Yanchuan County as a teenager following his father's purge during the Cultural Revolution and lived in a cave in the village of Liangjiahe, where he joined the CCP and worked as the party secretary.")
for word in data:
    print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')

Xi           PROPN      NNP      noun, proper singular
Jinping      PROPN      NNP      noun, proper singular
is           AUX        VBZ      verb, 3rd person singular present
a            DET        DT       determiner
Chinese      ADJ        JJ       adjective
politician   NOUN       NN       noun, singular or mass
who          PRON       WP       wh-pronoun, personal
has          AUX        VBZ      verb, 3rd person singular present
served       VERB       VBN      verb, past participle
as           SCONJ      IN       conjunction, subordinating or preposition
General      PROPN      NNP      noun, proper singular
Secretary    PROPN      NNP      noun, proper singular
of           ADP        IN       conjunction, subordinating or preposition
the          DET        DT       determiner
Chinese      PROPN      NNP      noun, proper singular
Communist    PROPN      NNP      noun, proper singular
Party        PROPN      NNP      noun, proper singular
(            PUNCT      -LRB-    le

# **Dividing into Tokens**

For this, I had used Gensim library and token2id() function to print the each token



In [None]:
import gensim
from gensim import corpora
from pprint import pprint
text = "Xi Jinping is a Chinese politician who has served as General Secretary of the Chinese Communist Party (CCP) and Chairman of the Central Military Commission (CMC) since 2012, and President of the People's Republic of China (PRC) since 2013. He has been the paramount leader of China, the most prominent political leader in the country, since 2012. The son of Chinese Communist veteran Xi Zhongxun, he was exiled to rural Yanchuan County as a teenager following his father's purge during the Cultural Revolution and lived in a cave in the village of Liangjiahe, where he joined the CCP and worked as the party secretary."
tokens = [[token for token in text.split()] for sentence in text]
gensim_dictionary = corpora.Dictionary(tokens)

print("The dictionary has: " +str(len(gensim_dictionary)) + " tokens")
print(gensim_dictionary.token2id)

# Another way of printing
for k, v in gensim_dictionary.token2id.items():
  print(f'{k:{15}} {v:{10}}')

The dictionary has: 71 tokens
{'(CCP)': 0, '(CMC)': 1, '(PRC)': 2, '2012,': 3, '2012.': 4, '2013.': 5, 'CCP': 6, 'Central': 7, 'Chairman': 8, 'China': 9, 'China,': 10, 'Chinese': 11, 'Commission': 12, 'Communist': 13, 'County': 14, 'Cultural': 15, 'General': 16, 'He': 17, 'Jinping': 18, 'Liangjiahe,': 19, 'Military': 20, 'Party': 21, "People's": 22, 'President': 23, 'Republic': 24, 'Revolution': 25, 'Secretary': 26, 'The': 27, 'Xi': 28, 'Yanchuan': 29, 'Zhongxun,': 30, 'a': 31, 'and': 32, 'as': 33, 'been': 34, 'cave': 35, 'country,': 36, 'during': 37, 'exiled': 38, "father's": 39, 'following': 40, 'has': 41, 'he': 42, 'his': 43, 'in': 44, 'is': 45, 'joined': 46, 'leader': 47, 'lived': 48, 'most': 49, 'of': 50, 'paramount': 51, 'party': 52, 'political': 53, 'politician': 54, 'prominent': 55, 'purge': 56, 'rural': 57, 'secretary.': 58, 'served': 59, 'since': 60, 'son': 61, 'teenager': 62, 'the': 63, 'to': 64, 'veteran': 65, 'village': 66, 'was': 67, 'where': 68, 'who': 69, 'worked': 70}


# **Named entity recognizer (NER)**



In [None]:
import spacy
from spacy import displacy # displacy() - Visualizing POS tags in a graphical way
nlp = spacy.load('en_core_web_sm')
text = nlp ("Xi Jinping is a Chinese politician who has served as General Secretary of the Chinese Communist Party (CCP) and Chairman of the Central Military Commission (CMC) since 2012, and President of the People's Republic of China (PRC) since 2013. He has been the paramount leader of China, the most prominent political leader in the country, since 2012. The son of Chinese Communist veteran Xi Zhongxun, he was exiled to rural Yanchuan County as a teenager following his father's purge during the Cultural Revolution and lived in a cave in the village of Liangjiahe, where he joined the CCP and worked as the party secretary.")
displacy.render(text, style = 'ent', jupyter=True)

**Another way**

In [None]:
import spacy
sp = spacy.load('en_core_web_sm')
data = sp("Xi Jinping is a Chinese politician who has served as General Secretary of the Chinese Communist Party (CCP) and Chairman of the Central Military Commission (CMC) since 2012, and President of the People's Republic of China (PRC) since 2013. He has been the paramount leader of China, the most prominent political leader in the country, since 2012. The son of Chinese Communist veteran Xi Zhongxun, he was exiled to rural Yanchuan County as a teenager following his father's purge during the Cultural Revolution and lived in a cave in the village of Liangjiahe, where he joined the CCP and worked as the party secretary.")
print(data.ents)  #ents - which returns the list of all the named entities in the data.
for entity in data.ents:
    print("{:30s}\t{:30s}\t".format(entity.text,entity.label_))

(Xi Jinping, Chinese, the Chinese Communist Party, CCP, the Central Military Commission, 2012, the People's Republic of China, PRC, 2013, China, 2012, Chinese, Communist, Xi Zhongxun, Yanchuan County, the Cultural Revolution, Liangjiahe, CCP)
Xi Jinping                    	PERSON                        	
Chinese                       	NORP                          	
the Chinese Communist Party   	ORG                           	
CCP                           	ORG                           	
the Central Military Commission	ORG                           	
2012                          	DATE                          	
the People's Republic of China	GPE                           	
PRC                           	GPE                           	
2013                          	DATE                          	
China                         	GPE                           	
2012                          	DATE                          	
Chinese                       	NORP                          	


# **Lemmatization**

In [2]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
data = "Xi Jinping is a Chinese politician who has served as General Secretary of the Chinese Communist Party (CCP) and Chairman of the Central Military Commission (CMC) since 2012, and President of the People's Republic of China (PRC) since 2013. He has been the paramount leader of China, the most prominent political leader in the country, since 2012. The son of Chinese Communist veteran Xi Zhongxun, he was exiled to rural Yanchuan County as a teenager following his father's purge during the Cultural Revolution and lived in a cave in the village of Liangjiahe, where he joined the CCP and worked as the party secretary."
nltk_tokens = nltk.word_tokenize(data)
print("{0:20}{1:30}".format("Original","Lemmatization"))
for w in nltk_tokens:
       print ("{0:20}{1:30}".format(w,wordnet_lemmatizer.lemmatize(w, pos="v"))) # I have passed the optional parameter Part of Speech as "v" - Verb. So it will process based on Verb.

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Original            Lemmatization                 
Xi                  Xi                            
Jinping             Jinping                       
is                  be                            
a                   a                             
Chinese             Chinese                       
politician          politician                    
who                 who                           
has                 have                          
served              serve                         
as                  as                            
General             General                       
Secretary           Secretary                     
of                  of                            
the                 the                           
Chinese  

# **Co-reference resolution**

Getting run time error and crashing due to version incompatability.  

In [None]:
!pip install botocore
!pip install neuralcoref --no-binary neuralcoref
import spacy
import neuralcoref
nlp = spacy.load('en')
coref = neuralcoref.NeuralCoref(nlp.vocab)
nlp.add_pipe(coref, name= 'neuralcoref')
doc = nlp("Xi Jinping is a Chinese politician who has served as General Secretary of the Chinese Communist Party (CCP) and Chairman of the Central Military Commission (CMC) since 2012, and President of the People's Republic of China (PRC) since 2013.")
doc._.has_coref
doc._.coref_clusters

Collecting urllib3<1.27,>=1.25.4; python_version != "3.4"
[?25l  Downloading https://files.pythonhosted.org/packages/23/fc/8a49991f7905261f9ca9df5aa9b58363c3c821ce3e7f671895442b7100f2/urllib3-1.26.3-py2.py3-none-any.whl (137kB)
[K     |██▍                             | 10kB 14.0MB/s eta 0:00:01[K     |████▉                           | 20kB 14.9MB/s eta 0:00:01[K     |███████▏                        | 30kB 10.0MB/s eta 0:00:01[K     |█████████▋                      | 40kB 8.7MB/s eta 0:00:01[K     |████████████                    | 51kB 4.7MB/s eta 0:00:01[K     |██████████████▍                 | 61kB 4.9MB/s eta 0:00:01[K     |████████████████▊               | 71kB 5.6MB/s eta 0:00:01[K     |███████████████████▏            | 81kB 5.7MB/s eta 0:00:01[K     |█████████████████████▌          | 92kB 5.9MB/s eta 0:00:01[K     |████████████████████████        | 102kB 4.9MB/s eta 0:00:01[K     |██████████████████████████▎     | 112kB 4.9MB/s eta 0:00:01[K     |████████

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


# **Parsing**


In [3]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
data = "Xi Jinping is a Chinese politician who has served as General Secretary of the Chinese Communist Party (CCP) and Chairman of the Central Military Commission (CMC) since 2012, and President of the People's Republic of China (PRC) since 2013. He has been the paramount leader of China, the most prominent political leader in the country, since 2012. The son of Chinese Communist veteran Xi Zhongxun, he was exiled to rural Yanchuan County as a teenager following his father's purge during the Cultural Revolution and lived in a cave in the village of Liangjiahe, where he joined the CCP and worked as the party secretary."
new_token = nltk.pos_tag (word_tokenize(data))
new_token

np = r "NP: {<DT>?<JJ>*<NN>}" #This is a definition for a rule to group of words into a noun phrase.  It will group one determinant, then zero or more adjectives followed by zero or more nouns. 
chunk_parser = nltk.RegexpParser(np) #RegexpParser - Uses a set of regular expression patterns to specify the behavior of the parser. 
result = chunk_parser.parse(new_token)
result

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


TclError: ignored

Tree('S', [Tree('NP', [('Xi', 'NN')]), ('Jinping', 'NNP'), ('is', 'VBZ'), Tree('NP', [('a', 'DT'), ('Chinese', 'JJ'), ('politician', 'NN')]), ('who', 'WP'), ('has', 'VBZ'), ('served', 'VBN'), ('as', 'IN'), ('General', 'NNP'), ('Secretary', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('Chinese', 'NNP'), ('Communist', 'NNP'), ('Party', 'NNP'), ('(', '('), ('CCP', 'NNP'), (')', ')'), ('and', 'CC'), ('Chairman', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('Central', 'NNP'), ('Military', 'NNP'), ('Commission', 'NNP'), ('(', '('), ('CMC', 'NNP'), (')', ')'), ('since', 'IN'), ('2012', 'CD'), (',', ','), ('and', 'CC'), ('President', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('People', 'NNP'), ("'s", 'POS'), ('Republic', 'NNP'), ('of', 'IN'), ('China', 'NNP'), ('(', '('), ('PRC', 'NNP'), (')', ')'), ('since', 'IN'), ('2013', 'CD'), ('.', '.'), ('He', 'PRP'), ('has', 'VBZ'), ('been', 'VBN'), Tree('NP', [('the', 'DT'), ('paramount', 'JJ'), ('leader', 'NN')]), ('of', 'IN'), ('China', 'NNP'), (',', ','), ('the', 'DT