<a href="https://colab.research.google.com/github/Nishthavan/Cognizant_Internship/blob/main/NLP_Pipeline_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Installing Important Libraries**

In [1]:
# Install spaCy 
!pip install -U spacy  # (While working with neural_coref we should not update it) | Otherwise Update it
# Download the large Spacy English model
!python3 -m spacy download en_core_web_lg
# Install textacy 
!pip install -U textacy

Collecting en-core-web-lg==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.2.0/en_core_web_lg-3.2.0-py3-none-any.whl (777.4 MB)
[K     |████████████████████████████████| 777.4 MB 5.4 kB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


# **Getting The Data & Loading The NLP Model**

In [2]:
import spacy
import textacy
# Loading the English NLP model
NLP = spacy.load('en_core_web_lg')
# Reading data from the file
File = open("Data.txt", "rt")
text = File.read()
print(text)

﻿We created AlphaGo, a computer program that combines advanced search trees with deep neural networks. These neural networks take a description of the Go board as an input and process it through a number of different network layers containing millions of neuron-like connections. One neural network, the “policy network”, selects the next move to play. The other neural network, the “value network”, predicts the winner of the game. We introduced AlphaGo to numerous amateur games to help it develop an understanding of the reasonable human play. Then we had it play against different versions of itself thousands of times, each time learning from its mistakes. Over time, AlphaGo improved and became increasingly stronger and better at learning and decision-making. This process is known as reinforcement learning. AlphaGo went on to defeat Go world champions in different global arenas and arguably became the greatest Go player of all time.


In [3]:
# Parsing the text with Spacy
final_doc = NLP(text)

# **1. Sentence Segmentation**

In [13]:
sentences = text.split(".")
for sentence in sentences:
  print(sentence)

print("-----------Different Method-------------")
print()
# We can also do this with help of parsed text. which is a better way
# Because it can work even if the data is not formatted cleanly

for sentence in final_doc.sents:
  print(sentence.text)

﻿We created AlphaGo, a computer program that combines advanced search trees with deep neural networks
 These neural networks take a description of the Go board as an input and process it through a number of different network layers containing millions of neuron-like connections
 One neural network, the “policy network”, selects the next move to play
 The other neural network, the “value network”, predicts the winner of the game
 We introduced AlphaGo to numerous amateur games to help it develop an understanding of the reasonable human play
 Then we had it play against different versions of itself thousands of times, each time learning from its mistakes
 Over time, AlphaGo improved and became increasingly stronger and better at learning and decision-making
 This process is known as reinforcement learning
 AlphaGo went on to defeat Go world champions in different global arenas and arguably became the greatest Go player of all time

-----------Different Method-------------

﻿We created Al

# **2. Tokenization**

In [12]:
tokens = [] 
for token in final_doc:
  tokens.append(token.text)
print(tokens)
print(len(tokens))

['We', 'created', 'AlphaGo', ',', 'a', 'computer', 'program', 'that', 'combines', 'advanced', 'search', 'trees', 'with', 'deep', 'neural', 'networks', '.', 'These', 'neural', 'networks', 'take', 'a', 'description', 'of', 'the', 'Go', 'board', 'as', 'an', 'input', 'and', 'process', 'it', 'through', 'a', 'number', 'of', 'different', 'network', 'layers', 'containing', 'millions', 'of', 'neuron', '-', 'like', 'connections', '.', 'One', 'neural', 'network', ',', 'the', '“', 'policy', 'network', '”', ',', 'selects', 'the', 'next', 'move', 'to', 'play', '.', 'The', 'other', 'neural', 'network', ',', 'the', '“', 'value', 'network', '”', ',', 'predicts', 'the', 'winner', 'of', 'the', 'game', '.', 'We', 'introduced', 'AlphaGo', 'to', 'numerous', 'amateur', 'games', 'to', 'help', 'it', 'develop', 'an', 'understanding', 'of', 'the', 'reasonable', 'human', 'play', '.', 'Then', 'we', 'had', 'it', 'play', 'against', 'different', 'versions', 'of', 'itself', 'thousands', 'of', 'times', ',', 'each', 'ti

# **3. Part of Speech**

In [14]:
from prettytable import PrettyTable # Using PrettyTable to create Table
Table = PrettyTable(["Word","TAG","Meaning of TAG","POS","Meaning of POS"])
for token in final_doc:
  Table.add_row([token.text, token.tag_, spacy.explain(token.tag_),token.pos_ ,spacy.explain(token.pos_)]) # Adding data to table
print(Table)

+---------------+------+----------------------------------------------------+-------+--------------------------+
|      Word     | TAG  |                   Meaning of TAG                   |  POS  |      Meaning of POS      |
+---------------+------+----------------------------------------------------+-------+--------------------------+
|      ﻿We      |  RB  |                       adverb                       |  ADV  |          adverb          |
|    created    | VBN  |               verb, past participle                |  VERB |           verb           |
|    AlphaGo    | NNP  |               noun, proper singular                | PROPN |       proper noun        |
|       ,       |  ,   |              punctuation mark, comma               | PUNCT |       punctuation        |
|       a       |  DT  |                     determiner                     |  DET  |        determiner        |
|    computer   |  NN  |               noun, singular or mass               |  NOUN |           

# **4. Text Lemmatization**

In [15]:
# Figuring out the most basic form or lemma of each word.
Table2 = PrettyTable(["Word", "Lemma (int)", "Lemma (str)"])
for token in final_doc:
  Table2.add_row([token.text, token.lemma ,token.lemma_])
print(Table2)

+---------------+----------------------+---------------+
|      Word     |     Lemma (int)      |  Lemma (str)  |
+---------------+----------------------+---------------+
|      ﻿We      | 10911418286934844680 |      ﻿we      |
|    created    | 10217138629343254602 |     create    |
|    AlphaGo    | 6073627545995165565  |    AlphaGo    |
|       ,       | 2593208677638477497  |       ,       |
|       a       | 11901859001352538922 |       a       |
|    computer   | 4912942957612137283  |    computer   |
|    program    | 17812688126189747487 |    program    |
|      that     | 4380130941430378203  |      that     |
|    combines   | 4438640794806048232  |    combine    |
|    advanced   | 3943929226210916060  |    advanced   |
|     search    |  295895373269394349  |     search    |
|     trees     | 5236966400857015965  |      tree     |
|      with     | 12510949447758279278 |      with     |
|      deep     | 12691978708603459222 |      deep     |
|     neural    | 1527895112069

# **5. Stop Words** 

In [22]:
# Stop words creats Noise while doing statistics on text as they appear more frequently, So we may wanna filter them out
# We can identify stop words by checking it with hardcoded known list of stop words.... 
# And we can create a personalized list of stop words to check against according to problem at hand.

All_Stopwords = NLP.Defaults.stop_words
print(len(All_Stopwords))
print(All_Stopwords)

# We can add our own words to this list by:
# NLP.Defaults.stop_words.add("New_Stopword")

# We can remove words from this list by:
# NLP.Defaults.stop_words.remove("Stopword")

print()
print("-------------------------------")
print()

# Cleaning or removing stop words:
Tokens = []
Words = []
Stop_Words = []
for token in final_doc:
  if(token.is_stop == False): # Checking if this is not a stop word
    Tokens.append(token)
    Words.append(token.text)
  else:
    Stop_Words.append(token.text)

# After removing stop words
print((" ".join(Words)))
print(len(Words))

print()

print("Stop Words: ")
# All the stopp words removed
print(Stop_Words)
print(len(Stop_Words))

# 106 + 66 (Stop words) = 172 (total)

326
{'above', 'nine', 'never', 'between', 'becoming', 'call', 'everyone', 'whole', 'hereby', "'m", 'put', 'from', 'please', 'almost', 'why', 'beside', 'your', 'wherever', 'the', 'hereafter', 'where', 'somehow', 'any', 'give', 'becomes', 'been', 'too', 'around', 'is', 'have', 'does', 'must', 'thence', 'unless', 'after', 'against', 'than', 'mostly', 'on', 'under', 'six', 'herein', '‘d', 'therefore', 'across', 'seem', 'this', 'seems', 'its', 'by', 'he', 'made', 'below', 'sometime', 'how', 'hundred', 'sixty', 'back', 'herself', 'more', 'then', 'former', 'last', 'few', 'make', 'since', 'off', 'was', 'nevertheless', 'what', 'quite', 'themselves', 'four', 'into', 'least', 'his', 'keep', 'over', 'via', 'even', 'not', 'could', 'while', 'name', 'her', 'itself', 'sometimes', 'my', 'although', 'whither', 'very', 'at', 'otherwise', 'noone', '’ll', 'ourselves', 'perhaps', 'be', 'myself', 'several', 'whereas', 'forty', 'top', 'those', 'mine', 'thereupon', 'down', 'get', 'until', 'everything', 'become

# **6. Dependency Parsing**

In [17]:
from spacy import displacy
displacy.render(final_doc, jupyter=True, options={'distance':90})

# **7. Finding Noun Phrases**

In [19]:
NounChunks = textacy.extract.noun_chunks(final_doc, min_freq=3) # Number of times it occurs (min_freq)
NounChunks = map(str, NounChunks)
NounChunks = map(str.lower, NounChunks) # Converting them to lower case str
# Now make a set of these so that they only have unique values
for NounChunk in set(NounChunks):
  if len(NounChunk.split(" ")) > 0: # Print out any noun that is atleasy 1 word long......
    print(NounChunk)

it
alphago


# **8. Named Entity Recognition**

In [20]:
Table3 = PrettyTable(["Entity", "Label", "Meaning of Label"])
for entity in final_doc.ents:
  Table3.add_row([entity.text, entity.label_ ,spacy.explain(entity.label_)])
print(Table3)

+-----------+----------+----------------------------------------------+
|   Entity  |  Label   |               Meaning of Label               |
+-----------+----------+----------------------------------------------+
|  AlphaGo  |   ORG    |   Companies, agencies, institutions, etc.    |
|  millions | CARDINAL | Numerals that do not fall under another type |
|    One    | CARDINAL | Numerals that do not fall under another type |
|  AlphaGo  |  PERSON  |         People, including fictional          |
| thousands | CARDINAL | Numerals that do not fall under another type |
|  AlphaGo  |  PERSON  |         People, including fictional          |
|  AlphaGo  |  PERSON  |         People, including fictional          |
+-----------+----------+----------------------------------------------+


# **9. Conference Resolution** *Used Spacy 2.3.7 in this section*

In [None]:
''' We need Neuralcoref for this which is not compatible with the 3.2.4 version of spacy so this (9th Step) is executed with 
spacy -version = 2.3.7 instead of 3.2.4, but all the other 8 steps till now have used 3.2.4 version of spacy. & After updating packages 
Don't forget to restart the runtime always. '''

!git clone https://github.com/huggingface/neuralcoref.git
%cd neuralcoref
!pip install -r requirements.txt
!pip install -e .

In [1]:
!python -m spacy download en 

Collecting en_core_web_sm==2.3.1
  Using cached en_core_web_sm-2.3.1-py3-none-any.whl
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [3]:
!pip show neuralcoref

Name: neuralcoref
Version: 4.0
Summary: Coreference Resolution in spaCy with Neural Networks
Home-page: https://github.com/huggingface/neuralcoref
Author: Thomas Wolf
Author-email: thomwolf@gmail.com
License: MIT
Location: /content/NeuralCoref/neuralcoref
Requires: numpy, boto3, requests, spacy
Required-by: 


In [4]:
import neuralcoref
import spacy
NLP1 = spacy.load('en')
neuralcoref.add_to_pipe(NLP1)

File = open("Data.txt", "rt")
text = File.read()

final_doc_1 = NLP1(text)
print(len(final_doc_1))

172


In [5]:
# Printing all the Mentions in the text:
for cluster in final_doc_1._.coref_clusters:
    print(cluster.mentions)

[deep neural networks, These neural networks]
[an input, it]
[The other neural network, the “value network”, We, we]
[AlphaGo, it, it, itself, its, AlphaGo, AlphaGo]


In [6]:
# Printing pronouns & their references
for token in final_doc_1:
    if token.pos_ == 'PRON' and token._.in_coref:
      for cluster in token._.coref_clusters:
        print(token.text,"     |     ",cluster.main.text) 

it      |      an input
We      |      The other neural network, the “value network”
it      |      AlphaGo
we      |      The other neural network, the “value network”
it      |      AlphaGo
itself      |      AlphaGo
