<a href="https://colab.research.google.com/github/AC8151/COG_INTERNSHIP_GN22CDBDS001/blob/main/NLP%20Pipeline/NLP_Example_Part_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#NLP Example Part - 2
##**Coreference Resolution for NLP Pipeline**
> **BY - Aditi Chatterjee**

*Previously in PART -1, an NLP pipeline was created. PART - 2 deals with **Coreference Resolution**, which is an optional part in any NLP Pipeline*

<img src='https://cdn-images-1.medium.com/max/1024/1*d9lDwTfR8SWiDIcXNza2hQ.png'>

##**1. The text (same file used in PART - 1)**

In [None]:
%%writefile greece.txt
Ancient Greece was a civilization that dominated much of the Mediterranean thousands of years ago. At its peak under General Alexander the Great , Ancient Greece ruled much of Europe and western Asia. The Greeks came before the Romans and much of the Roman culture was influenced by them. Ancient Greece formed the foundation of much of Western culture today. Everything from government, to arts, literature, and even sports was influenced by the Greek civilization.

Writing greece.txt


**NOTE:** Coreference resolution in spaCy is implemented using ***neuralcoref***, which requires a lower version of spaCy (***spacy 2.x***).

> Thus, spaCy has been ***downgraded*** here to be compatible with neuralcoref

##**2. Installing Neuralcoref & Downgrading Spacy**

*Neuralcoref 4.0.0 works best with SpaCy 2.2.4*

In [None]:
#import shutil
#shutil.rmtree('/content/neuralcoref', ignore_errors=True)

###**2.1. Neuralcoref**

In [None]:
!git clone https://github.com/huggingface/neuralcoref.git


Cloning into 'neuralcoref'...
remote: Enumerating objects: 772, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (22/22), done.[K
remote: Total 772 (delta 10), reused 5 (delta 1), pack-reused 748[K
Receiving objects: 100% (772/772), 67.85 MiB | 29.13 MiB/s, done.
Resolving deltas: 100% (407/407), done.


In [None]:
cd neuralcoref


/content/neuralcoref


In [None]:
pip install -r requirements.txt




In [None]:
pip install -e .

Obtaining file:///content/neuralcoref
Collecting boto3
  Downloading boto3-1.21.39-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 5.1 MB/s 
Collecting s3transfer<0.6.0,>=0.5.0
  Downloading s3transfer-0.5.2-py3-none-any.whl (79 kB)
[K     |████████████████████████████████| 79 kB 5.4 MB/s 
[?25hCollecting jmespath<2.0.0,>=0.7.1
  Downloading jmespath-1.0.0-py3-none-any.whl (23 kB)
Collecting botocore<1.25.0,>=1.24.39
  Downloading botocore-1.24.39-py3-none-any.whl (8.7 MB)
[K     |████████████████████████████████| 8.7 MB 34.7 MB/s 
[?25hCollecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 63.3 MB/s 
Installing collected packages: urllib3, jmespath, botocore, s3transfer, boto3, neuralcoref
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.24.3
    Uninstalling urllib3-1.24.3:
      Successfully uninstalled urllib3-1.24.3
  R

###**2.2. Spacy & En_Core_Web_Sm**

*SpaCy automatically downgraded based on compatibility with Neuralcoref 4.0.0*

In [None]:
pip show spacy 
# run this one more time if warning (in yellow) arises in executing next cell
# then run next cell again

Name: spacy
Version: 2.2.4
Summary: Industrial-strength Natural Language Processing (NLP) in Python
Home-page: https://spacy.io
Author: Explosion
Author-email: contact@explosion.ai
License: MIT
Location: /usr/local/lib/python3.7/dist-packages
Requires: preshed, blis, plac, requests, murmurhash, catalogue, thinc, numpy, wasabi, tqdm, srsly, cymem, setuptools
Required-by: fastai, en-core-web-sm, neuralcoref


In [None]:
import spacy.cli

spacy.cli.download("en_core_web_lg")

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


##**3. Importing Libraries**

In [None]:
#nlp.remove_pipe('neuralcoref')

In [None]:
import spacy
import neuralcoref

# load model
nlp = spacy.load('en_core_web_lg')

# add spacy model to neuralcoref pipeline
coref = neuralcoref.NeuralCoref(nlp.vocab)
nlp.add_pipe(coref, name='neuralcoref')

100%|██████████| 40155833/40155833 [00:00<00:00, 43437766.86B/s]


##**4. Reading text file**

In [None]:
f = open('/content/greece.txt', 'r') 
text = f.read()
doc = nlp(text)
print(doc)

Ancient Greece was a civilization that dominated much of the Mediterranean thousands of years ago. At its peak under General Alexander the Great , Ancient Greece ruled much of Europe and western Asia. The Greeks came before the Romans and much of the Roman culture was influenced by them. Ancient Greece formed the foundation of much of Western culture today. Everything from government, to arts, literature, and even sports was influenced by the Greek civilization.


##**5.** 
### **STEP 9: Performing Coreference Resolution**

*Check if text has any coreference resolutions*

In [None]:
print('Coreferences Present:',doc._.has_coref)

Coreferences Present: True


*It has! Proceed to show clusters where these resolutions happen*

In [None]:
print('COREFERENCES OCCUR HERE:')
og_doc = doc._.coref_clusters
for cluster in og_doc:
  print(cluster)

COREFERENCES OCCUR HERE:
Ancient Greece: [Ancient Greece, its, Ancient Greece, Ancient Greece]
The Greeks: [The Greeks, them]


*ACTUAL Coreference Resolution*

In [None]:
resolved_doc = doc._.coref_resolved
print('COREFERENCES RESOLVED:')
print(resolved_doc)

COREFERENCES RESOLVED:
Ancient Greece was a civilization that dominated much of the Mediterranean thousands of years ago. At Ancient Greece peak under General Alexander the Great , Ancient Greece ruled much of Europe and western Asia. The Greeks came before the Romans and much of the Roman culture was influenced by The Greeks. Ancient Greece formed the foundation of much of Western culture today. Everything from government, to arts, literature, and even sports was influenced by the Greek civilization.


*Et Voila! The coreferences have been resolved successfully*

# **END OF PART - 2**