# Coreference Resolution for Textbook Contents
> A notebook for getting data from official sources and unzipping them to machine readable formats

- toc: true 
- badges: false
- comments: true
- categories: [jupyter]
- author: Nirant Kasliwal and Meghana Bhange
<!-- - image: images/chart-preview.png -->

In [None]:
# hide
!pip install requests
!pip install pydantic
!pip install tqdm
!pip install pdfminer.six
!pip uninstall spacy 
!pip uninstall neuralcoref
!pip install spacy==2.1.0 
!pip install neuralcoref --no-binary neuralcoref
!python -m spacy download en

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
# hide_input
import json
from io import StringIO
from pathlib import Path
from typing import List, Union

import requests
from pydantic import BaseModel
from tqdm.notebook import tqdm

import neuralcoref
import spacy
from textbook import Book, Chapter
from textbookutils import pdf_to_text

In [None]:
Path.pdfls = lambda x: [x for x in list(x.iterdir()) if x.suffix == ".pdf"]
Path.ls = lambda x: list(x.iterdir())

### Get List of Books and Download Links

In [None]:
# collapse-hide
sheet_name = "History"
books_list = (
    f"https://api.steinhq.com/v1/storages/5fd49704f62b6004b3eb63a3/{sheet_name}"
)
r = requests.get(books_list)

In [None]:
# collapse-hide
ncert_history_books = [Book(**x) for x in json.loads(r.text)]

## Download and Extract all Books

In [None]:
# collapse-show
for book in tqdm(ncert_history_books):
    book.download("../data/raw")
    book.unzip("../data/extract")

In [None]:
single_book = ncert_history_books[0]

In [None]:
pdf_files = []
for folder in single_book.extract_to_path.ls():
    pdf_files.extend(folder.pdfls())
pdf_files.sort()
pdf_files = [
    file for file in pdf_files if file.stem[-2:].isdigit()
]  # keep the chapter files, nothing else
pdf_files

# Using NeuralCorefernce By Huggingface and Spacy

- To use the NeuralCoreference module, the one condition at the time of writing this notebook is that it is currently only functional on spacy 2.1.0. 
- More information on NerualCoreference is [here](https://medium.com/huggingface/state-of-the-art-neural-coreference-resolution-for-chatbots-3302365dcf30)

## Using NeuralCoref

NeuralCoref will resolve the coreferences and annotate them as [extension attributes](https://spacy.io/usage/processing-pipelines#custom-components-extensions) in the spaCy `Doc`,  `Span` and `Token` objects under the `._.` dictionary.

Here is the list of the annotations:

|  Attribute                |  Type              |  Description
|---------------------------|--------------------|-----------------------------------------------------
|`doc._.has_coref`          |boolean             |Has any coreference has been resolved in the Doc
|`doc._.coref_clusters`     |list of `Cluster`   |All the clusters of corefering mentions in the doc
|`doc._.coref_resolved`     |unicode             |Unicode representation of the doc where each corefering mention is replaced by the main mention in the associated cluster.
|`doc._.coref_scores`       |Dict of Dict        |Scores of the coreference resolution between mentions.
|`span._.is_coref`          |boolean             |Whether the span has at least one corefering mention
|`span._.coref_cluster`     |`Cluster`           |Cluster of mentions that corefer with the span
|`span._.coref_scores`      |Dict                |Scores of the coreference resolution of & span with other mentions (if applicable).
|`token._.in_coref`         |boolean             |Whether the token is inside at least one corefering mention
|`token._.coref_clusters`   |list of `Cluster`   |All the clusters of corefering mentions that contains the token

A `Cluster` is a cluster of coreferring mentions which has 3 attributes and a few methods to simplify the navigation inside a cluster:

|  Attribute or method   |  Type / Return type |  Description
|------------------------|---------------------|-----------------------------------------------------
|`i`                     |int                  |Index of the cluster in the Doc
|`main`                  |`Span`               |Span of the most representative mention in the cluster
|`mentions`              |list of `Span`       |List of all the mentions in the cluster
|`__getitem__`           |return `Span`        |Access a mention in the cluster
|`__iter__`              |yields `Span`        |Iterate over mentions in the cluster
|`__len__`               |return int           |Number of mentions in the cluster

In [None]:
nlp = spacy.load("en")
neuralcoref.add_to_pipe(nlp)

# Get the coreferece for each pdf file

In [None]:
coreferce_mapping_for_each_pdf = {}
for file in tqdm(pdf_files):
    output_io_wrapper = StringIO()
    plain_text = pdf_to_text(file, output_io_wrapper)
    doc = nlp(plain_text)
    coreferce_mapping_for_each_pdf[file] = {
        "plain_text": plain_text,
        "doc": doc,
        "resolved_text": doc._.coref_resolved,
        "coreference_clusters": doc._.coref_clusters,
    }