# Solving the Definition Extraction Problem
## Approach 4: Using Spacy's Text Classifier.
In this approach, we decided to give **Spacy's amazing Models Pipeline** a shot. Here is a summary of what are spaCy’s models from the [Spacy Docs](https://spacy.io/usage/training#basics):

- They are statistical and every “decision” they make is a prediction. This prediction is based on the examples the model has seen during training. To train a model, you first need training data. 


- The model is then shown the unlabelled text and will make a prediction then we give the model feedback on its prediction in the form of an error gradient of the loss function that calculates the difference between the training example and the expected output. The greater the difference, the more significant the gradient and the updates to our model.


- We want the model to come up with a theory that can be generalized across other examples. If you only test the model with the data it was trained on, you’ll have no idea how well it’s generalizing. So, that is why we also need evaluation data to test our model.

![](https://spacy.io/training-73950e71e6b59678754a87d6cf1481f9.svg)


### Optional: Run the next cell only if you are using Google Colab to run this notebook.

In [1]:
from google.colab import drive
from importlib.machinery import SourceFileLoader
!python -m spacy download en_core_web_lg
STORAGE_PATH = "gdrive/My Drive/deft_corpus/data"
OUTPUT_PATH = "gdrive/My Drive/deft_eval_models/spacy-model"
SOURCE_PATH = "gdrive/My Drive/source/data_loader.py"

drive.mount('/content/gdrive', force_remount=True)
data_loader = SourceFileLoader('source',SOURCE_PATH).load_module()
classifiers = SourceFileLoader('source',SOURCE_PATH).load_module()
DeftCorpusLoader = data_loader.DeftCorpusLoader
DeftSpacyClassifier = classifiers.DeftSpacyClassifier

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')
Mounted at /content/gdrive


### If you are runing Locally, run this cell instead of the above one.

In [0]:
#imports cell
import sys 
sys.path.append("../")
from source.data_loader import DeftCorpusLoader
from source.classifiers import DeftSpacyClassifier
# Download a language module to start with instead of building a blank model. Comment this line out if you downloaded it though CLI already.
!python -m spacy download en_core_web_lg
STORAGE_PATH = "../deft_corpus/data"
OUTPUT_PATH = "./deft_eval_models/spacy-model"

### Adding a text classifier to a spaCy model
We followed the step-by-step guide from [spacy's example](https://spacy.io/usage/training#textcat), to make our own implementation of Spacy's Text Classififer for Deft Corpus.

**What do we call it ? Duhh....the `DeftSpacyClassifier`!** 

- Text classification models can be used to solve a wide variety of problems. Differences in text length, number of labels, difficulty, and runtime performance constraints mean that no single algorithm performs well on all types of problems. To handle a wider variety of problems, the `TextCategorizer` object allows configuration of its model architecture, using the `architecture` keyword argument. 


- Chosen Architecture to be used is `simple_cnn`, a neural network model where token vectors are calculated using a CNN.


- Built our model over an existing language model from Spacy `en_core_web_lg` instead of building it over a blank language model.

In [2]:
positive = "DEFINTION"
negative = "NOT DEFINITION"
deft_classifier = DeftSpacyClassifier(positive_label= positive, negative_label= negative)

Loaded default model '<module 'en_core_web_lg' from '/usr/local/lib/python3.6/dist-packages/en_core_web_lg/__init__.py'>'


### Loading dataset and adjusting it's labels for Spacy Format
- We load the dataset as everytime the main difference now is that we have to preform an extra step. We have to change the label format to match the Spacy Labeling Format. Instead of a binary vector for labels we will have for each label value a dict indicating whether this instance is a defintion or not.


- Example: {"DEFINITION": True, "NOT DEFINITION": False}

In [0]:
deft_loader = DeftCorpusLoader(STORAGE_PATH)
trainframe, devframe = deft_loader.load_classification_data(preprocess=True, clean=True)
train_cats = [{positive: bool(y), negative: not bool(y)} for y in trainframe["HasDef"]]
dev_cats = [{positive: bool(y), negative: not bool(y)} for y in devframe["HasDef"]]

### Start the training loop

- Used **compouding batch sizes of starting size 32, maximum size of 100 and step size 1.001.** This values were manually tuned to find the best results at them.


- For each iteration, we evaluate the model by computing **loss, precision, recall, f1-score** on evaluation data (dev split).


- Used **droput rate of 0.2 and Adam Optimizer**

In [15]:
deft_classifier.fit(trainframe["Sentence"], devframe["Sentence"],
                   train_cats, dev_cats,output_dir=OUTPUT_PATH)

Training the model...
LOSS 	  P  	  R  	  F  
0.133	0.756	0.495	0.598
0.029	0.741	0.553	0.633
0.011	0.736	0.618	0.672
0.006	0.736	0.680	0.707
0.003	0.737	0.702	0.719
Saved model to gdrive/My Drive/deft_eval_models/spacy-model


### Reporting Full details of Evaluation Score on dev data

- This approach scored the **highest values** reported in our work. 
- **P/R/F1 for Positive class: 0.74/0.70/0.72**
- **P/R/F1 for Negative class: 0.84/0.86/0.85**
- **Official score is F1 for positive class = 0.72**

In [21]:
deft_classifier.score(devframe["Sentence"], devframe["HasDef"])

              precision    recall  f1-score   support

           0       0.84      0.86      0.85       510
           1       0.74      0.70      0.72       275

    accuracy                           0.81       785
   macro avg       0.79      0.78      0.79       785
weighted avg       0.81      0.81      0.81       785

