# CONTEXT

So far we had some data and we made some **rules** and based on that we got some labels.

In this approach, we will use a **Machine Learning mode** instead of rules to get labels from that data

In [1]:
import spacy
import pandas as pd
from spacy import displacy
from spacy.matcher import Matcher

In [4]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.0.0/en_core_web_md-3.0.0-py3-none-any.whl (47.1 MB)
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.0.0
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_md')


You should consider upgrading via the 'C:\Users\Akshit.Miglani\Anaconda3\python.exe -m pip install --upgrade pip' command.


In [5]:
nlp = spacy.load("en_core_web_md")

In [13]:
doc = nlp("My name is Brock and I was born in March. \
           I work at EY from India. I just bought a bench press \
           cost $1000 on ebay and I will get is services here for 20 euro a year.")

In [14]:
displacy.render(doc, style="ent")

In [15]:
[(e,type(e)) for e in doc.ents]

[(Brock, spacy.tokens.span.Span),
 (March, spacy.tokens.span.Span),
 (EY, spacy.tokens.span.Span),
 (India, spacy.tokens.span.Span),
 (1000, spacy.tokens.span.Span),
 (20 euro, spacy.tokens.span.Span)]

#### nlp does a lot of things in the backend/pipeline which we will see below

Link/Documentation : https://spacy.io/usage/spacy-101#pipelines

In [16]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x1c6e1067a08>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1c6e104da68>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1c6e0c91668>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1c6e0db9978>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1c6e0df0988>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1c6e0ded8c8>)]

What's detecting all these tags above is the **ner** part

![pipeline](images/nlppipeline.png)

We can't change the tokenizer step as it's necessary: basically splitting of sentence into words. We can play/change other parts though

Functions

- **Tokenizer** : Split sentences into words
- **Tagger** : The part that does the categorisation into noun/verb/adjectives for us
- **Parser** : Grammatical relationships : dependancy labeling
- **ner** : Detecting name entities. What we got above (Brock, March, EY, India etc)

## Parsing a Doc 

First I'll need to prepare the data such that it fits the API. Docs found [here](https://spacy.io/usage/training#training-simple-style).

The docs say the data format needs to look something like this; 

```
TRAIN_DATA = [
   ("Uber blew through $1 million", {"entities": [(0, 4, "ORG")]}),
   ("Google rebrands its apps", {"entities": [(0, 6, "ORG")]})
]
```

So we gotta have something like;

```
TRAIN_DATA = [
   ("Python is cool", {"entities": [(0, 6, "PROGLANG")]}),
   ("Me like golang", {"entities": [(8, 14, "PROGLANG")]})
]
```

## Making Matches!

I've taken the patterns code we made in the previous and put it in a seperate python file (common.py). This keeps the notebook clean and it is still easy for me to quickly get all these patterns.

In [19]:
from common import create_patterns
matcher = Matcher(nlp.vocab, validate=True)
matcher.add("PROG_LANG", [*create_patterns()])

In [31]:
doc = nlp("I do code with datastuff using python and golang.")

for idx, start, end in matcher(doc):
    print(doc[start:end])

python
golang


In [33]:
type(doc[start:end])

spacy.tokens.span.Span

In [41]:
def parse_train_data(doc):
    detections = [(doc[start:end].start_char, doc[start:end].end_char, "PROLANG") for idx, start, end in matcher(doc)] 
    return (doc.text, {"entities" : detections})

parse_train_data(nlp("I like python, javascript and golang"))

('I like python, javascript and golang',
 {'entities': [(7, 13, 'PROLANG'), (15, 25, 'PROLANG'), (30, 36, 'PROLANG')]})

Found 3 matchers : python, javascript and golang as according to how we trained it using matchers in common.py

## Full Training Set

Now to load previous data.

In [42]:
df = (pd.read_csv("archive/Questions.csv", nrows=5000, 
                  encoding="ISO-8859-1", usecols=['Title', 'Id']))

In [43]:
df.head()

Unnamed: 0,Id,Title
0,80,SQLStatement.execute() - multiple queries in o...
1,90,Good branching and merging tutorials for Torto...
2,120,ASP.NET Site Maps
3,180,Function for creating color wheels
4,260,Adding scripting functionality to .NET applica...


**NOTE:**

Ideally this should be **manually labeled data** but since I have not made that I'm using the data based on our match finder

In [68]:
titles = (_ for _ in df['Title'])
g = [str(d) for d in nlp.pipe(titles) if len(matcher(d))>0]

In [70]:
TRAIN_DATA = [parse_train_data(d) for d in nlp.pipe(g) if len(matcher(d)) == 1]
TRAIN_DATA[5:8]

[('How to set up unit testing for Visual Studio C++',
  {'entities': [(45, 48, 'PROLANG')]}),
 ('How do you pack a visual studio c++ project for release?',
  {'entities': [(32, 35, 'PROLANG')]}),
 ('How do you get leading wildcard full-text searches to work in SQL Server?',
  {'entities': [(62, 65, 'PROLANG')]})]

## Training Loop

Again, the docs for reference are [here](https://spacy.io/usage/training#training-simple-style). We take a slightly different approach than what is listed though.

We first create a blank nlp model and then add a `ner` step to it. This is easier than loading in a big model and replacing a step. It's also faster since the loading can be slow.


In [77]:
def create_blank_nlp(train_data):
    #Create blank email model
    nlp = spacy.blank("en")
    #Create NEE pipeline
    ner = nlp.create_pipe("ner")
    #Attach the ner pipeline to nlp object that we created
    nlp.add_pipe(ner, last=True)
    ner = nlp.get_pipe("ner")
    #Making sure "ner" is aware of the "PROLANG" entity we created above. 
    #We're just adding "PROLANG" to all our training data
    for _, annotations in train_data:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])
    return nlp

Next we just run it.

In [76]:
import random 
import datetime as dt

#Create blank nlp object
nlp = create_blank_nlp(TRAIN_DATA)
optimizer = nlp.begin_training()  
for i in range(20):
    #Shuffle the train data : happening 20 times
    random.shuffle(TRAIN_DATA)
    losses = {}
    #Updating entire training data to nlp model we created
    for text, annotations in TRAIN_DATA:
        nlp.update([text], [annotations], sgd=optimizer, losses=losses)
    print(f"Losses at iteration {i} - {dt.datetime.now()}", losses)

ValueError: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got <spacy.pipeline.ner.EntityRecognizer object at 0x000001C680820DD8> (name: 'None').

- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.

- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.

- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.

## Improvements 

This is a deprecated version so there's no need to look too much into it

----