# Important Dates


###### Recall: The intent classification categories
* Important Dates
* Course
* Professor
* Location -> Stretch

## Subclasses of Important Dates
Important dates has many different subclasses.
The subclasses we have identified so far are as follows
 
* faculty report
* curriculum study
* open residence halls
* holiday
* convocation
* instruction_begin  ** Removing this so that we only have semester start 4/29
* add date: 1) without permission, 2) with permission
* withdraw date
* drop date
* semester start
* semester end
* break
* finals
* registration
* graduation



 Faculty Report	Thursday, January 16
 Curriculum Study & Improvement of Instruction	Thursday – Friday, January 16 – 17
 New Student Orientation/Registration	Friday, January 17
* Residence Halls Opens	Sunday, January 19
* Martin Luther King Holiday	Monday, January 20
* Spring Convocation	Tuesday, January 21
* Instruction Begins	Wednesday, January 22
* Last Day to Add a Course without Instructor’s Permission	Wednesday, January 23
* Late Registration (late fee applies)	Friday, January 24
* Deadline for Filing Degree Application
* (Students meeting requirements at end of spring)	Friday, January 31
* Last Day to Add a Course (Instructor’s Permission Required)	Friday, January 31
* Last Day to Drop Course without “W” (refund)	Friday, February 7
* Window for early performance grades	Friday – Tuesday, February 28 – March 3
* Spring Break (no classes scheduled)	Monday – Friday, March 16-27
* Summer and Fall registration begins	Register for Classes
* Spring Holiday (no classes scheduled)	Friday, April 10
* Last Day to Drop Course with “W” (no refund)	Friday, April 17
* Last Day to Withdraw from the University (4:59 p.m.)	Friday, May 8
* EXAM WEEK	Monday – Friday, May 11-15
* Last Day of Classes	Friday, May 15
* Commencement	Friday, May 15 and Saturday, May 16
* Campus Housing Closes	Saturday, May 16
* Faculty Deadline to Submit Final Grades (by 5:00 p.m.)

In [2]:
#!/usr/bin/env python
# coding: utf8
"""Example of training an additional entity type
This script shows how to add a new entity type to an existing pretrained NER
model. To keep the example short and simple, only four sentences are provided
as examples. In practice, you'll need many more — a few hundred would be a
good start. You will also likely need to mix in examples of other entity
types, which might be obtained by running the entity recognizer over unlabelled
sentences, and adding their annotations to the training set.
The actual training is performed by looping over the examples, and calling
`nlp.entity.update()`. The `update()` method steps through the words of the
input. At each word, it makes a prediction. It then consults the annotations
provided on the GoldParse instance, to see whether it was right. If it was
wrong, it adjusts its weights so that the correct action will score higher
next time.
After training your model, you can save it to a directory. We recommend
wrapping models as Python packages, for ease of deployment.
For more details, see the documentation:
* Training: https://spacy.io/usage/training
* NER: https://spacy.io/usage/linguistic-features#named-entities
Compatible with: spaCy v2.1.0+
Last tested with: v2.1.0
"""
from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding

## Preliminary

We use a dictionary, important_dates where the key is the intent classifiction subcategory and the value is a list of questions.

The dictionary is accessed like important_dates['faculty_report'] or important_dates['semester_start'].

In [3]:
important_dates = {}

In [4]:
def get_substring_label_truple(intent, label_indicator, label):
    start = intent.find(label_indicator)
    end = start + len(label_indicator)
    return (start, end, label)

## Faculty Report
All faculty members who hold full-time appointments should be prepared to report on their teaching, research/professional development and scholarly activity, and service activities for the academic year (fall, spring and summer).

In [5]:
FACULTY_REPORT_LABEL = "FACULTY_REPORT"
FACULTY_REPORT_TRAIN_DATA = [
    # faculty_report
    (
        "When is faculty report?",
        {"entities": [
            get_substring_label_truple(
                "When is faculty report",
                "faculty report",
                FACULTY_REPORT_LABEL
            )]
        },
    ),
    
    (
        "When do I need to turn in the faculty report?",
        {"entities": [
            get_substring_label_truple(
                "When do I need to turn in the faculty report?", 
                "faculty report", 
                FACULTY_REPORT_LABEL
            )]
        },
    ),
    (
        "When is the faculty report due?", # START COUNTING HERE
        {"entities": [
            get_substring_label_truple(
                "When is the faculty report due?", 
                "faculty report", 
                FACULTY_REPORT_LABEL
            )]
        },
    ),
    (
        "Spring report is due when?",
        {"entities": [
            get_substring_label_truple(
                "Spring report is due when?", 
                "Spring report", 
                FACULTY_REPORT_LABEL
            )]
        },
    ),
    (
        "When do I need to turn in the spring report?",
        {"entities": [
            get_substring_label_truple(
                "When do I need to turn in the spring report?", 
                "spring report", 
                FACULTY_REPORT_LABEL
            )]
        },
    ),
    (
        "When is the spring report due?",
        {"entities": [
            get_substring_label_truple(
                "When is the spring report due?", 
                "spring report", 
                FACULTY_REPORT_LABEL
            )]
        },
    ),
    (
        "When is the spring report due?",
        {"entities": [
            get_substring_label_truple(
                "When is the spring report due?", 
                "spring report", 
                FACULTY_REPORT_LABEL
            )]
        },
    ),
    (
        "When do I need to have the fall report done by?",
        {"entities": [
            get_substring_label_truple(
                "When do I need to have the fall report done by?", 
                "fall report", 
                FACULTY_REPORT_LABEL
            )]
        },
    ),
    (
        "When is the fall report due?",
        {"entities": [
            get_substring_label_truple(
                "When is the fall report due?", 
                "fall report", 
                FACULTY_REPORT_LABEL
            )]
        },
    ),
    (
        "When is the annual report due?",
        {"entities": [
            get_substring_label_truple(
                "When is the annual report due?", 
                "annual report", 
                FACULTY_REPORT_LABEL
            )]
        },
    ),
    (
        "When is the research professor report due?",
        {"entities": [
            get_substring_label_truple(
                "When is the research professor report due?", 
                "research professor report", 
                FACULTY_REPORT_LABEL
            )]
        },
    ),
    (
        "When is the report due?",
        {"entities": [
            get_substring_label_truple(
                "When is the report due?", 
                "report", 
                FACULTY_REPORT_LABEL
            )]
        },
    ),
    (
        "Am I supposed to turn a report in before the semester starts?",
        {"entities": [
            get_substring_label_truple(
                "Am I supposed to turn a report in before the semester starts?", 
                "report", 
                FACULTY_REPORT_LABEL
            )]
        },
    ),
] # End faculty report training data

In [6]:
@plac.annotations(
    model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
    new_model_name=("New model name for model meta.", "option", "nm", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
)
def chunk_faculty_report(model=None, new_model_name="class", output_dir=None, n_iter=30):
    """Set up the pipeline and entity recognizer, and train the new entity."""
    random.seed(0)
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner)
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe("ner")

    ner.add_label(FACULTY_REPORT_LABEL)  # add new entity label to entity recognizer
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
    move_names = list(ner.move_names)
    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        sizes = compounding(1.0, 4.0, 1.001)
        # batch up the examples using spaCy's minibatch
        for itn in range(n_iter):
            random.shuffle(FACULTY_REPORT_TRAIN_DATA)
            batches = minibatch(FACULTY_REPORT_TRAIN_DATA, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)

    # test the trained model
    test_text = "I am not sure when to turn in the report."
    doc = nlp(test_text)
    print("Entities in '%s'" % test_text)
    for ent in doc.ents:
        print(ent.label_, ent.text)
        

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta["name"] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        # Check the classes have loaded back consistently
        assert nlp2.get_pipe("ner").move_names == move_names
        doc2 = nlp2(test_text)
        for ent in doc2.ents:
            print(ent.label_, ent.text)

In [7]:
chunk_faculty_report()

Created blank 'en' model
Losses {'ner': 51.38056101283291}
Losses {'ner': 22.06885736801678}
Losses {'ner': 22.112721854737174}
Losses {'ner': 20.244643689722565}
Losses {'ner': 15.23087171569653}
Losses {'ner': 10.89579725802815}
Losses {'ner': 9.918974945623608}
Losses {'ner': 5.2916755410030625}
Losses {'ner': 9.78242050506075}
Losses {'ner': 5.284540225012781}
Losses {'ner': 5.2322494321578805}
Losses {'ner': 3.6995813324879263}
Losses {'ner': 2.7445078088188684}
Losses {'ner': 3.238315051541401}
Losses {'ner': 1.702649412253405}
Losses {'ner': 2.087992687814245}
Losses {'ner': 2.2809875529506676}
Losses {'ner': 1.6630185900486618}
Losses {'ner': 3.2851466951857478}
Losses {'ner': 1.8191297307807102}
Losses {'ner': 0.029036725202794243}
Losses {'ner': 0.05846673842888321}
Losses {'ner': 0.0035427188763067306}
Losses {'ner': 0.002327218744113337}
Losses {'ner': 5.215032398440371e-06}
Losses {'ner': 1.7405913441205214e-06}
Losses {'ner': 0.00016960208616405628}
Losses {'ner': 8.35710

## Curriculum Study
Curriculum study is a training day

In [8]:
CURRICULUM_STUDY_LABEL = "CURRICULUM_STUDY"

CURRICULUM_STUDY_TRAIN_DATA = [
    # faculty_report
    (
        "When is curriculum study?",
        {"entities": [
            get_substring_label_truple(
                "When is curriculum study",
                "curriculum study",
                CURRICULUM_STUDY_LABEL
            )]
        },
    ),
    
    (
        "When do I need to go to the curriculum study",
        {"entities": [
            get_substring_label_truple(
                "When do I need to go to the curriculum study", 
                "curriculum study",
                CURRICULUM_STUDY_LABEL
            )]
        },
    ),
    (
        "Is there a curriculum study?", # START COUNTING HERE
        {"entities": [
            get_substring_label_truple(
                "Is there a curriculum study?", 
                "curriculum study",
                CURRICULUM_STUDY_LABEL
            )]
        },
    ),
    # Ask about how other people would ask this question
] # End curriculum report training data

In [9]:
@plac.annotations(
    model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
    new_model_name=("New model name for model meta.", "option", "nm", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
)
def chunk_curriculum_study(model=None, new_model_name="class", output_dir=None, n_iter=30):
    """Set up the pipeline and entity recognizer, and train the new entity."""
    random.seed(0)
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner)
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe("ner")

    ner.add_label(CURRICULUM_STUDY_LABEL)  # add new entity label to entity recognizer
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
    move_names = list(ner.move_names)
    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        sizes = compounding(1.0, 4.0, 1.001)
        # batch up the examples using spaCy's minibatch
        for itn in range(n_iter):
            random.shuffle(CURRICULUM_STUDY_TRAIN_DATA)
            batches = minibatch(CURRICULUM_STUDY_TRAIN_DATA, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)

    # test the trained model
    test_text = "I am not sure when curriculum study begins"
    doc = nlp(test_text)
    print("Entities in '%s'" % test_text)
    for ent in doc.ents:
        print(ent.label_, ent.text)
        

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta["name"] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        # Check the classes have loaded back consistently
        assert nlp2.get_pipe("ner").move_names == move_names
        doc2 = nlp2(test_text)
        for ent in doc2.ents:
            print(ent.label_, ent.text)

In [10]:
chunk_curriculum_study()

Created blank 'en' model
Losses {'ner': 16.64488697052002}
Losses {'ner': 14.267718613147736}
Losses {'ner': 10.907584339380264}
Losses {'ner': 6.874922752380371}
Losses {'ner': 4.6637024958617985}
Losses {'ner': 4.395704441245471}
Losses {'ner': 4.015897666307865}
Losses {'ner': 2.5376757009944413}
Losses {'ner': 4.502518051769584}
Losses {'ner': 4.814317160955397}
Losses {'ner': 3.668994711653795}
Losses {'ner': 2.944350157675217}
Losses {'ner': 2.037997988227744}
Losses {'ner': 1.3221059021715291}
Losses {'ner': 0.3084877991933732}
Losses {'ner': 0.011853643368249456}
Losses {'ner': 0.1555180497456728}
Losses {'ner': 0.00033895644263548237}
Losses {'ner': 0.01795182247578017}
Losses {'ner': 0.0006314947097356853}
Losses {'ner': 9.616128918047232e-05}
Losses {'ner': 0.00014639613340696218}
Losses {'ner': 2.2631199029764712e-05}
Losses {'ner': 2.3012198669688486e-08}
Losses {'ner': 5.670478605447673e-09}
Losses {'ner': 3.666999713748477e-08}
Losses {'ner': 2.262892246899507e-06}
Losse

Note that "not sure" is coming up as a label. Might require more data for training.

## Open Residence Halls
Students might wonder when they can begin to move in

In [11]:
OPEN_RESIDENCE_HALLS_LABEL = "OPEN_RESIDENCE_HALLS"
OPEN_RESIDENCE_HALLS_TRAIN_DATA = [
    (
        "When do residence halls open up?",
        {"entities": [
            get_substring_label_truple(
                "When do residence halls open up?",
                "residence halls",
                OPEN_RESIDENCE_HALLS_LABEL
            )]
        },
    ),
    
    (
        "When can I move in?",
        {"entities": [
            get_substring_label_truple(
                "When can I move in?", 
                "move in", 
                OPEN_RESIDENCE_HALLS_LABEL
            )]
        },
    ),
    (
        "When do the dorms open up?",
        {"entities": [
            get_substring_label_truple(
                "When do the dorms open up?", 
                "dorms open", 
                OPEN_RESIDENCE_HALLS_LABEL
            )]
        },
    ),
    (
        "Are the dorms open yet?",
        {"entities": [
            get_substring_label_truple(
                "Are the dorms open yet?", 
                "dorms open", 
                OPEN_RESIDENCE_HALLS_LABEL
            )]
        },
    ),
    (
        "Am I allowed to move in yet?",
        {"entities": [
            get_substring_label_truple(
                "Am I allowed to move in yet?", 
                "move in", 
                OPEN_RESIDENCE_HALLS_LABEL
            )]
        },
    ),

] # End open residence halls training data

In [12]:
@plac.annotations(
    model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
    new_model_name=("New model name for model meta.", "option", "nm", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
)
def chunk_open_residence_halls(model=None, new_model_name="class", output_dir=None, n_iter=30):
    """Set up the pipeline and entity recognizer, and train the new entity."""
    random.seed(0)
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner)
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe("ner")

    ner.add_label(OPEN_RESIDENCE_HALLS_LABEL)  # add new entity label to entity recognizer
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
    move_names = list(ner.move_names)
    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        sizes = compounding(1.0, 4.0, 1.001)
        # batch up the examples using spaCy's minibatch
        for itn in range(n_iter):
            random.shuffle(OPEN_RESIDENCE_HALLS_TRAIN_DATA)
            batches = minibatch(OPEN_RESIDENCE_HALLS_TRAIN_DATA, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)

    # test the trained model
    test_text = "When am I allowed to move into the dorms?"
    doc = nlp(test_text)
    print("Entities in '%s'" % test_text)
    for ent in doc.ents:
        print(ent.label_, ent.text)
        

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta["name"] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        # Check the classes have loaded back consistently
        assert nlp2.get_pipe("ner").move_names == move_names
        doc2 = nlp2(test_text)
        for ent in doc2.ents:
            print(ent.label_, ent.text)

In [13]:
chunk_open_residence_halls()

Created blank 'en' model
Losses {'ner': 25.647283852100372}
Losses {'ner': 15.766216836869717}
Losses {'ner': 8.216171025298536}
Losses {'ner': 8.623759474605322}
Losses {'ner': 7.303969016087649}
Losses {'ner': 8.815506931161508}
Losses {'ner': 10.379225305863656}
Losses {'ner': 7.064791837474331}
Losses {'ner': 4.251984400761103}
Losses {'ner': 1.4530246723462366}
Losses {'ner': 0.7567167352701816}
Losses {'ner': 0.24815775438722648}
Losses {'ner': 0.5710542981098645}
Losses {'ner': 0.06303585202209579}
Losses {'ner': 0.00016555962731485435}
Losses {'ner': 0.0005336267719110343}
Losses {'ner': 0.00016540878755570205}
Losses {'ner': 4.507005835411788e-06}
Losses {'ner': 2.8316961308524612e-05}
Losses {'ner': 1.0393440866078927e-06}
Losses {'ner': 1.1324133057124032e-05}
Losses {'ner': 4.477963591782647e-05}
Losses {'ner': 3.443737974604366e-09}
Losses {'ner': 1.0673731926246527e-05}
Losses {'ner': 2.2317222111305387e-06}
Losses {'ner': 4.383041218916441e-07}
Losses {'ner': 3.905537757

## Holidays

### Holidays CS Helper recognizes:
* New Year’s Day: Wednesday, January 1
* Birthday of Martin Luther King Jr: Monday, January 20
* President’s Day: Monday, February 17
* Memorial Day: Monday, May 25
* Independence Day: Saturday, July 4
* Labor Day: Monday, September 7
* Columbus Day: Monday, October 12
* Veterans' Day: Wednesday, November 11
* Thanksgiving: Thursday, November 26
* Christmas: Friday, December 25

### Other important days to note: 
* Valentine’s Day: Friday, February 14
* St Patrick’s Day: Tuesday, March 17
* Good Friday: Friday, April 10
* Easter: Sunday, April 12
* Mother’s Day: Sunday, May 10
* Father’s Day: Sunday, June 21
* Halloween: Saturday, October 31




In [14]:
HOLIDAYS_LABEL = "HOLIDAYS"
HOLIDAYS_TRAIN_DATA = [
    # New Year's Day
    (
        "When is New Year's Day?",
        {"entities": [
            get_substring_label_truple(
                "When is New Year's Day?",
                "New Year's Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "When is New Years?",
        {"entities": [
            get_substring_label_truple(
                "When is New Years?",
                "New Years",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Are you open for New Years Day?",
        {"entities": [
            get_substring_label_truple(
                "Are you open for New Years Day?",
                "New Years Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Are you open for New Years?",
        {"entities": [
            get_substring_label_truple(
                "Are you open for New Years?",
                "New Years",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Is the university closed for New Year's Day?",
        {"entities": [
            get_substring_label_truple(
                "Is the university closed for New Year's Day?",
                "New Year's Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Is the university closed for New Years?",
        {"entities": [
            get_substring_label_truple(
                "Is the university closed for New Years?",
                "New Years",
                HOLIDAYS_LABEL
            )]
        },
    ),

    # MLK Day
    (
        "When is Martin Luther King day?",
        {"entities": [
            get_substring_label_truple(
                "When is Martin Luther King day??",
                "Martin Luther King",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "When is MLK day?",
        {"entities": [
            get_substring_label_truple(
                "When is MLK day?",
                "MLK",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "When is Martin Luther King Junior day?",
        {"entities": [
            get_substring_label_truple(
                "When is Martin Luther King Jr day?",
                "Martin Luther King Junior",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Are you open for Martin Luther King's day?",
        {"entities": [
            get_substring_label_truple(
                "Are you open for Martin Luther King's day?",
                "Martin Luther King",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Are you open for MLK?",
        {"entities": [
            get_substring_label_truple(
                "Are you open for MLK?",
                "MLK",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Is the university closed for MLK day?",
        {"entities": [
            get_substring_label_truple(
                "Is the university closed for MLK day?",
                "MLK",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Is the university closed for MLK junior day?",
        {"entities": [
            get_substring_label_truple(
                "Is the university closed for MLK junior day?",
                "MLK",
                HOLIDAYS_LABEL
            )]
        },
    ),

    # Presidents Day
    (
        "When is Presidents Day?",
        {"entities": [
            get_substring_label_truple(
                "When is Presidents Day?",
                "Presidents Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Are you open for Presidents Day?",
        {"entities": [
            get_substring_label_truple(
                "Are you open for Presidents Day?",
                "Presidents Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Is the university closed for Presidents Day?",
        {"entities": [
            get_substring_label_truple(
                "Is the university closed for Presidents Day?",
                "Presidents Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    
    # Memorial Day
    (
        "When is Memorial Day?",
        {"entities": [
            get_substring_label_truple(
                "When is Memorial Day?",
                "Memorial Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Are you open for Memorial Day?",
        {"entities": [
            get_substring_label_truple(
                "Are you open for Memorial Day?",
                "Memorial Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Is the university closed for Memorial Day?",
        {"entities": [
            get_substring_label_truple(
                "Is the university closed for Memorial Day?",
                "Memorial Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    
    # Independence Day, Fourth of July, 4th of July,
    (
        "When is Independence Day?",
        {"entities": [
            get_substring_label_truple(
                "When is Independence Day?",
                "Independence Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "When is Fourth of July",
        {"entities": [
            get_substring_label_truple(
                "When is Fourth of July",
                "Fourth of July",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "When is the 4th of July?",
        {"entities": [
            get_substring_label_truple(
                "When is the 4th of July?",
                "4th of July",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Are you open for Independence Day?",
        {"entities": [
            get_substring_label_truple(
                "Are you open for Independence Day?",
                "Independence Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Are you open for Fourth of July?",
        {"entities": [
            get_substring_label_truple(
                "Are you open for Fourth of July?",
                "Fourth of July",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Are you open for the 4th of July?",
        {"entities": [
            get_substring_label_truple(
                "Are you open for the 4th of July?",
                "4th of July",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Is the university closed for Independence Day?",
        {"entities": [
            get_substring_label_truple(
                "Is the university closed for Independence Day?",
                "Independence Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Is the university closed for Fourth of July?",
        {"entities": [
            get_substring_label_truple(
                "Is the university closed for Fourth of July?",
                "Fourth of July",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Is the university closed for the 4th of July?",
        {"entities": [
            get_substring_label_truple(
                "Is the university closed the 4th of July?",
                "4th of July",
                HOLIDAYS_LABEL
            )]
        },
    ),

    # Labor Day
    (
        "When is Labor Day?",
        {"entities": [
            get_substring_label_truple(
                "When is Labor Dayy?",
                "Labor Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Are you open for Labor Day?",
        {"entities": [
            get_substring_label_truple(
                "Are you open for Labor Day?",
                "Labor Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Is the university closed for Labor Day?",
        {"entities": [
            get_substring_label_truple(
                "Is the university closed for Labor Day?",
                "Labor Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    
    # Columbus Day
    (
        "When is Columbus Day?",
        {"entities": [
            get_substring_label_truple(
                "When is Columbus Dayy?",
                "Columbus Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Are you open for Columbus Day?",
        {"entities": [
            get_substring_label_truple(
                "Are you open for Columbus Day?",
                "Columbus Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Is the university closed for Columbus Day?",
        {"entities": [
            get_substring_label_truple(
                "Is the university closed for Columbus Day?",
                "Columbus Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    
    # Veterans Day
    (
        "When is Veterans Day?",
        {"entities": [
            get_substring_label_truple(
                "When is Veterans Dayy?",
                "Veterans Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Are you open for Veterans Day?",
        {"entities": [
            get_substring_label_truple(
                "Are you open for Veterans Day?",
                "Veterans Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Is the university closed for Veterans Day?",
        {"entities": [
            get_substring_label_truple(
                "Is the university closed for Veterans Day?",
                "Veterans Day",
                HOLIDAYS_LABEL
            )]
        },
    ),

    # Thanksgiving Day
    (
        "When is Thanksgiving Day?",
        {"entities": [
            get_substring_label_truple(
                "When is Thanksgiving Dayy?",
                "Thanksgiving Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Are you open for Thanksgiving Day?",
        {"entities": [
            get_substring_label_truple(
                "Are you open for Thanksgiving Day?",
                "Thanksgiving Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Is the university closed for Thanksgiving Day?",
        {"entities": [
            get_substring_label_truple(
                "Is the university closed for Thanksgiving Day?",
                "Thanksgiving Day",
                HOLIDAYS_LABEL
            )]
        },
    ),

    # Christmas Day
    (
        "When is Christmas Day?",
        {"entities": [
            get_substring_label_truple(
                "When is Christmas Dayy?",
                "Christmas Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Are you open for Christmas Day?",
        {"entities": [
            get_substring_label_truple(
                "Are you open for Christmas Day?",
                "Christmas Day",
                HOLIDAYS_LABEL
            )]
        },
    ),
    (
        "Is the university closed for Christmas Day?",
        {"entities": [
            get_substring_label_truple(
                "Is the university closed for Christmas Day?",
                "Christmas Day",
                HOLIDAYS_LABEL
            )]
        },
    ),    

] # End holidays training data

In [15]:
@plac.annotations(
    model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
    new_model_name=("New model name for model meta.", "option", "nm", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
)
def chunk_holidays(model=None, new_model_name="class", output_dir=None, n_iter=30):
    """Set up the pipeline and entity recognizer, and train the new entity."""
    random.seed(0)
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner)
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe("ner")

    ner.add_label(HOLIDAYS_LABEL)  # add new entity label to entity recognizer
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
    move_names = list(ner.move_names)
    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        sizes = compounding(1.0, 4.0, 1.001)
        # batch up the examples using spaCy's minibatch
        for itn in range(n_iter):
            random.shuffle(HOLIDAYS_TRAIN_DATA)
            batches = minibatch(HOLIDAYS_TRAIN_DATA, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)
            
    # test the trained model: HOLIDAYS
    test_text = "Is there any school on the fourth"
    doc = nlp(test_text)
    print("Entities in '%s'" % test_text)
    for ent in doc.ents:
        print(ent.label_, ent.text)
            
    # test the trained model: CURRICULUM STUDY
    test_text = "Do we have a break for new years?"
    doc = nlp(test_text)
    print("Entities in '%s'" % test_text)
    for ent in doc.ents:
        print(ent.label_, ent.text)

    # test the trained model: OPEN RESIDENCE HALLS
    test_text = "Is there class on labor day?"
    doc = nlp(test_text)
    print("Entities in '%s'" % test_text)
    for ent in doc.ents:
        print(ent.label_, ent.text)
        

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta["name"] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        # Check the classes have loaded back consistently
        assert nlp2.get_pipe("ner").move_names == move_names
        doc2 = nlp2(test_text)
        for ent in doc2.ents:
            print(ent.label_, ent.text)

In [16]:
chunk_holidays()

Created blank 'en' model
Losses {'ner': 96.10060695337029}
Losses {'ner': 42.48257501226708}
Losses {'ner': 31.712946910978964}
Losses {'ner': 28.470781178815955}
Losses {'ner': 15.132017531164124}
Losses {'ner': 6.09346911936783}
Losses {'ner': 6.2112449398749305}
Losses {'ner': 0.0024505623259943603}
Losses {'ner': 0.03823347877814979}
Losses {'ner': 0.0001152338224900665}
Losses {'ner': 0.0004363360277196627}
Losses {'ner': 0.829102710754756}
Losses {'ner': 0.0014955903556105581}
Losses {'ner': 0.0017686557527298908}
Losses {'ner': 0.00028340566615505503}
Losses {'ner': 8.296999778402453e-06}
Losses {'ner': 6.324807480829586e-05}
Losses {'ner': 6.596592317329203e-05}
Losses {'ner': 1.910838383068366}
Losses {'ner': 1.8950070980835618}
Losses {'ner': 1.7643311030099101e-06}
Losses {'ner': 8.141407162498561e-05}
Losses {'ner': 2.657914738817611e-06}
Losses {'ner': 3.8207690058731234e-05}
Losses {'ner': 2.028982455703454e-07}
Losses {'ner': 0.0006022267773490347}
Losses {'ner': 0.00040

## Instruction begins
Everyone wants to know when school starts

In [17]:
INSTRUCTION_BEGINS_LABEL = "INSTRUCTION_BEGINS"

INSTRUCTION_BEGINS_TRAIN_DATA = [
    # instruction begins
    (
        "When does instruction begin?",
        {"entities": [
            get_substring_label_truple(
                "When does instruction begin?",
                "instruction begin",
                INSTRUCTION_BEGINS_LABEL
            )]
        },
    ),
    (
        "Has instruction began?",
        {"entities": [
            get_substring_label_truple(
                "Has instruction began?",
                "instruction began",
                INSTRUCTION_BEGINS_LABEL
            )]
        },
    ),
    
    (
        "When does school start?",
        {"entities": [
            get_substring_label_truple(
                "When does school start?", 
                "school start",
                INSTRUCTION_BEGINS_LABEL
            )]
        },
    ),
    (
        "Has school started yet?",
        {"entities": [
            get_substring_label_truple(
                "Has school started yet?", 
                "school start",
                INSTRUCTION_BEGINS_LABEL
            )]
        },
    ),
    (
        "When do classes begin?",
        {"entities": [
            get_substring_label_truple(
                "When does classes begin?",
                "classes begin",
                INSTRUCTION_BEGINS_LABEL
            )]
        },
    ),

    (
        "Have classes started yet?",
        {"entities": [
            get_substring_label_truple(
                "Has school started yet?", 
                "school start",
                INSTRUCTION_BEGINS_LABEL
            )]
        },
    ),
    # Ask about how other people would ask this question
] # End curriculum report training data

## Add Date
There are two dates that exist for adding classes that a user might want to know about. Those are dependent upon requiring instructor approval or not. We really only need to know when a user wants to know about the add date. CS Helper can just provide both dates back to the user

In [19]:
ADD_DATE_LABEL = 'ADD_DATE'
ADD_DATE_TRAIN_DATA = [
    (
        "When is the last day to add a class?",
        {"entities": [
            get_substring_label_truple(
                "When is the last day to add a class?", 
                "add a class",
                ADD_DATE_LABEL
            )]
        },
    ),
    (
        "Can I still add a course?",
        {"entities": [
            get_substring_label_truple(
                "Can I still add a course?", 
                "add a course",
                ADD_DATE_LABEL
            )]
        },
    ),
    (
        "Is it too late to add a course for this semester?",
        {"entities": [
            get_substring_label_truple(
                "Is it too late to add a course for this semester?", 
                "add a course",
                ADD_DATE_LABEL
            )]
        },
    ),
    (
        "I was wondering if I could still add a class?",
        {"entities": [
            get_substring_label_truple(
                "I was wondering if I could still add a class?", 
                "add a class",
                ADD_DATE_LABEL
            )]
        },
    ),
    (
        "Are we past the add date?",
        {"entities": [
            get_substring_label_truple(
                "Are we past the add date?", 
                "add date",
                ADD_DATE_LABEL
            )]
        },
    ),
    (
        "When is the final add date?",
        {"entities": [
            get_substring_label_truple(
                "When is the final add date?", 
                "add date",
                ADD_DATE_LABEL
            )]
        },
    ),
    
]

## Withdraw Date


In [20]:
WITHDRAW_DATE_LABEL = 'WITHDRAW_DATE'
WITHDRAW_DATE_TRAIN_DATA = [
    (
        "When is the last day to withdraw from a class?",
        {"entities": [
            get_substring_label_truple(
                "When is the last day to withdraw from a class?", 
                "withdraw from a class",
                WITHDRAW_DATE_LABEL
            )]
        },
    ),
    (
        "Can I still withdraw from a course?",
        {"entities": [
            get_substring_label_truple(
                "Can I still withdraw from a course?", 
                "withdraw from a course",
                WITHDRAW_DATE_LABEL
            )]
        },
    ),
    (
        "Is it too late to withdraw from a course for this semester?",
        {"entities": [
            get_substring_label_truple(
                "Is it too late to withdraw from a course for this semester?", 
                "withdraw from a course",
                WITHDRAW_DATE_LABEL
            )]
        },
    ),
    (
        "I was wondering if I could still withdraw from a class?",
        {"entities": [
            get_substring_label_truple(
                "I was wondering if I could still withdraw from a class?", 
                "withdraw from a class",
                WITHDRAW_DATE_LABEL
            )]
        },
    ),
    (
        "Are we past the withdraw date?",
        {"entities": [
            get_substring_label_truple(
                "Are we past the withdraw date?", 
                "withdraw date",
                WITHDRAW_DATE_LABEL
            )]
        },
    ),
    (
        "When is the last day to withdraw?",
        {"entities": [
            get_substring_label_truple(
                "When is the last day to withdraw?", 
                "day to withdraw",
                WITHDRAW_DATE_LABEL
            )]
        },
    ),
    
]

## Drop date

In [21]:
DROP_DATE_LABEL = 'DROP_DATE'
DROP_DATE_TRAIN_DATA = [
    (
        "When is the last day to drop a class?",
        {"entities": [
            get_substring_label_truple(
                "When is the last day to drop a class?", 
                "drop a class",
                DROP_DATE_LABEL
            )]
        },
    ),
    (
        "Can I still drop a course?",
        {"entities": [
            get_substring_label_truple(
                "Can I still drop a course?", 
                "drop a course",
                DROP_DATE_LABEL
            )]
        },
    ),
    (
        "Is it too late to drop a course for this semester?",
        {"entities": [
            get_substring_label_truple(
                "Is it too late to drop a course for this semester?", 
                "drop a course",
                DROP_DATE_LABEL
            )]
        },
    ),
    (
        "I was wondering if I could still drop a class?",
        {"entities": [
            get_substring_label_truple(
                "I was wondering if I could still drop a class?", 
                "drop a class",
                DROP_DATE_LABEL
            )]
        },
    ),
    (
        "Are we past the drop date?",
        {"entities": [
            get_substring_label_truple(
                "Are we past the drop date?", 
                "drop date",
                DROP_DATE_LABEL
            )]
        },
    ),
    (
        "When is the last day to drop?",
        {"entities": [
            get_substring_label_truple(
                "When is the last day to drop?", 
                "day to drop",
                DROP_DATE_LABEL
            )]
        },
    ),
    
]

In [22]:
## Semester Start

In [23]:
SEMESTER_START_LABEL = 'SEMESTER_START'
SEMESTER_START_TRAIN_DATA = [
    (
        "When does the semester start?",
        {"entities": [
            get_substring_label_truple(
                "When does the semester start?", 
                "semester start",
                SEMESTER_START_LABEL
            )]
        },
    ),
    (
        "When is the start of the semester?",
        {"entities": [
            get_substring_label_truple(
                "When is the start of the semester?", 
                "start of the semester",
                SEMESTER_START_LABEL
            )]
        },
    ),
    (
        "What day does the semester start?",
        {"entities": [
            get_substring_label_truple(
                "What day does the semester start?", 
                "semester start",
                SEMESTER_START_LABEL
            )]
        },
    ),
    (
        "What day is the start of the semester?",
        {"entities": [
            get_substring_label_truple(
                "What day is the start of the semester?", 
                "start of the semester",
                SEMESTER_START_LABEL
            )]
        },
    ),
    (
        "Which day does the semester start?",
        {"entities": [
            get_substring_label_truple(
                "Which day does the semester start?", 
                "semester start",
                SEMESTER_START_LABEL
            )]
        },
    ),
    (
        "Which day is the start of the semester?",
        {"entities": [
            get_substring_label_truple(
                "Which day is the start of the semester?", 
                "start of the semester",
                SEMESTER_START_LABEL
            )]
        },
    ),
]

In [24]:
## Semester End

In [25]:
SEMESTER_END_LABEL = 'SEMESTER_END'
SEMESTER_END_TRAIN_DATA = [
    (
        "When does the semester end?",
        {"entities": [
            get_substring_label_truple(
                "When does the semester end?", 
                "semester end",
                SEMESTER_END_LABEL
            )]
        },
    ),
    (
        "When is the end of the semester?",
        {"entities": [
            get_substring_label_truple(
                "When is the end of the semester?", 
                "end of the semester",
                SEMESTER_END_LABEL
            )]
        },
    ),
    (
        "What day does the semester end?",
        {"entities": [
            get_substring_label_truple(
                "What day does the semester end?", 
                "semester end",
                SEMESTER_END_LABEL
            )]
        },
    ),
    (
        "What day is the end of the semester?",
        {"entities": [
            get_substring_label_truple(
                "What day is the end of the semester?", 
                "end of the semester",
                SEMESTER_END_LABEL
            )]
        },
    ),
    (
        "Which day does the semester end?",
        {"entities": [
            get_substring_label_truple(
                "Which day does the semester end?", 
                "semester end",
                SEMESTER_END_LABEL
            )]
        },
    ),
    (
        "Which day is the end of the semester?",
        {"entities": [
            get_substring_label_truple(
                "Which day is the end of the semester?", 
                "end of the semester",
                SEMESTER_END_LABEL
            )]
        },
    ),
]

In [26]:
## Break

In [27]:
BREAK_LABEL = 'BREAK'
BREAK_TRAIN_DATA = [
    (
        "When is the break?",
        {"entities": [
            get_substring_label_truple(
                "When is the break?", 
                "break",
                BREAK_LABEL
            )]
        },
    ),
    (
        "When is spring break?",
        {"entities": [
            get_substring_label_truple(
                "When is spring break?", 
                "break",
                BREAK_LABEL
            )]
        },
    ),
    (
        "When is fall break?",
        {"entities": [
            get_substring_label_truple(
                "When is fall break?", 
                "break",
                BREAK_LABEL
            )]
        },
    ),
]

In [28]:
## Finals

In [29]:
FINALS_LABEL = 'FINALS'
FINALS_TRAIN_DATA = [
    (
        "When are finals?",
        {"entities": [
            get_substring_label_truple(
                "When are finals?", 
                "finals",
                FINALS_LABEL
            )]
        },
    ),
    (
        "When do finals start?",
        {"entities": [
            get_substring_label_truple(
                "When do finals start?", 
                "finals",
                FINALS_LABEL
            )]
        },
    ),
]

In [30]:
## Registration

In [31]:
REGISTRATION_LABEL = 'REGISTRATION'
REGISTRATION_TRAIN_DATA = [
    (
        "When can I begin to register for classes?",
        {"entities": [
            get_substring_label_truple(
                "When can I begin to register for classes?", 
                "register",
                REGISTRATION_LABEL
            )]
        },
    ),
    (
        "When does registration start?",
        {"entities": [
            get_substring_label_truple(
                "When does registration start?", 
                "registration",
                REGISTRATION_LABEL
            )]
        },
    ),
]

In [None]:
GRADUATION_LABEL = 'GRADUATION'
GRADUATION_TRAIN_DATA = [
    (
        "When is graduation?",
        {"entities": [
            get_substring_label_truple(
                "When is graduation?", 
                "graduation?",
                GRADUATION_LABEL
            )]
        },
    ),
    (
        "What day is graduation?",
        {"entities": [
            get_substring_label_truple(
                "What day is graduation?", 
                "graduation",
                GRADUATION_LABEL
            )]
        },
    ),
    (
        "When does graduation start?",
        {"entities": [
            get_substring_label_truple(
                "When does graduation start?", 
                "graduation",
                GRADUATION_LABEL
            )]
        },
    ),
    (
        "What time is graduation?",
        {"entities": [
            get_substring_label_truple(
                "What time is graduation?", 
                "graduation",
                GRADUATION_LABEL
            )]
        },
    ),
]

## WTD

In [39]:
IMPORTANT_DATES_TRAIN_DATA = []
subcategory_train_data = [
    FACULTY_REPORT_TRAIN_DATA,
    CURRICULUM_STUDY_TRAIN_DATA,
    OPEN_RESIDENCE_HALLS_TRAIN_DATA,
    HOLIDAYS_TRAIN_DATA,
    INSTRUCTION_BEGINS_TRAIN_DATA,
    ADD_DATE_TRAIN_DATA,
    WITHDRAW_DATE_TRAIN_DATA,
    DROP_DATE_TRAIN_DATA,
    SEMESTER_START_TRAIN_DATA,
    SEMESTER_END_TRAIN_DATA,
    BREAK_TRAIN_DATA,
    FINALS_TRAIN_DATA,
    REGISTRATION_TRAIN_DATA,
    GRADUATION_TRAIN_DATA,
]

IMPORTANT_DATES_LABELS = [
    FACULTY_REPORT_LABEL,
    CURRICULUM_STUDY_LABEL,
    OPEN_RESIDENCE_HALLS_LABEL,
    HOLIDAYS_LABEL,
    INSTRUCTION_BEGINS_LABEL,
    ADD_DATE_LABEL,
    WITHDRAW_DATE_LABEL,
    DROP_DATE_LABEL,
    SEMESTER_START_LABEL,
    SEMESTER_END_LABEL,
    BREAK_LABEL,
    FINALS_LABEL,
    REGISTRATION_LABEL,
    GRADUATION_LABEL,
]

for s in subcategory_train_data:
    IMPORTANT_DATES_TRAIN_DATA.extend(s)
    


In [43]:
def test_model(nlp):
    utterances = [
        
        "I am not sure when to turn in the report.", # Faculty report
        "I am not sure when curriculum study begins", # Curriculum study
        "When am I allowed to move into the dorms?", # Residence Halls Open
        "Are we open for Martin Luther King Day?", # Holidays
        "When do classes start?", # Instruction begins
        "Can I still add a class for this semester?", # Add date
        "Is it too late to withdraw from a class?", # Withdraw date
        "When does the semester start?", # Semester start
        "When does the semester end?", # Semester end
        "Which week is spring break?", # Break
        "What week is finals week?", # Finals
        "Can I register for next semester yet?", # Registration
    ]
    # test the trained model: CURRICULUM STUDY
    
    for u in utterances:
        doc = nlp(u)
        print("Entities in '%s'" % u)
        for ent in doc.ents:
            print(ent.label_, ent.text)
    

In [44]:
@plac.annotations(
    model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
    new_model_name=("New model name for model meta.", "option", "nm", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
)
def chunk_important_dates(model=None, new_model_name="class", output_dir=None, n_iter=30):
    """Set up the pipeline and entity recognizer, and train the new entity."""
    
    random.seed(0)
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
        
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner)
        
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe("ner")

    # Add labels
    for l in IMPORTANT_DATES_LABELS:
        ner.add_label(l)  # add new entity label to entity recognizer
        
    # Begin/Resume training
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
    move_names = list(ner.move_names)
    
    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    
    with nlp.disable_pipes(*other_pipes):  # only train NER
        sizes = compounding(1.0, 4.0, 1.001)
        
        # batch up the examples using spaCy's minibatch
        for itn in range(n_iter):
            random.shuffle(IMPORTANT_DATES_TRAIN_DATA)
            batches = minibatch(IMPORTANT_DATES_TRAIN_DATA, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print("Losses", losses)
            
    test_model(nlp)

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta["name"] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        # Check the classes have loaded back consistently
        assert nlp2.get_pipe("ner").move_names == move_names
        doc2 = nlp2(test_text)
        for ent in doc2.ents:
            print(ent.label_, ent.text)

In [45]:
chunk_important_dates()

Created blank 'en' model
Losses {'ner': 249.76170928887757}
Losses {'ner': 139.84909687744786}
Losses {'ner': 163.75799565175444}
Losses {'ner': 175.59950806400732}
Losses {'ner': 133.46897065410616}
Losses {'ner': 126.2065661016922}
Losses {'ner': 133.66536078225758}
Losses {'ner': 126.52656941303702}
Losses {'ner': 85.21886263562479}
Losses {'ner': 73.14054021418634}
Losses {'ner': 67.1987382788683}
Losses {'ner': 53.83804509495027}
Losses {'ner': 58.420892304051826}
Losses {'ner': 34.478665213156894}
Losses {'ner': 53.05879166461248}
Losses {'ner': 26.470127299862785}
Losses {'ner': 28.7590863000675}
Losses {'ner': 21.92261269816455}
Losses {'ner': 28.76388924495248}
Losses {'ner': 12.855424226681974}
Losses {'ner': 11.409866706882973}
Losses {'ner': 12.91490087823108}
Losses {'ner': 20.84585494190036}
Losses {'ner': 10.288504506895558}
Losses {'ner': 14.410486104279565}
Losses {'ner': 9.187588319578254}
Losses {'ner': 7.8605643689354485}
Losses {'ner': 3.381498989613924}
Losses {'n