In [1]:
import spacy
from essay_evaluation.collocational_aspects import CollocationalAspects
from essay_evaluation.lexical_accuracy import SpellChecker, CollocationPreprocessor, CollocationDectector, \
    CollocationEvaluator, LexicalAccuracy
from essay_evaluation.lexical_density import LexicalDensityFeatures
from essay_evaluation.lexical_sophistication import LexicalSophisticationFeatures
from essay_evaluation.lexical_variation import LexicalVariationFeatures
def setup_spacy():
    nlp = spacy.load('en_core_web_sm')

    # add all required components
    spell_checker = SpellChecker()
    nlp.add_pipe(spell_checker, name=spell_checker.name, last=True)

    col_preproc = CollocationPreprocessor()
    nlp.add_pipe(col_preproc, name=col_preproc.name, last=True)

    col_detect = CollocationDectector()
    nlp.add_pipe(col_detect, name=col_detect.name, last=True)

    col_evaluator = CollocationEvaluator()
    nlp.add_pipe(col_evaluator, name=col_evaluator.name, last=True)

    # Add all feature extractors:
    lvf = LexicalVariationFeatures()
    lsf = LexicalSophisticationFeatures()
    ldf = LexicalDensityFeatures()
    la = LexicalAccuracy()
    ca = CollocationalAspects()

    nlp.add_pipe(lvf, name=lvf.name, last=True)
    nlp.add_pipe(lsf, name=lsf.name, last=True)
    nlp.add_pipe(la, name=la.name, last=True)
    nlp.add_pipe(ca, name=ca.name, last=True)
    nlp.add_pipe(ldf, name=ldf.name, last=True)

    return nlp
nlp = setup_spacy()

# Feature Dictionary Refactoring
This notebook is for testing the system after saving all features inside a dictionary instead of an array.

## Change proposition

I want to propose following changes to our essay_evaluation python package:
1. Let's use a dictionary to store the feature values.
At the moment we add for each feature a new spaCy document extension (e.g. features_lv). This is always a list containing the index values (e.g. [0.7,0.1,0.4, ...]). While this is nice, as a list of lists can easily be converted to a numpy matrix, I think it's easy to mix them up. For example if we add a new index we might forget to update our code so that the index at position X might not be the same feature index as before (features_lv[x]). Also for the FormativeFeedbackEvaluator it would be much cleaner if we don't need to save the index like this:
```
# before
binning_indicies = [6, 18, 19, 20, 21, 22, 28, 29, 30, 31, 32, 33]
# after:
binnding_features = ['LV_HDD', 'LV_W', ...]
```

Same goes for the classifier if we need specific feature values.

2. There should be only one document extension for features
Instead of having a lot of extension to get the feature index (doc._.features_lv , doc._.features_ld, ...) there should only be one place called doc._.features . All the keys of our indices have a unique namespace (LV, LD, LS, ...) so there's no way the can collide.

```
# before
doc._.features_lv = [0.1,0.5,...]
doc._.features_ca = [1, 24, 2, ...]
#... and so on

# after
doc._.features = {
    'LV_W': 3,
    # ...
    'CA_BIN1': 0.3,
    # ...
}
```

3. Let's use OrderedDict instead of a regular dictionary
Starting with Python3.6+ the dict class keeps the order of the elements as they were inserted. I think we should consider backwards compatibility to Python3.5 as it's EOL is September 2020. Also see this: http://gandenberger.org/2018/03/10/ordered-dicts-vs-ordereddict/

4. To maintain backwards compatibility: keep all list indexes for now
We're using the doc._.features_XX lists a lot in our code / notebooks. So I would not remove them yet. Instead let's just add the dictionary additionally.

# Testing the new approach 

We will collect the feature indices using the old and new way.
Then we compare if both give us the same result and everything still works as before. 
**It's very important that none of our existing code breaks!**

In [2]:
# sample text from EFCAMDAT
text = 'Feeling likehome! Are you looking for a home which combines modern living and traditional craftsmen work in a unique way? Have you ever dreamed of feeling the cool breeze of the Pacific when you are waking up? If yes, then welcome to Pacific Heights. The completely remodeled kitchen  which offers you all kind of modern appliances - will enable you to be your own chef. You will enjoy your dinner on the wooden veranda with a once in a lifetime view over the Pacific. Even during rainy days you can enjoy the great view under the rustic porch. The staircase and trim throughout the house are still original, giving the house a special feeling of being comfortable. The apartment is perfectly suited for families with its three bedrooms and two bathrooms, both in line with the rustic and cosy design of the apartment. You can rent the apartment for $ 1850 per month plus the cleaning deposit. Pets are welcome. Do not miss this unique opportunity to get your ideal living space.'
doc = nlp(text)
doc

Feeling likehome! Are you looking for a home which combines modern living and traditional craftsmen work in a unique way? Have you ever dreamed of feeling the cool breeze of the Pacific when you are waking up? If yes, then welcome to Pacific Heights. The completely remodeled kitchen  which offers you all kind of modern appliances - will enable you to be your own chef. You will enjoy your dinner on the wooden veranda with a once in a lifetime view over the Pacific. Even during rainy days you can enjoy the great view under the rustic porch. The staircase and trim throughout the house are still original, giving the house a special feeling of being comfortable. The apartment is perfectly suited for families with its three bedrooms and two bathrooms, both in line with the rustic and cosy design of the apartment. You can rent the apartment for $ 1850 per month plus the cleaning deposit. Pets are welcome. Do not miss this unique opportunity to get your ideal living space.

In [3]:
legacy_feature_values = doc._.features_lv + doc._.features_ls + doc._.features_la + doc._.features_ca + doc._.features_ld
legacy_feature_names = all_feature_names = LexicalVariationFeatures.feature_names + LexicalSophisticationFeatures.feature_names + \
                    LexicalAccuracy.feature_names + CollocationalAspects.feature_names + LexicalDensityFeatures.feature_names
print("NAME", "LIST INDEX VALUE (LEGACY)", "NEW DICT VALUE")
result = {
    "Name": [],
    "Old Array": [],
    "New Dictionary": [],
    "Equal": []
}
for index, feature_name in enumerate(legacy_feature_names):
    result["Name"].append(feature_name)
    result["Old Array"].append(legacy_feature_values[index])
    result["New Dictionary"].append(doc._.features[feature_name])
    result["Equal"].append(legacy_feature_values[index] == doc._.features[feature_name])

NAME LIST INDEX VALUE (LEGACY) NEW DICT VALUE


In [4]:
import pandas as pd
pd.DataFrame(data=result)

Unnamed: 0,Name,Old Array,New Dictionary,Equal
0,LV_W,80.0,80.0,True
1,LV_WT,68.0,68.0,True
2,LV_WT1,58.0,58.0,True
3,LV_TTR,0.85,0.85,True
4,LV_CTTR,5.375872,5.375872,True
5,LV_RTTR,7.602631,7.602631,True
6,LV_HDD,0.929236,0.929236,True
7,LV_DUGA,118.153359,118.153359,True
8,LV_MAAS,0.008464,0.008464,True
9,LV_SUMM,0.974421,0.974421,True
