# Basic System

This notebook provides code for implementing a very simple machine learning system for named entity recognition.
It uses logistic regression and one feature (the token itself).
Links to information about the packages are provided. Your job is to document the code and use it to train a system. You can then use your evaluation code to provide the first basic evaluation of your system.
In the next assignment, you can use this as a basis to experiment with more features and more machine learning methods.

In [41]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
import pandas as pd
import sys

# If you want to include other modules, you can add them here
# Please note the recommendations on using modules in the Programming General Guidelines

#recommended resource for examples:

#https://scikit-learn.org/stable/modules/feature_extraction.html

In [42]:
def extract_features_and_labels(trainingfile):
    '''
    Extracts labels and features from the trainingfile
    
    :param trainingfile: path to the file with the training data
    
    :returns a list with a dictionary for each row containing the features for that row, and a list with the target labels
    '''
    
    data = []
    targets = []
    with open(trainingfile, 'r', encoding='utf8') as infile:
        for line in infile:
            components = line.rstrip('\n').split()
            if len(components) > 0:
                token = components[0]
                feature_dict = {'token':token}
                data.append(feature_dict)
                #gold is in the last column
                targets.append(components[-1])
    return data, targets

In [43]:
def extract_features(inputfile):
    '''
    Extracts features from the inputfile
    
    :param inputfile: path to the inputfile
    
    :returns a list with a dictionary for each row containing the features for that row
    '''
    data = []
    with open(inputfile, 'r', encoding='utf8') as infile:
        for line in infile:
            components = line.rstrip('\n').split()
            if len(components) > 0:
                token = components[0]
                feature_dict = {'token':token}
                data.append(feature_dict)
    return data

In [55]:
def create_classifier(train_features, train_targets):
    '''
    Creates a logistic regression classifier, one-hot encodes the train_features and uses these to fit the model
    
    :param train_features: a list of dictionaries containing the training features
    :param train_targets: a list of the target labels
    
    :returns the fitted logistic regression model and the one-hot vector
    '''
    logreg = LogisticRegression(max_iter=500)
    vec = DictVectorizer()
    features_vectorized = vec.fit_transform(train_features)
    model = logreg.fit(features_vectorized, train_targets)
    
    return model, vec

In [56]:
def classify_data(model, vec, inputdata, outputfile):
    '''
    Extracts the features from the inputdata, one-hot encodes them, then uses those to predict the labels
    with the model and write the assigned classes to the outputfile
    
    :param model: fitted model
    :param vec: vectoriser to verctorise the features
    :param inputdata: path to the inputfile
    :param outputfile: path to the outputfile
    '''
    features = extract_features(inputdata)
    features = vec.transform(features)
    predictions = model.predict(features)
    outfile = open(outputfile, 'w')
    counter = 0
    for line in open(inputdata, 'r'):
        if len(line.rstrip('\n').split()) > 0:
            outfile.write(line.rstrip('\n') + '\t' + predictions[counter] + '\n')
            counter += 1
    outfile.close()

In [57]:
def main(argv=None):
    '''
    Runs the function to extract the training features and gold labels and use these to create a classifier and classify the data
    
    :param argv: a list of arguments to indicate the following -> argv[0]: python program used,
    argv[1]: path to the trainingfile, argv[2]: path to the inputfile, argv[3]: path to the outputfile
    '''
    
    #a very basic way for picking up commandline arguments
    if argv is None:
        argv = sys.argv    
    
    #you can replace the values for these with paths to the appropriate files for now, e.g. by specifying values in argv
    #argv = ['mypython_program','','','']
    trainingfile = argv[1]
    inputfile = argv[2]
    outputfile = argv[3]
    
    training_features, gold_labels = extract_features_and_labels(trainingfile)
    ml_model, vec = create_classifier(training_features, gold_labels)
    classify_data(ml_model, vec, inputfile, outputfile)


### Change paths
The cell below can be run to create an output file with the assigned classes from the classifier.
Please replace the paths of the files if they are not the same on your device.

In [59]:
path_trainingfile = '../data/conll2003.train-preprocessed.conll'
path_inputfile = '../data/conll2003.dev-preprocessed.conll'
path_outputfile = '../data/conll2003.dev-basic_system-out.conll'


args = ['python', path_trainingfile, path_inputfile, path_outputfile]
main(args)

../data/conll2003.dev-preprocessed.conll
