# Capstone Project: Improving NLTK's Lemmatization Process

Submitted by Shannon Bingham
February 2019


## Executive Summary
The Natural Language Toolkit (NLTK) is a widely used open source solution for computer processing and analysis of human language.  The toolkit includes many useful utilities including an interface to the WordNet lexical reference database.  When I used the toolkit's stemming and lemmatizing programs for a Natural Language Processing (NLP) classification project at General Assembly, I grew very curious about first, the data that was being created, and second, the logic going on under the hood.  My questions led me to wonder about how I could improve the quality of my input data, especially given that the sheer number of features in an NLP project makes it quite challenging to understand what the data looks like.

For my capstone project, I decided to improve the NLTK interface to WordNet.  My work on this project led to many interesting insights about electronic dictionaries as well as about language processing in general.  Over the course of  project, I became more deeply interested in NLP, an area in data science that is growing rapidly.  The changes that I made to the toolkit built a foundation that provides greater control over the lemmatization process and, thus, more opportunity to improve data.  The changes will be available for implementation anywhere.

## Notebook Description
This notebook includes all the code I used to compare the lemmas created by the production version of the lemmatizer and my test version.  Please note the Setup cell that drives the variables substitutions.

## Set up environment.

In [1]:
# Import libraries.
import pandas as pd
import numpy as np
import random

# Increase number of columns that can be viewed in notebook.
pd.set_option('display.max_columns', 500)

# Set random seed.
random.seed(42)

In [2]:
# Specify the number of reviews of each type (pos and neg) being sampled.
n = 50

# Set file location for input files
dev_csv    = (f'./data/lemmas_dev_s{n*2}.csv')
prod_csv   = (f'./data/lemmas_prod_s{n*2}.csv')
# Locate the file.

# Print messages.
print(f'** The dev  lemmas will be loaded from "{dev_csv}". **')
print()
print(f'** The prod lemmas will be loaded from "{prod_csv}". **')

** The dev  lemmas will be loaded from "./data/lemmas_dev_s100.csv". **

** The prod lemmas will be loaded from "./data/lemmas_prod_s100.csv". **


In [3]:
# Load the data.
dev = pd.read_csv(dev_csv)
prod = pd.read_csv(prod_csv)

# Take a look. 
prod.head()

Unnamed: 0,label,lemmas
0,1,i wa very impressed with this small independen...
1,1,shot in the heart is wonderful it brilliantly ...
2,1,i have not seen this in over yr but i still re...
3,1,police story brought hong kong movie to modern...
4,1,the word classic is thrown around too loosely ...


In [4]:
# Check the shapes.
print(dev.shape)
print(prod.shape)

(100, 2)
(100, 2)


## Compare lemmas.

In [5]:
# Initialize a list to hold the differences.
diff_list = []

for i in range(0, len(dev)):
    dev_list  = dev.iloc[i]['lemmas'].split(" ")
    prod_list = prod.iloc[i]['lemmas'].split(" ")
    
    for l in range(0,len(dev_list)):
        if dev_list[l] != prod_list[l]:
            diff_list.append((dev_list[l], prod_list[l]))

# Remove duplicates.
diff_set = set(diff_list)

# Calculate the number of unique values.
len(diff_set)

# Calculate the number of unique values.
len(diff_set)

829

In [7]:
# Load to a dataframe if analysis is desired. 

# Get it back into list format.
diff_list = list(diff_set)

df = pd.DataFrame(diff_list, columns=['dev lemma', 'prod lemma'])
df.head()

Unnamed: 0,dev lemma,prod lemma
0,stun,stunning
1,annoy,annoying
2,remind,reminded
3,try,trying
4,air,aired
