## Lab 6: Text Analysis and Natural Language Processing 

In this lab, we explore the text data provided by Kiva's API. Our primary source of textual data is the descriptive texts that borrowers submit for a loan request and are posted publicly on the Kiva website. Kiva is unique in that often, borrowers do not write descriptive requests for themselves, but fill out a questionnaire to Kiva's team of volunteer translators. We try to leverage this body of text (also called a *"corpus"*) to see if we can see any patterns in how an individual translator writes a description.

As always, we first import our packages and read in our data below. 

In [1]:
import nltk
import pandas as pd
import numpy as np

# output of multiple commands in a cell will be output at once.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# display up to 80 columns, this keeps everything visible
pd.set_option('display.max_columns', 80)
pd.set_option('expand_frame_repr', True)

In [2]:
#datapath = '~/intro_to_machine_learning/data'
datapath = '~/Desktop'
df = pd.read_csv(datapath+'/df.csv', low_memory=False)

## Exploratory Analysis and Feature Engineering

We have very limited information about translators. In fact, the only variable in our dataset relevant to translators is their name! What information can we extract from this field? 

In text analysis, a common simple task is how to categorize names by gender. We know, just in our daily knoweldge of English names, that names that end in -a are likely to be female, and names that end in -o are likely to be male (for example, Jenna and Pablo). Since we have both the gender data and the name data for the borrowers, let's use borrowers' data to train a classifier model that can predict the gender from a name! Then, we will apply this model to the translators names to predict their genders. 

Here, we use the Naive Bayes Classifier (for a comprehensive review, take a look back at Module 6.) This algorithm assigns a label (in our case, "male" or "female") using the last letter of the name provided in the data. Remember that we first need to clean our data to ensure that we are capturing the last letter of first names. 

In [28]:
#create name and gender dataframe for single borrowers
names = df[['name', 'gender', 'borrower_count']]
names = names[['name', 'gender']][names['borrower_count'] == 1]

names.sample(10)

Unnamed: 0,name,gender
105125,Johnson,Male
31299,Mary,Female
12673,Jane,Female
95279,Pauline,Female
112818,Simon,Male
45702,Juma,Male
86799,Kaloleni,Female
26716,Mariamu,Female
39780,David,Male
57197,Mary Ann,Female


Here we see there are some instances in which the name is not an individual's first name, but rather the name of a business or a collective, or "Anonymous". Let's drop these out of our training dataset as they won't be helpful in determining the gender of a person. 

In [30]:
# rm null values, anonymous, and duplicates
names = names.loc[names['name'].isnull() == False]
names = names.drop_duplicates()
names = names[names['name'] != "Anonymous"]

Now let's define a function that will return the last letter of our borrowers' first names. This letter will be a **feature** we will use to attempt to predict the output feature, gender. 

In [31]:
#function that returns last letter of first name 
def gender_features(name):
    return {'last_letter': name[-1]}

Now let's prepare to train our model. We split train and test sets as usual. 

In [32]:
# Set training-test split %
split_pct = 0.80

# Remove null and NaN values 
names = names[pd.notnull(names)]

# the pandas command "sample" already randomizes its selection. 
names_shuffled = names.sample(frac=1)

train_set = names_shuffled[:int((len(names_shuffled)*split_pct))] 
test_set = names_shuffled[int(len(names_shuffled)*split_pct+1):]  

len(train_set.index)
len(test_set.index)

10436

2608

Now we prepare our data by converting the name and gender features from features into lists, so they are associated with each other. 

In [33]:
female_train = train_set[train_set['gender'] == "Female"]
male_train = train_set[train_set['gender'] == "Male"]
female_test = test_set[test_set['gender'] == "Female"]
male_test = test_set[test_set['gender'] == "Male"]

train_feature_set = [(name, "female") for name in female_train['name']] + \
[(name, "male") for name in male_train['name']]

test_feature_set = [(name, "female") for name in female_test['name']] + \
[(name, "male") for name in male_test['name']]

In [34]:
train_feature_set = [(gender_features(n), g) for (n, g) in train_feature_set]
test_feature_set = [(gender_features(n), g) for (n, g) in test_feature_set]

In [35]:
classifier = nltk.NaiveBayesClassifier.train(train_feature_set)

In [37]:
#let's test out our new classifier! 

classifier.classify(gender_features('Cleopatra'))
classifier.classify(gender_features('Maximillian'))
classifier.classify(gender_features('James'))

'female'

'male'

'male'

It looks like it works okay for our three samples, but let's get a better sense of overall accuracy.

The nltk "accuracy()" method returns the % of time our predictions are accurate

In [38]:
#Get a sense of overall accuracy

print(nltk.classify.accuracy(classifier, test_feature_set))

0.6276840490797546


**DEV NOTE** Why is this prediction so terrible? Only 10% above random... 

In [39]:
#Find out which features were most informative in determining outcome

classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'f'              male : female =      9.2 : 1.0
             last_letter = 'k'              male : female =      5.8 : 1.0
             last_letter = 'w'              male : female =      3.5 : 1.0
             last_letter = 'p'              male : female =      3.2 : 1.0
             last_letter = 'h'            female : male   =      2.8 : 1.0


Show most informative features: this returns LIKELIHOOD RATIOS. For the first entry "f", we see that males are more likely to have this letter as their last letter by a factor of 9.2x. 

Now let's apply this classifier to the translator dataset to see which ones are male and female! 

In [40]:
translators = pd.DataFrame()
translators[['translator_first_name', 'translator_last_name', 'null1', 'null2']] = df['translator.byline'].str.split(expand=True)

translators = translators.drop('null1', 1)
translators = translators.drop('null2', 1)

# rm null values and duplicates
translators = translators.loc[translators['translator_first_name'].isnull() == False]
translators = translators.drop_duplicates()

translators.head(10)

Unnamed: 0,translator_first_name,translator_last_name
0,Julie,Keaton
1,Morena,Calvo
8,Lynn,Cerra
19,Mohammad,Awais
21,Cheryl,Strecker
23,Rita,Snyder
25,Maureen,Wharton
29,Lorne,Warwick
31,Caty,McKenna
34,Trishna,Patel


In [41]:
translators['last_letter'] = translators['translator_first_name'].apply(lambda x: gender_features(x))
translators_last = translators['last_letter']
translators_last[0:5]

0     {'last_letter': 'e'}
1     {'last_letter': 'a'}
8     {'last_letter': 'n'}
19    {'last_letter': 'd'}
21    {'last_letter': 'l'}
Name: last_letter, dtype: object

In [138]:
test = list()

for x in range(0, len(translators_last)):
    test.append(classifier.classify(translators_last[x]))

#classifier.classify(translators_last)

KeyError: 2

It is easy to see how you can expand this kind of analysis for other things. Do you think there are patterns in gender for names that contain 

## Cleaning text 

You already got a taste of this cleaning the variable "name" to exclude business names. 

As easy or as difficult as you want it to be - what is best for your task? Word roots? Full words? Quotations, HTML, CSS - to include? What about numbers/tables? ARe all words equal ("a", "as", "is", "the"?) 

For our task, we want to stem the words and tokenize so as to get a broad overview of what words are used when people request loans.

In [8]:
# read all text into a single corpus
text = list(df['description.texts.en'])



#tokenize
#text = nltk.tokenize(text)

TypeError: 'module' object is not callable

In [None]:
# Clean data - stem

# Porter stemmer is one of several

pstemmer = nltk.PorterStemmer()
pstemmer(text)

## Preliminary investigations / visualizations 

In [None]:
#Prelim investigations : getting context of a word 

text.concordance("farmer")
text.concordance("hope")
text.concordance("future")

# try some words you're interested in - limited by imagination! 

In [None]:
# prelim investigation: getting other words that appear in similar range of contexts 

text.similar("farmer")
text.similar("hope")
text.similar("future")


## Algorithms

Bag of words - simplest representation - note that the downside is a big loss of context (man eats bread is the same as bread eats man -- order of words not preserved.) 

In [None]:
# Frequency of words - get idea of descriptions of loan requests in kenya
# Gets at nuanced question - what do people /translators include in requests  

fd = nltk.FreqDist(text)
fd.plot()
fd.plot(50, cumulative=True)
fd.most_common(12)

Ideas for research: 
Try to cluster description based on who the translator is? 
Try to parse out all adjectives - see what that looks like per translator ?

In [124]:
# example of document classification 
#   steps: construct a list of documents, labelled 
#   define feature extractor 
#   train classifier

from nltk.corpus import movie_reviews 
# movei corpus review, categorizes each review as pos or neg

In [127]:
documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

NameError: name 'random' is not defined

In [126]:
# feature extractor: define a feature for each word, 
# indicating whether doc contains that word 
# limit num of features classifier needs to process
# stick with 2k most freq words 

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = all_words.keys()[:2000]

TypeError: 'dict_keys' object is not subscriptable

In [None]:
def document_features(document): 
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

print document_features(movie_reviews.words('pos/cv9578737.txt'))


featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, test_set)
0.81
>>> classifier.show_most_informative_features(5)