## Lab 6: Text Analysis and Natural Language Processing 

In this lab, we explore the text data provided by Kiva's API. Our primary source of textual data is the descriptive texts that borrowers submit for a loan request and are posted publicly on the Kiva website. Kiva is unique in that often, borrowers do not write descriptive requests for themselves, but fill out a questionnaire to Kiva's team of volunteer translators. We try to leverage this body of text (also called a *"corpus"*) to see if we can see any patterns in how an individual translator writes a description.

As always, we first import our packages and read in our data below. 

In [1]:
import pandas as pd
import numpy as np

# NLP-specific packages: 
import nltk
from nltk.corpus import names
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS


# output of multiple commands in a cell will be output at once.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# display up to 80 columns, this keeps everything visible
pd.set_option('display.max_columns', 80)
pd.set_option('expand_frame_repr', True)

In [2]:
#datapath = '~/intro_to_machine_learning/data'
datapath = '~/Desktop'
df = pd.read_csv(datapath+'/df.csv', low_memory=False)

## Exploratory Analysis and Feature Engineering

We have very limited information about translators. In fact, the only variable in our dataset relevant to translators is their name! What information can we extract from this field? 

In text analysis, a common simple task is how to categorize names by gender. We know, just in our daily knoweldge of English names, that names that end in -a are likely to be female, and names that end in -o are likely to be male (for example, Jenna and Pablo). Since we have both the gender data and the name data for the borrowers, let's use borrowers' data to train a classifier model that can predict the gender from a name! Then, we will apply this model to the translators names to predict their genders. 

Here, we use the Naive Bayes Classifier (for a comprehensive review, take a look back at Module 6.) This algorithm assigns a label (in our case, "male" or "female") using the last letter of the name provided in the data. Remember that we first need to clean our data to ensure that we are capturing the last letter of first names. 

In [3]:
#create name and gender dataframe for single borrowers
kiva_names = df[['name', 'gender', 'borrower_count']]
kiva_names = kiva_names[['name', 'gender']][kiva_names['borrower_count'] == 1]

kiva_names.sample(15)
len(kiva_names)

Unnamed: 0,name,gender
116045,Anonymous,Female
8250,Nyaboke,Female
37091,Nyanchama,Female
10268,Kirui,Male
15450,Elizabeth,Female
86584,Fanuel,Male
28183,Grace,Female
77504,Benjamin,Male
89401,Hannah,Female
63380,Andrea,Male


105297

Here we see there are some instances in which the name is not an individual's first name, but rather the name of a business or a collective, or "Anonymous". Let's drop these out of our training dataset as they won't be helpful in determining the gender of a person. 

Let's also select only the first name. 

In [13]:
# rm null values, anonymous, and duplicates

kiva_names = kiva_names.loc[kiva_names['name'].isnull() == False]
kiva_names = kiva_names.drop_duplicates()
kiva_names = kiva_names[kiva_names['name'] != "Anonymous"]
kiva_names['name'] = kiva_names['name'].str.split(expand=True)[0]

len(kiva_names['name'])
kiva_names['name'].head(15)

9794

0     Evaline
1      Julias
2        Rose
3        Jane
4       Alice
5       Clare
6        Mary
7       James
8     Jacinta
9       Emily
10     Fridah
11    Charity
12      Susan
13      Joyce
14     Daniel
Name: name, dtype: object

Now let's define a function that will return the last letter of our borrowers' first names. This letter will be a **feature** we will use to attempt to predict the output feature, gender. 

In [14]:
#function that returns last letter of first name 
def gender_features(name):
    return {'last_letter': name[-1]}

Now let's prepare to train our model. We split train and test sets as usual. 

In [15]:
# Set training-test split %
split_pct = 0.80

# Remove null and NaN values 
kiva_names = kiva_names[pd.notnull(kiva_names)]

# the pandas command "sample" already randomizes its selection. 
kiva_names_shuffled = kiva_names.sample(frac=1)

kiva_train_set = kiva_names_shuffled[:int((len(kiva_names_shuffled)*split_pct))] 
kiva_test_set = kiva_names_shuffled[int(len(kiva_names_shuffled)*split_pct+1):]  

len(kiva_train_set.index)
len(kiva_test_set.index)

7835

1958

Now we prepare our data by converting the name and gender features from features into lists, so they are associated with each other. 

In [16]:
kiva_female_train = kiva_train_set[kiva_train_set['gender'] == "Female"]
kiva_male_train = kiva_train_set[kiva_train_set['gender'] == "Male"]
kiva_female_test = kiva_test_set[kiva_test_set['gender'] == "Female"]
kiva_male_test = kiva_test_set[kiva_test_set['gender'] == "Male"]

kiva_train_feature_set = [(name, "female") for name in kiva_female_train['name']] + \
[(name, "male") for name in kiva_male_train['name']]

kiva_test_feature_set = [(name, "female") for name in kiva_female_test['name']] + \
[(name, "male") for name in kiva_male_test['name']]

In [17]:
kiva_train_feature_set = [(gender_features(n), g) for (n, g) in kiva_train_feature_set]
kiva_test_feature_set = [(gender_features(n), g) for (n, g) in kiva_test_feature_set]

In [18]:
kiva_classifier = nltk.NaiveBayesClassifier.train(kiva_train_feature_set)

In [19]:
#let's test out our new classifier! 

kiva_classifier.classify(gender_features('Cleopatra'))
kiva_classifier.classify(gender_features('Maximillian'))
kiva_classifier.classify(gender_features('James'))

'female'

'male'

'male'

It looks like it works okay for our three samples, but let's get a better sense of overall accuracy.

The nltk "accuracy()" method returns the % of time our predictions are accurate

In [20]:
#Find out which features were most informative in determining outcome

kiva_classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'k'              male : female =      8.8 : 1.0
             last_letter = 'p'              male : female =      5.9 : 1.0
             last_letter = 'w'              male : female =      3.6 : 1.0
             last_letter = 'x'              male : female =      3.2 : 1.0
             last_letter = 'd'              male : female =      3.1 : 1.0


Show most informative features: this returns LIKELIHOOD RATIOS. For the first entry "f", we see that males are more likely to have this letter as their last letter by a factor of 9.3x.

But how accurate is this? Let's run this classifier on our test dataset. 

In [21]:
#Get a sense of overall accuracy

print(nltk.classify.accuracy(kiva_classifier, kiva_test_feature_set))

0.6828396322778345


This prediction is okay, but not amazing. Remember that a random generator of genders would likely get an accuracy of about 50%, so at least we are better than random. One potential hypothesis for why we are not better at classifying genders might be because this particular dataset mixes Kenyan and American first names. Whereas you might expect an American female name to end in -a and an American male name to end in -o (e.g. Jenna and Julio), these conventions do not necessarily hold for Kenyan names. 

Since we see that the translators have primarily American names, let's try training a model using a corpus of American names.  

In [22]:
from nltk.corpus import names
nltk_labeled_names = ([(name, "male") for name in names.words("male.txt")] +
                [(name, "female") for name in names.words("female.txt")])

nltk_feature_sets = [(gender_features(n), gender)
                for (n, gender) in nltk_labeled_names]

    # Divide the feature sets into training and test sets
nltk_train_set, nltk_test_set = nltk_feature_sets[500:], nltk_feature_sets[:500]

    # Train the naiveBayes classifier
nltk_classifier = nltk.NaiveBayesClassifier.train(nltk_train_set)

    # Test out the classifier with few samples outside of training set
print(nltk_classifier.classify(gender_features("neo")))  # returns male
print(nltk_classifier.classify(gender_features("trinity")))  # returns female

    # Test the accuracy of the classifier on the test data
print(nltk.classify.accuracy(nltk_classifier, nltk_test_set)) 

    # examine classifier to determine which feature is most effective for
    # distinguishing the name's gender
print(nltk_classifier.show_most_informative_features(5))

male
female
0.602
Most Informative Features
             last_letter = 'a'            female : male   =     35.5 : 1.0
             last_letter = 'k'              male : female =     34.1 : 1.0
             last_letter = 'f'              male : female =     15.9 : 1.0
             last_letter = 'p'              male : female =     13.5 : 1.0
             last_letter = 'v'              male : female =     12.7 : 1.0
None


Interestingly, our model using the Kiva data gets a slightly higher accuracy score. Let's use this model instead to try to predict translators' genders. 

In [23]:
translators = pd.DataFrame()
translators['translator_first_name'] = df['translator.byline'].str.split(expand=True)[0]

# rm null values and duplicates
translators = translators.loc[translators['translator_first_name'].isnull() == False]
translators = translators.drop_duplicates()

translators.head(5)

Unnamed: 0,translator_first_name
0,Julie
1,Morena
8,Lynn
19,Mohammad
21,Cheryl


In [24]:
translators['last_letter'] = translators['translator_first_name'].apply(lambda x: gender_features(x))
translators_last = translators['last_letter']
translators_last[0:5]

0     {'last_letter': 'e'}
1     {'last_letter': 'a'}
8     {'last_letter': 'n'}
19    {'last_letter': 'd'}
21    {'last_letter': 'l'}
Name: last_letter, dtype: object

In [25]:
translators['gender'] = translators_last.apply(lambda x: kiva_classifier.classify(x))
translators.head(10)

Unnamed: 0,translator_first_name,last_letter,gender
0,Julie,{'last_letter': 'e'},female
1,Morena,{'last_letter': 'a'},female
8,Lynn,{'last_letter': 'n'},male
19,Mohammad,{'last_letter': 'd'},male
21,Cheryl,{'last_letter': 'l'},male
23,Rita,{'last_letter': 'a'},female
25,Maureen,{'last_letter': 'n'},male
29,Lorne,{'last_letter': 'e'},female
31,Caty,{'last_letter': 'y'},female
34,Trishna,{'last_letter': 'a'},female


Interesting - even in this small sample of 10, we see that the accuracy rate is far from perfect. Using our own understanding of what gender we would assign the names we see, this sample has an accuracy score of 60%. Not great.  

**How can we make this prediction better? Can you think of other aspects of a name might be predictive of gender?** 
A quick test we can try is using the final two letters of a name instead of just one. Try it! 

We just completed our first supervised learning exercise: classification. Let's move forward in our question to finding patterns in the descriptions of the loans by translators, our unsupervised learning exercise. First we need to clean the text data: 

## Cleaning text 

Cleaning text is almost always required in text analysis. You have already gotten a taste of this in this notebook when you cleaned the variable "name" to exclude business names, and in past notebooks as well. 

Cleaning can be as extensive as you want it to be, depending on what serves your research question the best. Is it best to look at full sentences, so you can retain the context of words? Is it best to look at individual words? Should you remove grammar, HTML code, stopwords? 

Before answering this question, we have to know what's in our data. Let's turn to some exploratory analyses to determine how we should clean our data.

Note that we don't run the following snippets of code on the whole dataset as text analysis is very computationally expensive and may crash your computer. Instead, we draw samples from the dataset. 

In [26]:
# read all non-null text into a single df
text_raw = df['description.texts.en'][df['description.texts.en'].isnull() == False]

# take sample of 100 entries, read into list
text_raw_abridged = text_raw.sample(100)
text = list(map(str, text_raw_abridged))

print(text[0:3]) # Each sentence is an item in the list

['Muche is a 51-year-old married woman. She has six children with ages ranging from 9 to 22 years. She describes herself to be an honest woman. She operates a food stall where she sells coconuts. She has been involved in this business for 10 years. Her business is located within a rural area and her primary customers are local residents. \r\r\n\r\r\nMuche describes her biggest business challenge to be shortage of coconuts to sell. She will use the Kes 15,000 loan to buy coconuts to sell. Her business goal is to expand her business within five years. She hopes that in the future, she will be a supplier of coconuts. This is her second loan with SMEP DTM after taking a loan of Kes 10,000 which she managed to repay successfully.', "John is still married to Ann and they now have three children.  He is still running the Agrovet shop and it has been getting him income which is providing for his family's needs.  \r\r\n\r\r\nJohn is applying for his second loan term with KADET after repaying th

In [27]:
# Remove HTML 
text = [w.replace('\r', '') for w in text]
text = [w.replace('\n', '') for w in text]
text = [w.replace('<br />', '') for w in text]

# Lowercase
text = [w.lower() for w in text]

print(text[0:3])

['muche is a 51-year-old married woman. she has six children with ages ranging from 9 to 22 years. she describes herself to be an honest woman. she operates a food stall where she sells coconuts. she has been involved in this business for 10 years. her business is located within a rural area and her primary customers are local residents. muche describes her biggest business challenge to be shortage of coconuts to sell. she will use the kes 15,000 loan to buy coconuts to sell. her business goal is to expand her business within five years. she hopes that in the future, she will be a supplier of coconuts. this is her second loan with smep dtm after taking a loan of kes 10,000 which she managed to repay successfully.', "john is still married to ann and they now have three children.  he is still running the agrovet shop and it has been getting him income which is providing for his family's needs.  john is applying for his second loan term with kadet after repaying the previous loan successf

In [28]:
from nltk.tokenize import word_tokenize
from nltk.text import Text  

tokens = list(map(word_tokenize, text))
kiva_text = nltk.Text(tokens)
kiva_text[0:2]

[['muche',
  'is',
  'a',
  '51-year-old',
  'married',
  'woman',
  '.',
  'she',
  'has',
  'six',
  'children',
  'with',
  'ages',
  'ranging',
  'from',
  '9',
  'to',
  '22',
  'years',
  '.',
  'she',
  'describes',
  'herself',
  'to',
  'be',
  'an',
  'honest',
  'woman',
  '.',
  'she',
  'operates',
  'a',
  'food',
  'stall',
  'where',
  'she',
  'sells',
  'coconuts',
  '.',
  'she',
  'has',
  'been',
  'involved',
  'in',
  'this',
  'business',
  'for',
  '10',
  'years',
  '.',
  'her',
  'business',
  'is',
  'located',
  'within',
  'a',
  'rural',
  'area',
  'and',
  'her',
  'primary',
  'customers',
  'are',
  'local',
  'residents',
  '.',
  'muche',
  'describes',
  'her',
  'biggest',
  'business',
  'challenge',
  'to',
  'be',
  'shortage',
  'of',
  'coconuts',
  'to',
  'sell',
  '.',
  'she',
  'will',
  'use',
  'the',
  'kes',
  '15,000',
  'loan',
  'to',
  'buy',
  'coconuts',
  'to',
  'sell',
  '.',
  'her',
  'business',
  'goal',
  'is',
  'to',

## Preliminary investigations / visualizations 

- Frequency
- Concordance

In [29]:
text_corpus = list() 

for x in range(0, len(kiva_text)): 
    text_corpus.extend(kiva_text[x])

text_corpus = nltk.Text(text_corpus)

kiva_fdist = nltk.FreqDist(text_corpus)

# Frequency of words - get idea of descriptions of loan requests in kenya
# Gets at nuanced question - what do people /translators include in requests  

In [30]:
#kiva_fdist.plot()
#kiva_fdist.plot(50, cumulative=True)
kiva_fdist.most_common(50)

[('.', 708),
 ('to', 509),
 (',', 456),
 ('and', 415),
 ('a', 400),
 ('the', 380),
 ('she', 345),
 ('her', 341),
 ('is', 334),
 ('of', 263),
 ('in', 226),
 ('business', 196),
 ('for', 180),
 ('has', 162),
 ('loan', 153),
 ('will', 133),
 ('he', 123),
 ('years', 122),
 ('with', 109),
 ('this', 104),
 ('his', 104),
 ('children', 99),
 ('from', 80),
 ('be', 75),
 ('been', 75),
 ('as', 70),
 ('that', 64),
 ('use', 61),
 ('buy', 61),
 ('married', 59),
 ('are', 58),
 ('income', 58),
 ('farming', 55),
 ('have', 52),
 ('one', 51),
 ('more', 50),
 ('also', 49),
 ('kes', 48),
 ('which', 43),
 ('family', 42),
 ('by', 39),
 ('old', 39),
 ('kenya', 39),
 ('purchase', 37),
 ('area', 33),
 ('future', 33),
 ('group', 33),
 ('hopes', 32),
 ('who', 31),
 ("'s", 30)]

In [201]:
text_corpus.concordance('future')

Displaying 25 of 39 matches:
o open up a supermarket in the near future . sharon is a kiva borrower who des
 earn an income . she hopes that in future she will be able to educate and fee
other items . she hopes that in the future she will be able to feed and educat
ty products . she hopes that in the future , she will be prosperous . this is 
 five years . she hopes that in the future , she will be successful in busines
thin 5 years . he hopes that in the future , he will be able to economically s
eams of establishing a hotel in the future . with the kes 20,000 she wants to 
ng materials . he hopes that in the future , he will have a better life . this
 in his electronic shop for sale.in future he plans to build his own building 
s children to school , save for the future , buy a cow , and invest in a busin
, potatoes and other stock . in the future , philis aspires to become a local 
s to expand his curio shop . in the future , he wants to be able to invest mor
 living conditions and 

In [144]:
text_corpus.concordance('seasonality')

Displaying 4 of 4 matches:
m the retail shop . she mentioned seasonality and debts as a major challenge in
 . she faces a major challenge of seasonality in her business . in addition , s
ors and passersby . she mentioned seasonality and price fluctuation as her majo
 . she faces a major challenge of seasonality and perishability of her stock le


In [148]:
text_corpus.concordance('working')

Displaying 9 of 9 matches:
o serve her clients with due to low working capital . with a kiva loan , faith 
st business challenge is inadequate working capital . she will use the kes 30,0
business challenge to be inadequate working capital . he will use the kes 70,00
business challenge to be inadequate working capital . he will use the kes 25,00
st in a business . moses notes that working with one acre fund has led to many 
business challenge to be inadequate working capital . he will use the kes 20,00
rmed for the past 10 years , she is working hard and learning good farming prac
mptly . loise is a 34 year old hard working business lady . she is married and 
utions.esther’s favorite part about working with komaza is planting trees on he


** Placeholder for discussion on what "similar()" does 

In [157]:
# prelim investigation: getting other words that appear in similar range of contexts 

text_corpus.similar("future")

loan sale business market farm area children farmer profits price
education facilitator year family planting distribution members plant
shop years


In [161]:
text_corpus.similar("children")

years business income clients hopes which loan family kadet neighbors
husband is farm and profits enterprise region attention home kiva


### Remove stop words

**DEVNOTE - still in progress

In [31]:
#remove stop words

print("Number of stop words: {}".format(len(ENGLISH_STOP_WORDS)))

Number of stop words: 318


In [32]:
test = CountVectorizer(min_df=5, stop_words="english").fit(text_corpus)
#test = nltk.Text(test)

#test_fdist = nltk.FreqDist(test)

#test_fdist
test

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=5,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

### Stem words 

**DEVNOTE: still in progress

In [50]:
# Clean data - stem
# Porter stemmer is one of several
print(nltk.stem.PorterStemmer(text[1]))

TypeError: 'CountVectorizer' object does not support indexing

## Algorithms: Latent Dirichlet Allocation

**DEVNOTE -- still in progress

Now that data is clean let's turn to our unsupervised model.
Topic modeling 

Bag of words - simplest representation - note that the downside is a big loss of context (man eats bread is the same as bread eats man -- order of words not preserved.) 

https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html

We will use gensim because this allows the model to be run on data that might exceed your machine's RAM.

In [202]:
from sklearn.decomposition import LatentDirichletAllocation


lda=LatentDirichletAllocation(n_topics=5, learning_method="batch",
                              max_iter=25, random_state=0)
document_topics = lda.fit_transform(kiva_corpus)



ValueError: setting an array element with a sequence.

In [None]:
#for each topic (a row in components_), sort features
# invert rows with [:, ::-1] to make sort descending 
sorting = np.argsort(lda.components_, axis=1[:, ::-1])
# get feature names from vectorizer
feature_names = np.array(vect.get_feature_names())

In [None]:
#print 10 topics
mglearn.tools.print_topics(topics=range(10), feature_names=feature_names,
                          sorting=sorting, topics_per_chunk=5, n_words=10)

**DEVNOTES** 
Ideas for research: 
Try to cluster description based on who the translator is? 
Try to parse out all adjectives - see what that looks like per translator ?

In [None]:
# Other clustering algos using nltk

help(nltk.cluster)