## Lab 6: Text Analysis and Natural Language Processing 

In this lab, we explore the text data provided by Kiva's API. Our primary source of textual data is the descriptive texts that borrowers submit for a loan request and are posted publicly on the Kiva website. Kiva is unique in that often, borrowers do not write descriptive requests for themselves, but fill out a questionnaire to Kiva's team of volunteer translators. We try to leverage this body of text (also called a *"corpus"*) to see if we can see any patterns in how descriptions are written.

As always, we first import our packages and read in our data below. 

In [30]:
import pandas as pd
import numpy as np

# NLP-specific packages: 
import nltk
from nltk.corpus import names
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import gensim
from gensim import corpora
from nltk.corpus import names
from nltk.tokenize import word_tokenize
from nltk.text import Text  
from nltk.stem import PorterStemmer


# output of multiple commands in a cell will be output at once.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# display up to 80 columns, this keeps everything visible
pd.set_option('display.max_columns', 80)
pd.set_option('expand_frame_repr', True)

In [31]:
# datapath = '~/intro_to_machine_learning/data'
datapath = '~/Desktop'
df = pd.read_csv(datapath+'/df.csv', low_memory=False)

## Exploratory Analysis and Feature Engineering

We have very limited information about translators. In fact, the only variable in our dataset relevant to translators is their name! What information can we extract from this field? 

In text analysis, a common yet simple task is how to categorize names by gender. We know, just in our daily knoweldge of English names, that names that end in -a are likely to be female, and names that end in -o are likely to be male (for example, Jenna and Pablo). Since we have both the gender data and the name data for the borrowers, let's use borrowers' data to train a classifier model that can predict the gender from a name! Then, we will apply this model to the translators names to predict their genders. 

Here, we use the Naive Bayes Classifier (for a comprehensive review, take a look back at Module 6.) This algorithm assigns a label (in our case, "male" or "female") using the last letter of the name provided in the data. Remember that we first need to clean our data to ensure that we are capturing the last letter of first names. 

In [32]:
#create name and gender dataframe for single borrowers
kiva_names = df[['name', 'gender', 'borrower_count']]
kiva_names = kiva_names[['name', 'gender']][kiva_names['borrower_count'] == 1]

kiva_names.head(5)
len(kiva_names)

Unnamed: 0,name,gender
0,Naomi,Female
1,Florence,Female
2,Lucy,Female
3,Kadzo,Female
4,Maureen,Female


100961

We know from looking through the data that there are some instances in which the name is not an individual's first name, but rather the name of a business or a collective, or "Anonymous". Let's drop these out of our training dataset as they won't be helpful in determining the gender of a person. Let's also select only the first name. 

In [33]:
# rm null values, anonymous, and duplicates

kiva_names = kiva_names.loc[kiva_names['name'].isnull() == False]
kiva_names = kiva_names.drop_duplicates()
kiva_names = kiva_names[kiva_names['name'] != "Anonymous"]
kiva_names['name'] = kiva_names['name'].str.split(expand=True)[0]

len(kiva_names['name'])
kiva_names['name'].head(15)

12761

0         Naomi
1      Florence
2          Lucy
3         Kadzo
4       Maureen
5         Grace
6        Martha
7        Zawadi
8         Susan
9     Christine
10        Lydia
11         Jane
12         Mary
13      Richard
14    Valentine
Name: name, dtype: object

Now let's define a function that will return the last letter of our borrowers' first names. This letter will be a **feature** we will use to attempt to predict the output feature, gender. 

In [34]:
#function that returns last letter of first name 
def gender_features(name):
    return {'last_letter': name[-1]}

Now let's prepare to train our model. We split train and test sets as usual. 

In [35]:
# Set training-test split %
split_pct = 0.80

# Remove null and NaN values 
kiva_names = kiva_names[pd.notnull(kiva_names)]

# the pandas command "sample" already randomizes its selection. 
kiva_names_shuffled = kiva_names.sample(frac=1)

kiva_train_set = kiva_names_shuffled[:int((len(kiva_names_shuffled)*split_pct))] 
kiva_test_set = kiva_names_shuffled[int(len(kiva_names_shuffled)*split_pct+1):]  

len(kiva_train_set.index)
len(kiva_test_set.index)

10208

2552

Now we prepare our data by converting the name and gender features from features into lists, so they are associated with each other. 

In [36]:
kiva_female_train = kiva_train_set[kiva_train_set['gender'] == "Female"]
kiva_male_train = kiva_train_set[kiva_train_set['gender'] == "Male"]
kiva_female_test = kiva_test_set[kiva_test_set['gender'] == "Female"]
kiva_male_test = kiva_test_set[kiva_test_set['gender'] == "Male"]

kiva_train_feature_set = [(name, "female") for name in kiva_female_train['name']] + \
[(name, "male") for name in kiva_male_train['name']]

kiva_test_feature_set = [(name, "female") for name in kiva_female_test['name']] + \
[(name, "male") for name in kiva_male_test['name']]

In [37]:
kiva_train_feature_set = [(gender_features(n), g) for (n, g) in kiva_train_feature_set]
kiva_test_feature_set = [(gender_features(n), g) for (n, g) in kiva_test_feature_set]

In [38]:
kiva_classifier = nltk.NaiveBayesClassifier.train(kiva_train_feature_set)

In [39]:
#let's test out our new classifier! 

kiva_classifier.classify(gender_features('Cleopatra'))
kiva_classifier.classify(gender_features('Maximillian'))
kiva_classifier.classify(gender_features('James'))

'female'

'male'

'male'

It looks like it works okay for our three samples, but let's get a better sense of overall accuracy.

The nltk "accuracy()" method returns the % of time our predictions are accurate

In [40]:
#Find out which features were most informative in determining outcome

kiva_classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'k'              male : female =     18.9 : 1.0
             last_letter = 'p'              male : female =     12.1 : 1.0
             last_letter = 'f'              male : female =     10.3 : 1.0
             last_letter = 'w'              male : female =      6.1 : 1.0
             last_letter = 'd'              male : female =      5.3 : 1.0


Show most informative features: this returns LIKELIHOOD RATIOS. For the first entry "f", we see that males are more likely to have this letter as their last letter by the factor indicated above.

But how accurate is this? Let's run this classifier on our test dataset. 

In [41]:
#Get a sense of overall accuracy

print(nltk.classify.accuracy(kiva_classifier, kiva_test_feature_set))

0.7021943573667712


This prediction is okay, but not amazing. Remember that a random generator of genders would likely get an accuracy of about 50%, so at least we are better than random. One potential hypothesis for why we are not better at classifying genders might be because this particular dataset mixes Kenyan and American first names. Whereas you might expect an American female name to end in -a and an American male name to end in -o (e.g. Jenna and Julio), these conventions do not necessarily hold for Kenyan names. 

Let's use our model to try and predict our translators' names. 

In [42]:
translators = pd.DataFrame()
translators['translator_first_name'] = df['translator.byline'].str.split(expand=True)[0]

# rm null values and duplicates
translators = translators.loc[translators['translator_first_name'].isnull() == False]
translators = translators.drop_duplicates()

translators.head(5)

Unnamed: 0,translator_first_name
0,Michelle
8,Tim
9,Madhurima
20,Teresa
23,John


In [43]:
translators['last_letter'] = translators['translator_first_name'].apply(lambda x: gender_features(x))
translators_last = translators['last_letter']
translators_last[0:5]

0     {'last_letter': 'e'}
8     {'last_letter': 'm'}
9     {'last_letter': 'a'}
20    {'last_letter': 'a'}
23    {'last_letter': 'n'}
Name: last_letter, dtype: object

In [44]:
translators['gender'] = translators_last.apply(lambda x: kiva_classifier.classify(x))
translators.head(10)

Unnamed: 0,translator_first_name,last_letter,gender
0,Michelle,{'last_letter': 'e'},female
8,Tim,{'last_letter': 'm'},male
9,Madhurima,{'last_letter': 'a'},female
20,Teresa,{'last_letter': 'a'},female
23,John,{'last_letter': 'n'},male
24,Sheilah,{'last_letter': 'h'},female
26,Patrick,{'last_letter': 'k'},male
27,Lynn,{'last_letter': 'n'},male
29,Frederick,{'last_letter': 'k'},male
30,Mike,{'last_letter': 'e'},female


Interesting - even in this small sample of 10, we see that the accuracy rate is far from perfect. Using our own understanding of what gender we would assign the names we see, this sample has an accuracy score of 60%. Not great.  

**How can we make this prediction better? Can you think of other aspects of a name might be predictive of gender?** 
A quick test we can try is using the final two letters of a name instead of just one. Try it! 

We just completed our first supervised learning exercise: classification. Let's move forward in our question to finding patterns in the descriptions of the loans by translators, our unsupervised learning exercise. First we need to clean the text data: 

## Cleaning text 

Cleaning text is almost always required in text analysis. You have already gotten a taste of this in this notebook when you cleaned the variable "name" to exclude business names, and in past notebooks as well. 

Cleaning can be as extensive as you want it to be, depending on what serves your research question the best. Is it best to look at full sentences, so you can retain the context of words? Is it best to look at individual words? Should you remove grammar, HTML code, stopwords? 

Before answering this question, we have to know what's in our data. Let's turn to some exploratory analyses to determine how we should clean our data.

Note that we don't run the following snippets of code on the whole dataset as text analysis is very computationally expensive and may crash your computer. Instead, we draw a sample of 1000 descriptions from the dataset. *This means that your results will look slightly different, but that's okay -- make sure to post on Slack anything you find interesting!*  

In [45]:
def text_to_list(df, text_field, sample_num = 1000):
    """Convert a text field in the dataframe to a (sampled) list of strings."""
    # read all non-null text into a single df
    text_raw = df[text_field][df[text_field].isnull() == False]

    # take sample of n (default 1000) entries, read into list
    # set random_state parameter to draw the same sample
    text_raw_abridged = text_raw.sample(sample_num) 
    text = list(map(str, text_raw_abridged))
    return(text)

text = text_to_list(df, 'description.texts.en')

print(text[0:3]) # Each sentence is an item in the list

['Luvuno is from Samburu and has a fruits and vegetables business. She sells from her home and also around Samburu.\r\r\r\r\r\n\r\r\r\r\r\nLuvuno says that she has been able to help her family through her business and that she has given them all that they need to continue having a dignified life, with no deprivations. \r\r\r\r\r\n\r\r\r\r\r\nLuvuno wants to be able to continue working and helping her family, so she is asking for this loan to invest in fruits and vegetables, to continue working and offering good products to her customers.\r\r\r\r\r\n', 'Paul is a nurse/midwife managing his own private facility in Vihiga in the western part of Kenya.  He is 40 years old and lives with his wife and two children in Kakamega. His wife who is also a trained nurse and works at Kakamega Provincial General Hospital. <p>Paul trained as a nurse/midwife in 1990 and has since operated his own private clinic at Stend Kisa in Vihiga. He offers a wide range of clinical services including general curat

We see there is some HTML/CSS cluttering up the text. Below, we remove these and convert all capital letters to lowercase.

In [46]:
def clean_text(text):
    """Remove tags and punctuation and convert to lowercase."""
    # Remove HTML 
    text = [w.replace('\r', ' ') for w in text]
    text = [w.replace('\n', ' ') for w in text]
    text = [w.replace('<br />', ' ') for w in text]
    text = [w.replace('<p>', ' ') for w in text]
    text = [w.replace('</p>', ' ') for w in text]
    text = [w.replace('<i>', ' ') for w in text]
    text = [w.replace('</i>', ' ') for w in text]
    text = [w.replace('.', ' ') for w in text]
    text = [w.replace(',', ' ') for w in text]
    text = [w.replace('?', '') for w in text]
    text = [w.replace(';', '') for w in text]
    
    # Remove extra spaces
    text = [" ".join(t.split()) for t in text]

    # Lowercase
    text = [w.lower() for w in text]
    
    return(text)

text = clean_text(text)

print(text[0:3])

['luvuno is from samburu and has a fruits and vegetables business she sells from her home and also around samburu luvuno says that she has been able to help her family through her business and that she has given them all that they need to continue having a dignified life with no deprivations luvuno wants to be able to continue working and helping her family so she is asking for this loan to invest in fruits and vegetables to continue working and offering good products to her customers', 'paul is a nurse/midwife managing his own private facility in vihiga in the western part of kenya he is 40 years old and lives with his wife and two children in kakamega his wife who is also a trained nurse and works at kakamega provincial general hospital paul trained as a nurse/midwife in 1990 and has since operated his own private clinic at stend kisa in vihiga he offers a wide range of clinical services including general curative family planning maternal and child health services and also runs a pha

In [47]:
# Use the same text cleaning functions for the use text field
use_text = text_to_list(df, 'use')
use_text = clean_text(use_text)

print(use_text[0:3])

['to buy mobile phone accessories', 'to buy a dairy cow', 'to buy cattle feed and a goat']


Great! The text looks clean. We also notice that this dataset is a list where every item in the list is a description. Now we tokenize each item in the list so that each word is separated out. This yields a list of lists. 

In [48]:
tokens = list(map(word_tokenize, text))
kiva_text = nltk.Text(tokens)
kiva_text[0:2]

[['luvuno',
  'is',
  'from',
  'samburu',
  'and',
  'has',
  'a',
  'fruits',
  'and',
  'vegetables',
  'business',
  'she',
  'sells',
  'from',
  'her',
  'home',
  'and',
  'also',
  'around',
  'samburu',
  'luvuno',
  'says',
  'that',
  'she',
  'has',
  'been',
  'able',
  'to',
  'help',
  'her',
  'family',
  'through',
  'her',
  'business',
  'and',
  'that',
  'she',
  'has',
  'given',
  'them',
  'all',
  'that',
  'they',
  'need',
  'to',
  'continue',
  'having',
  'a',
  'dignified',
  'life',
  'with',
  'no',
  'deprivations',
  'luvuno',
  'wants',
  'to',
  'be',
  'able',
  'to',
  'continue',
  'working',
  'and',
  'helping',
  'her',
  'family',
  'so',
  'she',
  'is',
  'asking',
  'for',
  'this',
  'loan',
  'to',
  'invest',
  'in',
  'fruits',
  'and',
  'vegetables',
  'to',
  'continue',
  'working',
  'and',
  'offering',
  'good',
  'products',
  'to',
  'her',
  'customers'],
 ['paul',
  'is',
  'a',
  'nurse/midwife',
  'managing',
  'his',
  'o

## N-grams and word prediction

The task of predicting the next word in a sentence might seem irrelevant if one thinks of natural language processing (NLP) only in terms of processing text for semantic understanding. However, NLP also involves processing noisy data and checking text for errors. For example, noisy data can be produced in speech or handwriting recognition, as the computer may not properly recognize words due to unclear speech or handwriting that differs significantly from the computer’s model. Additionally, NLP could be extended to such functions as spell checking in order to catch errors in which no word is misspelled but the user has accidentally typed a word that she or he did not intend. In the sentence “I picked up the phone to answer her fall,” for instance, fall may have been the intended word, but it is more likely that call was simply mistyped. A spell checker cannot catch this error because both fall and call are English words. An NLP algorithm that could catch this error would thus need to look beyond what letters form words and instead attempt to determine what word is most probable in a given sentence.

### N-Gram Models
One of the oldest methods used in trying to compute the probability that a given word is the next word in a sentence is employing n-gram models. N-gram models are attempts to guess the next word in a sentence based upon the (n - 1) previous words in the sentence. These models base their guesses on the probability of a given word without any context (i.e., the is a more common word than green and is thus more probable than green if context is ignored) and the probability of a word given the last (n – 1) words. For example, take the sentence beginning “The four leaf clover was the color...”. Using a bigram model, one would compute P(green | color) and P(the | color) to determine the more probable guess between these two words. Based on this example, one might imagine that the model’s guess would be even more accurate if we computed P(green | The four leaf clover was the color), making a 7-gram model. However, such a model would take enormous computing power and a much greater amount of time than the bigram model to compute. Since good estimates can be made based on smaller models, it is more practical to use bi- or trigram models. This idea that a future event (in this case, the next word) can be predicted using a relatively short history (for the example, one or two words) is called a Markov assumption.

In order to make these predictions about the next word in a sentence, the NLP application must have access to probabilities about how often specific words occur in general and how often specific words occur after particular words. To program a computer with these probabilities by hand would be extremely tedious and raises the question of how those probabilities would be reached in the first place. A simple way to find the probabilities might involve counting the number of occurrences of words in samples of text; however, a human would probably become introduce errors into this process and be unable to sort through millions of words quickly. Thus, n-gram models are usually trained on corpora, huge text files that can be processed to determine statistical properties about the words and sentences within.

Source: https://cs.stanford.edu/people/eroberts/courses/soco/projects/2004-05/nlp/techniques_word.html

### Implementation
Here we create and implement a trigram model. 

First, we convert the text list to a dictionary mapping the first two words to a third word that follows. We save this dictionary as the object `chains`.

Next, we use these chains to make a text string. Using a random starting bigram, we add a word by selecting one word among the possible words that can follow. 

Then we repeat this word selection process for the next pair of words. 

In [49]:
from random import choice
def make_chains(text_list):
    """Takes input text as string; returns _dictionary_ of markov chains.

    A chain will be a key that consists of a tuple of (word1, word2)
    and the value would be a list of the word(s) that follow those two
    words in the input text.

    For example:

        >>> make_chains("hi there mary hi there juanita")
        {('hi', 'there'): ['mary', 'juanita'], ('there', 'mary'): ['hi'], ('mary', 'hi': ['there']}
    """
    
    word_list = []
    for sentences in text_list:
        words = sentences.split(' ')
        for word in words:
            word_list.append(word)

    ## Let's make trigrams
    chains = {}
    # word_three_list = []
    for i in range((len(word_list)-2)):
        key = (word_list[i], word_list[i+1])
        word_three = word_list[i+2]  

        if key not in chains:        
            chains[key] = [word_three]            

        else:       
            chains[key].append(word_three)

    return chains

def make_text(chains, text_len, silent_start = False):
    """Takes dictionary of markov chains; returns random text."""

    text = ""
    fake_text = []
    starting_tuple = choice(list(chains.keys()))
    fake_text.append(starting_tuple)
    
    if not silent_start:
        print('Starting bigram: ' + ' '.join(starting_tuple))

    for i in range(text_len):
        add_text(fake_text, chains)
    
    result_text = ' '.join([i for tup in fake_text for i in tup])
        
    return(result_text)

def add_text(fake_text, chains):
    """Helper function to add a randomly selected third word based on a starting bigram."""
    starting_tuple = fake_text[-1]
    for key in chains.keys():
        random_word = choice(list(chains[starting_tuple]))
        
        if random_word == key[0]:
            fake_text.append(key)
            break
            
    return(fake_text)


### Predicting words

Let's try our prediction functions on our cleaned text fields! Please share any funny ones!

In [50]:
# Get a Markov chain for the description text field
chains = make_chains(text)

# Produce random text
random_text = make_text(chains, 10)

print(random_text)

Starting bigram: competition and
competition and lack of water that is from a nurse/midwife in fruits and has a trained nurse and works at kakamega his


In [51]:
# Get a Markov chain for the use text field
chains = make_chains(use_text)

# Produce random text
random_text = make_text(chains, 5)

print(random_text)

Starting bigram: cabbages kales
cabbages kales and a goat to buy cattle feed for her cafe


## Preliminary investigations / visualizations 

Now that we've got cleaned data, let's conduct some preliminary investigations. Frequency, concordance and similar are all functions of the NLTK package that can give us a sense of what is in our text without our having to read every single line.

- Frequency
- Concordance
- Similar 

Frequency returns a list of unique words, with how often each word shows up in the corpus. This provides an idea of what words are included in the descriptions of loan requests in Kenya. Note that the most common words are relatively uninformative, such as "to," "and," or "is." Later we will remove these for analysis so they do not overinfluence our results. 

In [52]:
# Read all sentences into single list 

text_corpus = list() 

for x in range(0, len(kiva_text)): 
    text_corpus.extend(kiva_text[x])

text_corpus = nltk.Text(text_corpus)

In [53]:
#kiva_fdist.plot()
#kiva_fdist.plot(50, cumulative=True)
kiva_fdist = nltk.FreqDist(text_corpus)
kiva_fdist.most_common(25)

[('to', 5397),
 ('and', 4336),
 ('a', 3883),
 ('the', 3861),
 ('she', 3561),
 ('is', 3226),
 ('her', 3107),
 ('of', 2796),
 ('in', 2274),
 ('for', 1974),
 ('has', 1916),
 ('business', 1866),
 ('he', 1789),
 ('loan', 1601),
 ('his', 1496),
 ('will', 1411),
 ('with', 1264),
 ('years', 1239),
 ('this', 1107),
 ('children', 1010),
 ('from', 874),
 ('that', 828),
 ('be', 784),
 ('been', 767),
 ('as', 630)]

Concordance takes an input word of your choosing and returns the surrounding words. This provides important context about how a specific word is used in the text corpus. Here, we test "future", "seasonality", and "working". Note that sme of these words are used differently or ambiguously. This gets at an important point for NLP - words can be and are used ambiguously and it is difficult to parse meaning unless we also take a look at context.

In [54]:
text_corpus.concordance('man')

Displaying 25 of 99 matches:
nd make a loan ! johnson is a married man he describes himself to be focused he
arming david is a 31-year-old married man with three children he has been runni
 hire richard is a 26-year-old single man who has had to overcome a lot of adve
way of life he is a very enterprising man and although he never had a formal ed
 and happy family davies is a married man he has five children he operates a ta
repaid successfully joab is a married man he has 3 children he describes himsel
lets jackson is a 29-year-old married man who has been blessed with three child
financial support joseph is a married man he has 4 children he describes himsel
r services felix is a happily married man and is blessed with 2 children his wi
oost his business he is a hardworking man who will use profits from his busines
ost her business richard is a married man with five children who are in school 
stallments billy is a happily married man who lives in bomet with his family he
cial indepe

In [55]:
text_corpus.concordance('woman')

Displaying 25 of 225 matches:
l of 8 25 acres `` esha is a married woman with five children all of whom atten
rything they need loice is a married woman with four children all of whom gradu
oost her business monica is a single woman who has 2 children she operates a fa
 10 solar lights biliah is a married woman she has three children she describes
boost her business saumu is a single woman with one child who attends school sh
loring materials bendera is a single woman with three children all of whom atte
t her business everlyne is a married woman she has three children with ages ran
ap for resale elizabeth is a married woman she has two children she describes h
 of 4 75 acres magdaline is a single woman and has one child she describes hers
e has experienced susan is a married woman she has 2 children she describes her
rds of his family ritah is a married woman she has one child she describes hers
in a biodigester fahima is a married woman with two children all of whom attend
d fertiliz

Similar takes in an input word of your choosing, but returns other words that appear in a similar range of contexts. This is called finding the "distributional similarity." Most similar words appear first. 

In [56]:
text_corpus.similar("mother")

loan farmer father woman business profit group businessman family shop
living cow challenge farm man house community lack supplier plot


In [57]:
text_corpus.similar("father")

mother loan farmer business part stock challenge farm man area lack
person variety city dream lady shortage piece help family


Collocations are pairs of words that occur together in the data unusually often. Here, we recognize pairs of words that are familiar to us in day-to-day life and indicate a writing style, such as "major challenge" or "primary customers". There are also unexpected pairings, like "three children." 

In [58]:
text_corpus.collocations()

years old; acre fund; one acre; juhudi kilimo; school fees; piped
water; 000 kes; kadet ltd; primary customers; join yehu; married
woman; greatest monthly; first loan; microfinance bank; five years;
anticipated profits; major challenge; three children; smep
microfinance; wheat flour


### Remove stop words

"Stop words" are words like "to", "the", "a" - words that are plentiful but do not offer significantly meaningful information about the document. Here, we import a predetermined set of stop words defined by the NLTK package and then remove them from the dataset. The resulting dataset has words that we can generally agree are meaningful and say something about the content of the loan request. You can also define your own set of "stop words" to remove if you have a very specific set of words you want to remove. 

However, we see that these words still have suffixes such as "-s" and "-ing". We want to remove these because if we do not, the algorithm will count a set of words like "married" and "marries" as different words, when we can consider them, for our purposes, the same word. To remove these, we stem our text data. 

In [59]:
set(stopwords.words('english'))

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 'd',
 'did',
 'didn',
 'do',
 'does',
 'doesn',
 'doing',
 'don',
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 'has',
 'hasn',
 'have',
 'haven',
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 'it',
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 'more',
 'most',
 'mustn',
 'my',
 'myself',
 'needn',
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 're',
 's',
 'same',
 'shan',
 'she',
 'should',
 'shouldn',
 'so',
 'some',
 'such',
 't',
 'than',
 'that',
 'the',
 'their',
 'theirs',
 'them',
 

In [60]:
#remove stop words

text_corpus_clean = [word for word in text_corpus if word not in stopwords.words('english')]
text_corpus_clean[0:50]

['luvuno',
 'samburu',
 'fruits',
 'vegetables',
 'business',
 'sells',
 'home',
 'also',
 'around',
 'samburu',
 'luvuno',
 'says',
 'able',
 'help',
 'family',
 'business',
 'given',
 'need',
 'continue',
 'dignified',
 'life',
 'deprivations',
 'luvuno',
 'wants',
 'able',
 'continue',
 'working',
 'helping',
 'family',
 'asking',
 'loan',
 'invest',
 'fruits',
 'vegetables',
 'continue',
 'working',
 'offering',
 'good',
 'products',
 'customers',
 'paul',
 'nurse/midwife',
 'managing',
 'private',
 'facility',
 'vihiga',
 'western',
 'part',
 'kenya',
 '40']

### Stem words 

The Porter Stemmer is one of several stemming tools (including Snowball Stemmer and the Lancaster Stemmer). Each type of stemmer uses different rules to "stem" a word like "running" to "run". Here we use the Porter Stemmer as it is very commonly used. Try others! 

In [61]:
# Clean data - stem
# Porter stemmer is one of several

porter = nltk.PorterStemmer()
[porter.stem(t) for t in text_corpus_clean]

['luvuno',
 'samburu',
 'fruit',
 'veget',
 'busi',
 'sell',
 'home',
 'also',
 'around',
 'samburu',
 'luvuno',
 'say',
 'abl',
 'help',
 'famili',
 'busi',
 'given',
 'need',
 'continu',
 'dignifi',
 'life',
 'depriv',
 'luvuno',
 'want',
 'abl',
 'continu',
 'work',
 'help',
 'famili',
 'ask',
 'loan',
 'invest',
 'fruit',
 'veget',
 'continu',
 'work',
 'offer',
 'good',
 'product',
 'custom',
 'paul',
 'nurse/midwif',
 'manag',
 'privat',
 'facil',
 'vihiga',
 'western',
 'part',
 'kenya',
 '40',
 'year',
 'old',
 'live',
 'wife',
 'two',
 'children',
 'kakamega',
 'wife',
 'also',
 'train',
 'nurs',
 'work',
 'kakamega',
 'provinci',
 'gener',
 'hospit',
 'paul',
 'train',
 'nurse/midwif',
 '1990',
 'sinc',
 'oper',
 'privat',
 'clinic',
 'stend',
 'kisa',
 'vihiga',
 'offer',
 'wide',
 'rang',
 'clinic',
 'servic',
 'includ',
 'gener',
 'cur',
 'famili',
 'plan',
 'matern',
 'child',
 'health',
 'servic',
 'also',
 'run',
 'pharmaci',
 'laboratori',
 'paul',
 'attribut',
 'succe

In [62]:
# number of words in of entire corpus 
len(text_corpus_clean)

# number of unique words in entire corpus
len(set(text_corpus_clean))

67717

4727

## Algorithm: K-Means Clustering

Here, we apply k-means clustering to the documents. We use scikit-learn to tf-idf regularize each word in the documents, and then cluster the documents. 

In [63]:
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters

# Notes on parameters defined below:
#  max_df: this is the maximum frequency within the documents a given feature can have to be 
#        used in the tfi-idf matrix. If the term is in greater than 80% of the documents it 
#        probably cares little meanining (in the context of film synopses)
#  min_idf: this could be an integer (e.g. 5) and the term would have to be in at least 5 of 
#        the documents to be considered. Here I pass 0.01; the term must be in at least 1% of 
#        the document as each document is comparatively short. 
#  ngram_range: this just means I'll look at unigrams, bigrams and trigrams. 

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.01, stop_words='english',
                                 use_idf=True, ngram_range=(1,3))

 #fit the vectorizer to text that is still in sentences
tfidf_matrix = tfidf_vectorizer.fit_transform(text)
tfidf_matrix

<1000x2113 sparse matrix of type '<class 'numpy.float64'>'
	with 75700 stored elements in Compressed Sparse Row format>

In [64]:
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

k = 6
model = KMeans(n_clusters = k, init='k-means++', max_iter=100, n_init=1)
model.fit(tfidf_matrix)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=6, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [65]:
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = tfidf_vectorizer.get_feature_names()
for i in range(k):
    print("Cluster %d:" % i,)
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind],)
    print

Top terms per cluster:
Cluster 0:
 business
 electricity
 school
 selling
 wants purchase
 piped
 piped water
 yehu
 house electricity
 electricity piped water


<function print>

Cluster 1:
 business
 describes
 primary customers
 business located
 involved business
 customers
 biggest business
 use kes
 biggest business challenge
 operates


<function print>

Cluster 2:
 farming
 farm
 dairy
 income
 milk
 kiva
 poultry
 juhudi
 family
 kilimo


<function print>

Cluster 3:
 business
 years
 kenya
 000
 old
 children
 years old
 kes
 buy
 income


<function print>

Cluster 4:
 kadet
 business
 years
 old
 loan kadet
 use
 plans
 hopes
 years old
 introduced


<function print>

Cluster 5:
 group
 acre
 acre fund
 fund
 total
 solar
 farmers
 light
 receive
 solar light


<function print>

Fascinating -- each cluster corresponds to some big partners that we saw in earlier notebooks, namely Faulu, One Acre Fund, Juhudi, Yehu and VisionFund Kenya (formerly known as Kadet.) Each partner also appears to specialize in certain types of loans (e.g., Juhudi helps fund farming/dairy/poultry loans.) Let's refresh our memory of the top partners by loan amount. 

In [66]:
partners = df.groupby(['partner_name'])['loan_amount'].sum()
partners.sort_values(ascending=False)

partner_name
VisionFund Kenya                                                            11351650
One Acre Fund                                                                8026025
Yehu Microfinance Trust                                                      7631100
Juhudi Kilimo                                                                7493275
SMEP Microfinance Bank                                                       6738200
Faulu Kenya                                                                  2892775
Milango Financial Services                                                   1378975
Hand in Hand Eastern Africa                                                   988825
Kenya ECLOF                                                                   851175
Evidence Action                                                               799950
Ebony Foundation (Eb-F)                                                       697975
Women`s Economic Empowerment Consort (WEEC)         

### Homework

What other clusters do you see? Try adjusting the number of clusters (i.e. the hyperparameter "k").