## Lab 6: Text Analysis and Natural Language Processing 

In this lab, we explore the text data provided by Kiva's API. Our primary source of textual data is the descriptive texts that borrowers submit for a loan request and are posted publicly on the Kiva website. Kiva is unique in that often, borrowers do not write descriptive requests for themselves, but fill out a questionnaire to Kiva's team of volunteer translators. We try to leverage this body of text (also called a *"corpus"*) to see if we can see any patterns in how descriptions are written.

As always, we first import our packages and read in our data below. 

In [2]:
import pandas as pd
import numpy as np

# NLP-specific packages: 
import nltk
from nltk.corpus import names
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import gensim
from gensim import corpora
from nltk.corpus import names
from nltk.tokenize import word_tokenize
from nltk.text import Text  
from nltk.stem import PorterStemmer


# output of multiple commands in a cell will be output at once.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# display up to 80 columns, this keeps everything visible
pd.set_option('display.max_columns', 80)
pd.set_option('expand_frame_repr', True)

Slow version of gensim.models.doc2vec is being used


In [4]:
#datapath = '~/intro_to_machine_learning/data'
datapath = '~/Desktop'
df = pd.read_csv(datapath+'/df.csv', low_memory=False)

## Exploratory Analysis and Feature Engineering

We have very limited information about translators. In fact, the only variable in our dataset relevant to translators is their name! What information can we extract from this field? 

In text analysis, a common yet simple task is how to categorize names by gender. We know, just in our daily knoweldge of English names, that names that end in -a are likely to be female, and names that end in -o are likely to be male (for example, Jenna and Pablo). Since we have both the gender data and the name data for the borrowers, let's use borrowers' data to train a classifier model that can predict the gender from a name! Then, we will apply this model to the translators names to predict their genders. 

Here, we use the Naive Bayes Classifier (for a comprehensive review, take a look back at Module 6.) This algorithm assigns a label (in our case, "male" or "female") using the last letter of the name provided in the data. Remember that we first need to clean our data to ensure that we are capturing the last letter of first names. 

In [46]:
#create name and gender dataframe for single borrowers
kiva_names = df[['name', 'gender', 'borrower_count']]
kiva_names = kiva_names[['name', 'gender']][kiva_names['borrower_count'] == 1]

kiva_names.head(5)
len(kiva_names)

Unnamed: 0,name,gender
0,Evaline,Female
1,Julias,Male
2,Rose,Female
3,Jane,Female
4,Alice,Female


105297

We know from looking through the data that there are some instances in which the name is not an individual's first name, but rather the name of a business or a collective, or "Anonymous". Let's drop these out of our training dataset as they won't be helpful in determining the gender of a person. Let's also select only the first name. 

In [48]:
# rm null values, anonymous, and duplicates

kiva_names = kiva_names.loc[kiva_names['name'].isnull() == False]
kiva_names = kiva_names.drop_duplicates()
kiva_names = kiva_names[kiva_names['name'] != "Anonymous"]
kiva_names['name'] = kiva_names['name'].str.split(expand=True)[0]

len(kiva_names['name'])
kiva_names['name'].head(15)

9794

0     Evaline
1      Julias
2        Rose
3        Jane
4       Alice
5       Clare
6        Mary
7       James
8     Jacinta
9       Emily
10     Fridah
11    Charity
12      Susan
13      Joyce
14     Daniel
Name: name, dtype: object

Now let's define a function that will return the last letter of our borrowers' first names. This letter will be a **feature** we will use to attempt to predict the output feature, gender. 

In [49]:
#function that returns last letter of first name 
def gender_features(name):
    return {'last_letter': name[-1]}

Now let's prepare to train our model. We split train and test sets as usual. 

In [50]:
# Set training-test split %
split_pct = 0.80

# Remove null and NaN values 
kiva_names = kiva_names[pd.notnull(kiva_names)]

# the pandas command "sample" already randomizes its selection. 
kiva_names_shuffled = kiva_names.sample(frac=1)

kiva_train_set = kiva_names_shuffled[:int((len(kiva_names_shuffled)*split_pct))] 
kiva_test_set = kiva_names_shuffled[int(len(kiva_names_shuffled)*split_pct+1):]  

len(kiva_train_set.index)
len(kiva_test_set.index)

7835

1958

Now we prepare our data by converting the name and gender features from features into lists, so they are associated with each other. 

In [51]:
kiva_female_train = kiva_train_set[kiva_train_set['gender'] == "Female"]
kiva_male_train = kiva_train_set[kiva_train_set['gender'] == "Male"]
kiva_female_test = kiva_test_set[kiva_test_set['gender'] == "Female"]
kiva_male_test = kiva_test_set[kiva_test_set['gender'] == "Male"]

kiva_train_feature_set = [(name, "female") for name in kiva_female_train['name']] + \
[(name, "male") for name in kiva_male_train['name']]

kiva_test_feature_set = [(name, "female") for name in kiva_female_test['name']] + \
[(name, "male") for name in kiva_male_test['name']]

In [52]:
kiva_train_feature_set = [(gender_features(n), g) for (n, g) in kiva_train_feature_set]
kiva_test_feature_set = [(gender_features(n), g) for (n, g) in kiva_test_feature_set]

In [53]:
kiva_classifier = nltk.NaiveBayesClassifier.train(kiva_train_feature_set)

In [54]:
#let's test out our new classifier! 

kiva_classifier.classify(gender_features('Cleopatra'))
kiva_classifier.classify(gender_features('Maximillian'))
kiva_classifier.classify(gender_features('James'))

'female'

'male'

'male'

It looks like it works okay for our three samples, but let's get a better sense of overall accuracy.

The nltk "accuracy()" method returns the % of time our predictions are accurate

In [55]:
#Find out which features were most informative in determining outcome

kiva_classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'f'              male : female =     12.4 : 1.0
             last_letter = 'k'              male : female =      8.2 : 1.0
             last_letter = 'p'              male : female =      5.3 : 1.0
             last_letter = 'w'              male : female =      3.5 : 1.0
             last_letter = 's'              male : female =      3.0 : 1.0


Show most informative features: this returns LIKELIHOOD RATIOS. For the first entry "f", we see that males are more likely to have this letter as their last letter by the factor indicated above.

But how accurate is this? Let's run this classifier on our test dataset. 

In [56]:
#Get a sense of overall accuracy

print(nltk.classify.accuracy(kiva_classifier, kiva_test_feature_set))

0.6818181818181818


This prediction is okay, but not amazing. Remember that a random generator of genders would likely get an accuracy of about 50%, so at least we are better than random. One potential hypothesis for why we are not better at classifying genders might be because this particular dataset mixes Kenyan and American first names. Whereas you might expect an American female name to end in -a and an American male name to end in -o (e.g. Jenna and Julio), these conventions do not necessarily hold for Kenyan names. 

Let's use our model to try and predict our translators' names. 

In [57]:
translators = pd.DataFrame()
translators['translator_first_name'] = df['translator.byline'].str.split(expand=True)[0]

# rm null values and duplicates
translators = translators.loc[translators['translator_first_name'].isnull() == False]
translators = translators.drop_duplicates()

translators.head(5)

Unnamed: 0,translator_first_name
0,Julie
1,Morena
8,Lynn
19,Mohammad
21,Cheryl


In [58]:
translators['last_letter'] = translators['translator_first_name'].apply(lambda x: gender_features(x))
translators_last = translators['last_letter']
translators_last[0:5]

0     {'last_letter': 'e'}
1     {'last_letter': 'a'}
8     {'last_letter': 'n'}
19    {'last_letter': 'd'}
21    {'last_letter': 'l'}
Name: last_letter, dtype: object

In [59]:
translators['gender'] = translators_last.apply(lambda x: kiva_classifier.classify(x))
translators.head(10)

Unnamed: 0,translator_first_name,last_letter,gender
0,Julie,{'last_letter': 'e'},female
1,Morena,{'last_letter': 'a'},female
8,Lynn,{'last_letter': 'n'},male
19,Mohammad,{'last_letter': 'd'},male
21,Cheryl,{'last_letter': 'l'},male
23,Rita,{'last_letter': 'a'},female
25,Maureen,{'last_letter': 'n'},male
29,Lorne,{'last_letter': 'e'},female
31,Caty,{'last_letter': 'y'},female
34,Trishna,{'last_letter': 'a'},female


Interesting - even in this small sample of 10, we see that the accuracy rate is far from perfect. Using our own understanding of what gender we would assign the names we see, this sample has an accuracy score of 60%. Not great.  

**How can we make this prediction better? Can you think of other aspects of a name might be predictive of gender?** 
A quick test we can try is using the final two letters of a name instead of just one. Try it! 

We just completed our first supervised learning exercise: classification. Let's move forward in our question to finding patterns in the descriptions of the loans by translators, our unsupervised learning exercise. First we need to clean the text data: 

## Cleaning text 

Cleaning text is almost always required in text analysis. You have already gotten a taste of this in this notebook when you cleaned the variable "name" to exclude business names, and in past notebooks as well. 

Cleaning can be as extensive as you want it to be, depending on what serves your research question the best. Is it best to look at full sentences, so you can retain the context of words? Is it best to look at individual words? Should you remove grammar, HTML code, stopwords? 

Before answering this question, we have to know what's in our data. Let's turn to some exploratory analyses to determine how we should clean our data.

Note that we don't run the following snippets of code on the whole dataset as text analysis is very computationally expensive and may crash your computer. Instead, we draw a sample of 1000 descriptions from the dataset. *This means that your results will look slightly different, but that's okay -- make sure to post on Slack anything you find interesting!*  

In [23]:
# read all non-null text into a single df
text_raw = df['description.texts.en'][df['description.texts.en'].isnull() == False]

# take sample of 1000 entries, read into list
sample_num = 1000
text_raw_abridged = text_raw.sample(sample_num)
text = list(map(str, text_raw_abridged))

print(text[0:3]) # Each sentence is an item in the list

['Rose is a married woman with three kids, all of whom attend school. She owns a house that has neither electricity nor piped water. Her greatest monthly expense is food for the family.\r\r\n\r\r\nRose operates a grocery vegetables selling stall, and she sells at the market to town dwellers and neighbors. She faces a challenge of high cost of transportation to her place of operation. She dreams of expanding and establishing a motorcycle transport business in the future. \r\r\n\r\r\nWith the Kshs 20,000, she wants to purchase green vegetables, two crates of tomatoes, and onions for resale. She decided to join Yehu to access loans to boost her business. ', '<p>Joyce K. is an average young Kenyan. She is married with four children of her own. She also cares for the child of her late sister.</p>\r\r\n\r\r\n<p>Joyce started her tailoring business three years ago after completing her tailoring course in Kisii Town.  Before venturing into the business, Joyce worked on commission for a friend 

We see there is some HTML/CSS cluttering up the text. Below, we remove these and convert all capital letters to lowercase.

In [24]:
# Remove HTML 
text = [w.replace('\r', '') for w in text]
text = [w.replace('\n', '') for w in text]
text = [w.replace('<br />', '') for w in text]
text = [w.replace('<p>', '') for w in text]
text = [w.replace('</p>', '') for w in text]
text = [w.replace('.', '') for w in text]
text = [w.replace(',', '') for w in text]

# Lowercase
text = [w.lower() for w in text]

print(text[0:3])

['rose is a married woman with three kids all of whom attend school she owns a house that has neither electricity nor piped water her greatest monthly expense is food for the familyrose operates a grocery vegetables selling stall and she sells at the market to town dwellers and neighbors she faces a challenge of high cost of transportation to her place of operation she dreams of expanding and establishing a motorcycle transport business in the future with the kshs 20000 she wants to purchase green vegetables two crates of tomatoes and onions for resale she decided to join yehu to access loans to boost her business ', 'joyce k is an average young kenyan she is married with four children of her own she also cares for the child of her late sisterjoyce started her tailoring business three years ago after completing her tailoring course in kisii town  before venturing into the business joyce worked on commission for a friend in getembe market in the town of kisii from this job she accumulat

Great! The text looks clean. We also notice that this dataset is a list where every item in the list is a description. Now we tokenize each item in the list so that each word is separated out. This yields a list of lists. 

In [25]:
tokens = list(map(word_tokenize, text))
kiva_text = nltk.Text(tokens)
kiva_text[0:2]

[['rose',
  'is',
  'a',
  'married',
  'woman',
  'with',
  'three',
  'kids',
  'all',
  'of',
  'whom',
  'attend',
  'school',
  'she',
  'owns',
  'a',
  'house',
  'that',
  'has',
  'neither',
  'electricity',
  'nor',
  'piped',
  'water',
  'her',
  'greatest',
  'monthly',
  'expense',
  'is',
  'food',
  'for',
  'the',
  'familyrose',
  'operates',
  'a',
  'grocery',
  'vegetables',
  'selling',
  'stall',
  'and',
  'she',
  'sells',
  'at',
  'the',
  'market',
  'to',
  'town',
  'dwellers',
  'and',
  'neighbors',
  'she',
  'faces',
  'a',
  'challenge',
  'of',
  'high',
  'cost',
  'of',
  'transportation',
  'to',
  'her',
  'place',
  'of',
  'operation',
  'she',
  'dreams',
  'of',
  'expanding',
  'and',
  'establishing',
  'a',
  'motorcycle',
  'transport',
  'business',
  'in',
  'the',
  'future',
  'with',
  'the',
  'kshs',
  '20000',
  'she',
  'wants',
  'to',
  'purchase',
  'green',
  'vegetables',
  'two',
  'crates',
  'of',
  'tomatoes',
  'and',
 

## Preliminary investigations / visualizations 

Now that we've got cleaned data, let's conduct some preliminary investigations. Frequency, concordance and similar are all functions of the NLTK package that can give us a sense of what is in our text without our having to read every single line.

- Frequency
- Concordance
- Similar 

Frequency returns a list of unique words, with how often each word shows up in the corpus. This provides an idea of what words are included in the descriptions of loan requests in Kenya. Note that the most common words are relatively uninformative, such as "to," "and," or "is." Later we will remove these for analysis so they do not overinfluence our results. 

In [26]:
# Read all sentences into single list 

text_corpus = list() 

for x in range(0, len(kiva_text)): 
    text_corpus.extend(kiva_text[x])

text_corpus = nltk.Text(text_corpus)

In [27]:
#kiva_fdist.plot()
#kiva_fdist.plot(50, cumulative=True)
kiva_fdist = nltk.FreqDist(text_corpus)
kiva_fdist.most_common(25)

[('to', 5332),
 ('and', 4270),
 ('the', 3891),
 ('a', 3883),
 ('she', 3449),
 ('her', 3180),
 ('is', 3149),
 ('of', 2730),
 ('in', 2194),
 ('for', 1932),
 ('has', 1873),
 ('business', 1811),
 ('he', 1662),
 ('loan', 1572),
 ('will', 1394),
 ('his', 1373),
 ('years', 1234),
 ('with', 1177),
 ('this', 1040),
 ('children', 926),
 ('that', 839),
 ('from', 805),
 ('be', 769),
 ('been', 729),
 ('as', 634)]

Concordance takes an input word of your choosing and returns the surrounding words. This provides important context about how a specific word is used in the text corpus. Here, we test "future", "seasonality", and "working". Note that sme of these words are used differently or ambiguously. This gets at an important point for NLP - words can be and are used ambiguously and it is difficult to parse meaning unless we also take a look at context.

In [73]:
text_corpus.concordance('man')

Displaying 25 of 102 matches:
aler douglas is a 30-year-old married man he has 3 children with ages ranging f
e an ambitious visionary and decisive man douglas operates a motor vehicle busi
 humble honest and a very hardworking man on this planet earth david is a fathe
 the only “key to success” david is a man who is very enterprising he is seekin
so as to boost his income julius is a man who believes in hard work and he is l
in the near future hezron is a single man he describes himself as honest he ope
the next five years paul is a married man with 2 children he describes himself 
ough juhudi kilimo tsuwa is a married man with three children all of whom atten
he describes himself as a hardworking man jared is 28 years old married to rose
s is a thirty-eight-year-old business man he is married to anastasia a farmer t
r lovely children he is a hardworking man and aggressive with the principle of 
epaid successfully clerk is a married man he has five children he describes him
e lovely c

In [74]:
text_corpus.concordance('woman')

Displaying 25 of 237 matches:
                                     woman with three kids all of whom attend s
filling life dorine is a 27-year-old woman with 4 children ranging in age from 
aid successfully habiba is a married woman with four children all of whom atten
ocation in mombasa asha is a married woman with three children all of whom atte
the loan promptly nzije is a married woman who has been blessed with five schoo
thful 28-year-old mrembo ( beautiful woman ) she is a wise farmer whom neighbor
“ she said millicent mobilized other woman in her village and joined juhudi kil
ease her income level millicent is a woman who is making a change for herself a
le within 5 years grace is a married woman and owns a house that has piped wate
cess loans from banks since she is a woman and a smallholder farmer she joined 
ical decisions margaret is a married woman with two children both of whom atten
e solar lights mwanasha is a married woman with three children all of whom atte
n shilling

Similar takes in an input word of your choosing, but returns other words that appear in a similar range of contexts. This is called finding the "distributional similarity." Most similar words appear first. 

In [35]:
text_corpus.similar("mother")

woman father business loan farmer man house lady and challenge married
that children friend profit lack living hope life sale


In [38]:
text_corpus.similar("father")

mother woman challenge business loan lack farmer way house women
variety hope life sale stock number lives lady part group


Collocations are pairs of words that occur together in the data unusually often. Here, we recognize pairs of words that are familiar to us in day-to-day life and indicate a writing style, such as "major challenge" or "primary customers". There are also unexpected pairings, like "three children." 

In [76]:
text_corpus.collocations()

years old; acre fund; one acre; juhudi kilimo; piped water; join yehu;
greatest monthly; school fees; major challenge; primary customers;
married woman; access loans; kadet ltd; first loan; attend school;
solar lights; wheat flour; microfinance bank; monthly expense; three
children


### Remove stop words

"Stop words" are words like "to", "the", "a" - words that are plentiful but do not offer significantly meaningful information about the document. Here, we import a predetermined set of stop words defined by the NLTK package and then remove them from the dataset. The resulting dataset has words that we can generally agree are meaningful and say something about the content of the loan request. You can also define your own set of "stop words" to remove if you have a very specific set of words you want to remove. 

However, we see that these words still have suffixes such as "-s" and "-ing". We want to remove these because if we do not, the algorithm will count a set of words like "married" and "marries" as different words, when we can consider them, for our purposes, the same word. To remove these, we stem our text data. 

In [39]:
set(stopwords.words('english'))

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 'd',
 'did',
 'didn',
 'do',
 'does',
 'doesn',
 'doing',
 'don',
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 'has',
 'hasn',
 'have',
 'haven',
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 'it',
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 'more',
 'most',
 'mustn',
 'my',
 'myself',
 'needn',
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 're',
 's',
 'same',
 'shan',
 'she',
 'should',
 'shouldn',
 'so',
 'some',
 'such',
 't',
 'than',
 'that',
 'the',
 'their',
 'theirs',
 'them',
 

In [41]:
#remove stop words

text_corpus_clean = [word for word in text_corpus if word not in stopwords.words('english')]
text_corpus_clean[0:50]

['rose',
 'married',
 'woman',
 'three',
 'kids',
 'attend',
 'school',
 'owns',
 'house',
 'neither',
 'electricity',
 'piped',
 'water',
 'greatest',
 'monthly',
 'expense',
 'food',
 'familyrose',
 'operates',
 'grocery',
 'vegetables',
 'selling',
 'stall',
 'sells',
 'market',
 'town',
 'dwellers',
 'neighbors',
 'faces',
 'challenge',
 'high',
 'cost',
 'transportation',
 'place',
 'operation',
 'dreams',
 'expanding',
 'establishing',
 'motorcycle',
 'transport',
 'business',
 'future',
 'kshs',
 '20000',
 'wants',
 'purchase',
 'green',
 'vegetables',
 'two',
 'crates']

### Stem words 

The Porter Stemmer is one of several stemming tools (including Snowball Stemmer and the Lancaster Stemmer). Each type of stemmer uses different rules to "stem" a word like "running" to "run". Here we use the Porter Stemmer as it is very commonly used. Try others! 

In [42]:
# Clean data - stem
# Porter stemmer is one of several

porter = nltk.PorterStemmer()
[porter.stem(t) for t in text_corpus_clean]

['rose',
 'marri',
 'woman',
 'three',
 'kid',
 'attend',
 'school',
 'own',
 'hous',
 'neither',
 'electr',
 'pipe',
 'water',
 'greatest',
 'monthli',
 'expens',
 'food',
 'familyros',
 'oper',
 'groceri',
 'veget',
 'sell',
 'stall',
 'sell',
 'market',
 'town',
 'dweller',
 'neighbor',
 'face',
 'challeng',
 'high',
 'cost',
 'transport',
 'place',
 'oper',
 'dream',
 'expand',
 'establish',
 'motorcycl',
 'transport',
 'busi',
 'futur',
 'ksh',
 '20000',
 'want',
 'purchas',
 'green',
 'veget',
 'two',
 'crate',
 'tomato',
 'onion',
 'resal',
 'decid',
 'join',
 'yehu',
 'access',
 'loan',
 'boost',
 'busi',
 'joyc',
 'k',
 'averag',
 'young',
 'kenyan',
 'marri',
 'four',
 'children',
 'also',
 'care',
 'child',
 'late',
 'sisterjoyc',
 'start',
 'tailor',
 'busi',
 'three',
 'year',
 'ago',
 'complet',
 'tailor',
 'cours',
 'kisii',
 'town',
 'ventur',
 'busi',
 'joyc',
 'work',
 'commiss',
 'friend',
 'getemb',
 'market',
 'town',
 'kisii',
 'job',
 'accumul',
 'save',
 'start'

In [54]:
# number of words in of entire corpus 
len(text_corpus_clean)

# number of unique words in entire corpus
len(set(text_corpus_clean))

66075

5503

## Algorithm: K-Means Clustering

Here, we apply k-means clustering to the documents. We use scikit-learn to tf-idf regularize each word in the documents, and then cluster the documents. 

In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters

# Notes on parameters defined below:
#  max_df: this is the maximum frequency within the documents a given feature can have to be 
#        used in the tfi-idf matrix. If the term is in greater than 80% of the documents it 
#        probably cares little meanining (in the context of film synopses)
#  min_idf: this could be an integer (e.g. 5) and the term would have to be in at least 5 of 
#        the documents to be considered. Here I pass 0.01; the term must be in at least 1% of 
#        the document as each document is comparatively short. 
#  ngram_range: this just means I'll look at unigrams, bigrams and trigrams. 

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.01, stop_words='english',
                                 use_idf=True, ngram_range=(1,3))

 #fit the vectorizer to text that is still in sentences
tfidf_matrix = tfidf_vectorizer.fit_transform(text)
tfidf_matrix

<1000x1980 sparse matrix of type '<class 'numpy.float64'>'
	with 70828 stored elements in Compressed Sparse Row format>

In [55]:
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

k = 6
model = KMeans(n_clusters = k, init='k-means++', max_iter=100, n_init=1)
model.fit(tfidf_matrix)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=6, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [58]:
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = tfidf_vectorizer.get_feature_names()
for i in range(k):
    print("Cluster %d:" % i,)
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind],)
    print

Top terms per cluster:
Cluster 0:
 business
 years
 buy
 faulu
 kenya
 living
 faulu kenya
 old
 children
 requesting loan


<function print>

Cluster 1:
 business
 describes
 involved business
 primary customers
 business located
 involved
 located
 describes biggest
 biggest business
 operates


<function print>

Cluster 2:
 group
 acre
 acre fund
 total
 fund
 solar
 farmers
 light
 lights
 solar lights


<function print>

Cluster 3:
 farming
 farm
 dairy
 income
 juhudi
 milk
 kilimo
 juhudi kilimo
 family
 poultry


<function print>

Cluster 4:
 business
 school
 electricity
 yehu
 piped
 house electricity
 electricity piped
 house
 house electricity piped
 water


<function print>

Cluster 5:
 years
 business
 kadet
 use
 old
 years old
 hopes
 children
 income
 years old married


<function print>

Fascinating -- each cluster corresponds to some big partners that we saw in earlier notebooks, namely Faulu, One Acre Fund, Juhudi, Yehu and VisionFund Kenya (formerly known as Kadet.) Each partner also appears to specialize in certain types of loans (e.g., Juhudi helps fund farming/dairy/poultry loans.) Let's refresh our memory of the top partners by loan amount. 

In [71]:
partners = df.groupby(['partner_name'])['loan_amount'].sum()
partners.sort_values(ascending=False)

partner_name
VisionFund Kenya                                                            11960875
One Acre Fund                                                                9800175
Juhudi Kilimo                                                                8245775
Yehu Microfinance Trust                                                      7774375
SMEP Microfinance Bank                                                       7746150
Faulu Kenya                                                                  2998575
Milango Financial Services                                                   1440700
Kenya ECLOF                                                                  1162850
Hand in Hand Eastern Africa                                                  1059875
Evidence Action                                                               799950
Ebony Foundation (Eb-F)                                                       697975
Strathmore University                               

### Homework

What other clusters do you see? Try adjusting the number of clusters (i.e. the hyperparameter "k").