# Classification assignment

** Objective **

We will be testing your understanding of Logistic Regression and Naive Bayes. We will use the Kiva dataset "Use" column to predict the sector the loans came from. 

** Application of assignment in real life **

Imagine if Kiva didn't have a sector classification of their loans. This prediction algorithm could be used on a small set of data they have already classified manually, to be used to auto-classify all their loans.

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import string

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.naive_bayes import MultinomialNB

from sklearn import linear_model

from sklearn.metrics import accuracy_score

In [2]:
# Load the data
path = '../data/'
filename = 'loans.csv'
data = pd.read_csv(path+filename)

In [3]:
# Clean the data by removing empty values in the "Use" column
data_final = data.loc[data['use'].notnull()]
data_final.head()

Unnamed: 0,id_number,loan_amount,lender_count,status,funded_date,funded_amount,repayment_term,location_country_code,sector,description,use
17,767909,300,12,funded,2014-09-15T11:10:34Z,300,10,BJ,Arts,Ahmed Baba is a Beninese artisan who specializ...,to invest in a bulk purchase of raw materials ...
18,1266423,50000,1519,funded,2017-04-18T06:42:43Z,50000,12,BJ,Agriculture,"<a href=""http://agpowerbenin.com/""> Tolaro Glo...",to add value and jobs to the local economy by ...
19,1480779,50000,1574,funded,2018-04-06T22:46:13Z,50000,14,BJ,Agriculture,"In February 2017, MCE made a 12-month loan of ...",to promote the growth of the business by trans...
201,1563183,2200,2,funded,2018-07-12T17:48:22Z,2200,6,BF,Food,"Haoua N°2 is 40 years old, married, and the m...",to pay for vegetables to sell.
202,1563224,950,2,funded,2018-07-12T17:48:23Z,950,6,BF,Food,Bintou just finished her first Kiva loan. The ...,"to buy bags of pearl millet, sugar, and powder..."


# Logistic Regression

In [13]:
# Prepare the data with lemmatization
vectorizer = CountVectorizer(tokenizer=word_tokenize, stop_words = 'english')
X = vectorizer.fit_transform(data_final['use'].values)
y = data_final['sector']

In [15]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [78]:
# Train the model
logistic_model = linear_model.LogisticRegression(C=1, penalty='l1')
logistic.fit(X_train, y_train)

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [79]:
# Test the model
preds_lr = logistic.predict(X_test)

In [80]:
# Check the accuracy
accuracy_score(preds_lr, y_test)

0.7992957746478874

# Naive Bayes

In [81]:
# Split the data into train and test sets
train, test = train_test_split(data_final, test_size=0.2, random_state=42)

In [86]:
# Prepare the data with lemmatization
vectorizer = CountVectorizer(tokenizer=word_tokenize, stop_words = 'english')
train_features = vectorizer.fit_transform(train['use'])
test_features =  vectorizer.transform(test['use'])

In [87]:
# Train the model
nb_model = MultinomialNB()
nb_model.fit(train_features, train['sector'])

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [88]:
# Test the model
preds_nb = nb_model.predict(test_features)

In [89]:
# Check the accuracy
accuracy_score(preds_nb, test['sector'])

0.7834507042253521

# Comparing models

In [90]:
# Compare the models using an accuracy score
def compare_models(logistic_model, naive_bayes_model):
    'Inputs the model and returns the accuracy for both'
    logistic_accuracy = accuracy_score(preds_lr, y_test)
    naive_bayes_accuracy = accuracy_score(preds_nb, test['sector'])
    return logistic_accuracy, naive_bayes_accuracy

# Qualitative checks for understanding

** 1) What is one way you could improve the model to better predict the sectors? ** 

*TYPE YOUR RESPONSES HERE*

** 2) What are the logistic regression assumptions we need to check for? Have we met them? **

*TYPE YOUR RESPONSES HERE*

** 3) How can we interpret the logistic regression coefficients in this case? **

*TYPE YOUR RESPONSES HERE*

# Test answers

In [92]:
# Test the accuracy of the models
compare_models(logistic_model, nb_model) == (0.7992957746478874, 0.7834507042253521)

True

In [75]:
# Pull the wrong predictions rows for logistic regression
msk = preds_lr != y_test
wrong_predictions = pd.DataFrame(y_test[msk], columns = ['sector'])
pd.merge(wrong_predictions, data_final, left_index=True, right_index=True)

Unnamed: 0,sector_x,id_number,loan_amount,lender_count,status,funded_date,funded_amount,repayment_term,location_country_code,sector_y,description,use
4345,Services,969071,900,33,funded,2015-10-28T14:25:04Z,900,25,ZA,Services,"Mabel,51, is married and a mother to 1 girl an...",to buy knitting sets and a new sewing machine ...
732,Retail,1543044,450,17,funded,2018-06-18T13:30:46Z,450,9,CM,Retail,Abubakar is a young man of 36. He is single an...,to buy hair extensions and other beauty products.
2699,Food,1546977,3750,127,funded,2018-06-21T08:32:37Z,3750,6,MW,Food,Martha is a member of the Kachere Group and is...,"to pay for more sugar, soap, salt, squash, and..."
3588,Food,1548097,3725,126,funded,2018-06-22T15:43:30Z,3725,6,RW,Food,Greetings from Rwanda! This is Abishyizehamwe ...,"to buy more sugar, salt and rice to sell at th..."
4238,Health,1352912,50000,1646,funded,2017-08-22T18:16:34Z,50000,20,ZA,Health,The Problem<P>In developing countries like Ind...,to fund primary healthcare projects in South A...
716,Retail,1548927,550,18,funded,2018-07-14T03:30:48Z,550,15,CM,Retail,Hermine Carole is a young woman aged 34. She l...,to buy business supplies.
3199,Services,1539030,525,16,funded,2018-05-29T23:19:15Z,525,18,MZ,Services,Mariamo is 27 years old. She is the single mot...,Mariamo to buy equipment for preschool.
950,Retail,620547,5550,160,funded,2013-10-24T13:37:00Z,5550,6,CG,Retail,Basile is one of the members of the group Que ...,to buy cartons of milk and large sacks of sugar.
5829,Retail,1572584,1400,0,fundraising,,0,9,ZW,Retail,Judith is a 44 year old mother of four. Her hu...,"to buy clothes, shoes and blankets."
1024,Food,1543579,2650,82,funded,2018-07-01T19:11:44Z,2650,6,CD,Food,"Noella, who is a customer of IMF Hekima, is th...","to buy 300 kilos of beans, including transport..."


In [40]:
# Pull the wrong prediction rows for naive bayes
msk = preds_nb != test['sector']
test[msk]

Unnamed: 0,id_number,loan_amount,lender_count,status,funded_date,funded_amount,repayment_term,location_country_code,sector,description,use
3092,1557041,850,20,fundraising,,575,21,MZ,Housing,"Hemitério is 47 years old, a teacher, married ...","to buy cement, doors, locks and a toilet seat ..."
5302,1547138,300,10,funded,2018-06-14T12:20:12Z,300,10,CD,Construction,Léa is a seller of juices of a various qualiti...,to buy renovation materials for the house and ...
1496,1566087,275,6,funded,2018-07-15T10:07:51Z,275,9,GH,Manufacturing,"Stephen is 36 years old, married and has three...","to buy a welding machine, grinding machine and..."
2598,1562318,900,32,funded,2018-07-11T13:35:11Z,900,10,MG,Manufacturing,Marie Solange is very happy because the previo...,to buy a new industrial sewing machine and 50 ...
299,1557395,450,18,funded,2018-06-28T12:45:59Z,450,10,BF,Clothing,"Pagomdgoalma is the leader of the ""Sebe Allahy...",to buy 20 lots of pagnes.
4508,807625,800,30,funded,2014-12-03T20:57:01Z,800,7,SS,Retail,"A loan of 2,500 SSP helps James to buy differe...",To purchase more goods to sell.
1504,1563158,1050,34,funded,2018-07-11T17:51:00Z,1050,8,GH,Education,Charles is sixty-two years of age. He lives in...,to buy furniture for his school.
572,1035058,2600,85,funded,2016-03-31T00:54:33Z,2600,8,BI,Clothing,Joseph is a member of the Vehasi group and he ...,to increase his capital and buy large quantiti...
375,1534649,725,2,funded,2018-05-24T19:55:29Z,725,5,BF,Clothing,The group named “Nerwaya” just finished its fi...,to purchase traditional fabric for resale.
1620,1294308,50000,1706,funded,2017-05-18T05:06:35Z,50000,10,CI,Agriculture,CAJU is a nut processing business in rural Ivo...,double cashew nut export output and hire about...


# Appendix code

In [76]:
def wm2df(wm, feat_names):
    # create an index for each row
    doc_names = ['Doc{:d}'.format(idx) for idx, _ in enumerate(wm)]
    df = pd.DataFrame(data=wm.toarray(), index=doc_names,
                      columns=feat_names)
    return(df)

In [38]:
# retrieve the terms found in the corpora
tokens = vectorizer.get_feature_names()

# create a dataframe from the matrix
wm2df(test_features, tokens)

Unnamed: 0,&,','','ll,'pagne,'s,(,),",",",10",...,zamble,zealous,zimbabwe,zinc,zippers,zitenje,zucchini,’,“,”
Doc0,0,0,0,0,0,0,0,0,3,0,...,0,0,0,0,0,0,0,0,0,0
Doc1,0,0,0,0,0,2,0,0,5,0,...,0,0,0,0,0,0,0,0,0,0
Doc2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Doc3,0,0,0,0,0,0,0,0,2,0,...,0,0,0,0,0,0,0,0,0,0
Doc4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Doc5,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
Doc6,0,0,0,0,0,0,0,0,3,0,...,0,0,0,0,0,0,0,0,0,0
Doc7,0,0,0,0,0,0,0,0,4,0,...,0,0,0,0,0,0,0,0,0,0
Doc8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Doc9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Sources: https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f, https://stackoverflow.com/questions/29067434/using-sklearn-logistic-regression-on-new-text-when-countvectorizer-has-been-used![image.png](attachment:image.png), https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af