# Aoife Duna and Alec Schneider
# General Assembly Data Science 1-21-2020

This notebook uses the Kiva data set which is part of the "Data Science for Good: Kiva Crowdfunding" Kaggle competition which can be found here: https://www.kaggle.com/kiva/data-science-for-good-kiva-crowdfunding.

Using three different models, we attempt to predict:
1. Whether or not the loan was fully funded
2. If the loan was fully funded, how quickly it was funded
3. If the loan was not fully funded, what percentage of the loan was funded

id: Unique ID for loan
funded_amount: The amount disbursed by Kiva to the field agent(USD)
loan_amount: The amount disbursed by the field agent to the borrower(USD)
activity: More granular category
sector: High level category
use: Exact usage of loan amount
country_code: ISO country code of country in which loan was disbursed
country: Full country name of country in which loan was disbursed
region: Full region name within the country
currency: The currency in which the loan was disbursed
partner_id: ID of partner organization
posted_time: The time at which the loan is posted on Kiva by the field agent
disbursed_time: The time at which the loan is disbursed by the field agent to the borrower
funded_time: The time at which the loan posted to Kiva gets funded by lenders completely
term_in_months: The duration for which the loan was disbursed in months
lender_count: The total number of lenders that contributed to this loan
tags: Hashtags added to the funding request
borrower_genders: Comma separated M,F letters, where each instance represents a single male/female in the group
repayment_interval: Interval at which the loan is expected to be repayed
date

In [89]:
import pandas as pd
import numpy as np

In [90]:
kiva = pd.read_csv('/Users/aoifeduna/AoifeRepo/kiva_loans.csv')

In [91]:
kiva.head()

Unnamed: 0,id,funded_amount,loan_amount,activity,sector,use,country_code,country,region,currency,partner_id,posted_time,disbursed_time,funded_time,term_in_months,lender_count,tags,borrower_genders,repayment_interval,date
0,653051,300.0,300.0,Fruits & Vegetables,Food,"To buy seasonal, fresh fruits to sell.",PK,Pakistan,Lahore,PKR,247.0,2014-01-01 06:12:39+00:00,2013-12-17 08:00:00+00:00,2014-01-02 10:06:32+00:00,12.0,12,,female,irregular,2014-01-01
1,653053,575.0,575.0,Rickshaw,Transportation,to repair and maintain the auto rickshaw used ...,PK,Pakistan,Lahore,PKR,247.0,2014-01-01 06:51:08+00:00,2013-12-17 08:00:00+00:00,2014-01-02 09:17:23+00:00,11.0,14,,"female, female",irregular,2014-01-01
2,653068,150.0,150.0,Transportation,Transportation,To repair their old cycle-van and buy another ...,IN,India,Maynaguri,INR,334.0,2014-01-01 09:58:07+00:00,2013-12-17 08:00:00+00:00,2014-01-01 16:01:36+00:00,43.0,6,"user_favorite, user_favorite",female,bullet,2014-01-01
3,653063,200.0,200.0,Embroidery,Arts,to purchase an embroidery machine and a variet...,PK,Pakistan,Lahore,PKR,247.0,2014-01-01 08:03:11+00:00,2013-12-24 08:00:00+00:00,2014-01-01 13:00:00+00:00,11.0,8,,female,irregular,2014-01-01
4,653084,400.0,400.0,Milk Sales,Food,to purchase one buffalo.,PK,Pakistan,Abdul Hakeem,PKR,245.0,2014-01-01 11:53:19+00:00,2013-12-17 08:00:00+00:00,2014-01-01 19:18:51+00:00,14.0,16,,female,monthly,2014-01-01


In [92]:
# We need to create all of the variables necessary to do our analysis

In [93]:
kiva['funded_time'] = pd.to_datetime(kiva['funded_time'])

In [94]:
kiva['posted_time'] = pd.to_datetime(kiva['posted_time'])

In [95]:
kiva['TimetoFund'] = kiva['funded_time'] - kiva['posted_time']

In [96]:
kiva['TimetoFundMinutes'] = kiva['TimetoFund'] / np.timedelta64(1, 'm')

In [97]:
kiva['PercentFunded'] = kiva['funded_amount'] / kiva['loan_amount']

In [98]:
kiva['TimetoFundMinutes'].fillna(kiva.TimetoFundMinutes.mean(), inplace=True)

In [99]:
kiva['borrower_genders'].fillna("female", inplace=True)
# The mode value is a single female; filling these nulls with the mode

In [100]:
kiva['NumberofBorrowers'] = kiva['borrower_genders'].str.split().str.len()

In [101]:
kiva['NumberofFemaleBorrowers'] = kiva.borrower_genders.str.count("female")

In [102]:
kiva['PercentFemaleBorrowers'] = kiva['NumberofFemaleBorrowers'] / kiva['NumberofBorrowers']

In [103]:
kiva['NumberofTags'] = kiva['tags'].str.split().str.len()

In [104]:
kiva['NumberofTags'].fillna(0, inplace=True)

In [105]:
kiva['CountWordsinDesc'] = kiva['use'].str.split().str.len()

In [106]:
kiva['PostedDayofWeek'] = kiva.posted_time.dt.day_name()

In [107]:
kiva['PostedTimeofDay'] = pd.to_datetime(kiva['posted_time'], format='%H%M%S').dt.hour

In [108]:
kiva['PartnerPresent'] = (kiva.partner_id.values > 0).astype(np.uint8)

  """Entry point for launching an IPython kernel.


In [109]:
kiva['CountWordsinDesc'].fillna(0, inplace=True)

In [110]:
kiva['LoanFunded'] = (kiva.funded_amount.values > 0).astype(np.uint8)

In [111]:
kiva['LoanFunded'].fillna(0, inplace=True)

In [118]:
# Creating data sets for us to easily call later
X_matrix = kiva.drop(['use','country_code','region', 'currency', 'date', 'TimetoFund', 'tags', 
               'posted_time', 'disbursed_time', 'funded_time', 'NumberofFemaleBorrowers',
               'borrower_genders', 'partner_id', 'TimetoFundMinutes', 'id', 'PercentFunded', 'LoanFunded'], axis=1)
# Dropped these because they are either not additive to the data set or are represented elsewhere

In [119]:
y_matrix = kiva[['LoanFunded', 'PercentFunded', 'TimetoFundMinutes']]

In [120]:
X_matrix.head()

Unnamed: 0,funded_amount,loan_amount,activity,sector,country,term_in_months,lender_count,repayment_interval,NumberofBorrowers,PercentFemaleBorrowers,NumberofTags,CountWordsinDesc,PostedDayofWeek,PostedTimeofDay,PartnerPresent
0,300.0,300.0,Fruits & Vegetables,Food,Pakistan,12.0,12,irregular,1,1.0,0.0,7.0,Wednesday,6,1
1,575.0,575.0,Rickshaw,Transportation,Pakistan,11.0,14,irregular,2,1.0,0.0,11.0,Wednesday,6,1
2,150.0,150.0,Transportation,Transportation,India,43.0,6,bullet,1,1.0,2.0,17.0,Wednesday,9,1
3,200.0,200.0,Embroidery,Arts,Pakistan,11.0,8,irregular,1,1.0,0.0,12.0,Wednesday,8,1
4,400.0,400.0,Milk Sales,Food,Pakistan,14.0,16,monthly,1,1.0,0.0,4.0,Wednesday,11,1


In [122]:
X_matrix.columns

Index(['funded_amount', 'loan_amount', 'activity', 'sector', 'country',
       'term_in_months', 'lender_count', 'repayment_interval',
       'NumberofBorrowers', 'PercentFemaleBorrowers', 'NumberofTags',
       'CountWordsinDesc', 'PostedDayofWeek', 'PostedTimeofDay',
       'PartnerPresent'],
      dtype='object')

In [129]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from category_encoders import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor

ridge, lasso = LinearRegression(), Ridge(), Lasso()
ohe = OneHotEncoder()
sc = StandardScaler()
rf = RandomForestRegressor()

In [130]:
ridge_pipe = make_pipeline(ohe, sc, ridge)
lasso_pipe = make_pipeline(ohe, sc, lasso)
rf_pipe = make_pipeline(ohe, rf)