## Hi there!
If this is any way helpful to you, please upvote. Thanks for dropping by!

![donorschoose.org logo](https://cdn.donorschoose.net/images/logo/dc-logo-tagline.png)

# Executive Summary:

This notebook contains an exploration of DonorsChoose.org data and the development of a simple recommendation system for  first time donors. Using KMeans clustering to identify latent groupings or profiles of projects, new projects are classified into one of the identified project profiles. A web-based API running the classifier will output lists of Donor IDs segmented by location (using `Donor State` as proxy variable), by organization loyalty (using `Donation Included Optional Donation` as proxy variable), and by donation timing (using difference between `Donation Received Date` and `Project Posted Date` as derived proxy variable). These segmentation based on location, organization loyalty, and donation timing were based on related literature outlining the relevance of preference based on proximity and personal background<sup> [1] </sup> , support to the crowdfunding host as an indicator of donor retention or long term repeat donations<sup> [2] </sup>, and donation timing affecting when donors actually donate<sup> [3] </sup>. Following the methodology illustrated on the flowchart below, a working recommender system was developed with four identified latent profiles of projects with corresponding clusters of potential repeat Donors.

![Methodology](https://tjalba.files.wordpress.com/2018/06/screen-shot-2018-06-19-at-2-02-11-pm.png?w=900)

![Recommender API](https://tjalba.files.wordpress.com/2018/06/screen-shot-2018-06-21-at-12-06-09-am.png?w=720)

**References: **  
<sup> [1] </sup> Breeze, B. (2013) How donors choose charities: the role of personal taste and experiences in giving decisions. *Voluntary Sector Review*, *Vol. 4*, (2), pp. 165-183  
<sup> [2] </sup> Althoff, T and Leskovec, J (2015) Donor Retention in Online Crowdfunding Communities: A Case Study of Donorschoose.org. *ACM*  
<sup> [3] </sup> Salomon, J, et.al (2015) Don’t Wait! How Timing Affects Coordination of Crowdfunding Donations. *ACM*

# Outline: 
1. [About Donorschoose.org](##About Donorschoose.org)
2. [Loading the Libraries](##Loading the Libraries)
3. [About the Data](##About the Data)
4. [Questions we would like to ask](##Questions we would like to ask)
5. [Exploring the Data](##Exploring the Data)  
5.1. [Teachers Dataset](##Teachers Dataset)  
    5.1.1. [Teacher Dataset preprocessing](####Teacher Dataset preprocessing)  
    5.1.2. [Teacher ID duplication check](####Teacher ID duplication check)  
    5.1.3. [Teacher Prefix distribution](#### Teacher Prefix distribution)    
    5.1.4. [Teacher project creation trend over time](#### Teacher project creation trend over time)  
    5.1.4.1. [Most popular days of the week for Teacher project creation](####Most popular days of the week for Teacher project creation)  
    5.1.4.2. [Most popular months of the year for Teacher project creation](####Most popular months of the year for Teacher project creation)  
    5.1.4.3. [Teacher project creation change over time](####Teacher project creation change over time)  
    5.1.4.4. [Teacher project creation change over time by Teacher Prefix](#### Teacher project creation change over time by Teacher Prefix)  
    5.1.4.5. [Teacher project creation change over time by Teacher Gender](#### Teacher project creation change over time by Teacher Gender)  
    5.1.4.6. [Teacher project creation change over time in 2018 by Teacher Prefix](#### Teacher project creation change over time in 2018 by Teacher Prefix)  
    
    5.2. [Schools Dataset](##Schools Dataset)  
    5.2.1. [Schools Metro Type distribution](####Schools Metro Type distribution)  
    5.2.2. [Schools Percentage Free Lunch distribution](#### Schools Percentage Free Lunch distribution)  
    5.2.3. [Schools State distribution](#### Schools State distribution)  
    
    5.3. [Donors Dataset](## Donors Dataset)  
    5.3.1. [Donor-Teacher distribution](## Donor-Teacher distribution)  
    5.3.2. [Donors State distribution](#### Donors State distribution)  
    5.3.3. [Donors City distribution](#### Donors City distribution)  
    5.3.4. [Donation Time per Location of Donor](#### Donation Time per Location of Donor)
  
    5.4. [Donations Dataset](## Donations Dataset)  
    5.4.1. [Donations Amount distribution](#### Donations Amount distribution)  
    5.4.2. [Donations Included Optional Donation distribution](#### Donations Included Optional Donation distribution)  
    5.4.3. [Donations received trend over time in 2018](#### Donations received trend over time in 2018)  
    
    5.5. [Resources Dataset](## Resources Dataset)  
    5.5.1. [Resources Unit Price distribution](#### Resources Unit Price distribution)  
    5.5.2. [Resource Unit Price vs Resource Quantity](#### Resource Unit Price vs Resource Quantity)  
    
    5.6. [Projects Dataset](## Projects Dataset)  
    5.6.1. [Projects Dataset preprocessing](#### Projects Dataset preprocessing)  
    5.6.2. [Unique Projects by Teacher ID](#### Unique Projects by Teacher ID)  
    5.6.3. [Projects Type distribution](#### Projects Type distribution)  
    5.6.4. [Projects Subject Category Tree distribution](#### Projects Subject Category Tree distribution)  
    5.6.5. [Projects Grade Level Category distribution](#### Projects Grade Level Category distribution)  
    5.6.6. [Projects Resource Category distribution](#### Projects Resource Category distribution)  
    5.6.7. [Projects Current Status distribution](#### [Projects Current Status distribution])  
    5.6.8. [Projects Costs distribution](#### Projects Costs distribution)  
    5.6.9. [Number of days before project gets fully funded](#### Number of days before project gets fully funded)    
6.  [Merging the Datasets](##Merge Dataset)  
6.1. [Data Wrangling](## Data Wrangling)  
6.1.1. [Sampling 10000 rows](#### Sampling 10000 rows)  
6.1.2. [Identifying useful and not so useful features](#### Identifying useful and not so useful features)  
6.1.3. [Dropping not so useful features](#### Dropping not so useful features)  
6.1.4. [Dropping null values](#### Dropping null values)  
6.1.5. [Encoding Labels for Categorical Variables](#### Encoding Labels for Categorical Variables)  
6.1.6. [Encoding Datetime Features](#### Encoding Datetime Features)  

7. [Building the Recommendation System](# Building the Recommendation System)  
7.1. [Identifying Item Profiles](## Identifying Item Profiles (Projects)  
    7.1.1. [Creating the feature vector for previous projects for KMeans Clustering](#### Creating the feature vector for previous projects for KMeans Clustering)  
    7.1.2. [Initial KMeans Clustering with number of neighbors set to 5](#### Initial KMeans Clustering with number of neighbors set to 5)  
    7.1.3. [Finding the optimal number of clusters](#### Finding the optimal number of clusters)  
    7.1.4. [Plotting the first two components of the clusters](#### Plotting the first two components of the clusters)    
    7.1.5. [Storing Donor IDs per Cluster](#### Storing Donor IDs per Cluster)  
    7.1.6. [Labeling donors based on project clusters](#### Labeling donors based on project clusters)  
    7.1.7. [Visualizing the Clusters](#### Visualizing the Clusters)    
    
    7.2. [Classifying New Unobserved Items (New Projects)](## Classifying New Unobserved Items (New Projects)     
    7.3. [Filtering User Profiles](## Filtering User Profiles)    
    7.4. [Storing Donor ID and other Donor information in dictionaries, by clusters](## Storing Donor ID and other Donor information in dictionaries, by clusters)
8. [Demonstration](# Demonstration)
9. [Sample Web App](# Sample Web App)  
9. [Summary](# Summary)
10. [References](# References)  
11. [Collaborators](# Collaborators)  
12. [Acknowledgements](# Acknowledgements)










    
    
   
   
 
   
  
    

# Key Insights from Data:

Projects
- The number of projects posted the first time is rapidly growing since 2002 with peak at around 80,000 projects per year in 2016.
- The prices of requested items are cheap, most of them are less than 20 dollars with a few outliers costing up to almost 100,000 dollars
- Cheaper items come in bigger quantities for resources item requested
- Majority of projects are teacher-led.
- Top projects are those that deal with literacy and language.
- Top grade level category for projects is Grades Pre-K-2.
- Top projects requested either books, supplies or technology. 
- Most projects are already fully funded. 
- The average cost of a project is around 741 dollars pulled up by a few high value projects with the largest costing 255,737 dollars. 75% of the projects costs less than 868 dollars.
- On average, it only takes about a month (~32.07 days) for a project to get fully funded. 75% of projects get funded within 50 days.

Teachers
- Most Teachers that create projects have prefixes `Mrs.` or `Ms.` 
- We can infer from above that most Teachers that create projectes are females
- Sunday is the most popular day for Teachers to post their project the first time (73,410 projects posted)
- September is the most popular month for Teachers to post their project the first time (59,495 projects posted)

Schools
- Most beneficiary schools come from `suburban` or `urban` communities
- It's safe to say that most schools that benefit from the projects give out free lunches with median percentage of 61% of student population are given free lunches
- California is home to the most number of beneficiary schools; while Wyoming comprises the least.

Donors
- There are a handful donors that are teachers, but most aren't.
- California is home to the most number of donors; while Wyoming comprises the least.
- Chicago is the city that is home to most number of donors, followed by eastern city of New York, and western cities of California such as San Francisco and Los Angeles. 
- Swarm plots of donations over time show that donors from the same city tend to donate in bursts within a short time period.
- Most donations come in less than 20 dollars, with a very few big individual ones at  400 or 500 dollars.
- Most donations come with optional donation to donorschoose.org
- On January 26, 2018, there was an observed peak for donations received this year


## About DonorsChoose.org
Donorschoose.org is an online charity platform dedicated to supporting K-12 public education in the U.S. Briefly, it is a crowdfunding site where teachers can post or create project requests and where donors can donate and help raise funds to fulfill the teachers' educational causes. Since its founding in 2000, the platform has raised over $685 million for 1.1 million projects from over 3 million people and partners. Moreover, teachers from almost 75% of all public schools in the U.S. have sought the help of DonorsChoose.org in raising funds for their projects, making the platform the premier website for supporting education. 

Currently, teachers still spend over a billion dollars out of pocket for their students' needs. In order for students to get what they need to learn, DonorsChoose.org must be able to encourage its roster of first time donors to donate again to projects that inspire them most.

## Loading the Libraries

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
from collections import OrderedDict
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
color = sns.color_palette()
from numpy import array
from matplotlib import cm
from scipy.misc import imread
import base64
from sklearn import preprocessing
from mpl_toolkits.basemap import Basemap
from wordcloud import WordCloud, STOPWORDS
import plotly.plotly as py1
import plotly.offline as py
py.init_notebook_mode(connected=True)
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.offline as offline
offline.init_notebook_mode()
from plotly import tools


import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [2]:
df_donations = pd.read_csv('../input/io/Donations.csv')
df_donors = pd.read_csv('../input/io/Donors.csv')
df_resources = pd.read_csv('../input/io/Resources.csv')
df_schools = pd.read_csv('../input/io/Schools.csv')
df_teachers = pd.read_csv('../input/io/Teachers.csv')
df_projects = pd.read_csv('../input/io/Projects.csv')

## About the Data
The donorschoose.org data contains six separate datasets with the following number of datapoints and features. 

In [3]:
pd.DataFrame({'Dataset':['Donations','Donors','Resources','Schools','Teachers', 'Projects'],
             'Datapoints':[df_donations.shape[0], df_donors.shape[0],df_resources.shape[0],
                     df_schools.shape[0], df_teachers.shape[0], df_projects.shape[0]],
             'Features':[df_donations.shape[1], df_donors.shape[1],df_resources.shape[1],
                     df_schools.shape[1], df_teachers.shape[1], df_projects.shape[1]]})

In [4]:
df_donations.head(3)

In [5]:
df_donors.head(3)

In [6]:
df_resources.head(3)

In [7]:
df_schools.head(3)

In [8]:
df_teachers.head(3)

In [9]:
df_projects.head(3)

## Questions we would like to ask
With the objective to create a robust recommendation system to effectively target previous first time donors with new projects, some questions we would like to ask the data are:
1. **On Donor Preferences/Behavior:** Which projects are being funded most by which donors?
2. **On Donation Timing:** Does the timing i.e. project creation date or  deadline affect donation dynamics and donor coordination? 
3. **On Recommendation System:** By which features or proxy attributes the donors and projects be clustered to?

## Exploring the Dataset

## Teachers Dataset

#### Teacher Dataset preprocessing
We change `Teacher First Project Posted Date` to datetime format in order to explore time series trends over it.

In [10]:
df_teachers['Teacher First Project Posted Date'] = pd.to_datetime(
    df_teachers['Teacher First Project Posted Date'], errors='coerce')

df_teachers.dtypes

#### Teacher ID duplication check
We check if there are any `Teacher ID` duplicates. 

In [11]:
((df_teachers['Teacher ID'].value_counts().values)>1).any()

#### Teacher Prefix distribution
We check how teachers are distributed according to their designation/prefix i.e. Mrs., Ms., Mr., Dr., or Mx. We see here that Mrs. and Ms. dominate largely the distribution. 

In [12]:
plt.figure(figsize=(6,6))
plt.bar(df_teachers['Teacher Prefix'].value_counts().index, 
       df_teachers['Teacher Prefix'].value_counts(),
       color=sns.color_palette('viridis'))
plt.xlabel('Teacher Prefix')
plt.ylabel('Counts')
plt.title('Teacher Prefix Distribution')
plt.tight_layout()

#### Teacher project creation trend over time
We check how the trend of teacher project creation are changing since 2002 and specifically in 2018

In [13]:
df_teachers['weekdays'] = df_teachers['Teacher First Project Posted Date'
                                     ].dt.dayofweek

df_teachers['month'] = df_teachers['Teacher First Project Posted Date'].dt.month

df_teachers['year'] = df_teachers['Teacher First Project Posted Date'].dt.year

weekdays = {0:'Monday',1:'Tuesday',2:'Wednesday',3:'Thursday',
            4:'Friday',5:'Saturday',6:'Sunday'}

months= {1 :"Jan",2 :"Feb",3 :"Mar",4 :"Apr",5 : "May",6 : "Jun",
          7 : "Jul",8 :"Aug", 9 :"Sep",10 :"Oct",11 :"Nov",12 :"Dec"}

df_teachers['weekdays']=df_teachers['weekdays'].map(weekdays)
df_teachers['month']=df_teachers['month'].map(months)

df_teachers.head(3)

#### Most popular days of the week for Teacher project creation
We check in which days of the week are the teachers posting their first project the most. It appears that weekends appear to be the most favored days in creating projects: Sunday, Saturday, Monday and Friday. This seems reasonable given that Teachers would most likely have the most free time on these days of the week.

In [14]:
plt.figure(figsize=(10,6))

plt.bar(df_teachers['weekdays'].value_counts().index, 
        df_teachers['weekdays'].value_counts(),
        color=sns.color_palette('viridis'))
plt.xlabel('Days of the Week')
plt.ylabel('Counts')
plt.title('Most Popular Days for Teacher Project Creation')
plt.tight_layout()

In [15]:
pd.DataFrame(df_teachers['weekdays'].value_counts())

**Observation: **Top days of the week teachers posting their first project:
- Sunday: 73,410
- Saturday: 66,105
- Monday: 60,933

#### Most popular months of the year for Teacher project creation)  
We check in which months of the year are the teachers posting their first project the most. We see here that there is particularly a spike in project creation at the start of the academic year: August,  September, and October. Perhaps, it is possible that the teachers assess the needs of their students at the start of class and make projects as they deem necessary. Moreover, we also observe that there is a steady trend on the latter half of the academic year.

In [16]:
plt.figure(figsize=(10,6))
plt.bar(df_teachers['month'].value_counts().index, 
        df_teachers['month'].value_counts(),
        color=sns.color_palette('plasma'))
plt.xlabel('Months')
plt.ylabel('Counts')
plt.title('Most Popular Months for Teacher Project Creation')
plt.tight_layout()

In [17]:
pd.DataFrame(df_teachers['month'].value_counts())

**Observations:** Top months teachers posting their first project:
- September: 59,495
- August: 52,141
- October: 48,361

#### Teacher project creation change over time
We check how Teacher project creation changed over time since 2002. We see here that project creations steady keep on increasing over the years. 

In [18]:
ts = df_teachers.groupby('year').agg({'Teacher ID' : 'count'}).reset_index()
plt.figure(figsize=(10,6))
plt.plot(ts['year'][:-1],ts['Teacher ID'][:-1], 
         color=sns.color_palette('plasma')[0] )
plt.xlabel('Years')
plt.ylabel('Counts')
plt.title('Teacher Project Creation Over Time')
plt.tight_layout()

#### Teacher project creation change over time by Teacher Prefix
We check how Teacher project creation changed over time since 2002 by Teacher Prefix. Given by the distribution of Teacher Prefixes, we observe the same trend with Mrs. and Ms. consistently dominating the trendlines. 

In [19]:
plt.figure(figsize=(10,6))
pref = ['Mrs.','Ms.','Mr.','Teacher','Dr.','Mx.']
for i in range(len(pref)):
    ts = df_teachers[df_teachers['Teacher Prefix'] == pref[i]].groupby('year').agg({'Teacher ID' : 'count'}).reset_index()
    plt.plot(ts['year'][:-1],ts['Teacher ID'][:-1], marker = 'o',markersize = 8,
             color=sns.color_palette('plasma')[i], label = pref[i])
    plt.xlabel('Years')
    plt.ylabel('Counts')
    
plt.legend()
plt.title('Teacher Project Creation Over Time, by Teacher Prefix')
plt.tight_layout()

#### Teacher project creation change over time by Teacher Gender
Using Teacher Prefix as a proxy for Teacher Gender, check how Teacher project creation changed over time since 2002 by Gender. Also, given that both Mrs. and Ms. take Female gender assignments, we see the same trend observed above.

In [20]:
gender = {'Mrs.':'Female','Ms.':'Female','Mr.':'Male','Dr.':'Unknown','Mx.':'Unknown'}
df_teachers['gender'] = df_teachers['Teacher Prefix']
df_teachers['gender'] = df_teachers['gender'].map(gender)

In [21]:
plt.figure(figsize=(10,6))

genders = ['Female','Male','Unknown']

for i in range(len(genders)):
    ts = df_teachers[df_teachers['gender'] == genders[i]].groupby('year').agg({'Teacher ID' : 'count'}).reset_index()
    
    plt.plot(ts['year'][:-1],ts['Teacher ID'][:-1], marker = 'o',markersize = 8,
             color=sns.color_palette('plasma')[i], label = genders[i])
    plt.xlabel('Years')
    plt.ylabel('Counts')
    
plt.legend()
plt.title('Teacher Project Creation Over Time, by Teacher Gender')
plt.tight_layout()

#### Teacher project creation change over time in 2018 by Teacher Prefix
We check how the trend changes in 2018. It appears that there are some days that garners spikes in Teacher project creation. 

In [22]:
plt.figure(figsize=(10,6))

pref = ['Mrs.','Ms.','Mr.','Teacher','Dr.','Mx.']
ts = df_teachers[df_teachers['year'] == 2018]

for i in range(len(pref)):
    daily = ts[ts['Teacher Prefix']==pref[i]].groupby(
        ts['Teacher First Project Posted Date']).agg({'Teacher ID' : 'count'}).reset_index()
    
    plt.plot(daily['Teacher First Project Posted Date'],
             daily['Teacher ID'],
             color=sns.color_palette('Dark2')[i], label = pref[i])
    plt.xlabel('Days')
    plt.ylabel('Counts')
    
plt.legend()
plt.title('Teacher Project Creation in 2018')
plt.tight_layout()

## Schools Dataset

#### Schools Metro Type distribution
We check from which metro type do schools come from. Given that a school's metro type can be a proxy for the amount of government and private support an institution gets, it is of interest to determine how the schools are distributed by metro type. It is possible that more projects are created and do receive more donations in Rural and Suburban areas, as validated in the exploration below.

In [23]:
plt.figure(figsize=(10,6))
plt.bar(df_schools['School Metro Type'].value_counts().index, 
        df_schools['School Metro Type'].value_counts(),
        color=sns.color_palette('plasma'))
plt.xlabel('Metro Type')
plt.ylabel('Counts')
plt.title('Schools Metro Type Distribution')
plt.tight_layout()

#### Schools Percentage Free Lunch distribution
We check how many schools give out free lunch. Similar above, schools' percentage free lunch is also a good proxy for a school's economic status, a determinant whether the student population fall below the poverty line or not.

In [24]:
plt.figure(figsize=(10,6))
df_schools['School Percentage Free Lunch'].hist(bins = 50, color=sns.color_palette('plasma')[2])
plt.xlabel('School Percentage Free Lunch')
plt.ylabel('Counts')
plt.title('School Percentage Free Lunch')
plt.tight_layout()

In [25]:
pd.DataFrame(df_schools['School Percentage Free Lunch'].describe())

#### Schools State distribution
We check where do schools come from. We would like to see where teachers in public schools find themselves lacking in terms of educational materials. It is interesting as information on this could not only be used to help DonorsChoose.org but also state governments in prioritizing funds for public education.

In [26]:
plt.figure(figsize=(10,14))

plt.barh(np.arange(0,816,16), 
        df_schools['School State'].value_counts()[::-1], height = 12, 
        color=sns.color_palette('plasma'), align='center')

plt.yticks(np.arange(0,816,16),df_schools['School State'].value_counts()[::-1].index)
plt.xlabel('Counts')
plt.ylabel('State')
plt.title('School State Distribution')
plt.tight_layout()

## Donors Dataset

#### Donor-Teacher distribution
We check how many donors are teachers. Here we see whether being a teacher actually encourage you to donate, given that a teacher has probably experienced being in need. We see here that majority of the donors are non-teachers.

In [27]:
plt.figure(figsize=(10,6))
plt.bar(df_donors['Donor Is Teacher'].value_counts().index, 
        df_donors['Donor Is Teacher'].value_counts(),
        color=sns.color_palette('plasma'))
plt.xlabel('Donor Is Teacher')
plt.ylabel('Counts')
plt.title('Donor-Teacher Distribution')
plt.tight_layout()

#### Donors State distribution
We check where the donors are coming from. Similar to mapping where beneficiary schools are, it is also of interest to know where the donors are coming from. 

In [28]:
plt.figure(figsize=(10,14))
plt.barh(np.arange(0,832,16),
        df_donors['Donor State'].value_counts()[::-1], height = 12, 
         color=sns.color_palette('plasma'), align='center')
plt.yticks(np.arange(0,832,16), df_donors['Donor State'].value_counts()[::-1].index)
plt.xlabel('Counts')
plt.ylabel('State')
plt.title('Donors State Distribution')
plt.tight_layout()

#### Donors City distribution
We check where the donors are coming from. Similar to mapping where beneficiary schools are, it is also of interest to know where the donors are coming from. 

In [29]:
plt.figure(figsize=(10,14))
plt.barh(np.arange(0, 240, 16),
        df_donors['Donor City'].value_counts()[:15][::-1], height = 12, 
         color=sns.color_palette('plasma'), align='center')
plt.yticks(np.arange(0, 256, 16), df_donors['Donor City'].value_counts()[:15][::-1].index)
plt.xlabel('Counts')
plt.ylabel('City')
plt.title('Donors City Distribution')
plt.tight_layout()

In [30]:
pd.DataFrame(df_donors['Donor City'].value_counts()[:15])

**Observations:** Top cities where donors are coming from:
- Chicago: 34,352
- New York: 27,863
- Brooklyn: 22,330
- Los Angeles: 18,320
- San Francisco: 16,925


[](http://)#### Donation Time per Location of Donor
We see if we can find a pattern in the time of donations and locations. 

In [31]:
# Donation Time per Location of Donor
print("Donors Features:",df_donors.columns)
print("Donations Features:",df_donations.columns)

We are concerned with `Donor ID`, `Donor City`, `Donor State`, and `Donation Received Date`.

In [32]:
df_donations[['Donor ID', 'Donation Received Date']].head()

In [33]:
df_donors[['Donor ID', 'Donor State', 'Donor City']].head()

Merging `df_donors` and `df_donations` by `Donor ID` and checking the dimensions:

In [34]:
df_temp = pd.merge(df_donors, df_donations, on=['Donor ID'])[['Donor ID', 'Donor State', 'Donor City', 'Donation Received Date']]
print(df_donations.shape)
print(df_donors.shape)
print(df_temp.shape)

Converting `Donation Received Date` to datetime format.

In [35]:
df_temp.iloc[:, -1] = pd.to_datetime(df_temp.iloc[:, -1])

In [36]:
df_temp.head()

![](http://)We separate date and time into year, time, month, and day

In [37]:
df_temp['Donation Received Time'] = [d.time() for d in df_temp['Donation Received Date']]
df_temp['Donation Received Year'] = [d.year for d in df_temp['Donation Received Date']]
df_temp['Donation Received Month'] = [d.month for d in df_temp['Donation Received Date']]
df_temp['Donation Received Day'] = [d.day for d in df_temp['Donation Received Date']]
df_temp.head()

We get the list of states and save in `states`.

In [38]:
states = set(df_temp['Donor State'].values)
len(states)

We ranomly select five states to visualize. Below are the states selected:

In [49]:
# Select Random States to Visualize
import random
random.seed(33)
states_5 = random.sample(states, 5)
states_5

Below are the swarmplots showing donations over time for each of the five random states selected. We show only 1000 datapoints per state. The colors are the cities. Note that donors in the same cities tend to donate in groups as shown by the vertical stacks of similarly-colored dots. Long vertical stacks of points with the same color indicate donations from multiple donors coming from the same city at any short period of time. From these swarmplots, we immediately see a possible clustering of donors based on location. We can choose to visualize all states and still see the same pattern.

Supporting this observation, a 2016 research by Traag on Campaign Donations found that donation is socially contagious, and can travel through networks of people. The chances of people donating increase when they are exposed to donors who are from different social groups (don't know each other) as well as from donors who belong in the same social group (know each other). <sup>5</sup>

In [50]:
# Plotting 1st 1000 datapoints per state in states_5
for state in states_5:
    plt.figure(figsize=(15,3))
    a = df_temp[df_temp['Donor State'] == state][:1000]
    #b = a[a['Donation Received Year']==2015]
    #b = b[b['Donation Received Month']]
    ax = sns.swarmplot(y = 'Donor State', x = 'Donation Received Date', hue = 'Donor City', data=a, palette='Set2')
    plt.title(state)
    ax.legend_.remove()
    plt.show()

## Donations Dataset

#### Donations Dataset preprocessing
We change `Donation Received Date`to datetime format in order to explore time series trends over it.

In [41]:
df_donations['Donation Received Date'] = pd.to_datetime(
    df_donations['Donation Received Date'], errors='coerce')

df_donations['year'] = df_donations['Donation Received Date'].dt.year
df_donations['day-formated'] = df_donations['Donation Received Date'].dt.strftime('%m/%d/%Y')

#### Donations Amount distribution
We check how much are the typical donations to donorchoose.org projects. It is interesting to know how much donors are actually putting in to Teachers' causes. 

In [42]:
plt.figure(figsize=(10,6))
df_donations['Donation Amount'].hist(bins = 50, range = (0,500), color=sns.color_palette('plasma')[2])
plt.xlabel('Donation Amount')
plt.ylabel('Counts')
plt.title('Donations Amount Distribution')
plt.tight_layout()

#### Donations Included Optional Donation distribution
We check how many donations included optional support to donorschoose.org. As a proxy for implied loyalty and confidence on the organization and crowdfunding host, donorschoose.org, the `Donation Included Optional Donation` can be telling of a donor's long term engagement with the platform, possibly donating again. We see here that most donations come with the optional support to donorschoose.org. 

In [43]:
plt.figure(figsize=(10,6))
plt.bar(df_donations['Donation Included Optional Donation'].value_counts().index, 
        df_donations['Donation Included Optional Donation'].value_counts(),
        color=sns.color_palette('plasma'))
plt.xlabel('Donation Included Optional Donation')
plt.ylabel('Counts')
plt.title('Donations Included Optional Donation Distribution')
plt.tight_layout()

#### Donations received trend over time in 2018
Check when donations are received the most and how the trend of donation receipts changes over time in 2018. Similar to micro trends in the Projects dataset, we observe some spikes here. Interesting, the biggest spike can be found on January 2nd. Perhaps, people all over the U.S. wanted to share their generosity coming from the Christmas and New Year holidays; or maybe, some of them consider donating to donorschoose.org part of their New Year's resolution.

In [44]:
plt.figure(figsize=(10,6))

included = ['Yes','No']
ts = df_donations[df_donations['year']==2018]

for i in range(len(included)):
    daily = ts[ts['Donation Included Optional Donation']==included[i]].groupby(
        ts['day-formated']).agg({'Donation ID' : 'count'}).reset_index()
    
    plt.plot(daily['day-formated'],
             daily['Donation ID'],
             color=sns.color_palette('Dark2')[i], label = included[i])
    plt.xlabel('Days')
    plt.ylabel('Counts')

plt.xticks([0,25, 50,75,100,125,150])
plt.legend()
plt.title('Donations Received Trend in 2018')
plt.tight_layout()

## Resources Dataset

#### Resources Unit Price distribution
We check how the unit price for requested items in the projects are distributed. It is also interesting to know how much the items requested cost for the projects posted by teachers. Are they really expensive? Can teachers afford them otherwise? How cheap or expensive can they get?

In [45]:
df = df_resources

fig = plt.figure(figsize=(10,6))
ax2 = fig.add_subplot(111)
ax2.hist(df[df['Resource Unit Price']<100]['Resource Unit Price'],100, color=sns.color_palette('viridis')[1]);

ax3 = fig.add_axes([0.55,0.5,0.4,0.4])
ax3.hist(df['Resource Unit Price'].dropna(), 50, color=sns.color_palette('viridis')[0]);
ax2.set_xlabel("Resource Unit Price ($)")
ax2.set_ylabel('Frequency')

ax3.set_xlabel("Resource Unit Price ($)")
ax3.set_ylabel('Frequency')
ax2.set_title('Distribution of Resource Unit Price', size = 18)
plt.tight_layout()

#### Resource Unit Price vs Resource Quantity
We check the relationship of price and quantity for resource items requested. Here we map how many of how much are requested by teachers. We observe a power law distribution noting that more expensive items tend to be requested at less quantities than cheaper ones.

In [46]:
fig = plt.figure(figsize=(10,6))
ax4 = fig.add_subplot(111)
ax4.scatter(df['Resource Quantity'],df['Resource Unit Price'], color='#aa3333')
ax4.set_xlabel('Resource Quantity')
ax4.set_ylabel('Resource Unit Price')
ax4.set_title('Relationship of Resource Unit Price vs. Resource Quantity', size =18)
ax4.set_xlim(0,600);

## Projects Dataset

#### Projects Dataset preprocessing
We change supposed datetime data to appropriate datatype for further exploration.

In [47]:
df_projects['Project Posted Date'] = pd.to_datetime(
    df_projects['Project Posted Date'], errors='coerce')

df_projects['Project Expiration Date'] = pd.to_datetime(
    df_projects['Project Expiration Date'], errors='coerce')

df_projects['Project Fully Funded Date'] = pd.to_datetime(
    df_projects['Project Fully Funded Date'], errors='coerce')

df_projects.dtypes

In [48]:
df_projects['year-posted'] = df_projects['Project Posted Date'].dt.year
df_projects['day-posted-formated'] = df_projects['Project Posted Date'].dt.strftime('%m/%d/%Y')

df_projects['year-expiry'] = df_projects['Project Expiration Date'].dt.year
df_projects['day-expiry-formated'] = df_projects['Project Expiration Date'].dt.strftime('%m/%d/%Y')

df_projects['year-funded'] = df_projects['Project Fully Funded Date'].dt.year
df_projects['day-funded-formated'] = df_projects['Project Fully Funded Date'].dt.strftime('%m/%d/%Y')

df_projects['delta-days-before-expiry'] = (df_projects['Project Expiration Date'] - df_projects['Project Posted Date']).dt.days
df_projects['delta-days-before-funded'] = (df_projects['Project Fully Funded Date'] - df_projects['Project Posted Date']).dt.days

In [None]:
df_projects.columns.tolist()

#### Unique Projects by Teacher ID
We check how many projects do teachers have initiated. We see here how active teachers are in terms of project creation.

In [None]:
pd.DataFrame(df_projects['Teacher ID'].value_counts().describe())

**Observations:** Most teachers only created projects once with many creating an average 2.8 projects ever since.

#### Projects Type distribution
We check how the types of projects are distributed across all projects created. We see here that most projects are teacher-led, with just a few student-led or coming from professional-development.

In [None]:
plt.figure(figsize=(10,6))
plt.bar(df_projects['Project Type'].value_counts().index, 
        df_projects['Project Type'].value_counts(),
        color=sns.color_palette('viridis'))
plt.xlabel('Project Type')
plt.ylabel('Counts')
plt.title('Projects Type Distribution')
plt.tight_layout()

#### Projects Subject Category Tree distribution
We check what subject categories the projects fall into. We observe here that the top projects belong to literacy and language category.

In [None]:
plt.figure(figsize=(10,6))
plt.barh(df_projects['Project Subject Category Tree'].value_counts()[:15].index, 
        df_projects['Project Subject Category Tree'].value_counts()[:15],
        color=sns.color_palette('viridis'))
plt.xlabel('Counts')
plt.ylabel('Project Subject Category Tree')
plt.title('Projects Subject Category Tree Distribution')
plt.tight_layout()

#### Projects Subject Subcategory Tree distribution
We check what subject subcategories the projects fall into. Similarly, we observe that top projects deal with either literacy, mathematics, and writing.

In [None]:
plt.figure(figsize=(10,6))
plt.barh(df_projects['Project Subject Subcategory Tree'].value_counts()[:15].index, 
        df_projects['Project Subject Subcategory Tree'].value_counts()[:15],
        color=sns.color_palette('viridis'))
plt.xlabel('Counts')
plt.ylabel('Project Subject Subcategory Tree')
plt.title('Projects Subject Subcategory Tree Distribution')
plt.tight_layout()

#### Projects Grade Level Category distribution
We check what grade level categories the projects fall into.

In [None]:
temp = df_projects['Project Grade Level Category'].value_counts()
fig = {
  "data": [
    {
      "values": temp.values,
      "labels": temp.index,
      "domain": {"x": [0, .48]},
      "name": "Grade Level Category",
      #"hoverinfo":"label+percent+name",
      'marker': {'colors': ['rgb(45, 35, 113)',
                                  'rgb(0, 208, 110)',
                                  'rgb(0, 208, 202)',
                                  'rgb(83, 158, 196)',
                                  'rgb(124, 231, 87)']},
      "hole": .7,
      "type": "pie"
    },
    
    ],
  "layout": {
        "title":"Distribution of Projects Grade Level Category",
        "annotations": [
            {
                "font": {
                    "size": 20
                },
                "showarrow": False,
                "text": "Grade Level Categories",
                "x": 0.11,
                "y": 0.5
            }
            
        ]
    }
}
iplot(fig, filename='donut')

#### Projects Resource Category distribution
We check what resource categories the projects fall into. Here we observe that projects request technology, supplies, or books the most.

In [None]:
plt.figure(figsize=(10,6))
plt.barh(df_projects['Project Resource Category'].value_counts().index, 
        df_projects['Project Resource Category'].value_counts(),
        color=sns.color_palette('viridis'))
plt.xlabel('Counts')
plt.ylabel('Project Resource Category')
plt.title('Projects Resource Category Distribution')
plt.tight_layout()

#### Projects Current Status distribution
We check what status the projects are in now. We would like to see how many projects are still running and are already fully funded.

In [None]:
plt.figure(figsize=(10,6))
plt.barh(df_projects['Project Current Status'].value_counts().index, 
        df_projects['Project Current Status'].value_counts(),
        color=sns.color_palette('viridis'))
plt.xlabel('Counts')
plt.ylabel('Project Current Status')
plt.title('Projects Current Status Distribution')
plt.tight_layout()

#### Projects Costs distribution
We check how the costs of the projects are distributed.

In [None]:
df = df_projects

fig = plt.figure(figsize=(10,6))
ax2 = fig.add_subplot(111)
ax2.hist(df[df['Project Cost']<5000]['Project Cost'],100, color=sns.color_palette('viridis')[1]);

ax3 = fig.add_axes([0.55,0.5,0.4,0.4])
ax3.hist(df['Project Cost'].dropna(), 50, color=sns.color_palette('viridis')[0]);
ax2.set_xlabel("Project Cost ($)")
ax2.set_ylabel('Frequency')

ax3.set_xlabel("Project Cost ($)")
ax3.set_ylabel('Frequency')
ax2.set_title('Distribution of Project Cost', size = 18)
plt.tight_layout()

In [None]:
pd.DataFrame(df_projects['Project Cost'].describe())

**Observations:** The average cost of a project is around 741 dollars pulled up by a few high value projects with the largest costing 255,737 dollars. 75% of the projects costs less than 868 dollars.

#### Number of days before project gets fully funded
We check how many days it take for a project to be fully funded. As a good indicator of donation coordination and donor dynamics, time it takes before funding can shed a light to how long a project gets all the support it needs.

In [None]:
df = df_projects

fig = plt.figure(figsize=(10,6))
ax2 = fig.add_subplot(111)
ax2.hist(df['delta-days-before-funded'].dropna(),100, color=sns.color_palette('viridis')[1]);

#ax3 = fig.add_axes([0.55,0.5,0.4,0.4])
#ax3.hist(df['Project Cost'].dropna(), 50, color=sns.color_palette('viridis')[0]);
ax2.set_xlabel("Number of days before fully funded")
ax2.set_ylabel('Frequency')

#ax3.set_xlabel("Project Cost ($)")
#ax3.set_ylabel('Frequency')
ax2.set_title('Distribution of days it took before project funding', size = 18)
plt.tight_layout()

In [None]:
pd.DataFrame(df['delta-days-before-funded'].describe())

**Observations:** On average, it only takes about a month (~32.07 days) for a project to get fully funded. 75% of projects get funded within 50 days.

## Merging the Datasets
Using the code below, the dataframes are merged using their unique identifier keys i.e. `Project ID`,`Donor ID`, `Teacher ID`, and `School ID`

In [None]:
# donations_resources = pd.merge(df_donations,df_resources, how='left',on='Project ID')
# donations_resources_donors = pd.merge(donations_resources, df_donors, how='left', on='Donor ID')
# donations_resources_donors_projects = pd.merge(donations_resources_donors, df_projects, how='left', on='Project ID')
# donations_resources_donors_projects_teachers = pd.merge(donations_resources_donors_projects, df_teachers, how='left', on='Teacher ID')
# merged_df = pd.merge(donations_resources_donors_projects_teachers, df_schools, how='left', on='School ID')
# merged_df.head(1)
# merged_df.sample(n = 50000, axis = 0).to_pickle('sampled_df.pickle')

## Data Wrangling

### Note: From this point, we will use the merged dataset run from our own computers and uploaded here on the kernel. Furthermore, for the purposes of demonstration, we will use a subset of 10,000 datapoints sampled from the given dataset.

In [45]:
df = pd.read_csv("../input/sampled-dataset-50k/sampled_df_jojie.csv").iloc[:,1:]
df.shape

In [9]:
n_rows = 10000
df_sample_1 = df.sample(n = n_rows, random_state = 123, axis = 0)
df_sample_1.shape

#### Identifying useful and not so useful features
To recognize latent groupings of projects, we omit ID features like  `Project ID`, `Donation ID`, `Donor ID`, `School ID` and `Teacher ID`, free-form answer features like `Project Title`, `Project Essay`, `Project Short Description`, `Project Short Description` and other features we assume to be not so useful for the purposes of simplicity.

Next, we identified features that characterize either the  donor or the projects. Moreover, we also identified the categorical and datetime variables for later encoding. 

In [8]:
not_useful = ['Project ID', 'Donation ID','Donor ID',
          'Donor Cart Sequence','Resource Vendor Name','Resource Item Name', 
              'Teacher Project Posted Sequence','School ID', 'Teacher ID',
              'School Name', 'Donor Zip', 'School Zip', 'Unnamed: 0.1']

date_feat = ['Donation Received Date','Project Posted Date', 'Project Expiration Date',
        'Project Fully Funded Date','Teacher First Project Posted Date']

donor_feat = ['Donation ID', 'Donor ID',
       'Donation Included Optional Donation', 'Donation Amount',
       'Donor Cart Sequence', 'Donation Received Date', 
       'Donor City', 'Donor State', 'Donor Is Teacher', 'Donor Zip']

project_feat = ['Resource Item Name',
       'Resource Quantity', 'Resource Unit Price', 'Resource Vendor Name',
       'Project Type', 'Project Title', 'Project Essay',
       'Project Short Description', 'Project Need Statement',
       'Project Subject Category Tree', 'Project Subject Subcategory Tree',
       'Project Grade Level Category', 'Project Resource Category',
       'Project Cost', 'Project Posted Date', 'Project Expiration Date',
       'Project Current Status', 'Project Fully Funded Date', 'Teacher Prefix',
       'Teacher First Project Posted Date', 'School Name', 'School Metro Type',
       'School Percentage Free Lunch', 'School State', 'School Zip',
       'School City', 'School County', 'School District']

cat_feat = ['Donor City', 'Donor State', 'Donor Is Teacher', 'Donor Zip',
            'Project Type','Project Subject Category Tree', 'Donation Included Optional Donation',
            'Project Subject Subcategory Tree',
            'Project Grade Level Category', 'Project Resource Category',
            'Project Current Status','Teacher Prefix','School Metro Type',
            'School State', 'School Zip','School City','School County', 'School District']

#### Dropping not so useful features

In [10]:
df_sample = df_sample_1.drop(labels=not_useful,axis = 1)

In [11]:
cat_feat_new = []
for x in df_sample.columns.tolist(): 
    if x in cat_feat:
        cat_feat_new.append(x)

#### Dropping null values

In [12]:
df_sample.dropna(axis=0, inplace=True)
#df_new[pd.isnull(df_new).any(axis=1)]

In [13]:
checker1 = df_sample[df_sample['School State']=='Alaska'].index.tolist()
checker2 = df_sample[df_sample['School Metro Type']=='town'].index.tolist()

#### Encoding Labels for Categorical Variables

In [14]:
from sklearn.preprocessing import LabelEncoder

labels = {}
le = LabelEncoder()

for cat in cat_feat_new:
    le.fit(df_sample[cat].values)
    
    if df_sample[cat].dtype == 'float64' or df_sample[cat].dtype == 'int':
        df_sample[cat] = le.transform(df_sample[cat])
    
    else:
        df_sample[cat] = le.transform(df_sample[cat].astype(str))
    
    labels[cat] = list(le.classes_)

#### Encoding Datetime Features 

In [15]:
df_sample['Project Posted Date'] = pd.to_datetime(
    df_sample['Project Posted Date'], errors='coerce')

df_sample['Project Expiration Date'] = pd.to_datetime(
    df_sample['Project Expiration Date'], errors='coerce')

df_sample['Project Fully Funded Date'] = pd.to_datetime(
    df_sample['Project Fully Funded Date'], errors='coerce')

df_sample['Donation Received Date'] = pd.to_datetime(
    df_sample['Donation Received Date'], errors='coerce')

df_sample['Teacher First Project Posted Date'] = pd.to_datetime(
    df_sample['Teacher First Project Posted Date'], errors='coerce')

In [16]:
df_sample['delta-days-before-expiry'] = (df_sample['Project Expiration Date'] - df_sample['Project Posted Date']).dt.days
df_sample['delta-days-before-funded'] = (df_sample['Project Fully Funded Date'] - df_sample['Project Posted Date']).dt.days
df_sample['delta-days-before-donating'] = (df_sample['Donation Received Date'] - df_sample['Project Posted Date']).dt.days

# Building the Recommendation System
This section presents a model for a hybrid recommendation system utilizing both content-based filtering and demographic recommmendations. By definition, content-based filtering focus on properties of items and their similarities are determined by measuring the similiarity in their properties. On the other hand, a demographic recommender provides recommendations based on a demographic profile of the user. 

Borrowing terminologies from the field of data mining<sup>[4]</sup>, there are two components of the recommendation system in relation to the donorschoose.org challenge:
1. **Item Profiles** (Projects) - given by the characteristics of the projects from which 'profiles' will be constructed
2. **User Profiles** (Donors) - given by the demographics of the donors 

By making inferences from users' behaviors (donor donation and project preferences) and their demographic profiles, we develop a recommendation system potentially useful for the creation of targeted email campaigns to encourage repeat donations from first time donors. 

**Note:** For the purpose of simplicity, we opted to omit `Project Essay` and other free-form answer features of the projects and resorted to using the several project 'tags' features to discover item profiles. 

Construction of the aforementioned Recommendation System involves the following steps:
1. Determining Item Profiles (using KMeans Clustering / unsupervised Machine Learning)
2. Classifying New Unobserved Items (using Logistic Regression / supervised Machine Learning)
3. Filtering User Profiles (using heuristics)

Fed with information about the new projects, a web app containing the identified project clusters will churn out corresponding cluster of potential donors, to be filtered by location (`Donor State`), explicit support to donorschoose.org (`Donation Included Optional Donation`) and by donation timing (`Donation Received Date`)

![donorschoose recommender](https://tjalba.files.wordpress.com/2018/06/screen-shot-2018-06-19-at-9-36-19-am.png?w=1000)

## Identifying Item Profiles (Projects) - KMeans Clustering

#### Creating the feature vector for previous projects for KMeans Clustering

In [161]:
X = df_sample[['Resource Quantity', 'Resource Unit Price','Project Type',
       'Project Subject Category Tree', 'Project Subject Subcategory Tree',
       'Project Grade Level Category', 'Project Resource Category',
       'Project Cost', 'Teacher Prefix',
       'School Metro Type','School Percentage Free Lunch', 'School State', 'School City',
       'School County', 'School District']]

In [162]:
X.shape

#### Initial KMeans Clustering with number of neighbors set to 5

In [163]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 5).fit(X)
plt.figure(figsize=(10,6))
for i in range(len(kmeans.cluster_centers_)):
    plt.plot(kmeans.cluster_centers_[i][0],kmeans.cluster_centers_[i][1],'x', markersize=12, label=i);
    plt.xticks([], [])
    plt.yticks([], [])
    plt.legend()
plt.title('KMeans Clustering (n_clusters = 5)');

#### Finding the optimal number of clusters
We find that the optimal number of clusters is 4. We used two internal validation methods to find the optimal number of clusters: **sum of squares of the distances** of each datapoint in a cluster from the centroid is small (SS), and the **Silhouette** value. The optimal number of clusters is the number at which SS is small and Silhouette value is closest to 1. Plotting these two validation measures vs the number of clusters k, we find that the optimal trade off between the two validation measures is at k = 4.

In [164]:
from sklearn.metrics import silhouette_score
sse = []
silhouette = []
krange = range(2, 15)
plt.figure(figsize=(10,6))
for k in krange:
    kmeans = KMeans(n_clusters = k).fit(X)
    sse.append(kmeans.inertia_)
    
    labels = kmeans.predict(X)
    sl = silhouette_score(X, labels)
    silhouette.append(sl)
    
plt.plot(krange, sse, label='SS', c='orange')
lines, labels = plt.gca().get_legend_handles_labels()
plt.twinx()
plt.plot(krange, silhouette, label = 'Silhouette', c='blue')
lines2, labels2 = plt.gca().get_legend_handles_labels()
plt.legend(lines+lines2, labels+labels2)
plt.title("Validation Measures per k Clusters")
plt.show()


#### Plotting the first two components of the clusters
Plotting the first two components of the clusters on a 2D plot.

In [165]:
n_clusters = 4
kmeans = KMeans(n_clusters = n_clusters).fit(X)
plt.figure(figsize=(10,6))
for i in range(len(kmeans.cluster_centers_)):
    plt.plot(kmeans.cluster_centers_[i][0],kmeans.cluster_centers_[i][1],'x', markersize=12, label=i);
    plt.xticks([], [])
    plt.yticks([], [])
    plt.legend()
plt.title('KMeans Clustering (n_clusters = {})'.format(n_clusters));

#### Storing Donor IDs per Cluster 

In [166]:
clusters_list = {}
for c in range(n_clusters):
    clusters_list[c] = df.iloc[X.iloc[kmeans.labels_ == c, :].index]['Donor ID']

#### Labeling donors based on project clusters

In [167]:
X['Label'] = kmeans.labels_

In [168]:
X.shape

In [169]:
X.head()

In [170]:
X.tail()

#### Visualizing the Clusters

In [171]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 3).fit_transform(X)
plt.figure(figsize=(10,6))
plt.scatter(pca[:,0],pca[:,1], c = kmeans.labels_);

## Classifying New Unobserved Items (New Projects) - Logistic Regression

In [172]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
pd.options.display.float_format = '{:,.2g}'.format
from collections import Counter
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

In [173]:
X_ = X.drop('Label', axis = 1)
y_ = X['Label']

In [174]:
def ml_class(feature, target, ml_type='knn_class', show_PCC=False,
             param_range=range(1, 30), seed_settings=range(0, 30),
             plot=False, report=True, penalty='l2'):
    """
    Plot accuracy vs parameter for test and training data. Print
    maximum accuracy and corresponding parameter value. Print number of trials.

    Inputs
    ======
    feature: Dataframe of features
    target: Series of target values
    show_PCC: Boolean. will show PCC on plot if True
    param_range: Range of values for parameters
    seed_settings: Range of seed settings to run
    plt: Boolean. Will show plot if True
    report: Boolean. Will show report if True
    penalty: String either l1 for L1 norm or l2 for L2 norm

    Outputs
    =======
    Plot of accuracy vs parameter for test and training data
    Report showing number of maximum accuracy, optimal parameters, PCC, and
        no. of iterations
    """

    train_acc = []
    test_acc = []

    # Initiate counter for number of trials
    iterations = 0

    # create an array of cols: parameters and rows: seeds
    for seed in seed_settings:

        # count one trial
        iterations += 1

        # split data into test and training sets
        X_train, X_test, y_train, y_test = train_test_split(feature,
                                                            target,
                                                            random_state=seed)
        train = []
        test = []

        # make a list of accuracies for different parameters
        for param in param_range:
            # build the model
            if ml_type == 'knn_class':
                clf = KNeighborsClassifier(n_neighbors=param)

            elif ml_type == 'log_reg':
                clf = LogisticRegression(C=param, penalty=penalty)

            elif ml_type == 'svc':
                clf = LinearSVC(C=param, penalty=penalty, dual=False)

            clf.fit(X_train, y_train)

            # record training set accuracy
            train.append(clf.score(X_train, y_train))
            # record generalization accuracy
            test.append(clf.score(X_test, y_test))

        # append the list to _acc arrays
        train_acc.append(train)
        test_acc.append(test)

    # compute mean and error across columns
    train_all = np.mean(train_acc, axis=0)
    test_all = np.mean(test_acc, axis=0)

    # compute standard deviation
    var_train = np.var(train_acc, axis=0)
    var_test = np.var(test_acc, axis=0)

    # compute pcc
    state_counts = Counter(target)
    df_state = pd.DataFrame.from_dict(state_counts, orient='index')
    num = (df_state[0] / df_state[0].sum())**2
    pcc = 1.25 * num.sum()

    if plot == True:
        plt.figure(figsize=(10,6))
        # plot train and errors and standard devs
        plt.plot(param_range, train_all, c='b',
                 label="training set", marker='.')
        plt.fill_between(param_range,
                         train_all + var_train,
                         train_all - var_train,
                         color='b', alpha=0.1)

        # plot test and errors and standard devs
        plt.plot(param_range, test_all, c='r', label="test set", marker='.')
        plt.fill_between(param_range,
                         test_all + var_test,
                         test_all - var_test,
                         color='r', alpha=0.1)

        # plot pcc line
        if show_PCC == True:
            plt.plot(param_range, [pcc] * len(param_range),
                     c='tab:gray', label="pcc", linestyle='--')

        plt.xlabel('Parameter Value')
        plt.ylabel('Accuracy')
        plt.title(ml_type + ": Accuracy vs Parameter Value")
        plt.legend(loc=0)

        plt.tight_layout()
        plt.show()

    max_inds = np.argmax(test_all)
    acc_max = np.amax(test_all)
    param_max = (param_range)[max_inds]

    if report == True:
        print('Report:')
        print('=======')
        print("Max average accuracy: {}".format(
            np.round(acc_max, 4)))
        print("Var of accuracy at optimal parameter: {0:.4f}".format(
            var_test[max_inds]))
        print("Optimal parameter: {0:.4f}".format(param_max))
        if ml_type != "knn_class":
            print("Regularization: ", penalty)
        print('1.25 x PCC: {0:.4f}'.format(pcc))
        print('Total iterations: {}'.format(iterations))
        

    # return maximum accuracy and corresponding parameter value
    return np.round(acc_max, 4), param_max  # best_feat

Logistic regression was able to classify new projects into the identified clusters with 95-96% accuracy.

In [175]:
C = [1e-3,0.1, 0.75, 1, 5, 10]

acc_max, param_max = ml_class(X_, y_, ml_type='log_reg', show_PCC=True,
                            param_range=C, seed_settings=range(0, 20),
                            plot=True, report=True, penalty='l1');

Using the optimal parameter identified above, we find the values of the coefficients for each feature.

In [179]:
X_train, X_test, y_train, y_test = train_test_split(X_, y_)
clf = LogisticRegression(C=param_max, penalty='l1')
clf.fit(X_train, y_train)

plt.figure(figsize=(10,6))
for i in range(n_clusters):
    sorted_i = np.argsort(abs(clf.coef_[i]))[::-1]
    
    plt.plot(X_.columns[sorted_i], clf.coef_[i][sorted_i], marker='o', linestyle='none', linewidth=1, label = "Cluster "+str(i))
    plt.bar(X_.columns[sorted_i], clf.coef_[i][sorted_i], alpha=0.3, width=0.1)
    plt.legend()
    plt.xticks(rotation=90)
    plt.ylabel("Coefficient")
    plt.title("Logistic Regression Feature Coefficients per Cluster")
plt.show()

## Filtering User Profiles 

In [180]:
df['Project Posted Date'] = pd.to_datetime(
    df['Project Posted Date'], errors='coerce')

df['Project Expiration Date'] = pd.to_datetime(
    df['Project Expiration Date'], errors='coerce')

df['Project Fully Funded Date'] = pd.to_datetime(
    df['Project Fully Funded Date'], errors='coerce')

df['Donation Received Date'] = pd.to_datetime(
    df['Donation Received Date'], errors='coerce')

df['Teacher First Project Posted Date'] = pd.to_datetime(
    df['Teacher First Project Posted Date'], errors='coerce')

df['delta-days-before-expiry'] = (df['Project Expiration Date'] - df['Project Posted Date']).dt.days
df['delta-days-before-funded'] = (df['Project Fully Funded Date'] - df['Project Posted Date']).dt.days
df['delta-days-before-donating'] = (df['Donation Received Date'] - df['Project Posted Date']).dt.days

#### Storing Donor ID and other Donor information in dictionaries, by clusters
After clustering, the Donor ID of Donors who previously donated to projects in each of the identified clusters are stored in dictionaries. Conveniently, these clusters of Donor IDs can be easily called later on. Furthermore, to take into account proximity, organizational loyalty, and donation timing, Donor ID filtered on these features are also stored in individual dictionaries.
1. **Proximity/Location** - `clusters_by_state`
2. **Organizational Loyalty** - `clusters_by_org_loyalty`
3. **Donation Timing** - `clusters_by_early_donors` and `clusters_by_late_donors`

In [181]:
clusters_by_state = {}
clusters_by_org_loyalty = {}
clusters_by_early_donors = {}
clusters_by_late_donors = {}
clusters_by_all = {}

for c in range(len(clusters_list)):
    clusters_by_state[c] = df.iloc[clusters_list[c].index][['Donor ID','Donor State']]
    clusters_by_org_loyalty[c] = df.iloc[clusters_list[c].index][['Donor ID','Donation Included Optional Donation']]
    clusters_by_early_donors[c] = df.iloc[clusters_list[c].index][['Donor ID','delta-days-before-donating']][\
    df['delta-days-before-donating']<30]
    clusters_by_late_donors[c] = df.iloc[clusters_list[c].index][['Donor ID','delta-days-before-donating']][\
    df['delta-days-before-donating']>30]
    clusters_by_all[c] = df.iloc[clusters_list[c].index][['Donor ID','Donor State','Donation Included Optional Donation','delta-days-before-donating']]

# Demonstration
Now, to demonstrate how this recommender system would work when presented a new project, we have a function `donors_to_recommend` to churn out Donor ID's of potential donors based on the predicted class or cluster of the new project.  A new project to be recommended to donors have the features outlined below.

In [182]:
import pandas as pd
X_test = pd.DataFrame(columns=['Resource Quantity', 'Resource Unit Price','Project Type',
       'Project Subject Category Tree', 'Project Subject Subcategory Tree',
       'Project Grade Level Category', 'Project Resource Category',
       'Project Cost', 'Teacher Prefix',
       'School Metro Type','School Percentage Free Lunch', 'School State', 'School City',
       'School County', 'School District'])

X_test['Resource Quantity'] = [10]
X_test['Resource Unit Price'] = [10]
X_test['Project Type'] = ['Teacher-Led']
X_test['Project Subject Category Tree'] = ['Health & Sports']
X_test['Project Subject Subcategory Tree'] = ['Gym & Fitness, Health & Wellness']
X_test['Project Grade Level Category'] = ['Grades 9-12']
X_test['Project Resource Category'] = ['Sports & Exercise Equipment']
X_test['Project Cost'] = [53.3]
X_test['Teacher Prefix'] = ['Mrs.']
X_test['School Metro Type'] = ['suburban']
X_test['School Percentage Free Lunch'] = [65]
X_test['School State'] = ['New York']
X_test['School City'] = ['New York City']
X_test['School County'] = ['Queens']
X_test['School District'] = ['New York Dept Of Education']
temp = X_test
temp.T

In [187]:
def donors_to_recommend(X, ind_ = 0, cluster_disp = False):
    cat_feat = [
            'Project Type','Project Subject Category Tree',
            'Project Subject Subcategory Tree',
            'Project Grade Level Category', 'Project Resource Category',
            'Teacher Prefix','School Metro Type',
            'School State','School City','School County', 'School District']
    
    X_test_transformed = X_test.copy()

    le = LabelEncoder()

    for cat in cat_feat:
        le.fit(X_test_transformed[cat].values)
    
        if X_test_transformed[cat].dtype == 'float64' or X_test_transformed[cat].dtype == 'int':
            X_test_transformed[cat] = le.transform(X_test_transformed[cat])

        else:
#             print(cat)
#             print(X_test_transformed[cat].dtype)
#             print(X_test_transformed[cat].values.astype(str))
            X_test_transformed[cat] = le.transform(X_test_transformed[cat].astype(str))
    
    y = clf.predict(X_test_transformed)
    if cluster_disp == True:
        print("Cluster Num", y[ind_])
    else:
        pass
    
    return clusters_by_all[y[ind_]]              

By encoding the categorical variables the same way as the initial input dataset, the same optimized logistic regression model can reliably predict the class by which will be the basis of the cluster of Donors the project will be recommended to.

In [188]:
donors_pred = donors_to_recommend(X_test, ind_ = 0, cluster_disp = True)
display(donors_pred.head())
print("No. of donors returned:", len(donors_pred))

# Sample Web App
We developed an interactive website application which takes in information about new projects and runs the classifier to determine which cluster of donors it would recommend the project to. The app is accessible through this link: https://cyntwikip-choosedonors.herokuapp.com/


![Recommender API](https://tjalba.files.wordpress.com/2018/06/screen-shot-2018-06-21-at-12-06-09-am.png?w=720)

# Summary
With the aim to increase volume of donations and encourage first time donors to donate again, DonorsChoose.org faces a challenge to build a recommendation engine that would allow for previous donors to easily find and support new projects that inspires them the most. To build the recommender, first, a clustering algorithm (KMeans with k = 4) partitioned the projects on its characteristic features to develop item profiles. Next, a classifier (logistic regression with L1 regularization) is developed to predict clusters of new unobserved projects. Finally, the designed recommendation system will churn out Donor IDs and information about the donors corresponding to the particular cluster. In the experiment above, we have shown how the recommender would identify potential donors once presented a never seen before project. Finally, this recommendation system can be used to run an email marketing campaign making identification of target segment more efficient.

# References
<sup> [1] </sup> Breeze, B. (2013) How donors choose charities: the role of personal taste and experiences in giving decisions. *Voluntary Sector Review*, *Vol. 4*, (2), pp. 165-183  
<sup> [2] </sup> Althoff, T and Leskovec, J (2015) Donor Retention in Online Crowdfunding Communities: A Case Study of Donorschoose.org. *ACM*  
<sup> [3] </sup> Salomon, J, et.al (2015) Don’t Wait! How Timing Affects Coordination of Crowdfunding Donations. *ACM*  
<sup> [4] </sup> Rajaman, Leskovec, and Ullman (2014) Chapter 9 Recommendation Systems in *Mining of Massive Datasets* (307-340) Palo Alto, Cali., USA
<br><sup>[5]</sup>Traag, Vincent (2016) Complex Contagion of Campaign Donations. *Public Library of Science*

# Collaborators
Tristan Joshua Alba <br>
Prince Joseph Erneszer Javier <br>
Jude Michael Teves

# Acknowledgements
We would like to acknowledge our mentors Dr. Christopher Monterola and Dr. Erika Fille Legara for their invaluable inputs on how to tackle this problem. We would also like to thank Dr. Christian Alis and Eduardo David for giving us access to computing hardware and also for giving their inputs. We thank Erwin Obias and Bryan Damasco for their support on this project.