# Donors' Choices Report

Patrick King

Professor Morgan

DATA 512

7 December 2018

## Introduction

This report covers my statistical analysis of the open data shared by [DonorsChoose.org](https://www.donorschoose.org), regarding project funding success rate by subject. It details:
* why I examined DonorsChoose open data
* how I went about acquiring, cleaning, and preparing the data for processing
* the statistical methods I used to produce results
* analysis of the implications of said results
* all resources used in conducting this project

In addition to providing some insight into which projects attract donations and how DonorsChoose project managers might be able to revise their tactics towards increase funding success, a goal of this project is to provide transparency regarding the data science applied, to allow ease of understanding, sharing, and scaling for additional feedback, correction, and follow-on work by others.

Supplemental to this report are the [slides](https://github.com/PKing70/data-512-a6/blob/master/DonorsChoicesPreso.pdf) of the corresponding presentation given in class, also available in this repository.

### About this work
I was inspired to start this research because my wife is a public school teacher who often funds her classroom projects, unreimbursed, from our personal savings. I was aware of DonorsChoose due to its high publicity profile, as it is often promoted by celebrities such as [Stephen Colbert](https://youtu.be/I4GMX0MJNYw) (a DonorsChoose board member, not coincidentally). I thought I might be able to learn tips and tricks to help Nicole, my wife, build a DonorsChoose page for her next unfunded project idea.

This self-serving inclination is not unique to Nicole's classroom and our personal finances, however. The New York Times [reports](https://www.nytimes.com/2018/05/16/us/teachers-school-supplies.html) that 94 percent of  public school teachers fund their classroom projects out-of-pocket at an average rate of $479 annually. Seven percent report spending over $1000 of their own money on their classrooms to fill budget gaps not met by the stipends allocated by their school districts (Chokshi). DonorsChoose has grown incredibly for years, and is now providing millions of dollars yearly to public school funding from private donations from over three million donors reaching over 30 million students by over one million projects. 

This work could be the first step in providing insight and advice to help teachers attract classroom funding using DonorsChoose. With such findings, guidelines about how to structure project proposals could be used to increase funding success. However, in researching this project, I decided that whether it is a fair idea to help teachers fund their classrooms using DonorsChoose is a valid, debated question that demands additional consideration and research, as covered in the [Discussion](#discussion), below.

## Background

### About DonorsChoose

[DonorsChoose](https:www.donorschoose.org) is a crowd-sourced nonprofit organization that provides a platform for teachers to request funding for classroom supplies and projects. Donors can then peruse the advertised projects on the site, to choose which causes to support based on information garnered from the page created for each plea by the teacher or class. These "attractor" pages are standardized, containing uniform components such as classroom data, category of project, essays or short-form paragraphs. These page components cover how funds will be used, the financial targets, provide a status bar showing progress towards goal, and more. DonorsChoose has made public many years of its project data, tracking these attributes and donation results over many years. 

DonorsChoose has made [thirteen years of its data](https://data.donorschoose.org/docs/overview/) open and publicly accessible for analysis. The data includes information about:

* Projects (including classroom projects that have been posted, school information such as government-issued NCES ID, lat/long, and city/state/zip)
* Donations (including donation amounts, donor city, state)
* Project resources (including materials/resources requested for the classroom projects, including vendor name)
* Project essays (including the full text of the teacher-written requests accompanying all classroom projects)
* Giving pages (number of teachers, students, amount raised)


### DonorsChoose Looker Visualizations

In addition to raw data, DonorChoose also makes available filtered data access through an [exploration/visualization portal](https://data.donorschoose.org/explore-our-impact/) served by [Looker](https://looker.com/). To use it, any browser seems to be able to log in with credentials (login: “opendata@donorschoose.org” / “teachersrock1”) then interact and run queries to see results about specific locations, schools, years, and so on. This is an interesting and useful way to examine trends and results regarding projects and funding, and each visualization can be downloaded as Acrobat PDF with  its filtered, preprocessed underlying data availalble as a standard CSV file.

<h4 align="center">Image 1: DonorsChoose exploration portal</h4>

![Visualizations.png](https://github.com/PKing70/data-512-a6/blob/master/Visualizations.png?raw=true)

These visualization are scoped to the following types:

* dollars raised
* projects funded
* students reached
* donors
* teachers helped
* students reached
* poverty breakdown by school
* metro type breakdown by project
* top 10: projects by subject area (count)
* all projects (for which there is a project complete date)

However, what's not available in the data exploration portal are any statistics regarding projects that do not have a project complete date. Essentially, a user can explore all the projects that successfully received funding but none of the projects that did not. In the underlying source data .gz archive, however, both successful and unsuccessful projects are provided.

### DonorsChoose APIs

DonorsChoose also makes its data available through REST APIs with multiple endpoints, documented [here](https://data.donorschoose.org/docs/overview/):

* Project Listings
* Donors
* Teachers
* Schools
* Giving pages
* Partners
* Transactions

These APIs, particularly Project Listings, return JSON data that likely corresponds to the raw source data, referenced above (). I did not use these APIs to produce my report. Additional ideas for using these APIs for future work are covered in [Discussion](#discussion), below.

### Related Work

I could not find other work in the area of analyzing funding completion rate, though there is existing (and substantially more rigorous) [research completed](https://cs.stanford.edu/people/jure/pubs/donors-www15.pdf) by Althoff and Leskovec from Stanford, specific to analyzing this dataset regarding Donor Retention rate. This research found that teachers were not particularly successful at retaining donors: "Instead, they are most successful in retaining donors in their first projects. After those, the return rates are monotonically decreasing over time. However, we see that the effect levels off for site donors whereas the retention continues to decrease for teacher-referred donors. This could be explained by viewing teacher-referred donors as a limited resource available to the teacher. Teachers are only able to receive a certain amount of donors from their personal support network. This support is limited and asking over and over again for new projects is less and
less likely to be successful as the teacher “drained” most of the resources available to them already" (Althoff, 6). This would seem to make valuable determining if/how a teacher might be able to construct a project description page that could be most successful upon first attempt.  Examining at features of projects that correlate to successful funding might be a worthwhile addition to the mix, to add to the research about how teachers retain donors over time. So, that's what I decided to analyze.

### Research Question

The broad question I examine is: Which features available in the raw, unfiltered DonorsChoose project data correlated to project funding success and failure? 



## Methods

### Data acquisition

DonorsChoose provides guidelines to access their available data. By reading through these guidelines, I did learn much about structure of the dataset, and issues within such as "A few years back, donor addresses became optional, even when the donor is eligible to receive a mailed thank-you packet from the classroom. So there are a lot of null address fields for donors who elected not to provide their address." ([OpenData Layout and Docs page](https://research.donorschoose.org/t/opedata-layout-and-docs/18)). 

The fields and schema of each data file is covered, including the schema of the project data which I was focused on. DonorsChoose Data provides guidance to join and query these files using [PostgreSQL](https://www.postgresql.org/). However, after downloading, installing and attempting a variety of configuration approaches, I did not pursue this path as I was not able to learn enough about PostgreSQL to be able to run join or query scripts. As I indicated in my verbal presentation about this report, perhaps I shall try again after I complete DATA 514 (Data Management) during the Winter 2018 quarter. 

According to the documentation, "the data is compressed, quoted, escaped and without a header. To properly import, use Python's pandas using code in 'How to read this CSV file' under each button. No need to decompress files." This is the method I used to proceed.


First, I import pandas for the reading of the source CSV from the GZ archive. (And, pandas datasets will be used later, too).

In [1]:
import pandas     # For dataframing, merging, numeric conversions, and reading CSVs

To load the compressed CSV from the archive, the .gz file must be present within the working directory. Then, execute this code:

In [5]:
projects = pandas.read_csv('opendata_projects000.gz', escapechar='\\', names=['_projectid', 
           '_teacher_acctid', '_schoolid', 'school_ncesid', 'school_latitude', 
           'school_longitude', 'school_city', 'school_state', 'school_zip', 'school_metro', 
           'school_district', 'school_county', 'school_charter', 'school_magnet', 
           'school_year_round', 'school_nlns', 'school_kipp', 'school_charter_ready_promise', 
           'teacher_prefix', 'teacher_teach_for_america', 'teacher_ny_teaching_fellow', 
           'primary_focus_subject', 'primary_focus_area' ,'secondary_focus_subject', 
           'secondary_focus_area', 'resource_type', 'poverty_level', 'grade_level', 
           'vendor_shipping_charges', 'sales_tax', 'payment_processing_charges', 
           'fulfillment_labor_materials', 'total_price_excluding_optional_support', 
           'total_price_including_optional_support', 'students_reached', 'total_donations', 
           'num_donors', 'eligible_double_your_impact_match', 'eligible_almost_home_match', 
           'funding_status', 'date_posted', 'date_completed', 'date_thank_you_packet_mailed', 
           'date_expiration'])

To get an idea of the size of the loaded data, I look at the row count:

In [6]:
projects.shape[0]

1203287

1.2 million projects loaded. To see how many of each field are populated with values other than null or NaN, I use count:

In [7]:
projects.count()

_projectid                                1203287
_teacher_acctid                           1203287
_schoolid                                 1203287
school_ncesid                             1130941
school_latitude                           1203287
school_longitude                          1203287
school_city                               1193492
school_state                              1203287
school_zip                                1203283
school_metro                              1059692
school_district                           1202934
school_county                             1203270
school_charter                            1203287
school_magnet                             1203287
school_year_round                         1203287
school_nlns                               1203287
school_kipp                               1203287
school_charter_ready_promise              1203287
teacher_prefix                            1203241
teacher_teach_for_america                 1203287


Finally, I look at the head to get an idea of whats in the top five rows:

In [9]:
projects.head()

Unnamed: 0,_projectid,_teacher_acctid,_schoolid,school_ncesid,school_latitude,school_longitude,school_city,school_state,school_zip,school_metro,...,students_reached,total_donations,num_donors,eligible_double_your_impact_match,eligible_almost_home_match,funding_status,date_posted,date_completed,date_thank_you_packet_mailed,date_expiration
0,7342bd01a2a7725ce033a179d22e382d,5c43ef5eac0f5857c266baa1ccfa3d3f,9e72d6f2f1e9367b578b6479aa5852b7,360009700000.0,40.688454,-73.910432,New York City,NY,11207.0,urban,...,0.0,251.9,1,f,f,completed,2002-09-13 00:00:00,2002-09-23 00:00:00,2003-01-27 00:00:00,2003-12-31 00:00:00
1,ed87d61cef7fda668ae70be7e0c6cebf,1f4493b3d3fe4a611f3f4d21a249376a,1ae4695be589a36816188e2b301a0395,360007700000.0,40.765517,-73.96009,New York City,NY,10065.0,,...,0.0,137.0,1,f,f,completed,2002-09-13 00:00:00,2002-09-23 00:00:00,2003-01-03 00:00:00,2003-12-31 00:00:00
2,b56b502d25666e29550d107bf7e17910,57426949b47700ccf62098e1e9b0220c,4a06a328dd87bd29892d73310052f45f,360007700000.0,40.770233,-73.95076,New York City,NY,10075.0,,...,0.0,125.0,1,f,f,completed,2002-09-16 00:00:00,2002-09-19 00:00:00,2002-12-19 00:00:00,2003-12-31 00:00:00
3,016f03312995d5c89d6b348be4682166,9c0aa56b63b743454d6da9effcf122fc,bb0af5dac1b54693ba86ef63eacd6594,360007600000.0,40.727826,-73.978721,New York City,NY,10009.0,urban,...,0.0,205.0,1,f,f,completed,2002-09-17 00:00:00,2002-09-17 00:00:00,2002-12-02 00:00:00,2003-12-31 00:00:00
4,cf6275558534ca1b276b0d8d5130dd9a,1d4d8a42730dbb66af1ebb6ab37456b7,768dab263f87881fe7c68ffb3965df7c,360008300000.0,40.841216,-73.938605,New York City,NY,10032.0,urban,...,0.0,264.0,1,f,f,completed,2002-09-17 00:00:00,2002-09-23 00:00:00,2003-02-26 00:00:00,2003-12-31 00:00:00


If printed, a reader might not see all the columns, above. But when embedded in an interactive notebook, scroll to the right in the above dataframe to see all columns.

Funding_status is the response variable I want to examine, so I want to see what's in it by using [unique()](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.unique.html#pandas.unique):

In [10]:
projects.funding_status.unique()

array(['completed', 'expired', 'reallocated', 'live'], dtype=object)

There are four states for funding_status. According to the documentation, "**funding_status** refers to the status of this project as of the date the dataset was created. *Reallocated* projects are projects that received partial funding but the project never completed, so the donations were moved towards another project. *Completed* projects refer to projects that received full funding. *Expired* projects are ones that expired before donations were made. *Live* projects are projects that were still open for donations on the day the dataset was created."

Completed projects are what I want to encode with 1 as success. Expired should be 0 for failure. I want to omit live projects as undetermined, and I'm not sure how to consider reallocated projects at this time.

For my main analysis, I want to encode the categorical feature primary_focus_subject to determine association with funding_status. To see what's available in primary_focus_subject:

In [11]:
projects.primary_focus_subject.unique()

array(['Other', 'Literacy', 'Early Development', 'History & Geography',
       'Economics', 'Environmental Science', 'Health & Life Science',
       'Literature & Writing', 'Mathematics', 'Music', 'Visual Arts',
       'College & Career Prep', 'Parent Involvement', 'Social Sciences',
       'Civics & Government', 'Extracurricular', 'Performing Arts',
       'Character Education', 'Applied Sciences', 'Team Sports',
       'Foreign Languages', 'Community Service', 'Special Needs',
       'Gym & Fitness', 'ESL', 'Health & Wellness', 'Nutrition', nan,
       'Financial Literacy'], dtype=object)

There are similar/related features in the projects, for which I consider running additional regressions and correlations. They would be primary_focus_area, which is a "parent group" category that gathers related subjects under one heading. Also, there are secondary subjects and focus areas that a teacher can list for projects, too. To look at each, I use unique() again:

In [12]:
projects.primary_focus_area.unique()

array(['Applied Learning', 'Literacy & Language', 'History & Civics',
       'Math & Science', 'Music & The Arts', 'Health & Sports',
       'Special Needs', nan], dtype=object)

In [13]:
projects.secondary_focus_subject.unique()

array([nan, 'History & Geography', 'Early Development', 'Extracurricular',
       'Other', 'College & Career Prep', 'Performing Arts', 'Literacy',
       'Literature & Writing', 'Mathematics', 'Social Sciences',
       'Applied Sciences', 'Environmental Science', 'Visual Arts',
       'Parent Involvement', 'Foreign Languages', 'Team Sports',
       'Character Education', 'Music', 'Health & Life Science',
       'Civics & Government', 'Community Service', 'Economics',
       'Special Needs', 'ESL', 'Health & Wellness', 'Gym & Fitness',
       'Nutrition', 'Financial Literacy'], dtype=object)

In [14]:
projects.secondary_focus_area.unique()

array([nan, 'History & Civics', 'Applied Learning', 'Music & The Arts',
       'Literacy & Language', 'Math & Science', 'Health & Sports',
       'Special Needs'], dtype=object)

Since binary logistic regression requires independent observations, I don't want to combine all these features into one model. Instead, I want to simplify on the primary observation I am most interested in, primary_focus_subject. So I use data_funding to contain simply my wanted observations and response:

In [15]:
data_funding = pandas.DataFrame({'funding':projects['funding_status'],
                                 'subject':projects['primary_focus_subject']})

I want to examine data_funding for appropriateness, as I did with the projects source loaded from the dataset:

In [16]:
data_funding.count()

funding    1203287
subject    1203241
dtype: int64

In [17]:
data_funding.head()

Unnamed: 0,funding,subject
0,completed,Other
1,completed,Literacy
2,completed,Early Development
3,completed,History & Geography
4,completed,Other


In [18]:
data_funding.subject.unique()

array(['Other', 'Literacy', 'Early Development', 'History & Geography',
       'Economics', 'Environmental Science', 'Health & Life Science',
       'Literature & Writing', 'Mathematics', 'Music', 'Visual Arts',
       'College & Career Prep', 'Parent Involvement', 'Social Sciences',
       'Civics & Government', 'Extracurricular', 'Performing Arts',
       'Character Education', 'Applied Sciences', 'Team Sports',
       'Foreign Languages', 'Community Service', 'Special Needs',
       'Gym & Fitness', 'ESL', 'Health & Wellness', 'Nutrition', nan,
       'Financial Literacy'], dtype=object)

In [70]:
data_funding.funding.unique()

array(['completed', 'expired', 'reallocated', 'live'], dtype=object)

I want to determine how many of each interesting state of funding there are:

In [20]:
data_funding[(data_funding['funding']=='completed')].count()

funding    797071
subject    797047
dtype: int64

In [21]:
data_funding[(data_funding['funding']=='expired')].count()

funding    333630
subject    333608
dtype: int64

In [22]:
data_funding[(data_funding['funding']=='reallocated')].count()

funding    9086
subject    9086
dtype: int64

In [23]:
data_funding[(data_funding['funding']=='live')].count()

funding    63500
subject    63500
dtype: int64

I know I don't want to include live projects. Looking at the very low count of reallocated projects (9086/1203287 is less than one percent) I decide to not include them as not relevant.

Pandas [get_dummies()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) is a quick way to create a boolean "dummy" variable for each subject categorical value: 

In [24]:
df_dummy = pandas.get_dummies(data_funding)

In [25]:
df_dummy

Unnamed: 0,funding_completed,funding_expired,funding_live,funding_reallocated,subject_Applied Sciences,subject_Character Education,subject_Civics & Government,subject_College & Career Prep,subject_Community Service,subject_ESL,...,subject_Mathematics,subject_Music,subject_Nutrition,subject_Other,subject_Parent Involvement,subject_Performing Arts,subject_Social Sciences,subject_Special Needs,subject_Team Sports,subject_Visual Arts
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
5,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


I remove the unwanted funding status by selecting for only funding_completed or funding_expired:

In [26]:
df_dummy = df_dummy[(df_dummy['funding_completed']==1) | (df_dummy['funding_expired']==1)]

I want to check the layout of my df_dummy working dataframe, using the methods I've used before for examination:

In [27]:
df_dummy.shape[0]

1130701

In [28]:
df_dummy.head()

Unnamed: 0,funding_completed,funding_expired,funding_live,funding_reallocated,subject_Applied Sciences,subject_Character Education,subject_Civics & Government,subject_College & Career Prep,subject_Community Service,subject_ESL,...,subject_Mathematics,subject_Music,subject_Nutrition,subject_Other,subject_Parent Involvement,subject_Performing Arts,subject_Social Sciences,subject_Special Needs,subject_Team Sports,subject_Visual Arts
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


That's what I expect, so now I drop the unneeded columns:

In [29]:
df_dummy = df_dummy.drop(["funding_expired", "funding_live", "funding_reallocated"], axis=1)

In [30]:
df_dummy.shape[0]

1130701

In [31]:
df_dummy.head()

Unnamed: 0,funding_completed,subject_Applied Sciences,subject_Character Education,subject_Civics & Government,subject_College & Career Prep,subject_Community Service,subject_ESL,subject_Early Development,subject_Economics,subject_Environmental Science,...,subject_Mathematics,subject_Music,subject_Nutrition,subject_Other,subject_Parent Involvement,subject_Performing Arts,subject_Social Sciences,subject_Special Needs,subject_Team Sports,subject_Visual Arts
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


Things appear to be set up in the dataframe as I want for my regression, mostly.

I want to see how many of each subject are in my df_dummy dataframe now:

In [41]:
for column in df_dummy:
    print(column)
    print(df_dummy[(df_dummy[column])==1].shape[0]) #df_dummy.columns

funding_completed
797071
subject_Applied Sciences
59784
subject_Character Education
14016
subject_Civics & Government
3855
subject_College & Career Prep
11116
subject_Community Service
2326
subject_ESL
15212
subject_Early Development
23352
subject_Economics
2995
subject_Environmental Science
45862
subject_Extracurricular
4811
subject_Financial Literacy
2829
subject_Foreign Languages
8152
subject_Gym & Fitness
13029
subject_Health & Life Science
37320
subject_Health & Wellness
23951
subject_History & Geography
25061
subject_Literacy
323646
subject_Literature & Writing
136983
subject_Mathematics
156476
subject_Music
33262
subject_Nutrition
2325
subject_Other
19690
subject_Parent Involvement
1545
subject_Performing Arts
15341
subject_Social Sciences
13978
subject_Special Needs
73955
subject_Team Sports
8778
subject_Visual Arts
51005


## Findings

The [corr()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html) method applied the funding_completed column should show me the Pearson's correlation coefficient for each subject to funding_completed:

In [40]:
correlation = df_dummy.corr(method='pearson')
correlation.head()

Unnamed: 0,funding_completed,subject_Applied Sciences,subject_Character Education,subject_Civics & Government,subject_College & Career Prep,subject_Community Service,subject_ESL,subject_Early Development,subject_Economics,subject_Environmental Science,...,subject_Mathematics,subject_Music,subject_Nutrition,subject_Other,subject_Parent Involvement,subject_Performing Arts,subject_Social Sciences,subject_Special Needs,subject_Team Sports,subject_Visual Arts
funding_completed,1.0,0.00936,0.001238,0.000382,-0.00973,0.001169,-0.007633,-0.011572,0.006215,0.026534,...,-0.003131,0.020708,0.004368,-0.026524,-0.006884,0.005843,-0.001748,0.000255,0.011137,0.013172
subject_Applied Sciences,0.00936,1.0,-0.02647,-0.01382,-0.023543,-0.010727,-0.027591,-0.034311,-0.012176,-0.04858,...,-0.094691,-0.041134,-0.010725,-0.031454,-0.00874,-0.02771,-0.026434,-0.062505,-0.020899,-0.051354
subject_Character Education,0.001238,-0.02647,1.0,-0.006553,-0.011163,-0.005087,-0.013083,-0.016269,-0.005774,-0.023035,...,-0.044899,-0.019504,-0.005085,-0.014915,-0.004144,-0.013139,-0.012534,-0.029638,-0.00991,-0.02435
subject_Civics & Government,0.000382,-0.01382,-0.006553,1.0,-0.005828,-0.002656,-0.00683,-0.008494,-0.003014,-0.012026,...,-0.023441,-0.010183,-0.002655,-0.007787,-0.002164,-0.00686,-0.006544,-0.015473,-0.005174,-0.012713
subject_College & Career Prep,-0.00973,-0.023543,-0.011163,-0.005828,1.0,-0.004524,-0.011636,-0.01447,-0.005135,-0.020488,...,-0.039934,-0.017347,-0.004523,-0.013265,-0.003686,-0.011686,-0.011148,-0.02636,-0.008814,-0.021657


To simplify my view to just this correlation:

In [34]:
corr = correlation[['funding_completed']]
corr

Unnamed: 0,funding_completed
funding_completed,1.0
subject_Applied Sciences,0.00936
subject_Character Education,0.001238
subject_Civics & Government,0.000382
subject_College & Career Prep,-0.00973
subject_Community Service,0.001169
subject_ESL,-0.007633
subject_Early Development,-0.011572
subject_Economics,0.006215
subject_Environmental Science,0.026534


In [86]:
corr.sort_values('funding_completed')

Unnamed: 0,funding_completed
subject_Other,-0.026524
subject_Literature & Writing,-0.015453
subject_Early Development,-0.011572
subject_Gym & Fitness,-0.009968
subject_College & Career Prep,-0.00973
subject_Health & Wellness,-0.008901
subject_ESL,-0.007633
subject_Foreign Languages,-0.007441
subject_Parent Involvement,-0.006884
subject_Literacy,-0.004181


That shows which subjects are negatively and positively correlated with funding_completed.

To set up a logistic regression, I want to put funding_completed into a y response column, and put the observation dummies into an X matrix. For this I will use loc, simply selecting on whether funding_completed is 1 (true) or 0 (false):

In [35]:
X = df_dummy.loc[:, df_dummy.columns != 'funding_completed']
y = df_dummy.loc[:, df_dummy.columns == 'funding_completed']

In [36]:
print(X.shape)
print(y.shape)

(1130701, 28)
(1130701, 1)


In [37]:
X.head()

Unnamed: 0,subject_Applied Sciences,subject_Character Education,subject_Civics & Government,subject_College & Career Prep,subject_Community Service,subject_ESL,subject_Early Development,subject_Economics,subject_Environmental Science,subject_Extracurricular,...,subject_Mathematics,subject_Music,subject_Nutrition,subject_Other,subject_Parent Involvement,subject_Performing Arts,subject_Social Sciences,subject_Special Needs,subject_Team Sports,subject_Visual Arts
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [38]:
y.head()

Unnamed: 0,funding_completed
0,1
1,1
2,1
3,1
4,1


X and y look as I want them to look.

Now to proceed to the logistic regression, I import [statsmodels.api](https://www.statsmodels.org/stable/index.html) for this. Also, we want to add an intercept constant (a columnn of 1's prepended on the left of the dataframe), and [add_constant()](https://www.statsmodels.org/dev/generated/statsmodels.tools.tools.add_constant.html) does this.

Then run the regression using [Logit()](https://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.html).

In [44]:
import statsmodels.api as sm

X_int = sm.tools.tools.add_constant(X)

logit_model = sm.Logit(y, X_int)

result = logit_model.fit()


Optimization terminated successfully.
         Current function value: 0.245994
         Iterations 5
                              Results: Logit
Model:                 Logit               Pseudo R-squared:   inf        
Dependent Variable:    funding_completed   AIC:                556349.3014
Date:                  2018-12-09 19:32    BIC:                556695.5135
No. Observations:      1130701             Log-Likelihood:     -2.7815e+05
Df Model:              28                  LL-Null:            0.0000     
Df Residuals:          1130672             LLR p-value:        1.0000     
Converged:             1.0000              Scale:              1.0000     
No. Iterations:        5.0000                                             
--------------------------------------------------------------------------
                              Coef.  Std.Err.   z    P>|z|   [0.025 0.975]
--------------------------------------------------------------------------
const                       

I can see that the P values of the logistic regression show that each of the subjects is significant at predicting funding success at a significance 0f 0.05 (except Other, College and Career Prep and Parent Involvement). The most significant findings would likely be represented by smaller P values, though if I want to rank I'll likely go with a sort of the Pearson's correlations calculated above.

In [46]:
corr.sort_values(['funding_completed'])

Unnamed: 0,funding_completed
subject_Other,-0.026524
subject_Literature & Writing,-0.015453
subject_Early Development,-0.011572
subject_Gym & Fitness,-0.009968
subject_College & Career Prep,-0.00973
subject_Health & Wellness,-0.008901
subject_ESL,-0.007633
subject_Foreign Languages,-0.007441
subject_Parent Involvement,-0.006884
subject_Literacy,-0.004181


<a id='discussion'></a>

## Discussion

The DonorsChoose approach to funding education expenses is a valuable alternative for teachers who often must use personal funds to equip their classrooms or to enable novel learning experiences. Helping teachers understand the effectiveness or shortcomings of various options in setting up, describing, and categorizing their projects might increase the rates of donation. Though, it might instead produce a "zero-sum" competitive result, where some teachers attract even more donations, causing others who are less informed or adept with DonorsChoose to go without. 


## Conclusion

By looking at either measure, Pearson's correlation coefficient or the significance of p-values of a logistic regression, it is clear that some subjects are associated with funding success while others are associated with failure. This answers the research question about which features available in the raw, unfiltered DonorsChoose project data would be correlated to project funding success or failure.

With such information, teachers could consider categorizing their projects differently in pursuit of greater funding liklihood. For example, if a teacher wanted soccer balls, they could choose to categorize the project as "Team Sports" rather than "Gym & Fitness" and possibly increase their liklihood of funding. However, in doing this research, and when reading about reservations some have about crowdfunding classroom budgets, perhaps more work should be done before disseminating any such "advice."





## References

Althoff, Tim and Leskovec, Jure. "Donor Retention in Online Crowdfunding Communities: A Case Study of DonorsChoose.org." *ACM International Conference on World Wide Web (WWW)* 2015 [<https://cs.stanford.edu/people/jure/pubs/donors-www15.pdf>](https://cs.stanford.edu/people/jure/pubs/donors-www15.pdf).

Chokshi, Niraj. "94 Percent of U.S. Teachers Spend Their Own Money on School Supplies, Survey Finds." *New York Times*. Web. 16 May 2018 [<https://www.nytimes.com/2018/05/16/us/teachers-school-supplies.html>](https://www.nytimes.com/2018/05/16/us/teachers-school-supplies.html)

Karp, Jonathan. "Want to Keep School Funding Alive? Put an End to DonorsChoose" *Mic* 11 July 2013: n-pag. Web. 21 November 2018 [<https://mic.com/articles/54021/want-to-keep-school-funding-alive-put-an-end-to-donorschoose>](https://mic.com/articles/54021/want-to-keep-school-funding-alive-put-an-end-to-donorschoose#.vCXQjvABc).

Tyre, Peg. "Beyond School Supplies: How DonorsChoose is Crowdsourcing Real Education Reform." *FastCompany* 2 October 2014: n.
pag. Web. 21 November 2018 [<https://www.fastcompany.com/3025597/donorschoose-hot-for-teachers>](https://www.fastcompany.com/3025597/donorschoose-hot-for-teachers).
