### The Modeling Problem

We assume that DonorsChoose has hired a digital content expert who will review projects and help teachers improve their postings and increase their chances of reaching their funding threshold. Because this individualized review is a labor-intensive process, the digital content expert has ** time to review and support only 10% of the projects posted to the platform on a given day**. 

Our task is to help this content expert focus their limited resources on projects that most need the help. As such, we want to build a model to **identify projects that are least likely to be fully funded before they expire**  and pass them off to the digital content expert for review.


# Getting Set Up

In [1]:
!pip install kaggle

!mkdir /root/.kaggle
!touch /root/.kaggle/kaggle.json
api_token = {"username":"ploped123","key":"eeeeba8fc52706723e4c1bcf41ae6fd3"}

import json
import zipfile
import os
with open('/root/.kaggle/kaggle.json', 'w') as file:
    json.dump(api_token, file)

!chmod 600 /root/.kaggle/kaggle.json

!kaggle competitions download -c kdd-cup-2014-predicting-excitement-at-donors-choose

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Downloading kdd-cup-2014-predicting-excitement-at-donors-choose.zip to /content
100% 922M/926M [00:11<00:00, 112MB/s] 
100% 926M/926M [00:11<00:00, 81.4MB/s]


In [5]:
import zipfile

z= zipfile.ZipFile('kdd-cup-2014-predicting-excitement-at-donors-choose.zip')
z.extractall("/")

In [7]:
import pandas as pd
pd.set_option('display.max_columns', None)

In [9]:
donations_df = pd.read_csv("donations.csv.zip")
projects_df = pd.read_csv("projects.csv.zip")

# Getting the Baseline Rate

In [10]:
from datetime import timedelta
donations_df['donation_timestamp'] = pd.to_datetime(donations_df.donation_timestamp)
projects_df['date_posted'] = pd.to_datetime(projects_df.date_posted)

a = pd.merge(projects_df, donations_df, on=['projectid'], how='left')
a['in_4_months'] = (a['donation_timestamp'] - a['date_posted']) < timedelta(days=120)

values = {"donation_to_project": 0.0}
a = a.fillna(value=values)
a.loc[a['in_4_months'] == False, 'donation_to_project'] = 0.0

donation_in_4_months = a.groupby(['projectid'])['donation_to_project'].sum().reset_index(name='donation_in_4_months')
df = pd.merge(projects_df, donation_in_4_months, on=['projectid'], how='left')
df['if_fully_funded_after_4_months'] = df['total_price_excluding_optional_support'] <= df['donation_in_4_months']
df['if_fully_funded_after_4_months'].value_counts()

True     363940
False    300158
Name: if_fully_funded_after_4_months, dtype: int64

In [28]:
ratio = df['if_fully_funded_after_4_months'].value_counts()[1] / (df['if_fully_funded_after_4_months'].value_counts()[0] + df['if_fully_funded_after_4_months'].value_counts()[1])
print("Percentage of projects that are fully funded is: %.2f" % (ratio*100) + "%")

Percentage of projects that are fully funded is: 54.80%
