# Predicting Donation Projects Outcome Based on DonorsChoose.org Data

## Feature Engineering Notebook

#### *Author: Kunyu He*
#### *University of Chicago CAPP'20*

### Executive Summary

In this notebook, I'll read preprocessed training and testing data, which comes as data outputs of the [ETL notebook](https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/5a618e1d-b09e-4956-b0f0-ced1bbeae5a3/view?access_token=a7f6bc86b447a978fbf3c54457916f471e1fa4c4eccc9542d20086a6d63f6cb7) and stored on [IBM Cloud Object Storage](https://www.ibm.com/cloud/object-storage?S_PKG=AW&cm_mmc=Search_Google-_-Cloud_Cloud+Platform-_-WW_NA-_-+ibm++object++storage_Broad_&cm_mmca1=000016GC&cm_mmca2=10007090&cm_mmca7=9060146&cm_mmca8=aud-311016886972:kwd-346458796492&cm_mmca9=_k_CjwKCAiAyfvhBRBsEiwAe2t_i-XCqy6aVw7VL5rPgPbazlACBDB8tL5qFioP_k0oLEF8dxisH8cTlBoClHoQAvD_BwE_k_&cm_mmca10=317209285867&cm_mmca11=b&mkwid=_k_CjwKCAiAyfvhBRBsEiwAe2t_i-XCqy6aVw7VL5rPgPbazlACBDB8tL5qFioP_k0oLEF8dxisH8cTlBoClHoQAvD_BwE_k_|1445|530530&cvosrc=ppc.google.%2Bibm%20%2Bobject%20%2Bstorage&cvo_campaign=000016GC&cvo_crid=317209285867&Matchtype=b&gclid=CjwKCAiAyfvhBRBsEiwAe2t_i-XCqy6aVw7VL5rPgPbazlACBDB8tL5qFioP_k0oLEF8dxisH8cTlBoClHoQAvD_BwE) in `.csv` format, and perform feature engineering based on findings in the [EDA notebook](https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/ea878b70-71e0-406e-a118-7c8e898db8fb/view?access_token=031878c433b299b1fd3fbf7c0fb07b2ebc776d4b30ced9b8fe37e18a13890cec). The process includes:

* Create features including month, year of project posting and word counts for title, need statement and short description of the project
* Apply one-hot encoding on multinomial variables with multiple categories
* Drop our target, essay data and variables with too many categories from the features dataframe
* Use standard scaler to standardize the features dataframe and transform it into a Numpy feature matrix
* Extract the target variable *(`fully_funded`)*

Output data is stored in `.csv` fomat.

### Load Data

Use the chunks below to list the data assets in my IBM Cloud Object Storage linked to this project. As it includes my credentials, the code is hidden from unauthorized viewers.

In [1]:
import pandas as pd
import numpy as np

from sklearn import preprocessing

In [2]:
# The code was removed by Watson Studio for sharing.

In [3]:
project.get_files()

[{'asset_id': '7a2d8b2c-65c5-4258-8605-b95653bd30c5',
  'name': 'Donation-Projects-Outcome-Prediction.data.test.csv'},
 {'asset_id': 'f5450447-79fb-4b53-b646-5fbb9a220a8f',
  'name': 'Donation-Projects-Outcome-Prediction.data.train.csv'},
 {'asset_id': 'b6920713-693a-454e-855f-a24c95efd8ce',
  'name': 'Donation-Projects-Outcome-Prediction.data.projects.csv'},
 {'asset_id': '9b1c8961-1564-41a7-8402-89fded8d7e21',
  'name': 'Donation-Projects-Outcome-Prediction.data.outcomes.csv'},
 {'asset_id': '42bb6e53-4a3c-48f5-bcfb-5a7ab6b4461c',
  'name': 'Donation-Projects-Outcome-Prediction.data.resources.csv'},
 {'asset_id': '260878e4-a8c3-4c74-8070-48813214e8b2',
  'name': 'Donation-Projects-Outcome-Prediction.data.essays.csv'}]

Load data into the environment.

In [4]:
train = pd.read_csv(project.get_file('Donation-Projects-Outcome-Prediction.data.train.csv'))
test = pd.read_csv(project.get_file('Donation-Projects-Outcome-Prediction.data.test.csv'))

### Feature Engineering

Define a preprocess function for feature engineering.

In [6]:
def preprocess(df):
    # create new features
    df['month_of_year'] = pd.Categorical(df.date_posted.dt.month)
    df['year'] = pd.Categorical(df.date_posted.dt.year)
    df['need_statement_length'] = df.need_statement.str.len()
    df['short_description_length'] = df.short_description.str.len()
    df['title_length'] = df.title.str.len()

    # one-hot encoding
    multi_level_cat = ['teacher_prefix', 'primary_focus_subject', 'resource_type', 'poverty_level', 'month_of_year',
                       'year', 'grade_level']
    dummies = pd.get_dummies(df[multi_level_cat])
    
    # drop 
    to_drop = ['school_state', 'date_posted', 'fully_funded', 'title', 'short_description', 'need_statement', 'essay',
               'primary_focus_area'] + multi_level_cat
    labels = df.fully_funded.values
    df.drop(to_drop, axis=1, inplace=True)
    
    features = pd.concat([df, dummies], axis=1)

    return preprocessing.StandardScaler().fit(features).transform(features), np.array(labels), features

Perform feature engineering. Notice that `X_train`, `y_train`, `X_test`, `y_test` are Numpy ndarrays, `train_features` and `test_features` are Pandas DataFrame.

In [7]:
X_train, y_train, train_features = preprocess(train)
X_test, y_test, test_features = preprocess(test)

### Data Outputs Storage

Now save all data sets as `.csv` files and upload them back to my IBM Cloud Object Storage bucket.

In [13]:
project.save_data(data=pd.DataFrame(X_train).to_csv(index=False),
                  file_name='Donation-Projects-Outcome-Prediction.data.X_train.csv', overwrite=True)

{'asset_id': '945f71e4-c723-4cbd-a71c-381eaadb0ba1',
 'bucket_name': 'donationprojectsoutcomeprediction-donotdelete-pr-felyzh04iugf9l',
 'file_name': 'Donation-Projects-Outcome-Prediction.data.X_train.csv',
 'message': 'File Donation-Projects-Outcome-Prediction.data.X_train.csv has been written successfully to the associated OS'}

In [14]:
project.save_data(data=pd.DataFrame(y_train).to_csv(index=False),
                  file_name='Donation-Projects-Outcome-Prediction.data.y_train.csv', overwrite=True)

{'asset_id': '977cdb81-5861-4ef9-a937-190339b4f8fb',
 'bucket_name': 'donationprojectsoutcomeprediction-donotdelete-pr-felyzh04iugf9l',
 'file_name': 'Donation-Projects-Outcome-Prediction.data.y_train.csv',
 'message': 'File Donation-Projects-Outcome-Prediction.data.y_train.csv has been written successfully to the associated OS'}

In [16]:
project.save_data(data=pd.DataFrame(X_test).to_csv(index=False),
                  file_name='Donation-Projects-Outcome-Prediction.data.X_test.csv', overwrite=True)

{'asset_id': '4cd74b0a-1df2-4f6a-914c-6f095c4a16a1',
 'bucket_name': 'donationprojectsoutcomeprediction-donotdelete-pr-felyzh04iugf9l',
 'file_name': 'Donation-Projects-Outcome-Prediction.data.X_test.csv',
 'message': 'File Donation-Projects-Outcome-Prediction.data.X_test.csv has been written successfully to the associated OS'}

In [15]:
project.save_data(data=pd.DataFrame(y_test).to_csv(index=False),
                  file_name='Donation-Projects-Outcome-Prediction.data.y_test.csv', overwrite=True)

{'asset_id': '71dda0d1-cf32-4124-8170-64c425d52820',
 'bucket_name': 'donationprojectsoutcomeprediction-donotdelete-pr-felyzh04iugf9l',
 'file_name': 'Donation-Projects-Outcome-Prediction.data.y_test.csv',
 'message': 'File Donation-Projects-Outcome-Prediction.data.y_test.csv has been written successfully to the associated OS'}

In [18]:
project.save_data(data=train_features.to_csv(index=False),
                  file_name='Donation-Projects-Outcome-Prediction.data.train_features.csv', overwrite=True)

{'asset_id': '4aaa220a-3d1e-48fe-a563-0410d08290f2',
 'bucket_name': 'donationprojectsoutcomeprediction-donotdelete-pr-felyzh04iugf9l',
 'file_name': 'Donation-Projects-Outcome-Prediction.data.train_features.csv',
 'message': 'File Donation-Projects-Outcome-Prediction.data.train_features.csv has been written successfully to the associated OS'}

Check whether the output data files are successfully uploaded.

In [19]:
project.get_files()

[{'asset_id': '945f71e4-c723-4cbd-a71c-381eaadb0ba1',
  'name': 'Donation-Projects-Outcome-Prediction.data.X_train.csv'},
 {'asset_id': '7a2d8b2c-65c5-4258-8605-b95653bd30c5',
  'name': 'Donation-Projects-Outcome-Prediction.data.test.csv'},
 {'asset_id': '71dda0d1-cf32-4124-8170-64c425d52820',
  'name': 'Donation-Projects-Outcome-Prediction.data.y_test.csv'},
 {'asset_id': '4aaa220a-3d1e-48fe-a563-0410d08290f2',
  'name': 'Donation-Projects-Outcome-Prediction.data.train_features.csv'},
 {'asset_id': 'f5450447-79fb-4b53-b646-5fbb9a220a8f',
  'name': 'Donation-Projects-Outcome-Prediction.data.train.csv'},
 {'asset_id': '977cdb81-5861-4ef9-a937-190339b4f8fb',
  'name': 'Donation-Projects-Outcome-Prediction.data.y_train.csv'},
 {'asset_id': '4cd74b0a-1df2-4f6a-914c-6f095c4a16a1',
  'name': 'Donation-Projects-Outcome-Prediction.data.X_test.csv'},
 {'asset_id': 'b6920713-693a-454e-855f-a24c95efd8ce',
  'name': 'Donation-Projects-Outcome-Prediction.data.projects.csv'},
 {'asset_id': '9b1c8961

Now that our data is stored properly, the feature engineering for this project is done.

**Cheers!**