# Project Recommendations for DonorsChoose

This notebook explores different approaches to helping [DonorsChoose](https://www.donorschoose.org/) recommend the right project to right user.
So far we have tried the following methods:
- Content-based recommendations using tfidf
- ...

Other relevant methods:
- Topic models using LDA
- ...

In [34]:
import numpy as np
import pandas as pd
import os
from sklearn.feature_extraction.text import TfidfVectorizer

### Load and trim data

In [40]:
# Test flag for faster exploration
test_mode = True

projects = pd.read_csv(os.path.join(os.getcwd(), "data", "Projects.csv"), error_bad_lines=False, warn_bad_lines=False,
                      parse_dates=["Project Posted Date", "Project Fully Funded Date"])

if test_mode:
    projects = projects.head(10000)
    
# Remove rows not containing a valid ID (i.e. not a 32 chars ID)
projects = project.loc[projects.loc[:,"Project ID"].str.len() == 32,:]

NameError: name 'project' is not defined

In [30]:
projects.shape

(9892, 16)

In [32]:
projects.loc[:,"Project ID"].nunique()

9892

In [67]:
projects.head()

Unnamed: 0,Project ID,School ID,Teacher ID,Teacher Project Posted Sequence,Project Type,Project Title,Project Essay,Project Subject Category Tree,Project Subject Subcategory Tree,Project Grade Level Category,Project Resource Category,Project Cost,Project Posted Date,Project Current Status,Project Fully Funded Date
0,77b7d3f2ac4e32d538914e4a8cb8a525,c2d5cb0a29a62e72cdccee939f434181,59f7d2c62f7e76a99d31db6f62b7b67c,2,Teacher-Led,anti-bullying begins with me,do you remember your favorite classroom from e...,"Applied Learning, Literacy & Language","Character Education, Literacy",Grades PreK-2,Books,$490.38,2013-01-01,Fully Funded,2013-03-12
1,fd928b7f6386366a9cad2bea40df4b25,8acbb544c9215b25c71a0c655200baea,8fbd92394e20d647ddcdc6085ce1604b,1,Teacher-Led,ukuleles for middle schoolers,what sound is happier than a ukulele? we have...,Music & The Arts,Music,Grades 6-8,Supplies,$420.61,2013-01-01,Expired,
2,7c915e8e1d27f10a94abd689e99c336f,0ae85ea7c7acc41cffa9f81dc61d46df,9140ac16d2e6cee45bd50b0b2ce8cd04,2,Teacher-Led,"big books, flip books, and everything in between","my 1st graders may be small, but they have big...","Special Needs""","Literacy, Special Needs",Grades PreK-2,Books,$510.46,2013-01-01,Fully Funded,2013-01-07
3,feeec44c2a3a3d9a31c137a9780d2521,deddcdb20f86599cefa5e7eb31da309b,63750e765b46f9fa4d71e780047e896e,1,Teacher-Led,a little for a lot,our students were so excited to come to school...,"Math & Science, Literacy & Language","Applied Sciences, Literature & Writing",Grades 3-5,Supplies,$282.80,2013-01-01,Fully Funded,2013-05-29
4,037719bf60853f234610458a210f45a9,f3f0dc60ba3026944eeffe8b76cb8d9c,0d5b4cc12b2eb00013460d0ac38ce2a2,1,Teacher-Led,technology in the classroom,were you ever that kid that struggled to stay ...,"Literacy & Language, Math & Science","Literacy, Mathematics",Grades PreK-2,Technology,$555.28,2013-01-01,Fully Funded,2013-02-14


### Preprocess text data

In [54]:
projects.dtypes

Project ID                          object
School ID                           object
Teacher ID                          object
Teacher Project Posted Sequence     object
Project Type                        object
Project Title                       object
Project Essay                       object
Project Subject Category Tree       object
Project Subject Subcategory Tree    object
Project Grade Level Category        object
Project Resource Category           object
Project Cost                        object
Project Posted Date                 object
Project Current Status              object
Project Fully Funded Date           object
dtype: object

In [65]:
text_features = ["Project Title", "Project Essay"]

# Preprocess text columns to make them comparable
for feature in text_features:
    projects.loc[:,feature] = projects.loc[:,feature].astype(str).fillna("")
    projects.loc[:,feature] = projects.loc[:,feature].str.lower()
    
text = projects["Project Title"] + " " + projects["Project Essay"] # This is a pandas series containing title and essay text

vectorizer = TfidfVectorizer(strip_accents="unicode",
                            analyzer="word",
                            stop_words="english",
                            max_df=0.9)

tfidf_matrix = vectorizer.fit_transform(text)
tfidf_feature_names = vectorizer.get_feature_names()

In [66]:
tfidf_matrix

<10000x22984 sparse matrix of type '<class 'numpy.float64'>'
	with 910659 stored elements in Compressed Sparse Row format>

In [62]:
type(text)

pandas.core.series.Series

### Topic Modeling