# Project Recommendations for DonorsChoose

This notebook explores different approaches to helping [DonorsChoose](https://www.donorschoose.org/) recommend the right project to right user.
So far we have tried the following methods:
- Content-based recommendations using tfidf
    - tfidf project descriptions
    - Calcualte document distances (e.g. cosine similarity)
    - Explore document similarity
    - Create recommendations based on similarity 
    - Create evaluation metric to test whether similarity is a good predictor for recommendations
        - E.g. For users with #donations > 1 omit last donation, get ranking based on similar projects, check the ranking score of actually donated (omitted) projects
        - Compare score agains popularity "algorithm" performance and "recommending distinct projects" or "random projects"
- ...

Other relevant methods:
- Topic models using LDA
- Tag generation (automated tagging)

### Links

- Nice tfidf helper code: https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af
- We started doing approximately the content based recommender from here: https://www.kaggle.com/ranliu/donor-project-matching-with-recommender-systems/code (with more code but different challenge here: https://www.kaggle.com/gspmoreira/recommender-systems-in-python-101/code)

# Code

## Prerequisites
To install spacy and download the English language model run: `
conda install -c conda-forge spacy` and `python -m spacy download en`

In [52]:
import numpy as np
import pandas as pd
import os
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer

### Load and trim data

In [4]:
# Test flag for faster exploration
test_mode = True

projects = pd.read_csv(os.path.join(os.getcwd(), "data", "Projects.csv"), error_bad_lines=False, warn_bad_lines=False,
                      parse_dates=["Project Posted Date", "Project Fully Funded Date"])

if test_mode:
    projects = projects.head(10000)
    
# Remove rows not containing a valid ID (i.e. not a 32 chars ID)
projects = projects.loc[projects.loc[:,"Project ID"].str.len() == 32,:]

In [5]:
projects.shape

(10000, 15)

In [6]:
projects.loc[:,"Project ID"].nunique()

10000

In [7]:
projects.head()

Unnamed: 0,Project ID,School ID,Teacher ID,Teacher Project Posted Sequence,Project Type,Project Title,Project Essay,Project Subject Category Tree,Project Subject Subcategory Tree,Project Grade Level Category,Project Resource Category,Project Cost,Project Posted Date,Project Current Status,Project Fully Funded Date
0,77b7d3f2ac4e32d538914e4a8cb8a525,c2d5cb0a29a62e72cdccee939f434181,59f7d2c62f7e76a99d31db6f62b7b67c,2,Teacher-Led,Anti-Bullying Begins with Me,do you remember your favorite classroom from e...,"Applied Learning, Literacy & Language","Character Education, Literacy",Grades PreK-2,Books,$490.38,2013-01-01,Fully Funded,2013-03-12
1,fd928b7f6386366a9cad2bea40df4b25,8acbb544c9215b25c71a0c655200baea,8fbd92394e20d647ddcdc6085ce1604b,1,Teacher-Led,Ukuleles For Middle Schoolers,what sound is happier than a ukulele? we have...,Music & The Arts,Music,Grades 6-8,Supplies,$420.61,2013-01-01,Expired,NaT
2,7c915e8e1d27f10a94abd689e99c336f,0ae85ea7c7acc41cffa9f81dc61d46df,9140ac16d2e6cee45bd50b0b2ce8cd04,2,Teacher-Led,"Big Books, Flip Books, And Everything In Between","my 1st graders may be small, but they have big...","Literacy & Language, Special Needs","Literacy, Special Needs",Grades PreK-2,Books,$510.46,2013-01-01,Fully Funded,2013-01-07
3,feeec44c2a3a3d9a31c137a9780d2521,deddcdb20f86599cefa5e7eb31da309b,63750e765b46f9fa4d71e780047e896e,1,Teacher-Led,A Little for a Lot,our students were so excited to come to school...,"Math & Science, Literacy & Language","Applied Sciences, Literature & Writing",Grades 3-5,Supplies,$282.80,2013-01-01,Fully Funded,2013-05-29
4,037719bf60853f234610458a210f45a9,f3f0dc60ba3026944eeffe8b76cb8d9c,0d5b4cc12b2eb00013460d0ac38ce2a2,1,Teacher-Led,Technology in the Classroom,were you ever that kid that struggled to stay ...,"Literacy & Language, Math & Science","Literacy, Mathematics",Grades PreK-2,Technology,$555.28,2013-01-01,Fully Funded,2013-02-14


### Preprocess text data

In [8]:
projects.dtypes

Project ID                                  object
School ID                                   object
Teacher ID                                  object
Teacher Project Posted Sequence              int64
Project Type                                object
Project Title                               object
Project Essay                               object
Project Subject Category Tree               object
Project Subject Subcategory Tree            object
Project Grade Level Category                object
Project Resource Category                   object
Project Cost                                object
Project Posted Date                 datetime64[ns]
Project Current Status                      object
Project Fully Funded Date           datetime64[ns]
dtype: object

In [9]:
text_features = ["Project Title", "Project Essay"]

# Preprocess text columns to make them comparable
for feature in text_features:
    projects.loc[:,feature] = projects.loc[:,feature].astype(str).fillna("")
    projects.loc[:,feature] = projects.loc[:,feature].str.lower()
    
text = projects["Project Title"] + " " + projects["Project Essay"] # This is a pandas series containing title and essay text

vectorizer = TfidfVectorizer(strip_accents="unicode",
                            analyzer="word",
                            stop_words="english",
                            max_df=0.9)

tfidf_matrix = vectorizer.fit_transform(text)
tfidf_feature_names = vectorizer.get_feature_names()

# Unrelated Stuff / Testing



## Understanding tfidf with small examples

In [61]:
# Good intro with nice code: https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af
# create a dataframe from a word matrix
def wm2df(wm, feat_names):
    
    # create an index for each row
    doc_names = ['Doc{:d}'.format(idx) for idx, _ in enumerate(wm)]
    df = pd.DataFrame(data=wm.toarray(), index=doc_names,
                      columns=feat_names)
    return(df)

In [64]:

test_series = pd.Series(["I write a sentence about school's computer ", "I've written a sentence about school furniture", "I am writing a sentence about school computer periphery"])

# smooth_idf adds one to every df to prevent division by "0"
# Create spacy tokenzier
spacy.load("en")
lemmatizer = spacy.lang.en.English()
def spacy_tokenizer(doc):
    tokens = lemmatizer(doc)
    return([token.lemma_ for token in tokens])
# Vectorize with custom spacy token lemmatizer, e.g. written, writing --> write
test_tfidf = TfidfVectorizer(tokenizer=spacy_tokenizer, analyzer="word", stop_words="english", smooth_idf=False)
test_fitted = test_tfidf.fit(test_series)
test_transformed = test_fitted.transform(test_series)

idf = test_tfidf.idf_

print("> The vocabulary doesn't contain 'write' and 'written' as seperate tokens. '-PRON-' is a special token for pronouns")
print(test_tfidf.vocabulary_)
print("\n> 'Furniture' and periphery have the highest idf weights because they only appear in one document. 'Computer' appears in two out of three documents")
# Zip groups the elements from first object with elements from second object by index
print(dict(zip(test_fitted.get_feature_names(), idf)))
print("\n> The tf-idf matrix")
print(wm2df(test_transformed, test_tfidf.get_feature_names()))

> The vocabulary doesn't contain 'write' and 'written' as seperate tokens. '-PRON-' is a special token for pronouns
{'write': 6, 'sentence': 5, 'school': 4, 'computer': 1, '-PRON-': 0, 'furniture': 2, 'periphery': 3}

> 'Furniture' and periphery have the highest idf weights because they only appear in one document. 'Computer' appears in two out of three documents
{'-PRON-': 2.09861228866811, 'computer': 1.4054651081081644, 'furniture': 2.09861228866811, 'periphery': 2.09861228866811, 'school': 1.0, 'sentence': 1.0, 'write': 1.0}

> The tf-idf matrix
        -PRON-  computer  furniture  periphery    school  sentence     write
Doc0  0.000000  0.630099   0.000000   0.000000  0.448321  0.448321  0.448321
Doc1  0.610714  0.000000   0.610714   0.000000  0.291008  0.291008  0.291008
Doc2  0.000000  0.458913   0.000000   0.685239  0.326520  0.326520  0.326520
