# Project Recommendations for DonorsChoose

This notebook explores different approaches to helping [DonorsChoose](https://www.donorschoose.org/) recommend the right project to right user.
So far we have tried the following methods:
- Content-based recommendations using tfidf
    - tfidf project descriptions
    - Calcualte document distances (e.g. cosine similarity)
    - Explore document similarity
    - Create recommendations based on similarity 
    - Create evaluation metric to test whether similarity is a good predictor for recommendations
        - E.g. For users with #donations > 1 omit last donation, get ranking based on similar projects, check the ranking score of actually donated (omitted) projects
        - Compare score agains popularity "algorithm" performance and "recommending distinct projects" or "random projects"
- ...

Other relevant methods:
- Topic models using LDA
- Tag generation (automated tagging)

### Links

- Nice tfidf helper code: https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af
- We started doing approximately the content based recommender from here: https://www.kaggle.com/ranliu/donor-project-matching-with-recommender-systems/code (with more code but different challenge here: https://www.kaggle.com/gspmoreira/recommender-systems-in-python-101/code)

## ToDo
- Clean projects.csv (see, e.g. https://www.kaggle.com/madaha/cleaning-projects-csv-file for necessary steps).

# Code

## Prerequisites
To install spacy and download the English language model run: `
conda install -c conda-forge spacy` and `python -m spacy download en`

In [6]:
import numpy as np
import pandas as pd
import os
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

### Load and trim data

In [2]:
# Test flag for faster exploration
test_mode = True

projects = pd.read_csv(os.path.join(os.getcwd(), "data", "Projects.csv"), error_bad_lines=False, warn_bad_lines=False,
                      parse_dates=["Project Posted Date", "Project Fully Funded Date"])

if test_mode:
    projects = projects.head(10000)
    
# Remove rows not containing a valid ID (i.e. not a 32 chars ID)
projects = projects.loc[projects.loc[:,"Project ID"].str.len() == 32,:]

In [3]:
projects.loc[:,"Project ID"].nunique()

10000

In [4]:
projects.head()

Unnamed: 0,Project ID,School ID,Teacher ID,Teacher Project Posted Sequence,Project Type,Project Title,Project Essay,Project Subject Category Tree,Project Subject Subcategory Tree,Project Grade Level Category,Project Resource Category,Project Cost,Project Posted Date,Project Current Status,Project Fully Funded Date
0,77b7d3f2ac4e32d538914e4a8cb8a525,c2d5cb0a29a62e72cdccee939f434181,59f7d2c62f7e76a99d31db6f62b7b67c,2,Teacher-Led,Anti-Bullying Begins with Me,do you remember your favorite classroom from e...,"Applied Learning, Literacy & Language","Character Education, Literacy",Grades PreK-2,Books,$490.38,2013-01-01,Fully Funded,2013-03-12
1,fd928b7f6386366a9cad2bea40df4b25,8acbb544c9215b25c71a0c655200baea,8fbd92394e20d647ddcdc6085ce1604b,1,Teacher-Led,Ukuleles For Middle Schoolers,what sound is happier than a ukulele? we have...,Music & The Arts,Music,Grades 6-8,Supplies,$420.61,2013-01-01,Expired,NaT
2,7c915e8e1d27f10a94abd689e99c336f,0ae85ea7c7acc41cffa9f81dc61d46df,9140ac16d2e6cee45bd50b0b2ce8cd04,2,Teacher-Led,"Big Books, Flip Books, And Everything In Between","my 1st graders may be small, but they have big...","Literacy & Language, Special Needs","Literacy, Special Needs",Grades PreK-2,Books,$510.46,2013-01-01,Fully Funded,2013-01-07
3,feeec44c2a3a3d9a31c137a9780d2521,deddcdb20f86599cefa5e7eb31da309b,63750e765b46f9fa4d71e780047e896e,1,Teacher-Led,A Little for a Lot,our students were so excited to come to school...,"Math & Science, Literacy & Language","Applied Sciences, Literature & Writing",Grades 3-5,Supplies,$282.80,2013-01-01,Fully Funded,2013-05-29
4,037719bf60853f234610458a210f45a9,f3f0dc60ba3026944eeffe8b76cb8d9c,0d5b4cc12b2eb00013460d0ac38ce2a2,1,Teacher-Led,Technology in the Classroom,were you ever that kid that struggled to stay ...,"Literacy & Language, Math & Science","Literacy, Mathematics",Grades PreK-2,Technology,$555.28,2013-01-01,Fully Funded,2013-02-14


### Preprocess text data

In [34]:
text_features = ["Project Title", "Project Essay"]

# Preprocess text columns to make them comparable
for feature in text_features:
    projects.loc[:,feature] = projects.loc[:,feature].astype(str).fillna("")
    projects.loc[:,feature] = projects.loc[:,feature].str.lower()
    
text = projects["Project Title"] + " " + projects["Project Essay"] # This is a pandas series containing title and essay text

# Create spacy tokenzier
spacy.load("en")
lemmatizer = spacy.lang.en.English()
def spacy_tokenizer(doc):
    tokens = lemmatizer(doc)
    return([token.lemma_ for token in tokens])
# Vectorize with custom spacy token lemmatizer, e.g. written, writing --> write

vectorizer = TfidfVectorizer(strip_accents="unicode",
                            tokenizer=spacy_tokenizer,
                            analyzer="word",
                            stop_words="english",
                            max_df=0.9,
                            norm="l2")


project_ids = projects['Project ID'].tolist()
tfidf_matrix = vectorizer.fit_transform(text)
tfidf_feature_names = vectorizer.get_feature_names()

To get the most similar projects to a specific project of interest, I calculate the cosine similarity between that project and all others and return the ones with the highest cosine similarity.

In [57]:
def similar_cosines(tfidf_matrix, index, top_n = 5):
    '''
    tfidf_matrix, index document of interest -> list of tuples (index, cosine similarity)
    '''
    # Since the vectors have already been l2-normalized in the tfidfvectorizer a simple dot product suffices 
    # to calculate the cosine similarity. Use index +1 to converse rank ((5,1) instead of (5,))
    cosine_similarities = linear_kernel(tfidf_matrix[index:index + 1], tfidf_matrix).flatten()
    # Get indices for documents with highest cosine similarity. 
    # !would be more efficient as generator b/c right now creates sorted list of all indices even if only 5 needed
    related_docs_indices = [idx for idx in cosine_similarities.argsort()[::-1] if idx != index]
    return [(index, cosine_similarities[index]) for index in related_docs_indices][0:top_n]

In [59]:
# Top-5 most similar projects to the project with index 1
similar_cosines(tfidf_matrix, 1, top_n=5)

[(4860, 0.4071512061637883),
 (7615, 0.3738961659694673),
 (9132, 0.36502767777877104),
 (6342, 0.36367576190672746),
 (3536, 0.34230283439174436)]

To better understand the results, I extract the the top words (or rather tokens or features) for a document. 

In [76]:
def top_tfidf_features(tfidf_matrix, index, feature_names, top_n=5):
    row = tfidf_matrix[index].toarray().flatten()
    top_features = [(feature_names[idx], row[idx]) for idx in row.argsort()[::-1][:top_n]]
    return top_features
    

In [77]:
top_tfidf_features(tfidf_matrix, 1, tfidf_feature_names, top_n=5)

[('ukuleles', 0.31518235997016336),
 ('ukulele', 0.31518235997016336),
 ('junior', 0.2244233099572649),
 ('campus', 0.20585001857645263),
 ('tenure', 0.18586984294521866)]

Apparently the lemmatier didn't stem ukuleles to ukulele and both appear as a token/feature. In general, the data needs much more cleaning, e.g. removing all the special character sequences such as `\r\n` and doing something with all the `-donotremoveessaydivider-->"it',` stuff.

In [80]:
tfidf_feature_names[::]

['\r\n',
 '\r\n\t\t\t\t\t\t',
 '\r\n\r\n',
 '\r\n\r\n\r\n',
 '\r\n\r\n\r\n\r\n',
 '\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n    ',
 '\r\n\r\n\r\n\r\n\r\n\r\n ',
 '\r\n\r\n\r\n\r\n\r\n ',
 '\r\n\r\n\r\n\r\n ',
 '\r\n\r\n\r\n ',
 '\r\n\r\n\r\n  ',
 '\r\n\r\n ',
 '\r\n\r\n  ',
 '\r\n\r\n   ',
 '\r\n\r\n    ',
 '\r\n ',
 '\r\n \r\n',
 '\r\n \r\n\r\n',
 '\r\n \r\n\r\n\r\n',
 '\r\n \r\n\r\n\r\n ',
 '\r\n \r\n\r\n ',
 '\r\n \r\n ',
 '\r\n  ',
 '\r\n  \r\n',
 '\r\n   ',
 '\r\n    ',
 '\r\n    \r\n',
 '\r\n     ',
 '\r\n     \r\n',
 '\r\n      ',
 '\r\n      \r\n',
 '\r\n       ',
 '\r\n        ',
 '\r\n                                  ',
 ' ',
 ' \r\n',
 ' \r\n\r\n',
 ' \r\n\r\n\r\n',
 ' \r\n\r\n\r\n ',
 ' \r\n\r\n ',
 ' \r\n\r\n    ',
 ' \r\n ',
 ' \r\n \r\n',
 ' \r\n \r\n ',
 ' \r\n  ',
 ' \r\n  \r\n',
 ' \r\n   \r\n',
 ' \r\n    ',
 ' \r\n     ',
 ' \r\n     \r\n',
 ' \r\n      ',
 ' \r\n        \r\n',
 ' \r\n         ',
 '  ',
 '  \r\n',
 '  \r\n\r\n',
 '  \r\n\r\n ',
 '  \r\n ',
 '  \r\n \r\n',
 

# Unrelated Stuff / Testing



## Understanding tfidf with small examples

In [6]:
# Good intro with nice code: https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af
# create a dataframe from a word matrix
def wm2df(wm, feat_names):
    
    # create an index for each row
    timeit doc_names = ['Doc{:d}'.format(idx) for idx, _ in enumerate(wm)]
    df = pd.DataFrame(data=wm.toarray(), index=doc_names,
                      columns=feat_names)
    return(df)

In [64]:

test_series = pd.Series(["I write a sentence about school's computer ", "I've written a sentence about school furniture", "I am writing a sentence about school computer periphery"])

# smooth_idf adds one to every df to prevent division by "0"
# Create spacy tokenzier
spacy.load("en")
lemmatizer = spacy.lang.en.English()
def spacy_tokenizer(doc):
    tokens = lemmatizer(doc)
    return([token.lemma_ for token in tokens])
# Vectorize with custom spacy token lemmatizer, e.g. written, writing --> write
test_tfidf = TfidfVectorizer(tokenizer=spacy_tokenizer, analyzer="word", stop_words="english", smooth_idf=False)
test_fitted = test_tfidf.fit(test_series)
test_transformed = test_fitted.transform(test_series)

idf = test_tfidf.idf_

print("> The vocabulary doesn't contain 'write' and 'written' as seperate tokens. '-PRON-' is a special token for pronouns")
print(test_tfidf.vocabulary_)
print("\n> 'Furniture' and periphery have the highest idf weights because they only appear in one document. 'Computer' appears in two out of three documents")
# Zip groups the elements from first object with elements from second object by index
print(dict(zip(test_fitted.get_feature_names(), idf)))
print("\n> The tf-idf matrix")
print(wm2df(test_transformed, test_tfidf.get_feature_names()))

> The vocabulary doesn't contain 'write' and 'written' as seperate tokens. '-PRON-' is a special token for pronouns
{'write': 6, 'sentence': 5, 'school': 4, 'computer': 1, '-PRON-': 0, 'furniture': 2, 'periphery': 3}

> 'Furniture' and periphery have the highest idf weights because they only appear in one document. 'Computer' appears in two out of three documents
{'-PRON-': 2.09861228866811, 'computer': 1.4054651081081644, 'furniture': 2.09861228866811, 'periphery': 2.09861228866811, 'school': 1.0, 'sentence': 1.0, 'write': 1.0}

> The tf-idf matrix
        -PRON-  computer  furniture  periphery    school  sentence     write
Doc0  0.000000  0.630099   0.000000   0.000000  0.448321  0.448321  0.448321
Doc1  0.610714  0.000000   0.610714   0.000000  0.291008  0.291008  0.291008
Doc2  0.000000  0.458913   0.000000   0.685239  0.326520  0.326520  0.326520


## Nice performance evaluation tools

In [56]:
def my_add():
    return list(projects.loc[:, "Teacher Project Posted Sequence"] + 100)

def iter_add():
    lst = []
    for row in projects.loc[:,"Teacher Project Posted Sequence"].values:
        val = row + 100
        lst.append(val)
    return  lst

In [59]:
%timeit my_add()
%timeit iter_add()

301 µs ± 33.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.67 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [None]:
%reload_ext line_profiler

In [50]:
%lprun -f my_add my_add()


In [58]:
%lprun -f iter_add iter_add()