# Project Recommendations for DonorsChoose

This notebook explores different approaches to helping [DonorsChoose](https://www.donorschoose.org/) recommend the right project to right user.
So far we have tried the following methods:
- Content-based recommendations using tfidf
    - tfidf project descriptions
    - Calcualte document distances (e.g. cosine similarity)
    - Explore document similarity
    - Create recommendations based on similarity 
    - Create evaluation metric to test whether similarity is a good predictor for recommendations
        - E.g. For users with #donations > 1 omit last donation, get ranking based on similar projects, check the ranking score of actually donated (omitted) projects
        - Compare score agains popularity "algorithm" performance and "recommending distinct projects" or "random projects"
- ...

Other relevant methods:
- Topic models using LDA
- Tag generation (automated tagging)

### Links

- Nice tfidf helper code: https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af
- We started doing approximately the content based recommender from here: https://www.kaggle.com/ranliu/donor-project-matching-with-recommender-systems/code (with more code but different challenge here: https://www.kaggle.com/gspmoreira/recommender-systems-in-python-101/code)

# Code

## Prerequisites
To install spacy and download the English language model run: `
conda install -c conda-forge spacy` and `python -m spacy download en`

In [3]:
import numpy as np
import pandas as pd
import os
from spacy import lemmatizer, displacy
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

nlp = spacy.load("en_core_web_md")


### Load and trim data

In [4]:
# Test flag for faster exploration
test_mode = True

# I am reading in the cleaned projects csv from https://www.kaggle.com/madaha/cleaning-projects-csv-file
projects = pd.read_csv(os.path.join(os.getcwd(), "data", "projects_cleaned.csv"),
                       parse_dates=["Project_Posted_Date", "Project_Fully_Funded_Date"])

if test_mode:
    projects = projects.head(100)

def clean_text_cols(text_features=["Project_Title", "Project_Essay"]):
    """Cleans the columns in text_features. Returns a dataframe"""
    for feature in text_features:
        projects.loc[:,feature] = projects.loc[:,feature].astype(str).fillna("")
        #projects.loc[:,feature] = projects.loc[:,feature].str.lower()
    return projects 

projects = clean_text_cols(projects)

text = projects["Project_Title"] + " " + projects["Project_Essay"] # This is a pandas series containing title and essay text


In [9]:
projects.loc[1, "Project_Essay"]

"what sound is happier than a ukulele?  we have students who can't wait to spend their lunch period strumming away, learning to play via on-line tutorials and creating student-led music education in our small junior high. our students are diverse and our school is rural. our kids are also entrepreneurial!  we have had many student-initiated projects on campus, and this project itself was created by a student with a passion for ukulele who wants to teach and learn with his peers. music education is hard to come by and expensive in our small, rural town.  with instruments to play, our students will have the chance to learn the basics of music, in a social setting (a club that meets at lunch) while learning about pacific island culture. these 4 ukuleles will become a part of our little junior high campus and will remain a part of our curriculum beyond my own tenure.  we try to engender an environment that promotes student-led learning, and i am excited to see students seeking musical enri

In [41]:
docs = text.apply(lambda x: nlp(x))

In [42]:
displacy.render(docs[1], style='ent', jupyter=True)

### Tfidf Vectorize

In [7]:
# Create spacy tokenzier
lemmatizer = spacy.lang.en.English()
def spacy_tokenizer(doc):
    tokens = lemmatizer(doc)
    return([token.lemma_ for token in tokens])
# Vectorize with custom spacy token lemmatizer, e.g. written, writing --> write

vectorizer = TfidfVectorizer(strip_accents="unicode",
                            tokenizer=spacy_tokenizer,
                            analyzer="word",
                            stop_words="english",
                            max_df=0.9,
                            norm="l2")


project_ids = projects['Project_ID'].tolist()
tfidf_matrix = vectorizer.fit_transform(text)
tfidf_feature_names = vectorizer.get_feature_names()

To get the most similar projects to a specific project of interest, I calculate the cosine similarity between that project and all others and return the ones with the highest cosine similarity.

In [8]:
def similar_cosines(tfidf_matrix, index, top_n = 5):
    '''
    tfidf_matrix, index document of interest -> list of tuples (index, cosine similarity)
    '''
    # Since the vectors have already been l2-normalized in the tfidfvectorizer a simple dot product suffices 
    # to calculate the cosine similarity. Use index +1 to converse rank ((5,1) instead of (5,))
    cosine_similarities = linear_kernel(tfidf_matrix[index:index + 1], tfidf_matrix).flatten()
    # Get indices for documents with highest cosine similarity. 
    related_docs_indices = (idx for idx in cosine_similarities.argsort()[::-1] if idx != index)
    return [(index, cosine_similarities[index]) for index in list(related_docs_indices)[:top_n]]

In [15]:
# Top-5 most similar projects to the project with index 1
similar_cosines(tfidf_matrix, 1, top_n=5)

[(88, 0.2565553027750274),
 (38, 0.17036385345475594),
 (25, 0.138652443261203),
 (90, 0.13727128990539694),
 (22, 0.12279547037657573)]

In [16]:
projects.loc[88, "Project_Essay"]

'have you ever tried to picture something new in your head without ever seeing it in person? the students in my school have been learning about instruments that kids play around the world, and now it is time for them to have the opportunity to play them! my students are extremely musical! they love to sing songs and play instruments, and by the time their 45-minute music class is over, they are moaning and groaning for more time to play. my students truly believe that music is a universal language that everyone in the world is able to speak. they are very interested in how children from all over the world make music. we have learned many songs from many countries, and my students have even been able to sing in other languages! the tactile aspect of being able to touch and play an instrument from a different part of the world is really amazing! since the students have learned about and had the opportunity to play instruments in the modern-day orchestra, this is a fantastic way for them 

To better understand the results, I extract the the top words (or rather tokens or features) for a document. 

In [27]:
def top_tfidf_features(tfidf_matrix, index, feature_names, top_n=5):
    row = tfidf_matrix[index].toarray().flatten()
    top_features = [(feature_names[idx], row[idx]) for idx in row.argsort()[::-1][:top_n]]
    return top_features
    

In [29]:
top_tfidf_features(tfidf_matrix, 1, tfidf_feature_names, top_n=5)

[('ukulele', 0.3670846855190913),
 ('ukuleles', 0.36668457197241855),
 ('music', 0.3625657958762883),
 ('play', 0.3232747433279028),
 (' ', 0.2676539798559057)]

Apparently the lemmatier didn't stem ukuleles to ukulele and both appear as a token/feature. In general, the data needs much more cleaning, e.g. removing all the special character sequences such as `\r\n`.

In [30]:
tfidf_feature_names

['\t',
 '\t\t',
 '\t\t\t',
 '\t\t\t\t',
 '\t ',
 '\t  \t',
 '\x10catapults',
 ' ',
 '  ',
 '   ',
 '    ',
 '     ',
 '      ',
 '       ',
 '        ',
 '         ',
 '          ',
 '           ',
 '            ',
 '             ',
 '              ',
 '                ',
 '                 ',
 '                  ',
 '                    ',
 '                     ',
 '                       ',
 '                        ',
 '                            ',
 '                                 ',
 '                                    ',
 '                                     ',
 '                                             ',
 '                                                 ',
 '                                                      ',
 '                                                                                                                                                                                                       ',
 '                                                   

# Unrelated Stuff / Testing



## Understanding tfidf with small examples

In [None]:
# Good intro with nice code: https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af
# create a dataframe from a word matrix
def wm2df(wm, feat_names):
    
    # create an index for each row
    timeit doc_names = ['Doc{:d}'.format(idx) for idx, _ in enumerate(wm)]
    df = pd.DataFrame(data=wm.toarray(), index=doc_names,
                      columns=feat_names)
    return(df)

In [None]:

test_series = pd.Series(["I write a sentence about school's computer ", "I've written a sentence about school furniture", "I am writing a sentence about school computer periphery"])

# smooth_idf adds one to every df to prevent division by "0"
# Create spacy tokenzier
spacy.load("en")
lemmatizer = spacy.lang.en.English()
def spacy_tokenizer(doc):
    tokens = lemmatizer(doc)
    return([token.lemma_ for token in tokens])
# Vectorize with custom spacy token lemmatizer, e.g. written, writing --> write
test_tfidf = TfidfVectorizer(tokenizer=spacy_tokenizer, analyzer="word", stop_words="english", smooth_idf=False)
test_fitted = test_tfidf.fit(test_series)
test_transformed = test_fitted.transform(test_series)

idf = test_tfidf.idf_

print("> The vocabulary doesn't contain 'write' and 'written' as seperate tokens. '-PRON-' is a special token for pronouns")
print(test_tfidf.vocabulary_)
print("\n> 'Furniture' and periphery have the highest idf weights because they only appear in one document. 'Computer' appears in two out of three documents")
# Zip groups the elements from first object with elements from second object by index
print(dict(zip(test_fitted.get_feature_names(), idf)))
print("\n> The tf-idf matrix")
print(wm2df(test_transformed, test_tfidf.get_feature_names()))

## Nice performance evaluation tools

In [None]:
def my_add():
    return list(projects.loc[:, "Teacher Project Posted Sequence"] + 100)

def iter_add():
    lst = []
    for row in projects.loc[:,"Teacher Project Posted Sequence"].values:
        val = row + 100
        lst.append(val)
    return  lst

In [None]:
%timeit my_add()
%timeit iter_add()

In [13]:
%reload_ext line_profiler

In [None]:
%lprun -f my_add my_add()


In [None]:
%lprun -f iter_add iter_add()