**Donor Recommender System (Content Based Recommendation)**

In this kernel I won't be doing EDA for data since there are already excellent kernels with some valuable insights. I would straight away jump to developing recommender system. 
To reiterate, the problem statement is : **Using all the previous donations, projects and donors details build a system that can suggest donors that are most likely to donate for any given new project.**
Initially I would be using project categories along with word vectors to suggest donors. 
**Word Vectors** are vector representation of words that is generated by an algorithm which trains on large corpus of text. Each word has it's own vector of dimension 100,200 or however we choos it to be. Main advantage of this is that words that are semantically similar or appear in text closely and closer to each other in vector space. Like if we build our of word vectors by training on large number of documents then vectors representing words like 'book' and 'author' would be very similar. We will exploit this to our advantage as you will see later. Since gathering such large text corpus, preprocessing and training is a time consuming task, and we already have pre-trained models available already we would be using those. Here I will be using Glove vectors which can be found in Kaggle also

1. Build Word2Vec using pre-trained Glove vectors
2. Calculate average word vectors for all the unique categories used in the projects dataset and store it a list.
3. Given any new project, extract it's category, find the average word vector and find most similar vector from above list.
4. One you have most similar categories, get projects tagged under those categories.
5. Finally get donors who have donated to projects found above.

Let's begin with imports and initializations :


In [None]:
import numpy as np
import pandas as pd
import time
import os
import pickle
import math
from gensim.models import Word2Vec,KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
import string
from nltk.corpus import stopwords
from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import cosine_similarity

# stop words are not used in the kernel anymore since I am using pre-trained model.
stop = set(stopwords.words('english'))
stop = stop.union(set(string.punctuation))
stop = text.ENGLISH_STOP_WORDS.union(stop)

translator = str.maketrans('', '', string.punctuation)

glove_input_file = '../input/glove-global-vectors-for-word-representation/glove.6B.200d.txt' 
word2vec_output_file = 'glove_w2v.txt'
projects_file = "projects"
model_file = 'model'
update = False # if for some reason you want to update the loaded objects make it True
# ['glove.6B.50d.txt', 'glove.6B.200d.txt', 'glove.6B.300d.txt', 'glove.6B.100d.txt'] these are the available pre-trained models

Defining some functions that I am going to use. Methods are pretty simple with comments so I won't explain here again.

In [None]:
def print_time(tag, start_time):
    """
    To print time difference with a text
    """
    print(tag,(time.time()-start_time))
    
def get_avg_vec(words):
    """
    This method accepts a string with multiple words and returns the 
    average of there word vectors.
    """
    avg = np.zeros((len(model["book"]),))
    for word in words.split():
        try:
            avg = avg + model[word.lower()]
        except:
            print("word not found",word)

    avg = avg/len(words.split())
    if(np.isnan(avg).any()):
        print("Nan",words,len(words.split()))
    return avg

def smooth_donor_preference(x):
    """
    To reduce large numbers to smaller, comparable values. I had found this idea in one of the kernels some time back.
    I do not take credit for this smoothening idea.
    """
    return math.log(1+x, 2)

def build_df_groupy_donor(df):
    df["eventStrength"] = df["Donation Amount"]
    return df.groupby(['Project ID','Donor ID'])['eventStrength'].sum().apply(smooth_donor_preference).reset_index()

In [None]:
print(os.listdir("../input/io"))
print(os.listdir("../input/glove-global-vectors-for-word-representation"))

Build a word2vec model from glove vectors file. We would using file with 200 dimenstion vectors. Once model is built for the first time, store it in output storage for loading it directly for all subsequent runs.

In [None]:
model = None
if(not os.path.isfile(word2vec_output_file) or update):
    print("Glove model not pre-loaded. Loading now...")
    glove2word2vec(glove_input_file, word2vec_output_file)
    load_time = time.time()
    model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)
    print_time("model load time",load_time)
    pickle.dump(model,open(model_file,'wb')) # store the word2vec model in output folder after building for first time to save time

if(model is None): # if model is already available in output just load it
    load_time = time.time()
    model = pickle.load(open(model_file,'rb'))
    print_time("model pickle load time",load_time)

Now let's test if it's actually working.

In [None]:
print("size of output file",os.path.getsize(word2vec_output_file)//(1024*1024),"MB")
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1) # just to check if model is actually working
print("Testing Word2Vec model (Result should be 'queen')",result)
print(model.most_similar(positive=["book"],topn=2))

Now read projects and donations CSV and store them as DataFrames. We keep only 50K rows for projects here for saving memory. Otherwise some times kernel stops working.

In [None]:
df_projects = None
if(not os.path.isfile(projects_file) or update):
    print("No pre-loaded projects found. Loading now...")
    df_projects = pd.read_csv("../input/io/Projects.csv", low_memory=False)
    df_projects = df_projects[0:50000] # reading only 50K rows to save time
    pickle.dump(df_projects,open(projects_file,'wb'))

print("size of projects file",os.path.getsize(projects_file)//(1024*1024),"MB")
if(df_projects is None): # if already in storage, load it.
    df_projects = pickle.load(open(projects_file,'rb'))

# drop projects with no category
df_projects.dropna(subset=['Project Subject Category Tree'],inplace=True)
categories = df_projects["Project Subject Category Tree"].unique()
print("These are the overall categories in use till now")
print(categories)

load_time = time.time()
df_donations = pd.read_csv("../input/io/Donations.csv", low_memory=False)
df_gp = build_df_groupy_donor(df_donations)
#df_gp.set_index("Project ID",inplace=True)
display(df_gp[0:50])
print_time("Donations load time",load_time)

Now that loading (boring) part is over let's understand how we are going to recommend donors. The idea is to use the categories ("Project Subject Category Tree") tagged to each project to find similar projects. While I could have simply matched them directly (Like Project ID == "Math & Science"), that wouldn't be effective in longer run since someone could write "math & science" as "Math and Science". Also another advantage is that even if a new category is introduced, this system would work by returning most similar category. For example if new project is tagged under "Fitness" it would return projects with "Health & Sports", like we would see later in action. Awsome! right? 
Also the idea of using NLP techniques to find similar projects started with trying to get projec type based on "Project Essay". I tried multiple ways like TF-IDF with cosine similarity, TF-IDF & word vectors with K-Means etc but results were not satisfactory, may be because "Project Essay" has a lot of words that don't really add any value to project type. But it may achievable by tweaking idf parameters to remove useless words. Here is a sample essay to show insignificant words :

In [None]:
#display(df_projects[df_projects["Project Subject Category Tree"] == 'Math & Science, Literacy & Language'].head())
text_ms_ll = df_projects[df_projects["Project Subject Category Tree"] == 'Math & Science, Literacy & Language'].iloc[0]["Project Essay"]
print(text_ms_ll)

This is the final piece of code. In the below code **translator** is something which removes all the punctuations from a given string "Math & Science" would be converted to "Math Science" so that we can calculating accurate word vec averages. As we have already discussed, even if a new project with unseen category is passed to this system it would spit out the most similar categories gracefully. You see it in the output of next cell.

In [None]:
stime = time.time()
cat_vec = np.zeros((len(categories),len(model["book"])))
print("Number of categories",len(categories))
count = 0
"""
'categories' holds all the unique categories used in projects dataset.
we calculate average word vectors for each of them and store in 'cat_vec'
"""
for cat in categories:
    #print("Real-",cat.strip(),"--trans-",cat.strip().translate(translator))
    words = cat.strip().translate(translator)
    cat_vec[count] = get_avg_vec(words)
    count = count + 1

print("category vectors calculated",cat_vec.shape)

def get_similar_category(cats):
    """
    parameter : 'n' category strings
    returns : list of most similar category (from projects dataset) to each of the 'n' categories sent as parameter
    """
    #arr.argsort()[-3:]
    sim_cats = []
    for cat in cats:
        words = cat.strip().translate(translator)
        avg = get_avg_vec(words)
        res = cosine_similarity(cat_vec,avg.reshape(1,-1))
        max_ind = np.argmax(res)
        print("Given category- ",cat,", Most similar category- ",categories[max_ind])
        sim_cats.append(categories[max_ind])
    return sim_cats

res = get_similar_category(["Applied Math","Fitness"]) # array of matching categories
print("\n")
for cat in res:
    print("For category",cat)
    df_temp = df_projects[df_projects["Project Subject Category Tree"] == cat].reset_index() # keep only fields needed
    #display(df_temp)
    df_cat_projs = df_gp[df_gp["Project ID"].isin(df_temp["Project ID"])]#.sort_values("eventStrength",ascending=False)
    df_cat_projs = df_cat_projs.groupby(["Project ID","Donor ID"]).agg({'eventStrength':sum})
    df_cat_projs.sort_values("eventStrength",ascending=False,inplace=True)
    # suggest donors who have donated most genorously for better turnover and brevity of display
    df_cat_projs = df_cat_projs[df_cat_projs["eventStrength"] > 7]
    print("Suggested donors ({0}) for category {1}".format(len(df_cat_projs),cat))
    display(df_cat_projs)

print_time("End time",stime)
# 'Math & Science, Literacy & Language', 'Health & Sports, History & Civics'
#next think about collborative filtering with matrix [projects, donors]

Things in pipeline : 
1. Using other features to use while suggesting. Like project location as people tend to support a charity more genorously in there state or locality.
So combining "Donor Zip" and Project Zip code.
2. Trying collaborative filtering.