# Estimating Project Codes
This is done using SpaCy and the [`PhraseMatcher`](https://spacy.io/api/phrasematcher) class.
You'll need the [project codes `.csv` file from here](https://docs.google.com/spreadsheets/d/1VkahTppRoIEW54esq2Sb35C9bNfCGIwJUErEfpkFGWo/edit?usp=sharing). 

This file represents a master list of project codes and their formal names along with any additional ways that they should be matched. Additional names are separated by a `"|"` character and will have to be parsed out when loading the file. Below is the way that I recommend estimating project codes

In [20]:
import pandas as pd
import re
from tqdm.notebook import tqdm
import os
from datetime import datetime, timedelta, date
import numpy as np

# Loading the Master Project Code list

In [21]:
proj_codes_df = pd.read_csv('../project_codes.csv')
proj_codes_df = proj_codes_df[['Project Code', 'Match Keys']]
proj_codes_df.head()

Unnamed: 0,Project Code,Match Keys
0,ANO,Kiam Marcelo Junio
1,AP,The Algebra Project
2,AS,Afternoon Snatch
3,AV,Ambivert
4,BF,Brave Futures


In [22]:
proj_codes_df.index=proj_codes_df['Project Code'] # sets the index as the project code

# Load the files to estimate

In [38]:
ls -l ../data

total 432
-rw-r--r--@  1 scottcambo  staff   11204 May  1 15:58 Facebook_Post_(11_01_2017)_(11_30_2017).csv
drwxr-xr-x@ 45 scottcambo  staff    1440 Apr 24 11:48 [34mGabby 2010-2016 Vimeo Video Files[m[m/
-rw-r--r--@  1 scottcambo  staff  169578 Apr 23 17:33 Gabby 2010-2016 Vimeo Video Files-20200423T223328Z-001.zip
drwxr-xr-x  45 scottcambo  staff    1440 Apr 24 11:49 [34mGabby 2010-2016 Vimeo Video Files_RESOLVED[m[m/
-rw-------   1 scottcambo  staff   25417 Apr 23 17:37 Gabby 2010-2016 Vimeo Video Files_RESOLVED.zip
drwxr-xr-x@  5 scottcambo  staff     160 Apr 23 11:32 [34mOTV - Shared Data[m[m/
drwxr-xr-x   5 scottcambo  staff     160 May  6 12:02 [34mfrom_rey_05062020[m[m/
-rw-r--r--   1 scottcambo  staff    1757 May  4 12:19 vimeo_video_unhandled_project_ids
-rw-r--r--   1 scottcambo  staff    1757 May  4 12:20 vimeo_video_unhandled_project_ids.csv


In [43]:
file_directory = '../data/from_rey_05062020/'
csv_files = [file_name for file_name in os.listdir(file_directory) if '.csv' in file_name]

In [44]:
csv_files

['Facebook_Post_(05_01_2015)_(05_31_2015).csv',
 'Facebook_Post_(11_01_2017)_(11_30_2017).csv',
 'Facebook_Post_(09_01_2016)_(09_30_2016).csv']

In [45]:
df = pd.read_csv(file_directory+csv_files[1])

In [46]:
df.head()

Unnamed: 0,Date,Lifetime Total Likes,Daily New Likes,Daily Unlikes,Daily Page Engaged Users,Weekly Page Engaged Users,28 Days Page Engaged Users,Daily Total Reach,Weekly Total Reach,28 Days Total Reach,...,Weekly Total web site click count per Page by age and gender - 55-64.U,Weekly Total web site click count per Page by age and gender - 65+.F,Weekly Total web site click count per Page by age and gender - 65+.M,Weekly Total web site click count per Page by age and gender - 65+.U,Weekly Total web site click count per Page by age and gender - &lt;13.F,Weekly Total web site click count per Page by age and gender - &lt;13.M,Weekly Total web site click count per Page by age and gender - &lt;13.U,Weekly Total web site click count per Page by age and gender - UNKNOWN.F,Weekly Total web site click count per Page by age and gender - UNKNOWN.M,Weekly Total web site click count per Page by age and gender - UNKNOWN.U
0,,Lifetime: The total number of people who have ...,Daily: The number of new people who have liked...,Daily: The number of Unlikes of your Page (Uni...,Daily: The number of people who engaged with y...,Weekly: The number of people who engaged with ...,28 Days: The number of people who engaged with...,Daily: The number of people who had any conten...,Weekly: The number of people who had any conte...,28 Days: The number of people who had any cont...,...,,,,,,,,,,
1,2017-11-01,,,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2017-11-02,,,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2017-11-03,,,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2017-11-04,,,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# [Spacy PhraseMatcher](https://spacy.io/api/phrasematcher)
[Install Spacy](https://spacy.io/usage)

In [26]:
import string
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import PhraseMatcher
phrase_matcher = PhraseMatcher(nlp.vocab)

In [27]:
def name_to_doc(name):
    return nlp(name.lower().translate(str.maketrans('', '', string.punctuation)))

In [28]:
# add codes to the phrase matcher

for code, row in tqdm(proj_codes_df.iterrows(), total=len(proj_codes_df)):
    names = [name_to_doc(n) for n in row['Match Keys'].split('|')]
    print(names)
    phrase_matcher.add(code, None, *names)

HBox(children=(FloatProgress(value=0.0, max=85.0), HTML(value='')))

[kiam marcelo junio]
[the algebra project]
[afternoon snatch]
[ambivert]
[brave futures]
[brujos]
[bronx cunt tour]
[brown girls]
[black melodies]
[brand new boy]
[borderd]
[bsayf by roy kinsey]
[been there]
[code]
[the conspiracy theorist]
[damaged goods]
[darling shear]
[for better]
[filipino fusions]
[fame]
[full out]
[fobia]
[freaky phyllis]
[the furies]
[fck stan]
[futurewomen]
[fck yes]
[granny ballers
]
[the haven]
[the hoodoisie]
[hookups]
[i  love me]
[good enough]
[open tv ,  otv ,  open television]
[geetas guide to moving on]
[the hoodosie]
[hair story]
[hook ups]
[it goes unsaid]
[inertia]
[in real life]
[just call me ripley]
[kickin it]
[kings and queens]
[kissing walls
]
[lipstick city]
[let go and let god]
[low strung]
[michaela angela davis]
[movement matters]
[melody set me free]
[night night]
[nupita obama creates vogua]
[outtakes]
[on the verge]
[prep4love ,  p4l ,  dr every woman ,  one little pill
]
[project basho]
[pay day]
[public relations]
[philadelphia voices 

In [29]:
def estimate_project_id(name):
    try:
        doc = name_to_doc(name)
    except AttributeError as ae:
        if type(name) == float:
            return None
        print(name)
        raise ae
    
    matches = phrase_matcher(doc)
    
    codes = []
    if len(matches)>0:
        for m in matches:
            match_id = m[0]
            if (m[2] < len(doc)):
                if (str(doc[m[2]]) != 'presents'): # m[2] is the first token after the matched phrase, checks for "OTV presents:"
                    codes.append(nlp.vocab.strings[match_id])
            else:
                codes.append(nlp.vocab.strings[match_id])
    else:
        codes.append('None')
    if 'TQIK' in codes:
        return 'TQIK'
    return '|'.join(codes)


In [30]:
def load_and_clean_df(file_loc):
    df = pd.read_csv(file_loc)
    cols_to_delete = []
    for col in df.columns:
        if 'Unnamed' in col:
            cols_to_delete.append(col)
            
    df = df.drop(columns=cols_to_delete)

    return df

In [32]:
output_dir = '../data/'
for file_name in tqdm(csv_files):
    file_loc = file_directory+file_name
    df = load_and_clean_df(file_loc)
    df['Project ID'] = df['Post Message'].apply(estimate_project_id)
    df.to_csv(output_dir+file_name, index=False)

HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))

KeyError: 'Post Message'

# Check the files

In [18]:
df = pd.read_csv(output_dir+csv_files[0])
for ind, row in df.iterrows():
    print("%s) %s: %s\n" % (ind, row['Post Message'], row['Project ID']))

0) nan: nan

1) Sunday! Catch a sneak peek of our upcoming show #TheTTV after Tanya Saracho's play #Fade, which follows Mexican-born Lucia after she's hired to write for a ruthless Hollywood TV series. Join us for a discussion with creators Daniel Kyri Madison and Bea Cordelia Sullivan-Knoff about writing queer, black and trans characters outside of Hollywood!: None

2) Catch #Futurewomen today in South Shore!: FW

3) Love this deep conversation about art, identity and politics with NIC Kay with Anna Martine Whitehead for Art21! #lilBLK: None

4) Binging Brujos today 🧙🏽‍♂️👨🏽‍🏫💫✨

Episodes 1-11 free to watch at https://weareopen.tv/open-tv-originals/brujos

#BrujosTV #DecolonizeThanksgiving: B

5) The first season of Brujos is live! Watch all eleven episodes to see how these gay Latino witches join folx of all colors to stop the hunt against their people! 

Watch now on weareopen.tv or save it on Vimeo: https://vimeo.com/album/4451466

#brujostv #opentvoriginals: B

6) Brandon Markell H