# Estimating Project Codes
This is done using SpaCy and the [`PhraseMatcher`](https://spacy.io/api/phrasematcher) class.
You'll need the [project codes `.csv` file from here](https://docs.google.com/spreadsheets/d/1VkahTppRoIEW54esq2Sb35C9bNfCGIwJUErEfpkFGWo/edit?usp=sharing). 

This file represents a master list of project codes and their formal names along with any additional ways that they should be matched. Additional names are separated by a `"|"` character and will have to be parsed out when loading the file. Below is the way that I recommend estimating project codes

In [1]:
import pandas as pd
import re
from tqdm.notebook import tqdm
import os
from datetime import datetime, timedelta, date
import numpy as np

# Loading the Master Project Code list

In [2]:
proj_codes_df = pd.read_csv('../project_codes_05152020.csv', index_col=0)
#proj_codes_df = proj_codes_df[['Project Code', 'Match Keys']]
proj_codes_df.head()

Unnamed: 0,Post Example,Match Keys,Formal Name,Is OTV project?,Notes
,,,,,
ANO,Kiam Marcelo Junio -- The Artists of Nupita Obama,Kiam Marcelo Junio,Nupita Obama Artists,PRESENTS,There will be two other videos for Erik Wallac...
AP,,The Algebra Project,The Algebra Project,FALSE,
AS,Afternoon Snatch -- Episode 1,Afternoon Snatch,Afternoon Snatch,ORIGINALS,
AV,Open TV Presents - ambivert by ester alegria,Ambivert,Ambivert,PRESENTS,
BF,\n,Brave Futures,Brave Futures,PRESENTS,Note this will include around 12 separate shor...


# Load the files to estimate

In [3]:
file_directory = '../data/gabby_estimates_05202020/Gabby Estimator Needed/'
csv_files = [file_name for file_name in os.listdir(file_directory) if '.csv' in file_name]

In [4]:
df = pd.read_csv(file_directory+csv_files[0])

In [5]:
df.head()

Unnamed: 0,plays,downloads,loads,finishes,likes,comments,uri,name,duration,created_time,sizes,unique_loads,mean_percent,mean_seconds,sum_seconds,total_seconds,unique_viewers
0,229,0.0,585.0,19.0,2.0,0.0,/videos/134807299,Open TV Presents: Nupita Obama Creates Vogua,731.0,2015-07-29T04:46:53+00:00,Array,0.0,58.0,394.0,90327.0,167293.0,0.0
1,187,0.0,452.0,48.0,0.0,0.0,/videos/121810540,"You're So Talented, Chapter One",341.0,2015-03-10T18:04:57+00:00,Array,0.0,77.0,256.0,47915.0,64507.0,0.0
2,172,2.0,292.0,23.0,2.0,0.0,/videos/136670475,"Futurewomen, The Origins -- Episode 1",805.0,2015-08-18T23:17:14+00:00,Array,0.0,45.0,287.0,49529.0,137702.0,0.0
3,148,0.0,229.0,27.0,0.0,0.0,/videos/123121866,"You're So Talented, Season One -- Chapter Two",459.0,2015-03-24T18:39:13+00:00,Array,0.0,81.0,345.0,51160.0,66855.0,0.0
4,135,0.0,404.0,26.0,0.0,0.0,/videos/124849631,"You're So Talented, Season One -- Chapter Four",295.0,2015-04-13T17:46:12+00:00,Array,0.0,76.0,205.0,27787.0,39777.0,0.0


# [Spacy PhraseMatcher](https://spacy.io/api/phrasematcher)
[Install Spacy](https://spacy.io/usage)

In [6]:
import string
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import PhraseMatcher
phrase_matcher = PhraseMatcher(nlp.vocab)

In [7]:
def name_to_doc(name):
    return nlp(name.lower().translate(str.maketrans('', '', string.punctuation)))

In [8]:
# add codes to the phrase matcher

for code, row in tqdm(proj_codes_df.iterrows(), total=len(proj_codes_df)):
    names = [name_to_doc(n) for n in row['Match Keys'].split('|')]
    print(names)
    phrase_matcher.add(code, None, *names)

HBox(children=(FloatProgress(value=0.0, max=113.0), HTML(value='')))

[kiam marcelo junio]
[the algebra project]
[afternoon snatch]
[ambivert]
[brave futures]
[brujos]
[bronx cunt tour]
[brown girls]
[black melodies]
[brand new boy]
[borderd]
[bsayf by roy kinsey]
[been there]
[code]
[conspiracy theorist]
[damaged goods]
[darling shear]
[for better]
[filipino fusions]
[fame]
[full out]
[fobia]
[freaky phyllis]
[the furies]
[fck stan]
[futurewomen]
[fck yes]
[granny ballers
]
[the haven]
[the hoodoisie]
[i  love me]
[good enough]
[open tv ,  otv ,  open television]
[open tv ,  otv ,  open television ,  community]
[geetas guide to moving on]
[the hoodosie]
[hair story]
[hook ups]
[it goes unsaid]
[inertia]
[in real life]
[just call me ripley]
[kickin it]
[kings and queens]
[kissing walls
]
[lipstick city]
[let go and let god]
[low strung]
[michaela angela davis]
[movement matters]
[melody set me free]
[night night]
[nupita obama creates vogua]
[outtakes]
[on the verge]
[prep4love ,  p4l ,  dr every woman ,  one little pill
]
[project basho]
[pay day]
[publ

In [9]:
def estimate_project_id(name):
    try:
        doc = name_to_doc(name)
    except AttributeError as ae:
        print(name)
        raise ae
    
    matches = phrase_matcher(doc)
    
    codes = []
    if len(matches)>0:
        for m in matches:
            match_id = m[0]
            if (m[2] < len(doc)):
                if (str(doc[m[2]]) != 'presents'): # m[2] is the first token after the matched phrase, checks for "OTV presents:"
                    codes.append(nlp.vocab.strings[match_id])
            else:
                codes.append(nlp.vocab.strings[match_id])
    else:
        codes.append('None')
    if 'TQIK' in codes:
        return 'TQIK'
    return '|'.join(codes)


In [10]:
def load_and_clean_df(file_loc):
    df = pd.read_csv(file_loc)
    cols_to_delete = []
    for col in df.columns:
        if 'Unnamed' in col:
            cols_to_delete.append(col)
            
    df = df.drop(columns=cols_to_delete)
    if df.iloc[-1].plays == '\r\n':
        df = df.iloc[:-1]
    df['name'] = df['name'].fillna('NO NAME')
    return df

In [13]:
output_dir = '../data/gabby_estimates_05202020/Gabby Estimator Needed/'
for file_name in tqdm(csv_files):
    file_loc = file_directory+file_name
    df = load_and_clean_df(file_loc)
    df['Project ID'] = df['name'].apply(estimate_project_id)
    df.to_csv(output_dir+file_name, index=False)

HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))




# Check the files

In [14]:
df = pd.read_csv(output_dir+csv_files[0])
for ind, row in df.iterrows():
    print("%s) %s: %s\n" % (ind, row['name'], row['Project ID']))

0) Open TV Presents: Nupita Obama Creates Vogua: NOCV

1) You're So Talented, Chapter One: YST

2) Futurewomen, The Origins -- Episode 1: FW

3) You're So Talented, Season One -- Chapter Two: YST

4) You're So Talented, Season One -- Chapter Four: YST

5) You're So Talented, Season One -- Chapter Three: YST

6) You're So Talented, Season One -- Chapter Seven: YST

7) You're So Talented, Season One -- Chapter Six: YST

8) You're So Talented, Season One -- Chapter Five: YST

9) Teaser -- Futurewomen, an alternate reality series by Honey Pot Performance: FW

10) Teaser -- Nupita Obama Creates Vogua: NOCV

11) Introducing Peppa of #Futurewomen: FW

12) Introducing Wonder of #Futurewomen: FW

13) Introducing Althea of #Futurewomen: FW

14) FAME: Kade Style (Anniversary Cut!): FM

15) Introducing Isis of #Futurewomen: FW

16) Open TV Presents: Nupita Obama Creates Vogua Trailer: NOCV

17) #NupitaObama Creates Vogua Premiere: Wicker Park: None

18) 0: None

19) #NupitaObama Creates Vogua Prem