<a href="https://colab.research.google.com/github/RamosCatalina/Data-for-Good/blob/main/Copy_of_4_Channels_Pipeline_Spring_2022_DEI_TVL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This notebook runs the Spring 2022 pipeline for identifying DEI-related practice and outcome terms and their co-occurrences in a corpus of texts. It is tailored for the TVL dataset but includes instructions on how to modify it to be run on any corpus. It will output CSVs.**

*For your information:*
*Here are examples of:*
- *the [most recent full CSV](https://docs.google.com/spreadsheets/d/1xr0n7pO6l76EUKbABBI7ZiIp_mjf57reXp8TP-5bNK8/edit?usp=sharing) for TVL GIC Employee Engagement, Diversity, & Inclusion*
- *the [most recent validation sample CSV](https://docs.google.com/spreadsheets/d/1D7f7wb7VpKbkYspcpE-oDcm93XY59_g2HseQbcevR0M/edit?usp=sharing) for TVL GIC Employee Engagement, Diversity, & Inclusion*

Further explanation of our methodology using this pipeline is in [this document](https://docs.google.com/document/d/1GJseNlhVhQhbRLBv-iWtV3ycXFBAWUItFikzLxijok4/edit?usp=sharing).

## (RUN ENTIRE SECTION AS IS): Imports

In [None]:
# Ensures the tqdm package is up-to-date as necessary
!pip install "tqdm>=4.9.0"



In [None]:
import datetime
import numpy as np
import os
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
import random

from tqdm import tqdm

import plotly.graph_objects as go
import plotly.figure_factory as ff
import plotly.express as px

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.tokenize import word_tokenize

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

In [None]:
import itertools
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## (SLIGHT MODIFICATION): Mount your drive
- MODIFY-1: make sure you are in your working directory

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
cd /content/drive/My Drive/DART/

/content/drive/.shortcut-targets-by-id/1OtnD2owSU31YUJ0wHYRD95tiByYxg8X5/DART


In [None]:
import os
os.chdir("/content/drive/My Drive/DART/")
!ls

'API Business Reports'	    'Investor Protection'   'Text Extraction v1'
 business_report_df.csv     'Management Diagnosis'  'Text Extraction v2'
 business_report_df.pkl      Tables		    'Text Extraction v3'
 Example		     Testing		    'Text Tables'
'Executives and Employees'  'Text Extraction prod'


## (RUN ENTIRE SECTION AS IS): Define functions to capture terms

In [None]:
# Functions to create patterns of practice/outcome terms

def create_pattern_2(buck1, buck2, rangelist):
  '''
  Creates a pattern that checks if at least one search word in each of the 2 
  buckets of search words are 1) within a certain number of words of each other 
  and 2) as a phrase, within a certain number of words of a DEI context word 
  in a text (context window).

  Parameters
  ----------
  buck1 : list of strings; len(buck1)>=1
      The strings are search words in regex format
  buck2 : list of strings; len(buck2)>=1
      The strings are search words in regex format
  rangelist : list of 2 ints of form [a,b]
      a = context window between buck1 and buck2 search words
      b = context window between buck1-buck2 phrase and DEI context word

  Returns
  -------
  [[buck1, buck2], rangelist] : formatted list of the parameters
  '''
  return [[buck1, buck2], rangelist]

def create_pattern_3(buck1, buck2, buck3, rangelist):
  '''
  Creates a pattern that checks if at least one search word in each of the 3 
  buckets of search words are 1) within a certain number of words of each other 
  and 2) as a phrase, within a certain number of words of a DEI context word 
  in a text (context window).

  Parameters
  ----------
  buck1 : list of strings; len(buck1)>=1
      The strings are search words in regex format
  buck2 : list of strings; len(buck2)>=1
      The strings are search words in regex format
  buck3 : list of strings; len(buck3)>=1
      The strings are search words in regex format
  rangelist : list of 4 ints of form [a_1_2,a_2_3,a_1_3,b]
      a_1_2 = context window between buck1 and buck2 search words
      a_2_3 = context window between buck2 and buck3 search words
      a_1_3 = context window between buck1 and buck3 search words
      b = context window between buck1-buck2-buck3 phrase and DEI context word

  Returns
  -------
  [[buck1, buck2, buck3], rangelist] : formatted list of the parameters
  '''
  return [[buck1, buck2, buck3], rangelist]

In [None]:
# Functions to find practice/outcome terms in a text

def word_indices(input_str, search_word_lst, DEI_contx_list):
  '''
  Finds the all indices (by word) of every occurence of every search word for 
  practice/outcome terms and DEI context terms. 

  Parameters
  ----------
  input_str : string
      This is the input text you are searching through.
  search_word_lst : list of strings
      The strings are the practice/outcome term pattern search words in regex format
  DEI_contx_list : list of strings
      The strings are the DEI context search words in regex format

  Returns
  -------
  word_ind_dict : dict of indices where each search word was found
      key: string of search word in regex format
      value: list of indices where that search word was found in the text

      Example: dict with 4 search words
      word_ind_dict = {'apprentice': [54],
                       'female': [0, 17, 21],
                       'program': [27],
                       'wom(e|a)n': [4]}
  '''
  # Build dict with all words as keys w/ empty lists as values
  total_list = search_word_lst + DEI_contx_list
  li = [(i,[]) for i in total_list] 
  word_ind_dict = {}
  for j in li:
    word_ind_dict[j[0]] = j[1]

  # Fill empty lists with index matches
  for w in total_list:
    for match in re.finditer(w, input_str):
      before_str = word_tokenize(input_str[:match.start()])
      word_ind_dict[w].append(len(before_str))
  
  return word_ind_dict

def check_cooccur(word_ind_dict, terms_dict, DEI_contx_list):
  '''
  Finds every instance of all practice/outcome terms (as pattern + DEI context 
  word) in a text. Uses a dict of indices where each search word is found in the 
  text, the output of function word_indices().

  Parameters
  ----------
  word_ind_dict : dict of indices where each search word is found in a text
      Output of function word_indices()
      key: string of search word in regex format
      value: list of indices where that search word was found in the text
  terms_dict : dict of practice/outcome term patterns
      key: name of practice/outcome term
      value: list of patterns for that term; [p0,p1...]
  DEI_contx_list : list of strings
      The strings are the DEI context search words in regex format

  Returns
  -------
  flagged_terms : dict of every instance of all practice/outcome terms
      key: name of practice/outcome term
      value: list of instances with each instance being a list of search words 
      in regex format and the context window used to flag them

      Example: dict with 1 practice term found in 4 instances
      {'program-retain': [[('program', 'retain', 'talent', 'wom(e|a)n'), [10, 10, 10, 30]],
                          [('program', 'retain', 'talent', 'wom(e|a)n'), [10, 10, 10, 30]],
                          [('program', 'retain', 'talent', 'female'), [10, 10, 10, 30]],
                          [('program', 'apprenticeship', 'gender'), [4, 50]]]}
  '''  
  flagged_terms = {}

  # Iterate through all of the terms and their patterns
  for key, value in terms_dict.items():
    is_match = 0
    match_pattern = []
    
    # Iterate through each pattern of a term
    for pattern in value: # pattern = [[[buck1], [buck2]], [a,b]]
      combo_buck = pattern[0] + [DEI_contx_list] # combo_buck = [[buck1], [buck2], [DEI_contx_list]]
      combos = list(itertools.product(*combo_buck)) # List of combos of 1-ea word + 1 DEI context term, per pattern
      
      # Iterate through each possible search word combo for a pattern
      for c in combos: # c = ('lawsuit', 'discriminat', 'wom(e|a)n')
        
        # Collect indices for each search word
        ind_list = [] # List of indices-lists for each word in combo
        for w in c: # w = 'lawsuit'
          if w in word_ind_dict:
            ind_list.append(word_ind_dict[w]) # [[2,56],[],[45,23,12]]

        # Check if ind_list has enough lists (every word in combo is found in text)
        if len(ind_list) == len(c):
          # Check if indices are in range
          combos_inds = list(itertools.product(*ind_list)) # All possible combos of indices: combos_inds = [(24, 53, 20), (24, 53, 28)]

          # Iterate through each combo of indices
          for c_i in combos_inds: 
            
            # if is a 2-bucket pattern, c_i = (24, 53, 20)
            if len(pattern[1]) == 2: # pattern[1] = [a,b] -> rangelist
              subrange = [c_i[0],c_i[1]]
              subrange.sort()
              # if DEI context word is within term phrase buffer zone (24-(b+1) <= 20 <= 50+(b+1))
              # AND search words are within term pattern context window (|24-53|-1 <= a)
              if (subrange[0]-(pattern[1][1]+1) <= c_i[2] <= subrange[1]+(pattern[1][1]+1)) and (abs(c_i[0] - c_i[1])-1 <= pattern[1][0]):
                is_match = 1
                match_pattern.append([c,pattern[1]])
            
            # if is a 3-bucket pattern, c_i = (50, 34, 60, 61)
            elif len(pattern[1]) == 4: # pattern[1] = [a_1_2,a_2_3,a_1_3,b] -> rangelist
              subrange = [c_i[0],c_i[1],c_i[2]]
              subrange.sort()
              # if DEI context word is within term phrase buffer zone (34-(b+1) <= 61 <= 60+(b+1))
              # AND search words are within term pattern context windows (|50-34|-1 <= a_1_2 AND |34-60|-1 <= a_2_3 AND |50-60|-1 <= a_1_3)
              if (subrange[0]-(pattern[1][3]+1) <= c_i[3] <= subrange[2]+(pattern[1][3]+1)) and (abs(c_i[0] - c_i[1])-1 <= pattern[1][0]) and (abs(c_i[1] - c_i[2])-1 <= pattern[1][1]) and (abs(c_i[0] - c_i[2])-1 <= pattern[1][2]):
                is_match = 1
                match_pattern.append([c,pattern[1]])
            
            else:
              print("wrong range length")
              is_match = -1
    
    # If there is at least one instance found for the term, insert the match pattern into flagged_terms dict
    if is_match == 1:
      flagged_terms[key] = match_pattern
    
  return flagged_terms        

## (UPDATE REGULARLY) Term dictionaries and lists
Refer to the [original notebook](https://colab.research.google.com/drive/1TMTydjBmS3cxpAbHm1H3rL_I1keMKbBW?usp=sharing.) for the most up-to-date term dictionaries and replace your `CELL_1` and `CELL_3` with their up-to-date equivalents. Then, you can run this entire section as is.

*For your information:*
- *practice terms are a thing you do: managing, hiring, training, setting up programs (gerunds)*
- *outcome terms: everything else -> that results from these actions*
- [Spreadsheet](https://docs.google.com/spreadsheets/d/1kvx0vdwRB8C9WJ3vALMviU7i4UqXBugtYJ3FsxQROQE/edit?usp=sharing) of terms and relevant info

In [None]:
# CELL_1: Create the dictionaries for the PRACTICE and OUTCOME terms

terms_dict = {}
terms_category = {}

################################################################################
################################################################################
#                        Talent-Attraction-Retention                           #
################################################################################
################################################################################

#------------------------------------------------------------------------------#
#------------------------------------------------------------------------------#
#                   PRACTICE - Talent-Attraction-Retention                     #
#------------------------------------------------------------------------------#
#------------------------------------------------------------------------------#

# Practice: 'compensation-equal'
p0 = create_pattern_2(['(-| )pa(id|y)',' wage(s)? ','compensat'],['disparity','(un)?equal',' same','(un)?fair',' more ', ' less', ' gap '],[7,25])
terms_dict['compensation-equal'] = [p0]
terms_category['compensation-equal'] = ('Talent-Attraction-Retention','PRACTICE')

# Practice: 'programs/initiatives-attract'
p0 = create_pattern_3(['program', 'initiative', 'campaign'],['attract',' hir(e(s|d)?|ing) ', 'recruit', ' f(ind|ound)', '(br(ing(ing|s)?|ought)|dr(aw(ing|s)?|ew)) in', ' entic(e|ing)', 'discover', 'acquir(e|ing)', ' gain', 'collect', 'gather', 'procur(e|ing)'],['talent','skill','leader'],[10,10,10,25])
p1 = create_pattern_2(['program', 'initiative', 'campaign'],[' hir(e(s|d)?|ing) ', 'recruit'],[7,25])
terms_dict['programs/initiatives-attract'] = [p0,p1]
terms_category['programs/initiatives-attract'] = ('Talent-Attraction-Retention','PRACTICE')

# Practice: 'programs/initiatives-retain'
p0 = create_pattern_3(['program', 'initiative', 'campaign'],['retain', 'retention',' keep', 'preserv(ation|e)', 'maintain', ' train', ' hold', 'develop'],['talent','skill','leader'],[10,10,10,25])
p1 = create_pattern_2(['program', 'initiative', 'campaign'],[' mentor', 'apprentice'],[7,25])
p2 = create_pattern_3(['program', 'initiative', 'campaign'],['career','employee','work(er|force)','contractor'],['advanc', 'develop', 'promot(e(d|s)?|ing|ion(s)?) ', ' train', '(present(ed|ing|s)|grant(ed|ing|s)|g(ave|iv(e(s)?|ing))|provid(e(d|s)?|ing)|offer(ed|s)?|ing) opportunit'],[10,10,10,25])
terms_dict['programs/initiatives-retain'] = [p0,p1,p2]
terms_category['programs/initiatives-retain'] = ('Talent-Attraction-Retention','PRACTICE')

# Practice: 'policies/public commitment-attract'
p0 = create_pattern_3(['polic(ies|y)', 'commitment'],['attract',' hir(e(s|d)?|ing) ', 'recruit', ' f(ind|ound)', '(br(ing(ing|s)?|ought)|dr(aw(ing|s)?|ew)) in', ' entic(e|ing)', 'discover', 'acquir(e|ing)', ' gain', 'collect', 'gather', 'procur(e|ing)'],['talent','skill','leader'],[10,10,10,25])
p1 = create_pattern_2(['polic(ies|y)', 'commitment'],[' hir(e(s|d)?|ing) ', 'recruit'],[7,25])
terms_dict['policies/public commitment-attract'] = [p0,p1]
terms_category['policies/public commitment-attract'] = ('Talent-Attraction-Retention','PRACTICE')

# Practice: 'policies/public commitment-retain'
p0 = create_pattern_3(['polic(ies|y)', 'commitment'],['retain', 'retention',' keep', 'preserv(ation|e)', 'maintain', ' train', ' hold', 'develop'],['talent','skill','leader'],[10,10,10,25])
p1 = create_pattern_2(['polic(ies|y)', 'commitment'],[' mentor', 'apprentice'],[7,25])
p2 = create_pattern_3(['polic(ies|y)', 'commitment'],['career','employee','work(er|force)','contractor'],['advanc', 'develop', 'promot(e(d|s)?|ing|ion(s)?) ', ' train', '(present(ed|ing|s)|grant(ed|ing|s)|g(ave|iv(e(s)?|ing))|provid(e(d|s)?|ing)|offer(ed|s)?|ing) opportunit'],[10,10,10,25])
terms_dict['policies/public commitment-retain'] = [p0,p1,p2]
terms_category['policies/public commitment-retain'] = ('Talent-Attraction-Retention','PRACTICE')

# Practice: 'whistleblower protection'
p0 = create_pattern_2(['whistle(-| )?blow'],['protect', 'safeguard', 'preserv(ation|e)', 'shelter', 'shield', 'support', 'promot(e(d|s)?|ing|ion(s)?) ', 'defend(ed|ing|s)? ', ' listen', 'accept', 'bolster', 'assist'],[7,25])
terms_dict['whistleblower protection'] = [p0]
terms_category['whistleblower protection'] = ('Talent-Attraction-Retention','PRACTICE')

# Practice: 'hiring/recruitment'
p0 = create_pattern_2([' hir(e(s|d)?|ing) ', 'recruit'],['different','vari(ety|ous)','divers'],[10,25])
p1 = create_pattern_3(['welcom(e|ing)','embrac','celebrat','proud','support'],['employ', 'build(ing)? a (team|work(force|place)?)'],['different','vari(ety|ous)','divers'],[10,10,10,25])
p2 = create_pattern_2([' hir(e(s|d)?|ing) ', 'recruit', 'attract','(bring|draw(n)?) (in|to)'],['employ', 'work(er|force)', 'contractor', ' position', 'opportunit', 'opening'],[8,25])
terms_dict['hiring/recruitment'] = [p0,p1,p2]
terms_category['hiring/recruitment'] = ('Talent-Attraction-Retention','PRACTICE')

# Practice: 'training-employee development'
p0 = create_pattern_2(['apprentice', 'mentor'],['develop','advanc','skill','leader','progress','opportunit','promot(ed|ion) ','(rise|climb|progress|move) up', 'support','empower'],[7, 25])
p1 = create_pattern_3([' train'],['career','employee','work(er|force)','contractor'],['develop','advanc','skill','leader','progress','opportunit','promot(ed|ion) ','(rise|climb|progress|move) up', 'support','empower'],[7,7,7,25])
terms_dict['training-employee-development'] = [p0,p1]
terms_category['training-employee-development'] = ('Talent-Attraction-Retention','PRACTICE')

# Practice: 'promotion-employee'
p0 = create_pattern_2(['promot(ed|ion) ', 'advanc', '(rise|climb|progress|move) up', 'career(.*)progress','progress(.*)career'],['employ(ee|ment)', 'work(er|force|( )?place)', ' job ', 'contractor'],[8,25])
terms_dict['promotion-employee'] = [p0]
terms_category['promotion-employee'] = ('Talent-Attraction-Retention','PRACTICE')

# Practice: 'worker union'
p0 = create_pattern_2(['union'],['employee', 'work(er|force|( )?place)', 'contractor',' labo(u)?r'],[5,25])
terms_dict['worker union'] = [p0]
terms_category['worker union'] = ('Talent-Attraction-Retention','PRACTICE')

# Practice: 'worker committee'
p0 = create_pattern_2(['committee'],['employee', 'work(er|force|( )?place)', 'contractor',' labo(u)?r'],[5,25])
terms_dict['worker committee'] = [p0]
terms_category['worker committee'] = ('Talent-Attraction-Retention','PRACTICE')

#------------------------------------------------------------------------------#
#------------------------------------------------------------------------------#
#                   OUTCOME - Talent-Attraction-Retention                      #
#------------------------------------------------------------------------------#
#------------------------------------------------------------------------------#

# Outcome: 'labor shortage'
p0 = create_pattern_2(['shortage'],[' labo(u)?r','work(er|force)','employee','contractor'],[7,25])
p1 = create_pattern_2(['shortage'],['significant','persistent', 'pervasive','critical','massive','severe','sustained','well(-| )(known|documented)'],[10,25])
p2 = create_pattern_2(['shortage'],['opportunit','opening'],[10,25])
p3 = create_pattern_2([' position', 'opportunit','opening',' labo(u)?r','work(er|force)','employee','contractor'],['insufficient','unfilled','empty'],[6,25])
terms_dict['labor shortage'] = [p0,p1,p2,p3]
terms_category['labor shortage'] = ('Talent-Attraction-Retention','OUTCOME')

# Outcome: 'skill shortage/gap'
p0 = create_pattern_2(['shortage',' gap ','limited pool'],['((low|un|semi|high)(-|ly | |))?skill','talent'],[7,25])
terms_dict['skill shortage/gap'] = [p0]
terms_category['skill shortage/gap'] = ('Talent-Attraction-Retention','OUTCOME')

# Outcome: 'discrimination lawsuit'
p0 = create_pattern_2(['lawsuit', ' sue(d)? '],['discriminat'],[15,25])
terms_dict['discrimination lawsuit'] = [p0]
terms_category['discrimination lawsuit'] = ('Talent-Attraction-Retention','OUTCOME')

# Outcome: 'attrition'
p0 = create_pattern_2(['turnover'],[' high', 'disruptive', 'work(er|force)', 'employee', 'contractor',' rate', 'voluntary'],[7,25])
p1 = create_pattern_2(['attrition'],[' high', 'disruptive', 'work(er|force)', 'employee', ' rate','contractor'],[7,25])
p2 = create_pattern_2([' quit', ' leav(e(r(s)?)?|ing) ', ' left '],[' high', ' rate'],[6,25])
terms_dict['attrition'] = [p0,p1,p2]
terms_category['attrition'] = ('Talent-Attraction-Retention','OUTCOME')

# Outcome: 'talent retention'
p0 = create_pattern_2(['retain', 'retention',' keep', 'preserv(ation|e)', 'maintain', 'invest'],['talent','skill'],[7,25])
terms_dict['talent retention'] = [p0]
terms_category['talent retention'] = ('Talent-Attraction-Retention','OUTCOME')

# Outcome: 'talent attraction'
p0 = create_pattern_2(['attract','(br(ing|ought)|draw(n)?) (in|to)'],['talent','skill'],[7,25])
terms_dict['talent attraction'] = [p0]
terms_category['talent attraction'] = ('Talent-Attraction-Retention','OUTCOME')

# Outcome: 'diverse workforce composition'
p0 = create_pattern_2(['(under(-| )?)?represent(ation|ed)', 'demographic','compos(e|ition)', 'make(-| )?up of', ' only', '  few',  'divers'],['(non(-|))?executive', 'director', 'board( member|( )?room)', 'manage(ment|r)','(base|low|mid|senior)(-| )level', 'leader', 'c-suite', 'employee', 'work(er|force|( )?place)', ' ceo ', 'contractor', 'professionals', 'technical'],[7,25])
terms_dict['diverse workforce composition'] = [p0]
terms_category['diverse workforce composition'] = ('Talent-Attraction-Retention', 'OUTCOME')

# Outcome: 'aging workforce'
p0 = create_pattern_2(['aging', ' old(er|est)? '],['employee','work(er|force|( )?place)','contractor'],[7,25])
terms_dict['aging workforce'] = [p0]
terms_category['aging workforce'] = ('Talent-Attraction-Retention','OUTCOME')

# Outcome: 'go public'
p0 = create_pattern_2(['whistle(-| )?blow', 'worker','employee','contractor'],['(go(es|ing)|went) public'],[8,25])
p1 = create_pattern_3(['whistle(-| )?blow', 'worker','employee','contractor'],['expos(e|ing)', 'alleg', 'report', ' leak', 'br(ing(ing|s)?|ought) to light', 'disclos(e|ing)', 'uncover', 'unmask', 'document', ' claim', 'complain'],['publicly', '(in |to )(the )?public'],[10,10,20,25])
terms_dict['go public'] = [p0,p1]
terms_category['go public'] = ('Talent-Attraction-Retention','OUTCOME')

# Outcome: 'retaliation/reprisal'
p0 = create_pattern_2(['retaliat', 'reprisal'],['employee','work(er|force|( )?place)', 'contractor'],[12,25])
terms_dict['retaliation/reprisal'] = [p0]
terms_category['retaliation/reprisal'] = ('Talent-Attraction-Retention','OUTCOME')

# Outcome: 'harassment-employee'
p0 = create_pattern_2(['harass','bull(ie|y)',' torment', ' teas(e|ing)', 'mistreat'],['employee','work(er|force|( )?place)', 'contractor'],[10,25])
terms_dict['harassment-employee'] = [p0]
terms_category['harassment-employee'] = ('Talent-Attraction-Retention','OUTCOME')

# Outcome: 'unsafe conditions'
p0 = create_pattern_3(['conditions', 'environment', 'work( )?place'],['work','employee','contractor'],['unsafe', 'dangerous', 'hazard', 'violen(ce|t)', 'abus(e|ive)', ' harm', 'threat', 'precarious'],[10,10,10,25])
terms_dict['unsafe conditions'] = [p0]
terms_category['unsafe conditions'] = ('Talent-Attraction-Retention','OUTCOME')

# Outcome: 'inclusive culture'
p0 = create_pattern_2(['inclusive', 'friendly', 'support', 'welcom(e|ing)', 'amicable', 'be heard', 'empower'],['culture', 'work( conditions| environment|( )?place)', 'manage(ment|r)', 'boss ', 'leader'],[7,25])
p1 = create_pattern_3(['conditions', 'environment', 'work( )?place'],['work','employee','contractor'],['inclusive', 'friendly', 'support', 'welcom(e|ing)', 'amicable', 'be heard'],[10,10,10,25])
terms_dict['inclusive culture'] = [p0,p1]
terms_category['inclusive culture'] = ('Talent-Attraction-Retention','OUTCOME')

# Outcome: 'toxic culture'
p0 = create_pattern_2(['malpractice', 'violat','non(-|)inclusive','toxic','hostile','(not |un)friendly','(not |un)welcom(e|ing)','cruel','mean','disparag','exclus(ion|ive)', 'dismissive', 'abus(e|ive)', 'threat'],['culture', 'work( conditions| environment|( )?place)', 'manage(ment|r)', 'boss ', 'leader'],[7,25])
p1 = create_pattern_3(['(lack|void) of', ' no(ne|t)? ', 'lacking', 'loss of'],['support', 'communicat', 'protect', 'feedback', 'investigat', 'report', ' record', 'follow(-| )up', 'action', 'anonymity', 'confiden(ce|tiality)','integrity','trust', 'transparen', 'approachab', 'accountab', 'psychological(ly)? safe(ty)?', 'disclos', 'honest'],['culture', 'work( conditions| environment|( )?place)', 'manage(ment|r)', 'boss ', 'leader'],[7,10,15,25])
p2 = create_pattern_3(['fear(ing| of)', 'risk(ing| of)', '(scared|afraid) of'],['(lack of |in)action', 'retaliat', 'dismiss', 'judg(ed|ment)', 'sham(e|ing)', 'authority', 'embarrass', 'blame', 'offen(d|se)', 'terminat', ' fire(d)? ', 'promot(ed|ion) ', 'trouble', 'job (safety|security)', 'perc(eive|eption)', 'reputation', 'dismiss', 'relationship'],['culture', 'work( conditions| environment|( )?place)', 'manage(ment|r)', 'boss ', 'leader'],[7,10,15,25])
terms_dict['toxic culture'] = [p0,p1,p2]
terms_category['toxic culture'] = ('Talent-Attraction-Retention','OUTCOME')

# Outcome: 'job satisfaction'
p0 = create_pattern_2(['employ(ee|ment)', 'work( conditions| environment|er|force|( )?place)', ' job', 'contractor'],['satisf(action|ied)', 'morale', 'happy', 'engag(ed|ement)', 'content'],[7,25])
terms_dict['job satisfaction'] = [p0]
terms_category['job satisfaction'] = ('Talent-Attraction-Retention','OUTCOME')

# Outcome: 'to be liable'
p0 = create_pattern_3(['liab(ility|le)'],[' (is|are|was|were) ','accept', 'take on', 'assume', 'acknowledge',' hold','affirm','recognize'],['company', 'firm', 'business', 'leader', ' ceo ', 'executive', 'president', ' vp ', 'director', 'chair(-)?(wo)?man','spokes(person|man|woman)'],[10,15,10,25])
terms_dict['to be liable'] = [p0]
terms_category['to be liable'] = ('Talent-Attraction-Retention','OUTCOME')

# Outcome: 'trust-employee'
p0 = create_pattern_2(['trust'],['employee', 'work(er|force)', 'contractor'],[7,25])
terms_dict['trust-employee'] = [p0]
terms_category['trust-employee'] = ('Talent-Attraction-Retention','OUTCOME')

# Outcome: 'quit/resign'
p0 = create_pattern_2(['resign',' quit', ' leav(e(r(s)?)?|ing) ', ' left ','depart(ed|ing|s)'],['employee', 'work(er|force)', 'contractor'],[7,25])
terms_dict['quit/resign'] = [p0]
terms_category['quit/resign'] = ('Talent-Attraction-Retention','OUTCOME')

# Outcome: 'strike/walk-out-employee'
p0 = create_pattern_2(['strik(e|ing)', 'walk( |-)?out','refuse(d)? to work'],['employee', 'work(er|force|( )?place)', 'contractor'],[7,25])
terms_dict['strike/walk-out-employee'] = [p0]
terms_category['strike/walk-out-employee'] = ('Talent-Attraction-Retention','OUTCOME')

# Outcome: 'sit-in-employee'
p0 = create_pattern_2(['s(at|it)-in'],['employee', 'work(er|force|( )?place)', 'contractor'],[7,25])
terms_dict['sit-in-employee'] = [p0]
terms_category['sit-in-employee'] = ('Talent-Attraction-Retention','OUTCOME')

# Outcome: 'protest-employee'
p0 = create_pattern_2(['protest', 'demonstrat(e|ion)', 'march', 'picket'],['employee', 'work(er|force|( )?place)', 'contractor'],[7,25])
terms_dict['protest-employee'] = [p0]
terms_category['protest-employee'] = ('Talent-Attraction-Retention','OUTCOME')



################################################################################
################################################################################
#           Product-DMD (Product Design, Marketing & Delivery)                 #
################################################################################
################################################################################

#------------------------------------------------------------------------------#
#------------------------------------------------------------------------------#
#                           PRACTICE - Product-DMD                             #
#------------------------------------------------------------------------------#
#------------------------------------------------------------------------------#

#------------------------------------------------------------------------------#
#------------------------------------------------------------------------------#
#                           OUTCOME - Product-DMD                              #
#------------------------------------------------------------------------------#
#------------------------------------------------------------------------------#



################################################################################
################################################################################
#                            Community-Relations                               #
################################################################################
################################################################################

#------------------------------------------------------------------------------#
#------------------------------------------------------------------------------#
#                       PRACTICE - Community-Relations                         #
#------------------------------------------------------------------------------#
#------------------------------------------------------------------------------#

#------------------------------------------------------------------------------#
#------------------------------------------------------------------------------#
#                        OUTCOME - Community-Relations                         #
#------------------------------------------------------------------------------#
#------------------------------------------------------------------------------#



################################################################################
################################################################################
#                        Innovation-Risk-Recognition                           #
################################################################################
################################################################################

#------------------------------------------------------------------------------#
#------------------------------------------------------------------------------#
#                   PRACTICE - Innovation-Risk-Recognition                     #
#------------------------------------------------------------------------------#
#------------------------------------------------------------------------------#

#------------------------------------------------------------------------------#
#------------------------------------------------------------------------------#
#                   OUTCOME - Innovation-Risk-Recognition                      #
#------------------------------------------------------------------------------#
#------------------------------------------------------------------------------#

################################################################################
################################################################################
#                                    Other                                     #
################################################################################
################################################################################

#------------------------------------------------------------------------------#
#------------------------------------------------------------------------------#
#                               PRACTICE - Other                               #
#------------------------------------------------------------------------------#
#------------------------------------------------------------------------------#

#------------------------------------------------------------------------------#
#------------------------------------------------------------------------------#
#                               OUTCOME - Other                                #
#------------------------------------------------------------------------------#
#------------------------------------------------------------------------------#


In [None]:
search_words_list = []
terms_list = []
for key, val in terms_dict.items():
  terms_list.append(key)
  for pattern in val:
    bucks_list = pattern[0]
    for bucket in bucks_list:
      for word in bucket:
        search_words_list.append(word)

search_words_list = list(set(search_words_list))
terms_list = list(set(terms_list))

In [None]:
# CELL_3: Create the dictionary for the DEI context terms
DEI_context_dict = {
    'ethnic': 'ethnic',
    '(disab(i|l)| abilit(y|ies))': '(dis)ability',
    '(marital status|married)': 'marital status',
    'bias': 'bias',
    'religio': 'religio',
    'inclusiv': 'inclusive',
    'divers': 'diverse',
    #' access': 'access',
    ' race ': 'race',
    'racism': 'race',
    'racist': 'race',
    'racial': 'race',
    'bipoc': 'race',
    'people of colo(u)?r': 'race',
    'blackface': 'race',
    'black': 'race',
    'white': 'race',
    'asian': 'race',
    'latino': 'race',
    'hispanic':'ethnic',
    '(indigenous|native(s| (america|population|communit|govern|reservation))|american indian|first nations|trib(al|e)|aborigin)': 'race',
    '(environmental human rights defender|ehrd)': 'advocate',
    'working (famil|parent|mother|mom|father|dad)': 'familial status',
    'veteran': 'military status',
    '(service|guard|reserve) member': 'military status',
    'minorities': 'minorit',
    'minority group': 'minorit',
    'lgbt': 'LGBT', 
    'sexual orientation': 'LGBT',
    'gender identit': 'LGBT',
    ' gay': 'LGBT',
    'lesbian': 'LGBT',
    'bisexual': 'LGBT',
    'transgender': 'LGBT',
    'queer': 'LGBT',
    'asexual': 'LGBT',
    '(homo|trans)phobia': 'LGBT',
    'non(-)?binary': 'LGBT',
    'wom(e|a)n':'gender-M/F',
    'female':'gender-M/F',
    'gender':'gender-M/F',
    'based on sex':'gender-M/F',
    'pregnant':'gender-M/F',
    'on the basis of sex':'gender-M/F',
    'maternity leave':'gender-M/F',
    'sexist':'gender-M/F',
    'sex discrimination':'gender-M/F',
    'age discrimin': 'age',
    'age bias': 'age',
    'ageism': 'age',
    'average age': 'age',
    ' old(er)? ': 'age',
    'youth': 'youth',
    'young': 'youth',
    'next generation': 'youth',
    'nationalit': 'nationality',
    'national origin': 'nationality',
    'foreign nationals': 'nationality',
    'under(-)?represented': 'underrepresented',
    'migrant': 'migrant',
    'foreigner': 'migrant',
    ' visa ': 'migrant',
    'citizen': 'migrant',
    'foreign worker': 'migrant',
    'entry(-| )level': 'education/skill level',
    'education level': 'education/skill level',
    '(college|undergraduate|graduate) degree': 'education/skill level',
    'high school diploma': 'education/skill level',
    '(low|un|semi|high)(-|ly | |)skill': 'education/skill level',
    'economic status': 'economic status',
    'economic class': 'economic status',
    'low(-| )income': 'economic status', 
    'high(-| )income': 'economic status',
    'impoverish': 'economic status',
    'poverty': 'economic status',
    'middle(-| )class': 'economic status',
    'working(-| )class': 'economic status',
    'criminal history': 'criminal history',
    'felon': 'criminal history',
    'background check': 'criminal history',
    '(convict(s)? |formerly convicted|convicted formerly)': 'criminal history',
    'factory work': 'factory work',
    'supplier contract':'supplier contract'
}

DEI_context_list = []
for key, val in DEI_context_dict.items():
  DEI_context_list.append(key)

DEI_context_list = list(set(DEI_context_list))

## (REPLACE ENTIRELY): Prepare your data
- *This is TVL-specific work - gathering file names for all industries.*

In [None]:
#industry_files = {}  # "Industry": ["file1.csv", "file2.csv", ...]

#tvl_raw_dir = 'tvl_downloads_raw/'  # Create shortcut for this folder in your personal drive: https://drive.google.com/drive/folders/1S9MvX0UI7hfrSxYi3DwnxD3QL7mK_s29?usp=sharing
#gic_dir = 'Employee Engagement, Diversity, & Inclusion/'
#file_prefix = 'Truvalue_Spotlights_'

# Get all industries for which we have raw outputs
#for raw_output_file_name in os.listdir(tvl_raw_dir + gic_dir):
    #if not raw_output_file_name.startswith(file_prefix):
        #continue
    #industry_file_name = raw_output_file_name[len(file_prefix):]
    #next_underscore_idx = industry_file_name.find('_')
    #industry_name = industry_file_name[:next_underscore_idx]
    #industry_files[industry_name] = []

# Get all raw file names by industry
#for industry_name in industry_files:
    #industry_file_names = [file_name for file_name in os.listdir(tvl_raw_dir + gic_dir) if industry_name == file_name.split("_")[2]]
    #industry_files[industry_name] += industry_file_names

In [None]:
# check (list of industries)
#industry_list = sorted(list(industry_files.keys()))

#print(industry_list)
#print("The number of industries:", len(industry_list))

In [None]:
# check that each industry has six file names
#good_ind_filecount = 0

#for industry in industry_files:

    #if len(industry_files[industry]) != 6:
        #print(industry, "does not have 6 files.")
    #else:
        #good_ind_filecount += 1
        
#if good_ind_filecount == len(industry_files):
    #print("All industries are good to go!")

All industries are good to go!


## (REPLACE ENTIRELY): Create the input dataframe
The dataframe should be assigned to the name `all_industries_events_master_df`. All that is necessary is the text data that you want to analyze placed in 1+ columns of the dataframe, with rows corresponding to some sectioning of the data that is big enough to capture terms in their correct context but not too big that runtime is expensive.

*For TVL:*
- *Each row is a different article*
- *Each column is a feature of the article (industry type, URL, date, text)*

In [None]:
#industry_all_events_dict = {}

# Get all events in one df, by industry
#for industry_name, file_names_lst in sorted(industry_files.items()):
    #print(industry_name)
    #for file_name in file_names_lst:
        #print('\t' + file_name)
        #sub_df = pd.read_csv(tvl_raw_dir + gic_dir + file_name)
        #sub_df.dropna(axis=0, how='all', inplace=True)
        #if sub_df.empty:
            #continue
        #all_are_scm = list(sub_df['Category'].unique()) == [gic_dir[:-1]]
        #if not all_are_scm:
            #print()
            #print(f'NOT ALL ARTICLES ARE {gic_dir[:-1]}!!!')
            #print(file_name)
            #print(f'NOT ALL ARTICLES ARE {gic_dir[:-1]}!!!')
            #print()
        #if industry_name not in industry_all_events_dict:
            #industry_all_events_dict[industry_name] = sub_df
        #else:
            #industry_all_events_dict[industry_name] = industry_all_events_dict[industry_name].append(sub_df, ignore_index=True)
    
    # Create some columns with cleaned text/dates
    #if industry_name in industry_all_events_dict:
      #industry_df = industry_all_events_dict[industry_name].copy()
      #print(f'Before dropping dupes: {industry_df.shape}')

      # Drop duplicates for combos of company + TVL ID + article (repeating article pertaining to the same company)
          # Reasoning: TVL ID represents an identifier of a Spotlight Event for ONE company. 
          # A Spotlight Event may be made up of several articles. 
          # So we do not yet want to drop all the articles that comprise a single TVL ID by 
          # doing a hard drop_duplicates on JUST TVL ID. So, we are dropping on a combination of 
          # columns, to ensure that we only drop repeating articles for the same company. 
          # Repeating articles may occur due to potential overlap of articles from the CSVs.
      #drop_dupes_cols = ['Company', 'TVL ID', 'Primary Article Spotlight Headline','Primary Article Bullet Points', 'Spotlight Start Date']
      #industry_df = industry_df.drop_duplicates(drop_dupes_cols, keep='first')
      #print(f'After dropping dupes: {industry_df.shape}')
      #industry_df['INDUSTRY'] = industry_name
      #industry_df = industry_df[['INDUSTRY', 'Company', 'TVL ID', 'Category', 'Primary Article Spotlight Headline',
        #'Primary Article Bullet Points', 'Primary Article Source',
        #'Primary Article URL Link', 'Spotlight Start Date',
        #'Spotlight End Date', 'Spotlight Volume']]
    
      #industry_df['headline_lower'] = industry_df['Primary Article Spotlight Headline'].str.lower()
      #industry_df['bullet_pts_lower'] = industry_df['Primary Article Bullet Points'].str.lower()
      #industry_df['date'] = industry_df['Spotlight Start Date'].apply(lambda s_date: datetime.datetime.strptime(s_date, '%m/%d/%Y'))
      #industry_df['year'] = industry_df['date'].dt.year

      #industry_all_events_dict[industry_name] = industry_df

In [None]:
#all_industries_events_master_df = pd.concat(
    #[industry_df for industry_name, industry_df in industry_all_events_dict.items()], ignore_index=True)

#all_industries_events_master_df.head()

INPUT DATAFRAME

In [None]:
company_data_df=pd.read_csv("business_report_df.csv")
company_data_df.drop(company_data_df.columns[0], axis=1, inplace=True)
company_data_df["Text"]=company_data_df["Text"].apply(str)
company_data_df.head(10)

Unnamed: 0,Company,Text
0,Asiana Airlines,"아시아나항공 / 2021.03.31 [Correction] Business Report VIII. Matters concerning executives and employees, etc. 1. Current status of executives and employees go. Executives (Base dat e: December 31, 2020 )"
1,Asiana Airlines,"me. Current status of candidates for appointment and dismissal of regist ered executives (Base dat e: December 31, 2020"
2,Asiana Airlines,"all. Status of employees, etc. (Base dat e: December 31, 2020 ) (In millions of Korean won)"
3,Asiana Airlines,"Note 1) Non-executive employees on December 31, 2020, excluding executives, overs eas local employees, and foreign flight crew/cabin crew For workers whose working hours are short compared to the prescribed working hours of Im Note 5) Non-affiliated workers are workers who are dispatched, serviced, or contract ed by the employer 'in the workplace' of the employer who is obligated to disclose La. Remuneration of unregistered executives (Base dat e: December 31, 2020 ) (Unit: 1,000 won) Note) The total annual salary is based on the wage and salary income in the statemen t of wage and salary payment submitted to the competent tax office in accordance wit h Article 20 of the Income Tax Act. 2. Remuneration of executives, etc. <Status of remuneration for all directors and auditors> (1) Amount approved by the general meeting of shareholders (Unit: 1,000 won) Note 1) The above registered executives (5) are composed of 2 inside directors and 3 outside directors (2) Amount of remuneration paid (2-1) All directors and auditors (Unit: 1,000 won) (2-2) by type (Unit: 1,000 won)"
4,Asiana Airlines,"Note 1) Enter the number of registered directors only for inside directors. Note 2) All 3 members of our audit committee are outside directors . Note 3) The total amount of remuneration is the cumulative amount paid until the end of the current period reflecting all rem uneration (earned and retired income, etc.) of retired/new directors during the discl osure period. Calculated by dividing by * 5) The Company did not grant stock options <Status of individual remuneration for directors and auditors with a remuneration amount of 500 million won or more> (1) Individual remuneration amount (Unit: 1,000 won) (2) Calculation standards and methods - None (Unit: 1,000 won) <Status of remuneration for the top 5 individuals out of 500 million won or more paid in re muneration> (1) Individual remuneration amount (Unit: 1,000 won)"
5,Asiana Airlines,"Note) Total remuneration includes retirement income (2) Calculation standards and methods (Unit: 1,000 won)"
6,Busan Bank,"부산은행 / 2021.03.31 Business Report VIII. Matters concerning executives and employees, etc. 1. Current status of executives and employees go. Executives (Base dat e: December 31, 2020 ) (Unit: sh"
7,Busan Bank,"Note 2) Average remuneration per person: Total remuneration (including executives who retired in 2020) / annualized number of people - by type (In millions of Korean won) Note 1) The total amount of remuneration is based on the actual payment year regar dless of the year in which the performance pay is attributed. Note 2) Including the number and remuneration of 2 newly appointed registered dir ectors (2 outside directors) and 2 retired registered directors (2 outside directors) fro m the start date of the business year to the base date of preparation of public docume nts (C) Remuneration payment standards for directors and auditors remuneration system The remuneration system of executive directors and standing auditors is divided into fixed compensation, basic salary, activity allowance, and variable compensation, perf ormance salary. Based on this, the payment is divided into cash and stock price-linke d cash compensation, and the remuneration system for outside directors is paid in th e form of basic annual salary and meeting participation allowance. Short-term incentives are paid in the amount of 40-60% cash compensation and 60- 40% stock price-related cash compensation. Cash compensation is paid immediately in the evaluation year, and stock price-related cash compensation is distributed equa"
8,Busan Bank,"Note 1) The expiration date of the term of office of a registered director shall be the date of the regular general meeting of shareholders in the year of that term (excluding representative director) Note 2) Changes in executives after the settlement date - Registered executives * Retirement: Bin Dae-in (term expired), Jang Hyeon-gi (term expired), Oh Jae-chan (term expired), Choi Kyung-su (resigned midway) * Appointments: <Newly appointed> Gam-Chan Ahn, Jo Seong-Rae, Hoe-Yong Kim, Su-Hee Kim <Interim> Young-Jae Kim (3 consecutive appointments), Jong-Gyu Park (1 consecutive appointment) - Unregistered executives * Retirement: Bang Seong-bin (traditional landlord), Jang Jong-ho (expired), Roh Jong-geun (expired), Park Il-yong (expired), Hwang Myeong-sik (expired) * Appointments: <Newly appointed> Choi Woo-hyeong, Heo Young-seon, Jung Young-jun, Park Seon-ho <interim position> Kang Moon-seong (promotion), Park Kyung-hee, Park Myeong-cheol ※ Current status of newly appointed executives between the date of filing of the disclosure documents (December 31, 2020) and the su mission date of disclosure documents (March 31, 2021) (Unit: sh"
9,Busan Bank,"Note) The expiration date of the term of office of a registered director shall be the date of the regular general meeting of shareholders in the year of that term (excluding t epresentative director) me. employee status (Base dat e: December 31, 2020 ) (In millions of Korean won) Note) Excluding registered executives and overseas local employees - Current status of remuneration for unregistered executives (Base dat e: December 31, 2020 ) (In millions of Korean won) Note) Including performance pay for management in March 2020 (paid once a year) 2. Remuneration of executives, etc. go. Remuneration status of directors, auditors, etc. (1) Total remuneration for directors and auditors <Status of remuneration for all directors and auditors> (A) Amount approved by the general meeting of shareholders (In millions of Korean won) (B) Amount of remuneration paid - All directors and auditors (In millions of Korean won) Note 1) Total remuneration: The total amount of remuneration for executives (direct ors and auditors) paid from January to December 2020"


## (SLIGHT MODIFICATION): Create output dataframe w/ indicator and summary columns
- MODIFY-1: Change the `input_string` assignment to be all the column names in your dataframe that include your text data. Be sure to use `.lower` for each to remove uppercase characters.
- MODIFY-2: Based on the columns of document information in your input dataframe `all_industries_events_master_df`, you might want to do a different concatenation of the indicator+summary columns in `ind_df` with those input columns to order them in a way that makes sense. Set the dataframe of the combined columns as `full_master_df`. 
- MODIFY-3: Insert the `ANY_*` indicator columns for practice term, outcome term, DEI context term, and practice-outcome term co-occurrence into `full_master_df` at column indices you prefer for the column order of the final CSV.

In [None]:
def check_columns(row):
    input_string = row['Text'].lower() #+ row['Primary Article Bullet Points'].lower() #MODIFY-1: replace w/ the column names that hold your text data
    word_index_dict = word_indices(input_string, search_words_list, DEI_context_list)
    flag_term_dict = check_cooccur(word_index_dict, terms_dict, DEI_context_list)
    
    flagged_terms = []
    OUTCOME_list = ''
    PRACTICE_list = ''
    TAR_list = ''
    PDMD_list = ''
    CR_list = ''
    IRR_list = ''
    OTHER_list = ''
    DEI_terms_list = ''
    
    TAR_bool = 0
    PDMD_bool = 0
    CR_bool = 0
    IRR_bool = 0
    OTHER_bool = 0

    add_columns = ['PRACTICE_TERMS_FOUND','OUTCOME_TERMS_FOUND','DEI-CONTEXT_TERMS_FOUND','TAR_TERMS_FOUND','PDMD_TERMS_FOUND','CR_TERMS_FOUND','IRR_TERMS_FOUND','OTHER_TERMS_FOUND','TAR_ind','PDMD_ind','CR_ind','IRR_ind','OTHER_ind']
    series_columns = []
    for term in terms_list:
      indicator_col_name = "{}_{}_{}".format(term, terms_category[term][1], terms_category[term][0])
      add_columns.append(indicator_col_name)
      series_columns.append(term)
    for dt in DEI_context_list:
      indicator_col_name = "{}_{}_{}".format(dt, 'DEI-CONTEXT', DEI_context_dict[dt])
      add_columns.append(indicator_col_name)
      series_columns.append(dt)

    for key, value in flag_term_dict.items():
      flagged_terms.append(key)
      
      value_unique_str = list(set([str(v) for v in value]))
      DEI_unique = list(set([instance[0][len(instance[0])-1] for instance in value]))
      flagged_terms = flagged_terms + DEI_unique

      if terms_category[key][1] == 'OUTCOME':
        OUTCOME_list = OUTCOME_list + key + ' ('+ terms_category[key][0] + '): '+ '\n'.join(value_unique_str) + '\n\n'
      elif terms_category[key][1] == 'PRACTICE':
        PRACTICE_list = PRACTICE_list + key + ' ('+ terms_category[key][0] + '): '+ '\n'.join(value_unique_str) + '\n\n'

      if terms_category[key][0] == 'Talent-Attraction-Retention':
        TAR_list = TAR_list + key + ' ('+ terms_category[key][1] + '): '+ '\n'.join(value_unique_str) + '\n\n'
        TAR_bool = 1
      elif terms_category[key][0] == 'Product-DMD':
        PDMD_list = PDMD_list + key + ' ('+ terms_category[key][1] + '): '+ '\n'.join(value_unique_str) + '\n\n'
        PDMD_bool = 1
      elif terms_category[key][0] == 'Community-Relations':
        CR_list = CR_list + key + ' ('+ terms_category[key][1] + '): '+ '\n'.join(value_unique_str) + '\n\n'
        CR_bool = 1
      elif terms_category[key][0] == 'Innovation-Risk-Recognition':
        IRR_list = IRR_list + key + ' ('+ terms_category[key][1] + '): '+ '\n'.join(value_unique_str) + '\n\n'
        IRR_bool = 1
      elif terms_category[key][0] == 'Other':
        OTHER_list = OTHER_list + key + ' ('+ terms_category[key][1] + '): '+ '\n'.join(value_unique_str) + '\n\n'
        OTHER_bool = 1

      for DEI_word in DEI_unique:
        DEI_terms_list = DEI_terms_list + DEI_word + ' ('+ DEI_context_dict[DEI_word] + ') '+ ' ['+ key + '], \n'
    
    flagged_terms_ind = []
    for s in series_columns:
      if s in flagged_terms:
        flagged_terms_ind.append(1)
      else:
        flagged_terms_ind.append(0)
    
    pre_list = [PRACTICE_list, OUTCOME_list, DEI_terms_list, TAR_list, PDMD_list, CR_list, IRR_list, OTHER_list, TAR_bool, PDMD_bool, CR_bool, IRR_bool, OTHER_bool]
    series_data = pre_list + flagged_terms_ind
    final_series = pd.Series(data=series_data, index =add_columns)
    
    return final_series

In [None]:
tqdm.pandas()
ind_df = company_data_df.progress_apply(check_columns, axis=1)


100%|██████████| 17586/17586 [1:10:54<00:00,  4.13it/s]


In [None]:
ind_df.head()

Unnamed: 0,PRACTICE_TERMS_FOUND,OUTCOME_TERMS_FOUND,DEI-CONTEXT_TERMS_FOUND,TAR_TERMS_FOUND,PDMD_TERMS_FOUND,CR_TERMS_FOUND,IRR_TERMS_FOUND,OTHER_TERMS_FOUND,TAR_ind,PDMD_ind,CR_ind,IRR_ind,OTHER_ind,policies/public commitment-attract_PRACTICE_Talent-Attraction-Retention,diverse workforce composition_OUTCOME_Talent-Attraction-Retention,worker union_PRACTICE_Talent-Attraction-Retention,whistleblower protection_PRACTICE_Talent-Attraction-Retention,compensation-equal_PRACTICE_Talent-Attraction-Retention,programs/initiatives-retain_PRACTICE_Talent-Attraction-Retention,talent retention_OUTCOME_Talent-Attraction-Retention,training-employee-development_PRACTICE_Talent-Attraction-Retention,aging workforce_OUTCOME_Talent-Attraction-Retention,quit/resign_OUTCOME_Talent-Attraction-Retention,retaliation/reprisal_OUTCOME_Talent-Attraction-Retention,unsafe conditions_OUTCOME_Talent-Attraction-Retention,harassment-employee_OUTCOME_Talent-Attraction-Retention,talent attraction_OUTCOME_Talent-Attraction-Retention,promotion-employee_PRACTICE_Talent-Attraction-Retention,policies/public commitment-retain_PRACTICE_Talent-Attraction-Retention,hiring/recruitment_PRACTICE_Talent-Attraction-Retention,trust-employee_OUTCOME_Talent-Attraction-Retention,worker committee_PRACTICE_Talent-Attraction-Retention,go public_OUTCOME_Talent-Attraction-Retention,protest-employee_OUTCOME_Talent-Attraction-Retention,job satisfaction_OUTCOME_Talent-Attraction-Retention,skill shortage/gap_OUTCOME_Talent-Attraction-Retention,sit-in-employee_OUTCOME_Talent-Attraction-Retention,to be liable_OUTCOME_Talent-Attraction-Retention,programs/initiatives-attract_PRACTICE_Talent-Attraction-Retention,toxic culture_OUTCOME_Talent-Attraction-Retention,inclusive culture_OUTCOME_Talent-Attraction-Retention,attrition_OUTCOME_Talent-Attraction-Retention,labor shortage_OUTCOME_Talent-Attraction-Retention,strike/walk-out-employee_OUTCOME_Talent-Attraction-Retention,discrimination lawsuit_OUTCOME_Talent-Attraction-Retention,next generation_DEI-CONTEXT_youth,working(-| )class_DEI-CONTEXT_economic status,entry(-| )level_DEI-CONTEXT_education/skill level,(environmental human rights defender|ehrd)_DEI-CONTEXT_advocate,racist_DEI-CONTEXT_race,background check_DEI-CONTEXT_criminal history,(convict(s)? |formerly convicted|convicted formerly)_DEI-CONTEXT_criminal history,visa _DEI-CONTEXT_migrant,religio_DEI-CONTEXT_religio,ageism_DEI-CONTEXT_age,bipoc_DEI-CONTEXT_race,low(-| )income_DEI-CONTEXT_economic status,(low|un|semi|high)(-|ly | |)skill_DEI-CONTEXT_education/skill level,divers_DEI-CONTEXT_diverse,based on sex_DEI-CONTEXT_gender-M/F,sexual orientation_DEI-CONTEXT_LGBT,citizen_DEI-CONTEXT_migrant,migrant_DEI-CONTEXT_migrant,(disab(i|l)| abilit(y|ies))_DEI-CONTEXT_(dis)ability,queer_DEI-CONTEXT_LGBT,gay_DEI-CONTEXT_LGBT,minority group_DEI-CONTEXT_minorit,age bias_DEI-CONTEXT_age,(service|guard|reserve) member_DEI-CONTEXT_military status,hispanic_DEI-CONTEXT_ethnic,criminal history_DEI-CONTEXT_criminal history,gender identit_DEI-CONTEXT_LGBT,youth_DEI-CONTEXT_youth,black_DEI-CONTEXT_race,factory work_DEI-CONTEXT_factory work,(college|undergraduate|graduate) degree_DEI-CONTEXT_education/skill level,felon_DEI-CONTEXT_criminal history,economic status_DEI-CONTEXT_economic status,young_DEI-CONTEXT_youth,age discrimin_DEI-CONTEXT_age,under(-)?represented_DEI-CONTEXT_underrepresented,minorities_DEI-CONTEXT_minorit,high school diploma_DEI-CONTEXT_education/skill level,working (famil|parent|mother|mom|father|dad)_DEI-CONTEXT_familial status,asian_DEI-CONTEXT_race,racism_DEI-CONTEXT_race,impoverish_DEI-CONTEXT_economic status,transgender_DEI-CONTEXT_LGBT,bisexual_DEI-CONTEXT_LGBT,foreign worker_DEI-CONTEXT_migrant,middle(-| )class_DEI-CONTEXT_economic status,(indigenous|native(s| (america|population|communit|govern|reservation))|american indian|first nations|trib(al|e)|aborigin)_DEI-CONTEXT_race,economic class_DEI-CONTEXT_economic status,veteran_DEI-CONTEXT_military status,average age_DEI-CONTEXT_age,foreign nationals_DEI-CONTEXT_nationality,nationalit_DEI-CONTEXT_nationality,white_DEI-CONTEXT_race,old(er)? _DEI-CONTEXT_age,racial_DEI-CONTEXT_race,high(-| )income_DEI-CONTEXT_economic status,poverty_DEI-CONTEXT_economic status,lesbian_DEI-CONTEXT_LGBT,sexist_DEI-CONTEXT_gender-M/F,non(-)?binary_DEI-CONTEXT_LGBT,bias_DEI-CONTEXT_bias,inclusiv_DEI-CONTEXT_inclusive,wom(e|a)n_DEI-CONTEXT_gender-M/F,foreigner_DEI-CONTEXT_migrant,ethnic_DEI-CONTEXT_ethnic,gender_DEI-CONTEXT_gender-M/F,people of colo(u)?r_DEI-CONTEXT_race,national origin_DEI-CONTEXT_nationality,maternity leave_DEI-CONTEXT_gender-M/F,latino_DEI-CONTEXT_race,race _DEI-CONTEXT_race,sex discrimination_DEI-CONTEXT_gender-M/F,supplier contract_DEI-CONTEXT_supplier contract,blackface_DEI-CONTEXT_race,(marital status|married)_DEI-CONTEXT_marital status,female_DEI-CONTEXT_gender-M/F,asexual_DEI-CONTEXT_LGBT,lgbt_DEI-CONTEXT_LGBT,pregnant_DEI-CONTEXT_gender-M/F,on the basis of sex_DEI-CONTEXT_gender-M/F,(homo|trans)phobia_DEI-CONTEXT_LGBT,education level_DEI-CONTEXT_education/skill level
0,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
# Put all the columns in one master df

#MODIFY-2: change the indices depending on what column order you want for the final csv
left1 = company_data_df[company_data_df.columns[0:4]]
left2 = ind_df[ind_df.columns[0:8]]
right2 = ind_df[ind_df.columns[8:]]
full_master_df = pd.concat([left1, left2, right2], axis=1)
full_master_df.head()



Unnamed: 0,Company,Text,PRACTICE_TERMS_FOUND,OUTCOME_TERMS_FOUND,DEI-CONTEXT_TERMS_FOUND,TAR_TERMS_FOUND,PDMD_TERMS_FOUND,CR_TERMS_FOUND,IRR_TERMS_FOUND,OTHER_TERMS_FOUND,TAR_ind,PDMD_ind,CR_ind,IRR_ind,OTHER_ind,policies/public commitment-attract_PRACTICE_Talent-Attraction-Retention,diverse workforce composition_OUTCOME_Talent-Attraction-Retention,worker union_PRACTICE_Talent-Attraction-Retention,whistleblower protection_PRACTICE_Talent-Attraction-Retention,compensation-equal_PRACTICE_Talent-Attraction-Retention,programs/initiatives-retain_PRACTICE_Talent-Attraction-Retention,talent retention_OUTCOME_Talent-Attraction-Retention,training-employee-development_PRACTICE_Talent-Attraction-Retention,aging workforce_OUTCOME_Talent-Attraction-Retention,quit/resign_OUTCOME_Talent-Attraction-Retention,retaliation/reprisal_OUTCOME_Talent-Attraction-Retention,unsafe conditions_OUTCOME_Talent-Attraction-Retention,harassment-employee_OUTCOME_Talent-Attraction-Retention,talent attraction_OUTCOME_Talent-Attraction-Retention,promotion-employee_PRACTICE_Talent-Attraction-Retention,policies/public commitment-retain_PRACTICE_Talent-Attraction-Retention,hiring/recruitment_PRACTICE_Talent-Attraction-Retention,trust-employee_OUTCOME_Talent-Attraction-Retention,worker committee_PRACTICE_Talent-Attraction-Retention,go public_OUTCOME_Talent-Attraction-Retention,protest-employee_OUTCOME_Talent-Attraction-Retention,job satisfaction_OUTCOME_Talent-Attraction-Retention,skill shortage/gap_OUTCOME_Talent-Attraction-Retention,sit-in-employee_OUTCOME_Talent-Attraction-Retention,to be liable_OUTCOME_Talent-Attraction-Retention,programs/initiatives-attract_PRACTICE_Talent-Attraction-Retention,toxic culture_OUTCOME_Talent-Attraction-Retention,inclusive culture_OUTCOME_Talent-Attraction-Retention,attrition_OUTCOME_Talent-Attraction-Retention,labor shortage_OUTCOME_Talent-Attraction-Retention,strike/walk-out-employee_OUTCOME_Talent-Attraction-Retention,discrimination lawsuit_OUTCOME_Talent-Attraction-Retention,next generation_DEI-CONTEXT_youth,working(-| )class_DEI-CONTEXT_economic status,entry(-| )level_DEI-CONTEXT_education/skill level,(environmental human rights defender|ehrd)_DEI-CONTEXT_advocate,racist_DEI-CONTEXT_race,background check_DEI-CONTEXT_criminal history,(convict(s)? |formerly convicted|convicted formerly)_DEI-CONTEXT_criminal history,visa _DEI-CONTEXT_migrant,religio_DEI-CONTEXT_religio,ageism_DEI-CONTEXT_age,bipoc_DEI-CONTEXT_race,low(-| )income_DEI-CONTEXT_economic status,(low|un|semi|high)(-|ly | |)skill_DEI-CONTEXT_education/skill level,divers_DEI-CONTEXT_diverse,based on sex_DEI-CONTEXT_gender-M/F,sexual orientation_DEI-CONTEXT_LGBT,citizen_DEI-CONTEXT_migrant,migrant_DEI-CONTEXT_migrant,(disab(i|l)| abilit(y|ies))_DEI-CONTEXT_(dis)ability,queer_DEI-CONTEXT_LGBT,gay_DEI-CONTEXT_LGBT,minority group_DEI-CONTEXT_minorit,age bias_DEI-CONTEXT_age,(service|guard|reserve) member_DEI-CONTEXT_military status,hispanic_DEI-CONTEXT_ethnic,criminal history_DEI-CONTEXT_criminal history,gender identit_DEI-CONTEXT_LGBT,youth_DEI-CONTEXT_youth,black_DEI-CONTEXT_race,factory work_DEI-CONTEXT_factory work,(college|undergraduate|graduate) degree_DEI-CONTEXT_education/skill level,felon_DEI-CONTEXT_criminal history,economic status_DEI-CONTEXT_economic status,young_DEI-CONTEXT_youth,age discrimin_DEI-CONTEXT_age,under(-)?represented_DEI-CONTEXT_underrepresented,minorities_DEI-CONTEXT_minorit,high school diploma_DEI-CONTEXT_education/skill level,working (famil|parent|mother|mom|father|dad)_DEI-CONTEXT_familial status,asian_DEI-CONTEXT_race,racism_DEI-CONTEXT_race,impoverish_DEI-CONTEXT_economic status,transgender_DEI-CONTEXT_LGBT,bisexual_DEI-CONTEXT_LGBT,foreign worker_DEI-CONTEXT_migrant,middle(-| )class_DEI-CONTEXT_economic status,(indigenous|native(s| (america|population|communit|govern|reservation))|american indian|first nations|trib(al|e)|aborigin)_DEI-CONTEXT_race,economic class_DEI-CONTEXT_economic status,veteran_DEI-CONTEXT_military status,average age_DEI-CONTEXT_age,foreign nationals_DEI-CONTEXT_nationality,nationalit_DEI-CONTEXT_nationality,white_DEI-CONTEXT_race,old(er)? _DEI-CONTEXT_age,racial_DEI-CONTEXT_race,high(-| )income_DEI-CONTEXT_economic status,poverty_DEI-CONTEXT_economic status,lesbian_DEI-CONTEXT_LGBT,sexist_DEI-CONTEXT_gender-M/F,non(-)?binary_DEI-CONTEXT_LGBT,bias_DEI-CONTEXT_bias,inclusiv_DEI-CONTEXT_inclusive,wom(e|a)n_DEI-CONTEXT_gender-M/F,foreigner_DEI-CONTEXT_migrant,ethnic_DEI-CONTEXT_ethnic,gender_DEI-CONTEXT_gender-M/F,people of colo(u)?r_DEI-CONTEXT_race,national origin_DEI-CONTEXT_nationality,maternity leave_DEI-CONTEXT_gender-M/F,latino_DEI-CONTEXT_race,race _DEI-CONTEXT_race,sex discrimination_DEI-CONTEXT_gender-M/F,supplier contract_DEI-CONTEXT_supplier contract,blackface_DEI-CONTEXT_race,(marital status|married)_DEI-CONTEXT_marital status,female_DEI-CONTEXT_gender-M/F,asexual_DEI-CONTEXT_LGBT,lgbt_DEI-CONTEXT_LGBT,pregnant_DEI-CONTEXT_gender-M/F,on the basis of sex_DEI-CONTEXT_gender-M/F,(homo|trans)phobia_DEI-CONTEXT_LGBT,education level_DEI-CONTEXT_education/skill level
0,Asiana Airlines,"아시아나항공 / 2021.03.31 [Correction] Business Report VIII. Matters concerning executives and employees, etc. 1. Current status of executives and employees go. Executives (Base dat e: December 31, 2020 )",,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Asiana Airlines,"me. Current status of candidates for appointment and dismissal of regist ered executives (Base dat e: December 31, 2020",,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Asiana Airlines,"all. Status of employees, etc. (Base dat e: December 31, 2020 ) (In millions of Korean won)",,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Asiana Airlines,"Note 1) Non-executive employees on December 31, 2020, excluding executives, overs eas local employees, and foreign flight crew/cabin crew For workers whose working hours are short compared to the prescribed working hours of Im Note 5) Non-affiliated workers are workers who are dispatched, serviced, or contract ed by the employer 'in the workplace' of the employer who is obligated to disclose La. Remuneration of unregistered executives (Base dat e: December 31, 2020 ) (Unit: 1,000 won) Note) The total annual salary is based on the wage and salary income in the statemen t of wage and salary payment submitted to the competent tax office in accordance wit h Article 20 of the Income Tax Act. 2. Remuneration of executives, etc. <Status of remuneration for all directors and auditors> (1) Amount approved by the general meeting of shareholders (Unit: 1,000 won) Note 1) The above registered executives (5) are composed of 2 inside directors and 3 outside directors (2) Amount of remuneration paid (2-1) All directors and auditors (Unit: 1,000 won) (2-2) by type (Unit: 1,000 won)",,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Asiana Airlines,"Note 1) Enter the number of registered directors only for inside directors. Note 2) All 3 members of our audit committee are outside directors . Note 3) The total amount of remuneration is the cumulative amount paid until the end of the current period reflecting all rem uneration (earned and retired income, etc.) of retired/new directors during the discl osure period. Calculated by dividing by * 5) The Company did not grant stock options <Status of individual remuneration for directors and auditors with a remuneration amount of 500 million won or more> (1) Individual remuneration amount (Unit: 1,000 won) (2) Calculation standards and methods - None (Unit: 1,000 won) <Status of remuneration for the top 5 individuals out of 500 million won or more paid in re muneration> (1) Individual remuneration amount (Unit: 1,000 won)",,,,,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
# Adding indicators of having any practice term or outcome term, 
# to quickly identify events with co-occurrences of practice and outcome terms

# indicator format: {term}_{PRACTICE/OUTCOME/DEI-context term}_{category}

all_cols = list(full_master_df.columns)
practice_term_cols = []
outcome_term_cols = []
DEI_context_term_cols = []

for col in all_cols: 
    
    try:
        col_type = col.split('_')[1]

        if col_type == 'PRACTICE':
            practice_term_cols.append(col)
        elif col_type == 'OUTCOME':
            outcome_term_cols.append(col)
        elif col_type == 'DEI-CONTEXT':
            DEI_context_term_cols.append(col)

    except:
        continue
                                     

In [None]:
full_master_df['ANY_DEI-CONTEXT_TERM'] = 0
for DEI_context_term_col in DEI_context_term_cols:
    full_master_df['ANY_DEI-CONTEXT_TERM'] = np.where(full_master_df[DEI_context_term_col] == 1, 
                                                                    1, full_master_df['ANY_DEI-CONTEXT_TERM'])
    
full_master_df['ANY_PRACTICE_TERM'] = 0
for practice_term_col in practice_term_cols:
    full_master_df['ANY_PRACTICE_TERM'] = np.where(full_master_df[practice_term_col] == 1, 
                                                                    1, full_master_df['ANY_PRACTICE_TERM'])
    
full_master_df['ANY_OUTCOME_TERM'] = 0
for outcome_term_col in outcome_term_cols:
    full_master_df['ANY_OUTCOME_TERM'] = np.where(full_master_df[outcome_term_col] == 1, 
                                                                    1, full_master_df['ANY_OUTCOME_TERM'])
    
# Only practice + outcome co-occurence    
full_master_df['ANY_PRACTICE_AND_OUTCOME'] = np.where((full_master_df['ANY_PRACTICE_TERM'] == 1) & (full_master_df['ANY_OUTCOME_TERM'] == 1), 1, 0)

In [None]:
#MODIFY-3: Insert the indicator columns into `full_master_df` at column indices you prefer for the column order of the final CSV

for i, col in enumerate(['ANY_PRACTICE_AND_OUTCOME'
                         ]):
    mid = full_master_df[col]
    full_master_df.drop(labels=[col], axis=1, inplace = True)
    full_master_df.insert(i+4, col, mid) #MODIFY-3a: if necessary, change the index (set by 'i+4') to get the column order you want

for i, col in enumerate(['ANY_PRACTICE_TERM', 
                         'ANY_OUTCOME_TERM', 
                         'ANY_DEI-CONTEXT_TERM',
                         ]):
    mid = full_master_df[col]
    full_master_df.drop(labels=[col], axis=1, inplace = True)
    full_master_df.insert(i+22, col, mid) #MODIFY-3b: if necessary, change the index (set by 'i+22') to get the column order you want

### (OPTIONAL): Checking some features of the final dataframe, some are TVL-specific

In [None]:
# Total rows
full_master_df.shape[0]

17586

In [None]:
#full_master_df['Primary Article Source'].nunique()

In [None]:
# Number of ARTICLES with a practice-outcome co-occurrence
# (Of course, need to check later if these co-occurrences make sense with our 
# contexts, but just to get an idea)
full_master_df['ANY_PRACTICE_AND_OUTCOME'].sum()

1

In [None]:
# Total events
full_master_df[''].nunique()

49

In [None]:
# Number of EVENTS with a practice-outcome co-occurrence
#full_master_df.groupby(['INDUSTRY', 'TVL ID'])['ANY_PRACTICE_AND_OUTCOME'].sum().reset_index()['ANY_PRACTICE_AND_OUTCOME'].value_counts().reset_index().iloc[1:]['ANY_PRACTICE_AND_OUTCOME'].sum()

##(OPTIONAL): Function to randomly generate CSV of positive samples
- Run this code if you want to create a separate CSV of positive samples for faster viewing or validation. 

Slight modifications will be necessary:
- MODIFY-1: Change the `100` to whatever size sample you want
- MODIFY-2: Change the condition by which you are determining positivity if necessary. This code is presently capturing the rows that have a practice-outcome co-occurrence.
- MODIFY-3: Change the name of the CSV you are printing out to reflect the characteristics of your sample



In [None]:
# Create validation sample of 100
i = 0
rows_ind = []
while i < 100: #MODIFY-1: change 100 to whatever size sample you want
  row_no = random.randrange(len(full_master_df.index))
  if full_master_df.iloc[row_no]['ANY_PRACTICE_AND_OUTCOME'] == 1: #MODIFY-2: change to column(s) you want to filter on (to get positive results)
    if row_no not in rows_ind:
      rows_ind.append(row_no)
      i += 1

DEI_valid_100 = full_master_df.iloc[rows_ind]
DEI_valid_100.to_csv(f'{datetime.datetime.today().month}_{datetime.datetime.today().day}-VALID100_CONTEXT_TVLDEI-Industry_Article_Lvl.csv', index=False) #MODIFY-3: change 100 to whatever size sample you want, and TVL name

"\n# Create DEI w/o context validation sample of 50\ni = 0\nrows_ind = []\nwhile i < 50:\n  row_no = random.randrange(len(all_industries_events_master_df.index))\n  if all_industries_events_master_df.iloc[row_no]['DEI_validation_w/o_context'] == 1:\n    rows_ind.append(row_no)\n    i += 1\n\nSCMonDEI_valid_100 = all_industries_events_master_df.iloc[rows_ind]\nSCMonDEI_valid_100.to_csv(f'{datetime.datetime.today().month}_{datetime.datetime.today().day}-VALID50_NOCONTEXT_TVLDEI-Industry_Article_Lvl.csv', index=False)\n"

## (SLIGHT MODIFICATION): Create CSV(s)
Output dataframe `full_master_df` as the final CSV.
- MODIFY-1: Change the name of the CSV you are printing out to reflect the characteristics of your CSV



In [None]:
#MODIFY-1: change the 'TVLDEI-Industry_Article_Lvl' part of the CSV name

time = str(datetime.datetime.today().month) + '_' + str(datetime.datetime.today().day)
full_master_df.to_csv('/content/drive/My Drive/DART/' + time + '_DART_terms_agg.csv', index=False)



### (OPTIONAL): Create the other 3 CSVs
- These are currently tailored for the data provided by TVL and we haven't used these CSVs for visualizations, so it is safe to ignore these for now.



In [None]:
# Group articles by events: do a groupby on TVL ID (count events (by TVL ID) within each INDUSTRY)
# This is useful to observe counts for each term, for each event, within each industry

industry_event_level_sums = full_master_df.groupby(['INDUSTRY', 'TVL ID']).sum().reset_index() #MODIFY: change groupby column categories
industry_event_level_sums = industry_event_level_sums.drop(['year'], axis=1)
industry_event_level_sums.to_csv(f'{datetime.datetime.today().month}_{datetime.datetime.today().day}-TVLDEI-Industry_Event_Lvl.csv', index=False) #MODIFY: replace the 'TVL' in the CSV name

Unnamed: 0,INDUSTRY,TVL ID,ANY_PRACTICE_AND_RISK,ANY_CONTEXT_AND_PRACTICE_AND_RISK,Spotlight Volume,ANY_PRACTICE_TERM,ANY_RISK_TERM,ANY_DEI-CONTEXT_TERM,TAR_ind,PDMD_ind,CR_ind,IRR_ind,labor shortage_RISK_Talent-Attraction-Retention,workforce composition_PRACTICE_Talent-Attraction-Retention,programs/initiatives-retain_PRACTICE_Talent-Attraction-Retention,skill gap_RISK_Talent-Attraction-Retention,aging workforce_PRACTICE_Talent-Attraction-Retention,attrition_RISK_Talent-Attraction-Retention,talent attraction_PRACTICE_Talent-Attraction-Retention,discrimination lawsuit_RISK_Talent-Attraction-Retention,skill shortage_RISK_Talent-Attraction-Retention,talent retention_PRACTICE_Talent-Attraction-Retention,reporting_PRACTICE_Talent-Attraction-Retention,programs/initiatives-attract_PRACTICE_Talent-Attraction-Retention,unfilled positions_RISK_Talent-Attraction-Retention,working (famil|parent|mother|mom|father|dad)_DEI-CONTEXT_familial status,citizen_DEI-CONTEXT_migrant,racial_DEI-CONTEXT_race,visa_DEI-CONTEXT_migrant,sex discrimination_DEI-CONTEXT_gender-M/F,next generation_DEI-CONTEXT_youth,pregnant_DEI-CONTEXT_gender-M/F,high school diploma_DEI-CONTEXT_education/skill level,high(-| )income_DEI-CONTEXT_economic status,minorities_DEI-CONTEXT_minorit,white_DEI-CONTEXT_race,minority group_DEI-CONTEXT_minorit,bipoc_DEI-CONTEXT_race,gender_DEI-CONTEXT_gender-M/F,sexual orientation_DEI-CONTEXT_LGBT,youth_DEI-CONTEXT_youth,gender identity_DEI-CONTEXT_LGBT,disabili_DEI-CONTEXT_disabili,criminal history_DEI-CONTEXT_criminal history,under(-)?represented_DEI-CONTEXT_underrepresented,lesbian_DEI-CONTEXT_LGBT,ethnic_DEI-CONTEXT_ethnic,low(-| )income_DEI-CONTEXT_economic status,veteran_DEI-CONTEXT_military status,average age_DEI-CONTEXT_age,bisexual_DEI-CONTEXT_LGBT,(low|un|semi|high)(-|ly | |)skill_DEI-CONTEXT_education/skill level,background check_DEI-CONTEXT_criminal history,convict_DEI-CONTEXT_criminal history,people of colo[u]?r_DEI-CONTEXT_race,economic class_DEI-CONTEXT_economic status,marital status_DEI-CONTEXT_marital status,(service|guard|reserve) member_DEI-CONTEXT_military status,impoverish_DEI-CONTEXT_economic status,[im]?migrant_DEI-CONTEXT_migrant,homophobia_DEI-CONTEXT_LGBT,queer_DEI-CONTEXT_LGBT,latino_DEI-CONTEXT_race,race_DEI-CONTEXT_race,wom(e|a)n_DEI-CONTEXT_gender-M/F,(college|undergraduate|graduate) degree_DEI-CONTEXT_education/skill level,asian_DEI-CONTEXT_race,age discrimin_DEI-CONTEXT_age,foreign worker_DEI-CONTEXT_migrant,based on sex_DEI-CONTEXT_gender-M/F,entry(-| )level_DEI-CONTEXT_education/skill level,blackface_DEI-CONTEXT_race,on the basis of sex_DEI-CONTEXT_gender-M/F,foreign nationals_DEI-CONTEXT_nationality,female_DEI-CONTEXT_gender-M/F,gay_DEI-CONTEXT_LGBT,bias_DEI-CONTEXT_bias,national origin_DEI-CONTEXT_nationality,old(er)?_DEI-CONTEXT_age,sexist_DEI-CONTEXT_gender-M/F,divers_DEI-CONTEXT_divers,black_DEI-CONTEXT_race,religio_DEI-CONTEXT_religio,factory work_DEI-CONTEXT_factory work,ageism_DEI-CONTEXT_age,transgender_DEI-CONTEXT_LGBT,felon_DEI-CONTEXT_criminal history,age bias_DEI-CONTEXT_age,inclusiv_DEI-CONTEXT_inclusiv,access_DEI-CONTEXT_access,poverty_DEI-CONTEXT_economic status,racism_DEI-CONTEXT_race,nonbinary_DEI-CONTEXT_LGBT,young_DEI-CONTEXT_youth,maternity leave_DEI-CONTEXT_gender-M/F,lgbt_DEI-CONTEXT_LGBT,nationality_DEI-CONTEXT_nationality,working class_DEI-CONTEXT_economic status,asexual_DEI-CONTEXT_LGBT,foreigner_DEI-CONTEXT_migrant,economic status_DEI-CONTEXT_economic status,education level_DEI-CONTEXT_education/skill level,racist_DEI-CONTEXT_race,middle class_DEI-CONTEXT_economic status
0,Advertising& Marketing,0e80a6a7-b133-4f3d-b3ba-eaae4b621e95,0,0,16,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Advertising& Marketing,18cf502f-16b1-4ec9-8547-57e12c9aa827,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Advertising& Marketing,1918f688-e921-4bb5-80e4-a4402b44c71a,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Advertising& Marketing,1a81c9de-1ca9-4c35-9111-d90661d0a891,0,0,7,1,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Advertising& Marketing,24463be5-9776-4cf8-8472-d42fc99e4c4e,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
# This is useful to observe counts for each term, by year, within each industry
# *** Such breakdowns explain how each article was labeled, but they do not represent accurate numbers for co-occurrences 
# (need to check articles manually to ensure co-occurring terms produce useful results)
industry_year_level_sums = full_master_df.groupby(['INDUSTRY', 'year']).sum().reset_index() #MODIFY: change groupby column categories, if necessary
industry_year_level_sums.to_csv(f'{datetime.datetime.today().month}_{datetime.datetime.today().day}-TVLDEI-Industry_Year_Lvl.csv', index=False) #MODIFY: replace the 'TVL' in the CSV name


Unnamed: 0,INDUSTRY,year,ANY_PRACTICE_AND_RISK,ANY_CONTEXT_AND_PRACTICE_AND_RISK,Spotlight Volume,ANY_PRACTICE_TERM,ANY_RISK_TERM,ANY_DEI-CONTEXT_TERM,TAR_ind,PDMD_ind,CR_ind,IRR_ind,labor shortage_RISK_Talent-Attraction-Retention,workforce composition_PRACTICE_Talent-Attraction-Retention,programs/initiatives-retain_PRACTICE_Talent-Attraction-Retention,skill gap_RISK_Talent-Attraction-Retention,aging workforce_PRACTICE_Talent-Attraction-Retention,attrition_RISK_Talent-Attraction-Retention,talent attraction_PRACTICE_Talent-Attraction-Retention,discrimination lawsuit_RISK_Talent-Attraction-Retention,skill shortage_RISK_Talent-Attraction-Retention,talent retention_PRACTICE_Talent-Attraction-Retention,reporting_PRACTICE_Talent-Attraction-Retention,programs/initiatives-attract_PRACTICE_Talent-Attraction-Retention,unfilled positions_RISK_Talent-Attraction-Retention,working (famil|parent|mother|mom|father|dad)_DEI-CONTEXT_familial status,citizen_DEI-CONTEXT_migrant,racial_DEI-CONTEXT_race,visa_DEI-CONTEXT_migrant,sex discrimination_DEI-CONTEXT_gender-M/F,next generation_DEI-CONTEXT_youth,pregnant_DEI-CONTEXT_gender-M/F,high school diploma_DEI-CONTEXT_education/skill level,high(-| )income_DEI-CONTEXT_economic status,minorities_DEI-CONTEXT_minorit,white_DEI-CONTEXT_race,minority group_DEI-CONTEXT_minorit,bipoc_DEI-CONTEXT_race,gender_DEI-CONTEXT_gender-M/F,sexual orientation_DEI-CONTEXT_LGBT,youth_DEI-CONTEXT_youth,gender identity_DEI-CONTEXT_LGBT,disabili_DEI-CONTEXT_disabili,criminal history_DEI-CONTEXT_criminal history,under(-)?represented_DEI-CONTEXT_underrepresented,lesbian_DEI-CONTEXT_LGBT,ethnic_DEI-CONTEXT_ethnic,low(-| )income_DEI-CONTEXT_economic status,veteran_DEI-CONTEXT_military status,average age_DEI-CONTEXT_age,bisexual_DEI-CONTEXT_LGBT,(low|un|semi|high)(-|ly | |)skill_DEI-CONTEXT_education/skill level,background check_DEI-CONTEXT_criminal history,convict_DEI-CONTEXT_criminal history,people of colo[u]?r_DEI-CONTEXT_race,economic class_DEI-CONTEXT_economic status,marital status_DEI-CONTEXT_marital status,(service|guard|reserve) member_DEI-CONTEXT_military status,impoverish_DEI-CONTEXT_economic status,[im]?migrant_DEI-CONTEXT_migrant,homophobia_DEI-CONTEXT_LGBT,queer_DEI-CONTEXT_LGBT,latino_DEI-CONTEXT_race,race_DEI-CONTEXT_race,wom(e|a)n_DEI-CONTEXT_gender-M/F,(college|undergraduate|graduate) degree_DEI-CONTEXT_education/skill level,asian_DEI-CONTEXT_race,age discrimin_DEI-CONTEXT_age,foreign worker_DEI-CONTEXT_migrant,based on sex_DEI-CONTEXT_gender-M/F,entry(-| )level_DEI-CONTEXT_education/skill level,blackface_DEI-CONTEXT_race,on the basis of sex_DEI-CONTEXT_gender-M/F,foreign nationals_DEI-CONTEXT_nationality,female_DEI-CONTEXT_gender-M/F,gay_DEI-CONTEXT_LGBT,bias_DEI-CONTEXT_bias,national origin_DEI-CONTEXT_nationality,old(er)?_DEI-CONTEXT_age,sexist_DEI-CONTEXT_gender-M/F,divers_DEI-CONTEXT_divers,black_DEI-CONTEXT_race,religio_DEI-CONTEXT_religio,factory work_DEI-CONTEXT_factory work,ageism_DEI-CONTEXT_age,transgender_DEI-CONTEXT_LGBT,felon_DEI-CONTEXT_criminal history,age bias_DEI-CONTEXT_age,inclusiv_DEI-CONTEXT_inclusiv,access_DEI-CONTEXT_access,poverty_DEI-CONTEXT_economic status,racism_DEI-CONTEXT_race,nonbinary_DEI-CONTEXT_LGBT,young_DEI-CONTEXT_youth,maternity leave_DEI-CONTEXT_gender-M/F,lgbt_DEI-CONTEXT_LGBT,nationality_DEI-CONTEXT_nationality,working class_DEI-CONTEXT_economic status,asexual_DEI-CONTEXT_LGBT,foreigner_DEI-CONTEXT_migrant,economic status_DEI-CONTEXT_economic status,education level_DEI-CONTEXT_education/skill level,racist_DEI-CONTEXT_race,middle class_DEI-CONTEXT_economic status
0,Advertising& Marketing,2016,0,0,19,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Advertising& Marketing,2017,0,0,26,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Advertising& Marketing,2018,0,0,15,1,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Advertising& Marketing,2019,0,0,24,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Advertising& Marketing,2020,0,0,47,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,Advertising& Marketing,2021,0,0,47,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,Aerospace& Defense,2016,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,Aerospace& Defense,2017,0,0,20,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,Aerospace& Defense,2018,0,0,21,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,Aerospace& Defense,2019,0,0,52,0,2,2,2,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
# This is useful to observe counts for each term, within each industry (Jan 2016-Sep 2021)

industry_level_sums = full_master_df.groupby(['INDUSTRY']).sum() #MODIFY: change groupby column categories, if necessary
industry_level_sums = industry_level_sums.drop(['year'], axis=1)
industry_level_sums.to_csv(f'{datetime.datetime.today().month}_{datetime.datetime.today().day}-TVLDEI-Industry_ONLY_Lvl.csv', index=False) #MODIFY: replace the 'TVL' in the CSV name

Unnamed: 0_level_0,ANY_PRACTICE_AND_RISK,ANY_CONTEXT_AND_PRACTICE_AND_RISK,Spotlight Volume,ANY_PRACTICE_TERM,ANY_RISK_TERM,ANY_DEI-CONTEXT_TERM,TAR_ind,PDMD_ind,CR_ind,IRR_ind,labor shortage_RISK_Talent-Attraction-Retention,workforce composition_PRACTICE_Talent-Attraction-Retention,programs/initiatives-retain_PRACTICE_Talent-Attraction-Retention,skill gap_RISK_Talent-Attraction-Retention,aging workforce_PRACTICE_Talent-Attraction-Retention,attrition_RISK_Talent-Attraction-Retention,talent attraction_PRACTICE_Talent-Attraction-Retention,discrimination lawsuit_RISK_Talent-Attraction-Retention,skill shortage_RISK_Talent-Attraction-Retention,talent retention_PRACTICE_Talent-Attraction-Retention,reporting_PRACTICE_Talent-Attraction-Retention,programs/initiatives-attract_PRACTICE_Talent-Attraction-Retention,unfilled positions_RISK_Talent-Attraction-Retention,working (famil|parent|mother|mom|father|dad)_DEI-CONTEXT_familial status,citizen_DEI-CONTEXT_migrant,racial_DEI-CONTEXT_race,visa_DEI-CONTEXT_migrant,sex discrimination_DEI-CONTEXT_gender-M/F,next generation_DEI-CONTEXT_youth,pregnant_DEI-CONTEXT_gender-M/F,high school diploma_DEI-CONTEXT_education/skill level,high(-| )income_DEI-CONTEXT_economic status,minorities_DEI-CONTEXT_minorit,white_DEI-CONTEXT_race,minority group_DEI-CONTEXT_minorit,bipoc_DEI-CONTEXT_race,gender_DEI-CONTEXT_gender-M/F,sexual orientation_DEI-CONTEXT_LGBT,youth_DEI-CONTEXT_youth,gender identity_DEI-CONTEXT_LGBT,disabili_DEI-CONTEXT_disabili,criminal history_DEI-CONTEXT_criminal history,under(-)?represented_DEI-CONTEXT_underrepresented,lesbian_DEI-CONTEXT_LGBT,ethnic_DEI-CONTEXT_ethnic,low(-| )income_DEI-CONTEXT_economic status,veteran_DEI-CONTEXT_military status,average age_DEI-CONTEXT_age,bisexual_DEI-CONTEXT_LGBT,(low|un|semi|high)(-|ly | |)skill_DEI-CONTEXT_education/skill level,background check_DEI-CONTEXT_criminal history,convict_DEI-CONTEXT_criminal history,people of colo[u]?r_DEI-CONTEXT_race,economic class_DEI-CONTEXT_economic status,marital status_DEI-CONTEXT_marital status,(service|guard|reserve) member_DEI-CONTEXT_military status,impoverish_DEI-CONTEXT_economic status,[im]?migrant_DEI-CONTEXT_migrant,homophobia_DEI-CONTEXT_LGBT,queer_DEI-CONTEXT_LGBT,latino_DEI-CONTEXT_race,race_DEI-CONTEXT_race,wom(e|a)n_DEI-CONTEXT_gender-M/F,(college|undergraduate|graduate) degree_DEI-CONTEXT_education/skill level,asian_DEI-CONTEXT_race,age discrimin_DEI-CONTEXT_age,foreign worker_DEI-CONTEXT_migrant,based on sex_DEI-CONTEXT_gender-M/F,entry(-| )level_DEI-CONTEXT_education/skill level,blackface_DEI-CONTEXT_race,on the basis of sex_DEI-CONTEXT_gender-M/F,foreign nationals_DEI-CONTEXT_nationality,female_DEI-CONTEXT_gender-M/F,gay_DEI-CONTEXT_LGBT,bias_DEI-CONTEXT_bias,national origin_DEI-CONTEXT_nationality,old(er)?_DEI-CONTEXT_age,sexist_DEI-CONTEXT_gender-M/F,divers_DEI-CONTEXT_divers,black_DEI-CONTEXT_race,religio_DEI-CONTEXT_religio,factory work_DEI-CONTEXT_factory work,ageism_DEI-CONTEXT_age,transgender_DEI-CONTEXT_LGBT,felon_DEI-CONTEXT_criminal history,age bias_DEI-CONTEXT_age,inclusiv_DEI-CONTEXT_inclusiv,access_DEI-CONTEXT_access,poverty_DEI-CONTEXT_economic status,racism_DEI-CONTEXT_race,nonbinary_DEI-CONTEXT_LGBT,young_DEI-CONTEXT_youth,maternity leave_DEI-CONTEXT_gender-M/F,lgbt_DEI-CONTEXT_LGBT,nationality_DEI-CONTEXT_nationality,working class_DEI-CONTEXT_economic status,asexual_DEI-CONTEXT_LGBT,foreigner_DEI-CONTEXT_migrant,economic status_DEI-CONTEXT_economic status,education level_DEI-CONTEXT_education/skill level,racist_DEI-CONTEXT_race,middle class_DEI-CONTEXT_economic status
INDUSTRY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1
Advertising& Marketing,0,0,178,1,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Aerospace& Defense,0,0,307,2,4,6,6,0,0,0,0,2,0,0,0,1,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
AgriculturalProducts,0,0,73,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
AirFreight & Logistics,0,0,422,1,2,3,3,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Airlines,1,1,1168,3,4,6,6,0,0,0,0,3,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
AlcoholicBeverages,0,0,209,3,1,4,4,0,0,0,0,2,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
"Apparel,Accessories & Foo",1,1,1408,9,16,24,24,0,0,0,0,9,0,3,0,1,0,8,0,0,0,0,4,0,1,2,0,0,0,1,0,0,0,0,0,0,6,0,0,0,1,0,2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,10,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,3,1,0,0,0,0,0,0,3,2,1,0,0,2,0,1,0,0,0,0,0,0,0,0
ApplianceManufacturing,0,0,37,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
AssetManagement & Custody,1,1,407,10,2,11,11,0,0,0,0,6,0,0,1,1,3,1,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,3,0,1,0,1,0,6,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
AutoParts,0,0,259,1,2,3,3,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## (IGNORE): Miscellaneous Old Code

In [None]:
industry_level_sums.sort_values(by='ANY_PRACTICE_AND_RISK', ascending=False).head(10)

Unnamed: 0_level_0,ANY_PRACTICE_TERM,ANY_RISK_TERM,DEI_validation_w/_context,ANY_PRACTICE_AND_RISK,ANY_CONTEXT_AND_PRACTICE_AND_RISK,ANY_2DEI_AND_RISK,Spotlight Volume,diversity_ind,employee engage_ind,inclusion_ind,DEI_ind,DEI_keyword_ind,marked_DEI_relevant_ind,work_ind,employ_ind,root_employ_ind,marked_root_employ_ind,wage_PRACTICE_Wages,wage theft_PRACTICE_Wages,living wage_PRACTICE_Wages,low wages_PRACTICE_Wages,underpay_PRACTICE_Wages,DEI compensation_PRACTICE_Wages,precarity_PRACTICE_Precarious-Work,precarious work_PRACTICE_Precarious-Work,gig work_PRACTICE_Precarious-Work,alternative work_PRACTICE_Precarious-Work,alternate work_PRACTICE_Precarious-Work,contingent work_PRACTICE_Precarious-Work,informal work_PRACTICE_Precarious-Work,casual work_PRACTICE_Precarious-Work,hazardous work_PRACTICE_Precarious-Work,temporary work-employement_PRACTICE_Precarious-Work,contract labor_PRACTICE_Precarious-Work,confinement_PRACTICE_Mdrn-Slav-Risk,document retention_PRACTICE_Mdrn-Slav-Risk,restriction of movement_PRACTICE_Mdrn-Slav-Risk,delayed wage_PRACTICE_Mdrn-Slav-Risk,pay manipulation_PRACTICE_Mdrn-Slav-Risk,punishment_PRACTICE_Mdrn-Slav-Risk,poor food_PRACTICE_Mdrn-Slav-Risk,deprivationunpaid wage_PRACTICE_Mdrn-Slav-Risk,delayed payment_PRACTICE_Mdrn-Slav-Risk,wage violation_PRACTICE_Mdrn-Slav-Risk,coercive labor_PRACTICE_Mdrn-Slav-Risk,prison labor_PRACTICE_Mdrn-Slav-Risk,recruitment fee_PRACTICE_Mdrn-Slav-Risk,withhold wage_PRACTICE_Mdrn-Slav-Risk,passport retention_PRACTICE_Mdrn-Slav-Risk,freedom of association_PRACTICE_Work-Conditions,collective bargaining_PRACTICE_Work-Conditions,work stoppage_PRACTICE_Work-Conditions,hotline_PRACTICE_Work-Conditions,collective bargaining agreement_PRACTICE_Work-Conditions,unsafe conditions_PRACTICE_Work-Conditions,employee morale_PRACTICE_Work-Conditions,grievance mechanism_PRACTICE_Work-Conditions,reprisal/retaliation_PRACTICE_Work-Conditions,code of conduct_PRACTICE_Good-Practices,due diligence_PRACTICE_Good-Practices,ethical recruit_PRACTICE_Good-Practices,handbook_PRACTICE_Good-Practices,social audit_PRACTICE_Good-Practices,equal benefits_PRACTICE_Good-Practices,transparency_PRACTICE_Good-Practices,traceability_PRACTICE_Good-Practices,visibility_PRACTICE_Good-Practices,accessibtimely payments_PRACTICE_Good-Practices,union_PRACTICE_Good-Practices,worker committee_PRACTICE_Good-Practices,empower_PRACTICE_Good-Practices,accomodat_PRACTICE_Good-Practices,code of conduct negative_PRACTICE_Good-Practices,engagement_PRACTICE_Good-Practices,flexible work_PRACTICE_Good-Practices,outsourc_PRACTICE_Neutral-Practices,subcontracting_PRACTICE_Neutral-Practices,program_PRACTICE_Neutral-Practices,initiative_PRACTICE_Neutral-Practices,training_PRACTICE_Neutral-Practices,development_PRACTICE_Neutral-Practices,exempt_PRACTICE_Neutral-Practices,recruit_PRACTICE_Neutral-Practices,promotion_PRACTICE_Neutral-Practices,arbitration_PRACTICE_Neutral-Practices,corrective action_PRACTICE_Neutral-Practices,hiring_PRACTICE_Neutral-Practices,order delay_PRACTICE_Negative-Practices,lead time_PRACTICE_Negative-Practices,unplanned shipment_PRACTICE_Negative-Practices,corruption_PRACTICE_Negative-Practices,quota system_PRACTICE_Negative-Practices,delayed payment_PRACTICE_Negative-Practices,weak governance_PRACTICE_Negative-Practices,wage violation_PRACTICE_Negative-Practices,informal supply chain_PRACTICE_Negative-Practices,last-minute order modification_PRACTICE_Negative-Practices,unfair timing demand_PRACTICE_Negative-Practices,pricing pressure_PRACTICE_Negative-Practices,poor forecasting_PRACTICE_Negative-Practices,irresponsible exit_PRACTICE_Negative-Practices,hour violation_PRACTICE_Negative-Practices,canceled order_PRACTICE_Negative-Practices,overtime NEGATIVE_PRACTICE_Negative-Practices,lead time NEGATIVE_PRACTICE_Negative-Practices,turnover_RISK_Employee/Talent-Retention,retention_RISK_Employee/Talent-Retention,talent_RISK_Employee/Talent-Retention,strike_RISK_Worker-Protest,sit-in_RISK_Worker-Protest,operational disruption_RISK_Worker-Protest,protest_RISK_Worker-Protest,injury_RISK_Worker-Protest,walkout_RISK_Worker-Protest,boycott_RISK_Consumer-Protest,protest_RISK_Consumer-Protest,social license_RISK_Consumer-Protest,operational disruption_RISK_Operational-Costs,operating cost_RISK_Operational-Costs,delay_RISK_Operational-Costs,disruption_RISK_Operational-Costs,withhold release order_RISK_Operational-Costs,block import_RISK_Operational-Costs,sanction_RISK_Financial-Loss,reimburse_RISK_Financial-Loss,restitution_RISK_Financial-Loss,fine_RISK_Financial-Loss,compensation_RISK_Financial-Loss,penalt_RISK_Financial-Loss,bankrupt_RISK_Financial-Loss,liabl_RISK_Financial-Loss,loss_RISK_Financial-Loss,lost_RISK_Financial-Loss,pay damages_RISK_Financial-Loss,seizure of assets_RISK_Financial-Loss,lawsuit_RISK_Legal-Risk,litigation_RISK_Legal-Risk,impoundment_RISK_Legal-Risk,detain_RISK_Legal-Risk,penalt_RISK_Legal-Risk,sanction_RISK_Legal-Risk,court_RISK_Legal-Risk,consent decree_RISK_Legal-Risk,court-ordered relief_RISK_Legal-Risk,brand damage_RISK_Reputational-Damage,monetary damage_RISK_Reputational-Damage,brand reputation_RISK_Reputational-Damage,brand recognition_RISK_Reputational-Damage,social license_RISK_Reputational-Damage,decreased trust_RISK_Reputational-Damage,decreased innovation_RISK_Reputational-Damage,lost opportunit_RISK_Reputational-Damage,workplace shutdown_RISK_Reputational-Damage,reimburse_RISK_Remedy,compensation_RISK_Remedy,divest_RISK_Remedy,restitution_RISK_Remedy,modern slavery_RISK_Modern-Slavery,debt bondage_RISK_Modern-Slavery,human traffic_RISK_Modern-Slavery,forced labor_RISK_Modern-Slavery,child labor_RISK_Modern-Slavery,alleg_RISK_Other,accus_RISK_Other,exploit_RISK_Other,expose_RISK_Other,investigat_RISK_Other,police_RISK_Other,enforcement_RISK_Other,security force_RISK_Other,inspection_RISK_Other,inspector_RISK_Other,scandal_RISK_Other-RK,government action_RISK_Other-RK,share price_RISK_Other-RK,share value_RISK_Other-RK,investment_RISK_Other-RK,negative return_RISK_Other-RK,women_DEI-CONTEXT_DEI-Context,female_DEI-CONTEXT_DEI-Context,people of color_DEI-CONTEXT_DEI-Context,ethnicit_DEI-CONTEXT_DEI-Context,sexual_DEI-CONTEXT_DEI-Context,orientation_DEI-CONTEXT_DEI-Context,gender_DEI-CONTEXT_DEI-Context,disabili_DEI-CONTEXT_DEI-Context,LGBT_DEI-CONTEXT_DEI-Context,parental_DEI-CONTEXT_DEI-Context,marital_DEI-CONTEXT_DEI-Context,mother_DEI-CONTEXT_DEI-Context,pregnant_DEI-CONTEXT_DEI-Context,family-friendlyveteran_DEI-CONTEXT_DEI-Context,fairness_DEI-CONTEXT_DEI-Context,age _DEI-CONTEXT_DEI-Context,bias_DEI-CONTEXT_DEI-Context,religio_DEI-CONTEXT_DEI-Context,race_DEI-CONTEXT_DEI-Context,minorit_DEI-CONTEXT_DEI-Context,justice_DEI-CONTEXT_DEI-Context,equity_DEI-CONTEXT_DEI-Context,equality_DEI-CONTEXT_DEI-Context,nationality_DEI-CONTEXT_DEI-Context,underrepresented_DEI-CONTEXT_DEI-Context,migrant_DEI-CONTEXT_DEI-Context,education/skill level_DEI-CONTEXT_DEI-Context,economic status_DEI-CONTEXT_DEI-Context,criminal history_DEI-CONTEXT_DEI-Context,precarious worker_DEI-CONTEXT_DEI-Context,DEI abroad_DEI-CONTEXT_DEI-Context,discriminat_DEI-CONTEXT_DEI-Standalone-Terms,harrass_DEI-CONTEXT_DEI-Standalone-Terms,profiled_DEI-CONTEXT_DEI-Standalone-Terms,ANY_DEI-CONTEXT_TERM,ANY_DEI_TERM,ANY_ROOT_EMPLOY_TERM
INDUSTRY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1,Unnamed: 141_level_1,Unnamed: 142_level_1,Unnamed: 143_level_1,Unnamed: 144_level_1,Unnamed: 145_level_1,Unnamed: 146_level_1,Unnamed: 147_level_1,Unnamed: 148_level_1,Unnamed: 149_level_1,Unnamed: 150_level_1,Unnamed: 151_level_1,Unnamed: 152_level_1,Unnamed: 153_level_1,Unnamed: 154_level_1,Unnamed: 155_level_1,Unnamed: 156_level_1,Unnamed: 157_level_1,Unnamed: 158_level_1,Unnamed: 159_level_1,Unnamed: 160_level_1,Unnamed: 161_level_1,Unnamed: 162_level_1,Unnamed: 163_level_1,Unnamed: 164_level_1,Unnamed: 165_level_1,Unnamed: 166_level_1,Unnamed: 167_level_1,Unnamed: 168_level_1,Unnamed: 169_level_1,Unnamed: 170_level_1,Unnamed: 171_level_1,Unnamed: 172_level_1,Unnamed: 173_level_1,Unnamed: 174_level_1,Unnamed: 175_level_1,Unnamed: 176_level_1,Unnamed: 177_level_1,Unnamed: 178_level_1,Unnamed: 179_level_1,Unnamed: 180_level_1,Unnamed: 181_level_1,Unnamed: 182_level_1,Unnamed: 183_level_1,Unnamed: 184_level_1,Unnamed: 185_level_1,Unnamed: 186_level_1,Unnamed: 187_level_1,Unnamed: 188_level_1,Unnamed: 189_level_1,Unnamed: 190_level_1,Unnamed: 191_level_1,Unnamed: 192_level_1,Unnamed: 193_level_1,Unnamed: 194_level_1,Unnamed: 195_level_1,Unnamed: 196_level_1,Unnamed: 197_level_1,Unnamed: 198_level_1,Unnamed: 199_level_1,Unnamed: 200_level_1,Unnamed: 201_level_1,Unnamed: 202_level_1,Unnamed: 203_level_1,Unnamed: 204_level_1,Unnamed: 205_level_1,Unnamed: 206_level_1,Unnamed: 207_level_1,Unnamed: 208_level_1,Unnamed: 209_level_1,Unnamed: 210_level_1,Unnamed: 211_level_1,Unnamed: 212_level_1,Unnamed: 213_level_1,Unnamed: 214_level_1,Unnamed: 215_level_1
Software& IT Services,748,331,108,224,114,15,4954,122,36,82,0,240,240,697,648,1345,1345,30,0,1,0,3,2,0,0,5,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,2,0,0,2,171,10,0,8,2,0,0,1,0,0,14,1,7,0,22,0,57,0,0,36,64,36,1,283,97,146,209,4,65,22,4,1,171,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,11,10,1,0,0,8,0,6,4,8,0,0,3,7,11,16,0,0,3,0,2,44,2,0,0,15,14,1,0,50,2,0,4,2,0,35,2,6,0,0,2,0,0,0,0,0,0,3,44,2,0,0,0,0,1,0,78,36,2,5,44,9,7,0,2,3,4,0,6,0,72,0,152,71,4,7,71,9,88,21,0,12,0,14,5,0,14,1,17,7,47,7,5,15,17,6,11,103,33,7,1,40,1,60,1,0,456,62,834
InternetMedia & Services,270,262,109,162,129,4,6706,64,2,21,0,87,87,239,226,465,465,16,0,1,0,2,7,0,0,2,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,81,2,0,28,5,0,0,4,0,0,8,0,0,0,24,0,11,0,3,2,15,1,0,63,31,27,32,0,21,13,22,0,54,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,4,2,0,52,1,27,7,52,0,0,0,2,2,19,0,2,2,0,6,17,0,0,1,6,8,0,0,51,2,0,0,0,2,50,0,0,0,0,0,0,0,0,0,0,1,2,17,2,0,0,0,0,0,0,89,65,1,7,54,15,8,0,0,2,14,0,2,0,9,0,82,33,4,1,131,3,51,3,0,4,1,5,4,0,0,0,13,4,44,7,7,5,2,1,8,17,14,3,10,36,0,77,0,0,284,20,302
CommercialBanks,334,179,42,120,58,5,1707,71,14,56,0,141,141,282,258,540,540,19,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,52,2,0,1,1,0,0,1,0,0,6,0,1,0,20,0,26,0,0,14,14,1,0,154,96,80,111,0,22,5,4,0,54,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,4,8,0,0,9,0,0,1,9,0,0,1,2,3,5,12,2,2,0,5,19,5,0,0,4,7,0,0,14,2,0,0,5,2,20,0,2,0,0,0,0,0,0,0,0,0,2,19,1,0,0,0,0,0,0,21,10,1,4,7,4,4,0,1,0,5,0,1,0,78,0,109,57,5,0,25,2,72,13,0,15,1,8,1,0,3,0,10,4,32,14,5,7,23,0,4,9,4,8,0,5,0,36,0,1,223,39,356
Multilineand Specialty Re,222,150,64,100,71,10,1283,51,0,30,0,81,81,170,187,357,357,26,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,49,2,0,10,1,0,0,2,0,0,4,1,1,0,11,0,19,0,0,0,4,0,0,80,29,68,55,2,18,7,3,0,61,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,2,2,0,0,11,0,0,3,11,0,0,1,2,0,5,0,0,3,0,1,11,1,3,3,7,4,2,0,34,7,0,2,1,0,47,7,10,0,0,0,0,0,0,0,0,0,3,11,0,0,0,0,0,0,0,50,16,1,2,19,10,1,0,1,0,3,0,2,0,16,0,46,29,4,3,33,3,36,19,0,13,0,6,3,0,1,0,14,5,40,4,9,13,5,3,3,6,6,1,4,10,1,57,0,1,181,25,244
Restaurants,155,152,61,89,64,5,2054,31,0,22,0,53,53,134,153,287,287,16,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,4,0,37,1,0,12,3,0,0,1,0,0,4,1,0,0,15,0,7,0,0,0,7,1,0,40,10,36,19,1,11,10,0,1,38,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,1,3,6,0,0,16,1,0,3,16,0,0,0,1,0,12,0,0,0,0,3,6,3,0,2,16,8,6,0,57,4,0,0,3,0,35,4,10,0,3,1,0,0,0,0,0,1,0,6,0,0,0,0,0,0,1,56,17,2,2,17,16,3,0,2,2,1,0,7,0,11,0,32,25,6,2,53,5,26,20,0,2,0,2,0,0,1,1,15,14,42,4,3,6,2,2,1,7,7,2,0,2,2,56,1,1,159,20,189
Media& Entertainment,127,163,37,79,57,1,1186,16,1,6,0,23,23,91,87,178,178,6,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,23,0,0,4,0,0,0,0,0,0,4,0,1,0,21,0,1,0,0,1,2,1,0,45,9,14,20,3,1,5,6,0,21,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,2,0,0,10,1,0,3,10,0,0,0,0,0,7,0,1,1,0,4,6,1,2,2,2,5,3,0,37,5,0,2,1,1,35,0,2,0,1,0,0,0,0,0,0,1,1,6,1,0,0,0,0,0,0,91,38,2,8,49,8,0,0,0,0,13,0,0,0,17,0,55,35,1,2,95,1,31,0,0,1,1,3,0,0,0,0,5,2,19,1,3,3,0,0,1,2,4,2,0,3,0,40,0,0,158,5,131
Professional& Commercial,216,104,46,78,51,6,846,46,4,27,0,77,77,204,199,403,403,6,1,0,0,4,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,53,4,0,7,1,1,0,0,0,0,4,0,0,0,6,0,10,0,0,4,11,8,0,60,24,40,56,0,34,10,5,0,39,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,4,1,0,0,5,2,1,3,5,0,0,0,1,1,4,1,1,0,0,1,11,1,0,0,2,3,1,0,17,5,0,0,1,1,16,2,1,0,1,0,0,0,0,0,0,0,0,11,0,0,0,0,0,0,0,26,5,1,1,12,5,0,0,2,1,5,0,0,0,21,0,60,30,3,4,30,6,43,15,0,5,1,5,2,0,2,0,16,5,16,1,3,3,11,4,3,6,6,4,1,10,0,40,0,0,136,24,252
"Apparel,Accessories & Foo",173,134,45,75,58,5,1408,48,4,38,0,90,90,158,117,275,275,11,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,30,1,0,7,2,0,0,0,0,0,9,1,1,0,8,0,19,0,0,4,5,1,0,58,41,35,27,1,8,11,11,1,21,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,2,0,2,0,0,13,0,1,0,13,0,0,1,0,0,7,0,0,1,0,1,5,1,11,1,11,7,2,0,30,7,0,0,1,0,23,1,4,0,0,0,0,0,0,0,0,0,1,5,1,0,0,0,0,2,0,54,20,3,0,18,6,1,0,1,0,3,0,3,0,15,0,79,36,3,4,56,4,46,4,0,8,0,5,3,0,0,0,6,1,34,2,3,8,13,2,4,6,3,4,0,6,4,31,0,2,174,31,192
Insurance,170,90,32,57,31,7,738,55,11,41,0,107,107,159,158,317,317,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,2,0,48,0,0,2,2,0,0,1,0,0,6,0,0,0,9,0,19,0,2,11,13,0,0,62,36,33,45,1,21,6,1,0,31,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,2,0,0,2,3,0,0,2,0,0,0,0,0,4,0,0,3,0,2,13,1,3,1,3,2,1,0,12,0,0,1,1,0,14,1,0,0,0,0,1,0,0,0,0,0,3,13,2,0,0,0,0,0,0,17,10,0,1,5,0,0,0,0,0,3,0,0,0,35,0,51,27,0,0,22,1,41,14,0,6,0,2,0,0,6,0,3,2,13,2,1,14,13,2,3,9,3,6,0,4,0,20,0,0,134,36,205
Airlines,116,89,22,55,30,1,1168,18,1,6,0,25,25,75,95,170,170,5,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,3,0,0,1,33,0,0,3,0,0,0,0,0,0,1,0,0,0,23,0,2,0,0,1,0,0,0,37,9,40,23,3,10,4,1,1,14,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,8,0,1,3,0,0,1,3,0,1,0,2,3,5,0,0,1,0,2,5,1,2,1,9,5,0,0,23,2,0,1,1,0,22,0,1,0,0,0,0,0,0,0,0,0,1,5,0,0,0,0,0,0,0,31,13,1,1,11,6,2,0,2,0,1,0,1,0,3,0,27,25,1,0,31,5,18,8,0,1,0,3,1,0,1,0,3,9,14,5,0,1,2,3,1,5,2,0,0,4,0,32,0,1,100,4,123


### TESTER: Tokenize and find terms in text

In [None]:
reg = re.compile(r'(?<![^ .,?!;])theref(o|t)?',re.UNICODE)
test_string = 'therefore'

matched = re.match(reg, test_string)
is_match = bool(matched)

print(matched)
print(is_match)

<re.Match object; span=(0, 7), match='therefo'>
True


In [None]:
def create_pattern_2(buck1, buck2, rangelist):
  return [[buck1, buck2], rangelist]
def create_pattern_3(buck1, buck2, buck3, rangelist):
  return [[buck1, buck2, buck3], rangelist]
def word_indices(input_str, search_word_lst, DEI_contx_list): #, total_wrd_lst):
  # Returns dictionary of indices for each word found in text
  #word_ind_dict = {key: key for key in total_wrd_lst}
  #word_ind_dict = dict.fromkeys(total_wrd_lst, [])

  # Build dict with all words as keys w/ empty lists as values
  total_list = search_word_lst + DEI_contx_list
  li = [(i,[]) for i in total_list] 
  word_ind_dict = {}
  for j in li:
    word_ind_dict[j[0]] = j[1]

  # Update lists
  '''
  s_block = word_tokenize(input_str)
  for s_i in range(len(s_block)):
    for w in total_list:
      if s_block[s_i] == w:
        word_ind_dict[w].append(s_i)
  '''
  for w in total_list:
    for match in re.finditer(w, input_str):
      before_str = word_tokenize(input_str[:match.start()])
      word_ind_dict[w].append(len(before_str))
        #word_ind_dict[w].append(match.start())
  
  return word_ind_dict

def check_cooccur(word_ind_dict, terms_dict, DEI_contx_list):
  #(term,[[pattern1],[pattern2],[[[bucket1],[bucket2]],[4,5]]])
  flagged_terms = {}
  for key, value in terms_dict.items():
    print("key: ", key)
    print("value: ", value)
    is_match = 0
    match_pattern = []
    
    for pattern in value: # for one possible pattern for a term
      combo_buck = pattern[0] + [DEI_contx_list]
      print("combo_buck: ", combo_buck)
      combos = list(itertools.product(*combo_buck)) # Combos of 1-ea word + 1 DEI context term, per pattern
      for c in combos:
        print('combo: ', c)
        # Collect indices for each word
        ind_list = []
        for w in c:
          print("word: ", w)
          if w in word_ind_dict:
            ind_list.append(word_ind_dict[w]) # List of indices-lists for each word in combo #FIX HERE WHAT IF EMPTY
          #else:
            #word_ind_dict[w] = []
          print("index_list: ", ind_list)
        # Check if ind_list has enough lists (every word in combo is found in text)
        if len(ind_list) == len(c):
          print('same length')
          # Check if indices are in range
          combos_inds = list(itertools.product(*ind_list)) # All possible combos of indices
          print("list of combos_inds: ", combos_inds)
          for c_i in combos_inds:
            print("combo index: ", c_i)
            if len(pattern[1]) == 2:
              print("pattern ranges: ", pattern[1])
              subrange = [c_i[0],c_i[1]]
              subrange.sort()
              print("subrange: ", subrange)
              if (subrange[0]-(pattern[1][1]+1) <= c_i[2] <= subrange[1]+(pattern[1][1]+1)) and (abs(c_i[0] - c_i[1])-1 <= pattern[1][0]):
                is_match = 1
                print("match found")
                match_pattern.append([c,pattern[1]])
                print('match patterns: ', match_pattern)
            elif len(pattern[1]) == 4:
              print("pattern ranges: ", pattern[1])
              subrange = [c_i[0],c_i[1],c_i[2]]
              subrange.sort()
              print("subrange: ", subrange)
              if (subrange[0]-(pattern[1][3]+1) <= c_i[3] <= subrange[2]+(pattern[1][3]+1)) and (abs(c_i[0] - c_i[1])-1 <= pattern[1][0]) and (abs(c_i[1] - c_i[2])-1 <= pattern[1][1]) and (abs(c_i[0] - c_i[2])-1 <= pattern[1][2]):
                is_match = 1
                print("match found")
                match_pattern.append([c,pattern[1]])
                print('match patterns: ', match_pattern)
            else:
              print("wrong range length")
              is_match = -1
    
    if is_match == 1:
      flagged_terms[key] = match_pattern
    
  return flagged_terms

In [None]:
s = '* Wood Dale , March 25, 2021 (GLOBE NEWSWIRE) -- AAR (NYSE: AIR), a l(ea-d)ing\'s avi-ation ser--vices $14.3.5 provider to commercial and governments operators, MROs and OEMs worldwide, announces that its EAGLE Career Pathways program for aircraft maintenance technicians (AMTs) has been recognized by the of Labor\'s (DOL) Employment and Training Division as a nationally registered apprenticeship.'
s_block = word_tokenize(s)
s_block

['*',
 'Wood',
 'Dale',
 ',',
 'March',
 '25',
 ',',
 '2021',
 '(',
 'GLOBE',
 'NEWSWIRE',
 ')',
 '--',
 'AAR',
 '(',
 'NYSE',
 ':',
 'AIR',
 ')',
 ',',
 'a',
 'l',
 '(',
 'ea-d',
 ')',
 'ing',
 "'s",
 'avi-ation',
 'ser',
 '--',
 'vices',
 '$',
 '14.3.5',
 'provider',
 'to',
 'commercial',
 'and',
 'governments',
 'operators',
 ',',
 'MROs',
 'and',
 'OEMs',
 'worldwide',
 ',',
 'announces',
 'that',
 'its',
 'EAGLE',
 'Career',
 'Pathways',
 'program',
 'for',
 'aircraft',
 'maintenance',
 'technicians',
 '(',
 'AMTs',
 ')',
 'has',
 'been',
 'recognized',
 'by',
 'the',
 'of',
 'Labor',
 "'s",
 '(',
 'DOL',
 ')',
 'Employment',
 'and',
 'Training',
 'Division',
 'as',
 'a',
 'nationally',
 'registered',
 'apprenticeship',
 '.']

In [None]:
s = 'a leading aviation services female provider to commercial and governments operators, MROs and OEMs worldwide, announces that its women EAGLE Career Pathways program for aircraft maintenance woman technicians (AMTs) has been recognized by the of Labor\'s (DOL) work Employment and Training Division as a female nationally registered apprenticeship.'
wl = ['program', 'apprentice', 'work']
DEI_contx_list = ['wom(e|a)n', 'female', 'gender']
#twl = ['a', 'leading', 'operators', 'services', 'wow', 'hello']
#wid = word_indices(s, wl, twl)
wid = word_indices(s, wl,DEI_contx_list)
wid

{'apprentice': [54],
 'female': [4, 51],
 'gender': [],
 'program': [24],
 'wom(e|a)n': [20, 28],
 'work': [44]}

In [None]:
p0 = create_pattern_2(['program'],['apprentice'], [40,30])
p1 = create_pattern_3(['program'],['apprentice'],['work'], [40, 40,40,30])
td = {'test': [p0,p1]}
ft = check_cooccur(wid, td, DEI_contx_list)

c:  ('program', 'apprentice', 'wom(e|a)n')
w in combo:  program
w index list:  [24]
w in combo:  apprentice
w index list:  [54]
w in combo:  wom(e|a)n
w index list:  [20, 28]
ind_list has enough lists
list of all possible combos of indices:  [(24, 54, 20), (24, 54, 28)]
c:  ('program', 'apprentice', 'female')
w in combo:  program
w index list:  [24]
w in combo:  apprentice
w index list:  [54]
w in combo:  female
w index list:  [4, 51]
ind_list has enough lists
list of all possible combos of indices:  [(24, 54, 4), (24, 54, 51)]
c:  ('program', 'apprentice', 'gender')
w in combo:  program
w index list:  [24]
w in combo:  apprentice
w index list:  [54]
w in combo:  gender
w index list:  []
ind_list has enough lists
list of all possible combos of indices:  []
c:  ('program', 'apprentice', 'work', 'wom(e|a)n')
w in combo:  program
w index list:  [24]
w in combo:  apprentice
w index list:  [54]
w in combo:  work
w index list:  [44]
w in combo:  wom(e|a)n
w index list:  [20, 28]
ind_list has

In [None]:
ft

{'test': [[('program', 'apprentice', 'wom(e|a)n'), [40, 30]],
  [('program', 'apprentice', 'wom(e|a)n'), [40, 30]],
  [('program', 'apprentice', 'female'), [40, 30]],
  [('program', 'apprentice', 'female'), [40, 30]],
  [('program', 'apprentice', 'work', 'wom(e|a)n'), [40, 40, 40, 30]],
  [('program', 'apprentice', 'work', 'wom(e|a)n'), [40, 40, 40, 30]],
  [('program', 'apprentice', 'work', 'female'), [40, 40, 40, 30]],
  [('program', 'apprentice', 'work', 'female'), [40, 40, 40, 30]]]}

In [None]:
p0 = create_pattern_3(['aviation'],['commercial', 'government', 'a'],['operators'], [10,10,10,30])
p1 = create_pattern_2(['leading','services'],['aviation'],[4,50])
td = {'aviation services': [p0,p1]}
DEI_contx_list = ['program', 'apprenticeship']
#DEI_contx_list = create_DEI_context_list(DEI_contx_list_pre)
ft = check_cooccur(wid, td, DEI_contx_list)
ft
#td = {
   # 'aviation services': [[[['aviation'],['commercial', 'government'],['operators']],[10,10,10,30]]],[[['leading','services'],['aviation']],[4,30]]]
#}
#(term,[[pattern1],[pattern2],[[[bucket1],[bucket2]],[4,5]]])
#check_cooccur(word_ind_dict, terms_dict, DEI_contx_list)

key:  aviation services
value:  [[[['aviation'], ['commercial', 'government', 'a'], ['operators']], [10, 10, 10, 30]], [[['leading', 'services'], ['aviation']], [4, 50]]]
combo_buck:  [['aviation'], ['commercial', 'government', 'a'], ['operators'], ['program', 'apprenticeship']]
combo:  ('aviation', 'commercial', 'operators', 'program')
word:  aviation
index_list:  [[2]]
word:  commercial
index_list:  [[2], [6]]
word:  operators
index_list:  [[2], [6], [9]]
word:  program
index_list:  [[2], [6], [9], [22]]
same length
list of combos_inds:  [(2, 6, 9, 22)]
combo index:  (2, 6, 9, 22)
pattern ranges:  [10, 10, 10, 30]
subrange:  [2, 6, 9]
match found
match patterns:  [[('aviation', 'commercial', 'operators', 'program'), [10, 10, 10, 30]]]
combo:  ('aviation', 'commercial', 'operators', 'apprenticeship')
word:  aviation
index_list:  [[2]]
word:  commercial
index_list:  [[2], [6]]
word:  operators
index_list:  [[2], [6], [9]]
word:  apprenticeship
index_list:  [[2], [6], [9], [49]]
same le

{'aviation services': [[('aviation', 'commercial', 'operators', 'program'),
   [10, 10, 10, 30]],
  [('aviation', 'a', 'operators', 'program'), [10, 10, 10, 30]],
  [('leading', 'aviation', 'program'), [4, 50]],
  [('leading', 'aviation', 'apprenticeship'), [4, 50]],
  [('services', 'aviation', 'program'), [4, 50]],
  [('services', 'aviation', 'apprenticeship'), [4, 50]]]}

### Old work

In [None]:
# Other ideas:
# TO find more risk terms, do an n-gram analysis on articles with at least one practice term and with labor indicator = 1

# Other words to add:
# Manipulated wages, working conditions (usually has negative adj before it)

# Observations from 10-31 output:
# Articles with "corruption"/"fraud" usually do not relate to workers rights 
# Term "due diligence" occurs in relation to topics like company's debts, environment-related practices

# things to do with 10-31 output
# (DONE) See if articles marked Relevant=Yes also have labor indicator
# (Yes) If an article talks about sustainable AND ethical sourcing, does that allude to worker rights? 
# (DONE, not useful) Create worker rights org indicator
# (Started) Create "supplier relationship" indicator (doing this manually, since there are so many ways of talking about this)
# (DONE) Create updated heatmaps of practice/risk terms found in the articles that produce relevant co-occurrences
# (DONE) Calculate shares (percentages) of articles/events with practice-risk co-occurrence, by industry
# Odds-ratio test at industry level
    # Would this explain likelihood of ANY practice contributing to ANY risk? Or will we do it on term level (not enough cooccurrences to do term level by industry)

In [None]:
# DEI Context Terms
DEI_context_category_to_term_mapping_SIMPLE = {
    'DEI-Context': ['ethnic', 'disabili',
                    'marital status', 'working mother', 'pregnant',
                    'bias', 'religio', 'marginaliz', 'inclusiv','divers', 'access'
                    ],
}

DEI_context_category_to_term_mapping_COMPLEX = {
    'DEI-Context': {'race': attach_regex_to_beginning_of_terms(['race', 'racism', 'racist', 'racial', 'bipoc', 'people of colo[u]?r', 'blackface']),
                    'familial status': attach_regex_to_beginning_of_terms(['working (famil|parent|mother|mom|father|dad)']), 
                    'military status': attach_regex_to_beginning_of_terms(['veteran', '(service|guard|reserve) member']),
                    'minorit': attach_regex_to_beginning_of_terms(['minorities', 'minority group']),
                    'LGBT': attach_regex_to_beginning_of_terms(['lgbt', 'sexual orientation', 'gender identity', 'gay', 'lesbian', 
                                                                'bisexual', 'transgender', 'queer', 'asexual', 'homophobia', 'nonbinary']),
                    'gender-M/F': attach_regex_to_beginning_of_terms(['wom(e|a)n', 'female', 'gender','working (mother|mom)', 'based on sex'
                                                                      'pregnant', 'on the basis of sex', 'maternity leave', 
                                                                      'sexist', 'sex discrimination']),
                    'age': attach_regex_to_beginning_of_terms(['age discrimin', 'age bias', 'ageism', 'average age', 'older', 'old']),
                    'youth': attach_regex_to_beginning_of_terms(['youth', 'young', 'next generation']),
                    # 'justice': attach_regex_to_beginning_of_terms(['racial justice', 'social justice']),
                    # 'equity': attach_regex_to_beginning_of_terms(['racial (in)?equity', 'gender [in]?equity', 'social [in]?equity', '[in]?equitable']),
                    # 'engagement': attach_regex_to_beginning_of_terms(['worker engagement', 'employee engagement']),
                    # 'equality': attach_regex_to_beginning_of_terms(['racial [in]?equality', 'gender [in]?equality', 'social [in]?equality']),
                    'nationality': attach_regex_to_beginning_of_terms(['nationality', 'national origin', 'foreign nationals']),
                    'underrepresented': attach_regex_to_beginning_of_terms(['under(-)?represented']),
                    'migrant': attach_regex_to_beginning_of_terms(['[im]?migrant', 'foreigner', 'visa', 'citizen', 'foreign worker']),
                    'education/skill level': attach_regex_to_beginning_of_terms(['entry(.*)level','education level',
                                                                                 'college degree', 'undergraduate degree', 
                                                                                 'graduate degree', 'high school diploma',
                                                                                 '(low|un|semi|high)(-|ly | |)skill']),
                    'economic status': attach_regex_to_beginning_of_terms(['economic status', 'economic class', 'low(.*)income', #'high(.*)income',
                                                                          'impoverish', 'poverty', 'middle class', 'working class']),
                    'criminal history': attach_regex_to_beginning_of_terms(['criminal history', 'felon', 'background check', 'convict']),
                    #'precarious worker': attach_regex_to_beginning_of_terms(['temporary(.*)worker', 'temporary(.*)employee', 'contract labo[u]?rer',
                     #                                                        'contract worker', 'contractor', 'seasonal(.*)employee', 'seasonal(.*)worker',
                     #                                                        'part(.*)time worker', 'part(.*)time employee']),
                    'DEI abroad-factory': attach_regex_to_beginning_of_terms(['factory work']),},

}

In [None]:
practice_category_to_term_mapping_SIMPLE = {

    'Talent-Attraction-Retention': [],
    'Product-DMD': [],
    'Community-Relations': [],
    'Innovation-Risk-Recognition': []

}
 
practice_category_to_term_mapping_COMPLEX = {
   'Talent-Attraction-Retention': {
        'talent attraction': # talent attraction (for attraction, hiring, advancement, development, mentorship)
          attach_regex_to_beginning_of_terms([
            regex_or_n_range(['attract(s|ed|ion)?','hir(ing|e(s|d)?)'],['talent(s|ed)?','skill(s|ed)?'],0,7),
          ]),
        'talent retention': #(among DEI workers)
          attach_regex_to_beginning_of_terms([
            regex_or_n_range(['retain(ed)?', 'retention','keep', 'preserv(e|ation)', 'maintain'],['talent(s|ed)?','skill(s|ed)?'],0,7),
          ]),
        'programs/initiatives-attract': #programs/initiatives  (for attraction, hiring, advancement, development, mentorship)
          attach_regex_to_beginning_of_terms([
            regex_3_all_n_range(['program(s)?', 'initiative(s)?', 'campaign(s)?'],['attract(s|ed|ion)?','hir(ing|e(s|d)?)'],['talent(s|ed)?','skill(s|ed)?','leader(s|ship)?'],0,7,0,5,0,5),
          ]),  
        'programs/initiatives-retain': #programs/initiatives  (for attraction, hiring, advancement, development, mentorship)
          attach_regex_to_beginning_of_terms([
            regex_3_all_n_range(['program(s)?', 'initiative(s)?', 'campaign(s)?'],['retain(ed)?', 'retention','keep', 'preserv(e|ation)', 'maintain'],['talent(s|ed)?','skill(s|ed)?','leader(s|ship)?'],0,7,0,5,0,5),
            regex_3_all_n_range(['program(s)?', 'initiative(s)?', 'campaign(s)?'],['advance(ment)?', 'develop(ment)?', 'mentor(ship)?', 'promot(e(d)?|ion(s)?)'],['talent(s|ed)?','skill(s|ed)?','leader(s|ship)?'],0,7,0,5,0,5),
          ]),
        'workforce composition': #low representation/rates of (DEI) in workforce OR management
          attach_regex_to_beginning_of_terms([
            regex_or_n_range(['(under( |-)?)?represent(ation|ed)', 'demographic(s)','composition', 'make( |-)?up of', 'only', 'few'],
                             ['executive(s)', 'director(s)', 'board( member(s)?)?', 'manage(ment|r(s)?)','level', 'leader(s|ship)?', 'C-suite', 'employee(s)?', 'work(force|er(s)?|place)'],0,7),
          ]),
        'aging workforce':
          attach_regex_to_beginning_of_terms([
            regex_or_n_range(['aging', 'old(er)?'], ['employee(s)?', 'work(force|er(s)?|place)'],0,7),
          ]),
        'reporting':
          attach_regex_to_beginning_of_terms([
            regex_3_all_n_range(['report(ed|s)?', 'release(s|ed)?'], ['data'],['(under( |-)?)?represent(ation|ed)', 'demographic(s)','composition', 'make( |-)?up of'],0,7,0,4,0,8),
          ]),

    },

    'Product-DMD': {
        'programs/initiatives': #programs/initiatives (for DEI customers: interpreters, inclusive/diverse dietary offerings)
          attach_regex_to_beginning_of_terms([
            regex_or_n_range(['program(s)?', 'initiative(s)?', 'campaign(s)?'],['customer(s)?','client(s)?','consumer(s)?'],0,5),
          ]),  
          'marketing': #marketing (non/inclusive)
          attach_regex_to_beginning_of_terms([
            regex_or_n_range(['marketing', 'branding'],['customer(s)?','client(s)?','consumer(s)?', 'public'],0,10),
          ]),
    },

    'Community-Relations': {
        'programs/initiatives': #programs/initiatives (for stakeholder cooperation/engagement), # more: donate/charity
          attach_regex_to_beginning_of_terms([
            regex_or_n_range(['program(s)?', 'initiative(s)?', 'campaign(s)?', 'engage(s|ment)?', 'collaborat(e(d)?|ion)', 'cooperat(e(d)?|ion)'],['communit(ies|y)','local(s)?','stakeholder(s)?'],0,10),
          ]),  
    },

    'Innovation-Risk-Recognition': {
        'product design': 
          attach_regex_to_beginning_of_terms([
            regex_or_n_range(['product(s)?', 'service(s)?'],['design(s|ed)?'],0,7),
          ]),
        'on-the-job training': 
         attach_regex_to_beginning_of_terms([
            regex_or_n_range(['train(ing|ed|s)?'],['on( |-)the( |-)job'],0,5),
          ]),
    },
}

In [None]:
# RISK term categories
risk_category_to_term_mapping_SIMPLE = {

    'Talent-Attraction-Retention': [],
    'Product-DMD': [],
    'Community-Relations': [],
    'Innovation-Risk-Recognition': []

}

risk_category_to_term_mapping_COMPLEX = {
    'Talent-Attraction-Retention': {
        'labor shortage': # sustained labor shortage
          attach_regex_to_beginning_of_terms([
            regex_or_n_range(['shortage(s)?'],['labo(u)?r','work(force|er|ers)','employee(s)?'],0,7),
            regex_or_n_range(['shortage(s)?'],['significant','persistent', 'pervasive','critical','massive','severe','sustained','well( |-)(known|documented)'],0,10),
            regex_or_n_range(['shortage(s)?'],['opportunit(y|ies)','opening(s)?'],0,15)
          ]),
        'skill shortage':  
          attach_regex_to_beginning_of_terms([
            regex_or_n_range(['shortage(s)?'],['((low|un|semi|high)(-|ly | |))?skill(s|ed)?','talent(ed)?'],0,7),
          ]),
        'skill gap': # additional
          attach_regex_to_beginning_of_terms([
            regex_or_n_range(['((low|un|semi|high)(-|ly | |))?skill(s|ed)?', 'talent(ed)?'],['gap'],0,5),
          ]),
        'unfilled positions': 
          attach_regex_to_beginning_of_terms([
            regex_or_n_range(['position(s)?', 'opportunit(y|ies)','opening(s)?'],['unfilled','empty','open'],0,6),
          ]),
        'discrimination lawsuit':
          attach_regex_to_beginning_of_terms([
            regex_or_n_range(['lawsuit', 'sue(d)?', 'settle(ment)?'],['discriminat(ion|e|ed)'],0,15),
          ]),
        'attrition': #(among DEI workers)
          attach_regex_to_beginning_of_terms([
            regex_or_n_range(['turnover'],['high', 'work(er|force)', 'employee', 'rate', 'voluntary'],0,5),
            regex_or_n_range(['attrition'],['high', 'work(er|force)', 'employee', 'rate'],0,5),
            regex_or_n_range(['quit'],['high', 'rate'],0,5),
          ]),
        
    },

    'Product-DMD': {
        'consumer trust':
          attach_regex_to_beginning_of_terms([
            regex_3_all_n_range(['trust'],['consumer(s)?','client(s)?', 'customer(s)?', 'user(s)?', 'public'],['maintain','keep','cultivate','preserve','protect','retain','sustain','uphold'],0,3,0,5,0,5),
          ]),
          'boycott':
            attach_regex_to_beginning_of_terms(['boycott(s)?']),
          'public backlash': #social media campaigns/backlash -> positive option, brand damage
            attach_regex_to_beginning_of_terms([
            regex_or_n_range(['social media', 'online'],['backlash', 'outrage', 'ang(er|ry)', 'condemn(ation)?', 'fury', 'indignation', 'shock(ed)?', 'offend(ed)?', 'criticism'],0,7),
            regex_or_n_range(['public'],['backlash', 'outrage', 'ang(er|ry)', 'condemn(ation)?', 'fury', 'indignation', 'shock(ed)?', 'offend(ed)?', 'criticism'],0,7),
          ]),
          'reinforce stereotypes':
           attach_regex_to_beginning_of_terms([
            regex_or_n_range(['stereotype(s)?'],['perpetuate(s|d)?', 'reinforce(s|d)?', 'bolster(s|d)?', 'emphasize(s|d)?'],0,5)
          ]),
    },

    'Community-Relations': {
        'operational delay/shutdown':
          attach_regex_to_beginning_of_terms([
            regex_or_n_range(['delay(ed|s|ing)?'],['operation(s|al)?'],0,10),
            regex_or_n_range(['stall(ed|s|ing)?'],['operation(s|al)?'],0,10),
            regex_or_n_range(['shut(-| )?down(s)?'],['operation(s|al)?'],0,10),
          ]),
        'community backlash/conflict': 
          attach_regex_to_beginning_of_terms([
            regex_or_n_range(['communit(y|ies)'],['protest(s|ed)?', 'demonstrat(ions|ion)','backlash', 'conflict', 'clash', 'violen(ce|t)', 'outrage', 'ang(er|ry)', 'condemn(ation)?', 'fury', 'indignation', 'shock(ed)?', 'offen(d|ded|se)', 'criticism'],0,7),
            regex_or_n_range(['local(s)?'],['protest(s|ed)?', 'demonstrat(ions|ion)','backlash', 'conflict', 'clash', 'violen(ce|t)', 'outrage', 'ang(er|ry)', 'condemn(ation)?', 'fury', 'indignation', 'shock(ed)?', 'offen(d|ded|se)?', 'criticism'],0,7),
            regex_3_n_range(['communit(y|ies)','local(s)?'],['social media', 'online'],['backlash', 'outrage', 'ang(er|ry)', 'condemn(ation)?', 'fury', 'indignation', 'shock(ed)?', 'offend(ed)?', 'criticism'],0,7,0,4,0,4),
          ]),
    },

    'Innovation-Risk-Recognition': {
        'disruptive turnover': #(among DEI workers)
          attach_regex_to_beginning_of_terms([
            regex_or_n_range(['turnover', 'attrition'],['disruptive', 'high'],0,5),
          ]),
    }

}


In [None]:
   'Financial-Loss': {'pay damages': attach_regex_to_beginning_of_terms(['pay(.*)damage', '(agreed|had|forced)(.*)to pay']),
             'seizure of assets': attach_regex_to_beginning_of_terms(['seiz(.*)asset'])},

In [None]:
def regex_n_range(term_list_1, term_list_2, i, n):
  '''
  Takes in 2 lists of terms and returns regex pattern to identify if at least 
  one term in list 1 is within [i,n] words of at least one term in list 2. 
  ORDER DEPENDENT
  '''
  group_1 = '|'.join(term_list_1)
  group_2 = '|'.join(term_list_2)
  final_string = '\\b('+group_1+')(?:\W+\w+){'+str(i)+','+str(n)+'}?\W+('+group_2+')\\b'

  return final_string


def regex_or_n_range(term_list_1, term_list_2, i, n):
  '''
  Implements regex_n_range for ORDER INDEPENDENT
  Takes in 2 lists of terms and returns regex pattern to identify if at least 
  one term in each list is within [i,n] words of each other. 
  '''
  group_1 = '|'.join(term_list_1)
  group_2 = '|'.join(term_list_2)
  final_string = '\\b(('+group_1+')(?:\W+\w+){'+str(i)+','+str(n)+'}?\W+('+group_2+')|('+group_2+')(?:\W+\w+){'+str(i)+','+str(n)+'}?\W+('+group_1+'))\\b'

  return final_string

def regex_3_n_range(term_list_1, term_list_2, term_list_3, i, n, i_2, n_2):
  '''
  Takes in 3 lists of terms and returns regex pattern to identify if at least 
  one term in list 1 is within [i,n] words of at least one term in list 2 AND 
  if at least one term in list 2 is within [i_2,n_2] words of at least one term 
  in list 3. ORDER INDEPENDENT
  Output example: \b(whe|shortage(s)?)(?:\W+\w+){0,15}?\W+(shortage(s)?)(?:\W+\w+){0,15}?\W+(no|shortage(s)?)\b
  '''
  group_1 = '|'.join(term_list_1)
  group_2 = '|'.join(term_list_2)
  group_3 = '|'.join(term_list_3)
  final_string = '\\b('+group_1+')(?:\W+\w+){'+str(i)+','+str(n)+'}?\W+('+group_2+')(?:\W+\w+){'+str(i_2)+','+str(n_2)+'}?\W+('+group_3+')\\b'

  return final_string

def regex_3_all_n_range(term_list_1, term_list_2, term_list_3, i_12, n_12, i_23, n_23, i_13, n_13):
  '''
  Implements regex_3_n_range for ORDER INDEPENDENT
  [i_12,n_12]: the acceptable words range between the terms in list 1 and list 2
  '''
  perm = []
  perm.append(regex_3_n_range(term_list_1, term_list_2, term_list_3, i_12, n_12, i_23, n_23)[2:len(regex_3_n_range(term_list_1, term_list_2, term_list_3, i_12, n_12, i_23, n_23))-2])
  perm.append(regex_3_n_range(term_list_1, term_list_3, term_list_2, i_13, n_13, i_23, n_23)[2:len(regex_3_n_range(term_list_1, term_list_3, term_list_2, i_13, n_13, i_23, n_23))-2])
  perm.append(regex_3_n_range(term_list_2, term_list_1, term_list_3, i_12, n_12, i_13, n_13)[2:len(regex_3_n_range(term_list_2, term_list_1, term_list_3, i_12, n_12, i_13, n_13))-2])
  perm.append(regex_3_n_range(term_list_2, term_list_3, term_list_1, i_23, n_23, i_13, n_13)[2:len(regex_3_n_range(term_list_2, term_list_3, term_list_1, i_23, n_23, i_13, n_13))-2])
  perm.append(regex_3_n_range(term_list_3, term_list_1, term_list_2, i_13, n_13, i_12, n_12)[2:len(regex_3_n_range(term_list_3, term_list_1, term_list_2, i_13, n_13, i_12, n_12))-2])
  perm.append(regex_3_n_range(term_list_3, term_list_2, term_list_1, i_23, n_23, i_12, n_12)[2:len(regex_3_n_range(term_list_3, term_list_2, term_list_1, i_23, n_23, i_12, n_12))-2])
  perm_str = '|'.join(perm)
  final_string = '\\b('+perm_str+')\\b'

  return final_string

In [None]:
try_string1 = regex_3_n_range(['shortage(s)?'], ['labo(u)?r','sustained'], ["no"], 0, 15, 1, 2)
print(try_string1)

\b(shortage(s)?)(?:\W+\w+){0,15}?\W+(labo(u)?r|sustained)(?:\W+\w+){1,2}?\W+(no)\b


In [None]:
try_string2 = regex_3_all_n_range(['hello'], ['wow','sustained'], ["no"], 1,2,2,3,1,3)
print(try_string2)

\b((hello)(?:\W+\w+){1,2}?\W+(wow|sustained)(?:\W+\w+){2,3}?\W+(no)|(hello)(?:\W+\w+){1,3}?\W+(no)(?:\W+\w+){2,3}?\W+(wow|sustained)|(wow|sustained)(?:\W+\w+){1,2}?\W+(hello)(?:\W+\w+){1,3}?\W+(no)|(wow|sustained)(?:\W+\w+){2,3}?\W+(no)(?:\W+\w+){1,3}?\W+(hello)|(no)(?:\W+\w+){1,3}?\W+(hello)(?:\W+\w+){1,2}?\W+(wow|sustained)|(no)(?:\W+\w+){2,3}?\W+(wow|sustained)(?:\W+\w+){1,2}?\W+(hello))\b


In [None]:
'''
   'Risk': ['strike', 'sit-in', 'protest', 'boycott', 'disruption',
            'social license', 'operating cost', 'delay', 'sanction', 'restitution', 
            'fine', 'penalt', 'bankrupt', 'liabl', 'financial loss', 'lawsuit', 
            'litigation', 'impoundment', 'detain', 'penalt', 'sanction', 'court',
            'consent decree', 'brand damage', 'monetary damage', 'brand reputation',
            'brand recognition','trust', 'innovation', #decreased trust, maintaining trust
            'lost opportunit', 'resign', 'divest', 'conciliation agreement',
            'modern slavery', 'debt bondage', 'human traffic', 'alleg', 'accus',
            'exploit', 'publicly expose', 'investigat', 'enforcement', 
            'inspect', 'scandal', 'government action', 'share price', 'share value', 
            'positions', 'attrition', 'backlash', 'campaign', 'stereotypes', #'shortage'
            ],

    
    'Innovation-Risk-Recognition': {}
    'Other-Risk': {}

risk_category_to_term_mapping_COMPLEX = {
    'Risk': {
        'turnover': attach_regex_to_beginning_of_terms(['high turnover', 'worker turnover', 'employee turnover', 'turnover rate', 'voluntary turnover', 'quit rate']),
        'retention': attach_regex_to_beginning_of_terms(['low retention', 'retention rate', 'employee retention', 'worker retention']),
        'talent': attach_regex_to_beginning_of_terms(['attract talent', 'retain talent', 'find talent', 'talent acquisition'])
        'walkout': attach_regex_to_beginning_of_terms(['walk[- ]?out']),
        'social license': attach_regex_to_beginning_of_terms(['social licen[cs]e']),
        'withhold release order': attach_regex_to_beginning_of_terms(['withhold release order', 'wro']),
        'block import': attach_regex_to_beginning_of_terms(['block(.*)import', 'ban(.*)import', 'import(.*)ban',
                                                                          'prohibit(.*)import', 'import(.*)prohibit',
                                                                          'block(.*)entry', 'entry(.*)block',
                                                                          'seiz(.*)product', 'product(.*)seiz']),
        'pay damages': attach_regex_to_beginning_of_terms(['pay(.*)damage', '(agreed|had|forced)(.*)to pay']),
        'seizure of assets': attach_regex_to_beginning_of_terms(['seiz(.*)asset'])},
        'court-ordered relief': attach_regex_to_beginning_of_terms(['monetary relief', 'equitable relief', 
                                                                              'injunctive relief', 'back pay', 
                                                                              'front pay', 'compensatory damages',
                                                                              '(punitive|exemplary) damages']),
        'settlement': attach_regex_to_beginning_of_terms(['pay(.*)to settle']),
        'workplace shutdown': attach_regex_to_beginning_of_terms(['workplace shutdown', 'shutdown']),
        'forced labor': attach_regex_to_beginning_of_terms(['(forced|slave) labo[u]?r']),
        'child labor': attach_regex_to_beginning_of_terms(['child labo[u]?r', 'child slave']),
        'negative return': attach_regex_to_beginning_of_terms(['negative(.*)return'])
}
'''

In [None]:
'''
def attach_regex_to_beginning_of_terms(terms_lst, regex='(?<![^ .,?!;])'):
    if regex == '(?<![^ .,?!;])':
        return [regex + term for term in terms_lst]

# RISK term categories
risk_category_to_term_mapping_SIMPLE = {
    'Employee/Talent-Retention': [],

   'Worker-Protest': ['strike', 'sit-in', 'operational disruption',
                      'protest', 'injury'], 
   'Consumer-Protest': ['boycott', 'protest', 'social license'],
 
   'Operational-Costs': ['operational disruption', 'operating cost', 'delay',
                         'disruption'],
 
   'Financial-Loss': ['sanction', 
                      #'reimburse', 
                      'restitution', 'fine', 
              'penalt', 'bankrupt', 'liabl', ' financial loss', 
              #'lost'
              ],
 
   'Legal-Risk': ['lawsuit', 'litigation', 'impoundment', 'detain',
                  'penalt', 'sanction', 'court', 'consent decree'],
 
   'Reputational-Damage': ['brand damage', 'monetary damage',
                           'brand reputation', 'brand recognition',
                           'social license',
                           'decreased trust',
                           'decreased innovation',
                           'lost opportunit', 'resign'
                           ],
 
   "Remedy": [#'reimburse', 
              'divest', 'restitution', 'conciliation agreement'],
 
   "Modern-Slavery": ['modern slavery', 'debt bondage', 'human traffic'],
 
   'Other': ['alleg', 'accus', 'exploit', 'publicly expose', 'investigat',
            'enforcement', 'security force', 'inspection', 'inspector'],
 
   'Other-RK': ['scandal', 'government action', 'share price', 'share value',
                #'investment',
                # 'sales'
                ]
}
 
risk_category_to_term_mapping_COMPLEX = {
    'Employee/Talent-Retention': {
        'turnover': attach_regex_to_beginning_of_terms(['high turnover', 'worker turnover', 'employee turnover', 'turnover rate', 'voluntary turnover', 'quit rate']),
        'retention': attach_regex_to_beginning_of_terms(['low retention', 'retention rate', 'employee retention', 'worker retention']),
        'talent': attach_regex_to_beginning_of_terms(['attract talent', 'retain talent', 'find talent', 'talent acquisition']) 
    },
   'Worker-Protest': {'walkout': attach_regex_to_beginning_of_terms(['walk[- ]?out'])},
   'Consumer-Protest': {'social license': attach_regex_to_beginning_of_terms(['social licen[cs]e'])},
   'Operational-Costs': {'withhold release order': attach_regex_to_beginning_of_terms(['withhold release order', 'wro']),
                      'block import': attach_regex_to_beginning_of_terms(['block(.*)import', 'ban(.*)import', 'import(.*)ban',
                                                                          'prohibit(.*)import', 'import(.*)prohibit',
                                                                          'block(.*)entry', 'entry(.*)block',
                                                                          'seiz(.*)product', 'product(.*)seiz']),},
   'Financial-Loss': {'pay damages': attach_regex_to_beginning_of_terms(['pay(.*)damage', '(agreed|had|forced)(.*)to pay']),
             'seizure of assets': attach_regex_to_beginning_of_terms(['seiz(.*)asset'])},
   'Legal-Risk': {'court-ordered relief': attach_regex_to_beginning_of_terms(['monetary relief', 'equitable relief', 
                                                                              'injunctive relief', 'back pay', 
                                                                              'front pay', 'compensatory damages',
                                                                              '(punitive|exemplary) damages']),
                  'settlement': attach_regex_to_beginning_of_terms(['pay(.*)to settle'])
                                                                              },
   'Reputational-Damage': {'workplace shutdown': attach_regex_to_beginning_of_terms(['workplace shutdown', 'shutdown']),
                           'social license': attach_regex_to_beginning_of_terms(['social licen[cs]e'])},
   "Remedy": {},
   "Modern-Slavery": {
                      'forced labor': attach_regex_to_beginning_of_terms(['(forced|slave) labo[u]?r']),
                      'child labor': attach_regex_to_beginning_of_terms(['child labo[u]?r', 'child slave'])},
 
   'Other': {},
   'Other-RK': {'negative return': attach_regex_to_beginning_of_terms(['negative(.*)return'])}
}


# PRACTICE term categories UPDATED 11/29/2021 from Raiha's supply chain list
practice_category_to_term_mapping_SIMPLE = {
    #'Employee/Talent-Retention': [],
   'Wages': [
             'wage',
             'wage theft',
             'living wage'],
   'Precarious-Work': [
                       'precarity', 'precarious work',
                       'gig work',
                       'alternative work',
                       'alternate work',
                       'contingent work',
                       # 'migrant',
                       'informal work',
                       'casual work',
                       'hazardous work'
                       ],
   'Mdrn-Slav-Risk': [# 'broker', 'agent',
                      'confinement', 'document retention',
                      'restriction of movement',
                      'delayed wage',
                      'pay manipulation',
                      'punishment', 'poor food', #'retaliat',
                      # 'sexual violence', 'sexual harassment',
                      'deprivation'
                      'unpaid wage', 
				              'delayed payment',
                       'wage violation',
                       ],
 
   'Work-Conditions': [
                       # 'lockout',
                       'freedom of association',
                       'collective bargaining', 
                       'work stoppage',
                       'hotline',
                       #'worker retention'
                       ],
   'Good-Practices': ['code of conduct', 'due diligence',
                      'ethical recruit', # ethical recruitment
                      'handbook',  # supplier handbook
                      # 'supplier remediation',
                      'social audit',
                      # 'risk assessment',
                      'equal benefits',
                      'transparency', 'traceability', 'visibility', 'accessib'
                      # 'supply chain map',
                      'timely payments',
                      'union', 'worker committee', 
                      'empower', 'accommodati'
                      ],
                       
   'Neutral-Practices': [
                         # 'sourcing',
                         'outsourc', # Kept based on Joanne's feedback
                         # ‘raw material’,
                         'subcontracting',
                         # 'small-holder supply chain',
                         # 'overtime', 
                         # 'demand volatility',
                         'program', 'initiative', 'training', #'development',
                         'exempt', 'recruit', 'promotion', 'arbitration', 
                         'mentorship', 'affirmative action'
                         # investment (in people)
                         ],
   'Negative-Practices': [
                          # 'conflict',
                          'order delay',
                          'lead time', # previously 'short lead time'
                          'unplanned shipment',
                          'corruption',
                          # 'fraud',
                          'quota system', 'delayed payment',
                          'weak governance', 'wage violation',
                          'informal supply chain', 'last-minute order modification', 
                          'unfair timing demand', 'pricing pressure',
                          'poor forecasting', 'irresponsible exit'
                          ],
   'Other': []
}
 
practice_category_to_term_mapping_COMPLEX = {
   #'Employee/Talent-Retention': {
    #    'turnover': attach_regex_to_beginning_of_terms(['involuntary turnover', 'lay(.*)off']),
    #},

   'Wages': {# 'pricing': attach_regex_to_beginning_of_terms(['pricing', 'price']),
             'low wages': attach_regex_to_beginning_of_terms(['low wage', 'poverty-level wage']),
             'underpay': attach_regex_to_beginning_of_terms(['underpay', 'underpaid', 'inadequate pay', 'reduced pay']), 
             'DEI compensation': attach_regex_to_beginning_of_terms(['[un]?fair (pay|compensation)', '[un]?equal (pay|compensation)', 
                                                                     '(pay|wage|compensation) ([dis]?parit|bias|discrimination)',
                                                                     'pay equality', 'paid less', 'pay gap'])
             },
   'Precarious-Work': {'temporary work-employement': attach_regex_to_beginning_of_terms(['temporary( |-)(work|employ|contract)', 'non(-)?permanent (work|employ|contract)',
                                                                                         'seasonal( |-)(work|employ|contract)', 'part( |-)time worker']),
                       'contract labor': attach_regex_to_beginning_of_terms(['contract labo[u]?r ', 'contract work', 'contractor'])},
   'Mdrn-Slav-Risk': {
       # 'third party': attach_regex_to_beginning_of_terms(['third[- ]party'])
       'coercive labor': attach_regex_to_beginning_of_terms(['coercive labo[u]?r']),
       'prison labor': attach_regex_to_beginning_of_terms(['prison labo[u]?r']),
       'recruitment fee': attach_regex_to_beginning_of_terms(['recruitment(.*)fee']),
       'withhold wage': attach_regex_to_beginning_of_terms(['withh[oe]ld(ing)? wage']),
       'passport retention': attach_regex_to_beginning_of_terms(['passport retention', 'retention of passport', 'withh[oe]ld(ing)? passport'])
    },
    
    'Work-Conditions': {
        'collective bargaining agreement': attach_regex_to_beginning_of_terms(['collective bargaining agreement', 'cba']),
        'unsafe conditions': attach_regex_to_beginning_of_terms(['(toxic|hostile|poor|dire) work(ing| environment|place)', 'working conditions', 
                                                                 'dangerous', 'unsafe', 'unhealth', 'hazard', 'violence', 'assault']),
        'employee morale': attach_regex_to_beginning_of_terms(['employee morale', 'employee satisfaction']),
        'grievance mechanism': attach_regex_to_beginning_of_terms(['grievance mechanism', 'grievance system']),
        'reprisal/retaliation': attach_regex_to_beginning_of_terms(['retaliat', 'reprisal'])
    },
    
    'Good-Practices': {'code of conduct negative': attach_regex_to_beginning_of_terms(['code of conduct(.*)breach', 'breach(.*)code of conduct',
                                                                                    'violat(.*)code of conduct', 'code of conduct(.*)violat',
                                                                                    'non[- ]?compliance(.*)code of conduct', 'code of conduct(.*)non[- ]?compliance',
                                                                                    'break(.*)code of conduct',
                                                                                    'broken(.*)code of conduct', 'code of conduct(.*)broken',
                                                                                    'fail(.*)code of conduct']),
                      #'engagement': attach_regex_to_beginning_of_terms(['worker engagement', 'employee engagement']),
                      'flexible work': attach_regex_to_beginning_of_terms(['flexible work', 'flexible hour', 'remote( |-)work', 'work( |-)from( |-)home']),
                      'family leave': attach_regex_to_beginning_of_terms(['maternity leave', 'parental leave'])                                                         
   },
                      
                                             
   'Neutral-Practices': {'corrective action': attach_regex_to_beginning_of_terms(['corrective(.*)action', 'corrective(.*)plan', 'corrective(.*)measure']),
                         'hiring': attach_regex_to_beginning_of_terms(['hire', 'hiring', 'eligib', 'equal[ employment]? opportunit']),
                         'termination/layoff': attach_regex_to_beginning_of_terms(['termination','terminated','layoff','laid( |-)off', 'fired'])
                         },

   'Negative-Practices': {
       # 'piece work': attach_regex_to_beginning_of_terms(['piece work', 'piece[- ]rate']),
       # 'production target': attach_regex_to_beginning_of_terms(['production target', 'production quota']),
       'hour violation': attach_regex_to_beginning_of_terms(['hour (law )?violation']),
       'canceled order': attach_regex_to_beginning_of_terms(['cancel[l]?ed order']),
       'overtime NEGATIVE': attach_regex_to_beginning_of_terms(['(forced|unpaid|chronic|mandatory|unlawful) overtime']),
       'lead time NEGATIVE': attach_regex_to_beginning_of_terms(['short lead time', 'inadequate lead time'])},

   'Other': {}
}
'''

In [None]:
# Adding extra cleaning terms here (during processing, any mentions of 
# practice/risk terms in the irrelevant contexts below will be excluded)

# agent
# practice_terms_regex_dict['Mdrn-Slav-Risk']['agent']['extra_cleaning'] = True
# practice_terms_regex_dict['Mdrn-Slav-Risk']['agent']['terms_to_remove'] = ['recruitment agent']

# third party
# practice_terms_regex_dict['Mdrn-Slav-Risk']['third party']['extra_cleaning'] = True
# practice_terms_regex_dict['Mdrn-Slav-Risk']['third party']['terms_to_remove'] = ['independent third party']

# race
#DEI_context_terms_regex_dict['DEI-Context']['race']['extra_cleaning'] = True
#DEI_context_terms_regex_dict['DEI-Context']['race']['terms_to_remove'] = ['arms race', 'a race to', 'the race to', 'rat race', 'to corrosive racial inequality']
'''
# turnover
risk_terms_regex_dict['Employee/Talent-Retention']['turnover']['extra_cleaning'] = True
risk_terms_regex_dict['Employee/Talent-Retention']['turnover']['terms_to_remove'] = ['involuntary turnover']
#practice_terms_regex_dict['Employee/Talent-Retention']['turnover']['extra_cleaning'] = True
#practice_terms_regex_dict['Employee/Talent-Retention']['turnover']['terms_to_remove'] = ['voluntary turnover']

# shutdown
risk_terms_regex_dict['Reputational-Damage']['workplace shutdown']['extra_cleaning'] = True
risk_terms_regex_dict['Reputational-Damage']['workplace shutdown']['terms_to_remove'] = ['year-end shutdown','maintenance shutdown']

# fine
risk_terms_regex_dict['Financial-Loss']['fine']['extra_cleaning'] = True
risk_terms_regex_dict['Financial-Loss']['fine']['terms_to_remove'] = ['finecast', 'fine fragrance', 'fine and tall', 'driftable fine', 'fine chemical', 'fine construction level', 'fine-grain', 'fine partic', 'finergreen', 'fine molecular', 'fine wine', 'fine product', 'fine paper', 'fine balanc', 'fine-tun', 'fine fib', 'finely', 'fine, sand', 'fine tea', 'finest', 'finesse']
 
# strike
risk_terms_regex_dict['Worker-Protest']['strike']['extra_cleaning'] = True
risk_terms_regex_dict['Worker-Protest']['strike']['terms_to_remove'] = [r'disaster(s\b|\b) strike', 'strike deal', 'weather strike', 'strike a conversation']


# reimbursement
risk_terms_regex_dict['Financial-Loss']['reimburse']['extra_cleaning'] = True
risk_terms_regex_dict['Financial-Loss']['reimburse']['terms_to_remove'] = ['tuition reimbursement']
risk_terms_regex_dict['Remedy']['reimburse']['extra_cleaning'] = True
risk_terms_regex_dict['Remedy']['reimburse']['terms_to_remove'] = ['tuition reimbursement']

# union
practice_terms_regex_dict['Good-Practices']['union']['extra_cleaning'] = True
practice_terms_regex_dict['Good-Practices']['union']['terms_to_remove'] = ['european union', 'customs union', 'american civil liberties union']
 
# pricing pressure
practice_terms_regex_dict['Negative-Practices']['pricing pressure']['context_words'] = ['supplier', 'factory', 'manufactur']


##### OLD CLEANING ###########
# fine
risk_terms_regex_dict['Fines']['fine']['extra_cleaning'] = True
risk_terms_regex_dict['Fines']['fine']['terms_to_remove'] = ['finecast', 'fine fragrance', 'fine and tall', 'driftable fine', 'fine chemical', 'fine construction level', 'fine-grain', 'fine partic', 'finergreen', 'fine molecular', 'fine wine', 'fine product', 'fine paper', 'fine balanc', 'fine-tun', 'fine fib', 'finely', 'fine, sand', 'fine tea', 'finest', 'finesse']

# strike
risk_terms_regex_dict['Worker-Protest']['strike']['extra_cleaning'] = True
risk_terms_regex_dict['Worker-Protest']['strike']['terms_to_remove'] = [r'disaster(s\b|\b) strike', 'strike deal', 'weather strike']

# reimbursement
risk_terms_regex_dict['Fines']['reimburse']['extra_cleaning'] = True
risk_terms_regex_dict['Fines']['reimburse']['terms_to_remove'] = ['tuition reimbursement']
risk_terms_regex_dict['Remedy']['reimburse']['extra_cleaning'] = True
risk_terms_regex_dict['Remedy']['reimburse']['terms_to_remove'] = ['tuition reimbursement']

'''

In [None]:
practice_terms_regex_dict

{'Community-Relations': {'programs/initiatives': {'extra_cleaning': False,
   'regex_lst': [re.compile(r'(?<![^ .,?!;])\b((program(s)?|initiative(s)?|campaign(s)?|engage(s|ment)?|collaborat(e(d)?|ion)|cooperat(e(d)?|ion))(?:\W+\w+){0,10}?\W+(communit(ies|y)|local(s)?|stakeholder(s)?)|(communit(ies|y)|local(s)?|stakeholder(s)?)(?:\W+\w+){0,10}?\W+(program(s)?|initiative(s)?|campaign(s)?|engage(s|ment)?|collaborat(e(d)?|ion)|cooperat(e(d)?|ion)))\b',
    re.UNICODE)],
   'terms_to_remove': []}},
 'Innovation-Risk-Recognition': {'on-the-job training': {'extra_cleaning': False,
   'regex_lst': [re.compile(r'(?<![^ .,?!;])\b((train(ing|ed|s)?)(?:\W+\w+){0,5}?\W+(on( |-)the( |-)job)|(on( |-)the( |-)job)(?:\W+\w+){0,5}?\W+(train(ing|ed|s)?))\b',
    re.UNICODE)],
   'terms_to_remove': []},
  'product design': {'extra_cleaning': False,
   'regex_lst': [re.compile(r'(?<![^ .,?!;])\b((product(s)?|service(s)?)(?:\W+\w+){0,7}?\W+(design(s|ed)?)|(design(s|ed)?)(?:\W+\w+){0,7}?\W+(product(s)?|servic

In [None]:
# term_type_dict = {'term_type_cat': {'clean term': 'regex_lst': {['fmt1', 'fmt2']}, 'extra_cleaning': True}}

def create_comprehensive_term_regex_cleaning_dict(
    term_type, category_to_term_mapping_SIMPLE, category_to_term_mapping_COMPLEX):

    term_type_regexes_cleaning = {}
    for term_cat, term_lst in category_to_term_mapping_SIMPLE.items():

        # Create dictionary for term type category
        if term_cat not in term_type_regexes_cleaning:
            term_type_regexes_cleaning[term_cat] = dict()
        
        # Variable for new category dict for `term_type_regexes_cleaning`
        term_cat_dict = term_type_regexes_cleaning[term_cat]

        term_cat_SIMPLE_lst = category_to_term_mapping_SIMPLE[term_cat]
        for term_SIMPLE in term_cat_SIMPLE_lst:
            # term_regex = re.compile(attach_regex_to_beginning_of_terms([term_SIMPLE])[0])
            term_regex_str_lst = attach_regex_to_beginning_of_terms([term_SIMPLE])
            term_cat_dict[term_SIMPLE] = {'regex_lst': term_regex_str_lst, 'extra_cleaning': False, 'terms_to_remove': []}  # TODO: Create detailed function for examples of the term to ignore, wich will have extra_cleaning=True

        term_cat_COMPLEX = category_to_term_mapping_COMPLEX[term_cat]
        for term_clean_COMPLEX, term_COMPLEX_regex_list in term_cat_COMPLEX.items():
            # term_regex_str_lst = attach_regex_to_beginning_of_terms(term_COMPLEX_dict)
            term_cat_dict[term_clean_COMPLEX] = {'regex_lst': [re.compile(regex) for regex in term_COMPLEX_regex_list], 'extra_cleaning': False, 'terms_to_remove': []} 

        term_type_regexes_cleaning[term_cat] = term_cat_dict

    return term_type_regexes_cleaning

In [None]:
DEI_context_terms_regex_dict = create_comprehensive_term_regex_cleaning_dict(
    term_type='DEI-context', 
    category_to_term_mapping_SIMPLE=DEI_context_category_to_term_mapping_SIMPLE, 
    category_to_term_mapping_COMPLEX=DEI_context_category_to_term_mapping_COMPLEX)

practice_terms_regex_dict = create_comprehensive_term_regex_cleaning_dict(
    term_type='practice', 
    category_to_term_mapping_SIMPLE=practice_category_to_term_mapping_SIMPLE, 
    category_to_term_mapping_COMPLEX=practice_category_to_term_mapping_COMPLEX)

risk_terms_regex_dict = create_comprehensive_term_regex_cleaning_dict(
    term_type='risk', 
    category_to_term_mapping_SIMPLE=risk_category_to_term_mapping_SIMPLE, 
    category_to_term_mapping_COMPLEX=risk_category_to_term_mapping_COMPLEX)
'''
supplier_relship_regex_dict = create_comprehensive_term_regex_cleaning_dict(
    term_type='supplier-relship', 
    category_to_term_mapping_SIMPLE=supplier_relship_category_to_term_mapping_SIMPLE, 
    category_to_term_mapping_COMPLEX=supplier_relship_category_to_term_mapping_COMPLEX)
'''

"\n# turnover\nrisk_terms_regex_dict['Employee/Talent-Retention']['turnover']['extra_cleaning'] = True\nrisk_terms_regex_dict['Employee/Talent-Retention']['turnover']['terms_to_remove'] = ['involuntary turnover']\n#practice_terms_regex_dict['Employee/Talent-Retention']['turnover']['extra_cleaning'] = True\n#practice_terms_regex_dict['Employee/Talent-Retention']['turnover']['terms_to_remove'] = ['voluntary turnover']\n\n# shutdown\nrisk_terms_regex_dict['Reputational-Damage']['workplace shutdown']['extra_cleaning'] = True\nrisk_terms_regex_dict['Reputational-Damage']['workplace shutdown']['terms_to_remove'] = ['year-end shutdown','maintenance shutdown']\n\n# fine\nrisk_terms_regex_dict['Financial-Loss']['fine']['extra_cleaning'] = True\nrisk_terms_regex_dict['Financial-Loss']['fine']['terms_to_remove'] = ['finecast', 'fine fragrance', 'fine and tall', 'driftable fine', 'fine chemical', 'fine construction level', 'fine-grain', 'fine partic', 'finergreen', 'fine molecular', 'fine wine', '

In [None]:

      # PRACTICE TERM INDICATORS CREATED HERE
      term_type = 'practice'
      for cat, cat_terms in practice_terms_regex_dict.items():
          for clean_term, term_dict in cat_terms.items():
              indicator_col_name = "{}_{}_{}".format(clean_term, term_type.upper(), cat)
              keyword_regexes = term_dict['regex_lst']
              industry_df[indicator_col_name] = 0
              for keyword_regex_str in keyword_regexes:
                  keyword_regex = re.compile(keyword_regex_str)
                  industry_df[indicator_col_name] = np.where(industry_df['headline_lower'].str.contains(keyword_regex) | industry_df['bullet_pts_lower'].str.contains(keyword_regex), 1, industry_df[indicator_col_name])
              if term_dict['extra_cleaning']:
                  pattern_terms_to_remove = re.compile('|'.join(term_dict['terms_to_remove']))
                  industry_df[indicator_col_name] = np.where(industry_df['headline_lower'].str.contains(pattern_terms_to_remove) | industry_df['bullet_pts_lower'].str.contains(pattern_terms_to_remove), 0, industry_df[indicator_col_name])

      # RISK TERM INDICATORS CREATED HERE
      term_type = 'risk'
      for cat, cat_terms in risk_terms_regex_dict.items():
          for clean_term, term_dict in cat_terms.items():
              indicator_col_name = "{}_{}_{}".format(clean_term, term_type.upper(), cat)
              keyword_regexes = term_dict['regex_lst']
              industry_df[indicator_col_name] = 0
              for keyword_regex_str in keyword_regexes:
                  keyword_regex = re.compile(keyword_regex_str)
                  industry_df[indicator_col_name] = np.where(industry_df['headline_lower'].str.contains(keyword_regex) | industry_df['bullet_pts_lower'].str.contains(keyword_regex), 1, industry_df[indicator_col_name])
              if term_dict['extra_cleaning']:
                  pattern_terms_to_remove = re.compile('|'.join(term_dict['terms_to_remove']))
                  industry_df[indicator_col_name] = np.where(industry_df['headline_lower'].str.contains(pattern_terms_to_remove) | industry_df['bullet_pts_lower'].str.contains(pattern_terms_to_remove), 0, industry_df[indicator_col_name])

      # DEI TERM INDICATORS CREATED HERE
      term_type = 'DEI-context'
      for cat, cat_terms in DEI_context_terms_regex_dict.items():
          for clean_term, term_dict in cat_terms.items():
              indicator_col_name = "{}_{}_{}".format(clean_term, term_type.upper(), cat)
              keyword_regexes = term_dict['regex_lst']
              industry_df[indicator_col_name] = 0
              for keyword_regex_str in keyword_regexes:
                  keyword_regex = re.compile(keyword_regex_str)
                  industry_df[indicator_col_name] = np.where(industry_df['headline_lower'].str.contains(keyword_regex) | industry_df['bullet_pts_lower'].str.contains(keyword_regex), 1, industry_df[indicator_col_name])
              if term_dict['extra_cleaning']:
                  pattern_terms_to_remove = re.compile('|'.join(term_dict['terms_to_remove']))
                  industry_df[indicator_col_name] = np.where(industry_df['headline_lower'].str.contains(pattern_terms_to_remove) | industry_df['bullet_pts_lower'].str.contains(pattern_terms_to_remove), 0, industry_df[indicator_col_name])



In [None]:
# covid_lp_relevant_keywords = ['block', 'review', 'police', 'issue', r'licen[cs]e', 'wage', 'revoke', 'temporary'] as of 10/11
# pattern_covid_lp_relevant_keywords = '|'.join(covid_lp_relevant_keywords)

# Adaptation of Raiha's supply chain labor practice heuristic - get_labor_indicator_v2  
def get_DEI_indicator_v2(row, DEI_keywords): 
    """Applying this heuristic to generate indicator for mentions of 
    at least 2 of the following DEI keywords: 
    ['diversity', 'engage', 'inclusi']. 
    (v1 included all words as well as "employee".) """
    cols = [f'{keyword}_ind' for keyword in DEI_keywords]
    count_DEI_term_inds_1 = 0
    for col in cols:
        if row[col] == 1:
            count_DEI_term_inds_1 += 1
    
    return count_DEI_term_inds_1
    '''
    if count_DEI_term_inds_1 >= 2:
        return 1
    else:
        return 0
    '''

In [None]:
industry_all_events_dict = {}

# DEI heuristic (flag if 2/4 terms are found)
DEI_keywords = ['diversity', 'equity', 'inclusion', ' dei'] # replaced employee engage

# labor_keywords = [r'labo[u]r', 'wage', 'worker']
covid_keywords = ['covid', 'coronavirus', 'pandemic']
blm_keywords = ['blm', 'black lives matter', 'george floyd']
pattern_DEI = '|'.join(DEI_keywords)

# Get all events in one df, by industry
for industry_name, file_names_lst in sorted(industry_files.items()):
    print(industry_name)
    for file_name in file_names_lst:
        print('\t' + file_name)
        sub_df = pd.read_csv(tvl_raw_dir + gic_dir + file_name)
        sub_df.dropna(axis=0, how='all', inplace=True)
        if sub_df.empty:
            continue
        all_are_scm = list(sub_df['Category'].unique()) == [gic_dir[:-1]]
        if not all_are_scm:
            print()
            print(f'NOT ALL ARTICLES ARE {gic_dir[:-1]}!!!')
            print(file_name)
            print(f'NOT ALL ARTICLES ARE {gic_dir[:-1]}!!!')
            print()
        if industry_name not in industry_all_events_dict:
            industry_all_events_dict[industry_name] = sub_df
        else:
            industry_all_events_dict[industry_name] = industry_all_events_dict[industry_name].append(sub_df, ignore_index=True)
    
    # Create some columns with cleaned text/dates
    if industry_name in industry_all_events_dict:
      industry_df = industry_all_events_dict[industry_name].copy()
      print(f'Before dropping dupes: {industry_df.shape}')

      # Drop duplicates for combos of company + TVL ID + article (repeating article pertaining to the same company)
          # Reasoning: TVL ID represents an identifier of a Spotlight Event for ONE company. 
          # A Spotlight Event may be made up of several articles. 
          # So we do not yet want to drop all the articles that comprise a single TVL ID by 
          # doing a hard drop_duplicates on JUST TVL ID. So, we are dropping on a combination of 
          # columns, to ensure that we only drop repeating articles for the same company. 
          # Repeating articles may occur due to potential overlap of articles from the CSVs.
      drop_dupes_cols = ['Company', 'TVL ID', 'Primary Article Spotlight Headline', 
                        'Primary Article Bullet Points', 'Spotlight Start Date']
      industry_df = industry_df.drop_duplicates(drop_dupes_cols, keep='first')
      print(f'After dropping dupes: {industry_df.shape}')
      industry_df['INDUSTRY'] = industry_name
      industry_df = industry_df[['INDUSTRY', 'Company', 'TVL ID', 'Category', 'Primary Article Spotlight Headline',
        'Primary Article Bullet Points', 'Primary Article Source',
        'Primary Article URL Link', 'Spotlight Start Date',
        'Spotlight End Date', 'Spotlight Volume']]
    
      industry_df['headline_lower'] = industry_df['Primary Article Spotlight Headline'].str.lower()
      industry_df['bullet_pts_lower'] = industry_df['Primary Article Bullet Points'].str.lower()
      industry_df['date'] = industry_df['Spotlight Start Date'].apply(lambda s_date: datetime.datetime.strptime(s_date, '%m/%d/%Y'))
      industry_df['year'] = industry_df['date'].dt.year

      # Apply DEI (2/3) heuristic
      for keyword in DEI_keywords:
          keyword_regex = re.compile(f'(?<![^ .,?!;]){keyword}')
          industry_df[f'{keyword}_ind'] = np.where(industry_df['headline_lower'].str.contains(keyword_regex) | industry_df['bullet_pts_lower'].str.contains(keyword_regex), 
                                                  1, 0)
      industry_df['DEI_keyword_ind'] = industry_df.apply(lambda row: get_DEI_indicator_v2(row, DEI_keywords), axis=1)
      #industry_df['marked_DEI_relevant_ind'] = industry_df['DEI_keyword_ind']

      # Apply work/employ root word check
      root_employ = ['work', 'employ']
      for keyword in root_employ:
          keyword_regex = re.compile(f'(?<![^ .,?!;]){keyword}')
          industry_df[f'{keyword}_ind'] = np.where(industry_df['headline_lower'].str.contains(keyword_regex) | industry_df['bullet_pts_lower'].str.contains(keyword_regex), 
                                                  1, 0)
      industry_df['root_employ_ind'] = industry_df.apply(lambda row: get_DEI_indicator_v2(row, root_employ), axis=1)
      #industry_df['marked_root_employ_ind'] = industry_df['root_employ_ind']

      # Apply COVID heuristic
      for keyword in covid_keywords:
          keyword_regex = re.compile(f'(?<![^ .,?!;]){keyword}')
          industry_df[f'{keyword}_ind'] = np.where(industry_df['headline_lower'].str.contains(keyword_regex) | industry_df['bullet_pts_lower'].str.contains(keyword_regex), 
                                                  1, 0)
      industry_df['COVID_keyword_ind'] = industry_df.apply(lambda row: get_DEI_indicator_v2(row, covid_keywords), axis=1)
      #industry_df['marked_covid_ind'] = industry_df['covid_ind']
      
      # Apply Black Lives Matter heuristic
      for keyword in blm_keywords:
          keyword_regex = re.compile(f'(?<![^ .,?!;]){keyword}')
          industry_df[f'{keyword}_ind'] = np.where(industry_df['headline_lower'].str.contains(keyword_regex) | industry_df['bullet_pts_lower'].str.contains(keyword_regex), 
                                                  1, 0)
      industry_df['BLM_keyword_ind'] = industry_df.apply(lambda row: get_DEI_indicator_v2(row, blm_keywords), axis=1)
      #industry_df['blm_ind'] = industry_df['blm_ind']


      '''
      # Apply COVID-specific labor practice-relevance heuristic ONLY to articles marked as lp-relevant above
      mask_marked_as_lp_relevant = (industry_df['labor_keyword_ind'] == 1)
      mask_mentions_covid = (industry_df['headline_lower'].str.contains(pattern_covid) | industry_df['bullet_pts_lower'].str.contains(pattern_covid))
      industry_df['covid_and_labor_keyword_ind'] = np.where(mask_marked_as_lp_relevant & mask_mentions_covid, 1, 0)

      covid_lp_relevant_subset = industry_df[industry_df['covid_and_labor_keyword_ind'] == 1]
      if covid_lp_relevant_subset.empty:
          continue
      covid_lp_relevant_subset['marked_labor_relevant_ind'] = np.where((covid_lp_relevant_subset['headline_lower'].str.contains(pattern_covid_lp_relevant_keywords) | covid_lp_relevant_subset['bullet_pts_lower'].str.contains(pattern_covid_lp_relevant_keywords)), 1, 0)
      '''
      # PRACTICE TERM INDICATORS CREATED HERE
      term_type = 'practice'
      for cat, cat_terms in practice_terms_regex_dict.items():
          for clean_term, term_dict in cat_terms.items():
              indicator_col_name = "{}_{}_{}".format(clean_term, term_type.upper(), cat)
              keyword_regexes = term_dict['regex_lst']
              industry_df[indicator_col_name] = 0
              for keyword_regex_str in keyword_regexes:
                  keyword_regex = re.compile(keyword_regex_str)
                  industry_df[indicator_col_name] = np.where(industry_df['headline_lower'].str.contains(keyword_regex) | industry_df['bullet_pts_lower'].str.contains(keyword_regex), 1, industry_df[indicator_col_name])
              if term_dict['extra_cleaning']:
                  pattern_terms_to_remove = re.compile('|'.join(term_dict['terms_to_remove']))
                  industry_df[indicator_col_name] = np.where(industry_df['headline_lower'].str.contains(pattern_terms_to_remove) | industry_df['bullet_pts_lower'].str.contains(pattern_terms_to_remove), 0, industry_df[indicator_col_name])

      # RISK TERM INDICATORS CREATED HERE
      term_type = 'risk'
      for cat, cat_terms in risk_terms_regex_dict.items():
          for clean_term, term_dict in cat_terms.items():
              indicator_col_name = "{}_{}_{}".format(clean_term, term_type.upper(), cat)
              keyword_regexes = term_dict['regex_lst']
              industry_df[indicator_col_name] = 0
              for keyword_regex_str in keyword_regexes:
                  keyword_regex = re.compile(keyword_regex_str)
                  industry_df[indicator_col_name] = np.where(industry_df['headline_lower'].str.contains(keyword_regex) | industry_df['bullet_pts_lower'].str.contains(keyword_regex), 1, industry_df[indicator_col_name])
              if term_dict['extra_cleaning']:
                  pattern_terms_to_remove = re.compile('|'.join(term_dict['terms_to_remove']))
                  industry_df[indicator_col_name] = np.where(industry_df['headline_lower'].str.contains(pattern_terms_to_remove) | industry_df['bullet_pts_lower'].str.contains(pattern_terms_to_remove), 0, industry_df[indicator_col_name])

      # DEI TERM INDICATORS CREATED HERE
      term_type = 'DEI-context'
      for cat, cat_terms in DEI_context_terms_regex_dict.items():
          for clean_term, term_dict in cat_terms.items():
              indicator_col_name = "{}_{}_{}".format(clean_term, term_type.upper(), cat)
              keyword_regexes = term_dict['regex_lst']
              industry_df[indicator_col_name] = 0
              for keyword_regex_str in keyword_regexes:
                  keyword_regex = re.compile(keyword_regex_str)
                  industry_df[indicator_col_name] = np.where(industry_df['headline_lower'].str.contains(keyword_regex) | industry_df['bullet_pts_lower'].str.contains(keyword_regex), 1, industry_df[indicator_col_name])
              if term_dict['extra_cleaning']:
                  pattern_terms_to_remove = re.compile('|'.join(term_dict['terms_to_remove']))
                  industry_df[indicator_col_name] = np.where(industry_df['headline_lower'].str.contains(pattern_terms_to_remove) | industry_df['bullet_pts_lower'].str.contains(pattern_terms_to_remove), 0, industry_df[indicator_col_name])

      # SUPPLIER RELATIONSHIP INDICATORS CREATED HERE
      '''
      term_type = 'supplier-relationship'
      for cat, cat_terms in supplier_relship_regex_dict.items():
          for clean_term, term_dict in cat_terms.items():
              indicator_col_name = "{}_{}_{}".format(clean_term, term_type.upper(), cat)
              keyword_regexes = term_dict['regex_lst']
              industry_df[indicator_col_name] = 0
              for keyword_regex_str in keyword_regexes:
                  keyword_regex = re.compile(keyword_regex_str)
                  industry_df[indicator_col_name] = np.where(industry_df['headline_lower'].str.contains(keyword_regex) | industry_df['bullet_pts_lower'].str.contains(keyword_regex), 1, industry_df[indicator_col_name])
              if term_dict['extra_cleaning']:
                  pattern_terms_to_remove = re.compile('|'.join(term_dict['terms_to_remove']))
                  industry_df[indicator_col_name] = np.where(industry_df['headline_lower'].str.contains(pattern_terms_to_remove) | industry_df['bullet_pts_lower'].str.contains(pattern_terms_to_remove), 0, industry_df[indicator_col_name])
    '''

      industry_all_events_dict[industry_name] = industry_df


Advertising& Marketing
	Truvalue_Spotlights_Advertising& Marketing_12Months_20211023.csv

NOT ALL ARTICLES ARE Employee Engagement, Diversity, & Inclusion!!!
Truvalue_Spotlights_Advertising& Marketing_12Months_20211023.csv
NOT ALL ARTICLES ARE Employee Engagement, Diversity, & Inclusion!!!

	Truvalue_Spotlights_Advertising& Marketing_2020_20211023.csv

NOT ALL ARTICLES ARE Employee Engagement, Diversity, & Inclusion!!!
Truvalue_Spotlights_Advertising& Marketing_2020_20211023.csv
NOT ALL ARTICLES ARE Employee Engagement, Diversity, & Inclusion!!!

	Truvalue_Spotlights_Advertising& Marketing_2019_20211023.csv

NOT ALL ARTICLES ARE Employee Engagement, Diversity, & Inclusion!!!
Truvalue_Spotlights_Advertising& Marketing_2019_20211023.csv
NOT ALL ARTICLES ARE Employee Engagement, Diversity, & Inclusion!!!

	Truvalue_Spotlights_Advertising& Marketing_2018_20211023.csv

NOT ALL ARTICLES ARE Employee Engagement, Diversity, & Inclusion!!!
Truvalue_Spotlights_Advertising& Marketing_2018_2021102



Aerospace& Defense
	Truvalue_Spotlights_Aerospace& Defense_12Months_20211023.csv

NOT ALL ARTICLES ARE Employee Engagement, Diversity, & Inclusion!!!
Truvalue_Spotlights_Aerospace& Defense_12Months_20211023.csv
NOT ALL ARTICLES ARE Employee Engagement, Diversity, & Inclusion!!!

	Truvalue_Spotlights_Aerospace& Defense_2020_20211023.csv

NOT ALL ARTICLES ARE Employee Engagement, Diversity, & Inclusion!!!
Truvalue_Spotlights_Aerospace& Defense_2020_20211023.csv
NOT ALL ARTICLES ARE Employee Engagement, Diversity, & Inclusion!!!

	Truvalue_Spotlights_Aerospace& Defense_2019_20211023.csv

NOT ALL ARTICLES ARE Employee Engagement, Diversity, & Inclusion!!!
Truvalue_Spotlights_Aerospace& Defense_2019_20211023.csv
NOT ALL ARTICLES ARE Employee Engagement, Diversity, & Inclusion!!!

	Truvalue_Spotlights_Aerospace& Defense_2018_20211023.csv

NOT ALL ARTICLES ARE Employee Engagement, Diversity, & Inclusion!!!
Truvalue_Spotlights_Aerospace& Defense_2018_20211023.csv
NOT ALL ARTICLES ARE Employee 



### Simple Visualizations

In [None]:
# Industry Abbrev to Full Industry name mapping:
"""CONTEXT: TVL abbreviates industry names in their data sets, so this is just 
a mapping from TVL abbreviated industry names to full industry names. It can be 
helpful for visualizations, etc. """

industry_abbrev_to_full_map = {
    "Advertising& Marketing": "Advertising & Marketing",
    "Aerospace& Defense": "Aerospace & Defense",
    "AgriculturalProducts": "Agricultural Products",
    "AirFreight & Logistics": "Air Freight & Logistics",
    "Airlines": "Airlines",
    "AlcoholicBeverages": "Alcoholic Beverages",
    "Apparel,Accessories & Foo": "Apparel, Accessories & Footwear",
    "ApplianceManufacturing": "Appliance Manufacturing",
    "AssetManagement & Custody": "Asset Management & Custody Activities",
    "AutoParts": "Auto Parts",
    "Automobiles": "Automobiles",
    "Biofuels": "Biofuels",
    "Biotechnology& Pharmaceut": "Biotechnology & Pharmaceuticals",
    "BuildingProducts & Furnis": "Building Products & Furnishings",
    "CarRental & Leasing": "Car Rental & Leasing",
    "Casinos& Gaming": "Casinos & Gaming",
    "Chemicals": "Chemicals",
    "CoalOperations": "Coal Operations",
    "CommercialBanks": "Commercial Banks",
    "ConstructionMaterials": "Construction Materials",
    "ConsumerFinance": "Consumer Finance",
    "Containers& Packaging": "Containers & Packaging",
    "CruiseLines": "Cruise Lines",
    "DrugRetailers": "Drug Retailers",
    "E-Commerce": "E-Commerce", 
    "Education": "Education",
    "ElectricUtilities & Power": "Electric Utilities & Power Generators",
    "Electrical& Electronic Eq": "Electrical & Electronic Equipment",
    "ElectronicManufacturing S": "Electronic Manufacturing Services & Original Design Manufacturing",
    "Engineering& Construction": "Engineering & Construction Services",
    "FoodRetailers & Distribut": "Food Retailers & Distributors",
    "ForestryManagement": "Forestry Management",
    "FuelCells & Industrial Ba": "Fuel Cells & Industrial Batteries",
    "GasUtilities & Distributo": "Gas Utilities & Distributors",
    "Hardware": "Hardware",
    "HealthCare Delivery": "Health Care Delivery",
    "HealthCare Distributors": "Health Care Distributors",
    "HomeBuilders": "Home Builders",
    "Hotels& Lodging": "Hotels & Lodging",
    "Household& Personal Produ": "Household & Personal Products",
    "IndustrialMachinery & Goo": "Industrial Machinery & Goods",
    "Insurance": "Insurance",
    "InternetMedia & Services": "Internet Media & Services",
    "InvestmentBanking & Broke": "Investment Banking & Brokerage",
    "Iron& Steel Producers": "Iron & Steel Producers",
    "LeisureFacilities": "Leisure Facilities",
    "ManagedCare": "Managed Care",
    "MarineTransportation": "Marine Transportation",
    "Meat,Poultry & Dairy": "Meat, Poultry & Dairy",
    "Media& Entertainment": "Media & Entertainment",
    "MedicalEquipment & Suppli": "Medical Equipment & Supplies",
    "Metals& Mining": "Metals & Mining",
    "MortgageFinance": "Mortgage Finance",
    "Multilineand Specialty Re": "Multiline and Specialty Retailers & Distributors",
    "Non-AlcoholicBeverages": "Non-Alcoholic Beverages",
    "Oil& Gas - Exploration & ": "Oil & Gas - Exploration & Production",
    "Oil& Gas - Midstream": "Oil & Gas - Midstream",
    "Oil& Gas - Refining & Mar": "Oil & Gas - Refining & Marketing",
    "Oil& Gas - Services": "Oil & Gas - Services",
    "ProcessedFoods": "Processed Foods",
    "Professional& Commercial": "Professional & Commercial Services",
    "Pulp& Paper Products": "Pulp & Paper Products",
    "RailTransportation": "Rail Transportation",
    "RealEstate": "Real Estate",
    "RealEstate Services": "Real Estate Services",
    "Restaurants": "Restaurants",
    "RoadTransportation": "Road Transportation",
    "Security& Commodity Excha": "Security & Commodity Exchanges",
    "Semiconductors": "Semiconductors",
    "Software& IT Services": "Software & IT Services",
    "SolarTechnology & Project": "Solar Technology & Project Developers",
    "TelecommunicationServices": "Telecommunication Services",
    "Tobacco": "Tobacco",
    "Toys& Sporting Goods": "Toys & Sporting Goods",
    "WasteManagement": "Waste Management",
    "WaterUtilities & Services": "Water Utilities & Services",
    "WindTechnology & Project ": "Wind Technology & Project Developers"
}