# Human Rights Considered NLP 

### **Overview**

This notebook creates a training dataset using data sourced from the [Police Brutality 2020 API](https://github.com/2020PB/police-brutality) by adding category labels for types of force and the people involved in incidents using [Snorkel](https://www.snorkel.org/) for NLP. 

Build on original notebook by [Axel Corro](https://github.com/axefx) sourced from the HRD Team C DS [repository](https://github.com/Lambda-School-Labs/Labs25-Human_Rights_First-TeamC-DS/blob/main/notebooks/snorkel_hrf.ipynb).



# Imports 

In [1]:
!pip install snorkel

Collecting snorkel
[?25l  Downloading https://files.pythonhosted.org/packages/4e/6a/e33babd8b4fb34867b695b5ab6b02c9106ec9de05ed4a02b2b9417eb3ae7/snorkel-0.9.6-py3-none-any.whl (144kB)
[K     |████████████████████████████████| 153kB 2.9MB/s 
[?25hCollecting scikit-learn<0.22.0,>=0.20.2
[?25l  Downloading https://files.pythonhosted.org/packages/a0/c5/d2238762d780dde84a20b8c761f563fe882b88c5a5fb03c056547c442a19/scikit_learn-0.21.3-cp36-cp36m-manylinux1_x86_64.whl (6.7MB)
[K     |████████████████████████████████| 6.7MB 7.3MB/s 
Collecting networkx<2.4,>=2.2
[?25l  Downloading https://files.pythonhosted.org/packages/85/08/f20aef11d4c343b557e5de6b9548761811eb16e438cee3d32b1c66c8566b/networkx-2.3.zip (1.7MB)
[K     |████████████████████████████████| 1.8MB 13.5MB/s 
Collecting munkres>=1.0.6
  Downloading https://files.pythonhosted.org/packages/64/97/61ddc63578870e04db6eb1d3bee58ad4e727f682068a7c7405edb8b2cdeb/munkres-1.1.2-py2.py3-none-any.whl
Collecting tensorboard<2.0.0,>=1.14.0
[?2

In [2]:
import pandas as pd

from snorkel.labeling import labeling_function
from snorkel.labeling.model import LabelModel
from snorkel.labeling import PandasLFApplier

import sys
from google.colab import files

In [141]:
# using our cleaned processed data
df = pd.read_csv('https://raw.githubusercontent.com/Lambda-School-Labs/Labs25-Human_Rights_First-TeamC-DS/main/Data/pv_incidents.csv', na_values=False)

In [4]:
df2 = df.filter(['text'], axis=1)

In [5]:
df2['text'] = df2['text'].astype(str)

# Use of Force Tags

### Categories of force:

- **Presence**: Police show up and their presence is enough to de-escalate. This is ideal. 

- **verbalization**: Police use voice commands, force is non-physical.

- **empty-hand control soft technique**: Officers use grabs, holds and joint locks to restrain an individual. shove, chase, spit, raid, push

- **empty-hand control hard technique**: Officers use punches and kicks to restrain an individual.

- **blunt impact**: Officers may use a baton to immobilize a combative person, struck, shield, beat 

- **projectiles**: Projectiles shot or launched by police at civilians. Includes "less lethal" mutnitions such as rubber bullets, bean bag rounds, water hoses, and flash grenades, as well as deadly weapons such as firearms.

- **chemical**: Officers use chemical sprays or projectiles embedded with chemicals to restrain an individual (e.g., pepper spray). 

- **conducted energy devices**: Officers may use CEDs to immobilize an individual. CEDs discharge a high-voltage, low-amperage jolt of electricity at a distance. 

- **miscillaneous**: LRAD (long-range audio device), sound cannon, sonic weapon

## Presence category

Police presence is enough to de-escalate. This is ideal.

In [6]:
PRESENCE = 1
NOT_PRESENCE = 0
ABSTAIN = -1

In [7]:
@labeling_function()
def lf_keyword_swarm(x):
  return PRESENCE if 'swarm' in x.text.lower() else ABSTAIN

In [8]:
@labeling_function()
def lf_keyword_show(x):
  return PRESENCE if 'show' in x.text.lower() else ABSTAIN

In [9]:
@labeling_function()
def lf_keyword_arrive(x):
  return PRESENCE if 'arrive' in x.text.lower() else ABSTAIN

In [10]:
from snorkel.labeling.model import LabelModel
from snorkel.labeling import PandasLFApplier

# Define the set of labeling functions (LFs)
lfs = [lf_keyword_swarm, lf_keyword_show, lf_keyword_arrive]

# Apply the LFs to the unlabeled training data
applier = PandasLFApplier(lfs)
L_train = applier.apply(df2)

# Train the label model and compute the training labels
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=50, seed=123)
df2["presence_label"] = label_model.predict(L=L_train, tie_break_policy="abstain")

100%|██████████| 1069/1069 [00:00<00:00, 6832.63it/s]


## Verbalization Category

police use voice commands, force is non-physical

In [11]:
VERBAL = 1
NOT_VERBAL = 0
ABSTAIN = -1

In [12]:
@labeling_function()
def lf_keyword_shout(x):
  return VERBAL if 'shout' in x.text.lower() else ABSTAIN

In [13]:
@labeling_function()
def lf_keyword_order(x):
  return VERBAL if 'order' in x.text.lower() else ABSTAIN

In [14]:
@labeling_function()
def lf_keyword_loudspeaker(x):
  return VERBAL if 'loudspeaker' in x.text.lower() else ABSTAIN

In [15]:
from snorkel.labeling.model import LabelModel
from snorkel.labeling import PandasLFApplier

# Define the set of labeling functions (LFs)
lfs = [lf_keyword_shout, lf_keyword_order,lf_keyword_loudspeaker]

# Apply the LFs to the unlabeled training data
applier = PandasLFApplier(lfs)
L_train = applier.apply(df2)

# Train the label model and compute the training labels
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=50, seed=123)
df2["verbal_label"] = label_model.predict(L=L_train, tie_break_policy="abstain")

100%|██████████| 1069/1069 [00:00<00:00, 15317.04it/s]


In [16]:
lf_keyword_shout, lf_keyword_order, lf_keyword_loudspeaker = (L_train != ABSTAIN).mean(axis=0)
print(f"lf_keyword_shout coverage: {lf_keyword_shout * 100:.1f}%")
print(f"lf_keyword_order coverage: {lf_keyword_order * 100:.1f}%")
print(f"lf_keyword_loudspeaker coverage: {lf_keyword_loudspeaker * 100:.1f}%")

lf_keyword_shout coverage: 0.1%
lf_keyword_order coverage: 0.3%
lf_keyword_loudspeaker coverage: 0.0%


In [17]:
df2[df2['verbal_label']==1]

Unnamed: 0,text,presence_label,verbal_label
70,police apply no assembly order to journalists,-1,1
284,officers shove press during dispersal order,-1,1
445,police selectively enforce curfew and dispersa...,-1,1
737,police charge into peaceful crowd shouting gr...,-1,1


## Empty-hand Control - Soft Technique

Officers use grabs, holds and joint locks to restrain an individual. shove, chase, spit, raid, push

In [18]:
EHCSOFT = 1
NOT_EHCSOFT = 0
ABSTAIN = -1

In [19]:
@labeling_function()
def lf_keyword_shove(x):
  return EHCSOFT if 'shove' in x.text.lower() else ABSTAIN

In [20]:
@labeling_function()
def lf_keyword_grabs(x):
  return EHCSOFT if 'grabs' in x.text.lower() else ABSTAIN

In [21]:
@labeling_function()
def lf_keyword_holds(x):
  return EHCSOFT if 'holds' in x.text.lower() else ABSTAIN

In [22]:
@labeling_function()
def lf_keyword_arrest(x):
  return EHCSOFT if 'arrest' in x.text.lower() else ABSTAIN

In [23]:
@labeling_function()
def lf_keyword_spit(x):
  return EHCSOFT if 'spit' in x.text.lower() else ABSTAIN

In [24]:
@labeling_function()
def lf_keyword_raid(x):
  return EHCSOFT if 'raid' in x.text.lower() else ABSTAIN

In [25]:
@labeling_function()
def lf_keyword_push(x):
  return EHCSOFT if 'push' in x.text.lower() else ABSTAIN

In [26]:
from snorkel.labeling.model import LabelModel
from snorkel.labeling import PandasLFApplier

# Define the set of labeling functions (LFs)
lfs = [lf_keyword_shove, lf_keyword_grabs, lf_keyword_spit, lf_keyword_raid,
      lf_keyword_push, lf_keyword_holds, lf_keyword_arrest]

# Apply the LFs to the unlabeled training data
applier = PandasLFApplier(lfs)
L_train = applier.apply(df2)

# Train the label model and compute the training labels
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=50, seed=123)
df2["ehc-soft_technique"] = label_model.predict(L=L_train, tie_break_policy="abstain")

100%|██████████| 1069/1069 [00:00<00:00, 6842.96it/s]


In [27]:
df2[df2['ehc-soft_technique']==1]

Unnamed: 0,text,presence_label,verbal_label,ehc-soft_technique
12,police violently arrest drummer,-1,-1,1
13,police kneel on mans neck to make arrest,-1,-1,1
14,police punch arrestee on ground,-1,-1,1
15,livestreamer arrested and punched,-1,-1,1
16,police shove and pepper spray protesters,-1,-1,1
...,...,...,...,...
1056,livestreamer arrested while filming protest,-1,-1,1
1057,lmpd bearcat strikes vehicle police initially...,-1,-1,1
1063,police arrest protesters leaving scene,-1,-1,1
1067,peaceful protesters arrested for breaking curfew,-1,-1,1


## Empty-hand Control - Hard Technique

Officers use bodily force (punches and kicks or asphyxiation) to restrain an individual. 

In [28]:
EHCHARD = 1
NOT_EHCHARD = 0
ABSTAIN = -1

In [29]:
@labeling_function()
def lf_keyword_beat(x):
  return EHCHARD if 'beat' in x.text.lower() else ABSTAIN

In [30]:
@labeling_function()
def lf_keyword_tackle(x):
  return EHCHARD if 'tackle' in x.text.lower() else ABSTAIN

In [31]:
@labeling_function()
def lf_keyword_punch(x):
  return EHCHARD if 'punch' in x.text.lower() else ABSTAIN

In [32]:
@labeling_function()
def lf_keyword_assault(x):
  return EHCHARD if 'assault' in x.text.lower() else ABSTAIN

In [33]:
@labeling_function()
def lf_keyword_choke(x):
  return EHCHARD if 'choke' in x.text.lower() else ABSTAIN

In [34]:
@labeling_function()
def lf_keyword_kick(x):
  return EHCHARD if 'kick' in x.text.lower() else ABSTAIN

In [35]:
@labeling_function()
def lf_keyword_kneel(x):
  return EHCHARD if 'kneel' in x.text.lower() else ABSTAIN

In [36]:
from snorkel.labeling.model import LabelModel
from snorkel.labeling import PandasLFApplier

# Define the set of labeling functions (LFs)
lfs = [lf_keyword_beat, lf_keyword_tackle, lf_keyword_choke,
       lf_keyword_kick, lf_keyword_punch, lf_keyword_assault,
       lf_keyword_kneel]

# Apply the LFs to the unlabeled training data
applier = PandasLFApplier(lfs)
L_train = applier.apply(df2)

# Train the label model and compute the training labels
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=50, seed=123)
df2["ehc-hard_technique"] = label_model.predict(L=L_train, tie_break_policy="abstain")

100%|██████████| 1069/1069 [00:00<00:00, 7565.22it/s]


In [37]:
df2[df2['ehc-hard_technique']==1]

Unnamed: 0,text,presence_label,verbal_label,ehc-soft_technique,ehc-hard_technique
1,police assault protesters,-1,-1,-1,1
13,police kneel on mans neck to make arrest,-1,-1,1,1
14,police punch arrestee on ground,-1,-1,1,1
15,livestreamer arrested and punched,-1,-1,1,1
18,police officer tackles and knees man on the gr...,-1,-1,-1,1
...,...,...,...,...,...
1006,police choke man and push woman filming event ...,-1,-1,0,1
1047,police officers use batons to beat protester,-1,-1,-1,1
1048,louisville police swarm and beat a man screami...,1,-1,-1,1
1053,police tackle protester then target witness,-1,-1,-1,1


## Blunt Impact Category

Officers may use tools like batons to immobilize a person.

In [38]:
BLUNT = 1
NOT_BLUNT = 0
ABSTAIN = -1

In [39]:
@labeling_function()
def lf_keyword_baton(x):
  return BLUNT if 'baton' in x.text.lower() else ABSTAIN

In [40]:
@labeling_function()
def lf_keyword_club(x):
  return BLUNT if 'club' in x.text.lower() else ABSTAIN

In [41]:
@labeling_function()
def lf_keyword_shield(x):
  return BLUNT if 'shield' in x.text.lower() else ABSTAIN

In [42]:
@labeling_function()
def lf_keyword_bike(x):
  return BLUNT if 'bike' in x.text.lower() else ABSTAIN

In [43]:
@labeling_function()
def lf_keyword_horse(x):
  return BLUNT if 'horse' in x.text.lower() else ABSTAIN

In [44]:
@labeling_function()
def lf_keyword_vehicle(x):
  return BLUNT if 'vehicle' in x.text.lower() else ABSTAIN

In [45]:
@labeling_function()
def lf_keyword_car(x):
  return BLUNT if 'car' in x.text.lower() else ABSTAIN

In [46]:
from snorkel.labeling.model import LabelModel
from snorkel.labeling import PandasLFApplier

# Define the set of labeling functions (LFs)
lfs = [lf_keyword_baton, lf_keyword_club, lf_keyword_horse, lf_keyword_vehicle,
       lf_keyword_car, lf_keyword_shield, lf_keyword_bike]

# Apply the LFs to the unlabeled training data
applier = PandasLFApplier(lfs)
L_train = applier.apply(df2)

# Train the label model and compute the training labels
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=50, seed=123)
df2["blunt_impact"] = label_model.predict(L=L_train, tie_break_policy="abstain")

100%|██████████| 1069/1069 [00:00<00:00, 6739.78it/s]


In [47]:
df2[df2['blunt_impact']==1]

Unnamed: 0,text,presence_label,verbal_label,ehc-soft_technique,ehc-hard_technique,blunt_impact
6,police use horses as weapons,-1,-1,-1,-1,1
17,police push protesters with horses,-1,-1,0,-1,1
25,police trample protester with horse,-1,-1,-1,-1,1
27,police beat protester with batons then pepper...,-1,-1,-1,1,1
43,officer shoots projectile from moving vehicle,-1,-1,-1,-1,1
...,...,...,...,...,...,...
1045,police shoot at cars in traffic from overpass,-1,-1,-1,-1,1
1047,police officers use batons to beat protester,-1,-1,-1,1,1
1057,lmpd bearcat strikes vehicle police initially...,-1,-1,1,-1,1
1062,protester hit by police car,-1,-1,-1,-1,1


## Projectiles category

Projectiles shot or launched by police at civilians. Includes "less lethal" mutnitions such as rubber bullets, bean bag rounds, water hoses, and flash grenades, as well as deadly weapons such as firearms. 


In [48]:
PROJECTILE = 1
NOT_PROJECTILE = 0
ABSTAIN = -1

In [49]:
@labeling_function()
def lf_keyword_pepper(x):
  return PROJECTILE if 'pepper' in x.text else ABSTAIN

In [50]:
@labeling_function()
def lf_keyword_rubber(x):
  return PROJECTILE if 'rubber' in x.text else ABSTAIN

In [51]:
@labeling_function()
def lf_keyword_bean(x):
  return PROJECTILE if 'bean' in x.text else ABSTAIN

In [52]:
@labeling_function()
def lf_keyword_shoot(x):
  return PROJECTILE if 'shoot' in x.text else ABSTAIN

In [53]:
@labeling_function()
def lf_keyword_shot(x):
  return PROJECTILE if 'shot' in x.text else ABSTAIN

In [54]:
@labeling_function()
def lf_keyword_fire(x):
  return PROJECTILE if 'fire' in x.text else ABSTAIN

In [55]:
@labeling_function()
def lf_keyword_grenade(x):
  return PROJECTILE if 'grenade' in x.text else ABSTAIN

In [56]:
@labeling_function()
def lf_keyword_bullet(x):
  return PROJECTILE if 'bullet' in x.text else ABSTAIN

In [57]:
@labeling_function()
def lf_keyword_throw(x):
  return PROJECTILE if 'throw' in x.text else ABSTAIN

In [58]:
@labeling_function()
def lf_keyword_discharge(x):
  return PROJECTILE if 'discharge' in x.text else ABSTAIN

In [59]:
@labeling_function()
def lf_keyword_projectile(x):
  return PROJECTILE if 'projectile' in x.text else ABSTAIN

In [60]:
from snorkel.labeling.model import LabelModel
from snorkel.labeling import PandasLFApplier

# Define the set of labeling functions (LFs)
lfs = [lf_keyword_pepper, lf_keyword_rubber, lf_keyword_bean,
       lf_keyword_shoot, lf_keyword_shot, lf_keyword_fire, lf_keyword_grenade, 
       lf_keyword_bullet, lf_keyword_throw, lf_keyword_discharge, 
       lf_keyword_projectile]

# Apply the LFs to the unlabeled training data
applier = PandasLFApplier(lfs)
L_train = applier.apply(df2)

# Train the label model and compute the training labels
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=50, seed=123)
df2["projectile"] = label_model.predict(L=L_train, tie_break_policy="abstain")

100%|██████████| 1069/1069 [00:00<00:00, 4656.80it/s]


In [61]:
df2[df2['projectile'] == 1]

Unnamed: 0,text,presence_label,verbal_label,ehc-soft_technique,ehc-hard_technique,blunt_impact,projectile
2,police shoot non violent protester in the head,-1,-1,-1,-1,-1,1
3,police use tear gas rubber bullets on protes...,-1,-1,-1,-1,-1,1
4,police open fire on crowd with rubber bullets,-1,-1,-1,-1,-1,1
5,pregnant woman shot with bean bags by police,-1,-1,-1,-1,-1,1
8,police open fire on crowd after a protester th...,-1,-1,-1,-1,-1,1
...,...,...,...,...,...,...,...
1052,police shove woman and then fire pepper balls ...,-1,-1,1,-1,-1,1
1060,police fire at peaceful protesters,-1,-1,-1,-1,-1,1
1064,reporter shows tear gas canister fired at him ...,1,-1,-1,-1,-1,1
1065,woman bleeding from face after being shot by p...,-1,-1,-1,-1,-1,1


## Chemical Agents

Police use chemical agents including pepper pray, tear gas on civilians. 

In [62]:
CHEMICAL = 1
NOT_CHEMICAL = 0
ABSTAIN = -1

In [63]:
@labeling_function()
def lf_keyword_pepper(x):
  return CHEMICAL if 'pepper' in x.text else ABSTAIN

In [64]:
@labeling_function()
def lf_keyword_gas(x):
  return CHEMICAL if 'gas' in x.text else ABSTAIN

In [65]:
@labeling_function()
def lf_keyword_smoke(x):
  return CHEMICAL if 'smoke' in x.text else ABSTAIN

In [66]:
@labeling_function()
def lf_keyword_mace(x):
  return CHEMICAL if 'mace' in x.text else ABSTAIN

In [67]:
@labeling_function()
def lf_keyword_spray(x):
  return CHEMICAL if 'spray' in x.text else ABSTAIN

In [68]:
from snorkel.labeling.model import LabelModel
from snorkel.labeling import PandasLFApplier

# Define the set of labeling functions (LFs)
lfs = [lf_keyword_pepper, lf_keyword_gas, lf_keyword_smoke, 
       lf_keyword_spray, lf_keyword_mace]

# Apply the LFs to the unlabeled training data
applier = PandasLFApplier(lfs)
L_train = applier.apply(df2)

# Train the label model and compute the training labels
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=50, seed=123)
df2["chemical"] = label_model.predict(L=L_train, tie_break_policy="abstain")

100%|██████████| 1069/1069 [00:00<00:00, 10125.43it/s]


In [69]:
df2[df2['chemical']==1]

Unnamed: 0,text,presence_label,verbal_label,ehc-soft_technique,ehc-hard_technique,blunt_impact,projectile,chemical
9,police spray a man in the face while he stands...,-1,-1,-1,-1,-1,-1,1
16,police shove and pepper spray protesters,-1,-1,1,-1,-1,0,1
27,police beat protester with batons then pepper...,-1,-1,-1,1,1,0,1
37,police officer pepper sprays protesters for no...,-1,-1,-1,-1,-1,0,1
46,officer pepper sprays protester,-1,-1,-1,-1,-1,0,1
...,...,...,...,...,...,...,...,...
1025,police fire pepper bullets into apartment,-1,-1,-1,-1,-1,1,1
1051,protesters in st matthews shot with pepper ro...,-1,-1,-1,-1,-1,1,1
1052,police shove woman and then fire pepper balls ...,-1,-1,1,-1,-1,1,1
1061,protester pepper sprayed through open door,-1,-1,-1,-1,-1,0,1


## Conducted energy devices

Officers may use CEDs to immobilize an individual. CEDs discharge a high-voltage, low-amperage jolt of electricity at a distance. Most commonly tasers.

In [70]:
CED = 1
NOT_CED = 0
ABSTAIN = -1

In [71]:
@labeling_function()
def lf_keyword_taser(x):
  return CED if 'taser' in x.text else ABSTAIN

In [72]:
@labeling_function()
def lf_keyword_stun(x):
  return CED if 'stun' in x.text else ABSTAIN

In [73]:
@labeling_function()
def lf_keyword_stungun(x):
  return CED if 'stungun' in x.text else ABSTAIN

In [74]:
@labeling_function()
def lf_keyword_taze(x):
  return CED if 'taze' in x.text else ABSTAIN

In [75]:
from snorkel.labeling.model import LabelModel
from snorkel.labeling import PandasLFApplier

# Define the set of labeling functions (LFs)
lfs = [lf_keyword_taser, lf_keyword_stun, lf_keyword_stungun, lf_keyword_taze]

# Apply the LFs to the unlabeled training data
applier = PandasLFApplier(lfs)
L_train = applier.apply(df2)

# Train the label model and compute the training labels
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=50, seed=123)
df2["ced_category"] = label_model.predict(L=L_train, tie_break_policy="abstain")

100%|██████████| 1069/1069 [00:00<00:00, 11727.36it/s]


In [76]:
df2[df2['ced_category']==1]

Unnamed: 0,text,presence_label,verbal_label,ehc-soft_technique,ehc-hard_technique,blunt_impact,projectile,chemical,ced_category
262,officers deploy tear gas and stun grenades aga...,-1,-1,-1,-1,-1,0,0,1
319,police throw stun grenade at retreating protes...,-1,-1,-1,-1,-1,0,-1,1
333,police respond to vandalism with tear gas and ...,-1,-1,-1,-1,-1,0,0,1
356,police throw stun grenades and tear gas canist...,-1,-1,-1,-1,-1,0,0,1
357,police throw stun grenade at independent journ...,-1,-1,-1,-1,-1,0,-1,1
367,officer throws stun grenade at protesters on s...,-1,-1,-1,-1,-1,0,-1,1
383,officer repeatedly uses stun gun on suspect wh...,-1,-1,-1,-1,-1,-1,-1,1


# Add force tags to dataframe

In [77]:
df2.columns

Index(['text', 'presence_label', 'verbal_label', 'ehc-soft_technique',
       'ehc-hard_technique', 'blunt_impact', 'projectile', 'chemical',
       'ced_category'],
      dtype='object')

In [78]:
def add_force_labels(row):
  tags = []
  if row['presence_label'] == 1:
    tags.append('Presence')
  if row['verbal_label'] == 1:
    tags.append('Verbalization')
  if row['ehc-soft_technique'] == 1:
    tags.append('EHC Soft Technique')
  if row['ehc-hard_technique'] == 1:
    tags.append('EHC Hard Technique')
  if row['blunt_impact'] == 1:
    tags.append('Blunt Impact')
  if row['projectile'] == 1 or row['projectile'] == 0:
    tags.append('Projectiles')
  if row['chemical'] == 1:
    tags.append('Chemical')
  if row['ced_category'] == 1:
    tags.append('Conductive Energy')
  if not tags:
    tags.append('Other/Unknown')
  return tags

In [79]:
# apply force tags to incident data
df2['force_tags'] = df2.apply(add_force_labels,axis=1)

In [80]:
# take a peek
df2[['text','force_tags']].head(3)

Unnamed: 0,text,force_tags
0,police throw tear gas at protesters on a bridge,[Projectiles]
1,police assault protesters,[EHC Hard Technique]
2,police shoot non violent protester in the head,[Projectiles]


In [81]:
# clean the tags column by seperating tags
def join_tags(content):
  return ', '.join(content)

In [143]:
# add column to main df
df['force_tags'] = df2['force_tags'].apply(join_tags)

In [83]:
df['force_tags'].value_counts()

Other/Unknown                                                                  231
Projectiles                                                                    231
EHC Soft Technique                                                             202
Projectiles, Chemical                                                          115
EHC Hard Technique                                                              66
Blunt Impact                                                                    44
EHC Soft Technique, Blunt Impact                                                22
EHC Soft Technique, Projectiles, Chemical                                       22
EHC Soft Technique, EHC Hard Technique                                          22
EHC Hard Technique, Blunt Impact                                                22
EHC Soft Technique, Projectiles                                                 13
EHC Hard Technique, Projectiles, Chemical                                       13
Blun

# Human Categories



### Police Categories:

police, officer, deputy, PD, cop

federal, agent

In [84]:
POLICE = 1
NOT_POLICE = 0
ABSTAIN = -1

In [85]:
@labeling_function()
def lf_keyword_police(x):
  return POLICE if 'police' in x.text else ABSTAIN

In [86]:
@labeling_function()
def lf_keyword_officer(x):
  return POLICE if 'officer' in x.text else ABSTAIN

In [87]:
@labeling_function()
def lf_keyword_deputy(x):
  return POLICE if 'deputy' in x.text else ABSTAIN

In [88]:
@labeling_function()
def lf_keyword_pd(x):
  return POLICE if 'PD' in x.text else ABSTAIN

In [89]:
@labeling_function()
def lf_keyword_cop(x):
  return POLICE if 'cop' in x.text else ABSTAIN

In [90]:
@labeling_function()
def lf_keyword_enforcement(x):
  return POLICE if 'enforcement' in x.text else ABSTAIN

In [91]:
@labeling_function()
def lf_keyword_leo(x):
  return POLICE if 'LEO' in x.text else ABSTAIN

In [92]:
@labeling_function()
def lf_keyword_swat(x):
  return POLICE if 'SWAT' in x.text else ABSTAIN

In [93]:
# Define the set of labeling functions (LFs)
lfs = [lf_keyword_police, lf_keyword_officer, lf_keyword_deputy, lf_keyword_pd,
       lf_keyword_cop, lf_keyword_enforcement, lf_keyword_swat, lf_keyword_leo]

# Apply the LFs to the unlabeled training data
applier = PandasLFApplier(lfs)
L_train = applier.apply(df2)

# Train the label model and compute the training labels
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=50, seed=123)
df2['police_label'] = label_model.predict(L=L_train, tie_break_policy='abstain')

100%|██████████| 1069/1069 [00:00<00:00, 5825.05it/s]


In [94]:
df2[df2['police_label']==1]

Unnamed: 0,text,presence_label,verbal_label,ehc-soft_technique,ehc-hard_technique,blunt_impact,projectile,chemical,ced_category,force_tags,police_label
0,police throw tear gas at protesters on a bridge,-1,-1,-1,-1,-1,0,0,-1,[Projectiles],1
1,police assault protesters,-1,-1,-1,1,-1,-1,-1,-1,[EHC Hard Technique],1
2,police shoot non violent protester in the head,-1,-1,-1,-1,-1,1,-1,-1,[Projectiles],1
3,police use tear gas rubber bullets on protes...,-1,-1,-1,-1,-1,1,0,-1,[Projectiles],1
4,police open fire on crowd with rubber bullets,-1,-1,-1,-1,-1,1,-1,-1,[Projectiles],1
...,...,...,...,...,...,...,...,...,...,...,...
1062,protester hit by police car,-1,-1,-1,-1,1,-1,-1,-1,[Blunt Impact],1
1063,police arrest protesters leaving scene,-1,-1,1,-1,-1,-1,-1,-1,[EHC Soft Technique],1
1064,reporter shows tear gas canister fired at him ...,1,-1,-1,-1,-1,1,0,-1,"[Presence, Projectiles]",1
1065,woman bleeding from face after being shot by p...,-1,-1,-1,-1,-1,1,-1,-1,[Projectiles],1


### Federal Agent Category

In [95]:
FEDERAL = 1
NOT_FEDERAL = 0
ABSTAIN = -1

In [96]:
@labeling_function()
def lf_keyword_federal(x):
  return FEDERAL if 'federal' in x.text else ABSTAIN

In [97]:
@labeling_function()
def lf_keyword_feds(x):
  return FEDERAL if 'feds' in x.text else ABSTAIN

In [98]:
# national guard
@labeling_function()
def lf_keyword_guard(x):
  return FEDERAL if 'guard' in x.text else ABSTAIN

In [99]:
# Define the set of labeling functions (LFs)
lfs = [lf_keyword_federal, lf_keyword_feds, lf_keyword_guard]

# Apply the LFs to the unlabeled training data
applier = PandasLFApplier(lfs)
L_train = applier.apply(df2)

# Train the label model and compute the training labels
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=50, seed=123)
df2['federal_label'] = label_model.predict(L=L_train, tie_break_policy='abstain')

100%|██████████| 1069/1069 [00:00<00:00, 16305.06it/s]


In [100]:
df2[df2['federal_label']==1]

Unnamed: 0,text,presence_label,verbal_label,ehc-soft_technique,ehc-hard_technique,blunt_impact,projectile,chemical,ced_category,force_tags,police_label,federal_label
95,local and federal police fire on protesters s...,-1,-1,1,-1,-1,1,-1,-1,"[EHC Soft Technique, Projectiles]",1,1
102,badgeless federal agents deployed to portland ...,-1,-1,-1,-1,-1,-1,0,-1,[Other/Unknown],-1,1
103,confirmed report of u s federal agents kneeli...,-1,-1,-1,1,-1,-1,-1,-1,[EHC Hard Technique],-1,1
104,federal officers raid vigil for slain protester,-1,-1,0,-1,-1,-1,-1,-1,[Other/Unknown],0,1
105,federal agents fire tear gas during shift change,-1,-1,-1,-1,-1,1,0,-1,[Projectiles],-1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
547,police and national guard use tear gas and lra...,-1,-1,-1,-1,-1,-1,0,-1,[Other/Unknown],1,1
697,unmarked federal agents aid police,-1,-1,-1,-1,-1,-1,-1,-1,[Other/Unknown],1,1
698,unmarked federal agents aid police,-1,-1,-1,-1,-1,-1,-1,-1,[Other/Unknown],1,1
703,unknown federal agents aid police,-1,-1,-1,-1,-1,-1,-1,-1,[Other/Unknown],1,1


### Civilian Categories:

protesters, medic, 

reporter, journalist, 

minor, child

In [101]:
PROTESTER = 1
NOT_PROTESTER = 0
ABSTAIN = -1

In [102]:
@labeling_function()
def lf_keyword_protester(x):
  return PROTESTER if 'protester' in x.text else ABSTAIN

In [103]:
# adding the mispelling 'protestor'
@labeling_function()
def lf_keyword_protestor(x):
  return PROTESTER if 'protestor' in x.text else ABSTAIN

In [104]:
@labeling_function()
def lf_keyword_medic(x):
  return PROTESTER if 'medic' in x.text else ABSTAIN

In [105]:
# Define the set of labeling functions (LFs)
lfs = [lf_keyword_protester, lf_keyword_protestor, lf_keyword_medic]

# Apply the LFs to the unlabeled training data
applier = PandasLFApplier(lfs)
L_train = applier.apply(df2)

# Train the label model and compute the training labels
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=50, seed=123)
df2['protester_label'] = label_model.predict(L=L_train, tie_break_policy='abstain')

100%|██████████| 1069/1069 [00:00<00:00, 13679.36it/s]


In [106]:
df2[df2['protester_label']==1]

Unnamed: 0,text,presence_label,verbal_label,ehc-soft_technique,ehc-hard_technique,blunt_impact,projectile,chemical,ced_category,force_tags,police_label,federal_label,protester_label
0,police throw tear gas at protesters on a bridge,-1,-1,-1,-1,-1,0,0,-1,[Projectiles],1,-1,1
1,police assault protesters,-1,-1,-1,1,-1,-1,-1,-1,[EHC Hard Technique],1,-1,1
2,police shoot non violent protester in the head,-1,-1,-1,-1,-1,1,-1,-1,[Projectiles],1,-1,1
3,police use tear gas rubber bullets on protes...,-1,-1,-1,-1,-1,1,0,-1,[Projectiles],1,-1,1
7,police critically injure year old black pro...,-1,-1,-1,-1,-1,-1,-1,-1,[Other/Unknown],1,-1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1062,protester hit by police car,-1,-1,-1,-1,1,-1,-1,-1,[Blunt Impact],1,-1,1
1063,police arrest protesters leaving scene,-1,-1,1,-1,-1,-1,-1,-1,[EHC Soft Technique],1,-1,1
1066,police mace shoot pepper bullets at protester...,-1,-1,-1,-1,-1,1,1,-1,"[Projectiles, Chemical]",1,-1,1
1067,peaceful protesters arrested for breaking curfew,-1,-1,1,-1,-1,-1,-1,-1,[EHC Soft Technique],-1,-1,1


 Press 

In [107]:
PRESS = 1
NOT_PRESS = 0
ABSTAIN = -1

In [108]:
@labeling_function()
def lf_keyword_reporter(x):
  return PRESS if 'reporter' in x.text else ABSTAIN

In [109]:
@labeling_function()
def lf_keyword_press(x):
  return PRESS if 'press' in x.text else ABSTAIN

In [110]:
@labeling_function()
def lf_keyword_journalist(x):
  return PRESS if 'journalist' in x.text else ABSTAIN

In [111]:
# Define the set of labeling functions (LFs)
lfs = [lf_keyword_reporter, lf_keyword_press, lf_keyword_journalist]

# Apply the LFs to the unlabeled training data
applier = PandasLFApplier(lfs)
L_train = applier.apply(df2)

# Train the label model and compute the training labels
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=50, seed=123)
df2['press_label'] = label_model.predict(L=L_train, tie_break_policy='abstain')

100%|██████████| 1069/1069 [00:00<00:00, 14524.82it/s]


In [112]:
df2[df2['press_label']==1]

Unnamed: 0,text,presence_label,verbal_label,ehc-soft_technique,ehc-hard_technique,blunt_impact,projectile,chemical,ced_category,force_tags,police_label,federal_label,protester_label,press_label
34,police arrest two reporters,-1,-1,1,-1,-1,-1,-1,-1,[EHC Soft Technique],1,-1,-1,1
35,police arrest journalist michael harriot,-1,-1,1,-1,-1,-1,-1,-1,[EHC Soft Technique],1,-1,-1,1
38,police tear gas reporters,-1,-1,-1,-1,-1,-1,0,-1,[Other/Unknown],1,-1,-1,1
42,reporter shot with tear gas canister,-1,-1,-1,-1,-1,1,0,-1,[Projectiles],-1,-1,-1,1
53,police shove member of the press,-1,-1,1,-1,-1,-1,-1,-1,[EHC Soft Technique],1,-1,-1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1010,student journalist shot and tear gassed,-1,-1,-1,-1,-1,1,0,-1,[Projectiles],-1,-1,-1,1
1027,reporter shot with rubber bullets,-1,-1,-1,-1,-1,1,-1,-1,[Projectiles],-1,-1,-1,1
1042,police shoot rubber bullets at reporter,-1,-1,-1,-1,-1,1,-1,-1,[Projectiles],1,-1,-1,1
1058,police target independent journalists and live...,-1,-1,-1,-1,-1,-1,-1,-1,[Other/Unknown],1,-1,-1,1


Minors

In [113]:
MINOR = 1
NOT_MINOR = 0
ABSTAIN = -1

In [114]:
@labeling_function()
def lf_keyword_minor(x):
  return MINOR if 'minor' in x.text else ABSTAIN

In [115]:
@labeling_function()
def lf_keyword_underage(x):
  return MINOR if 'underage' in x.text else ABSTAIN

In [116]:
@labeling_function()
def lf_keyword_teen(x):
  return MINOR if 'teen' in x.text else ABSTAIN

In [117]:
@labeling_function()
def lf_keyword_child(x):
  return MINOR if 'child' in x.text else ABSTAIN

In [118]:
@labeling_function()
def lf_keyword_baby(x):
  return MINOR if 'baby' in x.text else ABSTAIN

In [119]:
@labeling_function()
def lf_keyword_toddler(x):
  return MINOR if 'toddler' in x.text else ABSTAIN

In [120]:
# Define the set of labeling functions (LFs)
lfs = [lf_keyword_minor, lf_keyword_child, lf_keyword_baby, 
       lf_keyword_underage, lf_keyword_teen, lf_keyword_toddler]

# Apply the LFs to the unlabeled training data
applier = PandasLFApplier(lfs)
L_train = applier.apply(df2)

# Train the label model and compute the training labels
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=50, seed=123)
df2['minor_label'] = label_model.predict(L=L_train, tie_break_policy='abstain')

100%|██████████| 1069/1069 [00:00<00:00, 7277.91it/s]


In [121]:
df2[df2['minor_label']==1]

Unnamed: 0,text,presence_label,verbal_label,ehc-soft_technique,ehc-hard_technique,blunt_impact,projectile,chemical,ced_category,force_tags,police_label,federal_label,protester_label,press_label,minor_label
44,underage protester tackled and arrested,-1,-1,1,1,-1,-1,-1,-1,"[EHC Soft Technique, EHC Hard Technique]",-1,-1,1,-1,1
411,toddler tear gassed by police,-1,-1,-1,-1,-1,-1,0,-1,[Other/Unknown],1,-1,-1,-1,1
771,police pepper spray young child,-1,-1,-1,-1,-1,0,1,-1,"[Projectiles, Chemical]",1,-1,-1,-1,1
943,law enforcement gas teenagers at a park,-1,-1,-1,-1,-1,-1,0,-1,[Other/Unknown],0,-1,-1,-1,1


# Add human tags to Dataframe

In [122]:
df2.columns

Index(['text', 'presence_label', 'verbal_label', 'ehc-soft_technique',
       'ehc-hard_technique', 'blunt_impact', 'projectile', 'chemical',
       'ced_category', 'force_tags', 'police_label', 'federal_label',
       'protester_label', 'press_label', 'minor_label'],
      dtype='object')

In [123]:
def add_human_labels(row):
  tags = []
  if row['police_label'] == 1 or row['police_label'] == 0:
    tags.append('Police')
  if row['federal_label'] == 1:
    tags.append('Federal')
  if row['protester_label'] == 1:
    tags.append('Protester')
  if row['press_label'] == 1:
    tags.append('Press')
  if row['minor_label'] == 1:
    tags.append('Minor')
  if not tags:
    tags.append('Other/Unknown')
  return tags

In [124]:
# apply human tags to incident data
df2['human_tags'] = df2.apply(add_human_labels,axis=1)

In [125]:
# take a peek
df2[['text','force_tags', 'human_tags']].head(3)

Unnamed: 0,text,force_tags,human_tags
0,police throw tear gas at protesters on a bridge,[Projectiles],"[Police, Protester]"
1,police assault protesters,[EHC Hard Technique],"[Police, Protester]"
2,police shoot non violent protester in the head,[Projectiles],"[Police, Protester]"


In [126]:
# clean the tags column by seperating tags
def join_tags(content):
  return ', '.join(content)

In [142]:
# add column to main df
df['human_tags'] = df2['human_tags'].apply(join_tags)

In [128]:
df['human_tags'].value_counts()

Police, Protester                    430
Police                               286
Other/Unknown                         85
Protester                             69
Police, Press                         67
Press                                 45
Police, Federal                       32
Police, Federal, Protester            26
Federal, Protester                     8
Police, Protester, Press               5
Police, Federal, Protester, Press      4
Police, Federal, Press                 3
Protester, Press                       3
Police, Minor                          3
Federal                                2
Protester, Minor                       1
Name: human_tags, dtype: int64

In [144]:
# last check
df = df.drop('date_text', axis=1)
df = df.drop('Unnamed: 0', axis=1)
df = df.drop_duplicates(subset=['id'], keep='last')
df.head(3)

Unnamed: 0,STATE_NAME,edit_at,CITY,text,date,id,Link 1,Link 2,Link 3,Link 4,Link 5,Link 6,Link 7,Link 8,Link 9,Link 10,Link 11,Link 12,Link 13,Link 14,Link 15,Link 16,Link 17,Link 18,Link 19,Link 20,STATE_CODE,COUNTY,LATITUDE,LONGITUDE,human_tags,force_tags
0,Louisiana,https://github.com/2020PB/police-brutality/blo...,New Orleans,police throw tear gas at protesters on a bridge,2020-06-03,la-neworleans-1,https://twitter.com/misaacstein/status/1268381...,https://twitter.com/ckm_news/status/1268382403...,https://twitter.com/brynstole/status/126838134...,https://twitter.com/xxnthe/status/126842775987...,https://twitter.com/greg_doucette/status/12685...,https://www.wdsu.com/article/protesters-on-i-1...,,,,,,,,,,,,,,,LA,Jefferson,29.963071,-90.160953,"Police, Protester",Projectiles
1,Texas,https://github.com/2020PB/police-brutality/blo...,Austin,police assault protesters,2020-05-30,tx-austin-2,https://gfycat.com/tautimaginativedore,https://www.reddit.com/r/2020PoliceBrutality/c...,,,,,,,,,,,,,,,,,,,TX,Hays,30.210692,-97.942749,"Police, Protester",EHC Hard Technique
2,Texas,https://github.com/2020PB/police-brutality/blo...,Austin,police shoot non violent protester in the head,2020-05-30,tx-austin-3,https://www.reddit.com/r/PublicFreakout/commen...,https://www.instagram.com/p/CA6TCIGnuWm/,https://www.youtube.com/watch?v=-BGyTi-KdKc (a...,https://streamable.com/o1uqgy (aftermath),https://cbsaustin.com/news/local/austin-teen-h...,,,,,,,,,,,,,,,,TX,Hays,30.210692,-97.942749,"Police, Protester",Projectiles


In [146]:
print(df.shape)

(1039, 32)


In [148]:
# exporting the dataframe 
df.to_csv('training_data.csv')
files.download('training_data.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>