<a href="https://colab.research.google.com/github/SEEsuite/colab_scripts/blob/main/general_hate_speech.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Our third workshop!

Today we are going to focus on practicing more pandas skills, using expressions to find what we need in data, and reasoning about how reliable models are. 

We continue to work on the hate speech dataset from last week, as it is a subset of sentiment analysis. I grabbed a smaller toy dataset since we aren't going to do a very scientific analysis.


[list of notebooks](https://docs.google.com/document/d/18cNWM8iu7hVXn3DdHBN3mmuTbZwoJ_AzTdk71oBOLeQ/edit?usp=sharing)

[pandas cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)


In [None]:
!pip install transformers
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m63.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.3 tokenizers-0.13.2 transformers-4.27.4
Looking in indexes: https://pypi.org/simple, https://us

In [None]:
# importing miscelaneaous packages 
import numpy as np # fast manipulation of multidimensional arrays

from tqdm.notebook import tqdm as progress_bar # a little vizualization of how fast a loop is running
from scipy.special import softmax
import csv
from datetime import datetime
from matplotlib.dates import date2num

# more packages, tools for getting to google drive
import urllib.request
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

import pandas as pd # basically the excel of python

In [None]:
import re

# deep learning toolkit
from torch.utils.data import DataLoader
from torch.nn import Softmax
import torch

In [None]:
# huggingface's tools for pretrained language models
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer

from datasets import Dataset

There seems to be plenty of confusion over importing data, so I moved it all into a function. 

 With the function below, you just need to understand the **inputs** and the **result**, not the inner working of the file manager object. We don't need to care about how a function works as long as is behaving well. 



| method | Description |inputs| returns|
|-------|----------------------------------------------------------------------|--|--|
| import_data_from_drive| downloads google drive data to colab workspace|share_link, your_name_for_file|None|



In [None]:
import urllib.request
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

def import_data_from_drive(share_link, your_name_for_file="my_data"):
  """Brings data file from a google drive sharepoint to your colab workspace.
     It does not require you to host the dataset on your own account.

     Parameters:
     share_link: the link to view a file in google drive
     our_name_for_file: a string describing the file, preferable endling in a file type, ex. 'data.csv'
     """
  id = share_link.split("/")[5] # separate the id from the link
  print("Using id", id, "to find file on drive")

  # use pydrive and colab modules to authenticate you
  auth.authenticate_user()
  gauth = GoogleAuth()
  gauth.credentials = GoogleCredentials.get_application_default()
  drive = GoogleDrive(gauth)
  print("Authenticated colab user")

  # This step will move the file from Drive to the workspace
  download_object = drive.CreateFile({'id':id}) 
  download_object.GetContentFile(your_name_for_file)
  print("Added file to workspace with name", your_name_for_file)

  return

# Exercise 1

We will continue exploring hate speech classification, a special case of sentiment analysis. Our steps will be, given a new dataset of hate speech on twitter, 1) clean the tweets, 2) infer hatespeech labels through our model, 3)

Given the new dataframe, lets get comfortable with where the text we need lives.

In [None]:
#CHANGE THIS HERE
link = "https://drive.google.com/file/d/1xekg4coSgYQ0-HbqM_hYmnTNJf55t2KG/view?usp=sharing"

In [None]:
my_file_name = "hate.csv"

import_data_from_drive(link, my_file_name)

df = pd.read_csv(my_file_name)
df.columns

Using id 1xekg4coSgYQ0-HbqM_hYmnTNJf55t2KG to find file on drive
Authenticated colab user
Added file to workspace with name hate.csv


Index(['Unnamed: 0', 'worker_id', 'task_id', 'task_response_id', 'Tweet',
       'Tweet Text', 'created_at'],
      dtype='object')

## 1) Clean the data


Our first step is to clean up the data by removing mentions (@'s) and hashtags. The code for this is just in day 1 - appendix. We are going to adjust it a little to so that our final code will:

task:
- remove the entire mention (ex. "@username" -> "")
- remove just the # sign, but leaves the body of the hashtag (#MyLife -> MyLife)

The core method to cleaning is re.sub(), which substitutes one string for another string. We will then use pd.apply() to clean the whole dataset. Examine the methods more below:


| method | Description |inputs| returns|
|-------|----------------------------------------------------------------------|--|--|
| [re.sub](https://docs.python.org/3/library/re.html) |search a given text for anything that matches the given expression, and replace these with a new string.| (regex query, replacement, the given text)| the text with replacements|
|[pd.apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html)|given a function that works on one element of a dataframe or series, call the function on every element |a function, axis (default is 0, apply over rows)|the altered string|

  
To match strings, we write an expression in regex. regex is a shorthand system to express various string patterns. For example, say we want to find all instances of "color" but we know that some text uses the British spelling "colour". We may write the query "colou?r" which will match "color" and "colour". To clean up these different spellings, we could call `text = re.sub("colo?r", "color", text)`. That will show those stuck-up brits!








In [None]:
def clean(tweet):
  ## ADD WHATEVER YOU NEED HERE ACCORDING TO DATA
  # remove mentions
  tweet = re.sub("@[A-Za-z0-9_]+", "", tweet)
  # remove hashtags
  tweet = re.sub("#", "", tweet)

  return tweet

In [None]:
df['clean_text'] = df['Tweet Text'].apply(clean)

## 2) Infer label scores

Now that we have our data ready, let's apply our code from last meeting to infer hatespeech labels!

task: apply the hatespeech model from the sentiment analysis part 3 notebook




In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("facebook/roberta-hate-speech-dynabench-r4-target")

model = AutoModelForSequenceClassification.from_pretrained("facebook/roberta-hate-speech-dynabench-r4-target")

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/816 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [None]:
dataset = Dataset.from_pandas(df)

In [None]:
model.to('cuda')
labels=[ 'nothate', 'hate']

class_df = pd.DataFrame(columns=labels)

for data in progress_bar(dataset):

  tokens = tokenizer(data['clean_text'], return_tensors='pt')
  tokens.to('cuda')
  outputs = model(tokens['input_ids'])

  embeddings = outputs['logits'].detach().cpu().numpy()
  scores = softmax(embeddings)

  score_df = pd.DataFrame(scores, columns=labels)
  class_df = pd.concat((class_df, score_df), axis=0, ignore_index=True)

  0%|          | 0/1000 [00:00<?, ?it/s]

In [None]:
class_df

Unnamed: 0,nothate,hate
0,0.001428,0.998572
1,0.000539,0.999461
2,0.000219,0.999781
3,0.999078,0.000922
4,0.992906,0.007094
...,...,...
995,0.000335,0.999665
996,0.000371,0.999629
997,0.000205,0.999795
998,0.000411,0.999589


In [None]:
final_df = pd.concat([df,class_df], axis=1)
final_df

Unnamed: 0.1,Unnamed: 0,worker_id,task_id,task_response_id,Tweet,Tweet Text,created_at,clean_text,nothate,hate
0,0,4TNHZX6VKQAA,94345b44-37a0-479a-bfc5-3950b7fa24ed,7433adb3-3869-4dab-a190-586dd9d4162b,https://twitter.com/Lex70188667/status/1544937...,Why you running like a women??? No man walks o...,2022-07-07 06:54:17+00:00,Why you running like a women??? No man walks o...,0.001428,0.998572
1,1,D7HVJTEKMYNW,794a2e67-413e-4e49-bb4e-aaaf887b5602,4e566967-b7e6-487f-8992-7c15a9dd15fb,https://twitter.com/ApplesAndRust/status/15415...,Fucking NIGGER crap quality I threw it in the ...,2022-06-27 22:36:34+00:00,Fucking NIGGER crap quality I threw it in the ...,0.000539,0.999461
2,2,KEMPHTDC6KGH,6b495266-1d2e-4ac8-8a91-0cd46ed70125,ed93d34e-a504-48a3-9d9e-82739e6cb921,https://twitter.com/Opopideis/status/154602787...,"Just realized that I don’t hate niggers, what ...",,"Just realized that I don’t hate niggers, what ...",0.000219,0.999781
3,3,7ZGW996YFZ6X,6c3832a5-4ebd-4545-acf2-997751312a53,70a7654d-2f65-4bec-bdde-b783eed9c269,https://twitter.com/SnapCrackle12/status/15443...,Looks alot like Ellen Page. Is it her brother?,2022-07-05 16:21:04+00:00,Looks alot like Ellen Page. Is it her brother?,0.999078,0.000922
4,4,7ZGW996YFZ6X,ee831d3a-f8f3-4a64-a601-2033923e4375,fe877609-2166-49e9-9d20-692efdfbbe3f,https://twitter.com/patrickhenry2nd/status/154...,"Reminder, Ellen Page is and always will be a w...",2022-07-05 16:23:48+00:00,"Reminder, Ellen Page is and always will be a w...",0.992906,0.007094
...,...,...,...,...,...,...,...,...,...,...
995,995,CEHXHWFAVDAD,d70f0016-32ec-4230-8d94-f0b842ffbe75,e90e452a-5d21-459f-b709-55922f7a67e0,https://twitter.com/Daniel22305652/status/1543...,And you're stupid fucking nigger who supports ...,,And you're stupid fucking nigger who supports ...,0.000335,0.999665
996,996,CEHXHWFAVDAD,d1a485ed-fdf3-4e50-b565-c0a50f577c45,e2c5eff7-a510-41b5-b727-f157bd263083,https://twitter.com/Longfordlass75/status/1533...,Not for dirty muzzies have sex with. Filthy cr...,,Not for dirty muzzies have sex with. Filthy cr...,0.000371,0.999629
997,997,CEHXHWFAVDAD,a956d9dc-fdbb-4465-aaec-d21c6c4aac7f,0b7ceaaa-bf3e-4ab9-a2a8-8c84c9d0d6dc,https://twitter.com/Bailey_4357/status/1544116...,Mormons are disgusting - and I’m pretty sure C...,,Mormons are disgusting - and I’m pretty sure C...,0.000205,0.999795
998,998,4TNHZX6VKQAA,5d65ee77-71ea-4585-8248-83c128452a48,b46517fb-f567-4f21-8dba-c0bbada9757d,https://twitter.com/Trollbias69/status/1532374...,This invasion has been in the works for years....,,This invasion has been in the works for years....,0.000411,0.999589


In [None]:
final_df.to_excel("my_date_with_hatespeech.xlsx")