<a href="https://colab.research.google.com/github/AtreyaBandyopadhyay/NLP-with-Disaster-Tweets/blob/main/2_Data_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#importing libraries
import pandas as pd
import numpy as np
from tqdm import tqdm

In [2]:
tqdm.pandas()

In [3]:
from transformers import BertTokenizer

In [4]:
import regex as re

In [5]:
pd.options.display.max_colwidth = 100


## 1. Load Dataset

In [11]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [14]:
dataset = pd.read_csv("/content/drive/MyDrive/Disaster Tweet Twitter/Dataset/train.csv")

In [15]:
dataset.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are being notified by officers. No other evacuation or...,1
3,6,,,"13,000 people receive #wildfires evacuation orders in California",1
4,7,,,Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school,1


In [16]:
#size of dataset
dataset.shape

(7613, 5)

In [17]:
#features
dataset.columns

Index(['id', 'keyword', 'location', 'text', 'target'], dtype='object')

In [18]:
dataset.target.unique()

array([1, 0])

## 2. Initializing Bert Tokenizer

In [19]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

## 3. Processing hashtags

In [20]:
text = dataset.text.iloc[5]

In [21]:
text

'#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires'

In [22]:
tokenizer.tokenize(text)

['#',
 'rocky',
 '##fire',
 'update',
 '=',
 '>',
 'california',
 'h',
 '##wy',
 '.',
 '20',
 'closed',
 'in',
 'both',
 'directions',
 'due',
 'to',
 'lake',
 'county',
 'fire',
 '-',
 '#',
 'caf',
 '##ire',
 '#',
 'wild',
 '##fires']

From the above example it is clear that BERT tokenizer tokenizes hashtags with a preceeding '#' this will allow the LLM to recognize hashtags. Also hashtags are often not tokenized. Like in this example like "#rockyfire". From the above example it is clear that BERT tokenizer can appropriately tokenize the word. **In conclusion hashtags does not need any preprocessing.**

## 3 Processing mentions

As we observed in the EDA section mentions occur in negative classes more frequently than positive classes. Also mentions in positive classes frequently refer to news portals or goverment websites. Mentions other than this often do not have semantic meanings. Based on these observations news portal mentions are replaces with @news, government site mentions are replaces with @gov and all other mentions are replaced with @mention.

In [23]:
def replace_mentions(text):
  '''
  Replace news mention with @news, government site mention with @gov
  and other mentions with @mention.
  '''
  all_mentions = re.findall(r"@(\w+)", text)

  for mention in all_mentions:

    if "gov" in mention:
      text=text.replace("@"+mention,"@gov")
    elif "news" in mention or "reuters" in mention or "nytimes" in mention:
      text=text.replace("@"+mention,"@news")
    else:
      text=text.replace("@"+mention,"@mention")
  return text

In [24]:
text = "@marksmaponyane Hey!Sundowns were annihilated in their previous meeting with Celtic.Indeed its an improvement. @nytimes @cdcgov"

In [25]:
replace_mentions(text)

'@mention Hey!Sundowns were annihilated in their previous meeting with Celtic.Indeed its an improvement. @news @gov'

In [26]:
dataset['text'] = dataset['text'].apply(lambda x: replace_mentions(x))

## 4. Processing URLs

In [27]:
text = dataset.text.iloc[207]

In [28]:
text

'http://t.co/J8TYT1XRRK Twelve feared killed in Pakistani air ambulance helicopter crash http://t.co/9d4nAzOI94'

In [29]:
tokenizer.tokenize(text)

['http',
 ':',
 '/',
 '/',
 't',
 '.',
 'co',
 '/',
 'j',
 '##8',
 '##ty',
 '##t',
 '##1',
 '##x',
 '##rr',
 '##k',
 'twelve',
 'feared',
 'killed',
 'in',
 'pakistani',
 'air',
 'ambulance',
 'helicopter',
 'crash',
 'http',
 ':',
 '/',
 '/',
 't',
 '.',
 'co',
 '/',
 '9',
 '##d',
 '##4',
 '##na',
 '##zo',
 '##i',
 '##9',
 '##4']

The URL is tokenized into many tokens which are uninformative. But the presense or absense of URL can be indicative about the target class. **So we replace the entire URL with http.**

In [30]:
dataset['text'] = dataset['text'].apply(lambda x: re.sub(r'https?://\S+|www\.\S+','http', x))

In [31]:
dataset.text.iloc[207]

'http Twelve feared killed in Pakistani air ambulance helicopter crash http'

## 6. Processing Emoticons

In [33]:
def replace_emojis(text):
  '''
  Replace emojis with semantically equivalent text.
  '''
  text = text.replace(':-)','smile')
  text = text.replace(':-(','sad')
  text = text.replace(':-o','surprise')
  text = text.replace(':-D','gawk')
  text = text.replace(':D','gawk')
  text = text.replace(';-)','wink')
  return text


In [34]:
dataset.text.iloc[4278]

"@mention Yay good cooler weather for PDX..ABQ NM is feeling the heat wave now bcuz my rain dances aren't working :-)"

In [35]:
dataset.text = dataset.text.apply(lambda x:replace_emojis(x))

In [36]:
dataset.text.iloc[4278]

"@mention Yay good cooler weather for PDX..ABQ NM is feeling the heat wave now bcuz my rain dances aren't working smile"

## 6. Remove duplicates

In [37]:
#total number of duplicated text
dataset.duplicated(subset="text").sum()

653

In [38]:
#total number of duplicated text and labels
dataset.duplicated(subset=["text","target"]).sum()

585

73 records has duplciated text but not target. Since the target is not consistent among this duplicate text, we are droping these records.

In [39]:
dataset.drop(dataset[~dataset.duplicated(subset=["text","target"]) & dataset.duplicated(subset="text")].index,axis=0,inplace=True)

For rest of the duplicates we keep the first occurrence.

In [40]:
duplicated = dataset.duplicated(subset=['text','target'],keep="first")

In [41]:
dataset = dataset[~duplicated]

## 7. Clean text

In [42]:
def clean_text(text):
  '''
  Clean text
  '''
  text = text.lower()
  #text=text.replace("#","")
  text = text.replace("'","")
  text =text.replace("-","")
  text = text.replace("~"," ")

  r = re.compile(r'([?.,/#!$%^&*;:{}=_`~()-])[?.,/#!$%^&*;:{}=_`~()-]+') #replace repeated punctuations
  text = r.sub(r'\1', text)

  r = re.compile(r"(.)\1{2,}") # replace leters repeating more than twice with the first occurrence. Eg funnyyy -> funny
  text = r.sub(r'\1', text)

  #remove HTML tags
  text = re.sub(r"<[^>]+>"," ", text)
  text = re.sub(r"&lt;"," ", text)
  text = re.sub(r"&gt;"," ", text)
  text = re.sub(r"&amp;"," ", text)

  #Spell Correction
  #text=TextBlob(text)
  #text=str(text.correct())

  return text

In [43]:
dataset['text'] = dataset['text'].progress_apply(lambda x:clean_text(x))

  0%|          | 1/6977 [00:00<00:05, 1235.80it/s]


NameError: name 'clean_test' is not defined

## 8. Cleaning Keyword

In [44]:
dataset['keyword'] = dataset['keyword'].str.replace("%20"," ")

## 9. Saving file

In [45]:
dataset.to_csv("/content/drive/MyDrive/Disaster Tweet Twitter/Dataset/cleaned_train.csv")