# Text data cleaning and preprocessing

Using the OffensEval, dataset the first task is to prepare and clean the data for further use.

Importing the dataset and removing unecessary columns

In [1]:
import pandas as pd
import pickle

tweets_data = pd.read_csv('../Dataset/olid-training-v1.0.tsv', delimiter='\t')
tweets_data = tweets_data.iloc[:,1:3]
tweets_data

Unnamed: 0,tweet,subtask_a
0,@USER She should ask a few native Americans wh...,OFF
1,@USER @USER Go home you’re drunk!!! @USER #MAG...,OFF
2,Amazon is investigating Chinese employees who ...,NOT
3,"@USER Someone should'veTaken"" this piece of sh...",OFF
4,@USER @USER Obama wanted liberals &amp; illega...,NOT
...,...,...
13235,@USER Sometimes I get strong vibes from people...,OFF
13236,Benidorm ✅ Creamfields ✅ Maga ✅ Not too sh...,NOT
13237,@USER And why report this garbage. We don't g...,OFF
13238,@USER Pussy,OFF


Creating a function to remove unecessary characters and useless stuff from text.

In [2]:
import re 
import string 

EMOJI_PATTERN = re.compile(
    "["
    "\U0001F1E0-\U0001F1FF"  # flags (iOS)
    "\U0001F300-\U0001F5FF"  # symbols & pictographs
    "\U0001F600-\U0001F64F"  # emoticons
    "\U0001F680-\U0001F6FF"  # transport & map symbols
    "\U0001F700-\U0001F77F"  # alchemical symbols
    "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
    "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
    "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
    "\U0001FA00-\U0001FA6F"  # Chess Symbols
    "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
    "\U00002702-\U000027B0"  # Dingbats
    "\U000024C2-\U0001F251" 
    "]+"
    )

def clean_text(text):
    #Remove @USER
    text = text.replace('@USER','')
    #Remove hastags
    text = re.sub('#\w+','',text)
    #Remove emoji
    text = re.sub(EMOJI_PATTERN,'',text)
    #Remove white space
    text = re.sub('\s+',' ',text)
    #Beginning of the text white space
    text = re.sub('^\s+',' ',text)
    #Remove URL
    text = re.sub('URL','',text)
    #Remove punctuation marks
    text = re.sub('[.,\/#!$%"\?@\^&\*\+;:{}=\-_`~()]','',text)
    #Remove digits
    text = re.sub('\d','',text)
    #Lowercase
    text = text.lower()
    
    return text

clean = lambda x: clean_text(x)

In [3]:
tweets_data.tweet = tweets_data.tweet.apply(clean)

Separating the dataset into offensive and not offensive language

In [4]:
off_tweets = tweets_data.loc[tweets_data['subtask_a'] == "OFF"]
off_tweets

Unnamed: 0,tweet,subtask_a
0,she should ask a few native americans what th...,OFF
1,go home you’re drunk,OFF
3,someone should'vetaken this piece of shit to ...,OFF
5,liberals are all kookoo,OFF
6,oh noes tough shit,OFF
...,...,...
13223,is advocating for conduct within bounds of hu...,OFF
13227,liars like the antifa twins you vigorously de...,OFF
13235,sometimes i get strong vibes from people and ...,OFF
13237,and why report this garbage we don't give a crap,OFF


In [5]:
not_tweets = tweets_data.loc[tweets_data['subtask_a'] == "NOT"]
not_tweets

Unnamed: 0,tweet,subtask_a
2,amazon is investigating chinese employees who ...,NOT
4,obama wanted liberals amp illegals to move in...,NOT
8,buy more icecream,NOT
10,it’s not my fault you support gun control,NOT
11,what’s the difference between and one of thes...,NOT
...,...,...
13232,she is not the brightest light on the tree,NOT
13233,if i say you are mad now you will say i'm tir...,NOT
13234,retweet complete amp followed all patriots,NOT
13236,benidorm creamfields maga not too shabby of a ...,NOT


## Corpus of text

Extracting only tweets for corpus and text cleaning of tweets

In [6]:
only_tweets = pd.DataFrame(tweets_data.tweet)
only_tweets

Unnamed: 0,tweet
0,she should ask a few native americans what th...
1,go home you’re drunk
2,amazon is investigating chinese employees who ...
3,someone should'vetaken this piece of shit to ...
4,obama wanted liberals amp illegals to move in...
...,...
13235,sometimes i get strong vibes from people and ...
13236,benidorm creamfields maga not too shabby of a ...
13237,and why report this garbage we don't give a crap
13238,pussy


In [7]:
only_tweets_off = pd.DataFrame(off_tweets.tweet)
only_tweets

Unnamed: 0,tweet
0,she should ask a few native americans what th...
1,go home you’re drunk
2,amazon is investigating chinese employees who ...
3,someone should'vetaken this piece of shit to ...
4,obama wanted liberals amp illegals to move in...
...,...
13235,sometimes i get strong vibes from people and ...
13236,benidorm creamfields maga not too shabby of a ...
13237,and why report this garbage we don't give a crap
13238,pussy


In [8]:
only_tweets_not = pd.DataFrame(not_tweets.tweet)
only_tweets_not

Unnamed: 0,tweet
2,amazon is investigating chinese employees who ...
4,obama wanted liberals amp illegals to move in...
8,buy more icecream
10,it’s not my fault you support gun control
11,what’s the difference between and one of thes...
...,...
13232,she is not the brightest light on the tree
13233,if i say you are mad now you will say i'm tir...
13234,retweet complete amp followed all patriots
13236,benidorm creamfields maga not too shabby of a ...


Exporting the clean corpus of all tweets and replacing the tweets column in the reduced training set

In [9]:
tweets_data.to_csv('tweets_data_clean.csv')
only_tweets.to_csv('only_tweets.csv')
only_tweets_off.to_csv('only_tweets_off.csv')
only_tweets_not.to_csv('only_tweets_not.csv')

Pickling corpus of tweets

In [10]:
only_tweets.to_pickle("corpus.pkl")
only_tweets_not.to_pickle("corpus_not.pkl")
only_tweets_off.to_pickle("corpus_off.pkl")