# Blabber Cleaning
#### (By: Mark Ehab Aziz)
#### (Built Under: Python 3.11.4)
Filtering out and cleaning text data.
As tasked inside the 'to do.txt'.

Ensure the presence of nltk package using `pip install nltk`.
Following usage of nltk should not require further dependencies than the basic install.

In [20]:
# Importing Libraries
import pandas as pd                         # Loading Data
from nltk.tokenize import regexp_tokenize   # To Tokenize words with Regex Expressions
from nltk.stem import PorterStemmer         # Stemming words

In [21]:
# Loading data into kernel
# Using two methods (As stated in my previous projects)
# 1. Path working within my git repo
blab = pd.read_csv("../dataset/train.csv")

# 2. Path when data is within the same folder
#blab = pd.read_csv("./train.csv")

# Data Exploration
Using `.head(n)` to show the first $n^{th}$ rows of the dataset.

In [22]:
# Defining n rows to see
n = 5

# Showing head
blab.head(n)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\r\nWhy the edits made under my use...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\r\nMore\r\nI can't make any real suggestions...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


As stated by our todo list, we are only tasked with cleaning of the text, so we'll be focusing on `comment_text`.

Referring to our todo list once again, we will be dropping `id`, `toxic`, `severe_toxic`, `obscene`, `threat`, `insult`, and `identity_hate`; as we are not concerned with classifying the sentiment or the meaning behind any of the comments.

Reminder for what to be done:
- Read Text
- Clean Text (Capitalisation, punctuation)
- Remove Stop Words
- Tokenization
- Stemming

Under no aforementioned task will we be using the columns I have mentioned to drop.

In [23]:
# Defining list of columns to be dropped
col_droppable = ["id", "toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

# Dropping
txt_blab = blab.drop(columns = col_droppable)

# Viewing
txt_blab.head(n)

Unnamed: 0,comment_text
0,Explanation\r\nWhy the edits made under my use...
1,D'aww! He matches this background colour I'm s...
2,"Hey man, I'm really not trying to edit war. It..."
3,"""\r\nMore\r\nI can't make any real suggestions..."
4,"You, sir, are my hero. Any chance you remember..."


In [24]:
# Removing '\n' '\r' '\t' from every line
txt_blab.replace(r'[\r\n\t]', ' ', regex = True, inplace=True)

# As noted, there are no escape characters for spaces, as new line or tab
txt_blab.head(n)

Unnamed: 0,comment_text
0,Explanation Why the edits made under my usern...
1,D'aww! He matches this background colour I'm s...
2,"Hey man, I'm really not trying to edit war. It..."
3,""" More I can't make any real suggestions on ..."
4,"You, sir, are my hero. Any chance you remember..."


# Cleaning Above Sentences
Using the NLTK library for Python; will be copy-pasting or creating patterns that are enough to extract words, starting with either upper or lower case letters.

This may violate the order of operations specified in the ToDo list, as cleaning data preceeds tokenization, but `regexp_tokenize()` takes care of both steps anyway.

Will also be removing the URLs as specified.

Special Characters will be removed as the regex matching only accounts for Alphabet characters (A through Z or a through z), there will be no punctuation, no space, no digits.

In [25]:
# Defining Regex patterns
# Match words starting with Uppercase letters
upper_words = r"([A-Z])\w+"

# Match Words that start with either Upper/lowercase letters
upper_lower_words = r"[A-Za-z]\w+"

# Match URLs
url_pattern = r"(http|ftp|https):\/\/([\w+?\.\w+])+([a-zA-Z0-9\~\!\@\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]*)?"

In [26]:
# Removing URLs (Standard URL Scheme, There still exist instances
# of just 'https' or 'http' randomly written, they will just be
# treated like normal words and tokenized as the rest)
txt_blab.replace(url_pattern, '', regex = True, inplace = True)


txt_blab.head(n)

Unnamed: 0,comment_text
0,Explanation Why the edits made under my usern...
1,D'aww! He matches this background colour I'm s...
2,"Hey man, I'm really not trying to edit war. It..."
3,""" More I can't make any real suggestions on ..."
4,"You, sir, are my hero. Any chance you remember..."


In [27]:
# Instantiate a list of tokens, to hold tokens of each entry
# Probably better to use a dictionary if we care about count???
# Still have to access every token and change to lower
# Also still have to make it so that I can stem
token_per_row = []

# Start and finish indecies of iterator
# Bound to become the length of the file eventually
# I really dispise this
for i in range(0, 50):
    # Grab string fully from dataframe
    line = txt_blab.iloc[i,0]

    # Append list of tokens
    # 2D list of lists; each containing tokens of each row
    token_per_row.append(regexp_tokenize(line, upper_lower_words))

In [None]:
# Need to remove stop words too

In [28]:
# Flatten the 2D??
# Lowercase every word