# Scott Breitbach
## 20-March-2021
## DSC550, Week 2

In [1]:
import pandas as pd
import unicodedata
import sys
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from nltk import pos_tag
from nltk import word_tokenize
import nltk
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

## 1) Read the *controversial-comments.jsonl* file and pre-process the text. 

In [2]:
allCommentsDF = pd.read_json("controversial-comments\controversial-comments.jsonl", lines=True)
allCommentsDF

Unnamed: 0,con,txt
0,0,Well it's great that he did something about th...
1,0,You are right Mr. President.
2,0,You have given no input apart from saying I am...
3,0,I get the frustration but the reason they want...
4,0,I am far from an expert on TPP and I would ten...
...,...,...
949995,0,I genuinely can't understand how anyone can su...
949996,0,"As a reminder, this subreddit [is for civil di..."
949997,0,K. Don't explain why or anything.
949998,0,[deleted]


Grab a sample of the set to work with:

In [3]:
commentsDF = allCommentsDF.sample(100)

#### A) Convert all text to lowercase letters.

In [4]:
commentsDF.txt = commentsDF['txt'].str.lower()

In [5]:
commentsDF.head()

Unnamed: 0,con,txt
202617,0,meh. i'm sure there will be a leaked email so...
812710,0,"politicians should take note, that if you're g..."
124885,0,he's really got it in for aaron rodgers' ankle.
545351,0,of course not. he's just trying to trick you i...
751713,0,oops sorry wrong comment.\n\nbasically i said ...


#### B) Remove all punctuation from the text.

In [6]:
punctuation = dict.fromkeys(i for i in range(sys.maxunicode) 
                            if unicodedata.category(chr(i)).startswith('P'))

In [7]:
commentsDF.txt = [string.translate(punctuation) for string in commentsDF.txt]

In [8]:
commentsDF.head()

Unnamed: 0,con,txt
202617,0,meh im sure there will be a leaked email soon...
812710,0,politicians should take note that if youre goo...
124885,0,hes really got it in for aaron rodgers ankle
545351,0,of course not hes just trying to trick you int...
751713,0,oops sorry wrong comment\n\nbasically i said t...


#### C) Remove stop words.

In [9]:
# import nltk
# nltk.download('stopwords')

In [10]:
stop_words = stopwords.words('english')

def removeStopWords(string):
    tokenized_words = word_tokenize(string)
    return [word for word in tokenized_words if word not in stop_words]

In [11]:
commentsDF.txt = commentsDF.txt.apply(lambda x: removeStopWords(x))

In [12]:
commentsDF.head()

Unnamed: 0,con,txt
202617,0,"[meh, im, sure, leaked, email, soon, dnc, staf..."
812710,0,"[politicians, take, note, youre, good, avoid, ..."
124885,0,"[hes, really, got, aaron, rodgers, ankle]"
545351,0,"[course, hes, trying, trick, believing, lies, ..."
751713,0,"[oops, sorry, wrong, comment, basically, said,..."


#### D) Apply NLTK's PorterStemmer.

In [13]:
porter = PorterStemmer()

In [14]:
commentsDF.txt = commentsDF.txt.apply(lambda x: [porter.stem(word) for word in x])

In [15]:
commentsDF.head()

Unnamed: 0,con,txt
202617,0,"[meh, im, sure, leak, email, soon, dnc, staffe..."
812710,0,"[politician, take, note, your, good, avoid, sc..."
124885,0,"[he, realli, got, aaron, rodger, ankl]"
545351,0,"[cours, he, tri, trick, believ, lie, msm, liza..."
751713,0,"[oop, sorri, wrong, comment, basic, said, coun..."


## 2) Get text into a usable form for model-building.

#### A) Convert each text entry into a word-count vector.
See sections 5.3 & 6.8 in the *Machine Learning with Python Cookbook*

In [16]:
count = CountVectorizer()

In [17]:
# commentsDF['WCV'] = commentsDF.txt.apply(lambda x: count.fit_transform(x).toarray())

Converting to word-count vector threw an error when the 'txt' field contained an empty list. This function should return an empty list when coming across this error.

In [18]:
# def wordCountVector(wordList):
#     try:
#         array = count.fit_transform(wordList).toarray()
#         return array
#     except:
#         return []

In [19]:
# commentsDF['WCV'] = commentsDF.txt.apply(lambda x: wordCountVector(x))
# commentsDF.head()

Apparently this needs to be in one large matrix so let's try it again:

In [20]:
# Convert all the text data into a list of strings, 
# with each tweet as one string in the list

text_data, string = [], " "

for text in commentsDF.txt:
    text_data.append(string.join(text))

In [21]:
# # Word-count vector as a DataFrame
# wordCountVector = pd.DataFrame(count.fit_transform(text_data).toarray(), columns=count.get_feature_names())
# wordCountVector

In [22]:
# Word-count vector as a sparse matrix
sparseWCV = count.fit_transform(text_data)
sparseWCV

<100x1004 sparse matrix of type '<class 'numpy.int64'>'
	with 1661 stored elements in Compressed Sparse Row format>

#### B) Convert each text entry into a part-of-speech tag vector.
See section 6.7 in the *Machine Learning with Python Cookbook*

In [23]:
nltk.pos_tag(commentsDF.txt.iloc[0])[:5]

[('meh', 'NN'), ('im', 'JJ'), ('sure', 'JJ'), ('leak', 'JJ'), ('email', 'NN')]

In [24]:
commentsDF['PoS'] = commentsDF.txt.apply(lambda x: [tag for word, tag in nltk.pos_tag(x)])
commentsDF.head(2)

Unnamed: 0,con,txt,PoS
202617,0,"[meh, im, sure, leak, email, soon, dnc, staffe...","[NN, JJ, JJ, JJ, NN, RB, JJ, NN, NN, JJS, NN, ..."
812710,0,"[politician, take, note, your, good, avoid, sc...","[JJ, VB, VB, PRP$, JJ, NN, NN, NN, VBP, CD, RB..."


This also needs to be in a matrix:

In [25]:
oneHotMulti = MultiLabelBinarizer()

In [26]:
taggedTweets = []

for tweet in text_data:
    tweetTag = nltk.pos_tag(word_tokenize(tweet))
    taggedTweets.append([tag for word, tag in tweetTag])

In [27]:
# # Part-of-speech tags as a DataFrame
# partOfSpeech = pd.DataFrame(oneHotMulti.fit_transform(taggedTweets), columns=oneHotMulti.classes_)
# partOfSpeech

In [28]:
# Part-of-speech tags as a coded matrix
posMatrix = oneHotMulti.fit_transform(taggedTweets)
posMatrix

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]])

In [29]:
posMatrix.shape

(100, 29)

#### C) Convert each entry into a term frequency-inverse document frequency (**tfidf**) vector.
See section 6.9 in the *Machine Learning with Python Cookbook*

In [30]:
tfidf = TfidfVectorizer()

In [31]:
# # tfidf vector as a Dataframe:
# tfidfVector = pd.DataFrame(tfidf.fit_transform(text_data).toarray(), columns=tfidf.get_feature_names())
# tfidfVector

In [32]:
# tfidf vector as a sparse matrix:
sparseTfidf = tfidf.fit_transform(text_data)
sparseTfidf

<100x1004 sparse matrix of type '<class 'numpy.float64'>'
	with 1661 stored elements in Compressed Sparse Row format>

## **Follow-Up Question**

### For the three techniques in problem 2 above, give an example where each would be useful.

#### A) Word-count vector

In the article [Building a Better Profanity Detection Library with scikit-learn](https://victorzhou.com/blog/better-profanity-detection-with-scikit-learn/) the author, Victor Zhou, discusses how he was looking for a pre-built way to detect profanity in some text he was working with. He found a few options, most of which were just checking against a pre-generated list, but they weren't as comprehensive as he'd like. So, he set off to create his own profanity detection library using machine learning.  
In order to build his model, he used a couple of datasets that were already labeled by humans as to whether or not they contained offensive or hate speech.  He then created a word-count vector using scikit-learn's CountVectorizer in order to feed his machine learning model and now it is available  to install in python as `profanity-check`.

#### B) Part-of-speech vector

Part-of-speech vectors are useful when working with and understanding language. Beyond just the written words themselves, they have meaning and the same word can mean different things in different contexts.  
An example of the usefulness of POS tagging is in speech-to-text applications. Because the English language contains homonyms (specifically homographs, or words that are spelled the same but mean different things), POS tagging can be used to differentiate which meaning, and therefore which pronunciation is appropriate (e.g. minute (60 seconds) vs minute (very small)).

#### C) TF-IDF vector

TF-IDF (term frequency-inverse document frequency) is a way of weighting words based on two metrics: term frequency, or how much a word shows up in a text, and inverse document frequency, which makes words score lower if they are very common across all documents. The idea being that higher scoring words are more important to that particular text.  
While there are a lot of uses for tfidf (from Natural Language Processing to auto-tagging / finding keywords), one of the most common uses is probably in search. Tfidf is used to find rank and return the most relevant results to your query.