### 2.2 Exercise: Building Your Text Classifiers <br> DSC550 <br> 2020-11-29 <br> Michael Hotaling

In [1]:
import pandas as pd
import numpy as np
import string
from nltk.corpus import stopwords

### Pre-processing

- **For this part, you will start by reading the controversial-comments.jsonl file into a DataFrame.**

In [2]:
df = pd.read_json("controversial-comments.jsonl",lines=True)
df.head(10)

Unnamed: 0,con,txt
0,0,Well it's great that he did something about th...
1,0,You are right Mr. President.
2,0,You have given no input apart from saying I am...
3,0,I get the frustration but the reason they want...
4,0,I am far from an expert on TPP and I would ten...
5,0,Thanks for playing. [I feel like her now](http...
6,0,[deleted]
7,0,"i cant be racist, i have a black friend \n\nl..."
8,0,Nope. You're right that they are both smoke an...
9,0,&lt;That's exactly what it means. especially w...


- **Convert all text to lowercase letters.**

In [3]:
df['txt'] = df['txt'].str.lower()

In [4]:
df.head(10)

Unnamed: 0,con,txt
0,0,well it's great that he did something about th...
1,0,you are right mr. president.
2,0,you have given no input apart from saying i am...
3,0,i get the frustration but the reason they want...
4,0,i am far from an expert on tpp and i would ten...
5,0,thanks for playing. [i feel like her now](http...
6,0,[deleted]
7,0,"i cant be racist, i have a black friend \n\nl..."
8,0,nope. you're right that they are both smoke an...
9,0,&lt;that's exactly what it means. especially w...


- **Remove all punctuation from the text.**

In [5]:
import re

# Removing \n (new lines)
df["txt"] = df['txt'].str.replace("\n", " ")

# Remving URLS
df['txt'] = df['txt'].apply(lambda x: re.split('https:\/\/.*', str(x))[0])
df['txt'] = df['txt'].apply(lambda x: re.split('http:\/\/.*', str(x))[0])

# Removing punctuation
df["txt"] = df['txt'].str.replace('[^\w\s]','')

In [6]:
df.head(10)

Unnamed: 0,con,txt
0,0,well its great that he did something about tho...
1,0,you are right mr president
2,0,you have given no input apart from saying i am...
3,0,i get the frustration but the reason they want...
4,0,i am far from an expert on tpp and i would ten...
5,0,thanks for playing i feel like her now
6,0,deleted
7,0,i cant be racist i have a black friend lolo...
8,0,nope youre right that they are both smoke and ...
9,0,ltthats exactly what it means especially when ...


- **Remove stop words.**

In [7]:
stop = set(stopwords.words('english')) 

In [8]:
df['txt'] = df['txt'].str.split().apply(lambda x: [item for item in x if item not in stop])

In [9]:
df.head(10)

Unnamed: 0,con,txt
0,0,"[well, great, something, beliefs, office, doub..."
1,0,"[right, mr, president]"
2,0,"[given, input, apart, saying, wrong, argument,..."
3,0,"[get, frustration, reason, want, way, foundati..."
4,0,"[far, expert, tpp, would, tend, agree, lot, pr..."
5,0,"[thanks, playing, feel, like]"
6,0,[deleted]
7,0,"[cant, racist, black, friend, lololol, vast, m..."
8,0,"[nope, youre, right, smoke, smoke, bad, lungs,..."
9,0,"[ltthats, exactly, means, especially, power, t..."


- **Apply NLTK’s PorterStemmer.**

In [10]:
from nltk.stem import PorterStemmer

In [11]:
ps = PorterStemmer()

In [12]:
# There has to be a better way to do this.

stemmed_data = []

for lines in df['txt']:
    string = ""
    for words in lines:
        string += str(ps.stem(words)) + " "
    stemmed_data.append(string)

In [13]:
df['txt'] = stemmed_data

In [14]:
df.head(10)

Unnamed: 0,con,txt
0,0,well great someth belief offic doubt trump wou...
1,0,right mr presid
2,0,given input apart say wrong argument clearli
3,0,get frustrat reason want way foundat complex p...
4,0,far expert tpp would tend agre lot problem und...
5,0,thank play feel like
6,0,delet
7,0,cant racist black friend lololol vast major mi...
8,0,nope your right smoke smoke bad lung tobacco l...
9,0,ltthat exactli mean especi power take exist ri...


In [15]:
# Saving a backup of the cleaned data for the next exercises

df.to_csv("cleaned_data.csv")

### Analysis

**You will apply three different techniques to get it into a usable form for model-building. Apply each of the following steps (individually) to the pre-processed data.**

- **Convert each text entry into a word-count vector (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook).**

In [16]:
df = pd.read_csv("cleaned_data.csv", index_col=0)

In [17]:
dftest = df.sample(frac = 0.1)

In [18]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

counter = CountVectorizer()

In [19]:
bag_of_words = counter.fit_transform(np.array(dftest['txt'].values.astype('U')))
arr = bag_of_words.toarray()

In [20]:
pd.DataFrame(arr, columns= counter.get_feature_names())

Unnamed: 0,00,000,00000000000001,000000017,000007,00001,000019,000079,0001,00015,...,федерация,хорошо,это,яepublican,яs,яussia,الإسلامية,الدولة,ಠ_ರ,ຈلຈ
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
94996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
94997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
94998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


- **Convert each text entry into a part-of-speech tag vector (see section 6.7 in the Machine Learning with Python Cookbook).**

In [21]:
from nltk import pos_tag, word_tokenize

In [22]:
def tokenizer(arr):
    tokens = []
    for i in arr:
        tokens.append(pos_tag(word_tokenize(str(i))))
    return tokens

In [23]:
dftest['tokenized'] = tokenizer(dftest['txt'])
dftest

Unnamed: 0,con,txt,tokenized
518518,0,delet,"[(delet, NN)]"
783840,0,dont forget senat wyom citizen punch like 3 ti...,"[(dont, NN), (forget, NN), (senat, NN), (wyom,..."
640884,0,remind subreddit civil discuss,"[(remind, VB), (subreddit, NN), (civil, JJ), (..."
867775,0,200 million didnt live swing state didnt matte...,"[(200, CD), (million, CD), (didnt, NN), (live,..."
798617,0,final hour shall identifi young,"[(final, JJ), (hour, NN), (shall, MD), (identi..."
...,...,...,...
860039,0,delet,"[(delet, NN)]"
359057,0,final open husband eye move franc tri get vase...,"[(final, JJ), (open, JJ), (husband, NN), (eye,..."
740414,0,much sit hand roll back protect minor creat lo...,"[(much, JJ), (sit, NN), (hand, NN), (roll, NN)..."
400357,0,ethic rule prevent anyon actual busi serv gove...,"[(ethic, JJ), (rule, NN), (prevent, NN), (anyo..."


- **Convert each entry into a term frequency-inverse document frequency (tfidf) vector (see section 6.9 in the Machine Learning with Python Cookbook).**

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

dftest = df.sample(frac = 0.05)

tfidf = TfidfVectorizer()

In [25]:
feature_matrix = tfidf.fit_transform(dftest['txt'].values.astype('U'))

In [26]:
feature_matrix.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [27]:
for i, (k, v) in enumerate(tfidf.vocabulary_.items()):
    print(k, v)
    if i > 25:
        break

washington 25764
post 18234
bastion 3151
bipartisan 3522
polit 18099
honest 11769
mistak 15361
id 12068
prefer 18415
fabric 8566
stori 22599
leav 13797
major 14553
detail 6834
lead 13766
blm 3647
riot 20138
rememb 19733
illeg 12142
look 14275
wikileak 26149
keep 13322
tell 23369
year 26581
long 14254
fbi 8785
investig 12777
