### 2.2 Exercise: Building Your Text Classifiers <br> DSC550 <br> 2020-11-29 <br> Michael Hotaling

In [1]:
import pandas as pd
import numpy as np
import string
from nltk.corpus import stopwords

### Pre-processing

- **For this part, you will start by reading the controversial-comments.jsonl file into a DataFrame.**

In [2]:
df = pd.read_json("controversial-comments.jsonl",lines=True)
df.head(10)

Unnamed: 0,con,txt
0,0,Well it's great that he did something about th...
1,0,You are right Mr. President.
2,0,You have given no input apart from saying I am...
3,0,I get the frustration but the reason they want...
4,0,I am far from an expert on TPP and I would ten...
5,0,Thanks for playing. [I feel like her now](http...
6,0,[deleted]
7,0,"i cant be racist, i have a black friend \n\nl..."
8,0,Nope. You're right that they are both smoke an...
9,0,&lt;That's exactly what it means. especially w...


- **Convert all text to lowercase letters.**

In [3]:
df['txt'] = df['txt'].str.lower()

In [4]:
df.head(10)

Unnamed: 0,con,txt
0,0,well it's great that he did something about th...
1,0,you are right mr. president.
2,0,you have given no input apart from saying i am...
3,0,i get the frustration but the reason they want...
4,0,i am far from an expert on tpp and i would ten...
5,0,thanks for playing. [i feel like her now](http...
6,0,[deleted]
7,0,"i cant be racist, i have a black friend \n\nl..."
8,0,nope. you're right that they are both smoke an...
9,0,&lt;that's exactly what it means. especially w...


- **Remove all punctuation from the text.**

In [5]:
import re

# Removing \n (new lines)
df["txt"] = df['txt'].str.replace("\n", " ")

# Removing URLS
df['txt'] = df['txt'].apply(lambda x: re.split('https:\/\/.*', str(x))[0])
df['txt'] = df['txt'].apply(lambda x: re.split('http:\/\/.*', str(x))[0])

# Removing punctuation
df["txt"] = df['txt'].str.replace('[^\w\s]','')

In [6]:
df.head(10)

Unnamed: 0,con,txt
0,0,well its great that he did something about tho...
1,0,you are right mr president
2,0,you have given no input apart from saying i am...
3,0,i get the frustration but the reason they want...
4,0,i am far from an expert on tpp and i would ten...
5,0,thanks for playing i feel like her now
6,0,deleted
7,0,i cant be racist i have a black friend lolo...
8,0,nope youre right that they are both smoke and ...
9,0,ltthats exactly what it means especially when ...


- **Remove stop words.**

In [7]:
stop = set(stopwords.words('english')) 

In [8]:
df['txt'] = df['txt'].str.split().apply(lambda x: [item for item in x if item not in stop])

In [9]:
df.head(10)

Unnamed: 0,con,txt
0,0,"[well, great, something, beliefs, office, doub..."
1,0,"[right, mr, president]"
2,0,"[given, input, apart, saying, wrong, argument,..."
3,0,"[get, frustration, reason, want, way, foundati..."
4,0,"[far, expert, tpp, would, tend, agree, lot, pr..."
5,0,"[thanks, playing, feel, like]"
6,0,[deleted]
7,0,"[cant, racist, black, friend, lololol, vast, m..."
8,0,"[nope, youre, right, smoke, smoke, bad, lungs,..."
9,0,"[ltthats, exactly, means, especially, power, t..."


- **Apply NLTK’s PorterStemmer.**

In [10]:
from nltk.stem import PorterStemmer

In [11]:
def stemmer(ser):
    ps = PorterStemmer()
    string = ""
    for words in ser:
        string += str(ps.stem(words)) + " "
    return string

In [12]:
df['txt'] = df['txt'].apply(stemmer)

In [13]:
df.head(10)

Unnamed: 0,con,txt
0,0,well great someth belief offic doubt trump wou...
1,0,right mr presid
2,0,given input apart say wrong argument clearli
3,0,get frustrat reason want way foundat complex p...
4,0,far expert tpp would tend agre lot problem und...
5,0,thank play feel like
6,0,delet
7,0,cant racist black friend lololol vast major mi...
8,0,nope your right smoke smoke bad lung tobacco l...
9,0,ltthat exactli mean especi power take exist ri...


In [14]:
# Saving a backup of the cleaned data for the next exercises

df.to_csv("cleaned_data.csv")

### Analysis

**You will apply three different techniques to get it into a usable form for model-building. Apply each of the following steps (individually) to the pre-processed data.**

- **Convert each text entry into a word-count vector (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook).**

In [15]:
df = pd.read_csv("cleaned_data.csv", index_col=0)

In [16]:
dftest = df.sample(frac = 0.005)

In [17]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import csr_matrix 

counter = CountVectorizer()

In [18]:
bag_of_words = counter.fit_transform(np.array(dftest['txt'].values.astype('U')))
arr = bag_of_words.toarray()

In [19]:
pd.DataFrame(arr, columns= counter.get_feature_names())

Unnamed: 0,00209,00313,010,01673,03394,05633,06,07,08,09357,...,zone,zuckerberg,zweifach,zzzzzz,ˈdeməˌɡäɡ,ˈterəˌrizəm,феликс,эдмундович,яepublican,ຈلຈ
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4745,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4746,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4747,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4748,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
print(csr_matrix(pd.DataFrame(arr, columns= counter.get_feature_names())))

  (0, 701)	1
  (0, 1265)	1
  (0, 2185)	1
  (0, 2622)	1
  (0, 2667)	1
  (0, 3702)	1
  (0, 4126)	2
  (0, 4507)	1
  (0, 4536)	1
  (0, 4663)	1
  (0, 5213)	1
  (0, 5222)	1
  (0, 5335)	1
  (0, 5943)	1
  (0, 6092)	1
  (0, 6385)	1
  (0, 6397)	1
  (0, 6922)	1
  (0, 7337)	1
  (0, 7648)	1
  (0, 7667)	1
  (0, 7683)	1
  (0, 7734)	1
  (0, 7740)	1
  (0, 7848)	1
  :	:
  (4747, 5588)	1
  (4747, 5939)	1
  (4747, 7877)	1
  (4748, 767)	1
  (4748, 2608)	1
  (4748, 3236)	1
  (4748, 4198)	1
  (4748, 4297)	1
  (4748, 6328)	1
  (4748, 7863)	1
  (4749, 424)	1
  (4749, 2178)	2
  (4749, 3193)	1
  (4749, 4512)	1
  (4749, 4640)	1
  (4749, 4662)	1
  (4749, 6240)	1
  (4749, 6920)	1
  (4749, 6956)	1
  (4749, 7676)	1
  (4749, 7877)	3
  (4749, 8276)	1
  (4749, 8316)	2
  (4749, 8471)	2
  (4749, 8527)	1


- **Convert each text entry into a part-of-speech tag vector (see section 6.7 in the Machine Learning with Python Cookbook).**

In [21]:
from nltk import pos_tag, word_tokenize

In [22]:
def tokenizer(arr):
    tokens = []
    for i in arr:
        tokens.append(pos_tag(word_tokenize(str(i))))
    return tokens

In [23]:
dftest['tokenized'] = tokenizer(dftest['txt'])
dftest

Unnamed: 0,con,txt,tokenized
841082,0,yeah nois make call punditri unbeliev equivoc ...,"[(yeah, NN), (nois, NNS), (make, VBP), (call, ..."
890021,0,even mobi know,"[(even, RB), (mobi, NNS), (know, VBP)]"
514860,0,game show refer pyramid believ,"[(game, NN), (show, NN), (refer, VBP), (pyrami..."
712605,1,say thug black,"[(say, VB), (thug, JJ), (black, JJ)]"
12439,0,delet,"[(delet, NN)]"
...,...,...,...
885134,0,remind subreddit civil discuss,"[(remind, VB), (subreddit, NN), (civil, JJ), (..."
576759,0,famili alpha threaten victim intimid lawsuit,"[(famili, NN), (alpha, NN), (threaten, VB), (v..."
760572,0,didnt misinform campaign interfer outsid parti...,"[(didnt, NN), (misinform, NN), (campaign, NN),..."
601630,0,troll kind job go away environment regul,"[(troll, NN), (kind, NN), (job, NN), (go, VB),..."


- **Convert each entry into a term frequency-inverse document frequency (tfidf) vector (see section 6.9 in the Machine Learning with Python Cookbook).**

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

dftest = df.sample(frac = 0.05)

tfidf = TfidfVectorizer()

In [25]:
feature_matrix = tfidf.fit_transform(dftest['txt'].values.astype('U'))

In [26]:
feature_matrix.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [27]:
for i, (k, v) in enumerate(tfidf.vocabulary_.items()):
    print(k, v)
    if i > 25:
        break

there 23579
noth 16596
stop 22675
happen 11312
that 23499
point 18176
realli 19480
nepot 16145
sinc 21666
hillari 11711
isnt 13004
appoint 2460
offic 16878
rel 19790
small 21884
number 16688
famili 8730
domin 7394
american 2046
polit 18211
would 26403
akin 1836
rise 20233
hereditari 11603
aristocraci 2555
pay 17643
trump 24229
