### 2.2 Exercise: Building Your Text Classifiers <br> DSC550 <br> 2020-11-29 <br> Michael Hotaling

In [1]:
import pandas as pd
import numpy as np
import string
from nltk.corpus import stopwords

### Pre-processing

- **For this part, you will start by reading the controversial-comments.jsonl file into a DataFrame.**

In [2]:
# Loading the dataframe into Juypter
# jsonl files require line = True to be interpreted correctly

df = pd.read_json("controversial-comments.jsonl",lines=True)
df.head(10)

Unnamed: 0,con,txt
0,0,Well it's great that he did something about th...
1,0,You are right Mr. President.
2,0,You have given no input apart from saying I am...
3,0,I get the frustration but the reason they want...
4,0,I am far from an expert on TPP and I would ten...
5,0,Thanks for playing. [I feel like her now](http...
6,0,[deleted]
7,0,"i cant be racist, i have a black friend \n\nl..."
8,0,Nope. You're right that they are both smoke an...
9,0,&lt;That's exactly what it means. especially w...


- **Convert all text to lowercase letters.**

In [3]:
# Each line needs to be coerced into a string and then we can pass the .lower() method to it
# to change all characters to lower case

df['txt'] = df['txt'].str.lower()

In [4]:
df.head(10)

Unnamed: 0,con,txt
0,0,well it's great that he did something about th...
1,0,you are right mr. president.
2,0,you have given no input apart from saying i am...
3,0,i get the frustration but the reason they want...
4,0,i am far from an expert on tpp and i would ten...
5,0,thanks for playing. [i feel like her now](http...
6,0,[deleted]
7,0,"i cant be racist, i have a black friend \n\nl..."
8,0,nope. you're right that they are both smoke an...
9,0,&lt;that's exactly what it means. especially w...


- **Remove all punctuation from the text.**

In [5]:
import re

# Removing \n (new lines)
df["txt"] = df['txt'].str.replace("\n", " ")

# Removing URLS
df['txt'] = df['txt'].apply(lambda x: re.split('https:\/\/.*', str(x))[0])
df['txt'] = df['txt'].apply(lambda x: re.split('http:\/\/.*', str(x))[0])

# Removing punctuation
df["txt"] = df['txt'].str.replace('[^\w\s]','')

In [6]:
df.head(10)

Unnamed: 0,con,txt
0,0,well its great that he did something about tho...
1,0,you are right mr president
2,0,you have given no input apart from saying i am...
3,0,i get the frustration but the reason they want...
4,0,i am far from an expert on tpp and i would ten...
5,0,thanks for playing i feel like her now
6,0,deleted
7,0,i cant be racist i have a black friend lolo...
8,0,nope youre right that they are both smoke and ...
9,0,ltthats exactly what it means especially when ...


- **Remove stop words.**

In [7]:
# Creating the list of stop words

stop = set(stopwords.words('english')) 

In [8]:
# Creating new lines based on the old lines, while removing all words in the stop list
# Each post must be transformed into a list of words split by spaces
# Once that is done, we can apply our lambda function to recreate the list without stop words
# This will return a list of words for each row

df['txt'] = df['txt'].str.split().apply(lambda x: [item for item in x if item not in stop])

In [9]:
df.head(10)

Unnamed: 0,con,txt
0,0,"[well, great, something, beliefs, office, doub..."
1,0,"[right, mr, president]"
2,0,"[given, input, apart, saying, wrong, argument,..."
3,0,"[get, frustration, reason, want, way, foundati..."
4,0,"[far, expert, tpp, would, tend, agree, lot, pr..."
5,0,"[thanks, playing, feel, like]"
6,0,[deleted]
7,0,"[cant, racist, black, friend, lololol, vast, m..."
8,0,"[nope, youre, right, smoke, smoke, bad, lungs,..."
9,0,"[ltthats, exactly, means, especially, power, t..."


- **Apply NLTK’s PorterStemmer.**

In [10]:
from nltk.stem import PorterStemmer

In [11]:
# We can define our stemming function here
# First, we call the the PorterStemmer() class from nltk
# I then created a temporary string to append all the stemmed words to.
# We can the apply the Portstemmer to each word and add a space after each word to regenerate the sentence structure
# We then return the entire string

def stemmer(ser):
    ps = PorterStemmer()
    string = ""
    for words in ser:
        string += str(ps.stem(words)) + " "
    return string

In [12]:
# We can then apply the stemmer function to our series

df['txt'] = df['txt'].apply(stemmer)

In [13]:
df.head(10)

Unnamed: 0,con,txt
0,0,well great someth belief offic doubt trump wou...
1,0,right mr presid
2,0,given input apart say wrong argument clearli
3,0,get frustrat reason want way foundat complex p...
4,0,far expert tpp would tend agre lot problem und...
5,0,thank play feel like
6,0,delet
7,0,cant racist black friend lololol vast major mi...
8,0,nope your right smoke smoke bad lung tobacco l...
9,0,ltthat exactli mean especi power take exist ri...


In [14]:
# Saving a backup of the cleaned data for the next exercises

df.to_csv("cleaned_data.csv")

### Analysis

**You will apply three different techniques to get it into a usable form for model-building. Apply each of the following steps (individually) to the pre-processed data.**

- **Convert each text entry into a word-count vector (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook).**

In [15]:
df = pd.read_csv("cleaned_data.csv", index_col=0)

In [16]:
# Due to the size of the dataframe, we will need to take a sample to get our program to run

dftest = df.sample(frac = 0.05)

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import csr_matrix 

# Calling the CountVectorizer method from sci-kit

counter = CountVectorizer()

In [18]:
# Pass in the series into the CountVectorizer 
# Each line needs to be interpreted as Unicode in order for the program to work

bag_of_words = counter.fit_transform(np.array(dftest['txt'].values.astype('U')))

# Convert the bag_of_words to an array to generate the dataframe
arr = bag_of_words.toarray()

In [19]:
# Create the dataframe using the feature names as the column names.

pd.DataFrame(arr, columns= counter.get_feature_names())

Unnamed: 0,00,000,000000000000,000000000000001,00000000000001,00000005,0000000566,000000143,000000554,000005,...,русский,сделали,спасибо,это,яepublican,яnew,яs,ترومپ,مرسی,ಠ_ಠ
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47495,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
47496,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
47497,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
47498,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
# Reducing the sparse matrix down using csr_matrix from scipy

print(csr_matrix(pd.DataFrame(arr, columns= counter.get_feature_names())))

  (0, 1535)	1
  (0, 6411)	1
  (0, 7350)	1
  (0, 9564)	1
  (0, 14992)	1
  (0, 20266)	1
  (0, 23776)	1
  (0, 26811)	1
  (1, 3624)	1
  (1, 6151)	1
  (1, 7071)	1
  (1, 15292)	1
  (1, 19160)	1
  (1, 19177)	1
  (1, 20905)	1
  (1, 21155)	1
  (1, 21554)	1
  (1, 22939)	1
  (1, 24363)	1
  (1, 26337)	1
  (2, 3333)	1
  (2, 7422)	1
  (2, 12236)	1
  (2, 14139)	1
  (2, 14303)	1
  :	:
  (47497, 19570)	1
  (47497, 19594)	1
  (47497, 20989)	1
  (47497, 21165)	1
  (47497, 23533)	1
  (47497, 23639)	1
  (47498, 3154)	1
  (47498, 3399)	1
  (47498, 11647)	1
  (47498, 14264)	1
  (47499, 1795)	1
  (47499, 1994)	1
  (47499, 3300)	1
  (47499, 6233)	1
  (47499, 6291)	2
  (47499, 8278)	1
  (47499, 10377)	1
  (47499, 14053)	1
  (47499, 14303)	1
  (47499, 14323)	2
  (47499, 14992)	1
  (47499, 22622)	1
  (47499, 23462)	1
  (47499, 23934)	2
  (47499, 25853)	2


- **Convert each text entry into a part-of-speech tag vector (see section 6.7 in the Machine Learning with Python Cookbook).**

In [21]:
from nltk import pos_tag, word_tokenize

In [22]:
# Create a function to generate part-of-speech tags for each line
# The lines need to be split into a word list using split() or word_tokenize
# Then we can pass the list of words into the part-of-speech function

def part_of_speech_analyzer(arr):
    tokens = []
    for i in arr:
        tokens.append(pos_tag(word_tokenize(str(i))))
    return tokens

In [23]:
# Applying the function to our series

dftest['part-of-speech'] = part_of_speech_analyzer(dftest['txt'])
dftest

Unnamed: 0,con,txt,part-of-speech
797983,0,access doesnt mean damn thing your fuck right,"[(access, NN), (doesnt, NN), (mean, VBP), (dam..."
868827,0,putin push blatantli say crimea direct militar...,"[(putin, NN), (push, NN), (blatantli, NNS), (s..."
42551,0,lol suppos sure seem like dont believ illeg se...,"[(lol, JJ), (suppos, JJ), (sure, JJ), (seem, V..."
789237,0,protectionist polici domest industri plu immig...,"[(protectionist, NN), (polici, NN), (domest, J..."
715901,0,hell liter content said said oh fox news never...,"[(hell, NN), (liter, RBR), (content, NN), (sai..."
...,...,...,...
468118,0,god miser veng petti peopl,"[(god, NNS), (miser, RBR), (veng, RB), (petti,..."
566114,0,gtwe never ever ever get back togeth gtyou go ...,"[(gtwe, NN), (never, RB), (ever, RB), (ever, R..."
371861,0,realli dont care administr polit parti reason ...,"[(realli, NN), (dont, NN), (care, NN), (admini..."
556686,0,hillari lobbi berni base,"[(hillari, NN), (lobbi, NN), (berni, NN), (bas..."


- **Convert each entry into a term frequency-inverse document frequency (tfidf) vector (see section 6.9 in the Machine Learning with Python Cookbook).**

In [24]:
# Importing the library
from sklearn.feature_extraction.text import TfidfVectorizer

# Taking a sample from the dataframe
dftest = df.sample(frac = 0.01)

# Checking the dataframe
dftest

Unnamed: 0,con,txt
889327,0,delet
145131,0,vast vast major far left peopl dont never use ...
9412,0,american vote trump im afraid proud read somet...
744165,0,3 day rule dont believ anyth news 3 day pass
890017,0,poor guy want throw elect get sell book run te...
...,...,...
476922,0,low ga price increas mpg standard
420131,0,weak nottru answer usual refer lie excus lie
560564,0,one say bad person judg order take care think ...
302528,0,well least someth good come trump presid


In [25]:
# Assigning the TfidVectorizer() class from scikit
tfidf = TfidfVectorizer()

# Passing in the series into thr tfidf class as unicode
feature_matrix = tfidf.fit_transform(dftest['txt'].values.astype('U'))

In [26]:
# Loading the TFIDF data into a dataframe with the column names as feature names and 
# matching the index values to the associated entries

pd.DataFrame(feature_matrix.toarray(), columns =tfidf.get_feature_names()).set_index(dftest.index)

Unnamed: 0,00,00001,0005,001,007,00983,01,01508,016run,02,...,zip,zodiac,zoloft,zombi,zone,zuckerberg,zullo,zuriel45,ಠ_ಠ,눈_눈
889327,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
145131,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9412,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
744165,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
890017,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
476922,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
420131,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
560564,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
302528,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
# Reducing the sparse matrix down using csr_matrix from scipy

print(csr_matrix(feature_matrix))

  (0, 2971)	1.0
  (1, 7410)	0.3047124269826051
  (1, 5195)	0.1931497581513458
  (1, 3137)	0.13790987237666952
  (1, 169)	0.3047124269826051
  (1, 1281)	0.3047124269826051
  (1, 6386)	0.16266364761040605
  (1, 2318)	0.18363871141869217
  (1, 11875)	0.18253493990917322
  (1, 10774)	0.37695264571658954
  (1, 11561)	0.2843132879470299
  (1, 7325)	0.14507030633298978
  (1, 3350)	0.10885128462708425
  (1, 8032)	0.1018889866798203
  (1, 6312)	0.16772348012256974
  (1, 4009)	0.1624691444094898
  (1, 6641)	0.1686498802238612
  (1, 11618)	0.47164661617031794
  (2, 8401)	0.13343527703205804
  (2, 11891)	0.10848774532052306
  (2, 8936)	0.12888024698820438
  (2, 2867)	0.09464407411170876
  (2, 8291)	0.10577017980943523
  (2, 8240)	0.09170236898767316
  (2, 4221)	0.14314849623132325
  :	:
  (9497, 4163)	0.218376223733438
  (9497, 7650)	0.16379852656067267
  (9497, 9395)	0.19645459071433588
  (9497, 1198)	0.22724796029421382
  (9497, 6644)	0.17138924803910613
  (9497, 10865)	0.16209575041629062
  (94