# Implementation 

## Initalizing the data 
First the data must be initalized before any data mining tasks can be performed on the data set. Since our data is in the form of tweets, text, then we will have to preprocess the text before we can classify it. 

### Read in data set
The data set we are using, a collection of 5000 tweets, is read in from a .CSV file

In [53]:
import pandas as pd

df = pd.read_table('data/Political-media-DFE.csv', sep=',')
df.head()

Unnamed: 0,id,golden,unitstate,trustedjudgments,lastjudgmentat,audience,audience:confidence,bias,bias:confidence,message,...,origgolden,audiencegold,biasgold,bioid,embed,id.1,label,messagegold,source,text
0,766192484,False,finalized,1,8/4/2015 21:17,national,1.0,partisan,1.0,policy,...,,,,R000596,"<blockquote class=""twitter-tweet"" width=""450"">...",3.83e+17,From: Trey Radel (Representative from Florida),,twitter,RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...
1,766192485,False,finalized,1,8/4/2015 21:20,national,1.0,partisan,1.0,attack,...,,,,M000355,"<blockquote class=""twitter-tweet"" width=""450"">...",3.11e+17,From: Mitch McConnell (Senator from Kentucky),,twitter,VIDEO - #Obamacare: Full of Higher Costs and ...
2,766192486,False,finalized,1,8/4/2015 21:14,national,1.0,neutral,1.0,support,...,,,,S001180,"<blockquote class=""twitter-tweet"" width=""450"">...",3.39e+17,From: Kurt Schrader (Representative from Oregon),,twitter,Please join me today in remembering our fallen...
3,766192487,False,finalized,1,8/4/2015 21:08,national,1.0,neutral,1.0,policy,...,,,,C000880,"<blockquote class=""twitter-tweet"" width=""450"">...",2.99e+17,From: Michael Crapo (Senator from Idaho),,twitter,RT @SenatorLeahy: 1st step toward Senate debat...
4,766192488,False,finalized,1,8/4/2015 21:26,national,1.0,partisan,1.0,policy,...,,,,U000038,"<blockquote class=""twitter-tweet"" width=""450"">...",4.08e+17,From: Mark Udall (Senator from Colorado),,twitter,.@amazon delivery #drones show need to update ...


In [54]:
#dropping unnecessary columns
cols=['id', 'golden', 'unitstate', 'lastjudgmentat' , 'label', 'audience:confidence', 'bias:confidence', 'origgolden','audiencegold','biasgold','bioid','embed','id.1','message', 'message:confidence','trustedjudgments','audience','messagegold', 'source']
df.drop(cols, axis=1, inplace=True)
df.head()

Unnamed: 0,bias,text
0,partisan,RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...
1,partisan,VIDEO - #Obamacare: Full of Higher Costs and ...
2,neutral,Please join me today in remembering our fallen...
3,neutral,RT @SenatorLeahy: 1st step toward Senate debat...
4,partisan,.@amazon delivery #drones show need to update ...


In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4999 entries, 0 to 4998
Data columns (total 2 columns):
bias    4999 non-null object
text    4999 non-null object
dtypes: object(2)
memory usage: 78.2+ KB


In [56]:
#store the target variable 
y = df['bias']

y.value_counts()

neutral     3688
partisan    1311
Name: bias, dtype: int64

### Encode Label

In [57]:
from sklearn.preprocessing import LabelEncoder

# Encode the class labels as numbers#  
le = LabelEncoder()
y_enc = le.fit_transform(y)

In [58]:
raw_text = df['text']

## Pre-processing text 

In [59]:
#imports
import re
import nltk
from nltk.corpus import stopwords
import string

### NLTK 
Write some stuff about NLTK

### Noise removal
To convert the text into a format that is easier to perform data mining tasks on, we will remove anything extra such as the "From:" label and the retweet information from the text. 


#### Begin with raw text

In [60]:
raw_text.head()

0    RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...
1    VIDEO - #Obamacare:  Full of Higher Costs and ...
2    Please join me today in remembering our fallen...
3    RT @SenatorLeahy: 1st step toward Senate debat...
4    .@amazon delivery #drones show need to update ...
Name: text, dtype: object

#### Remove the @ mentions

In [61]:
processed = raw_text.str.replace(r'(?:@[\w_]+)', '')
processed.head()

0    RT : Rep. Trey Radel (R- #FL) slams #Obamacare...
1    VIDEO - #Obamacare:  Full of Higher Costs and ...
2    Please join me today in remembering our fallen...
3    RT : 1st step toward Senate debate on Leahy-Cr...
4    . delivery #drones show need to update law to ...
Name: text, dtype: object

#### Remove hashtags (#)

In [62]:
processed = processed.str.replace(r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", '')
processed.head()

0    RT : Rep. Trey Radel (R- ) slams .  https://t....
1    VIDEO - :  Full of Higher Costs and Broken Pro...
2    Please join me today in remembering our fallen...
3    RT : 1st step toward Senate debate on Leahy-Cr...
4    . delivery  show need to update law to promote...
Name: text, dtype: object

#### Remove URLs 

In [63]:
processed = processed.str.replace(r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', '')
processed.head()

0                 RT : Rep. Trey Radel (R- ) slams .  
1    VIDEO - :  Full of Higher Costs and Broken Pro...
2    Please join me today in remembering our fallen...
3    RT : 1st step toward Senate debate on Leahy-Cr...
4    . delivery  show need to update law to promote...
Name: text, dtype: object

#### Numbers 

In [64]:
processed = processed.str.replace(r'(?:(?:\d+,?)+(?:\.?\d+)?)', '')
processed.head()

0                 RT : Rep. Trey Radel (R- ) slams .  
1    VIDEO - :  Full of Higher Costs and Broken Pro...
2    Please join me today in remembering our fallen...
3    RT : st step toward Senate debate on Leahy-Cra...
4    . delivery  show need to update law to promote...
Name: text, dtype: object

#### Emoticons 

In [65]:
emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
processed = processed.str.replace(emoticons_str, '')
processed.head()

0                 RT : Rep. Trey Radel (R- ) slams .  
1    VIDEO - :  Full of Higher Costs and Broken Pro...
2    Please join me today in remembering our fallen...
3    RT : st step toward Senate debate on Leahy-Cra...
4    . delivery  show need to update law to promote...
Name: text, dtype: object

#### remove extra whitespace

In [69]:
# Remove leading and trailing whitespace
processed = processed.str.replace(r'^\s+|\s+?$', '')

# Replace whitespace between terms with a single space
processed = processed.str.replace(r'\s+', ' ')

processed.head()

0                            Rep. Trey Radel (R- slams
1             VIDEO Full Higher Costs Broken Promises:
2    Please join today remembering fallen heroes ho...
3    st step toward Senate debate Leahy-Crapo bill ...
4    delivery show need update law promote &amp; pr...
Name: text, dtype: object

In [70]:
# Lowercase the corpus
processed = processed.str.lower()
processed.head()

0                            rep. trey radel (r- slams
1             video full higher costs broken promises:
2    please join today remembering fallen heroes ho...
3    st step toward senate debate leahy-crapo bill ...
4    delivery show need update law promote &amp; pr...
Name: text, dtype: object

#### Stop Words
To preprocess the data, we must also remove stop words, or words that are usually common but provide little meaning especially if taken out of context. 

NLTK has a built in list of stop words for the English language which we will be using, as well as adding to our list of stopwords punctuation and twitter related terms such as 'RT' (to indicate a retweet) or via (to reference original author) 

In [71]:
punctuation_marks = list(string.punctuation)
stop_words = stopwords.words('english') + punctuation_marks + ['rt', 'via']

In [72]:
# Remove all stop words
processed = processed.apply(lambda x: ' '.join(
    term for term in x.split() if term not in set(stop_words))
)
processed.head()

0                            rep. trey radel (r- slams
1             video full higher costs broken promises:
2    please join today remembering fallen heroes ho...
3    st step toward senate debate leahy-crapo bill ...
4    delivery show need update law promote &amp; pr...
Name: text, dtype: object

We can see above that some of the more commmon words in English such as 'of', 'the', etc have been removed from the tweets as well as punctuation some twitter related strings such as "RT"

#### Stemming

In [73]:
# Remove word stems using a Porter stemmer
porter = nltk.PorterStemmer()
processed = processed.apply(lambda x: ' '.join(
    porter.stem(term) for term in x.split())
)
processed.head()

0                             rep. trey radel (r- slam
1              video full higher cost broken promises:
2    pleas join today rememb fallen hero honor men ...
3    st step toward senat debat leahy-crapo bill se...
4    deliveri show need updat law promot &amp; prot...
Name: text, dtype: object