# Assignment 2 

It is impractical to go take a raw text data and fit it in a machine learning or deep learning model. The data needs to be prepared (cleaned) first which involves a wide range of text preparation approaches and steps. Such approaches often depends on the characteristics and pattern of the data as well as the natural language processing task we are going to do.


For this assignment, I was given a CSV file. Which contains five columns such as ID, Time stamp Title of the news story, Content of the news story and Pageview count of the news story.
The goal of this assignment is to prepare the text data which can be used by any machine learning model for predicting pageview. 


# Importing the libraries

In [1]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import scipy
import pandas as pd
from numpy import loadtxt, zeros, ones, array, linspace, logspace
from pylab import scatter, show, title, xlabel, ylabel, plot, contour
import string
import matplotlib.pyplot as plt
%matplotlib inline
import time
import pylab as pl

# Load the CSV data
For this assignment, I read the CSV data called "2018-ENGR5775-ASSIGNMENT2.csv" from the current working directory.

In [2]:
csv_filename="2018-ENGR5775-ASSIGNMENT2.csv"
df=pd.read_csv(csv_filename, encoding='ISO-8859-1')

# Inspecting the data
After reading the data, I inspect the data just to have an idea what kind of textual data is available. Such inspection can allow us to gain a deeper a knowledge regarding its contextual information and planning the data preparation and cleaning up tasks accordingly. 

In [3]:
print(df)

            id                    crawled  \
0      9546141  2016-01-01T04:47:00+00:00   
1      9544391  2016-01-01T05:01:00+00:00   
2      9544392  2016-01-01T05:01:00+00:00   
3      9544393  2016-01-01T05:01:00+00:00   
4      9544394  2016-01-01T05:01:00+00:00   
5      9544395  2016-01-01T05:01:00+00:00   
6      9544396  2016-01-01T05:01:00+00:00   
7      9544397  2016-01-01T05:01:00+00:00   
8      9544398  2016-01-01T05:01:00+00:00   
9      9544390  2016-01-01T05:02:00+00:00   
10     9544389  2016-01-01T05:03:00+00:00   
11     9544387  2016-01-01T05:05:00+00:00   
12     9544385  2016-01-01T05:08:00+00:00   
13     9544383  2016-01-01T05:10:00+00:00   
14     9544380  2016-01-01T05:15:00+00:00   
15     9551093  2016-01-01T14:14:00+00:00   
16     9551746  2016-01-01T15:04:00+00:00   
17     9552250  2016-01-01T15:40:00+00:00   
18     9552773  2016-01-01T16:04:00+00:00   
19     9553182  2016-01-01T16:42:00+00:00   
20     9554025  2016-01-01T17:55:00+00:00   
21     955

# Install NLTK
For the text preparation I use, the Natural Language Toolkit (NLTK). NLTK is a Python library written for working and modeling text. It is now widely being used as it provides good libraries for loading and cleaning text that we can use to get our data ready for working with machine learning and deep learning algorithms.

In [4]:
import sys
!{sys.executable} -m pip install -U nltk

Collecting nltk
  Downloading nltk-3.2.5.tar.gz (1.2MB)
Requirement already up-to-date: six in c:\users\100631155\appdata\local\continuum\anaconda3\lib\site-packages (from nltk)
Building wheels for collected packages: nltk
  Running setup.py bdist_wheel for nltk: started
  Running setup.py bdist_wheel for nltk: finished with status 'done'
  Stored in directory: C:\Users\100631155\AppData\Local\pip\Cache\wheels\18\9c\1f\276bc3f421614062468cb1c9d695e6086d0c73d67ea363c501
Successfully built nltk
Installing collected packages: nltk
  Found existing installation: nltk 3.2.4
    Uninstalling nltk-3.2.4:
      Successfully uninstalled nltk-3.2.4
Successfully installed nltk-3.2.5


You are using pip version 9.0.1, however version 9.0.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


# nltk.dowload()
After installation, you will need to install the data used with the library, including a great set of documents that you can use later for testing other tools in NLTK.

In [None]:
import nltk
nltk.download()

# Categorizing the data in terms of pageviews

In [5]:
high = df.pageviews >= 12000
low = df.pageviews < 12000
df.loc[high,'pageviews'] = 1
df.loc[low,'pageviews'] = 0

# Preparing the textual data 



# Filter out those data which contains null values

In [6]:
df = df.dropna(how='any',axis=0)
df.head()

Unnamed: 0,id,crawled,title,Original content,pageviews
0,9546141,2016-01-01T04:47:00+00:00,Kanye West Releases Awful New Song on New Year...,Is Drake inside Kanye West??s head? That??s ...,1.0
1,9544391,2016-01-01T05:01:00+00:00,Why Everybody Should Work in Hollywood,I??m convinced that working in Hollywood is t...,1.0
2,9544392,2016-01-01T05:01:00+00:00,Bill Cosby Tamir Rice And The Power of Prosecu...,What do Bill Cosby and Tamir Rice??s have in ...,1.0
3,9544393,2016-01-01T05:01:00+00:00,Translating the ??Iliad??? Who Isn??t.,Pop quiz: Which is greater? (a) The number of ...,0.0
4,9544394,2016-01-01T05:01:00+00:00,How Oprah Created a Profitable Weight-Loss Plan,Forget portion control: weight loss is a power...,1.0


# Filter Out Punctuation
Clean text often means a list of words or tokens that we can work with in our machine learning models. We need to split tokens based on white space and punctuation. We can filter out all tokens that we are not interested in, such as all standalone punctuation.

In [7]:
df[df.columns[2]] = df[df.columns[2]].str.replace('[{}]'.format(string.punctuation), '')
df[df.columns[3]] = df[df.columns[3]].str.replace('[{}]'.format(string.punctuation), '')

# Filter out redundant characters

In [8]:
df[df.columns[2]]= df[df.columns[2]].map(lambda x: str(x).replace('??', "'"))
df[df.columns[3]] = df[df.columns[3]].map(lambda k: str(k).replace('??', "'"))

# Filter out stopwords
Stop words are those words that do not contribute to the deeper meaning of the phrase. They are the most common words such as: “the“, “a“, and “is“ etc..  For some applications like documentation classification, it may make sense to remove stop words. NLTK provides a list of commonly agreed upon stop words for a variety of languages, such as English. In this assignment, I also filter out all the stop words. 

In [9]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
df[df.columns[2]] = (df[df.columns[2]].str.lower().str.split()).apply(lambda x: ([item for item in x if item not in stop]))
df[df.columns[3]] = (df[df.columns[3]].str.lower().str.split()).apply(lambda k: ([item for item in k if item not in stop]))
print(df[df.columns[2]].head())
print(df[df.columns[3]].head())

0    [kanye, west, releases, awful, new, song, new,...
1                         [everybody, work, hollywood]
2       [bill, cosby, tamir, rice, power, prosecutors]
3                        [translating, iliad, isnt]
4       [oprah, created, profitable, weightloss, plan]
Name: title, dtype: object
0    [drake, inside, kanye, wests, head, thats, i...
1    [im, convinced, working, hollywood, effective...
2    [bill, cosby, tamir, rices, common, cases, re...
3    [pop, quiz, greater, number, republican, presi...
4    [forget, portion, control, weight, loss, power...
Name: Original content, dtype: object


# Stemming Words
Stemming refers to the process of reducing each word to its root or base. Some applications, like document classification, may benefit from stemming to both reduce the vocabulary and to focus on the sense or sentiment of a document rather than deeper meaning. In this assignment, I use one of the popular stemming algorithms called Porter Stemming algorithm.

In [10]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
df[df.columns[2]] = (df[df.columns[2]]).apply(lambda x: [porter.stem(y) for y in x])
df[df.columns[3]] = (df[df.columns[3]]).apply(lambda k: [porter.stem(l) for l in k])
print(df[df.columns[2]].head())
print(df[df.columns[3]].head())

0    [kany, west, releas, aw, new, song, new, year...
1                         [everybodi, work, hollywood]
2        [bill, cosbi, tamir, rice, power, prosecutor]
3                           [translat, iliad, isnt]
4             [oprah, creat, profit, weightloss, plan]
Name: title, dtype: object
0    [drake, insid, kany, west, head, that, impre...
1    [im, convinc, work, hollywood, effect, effici...
2    [bill, cosbi, tamir, rice, common, case, reve...
3    [pop, quiz, greater, number, republican, presi...
4    [forget, portion, control, weight, loss, power...
Name: Original content, dtype: object


# Term document matrix
Raw text data cannot directly be used in most of the machine learning algorithms. We need to convert the texts to numbers for performing such algorithms. However, we can convert the text to some vectors of numbers which can be used as input to different machine learning algorithms. One prominent example is Bag-of-Words Model which stores the occurrences of the words in the document. Each word has a unique id. The document is represented as a vector of known words which contains the frequency of each word as its value. 

In this assignment, I use CountVectorizer to count the words simply through tokenizing and building a  vocabulary of all the known words. 

In [11]:
doc_col1 = df[df.columns[2]].values
combine = []
for row in doc_col1:
    str1 =" ".join(str(x) for x in row) 
    combine.append(str1)
print(combine)    

vec = CountVectorizer()
X = vec.fit_transform(combine).toarray()
data_matrix = pd.DataFrame(X, columns=vec.get_feature_names())
print(data_matrix)
y= df.iloc[:, 4].values
print(y)

['kany west releas aw new song new year\x83 eve', 'everybodi work hollywood', 'bill cosbi tamir rice power prosecutor', 'translat \x83iliad\x83 isn\x83t', 'oprah creat profit weightloss plan', 'world fall diana\x83 niec', 'democrat readi lincoln', 'florida men overcam racist', 'mom\x83 drama queen ann enright\x83 \x83the green road\x83', 'gay open marriag need come closet', 'syrian refuge band tour europ', 'miss trump', 'happen final season \x83downton abbey\x83', '\x83sherlock\x83 send benedict cumberbatch back time victorian london', 'mad scientist clone 100000 cow', 'cosbi speak assault arrest', 'mash actor wayn roger die', '380 injur filipino nye celebr', 'man crush elev nye', 'natali cole die la age 65', 'obama delay iran missil sanction', '2 kill manhunt underway tel aviv', 'watch last night\x83 tv host get wast', 'man attack french soldier mosqu', '4 dead shoot near lo angel', 'clinton rais 37m fourth quarter', 'trump use jihadist recruit vid', 'goodby natali cole soul diva figh

# Split Dataset

In [12]:
from sklearn.model_selection import train_test_split
from sklearn import cross_validation, metrics

x_train, x_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.3, random_state=0)
print(df.shape)
print(x_train.shape)
print(y_train.shape)

(1455, 5)
(1018, 3374)
(1018,)




# Prediction of PageView using Decision Tree

In [13]:
from sklearn import tree

In [14]:
def pageview_prediction_model_dT(x_train, y_train, x_test):
    model = tree.DecisionTreeClassifier(criterion='gini')
    model.fit(x_train, y_train)
    print(model.score(x_train, y_train))
    #Predict Output
    output = model.predict(x_test)
    return output

In [15]:
output_dt = pageview_prediction_model_dT(x_train, y_train, x_test)

0.997053045187


# Calculating Accuracy

In [16]:
def calculate_accuracy(output, y_test):
    accuracy = 0
    for i in range(len(output)):
        if y_test[i] == output[i]:
            accuracy += 1
    
    accuracy = accuracy/len(output)
    
    return accuracy

In [17]:
accuracy = calculate_accuracy(output_dt, y_test)
print(accuracy)

0.585812356979405


# Prediction of PageView using Gaussian Naive Bayes

In [18]:
from sklearn.naive_bayes import GaussianNB

In [19]:
def pageview_prediction_model_GNB(x_train, y_train, x_test):
    model = GaussianNB()
    model.fit(x_train, y_train)
    print(model.score(x_train, y_train))
    #Predict Output
    output= model.predict(x_test)
    return output

In [20]:
output_GNB = pageview_prediction_model_GNB(x_train, y_train, x_test)

0.969548133595


In [21]:
accuracy = calculate_accuracy(output_GNB, y_test)
print(accuracy)

0.5652173913043478


# Prediction of PageView using Support Vector Machine

In [22]:
from sklearn import svm
from itertools import cycle
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from scipy import interp

In [23]:
def pageview_prediction_model_SVM(x_train, y_train, x_test):
    model = svm.SVC()
    model.fit(x_train, y_train)
    print(model.score(x_train, y_train))
    output= model.predict(x_test)
    return output

In [26]:
output_SVM = pageview_prediction_model_SVM(x_train, y_train, x_test)

0.584479371316


In [27]:
accuracy = calculate_accuracy(output_SVM, y_test)
print(accuracy)

0.5743707093821511


# Prediction of PageView using Random Forest

In [28]:
from sklearn.ensemble import RandomForestClassifier

In [29]:
def pageview_prediction_model_RF(x_train, y_train, x_test):
    clf = RandomForestClassifier(n_jobs=2, random_state=0)
    clf.fit(x_train, y_train)
    output= clf.predict(test[features])
    return output

In [30]:
output_RF = pageview_prediction_model_SVM(x_train, y_train, x_test)

0.584479371316


In [31]:
accuracy = calculate_accuracy(output_RF, y_test)
print(accuracy)

0.5743707093821511
