# Week 5 and 6 - Assignment 6
#### Team 5

**Goal:** Classify new "test" documents using already classified "training" documents.

For this project, the dataset used was from the CrowdFlower library (https://www.crowdflower.com/data-for-everyone/), using the "Twitter sentiment analysis: Self-driving cars" dataset. This dataset includes posts from Twitter regarding users' opinions of self-driving cars, the tweets' classification as relevant to the topic or not, and if relevant the classification of tweet sentiment as integers (from a positive 5 to a negative 1).

# Importing data

#### Seeing what we're working with

In [9]:
import nltk 
import pandas as pd
import sys
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize.api import TokenizerI
from nltk.tokenize.toktok import ToktokTokenizer
toktok = ToktokTokenizer()

In [10]:
data = pd.io.parsers.read_csv("https://raw.githubusercontent.com/Galanopoulog/DATA620-Assignment-6/master/Self-Driving-Cars.csv")
data = data.drop(data.columns[[1,2,3,4,7,8,9]], axis=1)
data.text.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True) #remove ascii characters
data.head(11)

Unnamed: 0,_unit_id,sentiment,sentiment:confidence,text
0,724227031,5,0.7579,Two places I'd invest all my money if I could:...
1,724227032,5,0.8775,Awesome! Google driverless cars will help the ...
2,724227033,2,0.6805,If Google maps can't keep up with road constru...
3,724227034,2,0.882,Autonomous cars seem way overhyped given the t...
4,724227035,3,1.0,Just saw Google self-driving car on I-34. It w...
5,724227036,3,1.0,Will driverless cars eventually replace taxi d...
6,724227037,not_relevant,0.5367,Chicago metro expected to be fully autonomous ...
7,724227038,not_relevant,0.6548,I love the infotainment system in my new car. ...
8,724227039,5,0.7187,Autonomous vehicles could reduce traffic fatal...
9,724227040,1,0.6412,Driverless cars are not worth the risk. Don't...


In [3]:
len(data)

7156

In [4]:
print(data.dtypes)

_unit_id                  int64
sentiment                object
sentiment:confidence    float64
text                     object
dtype: object


In [5]:
# Count relevant and irrelevant
rel_count = len(data[data.sentiment!="not_relevant"])
irrel_count = len(data[data.sentiment=="not_relevant"])
neut_count = len(data[data.sentiment=="3"])

print "Relevant: %d" %rel_count
print "Not_Relevant: %d" %irrel_count
print "Neutral: %d" %neut_count

Relevant: 6943
Not_Relevant: 213
Neutral: 4245


Two key things to note is that 1) the tweets and sentiment values are objects, which may need to be changed in the future for manipulations and 2) the number of non-relevant tweets is significantly smaller than the relevant portion, so splitting the data into training and test sets should probably be more methodical than just randomly shuffling the data. Also, it will be interesting to see the difference between the relevant neutral tweets and the irrelevant ones.

#### Splitting the data

The data is split into the training set, the evaluation set and the final set that will be used for testing.

In [6]:
# What sizes should our sets be?
train_set = int(0.7 * len(data))
dev_set = int(.15 * len(data))
final_set = len(data) - train_set - dev_set
Total = train_set + dev_set + final_set

print "Training set: %d" %train_set
print "Evaluation set: %d" %dev_set
print "Testing set: %d" %final_set
print "Total: %d" %Total

Training set: 5009
Evaluation set: 1073
Testing set: 1074
Total: 7156


In [7]:
# Let's split our data into relevant and non-relevant parts.
relevant = data[data.sentiment!="not_relevant"]
nonrelevant = data[data.sentiment=="not_relevant"]

# Let's take 70%, 15% and the 15% of each
rel_train = int(0.7 * len(relevant))
rel_dev = int(.15 * len(relevant))
rel_fin = len(relevant) - rel_train - rel_dev
rel_Total = rel_train + rel_dev + rel_fin

irrel_train = int(0.7 * len(nonrelevant))
irrel_dev = int(.15 * len(nonrelevant))
irrel_fin = len(nonrelevant) - irrel_train - irrel_dev
irrel_Total = irrel_train + irrel_dev + irrel_fin

print "Training set: %d" %rel_train
print "Evaluation set: %d" %rel_dev
print "Testing set: %d" %rel_fin
print "Total: %d" %rel_Total
print "----------------------"
print "Training set: %d" %irrel_train
print "Evaluation set: %d" %irrel_dev
print "Testing set: %d" %irrel_fin
print "Total: %d" %irrel_Total

Training set: 4860
Evaluation set: 1041
Testing set: 1042
Total: 6943
----------------------
Training set: 149
Evaluation set: 31
Testing set: 33
Total: 213


In [8]:
# Make sets for relevant data
rel_train = relevant[0:4860]
rel_dev = relevant[4860:5901]
rel_fin = relevant[5901:6943]

# Make sets for non-relevant data
irrel_train = nonrelevant[0:149]
irrel_dev = nonrelevant[149:180]
irrel_fin = nonrelevant[180:231]

# Combine sets by set types
train_set = pd.concat([rel_train, irrel_train], axis=0)
dev_set = pd.concat([rel_dev, irrel_dev], axis=0)
final_set = pd.concat([rel_fin, irrel_fin], axis=0)

# Determine lengths and compare to set size estimates
print "Training set: %d" %len(train_set)
print "Evaluation set: %d" %len(dev_set)
print "Testing set: %d" %len(final_set)

Training set: 5009
Evaluation set: 1072
Testing set: 1075


All sets now have an evenly distributed number of relevant and non-relevant entries and match the 70%, 15%, 15% data distribution that was considered optimal originally.

# Analyze and Categorize Data

In [18]:
from textblob.classifiers import NaiveBayesClassifier

In [11]:
train_set[0:10]

Unnamed: 0,_unit_id,sentiment,sentiment:confidence,text
0,724227031,5,0.7579,Two places I'd invest all my money if I could:...
1,724227032,5,0.8775,Awesome! Google driverless cars will help the ...
2,724227033,2,0.6805,If Google maps can't keep up with road constru...
3,724227034,2,0.882,Autonomous cars seem way overhyped given the t...
4,724227035,3,1.0,Just saw Google self-driving car on I-34. It w...
5,724227036,3,1.0,Will driverless cars eventually replace taxi d...
8,724227039,5,0.7187,Autonomous vehicles could reduce traffic fatal...
9,724227040,1,0.6412,Driverless cars are not worth the risk. Don't...
10,724227041,3,0.9184,"Driverless cars are now legal in Florida, Cali..."
11,724227610,3,1.0,Audi is the first carmaker to get a license fr...


In [12]:
# Making tuples for classification
subset = train_set[['text', 'sentiment']]
tuples = [tuple(x) for x in subset.values]

train_set[['text']][0:10]
subset[['text']]
len(tuples)

5009

In [32]:
# Tokenizing words
all_words = toktok.tokenize(subset[['text']] )

t = [({word: (word in (x[0])) for word in all_words}, x[1]) for x in tuples]
len(t)

5009

In [30]:
# Classifying Methods

# Naive Bayes
classifier = nltk.NaiveBayesClassifier.train(t)
classifier.show_most_informative_features()

Most Informative Features
                    Damn = True                1 : 3      =     41.9 : 1.0
                    Fuck = True                1 : 3      =     41.9 : 1.0
                    cool = True                5 : 3      =     33.4 : 1.0
                 Awesome = True                5 : 3      =     32.7 : 1.0
                    seem = True                1 : 3      =     23.3 : 1.0
                  reduce = True                5 : 3      =     22.4 : 1.0
                       c = False          not_re : 2      =     20.9 : 1.0
                tracking = True           not_re : 3      =     20.4 : 1.0
         _______________ = True           not_re : 3      =     20.4 : 1.0
                     Dor = True           not_re : 3      =     20.4 : 1.0


In [31]:
# Other Methods to Consider

# MNB
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(t)

# BernoulliNB
BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(t)

# Logistic Regression
LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(t)

# SGDC
SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(t)

# Linear SVC
LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(t)

print "Percent Accuracy:"
print "------------------"
print "Naive Bayes: %d" %round((nltk.classify.accuracy(classifier, testing_set))*100, 2)
print "MNB: %d" %round((nltk.classify.accuracy(MNB_classifier, testing_set))*100, 2)
print "BernoulliNB: %d" %round((nltk.classify.accuracy(BernoulliNB_classifier, testing_set))*100, 2)
print "LogisticRegression: %d" %round((nltk.classify.accuracy(LogisticRegression_classifier, testing_set))*100, 2)
print "SGDClassifier: %d" %round((nltk.classify.accuracy(SGDClassifier_classifier, testing_set))*100, 2)
print "LinearSVC: %d" %round((nltk.classify.accuracy(LinearSVC_classifier, testing_set))*100, 2)

Percent Accuracy:
------------------
Naive Bayes: 50
MNB: 58
BernoulliNB: 51
LogisticRegression: 61
SGDClassifier: 37
LinearSVC: 62


Out of all these models, the one that has the highest accuracy is the Linear SVC model.