# Capturing Sentimental Opinions on Presidential Candidates

In this project, we'll aim to capture the opinions and sentiments of a sample of electoral voters. We'll do this by analyzing their tweets on a daily basis over a period of time to see how these opinions are varying or changing over time. 

Can analyzing sentiments of tweets from a sample during a given period provide insight to how an election will swing? The main goal of the project is to examine and analyze tweets containing reference to candidates vying for a presidential spot, classify these tweets into positive, negative or neutral sentiments and from these, try to conclude on how much online sentiments provide for a good indicator to predict the results of an election.

## Training Data Set 

The data set we used in this project is the Sentiment140 dataset. You can find the data set [here](http://help.sentiment140.com/for-students). 

## Exploring the Data Set 

Below we'll do a quick exploration of the data set to find irregularities and understand the composition of the data.

### What questions should we try to answer while looking at the data? (Please add ideas)
1. How many positive/negative records do we have?
2. Are there null/empty records?
3. Do we really need all the columns? Which features are important?
4. The text column, do we need @usernames, RTs, and any unnecessary text?
5. Are we taking neutral records into consideration or just positives and negatives?
6. Do we take into consideration tweets that contain e.g "I wish", "I hope" (ithout comparison these tweets might be deemed neutral, otherwise positive or negative).
7. How do we plan to prepare data to prevent bias? (i.e Ratio of positive to negative tweets), to prevent the model from being biased towards the seniment with the higher ratio?

In [2]:
import pandas as pd
import numpy as np
dataset = "./trainingandtestdata/training.1600000.processed.noemoticon.csv"
cols = ['sentiment', 'id', 'date' 'query', 'username', 'text']
df = pd.read_csv(dataset, header=None, names=cols, encoding='latin-1')
df.head(10)

Unnamed: 0,sentiment,id,datequery,username,text
0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew
0,1467811592,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,mybirch,Need a hug
0,1467811594,Mon Apr 06 22:20:03 PDT 2009,NO_QUERY,coZZ,@LOLTrish hey long time no see! Yes.. Rains a...
0,1467811795,Mon Apr 06 22:20:05 PDT 2009,NO_QUERY,2Hood4Hollywood,@Tatiana_K nope they didn't have it
0,1467812025,Mon Apr 06 22:20:09 PDT 2009,NO_QUERY,mimismo,@twittera que me muera ?


In [3]:
df.describe()

Unnamed: 0,sentiment
count,1600000.0
mean,1998818000.0
std,193576100.0
min,1467810000.0
25%,1956916000.0
50%,2002102000.0
75%,2177059000.0
max,2329206000.0


In [4]:
df.shape

(1600000, 5)

In [5]:
df.index

Int64Index([0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            ...
            4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
           dtype='int64', length=1600000)

In [6]:
df.columns

Index(['sentiment', 'id', 'datequery', 'username', 'text'], dtype='object')

In [7]:
df.isnull().sum()

sentiment    0
id           0
datequery    0
username     0
text         0
dtype: int64