# 1. Introduction

## Exploring the Dataset
This is the *sentiment140* dataset. It contains 1,600,000 tweets extracted using the twitter api. The tweets have been annotated (0 = negative, 4 = positive) and they can be used to detect sentiment.

It contains the following 6 fields:

- **target**: the polarity of the tweet (0 = negative, 4 = positive)
- **ids**: The id of the tweet (2087)
- **date**: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
- **flag**: The query (lyx). If there is no query, then this value is NO_QUERY.
- **user**: the user that tweeted (robotickilldozr)
- **text**: the text of the tweet (Lyx is cool)

In [1]:
# Path to the unzipped dataset and path where preprocessed data should be stored
import os

fileDir = os.path.dirname(os.path.realpath('__file__'))
absFilePath = os.path.join(fileDir, '../Data/training.1600000.processed.noemoticon.csv')
absFilePathToPreprocessedDataset = os.path.join(fileDir, '../Data/training.1600000.processed.noemoticon_preprocessed.csv')
pathToDataset = os.path.abspath(os.path.realpath(absFilePath))
pathToPreprocessedDataset = os.path.abspath(os.path.realpath(absFilePathToPreprocessedDataset))
print (pathToDataset)
print (pathToPreprocessedDataset)

c:\Users\nlp_workshop\Documents\PSIML-NLPWorkshop\Data\training.1600000.processed.noemoticon.csv
c:\Users\nlp_workshop\Documents\PSIML-NLPWorkshop\Data\training.1600000.processed.noemoticon_preprocessed.csv


In [2]:
import pandas as pd

# column names in the CSV file
columnNames = ["target", "ids", "date", "flag", "user", "text"]

# if the encoding is not set, the file could not be read
# if the names are not set, the first row is mistaken for a header 
dataset = pd.read_csv(pathToDataset, encoding='cp1252', names=columnNames)

dataset

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
...,...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


In [3]:
dataset['target'].value_counts()

4    800000
0    800000
Name: target, dtype: int64

## Text Preprocessing

In [4]:
dataset = dataset.sample(1000)

In [6]:
from Common.TextPreprocessor import preprocess_text

dataset["text"] = dataset["text"].apply(preprocess_text)

In [7]:
dataset["target"] = dataset["target"] / 4

dataset = dataset[['text', 'target']]

In [8]:
from Common.SplitDataset import SplitDataset

dataset = SplitDataset(dataset)

dataset

Unnamed: 0,text,target,split
0,is hopeful that the crcna will adopt the belha...,0.0,train
1,@nikredbull im going there for vacation this s...,0.0,train
2,@bdothill i will come see you today,1.0,train
3,boo my dad threw out his old bike. no hipster ...,0.0,train
4,@erica_lick sounds like you need a week of dig...,1.0,train
...,...,...,...
995,@lifecoachmary think of the years of story tel...,1.0,test
996,@ronearl excellent--thank you!,1.0,test
997,"awake, not in the mood to study",0.0,test
998,"@alaksir iyaa, part-time sambil kuliah hoho.. ...",1.0,test


In [9]:
dataset['target'].value_counts()

0.0    517
1.0    483
Name: target, dtype: int64

In [10]:
dataset.to_csv(pathToPreprocessedDataset, index=False)