## Twitter Sentiment Analysis 


The data comes from Marios Michailidis' dataset hosted in [this Kaggle repository.](https://www.kaggle.com/kazanova/sentiment140/)
It consists of 1.6 M rows evenly split into negative and positive Tweets. The labels were created automatically using emoticons - the specific steps of how this was accomplished have not been disclosed.
For more information on *sentiment140* see their [API.](http://help.sentiment140.com/api)

### Load data

In [1]:
import os 
import numpy as np
import pandas as pd 

# load raw dataset
dir_path = os.path.join("..","data","1_raw","sentiment140")
filename = "training.1600000.processed.noemoticon.csv"
filepath = os.path.join(dir_path, filename)

raw = pd.read_csv(filepath, encoding='latin-1', usecols=[0,1,4,5])
raw.columns = ["Polarity","ID","Username","Text"]

### Preview data

In [2]:
raw.shape

(1599999, 4)

In [3]:
raw.head()

Unnamed: 0,Polarity,ID,Username,Text
0,0,1467810672,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,joy_wolf,@Kwesidei not the whole crew


In [4]:
X = raw.iloc[:, 1:4]
y = raw.iloc[:, 0]

In [5]:
X.head()

Unnamed: 0,ID,Username,Text
0,1467810672,scotthamilton,is upset that he can't update his Facebook by ...
1,1467810917,mattycus,@Kenichan I dived many times for the ball. Man...
2,1467811184,ElleCTF,my whole body feels itchy and like its on fire
3,1467811193,Karoli,"@nationwideclass no, it's not behaving at all...."
4,1467811372,joy_wolf,@Kwesidei not the whole crew


In [6]:
y_df = pd.DataFrame({'target':y}) 
y_df[y_df["target"]==4] = 1
y_df.tail()

Unnamed: 0,target
1599994,1
1599995,1
1599996,1
1599997,1
1599998,1


### Split into Train, Test sets

In [7]:
import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y_df, test_size=0.25, random_state=42)

### Set Test set aside

In [17]:
#X_train.loc[:, "index"] = np.array(X_train.index).copy()
# how to avoid the SettingWithCopyWarning? is this even making any sense?

In [9]:
X_train.to_csv(os.path.join(dir_path, "X_train.csv"), index=False) 

In [10]:
X_train.head()

Unnamed: 0,ID,Username,Text,index
66270,1691417640,goatkinghoang,working madly,66270
428045,2063856213,Emma_Keenan,How can it be this cold! It's June!!! Sheesh,428045
1307927,2012574359,shellywellyx1,@jojoalexander ight i let they lil white boy k...,1307927
1112400,1972402571,FreagleDiva,Tweetlater Pro is the way to go for those who ...,1112400
840793,1559873939,bumgirl,&quot;I wanna wake up where you are&quot; I lo...,840793


In [11]:
# Save train and test sets with original indices
#X_train.to_csv(os.path.join(dir_path, "X_train.csv"), index=False) 
#X_test.to_csv(os.path.join(dir_path, "X_test.csv"), index=True)
#y_train.to_csv(os.path.join(dir_path, "y_train.csv"), index=True)
#y_test.to_csv(os.path.join(dir_path, "y_test.csv"), index=True)

### Cleanup Process

All details of cleanup steps can be found in the custom script: `cleanup_module.py`

Since my cleanup function entails heavy CPU-bound processes I use multiprocessing, splitting the data into 50k-row chunks which are processed 8 at a time (my laptop has 8 logical processors). The order of processing is asynchronous.

In [12]:
cmd = 'python cleanup_module.py X_train.csv'
!{cmd}

Saving cleaned up train dataset: 3
Saving cleaned up train dataset: 1
Saving cleaned up train dataset: 4
Saving cleaned up train dataset: 2
Saving cleaned up train dataset: 6
Saving cleaned up train dataset: 5
Finished in 34.73 second(s)


### X_train $\Rightarrow$ Raw $\Rightarrow$ X_clean

We can trace the path backward to the `raw` data and forward to the `clean` data via the **X_train** indices:

- Ex. the parameters list passed to the multiprocessing executor; i.e. - this sixth set of parameters indicates that the cleaned data subset 6 contains the range from 250000 to 300000:

```
(X_name, range( 250000,  300001),  6)
```

In [13]:
X_train.iloc[250000:250005, ]

Unnamed: 0,ID,Username,Text,index
127175,1834669916,willysandi,i am so sleepy but there's still lots of assig...,127175
1521865,2176374015,brandyf82,thinks its great how one person can make you f...,1521865
1028707,1932667098,warfoot,Got the BrainBone daily question right! - htt...,1028707
327835,2010059197,sydeshow,"@tinayayo Girl, I get out a lot but you put me...",327835
865153,1677410625,__JANEDOE,sitting at my desk... doing nothing. i look ar...,865153


In [14]:
raw.iloc[list(X_train.iloc[250000:250005, ].index), ]

Unnamed: 0,Polarity,ID,Username,Text
127175,0,1834669916,willysandi,i am so sleepy but there's still lots of assig...
1521865,4,2176374015,brandyf82,thinks its great how one person can make you f...
1028707,4,1932667098,warfoot,Got the BrainBone daily question right! - htt...
327835,0,2010059197,sydeshow,"@tinayayo Girl, I get out a lot but you put me..."
865153,4,1677410625,__JANEDOE,sitting at my desk... doing nothing. i look ar...


In [15]:
dirpath = os.path.join("..","data","2_clean","sentiment140")
filename = "train_6.csv"
filepath = os.path.join(dirpath, filename)

X_clean =  pd.read_csv(filepath)

In [16]:
X_clean.head()

Unnamed: 0,username,text,index,lemmatized
0,willysandi,i am so sleepy but there's still lots of assig...,127175,i am so sleepy but there still lot assignment ...
1,brandyf82,thinks its great how one person can make you f...,1521865,think great how one person can make you feel w...
2,warfoot,Got the BrainBone daily question right! - htt...,1028707,got brainbone daily question right
3,sydeshow,"@tinayayo Girl, I get out a lot but you put me...",327835,tinayayo girl i get out lot but you put me my ...
4,__JANEDOE,sitting at my desk... doing nothing. i look ar...,865153,sitting my desk doing nothing i look around al...


---