## Twitter Sentiment Analysis 
---

**PROJECT STRUCTURE**

Following well-established practices, the plan is to...

- do minimal "cleanup" before splitting into train, test datasets
- split into train, test datasets; set the test dataset aside until final generalization test
- create a cleanup pipeline for the training data that can be re-applied to the test data
- create an ML pre-processing pipeline that can be re-applied as well
- using cross-validation, develop several models without hyperparameter tuning to establish some baselines
- choose a few models for further hyperparemeter tuning; go wild
- go back if necessary, create new features, create a 2nd ML processing pipeline
- re-apply all cleanup and pre-processing steps to the test set
- calculate generalization error, once
- present results

Differentiating *cleanup* from *pre-processing* helps because the original cleanup is often simple enough and doesn't need to be tweaked, while pre-processing might involve several iterations and/or stages, as incorporated in the "go back if necessary" step during modeling.

**THE DATA**

The data comes from Marios Michailidis' dataset hosted in [this Kaggle repository.](https://www.kaggle.com/kazanova/sentiment140/)
It consists of 1.6 M rows evenly split into negative and positive Tweets. The labels were created automatically using emoticons - the specific steps of how this was accomplished have not been disclosed.
For more information on *sentiment140* see their [API.](http://help.sentiment140.com/api)

### Load data

In [1]:
import os 
import numpy as np
import pandas as pd 

# load raw dataset
dir_path = os.path.join("..","data","1_raw","sentiment140")
filename = "training.1600000.processed.noemoticon.csv"
filepath = os.path.join(dir_path, filename)

raw = pd.read_csv(filepath, encoding='latin-1', usecols=[0,1,4,5])
raw.columns = ["Polarity","ID","Username","Text"]

### Preview data

In [2]:
raw.shape

(1599999, 4)

In [3]:
raw.head()

Unnamed: 0,Polarity,ID,Username,Text
0,0,1467810672,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,joy_wolf,@Kwesidei not the whole crew


In [4]:
X = raw.iloc[:, 1:4]
y = raw.iloc[:, 0]

In [5]:
X.head()

Unnamed: 0,ID,Username,Text
0,1467810672,scotthamilton,is upset that he can't update his Facebook by ...
1,1467810917,mattycus,@Kenichan I dived many times for the ball. Man...
2,1467811184,ElleCTF,my whole body feels itchy and like its on fire
3,1467811193,Karoli,"@nationwideclass no, it's not behaving at all...."
4,1467811372,joy_wolf,@Kwesidei not the whole crew


#### Minimal cleanup!

In [6]:
# change 4s to 1s
y_df = pd.DataFrame({'target':y}) 
y_df[y_df["target"]==4] = 1
y_df.tail()

Unnamed: 0,target
1599994,1
1599995,1
1599996,1
1599997,1
1599998,1


### Split into Train, Test sets

In [7]:
import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y_df, test_size=0.25, random_state=42)

### Set Test set aside

Save training and test sets, and indices as well (although arguably the procedures is reproducible by setting the same random seed above).

In [8]:
X_train.head()

Unnamed: 0,ID,Username,Text
66270,1691417640,goatkinghoang,working madly
428045,2063856213,Emma_Keenan,How can it be this cold! It's June!!! Sheesh
1307927,2012574359,shellywellyx1,@jojoalexander ight i let they lil white boy k...
1112400,1972402571,FreagleDiva,Tweetlater Pro is the way to go for those who ...
840793,1559873939,bumgirl,&quot;I wanna wake up where you are&quot; I lo...


In [9]:
y_train.head()

Unnamed: 0,target
66270,0
428045,0
1307927,1
1112400,1
840793,1


In [10]:
# save train and test indices
dirpath = os.path.join("..","data","2_clean","sentiment140")
trainpath = os.path.join(dirpath, "train_ix.npy")
testpath = os.path.join(dirpath, "test_ix.npy")

with open(trainpath, 'wb') as f:
    np.save(f, np.array(y_train.index))
    
with open(testpath, 'wb') as f:
    np.save(f, np.array(y_test.index))

In [11]:
# Save train and test sets (index=True does not preserve split indices)
X_train.to_csv(os.path.join(dir_path, "X_train.csv"), index=False) 
X_test.to_csv(os.path.join(dir_path, "X_test.csv"), index=False)
y_train.to_csv(os.path.join(dir_path, "y_train.csv"), index=False)
y_test.to_csv(os.path.join(dir_path, "y_test.csv"), index=False)

### Cleanup Process

All details of cleanup steps can be found in the custom script: `cleanup_module.py`

Since my cleanup function entails heavy CPU-bound processes I use multiprocessing, splitting the data into 50k-row chunks which are processed 8 at a time (my laptop has 8 logical processors). The order of processing is asynchronous.

In [12]:
cmd = 'python cleanup_module.py X_train.csv'
!{cmd}

Saving cleaned up train dataset: 1
Saving cleaned up train dataset: 3
Saving cleaned up train dataset: 2
Saving cleaned up train dataset: 6
Saving cleaned up train dataset: 8
Saving cleaned up train dataset: 7
Saving cleaned up train dataset: 5
Saving cleaned up train dataset: 4
Saving cleaned up train dataset: 9
Saving cleaned up train dataset: 10
Saving cleaned up train dataset: 12
Saving cleaned up train dataset: 11
Saving cleaned up train dataset: 16
Saving cleaned up train dataset: 13
Saving cleaned up train dataset: 14
Saving cleaned up train dataset: 15
Saving cleaned up train dataset: 17
Saving cleaned up train dataset: 19
Saving cleaned up train dataset: 18
Saving cleaned up train dataset: 21
Saving cleaned up train dataset: 20
Saving cleaned up train dataset: 22
Saving cleaned up train dataset: 23
Saving cleaned up train dataset: 24
Elapsed time: 2 minute(s) and 35 second(s).


### X_clean $\Rightarrow$ train_ix $\Rightarrow$ raw data

After loading a cleaned subset we trace the path backward to the raw data via the train indices, using the ranges passed to the multiprocessing executor in `cleanup_module.py`:

- Ex. this set of parameters shows that the cleaned data subset 6 contains the range from 250000 to 300000:

```
(X_name, range( 250000,  300001),  6)
```

In [13]:
# loading example train subset 6
dirpath = os.path.join("..","data","2_clean","sentiment140")
filename = "train_6.csv"
filepath = os.path.join(dirpath, filename)

X_clean =  pd.read_csv(filepath)

In [14]:
X_clean.head()

Unnamed: 0,username,text,lemmatized
0,willysandi,i am so sleepy but there's still lots of assig...,i am so sleepy but there still lot assignment ...
1,brandyf82,thinks its great how one person can make you f...,think great how one person can make you feel w...
2,warfoot,Got the BrainBone daily question right! - htt...,got brainbone daily question right
3,sydeshow,"@tinayayo Girl, I get out a lot but you put me...",tinayayo girl i get out lot but you put me my ...
4,__JANEDOE,sitting at my desk... doing nothing. i look ar...,sitting my desk doing nothing i look around al...


In [15]:
# loading previously saved train indices
with open(os.path.join(dirpath, 'train_ix.npy'), 'rb') as f:
    train_ix = np.load(f)

In [16]:
# unit test: recovered train ix match X_train ix?
assert train_ix[250000:250005].all() == X_train.iloc[250000:250005, ].index.all()

In [17]:
# pass indces to raw data
raw.iloc[list(train_ix[250000:250005]), ]

Unnamed: 0,Polarity,ID,Username,Text
127175,0,1834669916,willysandi,i am so sleepy but there's still lots of assig...
1521865,4,2176374015,brandyf82,thinks its great how one person can make you f...
1028707,4,1932667098,warfoot,Got the BrainBone daily question right! - htt...
327835,0,2010059197,sydeshow,"@tinayayo Girl, I get out a lot but you put me..."
865153,4,1677410625,__JANEDOE,sitting at my desk... doing nothing. i look ar...


In [18]:
# confirm y_train has correct target vals
y_train[250000:250005]

Unnamed: 0,target
127175,0
1521865,1
1028707,1
327835,0
865153,1


---