## Twitter Sentiment Analysis 
---

**PROJECT STRUCTURE**

Following well-established practices, the plan is to...

- do minimal data exploration and cleanup before splitting into train, test datasets
- split into train, test datasets; set the test dataset aside and do not peek further to prevent data leakage
- create a cleanup pipeline for the training data that can be re-applied to the test data
- create an ML pre-processing pipeline that can be re-applied as well
- using cross-validation, evaluate a variety models without hyperparameter tuning to establish some baselines
- short-list promising models for further hyperparemeter tuning; go wild
- go back if necessary, create new features (feature engineering), create a 2nd ML processing pipeline
- re-apply all cleanup and pre-processing steps to the test set
- evaluate the most promising model using the test set and calculate the generalization error, once
- create a presentation including how this model would be deployed, evaluated, and maintained in production

Differentiating *cleanup* from *pre-processing* and *further processing* helps because the original cleanup is often simple enough and doesn't need to be tweaked, while later processing might involve several iterations and stages.

**THE DATA**

This so-called *sentiment140* dataset comes from Marios Michailidis [Kaggle repository.](https://www.kaggle.com/kazanova/sentiment140/)
It consists of 1.6 M rows evenly split into negative and positive Tweets. The labels were created automatically using emoticons - the specific steps of how this was accomplished have not been disclosed.
For more information on *sentiment140* see their [API.](http://help.sentiment140.com/api)

### Load data

In [1]:
import os 
import time
import numpy as np
import pandas as pd 

# time notebook
start_notebook = time.time()

# load raw dataset
raw_path = os.path.join("..","data","1_raw","sentiment140")
filename = "training.1600000.processed.noemoticon.csv"
filepath = os.path.join(raw_path, filename)

raw = pd.read_csv(filepath, encoding='latin-1', usecols=[0,1,4,5])
raw.columns = ["Polarity","ID","Username","Text"]

### Preview data

In [2]:
raw.shape

(1599999, 4)

In [3]:
raw.head()

Unnamed: 0,Polarity,ID,Username,Text
0,0,1467810672,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,joy_wolf,@Kwesidei not the whole crew


In [4]:
X = raw.iloc[:, 1:4]
y = raw.iloc[:, 0]

In [5]:
X.head()

Unnamed: 0,ID,Username,Text
0,1467810672,scotthamilton,is upset that he can't update his Facebook by ...
1,1467810917,mattycus,@Kenichan I dived many times for the ball. Man...
2,1467811184,ElleCTF,my whole body feels itchy and like its on fire
3,1467811193,Karoli,"@nationwideclass no, it's not behaving at all...."
4,1467811372,joy_wolf,@Kwesidei not the whole crew


#### Minimal cleanup!

In [6]:
# change 4s to 1s
y_df = pd.DataFrame({'target':y}) 
y_df[y_df["target"]==4] = 1
y_df.tail()

Unnamed: 0,target
1599994,1
1599995,1
1599996,1
1599997,1
1599998,1


### Split into Train, Test sets

In [7]:
import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y_df, test_size=0.25, random_state=42)

### Set Test set aside

Save training and test sets, and indices as well (although arguably the procedures is reproducible by setting the same random seed above).

In [8]:
X_train.head()

Unnamed: 0,ID,Username,Text
66270,1691417640,goatkinghoang,working madly
428045,2063856213,Emma_Keenan,How can it be this cold! It's June!!! Sheesh
1307927,2012574359,shellywellyx1,@jojoalexander ight i let they lil white boy k...
1112400,1972402571,FreagleDiva,Tweetlater Pro is the way to go for those who ...
840793,1559873939,bumgirl,&quot;I wanna wake up where you are&quot; I lo...


In [9]:
y_train.head()

Unnamed: 0,target
66270,0
428045,0
1307927,1
1112400,1
840793,1


In [10]:
# save train and test indices
trainpath = os.path.join(raw_path, "train_ix.npy")
testpath = os.path.join(raw_path, "test_ix.npy")

with open(trainpath, 'wb') as f:
    np.save(f, np.array(y_train.index))
    
with open(testpath, 'wb') as f:
    np.save(f, np.array(y_test.index))

In [11]:
# Save train and test sets (index=True does not preserve split indices)
X_train.to_csv(os.path.join(raw_path, "X_train.csv"), index=False) 
X_test.to_csv(os.path.join(raw_path, "X_test.csv"), index=False)
y_train.to_csv(os.path.join(raw_path, "y_train.csv"), index=False)
y_test.to_csv(os.path.join(raw_path, "y_test.csv"), index=False)

In [12]:
# print time so far

mins, secs = divmod(time.time() - start_notebook, 60)
print(f'Time so far: {mins:0.0f} minute(s) and {secs:0.0f} second(s)')

Time so far: 0 minute(s) and 31 second(s)


### Cleanup Process

All details of cleanup steps can be found in the custom script: `cleanup_module.py`

Since my cleanup function entails heavy CPU-bound processes I use multiprocessing, splitting the data into 50k-row chunks which are processed 8 at a time (my laptop has 8 logical processors). The order of processing is asynchronous.

In [13]:
cmd = 'python cleanup_module.py X_train'
!{cmd}

Saving X_train.2
Saving X_train.8
Saving X_train.3
Saving X_train.4
Saving X_train.5
Saving X_train.1
Saving X_train.6
Saving X_train.7
Saving X_train.10
Saving X_train.9
Saving X_train.15
Saving X_train.11
Saving X_train.16
Saving X_train.12
Saving X_train.13
Saving X_train.14
Saving X_train.18
Saving X_train.17
Saving X_train.20
Saving X_train.22
Saving X_train.19
Saving X_train.21
Saving X_train.23
Saving X_train.24
Cleanup time: 2 minute(s) and 33 second(s).


### X_clean $\Rightarrow$ train_ix $\Rightarrow$ raw data

After loading a cleaned subset we trace the path backward to the raw data via the train indices, using the ranges passed to the multiprocessing executor in `cleanup_module.py`:

- Ex. this set of parameters shows that the cleaned data subset 6 contains the range from 250000 to 300000:

```
(X_name, range( 250000,  300001),  6)
```

In [14]:
# loading example train subset 6
clean_path = os.path.join("..","data","2_clean","sentiment140")
filename = "X_train.6.csv"
filepath = os.path.join(clean_path, filename)

X_clean =  pd.read_csv(filepath)

In [15]:
X_clean.head(10)

Unnamed: 0,username,text,lemmatized
0,willysandi,i am so sleepy but there's still lots of assig...,i am so sleepy but there still lot assignment ...
1,brandyf82,thinks its great how one person can make you f...,think great how one person can make you feel w...
2,warfoot,Got the BrainBone daily question right! - htt...,got brainbone daily question right URL
3,sydeshow,"@tinayayo Girl, I get out a lot but you put me...",USERNAME girl i get out lot but you put me my ...
4,__JANEDOE,sitting at my desk... doing nothing. i look ar...,sitting my desk doing nothing i look around al...
5,lordtrilink,@guaranteedjuicy Wishful thinking?,USERNAME wishful thinking
6,eirwen29,@acoffinyoursize I was going to buy it today! ...,USERNAME i going buy today then my bank accoun...
7,Kerry_0,I want a Crabby Patty,i want crabby patty
8,Megann57,@coakay123 I'm gonna direct message you.,USERNAME im gon na direct message you
9,Miss_Grace,Got to wait till 10 Its taking forever!,got wait till 10 taking forever


In [16]:
# loading previously saved train indices
with open(os.path.join(raw_path, 'train_ix.npy'), 'rb') as f:
    train_ix = np.load(f)

In [17]:
# unit test: recovered train ix match X_train ix?
assert train_ix[250000:250010].all() == X_train.iloc[250000:250010, ].index.all()

In [18]:
# pass indces to raw data
raw.iloc[list(train_ix[250000:250010]), ]

Unnamed: 0,Polarity,ID,Username,Text
127175,0,1834669916,willysandi,i am so sleepy but there's still lots of assig...
1521865,4,2176374015,brandyf82,thinks its great how one person can make you f...
1028707,4,1932667098,warfoot,Got the BrainBone daily question right! - htt...
327835,0,2010059197,sydeshow,"@tinayayo Girl, I get out a lot but you put me..."
865153,4,1677410625,__JANEDOE,sitting at my desk... doing nothing. i look ar...
1141972,4,1977246621,lordtrilink,@guaranteedjuicy Wishful thinking?
70329,0,1693625979,eirwen29,@acoffinyoursize I was going to buy it today! ...
49803,0,1678267176,Kerry_0,I want a Crabby Patty
1533460,4,2178502727,Megann57,@coakay123 I'm gonna direct message you.
444364,0,2067816745,Miss_Grace,Got to wait till 10 Its taking forever!


In [19]:
# confirm y_train has correct target vals
y_train[250000:250010]

Unnamed: 0,target
127175,0
1521865,1
1028707,1
327835,0
865153,1
1141972,1
70329,0
49803,0
1533460,1
444364,0


---

In [20]:
# print total time
mins, secs = divmod(time.time() - start_notebook, 60)
print(f'Total running time: {mins:0.0f} minute(s) and {secs:0.0f} second(s)')

Total running time: 3 minute(s) and 7 second(s)
