# Twitter Sentiment Analysis - POC


## Project Structure

This is a proof of concept project for the main project. I'm first running the entire project structure with a very small dataset, avoiding all the complexity of using a large dataset, multiprocessing, and so forth, to focus on the project iself.

The project structure is:

1. set expectations and define the problem, question the project is addressing, or product being developed 
2. preliminary exploratory data analysis (EDA), just enough to figure out whether some pre-dataset-splitting cleanup needs to be performed
3. split dataset into trainining and test subsets; set the test subset aside and do not peek further to avoid data leakage
4. create a cleanup pipeline for the training data that can be re-applied to the test data
5. create an ML pre-processing pipeline that can be re-applied as well
6. using cross-validation, evaluate a variety models without hyperparameter tuning to establish some baselines
7. short-list promising models for further hyperparemeter tuning
8. iterate on any phase of the project as needed (refine the question/problem, recreate the cleanup/pre-processing)
9. consider feature selection and feature engineering
10. decide on a final cleanup and processing pipeline
11. settle on a final model
12. re-apply all cleanup and processing steps to the test set
13. evaluate the final model using the test set once and estimate the uncertainty around this generalization
14. create a presentation including how this model would be deployed, evaluated, and maintained in production

---

## 1. Project Definition

The project's goal is to simply to develop a machine learning model that successfully predicts or classifies whether a Tweet is positive or negative, and possibly how strongly so, as a way to automatically generate reports as an input to an app.

The app is beyond the scope of the project - this could be any app that uses short text input from users and needs to, as part of its process, evaluate the "sentiment" of the text. An example would be Twitter users who download their Tweets for the last year and want to know how positive or negative their posts were, in a timeline for example, or as aggregate in comparison with a cohort of friends. Another example would be an app that tracks positive/negative sentiments around a topic, by using hashtags and whatnot, in conjunction with this sentiment evaluation.

The scope of this project is to provide a model that most accurately predicts the sentiment of a Tweet - whether negative or positive, and does so within a reasonable amount of time, but not as fast as possible (not real time). The evaluation criteria of what constitutes "success" and "most accurate prediction" is TBD. I might use ROC/AUC curves and whatnot. Accuracy alone isn't enough without considering recall.

---

## 2. Preliminary EDA

For the POC, I'm using a very small dataset - but first we examine the entire dataset.

In [1]:
import os
import time
import numpy as np
import pandas as pd 

# time notebook
start_notebook = time.time()

# load raw dataset
raw_path = os.path.join("..","data","1_raw","sentiment140")
filename = "training.1600000.processed.noemoticon.csv"
filepath = os.path.join(raw_path, filename)

raw = pd.read_csv(filepath, encoding='latin-1', usecols=[0,1,4,5])
raw.columns = ["target","ID","username","tweet"]

There are four columns:
- target: is the target of prediction, 0 means negative, 4 means positive
- ID: is the Tweet ID for each Tweet
- username: is the username of the Tweeter
- tweet: is the text of the Tweet

### Preview data

In [2]:
raw.head()

Unnamed: 0,target,ID,username,tweet
0,0,1467810672,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,joy_wolf,@Kwesidei not the whole crew


In [3]:
raw.tail()

Unnamed: 0,target,ID,username,tweet
1599994,4,2193601966,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599995,4,2193601969,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599996,4,2193601991,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599997,4,2193602064,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...
1599998,4,2193602129,RyanTrevMorris,happy #charitytuesday @theNSPCC @SparksCharity...


### Target distribution

There are 1,599,999 rows and the target is evenly distributed:

In [4]:
raw[["target","ID"]].groupby("target").count()

Unnamed: 0_level_0,ID
target,Unnamed: 1_level_1
0,799999
4,800000


### Missing values

There are no NA values:

In [5]:
raw.isnull().sum()

target      0
ID          0
username    0
tweet       0
dtype: int64

### Deduplication

There are 3370 Tweets which are suspect. Half of them are the duplicated values, the other half are involved in this and look suspicious as they show with both positive and negative targets:

In [6]:
all_dupes = raw[raw.duplicated(subset=['ID'], keep=False)].sort_values(by=['ID'])
all_dupes.head(6)

Unnamed: 0,target,ID,username,tweet
212,0,1467863684,DjGundam,Awwh babs... you look so sad underneith that s...
800260,4,1467863684,DjGundam,Awwh babs... you look so sad underneith that s...
274,0,1467880442,iCalvin,Haven't tweeted nearly all day Posted my webs...
800299,4,1467880442,iCalvin,Haven't tweeted nearly all day Posted my webs...
988,0,1468053611,mariejamora,@hellobebe I also send some updates in plurk b...
801279,4,1468053611,mariejamora,@hellobebe I also send some updates in plurk b...


In [7]:
len(all_dupes)

3370

Verifying they are indeed all positive/negative "Twin Tweets":

In [8]:
all_dupes_grouped = all_dupes[["ID","target"]].groupby("ID").count()
all_dupes_grouped.head(3)

Unnamed: 0_level_0,target
ID,Unnamed: 1_level_1
1467863684,2
1467880442,2
1468053611,2


In [9]:
all_dupes_grouped[all_dupes_grouped["target"] != 2]

Unnamed: 0_level_0,target
ID,Unnamed: 1_level_1


Since these tweets are not useful, I'm dropping them from the entire dataset. Deduplication is a complex procedure that I will not be reproducing in a "cleanup pipeline", my test set won't have duplicates.

In [10]:
raw_deduped = raw.drop_duplicates(subset="ID", keep=False)

In [11]:
# double check
assert len(raw) - len(raw_deduped) == len(all_dupes)

We expect the distribution of the target to be equally balanced as before:

In [12]:
raw_deduped[["target","ID"]].groupby("target").count()

Unnamed: 0_level_0,ID
target,Unnamed: 1_level_1
0,798314
4,798315


We got rid of ID dupes, and we know there are username dupes (we don't care about that), but what about retweets? There might be tons of Tweet dupes, so we need to check for that - it appears as though our dataset creators made sure to avoid retweets!

In [14]:
text_dupes = raw_deduped[raw_deduped.duplicated(subset=['ID'], keep=False)].sort_values(by=['tweet'])
text_dupes.head(6)

Unnamed: 0,target,ID,username,tweet


In [16]:
# reindex
raw_deduped.index = range(len(raw_deduped))

In [17]:
raw_deduped.tail()

Unnamed: 0,target,ID,username,tweet
1596624,4,2193601966,AmandaMarie1028,Just woke up. Having no school is the best fee...
1596625,4,2193601969,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1596626,4,2193601991,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1596627,4,2193602064,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...
1596628,4,2193602129,RyanTrevMorris,happy #charitytuesday @theNSPCC @SparksCharity...


### Split X, y

Here I just split my data into a feature matrix X and its target vector y:

In [18]:
X = raw_deduped.iloc[:, 1:4]
y = raw_deduped.iloc[:, 0]

### Recode target

A final and minimal cleanup is to recode the target values to 0s and 1s. While in sentiment analysis the polarity of a sentiment is typically a continuum from -1 to +1, since we don't have a grading this coding makes no sense. I'll frame this discretized binary target as a vanilla `IsPositive` feature:

In [19]:
y_df = pd.DataFrame({'target':y}) 
y_df[y_df["target"]==4] = 1
y_df.tail()

Unnamed: 0,target
1596624,1
1596625,1
1596626,1
1596627,1
1596628,1


---

## 3. Split into Trainining & Test Sets

In [20]:
import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y_df, test_size=0.25, random_state=42)

### Set Test set aside

Save training and test sets, and indices as well (although arguably the procedures is reproducible by setting the same random seed above).

In [21]:
X_train.head()

Unnamed: 0,ID,username,tweet
1250211,1996787711,lifeischill,I went to frys to see andy and saw my BUD inst...
1195219,1985020349,knikkim,back home after brunch with church friends and...
906747,1750930121,ginny9577,"@specialk0478 yeah she's part lab, part spanie..."
427169,2063956079,kirstyhooper,kirsty is doing some ipd work!
905628,1695806605,TwentyFour,@ScaryMommy Sure! My entire blogroll is ter...


In [22]:
y_train.head()

Unnamed: 0,target
1250211,1
1195219,1
906747,1
427169,0
905628,1


In [23]:
# save train and test indices
trainpath = os.path.join(raw_path, "train_ix.npy")
testpath = os.path.join(raw_path, "test_ix.npy")

with open(trainpath, 'wb') as f:
    np.save(f, np.array(y_train.index))
    
with open(testpath, 'wb') as f:
    np.save(f, np.array(y_test.index))

In [24]:
# Save train and test sets
# Note: index=True does not preserve split indices
X_train.to_csv(os.path.join(raw_path, "X_train.csv"), index=False) 
X_test.to_csv(os.path.join(raw_path, "X_test.csv"), index=False)
y_train.to_csv(os.path.join(raw_path, "y_train.csv"), index=False)
y_test.to_csv(os.path.join(raw_path, "y_test.csv"), index=False)

---

In [25]:
mins, secs = divmod(time.time() - start_notebook, 60)
print(f'Total running time: {mins:0.0f} minute(s) and {secs:0.0f} second(s)')

Total running time: 1 minute(s) and 51 second(s)
