## Twitter Setiment Analysis 

### Part 1: sentiment140 dataset cleanup

The code was originally inspired Gaurav Singhal's guide: [Building a Twitter Setiment Analysis in Python.](https://www.pluralsight.com/guides/building-a-twitter-sentiment-analysis-in-python)

The data comes from Marios Michailidis' sentiment140 dataset hosted in [Kaggle.](https://www.kaggle.com/kazanova/sentiment140/)


### Cleanup Process

All details of cleanup steps can be found in the custom python script `cleanup_module.py`. 

Michailidis' dataset consists of 1.6 M rows evenly split into negative and positive Tweets. The labels were created automatically simply using emoticons (happy face is positive, and vice versa).  

Since my cleanup function entails heavy CPU-bound processes I use multiprocessing, splitting the data into 32 50k-row chunks which are processed 8 at a time (my laptop has 8 logical processors). The order of processing is asynchronous.

Here I just run that script by passing a command to the command line. 

In [10]:
cmd = 'python cleanup_module.py'
!{cmd}

Saving cleaned up train dataset: 6
Saving cleaned up train dataset: 5
Saving cleaned up train dataset: 3
Saving cleaned up train dataset: 2
Saving cleaned up train dataset: 1
Saving cleaned up train dataset: 4
Saving cleaned up train dataset: 8
Saving cleaned up train dataset: 7
Saving cleaned up train dataset: 9
Saving cleaned up train dataset: 10
Saving cleaned up train dataset: 12
Saving cleaned up train dataset: 11
Saving cleaned up train dataset: 13
Saving cleaned up train dataset: 15
Saving cleaned up train dataset: 14
Saving cleaned up train dataset: 16
Saving cleaned up train dataset: 17
Saving cleaned up train dataset: 18
Saving cleaned up train dataset: 19
Saving cleaned up train dataset: 20
Saving cleaned up train dataset: 22
Saving cleaned up train dataset: 23
Saving cleaned up train dataset: 21
Saving cleaned up train dataset: 24
Saving cleaned up train dataset: 25
Saving cleaned up train dataset: 26
Saving cleaned up train dataset: 28
Saving cleaned up train dataset: 27
S


Even without compiling the regex patterns the entire dataset runs in just under 5 mins, which is good enough for me since it's a one-time process. 

This is how we revert back to the original data (which includes Tweet IDs, etc) from the cleaned data. The key is basically the list of parameters passed to the multiprocessing executor; for example, this last set of parameters indicates that the cleaned dataset 32 contains the range from 1550000 to 1600000:

```
(range(1550000, 1600001), 32)
```


In [12]:
import pandas as pd

df = pd.read_csv("../data/1_raw/sentiment140/training.1600000.processed.noemoticon.csv",
                 encoding='latin-1', 
                 usecols=[0,5])

df.columns = ['target','text']
              
df_clean =  pd.read_csv("../data/2_clean/sentiment140/train_32.csv")

In [13]:
df.loc[1550401:1550406,]

Unnamed: 0,target,text
1550401,4,Going to see Ghosts of Girlfriends Past with @...
1550402,4,"@sarah_cawood can't wait to see the movie, it ..."
1550403,4,@ScottHuska rock the boat
1550404,4,@DooneyStudio me and the remaining web develop...
1550405,4,Wooh powergun Haha washing away
1550406,4,@mileycyrus NO MILEY IM NOT VOTING FOR YOU &gt...


In [14]:
df_clean.loc[401:406,]

Unnamed: 0,target,text
401,1,go see ghost girlfriend past
402,1,cant wait see movi look so good
403,1,rock boat
404,1,me remain web develop have plan keep compani a...
405,1,wooh powergun haha wash away
406,1,no miley im not vote you hhahahah joke cours i


---