## Twitter Sentiment Analysis 

### Part 1: sentiment140 dataset cleanup

The code was originally inspired Gaurav Singhal's guide: [Building a Twitter Setiment Analysis in Python.](https://www.pluralsight.com/guides/building-a-twitter-sentiment-analysis-in-python)

The data comes from Marios Michailidis' sentiment140 dataset hosted in [Kaggle.](https://www.kaggle.com/kazanova/sentiment140/)


### Cleanup Process

All details of cleanup steps can be found in the custom python script `cleanup_module.py`. 

Michailidis' dataset consists of 1.6 M rows evenly split into negative and positive Tweets. The labels were created automatically using emoticons - the specific steps of how this was accomplished have not been disclosed.

Since my cleanup function entails heavy CPU-bound processes I use multiprocessing, splitting the data into 32 50k-row chunks which are processed 8 at a time (my laptop has 8 logical processors). The order of processing is asynchronous.

In [1]:
cmd = 'python cleanup_module.py'
!{cmd}

Saving cleaned up train dataset: 3
Saving cleaned up train dataset: 1
Saving cleaned up train dataset: 6
Saving cleaned up train dataset: 8
Saving cleaned up train dataset: 4
Saving cleaned up train dataset: 5
Saving cleaned up train dataset: 7
Saving cleaned up train dataset: 2
Saving cleaned up train dataset: 9
Saving cleaned up train dataset: 12
Saving cleaned up train dataset: 11
Saving cleaned up train dataset: 14
Saving cleaned up train dataset: 16
Saving cleaned up train dataset: 13
Saving cleaned up train dataset: 10
Saving cleaned up train dataset: 15
Saving cleaned up train dataset: 18
Saving cleaned up train dataset: 19
Saving cleaned up train dataset: 17
Saving cleaned up train dataset: 23
Saving cleaned up train dataset: 20
Saving cleaned up train dataset: 21
Saving cleaned up train dataset: 24
Saving cleaned up train dataset: 22
Saving cleaned up train dataset: 26
Saving cleaned up train dataset: 25
Saving cleaned up train dataset: 27
Saving cleaned up train dataset: 30
S

Reverting back to the original data (which includes Tweet IDs, etc.) from the cleaned data:

- the key is the list of parameters passed to the multiprocessing executor; i.e. - this last set of parameters indicates that the cleaned dataset 32 contains the range from 1550000 to 1600000:

```
(range(1550000, 1600001), 32)
```


In [3]:
import os
import pandas as pd

dirpath = os.path.join("..","data","1_raw","sentiment140")
filename = "training.1600000.processed.noemoticon.csv"
filepath = os.path.join(dirpath, filename)

df = pd.read_csv(filepath,
                 encoding='latin-1',
                 usecols=[0,5])

df.columns = ['target','text']
              
dirpath = os.path.join("..","data","2_clean","sentiment140")
filename = "train_32.csv"
filepath = os.path.join(dirpath, filename)

df_clean =  pd.read_csv(filepath)

In [4]:
df.loc[1550401:1550406,]

Unnamed: 0,target,text
1550401,4,Going to see Ghosts of Girlfriends Past with @...
1550402,4,"@sarah_cawood can't wait to see the movie, it ..."
1550403,4,@ScottHuska rock the boat
1550404,4,@DooneyStudio me and the remaining web develop...
1550405,4,Wooh powergun Haha washing away
1550406,4,@mileycyrus NO MILEY IM NOT VOTING FOR YOU &gt...


In [5]:
df_clean.loc[401:406,]

Unnamed: 0,target,text,tokenized,filtered,stemmed,lemmatized
401,1,Going to see Ghosts of Girlfriends Past with @...,going to see ghosts of girlfriends past with d...,going see ghosts girlfriends past danii245,go see ghost girlfriend past danii245,going see ghost girlfriend past danii245
402,1,"@sarah_cawood can't wait to see the movie, it ...",sarahcawood cant wait to see the movie it look...,sarahcawood cant wait see movie looks so good,sarahcawood cant wait see movi look so good,sarahcawood cant wait see movie look so good
403,1,@ScottHuska rock the boat,scotthuska rock the boat,scotthuska rock boat,scotthuska rock boat,scotthuska rock boat
404,1,@DooneyStudio me and the remaining web develop...,dooneystudio me and the remaining web develope...,dooneystudio me remaining web developer have p...,dooneystudio me remain web develop have plan k...,dooneystudio me remaining web developer have p...
405,1,Wooh powergun Haha washing away,wooh powergun haha washing away,wooh powergun haha washing away,wooh powergun haha wash away,wooh powergun haha washing away
406,1,@mileycyrus NO MILEY IM NOT VOTING FOR YOU &gt...,mileycyrus no miley im not voting for you hhah...,mileycyrus no miley im not voting you hhahahah...,mileycyru no miley im not vote you hhahahah jo...,mileycyrus no miley im not voting you hhahahah...


---