## Twitter Setiment Analysis in Python

The code was inspired Gaurav Singhal's guide: [Building a Twitter Setiment Analysis in Python.](https://www.pluralsight.com/guides/building-a-twitter-sentiment-analysis-in-python)

The data comes from Marios Michailidis' sentiment140 dataset hosted in [Kaggle.](https://www.kaggle.com/kazanova/sentiment140/)


### Data Cleanup

All details of cleanup steps can be found in the custom python script `cleanup_tweets.py`. 

Michailidis' dataset consists of 1.6 M rows evenly split into negative and positive Tweets. The labels were created automatically simply using emoticons (happy face is positive, and vice versa).  

Since my cleanup function entails heavy CPU-bound processes I use multiprocessing, splitting the data into 32 50k-row chunks which are processed 8 at a time (since I have 8 logical processors). The order of processing is asynchronous.

Here I just run that script by passing a command to the command line. 

In [1]:
cmd = 'python cleanup_tweets.py'
!{cmd}

Saving cleaned up train dataset: 1
Saving cleaned up train dataset: 8
Saving cleaned up train dataset: 2
Saving cleaned up train dataset: 7
Saving cleaned up train dataset: 5
Saving cleaned up train dataset: 4
Saving cleaned up train dataset: 6
Saving cleaned up train dataset: 3
Saving cleaned up train dataset: 9
Saving cleaned up train dataset: 10
Saving cleaned up train dataset: 11
Saving cleaned up train dataset: 15
Saving cleaned up train dataset: 14
Saving cleaned up train dataset: 13
Saving cleaned up train dataset: 12
Saving cleaned up train dataset: 16
Saving cleaned up train dataset: 17
Saving cleaned up train dataset: 19
Saving cleaned up train dataset: 18
Saving cleaned up train dataset: 23
Saving cleaned up train dataset: 20
Saving cleaned up train dataset: 21
Saving cleaned up train dataset: 22
Saving cleaned up train dataset: 24
Saving cleaned up train dataset: 27
Saving cleaned up train dataset: 25
Saving cleaned up train dataset: 29
Saving cleaned up train dataset: 31
S


Even without compiling the regex patterns the entire dataset runs in just under 5 mins, which is good enough for me since it's a one-time process. Here I show how to revert back to the original data (which includes Tweet IDs, etc) from the cleaned data. The key is basically the list of parameters passed to the multiprocessing executor, for example, this last set of parameters indicates that the cleaned dataset 32 contains the range from 1550000 to 1600000:

```
(range(1550000, 1600000), 32)
```


In [37]:
import pandas as pd

df = pd.read_csv("./data/raw/training.1600000.processed.noemoticon.csv",
                 encoding='latin-1', 
                 usecols=[0,5])

df.columns = ['target','text']
              
df_clean =  pd.read_csv("./data/clean/train_32.csv")

In [46]:
df.loc[1550401:1550406,]

Unnamed: 0,target,text
1550401,4,Going to see Ghosts of Girlfriends Past with @DANii245
1550402,4,"@sarah_cawood can't wait to see the movie, it looks so good"
1550403,4,@ScottHuska rock the boat
1550404,4,@DooneyStudio me and the remaining web developer have plans to keep the company afloat
1550405,4,Wooh powergun Haha washing away
1550406,4,@mileycyrus NO MILEY IM NOT VOTING FOR YOU &gt;=( HHAHAHAH JOKES OF COURSE I WILL


In [47]:
df_clean.loc[401:406,]

Unnamed: 0,target,text
401,1,go see ghost girlfriend past
402,1,cant wait see movi look so good
403,1,rock boat
404,1,me remain web develop have plan keep compani afloat
405,1,wooh powergun haha wash away
406,1,no miley im not vote you hhahahah joke cours i


## Preprocess for Machine Learning

In [None]:
# Split dataset into Train, Test

In [None]:
# ML
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SV

In [255]:
# Vectorization with TF-IDF
# There are other techniques as well, such as Bag of Words and N-grams

# TODO: read more about this, make sure this implementation is kosher

def get_feature_vector(train_fit):
    vector = TfidfVectorizer(sublinear_tf=True)
    vector.fit(train_fit)
    return vector

In [None]:


# Same tf vector will be used for testing sentiments on unseen trending data
tf_vector = get_feature_vector(np.array(dataset.iloc[:, 1]).ravel())
X = tf_vector.transform(np.array(dataset.iloc[:, 1]).ravel())
y = np.array(dataset.iloc[:, 0]).ravel()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)

# Training Naive Bayes model
NB_model = MultinomialNB()
NB_model.fit(X_train, y_train)
y_predict_nb = NB_model.predict(X_test)
print(accuracy_score(y_test, y_predict_nb))

# Training Logistics Regression model
LR_model = LogisticRegression(solver='lbfgs')
LR_model.fit(X_train, y_train)
y_predict_lr = LR_model.predict(X_test)
print(accuracy_score(y_test, y_predict_lr))

Naive Bayes is giving nearly 76% accuracy, and Logistic Regression gives nearly 79%. These accuracy figures are recorded without implementing stemming or lemmatization. Using better techniques, you might get better accuracy.

Testing on Real-time Feeds
This step is completely optional and will only apply if you have read and implemented the guide [Building a Twitter Bot with Python.](https://www.pluralsight.com/guides/building-a-twitter-bot-with-python)

In [None]:
test_file_name = "trending_tweets/08-04-2020-1586291553-tweets.csv"
test_ds = load_dataset(test_file_name, ["t_id", "hashtag", "created_at", "user", "text"])
test_ds = remove_unwanted_cols(test_ds, ["t_id", "created_at", "user"])

# Creating text feature
test_ds.text = test_ds["text"].apply(preprocess_tweet_text)
test_feature = tf_vector.transform(np.array(test_ds.iloc[:, 1]).ravel())

# Using Logistic Regression model for prediction
test_prediction_lr = LR_model.predict(test_feature)

# Averaging out the hashtags result
test_result_ds = pd.DataFrame({'hashtag': test_ds.hashtag, 'prediction':test_prediction_lr})
test_result = test_result_ds.groupby(['hashtag']).max().reset_index()
test_result.columns = ['heashtag', 'predictions']
test_result.predictions = test_result['predictions'].apply(int_to_string)

print(test_result)

I hope you enjoyed reading this guide. Sentiment analysis is a popular project that almost every data scientist will do at some point. It can solve a lot of problems depending on you how you want to use it.

I highly recommended using different vectorizing techniques and applying feature extraction and feature selection to the dataset. Try to implement more machine learning models and you might be able to get accuracy over 85%.