## Twitter Setiment Analysis 

### Part 2: sentiment140 pre-processing for ML

The code was originally inspired Gaurav Singhal's guide: [Building a Twitter Setiment Analysis in Python.](https://www.pluralsight.com/guides/building-a-twitter-sentiment-analysis-in-python)

The data comes from Marios Michailidis' sentiment140 dataset hosted in [Kaggle.](https://www.kaggle.com/kazanova/sentiment140/)


### Load Cleaned Data

In [1]:
import time
import load_data as ld

start = time.perf_counter()

df = ld.run_processes()

finish = time.perf_counter()
print(f'Finished in {round(finish-start, 2)} second(s)')

Finished in 10.26 second(s)


In [2]:
df.head()

Unnamed: 0,target,text,tokenized,filtered,stemmed
0,0,is upset that he can't update his Facebook by ...,is upset that he cant update his facebook by t...,upset cant update his facebook texting might c...,upset cant updat hi facebook text might cri re...
1,0,@Kenichan I dived many times for the ball. Man...,i dived many times for the ball managed to sav...,i dived many times ball managed save 50 rest g...,i dive mani time ball manag save 50 rest go ou...
2,0,my whole body feels itchy and like its on fire,my whole body feels itchy and like its on fire,my whole body feels itchy like fire,my whole bodi feel itchi like fire
3,0,"@nationwideclass no, it's not behaving at all....",no its not behaving at all im mad why am i her...,no not behaving all im mad why am i here becau...,no not behav all im mad whi am i here becaus i...
4,0,@Kwesidei not the whole crew,not the whole crew,not whole crew,not whole crew


In [3]:
df.tail()

Unnamed: 0,target,text,tokenized,filtered,stemmed
1599994,1,Just woke up. Having no school is the best fee...,just woke up having no school is the best feel...,just woke up having no school best feeling ever,just woke up have no school best feel ever
1599995,1,TheWDB.com - Very cool to hear old Walt interv...,very cool to hear old walt interviews,very cool hear old walt interviews,veri cool hear old walt interview
1599996,1,Are you ready for your MoJo Makeover? Ask me f...,are you ready for your mojo makeover ask me fo...,you ready your mojo makeover ask me details,you readi your mojo makeov ask me detail
1599997,1,Happy 38th Birthday to my boo of alll time!!! ...,happy 38th birthday to my boo of alll time tup...,happy 38th birthday my boo alll time tupac ama...,happi 38th birthday my boo alll time tupac ama...
1599998,1,happy #charitytuesday @theNSPCC @SparksCharity...,happy charitytuesday,happy charitytuesday,happi charitytuesday


In [44]:
def calc_rsr(txt):
    """Calculates the ratio of characters in 
    the right-side of the QWERTY keyboard, also
    known as RSR (Right-Side Ratio), given a 
    lower-case text object.
    """
    lside = ['q','w','e','r','t',
             'a','s','d','f','g',
             'z','x','c','v','b']   
    
    rside = ['y','u','i','o','p',
             'h','j','k','l',
             'n','m']
    
    txt = str(txt)
    
    sub_string = [x for x in txt]
    lcount = rcount = 0
    
    for i in sub_string:
        if i in lside:
            lcount += 1
        elif i in rside:
            rcount += 1
        else:
            pass
    
    den = rcount+lcount
    if den != 0:
        return rcount / den
    else:
        return 0
    
def map_rsr(list_):
    map_iterator = map(calc_rsr, list_)
    return list(map_iterator)

In [45]:
start = time.perf_counter()
df['rsr'] = map_rsr(df.loc[:, 'tokenized'])
finish = time.perf_counter()
print(f'Finished in {round(finish-start, 2)} second(s)')

Finished in 72.25 second(s)


In [47]:
df.head()

Unnamed: 0,target,text,tokenized,filtered,stemmed,rsr
0,0,is upset that he can't update his Facebook by ...,is upset that he cant update his facebook by t...,upset cant update his facebook texting might c...,upset cant updat hi facebook text might cri re...,0.404762
1,0,@Kenichan I dived many times for the ball. Man...,i dived many times for the ball managed to sav...,i dived many times ball managed save 50 rest g...,i dive mani time ball manag save 50 rest go ou...,0.37931
2,0,my whole body feels itchy and like its on fire,my whole body feels itchy and like its on fire,my whole body feels itchy like fire,my whole bodi feel itchi like fire,0.513514
3,0,"@nationwideclass no, it's not behaving at all....",no its not behaving at all im mad why am i her...,no not behaving all im mad why am i here becau...,no not behav all im mad whi am i here becaus i...,0.424242
4,0,@Kwesidei not the whole crew,not the whole crew,not whole crew,not whole crew,0.4


In [48]:
df.tail()

Unnamed: 0,target,text,tokenized,filtered,stemmed,rsr
1599994,1,Just woke up. Having no school is the best fee...,just woke up having no school is the best feel...,just woke up having no school best feeling ever,just woke up have no school best feel ever,0.454545
1599995,1,TheWDB.com - Very cool to hear old Walt interv...,very cool to hear old walt interviews,very cool hear old walt interviews,veri cool hear old walt interview,0.387097
1599996,1,Are you ready for your MoJo Makeover? Ask me f...,are you ready for your mojo makeover ask me fo...,you ready your mojo makeover ask me details,you readi your mojo makeov ask me detail,0.444444
1599997,1,Happy 38th Birthday to my boo of alll time!!! ...,happy 38th birthday to my boo of alll time tup...,happy 38th birthday my boo alll time tupac ama...,happi 38th birthday my boo alll time tupac ama...,0.541667
1599998,1,happy #charitytuesday @theNSPCC @SparksCharity...,happy charitytuesday,happy charitytuesday,happi charitytuesday,0.473684


In [55]:
df.groupby(['target']).mean()

Unnamed: 0_level_0,rsr
target,Unnamed: 1_level_1
0,0.456939
1,0.462562


In [64]:
# save data for R analysis
df_sub = df.loc[:,('target','tokenized','filtered','stemmed','rsr')].copy()

In [65]:
import os
dir_ = os.path.join("..","data","2_clean","sentiment140")
filename = "sentiment140_training.csv"
filepath = os.path.join(dir_, filename)
df_sub.to_csv(filepath, index=False, compression='gzip')

### Preprocessing Data for ML

In [5]:
# ML
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [6]:
# split dataset into train, test datasets...

In [7]:
# Vectorization with TF-IDF
# There are other techniques as well, such as Bag of Words and N-grams

# TODO: read more about this, make sure this implementation is kosher

def get_feature_vector(train_fit):
    vector = TfidfVectorizer(sublinear_tf=True)
    vector.fit(train_fit)
    return vector

In [None]:


# Same tf vector will be used for testing sentiments on unseen trending data
tf_vector = get_feature_vector(np.array(dataset.iloc[:, 1]).ravel())
X = tf_vector.transform(np.array(dataset.iloc[:, 1]).ravel())
y = np.array(dataset.iloc[:, 0]).ravel()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)

# Training Naive Bayes model
NB_model = MultinomialNB()
NB_model.fit(X_train, y_train)
y_predict_nb = NB_model.predict(X_test)
print(accuracy_score(y_test, y_predict_nb))

# Training Logistics Regression model
LR_model = LogisticRegression(solver='lbfgs')
LR_model.fit(X_train, y_train)
y_predict_lr = LR_model.predict(X_test)
print(accuracy_score(y_test, y_predict_lr))

Naive Bayes is giving nearly 76% accuracy, and Logistic Regression gives nearly 79%. These accuracy figures are recorded without implementing stemming or lemmatization. Using better techniques, you might get better accuracy.

Testing on Real-time Feeds
This step is completely optional and will only apply if you have read and implemented the guide [Building a Twitter Bot with Python.](https://www.pluralsight.com/guides/building-a-twitter-bot-with-python)

In [None]:
test_file_name = "trending_tweets/08-04-2020-1586291553-tweets.csv"
test_ds = load_dataset(test_file_name, ["t_id", "hashtag", "created_at", "user", "text"])
test_ds = remove_unwanted_cols(test_ds, ["t_id", "created_at", "user"])

# Creating text feature
test_ds.text = test_ds["text"].apply(preprocess_tweet_text)
test_feature = tf_vector.transform(np.array(test_ds.iloc[:, 1]).ravel())

# Using Logistic Regression model for prediction
test_prediction_lr = LR_model.predict(test_feature)

# Averaging out the hashtags result
test_result_ds = pd.DataFrame({'hashtag': test_ds.hashtag, 'prediction':test_prediction_lr})
test_result = test_result_ds.groupby(['hashtag']).max().reset_index()
test_result.columns = ['heashtag', 'predictions']
test_result.predictions = test_result['predictions'].apply(int_to_string)

print(test_result)

I hope you enjoyed reading this guide. Sentiment analysis is a popular project that almost every data scientist will do at some point. It can solve a lot of problems depending on you how you want to use it.

I highly recommended using different vectorizing techniques and applying feature extraction and feature selection to the dataset. Try to implement more machine learning models and you might be able to get accuracy over 85%.