## Twitter Setiment Analysis in Python

The code was inspired Gaurav Singhal's guide: [Building a Twitter Setiment Analysis in Python.](https://www.pluralsight.com/guides/building-a-twitter-sentiment-analysis-in-python)

The data comes from Marios Michailidis' sentiment140 dataset hosted in [Kaggle.](https://www.kaggle.com/kazanova/sentiment140/)


### Load Cleaned Data

In [4]:
import pandas as pd 
num = 1

# load subset
filepath = ''.join(["./data/clean/train_", str(num), ".csv"])
df = pd.read_csv(filepath)

In [5]:
df.index

RangeIndex(start=0, stop=50000, step=1)

In [6]:
num = 2

# load subset
filepath = ''.join(["./data/clean/train_", str(num), ".csv"])
df2 = pd.read_csv(filepath)

In [7]:
df2.index

RangeIndex(start=0, stop=50000, step=1)

In [8]:
df.tail()

Unnamed: 0,target,text
49995,0,20 mintu late my meet start 8 howd i know i go...
49996,0,super excit you tweet event happen onli way i ...
49997,0,i want anoth day off much sht do today got new...
49998,0,i just jack up thi umbrella cake
49999,0,oh epic work fail also ka still need inciner d...


In [9]:
df2.head()

Unnamed: 0,target,text
0,0,revis her exam revis da polistish system deuts...
1,0,veri annoy about thi game i dont think instal
2,0,jackson not feel wellboo
3,0,cant bring my clearbook
4,0,i just broke up my boyfriend all i want do cri


In [13]:
new_index = range(50000, 100000)
df2.index = new_index

In [15]:
df2.head()

Unnamed: 0,target,text
50000,0,revis her exam revis da polistish system deuts...
50001,0,veri annoy about thi game i dont think instal
50002,0,jackson not feel wellboo
50003,0,cant bring my clearbook
50004,0,i just broke up my boyfriend all i want do cri


In [16]:
df2.tail()

Unnamed: 0,target,text
99995,0,look like my router broke more tweet my fone then
99996,0,i realli dont want colleg right now wish sunni
99997,0,offer you pepto
99998,0,i would sooooo there if i didnt have revis do
99999,0,omg i just moisturis my leg burn me realli hur...


In [17]:
# same params except -1 at end of ranges

In [20]:
cmd = 'python load_clean_data.py'
!{cmd}

Appending train subseet 2 ...
Appending train subseet 3 ...
Appending train subseet 1 ...
Empty DataFrame
Columns: []
Index: []
Empty DataFrame
Columns: []
Index: []
Empty DataFrame
Columns: []
Index: []
Finished in 0.26 second(s)


In [25]:
import time
import pandas as pd
import concurrent.futures

def load_training_data(params):
 
    # unpack parameters
    ix_list, num = params
                        
    # load subset
    filepath = ''.join(["./data/clean/train_", str(num), ".csv"])
    df = pd.read_csv(filepath)
    df.index = ix_list
    dfm.append(df)

    # print our result
    result=''.join(["Appending train subseet ", str(num), " ..."])
    print(result)
    
    return dfm


def run_processes():
    # since it's mostly I/O-bound heavy, use multithreading
    with concurrent.futures.ThreadPoolExecutor() as executor:
                   
        params_list = [
                       (range(     0,    50000),  1),
                       (range(  50000,  100000),  2),
                       (range( 100000,  150000),  3)
                      ]
        results = [executor.submit(load_training_data, p) for p in params_list]

        # get results with the as_completed function, which gives us an iterator 
        # we loop over to yield results of our processes as they're completed
        for f in concurrent.futures.as_completed(results):
            print(f.result())

In [27]:

dfm = pd.DataFrame()
run_processes()


Appending train subseet 1 ...
Appending train subseet 3 ...
Empty DataFrame
Columns: []
Index: []
Empty DataFrame
Columns: []
Index: []
Appending train subseet 2 ...
Empty DataFrame
Columns: []
Index: []


### Preprocessing Data for ML

In [None]:
# ML
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SV

In [1]:
# load data...

In [3]:
# split dataset into train, test datasets...

In [255]:
# Vectorization with TF-IDF
# There are other techniques as well, such as Bag of Words and N-grams

# TODO: read more about this, make sure this implementation is kosher

def get_feature_vector(train_fit):
    vector = TfidfVectorizer(sublinear_tf=True)
    vector.fit(train_fit)
    return vector

In [None]:


# Same tf vector will be used for testing sentiments on unseen trending data
tf_vector = get_feature_vector(np.array(dataset.iloc[:, 1]).ravel())
X = tf_vector.transform(np.array(dataset.iloc[:, 1]).ravel())
y = np.array(dataset.iloc[:, 0]).ravel()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)

# Training Naive Bayes model
NB_model = MultinomialNB()
NB_model.fit(X_train, y_train)
y_predict_nb = NB_model.predict(X_test)
print(accuracy_score(y_test, y_predict_nb))

# Training Logistics Regression model
LR_model = LogisticRegression(solver='lbfgs')
LR_model.fit(X_train, y_train)
y_predict_lr = LR_model.predict(X_test)
print(accuracy_score(y_test, y_predict_lr))

Naive Bayes is giving nearly 76% accuracy, and Logistic Regression gives nearly 79%. These accuracy figures are recorded without implementing stemming or lemmatization. Using better techniques, you might get better accuracy.

Testing on Real-time Feeds
This step is completely optional and will only apply if you have read and implemented the guide [Building a Twitter Bot with Python.](https://www.pluralsight.com/guides/building-a-twitter-bot-with-python)

In [None]:
test_file_name = "trending_tweets/08-04-2020-1586291553-tweets.csv"
test_ds = load_dataset(test_file_name, ["t_id", "hashtag", "created_at", "user", "text"])
test_ds = remove_unwanted_cols(test_ds, ["t_id", "created_at", "user"])

# Creating text feature
test_ds.text = test_ds["text"].apply(preprocess_tweet_text)
test_feature = tf_vector.transform(np.array(test_ds.iloc[:, 1]).ravel())

# Using Logistic Regression model for prediction
test_prediction_lr = LR_model.predict(test_feature)

# Averaging out the hashtags result
test_result_ds = pd.DataFrame({'hashtag': test_ds.hashtag, 'prediction':test_prediction_lr})
test_result = test_result_ds.groupby(['hashtag']).max().reset_index()
test_result.columns = ['heashtag', 'predictions']
test_result.predictions = test_result['predictions'].apply(int_to_string)

print(test_result)

I hope you enjoyed reading this guide. Sentiment analysis is a popular project that almost every data scientist will do at some point. It can solve a lot of problems depending on you how you want to use it.

I highly recommended using different vectorizing techniques and applying feature extraction and feature selection to the dataset. Try to implement more machine learning models and you might be able to get accuracy over 85%.