<h1 align="center" style="color:dodgerblue; font-weight:700"> Sentiment Analysis on Twitter Tweets</h1>
<hr/>

<p>Sentiment Analysis is conducted on various datasets after exploratory data analysis and data preprocessing, separately using variety of Machine Learning techniques</p>
<h4 style="font-weight:600;">15 implemented ML Algorithms : </h4>
<ol>
<li>Logistic Regression</li>
<ul>
<li>Newton CG</li>
<li>SAG</li>
<li>SAGA</li>
<li>LBFGS</li>
</ul>
<li>Decision Tree Classifier</li>
<li>Support Vector Machines</li>
<ul>
<li>Linear</li>
<li>Poly</li>
<li>RBF</li>
<li>Sigmoid</li>
</ul>
<li>Majority Voting Ensemble</li>
<li>Extreme Laerning Machines</li>
<ul>
<li>Tanh</li>
<li>SinSQ</li>
<li>Tribas</li>
<li>Hardlim</li>
</ul>
<li>Artificial Neural Networks (Multi - Layer Perceptron) Gradient Descent</li>
</ol>
<hr/>

### 
<h2 align="center" style="color:red">Part 1 : Data Preprocessing</h2>
<ol>
<li>Exploratory Data Analysis</li>
<li>Data Preprocessing</li>
<li>Cleaning</li>
<li>Lemmatization</li>
</ol>
<h4 style="font-weight:700">>>> Run Text_Preprocessing_MP_Hybrid.py if you want to Preprocess some datasets</h4>
<hr/>
<style>
li {
    margin-left:100px;
    /* color: yellow; */
}
</style>

<h4>Flow of Control</h4>
<ol>
<li>Sentence Segmentation</li>
<li>Word Tokenization</li>
<li>Same consecutive chars changed to max 2 times</li>
<li>Spelling Corrections</li>
<li>Removal of #Hashtags, @Mentions, http//:URLs, etc (Noise 1)</li>
<li>Removal of Special Unicode Characters (Noise 2)</li>
<li>Chat Abbreviations conversions (Noise 3)</li>
<li>Removal of Punctuations except `'` (Noise 4)</li>
<li>Stop Words Removal (Noise 5)</li>
<li>Parts of Speech Tagging</li>
<li>Stemming & Lemmatization</li>
<li>WhiteSpace Removals</li>
<li>Chunking</li>
</ol>
<hr/>
<style>
h4 {
    display: flex;
    justify-content: center;
    color: green;
    font-weight: 650;
}
li {
    margin-left:100px;
    /* color: yellow; */
}
</style>

<h2 align="center" style="color:red">Part 2 : Machine Learning Models Training</h2>
<hr/>

<h3>Neccessary Imports</h3>

In [2]:
import warnings
warnings.filterwarnings("ignore")
import traceback

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

from sklearn.metrics import classification_report, accuracy_score, f1_score, precision_score, recall_score
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

from sklearn_extensions.extreme_learning_machines.elm import GenELMClassifier
from sklearn_extensions.extreme_learning_machines.random_layer import RBFRandomLayer, MLPRandomLayer

from sklearn.neural_network import MLPClassifier, MLPRegressor

In [3]:
class Dataset:
    def __init__(
            self,
            filePath : str = "",
            df : pd.DataFrame = None,
            preprocessed : str = True,
            name : str = ""
        ) -> None:
        """Dataset
            ======
            Represents AI-ML datasets

            Paramters
            ---------
            filePath : str - System Path to the file location (default '')
            df : pd.DataFrame - Pandas DataFrame for the dataset (default None)
            preprocessed : str - Indicates whether data is raw or processed, Possible values = ('raw', 'pro')
            name : str - Given name for the dataset
        """
        self.filePath = filePath
        self.name = name if name else filePath
        self.df = df
        self.preprocessed = preprocessed
    


    def loadDataset(self, filePath : str = 'default'):
        if filePath:
            self.filePath = filePath
            if not self.name: self.name = filePath
        if self.filePath is None:
            raise("Path to dataset file not provided")
        try:
            self.df = pd.read_csv(filePath)
        except Exception as e:
            print(f"Cannot import dataset named {self.name} from path {self.filePath}")
            print(f"Original Error : {e}")
            traceback.print_exception()
    

    def get_DF(self):
        if self.df is None:
            self.loadDataset()
        return self.df

In [4]:
datasets = {
    "dataset_1" : {
        "dataset" : Dataset("./data/dataset_1_processed.csv", name = "dataset_1"),
        "trainingDone" : True
    },

    "dataset_2" : {
        "dataset" : Dataset("./data/dataset_2_processed.csv", name = "dataset_2"),
        "trainingDone" : False
    },

    "dataset_3" : {
        "dataset" : Dataset("./data/dataset_3_processed.csv", name = "dataset_3"),
        "trainingDone" : False
    },

    "dataset_4" : {
        "dataset" : Dataset("./data/dataset_4_processed.csv", name = "dataset_4"),
        "trainingDone" : False
    },
}

### Data Preparation

In [5]:
dfTwe = pd.read_csv("./data/dataset_3_processed.csv")

In [6]:
for idx in dfTwe.index:
    if type(dfTwe.loc[idx, "Proc_Tweet"]) != str:
        dfTwe.loc[idx, "Proc_Tweet"] = "neutral"

In [7]:
dfTwe["Proc_Tweet"].isna().value_counts()

False    10001
Name: Proc_Tweet, dtype: int64

#### Digitalize Sentiments

In [8]:
sents = {
    'fun' : 1, 'happiness' : 1, 'love' : 1, 'relief' : 1, 'enthusiasm' : 1, 'surprise' : 1,
    'empty' : 0, 'neutral' : 0,
    'boredom' : -1, 'anger' : -1, 'hate' : -1, 'sadness' : -1, 'worry' : -1
}

def digitalize(sent : str):
    return sents.get(sent, 0)

In [9]:
if type(dfTwe.loc[0, "sentiment"]) == str:
    dfTwe["sentiment"] = dfTwe["sentiment"].apply(digitalize)

In [10]:
x = dfTwe["Proc_Tweet"][:10000]
y = dfTwe["sentiment"] [:10000]

In [11]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25)

In [12]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((7500,), (2500,), (7500,), (2500,))

In [13]:
def generate_results(ML_Model, y_test, y_pred):
    results = {
        "ML Model" : ML_Model,
        "classification_report"  : classification_report(y_test, y_pred),
        "accuracy_score"  : accuracy_score(y_test, y_pred)
    }
    
    funcs = (precision_score, recall_score, f1_score,)
    params = ("micro", "macro", "weighted",)

    for func in funcs:
        results[func.__name__] = {}
        for param in params:
            results[func.__name__][param] = func(y_test, y_pred, average = param)
    
    return results

In [14]:
def Text_Classifier(ML_Method, x_train = x_train, x_test = x_test, y_train = y_train, y_test = y_test):
    model = make_pipeline(TfidfVectorizer(ngram_range=(1, 3)), ML_Method)
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    return generate_results(model, y_test, y_pred)

In [15]:
ML_results = {}

### Logistic Regression

In [16]:
LOGR_solvers = ('newton-cg', 'sag', 'saga', 'lbfgs',)

In [17]:
for slvr in LOGR_solvers:
    result_LOGR = Text_Classifier(LogisticRegression(random_state = 0, solver = slvr, multi_class = 'auto'))
    ML_results[f"LOGR_{slvr}"] = result_LOGR

### Decision Tree Classifier

In [18]:
result_DT = Text_Classifier(DecisionTreeClassifier())
ML_results['Decision_Tree'] = result_DT

### Support Vector Machines

In [19]:
SVM_Kernels = ['linear', 'poly', 'rbf', 'sigmoid']

In [20]:
for krnl in SVM_Kernels:
    try:
        result_SVM = Text_Classifier(SVC(kernel=krnl))
        ML_results[f"SVM_{krnl}"] = result_SVM
    except: pass

### Majority Voting Ensemble

In [21]:
result_MVE = Text_Classifier(VotingClassifier(estimators = [
    ('lr', LogisticRegression(random_state = 0, solver = 'lbfgs', multi_class = 'auto')),
    ('svm', SVC(kernel="rbf"))
]))
ML_results["MVE"] = result_MVE

### Extreme Learning Machines

In [22]:
def make_classifiers():
    nh : int = 10

    # Custom (user defined) transfer function
    sinsq = (lambda x: np.power(np.sin(x), 2.0))
    srhl_sinsq = MLPRandomLayer(n_hidden=nh, activation_func=sinsq)

    # Internal Transfer functions
    srhl_tanh = MLPRandomLayer(n_hidden=nh, activation_func='tanh')
    srhl_tribas = MLPRandomLayer(n_hidden=nh, activation_func='tribas')
    srhl_hardlim = MLPRandomLayer(n_hidden=nh, activation_func='hardlim')

    # Gaussian RBF
    srhl_rbf = RBFRandomLayer(n_hidden=nh*2, rbf_width=0.1, random_state=0)
    
    log_reg = LogisticRegression()

    classifiers = [
        ('ELM(10,tanh)', GenELMClassifier(hidden_layer=srhl_tanh)),
        ('ELM(10,tanh,LR)', GenELMClassifier(hidden_layer=srhl_tanh, regressor=log_reg)),
        ('ELM(10,sinsq)', GenELMClassifier(hidden_layer=srhl_sinsq)),
        ('ELM(10,tribas)', GenELMClassifier(hidden_layer=srhl_tribas)),
        ('ELM(hardlim)', GenELMClassifier(hidden_layer=srhl_hardlim)),
        ('ELM(20,rbf(0.1))', GenELMClassifier(hidden_layer=srhl_rbf)),
    ]

    return classifiers


if __name__ == '__main__':
    dataset = [dfTwe["Proc_Tweet"].to_list(), dfTwe["sentiment"].to_list()]
    names, classifiers = zip(*make_classifiers())


    X, y = dataset
    vectorizer = TfidfVectorizer(min_df=3, sublinear_tf=True, norm='l2', ngram_range=(1, 3))
    X = vectorizer.fit_transform(X)[:10000]
    y = y[:10000]

    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=3)

    for name, clf in zip(names, classifiers):
        try:
            clf.fit(x_train, y_train)
            # score = clf.score(x_test, y_test)
            y_pred = clf.predict(x_test)
            print(f'Model {name} Successful, metrics saved')
            ML_results[name] = generate_results(clf, y_test, y_pred)

        except Exception as e: print(f'Model {name} failed with error : {e}')

Model ELM(10,tanh) Successful, metrics saved
Model ELM(10,tanh,LR) Successful, metrics saved
Model ELM(10,sinsq) Successful, metrics saved
Model ELM(10,tribas) Successful, metrics saved
Model ELM(hardlim) Successful, metrics saved
Model ELM(20,rbf(0.1)) failed with error : unsupported operand type(s) for -: 'map' and 'map'


### Artificial Neural Networks - MLP

In [23]:
x = dfTwe["Proc_Tweet"]#[:10000]
y = dfTwe["sentiment"] #[:10000]

In [24]:
x = vectorizer.fit_transform(x).toarray()

In [25]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

In [26]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((7500, 4146), (2501, 4146), (7500,), (2501,))

In [27]:
clf = MLPClassifier(solver='sgd', random_state=20, hidden_layer_sizes=(15,12), alpha=1e-5)

In [28]:
sz = 10000
clf.fit(x_train[:sz], y_train[:sz])

MLPClassifier(alpha=1e-05, hidden_layer_sizes=(15, 12), random_state=20,
              solver='sgd')

In [29]:
y_pred = clf.predict(x_test)

In [30]:
ML_results['ANN-GD'] = generate_results(clf, y_test, y_pred)

## Full Metrics Results

In [31]:
for meth, result in ML_results.items():
    try:
        print('-'*50 + " " + meth + " " + '-'*50)
        for k, v in result.items():
            try:
                print(k)
                print(v)
            except: pass
        print('-'*100)
    except: pass

-------------------------------------------------- LOGR_newton-cg --------------------------------------------------
ML Model
Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer(ngram_range=(1, 3))),
                ('logisticregression',
                 LogisticRegression(random_state=0, solver='newton-cg'))])
classification_report
              precision    recall  f1-score   support

           0       0.94      1.00      0.97      2328
           1       1.00      0.08      0.15       172

    accuracy                           0.94      2500
   macro avg       0.97      0.54      0.56      2500
weighted avg       0.94      0.94      0.91      2500

accuracy_score
0.9368
precision_score
{'micro': 0.9368, 'macro': 0.9682220434432824, 'weighted': 0.9408167337087691}
recall_score
{'micro': 0.9368, 'macro': 0.5406976744186046, 'weighted': 0.9368}
f1_score
{'micro': 0.9368, 'macro': 0.5588583477402379, 'weighted': 0.9109941309174406}
----------------------------------------------------

## 4 - Centred Metrics

In [32]:
metrics_table = {
    'ML Method' : [],
    'Accuracy' : [],
    'Precision' : [],
    'Recall' : [],
    'F1-Score' : [],
}

for meth, result in ML_results.items():
    metrics_table['ML Method'].append(meth)
    metrics_table['Accuracy'].append(result['accuracy_score'])
    metrics_table['Precision'].append(max(result['precision_score'].values()))
    metrics_table['Recall'].append(max(result['recall_score'].values()))
    metrics_table['F1-Score'].append(max(result['f1_score'].values()))

In [33]:
df_metrics = pd.DataFrame(metrics_table)

In [34]:
for idx in df_metrics.index:
    df_metrics.loc[idx, :] = df_metrics.loc[idx, :].apply(lambda val : val.replace('_', ' - ').upper() if type(val) == str else val)

In [35]:
def color(x):
    df_metrics.style.set_properties(**{'color': 'cyan'}, subset=['ML Method'])
    # mask  =  df_metrics['Win %'] > 0
    # mask1 =  df_metrics['Win %'] == 0
    # mask2 =  df_metrics['Win %'] < 0
    x = pd.DataFrame('', index=df_metrics.index, columns=df_metrics.columns)
    x.loc[:, 'ML Method'] = 'color: cyan'
    # x.loc[mask1,['Win %']] = 'color: cyan'
    # x.loc[mask2,['Win %']] = 'color: red'
    return x

In [36]:
df_metrics.set_index('ML Method')
df_metrics.style.apply(color, axis=None)

Unnamed: 0,ML Method,Accuracy,Precision,Recall,F1-Score
0,LOGR - NEWTON-CG,0.9368,0.968222,0.9368,0.9368
1,LOGR - SAG,0.9368,0.968222,0.9368,0.9368
2,LOGR - SAGA,0.9368,0.968222,0.9368,0.9368
3,LOGR - LBFGS,0.9368,0.968222,0.9368,0.9368
4,DECISION - TREE,0.9196,0.919816,0.9196,0.919708
5,SVM - LINEAR,0.9488,0.9488,0.9488,0.9488
6,SVM - POLY,0.9416,0.9416,0.9416,0.9416
7,SVM - RBF,0.9424,0.954386,0.9424,0.9424
8,SVM - SIGMOID,0.946,0.946,0.946,0.946
9,MVE,0.9364,0.968034,0.9364,0.9364
