<p style="text-align: center; font-size: 35px;"><b>Social Media Management</b></p>
<p style="margin-top: -20px; text-align: center; font-size: 27px;"><b>Final Project</b></p>
<p style="margin-top: -10px; text-align: center; font-size: 18px;"><b>Classifier for Climate Change Tweets</b></p>
<p style="text-align: center; font-size: 16px;"><a href="https://antonioscardace.altervista.org/">Antonio Scardace</a> • 1000007272 • 2021/2022</p>

## Introduction to Project Idea and Data
This task consists, given a series of tweets about Climate Change, in figuring out whether the user (the author) is skeptic or supports the belief of man-made climate change. Once implemented, trained, and tested, this algorithm will be useful in some real contests.

To solve this problem, I have made an algorithm (a classifier) which has been trained on a **43943 tweets** dataset collected between 2015-04-27 and 2018-02-21 by ***Canada Foundation for Innovation JELF Grant to Chris Bauch, University of Waterloo***. <br/>
Each row of the dataset contains: the text of the tweet labelled as '*message*', the tweet id labelled as '*tweetid*', and the sentiment of the tweet labelled as '*sentiment*'.

Each sentiment is labelled as one of the following classes:
- ``-1`` (**Anti**) &#8594; the tweet author doesn't believe in man-made climate change;
- ``0`` (**Neutral**) &#8594; the tweet author neither supports nor refutes the belief of man-made climate change;
- ``1`` (**Pro**) &#8594; the tweet author supports the belief of man-made climate change;
- ``2`` (**News**) &#8594; the tweet links to factual news about climate change;

<img src="https://antonioscardace.altervista.org/smm/dataset_distr.png" alt="dataset tweets distribution" style="width: 550px; margin-top: 10px; border: 1px solid #555"/>

## Base Code

I had to import all the modules I needed:
- ``requests`` allows to send HTTP requests in a very easy way.
- ``numpy`` adds support for big arrays and matrices, along with a large collection of math functions.
- ``pandas`` helps with data manipulation and analysis through charts and tables.
- ``re`` provides Regular Expression matching operations.
- ``vaderSentiment`` is a lexicon and rule-based sentiment-analysis tool which is sensitive to web-based media texts.
- ``sklearn`` is a library for Machine Learning on Python.

Through ``sklearn`` I have imported some methods and classes:
- **train_test_split** _(method)_ &#8594; it splits dataset into Training Set and Test Set.
- **KNeighborsClassifier** _(class)_ &#8594; implements the K-Nearest Neighbors Classifier.
- **MultinomialNB** _(class)_ &#8594; implements the Multinomial Naive Bayes Classifier.
- **LogisticRegression** _(class)_ &#8594; implements the Logistic Regression Classifier.
- **SGDClassifier** _(class)_ &#8594; implements Linear Classifiers (such as SVM, Logistic Regression) with SGD.
- **SVC** _(class)_ &#8594; implements SVC (Support Vector Classifier) which is an implementation of SVM (Support Vector Machine).
- **GridSearchCV** _(class)_ &#8594; it provides methods to optimize parameters into the algorithm (such as K in KNN algorithm).
- **ShuffleSplit** _(class)_ &#8594; for large datasets it is used to improve the work of GridSearchCV bypassing the Cross-Validation.
- **CountVectorizer** _(class)_ &#8594; converts a collection of text documents to a matrix of token counts.
- **TfidfTransformer** _(class)_ &#8594; transforms a count matrix to a normalized tf or tf-idf representation.
- **Pipeline** _(class)_ &#8594; sequentially applies a list of operations and a final estimator (classifier, in our case). Implements methods to train and test our algorithm.
- **f1_score** _(method)_ &#8594; F1-score combines the precision and recall of a classifier into a single metric by taking their harmonic mean (one for each output class). 
- **accuracy_score** _(method)_ &#8594; Accuracy is one metric for evaluating classification models. Informally, it is the fraction of predictions our model got right.

In [3]:
import requests

import numpy as np
import pandas as pd

import re
import vaderSentiment.vaderSentiment

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

Additionally, I have created two classes to make a good code (readble and reliable).

**TwitterAPI** &#8594; Implements methods for connecting, authenticating, and retrieving tweets and user data. All via REST APIs.
- ``getTweetsByUserId(id)`` &#8594; Gets user tweets by its id. We get just _nmax_ tweets per call.
- ``getTweetById(id)`` &#8594; Gets tweet info by its id.
- ``getUserByHandle(handle)`` &#8594; Gets user info by its handle (e.g. @JohnDoe).
- I had to set also a **Bearer Token** attribute, which is a string and is the predominant type of access token used with **OAuth 2.0**. 

In [2]:
class TwitterAPI:

    def __init__(self):
        self.__BEARER = 'AAAAAAAAAAAAAAAAAAAAAAYXVQEAAAAADCquAxAdP8tVr%2BGc7OvHAnlgans%3DbGrWjD5c31ld2CTvTTM0nhyXBYEtTH1oBhqoAkM8ycPhv9Lfy2'
    
    def getTweetsByUserId(self, id, nmax):
        headers = { 'Authorization': "Bearer " + self.__BEARER }
        response = requests.get(f'https://api.twitter.com/2/users/{id}/tweets?max_results={nmax}', headers=headers)
        return response.json()

    def getTweetById(self, id):
        headers = { 'Authorization': "Bearer " + self.__BEARER }
        response = requests.get(f'https://api.twitter.com/2/tweets?ids={id}&expansions=author_id', headers=headers)
        return response.json()

    def getUserByHandle(self, handle):
        handle = handle.replace('@', '')
        headers = { 'Authorization': "Bearer " + self.__BEARER }
        response = requests.get(f'https://api.twitter.com/2/users/by/username/{handle}?user.fields=verified', headers=headers)
        return response.json()

**AnalyzeVIP** &#8594; Implements methods to retrieve the tweets of a VIP, classify them, and transform the predictions into a DataFrame.
- ``__init__(handle)`` &#8594; Gets user id by its handle.
- ``__loadTweets()`` &#8594; Gets last tweets of the user.
- ``__classify(c)`` &#8594; Puts tweets texts into the classifier to get predictions.
- ``makeTable(c)`` &#8594; Makes a DataFrame table with two columns: _message_ and _predict_.

In [3]:
class AnalyzeVIP:

    def __init__(self, handle):
        self.__twitter = TwitterAPI()
        self.__id = self.__twitter.getUserByHandle(handle)['data']['id']

    def __loadTweets(self, rows):
        tweets = self.__twitter.getTweetsByUserId(self.__id, rows)
        self.__texts = [tweet['text'] for tweet in tweets['data']]

    def __classify(self, c, rows):
        self.__loadTweets(rows)
        return c.predict(self.__texts)

    def makeTable(self, c, rows):
        preds = self.__classify(c, rows)
        dict = {'message': self.__texts, 'predict': preds}
        return pd.DataFrame(data=dict)

I have also implemented this **preprocessing** function ``preProc`` using **Regular Expressions** (RegEx). <br/>
It is a method useful to remove links inside the tweets text, because they confuse the classifier in multiple ways. <br/>
For instance: in the dataset there may be a tweet (sentiment **J**) containing just an image (represented by a link). Since there is no text to analyze, the classifier will associate (a priori) each time a tweet consisting of only a link appears, the output class **J** (it depends just on luck).

In [4]:
def preProc(doc):
    return re.sub(r"(?:\@|https?\://)\S+", "", doc).replace('RT ', '')

## Dataset Loading

As first thing, I have needed to load the dataset from a CSV file.

In [5]:
tweets = pd.read_csv('dataset.csv')
tweets.pop('tweetid')
tweets.head()

Unnamed: 0,sentiment,message
0,-1,@tiniebeany climate change is an interesting h...
1,1,RT @NatGeoChannel: Watch #BeforeTheFlood right...
2,1,Fabulous! Leonardo #DiCaprio's film on #climat...
3,1,RT @Mick_Fanning: Just watched this amazing do...
4,2,"RT @cnalive: Pranita Biswasi, a Lutheran from ..."


## Experimental Dataset

**As a test** I have extracted and added as features also the sentiment of each tweet thanks to **VADER**. <br/>
Each sentiment returned from VADER is composed by **four scores**: negative, neutral, positive, and compound (is a normalized metric).

In [6]:
vader = vaderSentiment.vaderSentiment.SentimentIntensityAnalyzer()

tweets['neg'], tweets['neu'], tweets['pos'], tweets['com'] = 0, 0, 0, 0
idx = 0

for tweet in tweets.iterrows():
    sentiment = vader.polarity_scores(preProc(tweet[1]['message']))

    tweets.loc[idx, 'neg'] = sentiment['neg'] * 100
    tweets.loc[idx, 'neu'] = sentiment['neu'] * 100
    tweets.loc[idx, 'pos'] = sentiment['pos'] * 100
    tweets.loc[idx, 'com'] = sentiment['compound'] * 100

    idx += 1

tweets.head()

Unnamed: 0,sentiment,message,neg,neu,pos,com
0,-1,@tiniebeany climate change is an interesting h...,8.1,62.2,29.7,64.28
1,1,RT @NatGeoChannel: Watch #BeforeTheFlood right...,0.0,100.0,0.0,0.0
2,1,Fabulous! Leonardo #DiCaprio's film on #climat...,0.0,54.4,45.6,85.44
3,1,RT @Mick_Fanning: Just watched this amazing do...,0.0,74.3,25.7,67.05
4,2,"RT @cnalive: Pranita Biswasi, a Lutheran from ...",15.6,73.6,10.8,-27.32


I have tested this dataset with each pipeline and each algorithm. Here is an example with the pipeline of KNN algorithm. <br/>
Unfortunately, as you can see below, this new version of the dataset did not bring better results to the model. On the contrary, it made them worse than the KNN algorithm that you can find [below in the project](#comparison-of-classification-algorithms). <br/>
So, in addition to this, I have obtained just a more complex model: bigger space and time complexity. For these reasons I didn't use it.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin # they allow us to extend fit and transform methods by our features
from sklearn.pipeline import FeatureUnion # allows us to work with multiple features
from sklearn.preprocessing import MaxAbsScaler # normalizes features in a range [0, 1]

In [None]:
class NumberSelector(BaseEstimator, TransformerMixin):

    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]

class TextSelector(BaseEstimator, TransformerMixin):

    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.key]

In [None]:
knn_exp = Pipeline([
    ('features', FeatureUnion([
        ('text', Pipeline([
            ('selector', TextSelector(key='message')),
            ('vect', CountVectorizer(stop_words=None, strip_accents=None, lowercase=True, max_df=0.75, preprocessor=preProc)),
            ('tfidf', TfidfTransformer(use_idf=False)),
            ('standard', MaxAbsScaler())
        ])),
        ('pos', Pipeline([
            ('selector', NumberSelector(key='pos')),
            ('standard', MaxAbsScaler())
        ])),
        ('neg', Pipeline([
            ('selector', NumberSelector(key='neg')),
            ('standard', MaxAbsScaler())
        ])),
        ('neu', Pipeline([
            ('selector', NumberSelector(key='neu')),
            ('standard', MaxAbsScaler())
        ])),
        ('com', Pipeline([
            ('selector', NumberSelector(key='com')),
            ('standard', MaxAbsScaler())
        ]))
    ])),
    ('classifier', KNeighborsClassifier(n_neighbors=1, weights='uniform'))
])

In [None]:
x_tr_exp, x_te_exp, y_tr_exp, y_te_exp = train_test_split(tweets[['message', 'neg', 'neu', 'pos', 'com']], tweets['sentiment'], test_size=0.3)

knn_exp.fit(x_tr_exp, y_tr_exp)
knn_exp_predict = knn_exp.predict(x_te_exp)

print('F1 score for KNN on experimental dataset: ', f1_score(y_te_exp, knn_exp_predict, average=None) * 100)
print('Accuracy score for KNN on experimental dataset: ', accuracy_score(y_te_exp, knn_exp_predict) * 100)

F1 score for KNN on experimental dataset:  [25.91656131 37.70835633 47.18579786 53.24384787]
Accuracy score for KNN on experimental dataset:  43.988469999241445


## Splitting Dataset

I have splitted the dataset into Training Set (TR), and Test Set (TE). Respectively in 85%, and 15%. <br/>
[Later](#comparison-of-classification-algorithms), from the Training Set will be extracted the Validation Set (VA)(15%).  

In [7]:
np.random.seed()

x_tr, x_te, y_tr, y_te = train_test_split(tweets['message'], tweets['sentiment'], test_size=0.15)

print('Training Set Length:', len(x_tr))
print('Test Set Length:', len(x_te))

Training Set Length: 37351
Test Set Length: 6592


## Comparison of Classification Algorithms

I have chosen five algorithms: ``K-Nearest Neighbors``, ``Multinomial Naive-Bayes``, ``Logistic Regression``, ``SVM``, and ``SGD-Classifier``. <br/>
``K-Nearest Neighbors`` and ``SVM`` are an easy geometric approach and a more complex one to classification. <br/>
``Multinomial Naive-Bayes`` and ``Logistic Regression`` are an easy probabilistic approach and a more complex one to classification. <br/>
``SGD-Classifier`` implements linear algorithms (both geometric and probabilistic) and optimizes them with SGD (Stochastic Gradient Descent).

I have made five different pipelines: one for each algorithm. <br/>
In each Pipeline I have setted three main steps:
- _vect_ &#8594; **CountVectorizer** &#8594; converts a collection of documents (tweets) to a matrix of token counts (Bag-of-Words). It takes this list of parameters:
    - *stop_words* &#187; Removes some very common words (based on English vocabulary).
    - *strip_accents* &#187; Removes accents and any other not supported characters (for instance, Arabic symbols are not supported in ASCII).
    - *lowercase* &#187; Converts all characters to lowercase.
    - *max_df* &#187; When building the vocabulary ignores terms that have a document frequency strictly higher than the given threshold.
- _tfidf_ &#8594; **TfidfTransformer** &#8594; transforms a count matrix to a normalized tf-idf representation (common term weighting scheme in information retrieval). It takes this list of parameters:
    - *use_idf* &#187; Enable inverse-document-frequency reweighting. If False, idf(t) = 1.
- _clf_ &#8594; this is the algorithm used for classification.

For each pipeline, for each step, there are some parameters which have been setted thanks to experience with ``GridSearchCV``. <br/>
*GridSearchCV* uses **Cross-Validation**. But it has been designed to work on limited dataset (and it is not our case). This just slows us down. <br/>
To solve this problem, I have setted the parameter *'cv'* using **ShuffleSplit()**, which gets the **Validation Set (15%)** from the TR, and lets the process of Cross-validation repeat just once (instead of 5 times) optimizing the execution time. 

**N.B.** Before running these codes, I suggest you to set in each _GridSearchCV_ the parameter *n_jobs* with the number of your cores - 2. <br/>
**N.B.** Before running these codes, assume that the entire run will last **about 5 hours** instead of **25 hours** (without ShuffleSplit).

#### **K-Nearest Neighbors Algorithm**

This algorithm works by finding the distances between a query point and all the examples in the data, selecting the specified number examples (K) closest to the query, then votes for the most frequent label.

<img src="https://antonioscardace.altervista.org/smm/knn.png" style="height: 300px; margin-top: 10px;"/>

In this case, the classifier takes two parameters:
- *n_neighbors* &#187; Number of closest examples to consider during classification. It is usually called *K parameter*.
- *weights* &#187; Weight function used in prediction.
    - *'uniform'* : uniform weights. All points in each neighborhood are weighted equally.
    - *'distance'* : weight points by the inverse of their distance. In this case, closer neighbors will have a greater influence than neighbors which are further away. 

In [8]:
knn_va_classifier = Pipeline([
    ('vect', CountVectorizer(preprocessor=preProc)),
    ('tfidf', TfidfTransformer()),
    ('clf', KNeighborsClassifier())
])

params = {
    'vect__stop_words': (None, 'english'),
    'vect__strip_accents': (None, 'unicode'),
    'vect__lowercase': [True, False],
    'vect__max_df': [0.125, 0.25, 0.375, 0.5, 0.75, 0.875, 1.0],
    'tfidf__use_idf': (True, False),
    'clf__n_neighbors': np.arange(1, 8),
    'clf__weights': ['uniform', 'distance'],
}
knn_gs = GridSearchCV(knn_va_classifier, params, cv=ShuffleSplit(test_size=0.1765, n_splits=1), n_jobs=4)

knn_gs.fit(x_tr, y_tr)

print('Use Stop-Words-Removal: ', knn_gs.best_params_['vect__stop_words'])
print('Remove Accents: ', knn_gs.best_params_['vect__strip_accents'])
print('Lowercase: ', knn_gs.best_params_['vect__lowercase'])
print('Max-df: ', knn_gs.best_params_['vect__max_df'])
print('Use TF-IDF: ', knn_gs.best_params_['tfidf__use_idf'])
print('Best K parameter: ', knn_gs.best_params_['clf__n_neighbors'])
print('Weight function used in prediction: ', knn_gs.best_params_['clf__weights'])

Use Stop-Words-Removal:  None
Remove Accents:  None
Lowercase:  True
Max-df:  0.75
Use TF-IDF:  False
Best K parameter:  1
Weight function used in prediction:  uniform


In [9]:
knn_classifier = Pipeline([
    ('vect', CountVectorizer(
        stop_words=knn_gs.best_params_['vect__stop_words'],
        strip_accents=knn_gs.best_params_['vect__strip_accents'],
        lowercase=knn_gs.best_params_['vect__lowercase'],
        max_df=knn_gs.best_params_['vect__max_df'],
        preprocessor=preProc,
    )),
    ('tfidf', TfidfTransformer(use_idf=knn_gs.best_params_['tfidf__use_idf'])),
    ('clf', KNeighborsClassifier(n_neighbors=knn_gs.best_params_['clf__n_neighbors'], weights=knn_gs.best_params_['clf__weights']))
])

knn_classifier.fit(x_tr, y_tr)
knn_predict = knn_classifier.predict(x_te)

print('F1 score for KNN on TE: ', f1_score(y_te, knn_predict, average=None) * 100)
print('Accuracy score for KNN on TE: ', accuracy_score(y_te, knn_predict) * 100)

F1 score for KNN on TE:  [28.16901408 40.31907179 59.06111603 62.66666667]
Accuracy score for KNN on TE:  51.88106796116505


#### **Multinomial Naive-Bayes Algorithm**

This algorithm is a probabilistic approach to classification. It is based on applying Bayes theorem with strong (naive) independence assumptions between the features.

In this case, the classifier takes zero parameters: &#8709;

In [10]:
nb_va_classifier = Pipeline([
    ('vect', CountVectorizer(preprocessor=preProc)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB())
])

params = {
    'vect__stop_words': (None, 'english'),
    'vect__strip_accents': (None, 'unicode'),
    'vect__lowercase': [True, False],
    'vect__max_df': [0.125, 0.25, 0.375, 0.5, 0.75, 0.875, 1.0],
    'tfidf__use_idf': (True, False)
}
nb_gs = GridSearchCV(nb_va_classifier, params, cv=ShuffleSplit(test_size=0.1765, n_splits=1), n_jobs=4)

nb_gs.fit(x_tr, y_tr)

print('Use Stop-Words-Removal: ', nb_gs.best_params_['vect__stop_words'])
print('Remove Accents: ', nb_gs.best_params_['vect__strip_accents'])
print('Lowercase: ', nb_gs.best_params_['vect__lowercase'])
print('Max-df: ', nb_gs.best_params_['vect__max_df'])
print('Use TF-IDF: ', nb_gs.best_params_['tfidf__use_idf'])

Use Stop-Words-Removal:  english
Remove Accents:  None
Lowercase:  True
Max-df:  0.125
Use TF-IDF:  True


In [11]:
nb_classifier = Pipeline([
    ('vect', CountVectorizer(
        stop_words=nb_gs.best_params_['vect__stop_words'],
        strip_accents=nb_gs.best_params_['vect__strip_accents'],
        lowercase=nb_gs.best_params_['vect__lowercase'],
        max_df=nb_gs.best_params_['vect__max_df'],
        preprocessor=preProc
    )),
    ('tfidf', TfidfTransformer(use_idf=nb_gs.best_params_['tfidf__use_idf'])),
    ('clf', MultinomialNB())
])

nb_classifier.fit(x_tr, y_tr)
nb_predict = nb_classifier.predict(x_te)

print('F1 score for MultinomialNB on TE: ', f1_score(y_te, nb_predict, average=None) * 100)
print('Accuracy score for MultinomialNB on TE: ', accuracy_score(y_te, nb_predict) * 100)

F1 score for MultinomialNB on TE:  [ 9.50819672 16.50641026 74.90523124 59.43621596]
Accuracy score for MultinomialNB on TE:  63.89563106796117


#### **Logistic Regression Algorithm**

This algorithm is named for the function used at the core of the method, the logistic function. It’s an S-shaped curve that can take any real number and map it into a value (probability) between 0 and 1, but never exactly at those limits.

<img src="https://antonioscardace.altervista.org/smm/logistic_regression.png" style="height: 300px; margin-top: 10px;"/>

In this case, the classifier takes two parameters:
- *max_iter* &#187; Maximum number of iterations taken for the solvers to converge.
- *solver* &#187; Algorithm to use in the optimization problem. It depends on dataset size, and on the number of possible output classes.

In [12]:
lr_va_classifier = Pipeline([
    ('vect', CountVectorizer(preprocessor=preProc)),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression())
])

params = {
    'vect__stop_words': (None, 'english'),
    'vect__strip_accents': (None, 'unicode'),
    'vect__lowercase': [True, False],
    'vect__max_df': [0.125, 0.25, 0.375, 0.5, 0.75, 0.875, 1.0],
    'tfidf__use_idf': (True, False),
    'clf__max_iter': (600, 700, 800),
    'clf__solver': ['liblinear', 'sag', 'saga', 'lbfgs']
}
lr_gs = GridSearchCV(lr_va_classifier, params, cv=ShuffleSplit(test_size=0.1765, n_splits=1), n_jobs=4)

lr_gs.fit(x_tr, y_tr)

print('Use Stop-Words-Removal: ', lr_gs.best_params_['vect__stop_words'])
print('Remove Accents: ', lr_gs.best_params_['vect__strip_accents'])
print('Lowercase: ', lr_gs.best_params_['vect__lowercase'])
print('Max-df: ', lr_gs.best_params_['vect__max_df'])
print('Use TF-IDF: ', lr_gs.best_params_['tfidf__use_idf'])
print('Maximum number of iterations taken to converge: ', lr_gs.best_params_['clf__max_iter'])
print('Algorithm to use in the optimization problem: ', lr_gs.best_params_['clf__solver'])

Use Stop-Words-Removal:  None
Remove Accents:  None
Lowercase:  False
Max-df:  0.25
Use TF-IDF:  True
Maximum number of iterations taken to converge:  600
Algorithm to use in the optimization problem:  saga


In [13]:
lr_classifier = Pipeline([
    ('vect', CountVectorizer(
        stop_words=lr_gs.best_params_['vect__stop_words'],
        strip_accents=lr_gs.best_params_['vect__strip_accents'],
        lowercase=lr_gs.best_params_['vect__lowercase'],
        max_df=lr_gs.best_params_['vect__max_df'],
        preprocessor=preProc
    )),
    ('tfidf', TfidfTransformer(use_idf=lr_gs.best_params_['tfidf__use_idf'])),
    ('clf', LogisticRegression(max_iter=lr_gs.best_params_['clf__max_iter'], solver=lr_gs.best_params_['clf__solver']))
])

lr_classifier.fit(x_tr, y_tr)
lr_predict = lr_classifier.predict(x_te)

print('F1 score for LogisticRegressor on TE: ', f1_score(y_te, lr_predict, average=None) * 100)
print('Accuracy score for LogisticRegressor on TE: ', accuracy_score(y_te, lr_predict) * 100)

F1 score for LogisticRegressor on TE:  [47.80600462 49.12652197 80.04125838 73.84960718]
Accuracy score for LogisticRegressor on TE:  72.23907766990291


#### **Support-Vector-Machine (SVM) Algorithm**

The objective of the Support Vector Machine algorithm is to find a hyperplane in an N-dimensional space (N = number of features) that distinctly classifies the data points. <br/>
We just want to find an hyperplane which has the maximum distance between data points of both classes. Maximizing this distance provides some reinforcement so that future data points can be classified with more confidence.

<img src="https://antonioscardace.altervista.org/smm/kernel_svm.png" style="height: 250px; margin-top: 10px;"/> <br/>
<img src="https://antonioscardace.altervista.org/smm/svm_example.png" style="height: 250px; margin-left:15px; margin-top: 10px;"/>

In this case, the classifier takes one parameter:
- *kernel* &#187; Specifies the kernel type to be used in the algorithm which return a "best fit" hyperplane which divides (categorizes) data.

In [16]:
svm_va_classifier = Pipeline([
    ('vect', CountVectorizer(preprocessor=preProc)),
    ('tfidf', TfidfTransformer()),
    ('clf', SVC())
])

params = {
    'vect__stop_words': (None, 'english'),
    'vect__strip_accents': (None, 'unicode'),
    'vect__lowercase': [True, False],
    'vect__max_df': [0.125, 0.25, 0.375, 0.5, 0.75, 0.875, 1.0],
    'tfidf__use_idf': (True, False),
    'clf__kernel': ('linear', 'rbf')
}
svm_gs = GridSearchCV(svm_va_classifier, params, cv=ShuffleSplit(test_size=0.1765, n_splits=1), n_jobs=4)

svm_gs.fit(x_tr, y_tr)

print('Use Stop-Words-Removal: ', svm_gs.best_params_['vect__stop_words'])
print('Remove Accents: ', svm_gs.best_params_['vect__strip_accents'])
print('Lowercase: ', svm_gs.best_params_['vect__lowercase'])
print('Max-df: ', svm_gs.best_params_['vect__max_df'])
print('Use TF-IDF: ', svm_gs.best_params_['tfidf__use_idf'])
print('Kernel type used: ', svm_gs.best_params_['clf__kernel'])

Use Stop-Words-Removal:  None
Remove Accents:  None
Lowercase:  True
Max-df:  0.375
Use TF-IDF:  True
Kernel type used:  linear


In [17]:
svm_classifier = Pipeline([
    ('vect', CountVectorizer(
        stop_words=svm_gs.best_params_['vect__stop_words'],
        strip_accents=svm_gs.best_params_['vect__strip_accents'],
        lowercase=svm_gs.best_params_['vect__lowercase'],
        max_df=svm_gs.best_params_['vect__max_df'],
        preprocessor=preProc
    )),
    ('tfidf', TfidfTransformer(use_idf=svm_gs.best_params_['tfidf__use_idf'])),
    ('clf', SVC(kernel=svm_gs.best_params_['clf__kernel']))
])

svm_classifier.fit(x_tr, y_tr)
svm_predict = svm_classifier.predict(x_te)

print('F1 score for SVM on TE: ', f1_score(y_te, svm_predict, average=None) * 100)
print('Accuracy score for SVM on TE: ', accuracy_score(y_te, svm_predict) * 100)

F1 score for SVM on TE:  [55.76923077 51.55374427 80.55811505 75.29761905]
Accuracy score for SVM on TE:  73.40716019417476


#### **SGD (Stochastic Gradient Descent)**

It uses linear classifiers (such as SVM and Logistic Regression) optimized by SGD (Stochastic Gradient Descent)

In this case, the classifier takes zero parameters: &#8709;

In [14]:
sgd_va_classifier = Pipeline([
    ('vect', CountVectorizer(preprocessor=preProc)),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier())
])

params = {
    'vect__stop_words': (None, 'english'),
    'vect__strip_accents': (None, 'unicode'),
    'vect__lowercase': [True, False],
    'vect__max_df': [0.125, 0.25, 0.375, 0.5, 0.75, 0.875, 1.0],
    'tfidf__use_idf': (True, False)
}
sgd_gs = GridSearchCV(sgd_va_classifier, params, cv=ShuffleSplit(test_size=0.1765, n_splits=1), n_jobs=4)

sgd_gs.fit(x_tr, y_tr)

print('Use Stop-Words-Removal: ', sgd_gs.best_params_['vect__stop_words'])
print('Remove Accents: ', sgd_gs.best_params_['vect__strip_accents'])
print('Lowercase: ', sgd_gs.best_params_['vect__lowercase'])
print('Max-df: ', sgd_gs.best_params_['vect__max_df'])
print('Use TF-IDF: ', sgd_gs.best_params_['tfidf__use_idf'])

Use Stop-Words-Removal:  None
Remove Accents:  None
Lowercase:  True
Max-df:  0.375
Use TF-IDF:  True


In [15]:
sgd_classifier = Pipeline([
    ('vect', CountVectorizer(
        stop_words=sgd_gs.best_params_['vect__stop_words'],
        strip_accents=sgd_gs.best_params_['vect__strip_accents'],
        lowercase=sgd_gs.best_params_['vect__lowercase'],
        max_df=sgd_gs.best_params_['vect__max_df'],
        preprocessor=preProc
    )),
    ('tfidf', TfidfTransformer(use_idf=sgd_gs.best_params_['tfidf__use_idf'])),
    ('clf', SGDClassifier())
])

sgd_classifier.fit(x_tr, y_tr)
sgd_predict = sgd_classifier.predict(x_te)

print('F1 score for SGD on TE: ', f1_score(y_te, sgd_predict, average=None) * 100)
print('Accuracy score for SGD on TE: ', accuracy_score(y_te, sgd_predict) * 100)

F1 score for SGD on TE:  [41.26582278 40.95778198 79.04515811 71.71641791]
Accuracy score for SGD on TE:  70.70691747572816


## Real World Application

As we have just seen, among these five algorithms, the best is the **SVM**. We will use it in a real-world application. 

In the real world, there are probably a lot of applications for this project, but I have been interested by one in particular: <br/>
Given the tweets of some Twitter Verified Accounts (VIP such as politicians or actors) we have to be able to understand how much these people care about climate change in their Twitter page. <br/>
We will make tests on some US Verified Twitter profiles owned by well known activists (such as **Greta Thunberg** or **Leonardo Di Caprio**). <br/>
In time, with an appropriate dataset, may be maked analytics on how many VIPs are interested (and how) on climate change for each main job category (such as politics, cinema, football).

In [18]:
greta = AnalyzeVIP('@GretaThunberg')
results = greta.makeTable(svm_classifier, 15)

results['message'] = results['message'].str[:130]
results.style

Unnamed: 0,message,predict
0,RT @Fridays4FutureU: Today #FridaysForFuture Uganda activists have again protested against the East African Crude Oil Pipeline at,1
1,RT @omaer_alam: Today I participated in the weekly strike against climate change from the bridge of Turag river in Dhaka district,1
2,RT @Joshomonukk: Keep 1.5 Alive. #FridaysForFuture @Riseupmovt https://t.co/tK8t98dvOo,1
3,RT @fff_tui: Save life on earth Start with a budget that funds #ClimateActionNow Standing outside #NewZealand Minister of finan,1
4,"RT @EKOenergy_: We're in front the Finnish Parliament, asking for #ClimateAction and #sustainable #RenewableEnergy! #fridaysforf",1
5,RT @auber_fichess: Week 41 #ClimateStrike in #Angola In Angola the politicians treat the law with a pistol and the 27th of May is,1
6,RT @FFF_Sweden: Med stor glädje meddelar vi om ett samarbete mellan FFF Sverige och Sáminuorra - vi ska bygga en gruva på Östermal,0
7,"School strike week 189. We recently received an irresistible offer from fossil fuel lobbyists, so we have now changed our narrativ",1
8,RT @stopEACOP: Last week we left the comfort of our homes and travelled to the lion's den. We were at @totalenergies headquarters,1
9,Here is the full list of contributors: https://t.co/eNnvrC4miN,1


As we can see, Greta Thunberg (activist) tweets are almost all about Climate Change and she supports the belief of man-made climate change. <br/>
We can note that 11th tweet is a news, and 6th tweet is in swedish and our classifier is not able to analyze this language, so it classied it as Neutral. <br/>
Last tweet on the contrary, is not about climate change, but it has been classified as Pro. It is an error!

In [19]:
leo = AnalyzeVIP('@LeoDiCaprio')
results = leo.makeTable(svm_classifier, 15)

results['message'] = results['message'].str[:130]
results.style

Unnamed: 0,message,predict
0,"It’s time for people to feel good about their purchases and for businesses to meet that challenge. As a Strategic Advisor, I am ex",1
1,@NPR @UN: https://t.co/8SxRVbwIAU,0
2,"@newscientist: “Planting trillions of trees won’t replace the 10 million hectares of forest ecosystems lost each year, but documen",1
3,"Since October 2020, @NatGeo has documented a pattern of ReconAfrica breaking rules and ignoring environmental and community concer",1
4,Coral scientist @ProfTerryHughes claims a 6th mass bleaching is unfolding across the #GreatBarrierReef. @UNESCO’s World Heritage C,1
5,It’s great to see @TheSolutionsProject be recognized for their amazing work toward resolving the climate crisis. https://t.co/TR13,1
6,Recent @IPCC_CH report tells a sobering truth: Nearly half of humanity is living in the danger zone now. The facts are undeniable,1
7,.@CityNational’s parent company @RBC is violating the rights of indigenous Wet'suwet'en people & bankrolling climate crisis.Jo,1
8,"Today, our #JustLookUp coalition kicks off at 12:30pm, at Pershing Square in Los Angeles – be there to join the movement and march",1
9,"Wild fish populations are threatened more than ever before. I’m pleased to be an investor in @wildtypefoods, the clear leader in c",1


As we can see, Leonardo Di Caprio (actor and activist) tweets are almost all about Climate Change and he supports the belief of man-made climate change.

## Conclusions

I have tested and compared five classification algorithms. The best is SVM, followed in order by Logistic Regression, SGD, K-Nearest Neighbors, and Multinomial Naive-Bayes. <br/>
To better classify text should be used a Neural Network.

I got Pros 🆗 and Cons ⛔ in this project due to the basic models used, and the not well-formed dataset (it isn't well proportioned):
* 🆗 when tweets are really about climate change, the model works well enough.
* 🆗 it is "quick" and was really useful for introduce me into Machine Learning world.
* 🆗 it is easy to read and has been easy to write.
* ⛔ when tweets are not about climate change, the model doesn't works (it base probably its prediction on tweet sentiment without understand really the topic).
* ⛔ when tweets contains humorism, the model doesn't work very well.

To conclude: I have learned a lot by implementing this project. I have used in pratic a lot of theoric notions studied during my Social Media Management course at university. <br/>
Moreover, I have learned that Machine Learning is a big field useful for many real ideas and applications.

## Bibliography

All info about sklearn classes and functions have been taken from: [scikit-learn.org](https://scikit-learn.org/stable/index.html). <br/>
All info about Twitter APIs have been taken from: [developer.twitter.com](https://developer.twitter.com/en/docs/twitter-api). <br/>
Dataset and idea have been taken from: [kaggle.com/edqian/twitter-climate-change-sentiment-dataset](https://www.kaggle.com/datasets/edqian/twitter-climate-change-sentiment-dataset).