# CS 180 Machine Project: Global Temperatures

The project aims to find an effective machine learning algorithm that can predict whether a Twitter user supports the belief in man-made climate change.

### Features
The features to be extracted are:
<br>
* Unigrams
* Bigrams
* Trigrams
<br>

### Model
The models to be used are:
<br>
* Support Vector Machine
* Decision Trees
<br>
<br>
---
! Run the cell below to load the libraries to be used in the project

In [2]:
import pandas as pd
import re
import string
import numpy as np
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score, f1_score, recall_score, precision_score
from collections import defaultdict

### Data Import

The Twitter data comes from two datasets:
<br>
* Kaggle
* data.world
<br>

The Twitter sentiments needed are:
<br>
1. Pro
2. Anti
3. Neutral 
<br>
<br>
---
! Run the cell below to import the data

In [3]:
data1 = pd.read_csv("../data/kaggle_twitter_data.csv")
data2 = pd.read_csv("../data/dataworld_twitter_data.csv")

# Remove other columns
data1 = data1[["sentiment", "tweet"]]
data2 = data2[["sentiment", "tweet"]]

# Remove sentiment=2 from Kaggle data set
data1 = data1[data1.sentiment != 2]

frames = [data1, data2]

# All data
data = pd.concat(frames, ignore_index=True)

### Data Preprocessing

The data (the Tweet) is preprocessed such that:
1. Rows with NAN are dropped
2. Tweet is converted to string
3. Tweets are lowercased
4. Removes non-English Characters from the Tweet
5. Removes URLs from the Tweet
6. Removing RT and hyperlink from Tweet
7. Removes @ and # from Tweet
8. Remove stopwords from Tweet
9. Remove numbers from Tweet
10. Remove punctuations from Tweet
11. The Tweet is tokenized
12. The Tweet is lemmatized
<br>

After that, the preprocessed tweet is a new column to the dataset
<br>
---
! Run the cell below to preprocess the data

In [4]:
# Pre-processing
# Drop rows with NA or NAN
data = data.dropna()

df = data["tweet"]

# Make tweet to str
df = df.apply(str)

# Lowercase all words
df = df.apply(lambda x: x.lower())

# Remove non-English characters
df = df.apply(lambda x: x.encode("ascii", "ignore").decode())

# Remove URLS
df = df.apply(
    lambda x: re.sub(r"http?://[A-Za-z0-9./]+", "", x, flags=re.MULTILINE)
)
df = df.apply(
    lambda x: re.sub(r"https?://[A-Za-z0-9./]+", "", x, flags=re.MULTILINE)
)
df = df.apply(
    lambda x: re.sub(r"www?://[A-Za-z0-9./]+", "", x, flags=re.MULTILINE)
)

# Removing RT and link
df = df.apply(lambda x: re.sub(r"\bRT\b", "", x).strip())
df = df.apply(lambda x: re.sub(r"\blink\b", "", x).strip())

# Remove @ and #
df = df.apply(lambda x: re.sub(r"@[A-Za-z0-9_]+", "", x))
df = df.apply(lambda x: re.sub(r"#[A-Za-z0-9_]+", "", x))

# Remove stopwords
stop = stopwords.words("english")
df = df.apply(lambda x: " ".join([x for x in x.split() if x not in stop]))

# Remove numbers
df = df.apply(lambda x: re.sub(r"[0-9]+", "", x))

# Remove punctuations
df = df.apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

# Tokenized
tokenizedTweets = [word_tokenize(x) for x in df]

# Lemmatize
lemmatizer = WordNetLemmatizer()
for tweet in tokenizedTweets:
    for word in tweet:
        word = lemmatizer.lemmatize(word)

processed = tokenizedTweets

# Append changed tweet to database
final = []
for x in range(len(processed)):
    final.append(" ".join(processed[x]))

out = pd.DataFrame(final)

data["changedtweet"] = out

### Splitting of Data

The data is split into three categories:
<br>
* Positive
* Negative
* Neutral
<br>

The first 5000 tweets in a category's dataset are to be used for testing and training.

For each category, the samples are split such that there are 60% tweets for training and 40% tweets for testing.
Therefore, for each category, there are:
<br>
* 3000 tweets for training
* 2000 tweets for testing
<br>
---
! Run the cell below to split the dataset

In [5]:
# 5000 samples per label
positive = data[data["sentiment"] == 1][:5000]
negative = data[data["sentiment"] == -1][:5000]
neutral = data[data["sentiment"] == 0][:5000]

#The features are extracted from the changed tweet and the target is the sentiment
features = ["changedtweet"]
targets = ["sentiment"]

#Create an empty Dataframe for features in train and test and sentiment in train and test
X_train = pd.DataFrame(columns = features)
X_test = pd.DataFrame(columns = features)
y_train = pd.DataFrame(columns = targets)
y_test = pd.DataFrame(columns = targets)

#Set an empty array to append the tweets
X_train_list = []
X_test_list = []
y_train_list = []
y_test_list = []

for category in (positive, negative, neutral):
    X = category["changedtweet"]
    y = category["sentiment"]
    Xs_train, Xs_test, ys_train, ys_test = train_test_split(X, y, random_state=0, train_size=0.6) #Split the dataset
    
    #Append the split data set to the array
    X_train_list.append(Xs_train)
    X_test_list.append(Xs_test)
    y_train_list.append(ys_train)
    y_test_list.append(ys_test)

#Concat the three categories to create one dataset
X_train = pd.concat(X_train_list, ignore_index=True)
X_test = pd.concat(X_test_list, ignore_index=True)
y_train = pd.concat(y_train_list, ignore_index=True)
y_test = pd.concat(y_test_list, ignore_index=True)

### Feature Extraction (Unigrams)

To extract the feature, use a TF-IDF vectorizer.
<br>
---
! Run the cell below to vectorize the tweets (for the Unigrams)

In [9]:
#To extract the features use TF-IDF vectorizer
Tfidf_vect = TfidfVectorizer(ngram_range=(1,1)) #Unigram
print(Tfidf_vect)
Tfidf_vect.fit(data['changedtweet']) #Fit the vectorizer to the pre-processed tweets
Train_X_Tfidf = Tfidf_vect.transform(X_train) #Get the vectorized tweets for trainig
Test_X_Tfidf = Tfidf_vect.transform(X_test) #Get the vectorized tweets for testing

# features_by_gram = defaultdict(list)
# for f, w in zip(Tfidf_vect.get_feature_names(), Tfidf_vect.idf_):
#     features_by_gram[len(f.split(' '))].append((f, w))
# top_n = 10
# for gram, features in features_by_gram.items():
#     top_features = sorted(features, key=lambda x: x[1], reverse=True)[:top_n]
#     top_features = [f[0] for f in top_features]
#     print ('{}-gram top:'.format(gram), top_features)

# needs to happen after fit_transform()

def display_scores(vectorizer, tfidf_result):
    # http://stackoverflow.com/questions/16078015/
    scores = zip(vectorizer.get_feature_names(),
                 np.asarray(tfidf_result.sum(axis=0)).ravel())
    sorted_scores = sorted(scores, key=lambda x: x[1], reverse=True)
    for item in sorted_scores:
        print "{0:50} Score: {1}".format(item[0], item[1])

display_scores(Tfidf_vect, Train_X_Tfidf)

TfidfVectorizer()
  (0, 28830)	0.2887321221437649
  (0, 28220)	0.3655044034158788
  (0, 26509)	0.4365080523212724
  (0, 25397)	0.37763176597422193
  (0, 22420)	0.11040303992949227
  (0, 22159)	0.3425059004694709
  (0, 12282)	0.3835668206604658
  (0, 11840)	0.39613190517300406
  (0, 4723)	0.09056659396775765
  (0, 4078)	0.0902764591644557
  (1, 25941)	0.4695217301918389
  (1, 22420)	0.1500818689863665
  (1, 12416)	0.4822472841578036
  (1, 10901)	0.4006270326558702
  (1, 7022)	0.4592939720250854
  (1, 4723)	0.12311620856718347
  (1, 4078)	0.12272179937731553
  (1, 2401)	0.350415641080777
  (2, 25883)	0.380559044218338
  (2, 22876)	0.24323500970121362
  (2, 22420)	0.08176318416533178
  (2, 21922)	0.3293808252574299
  (2, 16977)	0.26116650552560106
  (2, 15074)	0.333133879641695
  (2, 15033)	0.3721141286152039
  :	:
  (8996, 1103)	0.42191527565196124
  (8997, 28034)	0.1340384827754342
  (8997, 25651)	0.33744751259827394
  (8997, 17479)	0.5032841725949833
  (8997, 13181)	0.4694822589733035


### SVM Algorithm for Unigrams

The SVM is set up with the hyperparameters:
<br>
* Kernel- Radial Basis Function
* Gamma- 1.3
* C- 1000
<br>

In [6]:
# Classifier - Algorithm - SVM
# fit the training dataset on the classifier
SVM = svm.SVC(kernel='rbf', gamma=1.3, C=1000)
SVM.fit(Train_X_Tfidf,y_train)
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)
print("SVM Classification Report for Unigrams")
# Print the classification report
print(classification_report(y_test,predictions_SVM))


print("SVM Accuracy Score for Unigrams-> ", end="")
print(round(accuracy_score(predictions_SVM, y_test)*100,4))
print("SVM Precision Score for Unigrams-> ", end="")
print(round(precision_score(predictions_SVM, y_test, average='weighted')*100,4))
print("SVM Recall Score for Unigrams-> ", end="")
print(round(recall_score(predictions_SVM, y_test, average='weighted')*100,4))
print("SVM F1 Score for Unigrams-> ", end="")
print(round(f1_score(predictions_SVM, y_test, average='weighted')*100,4))

SVM Classification Report for Unigrams
              precision    recall  f1-score   support

          -1       0.76      0.72      0.74      2000
           0       0.65      0.78      0.71      2000
           1       0.90      0.77      0.83      2000

    accuracy                           0.76      6000
   macro avg       0.77      0.76      0.76      6000
weighted avg       0.77      0.76      0.76      6000

SVM Accuracy Score for Unigrams-> 75.6333
SVM Precision Score for Unigrams-> 75.8217
SVM Recall Score for Unigrams-> 75.6333
SVM F1 Score for Unigrams-> 75.3047


### Decision Tree Algorithm for Unigrams

The Decision Tree is set up with the hyperparameters:
<br>
* random state- 0
<br>

In [21]:
# Classifier - Algorithm - Decision Tree
# fit the training dataset on the classifier
dt = DecisionTreeClassifier(random_state=0)
dt.fit(Train_X_Tfidf, y_train)
# predict the labels on validation dataset
predictions_DecisionTree = dt.predict(Test_X_Tfidf)
print("Decision Tree Classification Report for Unigrams")
# Print the classification report
print(classification_report(y_test,predictions_DecisionTree))

print("Decision Tree Accuracy Score for Unigrams-> ", end="")
print(round(accuracy_score(predictions_DecisionTree, y_test)*100,4))
print("Decision Tree Precision Score for Unigrams-> ", end="")
print(round(precision_score(predictions_DecisionTree, y_test, average='weighted')*100,4))
print("Decision Tree Recall Score for Unigrams-> ", end="")
print(round(recall_score(predictions_DecisionTree, y_test, average='weighted')*100,4))
print("Decision Tree F1 Score for Unigrams-> ", end="")
print(round(f1_score(predictions_DecisionTree, y_test, average='weighted')*100,4))


Decision Tree Classification Report for Unigrams
              precision    recall  f1-score   support

          -1       0.64      0.55      0.59      2000
           0       0.55      0.67      0.61      2000
           1       0.75      0.71      0.73      2000

    accuracy                           0.64      6000
   macro avg       0.65      0.64      0.64      6000
weighted avg       0.65      0.64      0.64      6000

Decision Tree Accuracy Score for Unigrams-> 64.2167
Decision Tree Precision Score for Unigrams-> 64.6938
Decision Tree Recall Score for Unigrams-> 64.2167
Decision Tree F1 Score for Unigrams-> 64.1277


### Feature Extraction (Bigrams)

To extract the feature, use TF-IDF vectorizer.
<br>
---
! Run the cell below to vectorize the tweets (for the Bigrams)

In [22]:
#To extract the features use TF-IDF vectorizer
Tfidf_vect = TfidfVectorizer(ngram_range=(2,2)) #Bigrams
Tfidf_vect.fit(data['changedtweet']) #Fit the vectorizer to the pre-processed tweets
Train_X_Tfidf = Tfidf_vect.transform(X_train) #Get the vectorized tweets for trainig
Test_X_Tfidf = Tfidf_vect.transform(X_test) #Get the vectorized tweets for testing

### SVM Algorithm for Bigrams

The SVM is set up with the hyperparameters:
<br>
* Kernel- Radial Basis Function
* Gamma- 1.3
* C- 1000
<br>

In [24]:
# Classifier - Algorithm - SVM
# fit the training dataset on the classifier
SVM = svm.SVC(kernel='rbf', gamma=1.3, C=1000)
SVM.fit(Train_X_Tfidf,y_train)
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)
print("SVM Classification Report for Bigrams")
# Print the classification report
print(classification_report(y_test,predictions_SVM))


print("SVM Accuracy Score for Bigrams-> ", end="")
print(round(accuracy_score(predictions_SVM, y_test)*100,4))
print("SVM Precision Score for Bigrams-> ", end="")
print(round(precision_score(predictions_SVM, y_test, average='weighted')*100,4))
print("SVM Recall Score for Bigrams-> ", end="")
print(round(recall_score(predictions_SVM, y_test, average='weighted')*100,4))
print("SVM F1 Score for Bigrams-> ", end="")
print(round(f1_score(predictions_SVM, y_test, average='weighted')*100,4))

SVM Classification Report for Bigrams
              precision    recall  f1-score   support

          -1       0.66      0.68      0.67      2000
           0       0.55      0.72      0.63      2000
           1       0.97      0.63      0.76      2000

    accuracy                           0.68      6000
   macro avg       0.73      0.68      0.69      6000
weighted avg       0.73      0.68      0.69      6000

SVM Accuracy Score for Bigrams-> 67.9
SVM Precision Score for Bigrams-> 68.9252
SVM Recall Score for Bigrams-> 67.9
SVM F1 Score for Bigrams-> 67.1229


### Decision Tree Algorithm for Bigrams

The Decision Tree is set up with the hyperparameters:
<br>
* random state- 0
<br>

In [25]:
# Classifier - Algorithm - Decision Tree
# fit the training dataset on the classifier
dt = DecisionTreeClassifier(random_state=0)
dt.fit(Train_X_Tfidf, y_train)
# predict the labels on validation dataset
predictions_DecisionTree = dt.predict(Test_X_Tfidf)
print("Decision Tree Classification Report for Bigrams")
# Print the classification report
print(classification_report(y_test,predictions_DecisionTree))

print("Decision Tree Accuracy Score for Bigrams-> ", end="")
print(round(accuracy_score(predictions_DecisionTree, y_test)*100,4))
print("Decision Tree Precision Score for Bigrams-> ", end="")
print(round(precision_score(predictions_DecisionTree, y_test, average='weighted')*100,4))
print("Decision Tree Recall Score for Bigrams-> ", end="")
print(round(recall_score(predictions_DecisionTree, y_test, average='weighted')*100,4))
print("Decision Tree F1 Score for Bigrams-> ", end="")
print(round(f1_score(predictions_DecisionTree, y_test, average='weighted')*100,4))

Decision Tree Classification Report for Bigrams
              precision    recall  f1-score   support

          -1       0.57      0.57      0.57      2000
           0       0.51      0.61      0.56      2000
           1       0.81      0.66      0.73      2000

    accuracy                           0.61      6000
   macro avg       0.63      0.61      0.62      6000
weighted avg       0.63      0.61      0.62      6000

Decision Tree Accuracy Score for Bigrams-> 61.25
Decision Tree Precision Score for Bigrams-> 60.973
Decision Tree Recall Score for Bigrams-> 61.25
Decision Tree F1 Score for Bigrams-> 60.7093


### Feature Extraction (Trigrams)

To extract the feature, use TF-IDF vectorizer.
<br>
---
! Run the cell below to vectorize the tweets (for the Trigrams)

In [26]:
#To extract the features use TF-IDF vectorizer
Tfidf_vect = TfidfVectorizer(ngram_range=(3,3)) #Unigram
Tfidf_vect.fit(data['changedtweet']) #Fit the vectorizer to the pre-processed tweets
Train_X_Tfidf = Tfidf_vect.transform(X_train) #Get the vectorized tweets for trainig
Test_X_Tfidf = Tfidf_vect.transform(X_test) #Get the vectorized tweets for testing

### SVM Algorithm for Trigrams

The SVM is set up with the hyperparameters:
<br>
* Kernel- Radial Basis Function
* Gamma- 1.3
* C- 1000
<br>

In [28]:
# Classifier - Algorithm - SVM
# fit the training dataset on the classifier
SVM = svm.SVC(kernel='rbf', gamma=1.3, C=1000)
SVM.fit(Train_X_Tfidf,y_train)
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)
print("SVM Classification Report for Trigrams")
# Print the classification report
print(classification_report(y_test,predictions_SVM))


print("SVM Accuracy Score for Trigrams-> ", end="")
print(round(accuracy_score(predictions_SVM, y_test)*100,4))
print("SVM Precision Score for Trigrams-> ", end="")
print(round(precision_score(predictions_SVM, y_test, average='weighted')*100,4))
print("SVM Recall Score for Trigrams-> ", end="")
print(round(recall_score(predictions_SVM, y_test, average='weighted')*100,4))
print("SVM F1 Score for Trigrams-> ", end="")
print(round(f1_score(predictions_SVM, y_test, average='weighted')*100,4))

SVM Classification Report for Trigrams
              precision    recall  f1-score   support

          -1       0.63      0.55      0.59      2000
           0       0.51      0.78      0.62      2000
           1       0.98      0.59      0.73      2000

    accuracy                           0.64      6000
   macro avg       0.71      0.64      0.65      6000
weighted avg       0.71      0.64      0.65      6000

SVM Accuracy Score for Trigrams-> 64.0
SVM Precision Score for Trigrams-> 67.4366
SVM Recall Score for Trigrams-> 64.0
SVM F1 Score for Trigrams-> 63.2697


### Decision Tree Algorithm for Trigrams

The Decision Tree is set up with the hyperparameters:
<br>
* random state- 0
<br>

In [29]:
# Classifier - Algorithm - Decision Tree
# fit the training dataset on the classifier
dt = DecisionTreeClassifier(random_state=0)
dt.fit(Train_X_Tfidf, y_train)
# predict the labels on validation dataset
predictions_DecisionTree = dt.predict(Test_X_Tfidf)
print("Decision Tree Classification Report for Trigrams")
# Print the classification report
print(classification_report(y_test,predictions_DecisionTree))

print("Decision Tree Accuracy Score for Trigrams-> ", end="")
print(round(accuracy_score(predictions_DecisionTree, y_test)*100,4))
print("Decision Tree Precision Score for Trigrams-> ", end="")
print(round(precision_score(predictions_DecisionTree, y_test, average='weighted')*100,4))
print("Decision Tree Recall Score for Trigrams-> ", end="")
print(round(recall_score(predictions_DecisionTree, y_test, average='weighted')*100,4))
print("Decision Tree F1 Score for Trigrams-> ", end="")
print(round(f1_score(predictions_DecisionTree, y_test, average='weighted')*100,4))

Decision Tree Classification Report for Trigrams
              precision    recall  f1-score   support

          -1       0.69      0.39      0.50      2000
           0       0.50      0.78      0.61      2000
           1       0.82      0.72      0.77      2000

    accuracy                           0.63      6000
   macro avg       0.67      0.63      0.63      6000
weighted avg       0.67      0.63      0.63      6000

Decision Tree Accuracy Score for Trigrams-> 63.0167
Decision Tree Precision Score for Trigrams-> 69.0803
Decision Tree Recall Score for Trigrams-> 63.0167
Decision Tree F1 Score for Trigrams-> 63.5244
