# Transformation of Tweets UK Dataset for Cluster Mdelling

**By Esraa Mohamed**

<img src="https://i.imgur.com/3zHbJa1.png">

To perform clustering in text data, we need to transform the words into binary numbers then perform any modelling operation. This notebook will explain how we use different strategies to transform our tweets to binary numbers and then perform the clustering modelling.  

##### Main Requirements for Clustering:
The primary requirements that should be met by a clustering algorithm are:
* It should be scalable
* It should be able to deal with attributes of different types;
* It should be able to discover arbitrary shape clusters;
* It should have an inbuilt ability to deal with noise and outliers;
* The clusters should not vary with the order of input records;
* It should be able to handle data of high dimensions.
* It should be easy to interpret and use.

We will apply all the required steps in this notebook.

###  Libraries

The following libraries will be used throughout the notebook.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from transformers import BertTokenizer, BertModel
import torch
import nltk
from transformers import pipeline
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
import gensim.downloader
from sklearn.preprocessing import MinMaxScaler

Let us start by reading the data and explore it a bit.

In [2]:
dataset=pd.read_csv(r'C:\Users\r04ra18\Desktop\Esraa-project-data\final-Essra-project\dataset\tweets-clean-data\uk-clean-data-tweets.csv')

In [3]:
#dataset
X=dataset.iloc[:,1:]
X.head()

Unnamed: 0,author_id,username,author_followers,author_tweets,author_description,author_location,text,created_at,retweets,replies,...,mentions_count,extracted_hashtags,Year,Month,Day,Hour,Minute,Second,Period,Period_id
0,285794400.0,DaveWithington1,5235,7324,Studying Social Psychology @OpenUniversity,Stoke-on-Trent,@HearnBob @Brimshack @budget_tourist Hi Bob. S...,2021-12-08 22:18:20+00:00,0,1,...,3,[],2021,12,8,22,18,20,post-restriction,1
1,65005890.0,x3ChelseaLouise,279,5188,fiancé 💍 dog momma 🐾 Carbie 🍝🥔 lover of all t...,South Yorkshire,Allegra Stratton crying making this bull💩 stat...,2021-12-08 21:33:29+00:00,0,0,...,0,[],2021,12,8,21,33,29,post-restriction,1
2,2511388000.0,alfaqfour,4336,113064,Old School car fanatic. Would-be chef. Ex Ford...,Essex,Rubbish! Hospitals are filling with double-jab...,2021-12-08 21:22:36+00:00,0,0,...,0,[],2021,12,8,21,22,36,post-restriction,1
3,390370200.0,Jdrt4,202,15416,"still working at 75 taken up cycling again, an...",Stopsley Luton,"Gary, as John Still once said "" control the Co...",2021-12-08 20:40:19+00:00,0,0,...,0,[],2021,12,8,20,40,19,post-restriction,1
4,20828530.0,nwwilson,638,20232,Proud dad with 2 kids. A long suffering Parti...,Scotland,@BorisJohnson @Number10press @Conservatives \n...,2021-12-08 20:40:02+00:00,0,0,...,3,[],2021,12,8,20,40,2,post-restriction,1


## 1. Adding labels Using Deep Learning

### 1. Using BERT Deep Learning Model to Label our Data

In [4]:
clean_ls = dataset['clean_tweets'].tolist()

In [5]:
# get sentiment workflow from Bert
nlp = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


In [6]:
# insert tweet with index 1000 to the pipeline and see it is result
result = nlp(dataset['clean_tweets'][1000])

In [7]:
result

[{'label': 'NEGATIVE', 'score': 0.995641827583313}]

Lets look at one of the results

In [8]:
dataset['Month'][1000]

9

In [9]:
dataset['text'][1000]

'@ScottPughsley Of course. Nobody wearing masks anywhere really now, no social distancing. Covid has gone. 🥴'

In [10]:
# put this example 
dataset['clean_tweets'][1000]

'scottpughsley of course. nobody wearing masks anywhere really now, no social distancing. covid has gone. '

Let us now loop over the dataframe and get the result from BERT model.

In [11]:
#take every tweet through the pipeline and save the result in list
results_ls = []
for i in range(len(dataset['clean_tweets'])):
    result = nlp(dataset['clean_tweets'][i])
    results_ls.append(result)

Let us check the length of the result.

In [12]:
len(results_ls)

12265

Lets look at few tweets examples.

In [13]:
results_ls[1]

[{'label': 'POSITIVE', 'score': 0.9304701685905457}]

In [14]:
dataset['text'][1]

'Allegra Stratton crying making this bull💩 statement while she was laughing about not social distancing while people sat at home distanced from their dying loved ones and she thought it was funny???? Do not feel sorry for that woman even a little bit!! Not so funny now is it!!'

In [15]:
results_ls[1000]

[{'label': 'NEGATIVE', 'score': 0.995641827583313}]

In [16]:
dataset['clean_tweets'][1000]

'scottpughsley of course. nobody wearing masks anywhere really now, no social distancing. covid has gone. '

Now let us build a dictionary and store all the data into a dataframe.

In [17]:
dict_count = len(results_ls)
df = pd.DataFrame(results_ls[0], index=[0])

for i in range(1,dict_count):
    df = df.append(results_ls[i], ignore_index=True)

Let us open and explore our dataframe.

In [18]:
df

Unnamed: 0,label,score
0,POSITIVE,0.923956
1,POSITIVE,0.930470
2,NEGATIVE,0.999142
3,NEGATIVE,0.992522
4,POSITIVE,0.995535
...,...,...
12260,NEGATIVE,0.978447
12261,NEGATIVE,0.984081
12262,POSITIVE,0.998981
12263,NEGATIVE,0.998508


Let us open another example.

In [19]:
results_ls[13]

[{'label': 'NEGATIVE', 'score': 0.9758056402206421}]

In [20]:
dataset['clean_tweets'][13]

'bbcnickrobinson regret she was caught out more like. borisjohnson now allegra has done the right thing we assume you will be expecting jacob_rees_mogg  s resignation too last nights video mocking social distancing amp; the police wont investigate is contemptuous towards the people who voted'

Now let us join our newly build dataframe into the old data.

In [21]:
X =  X.join(df)

Moreover let us create two clusters from our data.

In [22]:
clst_map = {'NEGATIVE': 0, 'POSITIVE': 1}
X['PN_score_clst'] = X['label'].map(clst_map)

In [23]:
X

Unnamed: 0,author_id,username,author_followers,author_tweets,author_description,author_location,text,created_at,retweets,replies,...,Month,Day,Hour,Minute,Second,Period,Period_id,label,score,PN_score_clst
0,2.857944e+08,DaveWithington1,5235,7324,Studying Social Psychology @OpenUniversity,Stoke-on-Trent,@HearnBob @Brimshack @budget_tourist Hi Bob. S...,2021-12-08 22:18:20+00:00,0,1,...,12,8,22,18,20,post-restriction,1,POSITIVE,0.923956,1
1,6.500589e+07,x3ChelseaLouise,279,5188,fiancé 💍 dog momma 🐾 Carbie 🍝🥔 lover of all t...,South Yorkshire,Allegra Stratton crying making this bull💩 stat...,2021-12-08 21:33:29+00:00,0,0,...,12,8,21,33,29,post-restriction,1,POSITIVE,0.930470,1
2,2.511388e+09,alfaqfour,4336,113064,Old School car fanatic. Would-be chef. Ex Ford...,Essex,Rubbish! Hospitals are filling with double-jab...,2021-12-08 21:22:36+00:00,0,0,...,12,8,21,22,36,post-restriction,1,NEGATIVE,0.999142,0
3,3.903702e+08,Jdrt4,202,15416,"still working at 75 taken up cycling again, an...",Stopsley Luton,"Gary, as John Still once said "" control the Co...",2021-12-08 20:40:19+00:00,0,0,...,12,8,20,40,19,post-restriction,1,NEGATIVE,0.992522,0
4,2.082853e+07,nwwilson,638,20232,Proud dad with 2 kids. A long suffering Parti...,Scotland,@BorisJohnson @Number10press @Conservatives \n...,2021-12-08 20:40:02+00:00,0,0,...,12,8,20,40,2,post-restriction,1,POSITIVE,0.995535,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12260,8.538250e+07,LouiseStockwell,2664,27682,Teacher and happy optimistic lover of all stuf...,no_location,Family is everything #togetherapart @Max_Stoc...,2021-01-01 00:42:21+00:00,1,1,...,1,1,0,42,21,restriction,0,NEGATIVE,0.978447,0
12261,6.726682e+07,SubversiveRun,549,1602,"Former soldier, retired firefighter, very matu...","Glasgow, Scotland",He decided that Covid rules applied not to him...,2021-01-01 00:35:11+00:00,0,2,...,1,1,0,35,11,restriction,0,NEGATIVE,0.984081,0
12262,1.520131e+09,bethofnight,1317,34355,BA Politics. Sociology Masters student. UK,berkshire/notts - she/her,you know what? fuck it. the government would h...,2021-01-01 00:17:33+00:00,0,0,...,1,1,0,17,33,restriction,0,POSITIVE,0.998981,1
12263,1.286070e+08,Kerry0301,22,541,Loving it:),West Midlands,Horrified to see crowds gathered in London and...,2021-01-01 00:09:40+00:00,0,1,...,1,1,0,9,40,restriction,0,NEGATIVE,0.998508,0


Let us map the two clusters with our data.

In [24]:
# create clusters

points_map={'POSITIVE': 0, 'NEGATIVE': 1}
X['cluster_id'] = X['label'].map(points_map)

In [25]:
X.head()

Unnamed: 0,author_id,username,author_followers,author_tweets,author_description,author_location,text,created_at,retweets,replies,...,Day,Hour,Minute,Second,Period,Period_id,label,score,PN_score_clst,cluster_id
0,285794400.0,DaveWithington1,5235,7324,Studying Social Psychology @OpenUniversity,Stoke-on-Trent,@HearnBob @Brimshack @budget_tourist Hi Bob. S...,2021-12-08 22:18:20+00:00,0,1,...,8,22,18,20,post-restriction,1,POSITIVE,0.923956,1,0
1,65005890.0,x3ChelseaLouise,279,5188,fiancé 💍 dog momma 🐾 Carbie 🍝🥔 lover of all t...,South Yorkshire,Allegra Stratton crying making this bull💩 stat...,2021-12-08 21:33:29+00:00,0,0,...,8,21,33,29,post-restriction,1,POSITIVE,0.93047,1,0
2,2511388000.0,alfaqfour,4336,113064,Old School car fanatic. Would-be chef. Ex Ford...,Essex,Rubbish! Hospitals are filling with double-jab...,2021-12-08 21:22:36+00:00,0,0,...,8,21,22,36,post-restriction,1,NEGATIVE,0.999142,0,1
3,390370200.0,Jdrt4,202,15416,"still working at 75 taken up cycling again, an...",Stopsley Luton,"Gary, as John Still once said "" control the Co...",2021-12-08 20:40:19+00:00,0,0,...,8,20,40,19,post-restriction,1,NEGATIVE,0.992522,0,1
4,20828530.0,nwwilson,638,20232,Proud dad with 2 kids. A long suffering Parti...,Scotland,@BorisJohnson @Number10press @Conservatives \n...,2021-12-08 20:40:02+00:00,0,0,...,8,20,40,2,post-restriction,1,POSITIVE,0.995535,1,0


### 2. Vectorizing the Sets of Words

Let us now create a bowl of words and Create a vectors of our dataframe.

In [26]:
tweets_bowl = dataset['clean_tweets'].tolist()

In [27]:
# delete duplicates
string1 = str(tweets_bowl)
words = string1.split()
clean_tweets = " ".join(sorted(set(words), key=words.index))

In [28]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [29]:
'''Vectorizing the sets of words, then standardizing them. TFIDF will be used in order to take care of the least 
frequent words. Standardizing is cause TFIDF favors long sentences and there'll be inconsistencies between the length 
of the tweets and the length of set of words.'''


def get_vectors(*strs):
    text = [t for t in strs]
    vectorizer = TfidfVectorizer()
    vectorizer.fit(text)
    return vectorizer.transform(text).toarray()

In [30]:
tweets_vector = get_vectors(clean_tweets)

In [31]:
tweets_vector

array([[0.00205699, 0.00205699, 0.00205699, ..., 0.00205699, 0.00205699,
        0.01234192]])

In [32]:
## Vectorizing the tweets
tv=TfidfVectorizer()
# tweets_bowl = tweets_bowl.tweets.apply(get_vectors)
# tweets_bowl.head()
tfidf_tweets =tv.fit_transform(dataset['clean_tweets'])

In [33]:
tfidf_tweets

<12265x29105 sparse matrix of type '<class 'numpy.float64'>'
	with 317167 stored elements in Compressed Sparse Row format>

### 3. Calculating the Jaccard similarityof the Sets of Words

In [34]:
'''Jaccard similarity is good for cases where duplication does not matter, 
cosine similarity is good for cases where duplication matters while analyzing text similarity. For two product descriptions, 
it will be better to use Jaccard similarity as repetition of a word does not reduce their similarity.'''

def jaccard_similarity(query, document):
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)


# jaccard_score(socialvector, economic_vector)
#for similarity of 1 and 2 of column1
# jaccard_similarity('dog lion a dog','dog is cat')


def get_scores(group,tweets):
    scores = []

    for i in range(len(group)):
        s = jaccard_similarity(group['text'][i], tweets[i])
    
    scores.append(s)
    return scores

In [35]:
X['label']

0        POSITIVE
1        POSITIVE
2        NEGATIVE
3        NEGATIVE
4        POSITIVE
           ...   
12260    NEGATIVE
12261    NEGATIVE
12262    POSITIVE
12263    NEGATIVE
12264    POSITIVE
Name: label, Length: 12265, dtype: object

##########################################

## 2. Calculating the Jaccard similarity Score for the Set of Words

Let us get the score of the clean tweets.

In [36]:
jaccard_similarity(X['text'][100] , dataset['clean_tweets'][100])

0.7435897435897436

In [37]:
scores_text = []

for i in range(len(dataset['clean_tweets'])):
    s = jaccard_similarity(X['text'][i], dataset['clean_tweets'][i])
    
    scores_text.append(s)

In [38]:
scores_text[:5]

[0.675,
 0.7666666666666667,
 0.6136363636363636,
 0.6190476190476191,
 0.5909090909090909]

Let us build a for loop and get the scores for the author location.

In [39]:
scores_location = []

for i in range(len(dataset['clean_tweets'])):
    s = jaccard_similarity(X['author_location'][i], dataset['clean_tweets'][i])
    
    scores_location.append(s)

In [40]:
scores_location[:5]

[0.2, 0.4, 0.10344827586206896, 0.3448275862068966, 0.25925925925925924]

Let us build a for loop and get the scores for the scores username.

In [41]:
scores_username = []

for i in range(len(dataset['clean_tweets'])):
    s = jaccard_similarity(X['username'][i], dataset['clean_tweets'][i])
    
    scores_username.append(s)

In [42]:
scores_username[:5]

[0.3,
 0.2962962962962963,
 0.25925925925925924,
 0.10344827586206896,
 0.23076923076923078]

Let us build a for loop and get the scores for the scores author description.

In [43]:
scores_author_descrip = []

for i in range(len(dataset['clean_tweets'])):
    s = jaccard_similarity(X['author_description'][i], dataset['clean_tweets'][i])
    
    scores_author_descrip.append(s)

In [44]:
scores_author_descrip[:5]

[0.5625, 0.6451612903225806, 0.5365853658536586, 0.696969696969697, 0.6875]

Let us add the Hashtages data.

In [45]:
dataset.columns

Index(['Unnamed: 0', 'author_id', 'username', 'author_followers',
       'author_tweets', 'author_description', 'author_location', 'text',
       'created_at', 'retweets', 'replies', 'likes', 'quote_count',
       'clean_tweets', 'hastags_count', 'mentions_count', 'extracted_hashtags',
       'Year', 'Month', 'Day', 'Hour', 'Minute', 'Second', 'Period',
       'Period_id'],
      dtype='object')

In [46]:
scores_extracted_hashtags = []

for i in range(len(dataset['clean_tweets'])):
    s = jaccard_similarity(X['extracted_hashtags'][i], dataset['clean_tweets'][i])
    
    scores_extracted_hashtags.append(s)

In [47]:
scores_extracted_hashtags[:10]

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.21875, 0.0, 0.0, 0.32142857142857145]

Let us add all this scores to our dataframe and recheck all the numerical columns.

In [48]:
X['scores_author_descrip'] = scores_author_descrip
X['scores_username'] = scores_username
X['scores_location'] = scores_location
X['scores_text'] = scores_text

In [49]:
X['scores_extracted_hashtags']= scores_extracted_hashtags

In [50]:
num_cols = ['float64', 'int64']

In [51]:
X_num = X.select_dtypes(include=['number'])

In [52]:
X_num.head()

Unnamed: 0,author_id,author_followers,author_tweets,retweets,replies,likes,quote_count,hastags_count,mentions_count,Year,...,Second,Period_id,score,PN_score_clst,cluster_id,scores_author_descrip,scores_username,scores_location,scores_text,scores_extracted_hashtags
0,285794400.0,5235,7324,0,1,0,0,0,3,2021,...,20,1,0.923956,1,0,0.5625,0.3,0.2,0.675,0.0
1,65005890.0,279,5188,0,0,0,0,0,0,2021,...,29,1,0.93047,1,0,0.645161,0.296296,0.4,0.766667,0.0
2,2511388000.0,4336,113064,0,0,0,0,0,0,2021,...,36,1,0.999142,0,1,0.536585,0.259259,0.103448,0.613636,0.0
3,390370200.0,202,15416,0,0,0,0,0,0,2021,...,19,1,0.992522,0,1,0.69697,0.103448,0.344828,0.619048,0.0
4,20828530.0,638,20232,0,0,0,0,0,3,2021,...,2,1,0.995535,1,0,0.6875,0.230769,0.259259,0.590909,0.0


In [53]:
X[['scores_text', 'author_id']]

Unnamed: 0,scores_text,author_id
0,0.675000,2.857944e+08
1,0.766667,6.500589e+07
2,0.613636,2.511388e+09
3,0.619048,3.903702e+08
4,0.590909,2.082853e+07
...,...,...
12260,0.469388,8.538250e+07
12261,0.685714,6.726682e+07
12262,0.892857,1.520131e+09
12263,0.750000,1.286070e+08


In [54]:
###########################################################

## 3. Testing Clusters using GENSIM

Let us download all the available models in GENSIM data

In [55]:
# Show all available models in gensim-data

print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


Select Glove Twitter 25 model

In [56]:
# Download the "glove-twitter-25" embeddings
glove_vectors = gensim.downloader.load('glove-twitter-25')

Process the data for modelling.

In [57]:
# Join lines together so it becomes one long line
text = " ".join(tweets_bowl)

# Separate out the sentences 
sentences = nltk.sent_tokenize(text)

# Seperate out each word within each sentence
tokenised_sents = [nltk.word_tokenize(sent) for sent in sentences]

Let us now check our 3 terms we want to study in our data:

In [58]:
# Use the downloaded vectors as usual:
glove_vectors.most_similar('supportive')

[('hardworking', 0.9011430740356445),
 ('respectful', 0.889769971370697),
 ('appreciative', 0.8867218494415283),
 ('generous', 0.8865000605583191),
 ('devoted', 0.883791983127594),
 ('trustworthy', 0.877402126789093),
 ('thoughtful', 0.8707559108734131),
 ('humble', 0.8627842664718628),
 ('caring', 0.8619816303253174),
 ('sincere', 0.8579764366149902)]

In [59]:
# Use the downloaded vectors as usual:
glove_vectors.most_similar('against')

[('despite', 0.8768965005874634),
 ('lead', 0.8731784224510193),
 ('states', 0.8638511300086975),
 ('facing', 0.8598164319992065),
 ('defeat', 0.8579378724098206),
 ('strike', 0.8572096824645996),
 ('beating', 0.8568548560142517),
 ('state', 0.8552259802818298),
 ('court', 0.8533600568771362),
 ('between', 0.8486519455909729)]

In [60]:
# Use the downloaded vectors as usual:
glove_vectors.most_similar('neutral')

[('stable', 0.8939410448074341),
 ('flexible', 0.8851061463356018),
 ('narrow', 0.8575757145881653),
 ('complex', 0.8499032258987427),
 ('efficient', 0.8473550081253052),
 ('vibrant', 0.8449707627296448),
 ('vital', 0.8376925587654114),
 ('visible', 0.8360723257064819),
 ('rational', 0.8359155058860779),
 ('transparent', 0.835024356842041)]

Let us now train the model with our dataset

In [61]:
sg_tweets = gensim.models.Word2Vec(tokenised_sents, sg=1, min_count=2, window=5)

Let us also tune the model hyperparameters.

In [62]:
# Skip-gram model
sg_tweets_1 = gensim.models.Word2Vec(tokenised_sents, sg=1, min_count=2, window=5)
sg_tweets_1.train(tokenised_sents, total_examples=len(tokenised_sents), epochs=200)

(53388452, 78087800)

Compare the results from the two models.

In [63]:
def comparing_embeddings_similarity(word, g_emb, sg_emb):
    g    = pd.DataFrame(g_emb.wv.most_similar(positive=[word])[:100],columns=["g_name","g_score"])
    sg   = pd.DataFrame(sg_emb.wv.most_similar(positive=[word])[:100],columns=["sg_name","sg_score"])
    
    df = pd.concat([g, sg],axis = 1)
    display (df)

In [64]:
word = 'support' 

comparing_embeddings_similarity(word, sg_tweets, sg_tweets_1)

Unnamed: 0,g_name,g_score,sg_name,sg_score
0,changing,0.913458,cooperation,0.616683
1,name,0.906812,oneonediet,0.585027
2,add,0.88814,alexnorrisnn,0.570392
3,vote,0.882276,them.add,0.561178
4,standupforshopworkers,0.878342,jeremyclarkson,0.525511
5,amendments,0.869263,animalrights,0.503901
6,mp,0.847675,financial,0.497974
7,protect,0.839555,boozo,0.483106
8,book,0.826119,placement,0.470715
9,notpartofthejob,0.823921,tradegovuk,0.46903


In [65]:
word = 'against' 

comparing_embeddings_similarity(word, sg_tweets, sg_tweets_1)

Unnamed: 0,g_name,g_score,sg_name,sg_score
0,experts,0.908687,determined,0.547297
1,remaining,0.903857,interests,0.526556
2,gov,0.903393,batty_mufc,0.511474
3,governments,0.901918,bma,0.473096
4,citizens,0.901812,violently,0.469592
5,ignore,0.901264,bent,0.457242
6,considers,0.900715,gazzap,0.451499
7,introduce,0.89984,flush,0.441139
8,type,0.898572,checksno,0.440445
9,'social,0.898112,rip,0.433858


In [66]:
word = 'neutral' 

comparing_embeddings_similarity(word, sg_tweets, sg_tweets_1)

Unnamed: 0,g_name,g_score,sg_name,sg_score
0,faulty,0.988918,vegan,0.64831
1,incoming,0.988739,candles,0.619732
2,implies,0.988604,carbon,0.611798
3,lionsofficial,0.98812,orange,0.537835
4,andyhholt,0.987844,jeffkoons,0.504654
5,filthy,0.987751,canarywharf,0.504312
6,jacko_,0.987639,dishonest,0.501921
7,pingdemic,0.987623,fruity,0.48909
8,grabbed,0.987596,yds,0.486499
9,sudden,0.987577,postpopart,0.486466


###################################

## 4. Adding Labels using Semi-supervised learning

### 1. Create a list of words 

Let us start by creating lists for our needed words.

In [67]:
support_words = '''

acknowledge
admit
allow
comply
concede
concur
grant
recognize
set
settle
sign
accede
acquiesce
check
consent
engage
okay
permit
subscribe
be of the same mind
bury the hatchet
buy into
clinch the deal
come to terms
cut a deal
give blessing
give carte blanche
give green light
give the go-ahead
go along with
make a deal
pass on
play ball
see eye to eye
shake on
side with
take one up on
yes
'''

In [68]:
against_words = '''

antagonistic
conflicting
contending
rival
adverse
anti
opposite
antithetical
conflicting
contrary
incompatible
inconsistent
paradoxical
antipodal
con
converse
counter
opposite
reverse
adverse
against
agin
antipodean
antithetic
counteractive
diametric
discrepant
incongruous
irreconcilable
negating
nullifying
opposing
ornery
polar
repugnant

'''

In [69]:
neutral_words = '''

disinterested
evenhanded
fair-minded
inactive
indifferent
nonaligned
nonpartisan
unbiased
uncommitted
undecided
uninvolved
calm
cool
noncombatant
aloof
bystanding
clinical
collected
detached
disengaged
dispassionate
easy
impersonal
inert
middle-of-road
nonbelligerent
nonchalant
nonparticipating
on sidelines
on the fence
pacifistic
poker-faced
relaxed
unaligned
unconcerned
unprejudiced

'''

### 2. Remove Duplicates

In [70]:
# delete duplicates
string1 = support_words
words = string1.split()
support = " ".join(sorted(set(words), key=words.index))

In [71]:
support

'acknowledge admit allow comply concede concur grant recognize set settle sign accede acquiesce check consent engage okay permit subscribe be of the same mind bury hatchet buy into clinch deal come to terms cut a give blessing carte blanche green light go-ahead go along with make pass on play ball see eye shake side take one up yes'

In [72]:
# delete duplicates
string1 = against_words
words = string1.split()
against = " ".join(sorted(set(words), key=words.index))

In [73]:
# delete duplicates
string1 = neutral_words
words = string1.split()
neutral = " ".join(sorted(set(words), key=words.index))

In [74]:
neutral

'disinterested evenhanded fair-minded inactive indifferent nonaligned nonpartisan unbiased uncommitted undecided uninvolved calm cool noncombatant aloof bystanding clinical collected detached disengaged dispassionate easy impersonal inert middle-of-road nonbelligerent nonchalant nonparticipating on sidelines the fence pacifistic poker-faced relaxed unaligned unconcerned unprejudiced'

### 3. Bowl of Words

Let us clean our tweets and build a bowl of words:

In [75]:
tweets_bowl = dataset['clean_tweets'].tolist()

In [76]:
# delete duplicates
string1 = str(tweets_bowl)
words = string1.split()
clean_tweets = " ".join(sorted(set(words), key=words.index))

In [77]:
clean_tweets



In [78]:
clean_tweets = clean_tweets.replace("[", "").replace("]", "").replace(",", "")

In [79]:
clean_tweets = clean_tweets.replace("\\", "").replace("\n", "").replace("\\n", "")

In [80]:
clean_tweets



In [81]:
# Join lines together so it becomes one long line
text = " ".join(tweets_bowl)

# Separate out the sentences 
sentences = nltk.sent_tokenize(text)

# Seperate out each word within each sentence
tokenised_sents = [nltk.word_tokenize(sent) for sent in sentences]

In [82]:
total_tokens = [t for sent in tokenised_sents for t in sent]

print ('Total number of tokens: %i'%len(total_tokens))

Total number of tokens: 390439


### 4. Calculate Jaccard Similarity

In [83]:
'''Jaccard similarity is good for cases where duplication does not matter, 
cosine similarity is good for cases where duplication matters while analyzing text similarity. For two product descriptions, 
it will be better to use Jaccard similarity as repetition of a word does not reduce their similarity.'''

def jaccard_similarity(query, document):
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)


# jaccard_score(socialvector, economic_vector)
#for similarity of 1 and 2 of column1
# jaccard_similarity('dog lion a dog','dog is cat')


def get_scores(group,tweets):
    scores = []
    for tweet in tweets:
        s = jaccard_similarity(group, tweet)
        scores.append(s)
    return scores

In [84]:
# Tweet scores
support_scores = get_scores(support, dataset['clean_tweets'])
support_scores[:10]

[0.7666666666666667,
 0.8846153846153846,
 0.8928571428571429,
 0.7666666666666667,
 0.7931034482758621,
 0.7857142857142857,
 0.8214285714285714,
 0.8518518518518519,
 0.8571428571428571,
 0.8076923076923077]

In [85]:
# Tweet scores
against_scores = get_scores(against, dataset['clean_tweets'])
against_scores[:10]

[0.75,
 0.875,
 0.75,
 0.75,
 0.8461538461538461,
 0.7692307692307693,
 0.8076923076923077,
 0.84,
 0.7777777777777778,
 0.7916666666666666]

In [86]:
# Tweet scores
neutral_scores = get_scores(neutral, dataset['clean_tweets'])
neutral_scores[:10]

[0.7931034482758621,
 0.8461538461538461,
 0.8571428571428571,
 0.7931034482758621,
 0.8888888888888888,
 0.8148148148148148,
 0.7857142857142857,
 0.8148148148148148,
 0.7586206896551724,
 0.7692307692307693]

In [87]:
'''new df with names, and the jaccard scores for each group'''

data  = {'Clean_Tweets':dataset['clean_tweets'], 'support_scores' : support_scores, 'neutral_scores':neutral_scores,
         'against_scores': against_scores}

scores_df = pd.DataFrame(data)
scores_df

Unnamed: 0,Clean_Tweets,support_scores,neutral_scores,against_scores
0,hearnbob brimshack budget_tourist hi bob. so t...,0.766667,0.793103,0.750000
1,allegra stratton crying making this bull state...,0.884615,0.846154,0.875000
2,rubbish hospitals are filling with double-jabb...,0.892857,0.857143,0.750000
3,"gary, as john still once said "" control the co...",0.766667,0.793103,0.750000
4,borisjohnson numberpress conservatives this is...,0.793103,0.888889,0.846154
...,...,...,...,...
12260,family is everything togetherapart max_stockw...,0.821429,0.851852,0.880000
12261,he decided that covid rules applied not to him...,0.785714,0.750000,0.769231
12262,you know what fuck it. the government would ha...,0.888889,0.785714,0.807692
12263,horrified to see crowds gathered in london and...,0.851852,0.814815,0.840000


In [88]:
scores_df['Clean_Tweets'].is_unique

False

In [89]:
scores_df.groupby('Clean_Tweets')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000028D216677F0>

In [90]:
scores_df.duplicated(subset=['Clean_Tweets']).sum()

96

In [91]:
scores_df[['support_scores' , 'neutral_scores', 'against_scores']]

Unnamed: 0,support_scores,neutral_scores,against_scores
0,0.766667,0.793103,0.750000
1,0.884615,0.846154,0.875000
2,0.892857,0.857143,0.750000
3,0.766667,0.793103,0.750000
4,0.793103,0.888889,0.846154
...,...,...,...
12260,0.821429,0.851852,0.880000
12261,0.785714,0.750000,0.769231
12262,0.888889,0.785714,0.807692
12263,0.851852,0.814815,0.840000


In [92]:
max(scores_df['support_scores'])

1.0

### 5. Cretae clusters from the labels

In [93]:
'''Actual assigning of classes to the tweets'''

def get_clusters(l1, l2, l3):
    support_ls = []
    against_ls = []
    neutral_ls = []
    
    for i, j, k in zip(l1, l2, l3):
        m = max(i, j, k)
        if m == i:
            support_ls.append(1)
        else:
            support_ls.append(0)
        if m == j:
            against_ls.append(1)
        else:
            against_ls.append(0)
        
        if m == j:
            neutral_ls.append(0)
        elif m == k:
            neutral_ls.append(1)
        else:
            neutral_ls.append(0)    
            
    return support_ls, against_ls, neutral_ls

In [94]:
support_df_ls = scores_df['support_scores'].to_list()
against_df_ls = scores_df['against_scores'].to_list()
neutral_df_ls = scores_df['neutral_scores'].to_list()

In [95]:
support, against, neutral = get_clusters(support_df_ls, against_df_ls, neutral_df_ls)

In [96]:
max(support), min(support)

(1, 0)

In [97]:
max(against), min(against)

(1, 0)

In [98]:
max(neutral), min(neutral)

(1, 0)

In [99]:
scores_df['support_clst'] = support

In [100]:
scores_df['against_clst'] = against

In [101]:
scores_df['neutral_clst'] = neutral

In [102]:
scores_df

Unnamed: 0,Clean_Tweets,support_scores,neutral_scores,against_scores,support_clst,against_clst,neutral_clst
0,hearnbob brimshack budget_tourist hi bob. so t...,0.766667,0.793103,0.750000,0,0,1
1,allegra stratton crying making this bull state...,0.884615,0.846154,0.875000,1,0,0
2,rubbish hospitals are filling with double-jabb...,0.892857,0.857143,0.750000,1,0,0
3,"gary, as john still once said "" control the co...",0.766667,0.793103,0.750000,0,0,1
4,borisjohnson numberpress conservatives this is...,0.793103,0.888889,0.846154,0,0,1
...,...,...,...,...,...,...,...
12260,family is everything togetherapart max_stockw...,0.821429,0.851852,0.880000,0,1,0
12261,he decided that covid rules applied not to him...,0.785714,0.750000,0.769231,1,0,0
12262,you know what fuck it. the government would ha...,0.888889,0.785714,0.807692,1,0,0
12263,horrified to see crowds gathered in london and...,0.851852,0.814815,0.840000,1,0,0


In [103]:
scores_df_clst = scores_df.drop(['support_scores', 'neutral_scores', 'against_scores'], axis=1)

In [104]:
scores_df_clst

Unnamed: 0,Clean_Tweets,support_clst,against_clst,neutral_clst
0,hearnbob brimshack budget_tourist hi bob. so t...,0,0,1
1,allegra stratton crying making this bull state...,1,0,0
2,rubbish hospitals are filling with double-jabb...,1,0,0
3,"gary, as john still once said "" control the co...",0,0,1
4,borisjohnson numberpress conservatives this is...,0,0,1
...,...,...,...,...
12260,family is everything togetherapart max_stockw...,0,1,0
12261,he decided that covid rules applied not to him...,1,0,0
12262,you know what fuck it. the government would ha...,1,0,0
12263,horrified to see crowds gathered in london and...,1,0,0


### 6. Create one Column for All Three Labels

In [105]:
#get all labels in one column
'''Actual assigning of labels to the tweets'''

def get_labels(l1, l2, l3):
    label_ls = []
    
    for i, j, k in zip(l1, l2, l3):
        
        if i == 1:
            label_ls.append(0)
        elif j ==1:
            label_ls.append(1)
        else:
            label_ls.append(2)    
            
    return label_ls

In [106]:
support_clst_ls = scores_df_clst['support_clst'].to_list()
against_clst_ls = scores_df_clst['against_clst'].to_list()
neutral_clst_ls = scores_df_clst['neutral_clst'].to_list()

In [107]:
all_label = get_labels(support_clst_ls, against_clst_ls, neutral_clst_ls)

In [108]:
len(all_label)

12265

In [109]:
scores_df_clst['all_label'] =  all_label

In [110]:
scores_df_clst

Unnamed: 0,Clean_Tweets,support_clst,against_clst,neutral_clst,all_label
0,hearnbob brimshack budget_tourist hi bob. so t...,0,0,1,2
1,allegra stratton crying making this bull state...,1,0,0,0
2,rubbish hospitals are filling with double-jabb...,1,0,0,0
3,"gary, as john still once said "" control the co...",0,0,1,2
4,borisjohnson numberpress conservatives this is...,0,0,1,2
...,...,...,...,...,...
12260,family is everything togetherapart max_stockw...,0,1,0,1
12261,he decided that covid rules applied not to him...,1,0,0,0
12262,you know what fuck it. the government would ha...,1,0,0,0
12263,horrified to see crowds gathered in london and...,1,0,0,0


In [111]:
scores_df_clst['all_label'].value_counts()

0    5812
1    3579
2    2874
Name: all_label, dtype: int64

In [112]:
pivot_clusters = scores_df_clst.groupby(['Clean_Tweets']).sum()
pivot_clusters['support_scores_clst'] = pivot_clusters['support_clst'].astype(int)
pivot_clusters['neutral_scores_clst'] = pivot_clusters['neutral_clst'].astype(int)
pivot_clusters['against_scores_clst'] = pivot_clusters['against_clst'].astype(int)
pivot_clusters['all_label'] = pivot_clusters['all_label'].astype(int)

pivot_clusters['total'] = pivot_clusters['support_scores_clst'] + pivot_clusters['neutral_scores_clst'] + pivot_clusters['against_scores_clst']
pivot_clusters.loc["Total"] = pivot_clusters.sum()  #add a totals row
print(pivot_clusters.shape)
pivot_clusters.tail()

(12170, 8)


Unnamed: 0_level_0,support_clst,against_clst,neutral_clst,all_label,support_scores_clst,neutral_scores_clst,against_scores_clst,total
Clean_Tweets,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
zokko social distancing,1,0,0,0,1,0,0,1
"zonker dhscgovuk imperialcollege ipsosmori transportgovuk grantshapps sajidjavid no this is independent data thankfully, i was in sharm last august safe as houses no covid and measures put in place are really good occupancy only, all staff masked, social distancing etc etc",1,0,0,0,1,0,0,1
zubhaque thealiceroberts im not saying that you are not correct or otherwise regarding the causes of previous lockdowns but i do feel that disregarding any advice for people to maintain social distancing etc. as we come out of a level of lockdown is both naive amp; poor judgement.,1,0,0,0,1,0,0,1
"zwings e-scooters provide people with a new urban mobility mode, allowing you to move any time, anywhere, in a socially-distant, sustainable, and riveting waysocialdistancing zwings",1,0,0,0,1,0,0,1
Total,5812,3582,2874,9327,5812,2874,3582,12268


In [113]:
pivot_clusters = pivot_clusters.reset_index(drop=True)

In [114]:
pivot_clusters['all_label'].nunique()

10

In [115]:
pivot_clusters[:-1]

Unnamed: 0,support_clst,against_clst,neutral_clst,all_label,support_scores_clst,neutral_scores_clst,against_scores_clst,total
0,0,0,1,2,0,1,0,1
1,1,0,0,0,1,0,0,1
2,1,0,0,0,1,0,0,1
3,1,0,0,0,1,0,0,1
4,0,0,1,2,0,1,0,1
...,...,...,...,...,...,...,...,...
12164,1,0,0,0,1,0,0,1
12165,1,0,0,0,1,0,0,1
12166,1,0,0,0,1,0,0,1
12167,1,0,0,0,1,0,0,1


## 5. Join all Dataframes together to build one large dataframe

In [116]:
X_num_cols_1 = X_num.join(pivot_clusters[['support_scores_clst', 'neutral_scores_clst', 'against_scores_clst', 'total']][:-1])

In [117]:
X_num_cols_2 = X_num_cols_1.join(scores_df[['support_scores' , 'neutral_scores', 'against_scores']])

In [118]:
X_num_cols = X_num_cols_2.join(scores_df_clst['all_label'])

In [119]:
X_num_cols['against_scores'].max()

1.0

In [120]:
X_num_cols

Unnamed: 0,author_id,author_followers,author_tweets,retweets,replies,likes,quote_count,hastags_count,mentions_count,Year,...,scores_text,scores_extracted_hashtags,support_scores_clst,neutral_scores_clst,against_scores_clst,total,support_scores,neutral_scores,against_scores,all_label
0,2.857944e+08,5235,7324,0,1,0,0,0,3,2021,...,0.675000,0.000000,0.0,1.0,0.0,1.0,0.766667,0.793103,0.750000,2
1,6.500589e+07,279,5188,0,0,0,0,0,0,2021,...,0.766667,0.000000,1.0,0.0,0.0,1.0,0.884615,0.846154,0.875000,0
2,2.511388e+09,4336,113064,0,0,0,0,0,0,2021,...,0.613636,0.000000,1.0,0.0,0.0,1.0,0.892857,0.857143,0.750000,0
3,3.903702e+08,202,15416,0,0,0,0,0,0,2021,...,0.619048,0.000000,1.0,0.0,0.0,1.0,0.766667,0.793103,0.750000,2
4,2.082853e+07,638,20232,0,0,0,0,0,3,2021,...,0.590909,0.000000,0.0,1.0,0.0,1.0,0.793103,0.888889,0.846154,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12260,8.538250e+07,2664,27682,1,1,4,0,7,1,2021,...,0.469388,0.454545,,,,,0.821429,0.851852,0.880000,1
12261,6.726682e+07,549,1602,0,2,8,0,0,0,2021,...,0.685714,0.000000,,,,,0.785714,0.750000,0.769231,0
12262,1.520131e+09,1317,34355,0,0,1,0,0,0,2021,...,0.892857,0.000000,,,,,0.888889,0.785714,0.807692,0
12263,1.286070e+08,22,541,0,1,2,0,3,0,2021,...,0.750000,0.400000,,,,,0.851852,0.814815,0.840000,0


In [121]:
X_num_cols = X_num_cols.fillna(0)

In [122]:
X_num_cols['total'].value_counts()

1.0     12114
0.0        96
2.0        42
3.0         5
4.0         4
6.0         1
13.0        1
15.0        1
5.0         1
Name: total, dtype: int64

In [123]:
max(X_num_cols['support_scores'])

1.0

In [124]:
X_num_cols_ls = X_num_cols.columns.tolist()

In [125]:
len(X_num_cols_ls)

32

In [126]:
X_num_cols_scaled = X_num_cols.copy()

######################################

## 6. Scale the Dataframe

In [127]:
scaler = MinMaxScaler()

scaler.fit(X_num_cols_scaled[['author_id', 'author_followers', 'author_tweets', 
            'retweets', 'replies', 'likes', 'quote_count', 
            'Year', 'Month', 'Day', 'Hour', 'Minute','Second']])

MinMaxScaler()

In [128]:
scaler.transform(X_num_cols_scaled[['author_id', 'author_followers', 'author_tweets', 
            'retweets', 'replies', 'likes', 'quote_count', 
            'Year', 'Month', 'Day', 'Hour', 'Minute','Second']])

array([[1.95068246e-10, 2.27587916e-02, 4.11154337e-03, ...,
        9.56521739e-01, 3.05084746e-01, 3.38983051e-01],
       [4.43634508e-11, 1.21293273e-03, 2.91227304e-03, ...,
        9.13043478e-01, 5.59322034e-01, 4.91525424e-01],
       [1.71420348e-09, 1.88504528e-02, 6.34799164e-02, ...,
        9.13043478e-01, 3.72881356e-01, 6.10169492e-01],
       ...,
       [1.03759563e-09, 5.72556419e-03, 1.92882645e-02, ...,
        0.00000000e+00, 2.88135593e-01, 5.59322034e-01],
       [8.77760129e-11, 9.56434413e-05, 3.03186320e-04, ...,
        0.00000000e+00, 1.52542373e-01, 6.77966102e-01],
       [5.61466581e-01, 1.33031332e-03, 2.63884389e-04, ...,
        0.00000000e+00, 6.77966102e-02, 2.54237288e-01]])

In [129]:
X_num_cols_scaled = pd.DataFrame(scaler.transform(X_num_cols_scaled[['author_id', 'author_followers', 'author_tweets', 
            'retweets', 'replies', 'likes', 'quote_count', 
            'Year', 'Month', 'Day', 'Hour', 'Minute','Second']]), columns=['author_id', 'author_followers', 'author_tweets', 
            'retweets', 'replies', 'likes', 'quote_count', 
            'Year', 'Month', 'Day', 'Hour', 'Minute','Second'])

In [130]:
X_num_cols_scaled_ls = X_num_cols_scaled.columns.to_list()

In [131]:
X_num_cols_final = X_num_cols_scaled.join(X_num_cols.drop(columns=X_num_cols_scaled_ls))

In [132]:
X_num_cols_final

Unnamed: 0,author_id,author_followers,author_tweets,retweets,replies,likes,quote_count,Year,Month,Day,...,scores_text,scores_extracted_hashtags,support_scores_clst,neutral_scores_clst,against_scores_clst,total,support_scores,neutral_scores,against_scores,all_label
0,1.950682e-10,0.022759,0.004112,0.000000,0.002611,0.000000,0.000000,0.0,1.0,0.233333,...,0.675000,0.000000,0.0,1.0,0.0,1.0,0.766667,0.793103,0.750000,2
1,4.436345e-11,0.001213,0.002912,0.000000,0.000000,0.000000,0.000000,0.0,1.0,0.233333,...,0.766667,0.000000,1.0,0.0,0.0,1.0,0.884615,0.846154,0.875000,0
2,1.714203e-09,0.018850,0.063480,0.000000,0.000000,0.000000,0.000000,0.0,1.0,0.233333,...,0.613636,0.000000,1.0,0.0,0.0,1.0,0.892857,0.857143,0.750000,0
3,2.664491e-10,0.000878,0.008655,0.000000,0.000000,0.000000,0.000000,0.0,1.0,0.233333,...,0.619048,0.000000,1.0,0.0,0.0,1.0,0.766667,0.793103,0.750000,2
4,1.420908e-11,0.002774,0.011359,0.000000,0.000000,0.000000,0.000000,0.0,1.0,0.233333,...,0.590909,0.000000,0.0,1.0,0.0,1.0,0.793103,0.888889,0.846154,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12260,5.827202e-11,0.011582,0.015542,0.000951,0.002611,0.001248,0.000000,0.0,0.0,0.000000,...,0.469388,0.454545,0.0,0.0,0.0,0.0,0.821429,0.851852,0.880000,1
12261,4.590671e-11,0.002387,0.000899,0.000000,0.005222,0.002495,0.000000,0.0,0.0,0.000000,...,0.685714,0.000000,0.0,0.0,0.0,0.0,0.785714,0.750000,0.769231,0
12262,1.037596e-09,0.005726,0.019288,0.000000,0.000000,0.000312,0.000000,0.0,0.0,0.000000,...,0.892857,0.000000,0.0,0.0,0.0,0.0,0.888889,0.785714,0.807692,0
12263,8.777601e-11,0.000096,0.000303,0.000000,0.002611,0.000624,0.000000,0.0,0.0,0.000000,...,0.750000,0.400000,0.0,0.0,0.0,0.0,0.851852,0.814815,0.840000,0


In [133]:
X_num_cols_final.columns

Index(['author_id', 'author_followers', 'author_tweets', 'retweets', 'replies',
       'likes', 'quote_count', 'Year', 'Month', 'Day', 'Hour', 'Minute',
       'Second', 'hastags_count', 'mentions_count', 'Period_id', 'score',
       'PN_score_clst', 'cluster_id', 'scores_author_descrip',
       'scores_username', 'scores_location', 'scores_text',
       'scores_extracted_hashtags', 'support_scores_clst',
       'neutral_scores_clst', 'against_scores_clst', 'total', 'support_scores',
       'neutral_scores', 'against_scores', 'all_label'],
      dtype='object')

In [134]:
################################################

## 7. Save the Dataframe into CSV file

In [135]:
# X.to_csv(r'C:\Users\r04ra18\Desktop\output_X_data_final.csv')

In [136]:
X_num.to_csv(r'C:\Users\r04ra18\Desktop\output_X_num_data_final.csv')

In [137]:
X_num_cols.to_csv(r'C:\Users\r04ra18\Desktop\output_X_num_cols_data_final.csv')

In [139]:
X_num_cols_final.to_csv(r'C:\Users\r04ra18\Desktop\output_X_num_cols_uk_final.csv')

<img src="https://i.imgur.com/VCzUM0V.png">