## Clustering user by Twitter Bio

This program uses our custom built Twitter bio datasets for the purpose of matching users into similarity groups. 

**TODO: It is important to create/find a better text preprocessing code before drawing conclusions from the data. Twitter bios are highly irregular in format and style. Standard preprocessing techniques alone do not seem well suited for the job.**

**Citations**

Twitter text preprocessing:
- "Basic Tweet Preprocessing Method With Python" by 
Anil Emrah, [https://medium.com/analytics-vidhya/basic-tweet-preprocessing-method-with-python-56b4e53854a1]

Text Encoding: LEGAL BERT Model series by:

- I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras and I. Androutsopoulos. 
"LEGAL-BERT: The Muppets straight out of Law School". 
In Findings of Empirical Methods in Natural Language Processing (EMNLP 2020) 
(Short Papers), to be held online, 2020. (https://aclanthology.org/2020.findings-emnlp.261)

Pretrained Model Repo / Implementation:

- https://huggingface.co/nlpaueb/legal-bert-base-uncased

## Imports

In [None]:
import pandas as pd
import json
import os
import numpy as np
from sklearn.decomposition import PCA
from sklearn.mixture import BayesianGaussianMixture
from sklearn.feature_extraction.text import TfidfVectorizer

# PyTorch
import torch

# Pretrained Transformers from HuggingFace
!pip install transformers
from transformers import AutoTokenizer, AutoModel

# general
import os
import string
import re

# text preprocessing
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
nltk.download('stopwords')
nltk.download('punkt')



In [None]:
from google.colab import drive
drive.mount('/content/drive')

DATASETS_FOLDER = '/content/drive/MyDrive/Colab_Notebooks/models/ReRight/datasets'

Mounted at /content/drive


# Datasets

In [None]:
# Load datasets from excel files

def load_datasets(dataset_folder=os.path.join(DATASETS_FOLDER,'Twitter_Bios')):

    df = None
    for dirpath, dirnames, filenames in os.walk(dataset_folder):
        for file in filenames:
            if df is None:
                df = pd.read_excel(file)
            else:
                df = pd.concat([df, pd.read_excel(file)])

    return df


df = load_datasets()   

In [None]:
df.head(5)

Unnamed: 0,username,description,name,id
0,EmJaRo2,#IStandWithRosieDuffield,Emma Robertson,1217216916986699776
1,NotACommunist24,20. Autistic. Bi. Based af. ❤RESIDENT EVIL!❤ I...,Based Syndicalist,1309627886941409280
2,BarbaraRowley7,she/her. Former Healthcare Worker. Science ner...,Barbara Rowley,1170232400569192448
3,dreamchxild,𝘧𝘦𝘦𝘭𝘪𝘯𝘨 𝘮𝘺𝘴𝘦𝘭𝘧 𝘭𝘪𝘬𝘦 𝘪𝘮 𝘯𝘰𝘳𝘮𝘢 𝘫𝘦𝘢𝘯𝘦. 𝘯𝘪𝘨𝘩𝘵 𝘴𝘤𝘳𝘪...,𝐃𝐈𝐀༄ *.ﾟ♡,727246415232159744
4,rockinfabblue,Full of Myself.💖🤗\nVirgo sun♍\nGemini rising♊\...,Tiffany,1473144504


Preprocessing

*TODO: improve text prep. We believe it should be possible to find/create steps more tailored to Twitter data.*

In [None]:
# NOTE:  This code is due to the post "Basic Tweet Preprocessing Method With Python" by Anil Emrah
# and hosted on Github at https://gist.github.com/anilemrah/a390f0f7008670e6187ef980ee10c1da#file-preprocess_tweet-py
def preprocess_tweet(text):
    """
    Function comes from https://medium.com/analytics-vidhya/basic-tweet-preprocessing-method-with-python-56b4e53854a1
    """
    # Check characters to see if they are in punctuation
    nopunc = [char for char in text if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)

    # convert text to lower-case
    nopunc = nopunc.lower()

    # remove URLs
    nopunc = re.sub('((www\.[^\s]+)|(https?://[^\s]+)|(http?://[^\s]+))', '', nopunc)
    nopunc = re.sub(r'http\S+', '', nopunc)

    # remove usernames
    nopunc = re.sub('@[^\s]+', '', nopunc)

    # remove the # in #hashtag
    nopunc = re.sub(r'#([^\s]+)', r'\1', nopunc)

    # remove repeated characters
    nopunc = word_tokenize(nopunc)

    # remove stopwords from final word list
    return [word for word in nopunc if word not in stopwords.words('english')]

In [None]:
df['description'] = df['description'].apply(preprocess_tweet)
df['name'] = df['name'].apply(preprocess_tweet)

In [None]:
df.head(10)

Unnamed: 0,username,description,name,id
0,EmJaRo2,[istandwithrosieduffield],[emmarobertson],1217216916986699776
1,NotACommunist24,[20autisticbibasedaf❤residentevil❤illtweetingc...,[basedsyndicalist],1309627886941409280
2,BarbaraRowley7,[sheherformerhealthcareworkersciencenerdwashha...,[barbararowley],1170232400569192448
3,dreamchxild,[𝘧𝘦𝘦𝘭𝘪𝘯𝘨𝘮𝘺𝘴𝘦𝘭𝘧𝘭𝘪𝘬𝘦𝘪𝘮𝘯𝘰𝘳𝘮𝘢𝘫𝘦𝘢𝘯𝘦𝘯𝘪𝘨𝘩𝘵𝘴𝘤𝘳𝘪𝘣𝘣𝘭𝘦𝘳𝘮𝘢...,[𝐃𝐈𝐀༄ﾟ♡],727246415232159744
4,rockinfabblue,[fullmyself💖🤗virgosun♍geminirising♊sagittarius...,[tiffany],1473144504
5,ZMafereka,[siyaqhuba],[pandemicpapi],1347793743345307649
6,judithpark13,[mummy2preciousgirls💖💖],[judithpark],382876047
7,taycolmenero,[wildthings],[taylorcolmenero],305766960
8,norlenemm,[zimbabweanfounderherwombbfollowreproductivehe...,[norlenem],1251104064793952258
9,Lilies09,[lifesworthdamntillshout],[englishrose],54625997


## Vectorization

Tokenizer

*note: this was successfully used in our rights violations code, but we recommend finding/creating a preprocessors better suited to Twitter data*

In [None]:
"""
Tokenizer created by I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras and I. Androutsopoulos. 
"LEGAL-BERT: The Muppets straight out of Law School". In Findings of Empirical Methods in Natural Language Processing (EMNLP 2020) 
(Short Papers), to be held online, 2020. (https://aclanthology.org/2020.findings-emnlp.261)

PRETRAINED MODEL IMPLEMENTATION from https://huggingface.co/nlpaueb/legal-bert-base-uncased
"""

# Tokenizer
# This is a specialized tokenizer designed for use on the dataset
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/bert-base-uncased-echr")

# ECHR Dataset
# apply text preprocessor
texts = dataframe_human_rights['facts'].to_list()
tokens_echr = tokenizer(texts,
                   padding=True,
                   truncation=True,
                   max_length=256,  # pad/truncate to uniform size
                   return_tensors="pt")  # return in PyTorch format

masked_tokens_echr = tokens_echr['input_ids'] * tokens_echr['attention_mask']                   

Vectorizer

In [None]:
# Count Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# convert to string for fitting vectorizer
corpus = range(torch.max(tokens_echr['input_ids']))
corpus = [str(c) for c in corpus]  

# fit
tfidf_vectorizer.fit(corpus)

# convert to string for applying vectorizer
masked_tokens_list = masked_tokens_echr.tolist()
masked_tokens_echr_string = [' '.join([str(num) for num in masked_tokens_list[i]]) 
                                for i in range(len(masked_tokens_list))]

encoded_data = tfidf_vectorizer.transform(masked_tokens_echr_string)

In [None]:
encoded_data

<1000x29980 sparse matrix of type '<class 'numpy.float64'>'
	with 132098 stored elements in Compressed Sparse Row format>

# Mixture Models

Dimension Reduction

In [None]:
pca_transform = PCA(n_components=10)

# if using TF-IDF vectorized data
encoded_data_condensed = pca_transform.fit_transform(encoded_data.todense())

Cluster Model

In [None]:
mixture_model = BayesianGaussianMixture(n_components=5, random_state=142)
mixture_model.fit(encoded_data_condensed)

In [None]:
clusters_assignments = mixture_model.predict(encoded_data_condensed)

In [None]:
clusters_probs = mixture_model.predict_proba(encoded_data_condensed)
clusters_probs[:5,:]
print(clusters_probs.shape)

(1000, 5)


Combine into dataframe

In [None]:
# hard assignments
dataframe_human_rights['clusters_assignments'] = clusters_assignments
dataframe_human_rights.head(2)

In [None]:
# soft assignments
soft_assignments_df = pd.DataFrame(clusters_probs, columns=['clusters_probs_0', 'clusters_probs_1', 'clusters_probs_2', 'clusters_probs_3', 'clusters_probs_4'])
soft_assignments_df.head(3)
print(len(soft_assignments_df))

1000


In [None]:
df = pd.concat([dataframe_human_rights, soft_assignments_df.reindex(dataframe_human_rights.index)], axis='columns', join='inner')
len(df)

1000

## Interpretations

TODO:
- Interpret Results (what do the clusters represent?