# Project aims

This project aims to offer an alternative to the use of conventional grouping methods, such as KMeans implementation, to find profile similarity among OkCupid users. It is our assumption that text analysis and topic modeling can be beneficial to determining user compatibility, and potentially enhance the existing statistical algorithm used by the platform.

# Project methods and organization:

* Subset selection
* Initial text preprocessing
* Keyword selection
* KMeans model implementation and analysis
* LDA model implementation and analysis
* Conclusions

In [1]:
#dependencies
import numpy as np
import pandas as pd

import random

from matplotlib import pyplot as plt
import seaborn as sns

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
nltk.download('omw-1.4')
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.cluster import KMeans

import string
import re

import gensim
from gensim import corpora
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation, preprocess_string, stem_text
from gensim.models import LdaMulticore, Word2Vec
from gensim.models.coherencemodel import CoherenceModel

from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.manifold import TSNE

import spacy

!pip install git+https://github.com/LIAAD/yake
import yake

!pip install pyldavis
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...


Collecting git+https://github.com/LIAAD/yake
  Cloning https://github.com/LIAAD/yake to /tmp/pip-req-build-5kd3dt2x
  Running command git clone --filter=blob:none --quiet https://github.com/LIAAD/yake /tmp/pip-req-build-5kd3dt2x
  Resolved https://github.com/LIAAD/yake to commit 8d71d94ded93fb77f1361f62e5264f19b9c91cd7
  Preparing metadata (setup.py) ... [?25l- done
Collecting segtok
  Downloading segtok-1.5.11-py3-none-any.whl (24 kB)
Collecting jellyfish
  Downloading jellyfish-0.9.0.tar.gz (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.6/132.6 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- done
Building wheels for collected packages: yake, jellyfish
  Building wheel for yake (setup.py) ... [?25l- \ | / done
[?25h  Created wheel for yake: filename=yake-0.4.8-py2.py3-none-any.whl size=62600 sha256=dfd3123c92afc64d763e2ff9d7cc69e42a7ff8a405e6cb255f1d820378ed3b41
  Stored in direct

  """


# Subset selection

In [2]:
#importing the dataset
okc_data = pd.read_csv('/kaggle/input/okcupid-profiles/okcupid_profiles.csv')
okc_data.head()

Unnamed: 0,age,status,sex,orientation,body_type,diet,drinks,drugs,education,ethnicity,...,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9
0,22,single,m,straight,a little extra,strictly anything,socially,never,working on college/university,"asian, white",...,about me: i would love to think that i was so...,currently working as an international agent fo...,making people laugh. ranting about a good salt...,"the way i look. i am a six foot half asian, ha...","books: absurdistan, the republic, of mice and ...",food. water. cell phone. shelter.,duality and humorous things,trying to find someone to hang out with. i am ...,i am new to california and looking for someone...,you want to be swept off your feet! you are ti...
1,35,single,m,straight,average,mostly other,often,sometimes,working on space camp,white,...,i am a chef: this is what that means. 1. i am ...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,i am die hard christopher moore fan. i don't r...,delicious porkness in all of its glories. my b...,,,i am very open and will share just about anyth...,
2,38,available,m,straight,thin,anything,socially,,graduated from masters program,,...,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,okay this is where the cultural matrix gets so...,movement conversation creation contemplation t...,,viewing. listening. dancing. talking. drinking...,"when i was five years old, i was known as ""the...","you are bright, open, intense, silly, ironic, ..."
3,23,single,m,straight,thin,vegetarian,socially,,working on college/university,white,...,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,"bataille, celine, beckett. . . lynch, jarmusch...",,cats and german philosophy,,,you feel so inclined.
4,29,single,m,straight,athletic,,socially,never,graduated from college/university,"asian, black, other",...,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at: http://bagsbrown....,i smile a lot and my inquisitive nature,"music: bands, rappers, musicians at the moment...",,,,,


In this dataset, the main textual information is located in the so-called "essays" - user-generated texts that answer question prompts, such as "You should message me if..." and "What people first notice about me". Therefore, it is reasonable to subset and group the essays belonging to the same individual and not to aggregate all of the text data to create a corpus.

In [3]:
#creating the essay subset
essays = okc_data[['essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7', 'essay8', 'essay9']]
essays.head()

Unnamed: 0,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9
0,about me: i would love to think that i was so...,currently working as an international agent fo...,making people laugh. ranting about a good salt...,"the way i look. i am a six foot half asian, ha...","books: absurdistan, the republic, of mice and ...",food. water. cell phone. shelter.,duality and humorous things,trying to find someone to hang out with. i am ...,i am new to california and looking for someone...,you want to be swept off your feet! you are ti...
1,i am a chef: this is what that means. 1. i am ...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,i am die hard christopher moore fan. i don't r...,delicious porkness in all of its glories. my b...,,,i am very open and will share just about anyth...,
2,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,okay this is where the cultural matrix gets so...,movement conversation creation contemplation t...,,viewing. listening. dancing. talking. drinking...,"when i was five years old, i was known as ""the...","you are bright, open, intense, silly, ironic, ..."
3,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,"bataille, celine, beckett. . . lynch, jarmusch...",,cats and german philosophy,,,you feel so inclined.
4,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at: http://bagsbrown....,i smile a lot and my inquisitive nature,"music: bands, rappers, musicians at the moment...",,,,,


# Text preprocessing

In [4]:
essays.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   essay0  54458 non-null  object
 1   essay1  52374 non-null  object
 2   essay2  50308 non-null  object
 3   essay3  48470 non-null  object
 4   essay4  49409 non-null  object
 5   essay5  49096 non-null  object
 6   essay6  46175 non-null  object
 7   essay7  47495 non-null  object
 8   essay8  40721 non-null  object
 9   essay9  47343 non-null  object
dtypes: object(10)
memory usage: 4.6+ MB


As can be seen, there are null values present in the dataset, and we can use typical terms, such as "None" or "N/A" to fill them. However, this will create issues with preprocessing and topic selection later on, so we will use punctuation to maintain visibility of empty values and be able to remove them at the preprocessing stage.

In [5]:
essays = essays.fillna('.')
essays.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   essay0  59946 non-null  object
 1   essay1  59946 non-null  object
 2   essay2  59946 non-null  object
 3   essay3  59946 non-null  object
 4   essay4  59946 non-null  object
 5   essay5  59946 non-null  object
 6   essay6  59946 non-null  object
 7   essay7  59946 non-null  object
 8   essay8  59946 non-null  object
 9   essay9  59946 non-null  object
dtypes: object(10)
memory usage: 4.6+ MB


In [6]:
#sample essay
essays.essay0[0]

"about me:  i would love to think that i was some some kind of intellectual: either the dumbest smart guy, or the smartest dumb guy. can't say i can tell the difference. i love to talk about ideas and concepts. i forge odd metaphors instead of reciting cliches. like the simularities between a friend of mine's house and an underwater salt mine. my favorite word is salt by the way (weird choice i know). to me most things in life are better as metaphors. i seek to make myself a little better everyday, in some productively lazy way. got tired of tying my shoes. considered hiring a five year old, but would probably have to tie both of our shoes... decided to only wear leather shoes dress shoes.  about you:  you love to have really serious, really deep conversations about really silly stuff. you have to be willing to snap me out of a light hearted rant with a kiss. you don't have to be funny, but you have to be able to make me laugh. you should be able to bend spoons with your mind, and tele

As the original dataset consists of 59946 rows of data, it will be problematic to process all at once. To simplify the process, we select a number of random samples without replacement (3000, approximately 5% of the data). No random state is assigned to see the impact sample variations will have on model coherence.

In [7]:
#aggregating user-generated essays
texts = []
for i in range(59946):
    texts.append(''.join(essays.loc[i]))
essays['texts'] = texts
sample = essays['texts'].sample(n=3000, replace=False)
sample = sample.reset_index()
sample['texts'][0:10]

0    i am fun, smart, spontaneous, easy going and s...
1    above all else, there must be laughter. the pe...
2    simple and straightforward - easy to make conv...
3    i'm a minnesota girl who fell in love with san...
4    i'm an east coast native who moved to the bay ...
5    simply said: i'm a dorky, sweet, hopeless roma...
6    i like to create. i write short stories, draw,...
7    i'm a california native-- grew up in the south...
8    curious, creative, pragmatic optimist seeks ne...
9    i'm new to this coast and looking to make frie...
Name: texts, dtype: object

In [8]:
#text preprocessing
nlp = spacy.load('en_core_web_sm')
all_stopwords = nlp.Defaults.stop_words
all_stopwords.add("love")

sample['texts']
sample['texts'] = sample['texts'].apply(lambda x: remove_stopwords(x))
sample_tokens = sample['texts'].apply(lambda x: word_tokenize(x))

sample_clean = sample_tokens.apply(lambda text: " ".join(i for i in text if i not in all_stopwords))
sample_np = sample_clean.apply(lambda x: x.translate(str.maketrans(' ', ' ', string.punctuation)))
sample_np = sample_np.apply(lambda x: x.lower().strip())
sample_np = sample_np.apply(lambda x: remove_stopwords(x)) #second pass for certainty

In [9]:
#to gauge text similarity both lemmatisation and stemming were used, with stemming giving more consistent results

sample_stemmed = sample_np.apply(lambda x: stem_text(x))
sample['stems'] = sample_stemmed
sample['stems'][0:10]

# lemmatizer = WordNetLemmatizer()
# sample_lems = sample_np.apply(lambda x: lemmatizer.lemmatize(x))
# sample['lems'] = sample_lems
# sample_lems[0:10]

0    fun smart spontan easi go self motiv need simi...
1    laughter peopl appreci world laugh hard mayb p...
2    simpl straightforward easi convers fun love do...
3    minnesota girl fell san francisco visit final ...
4    east coast nativ move bai area school work pre...
5    simpli said dorki sweet hopeless romant hilari...
6    like creat write short stori draw dii sew like...
7    california nativ grew south move 6 year ago ge...
8    curiou creativ pragmat optimist seek new adven...
9    new coast look friend mayb hard worker rise qu...
Name: stems, dtype: object

# Keyword extraction

Unlike purposefully created text data, e.g. news items or academic articles, the essays do not have a uniform aim or rather, the aim is to represent the user in terms of being desirable. This can involve a wide range of linguistic, and sometimes paralinguistic means that hinder classification. Therefore, we propose to further standartize the data by selecting only relevant keywords and using them for corpus creation and analysis. To do this, YAKE keyword extractor was used.

In [10]:
#Extracted keyword strings
yaxtract_mod = yake.KeywordExtractor(lan='en', n=1, dedupLim=0.7, top=5, dedupFunc='seqm', windowsSize=1)

keywords = sample['stems'].apply(lambda x: yaxtract_mod.extract_keywords(x))
keywords_clean = keywords.apply(lambda i: list(dict(i).keys()))

keywords_str = keywords_clean.apply(lambda x: ' '.join(x))

keywords_str[0:10]

0          fun convers type music nom
1          thing plai good peopl time
2       food restaur work convers fun
3      friend differ restaur san easi
4       plai video game work interest
5     simpli session person time huge
6    favorit nice current charact sew
7     california move plai year place
8          make thing rel talk pretti
9       friend danc coast time random
Name: stems, dtype: object

# Tf-idf vectorization and KMeans

There are two very interesting public notebooks available on Kaggle that served as indirect precursors to this project:

* https://www.kaggle.com/code/vishaldixit2489/love-profile-match-making
* https://www.kaggle.com/code/vishaldixit2489/textanalysis-okcupid-profiles

In both, vectorization and k-means clustering were used to try and group OkC profiles for improved user recommendations. However, this method was not a complete success in either of the cases, despite the implementation of dimensionality-reducing practices, such as PCA and t-SNE. This leads us to assume that such diverse data cannot be clustered with the help of this algorithm. Regardless, we would like to run K-means on the keywords selected to see the result first-hand. TSNE will also be used as a good tool for reduction and visualization.

Please take note that the following cells take a comparatively long time to run, due to large k-values.

In [11]:
#vectorization
tf_idf = TfidfVectorizer(lowercase=False)
X = tf_idf.fit_transform(keywords_str)

In [12]:
#selecting the K-value
# inertia_vals = []
# k_vals = [1000, 1500, 1800, 2000]

# for k in k_vals:
#     kmns = KMeans(n_clusters=k)
#     kmns.fit(X)
    
#     inertia_vals.append(kmns.inertia_)
# plt.plot(k_vals, inertia_vals, 'ro-')
# plt.xlabel('k-values')
# plt.ylabel('inertias')
# plt.show()

In [13]:
#k-means model using the sub-optimal k-value
# kmn = KMeans(n_clusters=2000, random_state=1)
# y = kmn.fit_predict(X)
# print(kmn.score(X))
# print(kmn.inertia_)
# print(silhouette_score(X, y))

# as expected, the resulting scores are subpar

In [14]:
#introducing t-SNE

# X_d = X.todense()
# tsne = TSNE(n_components=2)
# Y = tsne.fit_transform(X)
# plt.scatter(Y[:, 0], Y[:, 1])
# plt.show()

As can be seen from the elbow plot, there is no optimal tradeoff between the K-value and inertia for the given dataset. Therefore, it is unlikely that selecting one of the larger K-values is going to satisfy our grouping requirements, as further illustrated by the scatter plot. Due to this, we will concentrate on the LDA topic modeling technique as a soft clustering algorithm.

# LDA model as applied to essay texts

To compare our keyword-oriented approach and the general text-oriented one, we have decided to implement Latent Dirichlet Allocation first to stemmed essays, and then to the keywords that were previously extracted from them.

In [15]:
# building the corpus and dictionaries
essay_corpus = sample['stems'].apply(lambda x: word_tokenize(x))

essay_corpus
essay_dicts = corpora.Dictionary(essay_corpus)
essay_dicts.filter_extremes(no_below=5, no_above=0.7, keep_n=100000)
essay_bow = [essay_dicts.doc2bow(i) for i in essay_corpus]

In [16]:
#introducing the LDA model
essay_lda=gensim.models.LdaMulticore(essay_bow, num_topics=112,id2word=essay_dicts, passes=2, workers=2)

# for index,topic in lda.print_topics(-1):
#     print("Topic: {} \nIdeas: {}".format(index, topic))
#     print("\n")

In [17]:
coherence_model = CoherenceModel(model=essay_lda, texts=essay_corpus, dictionary=essay_dicts, coherence='c_v')
coherence_score = coherence_model.get_coherence()
coherence_score

0.3417209304466829

In [18]:
#interactive visualization
pyLDAvis.gensim.prepare(essay_lda, essay_bow, essay_dicts, mds='mmds')

  by='saliency', ascending=False).head(R).drop('saliency', 1)


As can be seen from the coherence score, it remains very low even for quite a large number of topics. This has to do with the diversity of the given texts that cannot be offset by preprocessing. Let us examine the difference between this result and keyword analysis.

# LDA model as applied to keywords

In [19]:
#building the corpus and dictionaries
kwd_corpus = keywords_str.apply(lambda x: word_tokenize(x))
kwd_dicts = corpora.Dictionary(kwd_corpus)

kwd_bow = [kwd_dicts.doc2bow(i) for i in kwd_corpus]

In [20]:
#LDA model based on keywords
kwd_lda=gensim.models.LdaMulticore(kwd_bow, num_topics=95, id2word=kwd_dicts, passes=2, random_state=1)

# for index,topic in lda.print_topics(-1):
#     print("Topic: {} \nIdeas: {}".format(index, topic))
#     print("\n")

In [21]:
coherence_model = CoherenceModel(model=kwd_lda, texts=kwd_corpus, dictionary=kwd_dicts, coherence='c_v')
coherence_score = coherence_model.get_coherence()
coherence_score

0.604935804644273

As seen above, due to keyword use we can achieve nearly double the coherence with fewer topics, and with the number of topics above 100, the coherence score can increase over 0.7, which is a good result.

In [22]:
#visualizing the topics
pyLDAvis.gensim.prepare(kwd_lda, kwd_bow, kwd_dicts, mds='mmds')

  by='saliency', ascending=False).head(R).drop('saliency', 1)


# Conclusions

Despite LDA being a "soft" clustering algorithm, its value in determining text similarity cannot be underestimated. With its help we have been able to determine a number of topics that the OkCupid profiles have in common and that can further be used for grouping them. Streamlining the process by first selecting user-relevant keywords and stemming them can significantly contribute to the productivity of the topic modeling process.