The goal with this project is to use Data Science and Neural networks to create a clustering algorithm that can be used to find the best matches for a user on OkCupid.

This dataset also tells us a lot about who people are, and we can therefore also use this personal data to find out more about people behavior, and what variables influences each other.

## Loading Packages

In [91]:
import pandas as pd
import sklearn
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

## Exploring Data

In [92]:
profiles = pd.read_csv('Data/profiles.csv')
profiles.head(5)

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\n<br />\ni would love to think...,currently working as an international agent fo...,making people laugh.<br />\nranting about a go...,"the way i look. i am a six foot half asian, ha...",...,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,...,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,...,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,...,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
4,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at:<br />\nhttp://bag...,i smile a lot and my inquisitive nature,...,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single


In [93]:
print(profiles.columns.unique())

Index(['age', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'essay0',
       'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7',
       'essay8', 'essay9', 'ethnicity', 'height', 'income', 'job',
       'last_online', 'location', 'offspring', 'orientation', 'pets',
       'religion', 'sex', 'sign', 'smokes', 'speaks', 'status'],
      dtype='object')


In [94]:
profiles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          59946 non-null  int64  
 1   body_type    54650 non-null  object 
 2   diet         35551 non-null  object 
 3   drinks       56961 non-null  object 
 4   drugs        45866 non-null  object 
 5   education    53318 non-null  object 
 6   essay0       54458 non-null  object 
 7   essay1       52374 non-null  object 
 8   essay2       50308 non-null  object 
 9   essay3       48470 non-null  object 
 10  essay4       49409 non-null  object 
 11  essay5       49096 non-null  object 
 12  essay6       46175 non-null  object 
 13  essay7       47495 non-null  object 
 14  essay8       40721 non-null  object 
 15  essay9       47343 non-null  object 
 16  ethnicity    54266 non-null  object 
 17  height       59943 non-null  float64
 18  income       59946 non-null  int64  
 19  job 

In [95]:
profiles.dtypes

age              int64
body_type       object
diet            object
drinks          object
drugs           object
education       object
essay0          object
essay1          object
essay2          object
essay3          object
essay4          object
essay5          object
essay6          object
essay7          object
essay8          object
essay9          object
ethnicity       object
height         float64
income           int64
job             object
last_online     object
location        object
offspring       object
orientation     object
pets            object
religion        object
sex             object
sign            object
smokes          object
speaks          object
status          object
dtype: object

In [96]:
profiles.describe()

Unnamed: 0,age,height,income
count,59946.0,59943.0,59946.0
mean,32.34029,68.295281,20033.222534
std,9.452779,3.994803,97346.192104
min,18.0,1.0,-1.0
25%,26.0,66.0,-1.0
50%,30.0,68.0,-1.0
75%,37.0,71.0,-1.0
max,110.0,95.0,1000000.0


we see that we don't really have that much numerical data in the Dataset, actually we only have 3 columns that are numerical, and the rest are categorical. So we will have to do some feature engineering to get the most out of the data. Also we will later have to drop some of the 10 essay columns since we won't be looking at NLP in this project.

In [97]:
for column in profiles.columns:
    print(f"The {column} column has {profiles[column].nunique()} unique values.")

The age column has 54 unique values.
The body_type column has 12 unique values.
The diet column has 18 unique values.
The drinks column has 6 unique values.
The drugs column has 3 unique values.
The education column has 32 unique values.
The essay0 column has 54350 unique values.
The essay1 column has 51516 unique values.
The essay2 column has 48635 unique values.
The essay3 column has 43533 unique values.
The essay4 column has 49260 unique values.
The essay5 column has 48963 unique values.
The essay6 column has 43603 unique values.
The essay7 column has 45554 unique values.
The essay8 column has 39324 unique values.
The essay9 column has 45443 unique values.
The ethnicity column has 217 unique values.
The height column has 60 unique values.
The income column has 13 unique values.
The job column has 21 unique values.
The last_online column has 30123 unique values.
The location column has 199 unique values.
The offspring column has 15 unique values.
The orientation column has 3 unique v

we see that a lot of the columns that should only have x amount of values have a lot more values. This we can see on for example the sign column, where there are also "Gemini but it doesn't matter" values, while it should have been a multiple choice, with only the amount of signs. So it should just have your sign and nothing else. This we will be doing in the feature engineering section

## Feature Engineering

### Irrelevant Columns

In [98]:
print(profiles.columns.unique())
print(profiles.columns.nunique())

Index(['age', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'essay0',
       'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7',
       'essay8', 'essay9', 'ethnicity', 'height', 'income', 'job',
       'last_online', 'location', 'offspring', 'orientation', 'pets',
       'religion', 'sex', 'sign', 'smokes', 'speaks', 'status'],
      dtype='object')
31


As mentioned we won't deal with NLP therefore we will drop the essays columns. Also the status and last_online columns are not relevant for our analysis. We will drop them as well.

In [99]:
profiles = profiles.drop(
    columns=['offspring', 'essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7', 'essay8', 'essay9', 'last_online']
)

In [100]:
profiles.columns.unique()

Index(['age', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'ethnicity',
       'height', 'income', 'job', 'location', 'orientation', 'pets',
       'religion', 'sex', 'sign', 'smokes', 'speaks', 'status'],
      dtype='object')

### Converting to binary columns

#### Age column

In [101]:
profiles['sex'] = profiles.sex.replace({
    'm': 'male',
    'f': 'female'
})


dummies = pd.get_dummies(profiles['sex'])
profiles = pd.concat([profiles, dummies], axis=1)

In [102]:
profiles.columns.unique()

Index(['age', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'ethnicity',
       'height', 'income', 'job', 'location', 'orientation', 'pets',
       'religion', 'sex', 'sign', 'smokes', 'speaks', 'status', 'female',
       'male'],
      dtype='object')

### Engineering Features

Some of our features should only take a set amount of values, for example the 'Sign' feature should only have 12 unique values, but it has 48. Therefore we need to clean this up

#### Sign Column

The way I'm going to approach this is to split the 'Sign' column with the python str.split function and then only taking the first element of the resulting list. This will give me the Astrological sign of each person. But first let's make sure that this will work in all cases.

In [103]:
# Make a list of all the values at index 0 of each value in sign column
signs = []
for i in profiles['sign']:
    if type(i) != str:
        # skip
        continue
    split = i.split(' ')
    sign = split[0]
    if sign not in signs:
        signs.append(sign)

print(f'Len of signs list: {len(signs)}')
print(f'List of signs: {signs}')

Len of signs list: 12
List of signs: ['gemini', 'cancer', 'pisces', 'aquarius', 'taurus', 'virgo', 'sagittarius', 'leo', 'aries', 'libra', 'scorpio', 'capricorn']


here we see that this method will work in practice, so now we just have to apply it to our DataFrame

In [104]:
split_and_take_first = lambda x: x.split(' ')[0] if type(x) == str else x
profiles['sign'] = profiles['sign'].apply(split_and_take_first)
print(profiles['sign'].value_counts())

leo            4374
gemini         4310
libra          4207
cancer         4206
virgo          4141
taurus         4140
scorpio        4134
aries          3989
pisces         3946
sagittarius    3942
aquarius       3928
capricorn      3573
Name: sign, dtype: int64


Now we see that it all works, and the next column we will be looking at is the 'Pets' column, that has 15 unique values, so lets look at what is going on in that column

#### Pets Column

In [105]:
print(f'Pets column unique values: {profiles.pets.unique()}')

Pets column unique values: ['likes dogs and likes cats' 'has cats' 'likes cats' nan
 'has dogs and likes cats' 'likes dogs and has cats'
 'likes dogs and dislikes cats' 'has dogs' 'has dogs and dislikes cats'
 'likes dogs' 'has dogs and has cats' 'dislikes dogs and has cats'
 'dislikes dogs and dislikes cats' 'dislikes cats'
 'dislikes dogs and likes cats' 'dislikes dogs']


we see that it doesn't only say what pet they have it also tells us about their liked pets. And this is just another problem in the data collection, where it should have asked do you have any pets and if yes which ones

But to get the most out of this feature we will change the pets column into what pets they have, and then create a column for what pets they like. This will allow us to see if people like the same pets as they have, or if they like different pets. This will be done in the next section.

In [106]:
def transform_pets_column(data):
    if type(data) != str:
        return 'no pets'
    has_dogs = 'has dogs' in data.lower()
    has_cats = 'has cats' in data.lower()

    if has_dogs and has_cats:
        return 'has dogs and cats'
    elif has_dogs:
        return 'has dogs'
    elif has_cats:
        return 'has cats'
    else:
        return 'no pets'
    
def create_pet_preference_column(data):
    if type(data) != str:
        return 'no pet preference'
    likes_dogs = 'likes dogs' in data.lower()
    likes_cats = 'likes cats' in data.lower()
    dislikes_dogs = 'dislikes dogs' in data.lower()
    dislikes_cats = 'dislikes cats' in data.lower()

    if likes_dogs and likes_cats:
        return 'likes dogs and cats'
    elif likes_dogs:
        return 'likes dogs'
    elif likes_cats:
        return 'likes cats'
    elif dislikes_dogs and dislikes_cats:
        return 'dislikes dogs and cats'
    elif dislikes_dogs:
        return 'dislikes dogs'
    elif dislikes_cats:
        return 'dislikes cats'
    else:
        return 'no pet preference'

profiles['pet_opinion'] = profiles['pets'].apply(create_pet_preference_column)
profiles['pet_owner'] = profiles['pets'].apply(transform_pets_column)
    


In [107]:
print(profiles['pet_opinion'].value_counts())

no pet preference      26935
likes dogs and cats    17279
likes dogs             11662
likes cats              4070
Name: pet_opinion, dtype: int64


In [108]:
print(profiles['pet_owner'].value_counts())

no pets              45653
has dogs              7019
has cats              5800
has dogs and cats     1474
Name: pet_owner, dtype: int64


#### Religion Column

In [109]:
print(f'No of unique values in religion column: {profiles.religion.nunique()}')
print(f'Unique values in religion column: {profiles.religion.unique()}')


No of unique values in religion column: 45
Unique values in religion column: ['agnosticism and very serious about it'
 'agnosticism but not too serious about it' nan 'atheism' 'christianity'
 'christianity but not too serious about it'
 'atheism and laughing about it' 'christianity and very serious about it'
 'other' 'catholicism' 'catholicism but not too serious about it'
 'catholicism and somewhat serious about it'
 'agnosticism and somewhat serious about it'
 'catholicism and laughing about it' 'agnosticism and laughing about it'
 'agnosticism' 'atheism and somewhat serious about it'
 'buddhism but not too serious about it'
 'other but not too serious about it' 'buddhism'
 'other and laughing about it' 'judaism but not too serious about it'
 'buddhism and laughing about it' 'other and somewhat serious about it'
 'other and very serious about it' 'hinduism but not too serious about it'
 'atheism but not too serious about it' 'judaism'
 'christianity and somewhat serious about it'
 'h

we see in our religion column that if we split the strings index 0 would have the religion they practice and the first index would have 'but' or 'and' and then index 2:: would have how serious they are about it

In [110]:
for i in profiles.religion.unique():
    if type(i) != str:
        continue
    split = i.split(' ')
    religion = split[0]
    if len(split) > 2:
        seriousness = split[2::]
        seriousness = ' '.join(seriousness)
        print(f'Religion: {religion}, seriousness: {seriousness}')

Religion: agnosticism, seriousness: very serious about it
Religion: agnosticism, seriousness: not too serious about it
Religion: christianity, seriousness: not too serious about it
Religion: atheism, seriousness: laughing about it
Religion: christianity, seriousness: very serious about it
Religion: catholicism, seriousness: not too serious about it
Religion: catholicism, seriousness: somewhat serious about it
Religion: agnosticism, seriousness: somewhat serious about it
Religion: catholicism, seriousness: laughing about it
Religion: agnosticism, seriousness: laughing about it
Religion: atheism, seriousness: somewhat serious about it
Religion: buddhism, seriousness: not too serious about it
Religion: other, seriousness: not too serious about it
Religion: other, seriousness: laughing about it
Religion: judaism, seriousness: not too serious about it
Religion: buddhism, seriousness: laughing about it
Religion: other, seriousness: somewhat serious about it
Religion: other, seriousness: very

we see that we don't have a lot of different religions and this makes it a lot easier for us to create the columns we want

In [111]:
def create_seriousness_column(data):
    if type(data) != str:
        return pd.NA
    split = data.split(' ')
    if len(split) > 2:
        seriousness = split[2::]
        seriousness = ' '.join(seriousness)
        return seriousness
    else:
        return 'not answered'

profiles['religion_seriousness'] = profiles['religion'].apply(create_seriousness_column)
print(profiles['religion_seriousness'].value_counts())
print(f"No of unique values in religion_seriousness column: {profiles['religion_seriousness'].nunique()}")
    

not too serious about it     12212
not answered                 11781
laughing about it             8995
somewhat serious about it     4516
very serious about it         2216
Name: religion_seriousness, dtype: int64
No of unique values in religion_seriousness column: 5


In [113]:
def split_religion_column(data):
    if type(data) != str:
        return pd.NA
    split = data.split(' ')
    religion = split[0]
    return religion

profiles['religion'] = profiles['religion'].apply(split_religion_column)
print(profiles['religion'].value_counts())
print(f"No of unique values in religion column: {profiles['religion'].nunique()}")

agnosticism     8812
other           7743
atheism         6985
christianity    5787
catholicism     4758
judaism         3098
buddhism        1948
hinduism         450
islam            139
Name: religion, dtype: int64
No of unique values in religion column: 9
