In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import OrdinalEncoder

# Preprocessing the data and training the model

In [2]:
raw_dataset = pd.read_csv('../data/okcupid.csv') 
okcupid_profiles = raw_dataset.drop(columns="Unnamed: 0") 

As we can see from the output below, almost every column contains object types, which we can not use to fit the Random Forest. 
We need to convert the objects into numbers, and we can do that using OrdinalEncoder from sklearn.
We also need to manage the missing data

In [3]:
okcupid_profiles.dtypes

age              int64
status          object
sex             object
orientation     object
body_type       object
diet            object
drinks          object
drugs           object
education       object
ethnicity       object
height         float64
income           int64
job             object
last_online     object
location        object
offspring       object
pets            object
religion        object
sign            object
smokes          object
speaks          object
dtype: object

## Reworking the dataset
### Dropping columns

We noticed that for many labels, the value distribution is... a bit unusable. We have like 50000 occurences of a single value, and too few of the others.
We decided to solve this by introducing some changes to the dataset.

We are going to drop the following columns:

In [4]:
okcupid_profiles = okcupid_profiles.drop(columns = ['status', 
                                                     'orientation',
                                                     'diet',
                                                     'drinks',
                                                     'drugs',
                                                     'education',
                                                     'ethnicity',
                                                     'income',
                                                     'last_online',
                                                     'offspring',
                                                     'pets',
                                                     'smokes',
                                                     'speaks'])

Now we are left with:

In [5]:
for element in okcupid_profiles.columns:
    print(element)

age
sex
body_type
height
job
location
religion
sign


### Handling 'age'

In [6]:
okcupid_profiles['age_grouped'] = okcupid_profiles['age'].apply(lambda x: 5*np.floor(x/5))
print( okcupid_profiles['age_grouped'].value_counts() )
okcupid_profiles.drop(columns = 'age')

25.0     17818
30.0     12579
20.0     10003
35.0      7267
40.0      4648
45.0      2690
50.0      1650
55.0      1210
15.0       920
60.0       789
65.0       370
110.0        1
105.0        1
Name: age_grouped, dtype: int64


Unnamed: 0,sex,body_type,height,job,location,religion,sign,age_grouped
0,m,a little extra,75.0,transportation,"south san francisco, california",agnosticism and very serious about it,gemini,20.0
1,m,average,70.0,hospitality / travel,"oakland, california",agnosticism but not too serious about it,cancer,35.0
2,m,thin,68.0,,"san francisco, california",,pisces but it doesn&rsquo;t matter,35.0
3,m,thin,71.0,student,"berkeley, california",,pisces,20.0
4,m,athletic,66.0,artistic / musical / writer,"san francisco, california",,aquarius,25.0
...,...,...,...,...,...,...,...,...
59941,f,,62.0,sales / marketing / biz dev,"oakland, california",catholicism but not too serious about it,cancer and it&rsquo;s fun to think about,55.0
59942,m,fit,72.0,entertainment / media,"san francisco, california",agnosticism,leo but it doesn&rsquo;t matter,20.0
59943,m,average,71.0,construction / craftsmanship,"south san francisco, california",christianity but not too serious about it,sagittarius but it doesn&rsquo;t matter,40.0
59944,m,athletic,73.0,medicine / health,"san francisco, california",agnosticism but not too serious about it,leo and it&rsquo;s fun to think about,25.0


### Handling 'religion'

In [7]:
pd.isna(okcupid_profiles["religion"]).sum()

20226

In [8]:
okcupid_profiles['religion'].value_counts()

agnosticism                                   2724
other                                         2691
agnosticism but not too serious about it      2636
agnosticism and laughing about it             2496
catholicism but not too serious about it      2318
atheism                                       2175
other and laughing about it                   2119
atheism and laughing about it                 2074
christianity                                  1957
christianity but not too serious about it     1952
other but not too serious about it            1554
judaism but not too serious about it          1517
atheism but not too serious about it          1318
catholicism                                   1064
christianity and somewhat serious about it     927
atheism and somewhat serious about it          848
other and somewhat serious about it            846
catholicism and laughing about it              726
judaism and laughing about it                  681
buddhism but not too serious ab

As we can see from the value count, the data is a bit too sparse. We are going to merge all the various religion occurences together.

In [9]:
# in the religion column, find values containing the word "christian" and replace them with "christian"
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('christianity but not too serious about it', 'christian')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('christianity and very serious about it', 'christian')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('christianity and somewhat serious about it', 'christian')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('christianity and laughing about it', 'christian')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('christianity and somewhat serious about it', 'christian')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('christianity and very serious about it', 'christian')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('christianity', 'christian')

In [10]:
# in the religion column, find values containing the word "agnosticism" and replace them with "agnostic"
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('agnosticism but not too serious about it', 'agnostic')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('agnosticism and very serious about it', 'agnostic')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('agnosticism and somewhat serious about it', 'agnostic')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('agnosticism and laughing about it', 'agnostic')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('agnosticism and somewhat serious about it', 'agnostic')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('agnosticism and very serious about it', 'agnostic')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('agnosticism', 'agnostic')

In [11]:
# in the religion column, find values containing the word "atheism" and replace them with "atheist"
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('atheism but not too serious about it', 'atheist')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('atheism and very serious about it', 'atheist')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('atheism and somewhat serious about it', 'atheist')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('atheism and laughing about it', 'atheist')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('atheism and somewhat serious about it', 'atheist')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('atheism and very serious about it', 'atheist')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('atheism', 'atheist')

In [12]:
# in the religion column, find values containing the word "catholicism" and replace them with "catholic"
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('catholicism but not too serious about it', 'catholic')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('catholicism and very serious about it', 'catholic')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('catholicism and somewhat serious about it', 'catholic')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('catholicism and laughing about it', 'catholic')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('catholicism and somewhat serious about it', 'catholic')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('catholicism and very serious about it', 'catholic')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('catholicism', 'catholic')

In [13]:
# in the religion column, find values containing the word "judaism" and replace them with "jewish"
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('judaism but not too serious about it', 'jewish')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('judaism and very serious about it', 'jewish')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('judaism and somewhat serious about it', 'jewish')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('judaism and laughing about it', 'jewish')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('judaism and somewhat serious about it', 'jewish')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('judaism and very serious about it', 'jewish')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('judaism', 'jewish')

In [14]:
# in the religion column, find values containing the word "buddhism" and replace them with "buddhist"
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('buddhism but not too serious about it', 'buddhist')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('buddhism and very serious about it', 'buddhist')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('buddhism and somewhat serious about it', 'buddhist')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('buddhism and laughing about it', 'buddhist')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('buddhism and somewhat serious about it', 'buddhist')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('buddhism and very serious about it', 'buddhist')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('buddhism', 'buddhist')

In [15]:
# in the religion column, find values containing the word "hinduism" and replace them with "hindu"
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('hinduism but not too serious about it', 'hindu')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('hinduism and very serious about it', 'hindu')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('hinduism and somewhat serious about it', 'hindu')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('hinduism and laughing about it', 'hindu')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('hinduism and somewhat serious about it', 'hindu')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('hinduism and very serious about it', 'hindu')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('hinduism', 'hindu')

In [16]:
# in the religion column, find values containing the word "islam" and replace them with "muslim"
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('islam but not too serious about it', 'muslim')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('islam and very serious about it', 'muslim')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('islam and somewhat serious about it', 'muslim')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('islam and laughing about it', 'muslim')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('islam and somewhat serious about it', 'muslim')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('islam and very serious about it', 'muslim')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('islam', 'muslim')

In [17]:
# in the religion column, find values containing the word "other" and replace them with "other"
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('other but not too serious about it', 'other')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('other and very serious about it', 'other')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('other and somewhat serious about it', 'other')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('other and laughing about it', 'other')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('other and somewhat serious about it', 'other')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('other and very serious about it', 'other')
okcupid_profiles['religion'] = okcupid_profiles['religion'].str.replace('other', 'other')

In [18]:
print(okcupid_profiles['religion'].value_counts())
print("Missing values: ", okcupid_profiles['religion'].isnull().sum())

agnostic     8812
other        7743
atheist      6985
christian    5787
catholic     4758
jewish       3098
buddhist     1948
hindu         450
muslim        139
Name: religion, dtype: int64
Missing values:  20226


At this point, we noticed that we have too many missing values, since we are also considering the 'other' as basically missing. So we are just going to have to convert the religion attribute to a boolean.

In [19]:
"""for the religion column, convert the missing data into a boolean false, 
convert the 'other' values into boolean false, and the rest boolean true
"""
okcupid_profiles['religion'] = okcupid_profiles['religion'].fillna(False)
okcupid_profiles['religion'] = okcupid_profiles['religion'].replace('other', False)
okcupid_profiles['religion'] = okcupid_profiles['religion'].replace('agnostic', True)
okcupid_profiles['religion'] = okcupid_profiles['religion'].replace('atheist', True)
okcupid_profiles['religion'] = okcupid_profiles['religion'].replace('christian', True)
okcupid_profiles['religion'] = okcupid_profiles['religion'].replace('catholic', True)
okcupid_profiles['religion'] = okcupid_profiles['religion'].replace('jewish', True)
okcupid_profiles['religion'] = okcupid_profiles['religion'].replace('buddhist', True)
okcupid_profiles['religion'] = okcupid_profiles['religion'].replace('hindu', True)
okcupid_profiles['religion'] = okcupid_profiles['religion'].replace('muslim', True)


In [20]:
okcupid_profiles.rename(columns = {'religion': 'religious'}, inplace = True)

In [21]:
okcupid_profiles['religious'].value_counts()

True     31977
False    27969
Name: religious, dtype: int64

### Handling 'sign'

In [22]:
pd.isna(okcupid_profiles["sign"]).sum()

11056

In [23]:
okcupid_profiles['sign'].value_counts()

gemini and it&rsquo;s fun to think about         1782
scorpio and it&rsquo;s fun to think about        1772
leo and it&rsquo;s fun to think about            1692
libra and it&rsquo;s fun to think about          1649
taurus and it&rsquo;s fun to think about         1640
cancer and it&rsquo;s fun to think about         1597
pisces and it&rsquo;s fun to think about         1592
sagittarius and it&rsquo;s fun to think about    1583
virgo and it&rsquo;s fun to think about          1574
aries and it&rsquo;s fun to think about          1573
aquarius and it&rsquo;s fun to think about       1503
virgo but it doesn&rsquo;t matter                1497
leo but it doesn&rsquo;t matter                  1457
cancer but it doesn&rsquo;t matter               1454
gemini but it doesn&rsquo;t matter               1453
taurus but it doesn&rsquo;t matter               1450
libra but it doesn&rsquo;t matter                1408
aquarius but it doesn&rsquo;t matter             1408
capricorn and it&rsquo;s fun

In [24]:
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('gemini and it&rsquo;s fun to think about', 'gemini')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('gemini but it doesn&rsquo;t matter', 'gemini')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('gemini and it matters a lot', 'gemini')

In [25]:
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('scorpio and it&rsquo;s fun to think about', 'scorpio')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('scorpio but it doesn&rsquo;t matter', 'scorpio')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('scorpio and it matters a lot', 'scorpio')

In [26]:
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('leo and it&rsquo;s fun to think about', 'leo')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('leo but it doesn&rsquo;t matter', 'leo')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('leo and it matters a lot', 'leo')

In [27]:
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('libra and it&rsquo;s fun to think about', 'libra')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('libra but it doesn&rsquo;t matter', 'libra')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('libra and it matters a lot', 'libra')

In [28]:
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('taurus and it&rsquo;s fun to think about', 'taurus')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('taurus but it doesn&rsquo;t matter', 'taurus')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('taurus and it matters a lot', 'taurus')

In [29]:
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('cancer and it&rsquo;s fun to think about', 'cancer')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('cancer but it doesn&rsquo;t matter', 'cancer')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('cancer and it matters a lot', 'cancer')

In [30]:
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('pisces and it&rsquo;s fun to think about', 'pisces')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('pisces but it doesn&rsquo;t matter', 'pisces')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('pisces and it matters a lot', 'pisces')

In [31]:
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('sagittarius and it&rsquo;s fun to think about', 'sagittarius')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('sagittarius but it doesn&rsquo;t matter', 'sagittarius')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('sagittarius and it matters a lot', 'sagittarius')

In [32]:
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('virgo and it&rsquo;s fun to think about', 'virgo')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('virgo but it doesn&rsquo;t matter', 'virgo')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('virgo and it matters a lot', 'virgo')

In [33]:
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('aries and it&rsquo;s fun to think about', 'aries')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('aries but it doesn&rsquo;t matter', 'aries')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('aries and it matters a lot', 'aries')

In [34]:
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('aquarius and it&rsquo;s fun to think about', 'aquarius')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('aquarius but it doesn&rsquo;t matter', 'aquarius')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('aquarius and it matters a lot', 'aquarius')

In [35]:
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('capricorn and it&rsquo;s fun to think about', 'capricorn')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('capricorn but it doesn&rsquo;t matter', 'capricorn')
okcupid_profiles['sign'] = okcupid_profiles['sign'].str.replace('capricorn and it matters a lot', 'capricorn')

In [36]:
okcupid_profiles['sign'].value_counts()

leo            4374
gemini         4310
libra          4207
cancer         4206
virgo          4141
taurus         4140
scorpio        4134
aries          3989
pisces         3946
sagittarius    3942
aquarius       3928
capricorn      3573
Name: sign, dtype: int64

### Handling 'location'  

In [37]:
okcupid_profiles['location'].value_counts()

san francisco, california         31064
oakland, california                7214
berkeley, california               4212
san mateo, california              1331
palo alto, california              1064
                                  ...  
south wellfleet, massachusetts        1
orange, california                    1
astoria, new york                     1
london, united kingdom                1
rochester, michigan                   1
Name: location, Length: 199, dtype: int64

In [38]:
okcupid_profiles['location'] = okcupid_profiles['location'].fillna(False)
okcupid_profiles['location'] = (okcupid_profiles['location'] == 'san francisco, california')
okcupid_profiles.rename(columns = {'location': 'lives_in_san_francisco'}, inplace=True)

In [39]:
okcupid_profiles['lives_in_san_francisco'].value_counts()

True     31064
False    28882
Name: lives_in_san_francisco, dtype: int64

## Filling the missing data

The columns containing missing data are the following:

In [40]:
print(okcupid_profiles.isna().sum())

age                           0
sex                           0
body_type                  5296
height                        3
job                        8198
lives_in_san_francisco        0
religious                     0
sign                      11056
age_grouped                   0
dtype: int64


Since there are only three rows with missing values for height, instead of replacing the NaN with something like 0 or -1, or the average height, we think it's better to just drop them, since it is such a small number

In [41]:
okcupid_profiles = okcupid_profiles.dropna(how = 'any', subset = 'height') 

For the following attributes, we are going to replace the missing values with a string

In [42]:
okcupid_profiles['job'] = okcupid_profiles['job'].fillna(value = 'not specified')
okcupid_profiles['body_type'] = okcupid_profiles['body_type'].fillna(value = 'rather not say')
okcupid_profiles['sign'] = okcupid_profiles['sign'].fillna(value = 'unknown')

And now all the columns contain something

In [43]:
print(okcupid_profiles.isna().sum())

age                       0
sex                       0
body_type                 0
height                    0
job                       0
lives_in_san_francisco    0
religious                 0
sign                      0
age_grouped               0
dtype: int64


## Saving the processed data

In [None]:
okcupid_profiles.to_csv('../data/okcupid_processed.csv', index=False)