# OKCupid Project

In recent years, there has been a massive rise in the usage of dating apps to find love. Many of these apps use sophisticated data science techniques to recommend possible matches to users and to optimize user experience. These apps give us access to a wealth of information that we've never had before about how different people experience romance. In this project, data from OKCupid, an app that focuses on using multiple choice and short answers to match users, will be analysed. 

## The Dataset

The OKCupid Dataset contains the following features:

* `body_type`: multiple choice question with 12 possible answers: average, fit, athletic, thin, curvy, a litle extra, skinny, full figured, overweight, jacked, used up, and rather not say.
* `diet`: Icluding variations of the options: anything, vegetarian, vegan, kosher, and halal
* `drinks` : Abaut the person's drinking habits. Options are: desperately, very often, often, socially, rarely, and not at all.
* `drugs` : The person's drug use, with the options never, sometimes, or often. 
* `Education`: The education level of the person, with 32 different options.
* `essay 0` to `essay9`: Open short answers.
* `ethnicity`. The ethnicity of the person.
* `height`
* `Income`: How much the person makes a year.
* `job`: 21 possible answers.
* `location`: where the person lives.
* `offspring`: whether they have any offspring and whether they might want in the future (15 possible answers).
* `orientation`: Sexual orientation. Options are: straight, gay, or bisexual.
* `pets`: whether they like or dislike either dogs, cats or both and whether they have them as pets. 15 possible answers.
* `religion`: The religion the person follows an how important it is for them. 
* `sex`: Male or Female.
* `sign`: The horoscope sign and the importance the person places on horoscope. 48 possible answers.
* `smokes`" the smoking habbit of the person. Options are: yes, sometimes, when drinking, trying to quit, and no.
* `speaks`: the language/s the person speaks and the level (either poorly, okay, or fluently).
* `status`: the marital status, including the options: single, available, seeing someone, married, unknown.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [2]:
data = pd.read_csv('profiles.csv')
data.head()

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\n<br />\ni would love to think...,currently working as an international agent fo...,making people laugh.<br />\nranting about a go...,"the way i look. i am a six foot half asian, ha...",...,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,...,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,...,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,...,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
4,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at:<br />\nhttp://bag...,i smile a lot and my inquisitive nature,...,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single


In [3]:
# Create a subset of the dataset with the multiple choice questions only. 
mc = ['body_type', 'diet', 'drinks', 'drugs', 'education', 'ethnicity', 'height', 'income', 'job', 'offspring', 'orientation', 'pets', 'religion', 'sex', 'sign','smokes','speaks', 'status']
mc_data = data[mc]
mc_data.head()

Unnamed: 0,body_type,diet,drinks,drugs,education,ethnicity,height,income,job,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,a little extra,strictly anything,socially,never,working on college/university,"asian, white",75.0,-1,transportation,"doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,average,mostly other,often,sometimes,working on space camp,white,70.0,80000,hospitality / travel,"doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single
2,thin,anything,socially,,graduated from masters program,,68.0,-1,,,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,thin,vegetarian,socially,,working on college/university,white,71.0,20000,student,doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
4,athletic,,socially,never,graduated from college/university,"asian, black, other",66.0,-1,artistic / musical / writer,,straight,likes dogs and likes cats,,m,aquarius,no,english,single


In [4]:
# Create a subset of the dataset with the open short questions
oq = ['essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7', 'essay8', 'essay9']
oq_data = data[oq]
oq_data.head()

Unnamed: 0,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9
0,about me:<br />\n<br />\ni would love to think...,currently working as an international agent fo...,making people laugh.<br />\nranting about a go...,"the way i look. i am a six foot half asian, ha...","books:<br />\nabsurdistan, the republic, of mi...",food.<br />\nwater.<br />\ncell phone.<br />\n...,duality and humorous things,trying to find someone to hang out with. i am ...,i am new to california and looking for someone...,you want to be swept off your feet!<br />\nyou...
1,i am a chef: this is what that means.<br />\n1...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,i am die hard christopher moore fan. i don't r...,delicious porkness in all of its glories.<br /...,,,i am very open and will share just about anyth...,
2,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,okay this is where the cultural matrix gets so...,movement<br />\nconversation<br />\ncreation<b...,,viewing. listening. dancing. talking. drinking...,"when i was five years old, i was known as ""the...","you are bright, open, intense, silly, ironic, ..."
3,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,"bataille, celine, beckett. . .<br />\nlynch, j...",,cats and german philosophy,,,you feel so inclined.
4,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at:<br />\nhttp://bag...,i smile a lot and my inquisitive nature,"music: bands, rappers, musicians<br />\nat the...",,,,,


In [5]:
mc_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   body_type    54650 non-null  object 
 1   diet         35551 non-null  object 
 2   drinks       56961 non-null  object 
 3   drugs        45866 non-null  object 
 4   education    53318 non-null  object 
 5   ethnicity    54266 non-null  object 
 6   height       59943 non-null  float64
 7   income       59946 non-null  int64  
 8   job          51748 non-null  object 
 9   offspring    24385 non-null  object 
 10  orientation  59946 non-null  object 
 11  pets         40025 non-null  object 
 12  religion     39720 non-null  object 
 13  sex          59946 non-null  object 
 14  sign         48890 non-null  object 
 15  smokes       54434 non-null  object 
 16  speaks       59896 non-null  object 
 17  status       59946 non-null  object 
dtypes: float64(1), int64(1), object(16)
memory usa

In [7]:
mc_data.describe(include='all')

Unnamed: 0,body_type,diet,drinks,drugs,education,ethnicity,height,income,job,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
count,54650,35551,56961,45866,53318,54266,59943.0,59946.0,51748,24385,59946,40025,39720,59946,48890,54434,59896,59946
unique,12,18,6,3,32,217,,,21,15,3,15,45,2,48,5,7647,5
top,average,mostly anything,socially,never,graduated from college/university,white,,,other,doesn&rsquo;t have kids,straight,likes dogs and likes cats,agnosticism,m,gemini and it&rsquo;s fun to think about,no,english,single
freq,14652,16585,41780,37724,23959,32831,,,7589,7560,51606,14814,2724,35829,1782,43896,21828,55697
mean,,,,,,,68.295281,20033.222534,,,,,,,,,,
std,,,,,,,3.994803,97346.192104,,,,,,,,,,
min,,,,,,,1.0,-1.0,,,,,,,,,,
25%,,,,,,,66.0,-1.0,,,,,,,,,,
50%,,,,,,,68.0,-1.0,,,,,,,,,,
75%,,,,,,,71.0,-1.0,,,,,,,,,,


In [8]:
mc_data.diet.value_counts()


diet
mostly anything        16585
anything                6183
strictly anything       5113
mostly vegetarian       3444
mostly other            1007
strictly vegetarian      875
vegetarian               667
strictly other           452
mostly vegan             338
other                    331
strictly vegan           228
vegan                    136
mostly kosher             86
mostly halal              48
strictly halal            18
strictly kosher           18
halal                     11
kosher                    11
Name: count, dtype: int64

In [9]:
# Merge categories in the feature `diet` for model simplification
# Remove the prefixes 'mtrictly' or 'mostly' and any potential white spaces
mc_data['diet'] = mc_data['diet'].str.replace(r'^(strictly|mostly)\s+', '', regex=True)
mc_data.diet.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mc_data['diet'] = mc_data['diet'].str.replace(r'^(strictly|mostly)\s+', '', regex=True)


diet
anything      27881
vegetarian     4986
other          1790
vegan           702
kosher          115
halal            77
Name: count, dtype: int64

In [10]:
mc_data.body_type.value_counts()

body_type
average           14652
fit               12711
athletic          11819
thin               4711
curvy              3924
a little extra     2629
skinny             1777
full figured       1009
overweight          444
jacked              421
used up             355
rather not say      198
Name: count, dtype: int64

In [15]:
# Merge categories in the 'body_type' feature for model simplification
# Create a mapping dictionary 
body_mapping = {
    'thin': 'lean',
    'skiny': 'lean',
    'fit': 'lean',
    'curvy': 'curvy',
    'a little extra': 'curvy',
    'full figured': 'fuller',
    'overweight': 'fuller',
    'athletic': 'muscular',
    'jacked': 'muscular',
    'used up': 'other_unkown',
    'rather not say': 'other_unkown'    
}

# Apply mapping to the body_type feature
mc_data['body_type'] = mc_data['body_type'].map(body_mapping)
mc_data.body_type.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mc_data['body_type'] = mc_data['body_type'].map(body_mapping)


body_type
lean            17422
muscular        12240
curvy            6553
fuller           1453
other_unkown      553
Name: count, dtype: int64

In [21]:
mc_data.education.value_counts()

education
graduated from college/university    23959
graduated from masters program        8961
working on college/university         5712
working on masters program            1683
graduated from two-year college       1531
graduated from high school            1428
graduated from ph.d program           1272
graduated from law school             1122
working on two-year college           1074
dropped out of college/university      995
working on ph.d program                983
college/university                     801
graduated from space camp              657
dropped out of space camp              523
graduated from med school              446
working on space camp                  445
working on law school                  269
two-year college                       222
working on med school                  212
dropped out of two-year college        191
dropped out of masters program         140
masters program                        136
dropped out of ph.d program            127
d

In [23]:
def simplify_education(education):
    # Check on missing values and return them as np.NaN
    if pd.isna(education):
        return np.NaN
    #Convert education to string and lower case
    education = str(education).lower()

    # Check for keywords in the education field and return the new category
    if 'high school' in education:
        return 'high school or less'
    elif 'college' in education:
        return 'college or less'
    elif 'masters program' in education:
        return 'graduate degree or less'
    elif 'space' in education: 
        return 'other'
    elif 'ph.d program' in education:
        return 'doctorate/professional school'
    elif 'law' in education:
        return 'doctorate/professional school'
    elif 'med' in education:
        return 'doctorate/professional school'   

mc_data['education'] = mc_data['education'].apply(simplify_education)
mc_data['education'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mc_data['education'] = mc_data['education'].apply(simplify_education)


education
college or less                  34485
graduate degree or less          10920
doctorate/professional school     4517
high school or less               1713
other                             1683
Name: count, dtype: int64

In [27]:
print(mc_data.ethnicity.value_counts())

ethnicity
white                                                                 32831
asian                                                                  6134
hispanic / latin                                                       2823
black                                                                  2008
other                                                                  1706
                                                                      ...  
middle eastern, indian, white                                             1
asian, middle eastern, black, white, other                                1
asian, middle eastern, indian, hispanic / latin, white, other             1
black, native american, indian, pacific islander, hispanic / latin        1
asian, black, indian                                                      1
Name: count, Length: 217, dtype: int64


In [29]:
print(mc_data.ethnicity.unique())

['asian, white' 'white' nan 'asian, black, other' 'white, other'
 'hispanic / latin, white' 'hispanic / latin' 'pacific islander, white'
 'asian' 'black, white' 'pacific islander' 'asian, native american'
 'asian, pacific islander' 'black, native american, white'
 'middle eastern, other' 'native american, white' 'indian' 'black'
 'black, native american, hispanic / latin, other'
 'black, native american, hispanic / latin'
 'asian, black, pacific islander'
 'asian, middle eastern, black, native american, indian, pacific islander, hispanic / latin, white, other'
 'other' 'hispanic / latin, other' 'asian, black' 'middle eastern, white'
 'native american, white, other' 'black, native american'
 'black, white, other' 'hispanic / latin, white, other' 'middle eastern'
 'black, other' 'native american, hispanic / latin, white' 'black, indian'
 'indian, white, other' 'middle eastern, indian, other'
 'black, native american, hispanic / latin, white, other'
 'pacific islander, hispanic / latin' '

In [31]:
# Function to simplify 'ethnicity' 
def simplify_ethnicity(ethnicity):
    if pd.isna(ethnicity):
        return np.NaN
    # Convert to lowercase for case sensitive matching
    ethnicity = ethnicity.lower()

    # Define groups' keywords
    white = ['white']
    black = ['black']
    asian = ['asian', 'indian']
    hispanic = ['hispanic', 'latin']
    native_american = ['native american']
    pacific_islander = ['pacific islander']
    middle_eastern = ['middle eastern']

    # Create a list to store the detected categories
    detected_groups = []
    # Check category group
    if any(keyword in ethnicity for keyword in white):
        detected_groups.append('white')
    if any(keyword in ethnicity for keyword in black):
        detected_groups.append('black')
    if any(keyword in ethnicity for keyword in asian):
        detected_groups.append('asian')
    if any(keyword in ethnicity for keyword in hispanic):
        detected_groups.append('hispanic / latin')
    if any(keyword in ethnicity for keyword in pacific_islander):
        detected_groups.append('pacific islander')
    if any(keyword in ethnicity for keyword in middle_eastern):
        detected_groups.append('middle eastern')

    # Control for mixed ethnicity
    if len(detected_groups) > 1:
        return 'mixed ethnicity'
    # If one group is detected
    if detected_groups:
        return detected_groups[0]
    # If no recognized group
    else:
        return 'other'

mc_data['ethnicity'] = mc_data['ethnicity'].apply(simplify_ethnicity)
mc_data['ethnicity'].value_counts()


    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mc_data['ethnicity'] = mc_data['ethnicity'].apply(simplify_ethnicity)


ethnicity
white               33925
asian                6291
mixed ethnicity      4973
hispanic / latin     3059
other                2898
black                2292
pacific islander      460
middle eastern        368
Name: count, dtype: int64

In [39]:
mc_data['offspring'] = mc_data['offspring'].str.replace('&rsquo;', '')
mc_data.offspring.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mc_data['offspring'] = mc_data['offspring'].str.replace('&rsquo;', '')


offspring
doesnt have kids                         7560
doesnt have kids, but might want them    3875
doesnt have kids, but wants them         3565
doesnt want kids                         2927
has kids                                 1883
has a kid                                1881
doesnt have kids, and doesnt want any    1132
has kids, but doesnt want more            442
has a kid, but doesnt want more           275
has a kid, and might want more            231
wants kids                                225
might want kids                           182
has kids, and might want more             115
has a kid, and wants more                  71
has kids, and wants more                   21
Name: count, dtype: int64

In [41]:
# Function to simplify offspring
def simplify_offspring(offspring):
    if pd.isna(offspring):
        return np.NaN
    if 'doesnt have' in offspring:
        return 'no'
    elif 'has a kid' in offspring:
        return 'one'
    elif 'has kids' in offspring:
        return 'more than one'
    else:
        return 'no'

mc_data['offspring'] = mc_data['offspring'].apply(simplify_offspring)
mc_data.offspring.value_counts()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mc_data['offspring'] = mc_data['offspring'].apply(simplify_offspring)


offspring
no               19466
more than one     2461
one               2458
Name: count, dtype: int64

In [45]:
mc_data.pets.value_counts()

pets
likes dogs and likes cats          14814
likes dogs                          7224
likes dogs and has cats             4313
has dogs                            4134
has dogs and likes cats             2333
likes dogs and dislikes cats        2029
has dogs and has cats               1474
has cats                            1406
likes cats                          1063
has dogs and dislikes cats           552
dislikes dogs and likes cats         240
dislikes dogs and dislikes cats      196
dislikes cats                        122
dislikes dogs and has cats            81
dislikes dogs                         44
Name: count, dtype: int64

In [49]:
mc_data['pets'] = mc_data['pets'].apply(lambda x: 'yes' if isinstance(x,str) and 'has' in x else x if x is None else 'no')
mc_data.pets.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mc_data['pets'] = mc_data['pets'].apply(lambda x: 'yes' if isinstance(x,str) and 'has' in x else x if x is None else 'no')


pets
no     45653
yes    14293
Name: count, dtype: int64

In [57]:
 # Create a function to simplify religion
def simplify_religion(religion):
    # Control for nul values
    if pd.isna(religion):
        return np.NaN
    # Check for keyword in the religion field
    if 'agnosticism' in religion:
        return 'agnosticism'
    elif 'catholicism' in religion:
        return 'christianity'
    elif 'christianity' in religion:
        return 'christianity'
    elif 'atheism' in religion:
        return 'atheism'
    elif 'judaism' in religion:
        return 'judaism'
    elif 'buddhism' in religion:
        return 'buddhism'
    elif 'hinduism' in religion:
        return 'hinduism'
    elif 'islam' in religion:
        return 'islam'
    elif 'other' in religion:
        return 'other'

mc_data['religion'] = mc_data['religion'].apply(simplify_religion)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mc_data['religion'] = mc_data['religion'].apply(simplify_religion)


In [59]:
mc_data.religion.value_counts()

religion
christianity    10545
agnosticism      8812
other            7743
atheism          6985
judaism          3098
buddhism         1948
hinduism          450
islam             139
Name: count, dtype: int64

In [61]:
mc_data.describe(include='all')

Unnamed: 0,body_type,diet,drinks,drugs,education,ethnicity,height,income,job,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
count,38221,35551,56961,45866,53318,54266,59943.0,59946.0,51748,24385,59946,59946,39720,59946,48890,54434,59896,59946
unique,5,6,6,3,5,8,,,21,3,3,2,8,2,48,5,7647,5
top,lean,anything,socially,never,college or less,white,,,other,no,straight,no,christianity,m,gemini and it&rsquo;s fun to think about,no,english,single
freq,17422,27881,41780,37724,34485,33925,,,7589,19466,51606,45653,10545,35829,1782,43896,21828,55697
mean,,,,,,,68.295281,20033.222534,,,,,,,,,,
std,,,,,,,3.994803,97346.192104,,,,,,,,,,
min,,,,,,,1.0,-1.0,,,,,,,,,,
25%,,,,,,,66.0,-1.0,,,,,,,,,,
50%,,,,,,,68.0,-1.0,,,,,,,,,,
75%,,,,,,,71.0,-1.0,,,,,,,,,,
