# OKCupid Date a Scientist - Part 2 Match couples
Given clients' dataset of OKcupid - a dating company, this project's objective is to cluster these clients into groups and try to match a couple. There is only one data file which is 'profile.csv'. The dataset consists of 31 columns which are about clients. All columns are as follows:
 - **age**
 - **body_type** nominal such as average, thin, a little extra, etc.
 - diet 
 - drinks 
 - drugs        
 - education       
 - **essay0 - essay9**  These ten columns are more details about themselves which written by each individual, for examples, 'About me', 'What I do' and 'My interests'. However, I have not been provided the real questions or topics for each essay column.
 - ethnicity such as asian, indian, white, etc.
 - height           
 - income      
 - job           
 - last_online       
 - location         
 - offspring      
 - orientation      
 - pets         
 - religion      
 - sex             
 - sign - zodiac signs such as aries, taurus, gemini    
 - smokes         
 - speaks 
 - status - a member's status such as single, available, married

In [1]:
# import necessary libraries and change some display settings.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

In [57]:
# import dataset from a csv file
df = pd.read_csv("profiles.csv")

In [58]:
# check all columns' names
df.columns

Index(['age', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'essay0',
       'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7',
       'essay8', 'essay9', 'ethnicity', 'height', 'income', 'job',
       'last_online', 'location', 'offspring', 'orientation', 'pets',
       'religion', 'sex', 'sign', 'smokes', 'speaks', 'status'],
      dtype='object')

In [59]:
# have a look at the first five rows of data
df['religion_cleaned'] = df.religion.str.split().str.get(0)
df.head(10)

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9,ethnicity,height,income,job,last_online,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status,religion_cleaned
0,22,a little extra,strictly anything,socially,never,working on college/university,"about me:<br />\n<br />\ni would love to think that i was some some kind of intellectual:\neither the dumbest smart guy, or the smartest dumb guy. can't say i\ncan tell the difference. i love to talk about ideas and concepts. i\nforge odd metaphors instead of reciting cliches. like the\nsimularities between a friend of mine's house and an underwater\nsalt mine. my favorite word is salt by the way (weird choice i\nknow). to me most things in life are better as metaphors. i seek to\nmake myself a little better everyday, in some productively lazy\nway. got tired of tying my shoes. considered hiring a five year\nold, but would probably have to tie both of our shoes... decided to\nonly wear leather shoes dress shoes.<br />\n<br />\nabout you:<br />\n<br />\nyou love to have really serious, really deep conversations about\nreally silly stuff. you have to be willing to snap me out of a\nlight hearted rant with a kiss. you don't have to be funny, but you\nhave to be able to make me laugh. you should be able to bend spoons\nwith your mind, and telepathically make me smile while i am still\nat work. you should love life, and be cool with just letting the\nwind blow. extra points for reading all this and guessing my\nfavorite video game (no hints given yet). and lastly you have a\ngood attention span.","currently working as an international agent for a freight\nforwarding company. import, export, domestic you know the\nworks.<br />\nonline classes and trying to better myself in my free time. perhaps\na hours worth of a good book or a video game on a lazy sunday.","making people laugh.<br />\nranting about a good salting.<br />\nfinding simplicity in complexity, and complexity in simplicity.","the way i look. i am a six foot half asian, half caucasian mutt. it\nmakes it tough not to notice me, and for me to blend in.","books:<br />\nabsurdistan, the republic, of mice and men (only book that made me\nwant to cry), catcher in the rye, the prince.<br />\n<br />\nmovies:<br />\ngladiator, operation valkyrie, the producers, down periscope.<br />\n<br />\nshows:<br />\nthe borgia, arrested development, game of thrones, monty\npython<br />\n<br />\nmusic:<br />\naesop rock, hail mary mallon, george thorogood and the delaware\ndestroyers, felt<br />\n<br />\nfood:<br />\ni'm down for anything.",food.<br />\nwater.<br />\ncell phone.<br />\nshelter.,duality and humorous things,trying to find someone to hang out with. i am down for anything\nexcept a club.,i am new to california and looking for someone to wisper my secrets\nto.,you want to be swept off your feet!<br />\nyou are tired of the norm.<br />\nyou want to catch a coffee or a bite.<br />\nor if you want to talk philosophy.,"asian, white",75.0,-1,transportation,2012-06-28-20-30,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single,agnosticism
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1. i am a workaholic.<br />\n2. i love to cook regardless of whether i am at work.<br />\n3. i love to drink and eat foods that are probably really bad for\nme.<br />\n4. i love being around people that resemble line 1-3.<br />\ni love the outdoors and i am an avid skier. if its snowing i will\nbe in tahoe at the very least. i am a very confident and friendly.\ni'm not interested in acting or being a typical guy. i have no time\nor patience for rediculous acts of territorial pissing. overall i\nam a very likable easygoing individual. i am very adventurous and\nalways looking forward to doing new things and hopefully sharing it\nwith the right person.,dedicating everyday to being an unbelievable badass.,being silly. having ridiculous amonts of fun wherever. being a\nsmart ass. ohh and i can cook. ;),,i am die hard christopher moore fan. i don't really watch a lot of\ntv unless there is humor involved. i am kind of stuck on 90's\nalternative music. i am pretty much a fan of everything though... i\ndo need to draw a line at most types of electronica.,delicious porkness in all of its glories.<br />\nmy big ass doughboy's sinking into 15 new inches.<br />\nmy overly resilient liver.<br />\na good sharp knife.<br />\nmy ps3... it plays blurays too. ;)<br />\nmy over the top energy and my outlook on life... just give me a bag\nof lemons and see what happens. ;),,,i am very open and will share just about anything.,,white,70.0,80000,hospitality / travel,2012-06-29-21-41,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (poorly)",single,agnosticism
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public text on an online\ndating site makes me pleasantly uncomfortable. i'll try to be as\nearnest as possible in the noble endeavor of standing naked before\nthe world.<br />\n<br />\ni've lived in san francisco for 15 years, and both love it and find\nmyself frustrated with its deficits. lots of great friends and\nacquaintances (which increases my apprehension to put anything on\nthis site), but i'm feeling like meeting some new people that\naren't just friends of friends. it's okay if you are a friend of a\nfriend too. chances are, if you make it through the complex\nfiltering process of multiple choice questions, lifestyle\nstatistics, photo scanning, and these indulgent blurbs of text\nwithout moving quickly on to another search result, you are\nprobably already a cultural peer and at most 2 people removed. at\nfirst, i thought i should say as little as possible here to avoid\nyou, but that seems silly.<br />\n<br />\nas far as culture goes, i'm definitely more on the weird side of\nthe spectrum, but i don't exactly wear it on my sleeve. once you\nget me talking, it will probably become increasingly apparent that\nwhile i'd like to think of myself as just like everybody else (and\nby some definition i certainly am), most people don't see me that\nway. that's fine with me. most of the people i find myself\ngravitating towards are pretty weird themselves. you probably are\ntoo.","i make nerdy software for musicians, artists, and experimenters to\nindulge in their own weirdness, but i like to spend time away from\nthe computer when working on my artwork (which is typically more\nconcerned with group dynamics and communication, than with visual\nform, objects, or technology). i also record and deejay dance,\nnoise, pop, and experimental music (most of which electronic or at\nleast studio based). besides these relatively ego driven\nactivities, i've been enjoying things like meditation and tai chi\nto try and gently flirt with ego death.","improvising in different contexts. alternating between being\npresent and decidedly outside of a moment, or trying to hold both\nat once. rambling intellectual conversations that hold said\nconversations in contempt while seeking to find something that\ntranscends them. being critical while remaining generous. listening\nto and using body language--often performed in caricature or large\ngestures, if not outright interpretive dance. dry, dark, and\nraunchy humor.","my large jaw and large glasses are the physical things people\ncomment on the most. when sufficiently stimulated, i have an\nunmistakable cackle of a laugh. after that, it goes in more\ndirections than i care to describe right now. maybe i'll come back\nto this.","okay this is where the cultural matrix gets so specific, it's like\nbeing in the crosshairs.<br />\n<br />\nfor what it's worth, i find myself reading more non-fiction than\nfiction. it's usually some kind of philosophy, art, or science text\nby silly authors such as ranciere, de certeau, bataille,\nbaudrillard, butler, stein, arendt, nietzche, zizek, etc. i'll\noften throw in some weird new age or pop-psychology book in the mix\nas well. as for fiction, i enjoy what little i've read of eco,\nperec, wallace, bolao, dick, vonnegut, atwood, delilo, etc. when i\nwas young, i was a rabid asimov reader.<br />\n<br />\ndirectors i find myself drawn to are makavejev, kuchar, jodorowsky,\nherzog, hara, klein, waters, verhoeven, ackerman, hitchcock, lang,\ngorin, goddard, miike, ohbayashi, tarkovsky, sokurov, warhol, etc.\nbut i also like a good amount of ""trashy"" stuff. too much to\nname.<br />\n<br />\ni definitely enjoy the character development that happens in long\nform episodic television over the course of 10-100 episodes, which\na 1-2hr movie usually can't compete with. some of my recent tv\nfavorites are: breaking bad, the wire, dexter, true blood, the\nprisoner, lost, fringe.<br />\n<br />\na smattered sampling of the vast field of music i like and deejay:\nart ensemble, sun ra, evan parker, lil wayne, dj funk, mr. fingers,\nmaurizio, rob hood, dan bell, james blake, nonesuch recordings,\nomar souleyman, ethiopiques, fela kuti, john cage, meredith monk,\nrobert ashley, terry riley, yoko ono, merzbow, tom tom club, jit,\njuke, bounce, hyphy, snap, crunk, b'more, kuduro, pop, noise, jazz,\ntechno, house, acid, new/no wave, (post)punk, etc.<br />\n<br />\na few of the famous art/dance/theater folk that might locate my\nsensibility: andy warhol, bruce nauman, yayoi kusama, louise\nbourgeois, tino sehgal, george kuchar, michel duchamp, marina\nabramovic, gelatin, carolee schneeman, gustav metzger, mike kelly,\nmike smith, andrea fraser, gordon matta-clark, jerzy grotowski,\nsamuel beckett, antonin artaud, tadeusz kantor, anna halperin,\nmerce cunningham, etc. i'm clearly leaving out a younger generation\nof contemporary artists, many of whom are friends.<br />\n<br />\nlocal food regulars: sushi zone, chow, ppq, pagolac, lers ros,\nburma superstar, minako, shalimar, delfina pizza, rosamunde,\narinells, suppenkuche, cha-ya, blue plate, golden era, etc.",movement<br />\nconversation<br />\ncreation<br />\ncontemplation<br />\ntouch<br />\nhumor,,viewing. listening. dancing. talking. drinking. performing.,"when i was five years old, i was known as ""the boogerman"".","you are bright, open, intense, silly, ironic, critical, caring,\ngenerous, looking for an exploration, rather than finding ""a match""\nof some predetermined qualities.<br />\n<br />\ni'm currently in a fabulous and open relationship, so you should be\ncomfortable with that.",,68.0,-1,,2012-06-27-09-10,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available,
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books according to the library\nof congress classification system,socially awkward but i do my best,"bataille, celine, beckett. . .<br />\nlynch, jarmusch, r.w. fassbender. . .<br />\ntwin peaks &amp; fishing w/ john<br />\njoy division, throbbing gristle, cabaret voltaire. . .<br />\nvegetarian pho and coffee",,cats and german philosophy,,,you feel so inclined.,white,71.0,20000,student,2012-06-28-14-22,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single,
4,29,athletic,,socially,never,graduated from college/university,"hey how's it going? currently vague on the profile i know, more to\ncome soon. looking to meet new folks outside of my circle of\nfriends. i'm pretty responsive on the reply tip, feel free to drop\na line. cheers.",work work work work + play,creating imagery to look at:<br />\nhttp://bagsbrown.blogspot.com/<br />\nhttp://stayruly.blogspot.com/,i smile a lot and my inquisitive nature,"music: bands, rappers, musicians<br />\nat the moment: thee oh sees.<br />\nforever: wu-tang<br />\nbooks: artbooks for days<br />\naudiobooks: my collection, thick (thanks audible)<br />\nshows: live ones<br />\nfood: with stellar friends whenever<br />\nmovies &gt; tv<br />\npodcast: radiolab, this american life, the moth, joe rogan, the\nchamps",,,,,,"asian, black, other",66.0,-1,artistic / musical / writer,2012-06-27-21-26,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single,
5,29,average,mostly anything,socially,,graduated from college/university,"i'm an australian living in san francisco, but don't hold that\nagainst me. i spend most of my days trying to build cool stuff for\nmy company. i speak mandarin and have been known to bust out\nchinese songs at karaoke. i'm pretty cheeky. someone asked me if\nthat meant something about my arse, which i find really\nfunny.<br />\n<br />\ni'm a little oddball. i have a wild imagination; i like to think of\nthe most improbable reasons people are doing things just for fun. i\nlove to laugh and look for reasons to do so. occasionally this gets\nme in trouble because people think i'm laughing at them. sometimes\ni am, but more often i'm only laughing at myself.<br />\n<br />\ni'm an entrepreneur (like everyone else in sf, it seems) and i love\nwhat i do. i enjoy parties and downtime in equal measure.\nintelligence really turns me on and i love people who can teach me\nnew things.",building awesome stuff. figuring out what's important. having\nadventures. looking for treasure.,"imagining random shit. laughing at aforementioned random shit.\nbeing goofy. articulating what i think and feel. convincing people\ni'm right. admitting when i'm wrong.<br />\n<br />\ni'm also pretty good at helping people think through problems; my\nfriends say i give good advice. and when i don't have a clue how to\nhelp, i will say: i give pretty good hug.",i have a big smile. i also get asked if i'm wearing blue-coloured\ncontacts (no).,"books: to kill a mockingbird, lord of the rings, 1984, the farseer\ntrilogy.<br />\n<br />\nmusic: the beatles, frank sinatra, john mayer, jason mraz,\ndeadmau5, andrew bayer, everything on anjunadeep records, bach,\nsatie.<br />\n<br />\ntv shows: how i met your mother, scrubs, the west wing, breaking\nbad.<br />\n<br />\nmovies: star wars, the godfather pt ii, 500 days of summer,\nnapoleon dynamite, american beauty, lotr<br />\n<br />\nfood: thai, vietnamese, shanghai dumplings, pizza!","like everyone else, i love my friends and family, and need hugs,\nhuman contact, water and sunshine. let's take that as given.<br />\n<br />\n1. something to build<br />\n2. something to sing<br />\n3. something to play on (my guitar would be first choice)<br />\n4. something to write/draw on<br />\n5. a big goal worth dreaming about<br />\n6. something to laugh at",what my contribution to the world is going to be and/or should be.\nand what's for breakfast. i love breakfast.,out with my friends!,i cried on my first day at school because a bird shat on my head.\ntrue story.,you're awesome.,white,67.0,-1,computer / hardware / software,2012-06-29-19-18,"san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes cats,atheism,m,taurus,no,"english (fluently), chinese (okay)",single,atheism
6,32,fit,strictly anything,socially,never,graduated from college/university,life is about the little things. i love to laugh. it's easy to do\nwhen one can find beauty and humor in the ugly. this perspective\nmakes for a more gratifying life. it's a gift. we are here to play.,digging up buried treasure,frolicking<br />\nwitty banter<br />\nusing my camera to extract sums of a whole and share my perspective\nwith the world in hopes of opening up theirs<br />\nbeing amused by things most people would miss,i am the last unicorn,i like books. ones with pictures. reading them is great too. where\ndo people find the time? i spend more time with other people not\nreading. i collect books. they sit neatly on my bookshelves.<br />\n<br />\nmovies are great. especially on movie night. with brownies.<br />\n<br />\nmusic. i love (love) it all. unless it's country.<br />\n<br />\ni love food.,laughter<br />\namazing people in my life<br />\ncolor<br />\ncuriosity<br />\nmusic and rhythm<br />\na good pair of sunglasses,"synchronicity<br />\n<br />\nthere is this whole other realm where the fabrics of our life\nstories intersect as they dance and play in a magical burst of\nenergy. this realm doesn't need you to believe in it in order to\nmaintain. it is a cluster of synchronicities and happenings. it is\na gift to those who notice them. something to be treasured\nappreciated. there is something special in each and every moment\nthat you experience in your daily waking life. this something\nbrings us back to the age old question: if a tree falls in the\nforest and no one is there to hear it, does it make a sound? this\nworks in the same way. if you are not consciously there to hear it,\nsee it, taste it, smell it, feel it none of this matters, it's\nstill there. pay attention to the little things, those that are\noften overlooked. see if you can find the magic in this gift we\ncall life.",plotting to take over the world with my army of segway riding\npandas and fire breathing kittens,my typical friday night,,"white, other",65.0,-1,,2012-06-25-20-45,"san francisco, california",,straight,likes dogs and likes cats,,f,virgo,,english,single,
7,31,average,mostly anything,socially,never,graduated from college/university,,"writing. meeting new people, spending time with friends, seeing\nfilms, going to literary events and lectures, sifting through\nbookstores and thrift stores, exploring the city. i also work full\ntime at an interactive agency.","remembering people's birthdays, sending cards, being thoughtful,\narm wrestling",i'm rather approachable (a byproduct of being from a small town in\nthe midwest).,"i like: alphabetized lists, aquariums, autobiographies, beer on\ntap, ben folds, biking, brunch, citrus, cocktails, color, comfort\nfood, craft projects, dancing, design, diy, essays, fabric stores,\nfield trips, flea markets, foreign films, glee, good, hammond\norgans, helping lost tourists, indie rock, ice cream, languages,\nlectures, letterpress, libraries, literary fiction, live shows, mad\nmen, martha stewart living, memoir, mix tapes, non-fiction, npr,\nplants, puns, sewing, short stories, siestas, singer-songwriters,\nspicy food, stationery, storytelling, sufjan stevens, talking to\nstrangers, tea, tegan and sara, the office, 30 rock, travel,\nquilts, quirky movies, wes anderson, wine, writing, yoga.","friends, family, notebook/pen, books, music, travel",things that amuse and inspire me,out and about or relaxing at home with a good book or netflix,,,white,65.0,-1,artistic / musical / writer,2012-06-29-12-30,"san francisco, california","doesn&rsquo;t have kids, but wants them",straight,likes dogs and likes cats,christianity,f,sagittarius,no,"english, spanish (okay)",single,christianity
8,24,,strictly anything,socially,,graduated from college/university,,"oh goodness. at the moment i have 4 jobs, so it'd be nice to find\none i could settle into. other than that, i'm making sure i'm\nsurrounded by good people and keeping happy and active",,i'm freakishly blonde and have the same name as that hurricane from\na while back.,"i am always willing to try new foods and am not too picky. i do\nhowever have an extremely low tolerance to spicy food, i'm working\non it tho... i've already conquered pouring chipotle hot sauce all\nover my burritos. guess i still have a long ways to go. one of my\nfavorite spots in marin is sol food-- simply delicious puerto rican\nfood.<br />\n<br />\nlove to laugh so i like to watch a fair amount of comedies-\nromantic, dark or otherwise. princess bride, pixar!, fear and\nloathing, caddyshack, thomas crown affair, the birdcage (""chewing\ngum helps me think""... ""sweetie, you're wasting your gum!""), the\nbig lebowski, star wars (don't judge)<br />\n<br />\ni like historical fiction, murder mysteries, fantasy,\nutopian/dystopia, and, uh, the occasional romance book. just\nstarted the historian and trying to decide whether or not to read\nthe hunger games... already saw battle royale and it was pretty\ncool, so i'd imagine that collin's version she ripped off has got\nto be fun to read.<br />\n<br />\ncan't get enough music. like it, love it, want some more of it.\nlove to dance to it and to many peoples dismay, sing to it. pretty\nopen to any kind, even country.\nhttp://www.youtube.com/watch?v=dghbozbsv18&amp;feature=bfa&amp;list=pl01a90d209abe1ed3&amp;lf=plpp_video","sports/my softball glove<br />\ncoffee. because nobody likes zombies.<br />\nkindle<br />\nloud music/concerts<br />\ncandlelight showers<br />\noh, my amazing/crazy friends and family<br />\ngirl scout cookies... curse those delectable treats that dethrone\nmy generally reasonable diet-- i could eat a box of those in under\nan hour",,"in or out... drinking with friends, maybe a bar or dancing my pants\noff, watching or playing a game","potential friends/lovers/people who come in contact with me, beware\n... when alone in my car or the shower, i enjoy singing loudly and\noff key... and even sometimes when i'm not so alone.",http://www.youtube.com/watch?v=4dxbwzuwsxk let's have some fun and\nmaybe get into a little trouble,white,67.0,-1,,2012-06-29-23-39,"belvedere tiburon, california",doesn&rsquo;t have kids,straight,likes dogs and likes cats,christianity but not too serious about it,f,gemini but it doesn&rsquo;t matter,when drinking,english,single,christianity
9,37,athletic,mostly anything,not at all,never,working on two-year college,"my names jake.<br />\ni'm a creative guy and i look for the same in others.<br />\n<br />\ni'm easy going, practical and i don't have many hang ups. i\nappreciate life and try to live it to the fullest. i'm sober and\nhave been for the past few years.<br />\n<br />\ni love music and i play guitar. i like tons of different bands. i'm\nan artist and i love to paint/draw etc. and i love creative\npeople.<br />\n<br />\ni've got to say i'm not too big on internet dating. you cant really\nget an earnest impression of anyone from a few polished paragraphs.\nbut we'll see, you never know.","i have an apartment. i like to explore and check things out. i like\ngood japanese and peruvian food. nothing beats good ceviche on a\nhot day. or a hot chai on a cold one.<br />\n<br />\ni've been working on my a.o.d. certification but have stalled out.\ni'm hoping to pursue art but have yet to find the best venue.\nrecently i've been working on a construction job in belmont. it's\nnot my dream job. but for the time being it affords me other\nopportunities. plus it keeps me in shape, so i shouldn't complain.",i'm good at finding creative solutions to problems. i can organize\na living space pretty well. i'm good at making people smile. i'm\ngood at laughing at inappropriate times. and i make a mean bowl of\ncereal.,i'm short,"i like some tv. i love summer heights high and angry boys. and i\nlove fringe.<br />\n<br />\ni'm reading stiff after finishing elliott smith and the big nothing\n(loved it). i like biographies.<br />\n<br />\ni love music. it would be impossible to list everything i like\nbecause the list grows exponentially. i like george harrison or the\nclash. i like flight of the conchord's, old radiohead and elliott\nsmith. djali zwan, x, the knitters, the kinks, john lennon, floyd,\nnina simone, the smiths, seu jorge, the sex pistols, immortal\ntechnique, al green, dead kennedy's, the beatles, cat stevens, nine\ninch nails, the dead, bob dylan etc.<br />\nmy taste is varied, i love music.<br />\n<br />\nand i love movie's, i like all kinds.<br />\nthese days i'm at netflix more often though.<br />\nand i never miss science friday's.","music, my guitar<br />\ncontrast<br />\ngood food<br />\nmy bike<br />\nmy paintbrush<br />\nmy toothbrush<br />\nfamily &amp; friends<br />\n<em>ok....there's seven</em>",<strong><em>you should</em></strong>,<strong><em>send a message</em></strong>,<em><strong>and say hi.</strong></em>,you can rock the bells,white,65.0,-1,student,2012-06-28-21-08,"san mateo, california",,straight,likes dogs and likes cats,atheism and laughing about it,m,cancer but it doesn&rsquo;t matter,no,english (fluently),single,atheism


As the text data in all essay columns has some noises such as html tags, video links, a function to manage these was created.

In [71]:
cols_to_match = ['age', 'sex', 'orientation', 'diet', 'drinks', 'drugs', 'smokes',  
                 'essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 
                 'essay7', 'essay8', 'essay9', 'religion_cleaned']

def create_match_data(df, cols_to_match):
    match_df = df[cols_to_match]

    essay_columns = ['essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7', 'essay8', 'essay9']
    # delete all html tags, new line escape, and http links in the essay columns
    filled_df = match_df.replace({r'<[A-Za-z\/][^>]*>' : '', r'\n' : ' ', r'http[^ ]*[ ]' : ' ', r'http[^ ]*' : ''}, regex=True)
    # fill the essay columns with null values with a space
    filled_df.fillna(' ')
    # create the 'combined_essay' column by combining all the essay columns together
    filled_df['combined_essay'] = filled_df[essay_columns].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
              
    return filled_df

In [72]:
match_data = create_match_data(df, cols_to_match)
match_data.columns

Index(['age', 'sex', 'orientation', 'diet', 'drinks', 'drugs', 'smokes',
       'essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6',
       'essay7', 'essay8', 'essay9', 'religion_cleaned', 'combined_essay'],
      dtype='object')

In [73]:
match_data.shape

(59946, 19)

In [93]:
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

import random


def find_top_5_matches(sample, df):
    cols_to_match = ['age', 'sex', 'orientation', 'diet', 'drinks', 'drugs', 'smokes', 'pets',  'essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7', 'essay8', 'essay9', 'religion_cleaned']
    # create a dataframe with only columns used for matching
    data = create_match_data(df, cols_to_match)
    
    # check a user's orientation to prepare proper dataset
    if sample['orientation'] == 'straight':
        dataset = data.loc[(data.sex != sample.sex) & (data.orientation == 'straight')]
    elif sample['orientation'] == 'gay':
        dataset = data.loc[(data.sex == sample.sex) & (data.orientation == 'gay')]
    else:
        dataset = data.loc[((data.sex == sample.sex) & (data.orientation == 'gay')) | 
                        ((data.sex != sample.sex) & (data.orientation == 'straight'))]  
    

    # select only used columns in the dataset by excluding essay0-essay9, and religion columns
    cols_to_transform = ['age', 'orientation','diet', 'drinks', 'drugs', 'smokes', 'pets', 'combined_essay', 'religion_cleaned']
    prepared_dataset = dataset[cols_to_transform]
    
   # remove containing null values rows
    prepared_dataset = prepared_dataset.dropna()
 
    
    # create a columntransformer to transform different features with diferent tools
    ct = ColumnTransformer(transformers=[
        ('tfidf', TfidfVectorizer(), 'combined_essay'),
        ('scaler', StandardScaler(), ['age']), 
        ('onehot', OneHotEncoder(), ['orientation', 'diet', 'drinks', 'drugs', 'smokes', 'pets', 'religion_cleaned'])])

    # fitting columns transformer with new_dataset
    transformed_dataset = ct.fit_transform(prepared_dataset)
    
    # preprocess sample
    essay_columns = ['essay0', 'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6',
       'essay7', 'essay8', 'essay9']
    for col in essay_columns:
        if not isinstance(sample[col], str):
            sample[col] = ' '
    sample['combined_essay'] = ' '.join([sample[col] for col in essay_columns])

    prepared_sample = sample[cols_to_transform]
      
    # reshape values of sample as it is only one sample
    reshaped_sample = prepared_sample.values.reshape(1,-1)
    df_sample = pd.DataFrame(reshaped_sample, columns = ['age', 'orientation', 'diet', 'drinks', 'drugs', 'smokes', 'pets', 'combined_essay', 'religion_cleaned'])
    transformed_sample = ct.transform(df_sample)
    
    # find distances between sample and all datapoints in prepared 
    distances = []
    for data_point in transformed_dataset:
        distances.append(euclidean_distances(transformed_sample, data_point)[0][0])
    # get top 5 closet indices people in dataset
    five_closet_indice = np.argsort(distances)[::-1][:5]

    print("The top 5 people matches:")
    i = 1
    for index in five_closet_indice:
        print("\n")
        print("Suggested match " + str(i))
        print(dataset.iloc[index])
        i +=1

## Testing a function with sample data

In [84]:
sample = df.iloc[7]
print("Who are you?")
print("\n")
print(sample)

Who are you?


age                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           31
body_type                                                                                                                                                                                                                                               

In [94]:
import numpy as np

find_top_5_matches(sample, df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sample['combined_essay'] = ' '.join([sample[col] for col in essay_columns])


The top 5 people matches:


Suggested match 1
age                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            34
sex                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     