## **OKCupid Machine Learning Project**
## **================================**
This project analyzes data from on-line dating application OKCupid. In recent years, there has been a massive rise in the usage of dating apps to find love. Many of these apps use sophisticated data science techniques to recommend possible matches to users and to optimize the user experience. These apps give us access to a wealth of information that we've never had before about how different people experience romance.

### **Project Objetives**
The goals of this project is to **scope, prepare and analyze the data**. The second goal is to **create ML model(s)** that can solve questions and predict outcomes. Depending on the results and resources available, the model could be tuned and optimised. 
- Load and check the data
- Exploratory data analysis
- Select the ML model
- Prepare the data  
- Build the model(s)
- Train and evaluate the model(s)
- Conclusions 
### **Dataset**

The dataset used in this project is from [Kaggle](https://www.kaggle.com/datasets/andrewmvd/okcupid-profiles). The dataset contains information about 60,000 users and their responses to a set of questions. OkCupid is a mobile dating app. It sets itself apart from other dating apps by making use of a pre computed compatibility score, calculated by optional questions the users may choose to answer. In this dataset, there are 60k records containing structured information such as age, sex, orientation as well as text data from open ended descriptions in 30 columns.

##### The columns in the dataset include: 
- ``age:`` continuous variable of age of user
- ``body_type:`` categorical variable of body type of user
- ``diet:`` categorical variable of dietary information
- ``drinks:``  categorical variable of alcohol consumption
- ``drugs:`` categorical variable of drug usage
- ``education:`` categorical variable of educational attainment
- ``ethnicity:`` categorical variable of ethnic backgrounds
- ``height:`` continuous variable of height of user
- ``income:`` continuous variable of income of user
- ``job:`` categorical variable of employment description
- ``offspring:`` categorical variable of children status
- ``orientation:`` categorical variable of sexual orientation
- ``pets:`` categorical variable of pet preferences
- ``religion:`` categorical variable of religious background
- ``sex:`` categorical variable of gender
- ``sign:`` categorical variable of astrological symbol
- ``smokes:`` categorical variable of smoking consumption
- ``speaks:`` categorical variable of language spoken
- ``status:`` categorical variable of relationship status
- ``last_online:`` date variable of last login
##### And a set of open short-answer responses to :
- ``essay0:`` My self summary
- ``essay1:``  What I’m doing with my life
- ``essay2:`` I’m really good at
- ``essay3:`` The first thing people usually notice about me
- ``essay4:`` Favorite books, movies, show, music, and food
- ``essay5:`` The six things I could never do without
- ``essay6:`` I spend a lot of time thinking about
- ``essay7:`` On a typical Friday night I am
- ``essay8:`` The most private thing I am willing to admit
- ``essay9:`` You should message me if…

#### **1. Load and check the data**

In [1]:
# import libraries
import pandas as pd
import numpy as np
from IPython.display import display
import matplotlib.pyplot as plt
import seaborn as sn
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

from sklearn.feature_selection import RFE

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay, classification_report

#set theme for the seaborn plots
#context is for the size of the plot
#style is for the background of the plot
#palette is for the color of the plot
#font is for the font of the plot
#font_scale is for the size of the font
#color_codes is for the color of the plot
sn.set_theme(context='notebook', style='darkgrid', palette='tab10', font='sans-serif', font_scale=1, color_codes=True, rc=None)

In [8]:
#load the data
profiles_data = pd.read_csv('profiles.csv')
profiles_df = pd.DataFrame(profiles_data)

display(profiles_df.info())
display(profiles_df.describe(include='all'))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          59946 non-null  int64  
 1   body_type    54650 non-null  object 
 2   diet         35551 non-null  object 
 3   drinks       56961 non-null  object 
 4   drugs        45866 non-null  object 
 5   education    53318 non-null  object 
 6   essay0       54458 non-null  object 
 7   essay1       52374 non-null  object 
 8   essay2       50308 non-null  object 
 9   essay3       48470 non-null  object 
 10  essay4       49409 non-null  object 
 11  essay5       49096 non-null  object 
 12  essay6       46175 non-null  object 
 13  essay7       47495 non-null  object 
 14  essay8       40721 non-null  object 
 15  essay9       47343 non-null  object 
 16  ethnicity    54266 non-null  object 
 17  height       59943 non-null  float64
 18  income       59946 non-null  int64  
 19  job 

None

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
count,59946.0,54650,35551,56961,45866,53318,54458,52374,50308,48470,...,59946,24385,59946,40025,39720,59946,48890,54434,59896,59946
unique,,12,18,6,3,32,54350,51516,48635,43533,...,199,15,3,15,45,2,48,5,7647,5
top,,average,mostly anything,socially,never,graduated from college/university,.,enjoying it.,listening,my smile,...,"san francisco, california",doesn&rsquo;t have kids,straight,likes dogs and likes cats,agnosticism,m,gemini and it&rsquo;s fun to think about,no,english,single
freq,,14652,16585,41780,37724,23959,12,61,82,529,...,31064,7560,51606,14814,2724,35829,1782,43896,21828,55697
mean,32.34029,,,,,,,,,,...,,,,,,,,,,
std,9.452779,,,,,,,,,,...,,,,,,,,,,
min,18.0,,,,,,,,,,...,,,,,,,,,,
25%,26.0,,,,,,,,,,...,,,,,,,,,,
50%,30.0,,,,,,,,,,...,,,,,,,,,,
75%,37.0,,,,,,,,,,...,,,,,,,,,,


In [9]:
display(profiles_df.head())

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\n<br />\ni would love to think...,currently working as an international agent fo...,making people laugh.<br />\nranting about a go...,"the way i look. i am a six foot half asian, ha...",...,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,...,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,...,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,...,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
4,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at:<br />\nhttp://bag...,i smile a lot and my inquisitive nature,...,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single


#### **2. Exploratory Data Analysis**

#### **3. Select the ML model**

#### **4. Prepare the data**

#### **5. Build the model(s)**

#### **6. Train and Evaluate the model(s)**

#### **7. Conclusions**