Project Topic
Is there a clear explanation of what this project is about? Does it state clearly which type of problem? E.g. type of learning and type of the task.
Is the goal of the project clearly stated? E.g. why it’s important, what goal the author wants to achieve, or wants to learn.
# Unlocking the Key to Finding Your Match

## Introduction
Struggling to find a match in the modern dating scene, I became determined to uncover the secret to success. My mom always assured me I was handsome, so looks couldn't be the issue. Could it be because I'm under 6 feet tall? My ethnicity? Or maybe my job title? Armed with my newfound knowledge of machine learning, I set out to dive deep into dating data and discover the hidden ingredients for finding my perfect match.

## Project Topic
Understanding the factors that lead to success on a second date can provide valuable insights into what I need to improve to find my ideal match. In this analysis, I employed several supervised machine learning algorithms, including logistic regression, random forest, and gradient boosting models, to predict the traits that make each gender more attractive for a match. The dataset used for this analysis includes various attributes related to dating. By training predictive models on this data, I aim not only to enhance my knowledge of machine learning but also to apply these findings to my personal quest for the perfect match.

## About The Data
The data I used is from Speed Dating dataset from Kaggle: https://www.kaggle.com/annavictoria/speed-dating-experiment
The data was gathered from 552 participants in experimental speed dating events from 2002-2004.
During the events, the attendees would have a four minute "first date" with every other participant of the opposite sex.
At the end of their four minutes, participants were asked if they would like to see their date again. They were also asked to rate their date on six attributes:
- Attractiveness
- Sincerity
- Intelligence
- Fun
- Ambition
- Shared Interests.

The dataset also includes questionnaire data gathered from participants at different points in the process. These fields include:
- demographics
- dating habits
- self-perception across key attributes
- beliefs on what others find valuable in a mate
- lifestyle information


# Import Packages

In [2]:
# importing packages

import pandas as pd
import numpy as np
import seaborn as sns
pd.options.display.max_rows = 1000 #shows truncated results
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics

# Initial Look
Conduct an initial assessment of the data type and its size. The speed dating data is found in a csv file(5.2 MB). See the speed-dating-data-key.doc for data dictionary and question key.

In [31]:
# importing data
speed_dating_events_data = pd.read_csv('data/speed_dating_data.csv', encoding="ISO-8859-1") # this encoding handles reading non-ASCII characters. 
speed_dating_events_data.head()

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1,1.0,0,1,1,1,10,7,,4,...,5.0,7.0,7.0,7.0,7.0,,,,,
1,1,1.0,0,1,1,1,10,7,,3,...,5.0,7.0,7.0,7.0,7.0,,,,,
2,1,1.0,0,1,1,1,10,7,,10,...,5.0,7.0,7.0,7.0,7.0,,,,,
3,1,1.0,0,1,1,1,10,7,,5,...,5.0,7.0,7.0,7.0,7.0,,,,,
4,1,1.0,0,1,1,1,10,7,,7,...,5.0,7.0,7.0,7.0,7.0,,,,,


In [32]:
#To identify the data types and size of the data
speed_dating_events_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8378 entries, 0 to 8377
Columns: 195 entries, iid to amb5_3
dtypes: float64(174), int64(13), object(8)
memory usage: 12.5+ MB


## Data Cleaning
 There are 8378 rows of dating data with 195 columns. Many columns are resulted from 3 surveys asking participants same questions. The first survey, which was the survey filled out by students that are interested in participating in order to register for the event, was filled out completely which explains very few NaN in those columns. However, in subsequent follow up surveys, few participants responded which explains extensive amount of NaN in those columns.  
 Due to hundreds of columns, instead of dropping columns from original dataframe, I have decided to create multiple dataframes for analysis by extracting essential columns from original dataframe.

In [33]:
# counting null values
speed_dating_events_data.isnull().sum()

iid            0
id             1
gender         0
idg            0
condtn         0
wave           0
round          0
position       0
positin1    1846
order          0
partner        0
pid           10
match          0
int_corr     158
samerace       0
age_o        104
race_o        73
pf_o_att      89
pf_o_sin      89
pf_o_int      89
pf_o_fun      98
pf_o_amb     107
pf_o_sha     129
dec_o          0
attr_o       212
sinc_o       287
intel_o      306
fun_o        360
amb_o        722
shar_o      1076
like_o       250
prob_o       318
met_o        385
age           95
field         63
field_cd      82
undergra    3464
mn_sat      5245
tuition     4795
race          63
imprace       79
imprelig      79
from          79
zipcode     1064
income      4099
goal          79
date          97
go_out        79
career        89
career_c     138
sports        79
tvsports      79
exercise      79
dining        79
museums       79
art           79
hiking        79
gaming        79
clubbing      

It is important to notice that each participant is in the dataframe multiple times, once for each opposite gender participant. Thus an exploratory analysis on this dataset would be on 8378 individuals with many many repetitions.

In other words, if I am Goku and I participate to a wave(speed dating event) with 10 participants of the opposite gender, I count as 10 Goku people. This can bias the analysis and therefore I want to create a second DataFrame with only the unique entries, giving us the real number of participants: 551.

In [34]:
personal_attributes = ['gender', 'age', 'field',
       'race', 'imprace', 'imprelig', 'from',
       'income', 'goal', 'date', 'go_out', 'career',
       'career_c', 'sports', 'tvsports', 'exercise', 'dining', 'museums',
       'art', 'hiking', 'gaming', 'clubbing', 'reading', 'tv', 'theater',
       'movies', 'concerts', 'music', 'shopping', 'yoga', 'exphappy',
       'expnum','match_es']
decision = ['match','dec',
       'attr', 'sinc', 'intel', 'fun', 'amb', 'shar', 'like', 'prob',
       'met']
evaluation = ['satis_2', 'length', 'numdat_2']
outcome = ['you_call', 'them_cal', 'date_3', 'numdat_3',
       'num_in_3']

In [35]:
speed_dating_events_analysis_data= speed_dating_events_data[['iid', 'wave'] + personal_attributes + evaluation + outcome].drop_duplicates().copy()
len(speed_dating_events_analysis_data)

551

In [36]:
speed_dating_events_analysis_data.head()

Unnamed: 0,iid,wave,gender,age,field,race,imprace,imprelig,from,income,...,expnum,match_es,satis_2,length,numdat_2,you_call,them_cal,date_3,numdat_3,num_in_3
0,1,1,0,21.0,Law,4.0,2.0,4.0,Chicago,69487.0,...,2.0,4.0,6.0,2.0,1.0,1.0,1.0,0.0,,
10,2,1,0,24.0,law,2.0,2.0,5.0,Alabama,65929.0,...,5.0,3.0,5.0,2.0,,0.0,0.0,0.0,,
20,3,1,0,25.0,Economics,2.0,8.0,4.0,Connecticut,,...,2.0,,,,,,,,,
30,4,1,0,23.0,Law,2.0,1.0,1.0,Texas,37754.0,...,2.0,2.0,4.0,3.0,2.0,0.0,0.0,0.0,,
40,5,1,0,21.0,Law,2.0,8.0,1.0,Bowdoin College,86340.0,...,10.0,,7.0,2.0,2.0,0.0,0.0,0.0,,


## Exploratory Data Analysis

### Who are the partipants?
Here, we explore the dataset in terms of those features that describe the participants. The goals are to get to know what is in this dataset and get some hints on what to focus on for further analysis.

These speed dating events were dedicated to partners of opposite genders.

In [37]:
speed_dating_events_analysis_data['gender'] = speed_dating_events_analysis_data.gender.map({1 : 'Male', 0 : 'Female'}).fillna(speed_dating_events_analysis_data.gender)
speed_dating_events_data['gender'] = speed_dating_events_data.gender.map({1 : 'Male', 0 : 'Female'}).fillna(speed_dating_events_data.gender)
speed_dating_events_analysis_data.gender.value_counts(dropna=False)

gender
Male      277
Female    274
Name: count, dtype: int64