# How do we predict a potential match in dating?

Jingjin Wei, Lei Zhang

### Background

Imagine that we are a social search service app, like tinder. We have millions of users. They register and write down some personal description, looking for a match. Here comes the question: How do we predict the match? More precisely, for a user of our app, who should we recommend to him/her?



### Data description

The dataset we are using here was compiled by Columbia Business School professors Ray Fisman and Sheena Iyengar for their paper Gender Differences in Mate Selection: Evidence From a Speed Dating Experiment. In speed dating, the participants engage in a four-minute conversations and then determine if they are interested in the person or not. The subjects are students from graduate and professional schools of Columbia University. Before the actual meeting, all the participants registered would need to fill out a form online, which including some basic personal information (age/gender/religion/etc) and then give a self rating on their attributes like attractiveness. There are 14 rounds conducted in 2002 to 2004, with the number of participants varies. After the speed dating, all the participants would have to decide if they want to meet with the partner again. If only both say 'yes', we would call that a 'match'.

### Besides training a model to predict whether a male and a female will match, here're some interesting intuitive questions:
 
(1)What's the most important factor(s) when male or female make decisions?

(2)What's the difference between male and female when they choose their partners?

(3)Will a factor, such as age difference, positively effect or negatively effect one's decision?
 
To answer the above questions, before we train the model, we firstly normalize all parameters over given 8k samples with mean equals zero and variance equals 1. Under this condition, "NAN" in the data, which means the participants didn't give the information, can be treated as zero. After we train the model, the magnitude of coefficient will tell the importance of corresponding parameters, and sign of coefficient will show whether it will positively or negatively support.
 

In training data, we use first 1000 data as testing data and the rest 3500 data as training data for both male and female.

### part 1. Male

In [16]:

import pandas as pdde
import numpy as np
import matplotlib.pylab as plt
from sklearn import cross_validation, linear_model

%matplotlib inline
data_df = pd.read_csv("Speed Dating Data.csv", encoding="ISO-8859-1")
data_df.head()
fields = data_df.columns
num_dates_per_male = data_df[data_df.gender == 1].groupby('iid').apply(len)
num_dates_per_female = data_df[data_df.gender == 0].groupby('iid').apply(len)


# 1.Preprocessing
def str_to_float(series):
    return series.apply(lambda x: str(x).replace(",", "")).astype('float64')

for trait in ['mn_sat', 'tuition', 'income']:
    data_df[trait] = str_to_float(data_df[trait])

    
data_df['pid'] = data_df['pid'].fillna(-1.0).astype('int64')  # Invalid PID as -1

# standardize features
def standardize_feature(series):
    return (series - series.mean()) / series.std(ddof=0)
    
    
# 2.PROFILE OF THE PERSON
#     2.1 'tuition' + 'income' -> 'financial'

# fill out the nan in tuition and income to be mean value.
data_df['tuition']=data_df['tuition'].fillna(data_df['tuition'].mean())
data_df['income']=data_df['income'].fillna(data_df['income'].mean())


# standardize -> plus, not used

# data_df['financial'] = standardize_feature(data_df['tuition']) \
#                        .add(standardize_feature(data_df['income']), fill_value=0.0)

# plus -> standardize
data_df['financial'] = standardize_feature((data_df['tuition']) \
                       .add((data_df['income'])))

    
    
#     2.2 'date' + 'go_out' -> 'experience' 
#     fill nan
data_df['date']=data_df['date'].fillna(data_df['date'].mean())
data_df['go_out']=data_df['go_out'].fillna(data_df['go_out'].mean())

#     importance bet/ date and go_out. giving date more weight.
a=5
b=1
data_df['experience'] = a*data_df['date'] + b*data_df['go_out']
data_df['experience'] = standardize_feature(data_df['experience'])


#     2.3 'mn_sat' -> 'intelligence'
data_df['int'] = standardize_feature(data_df['mn_sat']);
data_df['int'] = data_df['int'].fillna(value = 0)

# 3 pairwise features for male participants

#     3.0
# divide them into female data and male data
data_df_m = data_df[data_df.gender == 1] #data of male
data_df_f = data_df[data_df.gender == 0] #data of female


# Create a dataframe containing information for each person that needs to be looked up
# profiles = data_df[['iid', 'mn_sat', 'goal', 'field_cd', 'financial', 'experience']]\
#            .set_index(keys='iid').drop_duplicates()
    
profiles_m = data_df_m[['iid', 'int',  'field_cd', 'financial', 'experience','career_c']]\
           .set_index(keys='iid').drop_duplicates()
for trait in ['int', 'financial', 'experience']:
#     profiles_m[trait] = profiles_m[trait].fillna(profiles_m[trait].mean())  # Fill NaN values with mean
    profiles_m[trait] = profiles_m[trait].fillna(value=0)


# data_df_m = data_df_m.fillna(data_df_m.mean())

profiles_f = data_df_f[['iid', 'int',  'field_cd', 'financial', 'experience','career_c']]\
           .set_index(keys='iid').drop_duplicates()
for trait in ['int', 'financial', 'experience']:
#     profiles_f[trait] = profiles_f[trait].fillna(profiles_f[trait].mean())  # Fill NaN values with mean
    profiles_f[trait] = profiles_f[trait].fillna(value=0)


##########################################################################

#     3.1 age difference =  male age - female age
data_df_m['age_diff'] = data_df_m['age'].sub(data_df_m['age_o'])  # Age difference
# If nan, set the value to be 0.
data_df_m['age_diff'] = data_df_m['age_diff'].fillna(value=0)
data_df_m['age_diff'] = standardize_feature(data_df_m['age_diff'])

#     3.2 same field
def is_similar_profession(x, profiles):
    if np.isnan(x['field_cd']) or np.isnan(x['pid']) or x['pid'] not in profiles.index or \
    x['field_cd'] != profiles.loc[x['pid']]['field_cd']:
        return -1
    else:
        return int(x['field_cd'] == profiles.loc[x['pid']]['field_cd'])
    
data_df_m['sim_profession'] = data_df_m[['field_cd', 'pid']]\
                            .apply(lambda x: is_similar_profession(x, profiles_f), axis=1)


#     3.2 same career
def is_similar_career(x, profiles):
    if np.isnan(x['career_c']) or np.isnan(x['pid']) or x['pid'] not in profiles.index or\
        x['career_c'] != profiles.loc[x['pid']]['career_c']:
        return -1
    else:
        return int(x['career_c'] == profiles.loc[x['pid']]['career_c'])
    
data_df_m['sim_career'] = data_df_m[['career_c', 'pid']]\
                            .apply(lambda x: is_similar_career(x, profiles_f), axis=1)
    
    
    
    

#     3.4 basic traits diffrence (standardized)
    
def trait_difference(trait):
    trait_other = data_df_m['pid'].apply(lambda x: profiles_f.loc[x][trait] if x in profiles_f.index else None)
    return data_df_m[trait].sub(trait_other)
    
# basic trait difference : male - female
for trait in ['int','experience', 'financial']:
    string= trait + '_diff'
    data_df_m[string] = trait_difference(trait)
    data_df_m[string] = data_df_m[string].fillna(value=0) 

data_df_m.loc[data_df_m['pid']==21].loc[data_df_m['iid']==40]
# data_df_m.loc[data_df_m['iid']==11]['pid']

############################################################
#     3.5.1 preprocess. 
attr_exp = ['attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1']
attr_o = ['attr1_3', 'sinc1_3', 'intel1_3', 'fun1_3', 'amb1_3', 'shar1_3']

# profiles_f = data_df_f[['iid', 'int',  'field_cd', 'financial', 'experience','career_c']]\
#            .set_index(keys='iid').drop_duplicates()

# attr comes from original data, contains only the pair ids and attributes we need.
attr = data_df[['iid','gender','pid','attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1',\
               'attr1_3', 'sinc1_3', 'intel1_3', 'fun1_3', 'amb1_3', 'shar1_3']].set_index(keys='iid').drop_duplicates()

# data_norm['sum1']=data_norm[f1].sum(axis=1)
attr['attr1_sum']=attr[attr_exp].sum(axis=1)
attr['attr3_sum']=attr[attr_o].sum(axis=1)

#############################################################

df_train = data_df_m[['iid','pid','match', 'age_diff','sim_profession','sim_career','int_diff','experience_diff','financial_diff']]
df_train['rating'] = np.zeros(len(df_train))

attr_exp_n = []
attr_o_n = []

for trait in attr_exp:
    
    attr[trait + '_n'] = attr[trait]/attr['attr1_sum'] 
    attr_exp_n.append(trait+'_n')

    
for trait in attr_o:
    
    attr[trait + '_n'] = attr[trait]/attr['attr3_sum'] 
    attr_o_n.append(trait+'_n')


    #######################################################
    
    
attr[attr_exp_n] = attr[attr_exp_n].fillna(value=1.0/6).astype('float64')
attr[attr_o_n] = attr[attr_o_n].fillna(value=1.0/6).astype('float64')

attr_m = attr[attr.gender == 1];
attr_f = attr[attr.gender == 0];


for i in attr_m.index.drop_duplicates():
    for j in attr_m.loc[i].pid:
        # j = female iid, (i,j) makes a pair
        temp1=0
        temp2=0
        temp3=0
        for k in np.arange(0,6):

            if i not in attr_m.index or \
                j not in attr_f.index:
                
                temp1 = 0
                temp2 = 0
            else:

                temp1 = attr_m.loc[attr_m['pid']==j].loc[i][attr_exp_n[k]]
                temp2 = attr_f.loc[attr_f['pid']==i].loc[j][attr_o_n[k]]
            temp3 += temp1*temp2

        get_index = df_train.loc[df_train['iid']==i].loc[df_train['pid']==j].index
        
        df_train.set_value(get_index,'rating',temp3)


df_train['rating']=standardize_feature((df_train['rating']))   



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is tryin

### Final data

Here is our final data.

iid: id of participants, male here

pid: id of partner, female here

dec: if the participants would like to meet the partner again.

age_diff: age_of_male - age_of_female

sim_profession : if they have similar profession

sim_career : if they have similar career plan

int_diff : we use the difference of the average SAT score of their undergrad college to measure the 'intelligence difference'	

experience_diff	: we combine the participants expeirience in their social life (including hanging out with friends and going out on date)

financial_diff : we use the average income of participants birth place to indicate the wealth of the participants

rating : rating = expectation_of_participants * self_ratings_of_partner. attr_exp_n is the expectation of attributes, attr_o_n is self rating.

In [17]:
df_train_all=df_train
df_test=df_train_all[:1000]
print(df_test)
df_train=df_train_all[1000:]
print(df_train)
df_train.head()

      iid  pid  match  age_diff  sim_profession  sim_career  int_diff  \
100    11    1      0  1.172198              -1          -1       0.0   
101    11    2      0  0.531540              -1          -1       0.0   
102    11    3      0  0.317987              -1          -1       0.0   
103    11    4      0  0.745093              -1          -1       0.0   
104    11    5      0  1.172198              -1          -1       0.0   
105    11    6      0  0.745093              -1          -1       0.0   
106    11    7      0  0.958645              -1          -1       0.0   
107    11    8      0  0.317987              -1          -1       0.0   
108    11    9      0  0.104434              -1          -1       0.0   
109    11   10      0  0.104434              -1          -1       0.0   
110    12    1      0  0.104434               1          -1       0.0   
111    12    2      0 -0.536224               1          -1       0.0   
112    12    3      0 -0.749777              -1    

Unnamed: 0,iid,pid,match,age_diff,sim_profession,sim_career,int_diff,experience_diff,financial_diff,rating
2199,160,157,0,0.104434,1,-1,0.0,-0.524824,-0.80195,-0.145104
2200,161,142,0,-0.536224,-1,-1,0.0,-0.65603,0.229107,0.054959
2201,161,143,0,-2.4582,-1,-1,0.0,-2.230502,0.672587,-0.145104
2202,161,144,0,-0.322671,1,-1,0.0,-0.65603,1.854359,-0.112535
2203,161,145,0,-0.536224,1,-1,0.0,-1.836884,1.135466,-0.145126


### Logistic Regression:

We use the first 1000 data as  testing data, and the rest over 3500 data as training data.

In [18]:
##Use Scikit logistic learn
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler

from sklearn import linear_model, datasets
X = df_train[[ 'age_diff','sim_profession',
         'sim_career','int_diff','experience_diff','financial_diff','rating']]
y = df_train["match"]
Xtest = df_test[[ 'age_diff','sim_profession',
         'sim_career','int_diff','experience_diff','financial_diff','rating']]
ytest = df_test["match"]

clf_LR=LogisticRegression(C=1000)
clf_LR.fit(X,y)

print "No-R Single Logistic Regression accuracy:",clf_LR.score(Xtest,ytest)
clf_l1_LR = LogisticRegression(C=1, penalty='l1')
clf_l1_LR.fit(X,y)
print "Logistic Regression accuracy with l1 penalty:",clf_l1_LR.score(Xtest,ytest)
clf_l2_LR = LogisticRegression(C=1, penalty='l2')
clf_l2_LR.fit(X,y)
print "Logistic Regression accuracy with l2 penalty:",clf_l2_LR.score(Xtest,ytest)

print ("parameter of Male")
print ('age_diff','sim_profession',
         'sim_career','int_diff','experience_diff','financial_diff','rating')
print(clf_LR.coef_)

No-R Single Logistic Regression accuracy: 0.823
Logistic Regression accuracy with l1 penalty: 0.823
Logistic Regression accuracy with l2 penalty: 0.823
parameter of Male
('age_diff', 'sim_profession', 'sim_career', 'int_diff', 'experience_diff', 'financial_diff', 'rating')
[[ 0.00387055  0.26222104 -0.00373471  0.03390501 -0.11622529 -0.13160005
   0.00503269]]


### Part II. Female Result

In [20]:
data_df_m = data_df[data_df.gender == 0] #data of female
data_df_f = data_df[data_df.gender == 1] #data of male


# Create a dataframe containing information for each person that needs to be looked up
# profiles = data_df[['iid', 'mn_sat', 'goal', 'field_cd', 'financial', 'experience']]\
#            .set_index(keys='iid').drop_duplicates()
    
profiles_m = data_df_m[['iid', 'int',  'field_cd', 'financial', 'experience','career_c']]\
           .set_index(keys='iid').drop_duplicates()
for trait in ['int', 'financial', 'experience']:
#     profiles_m[trait] = profiles_m[trait].fillna(profiles_m[trait].mean())  # Fill NaN values with mean
    profiles_m[trait] = profiles_m[trait].fillna(value=0)


# data_df_m = data_df_m.fillna(data_df_m.mean())

profiles_f = data_df_f[['iid', 'int',  'field_cd', 'financial', 'experience','career_c']]\
           .set_index(keys='iid').drop_duplicates()
for trait in ['int', 'financial', 'experience']:
#     profiles_f[trait] = profiles_f[trait].fillna(profiles_f[trait].mean())  # Fill NaN values with mean
    profiles_f[trait] = profiles_f[trait].fillna(value=0)


##########################################################################

#     3.1 age difference =  male age - female age
data_df_m['age_diff'] = data_df_m['age'].sub(data_df_m['age_o'])  # Age difference
# If nan, set the value to be 0.
data_df_m['age_diff'] = data_df_m['age_diff'].fillna(value=0)
data_df_m['age_diff'] = standardize_feature(data_df_m['age_diff'])

#     3.2 same field
def is_similar_profession(x, profiles):
    if np.isnan(x['field_cd']) or np.isnan(x['pid']) or x['pid'] not in profiles.index or \
    x['field_cd'] != profiles.loc[x['pid']]['field_cd']:
        return -1
    else:
        return int(x['field_cd'] == profiles.loc[x['pid']]['field_cd'])
    
data_df_m['sim_profession'] = data_df_m[['field_cd', 'pid']]\
                            .apply(lambda x: is_similar_profession(x, profiles_f), axis=1)


#     3.2 same career
def is_similar_career(x, profiles):
    if np.isnan(x['career_c']) or np.isnan(x['pid']) or x['pid'] not in profiles.index or\
        x['career_c'] != profiles.loc[x['pid']]['career_c']:
        return -1
    else:
        return int(x['career_c'] == profiles.loc[x['pid']]['career_c'])
    
data_df_m['sim_career'] = data_df_m[['career_c', 'pid']]\
                            .apply(lambda x: is_similar_career(x, profiles_f), axis=1)
    
    
    
    

#     3.4 basic traits diffrence (standardized)
    
def trait_difference(trait):
    trait_other = data_df_m['pid'].apply(lambda x: profiles_f.loc[x][trait] if x in profiles_f.index else None)
    return data_df_m[trait].sub(trait_other)
    
# basic trait difference : male - female
for trait in ['int','experience', 'financial']:
    string= trait + '_diff'
    data_df_m[string] = trait_difference(trait)
    data_df_m[string] = data_df_m[string].fillna(value=0) 

# data_df_m.loc[data_df_m['pid']==21].loc[data_df_m['iid']==40]
# data_df_m.loc[data_df_m['iid']==11]['pid']

############################################################
#     3.5.1 preprocess. 
attr_exp = ['attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1']
attr_o = ['attr1_3', 'sinc1_3', 'intel1_3', 'fun1_3', 'amb1_3', 'shar1_3']

# profiles_f = data_df_f[['iid', 'int',  'field_cd', 'financial', 'experience','career_c']]\
#            .set_index(keys='iid').drop_duplicates()

# attr comes from original data, contains only the pair ids and attributes we need.
attr = data_df[['iid','gender','pid','attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1',\
               'attr1_3', 'sinc1_3', 'intel1_3', 'fun1_3', 'amb1_3', 'shar1_3']].set_index(keys='iid').drop_duplicates()

# data_norm['sum1']=data_norm[f1].sum(axis=1)
attr['attr1_sum']=attr[attr_exp].sum(axis=1)
attr['attr3_sum']=attr[attr_o].sum(axis=1)

#############################################################

df_train = data_df_m[['iid','pid','match', 'age_diff','sim_profession','sim_career','int_diff','experience_diff','financial_diff']]
df_train['rating'] = np.zeros(len(df_train))

attr_exp_n = []
attr_o_n = []

for trait in attr_exp:
    
    attr[trait + '_n'] = attr[trait]/attr['attr1_sum'] 
    attr_exp_n.append(trait+'_n')

    
for trait in attr_o:
    
    attr[trait + '_n'] = attr[trait]/attr['attr3_sum'] 
    attr_o_n.append(trait+'_n')


    #######################################################
    
    
attr[attr_exp_n] = attr[attr_exp_n].fillna(value=1.0/6).astype('float64')
attr[attr_o_n] = attr[attr_o_n].fillna(value=1.0/6).astype('float64')

attr_m = attr[attr.gender == 0];
attr_f = attr[attr.gender == 1];


for i in attr_m.index.drop_duplicates():
    for j in attr_m.loc[i].pid:
        # j = female iid, (i,j) makes a pair
        temp1=0
        temp2=0
        temp3=0
        for k in np.arange(0,6):

            if i not in attr_m.index or \
                j not in attr_f.index:
                
                temp1 = 0
                temp2 = 0
            else:

                temp1 = attr_m.loc[attr_m['pid']==j].loc[i][attr_exp_n[k]]
                temp2 = attr_f.loc[attr_f['pid']==i].loc[j][attr_o_n[k]]
            temp3 += temp1*temp2

        get_index = df_train.loc[df_train['iid']==i].loc[df_train['pid']==j].index
        
        df_train.set_value(get_index,'rating',temp3)


df_train['rating']=standardize_feature((df_train['rating']))   
df_train_all=df_train
df_test=df_train_all[:1000]
print(df_test)
df_train=df_train_all[1000:]
print(df_train)
df_train.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is tryin

      iid  pid  match  age_diff  sim_profession  sim_career  int_diff  \
0       1   11      0 -1.170556              -1          -1       0.0   
1       1   12      0 -0.104051               1          -1       0.0   
2       1   13      1 -0.104051               1          -1       0.0   
3       1   14      1 -0.317352               1          -1       0.0   
4       1   15      1 -0.530653               1          -1       0.0   
5       1   16      0 -0.743954              -1          -1       0.0   
6       1   17      0 -1.810459              -1          -1       0.0   
7       1   18      0 -1.170556              -1          -1       0.0   
8       1   19      1 -1.383857              -1          -1       0.0   
9       1   20      0 -0.530653              -1          -1       0.0   
10      2   11      0 -0.530653              -1          -1       0.0   
11      2   12      0  0.535853               1          -1       0.0   
12      2   13      0  0.535853               1    

Unnamed: 0,iid,pid,match,age_diff,sim_profession,sim_career,int_diff,experience_diff,financial_diff,rating
1953,145,167,0,0.322552,1,1,0.0,0.524824,-0.666122,-0.097389
1954,145,168,0,0.322552,1,-1,0.0,1.836884,1.236769,-0.097389
1955,145,169,0,-0.743954,1,1,0.0,0.524824,0.559594,-0.04669
1956,145,170,0,-0.743954,1,-1,0.0,1.049648,-0.462879,0.133706
1957,145,171,0,-0.957255,-1,-1,0.0,0.0,0.0,-0.097234


In [21]:
df_train

Unnamed: 0,iid,pid,match,age_diff,sim_profession,sim_career,int_diff,experience_diff,financial_diff,rating
1953,145,167,0,0.322552,1,1,0.000000,0.524824,-0.666122,-0.097389
1954,145,168,0,0.322552,1,-1,0.000000,1.836884,1.236769,-0.097389
1955,145,169,0,-0.743954,1,1,0.000000,0.524824,0.559594,-0.046690
1956,145,170,0,-0.743954,1,-1,0.000000,1.049648,-0.462879,0.133706
1957,145,171,0,-0.957255,-1,-1,0.000000,0.000000,0.000000,-0.097234
1958,145,172,0,-0.530653,-1,-1,0.000000,2.624119,-0.462879,-0.097389
1959,145,173,0,0.535853,-1,1,0.000000,1.968090,-0.462879,-0.097389
1960,146,158,0,2.882165,-1,1,0.000000,0.524824,1.631915,-0.039707
1961,146,159,0,1.815659,-1,1,0.000000,0.524824,1.053498,-0.046793
1962,146,160,0,1.602358,-1,-1,0.000000,1.180854,2.313936,-0.097389


In [22]:
##Use Scikit logistic learn
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler

from sklearn import linear_model, datasets
X = df_train[[ 'age_diff','sim_profession',
         'sim_career','int_diff','experience_diff','financial_diff','rating']]
y = df_train["match"]
Xtest = df_test[[ 'age_diff','sim_profession',
         'sim_career','int_diff','experience_diff','financial_diff','rating']]
ytest = df_test["match"]

clf_LR=LogisticRegression(C=1000)
clf_LR.fit(X,y)

print "No-R Single Logistic Regression accuracy:",clf_LR.score(Xtest,ytest)
clf_l1_LR = LogisticRegression(C=0.1, penalty='l1')
clf_l1_LR.fit(X,y)
print "Logistic Regression accuracy with l1 penalty:",clf_l1_LR.score(Xtest,ytest)
clf_l2_LR = LogisticRegression(C=0.1, penalty='l2')
clf_l2_LR.fit(X,y)
print "Logistic Regression accuracy with l2 penalty:",clf_l2_LR.score(Xtest,ytest)

print ("parameter of Female")
print ('age_diff','sim_profession',
         'sim_career','int_diff','experience_diff','financial_diff','rating')
print(clf_LR.coef_)

No-R Single Logistic Regression accuracy: 0.82
Logistic Regression accuracy with l1 penalty: 0.82
Logistic Regression accuracy with l2 penalty: 0.82
parameter of Female
('age_diff', 'sim_profession', 'sim_career', 'int_diff', 'experience_diff', 'financial_diff', 'rating')
[[-0.01748436  0.28766951 -0.02164836 -0.00968521  0.13286699  0.08635026
  -0.01038006]]


### Conclusion:
Here's the results from Logistic results:

 Male accuracy: 0.823
 
 Female accuracy: 0.82


|Coeff | age_diff|    sim_profession |  sim_career |   int_diff  | experience_diff  | financial_diff   |   rating|
|--|--------------|----------------|----------------|------------|------------------|-----------------|----------|
|Male |  0.00387055  |  0.26222104  |   -0.00373471 | 0.03390501 |  -0.11622529    |   -0.13160005    |   0.00503269|
|Female | -0.01748436  | 0.28766951  |   -0.02164836 |  -0.00968521 | 0.13286699    |    0.08635026   |   -0.01038006|

In conclusion, this is a material world!

1. The rating is the only subjective factor in these 7 parameters, which indicates the scores of partner's attributes (including many personal characteristics  like attractiveness, intelligence, fun, ambitious and so on). But its coefficients are so small in magnitude, for both male and female.
2. Sim_profession is the most important factor, which means if the participants and partners are of the same profession. It's not hard to imagine that if both of them come from the same background they will have more to talk about in the first date, and more probable to give a 'yes'.
3. experience_diff: this is a factor indicating partner/participants experience in social life, including hanging out with friends and going out on dates. The more often the partner/participants engage in social activities, the higher the rating. So if the difference is positive it means that the participants are more experienced than partners. From the sign we can see that in general both male and female would prefer a relationship in which male is more experienced in social.
4. financial_diff : Financial is an indicator of the wealthiness of the family. From the sign we know that both male and female would prefer a relationship in which the male is wealthier than the female, which is in accord with the mainstream idea.


   
