# Lab 01
## Conrad Appel & Eric Hawkins

In [None]:
%matplotlib inline
import seaborn as sns
import pandas as p

The dataset we're using for this project is named "[Speed Dating Experiment](https://www.kaggle.com/annavictoria/speed-dating-experiment)" and was submitted to Kaggle by Anna Montoya. The data was collected by two Colombia professors, Ray Fisman and Sheena Iyengar, in order to discover what causes love at first sight.

The experiement was handled in a few distinct phases. First, participants answered general questions on themselves, such as their age, race, gender, income, career, interests, as well as what they look for in their dates. After this questionnaire, the participants each went on 10 rounds of 4-minute dates, and afterwards, rated each date's attributes, overall attractiveness, and whether or not they'd go on a second date. After these first 10 dates, the participants answered the same questions on what they find attractive in their dates and how they think they stood up. After 10 more dates, they'll answer the same questions one more time. Then, three weeks later, there was one last follow-up questionairre that asks how many dates each person met up with again, and who asked whom.

Through this data, we hope to discover the "key" to acquiring and impressing dates. We believe that this data set holds the key to what the most attractive attributes are in a person, at least for making first impressions. With the abundance of data available to us from this study, we believe we might be able to gather other interesting traits that have an effect on attractiveness. For example, we might be able to guess a person's confidence by comparing others' scores of them to their expected scores.

In [None]:
data = p.read_csv('./speeddating.csv', encoding='ISO-8859-1')
data['gender'] = data['gender'].apply(lambda v: 'Male' if v==1 else 'Female')

In [None]:
data['zipcode'].unique()[0:5]

With the exception of the zipcode field, most of the fields look like they were imported correctly by pandas. There are some exceptions, such as the "id" column in which the datatype should have been an int instead of a float, but it shouldn't cause issues in the long run as none of the numbers are too big or too small (and if so, it will be easy to fix).

The zipcode field should be an "object" (really a string in this case) as it's not ordered data. However, the data was malformed in the input file, leading to values such as "60,521" and "6,268", which should be transformed to "60521" and "06268".

In [None]:
data['zipcode'] = data['zipcode'].apply(lambda v: str(v).replace(',', '').zfill(5))

Side-note: The fields ending with "_o" are scores that the participant's partner gave.

### Perceived Personality Traits

In the study, participants were asked to rate their partners on 6 individual personality traits, as well as overall attractiveness. These six traits are physical attractiveness, sincerity, intelligence, fun, ambition, and shared interests. For each of these traits, we compare a person's average likelihood for their partners to want to meet them again with the average score of the trait. This should make it easy for us to visualize which personality traits are most likely to be dealbreakers during the dating process.

Originally, we were using the individual's average overall attractiveness score. We saw that for this comparison, all the traits had a correlation of nearly 1. We decided that this makes a lot of sense - all 6 of these traits are highly desirable in a partner. However, this experiment only proved obvious knowledge, and didn't make any new insights, leading us to switch to the binary decision of the partner.

In [None]:
overall = data.drop_duplicates('iid')

for attribute in ['dec_o', 'attr_o', 'sinc_o', 'intel_o', 'fun_o', 'amb_o', 'shar_o']:
    cur = data.groupby('iid')[attribute].mean()
    overall = p.merge(overall, cur.reset_index(), on='iid')

for attribute, title in [('attr_o_y', 'Attraction'), ('sinc_o_y', 'Sincerity'), ('intel_o_y', 'Intelligence'), ('fun_o_y', 'Fun'), ('amb_o_y', 'Ambition'), ('shar_o_y', 'Shared Interests')]:
    tmp = sns.lmplot(x='dec_o_y', y=attribute, data=overall, size=3, col='gender')

In [None]:
overall_bygender = overall.groupby('gender')
for attribute, title in [('attr_o_y', 'Attraction'), ('sinc_o_y', 'Sincerity'), ('intel_o_y', 'Intelligence'), ('fun_o_y', 'Fun'), ('amb_o_y', 'Ambition'), ('shar_o_y', 'Shared Interests')]:
    print(title+' for men: '+str(overall_bygender.get_group('Male')['dec_o_y'].corr(overall_bygender.get_group('Male')[attribute])))
    print(title+' for women: '+str(overall_bygender.get_group('Female')['dec_o_y'].corr(overall_bygender.get_group('Female')[attribute])))

In the above graphs, a regression line is shown over a scatter plot of one point for each individual, comparing whether or not dates wanted to see them again to specific personality traits. Below the graphs, we've printed out the correlation coefficients for each trait in the eyes of both genders. We decided that a correlation coefficient closer to one means that a specific trait has more of an effect in a partner's eyes, in general.

Glancing at the data above, it's easy to see that attraction, fun, and shared interests are by far the most important traits to possess in order to have a successful dating experience, as these correlation coefficients are relatively closer to one. In the context of a four-minute first date, it would make sense that these traits are the most important. Looks initially capture the eye of the partner. Fun seems to exude from some people, leading them to make a good impression. Trying to find shared interests is one of the easier conversations to partake in when making small talk with a stranger.