# Dataset analysis

#### Load dependencies

In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Raw data overview
Read the data from file and print the first entries. Every row represents an interaction between 2 people.

In [12]:
df = pd.read_csv('./data/raw_data.csv', encoding = "ISO-8859-1")
df.head()

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1,1.0,0,1,1,1,10,7,,4,...,5.0,7.0,7.0,7.0,7.0,,,,,
1,1,1.0,0,1,1,1,10,7,,3,...,5.0,7.0,7.0,7.0,7.0,,,,,
2,1,1.0,0,1,1,1,10,7,,10,...,5.0,7.0,7.0,7.0,7.0,,,,,
3,1,1.0,0,1,1,1,10,7,,5,...,5.0,7.0,7.0,7.0,7.0,,,,,
4,1,1.0,0,1,1,1,10,7,,7,...,5.0,7.0,7.0,7.0,7.0,,,,,


Since the Dataset contains **195** columns, unfortunately it is not possible to have a meaningful overview on them just using the `head()` method.

Show details about columns and the contained values.

In [8]:
with pd.option_context('display.max_rows', None): # Remove the limit on displayed rows
    display(df.describe().T)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
iid,8378.0,283.675937,158.583367,1.0,154.0,281.0,407.0,552.0
id,8377.0,8.960248,5.491329,1.0,4.0,8.0,13.0,22.0
gender,8378.0,0.500597,0.500029,0.0,0.0,1.0,1.0,1.0
idg,8378.0,17.327166,10.940735,1.0,8.0,16.0,26.0,44.0
condtn,8378.0,1.828837,0.376673,1.0,2.0,2.0,2.0,2.0
wave,8378.0,11.350919,5.995903,1.0,7.0,11.0,15.0,21.0
round,8378.0,16.872046,4.358458,5.0,14.0,18.0,20.0,22.0
position,8378.0,9.042731,5.514939,1.0,4.0,8.0,13.0,22.0
positin1,6532.0,9.295775,5.650199,1.0,4.0,9.0,14.0,22.0
order,8378.0,8.927668,5.477009,1.0,4.0,8.0,13.0,22.0


Show the number of records present in the dataset.

In [9]:
len(df)

8378

The collected informations for each interaction are numerous and not all are relevant in our work. So we need to perform some preprocessing on the data.

### Data pre-processing

The document [SpeedDatingSurveyAndDataKey.doc](./data/SpeedDatingSurveyAndDataKey.doc) contains the survey given during the event, and therefore the semantics of the columns present in the dataset.

Remove irrelevant fields for our work.

In [14]:
irrelevant_cols = [
    'iid', 'id', 'idg',             # identifiers of the person
    #TODO: what is condtn?
    'wave',                         # number of the experiment's wave
    'round',                        # number of people that met in wave
    'position', 'positin1', 'order',# details about the logistics of the event
    'partner', 'pid',               # partner's identifiers
    #TODO what is pf_o_att
    'undergra', 'mn_sat', 'tuition',# details of the university career of the person
    'from', 'zipcode', 'income',    # details of the person's origins
    #'goal',                         # person's goal in participating in the experiment, in fact could be turned in goal in using the app, DISCUSS
    'exphappy', 'expnum',           # expectations on the event DISCUSS
    'dec', 'dec_o',                 # want to meet the person again
    'attr', 'sinc', 'intel', 'fun', # person's evaluation of the partner and of the meeting
    'amb', 'shar', 'like', 'prob',
    'met', 'match_es',
    'attr1_s','sinc1_s','intel1_s', # questions asked in the middle of the event
    'fun1_s', 'amb1_s', 'shar1_s',
    'attr3_s','sinc3_s','intel3_s',
    'fun3_s', 'amb3_s',
    'satis_2', 'length', 'numdat_2',# person's evaluation of the event
    'attr7_2','sinc7_2','intel7_2', # questions asked after the event
    'fun7_2', 'amb7_2', 'shar7_2',
    'you_call','them_cal','date_3', # dates after the event
    'numdat_3', 'num_in_3',
    'attr7_3','sinc7_3','intel7_3', # questions asked after the event
    'fun7_3', 'amb7_3', 'shar7_3',
    'attr4_3','sinc4_3','intel4_3',
    'fun4_3', 'amb4_3', 'shar4_3',
    'attr2_3','sinc2_3','intel2_3',
    'fun2_3', 'amb2_3', 'shar2_3',
    'attr3_3','sinc3_3','intel3_3',
    'fun3_3', 'amb3_3',
    'attr5_3','sinc5_3','intel5_3',
    'fun5_3', 'amb5_3',
]
df = df.drop(irrelevant_cols, axis=1)

In [15]:
df

Unnamed: 0,gender,condtn,match,int_corr,samerace,age_o,race_o,pf_o_att,pf_o_sin,pf_o_int,...,sinc5_2,intel5_2,fun5_2,amb5_2,attr1_3,sinc1_3,intel1_3,fun1_3,amb1_3,shar1_3
0,0,1,0,0.14,0,27.0,2.0,35.0,20.0,20.0,...,,,,,15.0,20.0,20.0,15.0,15.0,15.0
1,0,1,0,0.54,0,22.0,2.0,60.0,0.0,0.0,...,,,,,15.0,20.0,20.0,15.0,15.0,15.0
2,0,1,1,0.16,1,22.0,4.0,19.0,18.0,19.0,...,,,,,15.0,20.0,20.0,15.0,15.0,15.0
3,0,1,1,0.61,0,23.0,2.0,30.0,5.0,15.0,...,,,,,15.0,20.0,20.0,15.0,15.0,15.0
4,0,1,1,0.21,0,24.0,3.0,30.0,10.0,20.0,...,,,,,15.0,20.0,20.0,15.0,15.0,15.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8373,1,2,0,0.64,0,26.0,3.0,10.0,10.0,30.0,...,3.0,9.0,4.0,7.0,70.0,0.0,20.0,10.0,0.0,0.0
8374,1,2,0,0.71,0,24.0,6.0,50.0,20.0,10.0,...,3.0,9.0,4.0,7.0,70.0,0.0,20.0,10.0,0.0,0.0
8375,1,2,0,-0.46,0,29.0,3.0,40.0,10.0,30.0,...,3.0,9.0,4.0,7.0,70.0,0.0,20.0,10.0,0.0,0.0
8376,1,2,0,0.62,0,22.0,4.0,10.0,25.0,25.0,...,3.0,9.0,4.0,7.0,70.0,0.0,20.0,10.0,0.0,0.0
