# Data Exploration

Possible Questions:  
**Can we predict people's political orientation based on how they portray their character traits and habits?**

Data:
- Cognitive Ability test questions
- Question_data
- Parsed_data

Exploration:
- Keywords
- Political Questions
- Descriptive Questions
- Question Extraction

Source: https://figshare.com/articles/dataset/OKCupid_Datasets/14987388?file=28850916



In [153]:
import pandas as pd

### Cognitive ability test questions

In [154]:
# cognitive ability test questions
test_items = pd.read_csv('data/test_items.csv')
print(test_items.columns)
print(test_items.shape)

display(test_items.head())

Index(['Unnamed: 0', 'ID', 'text', 'option_1', 'option_2', 'option_3',
       'option_4', 'option_correct'],
      dtype='object')
(28, 8)


Unnamed: 0.1,Unnamed: 0,ID,text,option_1,option_2,option_3,option_4,option_correct
0,q178,178,Which is bigger?,The earth,The sun,,,2
1,q255,255,STALE is to STEAL as 89475 is to...,89457,98547,89754,89547,4
2,q1201,1201,"What is next in this series? 1, 4, 10, 19, 31, _",36,48,46,Don't know / don't care,3
3,q14835,14835,"If you turn a left-handed glove inside out, it...",On my left hand,On my right hand,,,2
4,q8672,8672,In the line 'Wherefore art thou Romeo?' what d...,Why,Where,How,Who cares / wtf?,1


### Question Data

In [155]:
question_data = pd.read_csv('data/question_data.csv', sep=';')
question_data = question_data.set_index('Unnamed: 0')  # set first column (question number) as index
print('shape', question_data.shape)
print(question_data.columns)
display(question_data.head())

shape (2620, 9)
Index(['text', 'option_1', 'option_2', 'option_3', 'option_4', 'N', 'Type',
       'Order', 'Keywords'],
      dtype='object')


Unnamed: 0_level_0,text,option_1,option_2,option_3,option_4,N,Type,Order,Keywords
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
q2,Breast implants?,more cool than pathetic,more pathetic than cool,,,24839,N,,sex/intimacy; preference; opinion
q11,How does the idea of being slapped hard in the...,Horrified,Aroused,Nostalgic,Indifferent,28860,N,,sex/intimacy
q12,Divide your age by 2. Have you had sex with a...,Yes,No,,,22496,O,,sex/intimacy
q13,Is a girl who's slept with 100 guys a bad person?,Yes,No,,,32581,O,,sex/intimacy
q14,Is a guy who's slept with 100 girls a bad person?,Yes,No,,,31127,O,,sex/intimacy


### Data

In [156]:
data = pd.read_parquet('data/parsed_data_public.parquet', engine='fastparquet')
print('shape', data.shape)
display(data.head())

shape (68371, 2626)


Unnamed: 0.1,Unnamed: 0,q2,q11,q12,q13,q14,q16,q17,q18,q20,...,q86615,q86699,q363047,CA,gender_orientation,gender,race,gender2,gender2_num,CA_items
0,1,,Horrified,,,,,No,,,...,,,,0.76308,Hetero_female,Woman,White,Woman,0.0,4
1,2,,,,,,,,,,...,,,,,Hetero_male,Man,,Man,1.0,0
2,3,,,,No,No,,No,,,...,,,,0.661309,Hetero_female,Woman,,Woman,0.0,7
3,4,,,,,,,,,,...,,,,,Hetero_female,Woman,White,Woman,0.0,0
4,5,,,,,,,,,,...,,,,0.875424,Bisexual_female,Woman,,Woman,0.0,3


## Exploration

List of keywords

In [157]:
keys = set(question_data.Keywords)  # set of unique keywords
print('number of unique keywords', len(keys))
print(keys)


number of unique keywords 62
{'politics; descriptive; preference', 'sex/intimacy; preference', 'opinion; cognitive', 'opinion', 'politics; cognitive', nan, 'sex/intimacy; preference; descriptive', 'descriptive; technology', 'politics; sex/intimacy; preference', 'politics; opinion; cognitive', 'sex/intimacy; religion/superstition; preference', 'sex/intimacy; BDSM', 'politics; religion/superstition', 'politics; opinion; sex/intimacy', 'sex/intimacy; religion/superstition', 'politics; preference; opinion; sex/intimacy', 'descriptive', 'politics; descriptive', 'descriptive; preference', 'religion/superstition; opinion', 'preference; descriptive; technology', 'preference; descriptive; opinion', 'preference; descriptive; politics', 'politics; opinion', 'politics; preference; opinion', 'preference; technology', 'descriptive; opinion', 'sex/intimacy; preference; opinion', 'religion/superstition', 'religion/superstition; opinion; cognitive', 'sex/intimacy; descriptive; BDSM', 'religion/supersti

### Number of answers per question

In [158]:
# per question count number of times the question is answered
n_answers_per_question = data.notnull().sum(axis=0)[1:] # first column is question name (Unnamed: 0) => [1:]

# append column with number of people who answered the question to question dataframe
question_data = question_data.join(n_answers_per_question.to_frame('n_answers')) 

### Political Questions

In [159]:
# find political questions
p_questions = question_data[question_data.Keywords.str.contains('politics', na=False)]
print(f'number of questions involving politcs: {p_questions.shape[0]}')

# sort political questions
sorted_p_questions = p_questions.sort_values(by=['n_answers'], ascending=False)
print(sorted_p_questions.head()['text'].values)
display(sorted_p_questions.head())


number of questions involving politcs: 270
['How do you feel about government-subsidized food programs (free lunch, food stamps, etc.)?'
 'Are you either vegetarian or vegan?' 'Do you enjoy discussing politics?'
 "Should burning your country's flag be illegal?"
 'Which best describes your political beliefs?']


Unnamed: 0_level_0,text,option_1,option_2,option_3,option_4,N,Type,Order,Keywords,n_answers
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
q34113,How do you feel about government-subsidized fo...,No problem,"It's okay, if it is not abused",Okay for short amounts of time,Never - Get a job,31769,O,,politics,68371.0
q179268,Are you either vegetarian or vegan?,Yes,No,,,54202,O,,politics; descriptive,54202.0
q403,Do you enjoy discussing politics?,Yes,No,,,52369,O,,politics; preference; descriptive,52369.0
q175,Should burning your country's flag be illegal?,Yes,No,,,45720,O,,politics,45720.0
q212813,Which best describes your political beliefs?,Liberal / Left-wing,Centrist,Conservative / Right-wing,Other,45107,M,[4],politics; descriptive,45107.0


### Descriptive questions

In [160]:
# find descriptive questions
d_questions = question_data[question_data.Keywords == 'descriptive']
print(f'number of questions with keyword descriptive: {d_questions.shape[0]}')

# sort descriptive questions
sorted_d_questions = d_questions.sort_values(by=['n_answers'], ascending=False)
print(sorted_d_questions['text'].head().values)

display(sorted_d_questions.head())

number of questions with keyword descriptive: 829
['Do you like watching foreign movies with subtitles?'
 'Which type of wine would you prefer to drink outside of a meal, such as for leisure?'
 'Have you smoked a cigarette in the last 6 months?'
 'Do you enjoy intense intellectual conversations?'
 'Rate your self-confidence:']


Unnamed: 0_level_0,text,option_1,option_2,option_3,option_4,N,Type,Order,Keywords,n_answers
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
q416235,Do you like watching foreign movies with subti...,Yes,No,Can't answer without a subtitle,,20364,O,"3, 1, 2",descriptive,68371.0
q85419,Which type of wine would you prefer to drink o...,"White (such as Chardonnay, Riesling).","Red (such as Merlot, Cabernet, Shiraz).",Rosé (such as White Zinfindel).,I don't drink wine.,18838,N,,descriptive,68371.0
q501,Have you smoked a cigarette in the last 6 months?,Yes,No,,,57123,O,,descriptive,57123.0
q358084,Do you enjoy intense intellectual conversations?,Yes,No,,,54696,O,,descriptive,54696.0
q20930,Rate your self-confidence:,"Very, very high",Higher than average,Average,Below average,53737,O,,descriptive,53737.0


### Extract Answered Questions

In [161]:
# extract 10 most answered descriptive questions and target question about politcal orientation ('q212813')
questions = sorted_d_questions[:10].index.to_list() + ['q212813']

feature_target_data = data[questions].dropna()[questions]  # remove rows with NaN Values 
# TODO: think about political view 'other' What do we want to do with that?

print('shape:', feature_target_data.shape)
display(feature_target_data.head())

shape: (32271, 11)


Unnamed: 0,q416235,q85419,q501,q358084,q20930,q4018,q77,q80,q49,q79,q212813
10,Yes,"Red (such as Merlot, Cabernet, Shiraz).",No,Yes,Average,Yes,Sometimes,I never do drugs.,Carefree,Never.,Liberal / Left-wing
12,Yes,"White (such as Chardonnay, Riesling).",No,Yes,Average,Yes,Sometimes,I never do drugs.,Intense,Never.,Other
14,Can't answer without a subtitle,Rosé (such as White Zinfindel).,No,Yes,Average,No,Never,I never do drugs.,Intense,Never.,Other
18,Can't answer without a subtitle,Rosé (such as White Zinfindel).,Yes,Yes,Higher than average,Yes,Sometimes,"I've done drugs in the past, but no longer.",Intense,Never.,Other
25,Can't answer without a subtitle,Rosé (such as White Zinfindel).,Yes,Yes,Higher than average,Yes,Sometimes,"I've done drugs in the past, but no longer.",Carefree,I smoke occasionally.,Liberal / Left-wing
