# Data Exploration

Possible Questions:  
**Can we predict people's political orientation based on how they portray their character traits and habits?**

Data:
- Cognitive Ability test questions
- Question_data
- Parsed_data

Exploration:
- Keywords
- Political Questions
- Descriptive Questions
- Question Extraction

Source: https://figshare.com/articles/dataset/OKCupid_Datasets/14987388?file=28850916



In [1]:
import pandas as pd

### Cognitive ability test questions

In [2]:
# cognitive ability test questions
test_items = pd.read_csv('data/test_items.csv')
print(test_items.columns)
print(test_items.shape)

display(test_items.head())

Index(['Unnamed: 0', 'ID', 'text', 'option_1', 'option_2', 'option_3',
       'option_4', 'option_correct'],
      dtype='object')
(28, 8)


Unnamed: 0.1,Unnamed: 0,ID,text,option_1,option_2,option_3,option_4,option_correct
0,q178,178,Which is bigger?,The earth,The sun,,,2
1,q255,255,STALE is to STEAL as 89475 is to...,89457,98547,89754,89547,4
2,q1201,1201,"What is next in this series? 1, 4, 10, 19, 31, _",36,48,46,Don't know / don't care,3
3,q14835,14835,"If you turn a left-handed glove inside out, it...",On my left hand,On my right hand,,,2
4,q8672,8672,In the line 'Wherefore art thou Romeo?' what d...,Why,Where,How,Who cares / wtf?,1


### Question Data

In [3]:
question_data = pd.read_csv('data/question_data.csv', sep=';')
question_data = question_data.set_index('Unnamed: 0')  # set first column (question number) as index
print('shape', question_data.shape)
print(question_data.columns)
display(question_data.head())

shape (2620, 9)
Index(['text', 'option_1', 'option_2', 'option_3', 'option_4', 'N', 'Type',
       'Order', 'Keywords'],
      dtype='object')


Unnamed: 0_level_0,text,option_1,option_2,option_3,option_4,N,Type,Order,Keywords
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
q2,Breast implants?,more cool than pathetic,more pathetic than cool,,,24839,N,,sex/intimacy; preference; opinion
q11,How does the idea of being slapped hard in the...,Horrified,Aroused,Nostalgic,Indifferent,28860,N,,sex/intimacy
q12,Divide your age by 2. Have you had sex with a...,Yes,No,,,22496,O,,sex/intimacy
q13,Is a girl who's slept with 100 guys a bad person?,Yes,No,,,32581,O,,sex/intimacy
q14,Is a guy who's slept with 100 girls a bad person?,Yes,No,,,31127,O,,sex/intimacy


### Data

In [4]:
data = pd.read_parquet('data/parsed_data_public.parquet')
print('shape', data.shape)
display(data.head())

shape (68371, 2626)


Unnamed: 0.1,Unnamed: 0,q2,q11,q12,q13,q14,q16,q17,q18,q20,...,q86615,q86699,q363047,CA,gender_orientation,gender,race,gender2,gender2_num,CA_items
0,1,,Horrified,,,,,No,,,...,,,,0.76308,Hetero_female,Woman,White,Woman,0.0,4
1,2,,,,,,,,,,...,,,,,Hetero_male,Man,,Man,1.0,0
2,3,,,,No,No,,No,,,...,,,,0.661309,Hetero_female,Woman,,Woman,0.0,7
3,4,,,,,,,,,,...,,,,,Hetero_female,Woman,White,Woman,0.0,0
4,5,,,,,,,,,,...,,,,0.875424,Bisexual_female,Woman,,Woman,0.0,3


## Exploration

List of keywords

In [5]:
keys = question_data.Keywords.dropna().unique().tolist()  # set of unique keywords
print('number of unique keywords', len(keys))
print(keys)


number of unique keywords 61
['sex/intimacy; preference; opinion', 'sex/intimacy', 'sex/intimacy; BDSM', 'religion/superstition', 'preference', 'descriptive', 'opinion', 'religion/superstition; descriptive', 'politics', 'preference; opinion', 'sex/intimacy; opinion', 'religion/superstition; opinion', 'sex/intimacy; preference; descriptive', 'preference; descriptive', 'sex/intimacy; preference', 'sex/intimacy; descriptive', 'politics; preference', 'politics; opinion', 'cognitive', 'descriptive; cognitive', 'politics; religion/superstition', 'opinion; technology', 'opinion; descriptive', 'politics; preference; opinion; sex/intimacy', 'descriptive; preference', 'politics; preference; descriptive', 'politics; religion/superstition; opinion', 'politics; opinion; sex/intimacy', 'descriptive; technology', 'sex/intimacy; religion/superstition', 'politics; sex/intimacy; religion/superstition ', 'religion/superstition; preference', 'opinion; cognitive', 'politics; preference; opinion', 'preferen

In [6]:
[key for key in list(keys)[1:] if ";" not in key]

['sex/intimacy',
 'religion/superstition',
 'preference',
 'descriptive',
 'opinion',
 'politics',
 'cognitive']

### Number of answers per question

In [7]:
# per question count number of times the question is answered
n_answers_per_question = data.notnull().sum(axis=0)[1:] # first column is question name (Unnamed: 0) => [1:]

# append column with number of people who answered the question to question dataframe
question_data = question_data.join(n_answers_per_question.to_frame('n_answers')) 

### Political Questions

In [8]:
# find political questions
p_questions = question_data[question_data.Keywords.str.contains('politics', na=False)]
print(f'number of questions involving politcs: {p_questions.shape[0]}')

# sort political questions
sorted_p_questions = p_questions.sort_values(by=['n_answers'], ascending=False)
print(sorted_p_questions[:15]['text'].values)
display(sorted_p_questions.head())
political_belief = 'q212813'

number of questions involving politcs: 270
['How do you feel about government-subsidized food programs (free lunch, food stamps, etc.)?'
 'Are you either vegetarian or vegan?' 'Do you enjoy discussing politics?'
 "Should burning your country's flag be illegal?"
 'Which best describes your political beliefs?'
 'Should evolution and creationism be taught side-by-side in school?'
 'For you personally, is abortion an option in case of an accidental pregnancy?'
 'Do you believe your country would be more or less safe if every adult owned a gun?'
 'In a relationship I like to discuss politics with my partner.'
 'Which is worse: starving children or abused animals?'
 'The idea of gay and lesbian couples having children is:'
 'When men show extra courtesy toward women (opening doors, pulling out chairs, etc.), this is:'
 'Can anything be made the subject of a joke?'
 'Which is more offensive: book burning or flag burning?'
 'Are you okay with people who grow marijuana for their own personal us

Unnamed: 0_level_0,text,option_1,option_2,option_3,option_4,N,Type,Order,Keywords,n_answers
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
q34113,How do you feel about government-subsidized fo...,No problem,"It's okay, if it is not abused",Okay for short amounts of time,Never - Get a job,31769,O,,politics,68371.0
q179268,Are you either vegetarian or vegan?,Yes,No,,,54202,O,,politics; descriptive,54202.0
q403,Do you enjoy discussing politics?,Yes,No,,,52369,O,,politics; preference; descriptive,52369.0
q175,Should burning your country's flag be illegal?,Yes,No,,,45720,O,,politics,45720.0
q212813,Which best describes your political beliefs?,Liberal / Left-wing,Centrist,Conservative / Right-wing,Other,45107,M,[4],politics; descriptive,45107.0


### Descriptive questions

In [9]:
# find descriptive questions
d_questions = question_data[question_data.Keywords == 'descriptive']
print(f'number of questions with keyword descriptive: {d_questions.shape[0]}')

# sort descriptive questions
sorted_d_questions = d_questions.sort_values(by=['n_answers'], ascending=False)
print(sorted_d_questions['text'][:10].values)

display(sorted_d_questions[:10])

number of questions with keyword descriptive: 829
['Do you like watching foreign movies with subtitles?'
 'Which type of wine would you prefer to drink outside of a meal, such as for leisure?'
 'Have you smoked a cigarette in the last 6 months?'
 'Do you enjoy intense intellectual conversations?'
 'Rate your self-confidence:' 'Are you happy with your life?'
 'How frequently do you drink alcohol?'
 "What's your deal with harder drugs (stuff beyond pot)?"
 'Which word describes you better?'
 "What's your relationship with marijuana?"]


Unnamed: 0_level_0,text,option_1,option_2,option_3,option_4,N,Type,Order,Keywords,n_answers
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
q416235,Do you like watching foreign movies with subti...,Yes,No,Can't answer without a subtitle,,20364,O,"3, 1, 2",descriptive,68371.0
q85419,Which type of wine would you prefer to drink o...,"White (such as Chardonnay, Riesling).","Red (such as Merlot, Cabernet, Shiraz).",Rosé (such as White Zinfindel).,I don't drink wine.,18838,N,,descriptive,68371.0
q501,Have you smoked a cigarette in the last 6 months?,Yes,No,,,57123,O,,descriptive,57123.0
q358084,Do you enjoy intense intellectual conversations?,Yes,No,,,54696,O,,descriptive,54696.0
q20930,Rate your self-confidence:,"Very, very high",Higher than average,Average,Below average,53737,O,,descriptive,53737.0
q4018,Are you happy with your life?,Yes,No,,,53625,O,,descriptive,53625.0
q77,How frequently do you drink alcohol?,Very often,Sometimes,Rarely,Never,52467,O,,descriptive,52467.0
q80,What's your deal with harder drugs (stuff beyo...,I do drugs regularly.,I do drugs occasionally.,"I've done drugs in the past, but no longer.",I never do drugs.,50107,O,,descriptive,50107.0
q49,Which word describes you better?,Carefree,Intense,,,49827,N,,descriptive,49827.0
q79,What's your relationship with marijuana?,I smoke regularly.,I smoke occasionally.,"I smoked in the past, but no longer.",Never.,49796,O,,descriptive,49796.0


### Extract Answered Questions and exclude political orientation 'other'

In [10]:
# extract 10 most answered descriptive questions and target question about politcal orientation ('q212813')
questions = sorted_d_questions[:10].index.to_list() + [political_belief]

feature_target_data = data[questions].dropna()[questions]  # remove rows with NaN Values 
feature_target_data = feature_target_data[feature_target_data[political_belief] != 'Other'] # remove rows with NaN Values 

# TODO: think about political view 'other' What do we want to do with that?

print('shape:', feature_target_data.shape)
display(feature_target_data[:10])

shape: (21348, 11)


Unnamed: 0,q416235,q85419,q501,q358084,q20930,q4018,q77,q80,q49,q79,q212813
10,Yes,"Red (such as Merlot, Cabernet, Shiraz).",No,Yes,Average,Yes,Sometimes,I never do drugs.,Carefree,Never.,Liberal / Left-wing
25,Can't answer without a subtitle,Rosé (such as White Zinfindel).,Yes,Yes,Higher than average,Yes,Sometimes,"I've done drugs in the past, but no longer.",Carefree,I smoke occasionally.,Liberal / Left-wing
30,Can't answer without a subtitle,Rosé (such as White Zinfindel).,Yes,Yes,Average,Yes,Sometimes,I never do drugs.,Intense,I smoke occasionally.,Liberal / Left-wing
33,Can't answer without a subtitle,Rosé (such as White Zinfindel).,No,Yes,Average,Yes,Sometimes,I never do drugs.,Intense,"I smoked in the past, but no longer.",Liberal / Left-wing
36,Can't answer without a subtitle,Rosé (such as White Zinfindel).,No,Yes,Average,Yes,Sometimes,I never do drugs.,Carefree,Never.,Liberal / Left-wing
37,Can't answer without a subtitle,Rosé (such as White Zinfindel).,No,Yes,Average,Yes,Sometimes,I never do drugs.,Carefree,I smoke occasionally.,Liberal / Left-wing
39,Yes,"White (such as Chardonnay, Riesling).",No,Yes,Higher than average,Yes,Sometimes,I never do drugs.,Carefree,Never.,Liberal / Left-wing
40,Can't answer without a subtitle,Rosé (such as White Zinfindel).,No,Yes,Average,Yes,Sometimes,I never do drugs.,Intense,Never.,Liberal / Left-wing
71,Can't answer without a subtitle,Rosé (such as White Zinfindel).,No,Yes,Average,Yes,Rarely,I never do drugs.,Carefree,Never.,Centrist
72,Can't answer without a subtitle,Rosé (such as White Zinfindel).,Yes,Yes,Higher than average,No,Sometimes,I never do drugs.,Carefree,Never.,Liberal / Left-wing


## Convert answers to ordered categorical variabales:

In [11]:
from pandas.api.types import CategoricalDtype

In [12]:
options = [column for column in question_data.columns if 'option' in column]
questions_categories = {index: row[options].dropna().tolist()   for index, row in question_data.loc[questions].iterrows()}
unordered_categories = {k: questions_categories[k] for k in questions_categories.keys() - {'q20930', 'q77', 'q80', 'q79'}}
ordered_categories = {k: questions_categories[k] for k in questions_categories.keys() - unordered_categories.keys()}
print('Categories with no order:\n', unordered_categories)
print('Categories with order:\n', ordered_categories)

Categories with no order:
 {'q416235': ['Yes', 'No', "Can't answer without a subtitle"], 'q358084': ['Yes', 'No'], 'q212813': ['Liberal / Left-wing', 'Centrist', 'Conservative / Right-wing', 'Other'], 'q49': ['Carefree', 'Intense'], 'q4018': ['Yes', 'No'], 'q501': ['Yes', 'No'], 'q85419': ['White (such as Chardonnay, Riesling).', 'Red (such as Merlot, Cabernet, Shiraz).', 'Rosé (such as White Zinfindel).', "I don't drink wine."]}
Categories with order:
 {'q20930': ['Very, very high', 'Higher than average', 'Average', 'Below average'], 'q77': ['Very often', 'Sometimes', 'Rarely', 'Never'], 'q80': ['I do drugs regularly.', 'I do drugs occasionally.', "I've done drugs in the past, but no longer.", 'I never do drugs.'], 'q79': ['I smoke regularly.', 'I smoke occasionally.', 'I smoked in the past, but no longer.', 'Never.']}


In [13]:
for question, categories in unordered_categories.items():
    cat_type = CategoricalDtype(categories=categories)
    feature_target_data[question] = feature_target_data[question].astype(cat_type)

for question, categories in ordered_categories.items():
    cat_type = CategoricalDtype(categories=categories.reverse(), ordered=True)
    feature_target_data[question] = feature_target_data[question].astype(cat_type)

In [14]:
feature_target_data['q77']

10       Sometimes
25       Sometimes
30       Sometimes
33       Sometimes
36       Sometimes
           ...    
68040       Rarely
68123    Sometimes
68139    Sometimes
68142       Rarely
68171    Sometimes
Name: q77, Length: 21348, dtype: category
Categories (4, object): ['Never' < 'Rarely' < 'Sometimes' < 'Very often']