## Asking questions

The real question here is something I ask myself before the data comes in. For example I want to know if you can guess a person's zodiac sign by the way they write. is there anything Zodiac signs can tell us?

### Question 1:
What are the clusters in this data? 
- Maximising accuracy score for KMeans
which dimensions best define the clusters? 
- optimise with different features

### Question 2:
Can I predict age/zodiac/gender/drug use/drinking from essays?

Answering Question 1 First

## Collecting and Processing data

In [5]:
import numpy as np
import pandas as pd
profiles = pd.read_csv("./profiles.csv")
# got data from an OKCupid profiles dataset

In [56]:
## Getting to know the data
# print(profiles.columns)
# print(profiles.dtypes)
# print(profiles.income.head())

# Data types
# for col in profiles.columns:
#     print(col)
#     print(profiles[col].nunique())

print(profiles.status.value_counts())
print(profiles.columns)
## making lists of features by data type for convenience
quant_features = ['age', 'height', 'income']
#income is discrete, others continuous. 
categorical_features = ['body_type', 'diet', 'drinks', 'drugs', 'education', 'location', 'offspring', 'orientation', 'ethnicity', 'job', 'pets', 'religion', 'sex', 'sign', 'smokes', 'speaks', 'status']
# speaks has a massive number of combinations, because it differentiates different levels. Be careful
language_features = ['essay0','essay1','essay2','essay3','essay4','essay5','essay6','essay7','essay8','essay9']
date_features= ['last_online']

single            55697
seeing someone     2064
available          1865
married             310
unknown              10
Name: status, dtype: int64
Index(['age', 'body_type', 'diet', 'drinks', 'drugs', 'education', 'essay0',
       'essay1', 'essay2', 'essay3', 'essay4', 'essay5', 'essay6', 'essay7',
       'essay8', 'essay9', 'ethnicity', 'height', 'income', 'job',
       'last_online', 'location', 'offspring', 'orientation', 'pets',
       'religion', 'sex', 'sign', 'smokes', 'speaks', 'status'],
      dtype='object')


In [61]:
# Cleaning data
# print(profiles.info())

# most features have close to 50k non-null entries. The ones with less are diet, offspring, religion and essay8. Why?
# how many have all non-null?

#Max is 59946
# print(profiles.count()>50000)
# print(profiles[profiles['income'] > 50000].count())

profiles_no_na = profiles.dropna(subset=language_features)
print(profiles_no_na.info())
# only 4407 are filled completely. Therefore we must dropna pairwise, not listwise, or we would lose too much data.
profiles_no_na.sign.value_counts()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 29866 entries, 0 to 59945
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          29866 non-null  int64  
 1   body_type    27626 non-null  object 
 2   diet         18720 non-null  object 
 3   drinks       28912 non-null  object 
 4   drugs        22751 non-null  object 
 5   education    27564 non-null  object 
 6   essay0       29866 non-null  object 
 7   essay1       29866 non-null  object 
 8   essay2       29866 non-null  object 
 9   essay3       29866 non-null  object 
 10  essay4       29866 non-null  object 
 11  essay5       29866 non-null  object 
 12  essay6       29866 non-null  object 
 13  essay7       29866 non-null  object 
 14  essay8       29866 non-null  object 
 15  essay9       29866 non-null  object 
 16  ethnicity    27616 non-null  object 
 17  height       29865 non-null  float64
 18  income       29866 non-null  int64  
 19  job 

scorpio and it&rsquo;s fun to think about        1073
gemini and it&rsquo;s fun to think about         1062
leo and it&rsquo;s fun to think about             997
libra and it&rsquo;s fun to think about           994
pisces and it&rsquo;s fun to think about          976
taurus and it&rsquo;s fun to think about          971
sagittarius and it&rsquo;s fun to think about     967
aries and it&rsquo;s fun to think about           954
virgo and it&rsquo;s fun to think about           938
cancer and it&rsquo;s fun to think about          936
aquarius and it&rsquo;s fun to think about        893
virgo but it doesn&rsquo;t matter                 823
libra but it doesn&rsquo;t matter                 806
capricorn and it&rsquo;s fun to think about       803
cancer but it doesn&rsquo;t matter                801
gemini but it doesn&rsquo;t matter                788
leo but it doesn&rsquo;t matter                   781
aries but it doesn&rsquo;t matter                 766
taurus but it doesn&rsquo;t 

## Analysis starts with KMeans.
KMeans will inform us:
- What are the clusters in this data? 
- which dimensions best define the clusters? 
optimise with different features
Maximising accuracy score for KMeans

In [None]:
# Starting with just categorical features, Kmeans
df = profiles[categorical_features]

x_train, y_train, x_test, y_test
