**Finding a model**

Let's find out if we can get a model that predicts what level of happiness someone claims. I will be looking at all of the elements that I looked at in previous sections. I predict that income and social level will be integral,  but we'll find out.

In [0]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score

df_happy = pd.read_csv('happy.csv')
df_jobs = pd.read_csv('jobs.csv')
df_social = pd.read_csv('social.csv')
df_internet = pd.read_csv('internet.csv')

In [4]:
len(df_happy)

64816

In [0]:
df_happy = df_happy[df_happy['General happiness'] != 'Don\'t know']

In [0]:
len(df_happy)

64777

I'll start out with the basics. I'll optimize a 20-nearest neighbors classifier with the features that I studied in exploration.

In [4]:
df_happy['Satisfaction with financial situation'].value_counts()

More or less      26606
Satisfied         17588
Not at all sat    15953
Don't know          116
Name: Satisfaction with financial situation, dtype: int64

In [0]:
len(df_jobs)

37887

In [0]:
len(df_social)

13430

In [0]:
len(df_internet)

606

If possible, I'd like to avoid using features from df_internet just because significantly less data is available.

In [5]:
df_happy

Unnamed: 0.1,Unnamed: 0,index,Standard of living of r will improve,How much time felt sad in past wk,How much time felt happy in past wk,How much time felt depressed in past wk,I expect more good things to happen to me than bad,I'm always optimistic about my future,Happiness of marriage,General happiness,Satisfaction with financial situation,Rs self ranking of social position,Is life exciting or dull,year
0,0,0,,,,,,,,Not too happy,Not at all sat,,,1972.0
1,1,1,,,,,,,,Not too happy,More or less,,,1972.0
2,2,2,,,,,,,,Pretty happy,Satisfied,,,1972.0
3,3,3,,,,,,,,Not too happy,Not at all sat,,,1972.0
4,4,4,,,,,,,,Pretty happy,Satisfied,,,1972.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64811,64811,64811,Neither,,,,,,Very happy,Very happy,Satisfied,4,Routine,2018.0
64812,64812,64812,Disagree,,,,,,Very happy,Very happy,More or less,3,,2018.0
64813,64813,64813,,,,,,,,Very happy,More or less,5,Routine,2018.0
64814,64814,64814,,,,,,,,,,,,


In [0]:
df_happy = df_happy[(df_happy['year'].notna()) &
         (df_happy['Satisfaction with financial situation'].notna()) &
         (df_happy['General happiness'].notna())]

In [0]:
y_train = df_happy['General happiness']

In [36]:
# calculate estimate of test error for a given feature set
def get_cv_error(features, clss):
  # define pipeline
  if len(features) == 1 and features == ['year']:
    ct = make_column_transformer(
        (OneHotEncoder(handle_unknown='ignore'), ["year"]),
        remainder="passthrough"
    )
  if len(features) == 1 and features != ['year']:
    ct = make_column_transformer(
        (OneHotEncoder(handle_unknown='ignore'), ["Satisfaction with financial situation"]),
        remainder="passthrough"
    )
  if len(features) == 2:
    ct = make_column_transformer(
        (OneHotEncoder(handle_unknown='ignore'), ["year", "Satisfaction with financial situation"]),
        remainder="passthrough"
    )

  pipeline = make_pipeline(
      ct,
      KNeighborsClassifier(n_neighbors=10)
  )

  is_class = (y_train == clss)
  # calculate errors from cross-validation
  recall = cross_val_score(pipeline, df_happy[features],
                is_class, cv=10, scoring="recall").mean()
  precision = cross_val_score(pipeline, df_happy[features],
                is_class, cv=10, scoring="precision").mean()
  # calculate average of the cross-validation errors
  return (precision, recall, 2 * (precision * recall) / (precision + recall))

# calculate and store errors for different feature sets
errs = pd.Series()
classes = ['Not too happy','Pretty happy', 'Very happy']
for clss in classes:
  for features in [["year"],
                   ["Satisfaction with financial situation"],
                  ["year", "Satisfaction with financial situation"]]:
    errs[clss + ", " + str(features)] = get_cv_error(features, clss)

errs

hello


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


hello


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


hello


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


hello


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


hello


  _warn_prf(average, modifier, msg_start, len(result))


hello
hello


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


hello


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


hello


Not too happy, ['year']                                             (0.04687401133122441, 0.1586161879895561, 0.07...
Not too happy, ['Satisfaction with financial situation']                                              (0.0, 0.0, nan)
Not too happy, ['year', 'Satisfaction with financial situation']    (0.08836795842177565, 0.13655352480417754, 0.1...
Pretty happy, ['year']                                              (0.42627406298655596, 0.5778551557343097, 0.49...
Pretty happy, ['Satisfaction with financial situation']             (0.5065532797126299, 0.5089154228855721, 0.507...
Pretty happy, ['year', 'Satisfaction with financial situation']     (0.5242951897012246, 0.5043302377993648, 0.514...
Very happy, ['year']                                                (0.1240291921427713, 0.10978674674007485, 0.11...
Very happy, ['Satisfaction with financial situation']               (0.16306238225772535, 0.1288685947082858, 0.14...
Very happy, ['year', 'Satisfaction with financial situat

In [37]:
for val in errs:
  print(val[2])

0.07236332455478037
nan
0.10729927643345803
0.49062343861755137
0.5077316039370429
0.5141189603713742
0.11647419394258846
0.1439629344561468
0.21451648580159788


We can see from the above that we aren't getting great f1 scores from any of the combos. Around .5 is the best we could achieve. Adding year seemed to help boost a little bit, but then I realized that year is technically a time series and can't be used in the context. If I were to try and predict something in the future, then using year as a training feature would not be helpful.

I'd like to move on to job features now. I'll add income to see if it makes the f1 score any better.

In [38]:
df_jobs.head()

Unnamed: 0.1,Unnamed: 0,index,Rs income in constant $,Rs job is secure,Respondents income,Number of hours usually work a week,General happiness,income
0,3117,3117,4935.0,,$1000 to 2999,,Very happy,4935.0
1,3118,3118,43178.0,,$15000 - 19999,,Very happy,43178.0
2,3121,3121,18505.0,,$7000 to 7999,,Pretty happy,18505.0
3,3122,3122,22206.0,,$8000 to 9999,,Pretty happy,22206.0
4,3123,3123,55515.0,,$20000 - 24999,,Very happy,55515.0


In [0]:
df_happy_jobs = df_happy.merge(df_jobs, left_on=["index"], right_on=["index"])
y_train = df_happy_jobs['General happiness_x']

In [40]:
# calculate estimate of test error for a given feature set
def get_cv_error(features, clss):
  # define pipeline
  if len(features) == 1:
    ct = make_column_transformer(
        (StandardScaler(), ["income"]),
        remainder="passthrough"
    )
  if len(features) == 2:
    ct = make_column_transformer(
        (OneHotEncoder(handle_unknown='ignore'), ["Satisfaction with financial situation"]),
        (StandardScaler(), ["income"]),
        remainder="passthrough"
    )

  pipeline = make_pipeline(
      ct,
      KNeighborsClassifier(n_neighbors=10)
  )

  is_class = (y_train == clss)
  # calculate errors from cross-validation
  recall = cross_val_score(pipeline, df_happy_jobs[features],
                is_class, cv=10, scoring="recall").mean()
  precision = cross_val_score(pipeline, df_happy_jobs[features],
                is_class, cv=10, scoring="precision").mean()
  # calculate average of the cross-validation errors
  return (precision, recall, 2 * (precision * recall) / (precision + recall))

# calculate and store errors for different feature sets
errs = pd.Series()
classes = ['Not too happy','Pretty happy', 'Very happy']
for clss in classes:
  for features in [["income"],
                   ["income", "Satisfaction with financial situation"]]:
    errs[clss + ", " + str(features)] = get_cv_error(features, clss)

errs

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Not too happy, ['income']                                             (0.027942882641677824, 0.01566579634464752, 0....
Not too happy, ['income', 'Satisfaction with financial situation']    (0.1373199535850706, 0.015674681831230435, 0.0...
Pretty happy, ['income']                                              (0.5656069946040347, 0.5346845622904625, 0.549...
Pretty happy, ['income', 'Satisfaction with financial situation']     (0.5781024136145734, 0.5727305881429087, 0.575...
Very happy, ['income']                                                (0.24395710683473207, 0.13402876470922628, 0.1...
Very happy, ['income', 'Satisfaction with financial situation']       (0.36741274139543434, 0.1638192222951475, 0.22...
dtype: object

In [41]:
for val in errs:
  print(val[2])

0.02007616460403116
0.028137543197752895
0.5497112586991892
0.5754039636518566
0.17300789332435834
0.22660259047132822


Adding income gave us .028 for Not Happy, .575 for Pretty Happy and .226 for Very happy. Adding income did give a boost.

I am getting the complaints about precision because it can't make a prediction with no data in the much smaller category of Not too happy.

Now I'll see if adding social trends helps with predictions.

In [45]:
df_happy_jobs_social = df_happy_jobs.merge(df_social, 
                                           left_on=["index"], 
                                           right_on=["index"])
df_happy_jobs_social['General happiness'].value_counts()

Pretty happy     4913
Very happy       2554
Not too happy     821
Name: General happiness, dtype: int64

In [42]:
# calculate estimate of test error for a given feature set
y_train = df_happy_jobs_social['General happiness']
def get_cv_error(features, clss):
  # define pipeline
  if len(features) == 1:
    ct = make_column_transformer(
        (StandardScaler(), ["social_score"]),
        remainder="passthrough"
    )
  if len(features) == 2:
    ct = make_column_transformer(
        (StandardScaler(), ["income", "social_score"]),
        remainder="passthrough"
    )
  if len(features) == 3:
    ct = make_column_transformer(
        (OneHotEncoder(handle_unknown='ignore'), ["Satisfaction with financial situation"]),
        (StandardScaler(), ["income", "social_score"]),
        remainder="passthrough"
    )

  pipeline = make_pipeline(
      ct,
      KNeighborsClassifier(n_neighbors=10)
  )

  is_class = (y_train == clss)
  # calculate errors from cross-validation
  recall = cross_val_score(pipeline, df_happy_jobs_social[features],
                is_class, cv=10, scoring="recall").mean()
  precision = cross_val_score(pipeline, df_happy_jobs_social[features],
                is_class, cv=10, scoring="precision").mean()
  # calculate average of the cross-validation errors
  return (precision, recall, 2 * (precision * recall) / (precision + recall))

# calculate and store errors for different feature sets
errs = pd.Series()
classes = ['Not too happy','Pretty happy', 'Very happy']
for clss in classes:
  for features in [["social_score"],
                   ["income", "social_score"],
                   ["income", "Satisfaction with financial situation", "social_score"]]:
    errs[clss + ", " + str(features)] = get_cv_error(features, clss)

errs

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

Not too happy, ['social_score']                                                                                         (0.0, 0.0, nan)
Not too happy, ['income', 'social_score']                                                                               (0.0, 0.0, nan)
Not too happy, ['income', 'Satisfaction with financial situation', 'social_score']                                      (0.0, 0.0, nan)
Pretty happy, ['social_score']                                                        (0.5945018382607017, 0.6467078138194824, 0.619...
Pretty happy, ['income', 'social_score']                                              (0.5890257629236345, 0.6061360588147633, 0.597...
Pretty happy, ['income', 'Satisfaction with financial situation', 'social_score']     (0.6106161563361755, 0.6326138790919477, 0.621...
Very happy, ['social_score']                                                          (0.15619047619047619, 0.04274509803921569, 0.0...
Very happy, ['income', 'social_score']          

In [46]:
for val in errs:
  print(val[2])
# the below are the f1 scores
# they are ordered the same as the happiness and associated features above

nan
nan
nan
0.619506919703529
0.5974584327996366
0.621420403767869
0.06712099877968647
0.11579681274075096
0.19960610323161174


The huge imbalance between not happy and the others makes the predictions for not too happy cause the precision and recall to be 0. F1 score for pretty happy went up, but down for very  happy. I think adding social to the equation doesn't actually help much.

I think income and job satisfaction are the best ways to predict happiness or non-happiness. We'll change up models now to see what works best. I'll still look at social score just to see how it performs with a different model.

In [52]:
# calculate estimate of test error for a given feature set
from sklearn import svm
y_train = df_happy_jobs_social['General happiness']
def get_cv_error(features, clss):
  # define pipeline
  if len(features) == 1:
    ct = make_column_transformer(
        (StandardScaler(), ["social_score"]),
        remainder="passthrough"
    )
  if len(features) == 2 and features[1] == "social_score":
    ct = make_column_transformer(
        (StandardScaler(), ["income", "social_score"]),
        remainder="passthrough"
    )
  if len(features) == 2 and features[1] != "social_score":
    ct = make_column_transformer(
        (OneHotEncoder(handle_unknown='ignore'), ["Satisfaction with financial situation"]),
        (StandardScaler(), ["income"]),
        remainder="passthrough"
    )
  if len(features) == 3:
    ct = make_column_transformer(
        (OneHotEncoder(handle_unknown='ignore'), ["Satisfaction with financial situation"]),
        (StandardScaler(), ["income", "social_score"]),
        remainder="passthrough"
    )

  pipeline = make_pipeline(
      ct,
      svm.SVC()
  )

  is_class = (y_train == clss)
  # calculate errors from cross-validation
  recall = cross_val_score(pipeline, df_happy_jobs_social[features],
                is_class, cv=5, scoring="recall").mean()
  precision = cross_val_score(pipeline, df_happy_jobs_social[features],
                is_class, cv=5, scoring="precision").mean()
  # calculate average of the cross-validation errors
  return (precision, recall, 2 * (precision * recall) / (precision + recall))

# calculate and store errors for different feature sets
errs = pd.Series()
classes = ['Not too happy','Pretty happy', 'Very happy']
for clss in classes:
  for features in [["social_score"],
                   ["income", "social_score"],
                   ["income", "Satisfaction with financial situation"],
                   ["income", "Satisfaction with financial situation", "social_score"]]:
    errs[clss + ", " + str(features)] = get_cv_error(features, clss)

errs

ValueError: The number of classes has to be greater than one; got 1 class

ValueError: The number of classes has to be greater than one; got 1 class

ValueError: The number of classes has to be greater than one; got 1 class

ValueError: The number of classes has to be greater than one; got 1 class

ValueError: The number of classes has to be greater than one; got 1 class

ValueError: The number of classes has to be greater than one; got 1 class

ValueError: The number of classes has to be greater than one; got 1 class

ValueError: The number of classes has to be greater than one; got 1 class

ValueError: The number of classes has to be greater than one; got 1 class

ValueError: The number of classes has to be greater than one; got 1 class

ValueError: The number of classes has to be greater than one; got 1 class

ValueError: The number of classes has to be greater than one; got 1 class

ValueError: The number of classes has to be greater than one; got 1 class

ValueError: The number of

Don't know, ['social_score']                                                                                            (nan, nan, nan)
Don't know, ['income', 'social_score']                                                                                  (nan, nan, nan)
Don't know, ['income', 'Satisfaction with financial situation']                                                         (nan, nan, nan)
Don't know, ['income', 'Satisfaction with financial situation', 'social_score']                                         (nan, nan, nan)
Not too happy, ['social_score']                                                                                         (0.0, 0.0, nan)
Not too happy, ['income', 'social_score']                                                                               (0.0, 0.0, nan)
Not too happy, ['income', 'Satisfaction with financial situation']                                                      (0.0, 0.0, nan)
Not too happy, ['income', 'Satisfaction with fin

In [26]:
for val in errs:
  print(val[2])
# the below are the f1 scores
# they are ordered the same as the happiness and associated features above

nan
nan
nan
nan
0.02441505595116989
0.024514811031664967
0.7437082918052215
0.7177672452193558
0.7057036998964975
nan
0.012339676418811297
0.10941931076880329


As expected, cutting the data down to include social_score caused issues with balance, which gave us nans. I'll return to the job and happiness data so we can have more variety in data.

In [0]:
# calculate estimate of test error for a given feature set
from sklearn import svm
y_train = df_happy_jobs['General happiness_x']
def get_cv_error(features, clss):
  # define pipeline
  if len(features) == 1:
    ct = make_column_transformer(
        (StandardScaler(), ["income"]),
        remainder="passthrough"
    )
  if len(features) == 2:
    ct = make_column_transformer(
        (OneHotEncoder(handle_unknown='ignore'), ["Satisfaction with financial situation"]),
        (StandardScaler(), ["income"]),
        remainder="passthrough"
    )

  pipeline = make_pipeline(
      ct,
            svm.SVC()
  )

  is_class = (y_train == clss)
  # calculate errors from cross-validation
  recall = cross_val_score(pipeline, df_happy_jobs[features],
                is_class, cv=10, scoring="recall").mean()
  precision = cross_val_score(pipeline, df_happy_jobs[features],
                is_class, cv=10, scoring="precision").mean()
  # calculate average of the cross-validation errors
  return (precision, recall, 2 * (precision * recall) / (precision + recall))

# calculate and store errors for different feature sets
errs = pd.Series()
classes = ['Not too happy','Pretty happy', 'Very happy']
for clss in classes:
  for features in [["income"],
                   ["income", "Satisfaction with financial situation"]]:
    errs[clss + ", " + str(features)] = get_cv_error(features, clss)

errs

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

I had HUGE issues with svm.SVC. I was getting UndefinedMetricWarning and zero division left and right. I had to stop it because it was running for over an hour. This was because of metrics being all 0, again because the data is uneven. I'll try RandomForestClassifier and see if that can get around it.

In [0]:
for val in errs:
  print(val[2])

In [10]:
# calculate estimate of test error for a given feature set
from sklearn.ensemble import RandomForestClassifier
y_train = df_happy_jobs['General happiness_x']
def get_cv_error(features, clss):
  # define pipeline
  if len(features) == 1:
    ct = make_column_transformer(
        (StandardScaler(), ["income"]),
        remainder="passthrough"
    )
  if len(features) == 2:
    ct = make_column_transformer(
        (OneHotEncoder(handle_unknown='ignore'), ["Satisfaction with financial situation"]),
        (StandardScaler(), ["income"]),
        remainder="passthrough"
    )

  pipeline = make_pipeline(
      ct,
      RandomForestClassifier()
  )

  is_class = (y_train == clss)
  # calculate errors from cross-validation
  recall = cross_val_score(pipeline, df_happy_jobs[features],
                is_class, cv=10, scoring="recall").mean()
  precision = cross_val_score(pipeline, df_happy_jobs[features],
                is_class, cv=10, scoring="precision").mean()
  # calculate average of the cross-validation errors
  return (precision, recall, 2 * (precision * recall) / (precision + recall))

# calculate and store errors for different feature sets
errs = pd.Series()
classes = ['Not too happy','Pretty happy', 'Very happy']
for clss in classes:
  for features in [["income"],
                   ["income", "Satisfaction with financial situation"]]:
    errs[clss + ", " + str(features)] = get_cv_error(features, clss)

errs

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Not too happy, ['income']                                             (0.042131247197200725, 0.1047298128579826, 0.0...
Not too happy, ['income', 'Satisfaction with financial situation']    (0.15275698237124052, 0.08723428977622244, 0.1...
Pretty happy, ['income']                                              (0.5653308608153542, 0.7604959620909157, 0.648...
Pretty happy, ['income', 'Satisfaction with financial situation']     (0.5766200025337027, 0.6910342664629766, 0.628...
Very happy, ['income']                                                (0.2758813348224719, 0.13857551123766804, 0.18...
Very happy, ['income', 'Satisfaction with financial situation']       (0.33137346904234866, 0.222111858164385, 0.265...
dtype: object

In [11]:
for val in errs:
  print(val[2])

# the below are the f1 scores
# they are ordered the same as the happiness and associated features above

0.060089415571129186
0.11105109570256486
0.6485490102742516
0.6286638087751365
0.18448433113061366
0.2659581865586948


The best performing model was RandomForestClassifier with income and satisfaction with financial situation. Adding other features cut the data down too much and caused issues with prediction run time. F1 score never got very good, but this model has an F1 score for predicting pretty happy of .62, which isn't horrible considering how hard it is to tell if someone really is "happy".