# What Makes People Happy?


## Part 2: NLP Modeling and Results

---

## Modeling Goals:

###### 1. Select Model
The goal for this portion of the project was to determine which model was able to most accurately predict which category each Happy Moment was assigned to. Each Happy Moment was pre-assigned to a category by the research team that initially studied the subject. 
###### 2. Identify Most Informative Features
After selecting the most accurate model, apply the model to each of the categories to determine which features were most significant to it's prediction


In [1]:
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
import scipy as sp
import spacy
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC, SVC
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.feature_selection import chi2
from sklearn.naive_bayes import MultinomialNB         # Naive Bayes
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from textblob import TextBlob, Word
import nltk
from nltk.corpus import stopwords
import string
import re

plt.style.use('fivethirtyeight')

%matplotlib inline

#### Read in data and dummy the categories

In [3]:
hm_df = pd.read_table('./data/happy_moments.csv', sep=',')
category_id_dummies = pd.get_dummies(hm_df.category_id,prefix='category')
category_id_dummies.rename(columns={'category_0':'affection_0',
                                   'category_1':'leisure_1',
                                   'category_2':'bonding_2',
                                   'category_3':'achievement_3',
                                   'category_4':'enjoy_the_moment_4',
                                   'category_5':'exercise_5',
                                   'category_6':'nature_6'},inplace=True)
hm_df = pd.concat([hm_df,category_id_dummies],axis=1)
hm_df.to_csv('./data/hm_df.csv',index=False)
hm_df.head(2)

Unnamed: 0,hmid,wid,reflection_period,cleaned_hm,num_sentence,predicted_category,age,country,gender,marital,parenthood,category_id,toke_text,affection_0,leisure_1,bonding_2,achievement_3,enjoy_the_moment_4,exercise_5,nature_6
0,27673,2053,24h,I went on a successful date with someone I fel...,1,affection,35.0,USA,m,single,n,0,"['successful', 'date', 'feel', 'sympathy', 'co...",1,0,0,0,0,0,0
1,27873,2053,24h,I played a new game that was fun and got to en...,1,leisure,35.0,USA,m,single,n,1,"['play', 'new', 'game', 'fun', 'enjoy', 'mecha...",0,1,0,0,0,0,0


#### Prepare data for NLP

In [4]:
nlp = spacy.load('en_core_web_sm')
STOPLIST = set(stopwords.words('english') + ["n't", "'s", "'m", "ca"] + list(ENGLISH_STOP_WORDS))
SYMBOLS = " ".join(string.punctuation).split(" ") + ["-----", "---", "...", "“", "”", "'ve"]
category_id_df = hm_df[['predicted_category','category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)

### Modeling Functions

Functions created to fit and test the accuracy of different models

In [5]:
class CleanTextTransformer(TransformerMixin):

    def transform(self, X, **transform_params):
        return [cleanText(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}
def cleanText(text):
    # get rid of newlines
    text = text.strip().replace("\n", " ").replace("\r", " ")
    
    # lowercase
    text = text.lower()
    return text

#### Count Vectorizer
Function created to check the accuracy of using a count vectorizer on each Happy Moment token

In [6]:
def model_test(df,clf):
    vectorizer = CountVectorizer(ngram_range=(1,1), min_df=2)
    pipe = Pipeline([('cleanText', CleanTextTransformer()), ('vectorizer', vectorizer), ('clf', clf)])
    X = df.toke_text
    y = df.category_id
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1,test_size=0.2)
    pipe.fit(X_train,y_train)
    y_pred_class = pipe.predict(X_test)
    print("----------------------------------------------------------------------------------------------")
    print("CV Accuracy Score:", metrics.accuracy_score(y_test, y_pred_class))
    print("----------------------------------------------------------------------------------------------")


#### TFIDF Vectorizer
Function created to check the accuracy of using Term Frequency Inverse Document Frequency (TFIDF) on each Happy Moment Token

In [7]:
def tfidf_model_test(df,clf):
    vectorizer = TfidfVectorizer(ngram_range=(1,1), min_df=2)
    pipe = Pipeline([('cleanText', CleanTextTransformer()), ('vectorizer', vectorizer), ('clf', clf)])
    X = df.toke_text
    y = df.category_id
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1,test_size=0.2)
    pipe.fit(X_train,y_train)
    y_pred_class = pipe.predict(X_test)
    print("----------------------------------------------------------------------------------------------")
    print("Tfidf Accuracy Score:", metrics.accuracy_score(y_test, y_pred_class))
    print("----------------------------------------------------------------------------------------------")

### Model Testing
Tested the accuracy score of each model using both the CountVectorizer and TFIDF model testing functions

#### Models Tested:
- Linear Support Vector Machine
- Multinomial Naive Bayes
- Logistic Regression

In [8]:
clf_lsvc = LinearSVC()
model_test(hm_df,clf_lsvc)
clf_lsvc_tfidf = LinearSVC()
tfidf_model_test(hm_df,clf_lsvc_tfidf)#Highest Scoring Model

----------------------------------------------------------------------------------------------
CV Accuracy Score: 0.8792198049512379
----------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------
Tfidf Accuracy Score: 0.8841210302575644
----------------------------------------------------------------------------------------------


In [9]:
clf_nb = MultinomialNB()
model_test(hm_df,clf_nb)
clf_nb_tfidf = MultinomialNB()
tfidf_model_test(hm_df,clf_nb_tfidf)

----------------------------------------------------------------------------------------------
CV Accuracy Score: 0.8037509377344336
----------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------
Tfidf Accuracy Score: 0.7428357089272318
----------------------------------------------------------------------------------------------


In [10]:
clf_lr = LogisticRegression(C=1e9)
model_test(hm_df,clf_lr)
clr_lr_tfidf = LogisticRegression(C=1e9)
tfidf_model_test(hm_df,clf_nb)

----------------------------------------------------------------------------------------------
CV Accuracy Score: 0.8623155788947237
----------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------
Tfidf Accuracy Score: 0.7428357089272318
----------------------------------------------------------------------------------------------


### Model Selected:
> The Linear Support Vector Machine using a TFIDF vectorizer scored highest among all other classification models

---

## Apply Selected Model to Each of the Predicted Categories

### Function Used to Test Accuracy and Determine Most Informative Features

In [11]:
def tfidf_model_most_informative(df,dfx,dfy,clf,N):
    vectorizer = TfidfVectorizer(ngram_range=(1,1), min_df=2)
    pipe = Pipeline([('cleanText', CleanTextTransformer()), ('vectorizer', vectorizer), ('clf', clf)])
    X = dfx
    y = dfy
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1,test_size=0.2)
    pipe.fit(X_train,y_train)
    y_pred_class = pipe.predict(X_test)
    print("----------------------------------------------------------------------------------------------")
    print("Tfidf Accuracy Score:", metrics.accuracy_score(y_test, y_pred_class))
    print("----------------------------------------------------------------------------------------------")
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    topClass1 = coefs_with_fns[:-(N + 1):-1]
    topClass2 = coefs_with_fns[:-(N + 1):-1]
    print("#Features with highest coefficient values:")
    print("(coefficient,feature name)")
    for feat in topClass1:
        print(feat)


### Apply Function to Each of the Predicted Categories

The model is fit and run on each of the predicted categories, and provides the accuracy of its ability to predict the category, as well as the most significant features used to predict that category

#### Predicted Category: Affection

In [12]:
dfx = hm_df.toke_text
dfy = hm_df.affection_0
tfidf_model_most_informative(hm_df,dfx,dfy,clf_lsvc_tfidf,8)

----------------------------------------------------------------------------------------------
Tfidf Accuracy Score: 0.95743935983996
----------------------------------------------------------------------------------------------
#Features with highest coefficient values:
(coefficient,feature name)
(7.981595981332734, 'wife')
(7.928953343039355, 'husband')
(7.7170025137529565, 'son')
(7.541949527205374, 'family')
(7.420829562123793, 'daughter')
(6.591834665638432, 'sister')
(6.3519700444545215, 'boyfriend')
(6.266532712706981, 'brother')


#### Predicted Category: Leisure

In [13]:
dfx = hm_df.toke_text
dfy = hm_df.leisure_1
tfidf_model_most_informative(hm_df,dfx,dfy,clf_lsvc_tfidf,8)

----------------------------------------------------------------------------------------------
Tfidf Accuracy Score: 0.9645911477869468
----------------------------------------------------------------------------------------------
#Features with highest coefficient values:
(coefficient,feature name)
(2.7710294914729934, 'surprises')
(2.7185430411293936, 'nap')
(2.704888026135278, 'episode')
(2.690225468511501, 'watch')
(2.551766759141951, 'read')
(2.4594703940097515, 'watched')
(2.3488387150192915, 'game')
(2.2023985328526026, 'mexican')


#### Predicted Category: Bonding
> **Note:** Bonding is different than affection as it refers to relationships more likely associated with friends, neighbors, etc.

In [14]:
dfx = hm_df.toke_text
dfy = hm_df.bonding_2
tfidf_model_most_informative(hm_df,dfx,dfy,clf_lsvc_tfidf,8)

----------------------------------------------------------------------------------------------
Tfidf Accuracy Score: 0.9867966991747937
----------------------------------------------------------------------------------------------
#Features with highest coefficient values:
(coefficient,feature name)
(11.922201432990597, 'friend')
(5.16143130151014, 'friends')
(4.410834141469935, 'roommate')
(4.40714776435012, 'coworker')
(4.3317949351715725, 'worker')
(3.838889638640418, 'neighbor')
(3.280475213778114, 'colleague')
(2.8752669872010337, 'mate')


#### Predicted Category: Achievement

In [15]:
dfx = hm_df.toke_text
dfy = hm_df.achievement_3
tfidf_model_most_informative(hm_df,dfx,dfy,clf_lsvc_tfidf,8)

----------------------------------------------------------------------------------------------
Tfidf Accuracy Score: 0.9218304576144036
----------------------------------------------------------------------------------------------
#Features with highest coefficient values:
(coefficient,feature name)
(3.0419402742253467, 'job')
(2.751026948135801, 'exam')
(2.7222849285122197, 'raise')
(2.65144150447287, 'win')
(2.578817623182829, 'approval')
(2.540846301723316, 'company')
(2.4464555312449967, 'aid')
(2.235979004134518, 'beat')


#### Predicted Category: Enjoy The Moment
> **Note:** This category appears to be more of "catch all" that contains more of the miscellaneous Happy Moments

In [16]:
dfx = hm_df.toke_text
dfy = hm_df.enjoy_the_moment_4
tfidf_model_most_informative(hm_df,dfx,dfy,clf_lsvc_tfidf,8)

----------------------------------------------------------------------------------------------
Tfidf Accuracy Score: 0.9393348337084271
----------------------------------------------------------------------------------------------
#Features with highest coefficient values:
(coefficient,feature name)
(3.151642883309448, 'happy')
(2.2974743587764372, 'blissful')
(2.156808753600808, 'package')
(2.1515789231558826, 'randomly')
(2.143983994020429, 'enjoy')
(2.121652923544125, 'jesus')
(2.0445416110410854, 'abandon')
(2.0266449832488274, 'crave')


#### Predicted Category: Exercise

In [17]:
dfx = hm_df.toke_text
dfy = hm_df.exercise_5
tfidf_model_most_informative(hm_df,dfx,dfy,clf_lsvc_tfidf,8)

----------------------------------------------------------------------------------------------
Tfidf Accuracy Score: 0.9959989997499374
----------------------------------------------------------------------------------------------
#Features with highest coefficient values:
(coefficient,feature name)
(4.810522070447907, 'gym')
(3.8414214460514344, 'workout')
(3.1397626095780953, 'yoga')
(2.8229801110999704, 'exercise')
(2.660053903065797, 'run')
(2.5606706442601506, 'mile')
(2.3055573561273355, 'jog')
(2.3027550089756876, 'push')


#### Predicted Category: Nature

In [18]:
dfx = hm_df.toke_text
dfy = hm_df.nature_6
tfidf_model_most_informative(hm_df,dfx,dfy,clf_lsvc_tfidf,8)

----------------------------------------------------------------------------------------------
Tfidf Accuracy Score: 0.9915478869717429
----------------------------------------------------------------------------------------------
#Features with highest coefficient values:
(coefficient,feature name)
(3.7893986407982005, 'weather')
(3.341300331051462, 'rain')
(2.788552013488483, 'sun')
(2.6661528141123862, 'temperature')
(2.593729616755351, 'snow')
(2.41448242713872, 'ocean')
(2.3472220589449395, 'sunset')
(2.2607673873897647, 'nature')


# Conclusion:

The following results indicate that the model was generally successful in being able to accurately predict each worker's Happy Moment. 

Further analysis will be done to better understand the relationships between Workers demographic data and the predicted results/features.