<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Laden-der-Daten-und-Hinzufügen-von-neugenerierten-Metadaten" data-toc-modified-id="Laden-der-Daten-und-Hinzufügen-von-neugenerierten-Metadaten-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Laden der Daten und Hinzufügen von neugenerierten Metadaten</a></span></li><li><span><a href="#Zusammenführen-von-Daten-und-Metadaten" data-toc-modified-id="Zusammenführen-von-Daten-und-Metadaten-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Zusammenführen von Daten und Metadaten</a></span></li><li><span><a href="#Reduzierung-auf-Daten-mit-ground_truth-Label" data-toc-modified-id="Reduzierung-auf-Daten-mit-ground_truth-Label-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Reduzierung auf Daten mit ground_truth Label</a></span></li><li><span><a href="#Altersangabe-nach-Dekade" data-toc-modified-id="Altersangabe-nach-Dekade-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Altersangabe nach Dekade</a></span></li><li><span><a href="#Gruppierung-nach-Alter-pro-Dekade" data-toc-modified-id="Gruppierung-nach-Alter-pro-Dekade-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Gruppierung nach Alter pro Dekade</a></span></li><li><span><a href="#Gruppierung-nach-Worker-ID" data-toc-modified-id="Gruppierung-nach-Worker-ID-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Gruppierung nach Worker-ID</a></span></li><li><span><a href="#nach-Alter-und-Beziehungsstatus" data-toc-modified-id="nach-Alter-und-Beziehungsstatus-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>nach Alter und Beziehungsstatus</a></span></li></ul></div>

In [2]:
import pandas as pd
import numpy as np

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
import string

import matplotlib.pyplot as plt
import seaborn as sns

Quelle: https://www.kaggle.com/ritresearch/happydb

What made you happy today? Reflect on the past 24 hours, and recall three actual events that happened to you that made you happy. Write down your happy moment in a complete sentence.

Citation:

Please cite the following publication if you are using the dataset for your work:

HappyDB: A Corpus of 100,000 Crowdsourced Happy Moments, LREC 2018 (to appear)

Akari Asai, Sara Evensen, Behzad Golshan, Alon Halevy, Vivian Li, Andrei
Lopatenko, Daniela Stepanov, Yoshihiko Suhara, Wang-Chiew Tan and Yinzhan Xu

#### Laden der Daten und Hinzufügen von neugenerierten Metadaten

In [49]:
# demographic.csv: Enthält demographische Informationen zu den Beschäftigten. Die Metadaten enthalten worker id, age, country, 
# gender, marital status, und status of parenthood.
demographic = pd.read_csv('demographic.csv')
demographic.head(1)

Unnamed: 0,wid,age,country,gender,marital,parenthood
0,1,37.0,USA,m,married,y


In [50]:
# cleaned_hm.csv: Bereinigtes Korpus der 100.000 crowd-sourced Happy Moments.
# Zur Bereinigung wurde Rechschreibung kontrolliert und leere/ ein-Wort Statements entfernt
# Die rohen, unbereinigten Momente sind zur Referenz erhalten
# Der Autor jedes Moments ist durch seine/ihre worker ID/ wid repräsentiert.
happy = pd.read_csv('cleaned_hm.csv')
happy.head(1)

Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category
0,27673,2053,24h,I went on a successful date with someone I fel...,I went on a successful date with someone I fel...,True,1,,affection


In [51]:
happy['length'] = happy['cleaned_hm'].apply(lambda x: len(x.split()))
happy.head(1)

Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category,length
0,27673,2053,24h,I went on a successful date with someone I fel...,I went on a successful date with someone I fel...,True,1,,affection,14


In [52]:
ps = PorterStemmer()

for i, row in happy.iterrows():
    words = word_tokenize(row.cleaned_hm)
    sent = ""
    for word in words:
        sent = sent + " " + ps.stem(word)
    happy.at[i,"stemmed"] = sent
happy.head(1)

Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category,length,stemmed
0,27673,2053,24h,I went on a successful date with someone I fel...,I went on a successful date with someone I fel...,True,1,,affection,14,I went on a success date with someon I felt s...


In [12]:
def text_process(mess):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Returns a list of the cleaned text
    """
    # Check characters to see if they are in punctuation
    nopunc = [char for char in mess if char not in string.punctuation]
    # remove commonly occurring words
    LIMIT_WORDS = ['happy', 'day', 'got', 'went', 'today', 'made', 'one', 'two', 'time', 'last', 'first', 'going', 'getting', 'took', 'found', 'lot', 'really', 'saw', 'see', 'month', 'week', 'day', 'yesterday', 'year', 'ago', 'now', 'still', 'since', 'something', 'great', 'good', 'long', 'thing', 'toi', 'without', 'yesteri', '2s', 'toand', 'ing']
    nopunc = [char for char in mess if char.lower() not in LIMIT_WORDS]
        
    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)   
    # Now just remove any stopwords

    
    return [word.lower() for word in nopunc.split()]
# Apply to entire happy dataset, column cleaned_hm
happy['preprocessed'] = happy['stemmed'].apply(text_process)
happy.head(1)

#### Zusammenführen von Daten und Metadaten

In [54]:
happy_moments = pd.merge(happy, demographic, on='wid', validate = 'm:1')
happy_moments.head(1)

Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category,length,stemmed,preprocessed,age,country,gender,marital,parenthood
0,27673,2053,24h,I went on a successful date with someone I fel...,I went on a successful date with someone I fel...,True,1,,affection,14,I went on a success date with someon I felt s...,"[i, went, on, a, success, date, with, someon, ...",35,USA,m,single,n


#### Reduzierung auf Daten mit ground_truth Label

In [55]:
happy_ground = happy_moments[happy_moments['ground_truth_category'].notna()]
happy_ground.head(1)

Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category,length,stemmed,preprocessed,age,country,gender,marital,parenthood
6,40281,2053,24h,I played a game for about half an hour.,I played a game for about half an hour.,True,1,leisure,leisure,9,I play a game for about half an hour .,"[i, play, a, game, for, about, half, an, hour, .]",35,USA,m,single,n


#### Altersangabe nach Dekade

In [56]:
age = happy_ground[happy_ground['age']!='prefer not to say']
age = age[age['age']!='čá']

for i, row in age.iterrows():
    if float(row.age)>100:
        age.at[i,'age']= np.nan
    if float(row.age)<18:
        age.at[i,'age']= np.nan
        
age = age[age.age.notna()]
age[['age']] = age[['age']].astype(float)
age[['age']] = age[['age']].astype(int)
age.head(1)

Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category,length,stemmed,preprocessed,age,country,gender,marital,parenthood
6,40281,2053,24h,I played a game for about half an hour.,I played a game for about half an hour.,True,1,leisure,leisure,9,I play a game for about half an hour .,"[i, play, a, game, for, about, half, an, hour, .]",35,USA,m,single,n


In [57]:
np.unique(age.age)

array([18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
       52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
       69, 70, 71, 72, 73, 74, 75, 76, 77, 79, 81, 83, 88])

In [58]:
# add age range column
for i, row in age.iterrows():
    if row.age < 20:
        age.at[i,'age_range'] = '10-19'
    if row.age >=20 and row.age <30:
        age.at[i,'age_range'] = '20-29'
    if row.age >=30 and row.age < 40:
        age.at[i,'age_range'] = '30-39'
    if row.age >=40 and row.age < 50:
        age.at[i,'age_range'] = '40-49'
    if row.age >=50 and row.age < 60:
        age.at[i,'age_range'] = '50-59'
    if row.age >=60 and row.age < 70:
        age.at[i,'age_range'] = '60-69'
    if row.age >=70 and row.age < 80:
        age.at[i,'age_range'] = '70-79'
    if row.age >=80 and row.age < 90:
        age.at[i,'age_range'] = '80-89'
age.head(1)

Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category,length,stemmed,preprocessed,age,country,gender,marital,parenthood,age_range
6,40281,2053,24h,I played a game for about half an hour.,I played a game for about half an hour.,True,1,leisure,leisure,9,I play a game for about half an hour .,"[i, play, a, game, for, about, half, an, hour, .]",35,USA,m,single,n,30-39


In [60]:
for i, row in age.iterrows():
    relationship = row.marital
    ages = row.age_range
    age.at[i, 'relation_age'] = str(relationship)+'_'+str(ages)
age

Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category,length,stemmed,preprocessed,age,country,gender,marital,parenthood,age_range,relation_age
6,40281,2053,24h,I played a game for about half an hour.,I played a game for about half an hour.,True,1,leisure,leisure,9,I play a game for about half an hour .,"[i, play, a, game, for, about, half, an, hour, .]",35,USA,m,single,n,30-39,single_30-39
15,32821,2,24h,When my family plan a abroad tour with me,When my family plan a abroad tour with me,True,1,affection,affection,9,when my famili plan a abroad tour with me,"[when, my, famili, plan, a, abroad, tour, with...",29,IND,m,married,y,20-29,married_20-29
19,34843,2,24h,When my house ready to live with my family,When my house ready to live with my family,True,1,affection,affection,9,when my hous readi to live with my famili,"[when, my, hous, readi, to, live, with, my, fa...",29,IND,m,married,y,20-29,married_20-29
23,37031,2,24h,When my friend meet me today with expensive gi...,When my friend meet me today with expensive gi...,True,1,bonding,bonding,11,when my friend meet me today with expens gift...,"[when, my, friend, meet, me, today, with, expe...",29,IND,m,married,y,20-29,married_20-29
25,38598,2,24h,I was very happy when my son playing with whol...,I was very happy when my son playing with whol...,True,1,affection,affection,11,I wa veri happi when my son play with whole day,"[i, wa, veri, happi, when, my, son, play, with...",29,IND,m,married,y,20-29,married_20-29
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100494,128202,10685,24h,My tooth stopped aching after my dentist visit.,My tooth stopped aching after my dentist visit.,True,1,achievement,achievement,8,My tooth stop ach after my dentist visit .,"[my, tooth, stop, ach, after, my, dentist, vis...",46,USA,f,married,y,40-49,married_40-49
100496,127705,9044,24h,I took a bath with my husband.,I took a bath with my husband.,True,1,affection,affection,7,I took a bath with my husband .,"[i, took, a, bath, with, my, husband, .]",32,USA,f,married,n,30-39,married_30-39
100526,127748,8880,24h,I got on the scales in the morning and I was 5...,I got on the scales in the morning and I was 5...,True,1,achievement,achievement,14,I got on the scale in the morn and I wa 5 lb ...,"[i, got, on, the, scale, in, the, morn, and, i...",58,USA,f,married,n,50-59,married_50-59
100529,127751,11402,24h,Quite dinner with my wife.,Quite dinner with my wife.,True,1,affection,affection,5,quit dinner with my wife .,"[quit, dinner, with, my, wife, .]",32,USA,m,married,n,30-39,married_30-39


In [61]:
import os
directories = os.listdir('topic_dict/')
dic = {}
for file in directories:
    p = pd.read_csv('topic_dict/'+file, header=None)
    dic[file.split('-')[0]] = list(p[0])

In [62]:
for i, row in age.iterrows():
    for k, v in dic.items():
        for word in v:
            if word in row.cleaned_hm:
                age.at[i, 'topic']=k

In [63]:
age

Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category,length,stemmed,preprocessed,age,country,gender,marital,parenthood,age_range,relation_age,topic
6,40281,2053,24h,I played a game for about half an hour.,I played a game for about half an hour.,True,1,leisure,leisure,9,I play a game for about half an hour .,"[i, play, a, game, for, about, half, an, hour, .]",35,USA,m,single,n,30-39,single_30-39,
15,32821,2,24h,When my family plan a abroad tour with me,When my family plan a abroad tour with me,True,1,affection,affection,9,when my famili plan a abroad tour with me,"[when, my, famili, plan, a, abroad, tour, with...",29,IND,m,married,y,20-29,married_20-29,family
19,34843,2,24h,When my house ready to live with my family,When my house ready to live with my family,True,1,affection,affection,9,when my hous readi to live with my famili,"[when, my, hous, readi, to, live, with, my, fa...",29,IND,m,married,y,20-29,married_20-29,family
23,37031,2,24h,When my friend meet me today with expensive gi...,When my friend meet me today with expensive gi...,True,1,bonding,bonding,11,when my friend meet me today with expens gift...,"[when, my, friend, meet, me, today, with, expe...",29,IND,m,married,y,20-29,married_20-29,people
25,38598,2,24h,I was very happy when my son playing with whol...,I was very happy when my son playing with whol...,True,1,affection,affection,11,I wa veri happi when my son play with whole day,"[i, wa, veri, happi, when, my, son, play, with...",29,IND,m,married,y,20-29,married_20-29,family
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100494,128202,10685,24h,My tooth stopped aching after my dentist visit.,My tooth stopped aching after my dentist visit.,True,1,achievement,achievement,8,My tooth stop ach after my dentist visit .,"[my, tooth, stop, ach, after, my, dentist, vis...",46,USA,f,married,y,40-49,married_40-49,
100496,127705,9044,24h,I took a bath with my husband.,I took a bath with my husband.,True,1,affection,affection,7,I took a bath with my husband .,"[i, took, a, bath, with, my, husband, .]",32,USA,f,married,n,30-39,married_30-39,family
100526,127748,8880,24h,I got on the scales in the morning and I was 5...,I got on the scales in the morning and I was 5...,True,1,achievement,achievement,14,I got on the scale in the morn and I wa 5 lb ...,"[i, got, on, the, scale, in, the, morn, and, i...",58,USA,f,married,n,50-59,married_50-59,food
100529,127751,11402,24h,Quite dinner with my wife.,Quite dinner with my wife.,True,1,affection,affection,5,quit dinner with my wife .,"[quit, dinner, with, my, wife, .]",32,USA,m,married,n,30-39,married_30-39,food


In [64]:
age = age.fillna(0)

In [65]:
happy_day = age[age.reflection_period=='24h']
happy_months = age[age.reflection_period=='3m']

In [66]:
age.to_csv('happy_preprocessed_onlygroundtruth.csv')

In [67]:
happy_day.to_csv('happy_preprocessed_24h.csv')

In [68]:
happy_months.to_csv('happy_preprocessed_3m.csv')

#### Gruppierung nach Alter pro Dekade

In [33]:
combined_all = age.groupby(['age_range', 'ground_truth_category'])['stemmed'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
c = age.groupby(['age_range', 'ground_truth_category'])['cleaned_hm'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
combined_all = combined_all.merge(c, on=['age_range','ground_truth_category'])
combined_all.drop_duplicates(subset=['age_range', 'ground_truth_category'], keep='first', inplace=True)
combined_all['preprocessed'] = combined_all['stemmed'].apply(text_process)
combined_all['length'] = combined_all['stemmed'].apply(lambda x: len(x.split()))
combined_all.head(1)

Unnamed: 0,age_range,ground_truth_category,stemmed,cleaned_hm,preprocessed,length
0,10-19,achievement,I swat for my carib stud exam and I wa abl to...,I swatted for my carib studs exam and I was ab...,"[i, swat, for, my, carib, stud, exam, and, i, ...",1072


In [34]:
combined_day = happy_day.groupby(['age_range', 'ground_truth_category'])['stemmed'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
c = happy_day.groupby(['age_range', 'ground_truth_category'])['cleaned_hm'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
combined_day = combined_day.merge(c, on=['age_range','ground_truth_category'])
combined_day.drop_duplicates(subset=['age_range', 'ground_truth_category'], keep='first', inplace=True)
combined_day['preprocessed'] = combined_day['stemmed'].apply(text_process)
combined_day['length'] = combined_day['stemmed'].apply(lambda x: len(x.split()))
combined_day.head(1)

Unnamed: 0,age_range,ground_truth_category,stemmed,cleaned_hm,preprocessed,length
0,10-19,achievement,"I final complet a long , tire task at work . ...","I finally completed a long, tiring task at wor...","[i, final, complet, a, long, ,, tire, task, at...",464


In [35]:
combined_month = happy_months.groupby(['age_range', 'ground_truth_category'])['stemmed'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
c = happy_months.groupby(['age_range', 'ground_truth_category'])['cleaned_hm'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
combined_month = combined_month.merge(c, on=['age_range','ground_truth_category'])
combined_month.drop_duplicates(subset=['age_range', 'ground_truth_category'], keep='first', inplace=True)
combined_month['preprocessed'] = combined_month['stemmed'].apply(text_process)
combined_month['length'] = combined_month['stemmed'].apply(lambda x: len(x.split()))
combined_month.head(1)

Unnamed: 0,age_range,ground_truth_category,stemmed,cleaned_hm,preprocessed,length
0,10-19,achievement,I swat for my carib stud exam and I wa abl to...,I swatted for my carib studs exam and I was ab...,"[i, swat, for, my, carib, stud, exam, and, i, ...",608


In [36]:
combined_all = combined_all.to_csv('happy_combined_by_age_all.csv')
combined_day = combined_day.to_csv('happy_combined_by_age_day.csv')
combined_month = combined_month.to_csv('happy_combined_by_age_months.csv')

#### Gruppierung nach Worker-ID

In [69]:
wid_all = age.groupby(['wid'])['stemmed'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
wid_all['preprocessed'] = wid_all['stemmed'].apply(text_process)
wid_all['length'] = wid_all['stemmed'].apply(lambda x: len(x.split()))
wid_all = pd.merge(wid_all, age[['wid','reflection_period','country','gender','marital', 'parenthood','cleaned_hm','relation_age','age_range','age']], on=['wid'])
wid_all.drop_duplicates(subset=['wid'], keep='first', inplace=True)
wid_all.head(1)

Unnamed: 0,wid,stemmed,preprocessed,length,reflection_period,country,gender,marital,parenthood,cleaned_hm,relation_age,age_range,age
0,1,My mother call out of the blue to tell me how...,"[my, mother, call, out, of, the, blue, to, tel...",141,24h,USA,m,married,y,My mother called out of the blue to tell me ho...,married_30-39,30-39,37


In [70]:
wid_day = happy_day.groupby(['wid'])['stemmed'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
wid_day['preprocessed'] = wid_day['stemmed'].apply(text_process)
wid_day['length'] = wid_day['stemmed'].apply(lambda x: len(x.split()))
wid_day = pd.merge(wid_day, happy_day[['wid','reflection_period','country','gender','marital', 'parenthood','cleaned_hm','relation_age','age_range','age']], on=['wid'])
wid_day.drop_duplicates(subset=['wid'], keep='first', inplace=True)
wid_day.head(1)

Unnamed: 0,wid,stemmed,preprocessed,length,reflection_period,country,gender,marital,parenthood,cleaned_hm,relation_age,age_range,age
0,1,My mother call out of the blue to tell me how...,"[my, mother, call, out, of, the, blue, to, tel...",72,24h,USA,m,married,y,My mother called out of the blue to tell me ho...,married_30-39,30-39,37


In [71]:
wid_month = happy_months.groupby(['wid'])['stemmed'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
wid_month['preprocessed'] = wid_month['stemmed'].apply(text_process)
wid_month['length'] = wid_month['stemmed'].apply(lambda x: len(x.split()))
wid_month = pd.merge(wid_month, happy_months[['wid','reflection_period','country','gender','marital', 'parenthood', 'cleaned_hm','relation_age','age_range','age']], on=['wid'])
wid_month.drop_duplicates(subset=['wid'], keep='first', inplace=True)
wid_month.head(1)

Unnamed: 0,wid,stemmed,preprocessed,length,reflection_period,country,gender,marital,parenthood,cleaned_hm,relation_age,age_range,age
0,1,my son had a great time on hi 8th birthday . ...,"[my, son, had, a, great, time, on, hi, 8th, bi...",69,3m,USA,m,married,y,my son had a great time on his 8th birthday.,married_30-39,30-39,37


In [47]:
wid_all.to_csv('happy_combined_by_wid_all.csv')
wid_day.to_csv('happy_combined_by_wid_day.csv')
wid_month.to_csv('happy_combined_by_wid_months.csv')

#### nach Alter und Beziehungsstatus

In [7]:
happy_day = pd.read_csv('happy_preprocessed_24h.csv')
happy_months = pd.read_csv('happy_preprocessed_3m.csv')

In [9]:
for i, row in happy_day.iterrows():
    relationship = row.marital
    ages = row.age_range
    happy_day.at[i, 'relation_age'] = str(relationship)+'_'+str(ages)
happy_day.head(1)

Unnamed: 0.1,Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category,...,stemmed,preprocessed,age,country,gender,marital,parenthood,age_range,relation_age,topic
0,6,40281,2053,24h,I played a game for about half an hour.,I played a game for about half an hour.,True,1,leisure,leisure,...,I play a game for about half an hour .,"['i', 'play', 'a', 'game', 'for', 'about', 'ha...",35,USA,m,single,n,30-39,single_30-39,0


In [17]:
rel_age = happy_day.groupby(['relation_age', 'ground_truth_category'])['stemmed'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
c = happy_day.groupby(['relation_age', 'ground_truth_category'])['cleaned_hm'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
rel_age_ = rel_age.merge(c, on=['relation_age','ground_truth_category'])
rel_age.drop_duplicates(subset=['relation_age', 'ground_truth_category'], keep='first', inplace=True)
rel_age['preprocessed'] = rel_age['stemmed'].apply(text_process)
rel_age['length'] = rel_age['stemmed'].apply(lambda x: len(x.split()))

for i, row in rel_age.iterrows():
    rel_age.at[i, 'marital'] = row.relation_age.split('_')[0]
    rel_age.at[i, 'age_rage'] = row.relation_age.split('_')[1]
    
rel_age.to_csv('rel_age_day.csv')

In [18]:
rel_age_m = happy_months.groupby(['relation_age', 'ground_truth_category'])['stemmed'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
c = happy_months.groupby(['relation_age', 'ground_truth_category'])['cleaned_hm'].apply(lambda x: ' '.join(x.astype(str))).reset_index()
rel_age_m = rel_age_m.merge(c, on=['relation_age','ground_truth_category'])
rel_age_m.drop_duplicates(subset=['relation_age', 'ground_truth_category'], keep='first', inplace=True)
rel_age_m['preprocessed'] = rel_age_m['stemmed'].apply(text_process)
rel_age_m['length'] = rel_age_m['stemmed'].apply(lambda x: len(x.split()))

for i, row in rel_age_m.iterrows():
    rel_age_m.at[i, 'marital'] = row.relation_age.split('_')[0]
    rel_age_m.at[i, 'age_rage'] = row.relation_age.split('_')[1]

rel_age_m.to_csv('rel_age_months.csv')