<a href="https://colab.research.google.com/github/RoelRotti/ADStructures/blob/master/DataMiningTechniquesHW1Advanced.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Summarizing Data

In [None]:
import pandas as pd
import numpy as np
import keras
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from google.colab import data_table


df = pd.read_csv("dataset_mood_smartphone.csv", parse_dates=["time"])
df = df.rename(columns = {"Unnamed: 0" : "index", "id" : "user_id", "time" : "datetime"}, 
               inplace = False)
# print(df[df['user_id']=='AS14.01']) # all information for one user Id
print(df['user_id'].value_counts()) # instances per user
print("Number of different user id's:", len(df['user_id'].value_counts())) # number of users
df.head()

FileNotFoundError: ignored

## Preprocessing

In [None]:
# unstack makes seperated columns for all different variable values
df_seperated = df.set_index(['variable','index', 'user_id', 'datetime']).unstack(['variable'])
# Seperate columns for everything
df_seperated.columns = ['activity',	'appCat.builtin',
                        'appCat.communication', 'appCat.entertainment',	'appCat.finance',
                        'appCat.game', 'appCat.office', 'appCat.other', 'appCat.social',
                        'appCat.travel', 'appCat.unknown', 'appCat.utilities', 
                        'appCat.weather', 'call', 'circumplex.arousal', 'circumplex.valence',
                        'mood', 'screen', 'sms']

# These resampler functions can aggregate (by summing or averaging) data over periods 'D'
# here means that this function aggregates over days (because datetime is in DayTime format)

def resampler_sum(x):    
    return x.set_index('datetime').resample('D').sum()

def resampler_avg(x):
    return x.set_index('datetime').resample('D').mean()
    
# apply the resamplers
df_aggregated_daily = df_seperated.copy().reset_index(level=2).groupby(level=1).apply(resampler_sum)
df_aggregated_avg = df_seperated.copy().reset_index(level=2).groupby(level=1).apply(resampler_avg)

# These values are useful as averages (not summed!) 
df_aggregated_daily["mood"] = df_aggregated_avg["mood"]
df_aggregated_daily["circumplex.arousal"] = df_aggregated_avg["circumplex.arousal"]
df_aggregated_daily["circumplex.valence"] = df_aggregated_avg["circumplex.valence"]
df_aggregated_daily["activity"] = df_aggregated_avg["activity"]


## Handling missing values

In [None]:
df_including_na = df_aggregated_daily.copy()

# when there is no value for mood the instance is useless.
df_including_na.dropna(subset = ["mood"], inplace= True)

# all different user
persons = df_including_na.index.get_level_values(0).unique()

# Shows which days are missing
for person in persons:
    # print(person)
    df_per_personX = df_including_na.xs(person, level='user_id')
    # print("First recorded date: ", df_per_personX.index.min())
    # print("Final date: ", df_per_personX.index.max())
    # print("Duration: ", df_per_personX.index.max()-df_per_personX.index.min())
    # print(pd.date_range(start = df_per_personX.index.min(), end = df_per_personX.index.max()).difference(df_per_personX.index))

## Shows dataframe per user to observe where the missing days are
# df_including_na.xs('AS14.01', level='user_id') #--> participant started later (so delete first two days for which there are some measurements)
# df_including_na.xs('AS14.03', level='user_id') #--> just one day missing (interpolate)
# df_including_na.xs('AS14.06', level='user_id') #--> just one day missing (interpolate)
# df_including_na.xs('AS14.12', level='user_id') #--> participant started later (so delete first day for which there are some measurements)
# df_including_na.xs('AS14.14', level='user_id') #--> 3 randomly missing days (interpolate)
# df_including_na.xs('AS14.15', level='user_id') #--> just one day missing (interpolate)
# df_including_na.xs('AS14.16', level='user_id') #--> just one day missing (interpolate)
# df_including_na.xs('AS14.17', level='user_id') #--> skipped a whole week after a week of missing activity values. SO WHAT TO DO HERE?? <-------------------------------------------------
# df_including_na.xs('AS14.23', level='user_id') #--> just one day missing (interpolate)
# df_including_na.xs('AS14.24', level='user_id') #--> just one day missing (interpolate)
# df_including_na.xs('AS14.25', level='user_id') #--> just one day missing (interpolate)
# df_including_na.xs('AS14.26', level='user_id') #--> just one day missing (interpolate)
# df_including_na.xs('AS14.27', level='user_id') #--> just one day missing (interpolate)
# df_including_na.xs('AS14.28', level='user_id') #--> just one day missing (interpolate)
# df_including_na.xs('AS14.15', level='user_id') #--> just one day missing (interpolate)
# df_including_na.xs('AS14.29', level='user_id') #--> 2 days randomly missing, not sequentially (interpolate)
# df_including_na.xs('AS14.31', level='user_id') #--> just one day missing (interpolate)
# df_including_na.xs('AS14.32', level='user_id') #--> 5 days randomly missing of which 2 follow each other directly (interpolate)
# df_including_na.xs('AS14.33', level='user_id') #--> 3 days randomly missing of which 2 follow each other directly (interpolate)

## EVERYONE MISSES THE 6TH OF MAY -->> mention this in report

# Here the rows are deleted manually where people had some trouble in the beginning of the 
# trial (where no phone usage information is available), ugly but took so long to find out how to delete the correct days
df_no_missing_days = df_including_na.copy().drop(df_including_na.index[[0,1, 389, 635, 636, 637, 638, 639, 640, 641, 642, 643, 644, 645]])

# Interpolate 'valence', only 2 cases where all data is allright except for valence
df_no_missing_days['circumplex.valence'] = df_no_missing_days['circumplex.valence'].interpolate(method='linear')

# Interpolate 'activity' values, so note NO EXTRAPOLATION!
df_no_missing_activity = df_no_missing_days.groupby('user_id').apply(lambda group: group.interpolate(method='linear', limit=10, limit_area= 'inside', order=1))

# Delete rows that still exist, because these need to be extrapolated which seems not very realistic
df_no_missing_activity = df_no_missing_activity.dropna()

# Function that inserts a row with missing values for the days that are not in the dataset
def fill_days(x): 
    missing_dates = pd.date_range(start = x.index.min(), end = x.index.max()).difference(x.index)
    for missing_date in missing_dates:
        x.loc[missing_date] = [np.nan] * x.shape[-1]
        x = x.sort_values(by="datetime")
    return x

# Insert missing days using the function above
df_added_days = df_no_missing_activity.copy().reset_index(level=0).groupby('user_id').apply(fill_days)
# Somehow this column is made so I delete it here
df_added_days = df_added_days.drop(columns='user_id')

# interpolate values for the inserted days
df_final = df_added_days.groupby('user_id').apply(lambda group: group.interpolate(method='linear', limit=2))


# print dataframe fully
data_table.DataTable(df_final)

In [None]:
## CREATE THE 5 DAILY BY SLIDING OVER THE DATA 
##          WITH A WINDOW OF SIZE 5 & COMPUTE DELTA'

# Generate rolling mean and rolling sum with window = 5. Shift=1 since 'rolling'
# includes current day

df_aggregated_daily = df_final.copy()
#print(df_aggregated_daily.head())
df_avg_rolling_mean_5daily = df_aggregated_daily.reset_index(level=0).groupby('user_id').rolling(window=5).mean().shift(1) #.groupby('user_id')
df_avg_rolling_sum_5daily = df_aggregated_daily.reset_index(level=0).groupby('user_id').rolling(window=5).sum().shift(1)
# print(df_avg_rolling_sum_5daily.head(6))
# print(df_avg_rolling_sum_5daily.head(6))

## GROUP BY
## REMOVE NA VALUES
## CHANGE WINDOW TYPE

# Take the sum for everything
df_avg_5daily = df_avg_rolling_sum_5daily

# Take the mean of relevant attributes (the mood-related attributes )
df_avg_5daily["mood"] = df_avg_rolling_mean_5daily["mood"]
df_avg_5daily["circumplex.arousal"] = df_avg_rolling_mean_5daily["circumplex.arousal"]
df_avg_5daily["circumplex.valence"] = df_avg_rolling_mean_5daily["circumplex.valence"]
df_avg_5daily["activity"] = df_avg_rolling_mean_5daily["activity"]

# Rename columns
df_avg_5daily = df_avg_5daily.add_prefix('AVG_')
df_avg_5daily = df_avg_5daily.rename(columns = {'AVG_mood':'MEAN_mood', 
            'AVG_circumplex.arousal':'MEAN_circumplex.arousal', 'AVG_circumplex.valence':'MEAN_circumplex.valence', 
            'AVG_activity':'MEAN_activity',})

## CALCULATE DELTA
## Option 1: Take sum of absolute differences between rows:
## Subtract each value with value of last row: take absolute difference
df_aggregated_daily_differences = abs(df_aggregated_daily.groupby('user_id').diff()) # <- .groupby('user_id')
df_5daily_delta = df_aggregated_daily_differences.reset_index(level=0).groupby('user_id').rolling(window=5).sum().shift(1)


## Option 2: Take trend of last 5 datapoints
# polyfit = lambda y: np.polyfit(range(len(y)), y, 1)[0] # 0 = a, 1 = b (y=ax+b). Take 1st order polynomial
# df_5daily_delta = df_aggregated_daily.rolling(window=5).apply(polyfit).shift(1)

#NOTE: BECAUSE OF DIFFERENCE THE VALUES NOW START FROM THE 7'TH ROW (NOT 6'TH)

# Rename columns
df_5daily_delta = df_5daily_delta.add_prefix('DELTA_')
# Remove weekdays
df_5daily_delta = df_5daily_delta.loc[:,'DELTA_activity':'DELTA_sms']

# Join 5-daily-DELTA-df and 5-daily-avg-df
df_avg_5daily = pd.concat([df_avg_5daily, df_5daily_delta], axis=1)

# Add label to df: actual avg mood on that day

df_avg_5daily['LABEL_mood'] = df_aggregated_daily.loc[:,'mood']

#df_avg_5daily['Monday':'Sunday'] = df_aggregated_daily['Monday':'Sunday']
# Add weekdays as a one hot encoding to the dataframe
names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
for i, x in enumerate(names):
    df_avg_5daily[x] = (df_avg_5daily.index.get_level_values(1).weekday == i).astype(int)

# Drop all rows with na-values: this drops first 6 rows of every user
df_avg_5daily = df_avg_5daily.dropna()

## 1st 5 values communication: 
#print(df_aggregated_daily.iloc[0:5, 2])
# user_id  datetime  
# AS14.01  2014-03-21     6280.890
#          2014-03-22     4962.918
#          2014-03-23     5237.319
#          2014-03-24     9270.629
#          2014-03-25    10276.751
# Name: appCat.communication, dtype: float64
# Slope: 
#print( np.polyfit(range(len(df_aggregated_daily.iloc[0:5, 2].index)), df_aggregated_daily.iloc[0:5, 2], 1) )
#1229.9432999999983
# test = df_aggregated_daily.iloc[0:5, 2]
# test = np.array(test)
# print(test)
# coef = np.polyfit(range(len(test)), test, 1)
# print( coef )
# plt.plot(test)
# plt.plot([coef[0]*x + coef[1] for x in range(len(test))])
# plt.show()

#df_aggregated_daily.head()
#df_aggregated_daily_differences.head(10)
#df_5daily_delta.head(10)
df_avg_5daily.head(n=10)
#data_table.DataTable(df_avg_5daily)


In [None]:
import numpy as np
import scipy.stats


def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
    return m, m-h, m+h

## Baseline
The baseline for this project is the average mood per participant on the previous day. 

In [None]:
# by grouping the data per user and shifting the values 1 place we get the baseline model
df_aggregated_daily_baseline = df_final.copy()
df_aggregated_daily_baseline['predicted mood'] = df_aggregated_daily_baseline['mood'].groupby(
    'user_id').shift(1)

# No longer needed to create np arrays: pandas arrays are also accepted now
# create np arrays for the labels and the predictons to compare these two
labels = np.array(df_aggregated_daily_baseline['mood'])
predictions = np.array(df_aggregated_daily_baseline['predicted mood'])
print("Number of NaN predictions :", np.count_nonzero(np.isnan(predictions)), 
      "(a NaN value for every first observation per user)")

errors_bl = abs(predictions - labels)
errors_bl = errors_bl[~np.isnan(errors_bl)]
squared_errors_bl = np.power(errors_bl,2)
mse=round(np.mean(np.power(errors_bl,2)), 2)
print('Confidence Interval MSE (Baseline): ', mean_confidence_interval(squared_errors_bl))
print('Confidence Interval MAE (Baseline): ', mean_confidence_interval(errors_bl)) 

# Dataframe with predictions and dates
predictions_data = pd.DataFrame(data = {'date': range(len(predictions)), 'prediction': predictions})
plt.figure(figsize=(20,10))
plt.rcParams['axes.facecolor'] = 'lightgrey'
# Plot the actual values
plt.plot(range(len(labels)), labels, 'b-', label = 'actual')
# Plot the predicted values
plt.plot(range(len(predictions)), predictions, 'ro', label = 'prediction')
plt.xticks(rotation = '60'); 
plt.legend()
# Graph labels
plt.xlabel('Days in test set'); plt.ylabel('Mood'); plt.title('BASELINE Actual and Predicted Values. MSE = {}'.format(mse));

# # print dataframe fully
# data_table.DataTable(df_aggregated_daily_baseline)
plt.savefig("baseline.pdf")

## Random Forest

In [None]:
avg5day = True
fulldata = False

if avg5day :
    # Labels are the mood values we want to predict
    labels = df_avg_5daily['LABEL_mood']
    # Features totally 
    if fulldata: # with all variables still in the dataframe
        features = df_avg_5daily.drop(['MEAN_mood', 'MEAN_circumplex.arousal', 
                                    'MEAN_circumplex.valence', 'LABEL_mood',
                                    'DELTA_mood', 'DELTA_circumplex.arousal',
                                    'DELTA_circumplex.valence'], axis = 1)
    else: # without variable pruned because they are < (max/4)
        features = df_avg_5daily.drop(['MEAN_mood', 'MEAN_circumplex.arousal', 
                                    'MEAN_circumplex.valence', 'LABEL_mood',
                                    'DELTA_mood', 'DELTA_circumplex.arousal',
                                    'DELTA_circumplex.valence', 
                                    'AVG_appCat.finance', 'AVG_appCat.game',
                                    'AVG_appCat.weather', 'DELTA_appCat.finance',
                                    'DELTA_appCat.game', 'DELTA_appCat.weather',
                                    'Monday', 'Tuesday', 'Wednesday','Thursday',
                                    'Friday', 'Saturday', 'Sunday' ], axis = 1)
                                

                                
else:
    # Labels are the mood values we want to predict
    labels = np.array(df_aggregated_daily['mood'])
    # We want to predict mood so we have to take it out of the dataset to predict
    ## FEATURES without the mood:
    if originaldata:
        features = df_aggregated_daily.drop(['mood', 'circumplex.arousal', 'circumplex.valence'], axis = 1)
    else:
        ## FEATURES without the mood + importance > 0.03:
        features = df_aggregated_daily.drop(['mood', 'circumplex.arousal', 'circumplex.valence',
                                             'appCat.finance', 'appCat.game', 'appCat.office',
                                             'appCat.unknown', 'appCat.weather'], axis = 1)



# Saving feature names for later use in the plot
feature_list = list(features.columns)

# No longer needed: pandas also accepted
# Convert to numpy array
features = np.array(features)

train_features, test_features, train_labels, test_labels = train_test_split(
    features, labels, test_size = 0.25, random_state = 42)

rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
rf.fit(train_features, train_labels)

# Use the forest's predict method on the test data
predictions = rf.predict(test_features)

# Print out the mean squared error (MSE)
errors_rf = abs(predictions - test_labels)
errors_rf = errors_rf[~np.isnan(errors_rf)]
squared_errors_rf = np.power(errors_rf,2)
mse=round(np.mean(np.power(errors_rf,2)), 2)
print('Confidence Interval MSE (Random Forest): ', mean_confidence_interval(squared_errors_rf))
print('Confidence Interval MAE (Random Forest): ', mean_confidence_interval(errors_rf))

# Get numerical feature importances
importances = list(rf.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in 
                       zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

print("Max importance: ", max(importances)/4)

%matplotlib inline
# Set the style
plt.style.use('fivethirtyeight')
plt.rcParams['axes.facecolor'] = 'lightgrey'
if avg5day:
    plt.rcParams["figure.figsize"]=(10, 10) # Including DELTA / Weekdays
# list of x locations for plotting
x_values = list(range(len(importances)))
# Make a bar chart
# plt.bar(x_values, importances, orientation = 'vertical')
plt.barh(list(reversed(feature_list)), list(reversed(importances)))
# Tick labels for x axis
#plt.xticks(x_values, feature_list, rotation='vertical')
# Axis labels and title
cutoff = max(importances)/4
boolimp = importances < cutoff
for i in range(len(boolimp)):
    if boolimp[i]: 
        print(feature_list[i]) 
#print(feature_list[importances < cutoff])
plt.vlines(x=cutoff, ymin=-1, ymax=len(feature_list), linestyles="dashed", colors="red", )
plt.ylabel('Variable'); plt.title('Variable Importances for RF (with MSE test data = {})'.format(mse))
plt.xlabel('Importance, cutoff = (max/4) ={}'.format(round(cutoff,3)))
 
plt.savefig("variable_importances.pdf", bbox_inches='tight')

In [None]:
# Dataframe with predictions and dates
predictions_data = pd.DataFrame(data = {'date': range(len(predictions)), 'prediction': predictions})
# Plot the actual values
plt.figure(figsize=(20,10))
plt.rcParams['axes.facecolor'] = 'lightgrey'
plt.plot(range(len(predictions)), test_labels, 'b-', label = 'actual')
# Plot the predicted values
plt.plot(predictions_data['date'], predictions_data['prediction'], 'ro', label = 'prediction')
plt.xticks(rotation = '60'); 
plt.legend()
# Graph labels
plt.xlabel('Days in test set'); plt.ylabel('Mood'); plt.title('RANDOM FOREST Actual and Predicted Values. MSE = {}'.format(mse));
plt.savefig("random_forest.pdf")

## LSTM
### many-to-one

In [None]:
# function to transfrom daily data to lstm ready format

def lstm_data_transform(data, num_steps=5):
    """ Changes data to the format for LSTM training 
for sliding window approach """
    # Prepare the list for the transformed data
    X, y = list(), list()
    # Loop over the user's data set
    users = data.index.unique(0)
    for user in users:
        u_data = data.filter(like=user, axis=0)       
        y_data = u_data['mood']
        x_data = u_data.drop(columns='mood')
        for i in range(u_data.shape[0]):
            # compute a new (sliding window) index
            end_ix = i + num_steps
            # if index is larger than the size of the dataset, we stop
            if end_ix >= u_data.shape[0]:
                break
            # Get a sequence of data for x
            seq_X = x_data.iloc[i:end_ix] # update for columns if changed!
            # Get only the last element of the sequency for y
            seq_y = y_data.iloc[end_ix]
            # Append the list with sequencies
            X.append(seq_X)
            y.append(seq_y)
    # Make final arrays
    x_array = np.array(X)
    y_array = np.array(y)
    return x_array, y_array

In [None]:
# normalizes values

def standardize(data):
    mood = data['mood']
    data_mean = data.mean(axis=0)
    data_std = data.std(axis=0)
    data = (data - data_mean) / data_std
    data['mood'] = mood
    return data

def normalize(data):
    mood = data['mood']
    data = (data-data.min())/(data.max()-data.min())
    data['mood'] = mood
                             
    return data
#     x = data.values #returns a numpy array
#     cols = data.columns
#     min_max_scaler = preprocessing.MinMaxScaler()
#     x_scaled = min_max_scaler.fit_transform(x)
#     df = pd.DataFrame(x_scaled)
#     df.columns = cols
#     return df

#t_data = standardize(df_aggregated_daily.drop(columns=['appCat.finance','appCat.game','appCat.office','appCat.unknown','appCat.weather']))
t_data = normalize(df_aggregated_daily.drop(columns=['appCat.finance','appCat.game','appCat.office','appCat.unknown','appCat.weather']))
#t_data = normalize(df_aggregated_daily)

# splits data into train and test data

train_ind = int(0.75 * t_data.shape[0])
train_data = t_data[:train_ind] #df_aggregated_daily
test_data = t_data[train_ind:]

# transforms it to be traing ready
num_steps = 5
# training set
(x_train_transformed,
 y_train_transformed) = lstm_data_transform(train_data, num_steps=num_steps)
assert x_train_transformed.shape[0] == y_train_transformed.shape[0]
# test set
(x_test_transformed,
 y_test_transformed) = lstm_data_transform(test_data, num_steps=num_steps)
assert x_test_transformed.shape[0] == y_test_transformed.shape[0]


In [None]:
# # compile model
model = keras.Sequential()
model.add(keras.layers.LSTM(13, activation='tanh', input_shape=(num_steps, 13), return_sequences=False))
model.add(keras.layers.Dense(units=13, activation='relu'))
model.add(keras.layers.Dense(units=1, activation='linear'))
adam = keras.optimizers.Adam(lr=0.001)
model.compile(optimizer=adam, loss='mse')


In [None]:
# fit model and predict
model.fit(x_train_transformed, y_train_transformed, epochs=10)
test_predict = model.predict(x_test_transformed)

In [None]:
# Dataframe with predictions and dates
predictions_data = pd.DataFrame(data = {'date': range(len(test_predict)), 'prediction': test_predict.squeeze()})
# Plot the actual values
plt.figure(figsize=(20,10))
plt.rcParams['axes.facecolor'] = 'lightgrey'
plt.plot(range(len(y_test_transformed)), y_test_transformed, 'b-', label = 'actual')
# Plot the predicted values
plt.plot(predictions_data['date'], predictions_data['prediction'], 'ro', label = 'prediction')
plt.xticks(rotation = '60'); 
plt.legend()
# Graph labels
mse=round(mean_squared_error(test_predict, y_test_transformed), 2)
plt.xlabel('Days in test set'); plt.ylabel('Mood'); plt.title('LSTM Actual and Predicted Values. MSE = {}'.format(mse))

# Print out the mean squared error (MSE)
errors_lstm = abs(predictions_data['prediction'] - y_test_transformed)
errors_lstm = errors_lstm[~np.isnan(errors_lstm)]
squared_errors_lstm = np.power(errors_lstm,2)
mse=round(np.mean(np.power(errors_lstm,2)), 2)
print('Confidence Interval MSE (LSTM): ', mean_confidence_interval(squared_errors_lstm))
print('Confidence Interval MAE (LSTM): ', mean_confidence_interval(errors_lstm))

# Statistical Analysis

In [None]:
from scipy.stats import mannwhitneyu

print("Random Forest vs Baseline")
print(scipy.stats.mannwhitneyu(squared_errors_rf, squared_errors_bl))
print("\nLSTM vs Baseline")
print(scipy.stats.mannwhitneyu(squared_errors_lstm, squared_errors_bl))
print("\nLSTM vs Random Forest")
print(scipy.stats.mannwhitneyu(squared_errors_lstm, squared_errors_rf))

## LSTM
### many-to-many

In [None]:
# splitting 
train_ind = int(0.75 * t_data.shape[0])
x_train = t_data[:train_ind].drop(columns='mood')
x_test = t_data[train_ind:].drop(columns='mood')
y_train = t_data['mood'][:train_ind].values.reshape(-1, 1)
y_test = t_data['mood'][train_ind:].values.reshape(-1, 1)
print(y_test.shape)
print(y_train.shape)

In [None]:
# scaling
from sklearn.preprocessing import StandardScaler
scaler_x = StandardScaler()
scaler_y = StandardScaler()

x_train_sc = scaler_x.fit_transform(x_train)
x_test_sc = scaler_x.transform(x_test)
y_train_sc = scaler_y.fit_transform(y_train)
y_test_sc = scaler_y.transform(y_test)

In [None]:
# reshape
num_steps = 3
# training set
x_train_shaped = np.reshape(x_train_sc[:864], newshape=(-1, num_steps, 18))
y_train_shaped = np.reshape(y_train_sc[:864], newshape=(-1, num_steps, 1))
assert x_train_shaped.shape[0] == y_train_shaped.shape[0]
# test set
x_test_shaped = np.reshape(x_test_sc[:90], newshape=(-1, num_steps, 18))
y_test_shaped = np.reshape(y_test_sc[:90], newshape=(-1, num_steps, 1))
assert x_test_shaped.shape[0] == y_test_shaped.shape[0]

In [None]:
# compile model

model = keras.Sequential()
model.add(keras.layers.LSTM(18, activation='tanh', input_shape=(num_steps, 18), return_sequences=True))
model.add(keras.layers.Dense(units=18, activation='relu'))
model.add(keras.layers.Dense(units=1, activation='linear'))
adam = keras.optimizers.Adam(lr=0.001)
model.compile(optimizer=adam, loss='mse')


In [None]:
model.fit(x_train_shaped, y_train_shaped, epochs=30)
test_predict = model.predict(x_test_shaped)

In [None]:
print(test_predict.reshape(90,1).squeeze(), y_test_shaped.shape)
print('Mean Squared Error over test data (LSTM):', 
      round(mean_squared_error(test_predict.squeeze(), y_test_shaped.squeeze()), 2))

In [None]:
# Dataframe with predictions and dates
predictions_data = pd.DataFrame(data = {'date': range(len(test_predict.reshape(90,1).squeeze())), 'prediction': test_predict.reshape(90,1).squeeze()})
# Plot the actual values
plt.figure(figsize=(20,10))
plt.rcParams['axes.facecolor'] = 'lightgrey'
plt.plot(range(len(y_test_transformed.reshape(257,1).squeeze())), y_test_transformed.reshape(257,1).squeeze(), 'b-', label = 'actual')
# Plot the predicted values
plt.plot(predictions_data['date'], predictions_data['prediction'], 'ro', label = 'prediction')
plt.xticks(rotation = '60'); 
plt.legend()
# Graph labels
mse= round(mean_squared_error(test_predict, y_test_transformed), 2)
plt.xlabel('Days in test set'); plt.ylabel('Mood'); plt.title('LSTM Actual and Predicted Values. MSE = {}'.format(mse));

print('Mean Squared Error over test data (LSTM):', mse)

# RNN


In [None]:

# Simple: 
import tensorflow as tf
from tensorflow import keras
model = keras.models.Sequential([
    keras.layers.SimpleRNN(1, input_shape=[None, 1])
])
optimizer = keras.optimizers.Adam(lr=0.005)
model.compile(loss="mse", optimizer=optimizer)
history = model.fit(x_train_transformed, y_train_transformed, epochs=20,
                    validation_data=(x_test_transformed, y_test_transformed))
model.evaluate(x_test_transformed, y_test_transformed)

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(
    features, labels, test_size = 0.25, random_state = 42)

# Simple: 
import tensorflow as tf
from tensorflow import keras
model = keras.models.Sequential([
    keras.layers.SimpleRNN(1, input_shape=[None, 1])
])
optimizer = keras.optimizers.Adam(lr=0.005)
model.compile(loss="mse", optimizer=optimizer)
history = model.fit(X_train, y_train, epochs=20,
                    validation_data=(X_valid, y_valid))
model.evaluate(X_valid, y_valid)

Temporal Approach -> I thought LSTM could be nice? Or a time series approach but I do not know much about those (Autoregressive integrated moving average is the proposed time series model, but this only uses the mood as a variable I think)

Things to try out: random forest, ARIMA, RNNs, prophet if the dataset is univariate

**Aggregating Data (not based on literature sadly):**


1.   Make everything hourly (ex : # of pickups within 1 app-domain, total minutes of usage within 1 app-domain, both in 1 hour). Some attributes are hourly by default (such as activity) so we will want to make everything hourly
2.   Make everything daily (THINK ABOUT: maybe also account for mornings/afternoons/evenings/night. Will that make a difference?). Same examples as above. 
3.   Make everything 5-daily. IMPORTANT: Also measure the DELTA for every attribute. How much did the behaviour/measurements change over the days(/mornings/afternoons/evenings/nights)?

=> END RESULT: We end up with 
*   average numbers over 5 day-periods (same example as 1st )
*   DELTA of attributes over days. This might be even more important than averages etc as in point above


---











**Attributes for aggregating History**

*   Mood/Arousal/Valence: avg per day/part-of-day (don't include 0's)
*   Activity: "
*   Call/SMS: "
*  Screen/All-apps: # of pickups, total time (or average already)

ALL: DELTA-parameter which summarizes how much the behaviour/mood changed over days (positive/more or negative/less)

https://www.researchgate.net/publication/310598732_How_to_Predict_Mood_Delving_into_Features_of_Smartphone-_Based_Data (link on how previous VU researcers published a paper about this)

---
Also include day of the week as attribute
---






TODO: (11-4)


*   In 'df_avg_5daily' discriminate between users: probably use 'groupby' (now the rolling window just continues)
*   join data data is NA: now we drop everything -delete NA when user starts later. Fill in (interpolate) randomly missing days (such as 6th of May)
*   DELTA: try using np.polyfit to fit the trend ( https://www.emilkhatib.com/analyzing-trends-in-data-with-pandas/ )

*   REPORT: Compare attributes: including mood/arousal/valence, excluding mood/arousal/valence, excluding DELTA-mood/arousal/valence (mainly to show how close our MSE is without all mood stuff

* Specify potential window function for rolling : https://docs.scipy.org/doc/scipy/reference/signal.windows.html#module-scipy.signal.windows 

