#**QPID Machine Learning/Data Science Challenge**

>[QPID Machine Learning/Data Science Challenge](#scrollTo=0rfF7yE_Nzsx)

>>[Introduction](#scrollTo=rm6JUOo3PkQ3)

>>>[Problem Statement](#scrollTo=tcXSWw-895ez)

>>[Details](#scrollTo=GZh7-oC8ELQh)

>>>[Downloading the Dataset](#scrollTo=NkkA0ZzePwI8)

>>>[Dataset Description](#scrollTo=mEXwMJQ-VWoR)

>>>[Hints](#scrollTo=Tg8y-zLlxZNQ)

>>>[Import Libraries](#scrollTo=JBt8Ino7PGDz)

>>>[Code Samples](#scrollTo=i38zIcCZXg33)

>>[Tasks](#scrollTo=TJTdm_qNYqqb)

>>>[Part 1: Modeling](#scrollTo=2hcyIkRSXg82)

>>>[Part 2: Additional Questions](#scrollTo=8VSLpzIMXg_a)



## Introduction

We'd like to get a better sense of your approach to and intuition for machine learning, natural language processing, data science, as well as your other technical and analytical skills. To that end, we'd like you to complete this ML challenge. 

In this challenge, we provide you with some code samples using `pandas` and `scikit-learn`. You may use other python libraries as well as reference online resources to complete this challenge.

This is an interactive Jupyter notebook that allows you to write, comment on, and execute python code directly on Google's servers. If you are new to Jupyter notebooks, you can read more about them here: [https://jupyter.org/](https://jupyter.org/). **Please edit this notebook directly** and take as much time as you feel is reasonable to complete this exercise. 


### Problem Statement
The goal of this notebook is to develop a model that predicts whether self-reported severity of [fibromyalgia](https://en.wikipedia.org/wiki/Fibromyalgia) **IMPROVED**, **WORSENED**, or stayed the **SAME** over a variable period of time for patients who have this condition. Below are instructions about how to download the data, as well as some sample code that generates such predictions. 

**Your goal is to improve on this model and show us how you think through such a problem.**


## Tasks

This challenge consists of two parts: modeling and addressing additional questions. The first part is to train a model and predict the change in status of fibromyalgia. The second part is to answer some questions in regards to your approach. 


### Part 1: Modeling



1.   We will evaluate your results on a held-out test set. Therefore, please make sure that your code is clearly documented, so that we can easily run **your code** against our test set. The test set is formatted exactly the same as the dataset set we provided. **You will not be judged solely on the performance of the model.** We are also interested in your creativity and problem solving approach.  
2.   Please report on how well the model did. You may choose to whatever metrics you find appropriate.


### Part 2: Additional Questions

Please answer the following questions at the bottom of the notebook when you have completed part 1:

1.   If you've explored the data, please describe your observations about the dataset. 
2.   What approach (i.e. modeling & evaluation) did you use?
3.   What features have you tried (please also include the ones that you do not include in your final model)?
4.   Why did you use this approach?
5.   How would you improve your model if you had more time?



## The Dataset

### Downloading the Dataset

The following is a function to download the datasets and then import it as a pandas `DataFrame` object. 

You may need to sign in with your Google account when prompted.

In [0]:
# Install the PyDrive wrapper & import libraries.
# This only needs to be done once per notebook.
!pip install -U -q PyDrive

In [0]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

import pandas as pd
import io
import json

In [0]:
# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
def download_dataset(file_id):
  auth.authenticate_user()
  gauth = GoogleAuth()
  gauth.credentials = GoogleCredentials.get_application_default()
  drive = GoogleDrive(gauth)

  # Download a file based on its file ID.
  downloaded = drive.CreateFile({'id': file_id})
  content = downloaded.GetContentString()
  
  return pd.read_csv(io.StringIO(content))
  

### Dataset Description

The dataset you will be using is a subset of the Chronic Illness dataset on Kaggle: https://www.kaggle.com/flaredown/flaredown-autoimmune-symptom-tracker



Flaredown is an app that helps patients of chronic autoimmune and invisible illnesses improve their symptoms by avoiding triggers and evaluating their treatments. Each day, patients track their symptom severity, treatments and doses, and any potential environmental triggers (foods, stress, allergens, etc) they encounter.

**About the data**

Instead of coupling symptoms to a particular illness, Flaredown asks users to create their unique set of conditions, symptoms and treatments (“**trackables**”). They can then “check-in” each day and record the severity of symptoms and conditions, the doses of treatments, and “tag” the day with any unexpected environmental factors.

**Condition**: an illness or diagnosis, for example Rheumatoid Arthritis, rated on a scale of **0 (not active) to 4 (extremely active)**.

**Symptom**: self-explanatory, also rated on a 0–4 scale.

**Treatment**: anything a patient uses to improve their symptoms, along with an optional dose, which is a string that describes how much they took during the day. For instance “3 x 5mg”.

**Tag**: a string representing an environmental factor that does not occur every day, for example “ate dairy” or “rainy day”.

**Food**: food items were seeded from the publicly-available USDA food database. Users have also added many food items manually.

**Weather**: weather is pulled automatically for the user's postal code from the Dark Sky API. Weather parameters include a description, precipitation intensity, humidity, pressure, and min/max temperatures for the day.

If users do not see a symptom, treatment, tag, or food in our database (for instance “Abdominal Pain” as a symptom) they may add it by simply naming it. This means that the data requires some cleaning, but it is patient-centered and indicates their primary concerns.



The following is a snippet of what the original dataset looks like:

In [0]:
sample_file_id = '1r4afwYJ3JC_8kJFMKbYea_7vnnXaFb1r'
sample_df = download_dataset(sample_file_id)

In [0]:
sample_df
#sample_df['trackable_name'].value_counts()

The dataset we provided is a subset of the original dataset, and we grouped all the `trackable_type`, `trackable_name`, and `trackable_value` of a patient/user within a `checkin_date` into one JSON array. 

For example: those 10 rows in the `DataFrame` above belong to the same user within the same `checkin_date`, so the trackable columns are aggregated into a JSON array as following:

In [0]:
sample_df.head(10)[['trackable_type', 'trackable_name', 'trackable_value']].to_json(orient='records')

For this challenge, we want to compare the `trackable_value`, i.e. severity, of the condition of a user between two closest recorded dates in the dataset. (Please run the following code to download the training dataset.)

In [0]:
train_file_id = '1hDM1VfaZ7o1moBN1IlcaFExfoE2Jeqgg'
dataframe = download_dataset(train_file_id)
dataframe['entries_from'] = dataframe.entries_from.apply(json.loads)
dataframe['entries_to'] = dataframe.entries_to.apply(json.loads)

In [0]:
dataframe.head()
#dataframe['entries_from']['trackable_type'].value_counts()

  Columns with the `from` suffix contain the records from the same earlier date, and the ones with the `to` suffix have the records from a later date. `entries_from` and `entries_to` columns are the aggregated JSON arrays mentioned above. 
  
  Take the first row as example, the JSON array in the `entries_from` column contains all the `trackable` entries on the date (10/29/17) in `checkin_data_from` column, whereas that in the `entries_to` column contains all the `trackable` entries on the date (11/1/17) specified in the `checkin_data_to` column. 
  
  The `status` column has three values: `IMPROVED`, `WORSENED`, and `SAME`. If the value in `value_from` is greater than that in `value_to`, that means the condition of a user has worsened between those two check-in dates; if `value_from` is less than `value_to` that means the condition has improved. If those two values are the same, that means the condition has remained the same. 
  
  We ask you to build a model to predict the `status` of a patient's condition, whether it's `WORSERNED`, `IMPROVED`, or `SAME`.

### Hints

We strongly encourage you to look at the values in the `entries_from` and `entries_to` columns and extract useful features from there. And please note that `trackable_value` can be free text. 

## Your Solution 

The following is a working code sample. You are free to use the following code as part of your solution.

### Import Libraries

In [0]:
from sklearn.base import TransformerMixin
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

### Generate Predictions



In [0]:
# download the dataset
train_file_id = '1hDM1VfaZ7o1moBN1IlcaFExfoE2Jeqgg'
dataframe = download_dataset(train_file_id)
dataframe['entries_from'] = dataframe.entries_from.apply(json.loads)
dataframe['entries_to'] = dataframe.entries_to.apply(json.loads)

In [0]:
from google.colab import files
example_df = dataframe.copy()
#example_df.to_csv('example_df.csv')
#files.download('example_df.csv')

# Exploratory Data Analysis

In [0]:
dataframe.describe()

## **Distribution of gender**

In [0]:
import seaborn as sns

dataframe['sex'].value_counts()

More than 83% of the data is given by female

In [0]:
fig = sns.countplot(x = 'sex', data= dataframe )

In [0]:
dataframe.head()

# **Age Vs Status**

Is there any pattern with Age and status of disease ?

In [0]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(15,8))
sns.countplot(x = 'age',hue='status', data = dataframe, dodge= True)
#plt.xticks(rotation = 45)
plt.show()



*   Most of the data consists of people with ages ranging from 22-35. Although      there is a sudden increase at age 43-45, There seems to be less elder users 
*   There is a notable peak in condition being SAME among all ages.

*   Discrepencies in ages with values -1, 0 ,1. Replacing them with mean
*   Most people with age 58 and 62 reported worsened disease status






In [0]:
import numpy as np

dataframe['age'] = dataframe['age'].replace(0, int(np.mean(dataframe['age'])))
dataframe["age"] = dataframe['age'].replace(-1,int(np.mean(dataframe['age'])))
dataframe["age"] = dataframe['age'].replace(1,int(np.mean(dataframe['age'])))
#dataframe['age'] = dataframe['age'].astype('Int64')

In [0]:
int(dataframe["age"].mean())

#**Countries**

Generating latitude and longitude data to plot on a interactve map

In [0]:
df = pd.DataFrame(dataframe['country'].unique())
df['values'] = df[0].apply(locator.geocode)
df['location'] = df['values'].apply(lambda loc: tuple(loc.point) if loc else None)
df[['latitude', 'longitude', 'altitude']] = pd.DataFrame(df['location'].tolist(), index=df.index)
df = df.drop(['location', 'values', 'altitude'], axis=1)
#dataframe.drop(['latitude', 'longitude'], axis = 1, inplace= True)
dataframe = dataframe.join(df.set_index(0), on='country')

In [0]:
dataframe.columns

In [0]:
dataframe['latitude'].isna().value_counts()

In [0]:
dataframe["latitude"] = dataframe.latitude.replace(np.nan,0)
dataframe["longitude"] = dataframe.longitude.replace(np.nan, 0)

In [0]:
import folium
from folium.plugins import FastMarkerCluster
folium_map = folium.Map()
FastMarkerCluster(data=list(zip(dataframe['latitude'].values, dataframe['longitude'].values))).add_to(folium_map )
folium.LayerControl().add_to(folium_map)
folium_map

## **Trackables**

In [0]:
trackable_type_from = []
trackable_value_from = []
trackable_name_from = []
for row in dataframe['entries_from']:
  for d in row:
    for key, value in d.items():
      if key == 'trackable_name':
        trackable_name_from.append(value)
      elif key == 'trackable_type':
        trackable_type_from.append(value)
      else:
        trackable_value_from.append(value)

In [0]:
sns.countplot(trackable_type_from)
plt.title("Trackable types_from")
plt.show()


In [0]:
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS) 
string_from = (" ").join(trackable_name_from)
wordcloud = WordCloud(width = 2000, height = 2000, 
                background_color ='black', 
                stopwords = stopwords, 
                min_font_size = 10).generate(string_from)  
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.title('Wordcloud of trackable names_from')
plt.show()


In [0]:
plt.figure(figsize=(15,8))
pd.Series(trackable_name_from).value_counts().head(35).plot(kind= 'barh' )
plt.title('Trackable Names_from')
plt.show()


Looks like Anxiety is the leading problem. Let us see which age group is affected by anxiety

In [0]:
anxiety_age_from = []
for index, row in dataframe['entries_from'].items():
  #print(index)
  for d in row:
    for key, value in d.items():
      if value == 'Anxiety':
        age = dataframe['age'][index]
        anxiety_age_from.append(age)


In [0]:
pd.Series(anxiety_age_from).value_counts().head(5).plot(kind = 'bar')
plt.title("Most Leading problem of the disease fibromyalgia is Anxiety observed in the ages of 24 and 32 ")
plt.xlabel('Age')
plt.ylabel('Counts')
plt.show()


In [0]:
string_from = (" ").join(trackable_value_from)
wordcloud = WordCloud(width = 2000, height = 2000, 
                background_color ='black', 
                stopwords = stopwords, 
                min_font_size = 10).generate(string_from)  
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.title('Wordcloud of trackable values_from')
plt.show()


In [0]:
plt.figure(figsize=(15,8))
pd.Series(trackable_value_from).value_counts().head(30).plot(kind= 'barh')
plt.title('Trackable Value_from')
plt.show()


In [0]:
dataframe.head()

# Parsing trackable_types

In [0]:
Symptoms = []
Treatments= []
Conditions = []
Weather = []


for index, row in dataframe['entries_from'].items():
  for d in row:
    for key, value in d.items():
      if value == 'Symptom':
        Symptoms.append(d.get("trackable_name"))
      elif value == 'Treatment':
        Treatments.append(d.get("trackable_name"))
      elif value == 'Condition':
        Conditions.append(d.get("trackable_name"))
      elif value == 'Weather':
        Weather.append(d.get("trackable_name"))

plt.subplot(2,2,1)
pd.Series(Symptoms).value_counts().head().plot(kind = 'barh')
plt.title("Types of Symptoms")
plt.show()
plt.subplot(2,2,2)
pd.Series(Conditions).value_counts().head().plot(kind = 'barh')
plt.title("Types of Conditions")
plt.show()
plt.subplot(2,2,3)
pd.Series(Treatments).value_counts().head().plot(kind = 'barh')
plt.title("Types of Treatments")
plt.show()
plt.subplot(2,2,4)
pd.Series(Weather).value_counts().head().plot(kind = 'barh')
plt.title("Types of Weather")
plt.show()



In [0]:
#dataframe.drop(['symptoms'], axis = 1, inplace= True)
#dataframe.drop(['conditions'], axis = 1, inplace= True)
#dataframe.drop(['treatments'], axis = 1, inplace= True)
#dataframe.drop(['weather'], axis = 1, inplace= True)

def symp(col):
  symptoms = []
  for d in col:
    for key, value in d.items():
      if value == 'Symptom':
        symptoms.append(d.get("trackable_name"))
  return ','.join(symptoms)
def cond(col):
  conditions = []
  for d in col:
    for key, value in d.items():
      if value == 'Condition':
        conditions.append(d.get("trackable_name"))
  return ','.join(conditions)
def treat(col):
  treatments = []
  for d in col:
    for key, value in d.items():
      if value == 'Treatment':
        treatments.append(d.get("trackable_name"))
  return ','.join(treatments)
def weat(col):
  weather = []
  for d in col:
    for key, value in d.items():
      if value == 'Weather':
        weather.append(d.get("trackable_name"))
  return ','.join(weather)



dataframe['symptoms']  = dataframe['entries_from'].apply(symp)
dataframe['conditions']  = dataframe['entries_from'].apply(cond)
dataframe['treatments']  = dataframe['entries_from'].apply(treat)
dataframe['weather']  = dataframe['entries_from'].apply(weat)

In [0]:
dataframe[['symptoms', 'conditions', 'treatments', 'weather']]

In [0]:
import nltk
nltk.download('stopwords')

In [0]:
string.punctuation

In [0]:
from nltk.stem import SnowballStemmer
import string
from nltk.corpus import stopwords
stemmer = SnowballStemmer("english")

def cleanText(para):
    #print(para)
    para = para.replace(',', " ")
    para = para.translate(str.maketrans(',',' ',string.punctuation))
    words = [stemmer.stem(word) for word in para.split() if word.lower() not in stopwords.words("english")]
    
    return " ".join(words)

dataframe["symptoms"] = dataframe["symptoms"].apply(cleanText)
dataframe["conditions"] = dataframe["conditions"].apply(cleanText)
dataframe["treatments"] = dataframe["treatments"].apply(cleanText)
dataframe["weather"] = dataframe["weather"].apply(cleanText)
dataframe.head(n = 10)    

Creating a new column called text with all the data in symptoms, treatments, conditions, weather combined

In [0]:
dataframe['text'] = dataframe['symptoms'] + ' ' + dataframe['conditions'] + ' ' + dataframe['treatments'] + ' ' + dataframe['weather']

In [0]:
def encode_labels(label):
    if label == 'IMPROVED' or label == 'female':
        return 0
    elif label == 'WORSENED' or label ==  'male':
        return 1
    elif label == 'doesnt_say':
        return 3
    else:
        return 2


In [0]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
dataframe['country'] = encoder.fit_transform(dataframe['country']) 

In [0]:
dataframe['y'] = example_df.status.apply(encode_labels)
dataframe['sex'] = example_df.sex.apply(encode_labels)

In [0]:
dataframe.head()

# **Modelling with text from combined columns of Symptoms, Conditions, Weather, Treatment**

In [0]:
X_train, X_test, y_train, y_test = train_test_split(dataframe['text'], dataframe['y'], train_size=0.8, random_state=23123)


In [0]:
X_train

Using Count Vectorizer with Word Embeddings

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(X_train)
vectorizer.fit(X_test)

In [0]:

from keras.preprocessing.sequence import pad_sequences
word2idx = {word: idx for idx, word in enumerate(vectorizer.get_feature_names())}
tokenize = vectorizer.build_tokenizer()
preprocess = vectorizer.build_preprocessor()
 
def to_sequence(tokenizer, preprocessor, index, text):
    words = tokenizer(preprocessor(text))
    indexes = [index[word] for word in words if word in index]
    return indexes

X_train_sequences = [to_sequence(tokenize, preprocess, word2idx, x) for x in X_train]
max_sqeunce=60
labels_len = len(vectorizer.get_feature_names())
X_train_sequences = pad_sequences(X_train_sequences, maxlen=max_sqeunce, value=labels_len)
X_test_sequences = [to_sequence(tokenize, preprocess, word2idx, x) for x in X_test]
X_test_sequences = pad_sequences(X_test_sequences, maxlen=max_sqeunce, value=labels_len)

In [0]:

emb_len = 300
embeddings_index = np.zeros((len(vectorizer.get_feature_names()) + 1, emb_len))
for word, idx in word2idx.items():
    try:
        embedding = nlp.vocab[word].vector
        embeddings_index[idx] = embedding
    except:
        pass


Using LSTM

In [0]:
from keras.models import Sequential, Model
from keras.layers import Activation,Input,concatenate, BatchNormalization,Dense, Dropout,Flatten, LSTM, Embedding
model = Sequential()
model.add(Embedding(len(vectorizer.get_feature_names()) + 1,
                    EMBEDDINGS_LEN,  # Embedding size
                    weights=[embeddings_index],
                    input_length=MAX_SEQ_LENGHT,
                    trainable=False))
model.add(LSTM(300, dropout=0.2))
model.add(Dense(3, activation='softmax'))
 
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

In [0]:
model.fit(X_train_sequences, y_train, 
          epochs=10, batch_size=128, verbose=1, 
          validation_split=0.1)
 
scores = model.evaluate(X_test_sequences, y_test, verbose=1)
print("Accuracy:", scores[1])  

Accuracy is barely 50% which should be improved.  Including multiple features such as age, sex, country

In [0]:
text_data = Input(shape=(max_sqeunce,), name='text')
meta_data = Input(shape=(3,), name = 'meta')
x=(Embedding(len(vectorizer.get_feature_names()) + 1,
                    300,  
                    weights=[embeddings_index],
                    input_length=max_sqeunce,
                    trainable=False))(text_data)
x2 = ((LSTM(300, dropout=0.2, recurrent_dropout=0.2)))(x)
x4 = concatenate([x2, meta_data])
x5 = Dense(150, activation='relu')(x4)
x6 = Dropout(0.25)(x5)
x7 = BatchNormalization()(x6)
out=(Dense(len(set(y_train)), activation="softmax"))(x7)
model = Model(inputs=[text_data, meta_data ], outputs=out)
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

In [0]:
df_cat_train = dataframe.iloc[X_train.index][['age', 'sex', 'country']]
df_cat_test = dataframe.iloc[X_test.index][[ 'age','sex', 'country']]

In [0]:
df_cat_train['sex'].value_counts()

In [0]:
model.fit([X_train_sequences, df_cat_train], y_train, 
          epochs=12, batch_size=128, verbose=1, 
          validation_split=0.1)
 
scores = model.evaluate([X_test_sequences, df_cat_test],y_test, verbose=1)
print("Accuracy:", scores[1])  


Accuracy haven't improved. Maybe adding more feature isn't helping  much. Let's try without combining extracted features

In [0]:
class ItemSelector(TransformerMixin):
    """This class allows you to select a subset of a dataframe based on a given column name.
    If as_feature is False, you will need to pass the data to another Transformer to convert it into features; 
    otherwise, scikit-learn will throw dimension related exception.
    If as_feature is True, the column from that dataframe you just pass in will be use as feature directly. 
    For example, if 'key' is set to 'age' from the dataset, the values from the 'age' column will be used as features
    without the need for another Transformer.
    """
    def __init__(self, key, as_feature=False):
        self.key = key
        self.as_feature = as_feature

    def fit(self, x, y=None):
        return self

    def transform(self, dataframe):
        if self.as_feature:
            return dataframe[[self.key]]
        return dataframe[self.key]

In [0]:
X_train, X_test, y_train, y_test = train_test_split(dataframe, dataframe["y"], test_size = 0.2)

In [0]:

pipeline = Pipeline([
        
    ('union', FeatureUnion(
        transformer_list=[
        
            ('age', Pipeline([
                ('selector', ItemSelector('age', as_feature=True))
            ])),
            ('sex', Pipeline([
                ('selector', ItemSelector('sex', as_feature=True))
            ])),
            ('country', Pipeline([
                ('selector', ItemSelector('country', as_feature=True))
            ])),
            ('symptoms', Pipeline([
                ('selector', ItemSelector('symptoms')),
                ('cnt', CountVectorizer()),
                
            ])),
             ('conditions', Pipeline([
                ('selector', ItemSelector('conditions')),
                ('cnt', CountVectorizer()),
                
            ])),
             ('treatments', Pipeline([
                ('selector', ItemSelector('treatments')),
                ('cnt', CountVectorizer()),
               
            ])),
             ('weather', Pipeline([
                ('selector', ItemSelector('weather')),
                ('cnt', CountVectorizer())
            ])),

            
           
            
        ],

    )),

    # Use a naive bayes classifier on the combined features
    #('clf', MultinomialNB()),
])


pipeline.fit(X_train, y_train)


# KNN

In [0]:
train = pipeline.transform(X_train)
test = pipeline.transform(X_test)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(train, y_train)
y_pred = knn.predict(test)
print(accuracy_score(y_test, y_pred))

# SVM

In [0]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
svm = LinearSVC( max_iter=10000)
svm.fit(train, y_train)
y_pred = svm.predict(test)
print(accuracy_score(y_test, y_pred))
#scores = cross_val_score(svm,train, y_train, cv=3, scoring="accuracy" )
#print(np.mean(scores))

# Naive Bayes

In [0]:
from sklearn.naive_bayes import MultinomialNB
NB = MultinomialNB()
NB.fit(train, y_train)
y_pred = NB.predict(test)
print(accuracy_score(y_test, y_pred))

### Additional Questions

Please answer the following questions:

**1.   If you've explored the data, please describe your observations about the dataset.**

**Answer:**


1.   Trackable_name and trackable_type from entries_from and entries_to tables are same. There is only change in trackable_value
2.   Most of the observations are given by females
3.   Highly recorded
                  symptom = Headache, Fatigue
                  condition = Depression, Anxiety
                  Treatments = Tramadol
                  weather = icon, pressure, tem_min,tem,max, precip_Intensity
4.   Anxiety is the Leading problem for the disease fibromyalgia which is  observed between the age 24 and 32

5. Data consists of people with ages ranging from 14-79 . Average age is 37
7. Discrepencies in ages with values -1, 0 ,1. Replaced them with mean
8. Most people with age 58 and 62 reported worsened disease status



**2.   What approach (i.e. modeling & evaluation) did you use?**

Firstly, features: symptoms, treatments, conditions and weather were extracted from list of dictionaries. Latitude and Longitude data for countries were generated to plot on a interactive Folium map. 

Basic exploration of data using pandas, matplotlib, seaborn, nltk.Text Cleaning which consists of stopwords, punctuation removal, stemming. 

Used CountVectorization and Word Embeddings to train combined  textual data on LSTM with 51% validation accuracy. Added more features like age, sex, country. Model couldn't distinguish target properly.

Compared accuracies using Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Multinomial Naive Bayes (NB) without combining ectracted features and with other meta data. Accuracies are as follows:

        SVM : 45%
        KNN: 39%
        NB: 41%



**3.   What features have you tried (please also include the ones that you do not include in your final model)?**

 With Symptoms, Conditions, Treatments, Weather as features along with meta data like age, sex, country all the models are trained. 

userid, checkin_to and checkin_from, value_to, value_from, trackable_value,latitude, longitude aren't included.

**4.   What tradeoffs did you consider? Why did you use this approach?**

As trackable names and values have a lot of individual observations, if used as features will result in a lot of columns which could be more than number of observations (p > n) where traditional ML models may show ambiguious reslts or may result in extremely sparse dataframe. Hence, I used trackable types with trackable names as its row value and ignored trackale_values for now.

**5.   How would you improve your model if you had more time?**

This project is genuinely very interesting to work with. My first future enchancement would be finding a way to use trackable value as a feature and see if model performence increases or not. Adding attention layer to the existing LSTM is also one in my mind.

