#### <span style="color:#AD07FF"> In this notebook we will do data cleaning, visualization, adding features <br> as well as clearing the data so that we can run a model.

# <span style="color:#FF7B07"><div align="center">**Table Of Contents**
[<span style="color:#FF7B07">**1. Introduction**](#1)<br>
[<span style="color:#FF7B07">**2. Import Libraries and Load Data**](#2)<br>
[<span style="color:#FF7B07">**3. Data Analysis**](#3)<br>
[<span style="color:#FF7B07">**4. Create Features**](#4)<br> 
[<span style="color:#FF7B07">**5. Data Cleaning**](#5)<br>
[<span style="color:#FF7B07">**6. Data Preparation**](#6)<br> 

# <span style="color:#FF7B07"><div align="center">**Introduction** <a  name="1"></a>

Nowadays, a healthy lifestyle became a valuable characteristic of a modern society. <br>
More and more people try to enhance their health by doing regularly different sports and put emphasis on their food habits. <br>
In order to satisfy the specific needs of every individual, conclusions gained out of the users’ data are from high importance. <br>
Companies such as MyFitnessPal operate in the lucrative business field of health data. <br>
The healthcare industry is booming, especially when it comes to the analysis of health-related data. <br>
This report will show how the analysis of data will improve the life of the users, <br>
but also highlight the potential for companies active in the business. <br>
#### <span style="color:#FF7B07">  **Source** https://www.kaggle.com/vetrirah/customer?select=Train.csv

# <span style="color:#FF7B07"><div align="center">**Import Libraries and Load Data** <a  name="2"></a>

In [1]:
from datetime import datetime
import json
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
from sklearn import preprocessing

In [2]:
# The data is large and this command will help us to see the whole result 
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [3]:
data = pd.read_csv('myFitnessPal_parsed.csv')

In [4]:
# with open('foods.json') as json_file:
#      foods = json.load(json_file)

# <span style="color:#FF7B07"><div align="center">**Data Analysis** <a  name="3"></a>

#### <span style="color:#FF7B07">Let us first review the general information of the data 

In [5]:
print(f'\n Data shape - {data.shape} \n')

data.head(3)


 Data shape - (587186, 16) 



Unnamed: 0,user_id,date,sequence,food_ids,total_calories,total_carbs,total_fat,total_protein,total_sodium,total_sugar,goal_calories,goal_carbs,goal_fat,goal_protein,goal_sodium,goal_sugar
0,1,2014-09-15,1,"[1, 2, 3, 4, 4]",2430,96,37.0,50.0,855.0,63.0,1572.0,196.0,52.0,79.0,2300.0,59.0
1,1,2014-09-16,1,"[5, 1, 2, 3, 6, 7]",1862,158,54.0,114.0,2215.0,100.0,1832.0,229.0,61.0,92.0,2300.0,69.0
2,1,2014-09-17,1,"[1, 2, 3, 6, 8, 9, 10]",2251,187,60.0,98.0,1765.0,105.0,1685.0,210.0,56.0,85.0,2300.0,63.0


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 587186 entries, 0 to 587185
Data columns (total 16 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   user_id         587186 non-null  int64  
 1   date            587186 non-null  object 
 2   sequence        587186 non-null  int64  
 3   food_ids        587186 non-null  object 
 4   total_calories  587186 non-null  int64  
 5   total_carbs     587186 non-null  int64  
 6   total_fat       586309 non-null  float64
 7   total_protein   586309 non-null  float64
 8   total_sodium    585881 non-null  float64
 9   total_sugar     585879 non-null  float64
 10  goal_calories   585264 non-null  float64
 11  goal_carbs      585261 non-null  float64
 12  goal_fat        559873 non-null  float64
 13  goal_protein    559868 non-null  float64
 14  goal_sodium     519466 non-null  float64
 15  goal_sugar      519196 non-null  float64
dtypes: float64(10), int64(4), object(2)
memory usage: 71.7+ 

#### <span style="color:#00CC00"> As you can see most of the data types is numbers so we don't need to convert them <br> but there is "date" and "food_ids" which need to convert somheow in numeral variables 

In [7]:
data.describe()

Unnamed: 0,user_id,sequence,total_calories,total_carbs,total_fat,total_protein,total_sodium,total_sugar,goal_calories,goal_carbs,goal_fat,goal_protein,goal_sodium,goal_sugar
count,587186.0,587186.0,587186.0,587186.0,586309.0,586309.0,585881.0,585879.0,585264.0,585261.0,559873.0,559868.0,519466.0,519196.0
mean,4946.928031,4.004532,1421.923,153.679057,71.235077,93.821524,1157.984207,407.563147,1613.422573,194.22268,90.101536,157.212782,1446.919333,411.366736
std,2844.719822,1.341577,2284.502,355.777029,277.018504,289.062514,2049.176072,907.108966,722.874981,352.425203,240.765991,390.16981,1166.553691,886.227384
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2513.0,3.0,1038.0,76.0,29.0,40.0,30.0,22.0,1314.0,122.0,46.0,70.0,56.0,45.0
50%,4928.0,4.0,1403.0,135.0,49.0,66.0,690.0,51.0,1630.0,178.0,58.0,96.0,2300.0,64.0
75%,7427.0,5.0,1766.0,191.0,72.0,98.0,1982.0,139.0,1991.0,227.0,77.0,135.0,2300.0,100.0
max,9897.0,6.0,1200800.0,210865.0,132088.0,114949.0,960000.0,168015.0,26068.0,154417.0,38128.0,102945.0,23009.0,35055.0


#### <span style="color:#00CC00"> At first glance we would probably need to normalize and maybe we will have some outliers

#### <span style="color:#FF7B07"> As we know we have a different number of records for each user in the data 

In [8]:
print(f'user_id 1 has {len(data[data["user_id"]==1])} records')
print(f'user_id 2 has {len(data[data["user_id"]==2])} records')

user_id 1 has 173 records
user_id 2 has 60 records


#### <span style="color:#FF7B07"> User may also have missed information between the first and final days of the record

In [9]:
data[data["user_id"]==7].head(2)

Unnamed: 0,user_id,date,sequence,food_ids,total_calories,total_carbs,total_fat,total_protein,total_sodium,total_sugar,goal_calories,goal_carbs,goal_fat,goal_protein,goal_sodium,goal_sugar
561,7,2014-10-06,1,"[980, 981]",360,45,14.0,15.0,2.0,5.0,1400.0,140.0,31.0,140.0,25.0,102.0
562,7,2014-10-15,1,[982],140,2,9.0,12.0,0.0,1.0,1400.0,140.0,31.0,140.0,25.0,102.0


#### <span style="color:#00CC00"> As we can see, User whose ID is 7, made the first record in 2014-10-06 and the second in 2014-10-15. <br> We will need to reflect this information

#### <span style="color:#FF7B07"> Check for null values in the data

In [10]:
null_df = pd.DataFrame(data.isna().sum())
null_df.columns = ["Null Frequency"]
null_df.T

Unnamed: 0,user_id,date,sequence,food_ids,total_calories,total_carbs,total_fat,total_protein,total_sodium,total_sugar,goal_calories,goal_carbs,goal_fat,goal_protein,goal_sodium,goal_sugar
Null Frequency,0,0,0,0,0,0,877,877,1305,1307,1922,1925,27313,27318,67720,67990


#### <span style="color:#00CC00"> As we saw there are 10 columns which have sometimes null values <br> User can have a maximum of 10 columns with null value <br> so I think we should throw away a user who has 4 or more Null features, <br> Because otherwise it would turn out that we have filled more than 25 percent of the user information

In [11]:
print('Users which have null in more than 3 columns : ',(len(data.loc[data.isnull().sum(axis=1)>3])/len(data))*100,'%')

Users which have null in more than 3 columns :  4.6515073588266755 %


#### <span style="color:#00CC00"> It seems that the number of such people is about 4.6 percent. <br> Due to the fact that the data is not so little and at the same time filling 4 null feature is not so accurate we can throw them 

#### <span style="color:#FF7B07"> There can still be a problem with the values and such a situation is when the goal calories are 0 <br>and the mentioned situation does not make sense <br> because the main purpose of the app is to select a goal and get closer to it.

In [12]:
print('Users which have zero in goal_calories : ',(len(data[data['goal_calories']==0])/len(data))*100,'%')

Users which have zero in goal_calories :  0.11767991743672362 %


#### <span style="color:#00CC00"> It seems that the number of such people is about 0.1 percent. So we can throw them away 

# <span style="color:#FF7B07"><div align="center">**Create Features** <a  name="4"></a>

In [13]:
data.head(3)

Unnamed: 0,user_id,date,sequence,food_ids,total_calories,total_carbs,total_fat,total_protein,total_sodium,total_sugar,goal_calories,goal_carbs,goal_fat,goal_protein,goal_sodium,goal_sugar
0,1,2014-09-15,1,"[1, 2, 3, 4, 4]",2430,96,37.0,50.0,855.0,63.0,1572.0,196.0,52.0,79.0,2300.0,59.0
1,1,2014-09-16,1,"[5, 1, 2, 3, 6, 7]",1862,158,54.0,114.0,2215.0,100.0,1832.0,229.0,61.0,92.0,2300.0,69.0
2,1,2014-09-17,1,"[1, 2, 3, 6, 8, 9, 10]",2251,187,60.0,98.0,1765.0,105.0,1685.0,210.0,56.0,85.0,2300.0,63.0


#### <span style="color:#FF7B07"> As we have seen one feature which we want to remake is food_ids <br> To do this, we will extract most of the information from the given feature and create a new column <br> so that we can lose as little information as possible.<br>  The first thing we will do is create a new attribute based on what the size of food_ids is. 

In [14]:
data['foods_len'] = data["food_ids"].apply(lambda x: len(x[1:-1].split(',')))

#### <span style="color:#FF7B07"> For each user we may have several rows. <br> Our goal is to transform the data so that we have one characteristic or one row for each person. <br> So let's take a separate feature of how many records we have for each person. <br> It will also help us to store information and according to the algorithm we can select people who for example have less than 10 records

#### <span style="color:#FF7B07"> To do this,we create temporary dataframe where we will have unique user IDs and the corresponding number of logs<br>  so that we do not want the values to be repeated and data manipulation to become difficult. <br> Finally when we prepare the data we will add it and we will have all the columns together so do not be confused 

In [15]:
# add new feature which shows how many days are logged by each applicants
user_logged_freq = data["user_id"].value_counts()
user_logged_df = pd.DataFrame(data["user_id"].unique(),columns = ['user_id'])
user_logged_df["logged_frequency"] = user_logged_df["user_id"].apply(lambda _id: user_logged_freq[_id])

#### <span style="color:#FF7B07"> As we have seen, users may not have data for all days in a row and there may be missing data in the middle, <br> so let's do a separate feature that shows how much information is missing from the first day record to the last day record. <br> This will help us retain as much information as possible about our customers

In [16]:
# this function counts whole days between start and end date and calculates missed days for new feature
def days_missed(d1, d2,loggedDays):
    d1 = datetime.strptime(str(d1), "%Y-%m-%d")
    d2 = datetime.strptime(str(d2), "%Y-%m-%d")
    return abs(abs((d2 - d1).days)-loggedDays)

In [17]:
# this function calls days_missed regarding last and first records
def get_missed_days(df,userID,logged_frequency):
    tail = df[df["user_id"]==userID].tail(1)['date'].values[0]
    head = df[df["user_id"]==userID].head(1)['date'].values[0]
    return days_missed(tail,head,logged_frequency)   

In [18]:
# add new feature based on how many days are missed for each user
user_logged_df['days_missed'] = user_logged_df[['user_id','logged_frequency']].apply(lambda x: get_missed_days(data,x.user_id,x.logged_frequency),axis=1)

#### <span style="color:#FF7B07"> Since the data relates to nutrition, health and exercise, <br>we need to have a separate identifying variable for each person <br> that shows how different it is from a healthy distribute.

In [19]:
def getHealtyDistributedValues(value,lower,upper):
    if(value <= lower):
        return lower-value # the difference person lacked 
    if(value >= upper):
        return value-upper # the difference person exceed 
    return 0 # method returns 0 for the values in range

In [20]:
# There is known that for healthy eating, daily carbs should be between 45-65% out of callories, fats between 10-35% and proteins 20-35%
# So this method counts daily norm distribution of person
# 0 for the people in persmissible range
# max value may be 1.3 (when person only took fats)

def healthyDistributed(carbs,fat,protein):
    totalCalories = fat*9+ carbs*4 + protein*4 # convert to calories (1g fat = 9 calories and etc..) and sum
    sum =  getHealtyDistributedValues ( carbs*4 / (totalCalories+0.00000001),0.45,0.65)
    sum += getHealtyDistributedValues ( protein*4 / (totalCalories+0.00000001),0.2,0.35)
    sum += getHealtyDistributedValues ( fat*9 / (totalCalories+0.00000001),0.1,0.35)
    return sum

In [21]:
data['healtyDistrib'] = data[['total_carbs','total_fat','total_protein']].apply(lambda x: healthyDistributed(x.total_carbs,x.total_fat,x.total_protein),axis=1)

#### <span style="color:#FF7B07"> One of the algorithms which we'll do is predict whether each user will approach the goal in the future <br>so we need a variable that describes this. 

In [22]:
# this function gets all nutrition values with goals and checks if calories difference is less than percentage of goals
# TODO:
def check_bounds(total_calories, total_carbs, total_fat, total_protein, total_sodium, total_sugar, 
               goal_calories, goal_carbs, goal_fat, goal_protein, goal_sodium, goal_sugar,percent):
    
    return (abs(goal_calories - total_calories) < goal_calories * percent / 100)

In [23]:
# this function checks last days for user and counts number of days when his nutrient was in goal range
def reach_goal(df,user_id,num_days):
    allowed_difference_percentage = 15
    tails = df[df["user_id"]==user_id].tail(num_days)
    tails["reach_goal"] = tails.apply(lambda row: check_bounds(*(row.values[4:16]),allowed_difference_percentage),axis=1)
    return tails["reach_goal"].sum()

In [24]:
# create new feature which shows if user reaches goals in last days 
# and return 1 if the number of days when user reached goal is greater than threshold else 0
# TODO:

number_of_last_days = 5
threshold = 2
user_logged_df["reach_goal"] = user_logged_df['user_id'].apply(lambda x: reach_goal(data,x,number_of_last_days))
user_logged_df["reach_goal"] = user_logged_df["reach_goal"].apply(lambda x: 1 if x>=threshold else 0)

#### <span style="color:#FF7B07"> Each user has its goal and real received nutrient columns <br> so it is important to know the deviation between them and bring them out as a separate feature

In [25]:
data["calories_diff"] = data["goal_calories"]-data["total_calories"]
data["carbs_diff"] = data["goal_carbs"]-data["total_carbs"]
data["fat_diff"] = data["goal_fat"]-data["total_fat"]
data["protein_diff"] = data["goal_protein"]-data["total_protein"]
data["sodium_diff"] = data["goal_sodium"]-data["total_sodium"]
data["sugar_diff"] = data["goal_sugar"]-data["total_sugar"]

# <span style="color:#FF7B07"><div align="center">**Data Cleaning** <a  name="5"></a>

#### <span style="color:#FF7B07"> Drop applicants which has more than 3 null values 

In [26]:
nullColumns = data.loc[data.isnull().sum(axis=1)>3].index
nullColumns = pd.DataFrame(nullColumns,columns={'index'})

cond = data.index.isin(nullColumns['index']) 
data.drop(data[cond].index, inplace = True)

In [27]:
print('Values which have null in more than 3 columns : ',(len(data.loc[data.isnull().sum(axis=1)>3])/len(data))*100,'%')

Values which have null in more than 3 columns :  0.0 %


#### <span style="color:#FF7B07"> Drop applicants which has 0 goal calories

In [28]:
zeros = data[data['goal_calories']==0].index
zeros = pd.DataFrame(zeros,columns={'index'})

cond = data.index.isin(zeros['index']) 
data.drop(data[cond].index, inplace = True)

In [29]:
print('Examples which have zero in goal_calories : ',(len(data[data['goal_calories']==0])/len(data))*100,'%')

Examples which have zero in goal_calories :  0.0 %


#### <span style="color:#FF7B07"> Delete "date" "food_ids" and "sequence" columns because we no longer use. all information about them are in other columns 

In [30]:
data = data.drop(columns=['date','food_ids','sequence'])

In [31]:
data.head(2)

Unnamed: 0,user_id,total_calories,total_carbs,total_fat,total_protein,total_sodium,total_sugar,goal_calories,goal_carbs,goal_fat,goal_protein,goal_sodium,goal_sugar,foods_len,healtyDistrib,calories_diff,carbs_diff,fat_diff,protein_diff,sodium_diff,sugar_diff
0,1,2430,96,37.0,50.0,855.0,63.0,1572.0,196.0,52.0,79.0,2300.0,59.0,5,0.044384,-858.0,100.0,15.0,29.0,1445.0,-4.0
1,1,1862,158,54.0,114.0,2215.0,100.0,1832.0,229.0,61.0,92.0,2300.0,69.0,6,0.048475,-30.0,71.0,7.0,-22.0,85.0,-31.0


#### <span style="color:#FF7B07"> Some of the features had wide range of values so lets scale them

In [44]:
scaler = preprocessing.MinMaxScaler()
col_df = data.columns.drop('user_id')
col_user_logged_df = ["logged_frequency","days_missed"]

user_logged_df[col_user_logged_df] = scaler.fit_transform(user_logged_df[col_user_logged_df])
data[col_df] = scaler.fit_transform(data[col_df])

# <span style="color:#FF7B07"><div align="center">**Data Preparation** <a  name="6"></a>

#### <span style="color:#FF7B07"> As we have seen all users have a different number of records and we want each user to have one characteristic in one row.<br>For this we will take a 5 day record from all users and then we will flattens it. <br> From users who had more than 5 data we will leave the last 5 days. And those who had less than 5 will be filled with -1s 

In [46]:
def row_padding(x,num_row):
    # get last row because I need same format and same id, others columns replaced by -1
    last_row = x.iloc[-1] 
    last_row[1:] = [-1]*len(last_row[1:])
    
    if np.shape(x)[0] < num_row:
        new_x = pd.DataFrame(x)
        for i in range(np.shape(x)[0],num_row):
            new_x = pd.DataFrame(new_x.append(last_row))
        return new_x
    else:
        return x.tail(num_row)

In [47]:
# this function flattens all rows for each user which we padded already 
# so creates one vector because we need one input for each user
def flatten_rows(x,cols):
    for i in range(1,x.shape[0]):
        for j in range(1,len(cols)):
            temp_row = x.iloc[i]
            x[cols[j]+"_"+str(i)] = temp_row[j]
    return x.head(1)

In [48]:
# It may take 5 minutes 
data = data.groupby('user_id').apply(row_padding,5).reset_index(drop=True)

cols = data.columns
data = data.groupby('user_id').apply(flatten_rows,cols).reset_index(drop=True)

In [49]:
data.head(3)

Unnamed: 0,user_id,total_calories,total_carbs,total_fat,total_protein,total_sodium,total_sugar,goal_calories,goal_carbs,goal_fat,goal_protein,goal_sodium,goal_sugar,foods_len,healtyDistrib,calories_diff,carbs_diff,fat_diff,protein_diff,sodium_diff,sugar_diff,total_calories_1,total_carbs_1,total_fat_1,total_protein_1,total_sodium_1,total_sugar_1,goal_calories_1,goal_carbs_1,goal_fat_1,goal_protein_1,goal_sodium_1,goal_sugar_1,foods_len_1,healtyDistrib_1,calories_diff_1,carbs_diff_1,fat_diff_1,protein_diff_1,sodium_diff_1,sugar_diff_1,total_calories_2,total_carbs_2,total_fat_2,total_protein_2,total_sodium_2,total_sugar_2,goal_calories_2,goal_carbs_2,goal_fat_2,goal_protein_2,goal_sodium_2,goal_sugar_2,foods_len_2,healtyDistrib_2,calories_diff_2,carbs_diff_2,fat_diff_2,protein_diff_2,sodium_diff_2,sugar_diff_2,total_calories_3,total_carbs_3,total_fat_3,total_protein_3,total_sodium_3,total_sugar_3,goal_calories_3,goal_carbs_3,goal_fat_3,goal_protein_3,goal_sodium_3,goal_sugar_3,foods_len_3,healtyDistrib_3,calories_diff_3,carbs_diff_3,fat_diff_3,protein_diff_3,sodium_diff_3,sugar_diff_3,total_calories_4,total_carbs_4,total_fat_4,total_protein_4,total_sodium_4,total_sugar_4,goal_calories_4,goal_carbs_4,goal_fat_4,goal_protein_4,goal_sodium_4,goal_sugar_4,foods_len_4,healtyDistrib_4,calories_diff_4,carbs_diff_4,fat_diff_4,protein_diff_4,sodium_diff_4,sugar_diff_4
0,1.0,0.003433,0.0022,0.001484,0.0056,0.013776,0.000744,0.175586,0.003704,0.004013,0.002224,0.099961,0.004907,0.229508,0.093842,0.980352,0.576915,0.775728,0.225061,0.941119,0.827611,0.002506,0.001072,0.000666,0.002467,0.007334,0.000369,0.109103,0.002305,0.002492,0.001379,0.099961,0.003052,0.163934,0.076151,0.979846,0.576975,0.776022,0.225114,0.947266,0.827601,0.002025,0.001224,0.000825,0.003133,0.007651,0.00044,0.069743,0.00147,0.0016,0.000884,0.099961,0.00194,0.163934,0.09328,0.979479,0.576533,0.775698,0.224579,0.946964,0.827349,0.001899,0.000811,0.000651,0.007133,0.003806,0.000595,0.08716,0.001839,0.001993,0.001107,0.099961,0.002425,0.147541,0.1341,0.979974,0.576928,0.775922,0.223848,0.950633,0.827305,0.001006,0.000545,0.000333,0.003267,0.002709,0.000363,0.089308,0.001885,0.002046,0.001127,0.099961,0.002482,0.081967,0.062623,0.980896,0.577101,0.77618,0.224737,0.95168,0.827507
1,2.0,0.001289,0.000726,0.000394,0.001467,0.004171,0.000595,0.0506,0.001069,0.001154,0.000641,0.099961,0.001426,0.213115,0.063449,0.979793,0.576651,0.775933,0.224767,0.950285,0.827132,0.001393,0.001176,0.000394,0.001633,0.003135,0.000548,0.0506,0.001069,0.001154,0.000641,0.099961,0.001426,0.213115,0.062802,0.979691,0.57639,0.775933,0.224729,0.951273,0.827172,0.001186,0.00092,0.000341,0.001033,0.000993,0.000464,0.0506,0.001069,0.001154,0.000641,0.099961,0.001426,0.196721,0.080754,0.979895,0.576538,0.775975,0.224865,0.953318,0.827241,0.001203,0.001053,0.000273,0.002233,0.003038,0.000393,0.0506,0.001069,0.001154,0.000641,0.099961,0.001426,0.262295,0.014553,0.979878,0.576461,0.776027,0.224594,0.951366,0.8273,4.7e-05,5.2e-05,8e-06,0.000233,0.000255,6e-05,0.0506,0.001069,0.001154,0.000641,0.099961,0.001426,0.016393,0.0,0.981012,0.577041,0.776233,0.225046,0.954022,0.827576
2,3.0,0.001223,0.000749,0.00031,0.001833,0.00378,8.9e-05,0.056969,0.001198,0.001285,0.000719,0.099961,0.000884,0.163934,0.015246,0.979994,0.576692,0.776027,0.224744,0.950658,0.827458,0.00117,0.000825,0.000363,0.0021,0.005135,9.5e-05,0.057007,0.001205,0.001285,0.000719,0.099961,0.000884,0.114754,0.013378,0.980047,0.576651,0.775986,0.224684,0.949365,0.827453,0.001491,0.000659,0.000545,0.002633,0.005547,0.000268,0.057007,0.001205,0.001285,0.000719,0.099961,0.000884,0.131148,0.123482,0.979732,0.576747,0.775845,0.224564,0.948972,0.82731,0.001832,0.0,0.0,0.0,0.0,0.0,0.04638,0.000978,0.001049,0.000583,0.099961,0.000713,0.0,0.576923,0.979171,0.577033,0.776216,0.225053,0.954265,0.827502,0.001006,0.00082,0.000227,0.001133,0.004064,6.5e-05,0.04638,0.000978,0.001049,0.000583,0.099961,0.000713,0.081967,0.058568,0.979981,0.576558,0.776039,0.224797,0.950387,0.827448


#### <span style="color:#FF7B07"> merge dataframe with features which we already create 

In [35]:
data = pd.merge(data, user_logged_df, on=['user_id'])

#### <span style="color:#FF7B07"> Handle missing data with KNNImputer

In [36]:
imputer = KNNImputer()
data[data.columns] = np.round(imputer.fit_transform(data))
data[data.columns] = np.round(imputer.fit_transform(data))

In [37]:
null_df = pd.DataFrame(data.isna().sum())
null_df.columns = ["Null Frequency"]
null_df.T

Unnamed: 0,user_id,total_calories,total_carbs,total_fat,total_protein,total_sodium,total_sugar,goal_calories,goal_carbs,goal_fat,goal_protein,goal_sodium,goal_sugar,foods_len,healtyDistrib,calories_diff,carbs_diff,fat_diff,protein_diff,sodium_diff,sugar_diff,total_calories_1,total_carbs_1,total_fat_1,total_protein_1,total_sodium_1,total_sugar_1,goal_calories_1,goal_carbs_1,goal_fat_1,goal_protein_1,goal_sodium_1,goal_sugar_1,foods_len_1,healtyDistrib_1,calories_diff_1,carbs_diff_1,fat_diff_1,protein_diff_1,sodium_diff_1,sugar_diff_1,total_calories_2,total_carbs_2,total_fat_2,total_protein_2,total_sodium_2,total_sugar_2,goal_calories_2,goal_carbs_2,goal_fat_2,goal_protein_2,goal_sodium_2,goal_sugar_2,foods_len_2,healtyDistrib_2,calories_diff_2,carbs_diff_2,fat_diff_2,protein_diff_2,sodium_diff_2,sugar_diff_2,total_calories_3,total_carbs_3,total_fat_3,total_protein_3,total_sodium_3,total_sugar_3,goal_calories_3,goal_carbs_3,goal_fat_3,goal_protein_3,goal_sodium_3,goal_sugar_3,foods_len_3,healtyDistrib_3,calories_diff_3,carbs_diff_3,fat_diff_3,protein_diff_3,sodium_diff_3,sugar_diff_3,total_calories_4,total_carbs_4,total_fat_4,total_protein_4,total_sodium_4,total_sugar_4,goal_calories_4,goal_carbs_4,goal_fat_4,goal_protein_4,goal_sodium_4,goal_sugar_4,foods_len_4,healtyDistrib_4,calories_diff_4,carbs_diff_4,fat_diff_4,protein_diff_4,sodium_diff_4,sugar_diff_4,logged_frequency,days_missed,reach_goal
Null Frequency,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### <span style="color:#00CC00"> As you see we don't have null values 

# <span style="color:#FF7B07"> data for prediction if user reach goals

In [38]:
# choosing users which have more logs than number_of_logs
number_of_logs = 60
data_2 = data[data["logged_frequency"]>number_of_logs]

# <span style="color:#FF7B07"> update dataframe

In [1]:
data.to_csv(r'myFitnessPal_parsed.csv',index = False)

NameError: name 'data' is not defined