### Possible Issues

`calories_day` can be named better

`comfort_food` is not yet coded

`nan` values for cuisine?

`father_profession`, `mothers_profession` not coded; not uniform column name 

`food_childhood` not coded 

`healthy_feel`, `life_rewarding` interesting scale: 1 for strongest, 10 for weakest 

`healthy_meal` not coded 

`meals_dinner_friend` not coded 

some columns that have int values such as `[food]_calories` are actually categorical/discrete, not continuous 

`type_sports` not coded 

`weight` has to be cleaned

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('food_coded.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125 entries, 0 to 124
Data columns (total 61 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   GPA                           123 non-null    object 
 1   Gender                        125 non-null    int64  
 2   breakfast                     125 non-null    int64  
 3   calories_chicken              125 non-null    int64  
 4   calories_day                  106 non-null    float64
 5   calories_scone                124 non-null    float64
 6   coffee                        125 non-null    int64  
 7   comfort_food                  124 non-null    object 
 8   comfort_food_reasons          124 non-null    object 
 9   comfort_food_reasons_coded    106 non-null    float64
 10  cook                          122 non-null    float64
 11  comfort_food_reasons_coded.1  125 non-null    int64  
 12  cuisine                       108 non-null    float64
 13  diet_

In [4]:
df.head()

Unnamed: 0,GPA,Gender,breakfast,calories_chicken,calories_day,calories_scone,coffee,comfort_food,comfort_food_reasons,comfort_food_reasons_coded,...,soup,sports,thai_food,tortilla_calories,turkey_calories,type_sports,veggies_day,vitamins,waffle_calories,weight
0,2.4,2,1,430,,315.0,1,none,we dont have comfort,9.0,...,1.0,1.0,1,1165.0,345,car racing,5,1,1315,187
1,3.654,1,1,610,3.0,420.0,2,"chocolate, chips, ice cream","Stress, bored, anger",1.0,...,1.0,1.0,2,725.0,690,Basketball,4,2,900,155
2,3.3,1,1,720,4.0,420.0,2,"frozen yogurt, pizza, fast food","stress, sadness",1.0,...,1.0,2.0,5,1165.0,500,none,5,1,900,I'm not answering this.
3,3.2,1,1,430,3.0,420.0,2,"Pizza, Mac and cheese, ice cream",Boredom,2.0,...,1.0,2.0,5,725.0,690,,3,1,1315,"Not sure, 240"
4,3.5,1,1,720,2.0,420.0,2,"Ice cream, chocolate, chips","Stress, boredom, cravings",1.0,...,1.0,1.0,4,940.0,500,Softball,4,2,760,190


### Removing redundancies

Because these columns have already been coded, we only keep the codes and remove the uncoded columns.

In [5]:
df.drop(['comfort_food_reasons', 'diet_current', 'eating_changes', 'fav_cuisine', 'ideal_diet'], inplace=True, axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125 entries, 0 to 124
Data columns (total 56 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   GPA                           123 non-null    object 
 1   Gender                        125 non-null    int64  
 2   breakfast                     125 non-null    int64  
 3   calories_chicken              125 non-null    int64  
 4   calories_day                  106 non-null    float64
 5   calories_scone                124 non-null    float64
 6   coffee                        125 non-null    int64  
 7   comfort_food                  124 non-null    object 
 8   comfort_food_reasons_coded    106 non-null    float64
 9   cook                          122 non-null    float64
 10  comfort_food_reasons_coded.1  125 non-null    int64  
 11  cuisine                       108 non-null    float64
 12  diet_current_coded            125 non-null    int64  
 13  drink

`comfort_food_reasons_coded` and `comfort_food_reasons_coded.1` are almost the same column, except that the former has `nan` values. Hence, we choose the latter to keep.

In [6]:
df.drop(['comfort_food_reasons_coded'], inplace=True, axis=1)
df.rename({'comfort_food_reasons_coded.1': 'comfort_food_reasons_coded'}, axis=1, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125 entries, 0 to 124
Data columns (total 55 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   GPA                         123 non-null    object 
 1   Gender                      125 non-null    int64  
 2   breakfast                   125 non-null    int64  
 3   calories_chicken            125 non-null    int64  
 4   calories_day                106 non-null    float64
 5   calories_scone              124 non-null    float64
 6   coffee                      125 non-null    int64  
 7   comfort_food                124 non-null    object 
 8   cook                        122 non-null    float64
 9   comfort_food_reasons_coded  125 non-null    int64  
 10  cuisine                     108 non-null    float64
 11  diet_current_coded          125 non-null    int64  
 12  drink                       123 non-null    float64
 13  eating_changes_coded        125 non

There are two codings for `eating_changes`: `eating_changes_coded` and `eating_changes_coded1`. The latter is more detailed than the forever, so we decide to keep it.

In [7]:
df.drop(['eating_changes_coded'], inplace=True, axis=1)
df.rename({'eating_changes_coded1': 'eating_changes_coded'}, axis=1, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125 entries, 0 to 124
Data columns (total 54 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   GPA                         123 non-null    object 
 1   Gender                      125 non-null    int64  
 2   breakfast                   125 non-null    int64  
 3   calories_chicken            125 non-null    int64  
 4   calories_day                106 non-null    float64
 5   calories_scone              124 non-null    float64
 6   coffee                      125 non-null    int64  
 7   comfort_food                124 non-null    object 
 8   cook                        122 non-null    float64
 9   comfort_food_reasons_coded  125 non-null    int64  
 10  cuisine                     108 non-null    float64
 11  diet_current_coded          125 non-null    int64  
 12  drink                       123 non-null    float64
 13  eating_changes_coded        125 non

### Missing Values

`GPA` has missing values. We will impute these with the mean. To do this, further data exploration must be made on the column so that its data type can be changed to float.

In [8]:
df['GPA'].value_counts()

3.5           13
3             11
3.2           10
3.7           10
3.3            9
3.4            9
3.6            7
3.9            7
3.8            6
2.8            5
4              4
3.1            3
2.9            2
3.83           2
2.6            2
2.4            1
3.79 bitch     1
3.73           1
2.71           1
3.92           1
3.68           1
3.75           1
Unknown        1
3.77           1
3.63           1
3.67           1
3.89           1
Personal       1
3.35           1
3.292          1
3.605          1
3.654          1
3.65           1
3.87           1
2.2            1
3.904          1
2.25           1
3.882          1
Name: GPA, dtype: int64

We should extract the number from `3.79 bitch`, and turn `Unknown` and `Personal` into `NaN` values to be imputed later.

In [9]:
df.loc[df['GPA'] == '3.79 bitch']

Unnamed: 0,GPA,Gender,breakfast,calories_chicken,calories_day,calories_scone,coffee,comfort_food,cook,comfort_food_reasons_coded,...,soup,sports,thai_food,tortilla_calories,turkey_calories,type_sports,veggies_day,vitamins,waffle_calories,weight
73,3.79 bitch,2,1,720,4.0,420.0,2,"Chips, ice cream",1.0,2,...,1.0,1.0,2,1165.0,850,baseball,4,1,1315,200


In [10]:
df.replace({'3.79 bitch': 3.79}, inplace=True)

In [11]:
df.loc[df['GPA'] == 3.79]

Unnamed: 0,GPA,Gender,breakfast,calories_chicken,calories_day,calories_scone,coffee,comfort_food,cook,comfort_food_reasons_coded,...,soup,sports,thai_food,tortilla_calories,turkey_calories,type_sports,veggies_day,vitamins,waffle_calories,weight
73,3.79,2,1,720,4.0,420.0,2,"Chips, ice cream",1.0,2,...,1.0,1.0,2,1165.0,850,baseball,4,1,1315,200


Here, we note that `Personal ` has a space after the word. This is important in accessing it.

In [17]:
df['GPA'].replace({
    'Unknown': np.NaN,
    'Personal ': np.NaN
}, inplace=True)

df.loc[(df['GPA'] == 'Unknown') | (df['GPA'] == 'Personal ')]

Unnamed: 0,GPA,Gender,breakfast,calories_chicken,calories_day,calories_scone,coffee,comfort_food,cook,comfort_food_reasons_coded,...,soup,sports,thai_food,tortilla_calories,turkey_calories,type_sports,veggies_day,vitamins,waffle_calories,weight


In [18]:
df['GPA'].value_counts()

3.5      13
3        11
3.7      10
3.2      10
3.3       9
3.4       9
3.6       7
3.9       7
3.8       6
2.8       5
4         4
3.1       3
2.9       2
2.6       2
3.83      2
3.79      1
2.71      1
3.73      1
2.4       1
3.92      1
3.68      1
3.75      1
3.77      1
3.63      1
3.67      1
3.65      1
3.35      1
3.292     1
3.605     1
3.89      1
3.654     1
3.87      1
2.2       1
3.904     1
2.25      1
3.882     1
Name: GPA, dtype: int64

Now, we can impute.

In [20]:
df['GPA'] = pd.to_numeric(df['GPA'])
df['GPA'] = df['GPA'].fillna(value = df['GPA'].mean())
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125 entries, 0 to 124
Data columns (total 54 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   GPA                         125 non-null    float64
 1   Gender                      125 non-null    int64  
 2   breakfast                   125 non-null    int64  
 3   calories_chicken            125 non-null    int64  
 4   calories_day                106 non-null    float64
 5   calories_scone              124 non-null    float64
 6   coffee                      125 non-null    int64  
 7   comfort_food                124 non-null    object 
 8   cook                        122 non-null    float64
 9   comfort_food_reasons_coded  125 non-null    int64  
 10  cuisine                     108 non-null    float64
 11  diet_current_coded          125 non-null    int64  
 12  drink                       123 non-null    float64
 13  eating_changes_coded        125 non

Next, we look at `calories_day`. Since it is categorical in spite of the data type being `float64`, we impute the `NaN` values with the mode.

In [29]:
df['calories_day'] = df['calories_day'].fillna(value = df['calories_day'].mode())
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125 entries, 0 to 124
Data columns (total 54 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   GPA                         125 non-null    float64
 1   Gender                      125 non-null    int64  
 2   breakfast                   125 non-null    int64  
 3   calories_chicken            125 non-null    int64  
 4   calories_day                107 non-null    float64
 5   calories_scone              124 non-null    float64
 6   coffee                      125 non-null    int64  
 7   comfort_food                124 non-null    object 
 8   cook                        122 non-null    float64
 9   comfort_food_reasons_coded  125 non-null    int64  
 10  cuisine                     108 non-null    float64
 11  diet_current_coded          125 non-null    int64  
 12  drink                       123 non-null    float64
 13  eating_changes_coded        125 non

In [30]:
len(df[df['calories_day'].isnull()])

18