# Data Analytics Final Project

Data cleaner for the food_coded.csv dataset

### Defining the Assignment

We are looking to learn about the eating habits of college students. This includes how health conscious they are, how their eating habits have changed since starting college, their preferences related to ethnic food, and what some of their preferences might be related to available selections and pricing. 

This is primarily an exploratory data analysis where we work with a collection of data, clean and format the data, summarize statistics and visualize distributions, examine relationships and create a predictive model, all while addressing inconsistencies in the dataset, handling data that may skew or bias the results of our model, and finding a meaningful insight into how the cafeteria can help improve the eating habits of students. 

Before we begin our analysis, we will have to make sure the dataset we are working with has been thoroughly cleaned and formatted in such a way that we can effortlessly use python data analysis and visualization libraries on it to produce meaningful models and visualizations. 

### Project Setup

In [33]:
# md repo
# <span style="color:green"></span>
# <span style="color:red"></span>

In [34]:
# import built in libraries for handling files and data
import io
import os
import random

# import library for data analysis
import pandas as pd

# define constants
DATA_FILE_PATH = os.path.join(os.getcwd(), "material/data", "food_coded.csv")
CLEAN_FILE_PATH = os.path.join(os.getcwd(), "material/data", "food_cleaned.csv")
FAST_COLUMN = [
    'cook', 'cuisine', 'diet_current', 'drink', 'eating_changes', 'employment', 
    'exercise', 'father_education', 'father_profession', 'fav_cuisine', 
    'fav_food', 'food_childhood', 'healthy_meal', 'ideal_diet', 'income', 
    'life_rewarding', 'marital_status', 'meals_dinner_friend', 'mother_education', 
    'mother_profession', 'on_off_campus', 'persian_food', 'self_perception_weight', 
    'soup', 'sports', 'tortilla_calories', 'type_sports', 'weight'
]

In [35]:
# import dataset from csv file
food = pd.read_csv(DATA_FILE_PATH)
food.head()

Unnamed: 0,GPA,Gender,breakfast,calories_chicken,calories_day,calories_scone,coffee,comfort_food,comfort_food_reasons,comfort_food_reasons_coded,...,soup,sports,thai_food,tortilla_calories,turkey_calories,type_sports,veggies_day,vitamins,waffle_calories,weight
0,2.4,2,1,430,,315.0,1,none,we dont have comfort,9.0,...,1.0,1.0,1,1165.0,345,car racing,5,1,1315,187
1,3.654,1,1,610,3.0,420.0,2,"chocolate, chips, ice cream","Stress, bored, anger",1.0,...,1.0,1.0,2,725.0,690,Basketball,4,2,900,155
2,3.3,1,1,720,4.0,420.0,2,"frozen yogurt, pizza, fast food","stress, sadness",1.0,...,1.0,2.0,5,1165.0,500,none,5,1,900,I'm not answering this.
3,3.2,1,1,430,3.0,420.0,2,"Pizza, Mac and cheese, ice cream",Boredom,2.0,...,1.0,2.0,5,725.0,690,,3,1,1315,"Not sure, 240"
4,3.5,1,1,720,2.0,420.0,2,"Ice cream, chocolate, chips","Stress, boredom, cravings",1.0,...,1.0,1.0,4,940.0,500,Softball,4,2,760,190


In [36]:
# explore the dataset
descript = food.describe()
shape = food.shape
columns = food.columns
nulls = food.isnull().sum()

# had to look up had to handle the response from .info
buffer = io.StringIO()
food.info(buf=buffer)
info = buffer.getvalue()

In [37]:
def produceReport(shape, columns, info, descript, nulls, name):

    # use name parameter to add a suffix to the file name
    suffix = f"_{name}" if name in ['raw', 'clean'] else ""
    
    # format string
    report = f"""
=======================================
FOOD DATA SUMMARY REPORT
=======================================

* OVERVIEW
- **Shape (Rows, Columns):** {shape}
- **Columns:** {', '.join(columns)}

---------------------------------------

* COLUMNS
- Data Types and Non-Null Counts:
{info}
---------------------------------------

* DESCRIPTIVE STATISTICS
{descript}

---------------------------------------

* MISSING VALUES
- Number of Missing Values Per Column:
{nulls.to_string()}

=======================================
"""

    
    file_name = f"material/data_report{suffix}.txt"

    # save file
    with open(file_name, "w") as file:
        file.write(report)
        return file_name

In [38]:
# generate a report from the `explore the dataset` cell
name = 'raw'
file_name = produceReport(shape, columns, info, descript, nulls, name)

### Cleaning the Dataset

In [39]:
# view the columns
print(food.columns)

Index(['GPA', 'Gender', 'breakfast', 'calories_chicken', 'calories_day',
       'calories_scone', 'coffee', 'comfort_food', 'comfort_food_reasons',
       'comfort_food_reasons_coded', 'cook', 'comfort_food_reasons_coded.1',
       'cuisine', 'diet_current', 'diet_current_coded', 'drink',
       'eating_changes', 'eating_changes_coded', 'eating_changes_coded1',
       'eating_out', 'employment', 'ethnic_food', 'exercise',
       'father_education', 'father_profession', 'fav_cuisine',
       'fav_cuisine_coded', 'fav_food', 'food_childhood', 'fries', 'fruit_day',
       'grade_level', 'greek_food', 'healthy_feeling', 'healthy_meal',
       'ideal_diet', 'ideal_diet_coded', 'income', 'indian_food',
       'italian_food', 'life_rewarding', 'marital_status',
       'meals_dinner_friend', 'mother_education', 'mother_profession',
       'nutritional_check', 'on_off_campus', 'parents_cook', 'pay_meal_out',
       'persian_food', 'self_perception_weight', 'soup', 'sports', 'thai_food',
       

In [40]:
# fix gender and gpa capitalization
food.rename(columns={"Gender": "gender"}, inplace=True)
food.rename(columns={"GPA": "gpa"}, inplace=True)

<span style="color:green">I am happy with the way the columns are named. No changes needed.</span>

In [41]:
# summary statistics
print(food.describe())

           gender   breakfast  calories_chicken  calories_day  calories_scone  \
count  125.000000  125.000000        125.000000    106.000000      124.000000   
mean     1.392000    1.112000        577.320000      3.028302      505.241935   
std      0.490161    0.316636        131.214156      0.639308      230.840506   
min      1.000000    1.000000        265.000000      2.000000      315.000000   
25%      1.000000    1.000000        430.000000      3.000000      420.000000   
50%      1.000000    1.000000        610.000000      3.000000      420.000000   
75%      2.000000    1.000000        720.000000      3.000000      420.000000   
max      2.000000    2.000000        720.000000      4.000000      980.000000   

          coffee  comfort_food_reasons_coded        cook  \
count  125.00000                  106.000000  122.000000   
mean     1.75200                    2.698113    2.786885   
std      0.43359                    1.972042    1.038351   
min      1.00000              

In [42]:
# show missing data, after each cell below rerun this 'to do list'
missing_data = food.isnull().sum()
show_missing = missing_data[missing_data > 0]
print(show_missing.to_string())

gpa                            2
calories_day                  19
calories_scone                 1
comfort_food                   1
comfort_food_reasons           2
comfort_food_reasons_coded    19
cook                           3
cuisine                       17
diet_current                   1
drink                          2
eating_changes                 3
employment                     9
exercise                      13
father_education               1
father_profession              3
fav_cuisine                    2
fav_food                       2
food_childhood                 1
healthy_meal                   1
ideal_diet                     1
income                         1
life_rewarding                 1
marital_status                 1
meals_dinner_friend            3
mother_education               3
mother_profession              2
on_off_campus                  1
persian_food                   1
self_perception_weight         1
soup                           1
sports    

In [43]:
# clean gpa and handle null values
food['gpa'] = pd.to_numeric(food['gpa'], errors='coerce')
print(food['gpa'].dtype)
print(f"Mean of 'gpa': {food['gpa'].mean()}")
print(f"Median of 'gpa': {food['gpa'].median()}")
print(f"Mode of 'gpa': {food['gpa'].mode()[0]}")

# I will use the median/mode value of 3.5 for missing gpa values.
food['gpa'] = food['gpa'].fillna(food['gpa'].median())

# Last value printed should be '0'
print(food['gpa'].isnull().sum())


float64
Mean of 'gpa': 3.4155583333333337
Median of 'gpa': 3.5
Mode of 'gpa': 3.5
0


In [44]:
# clean calories_day and handle null values
food['calories_day'] = pd.to_numeric(food['calories_day'], errors='coerce')
print(food['calories_day'].dtype)
print(f"Mean of 'calories_day': {food['calories_day'].mean()}")
print(f"Median of 'calories_day': {food['calories_day'].median()}")
print(f"Mode of 'calories_day': {food['calories_day'].mode()[0]}")

# I will use the median/mode value for missing calories_day values.
food['calories_day'] = food['calories_day'].fillna(food['calories_day'].median())

# Last value printed should be '0'
print(food['calories_day'].isnull().sum())

float64
Mean of 'calories_day': 3.0283018867924527
Median of 'calories_day': 3.0
Mode of 'calories_day': 3.0
0


In [45]:
# clean calories_scone and handle null values
food['calories_scone'] = pd.to_numeric(food['calories_scone'], errors='coerce')
print(food['calories_scone'].dtype)
print(f"Mean of 'calories_scone': {food['calories_scone'].mean()}")
print(f"Median of 'calories_scone': {food['calories_scone'].median()}")
print(f"Mode of 'calories_scone': {food['calories_scone'].mode()[0]}")

# I will use the median/mode value for missing calories_scone values.
food['calories_scone'] = food['calories_scone'].fillna(food['calories_scone'].median())

# Last value printed should be '0'
print(food['calories_scone'].isnull().sum())


float64
Mean of 'calories_scone': 505.241935483871
Median of 'calories_scone': 420.0
Mode of 'calories_scone': 420.0
0


In [46]:
# clean weight and handle null values
food['weight'] = pd.to_numeric(food['weight'], errors='coerce')
print(food['weight'].dtype)
print(f"Mean of 'weight': {food['weight'].mean()}")
print(f"Median of 'weight': {food['weight'].median()}")
print(f"Mode of 'weight': {food['weight'].mode()[0]}")

# I will use the median value for missing weight values.
food['weight'] = food['weight'].fillna(food['weight'].median())

# Last value printed should be '0'
print(food['weight'].isnull().sum())

float64
Mean of 'weight': 158.5
Median of 'weight': 155.0
Mode of 'weight': 135.0
0


<hr style="color:red">

CODE: RETROFIX CELL

In [47]:
check_comfort = ['comfort_food', 'comfort_food_reasons', 'comfort_food_reasons_coded']

for column in check_comfort:
    missing_count = food[column].isnull().sum()
    unique_values = food[column].nunique()
    print(f"Column: {column}")
    print(f"  Missing values: {missing_count}")
    print(f"  Unique values: {unique_values}")
    print(f"  Unique values in '{column}': {food[column].unique()}\n")

Column: comfort_food
  Missing values: 1
  Unique values: 124
  Unique values in 'comfort_food': ['none' 'chocolate, chips, ice cream' 'frozen yogurt, pizza, fast food'
 'Pizza, Mac and cheese, ice cream' 'Ice cream, chocolate, chips '
 'Candy, brownies and soda.'
 'Chocolate, ice cream, french fries, pretzels'
 'Ice cream, cheeseburgers, chips.' 'Donuts, ice cream, chips'
 'Mac and cheese, chocolate, and pasta '
 'Pasta, grandma homemade chocolate cake anything homemade '
 'chocolate, pasta, soup, chips, popcorn' 'Cookies, popcorn, and chips'
 'ice cream, cake, chocolate'
 'Pizza, fruit, spaghetti, chicken and Potatoes  '
 'cookies, donuts, candy bars' 'Saltfish, Candy and Kit Kat '
 'chips, cookies, ice cream' 'Chocolate, ice crea '
 'pizza, wings, Chinese' 'Fast food, pizza, subs'
 'chocolate, sweets, ice cream' 'burgers, chips, cookies'
 'Chilli, soup, pot pie' 'Soup, pasta, brownies, cake'
 'chocolate, ice cream/milkshake, cookies'
 'Chips, ice cream, microwaveable foods ' 'Chicke

In [48]:
# replace null with 'missing' in comfort_food and comfort_food_reasons
food['comfort_food'] = food['comfort_food'].fillna('missing')
food['comfort_food_reasons'] = food['comfort_food_reasons'].fillna('missing')

# replace null with code=9 (None) in comfort_food_reasons_coded
food['comfort_food_reasons_coded'] = food['comfort_food_reasons_coded'].fillna(9)

# should all print '0'
print(food['comfort_food'].isnull().sum())
print(food['comfort_food_reasons'].isnull().sum())
print(food['comfort_food_reasons_coded'].isnull().sum())


0
0
0


<hr style="color:red">

In [49]:
### RETROFIXED
# clean comfort_food and handle missing values
#print(f"Mode of 'comfort_food': {food['comfort_food'].mode()[0]}")

# create a list to randomly select a mode
#modes = food['comfort_food'].mode().tolist()
#selected_mode = random.choice(modes)
#food['comfort_food'] = food['comfort_food'].fillna(selected_mode)

# Last value printed should be '0'
#print(food['comfort_food'].isnull().sum())

<span style="color:red">It was at this point I decided it would be smarter (due to time constraints) to just establish ground rules for how I was cleaning, and to include the process in the report because there will be some bias introduced due to the laisse-faire approach of iterating mode imputes for the categorical missing data.</span>

In [50]:
def runNullFixer(food, FAST_COLUMN):

    for column in FAST_COLUMN:

        missing_count = food[column].isnull().sum()

        if missing_count <= 12:
            print(f"\nHandling missing data for '{column}' with {missing_count} missing values.")
            
            # Get modes of the column and select a random mode
            modes = food[column].mode().tolist()
            selected_mode = random.choice(modes)
            
            # Fill missing values with the selected mode
            food[column] = food[column].fillna(selected_mode)

            print(f"After runNullFixer missing values in '{column}': {food[column].isnull().sum()}")

In [51]:
# run the iterator function to replace nulls in categorical columns with the mode
# only replaces columns with count(missing_values) <= 12 (10% of the 125 row set)
# selects a mode at random if count(modes) > 1
runNullFixer(food, FAST_COLUMN)


Handling missing data for 'cook' with 3 missing values.
After runNullFixer missing values in 'cook': 0

Handling missing data for 'diet_current' with 1 missing values.
After runNullFixer missing values in 'diet_current': 0

Handling missing data for 'drink' with 2 missing values.
After runNullFixer missing values in 'drink': 0

Handling missing data for 'eating_changes' with 3 missing values.
After runNullFixer missing values in 'eating_changes': 0

Handling missing data for 'employment' with 9 missing values.
After runNullFixer missing values in 'employment': 0

Handling missing data for 'father_education' with 1 missing values.
After runNullFixer missing values in 'father_education': 0

Handling missing data for 'father_profession' with 3 missing values.
After runNullFixer missing values in 'father_profession': 0

Handling missing data for 'fav_cuisine' with 2 missing values.
After runNullFixer missing values in 'fav_cuisine': 0

Handling missing data for 'fav_food' with 2 missing v

In [52]:
# show missing data, after each cell below rerun this 'to do list'
missing_data = food.isnull().sum()
show_missing = missing_data[missing_data > 0]
print(show_missing.to_string())

cuisine        17
exercise       13
type_sports    26


In [53]:
# column check
check = ['comfort_food_reasons_coded', 'cuisine', 'exercise', 'type_sports']

for column in check:
    print(f"\nColumn: {column}")
    print(f"Data type: {food[column].dtype}")
    print(f"Unique values: {food[column].unique()}")


Column: comfort_food_reasons_coded
Data type: float64
Unique values: [9. 1. 2. 4. 3. 7. 6. 5. 8.]

Column: cuisine
Data type: float64
Unique values: [nan  1.  3.  2.  6.  4.  5.]

Column: exercise
Data type: float64
Unique values: [ 1.  2.  3. nan]

Column: type_sports
Data type: object
Unique values: ['car racing' 'Basketball ' 'none' nan 'Softball' 'None.' 'soccer'
 'field hockey' 'Running' 'Soccer and basketball ' 'intramural volleyball'
 'Hockey' 'hockey' 'dancing ' 'basketball' 'Soccer' 'Tennis'
 'tennis soccer gym' 'Gaelic Football' 'Ice hockey' 'Lacrosse '
 'snowboarding' 'none organized' 'softball' 'Lacrosse' 'Softball '
 'Dancing' 'wrestling ' 'no particular engagement ' 'Volleyball' 'soccer '
 'wrestling & rowing' 'Wrestling' 'Skiing' 'skiing '
 'Water polo and running ' 'Ice Hockey' 'rowing ' 'tennis  '
 'Recreational Basketball, Equestrian Team' 'Rec Volleyball' 'baseball'
 'I danced in high school' 'horse back riding' 'competitive skiing'
 'Rowing, Running, and Cycling' '

<span style="color:green">For cuisine, exercise and type_sports I am going to just select a value from the dataset at random. I'm doing this because there is >10% missing data for those columns and just adding the mode to all of them seems like it will introduce more bias than random sampling. </span> <br>
<span style="color:red">For comfort_food_reasons_coded I'm going to have to actually put in the correct data. Which I will do earlier in the process<br>coded: RETROFIX ... CORRECTION: We are doing text analysis on those columns so I will merely clean the data in RETROFIX</span>

In [54]:
random_replace = ['type_sports', 'cuisine', 'exercise']

for column in random_replace:

    print(f"Unique values for '{column}': {food[column].dropna().unique()}")
    
    # create a list of unique values
    unique = food[column].dropna().unique().tolist()
    
    # iterate through each row and fill missing values with a random selection
    for index, row in food.iterrows():

        if pd.isnull(row[column]): # thanks murach's
            selected = random.choice(unique)
            food.at[index, column] = selected
            print(f"Row {index}, Column '{column}' filled with value: {selected}")
    
    # Print the last value (number of remaining missing values) should be '0'
    print(f"Missing values in '{column}' after random_replace: {food[column].isnull().sum()}")

Unique values for 'type_sports': ['car racing' 'Basketball ' 'none' 'Softball' 'None.' 'soccer'
 'field hockey' 'Running' 'Soccer and basketball ' 'intramural volleyball'
 'Hockey' 'hockey' 'dancing ' 'basketball' 'Soccer' 'Tennis'
 'tennis soccer gym' 'Gaelic Football' 'Ice hockey' 'Lacrosse '
 'snowboarding' 'none organized' 'softball' 'Lacrosse' 'Softball '
 'Dancing' 'wrestling ' 'no particular engagement ' 'Volleyball' 'soccer '
 'wrestling & rowing' 'Wrestling' 'Skiing' 'skiing '
 'Water polo and running ' 'Ice Hockey' 'rowing ' 'tennis  '
 'Recreational Basketball, Equestrian Team' 'Rec Volleyball' 'baseball'
 'I danced in high school' 'horse back riding' 'competitive skiing'
 'Rowing, Running, and Cycling' 'softball and basketball' 'wrestling'
 'Marching Band' 'Collegiate Water Polo' 'None right now'
 'volleyball, lacrosse' 'none ' 'Fotball' 'crew'
 'Football, Basketball, Volleyball, Golf' 'hockey, soccer, golf'
 'Running ' 'Volleyball, Track'
 'When I can, rarely though play p

<span style="color:red">Now I'm seeing all the goofy sports inputs...</span>

<span style="color:green">I used Chat GPT to generate the mapping object by dumping the unique value list into it and explaining it the rules I wanted, then I had to edit the code a little (a lot for Multi) bit.</span><br>
QUERY: I am doing data analysis in python with jupyter notebooks. my dataset is related to cafeteria at a college. these are the 'type_sports' unique values in my data set. I need to split this down into categories, and use the 'None' category to catch all the inputs that mean that, and use a 'Multi' category to catch all the inputs with multiple sports listed. then i need to do this type of thing: intramural volleyball, Rec Volleyball, should both be Volleyball (same for other inputs that mean the same thing...) Unique values for 'type_sports': (copy and pasted from results above)

In [55]:
#CHAT GPT OBJECT/FUNCTION
single_sport_mapping = {
    'Volleyball': ['intramural volleyball', 'Rec Volleyball', 'volleyball'],
    'Basketball': ['Basketball ', 'basketball'],
    'Soccer': ['soccer', 'Soccer', 'soccer '],
    'Hockey': ['Hockey', 'hockey', 'Ice hockey', 'Ice Hockey'],
    'Softball': ['Softball', 'softball', 'Softball '],
    'Tennis': ['Tennis', 'tennis'],
    'Running': ['Running', 'Running '],
    'Dancing': ['dancing ', 'Dancing'],
    'Wrestling': ['wrestling ', 'Wrestling'],
    'Lacrosse': ['Lacrosse ', 'Lacrosse'],
    'Skiing': ['Skiing', 'competitive skiing', 'skiing '],
    'Football': ['Football', 'Fotball'],
    'Golf': ['Golf'],
    'Rowing': ['rowing ', 'Rowing'],
    'GaelicFootball': ['Gaelic Football'],
    'Baseball': ['baseball'],
    'Snowboarding': ['snowboarding'],
    'WaterPolo': ['Collegiate Water Polo'],
    'HorseRiding': ['horse back riding'],
    'MarchingBand': ['Marching Band'],
    'NoSport': [
        'none', 'no particular engagement', 'none organized', 'none ', 'None', 'I danced in high school', 'Missing'
        'None right now', 'None at the moment', 'When I can, rarely though play pool, darts, and basketball.', 'None.'
    ],
    'Multi': [
        'Soccer and basketball', 'softball and basketball', 'soccer and basketball', 'Rowing, Running, and Cycling', 
        'Water polo and running', 'Recreational Basketball, Equestrian Team', 'wrestling & rowing', 'volleyball, lacrosse', 
        'Football, Basketball, Volleyball, Golf', 'hockey, soccer, golf', 'tennis soccer gym','rowing, running', 
    ]
}

def categorize_sport(value):

    value = value.strip()  
    
    for category, items in single_sport_mapping.items():
        if value in items:
            return category
        
    if pd.isna(value) or not isinstance(value, str):
        return 'NoSport'
        
    return 'NoSport'

food['type_sports'] = food['type_sports'].apply(categorize_sport)

# had a hard time getting rid of NaN values...then I vaguely remember something like this in R, where changes to NaN dont carry over when you export
# wondered if it was the same here, and just replaced missing values in the analysis.
food['type_sports'].fillna('NoSport', inplace=True)

print(f"Updated unique values for 'type_sports': {food['type_sports'].unique()}")

Updated unique values for 'type_sports': ['NoSport' 'Baseball' 'Softball' 'Soccer' 'Running' 'Multi' 'Volleyball'
 'Hockey' 'Basketball' 'Tennis' 'GaelicFootball' 'Lacrosse' 'MarchingBand'
 'Snowboarding' 'Dancing' 'Wrestling' 'Skiing' 'Football' 'HorseRiding'
 'WaterPolo']


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  food['type_sports'].fillna('NoSport', inplace=True)


In [56]:
print(f"Remaining null values: {food['type_sports'].isna().sum()}")
food.head()

Remaining null values: 0


Unnamed: 0,gpa,gender,breakfast,calories_chicken,calories_day,calories_scone,coffee,comfort_food,comfort_food_reasons,comfort_food_reasons_coded,...,soup,sports,thai_food,tortilla_calories,turkey_calories,type_sports,veggies_day,vitamins,waffle_calories,weight
0,2.4,2,1,430,3.0,315.0,1,none,we dont have comfort,9.0,...,1.0,1.0,1,1165.0,345,NoSport,5,1,1315,187.0
1,3.654,1,1,610,3.0,420.0,2,"chocolate, chips, ice cream","Stress, bored, anger",1.0,...,1.0,1.0,2,725.0,690,NoSport,4,2,900,155.0
2,3.3,1,1,720,4.0,420.0,2,"frozen yogurt, pizza, fast food","stress, sadness",1.0,...,1.0,2.0,5,1165.0,500,NoSport,5,1,900,155.0
3,3.2,1,1,430,3.0,420.0,2,"Pizza, Mac and cheese, ice cream",Boredom,2.0,...,1.0,2.0,5,725.0,690,Baseball,3,1,1315,155.0
4,3.5,1,1,720,2.0,420.0,2,"Ice cream, chocolate, chips","Stress, boredom, cravings",1.0,...,1.0,1.0,4,940.0,500,Softball,4,2,760,190.0


<span style="color:green">I am happy with the type_sports values.</span>

In [57]:
# Check for missing values in the 'cuisine' column
missing_values = food['cuisine'].isnull().sum()
print(f"Missing values in 'cuisine' column: {missing_values}")

# Check unique values in the 'cuisine' column
unique_values = food['cuisine'].unique()
print(f"Unique values in 'cuisine' column: {unique_values}")

# Count the occurrences of each unique value in the 'cuisine' column
cuisine_counts = food['cuisine'].value_counts()
print(f"Value counts for 'cuisine' column:\n{cuisine_counts}")

# Convert 'cuisine' column to integer
food['cuisine'] = food['cuisine'].astype(int)
data_type = food['cuisine'].dtype
print(f"Data type of 'cuisine' column: {data_type}")



Missing values in 'cuisine' column: 0
Unique values in 'cuisine' column: [6. 1. 3. 2. 5. 4.]
Value counts for 'cuisine' column:
cuisine
1.0    87
2.0    17
3.0     8
6.0     5
4.0     5
5.0     3
Name: count, dtype: int64
Data type of 'cuisine' column: int64


### Export Cleaned Report and Dataset

In [58]:
# check total nulls in food dataframe # had to research how to achieve the results of .sum().sum()
total_null = food.isnull().sum().sum() 
print(f"Total null values in the 'food' dataframe: {total_null}")

Total null values in the 'food' dataframe: 0


In [59]:
# explore the dataset
descript = food.describe()
shape = food.shape
columns = food.columns
nulls = food.isnull().sum()

# had to look up had to handle the response from .info
buffer = io.StringIO()
food.info(buf=buffer)
info = buffer.getvalue()

# generate a report from the `explore the dataset` cell
name = 'clean'
file_name = produceReport(shape, columns, info, descript, nulls, name)

In [60]:
food.to_csv(CLEAN_FILE_PATH, index=False)