For this project, I will be coding via the terminal in Visual Studio Code and replicating my project in Github. The code that follows will have instructions and links to the github repo.

**Load commands and packages to explore dataset**

In [19]:
#Commands to automatically load modules
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [20]:
#Import library packages to use for the task
import pandas as pd
import numpy as np

**Load dataset into dataframe**

In [21]:
df = pd.read_csv('../data/raw/FastFoodNutritionMenuV3.csv')

**Display basic information about the dataset**

In [22]:
df.head()

Unnamed: 0,Company,Item,Calories,Calories from\nFat,Total Fat\n(g),Saturated Fat\n(g),Trans Fat\n(g),Cholesterol\n(mg),Sodium \n(mg),Carbs\n(g),Fiber\n(g),Sugars\n(g),Protein\n(g),Weight Watchers\nPnts
0,McDonald’s,Hamburger,250,80,9,3.5,0.5,25,520,31,2,6,12,247.5
1,McDonald’s,Cheeseburger,300,110,12,6.0,0.5,40,750,33,2,6,15,297.0
2,McDonald’s,Double Cheeseburger,440,210,23,11.0,1.5,80,1150,34,2,7,25,433.0
3,McDonald’s,McDouble,390,170,19,8.0,1.0,65,920,33,2,7,22,383.0
4,McDonald’s,Quarter Pounder® with Cheese,510,230,26,12.0,1.5,90,1190,40,3,9,29,502.0


In [23]:
#Display the dimensions of the dataframe
df.shape

(1147, 14)

In [24]:
#Display the summary info of df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1147 entries, 0 to 1146
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Company               1147 non-null   object
 1   Item                  1147 non-null   object
 2   Calories              1147 non-null   object
 3   Calories from
Fat     642 non-null    object
 4   Total Fat
(g)         1091 non-null   object
 5   Saturated Fat
(g)     1091 non-null   object
 6   Trans Fat
(g)         1091 non-null   object
 7   Cholesterol
(mg)      1147 non-null   object
 8   Sodium 
(mg)          1147 non-null   object
 9   Carbs
(g)             1091 non-null   object
 10  Fiber
(g)             1091 non-null   object
 11  Sugars
(g)            1147 non-null   object
 12  Protein
(g)           1091 non-null   object
 13  Weight Watchers
Pnts  887 non-null    object
dtypes: object(14)
memory usage: 125.6+ KB


In [25]:
#Display descriptive statistics of df
df.describe(include='all')

Unnamed: 0,Company,Item,Calories,Calories from\nFat,Total Fat\n(g),Saturated Fat\n(g),Trans Fat\n(g),Cholesterol\n(mg),Sodium \n(mg),Carbs\n(g),Fiber\n(g),Sugars\n(g),Protein\n(g),Weight Watchers\nPnts
count,1147,1147,1147,642,1091,1091,1091,1147,1147,1091,1091,1147,1091,887
unique,6,1071,105,64,73,35,11,65,214,131,17,122,56,524
top,McDonald’s,20 fl oz,0,0,0,0,0,0,0,0,0,0,0,0
freq,328,11,83,175,357,383,954,378,54,75,551,190,314,67


In [27]:
#Count the number of duplicate rows in the dataframe
df[df.duplicated()].shape

(7, 14)

In [28]:
#Check for null values in dataset and display
null_values = df.isnull().sum()
print("Null values in the dataset:")
print(null_values)

Null values in the dataset:
Company                    0
Item                       0
Calories                   0
Calories from\nFat       505
Total Fat\n(g)            56
Saturated Fat\n(g)        56
Trans Fat\n(g)            56
Cholesterol\n(mg)          0
Sodium \n(mg)              0
Carbs\n(g)                56
Fiber\n(g)                56
Sugars\n(g)                0
Protein\n(g)              56
Weight Watchers\nPnts    260
dtype: int64


**Clean and prepare the dataset**

In [29]:
#Create a copy of the dataframe to be cleaned and prepared
df_cleaned = df.copy()

In [31]:
#Replace null values with zero for the following columns
replace_with_zero = ['Calories from\nFat', 'Total Fat\n(g)', 'Saturated Fat\n(g)','Trans Fat\n(g)', 'Carbs\n(g)', 'Fiber\n(g)', 'Protein\n(g)', 'Weight Watchers\nPnts']
df_cleaned[replace_with_zero] = df_cleaned[replace_with_zero].fillna(0)

In [32]:
# Calculate the total number of null values in each column
null_counts = df_cleaned.isnull().sum()

# Calculate the total number of values in the DataFrame
total_rows = df_cleaned.shape[0]

# Calculate the total percentage of null values
total_percent_null = (null_counts.sum() / total_rows) * 100

print(f"Total percentage of null data: {total_percent_null:.2f}%")

Total percentage of null data: 0.00%


In [33]:
#Check for null values in new dataframe and display
null_values = df_cleaned.isnull().sum()
print("Null values in the dataset:")
print(null_values)

Null values in the dataset:
Company                  0
Item                     0
Calories                 0
Calories from\nFat       0
Total Fat\n(g)           0
Saturated Fat\n(g)       0
Trans Fat\n(g)           0
Cholesterol\n(mg)        0
Sodium \n(mg)            0
Carbs\n(g)               0
Fiber\n(g)               0
Sugars\n(g)              0
Protein\n(g)             0
Weight Watchers\nPnts    0
dtype: int64
