In [1]:
import pandas as pd
food_info=pd.read_csv("food_info.csv")
col_names=food_info.columns.tolist()
print(col_names)
food_info.head(3)

['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)', 'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)', 'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)', 'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)', 'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)', 'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)', 'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg', 'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)', 'Cholestrl_(mg)']


Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,Vit_A_IU,Vit_A_RAE,Vit_E_(mg),Vit_D_mcg,Vit_D_IU,Vit_K_(mcg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg)
0,1001,BUTTER WITH SALT,15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,2499.0,684.0,2.32,1.5,60.0,7.0,51.368,21.021,3.043,215.0
1,1002,BUTTER WHIPPED WITH SALT,15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,2499.0,684.0,2.32,1.5,60.0,7.0,50.489,23.426,3.012,219.0
2,1003,BUTTER OIL ANHYDROUS,0.24,876,0.28,99.48,0.0,0.0,0.0,0.0,...,3069.0,840.0,2.8,1.8,73.0,8.6,61.924,28.732,3.694,256.0


Transforming a Column

We can use the arithmetic operators to transform a numerical column. The values in the "Iron_(mg)" column, for example, are currently in milligrams. We can divide each value by 1000 to convert the values to grams. The following code will divide each value in the "Iron_(mg)" column by 1000, and return a new Series object with those values:


In [2]:
div_1000 = food_info["Iron_(mg)"] / 1000
add_100 = food_info["Iron_(mg)"] + 100
sub_100 = food_info["Iron_(mg)"] - 100
mult_2 = food_info["Iron_(mg)"]*2
sodium_grams=food_info["Sodium_(mg)"]/1000
sugar_milligrams=food_info["Sugar_Tot_(g)"]*1000

Performing Math with Multiple Columns

In addition to transforming columns by numerical values, we can transform columns by other columns. When we use an arithmetic operator between two columns (Series objects), pandas will perform that computation in a pair-wise fashion, and return a new Series object. It applies the arithmetic operator to the first value in both columns, the second value in both columns, and so on.

In [3]:
water_energy = food_info["Water_(g)"] * food_info["Energ_Kcal"]
print(water_energy[0:5])
grams_of_protein_per_gram_of_water=food_info["Protein_(g)"]/food_info["Water_(g)"]

milligrams_of_calcium_and_iron=food_info["Calcium_(mg)"]+food_info["Iron_(mg)"]

0    11378.79
1    11378.79
2      210.24
3    14970.73
4    15251.81
dtype: float64


In [4]:
weighted_protein=food_info["Protein_(g)" ]*2
weighted_fat=food_info["Lipid_Tot_(g)"]*(-0.75)
initial_rating=weighted_protein+weighted_fat

Normalizing Columns in a Data Set

The columns in the data set use different units (kilo-calories, milligrams, etc.). As a result, the range of values varies greatly between columns. For example, the "Vit_A_IU" column ranges from 0 to 100000, while the "Fiber_TD_(g)" column ranges from 0 to 79. For certain calculations, columns like "Vit_A_IU" can have a greater effect on the result, due to the scale of the values.

While there are many ways to normalize data, one of the simplest ways is to divide all of the values in a column by that column's maximum value. This way, all of the columns will range from 0 to 1. To calculate the maximum value of a column, we use the Series.max() method. In the following code, we use the Series.max() method to calculate the largest value in the "Energ_Kcal" column, and assign it to max_calories:

# The largest value in the "Energ_Kcal" column.

max_calories = food_info["Energ_Kcal"].max()

We can then use the division operator (/) to divide the values in the "Energ_Kcal" column by the maximum value, max_calories:

# Divide the values in "Energ_Kcal" by the largest value.

normalized_calories = food_info["Energ_Kcal"] / max_calories



In [7]:
print(food_info["Protein_(g)"][0:5])
max_protein = food_info["Protein_(g)"].max()
print('max_protein',max_protein)
normalized_protein=food_info["Protein_(g)"]/max_protein
print('normalized_protein',normalized_protein.head(5))
max_fat=food_info["Lipid_Tot_(g)"].max()
normalized_fat=food_info["Lipid_Tot_(g)"]/max_fat

0     0.85
1     0.85
2     0.28
3    21.40
4    23.24
Name: Protein_(g), dtype: float64
('max_protein', 88.319999999999993)
('normalized_protein', 0    0.009624
1    0.009624
2    0.003170
3    0.242301
4    0.263134
Name: Protein_(g), dtype: float64)


In [10]:
#creating NEw column
food_info["Normalized_Protein"]=food_info["Protein_(g)"]/(food_info["Protein_(g)"].max())
food_info["Normalized_Fat"]=food_info["Lipid_Tot_(g)"]/food_info["Lipid_Tot_(g)"].max()
food_info["Norm_Nutr_Index"] =(2*food_info["Normalized_Protein"])-(0.75*food_info["Normalized_Fat"])

Sorting a DataFrame by a Column

The DataFrame currently appears in numerical order according to the NDB_No column. NDB_No is a unique USDA identifier that isn't really useful for our needs. To explore which foods rank the highest in the Norm_Nutr_Index column, we need to sort the DataFrame by that column. DataFrame objects have a sort_values() method that we can use to sort the entire DataFrame.

To sort the DataFrame on the Sodium_(mg) column, pass in the column name to the DataFrame.sort_values() method, and assign the resulting DataFrame to a new variable:

food_info.sort_values("Sodium_(mg)")

By default, pandas will sort the data by the column we specify in ascending order and return a new DataFrame, rather than modifying food_info itself. To customize the method's behavior, use the parameters listed in the documentation:

# Sorts the DataFrame in-place, rather than returning a new DataFrame.

food_info.sort_values("Sodium_(mg)", inplace=True)

​

# Sorts by descending order, rather than ascending.

food_info.sort_values("Sodium_(mg)", inplace=True, ascending=False)