# Pandas

Pandas is a library that unifies the most common workflows that data analysts and data scientists previously relied on many different libraries for. Pandas has quickly became an important tool in a data professional's toolbelt and is the most popular library for working with tabular data in Python. Tabular data is any data that can be represented as rows and columns. The CSV files we've worked with in previous missions are all examples of tabular data.

To represent tabular data, pandas uses a custom data structure called a **dataframe**. A dataframe is a highly efficient, 2-dimensional data structure that provides a suite of methods and attributes to quickly explore, analyze, and visualize data. The dataframe is similar to the NumPy 2D array but adds support for many features that help you work with tabular data.

One of the biggest advantages that pandas has over NumPy is the ability to **store mixed data types in rows and columns**. Many tabular datasets contain a range of data types and pandas dataframes handle mixed data types effortlessly while NumPy doesn't. Pandas dataframes can also **handle missing values gracefully using a custom object, NaN, to represent those values**. A common complaint with NumPy is its lack of an object to represent missing values and people end up having to find and replace these values manually. In addition, **pandas dataframes contain axis labels** for both rows and columns and enable you to refer to elements in the dataframe more intuitively. Since many tabular datasets contain column titles, this means that **dataframes preserve the metadata from the file around** the data.

We can then refer to the module using pandas and use **dot notation to call its methods**. To read a CSV file into a dataframe, we use the `pandas.read_csv()` function and pass in the file name as a string:

In [4]:
import pandas

food_info = pandas.read_csv('food_info.csv')

## Basic functions
To select the **first 5 rows of a dataframe**, use the dataframe method `head()`. When you call the head() method, pandas will return a new dataframe containing just the first 5 rows:

In [6]:
first_5_rows = food_info.head()
print(first_5_rows)

   NDB_No                 Shrt_Desc  Water_(g)  Energ_Kcal  Protein_(g)  \
0    1001          BUTTER WITH SALT      15.87         717         0.85   
1    1002  BUTTER WHIPPED WITH SALT      15.87         717         0.85   
2    1003      BUTTER OIL ANHYDROUS       0.24         876         0.28   
3    1004               CHEESE BLUE      42.41         353        21.40   
4    1005              CHEESE BRICK      41.11         371        23.24   

   Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  Fiber_TD_(g)  Sugar_Tot_(g)  \
0          81.11     2.11            0.06           0.0           0.06   
1          81.11     2.11            0.06           0.0           0.06   
2          99.48     0.00            0.00           0.0           0.00   
3          28.74     5.11            2.34           0.0           0.50   
4          29.68     3.18            2.79           0.0           0.51   

        ...        Vit_A_IU  Vit_A_RAE  Vit_E_(mg)  Vit_D_mcg  Vit_D_IU  \
0       ...          2499.0  

In [7]:
# pass an integer into the head() method to change the default of 5
print(food_info.head(10))

   NDB_No                 Shrt_Desc  Water_(g)  Energ_Kcal  Protein_(g)  \
0    1001          BUTTER WITH SALT      15.87         717         0.85   
1    1002  BUTTER WHIPPED WITH SALT      15.87         717         0.85   
2    1003      BUTTER OIL ANHYDROUS       0.24         876         0.28   
3    1004               CHEESE BLUE      42.41         353        21.40   
4    1005              CHEESE BRICK      41.11         371        23.24   
5    1006               CHEESE BRIE      48.42         334        20.75   
6    1007          CHEESE CAMEMBERT      51.80         300        19.80   
7    1008            CHEESE CARAWAY      39.28         376        25.18   
8    1009            CHEESE CHEDDAR      37.10         406        24.04   
9    1010           CHEESE CHESHIRE      37.65         387        23.37   

   Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  Fiber_TD_(g)  Sugar_Tot_(g)  \
0          81.11     2.11            0.06           0.0           0.06   
1          81.11     2.11 

In [8]:
# To access the full list of column names, use the columns attribute:

column_names = food_info.columns
print(column_names)

Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)',
       'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)',
       'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',
       'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)',
       'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)',
       'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',
       'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg',
       'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)',
       'Cholestrl_(mg)'],
      dtype='object')


Lastly, you can use the **shape attribute** to understand the dimensions of the dataframe. The shape attribute returns a **tuple of integers** representing the number of rows followed by the number of columns:

In [9]:
# Returns the tuple (8618,36) and assigns to `dimensions`.
dimensions = food_info.shape
print(dimensions)
# The number of rows, 8618.
num_rows = dimensions[0]
print(num_rows)
# The number of columns, 36.
num_cols = dimensions[1]
print(num_cols)

(8618, 36)
8618
36


In [16]:
food_info.shape

(8618, 36)

In [15]:
food_info.shape[0]

8618

## Navigating dataframes
When you read in a file into a dataframe, pandas uses the values in the first row (also known as the **header**) for the column labels and the row number for the row labels. Collectively, the labels are referred to as the index.

### Series
The **Series object** is a core data structure that pandas uses to represent rows and columns. A Series is a **labelled collection of values** similar to the NumPy vector. The main advantage of Series objects is the ability to **utilize non-integer labels**. NumPy arrays can only utilize integer labels for indexing.

Pandas utilizes this feature to **provide more context** when returning a row or a column from a dataframe. For example, when you select a row from a dataframe, instead of just returning the values in that row as a list, **pandas returns a Series object that contains the column labels** as well as the corresponding values.

### loc[]
While we use bracket notation to access elements in a NumPy array or a standard list, we need to **use the pandas method loc[] to select rows in a dataframe**. The loc[] method allows you to **select rows by row labels**. Recall that when you read a file into a dataframe, pandas uses the row number (or position) as each row's label. Pandas uses **zero-indexing**, so the first row is at index 0, the second row at index 1, and so on.


In [30]:
print(food_info.loc[0])

NDB_No                         1001
Shrt_Desc          BUTTER WITH SALT
Water_(g)                     15.87
Energ_Kcal                      717
Protein_(g)                    0.85
Lipid_Tot_(g)                 81.11
Ash_(g)                        2.11
Carbohydrt_(g)                 0.06
Fiber_TD_(g)                      0
Sugar_Tot_(g)                  0.06
Calcium_(mg)                     24
Iron_(mg)                      0.02
Magnesium_(mg)                    2
Phosphorus_(mg)                  24
Potassium_(mg)                   24
Sodium_(mg)                     643
Zinc_(mg)                      0.09
Copper_(mg)                       0
Manganese_(mg)                    0
Selenium_(mcg)                    1
Vit_C_(mg)                        0
Thiamin_(mg)                  0.005
Riboflavin_(mg)               0.034
Niacin_(mg)                   0.042
Vit_B6_(mg)                   0.003
Vit_B12_(mcg)                  0.17
Vit_A_IU                       2499
Vit_A_RAE                   

In [22]:
food_info.loc[0][1]

'BUTTER WITH SALT'

### Datatypes
When you displayed individual rows, represented as Series objects, you may have noticed the text `"dtype: object"` after the last value. **dtype: object** refers to the data type, or dtype, of that Series. The object dtype is equivalent to the string type in Python. Pandas borrows from the NumPy type system and contains the following dtypes:

* `object` - for representing string values.
* `int` - for representing integer values.
* `float` - for representing float values.
* `datetime` - for representing time values.
* `bool` - for representing Boolean values.

To access the types for each column, use the `DataFrame.dtypes` attribute to return a Series containing each column name and its corresponding type.

In [27]:
food_info.dtypes

NDB_No               int64
Shrt_Desc           object
Water_(g)          float64
Energ_Kcal           int64
Protein_(g)        float64
Lipid_Tot_(g)      float64
Ash_(g)            float64
Carbohydrt_(g)     float64
Fiber_TD_(g)       float64
Sugar_Tot_(g)      float64
Calcium_(mg)       float64
Iron_(mg)          float64
Magnesium_(mg)     float64
Phosphorus_(mg)    float64
Potassium_(mg)     float64
Sodium_(mg)        float64
Zinc_(mg)          float64
Copper_(mg)        float64
Manganese_(mg)     float64
Selenium_(mcg)     float64
Vit_C_(mg)         float64
Thiamin_(mg)       float64
Riboflavin_(mg)    float64
Niacin_(mg)        float64
Vit_B6_(mg)        float64
Vit_B12_(mcg)      float64
Vit_A_IU           float64
Vit_A_RAE          float64
Vit_E_(mg)         float64
Vit_D_mcg          float64
Vit_D_IU           float64
Vit_K_(mcg)        float64
FA_Sat_(g)         float64
FA_Mono_(g)        float64
FA_Poly_(g)        float64
Cholestrl_(mg)     float64
dtype: object

In [28]:
food_info.dtypes[3]

dtype('int64')

### Slicing

If you're interested in accessing multiple rows of the dataframe, you can pass in either a **slice of row labels or a list of row labels** and pandas will return a dataframe. Note that unlike slicing lists in Python, a slice of a dataframe using .loc[] **will include both the start and the end row**:

In [29]:
#row 3, 4, 5, 6
food_info.loc[3:6]

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,Vit_A_IU,Vit_A_RAE,Vit_E_(mg),Vit_D_mcg,Vit_D_IU,Vit_K_(mcg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg)
3,1004,CHEESE BLUE,42.41,353,21.4,28.74,5.11,2.34,0.0,0.5,...,721.0,198.0,0.25,0.5,21.0,2.4,18.669,7.778,0.8,75.0
4,1005,CHEESE BRICK,41.11,371,23.24,29.68,3.18,2.79,0.0,0.51,...,1080.0,292.0,0.26,0.5,22.0,2.5,18.764,8.598,0.784,94.0
5,1006,CHEESE BRIE,48.42,334,20.75,27.68,2.7,0.45,0.0,0.45,...,592.0,174.0,0.24,0.5,20.0,2.3,17.41,8.013,0.826,100.0
6,1007,CHEESE CAMEMBERT,51.8,300,19.8,24.26,3.68,0.46,0.0,0.46,...,820.0,241.0,0.21,0.4,18.0,2.0,15.259,7.023,0.724,72.0


In [33]:
# List of rows as param 
food_info.loc[[3,6,9,38]]

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,Vit_A_IU,Vit_A_RAE,Vit_E_(mg),Vit_D_mcg,Vit_D_IU,Vit_K_(mcg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg)
3,1004,CHEESE BLUE,42.41,353,21.4,28.74,5.11,2.34,0.0,0.5,...,721.0,198.0,0.25,0.5,21.0,2.4,18.669,7.778,0.8,75.0
6,1007,CHEESE CAMEMBERT,51.8,300,19.8,24.26,3.68,0.46,0.0,0.46,...,820.0,241.0,0.21,0.4,18.0,2.0,15.259,7.023,0.724,72.0
9,1010,CHEESE CHESHIRE,37.65,387,23.37,30.6,3.6,4.78,0.0,,...,985.0,233.0,,,,,19.475,8.671,0.87,103.0
38,1039,CHEESE ROQUEFORT,39.38,369,21.54,30.64,6.44,2.0,0.0,,...,1047.0,294.0,,,,,19.263,8.474,1.32,90.0


In [43]:
num_rows = food_info.shape[0]
#last_rows
food_info.loc[num_rows-5:num_rows-1]

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,Vit_A_IU,Vit_A_RAE,Vit_E_(mg),Vit_D_mcg,Vit_D_IU,Vit_K_(mcg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg)
8613,83110,MACKEREL SALTED,43.0,305,18.5,25.1,13.4,0.0,0.0,0.0,...,157.0,47.0,2.38,25.2,1006.0,7.8,7.148,8.32,6.21,95.0
8614,90240,SCALLOP (BAY&SEA) CKD STMD,70.25,111,20.54,0.84,2.97,5.41,0.0,0.0,...,5.0,2.0,0.0,0.0,2.0,0.0,0.218,0.082,0.222,41.0
8615,90480,SYRUP CANE,26.0,269,0.0,0.0,0.86,73.14,0.0,73.2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8616,90560,SNAIL RAW,79.2,90,16.1,1.4,1.3,2.0,0.0,0.0,...,100.0,30.0,5.0,0.0,0.0,0.1,0.361,0.259,0.252,50.0
8617,93600,TURTLE GREEN RAW,78.5,89,19.8,0.5,1.2,0.0,0.0,0.0,...,100.0,30.0,0.5,0.0,0.0,0.1,0.127,0.088,0.17,50.0


### Accessing columns
When accessing a **column in a dataframe**, pandas returns a **Series object containing the row label and each row's value** for that column. To access a single column, use bracket notation and **pass in the column name as a string**:

In [48]:
ndb_no = food_info["NDB_No"]
print(ndb_no)

0        1001
1        1002
2        1003
3        1004
4        1005
5        1006
6        1007
7        1008
8        1009
9        1010
10       1011
11       1012
12       1013
13       1014
14       1015
15       1016
16       1017
17       1018
18       1019
19       1020
20       1021
21       1022
22       1023
23       1024
24       1025
25       1026
26       1027
27       1028
28       1029
29       1030
        ...  
8588    43544
8589    43546
8590    43550
8591    43566
8592    43570
8593    43572
8594    43585
8595    43589
8596    43595
8597    43597
8598    43598
8599    44005
8600    44018
8601    44048
8602    44055
8603    44061
8604    44074
8605    44110
8606    44158
8607    44203
8608    44258
8609    44259
8610    44260
8611    48052
8612    80200
8613    83110
8614    90240
8615    90480
8616    90560
8617    93600
Name: NDB_No, Length: 8618, dtype: int64


To select **multiple columns**, pass in a list of strings representing the column names and pandas will return a dataframe containing only the values in those columns

In [54]:
protein_carbs = food_info[["Protein_(g)", "Carbohydrt_(g)"]]
print(protein_carbs)

      Protein_(g)  Carbohydrt_(g)
0            0.85            0.06
1            0.85            0.06
2            0.28            0.00
3           21.40            2.34
4           23.24            2.79
5           20.75            0.45
6           19.80            0.46
7           25.18            3.06
8           24.04            1.33
9           23.37            4.78
10          23.76            2.57
11          11.12            3.38
12          10.69            4.61
13          10.34            6.66
14          10.45            4.76
15          12.39            2.72
16           5.93            4.07
17          24.99            1.43
18          14.21            4.09
19          25.60            1.55
20           9.65           42.65
21          24.94            2.22
22          29.81            0.36
23          20.05            0.49
24          24.48            0.68
25          22.17            2.19
26          21.60            2.47
27          24.26            2.77
28          24

#### Exercise
Select and display only the columns that use grams for measurement (that end with "(g)"). To accomplish this:
* Use the columns attribute to return the column names in `food_info` and convert to a list by calling the method `tolist()`
* Create a new list, `gram_columns`, containing only the column names that end in `"(g)"`. The string method **`endswith()`** returns True if the string object calling the method ends with the string passed into the parentheses.
* Pass `gram_columns` into bracket notation to select just those columns and assign the resulting dataframe to `gram_df`
* Then use the dataframe method `head()` to display the first 3 rows of gram_df.

In [56]:
gram_columns = []

for name in food_info.columns.tolist():
    if name.endswith("(g)"):
        gram_columns.append(name)
gram_df = food_info[gram_columns]
print(gram_df.head(3))

   Water_(g)  Protein_(g)  Lipid_Tot_(g)  Ash_(g)  Carbohydrt_(g)  \
0      15.87         0.85          81.11     2.11            0.06   
1      15.87         0.85          81.11     2.11            0.06   
2       0.24         0.28          99.48     0.00            0.00   

   Fiber_TD_(g)  Sugar_Tot_(g)  FA_Sat_(g)  FA_Mono_(g)  FA_Poly_(g)  
0           0.0           0.06      51.368       21.021        3.043  
1           0.0           0.06      50.489       23.426        3.012  
2           0.0           0.00      61.924       28.732        3.694  


# Data Manipulation With Pandas

We'll build a basic nutritional index for people who want to eat high-protein, low-fat foods. The "Lipid_Tot_(g)" column contains each food's total fat content, and the "Protein_(g)" (in grams) contains each food's total protein content (in grams). Let's use the following formula to score each food in our data set:

```Score = 2x (Protein_(g) - 0.75 * Lipid_Tot_(g))```

## Arithmetic operators

We can use the arithmetic operators to transform a numerical column. The values in the "Iron_(mg)" column, for example, are currently in milligrams. We can divide each value by 1000 to convert the values to grams.

In [58]:
div_1000 = food_info["Iron_(mg)"] / 1000

add_100 = food_info["Iron_(mg)"] + 100

sodium_grams = food_info["Sodium_(mg)"] / 1000
sugar_milligrams = food_info["Sugar_Tot_(g)"] * 1000

In addition to transforming columns by numerical values, **we can transform columns by other columns**. When we use an arithmetic operator between two columns (Series objects), pandas will perform that computation in a pair-wise fashion, and return a new Series object. It applies the arithmetic operator to the first value in both columns, the second value in both columns, and so on.

In the following code, we multiply the "Water_(g)" column by the "Energ_Kcal" column, and assign the resulting Series to water_energy:

In [59]:
water_energy = food_info["Water_(g)"] * food_info["Energ_Kcal"]

In [70]:
#Nutrition index: 
weighted_protein = food_info["Protein_(g)"] * 2
weighted_fat = food_info["Lipid_Tot_(g)"] * -0.75
initial_rating = weighted_protein + weighted_fat
print(initial_rating.head())

0   -59.1325
1   -59.1325
2   -74.0500
3    21.2450
4    24.2200
dtype: float64


## Normalising columns in a data set

The columns in the data set **use different units** (kilo-calories, milligrams, etc.). As a result, the range of values varies greatly between columns. For example, the "Vit_A_IU" column ranges from 0 to 100000, while the "Fiber_TD_(g)" column ranges from 0 to 79. For certain calculations, columns like "Vit_A_IU" can have a greater effect on the result, due to the scale of the values

While there are many ways to normalize data, one of the **simplest ways is to divide all of the values in a column by that column's maximum value**. This way, all of the columns will **range from 0 to 1**. To calculate the maximum value of a column, we use the `Series.max()` method. In the following code, we use the `Series.max()` method to calculate the largest value in the "Energ_Kcal" column, and assign it to `max_calories`:

In [71]:
max_calories = food_info["Energ_Kcal"].max()
normalised_calories = food_info["Energ_Kcal"]/max_calories
print(normalised_calories.head())

0    0.794900
1    0.794900
2    0.971175
3    0.391353
4    0.411308
Name: Energ_Kcal, dtype: float64


In [74]:
normalized_fat = food_info["Lipid_Tot_(g)"] / food_info["Lipid_Tot_(g)"].max() 
print(normalized_fat.head())

0    0.8111
1    0.8111
2    0.9948
3    0.2874
4    0.2968
Name: Lipid_Tot_(g), dtype: float64


## Creating new columns

We add bracket notation to specify the name we want for that column, then use the assignment operator (=) to specify the Series object containing the values we want to assign to that column:

In [75]:
food_info["Iron_(g)"] = food_info["Iron_(mg)"] / 1000

In [78]:
food_info.columns
food_info.head()

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,Vit_A_RAE,Vit_E_(mg),Vit_D_mcg,Vit_D_IU,Vit_K_(mcg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg),Iron_(g)
0,1001,BUTTER WITH SALT,15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,684.0,2.32,1.5,60.0,7.0,51.368,21.021,3.043,215.0,2e-05
1,1002,BUTTER WHIPPED WITH SALT,15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,684.0,2.32,1.5,60.0,7.0,50.489,23.426,3.012,219.0,0.00016
2,1003,BUTTER OIL ANHYDROUS,0.24,876,0.28,99.48,0.0,0.0,0.0,0.0,...,840.0,2.8,1.8,73.0,8.6,61.924,28.732,3.694,256.0,0.0
3,1004,CHEESE BLUE,42.41,353,21.4,28.74,5.11,2.34,0.0,0.5,...,198.0,0.25,0.5,21.0,2.4,18.669,7.778,0.8,75.0,0.00031
4,1005,CHEESE BRICK,41.11,371,23.24,29.68,3.18,2.79,0.0,0.51,...,292.0,0.26,0.5,22.0,2.5,18.764,8.598,0.784,94.0,0.00043


In [81]:
## Adding normalised columns

food_info["Normalized_Protein"] = food_info["Protein_(g)"] / food_info["Protein_(g)"].max()
food_info["Normalized_Fat"] = food_info["Lipid_Tot_(g)"] / food_info["Lipid_Tot_(g)"].max()

## Normalised nutrition index
food_info["Norm_Nutr_Index"] = 2 * food_info["Normalized_Protein"] - 0.75 * food_info["Normalized_Fat"]

In [82]:
food_info.head()

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,Vit_D_IU,Vit_K_(mcg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg),Iron_(g),Normalized_Protein,Normalized_Fat,Norm_Nutr_Index
0,1001,BUTTER WITH SALT,15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,60.0,7.0,51.368,21.021,3.043,215.0,2e-05,0.009624,0.8111,-0.589077
1,1002,BUTTER WHIPPED WITH SALT,15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,60.0,7.0,50.489,23.426,3.012,219.0,0.00016,0.009624,0.8111,-0.589077
2,1003,BUTTER OIL ANHYDROUS,0.24,876,0.28,99.48,0.0,0.0,0.0,0.0,...,73.0,8.6,61.924,28.732,3.694,256.0,0.0,0.00317,0.9948,-0.739759
3,1004,CHEESE BLUE,42.41,353,21.4,28.74,5.11,2.34,0.0,0.5,...,21.0,2.4,18.669,7.778,0.8,75.0,0.00031,0.242301,0.2874,0.269051
4,1005,CHEESE BRICK,41.11,371,23.24,29.68,3.18,2.79,0.0,0.51,...,22.0,2.5,18.764,8.598,0.784,94.0,0.00043,0.263134,0.2968,0.303668


## Sorting dataset by column

To explore which foods rank the highest in the Norm_Nutr_Index column, we need to sort the DataFrame by that column. DataFrame objects have a `sort_values()` method that we can use to sort the entire DataFrame.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html

To sort the DataFrame on the `Sodium_(mg)` column, pass in the column name to the `DataFrame.sort_values()` method, and assign the resulting DataFrame to a new variable:


In [84]:
food_info.sort_values("Water_(g)")

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,Vit_D_IU,Vit_K_(mcg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg),Iron_(g),Normalized_Protein,Normalized_Fat,Norm_Nutr_Index
676,4544,SHORTENING HOUSEHOLD LARD&VEG OIL,0.00,900,0.00,100.00,0.00,0.00,0.0,0.00,...,0.0,21.5,40.300,44.400,10.900,56.0,0.00000,0.000000,1.0000,-0.750000
664,4520,FAT MUTTON TALLOW,0.00,902,0.00,100.00,0.00,0.00,0.0,0.00,...,28.0,0.0,47.300,40.600,7.800,102.0,0.00000,0.000000,1.0000,-0.750000
665,4528,OIL WALNUT,0.00,884,0.00,100.00,0.00,0.00,0.0,0.00,...,0.0,15.0,9.100,22.800,63.300,0.0,0.00000,0.000000,1.0000,-0.750000
666,4529,OIL ALMOND,0.00,884,0.00,100.00,0.00,0.00,0.0,0.00,...,0.0,7.0,8.200,69.900,17.400,0.0,0.00000,0.000000,1.0000,-0.750000
667,4530,OIL APRICOT KERNEL,0.00,884,0.00,100.00,0.00,0.00,0.0,0.00,...,,,6.300,60.000,29.300,0.0,0.00000,0.000000,1.0000,-0.750000
668,4531,OIL SOYBN LECITHIN,0.00,763,0.00,100.00,0.00,0.00,0.0,0.00,...,0.0,183.9,15.005,10.977,45.318,0.0,0.00000,0.000000,1.0000,-0.750000
669,4532,OIL HAZELNUT,0.00,884,0.00,100.00,0.00,0.00,0.0,0.00,...,,,7.400,78.000,10.200,0.0,0.00000,0.000000,1.0000,-0.750000
670,4534,OIL BABASSU,0.00,884,0.00,100.00,0.00,0.00,0.0,0.00,...,,,81.200,11.400,1.600,0.0,0.00000,0.000000,1.0000,-0.750000
671,4536,OIL SHEANUT,0.00,884,0.00,100.00,0.00,0.00,0.0,0.00,...,,,46.600,44.000,5.200,0.0,0.00000,0.000000,1.0000,-0.750000
8599,44005,OIL CORN PEANUT AND OLIVE,0.00,884,0.00,100.00,0.00,0.00,0.0,0.00,...,0.0,21.0,14.367,48.033,33.033,0.0,0.00013,0.000000,1.0000,-0.750000


By default, pandas will sort the data by the column we specify in ascending order and return a new DataFrame, rather than modifying food_info itself. To customize the method's behavior, use the **parameters listed** in the documentation

In [87]:
# Sorts the DataFrame in-place, rather than returning a new DataFrame.
food_info.sort_values("Water_(g)", inplace=True, ascending=False)

food_info.sort_values("Norm_Nutr_Index", ascending=False, inplace=True)

In [88]:
food_info.head()

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,Vit_D_IU,Vit_K_(mcg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg),Iron_(g),Normalized_Protein,Normalized_Fat,Norm_Nutr_Index
4991,16423,SOY PROT ISOLATE K TYPE CRUDE PROT BASIS,4.98,321,88.32,0.53,3.58,2.59,2.0,0.0,...,0.0,0.0,0.066,0.101,0.258,0.0,0.0145,1.0,0.0053,1.996025
6155,19177,GELATINS DRY PDR UNSWTND,13.0,335,85.6,0.1,1.3,0.0,0.0,0.0,...,0.0,0.0,0.07,0.06,0.01,0.0,0.00111,0.969203,0.001,1.937656
216,1258,EGG WHITE DRIED STABILIZED GLUCOSE RED,6.53,362,84.63,0.48,3.63,4.72,0.0,0.0,...,0.0,0.0,0.147,0.173,0.07,20.0,0.00023,0.95822,0.0048,1.91284
124,1136,EGG WHITE DRIED PDR STABILIZED GLUCOSE RED,8.54,376,82.4,0.04,4.55,4.47,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.00024,0.932971,0.0004,1.865642
8152,35055,SEAL BEARDED (OOGRUK) MEAT DRIED (ALASKA NATIVE),11.6,351,82.6,2.3,3.5,0.0,0.0,0.0,...,,,0.6,1.33,0.37,,0.0496,0.935236,0.023,1.853221


continue here: http://localhost:8888/notebooks/04%20-%20Data%20Analysis%20with%20Pandas%3A%20Intermediate/03%20-%20Working%20with%20missing%20data.ipynb