# Python: The pandas library

**Goal**: master the use of the Pandas library to manipulate numerical data!

## Introduction to the Pandas library

In this chapter, we will talk about the ```Pandas``` library which is one of the most used libraries by data scientist and data analyst. Pandas uses a very efficient data structure called ```dataframe```, which is a numpy array of dimension 2. It offers a suite of methods to quickly explore, analyze and visualize data. The main advantage of Pandas over Numpy is that it can store different types of data in a dataframe. It also has a ```NaN``` object to qualify missing values which allows to quickly replace these values. And finally Pandas is more intuitive to extract a column.

## Read a csv file with Pandas

In [1]:
import pandas as pd

In [2]:
food_infos = pd.read_csv("food_infos.csv")
food_infos

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,Vit_A_IU,Vit_A_RAE,Vit_E_(mg),Vit_D_mcg,Vit_D_IU,Vit_K_(mcg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg)
0,1001,BUTTER WITH SALT,15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,2499.0,684.0,2.32,1.5,60.0,7.0,51.368,21.021,3.043,215.0
1,1002,BUTTER WHIPPED WITH SALT,15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,2499.0,684.0,2.32,1.5,60.0,7.0,50.489,23.426,3.012,219.0
2,1003,BUTTER OIL ANHYDROUS,0.24,876,0.28,99.48,0.00,0.00,0.0,0.00,...,3069.0,840.0,2.80,1.8,73.0,8.6,61.924,28.732,3.694,256.0
3,1004,CHEESE BLUE,42.41,353,21.40,28.74,5.11,2.34,0.0,0.50,...,721.0,198.0,0.25,0.5,21.0,2.4,18.669,7.778,0.800,75.0
4,1005,CHEESE BRICK,41.11,371,23.24,29.68,3.18,2.79,0.0,0.51,...,1080.0,292.0,0.26,0.5,22.0,2.5,18.764,8.598,0.784,94.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8613,83110,MACKEREL SALTED,43.00,305,18.50,25.10,13.40,0.00,0.0,0.00,...,157.0,47.0,2.38,25.2,1006.0,7.8,7.148,8.320,6.210,95.0
8614,90240,SCALLOP (BAY&SEA) CKD STMD,70.25,111,20.54,0.84,2.97,5.41,0.0,0.00,...,5.0,2.0,0.00,0.0,2.0,0.0,0.218,0.082,0.222,41.0
8615,90480,SYRUP CANE,26.00,269,0.00,0.00,0.86,73.14,0.0,73.20,...,0.0,0.0,0.00,0.0,0.0,0.0,0.000,0.000,0.000,0.0
8616,90560,SNAIL RAW,79.20,90,16.10,1.40,1.30,2.00,0.0,0.00,...,100.0,30.0,5.00,0.0,0.0,0.1,0.361,0.259,0.252,50.0


In [3]:
print(type(food_infos))

<class 'pandas.core.frame.DataFrame'>


## Dataframe exploration

To explore the data, we will start by talking about the ``head()`` function. This function allows you to have an overview of the ``first 5 rows (by default)`` of a dataframe.

In [4]:
food_infos.head()

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,Vit_A_IU,Vit_A_RAE,Vit_E_(mg),Vit_D_mcg,Vit_D_IU,Vit_K_(mcg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg)
0,1001,BUTTER WITH SALT,15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,2499.0,684.0,2.32,1.5,60.0,7.0,51.368,21.021,3.043,215.0
1,1002,BUTTER WHIPPED WITH SALT,15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,2499.0,684.0,2.32,1.5,60.0,7.0,50.489,23.426,3.012,219.0
2,1003,BUTTER OIL ANHYDROUS,0.24,876,0.28,99.48,0.0,0.0,0.0,0.0,...,3069.0,840.0,2.8,1.8,73.0,8.6,61.924,28.732,3.694,256.0
3,1004,CHEESE BLUE,42.41,353,21.4,28.74,5.11,2.34,0.0,0.5,...,721.0,198.0,0.25,0.5,21.0,2.4,18.669,7.778,0.8,75.0
4,1005,CHEESE BRICK,41.11,371,23.24,29.68,3.18,2.79,0.0,0.51,...,1080.0,292.0,0.26,0.5,22.0,2.5,18.764,8.598,0.784,94.0


You can put a parameter in the ``head()`` function to specify exactly how many rows you want to display. For example, to display the first 10 lines, we do the following code.

In [5]:
food_infos.head(10)

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,Vit_A_IU,Vit_A_RAE,Vit_E_(mg),Vit_D_mcg,Vit_D_IU,Vit_K_(mcg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg)
0,1001,BUTTER WITH SALT,15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,2499.0,684.0,2.32,1.5,60.0,7.0,51.368,21.021,3.043,215.0
1,1002,BUTTER WHIPPED WITH SALT,15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,2499.0,684.0,2.32,1.5,60.0,7.0,50.489,23.426,3.012,219.0
2,1003,BUTTER OIL ANHYDROUS,0.24,876,0.28,99.48,0.0,0.0,0.0,0.0,...,3069.0,840.0,2.8,1.8,73.0,8.6,61.924,28.732,3.694,256.0
3,1004,CHEESE BLUE,42.41,353,21.4,28.74,5.11,2.34,0.0,0.5,...,721.0,198.0,0.25,0.5,21.0,2.4,18.669,7.778,0.8,75.0
4,1005,CHEESE BRICK,41.11,371,23.24,29.68,3.18,2.79,0.0,0.51,...,1080.0,292.0,0.26,0.5,22.0,2.5,18.764,8.598,0.784,94.0
5,1006,CHEESE BRIE,48.42,334,20.75,27.68,2.7,0.45,0.0,0.45,...,592.0,174.0,0.24,0.5,20.0,2.3,17.41,8.013,0.826,100.0
6,1007,CHEESE CAMEMBERT,51.8,300,19.8,24.26,3.68,0.46,0.0,0.46,...,820.0,241.0,0.21,0.4,18.0,2.0,15.259,7.023,0.724,72.0
7,1008,CHEESE CARAWAY,39.28,376,25.18,29.2,3.28,3.06,0.0,,...,1054.0,271.0,,,,,18.584,8.275,0.83,93.0
8,1009,CHEESE CHEDDAR,37.1,406,24.04,33.82,3.71,1.33,0.0,0.28,...,994.0,263.0,0.78,0.6,24.0,2.9,19.368,8.428,1.433,102.0
9,1010,CHEESE CHESHIRE,37.65,387,23.37,30.6,3.6,4.78,0.0,,...,985.0,233.0,,,,,19.475,8.671,0.87,103.0


To access the complete list of all columns, we use the ``columns`` attribute.

In [6]:
food_infos.columns

Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)',
       'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)',
       'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)',
       'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)',
       'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)',
       'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)',
       'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg',
       'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)',
       'Cholestrl_(mg)'],
      dtype='object')

Another important attribute is the ``shape`` attribute which returns a tuple of two elements containing the dimensions of a dataframe, i.e. the number of rows and the number of columns of our dataframe.

In [7]:
food_infos.shape

(8618, 36)

So our dataset contains 8618 rows and 36 columns.

To access the row and column numbers separately, do the following code.

In [8]:
# number of rows
food_infos.shape[0]

8618

In [9]:
# number of columns
food_infos.shape[1]

36

## Rows selection

Unlike numpy, to select rows with Pandas, we use the ``loc[row index]`` method. By applying ``loc[]`` to a dataframe, the result returns a ``series object``. ``Series`` are data structures used by Pandas to represent rows or columns data.

In [10]:
# example: return first series row
food_infos.loc[0]

NDB_No                         1001
Shrt_Desc          BUTTER WITH SALT
Water_(g)                     15.87
Energ_Kcal                      717
Protein_(g)                    0.85
Lipid_Tot_(g)                 81.11
Ash_(g)                        2.11
Carbohydrt_(g)                 0.06
Fiber_TD_(g)                    0.0
Sugar_Tot_(g)                  0.06
Calcium_(mg)                   24.0
Iron_(mg)                      0.02
Magnesium_(mg)                  2.0
Phosphorus_(mg)                24.0
Potassium_(mg)                 24.0
Sodium_(mg)                   643.0
Zinc_(mg)                      0.09
Copper_(mg)                     0.0
Manganese_(mg)                  0.0
Selenium_(mcg)                  1.0
Vit_C_(mg)                      0.0
Thiamin_(mg)                  0.005
Riboflavin_(mg)               0.034
Niacin_(mg)                   0.042
Vit_B6_(mg)                   0.003
Vit_B12_(mcg)                  0.17
Vit_A_IU                     2499.0
Vit_A_RAE                   

In [11]:
type(food_infos.loc[0])

pandas.core.series.Series

The selection of several rows ``in order`` is done in a similar way using the method ``loc[index of the first element:index of the last element]`` and the latter returns this time a ``dataframe``.

In [12]:
food_infos.loc[0:5]

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,Vit_A_IU,Vit_A_RAE,Vit_E_(mg),Vit_D_mcg,Vit_D_IU,Vit_K_(mcg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg)
0,1001,BUTTER WITH SALT,15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,2499.0,684.0,2.32,1.5,60.0,7.0,51.368,21.021,3.043,215.0
1,1002,BUTTER WHIPPED WITH SALT,15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,2499.0,684.0,2.32,1.5,60.0,7.0,50.489,23.426,3.012,219.0
2,1003,BUTTER OIL ANHYDROUS,0.24,876,0.28,99.48,0.0,0.0,0.0,0.0,...,3069.0,840.0,2.8,1.8,73.0,8.6,61.924,28.732,3.694,256.0
3,1004,CHEESE BLUE,42.41,353,21.4,28.74,5.11,2.34,0.0,0.5,...,721.0,198.0,0.25,0.5,21.0,2.4,18.669,7.778,0.8,75.0
4,1005,CHEESE BRICK,41.11,371,23.24,29.68,3.18,2.79,0.0,0.51,...,1080.0,292.0,0.26,0.5,22.0,2.5,18.764,8.598,0.784,94.0
5,1006,CHEESE BRIE,48.42,334,20.75,27.68,2.7,0.45,0.0,0.45,...,592.0,174.0,0.24,0.5,20.0,2.3,17.41,8.013,0.826,100.0


In [13]:
type(food_infos.loc[0:5])

pandas.core.frame.DataFrame

It is also possible to select specific rows of a dataframe using ``loc[[index of row_x, index of row_y, etc]]`` and the latter returns this time a ``dataframe``.

In [14]:
food_infos.loc[[0,2,4,6]]

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,Vit_A_IU,Vit_A_RAE,Vit_E_(mg),Vit_D_mcg,Vit_D_IU,Vit_K_(mcg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg)
0,1001,BUTTER WITH SALT,15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,2499.0,684.0,2.32,1.5,60.0,7.0,51.368,21.021,3.043,215.0
2,1003,BUTTER OIL ANHYDROUS,0.24,876,0.28,99.48,0.0,0.0,0.0,0.0,...,3069.0,840.0,2.8,1.8,73.0,8.6,61.924,28.732,3.694,256.0
4,1005,CHEESE BRICK,41.11,371,23.24,29.68,3.18,2.79,0.0,0.51,...,1080.0,292.0,0.26,0.5,22.0,2.5,18.764,8.598,0.784,94.0
6,1007,CHEESE CAMEMBERT,51.8,300,19.8,24.26,3.68,0.46,0.0,0.46,...,820.0,241.0,0.21,0.4,18.0,2.0,15.259,7.023,0.724,72.0


In [15]:
type(food_infos.loc[[0,2,4,6]])

pandas.core.frame.DataFrame

You can also select a single element of a row for a given column. To do this, we use ``loc[row index, column name]``.

In [16]:
food_infos.loc[4, "Shrt_Desc"]

'CHEESE BRICK'

## Pandas data types

When ``Pandas`` reads a file in a ``dataframe``, it analyzes the values and deduces for each column the ``data types``. The most common ``data types`` in ``pandas dataframe`` are: object, int, float, datetime and bool. To know the type of a dataframe, i.e. the type of data associated for each column, we use the ``dtypes`` attribute.

In [17]:
food_infos.dtypes

NDB_No               int64
Shrt_Desc           object
Water_(g)          float64
Energ_Kcal           int64
Protein_(g)        float64
Lipid_Tot_(g)      float64
Ash_(g)            float64
Carbohydrt_(g)     float64
Fiber_TD_(g)       float64
Sugar_Tot_(g)      float64
Calcium_(mg)       float64
Iron_(mg)          float64
Magnesium_(mg)     float64
Phosphorus_(mg)    float64
Potassium_(mg)     float64
Sodium_(mg)        float64
Zinc_(mg)          float64
Copper_(mg)        float64
Manganese_(mg)     float64
Selenium_(mcg)     float64
Vit_C_(mg)         float64
Thiamin_(mg)       float64
Riboflavin_(mg)    float64
Niacin_(mg)        float64
Vit_B6_(mg)        float64
Vit_B12_(mcg)      float64
Vit_A_IU           float64
Vit_A_RAE          float64
Vit_E_(mg)         float64
Vit_D_mcg          float64
Vit_D_IU           float64
Vit_K_(mcg)        float64
FA_Sat_(g)         float64
FA_Mono_(g)        float64
FA_Poly_(g)        float64
Cholestrl_(mg)     float64
dtype: object

## Columns selection

In this section, we will see how to ``extract`` a column from a ``dataframe``. The extraction is done in a similar way as for the rows.

In [18]:
food_infos.head(3)

Unnamed: 0,NDB_No,Shrt_Desc,Water_(g),Energ_Kcal,Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),...,Vit_A_IU,Vit_A_RAE,Vit_E_(mg),Vit_D_mcg,Vit_D_IU,Vit_K_(mcg),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g),Cholestrl_(mg)
0,1001,BUTTER WITH SALT,15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,2499.0,684.0,2.32,1.5,60.0,7.0,51.368,21.021,3.043,215.0
1,1002,BUTTER WHIPPED WITH SALT,15.87,717,0.85,81.11,2.11,0.06,0.0,0.06,...,2499.0,684.0,2.32,1.5,60.0,7.0,50.489,23.426,3.012,219.0
2,1003,BUTTER OIL ANHYDROUS,0.24,876,0.28,99.48,0.0,0.0,0.0,0.0,...,3069.0,840.0,2.8,1.8,73.0,8.6,61.924,28.732,3.694,256.0


In [19]:
# first method for Shrt_Desc colomn
Shrt_Desc_Col = food_infos["Shrt_Desc"]
Shrt_Desc_Col

0                 BUTTER WITH SALT
1         BUTTER WHIPPED WITH SALT
2             BUTTER OIL ANHYDROUS
3                      CHEESE BLUE
4                     CHEESE BRICK
                   ...            
8613               MACKEREL SALTED
8614    SCALLOP (BAY&SEA) CKD STMD
8615                    SYRUP CANE
8616                     SNAIL RAW
8617              TURTLE GREEN RAW
Name: Shrt_Desc, Length: 8618, dtype: object

As for the rows, the extraction of a single column also provides a series.

In [20]:
# second method for Shrt_Desc_colomn
Shrt_Desc_Col = food_infos.Shrt_Desc
Shrt_Desc_Col

0                 BUTTER WITH SALT
1         BUTTER WHIPPED WITH SALT
2             BUTTER OIL ANHYDROUS
3                      CHEESE BLUE
4                     CHEESE BRICK
                   ...            
8613               MACKEREL SALTED
8614    SCALLOP (BAY&SEA) CKD STMD
8615                    SYRUP CANE
8616                     SNAIL RAW
8617              TURTLE GREEN RAW
Name: Shrt_Desc, Length: 8618, dtype: object

For information, there are other methods using specific methods of a dataframe to select a list of columns for example. We won't do all these examples, you can look at them on the Python documentation about dataframes.

The extraction of several columns is also done in the same way as the extraction of rows. This latter provides a dataframe.

In [21]:
# extraction of the two first colomn
two_first_column = food_infos[["NDB_No","Shrt_Desc"]]
two_first_column

Unnamed: 0,NDB_No,Shrt_Desc
0,1001,BUTTER WITH SALT
1,1002,BUTTER WHIPPED WITH SALT
2,1003,BUTTER OIL ANHYDROUS
3,1004,CHEESE BLUE
4,1005,CHEESE BRICK
...,...,...
8613,83110,MACKEREL SALTED
8614,90240,SCALLOP (BAY&SEA) CKD STMD
8615,90480,SYRUP CANE
8616,90560,SNAIL RAW


### Training

In this practice, we will try to answer the following questions:

* select and display only the columns that use grams as a unit of measurement (i.e. that end with "(g)"). To do this:
    * use the columns attribute to return the names of the columns in the food_info dataframe and convert to a list using the tolist() method
    * create a new list named gram_columns, containing only the columns ending with "(g)"
    * hint: the endswith() method returns True if the object on which it is applied ends with the element in brackets, e.g.: total(g).endswith("(g)") returns True and a for loop is highly recommended to browse all the column names
    * select the gram_columns for the food_info dataframe and assign the resulting dataframe to the gram_df variable
    * display the first 3 values of this dataframe gram_df

In [22]:
colums_names = food_infos.columns.to_list()
print(colums_names)

['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)', 'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)', 'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)', 'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)', 'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)', 'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)', 'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg', 'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)', 'Cholestrl_(mg)']


In [23]:
gram_columns = [col for col in colums_names if col.endswith("(g)")]
gram_columns

['Water_(g)',
 'Protein_(g)',
 'Lipid_Tot_(g)',
 'Ash_(g)',
 'Carbohydrt_(g)',
 'Fiber_TD_(g)',
 'Sugar_Tot_(g)',
 'FA_Sat_(g)',
 'FA_Mono_(g)',
 'FA_Poly_(g)']

In [24]:
gram_df = food_infos[gram_columns]
gram_df.head()

Unnamed: 0,Water_(g),Protein_(g),Lipid_Tot_(g),Ash_(g),Carbohydrt_(g),Fiber_TD_(g),Sugar_Tot_(g),FA_Sat_(g),FA_Mono_(g),FA_Poly_(g)
0,15.87,0.85,81.11,2.11,0.06,0.0,0.06,51.368,21.021,3.043
1,15.87,0.85,81.11,2.11,0.06,0.0,0.06,50.489,23.426,3.012
2,0.24,0.28,99.48,0.0,0.0,0.0,0.0,61.924,28.732,3.694
3,42.41,21.4,28.74,5.11,2.34,0.0,0.5,18.669,7.778,0.8
4,41.11,23.24,29.68,3.18,2.79,0.0,0.51,18.764,8.598,0.784
