# 02. What is pandas
Content
While NumPy is a great tool for efficient manipulation of data, it lacks some of the more high-level functionality of spreadsheets and databases. The goal of the pandas library is to provide this additional functionality. The name is derived from the term "panel data", which is an econometrics term for multidimensional, structured data sets. Pandas is an open source software, originally written by Wes McKinney. The project came out of a need for a high-performance tool to perform analysis on financial data. Since then, it has become the software of choice for data scientists not only in finance, but across several domains such as neuroscience, economics, statistics, web analytics and more.

What is important to realize is that pandas was not meant to replace NumPy, but instead provide an additional functionality on top of it. In fact, pandas was built from NumPy, and it uses the multidimensional array from NumPy and its fast operations internally

The main purpose of pandas is to allow us to take data - like a CSV or SQL database that is in a table format with rows and columns - and create a Python object called a data frame. Compared to lists or dictionaries, data frames will be much easier to work with.

### In addition to the data frame object, pandas provides us with

tools for reading and writing data
data alignment and integrated handling of missing data
the ability to perform arithmetic operations on the data
easy reshaping and pivoting of data sets
user-friendly operations for merging and joining data
the ability to handle time series

In [1]:
import pandas as pd

The pandas library has two main objects which serve as containers for our data:

a one-dimensional labeled array called Series

a two-dimensional labeled array called DataFrame

In [2]:
my_Series = pd.Series([1,'cat',10.2,'dog'])
my_Series

0       1
1     cat
2    10.2
3     dog
dtype: object

Notice that unlike NumPy in arrays we have an additional column on the left, giving numbers from 0 to 3 for each entry. This is what we call the index of the Series. We can use the index to select specific entries of the Series by passing it inside the square brackets as follows:

In [3]:
my_Series[1]

'cat'

This is exactly the same as using the position for indexing in NumPy arrays. The difference here is that we can actually change this default indexing from 0 to 3 to any indexing of our choices

In [4]:
ages = pd.Series([20,53,68], index=['John', 'Allen', 'Mary'])
ages


John     20
Allen    53
Mary     68
dtype: int64

In [5]:
ages['John']

20

## Dataframes 

pd.DataFrame({ 'label1' : [col1], 'label2': [col2], .... })

In [6]:
df = pd.DataFrame( {'user' : [1,2,3],
            'age' : [24,54,17],
            'sex' : ['F','F','M'],
            'occupation' : ['technician','musician','student']})

In [7]:
df

Unnamed: 0,user,age,sex,occupation
0,1,24,F,technician
1,2,54,F,musician
2,3,17,M,student


In [8]:
df.set_index('user', inplace = True) #inplace = true pour remplacer l'original

In [9]:
df

Unnamed: 0_level_0,age,sex,occupation
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,24,F,technician
2,54,F,musician
3,17,M,student


In [10]:
df.head()

Unnamed: 0_level_0,age,sex,occupation
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,24,F,technician
2,54,F,musician
3,17,M,student


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 1 to 3
Data columns (total 3 columns):
age           3 non-null int64
sex           3 non-null object
occupation    3 non-null object
dtypes: int64(1), object(2)
memory usage: 96.0+ bytes


In [12]:
df.index

Int64Index([1, 2, 3], dtype='int64', name='user')

In [13]:
#displaying the number of rows
df.shape[0]
#displaying the number of columns
df.shape[1]
#displaying the labels of all the columns
df.columns
#displaying the data types of each column
df.dtypes

age            int64
sex           object
occupation    object
dtype: object

In [14]:
df.describe()


Unnamed: 0,age
count,3.0
mean,31.666667
std,19.655364
min,17.0
25%,20.5
50%,24.0
75%,39.0
max,54.0


In [15]:
# Selecting columns
df['occupation']

user
1    technician
2      musician
3       student
Name: occupation, dtype: object

# Importing Data from files

### df = pd.read_csv('file_name.csv')

Where file_name is to be replaced by the actual name of the csv file you want to load. This will automatically store the data from the csv file into a DataFrame. The first row of the csv file will be interpreted as a header row. If the csv file does not have headers, we can override this as follows

df = pd.read_csv('file_name.csv', header=None)

Or we can provide our own list of column headers

df = pd.read_csv('file_name.csv', names=['Header1', 'Header2', ....])

One of the main advantages of using pandas to import data is that it will automatically recognize and parse common missing data indicators for us. For example, it recognizes both NA and empty fields as missing data. However, missing data can often take many different forms. This is why pandas has an option for us to specify any additional symbols that should be interpreted as missing data by passing them with the na_values parameter. For example, the code below imports the csv file and treats all question mark symbols as missing data.

df = pd.read_csv('file_name.csv', na_values=['?'])

Other than csv files, we can also create a DataFrame directly from an Excel file using the read_excel() method. The syntax is similar:

df = pd.read_excel('file_name.xls', sheet_name = 'sheet 3')

#### Other supported files include JSON, HTML, SAS, and SQL.

In [16]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(9).reshape(3,3), columns=['a','b', 'c'])

In [17]:
df

Unnamed: 0,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8


## Dropping Values


In [18]:
df.drop(0, axis=0) #original df not modified

Unnamed: 0,a,b,c
1,3,4,5
2,6,7,8


In [19]:
df.drop([0,2], axis=0)

Unnamed: 0,a,b,c
1,3,4,5


In [20]:
df.drop(['b','c'], axis=1)

Unnamed: 0,a
0,0
1,3
2,6


## Arithmetic operations

In [21]:
df['a'] + df['b']

0     1
1     7
2    13
dtype: int32

In [22]:
df['a'].add(df['b'])

0     1
1     7
2    13
dtype: int32

Pandas also provides similar methods for the other operations:

sub()

div()

mul()

In [23]:
df.add(df.loc[0:1,:])

Unnamed: 0,a,b,c
0,0.0,2.0,4.0
1,6.0,8.0,10.0
2,,,


The function .loc is used to select entries by index by first specifying the rows that we want 

In [24]:
df.add(df.loc[0:1,:], fill_value=0)

Unnamed: 0,a,b,c
0,0.0,2.0,4.0
1,6.0,8.0,10.0
2,6.0,7.0,8.0


## Concatenating DataFrames

In [25]:
df1= pd.DataFrame([['Mark', 50], ['Kate', 46]],
                 columns=['name', 'age'])
df2 = pd.DataFrame([['Jon', 3], ['David', 4]],
                columns=['name', 'age'])

In [26]:
df1

Unnamed: 0,name,age
0,Mark,50
1,Kate,46


In [27]:
df2

Unnamed: 0,name,age
0,Jon,3
1,David,4


In [28]:
df_concat = pd.concat([df1,df2])
df_concat

Unnamed: 0,name,age
0,Mark,50
1,Kate,46
0,Jon,3
1,David,4


In [29]:
df_concat.reset_index() # ou utiliser ignore_index=True)

Unnamed: 0,index,name,age
0,0,Mark,50
1,1,Kate,46
2,0,Jon,3
3,1,David,4


In [30]:
df3 = pd.DataFrame(['writer', 'journalist'], columns=['occupation'])
df3

Unnamed: 0,occupation
0,writer
1,journalist


In [31]:
pd.concat([df1, df3], axis=1) #concatener sur les colonnes avec axis = 1 

Unnamed: 0,name,age,occupation
0,Mark,50,writer
1,Kate,46,journalist


In [32]:
columns = [
    'Mountain', 'Height (m)', 'Range', 'Coordinates', 'Parent mountain',
    'First ascent', 'Ascents bef. 2004', 'Failed attempts bef. 2004'
]
df = pd.read_csv('Mountains.csv', nrows=10, usecols=columns)
df

Unnamed: 0,Mountain,Height (m),Range,Coordinates,Parent mountain,First ascent,Ascents bef. 2004,Failed attempts bef. 2004
0,Mount Everest / Sagarmatha / Chomolungma,8848,Mahalangur Himalaya,27°59′17″N 86°55′31″E﻿,,1953,>>145,121
1,K2 / Qogir / Godwin Austen,8611,Baltoro Karakoram,35°52′53″N 76°30′48″E﻿,Mount Everest,1954,45,44
2,Kangchenjunga,8586,Kangchenjunga Himalaya,27°42′12″N 88°08′51″E﻿,Mount Everest,1955,38,24
3,Lhotse,8516,Mahalangur Himalaya,27°57′42″N 86°55′59″E﻿,Mount Everest,1956,26,26
4,Makalu,8485,Mahalangur Himalaya,27°53′23″N 87°05′20″E﻿,Mount Everest,1955,45,52
5,Cho Oyu,8188,Mahalangur Himalaya,28°05′39″N 86°39′39″E﻿,Mount Everest,1954,79,28
6,Dhaulagiri I,8167,Dhaulagiri Himalaya,28°41′48″N 83°29′35″E﻿,K2,1960,51,39
7,Manaslu,8163,Manaslu Himalaya,28°33′00″N 84°33′35″E﻿,Cho Oyu,1956,49,45
8,Nanga Parbat,8126,Nanga Parbat Himalaya,35°14′14″N 74°35′21″E﻿,Dhaulagiri,1953,52,67
9,Annapurna I,8091,Annapurna Himalaya,28°35′44″N 83°49′13″E﻿,Cho Oyu,1950,36,47


In [33]:
df.set_index('Mountain', inplace=True)
df

Unnamed: 0_level_0,Height (m),Range,Coordinates,Parent mountain,First ascent,Ascents bef. 2004,Failed attempts bef. 2004
Mountain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Mount Everest / Sagarmatha / Chomolungma,8848,Mahalangur Himalaya,27°59′17″N 86°55′31″E﻿,,1953,>>145,121
K2 / Qogir / Godwin Austen,8611,Baltoro Karakoram,35°52′53″N 76°30′48″E﻿,Mount Everest,1954,45,44
Kangchenjunga,8586,Kangchenjunga Himalaya,27°42′12″N 88°08′51″E﻿,Mount Everest,1955,38,24
Lhotse,8516,Mahalangur Himalaya,27°57′42″N 86°55′59″E﻿,Mount Everest,1956,26,26
Makalu,8485,Mahalangur Himalaya,27°53′23″N 87°05′20″E﻿,Mount Everest,1955,45,52
Cho Oyu,8188,Mahalangur Himalaya,28°05′39″N 86°39′39″E﻿,Mount Everest,1954,79,28
Dhaulagiri I,8167,Dhaulagiri Himalaya,28°41′48″N 83°29′35″E﻿,K2,1960,51,39
Manaslu,8163,Manaslu Himalaya,28°33′00″N 84°33′35″E﻿,Cho Oyu,1956,49,45
Nanga Parbat,8126,Nanga Parbat Himalaya,35°14′14″N 74°35′21″E﻿,Dhaulagiri,1953,52,67
Annapurna I,8091,Annapurna Himalaya,28°35′44″N 83°49′13″E﻿,Cho Oyu,1950,36,47


In [34]:
df.index


Index(['Mount Everest / Sagarmatha / Chomolungma',
       'K2 / Qogir / Godwin Austen', 'Kangchenjunga', 'Lhotse', 'Makalu',
       'Cho Oyu', 'Dhaulagiri I', 'Manaslu', 'Nanga Parbat', 'Annapurna I'],
      dtype='object', name='Mountain')

In [35]:
df.columns

Index(['Height (m)', 'Range', 'Coordinates', 'Parent mountain', 'First ascent',
       'Ascents bef. 2004', 'Failed attempts bef. 2004'],
      dtype='object')

#### The attribute operator . to select columns

In [36]:
df.Range

Mountain
Mount Everest / Sagarmatha / Chomolungma       Mahalangur Himalaya
K2 / Qogir / Godwin Austen                       Baltoro Karakoram
Kangchenjunga                               Kangchenjunga Himalaya
Lhotse                                         Mahalangur Himalaya
Makalu                                         Mahalangur Himalaya
Cho Oyu                                        Mahalangur Himalaya
Dhaulagiri I                                   Dhaulagiri Himalaya
Manaslu                                           Manaslu Himalaya
Nanga Parbat                                 Nanga Parbat Himalaya
Annapurna I                                     Annapurna Himalaya
Name: Range, dtype: object

In [37]:
df.Height (m) # les espaces ne sont pas acceptés par panda voir ligne suivante

AttributeError: 'DataFrame' object has no attribute 'Height'

In [None]:
getattr(df, 'Height (m)') # pour contourner les noms de colonnes avec espaces ou autre

#### The index operator [ ] to select columns

In [40]:
df['Height (m)']

Mountain
Mount Everest / Sagarmatha / Chomolungma    8848
K2 / Qogir / Godwin Austen                  8611
Kangchenjunga                               8586
Lhotse                                      8516
Makalu                                      8485
Cho Oyu                                     8188
Dhaulagiri I                                8167
Manaslu                                     8163
Nanga Parbat                                8126
Annapurna I                                 8091
Name: Height (m), dtype: int64

In [41]:
df['Height (m)'][0:3:2]

Mountain
Mount Everest / Sagarmatha / Chomolungma    8848
Kangchenjunga                               8586
Name: Height (m), dtype: int64

In [42]:
df[['Height (m)', 'Range', 'Coordinates']]

Unnamed: 0_level_0,Height (m),Range,Coordinates
Mountain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mount Everest / Sagarmatha / Chomolungma,8848,Mahalangur Himalaya,27°59′17″N 86°55′31″E﻿
K2 / Qogir / Godwin Austen,8611,Baltoro Karakoram,35°52′53″N 76°30′48″E﻿
Kangchenjunga,8586,Kangchenjunga Himalaya,27°42′12″N 88°08′51″E﻿
Lhotse,8516,Mahalangur Himalaya,27°57′42″N 86°55′59″E﻿
Makalu,8485,Mahalangur Himalaya,27°53′23″N 87°05′20″E﻿
Cho Oyu,8188,Mahalangur Himalaya,28°05′39″N 86°39′39″E﻿
Dhaulagiri I,8167,Dhaulagiri Himalaya,28°41′48″N 83°29′35″E﻿
Manaslu,8163,Manaslu Himalaya,28°33′00″N 84°33′35″E﻿
Nanga Parbat,8126,Nanga Parbat Himalaya,35°14′14″N 74°35′21″E﻿
Annapurna I,8091,Annapurna Himalaya,28°35′44″N 83°49′13″E﻿


#### The index operator [ ] to select rows

a[start:stop:step]  start is inclusive and stop is exclusive (except if index name is used both inclusive)

In [43]:
df['Lhotse':'Manaslu']

Unnamed: 0_level_0,Height (m),Range,Coordinates,Parent mountain,First ascent,Ascents bef. 2004,Failed attempts bef. 2004
Mountain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Lhotse,8516,Mahalangur Himalaya,27°57′42″N 86°55′59″E﻿,Mount Everest,1956,26,26
Makalu,8485,Mahalangur Himalaya,27°53′23″N 87°05′20″E﻿,Mount Everest,1955,45,52
Cho Oyu,8188,Mahalangur Himalaya,28°05′39″N 86°39′39″E﻿,Mount Everest,1954,79,28
Dhaulagiri I,8167,Dhaulagiri Himalaya,28°41′48″N 83°29′35″E﻿,K2,1960,51,39
Manaslu,8163,Manaslu Himalaya,28°33′00″N 84°33′35″E﻿,Cho Oyu,1956,49,45


#### The iloc operator to select rows and columns by position

df.iloc[rows, columns]

In [44]:
df.iloc[:, 2:6]

Unnamed: 0_level_0,Coordinates,Parent mountain,First ascent,Ascents bef. 2004
Mountain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mount Everest / Sagarmatha / Chomolungma,27°59′17″N 86°55′31″E﻿,,1953,>>145
K2 / Qogir / Godwin Austen,35°52′53″N 76°30′48″E﻿,Mount Everest,1954,45
Kangchenjunga,27°42′12″N 88°08′51″E﻿,Mount Everest,1955,38
Lhotse,27°57′42″N 86°55′59″E﻿,Mount Everest,1956,26
Makalu,27°53′23″N 87°05′20″E﻿,Mount Everest,1955,45
Cho Oyu,28°05′39″N 86°39′39″E﻿,Mount Everest,1954,79
Dhaulagiri I,28°41′48″N 83°29′35″E﻿,K2,1960,51
Manaslu,28°33′00″N 84°33′35″E﻿,Cho Oyu,1956,49
Nanga Parbat,35°14′14″N 74°35′21″E﻿,Dhaulagiri,1953,52
Annapurna I,28°35′44″N 83°49′13″E﻿,Cho Oyu,1950,36


In [45]:
df.iloc[::2, 2:]

Unnamed: 0_level_0,Coordinates,Parent mountain,First ascent,Ascents bef. 2004,Failed attempts bef. 2004
Mountain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Mount Everest / Sagarmatha / Chomolungma,27°59′17″N 86°55′31″E﻿,,1953,>>145,121
Kangchenjunga,27°42′12″N 88°08′51″E﻿,Mount Everest,1955,38,24
Makalu,27°53′23″N 87°05′20″E﻿,Mount Everest,1955,45,52
Dhaulagiri I,28°41′48″N 83°29′35″E﻿,K2,1960,51,39
Nanga Parbat,35°14′14″N 74°35′21″E﻿,Dhaulagiri,1953,52,67


In [46]:
df.iloc[[3,5,8], 2:]

Unnamed: 0_level_0,Coordinates,Parent mountain,First ascent,Ascents bef. 2004,Failed attempts bef. 2004
Mountain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Lhotse,27°57′42″N 86°55′59″E﻿,Mount Everest,1956,26,26
Cho Oyu,28°05′39″N 86°39′39″E﻿,Mount Everest,1954,79,28
Nanga Parbat,35°14′14″N 74°35′21″E﻿,Dhaulagiri,1953,52,67


#### The loc operator to select rows and columns by label

The loc operator is similar to the iloc one except that instead of referencing rows and columns using their position in the DataFrame we use the index labels and column names respectively. The general syntax is exactly the same

df.loc[rows, columns]

In [47]:
df.loc[:, 'Height (m)':'First ascent':2]

Unnamed: 0_level_0,Height (m),Coordinates,First ascent
Mountain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mount Everest / Sagarmatha / Chomolungma,8848,27°59′17″N 86°55′31″E﻿,1953
K2 / Qogir / Godwin Austen,8611,35°52′53″N 76°30′48″E﻿,1954
Kangchenjunga,8586,27°42′12″N 88°08′51″E﻿,1955
Lhotse,8516,27°57′42″N 86°55′59″E﻿,1956
Makalu,8485,27°53′23″N 87°05′20″E﻿,1955
Cho Oyu,8188,28°05′39″N 86°39′39″E﻿,1954
Dhaulagiri I,8167,28°41′48″N 83°29′35″E﻿,1960
Manaslu,8163,28°33′00″N 84°33′35″E﻿,1956
Nanga Parbat,8126,35°14′14″N 74°35′21″E﻿,1953
Annapurna I,8091,28°35′44″N 83°49′13″E﻿,1950


#### Boolean selection of rows using the [ ] operator

In [48]:
df['Parent mountain'] == 'Mount Everest'

Mountain
Mount Everest / Sagarmatha / Chomolungma    False
K2 / Qogir / Godwin Austen                   True
Kangchenjunga                                True
Lhotse                                       True
Makalu                                       True
Cho Oyu                                      True
Dhaulagiri I                                False
Manaslu                                     False
Nanga Parbat                                False
Annapurna I                                 False
Name: Parent mountain, dtype: bool

In [49]:
df[df['Parent mountain'] == 'Mount Everest']

Unnamed: 0_level_0,Height (m),Range,Coordinates,Parent mountain,First ascent,Ascents bef. 2004,Failed attempts bef. 2004
Mountain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
K2 / Qogir / Godwin Austen,8611,Baltoro Karakoram,35°52′53″N 76°30′48″E﻿,Mount Everest,1954,45,44
Kangchenjunga,8586,Kangchenjunga Himalaya,27°42′12″N 88°08′51″E﻿,Mount Everest,1955,38,24
Lhotse,8516,Mahalangur Himalaya,27°57′42″N 86°55′59″E﻿,Mount Everest,1956,26,26
Makalu,8485,Mahalangur Himalaya,27°53′23″N 87°05′20″E﻿,Mount Everest,1955,45,52
Cho Oyu,8188,Mahalangur Himalaya,28°05′39″N 86°39′39″E﻿,Mount Everest,1954,79,28


Although in Python we can use the syntax and, or, and not, these will not work when testing multiple conditions with Pandas. Instead, we must use the operators

& for and

| for or

~ for not

In [50]:
df[(df['Parent mountain'] == 'Mount Everest') & (df['First ascent'] > 1955)]

Unnamed: 0_level_0,Height (m),Range,Coordinates,Parent mountain,First ascent,Ascents bef. 2004,Failed attempts bef. 2004
Mountain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Lhotse,8516,Mahalangur Himalaya,27°57′42″N 86°55′59″E﻿,Mount Everest,1956,26,26


#### Boolean selection of rows and columns using the .loc operator

In [51]:
df.loc[(df['Parent mountain'] == 'Mount Everest') & (df['First ascent'] > 1955), :]

Unnamed: 0_level_0,Height (m),Range,Coordinates,Parent mountain,First ascent,Ascents bef. 2004,Failed attempts bef. 2004
Mountain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Lhotse,8516,Mahalangur Himalaya,27°57′42″N 86°55′59″E﻿,Mount Everest,1956,26,26


In [52]:
df.loc[(df['Parent mountain'] == 'Mount Everest') & (df['First ascent'] > 1955), 'Height (m)':'Range']

Unnamed: 0_level_0,Height (m),Range
Mountain,Unnamed: 1_level_1,Unnamed: 2_level_1
Lhotse,8516,Mahalangur Himalaya


In [53]:
col_criteria = [True, False, False, False, True, True, False]
df.loc[df['Height (m)'] > 8000, col_criteria]

Unnamed: 0_level_0,Height (m),First ascent,Ascents bef. 2004
Mountain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mount Everest / Sagarmatha / Chomolungma,8848,1953,>>145
K2 / Qogir / Godwin Austen,8611,1954,45
Kangchenjunga,8586,1955,38
Lhotse,8516,1956,26
Makalu,8485,1955,45
Cho Oyu,8188,1954,79
Dhaulagiri I,8167,1960,51
Manaslu,8163,1956,49
Nanga Parbat,8126,1953,52
Annapurna I,8091,1950,36


#### Views vs copies

In [54]:
df= pd.DataFrame( {'user' : [1,2,3],
            'age' : [24,54,17],
            'sex' : ['F','F','M'],
            'occupation' : ['technician','musician','student']})
df

Unnamed: 0,user,age,sex,occupation
0,1,24,F,technician
1,2,54,F,musician
2,3,17,M,student


Part 1: Warning after failed attempt at setting values

Suppose we want to change the values 'F' to 'Female'. How would we go about this?

Well, one way would be to begin selecting the relevant values with boolean indexing as follows:

In [55]:
df[df.sex=='F']

Unnamed: 0,user,age,sex,occupation
0,1,24,F,technician
1,2,54,F,musician


In [56]:
df[df.sex=='F'].sex

0    F
1    F
Name: sex, dtype: object

In [57]:
df[df.sex=='F'].sex = 'Female'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


If we check the original DataFrame df, we will see that it remains unchanged! So indeed, the warning is for good reason since the command we performed did not have the intended effect.

The problem with chained assignment is that pandas cannot ensure whether a copy or a view is returned, thus giving us a warning. In our case, the warning was appropriate since indeed the changes were not performed on the original DataFrame. Instead, they were performed on the copy returned by the chained selection.

So what can we do to fix this problem?  The answer is given in the warning from pandas: we use the loc() method instead!

In [58]:
df.loc[df.sex=='F','sex'] = 'Female'
df

Unnamed: 0,user,age,sex,occupation
0,1,24,Female,technician
1,2,54,Female,musician
2,3,17,M,student


In [59]:
df = pd.DataFrame( {
    'user' : [1,2,3], 
    'age' : [24,54,17], 
    'sex' : ['F','F','M'], 
    'occupation' : ['technician','musician','student']
})

In [None]:
df2 = df.loc[df.sex=='F']

And further, suppose that now our intention is to change the values in the DataFrame df2. So using what we just learned, we proceed to call the loc() method again and assign the new values:

In [None]:
df2.loc[0:1,'sex']='Female'

So far so good, right? Well not, quite! Executing this command gives us the same old warning from pandas.

This is very confusing since, after all, we used the loc() method and this is what the warning is advising us to do.

So what is happening here? Actually, the whole source of the problem is not the assignment statement, but our definition of the DataFrame df2. When we define df2 to be a subset of df, pandas does not know if once again this subset is a copy or a view. So when later on in our code we attempt to modify this subset, we get the same warning, since pandas does not know if we are modifying just df2, or both df and df2.

In our case, our code modified indeed the DataFrame df2 and left df unchanged as was our intention. This is because in our original line of code

In [60]:
df2 = df.loc[df.sex=='F'].copy()

In [61]:
df2.loc[0:1,'sex']='Female'

#### Applying functions

In this unit, we will look at three main methods that help us apply functions to a DataFrame or Series. The methods are:

map()

apply()

applymap()

The method map() applies to Series and is used to replace the existing value in a series with different values. 

==> FULL DICTIONNARY OR NaN

In [63]:
df = pd.DataFrame( {
    'user': [1, 2, 3],
    'age': [24, 54, 17],
    'sex': ['F', 'F', 'M'],
    'occupation': ['technician', 'musician', 'student']
})
df

Unnamed: 0,user,age,sex,occupation
0,1,24,F,technician
1,2,54,F,musician
2,3,17,M,student


In [64]:
df['sex'] = df['sex'].map({'F': 'Female', 'M': 'Male'})
df

Unnamed: 0,user,age,sex,occupation
0,1,24,Female,technician
1,2,54,Female,musician
2,3,17,Male,student


In [65]:
df['sex'].map({'Female': 1})

0    1.0
1    1.0
2    NaN
Name: sex, dtype: float64

In [66]:
df['sex'].replace('Female', '1')

0       1
1       1
2    Male
Name: sex, dtype: object

In [67]:
df2 = pd.DataFrame(
    data=np.arange(9).reshape(3,3), columns=['a','b', 'c'])

We will now apply the python sum function to our DataFrame. We have two ways of applying our function

along axis 0 meaning down the data frame which will return the sum along each column
along axis 1 meaning across the data frame which will return the sum along each row

In [68]:
df2

Unnamed: 0,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8


In [69]:
df2.apply(sum, axis=0)

a     9
b    12
c    15
dtype: int64

In [70]:
df2.apply(sum, axis=1)

0     3
1    12
2    21
dtype: int64

In [71]:
df2.apply(np.max, axis = 1)

0    2
1    5
2    8
dtype: int64

In [72]:
df2.apply(np.mean, axis = 0)

a    3.0
b    4.0
c    5.0
dtype: float64

#### The applymap() method
The applymap() method is used to apply a function to every element of a DataFrame. Let's say we have some custom function such as

In [73]:
def my_func(x):
    if x > 5:
        size = 'Large'
    elif x >3:
        size = 'Medium'
    else:
        size = 'Small'
    return size

In [74]:
df2.applymap(my_func)

Unnamed: 0,a,b,c
0,Small,Small,Small
1,Small,Medium,Medium
2,Large,Large,Large


One thing to keep in mind is that while applymap() is a convenient and versatile method, it can be quite inefficient for larger datasets, basically because it has to run the function on each individual element of the DataFrame. It is always best whenever possible to try and use vectorized operations.

#### Sorting

In [75]:
df = pd.DataFrame({'A':[3,6,1,12,3],'B':[0,0,7,5,6],'C':[10,4,5,8,2]})
df


Unnamed: 0,A,B,C
0,3,0,10
1,6,0,4
2,1,7,5
3,12,5,8
4,3,6,2


In [76]:
df.sort_index()

Unnamed: 0,A,B,C
0,3,0,10
1,6,0,4
2,1,7,5
3,12,5,8
4,3,6,2


In [77]:
df.sort_index(ascending=False)

Unnamed: 0,A,B,C
4,3,6,2
3,12,5,8
2,1,7,5
1,6,0,4
0,3,0,10


In [78]:
df.sort_index(ascending=False, axis=1)

Unnamed: 0,C,B,A
0,10,0,3
1,4,0,6
2,5,7,1
3,8,5,12
4,2,6,3


In [79]:
df['A'].sort_values()

2     1
0     3
4     3
1     6
3    12
Name: A, dtype: int64

In [80]:
df.sort_values('A')

Unnamed: 0,A,B,C
2,1,7,5
0,3,0,10
4,3,6,2
1,6,0,4
3,12,5,8


In [81]:
df.sort_values(['A','C'])

Unnamed: 0,A,B,C
2,1,7,5
4,3,6,2
0,3,0,10
1,6,0,4
3,12,5,8


#### Grouping

In [82]:
import pandas as pd
import numpy as np
df = pd.DataFrame({
       'A' : ['dog', 'cat', 'dog', 'cat', 'dog', 'cat', 'dog', 'dog'],
       'B' : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
       'C' : np.random.randint(10, size=8)})

In [83]:
df

Unnamed: 0,A,B,C
0,dog,one,2
1,cat,one,1
2,dog,two,6
3,cat,three,4
4,dog,two,1
5,cat,two,6
6,dog,one,5
7,dog,three,9


In [84]:
df['C'].mean()

4.25

In [85]:
df.groupby('A')['C'].mean()

A
cat    3.666667
dog    4.600000
Name: C, dtype: float64

In [86]:
import pandas as pd
df = pd.read_csv('songs_requested.csv')

In [87]:
df

Unnamed: 0,Musician,Name,Decade,Requested
0,Led Zeppelin,Stairway to Heaven,70,435
1,Led Zeppelin,Kashmir,70,284
2,Led Zeppelin,Immigrant Song,70,129
3,Led Zeppelin,Whole Lotta Love,60,337
4,Led Zeppelin,Black Dog,70,302
5,Led Zeppelin,Good Times Bad Times,60,220
6,Led Zeppelin,Moby Dick,60,93
7,Led Zeppelin,Ramble On,60,144
8,Led Zeppelin,All My Love,70,396
9,Led Zeppelin,The Song Remains the Same,70,178


In [88]:
df.groupby('Musician')['Requested'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Musician,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Bob Dylan,3.0,210.666667,78.678671,143.0,167.5,192.0,244.5,297.0
David Bowie,6.0,165.166667,32.046321,107.0,156.25,177.5,183.75,194.0
Led Zeppelin,11.0,234.818182,124.607237,65.0,136.5,220.0,319.5,435.0


In [89]:
df.groupby('Musician')['Decade'].agg(lambda x: str(min(x))+"-"+str(max(x)))

Musician
Bob Dylan       60-70
David Bowie     60-90
Led Zeppelin    60-70
Name: Decade, dtype: object

In [90]:
df.groupby('Musician')['Decade'].agg(lambda x: str(min(x))+"-"+str(max(x))).to_frame()

Unnamed: 0_level_0,Decade
Musician,Unnamed: 1_level_1
Bob Dylan,60-70
David Bowie,60-90
Led Zeppelin,60-70


In [93]:
grouped_df=df.groupby('Musician').agg({'Name': 'count',
                                    'Decade': lambda x: str(min(x))+"-"+str(max(x)), 
                                    'Requested': ['sum', 'max', np.mean]}
                                    )

In [94]:
grouped_df.rename(columns={"count":"Total",
                           "<lambda>":"span"})

Unnamed: 0_level_0,Name,Decade,Requested,Requested,Requested
Unnamed: 0_level_1,Total,span,sum,max,mean
Musician,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Bob Dylan,3,60-70,632,297,210.666667
David Bowie,6,60-90,991,194,165.166667
Led Zeppelin,11,60-70,2583,435,234.818182


In [95]:
grouped_df = df.groupby('Musician').agg({
    'Name': 'count',
    'Decade': lambda x: str(min(x))+"-"+str(max(x)), 
    'Requested': ['sum', 'max', np.mean] # Produces a DataFrame with "2-levels" columns
})
grouped_df['Requested'] # All

Unnamed: 0_level_0,sum,max,mean
Musician,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bob Dylan,632,297,210.666667
David Bowie,991,194,165.166667
Led Zeppelin,2583,435,234.818182


In [96]:
grouped_df

Unnamed: 0_level_0,Name,Decade,Requested,Requested,Requested
Unnamed: 0_level_1,count,<lambda>,sum,max,mean
Musician,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Bob Dylan,3,60-70,632,297,210.666667
David Bowie,6,60-90,991,194,165.166667
Led Zeppelin,11,60-70,2583,435,234.818182


In [98]:
grouped_df = df.groupby('Musician').agg({
    'Name': 'count',
    'Decade': lambda x: str(min(x))+"-"+str(max(x)), 
    'Requested': ['sum', 'max', np.mean]
})
grouped_df.columns = ['Number of songs', 'Decade', 'Requested_sum', 'Requested_max', 'Requested_mean']
grouped_df # Returns a DataFrame with 5 columns

Unnamed: 0_level_0,Number of songs,Decade,Requested_sum,Requested_max,Requested_mean
Musician,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Bob Dylan,3,60-70,632,297,210.666667
David Bowie,6,60-90,991,194,165.166667
Led Zeppelin,11,60-70,2583,435,234.818182


In [99]:
grouped_df = df.groupby('Musician').agg({
    'Name': 'count',
    'Decade': lambda x: str(min(x))+"-"+str(max(x)), 
    'Requested': ['sum', 'max', np.mean]
})
grouped_df.columns = pd.MultiIndex.from_tuples([
    ('Number of songs', None),
    ('Decade', None),
    ('Requested', 'sum'),
    ('Requested', 'max'),
    ('Requested', 'mean')
])
grouped_df # Returns a DataFrame with "2-level" columns

Unnamed: 0_level_0,Number of songs,Decade,Requested,Requested,Requested
Unnamed: 0_level_1,NaN,NaN,sum,max,mean
Musician,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Bob Dylan,3,60-70,632,297,210.666667
David Bowie,6,60-90,991,194,165.166667
Led Zeppelin,11,60-70,2583,435,234.818182


Pandas has several useful methods for dealing with missing values in our data. When we say 'missing', we simply mean a value that is null or not present. In many applications, data sets contain several missing values which can arise from the values not being reported, the values not being stored properly, or the values not existing in the first place.

In [100]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(10, size=(3, 3)), index=['a', 'c', 'e'], columns=['A', 'B', 'C'])

In [101]:
df

Unnamed: 0,A,B,C
a,0,6,9
c,2,5,2
e,8,4,6


In [102]:
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f'])

In [103]:
df2

Unnamed: 0,A,B,C
a,0.0,6.0,9.0
b,,,
c,2.0,5.0,2.0
d,,,
e,8.0,4.0,6.0
f,,,


In [104]:
df2.isnull()


Unnamed: 0,A,B,C
a,False,False,False
b,True,True,True
c,False,False,False
d,True,True,True
e,False,False,False
f,True,True,True


In [105]:
df2.isnull().sum()

A    3
B    3
C    3
dtype: int64

In [106]:
df2.isnull().sum(axis=1)

a    0
b    3
c    0
d    3
e    0
f    3
dtype: int64

In [107]:
df2[df2['A'].isnull()]

Unnamed: 0,A,B,C
b,,,
d,,,
f,,,


In [108]:
df2.dropna()


Unnamed: 0,A,B,C
a,0.0,6.0,9.0
c,2.0,5.0,2.0
e,8.0,4.0,6.0


The default setting for the dropna() method is to drop a row if any of its values are missing. Sometimes, we might be interested in dropping a row only when it has a missing value in a certain column, or only if it has a missing value in all of the columns. We can specify these as follows:

#### drop a row if it has a missing value in all of the columns
df2.dropna(how='all')

#### drop a row if it has a missing value in column 'A'
df2.dropna(subset=['A'])

#### drop a row if it has a missing value in column 'A' or column B
df2.dropna(subset=['A','B'])

#### drop a row if it has a missing value in both column 'A' and column B
df2.dropna(subset=['A','B'], how='all')

In [110]:
df2.fillna(value=0)

Unnamed: 0,A,B,C
a,0.0,6.0,9.0
b,0.0,0.0,0.0
c,2.0,5.0,2.0
d,0.0,0.0,0.0
e,8.0,4.0,6.0
f,0.0,0.0,0.0


#### replace with the mean of the same dataframe:
df2.fillna(value=df2.mean())


#### replace with specific values:
df2.fillna(value = {'A': 100, 'B': 200, 'C': 300})

#### replace only the NaNs for a column:
df2[['A']].fillna(value={'A': 100, 'B': 200, 'C': 300})

#### replace only the first rows of NaNs:
df2.fillna(value={'A': 100, 'B': 200, 'C': 300},limit=1)

Unnamed: 0,A,B,C
a,0.0,6.0,9.0
b,100.0,200.0,300.0
c,2.0,5.0,2.0
d,100.0,200.0,300.0
e,8.0,4.0,6.0
f,100.0,200.0,300.0
