In [None]:
import pandas as pd

# Series 

A pandas Series is a one-dimensional array that can hold indexed data of any type (integers, strings, floating point numbers, Python objects, etc.). Series can be created using:

🔸Python dictionaries
🔸NumPy ndarrays
🔸Scalar values
🔸Lists

In [4]:
# Create a pandas Series using a list

list_ser = [45, 123, 67, 1, 14]

serA = pd.Series(list_ser)

print(serA) # The default index in pandas always starts with 0 (zero).

print(type(serA))

0     45
1    123
2     67
3      1
4     14
dtype: int64
<class 'pandas.core.series.Series'>


In [5]:
serA.index # inspect index

RangeIndex(start=0, stop=5, step=1)

a_given_pandas.values

In pandas, when you create a Series with an index of strings, the dtype of the index is typically represented as 'object'. This is because pandas uses the NumPy library for underlying data structures, and NumPy does not have a specific data type for strings. Instead, it uses the generic 'object' data type to represent any arbitrary Python object, including strings.

Even though the index values look like strings and are indeed strings in this case, pandas internally represents them as objects due to the limitations of NumPy's dtype system.

Therefore, when you see dtype='object' in the output of print(serB.index), it indicates that the index data type is a generic Python object type, which is used to accommodate the string values of the index. This is a common representation for indexes with string values in pandas.

# Creating series

## Using a list

In [14]:
# create a Series using the same list as above, but define the index, data type and Series name:
serB = pd.Series(list_ser,
                 index=['Num1', 'Num2', 'Num3', 'Num4', 'Num5'], # indices can be strings
                 dtype= 'float',
                 name= 'Numbers') 

print(serB)
print(serB.index)
print(serB.values)

Num1     45.0
Num2    123.0
Num3     67.0
Num4      1.0
Num5     14.0
Name: Numbers, dtype: float64
Index(['Num1', 'Num2', 'Num3', 'Num4', 'Num5'], dtype='object')


array([ 45., 123.,  67.,   1.,  14.])

## Using a dictionary

In the next example we will create a `Series` from a dictionary. We will use the top 5 Canadian provinces by population (retrieved from [Statistics Canada](https://www150.statcan.gc.ca/n1/pub/12-581-x/2018000/pop-eng.htm) web site, we used the 2017 column of data):

In [15]:
population_dict = {'ON': 14193384, 'QC': 8394034, 'BC': 4817160, 'AB': 4286134, 'MB': 1338109}

provinces_population = pd.Series(population_dict, name='Top 5 provinces by population')

provinces_population # the dictionary keys become the indices

ON    14193384
QC     8394034
BC     4817160
AB     4286134
MB     1338109
Name: Top 5 provinces by population, dtype: int64

In [21]:
print(provinces_population['ON']) # population of Ontario

# Selecting only provinces with population greater than 5 million.
# This type of selection is called boolean indexing:
print(provinces_population[provinces_population > 5000000])
print(provinces_population[2:4])
print(provinces_population['BC':'MB'])
print('NS' in provinces_population) # i.e Nova Scotia

14193384
ON    14193384
QC     8394034
Name: Top 5 provinces by population, dtype: int64
BC    4817160
AB    4286134
Name: Top 5 provinces by population, dtype: int64
BC    4817160
AB    4286134
MB    1338109
Name: Top 5 provinces by population, dtype: int64
False


In [23]:
print(provinces_population.sum())
print(provinces_population.mean())

33028821
6605764.2


# DataFrames

A `DataFrame` can be created from a:

- Dictionary of 1-D structures (`ndarray`s, `list`s, dictionaries, tuples or `Series`)
- List of 1-D structures
- 2-D NumPy `ndarray`
- `Series`
- Another `DataFrame`

## Using a 2-dimensional list

In [42]:
df = pd.DataFrame(data=[[8, 128, 27.5],
                        [10, 138.9, 34.5],
                        [16, 157.3, 91.1],
                        [6, 116.6, 21.4],
                        [14, 159.2, 54.4]],
                    columns= ['Age', 'Height', 'Weight'])

df

Unnamed: 0,Age,Height,Weight
0,8,128.0,27.5
1,10,138.9,34.5
2,16,157.3,91.1
3,6,116.6,21.4
4,14,159.2,54.4


In [44]:
print(df.index)
print(df.columns)

RangeIndex(start=0, stop=5, step=1)
Index(['Age', 'Height', 'Weight'], dtype='object')


In [46]:
# Using `info()` function to output summary information about the `DataFrame`:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Age     5 non-null      int64  
 1   Height  5 non-null      float64
 2   Weight  5 non-null      float64
dtypes: float64(2), int64(1)
memory usage: 252.0 bytes


## using a dictionary

To create a `DataFrame` from a dictionary, let's start with a simple dictionary which will contain area data for land and water areas for the five Canadian provinces that we worked with in the previous example (i.e. Ontario, Quebec, British Columbia, Alberta and Manitoba). We will retrieve our data from a [Wikipedia page](https://en.wikipedia.org/wiki/Provinces_and_territories_of_Canada).

In [47]:
area = {'province':['ON', 'QC', 'BC', 'AB', 'MB'],
        'area_land': [917741, 1356128, 925186, 642317, 553556],
        'area_water': [158654, 185928, 19549, 19531, 94241]}

provinces_area = pd.DataFrame(area)

provinces_area

Unnamed: 0,province,area_land,area_water
0,ON,917741,158654
1,QC,1356128,185928
2,BC,925186,19549
3,AB,642317,19531
4,MB,553556,94241


### set_index()

If we want the province to be an index for this DataFrame, we need to use the method set_index():

In [49]:
provinces_area.set_index('province')

provinces_area

Unnamed: 0,province,area_land,area_water
0,ON,917741,158654
1,QC,1356128,185928
2,BC,925186,19549
3,AB,642317,19531
4,MB,553556,94241


# Loading DataFrames

For this exercise, we will continue looking into the data that describes Canadian provinces. This time, we will use the data of the last 3 years of [Federal Support to all Canadian Provinces and Territories](https://www.fin.gc.ca/fedprov/mtp-eng.asp). All numbers are in millions of dollars.

The dataset that I am planning to use in this section is stored in the pandas_ex1.csv file, we assume that the file is in the same folder as the notebook working directory.

In [50]:
# Reading a csv file/loading csv file into a DataFrame

# sep=, to ensure that the Python interpreter reads the data correctly:
prov_support = pd.read_csv('pandas_ex1.csv', sep=',')

prov_support

Unnamed: 0,Canadian Provinces and Territories,Two-Letter Abbreviation,2016-17,2017-18,2018-19
0,Newfoundland and Labrador,NL,724,734,750
1,Prince Edward Island,PE,584,601,638
2,Nova Scotia,NS,3060,3138,3201
3,New Brunswick,NB,2741,2814,2956
4,Quebec,QC,21372,22720,23749
5,Ontario,ON,21347,21101,21420
6,Manitoba,MB,3531,3675,3965
7,Saskatchewan,SK,1565,1613,1673
8,Alberta,AB,5772,5943,6157
9,British Columbia,BC,6482,6680,6925


In [51]:
prov_support.info()

# We can see that the `DataFrame` object has 5 columns of data, 3 columns are integers and 2 columns are of `object` data type. In pandas, "object" usually means `string`.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 5 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   Canadian Provinces and Territories  13 non-null     object
 1   Two-Letter Abbreviation             13 non-null     object
 2   2016-17                             13 non-null     int64 
 3   2017-18                             13 non-null     int64 
 4   2018-19                             13 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 652.0+ bytes


## Object data type

We can see that the `DataFrame` object has 5 columns of data, 3 columns are integers and 2 columns are of `object` data type. In pandas, "object" usually means `string`.
We can also use the method `dtypes` to check data types for each column of data:

## dtypes

We can also use the method `dtypes` to check data types for each column of data:

In [56]:
prov_support.dtypes

Canadian Provinces and Territories    object
Two-Letter Abbreviation               object
2016-17                                int64
2017-18                                int64
2018-19                                int64
dtype: object

In [13]:
# Omit header
# specify column names
# use column 'province' as index

prov_support = pd.read_csv('pandas_ex1.csv',
                           sep=',',
                           skiprows=1,
                           header=None, # We are telling pandas that there is no header in the file, as we skipped  the first row
                           names=['province_name','province','2016','2017','2018'], # names of columns
                           index_col='province') 

prov_support

Unnamed: 0_level_0,province_name,2016,2017,2018
province,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
NL,Newfoundland and Labrador,724,734,750
PE,Prince Edward Island,584,601,638
NS,Nova Scotia,3060,3138,3201
NB,New Brunswick,2741,2814,2956
QC,Quebec,21372,22720,23749
ON,Ontario,21347,21101,21420
MB,Manitoba,3531,3675,3965
SK,Saskatchewan,1565,1613,1673
AB,Alberta,5772,5943,6157
BC,British Columbia,6482,6680,6925


# Saving DataFrames

In `pandas`, we utilize the `to_csv()` function for writing out `DataFrame` data in `csv` format. It is similar to the reading function, except now the parameters are used for formatting output instead of parsing input.

In [14]:
csv_out = prov_support.to_csv() # convert DataFrame to csv format

csv_out

'province,province_name,2016,2017,2018\r\nNL,Newfoundland and Labrador,724,734,750\r\nPE,Prince Edward Island,584,601,638\r\nNS,Nova Scotia,3060,3138,3201\r\nNB,New Brunswick,2741,2814,2956\r\nQC,Quebec,21372,22720,23749\r\nON,Ontario,21347,21101,21420\r\nMB,Manitoba,3531,3675,3965\r\nSK,Saskatchewan,1565,1613,1673\r\nAB,Alberta,5772,5943,6157\r\nBC,British Columbia,6482,6680,6925\r\nYT,Yukon,946,973,1006\r\nNT,Northwest Territories,1281,1294,1319\r\nNU,Nunavut,1539,1583,1634\r\n'

## to_csv( ) part 1

In [15]:
# writing to a file
prov_support.to_csv('csv_out1.csv')

## to_csv( ) part 2

We can also make a more elaborate call and specify if we want to include a header row and index, and what character should be used as a column separator:

In [16]:
prov_support.to_csv('csv_out2.csv',
                    sep='\t',       # the separator used for the columns
                    index=True,     # whether to include indexes in output
                    header=True)    # whether to include headers in output

# Renaming DataFrame columns

Let's load the data into the DataFrame one more time without specifying any parameters for the read_csv() function, like we did in the beginning of this section:

In [17]:
fed_sup = pd.read_csv('pandas_ex1.csv', sep=',')

fed_sup

Unnamed: 0,Canadian Provinces and Territories,Two-Letter Abbreviation,2016-17,2017-18,2018-19
0,Newfoundland and Labrador,NL,724,734,750
1,Prince Edward Island,PE,584,601,638
2,Nova Scotia,NS,3060,3138,3201
3,New Brunswick,NB,2741,2814,2956
4,Quebec,QC,21372,22720,23749
5,Ontario,ON,21347,21101,21420
6,Manitoba,MB,3531,3675,3965
7,Saskatchewan,SK,1565,1613,1673
8,Alberta,AB,5772,5943,6157
9,British Columbia,BC,6482,6680,6925


.columns

In [18]:
# rename columns
fed_sup.columns = ['Province Name', 'Province Abbreviation', '2016', '2017', '2018'] 

fed_sup

Unnamed: 0,Province Name,Province Abbreviation,2016,2017,2018
0,Newfoundland and Labrador,NL,724,734,750
1,Prince Edward Island,PE,584,601,638
2,Nova Scotia,NS,3060,3138,3201
3,New Brunswick,NB,2741,2814,2956
4,Quebec,QC,21372,22720,23749
5,Ontario,ON,21347,21101,21420
6,Manitoba,MB,3531,3675,3965
7,Saskatchewan,SK,1565,1613,1673
8,Alberta,AB,5772,5943,6157
9,British Columbia,BC,6482,6680,6925


rename()

The advantage of using the rename() function is that you can include only those column names that require the change and be specific about what column names you want to rename.

In [19]:
# rename columns and remove spaces
fed_sup = fed_sup.rename(columns={'Province Name':'province_name', 'Province Abbreviation':'province'})

fed_sup

Unnamed: 0,province_name,province,2016,2017,2018
0,Newfoundland and Labrador,NL,724,734,750
1,Prince Edward Island,PE,584,601,638
2,Nova Scotia,NS,3060,3138,3201
3,New Brunswick,NB,2741,2814,2956
4,Quebec,QC,21372,22720,23749
5,Ontario,ON,21347,21101,21420
6,Manitoba,MB,3531,3675,3965
7,Saskatchewan,SK,1565,1613,1673
8,Alberta,AB,5772,5943,6157
9,British Columbia,BC,6482,6680,6925


# Working with DataFrames

## head()

In [20]:
# first 5 rows of the DataFrame
prov_support.head()

Unnamed: 0_level_0,province_name,2016,2017,2018
province,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
NL,Newfoundland and Labrador,724,734,750
PE,Prince Edward Island,584,601,638
NS,Nova Scotia,3060,3138,3201
NB,New Brunswick,2741,2814,2956
QC,Quebec,21372,22720,23749


## tail()

In [21]:
# last 5 rows
prov_support.tail()

Unnamed: 0_level_0,province_name,2016,2017,2018
province,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AB,Alberta,5772,5943,6157
BC,British Columbia,6482,6680,6925
YT,Yukon,946,973,1006
NT,Northwest Territories,1281,1294,1319
NU,Nunavut,1539,1583,1634


In [23]:
prov_support.head(3)

Unnamed: 0_level_0,province_name,2016,2017,2018
province,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
NL,Newfoundland and Labrador,724,734,750
PE,Prince Edward Island,584,601,638
NS,Nova Scotia,3060,3138,3201


## Axes and indices

An important concept used in pandas is DataFrame axes. In a DataFrame, axes are indexes where axis=0 is the rows index, and axis=1 is the full set of column names.

**NOTE:** `Series` have only one axis and it is a row axis.

In [24]:
prov_support.shape

(13, 4)

In [25]:
prov_support.columns

Index(['province_name', '2016', '2017', '2018'], dtype='object')

In [26]:
prov_support.index

Index(['NL', 'PE', 'NS', 'NB', 'QC', 'ON', 'MB', 'SK', 'AB', 'BC', 'YT', 'NT',
       'NU'],
      dtype='object', name='province')

In [27]:
prov_support.axes # returns both axes, the rows index (axis=0) first, then column names (axis=1).

[Index(['NL', 'PE', 'NS', 'NB', 'QC', 'ON', 'MB', 'SK', 'AB', 'BC', 'YT', 'NT',
        'NU'],
       dtype='object', name='province'),
 Index(['province_name', '2016', '2017', '2018'], dtype='object')]

## sort_index( )

In [None]:
prov_support.sort_index(axis=0, ascending=True) # The default order is ascending

In [None]:
prov_support.sort_index(axis=1, ascending=True)

# type( )

Indexes are of pandas Index object type:

In [29]:
obj1 = prov_support.index

type(obj1)

pandas.core.indexes.base.Index

In [30]:
obj2 = prov_support.columns

type(obj2)

pandas.core.indexes.base.Index

In [31]:
print(obj1); print(obj2)

Index(['NL', 'PE', 'NS', 'NB', 'QC', 'ON', 'MB', 'SK', 'AB', 'BC', 'YT', 'NT',
       'NU'],
      dtype='object', name='province')
Index(['province_name', '2016', '2017', '2018'], dtype='object')


# Hierarchical indexes in DataFrames

In [32]:
df_hierarch = pd.DataFrame(data=[4, 7, 2, 5, 6],
                           columns=['Data'],
                           index=
                           [['a', 'a', 'b', 'b', 'a'],
                            ['x', 'y', 'x', 'y', 'x']])

df_hierarch

Unnamed: 0,Unnamed: 1,Data
a,x,4
a,y,7
b,x,2
b,y,5
a,x,6


In [35]:
df_hierarch.index

MultiIndex([('a', 'x'),
            ('a', 'y'),
            ('b', 'x'),
            ('b', 'y'),
            ('a', 'x')],
           )