In [None]:
import pandas as pd

# Series 

A pandas Series is a one-dimensional array that can hold indexed data of any type (integers, strings, floating point numbers, Python objects, etc.). Series can be created using:

🔸Python dictionaries
🔸NumPy ndarrays
🔸Scalar values
🔸Lists

In [4]:
# Create a pandas Series using a list

list_ser = [45, 123, 67, 1, 14]

serA = pd.Series(list_ser)

print(serA) # The default index in pandas always starts with 0 (zero).

print(type(serA))

0     45
1    123
2     67
3      1
4     14
dtype: int64
<class 'pandas.core.series.Series'>


In [5]:
serA.index # inspect index

RangeIndex(start=0, stop=5, step=1)

a_given_pandas.values

In pandas, when you create a Series with an index of strings, the dtype of the index is typically represented as 'object'. This is because pandas uses the NumPy library for underlying data structures, and NumPy does not have a specific data type for strings. Instead, it uses the generic 'object' data type to represent any arbitrary Python object, including strings.

Even though the index values look like strings and are indeed strings in this case, pandas internally represents them as objects due to the limitations of NumPy's dtype system.

Therefore, when you see dtype='object' in the output of print(serB.index), it indicates that the index data type is a generic Python object type, which is used to accommodate the string values of the index. This is a common representation for indexes with string values in pandas.

# Creating series

## Using a list

In [14]:
# create a Series using the same list as above, but define the index, data type and Series name:
serB = pd.Series(list_ser,
                 index=['Num1', 'Num2', 'Num3', 'Num4', 'Num5'], # indices can be strings
                 dtype= 'float',
                 name= 'Numbers') 

print(serB)
print(serB.index)
print(serB.values)

Num1     45.0
Num2    123.0
Num3     67.0
Num4      1.0
Num5     14.0
Name: Numbers, dtype: float64
Index(['Num1', 'Num2', 'Num3', 'Num4', 'Num5'], dtype='object')


array([ 45., 123.,  67.,   1.,  14.])

## Using a dictionary

In the next example we will create a `Series` from a dictionary. We will use the top 5 Canadian provinces by population (retrieved from [Statistics Canada](https://www150.statcan.gc.ca/n1/pub/12-581-x/2018000/pop-eng.htm) web site, we used the 2017 column of data):

In [15]:
population_dict = {'ON': 14193384, 'QC': 8394034, 'BC': 4817160, 'AB': 4286134, 'MB': 1338109}

provinces_population = pd.Series(population_dict, name='Top 5 provinces by population')

provinces_population # the dictionary keys become the indices

ON    14193384
QC     8394034
BC     4817160
AB     4286134
MB     1338109
Name: Top 5 provinces by population, dtype: int64

In [21]:
print(provinces_population['ON']) # population of Ontario

# Selecting only provinces with population greater than 5 million.
# This type of selection is called boolean indexing:
print(provinces_population[provinces_population > 5000000])
print(provinces_population[2:4])
print(provinces_population['BC':'MB'])
print('NS' in provinces_population) # i.e Nova Scotia

14193384
ON    14193384
QC     8394034
Name: Top 5 provinces by population, dtype: int64
BC    4817160
AB    4286134
Name: Top 5 provinces by population, dtype: int64
BC    4817160
AB    4286134
MB    1338109
Name: Top 5 provinces by population, dtype: int64
False


In [23]:
print(provinces_population.sum())
print(provinces_population.mean())

33028821
6605764.2


# DataFrames

A `DataFrame` can be created from a:

- Dictionary of 1-D structures (`ndarray`s, `list`s, dictionaries, tuples or `Series`)
- List of 1-D structures
- 2-D NumPy `ndarray`
- `Series`
- Another `DataFrame`

## Using a 2-dimensional list

In [42]:
df = pd.DataFrame(data=[[8, 128, 27.5],
                        [10, 138.9, 34.5],
                        [16, 157.3, 91.1],
                        [6, 116.6, 21.4],
                        [14, 159.2, 54.4]],
                    columns= ['Age', 'Height', 'Weight'])

df

Unnamed: 0,Age,Height,Weight
0,8,128.0,27.5
1,10,138.9,34.5
2,16,157.3,91.1
3,6,116.6,21.4
4,14,159.2,54.4


In [44]:
print(df.index)
print(df.columns)

RangeIndex(start=0, stop=5, step=1)
Index(['Age', 'Height', 'Weight'], dtype='object')


In [46]:
# Using `info()` function to output summary information about the `DataFrame`:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Age     5 non-null      int64  
 1   Height  5 non-null      float64
 2   Weight  5 non-null      float64
dtypes: float64(2), int64(1)
memory usage: 252.0 bytes


## using a dictionary

To create a `DataFrame` from a dictionary, let's start with a simple dictionary which will contain area data for land and water areas for the five Canadian provinces that we worked with in the previous example (i.e. Ontario, Quebec, British Columbia, Alberta and Manitoba). We will retrieve our data from a [Wikipedia page](https://en.wikipedia.org/wiki/Provinces_and_territories_of_Canada).

In [47]:
area = {'province':['ON', 'QC', 'BC', 'AB', 'MB'],
        'area_land': [917741, 1356128, 925186, 642317, 553556],
        'area_water': [158654, 185928, 19549, 19531, 94241]}

provinces_area = pd.DataFrame(area)

provinces_area

Unnamed: 0,province,area_land,area_water
0,ON,917741,158654
1,QC,1356128,185928
2,BC,925186,19549
3,AB,642317,19531
4,MB,553556,94241


### set_index()

If we want the province to be an index for this DataFrame, we need to use the method set_index():

In [49]:
provinces_area.set_index('province')

provinces_area

Unnamed: 0,province,area_land,area_water
0,ON,917741,158654
1,QC,1356128,185928
2,BC,925186,19549
3,AB,642317,19531
4,MB,553556,94241


# Loading/Saving DataFrames

For this exercise, we will continue looking into the data that describes Canadian provinces. This time, we will use the data of the last 3 years of [Federal Support to all Canadian Provinces and Territories](https://www.fin.gc.ca/fedprov/mtp-eng.asp). All numbers are in millions of dollars.

The dataset that I am planning to use in this section is stored in the pandas_ex1.csv file, we assume that the file is in the same folder as the notebook working directory.