
# <font color=#14F278>Unit 2 - Pandas DataFrames</font>
---

## <font color=#14F278> 1. Pandas DataFrame - Definition:</font>



In the previous unit we learnt about the concept of Series. Pandas <font color=#14F278>**DataFrames**</font> are a natural conitnuation of it.

A <font color=#14F278>**DataFrame**</font> is a <font color=#14F278>**2-dimensional array of data**</font>, indexed by rows and columns. Each column in a __DataFrame__ corresponds to a __Series__ object. Put simply, a __DataFrame__ is the Pandas' term for a table! And just like with any table, each cell in it is uniquely identified by its row and column index - in that sense:  

- Each __DataFrame__ is a <font color=#14F278>**collection of Series**</font>, each having a single data type
- We can think of each pair `(row_index, column_index)` as a <font color=#14F278>**key**</font> of an element in the __DataFrame__


<center>
    <div>
        <img src="..\images\dataframes_001.png"/>
    </div>
</center>

In [1]:
# Imports
import pandas as pd
import numpy as np
import datetime as dt

---
## <font color=#14F278> 2. Constructing a Pandas DataFrame:</font>

A __DataFrame__ object can be created in many ways. To see why, recall that a simple __Series__ object can be constructed from a list as well as a dictionary. Now we are handling not a 1- but a 2-dimensional object. In that sense, there are multiple ways to navigate the construction - horizontally via creating a collection of rows, or vertically - via a collection of columns!

---
### <font color=#14F278> 2.1 Constructing a DataFrame with 1 Column:</font>
A __DataFrame__ can have a single column. Yes - this really looks like a Series, however, once initialised as a DataFrame object, Python will treat it as such - a table with only one column. A single-column DataFrame can be built in 2 ways:

- From a <font color=#14F278>**List**</font> :
    - `pd.DataFrame([item1, item2, ...], columns = ['column1'])` 
    
- From a <font color=#14F278>**Series**</font>:
    - `pd.DataFrame(series_name)`

In [2]:
# Constructing a dataframe from a simple list
df = pd.DataFrame([1,2,3,4], columns=['column1'])
display(df)

Unnamed: 0,column1
0,1
1,2
2,3
3,4


In [3]:
# Constructing a dataframe from a series object
s  = pd.Series({'a':123, 'b':456})
df = pd.DataFrame(s)
display(df)

# unless the column name is specified, it will be assigned a default value 0 - we can rename the column in this way
df = df.rename(columns = {0:'column1'})
display(df)

Unnamed: 0,0
a,123
b,456


Unnamed: 0,column1
a,123
b,456


---
### <font color=#14F278> 2.2 Constructing a DataFrame with Multiple Columns:</font>
Now let's explore the ways to build a <font color=#14F278>**multi-column DataFrame**</font>. Below is a brief syntax guide on the different ways to do it:

<center>
    <div>
        <img src="..\images\dataframes_002.png"/>
    </div>
</center>


- From a <font color=#14F278>**List of Lists**</font> - each inner list corresponds to a row:
    - `pd.DataFrame([list1, list2, list3, ...], columns = ['column1', ...])` where `list1 = [item1, item2, ...]`
    
    
- From a <font color=#14F278>**List of Dictionaries**</font> - each dictionary corresponds to a row; each key in a dict corresponds to a column name
    - `pd.DataFrame([dict1, dict2, dict3, ...])` where `dict1 = {'column1': value1, 'columns2': value2, ...}`
   
   
- From a <font color=#14F278>**Dictionary of List Values**</font> - each key correspond to a column name; each list corresponds to column values
    - `pd.DataFrame({'column1': list1, 'column2': list2, ...})` where `list1 = [item1, item2, ...]`
    
    
- From a <font color=#14F278>**Dictionary of Series**</font> - each key corresponds to a column name; each Series corresponds to column values
    - `pd.DataFrame({'column1': series1, 'column2': series2, ...})` where `series1 = pd.Series(...)`

In [4]:
# Constructing a dataframe using a list of lists.
data = [[1,2], # row 1
        [3,4], # row 2
        [5,6]] # row 3

df = pd.DataFrame(data, columns=['col1', 'col2'])
display(df)

Unnamed: 0,col1,col2
0,1,2
1,3,4
2,5,6


In [6]:
# From a list of dictionaries
data = [{'a':111, 'b':222}, # row 1
        {'a':333, 'b':444}, # row 2
        {'b':666, 'a':555}] # row 3

df = pd.DataFrame(data)
display(df)

Unnamed: 0,a,b
0,111,222
1,333,444
2,555,666


In [5]:
# From a dictionary of list values. For me this is the best type
data = {'a':[1,2,3,4], 'b':[5,6,7,8]}

df = pd.DataFrame(data)
display(df)

Unnamed: 0,a,b
0,1,5
1,2,6
2,3,7
3,4,8


In [7]:
# From a dictionary of series
s1 = pd.Series([1,2], index=['a','b'])
s2 = pd.Series([3,4], index=['a','b'])

data = {'series1':s1, 'series2':s2}

df = pd.DataFrame(data)
display(df)

Unnamed: 0,series1,series2
a,1,3
b,2,4


In [8]:
# From a dictionary of series - showing index alignment
s1 = pd.Series([1,2], index=['a','b'])
s2 = pd.Series([3,4], index=['b','c'])
data = {'series1':s1, 'series2':s2}

df = pd.DataFrame(data)
display(df)

Unnamed: 0,series1,series2
a,1.0,
b,2.0,3.0
c,,4.0


In [11]:
type(df.iloc[0,1])

numpy.float64

---
## <font color=#14F278> 3. Retrieving DataFrame's Index, Columns and Values:</font>


Obtaining information on values and the index of a DataFrame is identical to how we do it with Series:

In [16]:
# Getting the index
df.index

Index(['a', 'b', 'c'], dtype='object')

In [20]:
# Getting the columns
df.columns

Index(['series1', 'series2'], dtype='object')

In [None]:
# Getting the values
df.values

---
## <font color=#14F278> 4. DataFrame - Column Data Types:</font>

Since a __DataFrame__ is really just a <font color=#14F278>**collection of Series**</font>, we can easily obtain the data types of all columns in the same way we did with Series. To cast a new data type onto a column however, we now have to specify the column name of interest:

In [21]:
# Getting datatypes for each column - note the difference between s.dtype and df.dtypes - there is an extra s
df.dtypes

series1    float64
series2    float64
dtype: object

In [22]:
# Typecasting on a dataframe column
# we have to use a nested (double) square bracket structure - more on this later
df[['series1']] = df[['series1']].astype(str)
df.dtypes

series1     object
series2    float64
dtype: object

---
## <font color=#14F278> 5. DataFrame Shape:</font>
As we know, the __Shape__ of an object returns information on its dimensions. With __Series__ we saw that the outcome of the `.shape` method was a __(x,)__ pair, indicating the number of elements in the Series. 


__DataFrames__ are however 2-dimensional, so we would expect to <font color=#14F278>**obtain information on 2 things**</font> - the number of rows and number of columns in it.

Remember: The output of `df.shape` is <font color=#14F278>**a pair (x,y)**</font> where:
- <font color=#14F278>**x**</font> corresponds to the number of rows
- <font color=#14F278>**y**</font> corresponds to the number of columns

In [23]:
# 2 dimensions - 3 rows, 2 columns.
df.shape

(3, 2)

---
## <font color=#14F278> 6. Pandas Index Object:</font>
We encountered the concept of <font color=#14F278>**Index**</font> with both __Series__ and __DataFrames__. 

<font color=#14F278>**Pandas Index**</font> is an <font color=#14F278>**immutable sequence**</font> used for indexing and alignment - the basic object, storing axis labels for all Pandas objects. Think of an index as an immutable list or tuple.

In the context of working with Series and DataFrames, unless explicitly specified, indexes will be automatically created. Below we are showing a couple of ways to explicitly construct an __Index__ object via the `.Index()` method:

In [24]:
# Creating an integer index
idx = pd.Index([1,2,3])
idx

Index([1, 2, 3], dtype='int64')

In [13]:
# Creating a datetime index
pd.Index([dt.datetime(2020,1,1), dt.datetime(2020,1,2)])

DatetimeIndex(['2020-01-01', '2020-01-02'], dtype='datetime64[ns]', freq=None)

In [14]:
# Pandas indexes are immutable objects -- we can't change its value.
idx = pd.Index([1,2,3])
idx[1] = 5 # We get an error

TypeError: Index does not support mutable operations

---
## <font color=#14F278> 7. Summary:</font>

- Pandas __DataFrame__ is a 2-dimensional array of data, indexed by rows and columns
- DataFrames can be constructed in multiple ways - via list of lists, list of dictionaries, dictionary of lists, etc.
- Obtaining the values, index, columns, column data types and the shape of a dataframe is identical to Series
- Pandas __Index__ Object is an immutable sequence for indexing and alignment of Pandas objects

---
## <font color=#FF8181> 8. Concept Check: </font>

1. What is a pandas DataFrame? How is it different to a Series?
2. What are some useful attributes of a DataFrame instance?
3. What is the data type for a column in a dataframe?
4. Construct the following DataFrame in Pandas:

| |col1|col2|
|-|-|-|
|0|1|2.0|
|1|3|4.0|
|2|5|6.0|

Do this by using:
- list of lists
- list of tuples
- list of dictionaries
- dictionary of lists

In [None]:
# 1) What is a Pandas DataFrame Object? How is it different to a Series?

# A DataFrame is a 2-dimensional array of data, indexed by rows and columns (Unlike Series, which is 1-dimensional).
# Each column of a DataFrame is a Series object, in that sense a DataFrame is a collection of Series.

In [None]:
# 2) What are some useful attributes of a DataFrame instance?

# The concept of DataFrame is a natural continuation of the concept of Series - in that sense, it inherits a lot of rules and
# attributes, applicable to Series:
# - General - Dimension, Shape - .shape, .index, .columns
# - Data related - .values, .dtypes

In [None]:
# 3) What is the data type for each column in a DataFrame?

# Each column in a DataFrame is a Series. In that sense, each DataFrame column has a single data type - int, float, object, etc.

In [28]:
#4 a
data = [[1,2.0], # row 1
        [3,4.0], # row 2
        [5,6.0]] # row 3

df = pd.DataFrame(data, columns=['col1', 'col2'])
display(df)



Unnamed: 0,col1,col2
0,1,2.0
1,3,4.0
2,5,6.0


In [30]:
#4 b
data = [(1,2.0),
        (3,4.0),
        (5,6.0)]



df = pd.DataFrame(data, columns=['col1', 'col2'])
display(df)

Unnamed: 0,col1,col2
0,1,2.0
1,3,4.0
2,5,6.0


In [27]:
#4 c
data = [{'col1':1, 'col2':2.0},
        {'col1':3, 'col2':4.0},
        {'col1':5, 'col2':6.0}]


df = pd.DataFrame(data)
display(df)



Unnamed: 0,col1,col2
0,1,2.0
1,3,4.0
2,5,6.0


In [29]:
#4 d
data = {'col1':[1,3,5], 'col2':[2.0,4.0,6.0]}

df = pd.DataFrame(data)
display(df)

Unnamed: 0,col1,col2
0,1,2.0
1,3,4.0
2,5,6.0
