<div align="center">
<img style="display: block; margin: auto;" alt="photo" src="https://cdn.quantconnect.com/web/i/icon.png">

Quantconnect

Introduction to Financial Python
</div>

# 01 Data Types and Data Structures

# Introduction

This tutorial provides a basic introduction to the Python programming language. If you are new to Python, you should run the code snippets while reading this tutorial. If you are an advanced Python user, please feel free to skip this chapter.

# Basic Variable Types
The basic types of variables in Python are: strings, integers, floating point numbers and booleans.

Strings in python are identified as a contiguous set of characters represented in either single quotes (' ') or double quotes (" ").


In [1]:
my_string1 = 'Welcome to'
my_string2 = "QuantConnect"
print(my_string1 + ' ' + my_string2)

Welcome to QuantConnect


An integer is a round number with no values after the decimal point.

In [2]:
my_int = 10
print(my_int)
print(type(my_int))

10
<class 'int'>


The built-in function int() can convert a string into an integer.

In [3]:
my_string = "100"
print(type(my_string))
my_int = int(my_string)
print(type(my_int))

<class 'str'>
<class 'int'>


A floating point number, or a float, is a real number in mathematics. In Python we need to include a value after a decimal point to define it as a float

In [4]:
my_string = "100"
my_float = float(my_string)
print(type(my_float))
flo=3.54
print(type(flo))

<class 'float'>
<class 'float'>


As you can see above, if we don't include a decimal value, the variable would be defined as an integer. The built-in function float() can convert a string or an integer into a float.

In [5]:
my_bool = False
print(my_bool)
print(type(my_bool))

False
<class 'bool'>


A boolean, or bool, is a binary variable. Its value can only be True or False. It is useful when we do some logic operations, which would be covered in our next chapter.

In [6]:
print("Addition ", 1+1)
print("Subtraction ", 5-2)
print("Multiplication ", 2*3)
print("Division ", 10/2)
print('exponent', 2**3)

Addition  2
Subtraction  3
Multiplication  6
Division  5.0
exponent 8


# Basic Math Operations

The basic math operators in python are demonstrated below:

In [7]:
print(1/3)
print(1.0/3)

0.3333333333333333
0.3333333333333333


# Data Collections

## List
A list is an ordered collection of values. A list is mutable, which means you can change a list's value without changing the list itself. Creating a list is simply putting different comma-separated values between square brackets.

In [8]:
my_list = ['Quant', 'Connect']#, 1,2,3]
print(my_list)

['Quant', 'Connect']


The values in a list are called "elements". We can access list elements by indexing. Python index starts from 0. So if you have a list of length n, the index of the first element will be 0, and that of the last element will be n − 1. By the way, the length of a list can be obtained by the built-in function len().

In [9]:
my_list = ['Quant', 'Connect', 1,2,3]
print(len(my_list))
print(my_list[0])
print(my_list[len(my_list) -1])

5
Quant
3


You can also change the elements in the list by accessing an index and assigning a new value.

In [10]:
my_list = ['Quant','Connect',1,2,3]
my_list[2] = 'go'
print(my_list)

['Quant', 'Connect', 'go', 2, 3]


A list can also be sliced with a colon:

In [11]:
my_list = ['Quant','Connect',1,2,3]
print(my_list[1:3])

['Connect', 1]


The slice starts from the first element indicated, but excludes the last element indicated. Here we select all elements starting from index 1, which refers to the second element:

In [12]:
print(my_list[1:])

['Connect', 1, 2, 3]


And all elements up to but excluding index 3:

In [13]:
print(my_list[:3])

['Quant', 'Connect', 1]


If you wish to add or remove an element from a list, you can use the append() and remove() methods for lists as follows:

In [14]:
my_list = ['Hello', 'Quant']
my_list.append('Hello')
print(my_list)



['Hello', 'Quant', 'Hello']


In [15]:
my_list.remove('Hello')
print(my_list)

['Quant', 'Hello']


When there are repeated instances of "Hello", the first one is removed.

## Tuple
A tuple is a data structure type similar to a list. The difference is that a tuple is immutable, which means you can't change the elements in it once it's defined. We create a tuple by putting comma-separated values between parentheses.

In [16]:
# A tuple is created, its values are immutable.
my_tuple = ('Bienvenido','a','mi','primer','laboratorio','de','algoritmos')

Just like a list, a tuple can be sliced by using index.

In [17]:
my_tuple = ('Bienvenido','a','mi','primer','laboratorio','de','algoritmos')
#The tuple is sliced in the same way as a list.
print(my_tuple[1:])

('a', 'mi', 'primer', 'laboratorio', 'de', 'algoritmos')


## Set
A set is an **unordered**  collection with **no duplicate** elements. The built-in function **set()** can be used to create sets.

In [18]:
stock_list = ['A','A','A','B','B','C','C','CC']
# A set is created using the set() function.
stock_set = set(stock_list)
# Prints the set, which contains the values of stock_list which don't repeat themselves.
print(stock_set)

{'A', 'CC', 'B', 'C'}


Set is an easy way to remove duplicate elements from a list.

##Dictionary
A dictionary is one of the most important data structures in Python. Unlike sequences which are indexed by integers, dictionaries are indexed by keys which can be either strings or floats.

A dictionary is an **unordered** collection of key : value pairs, with the requirement that the keys are unique. We create a dictionary by placing a comma-separated list of key : value pairs within the braces.

In [19]:
my_dic = {'clave1':'valor1', 'clave2':'valor2', 'clave3':'valor3'}

In [20]:
# We obtain a value from the previously assigned key
print(my_dic['clave1'])

valor1


After defining a dictionary, we can access any value by indicating its key in brackets.

In [21]:
'''
We access the values of the dictionary indicating the assigned key,
We can change these values, unlike tuples.
'''
my_dic['clave1'] = 'valor34'
print(my_dic['clave1'])

valor34


We can also change the value associated with a specified key:

In [22]:
#We can obtain the dictionary keys with the .keys() method.
print(my_dic.keys())

dict_keys(['clave1', 'clave2', 'clave3'])


The built-in method of the dictionary object dict.keys() returns a list of all the keys used in the dictionary.

# Common String Operations
A string is an immutable sequence of characters. It can be sliced by index just like a tuple:

In [23]:
#We define a character string using quotation marks.
my_str = 'Bienvenido a mi primer laboratorio de Algoritmos'
# We can chunk it as a list. Like a tuple, it is immutable.
print(my_str[13:])

mi primer laboratorio de Algoritmos


There are many methods associated with strings. We can use string.count() to count the occurrences of a character in a string, use string.find() to return the index of a specific character, and use string.replace() to replace characters

In [24]:
#The count() method allows us to count the occurrences of a substring in the string.
print('Contando cuantas veces está la subcadena "as" en esta frase'.count('as'))
#The find() method allows searching for the index of the first occurrence of a substring.
print('Primera vez que aparece "as" en esta cadena'.find('as'))
#The replace method allows us to replace one substring with another.
print('Todas las "a" se reemplazarán por "u"'.replace('a','u'))

3
25
Todus lus "u" se reempluzurán por "u"


The most commonly used method for strings is string.split(). This method will split the string by the indicated character and return a list:

In [25]:
Time = '2022-08-20 18:21:00'
#The split method receives a substring and uses it to split items in a list.
#We will use a space to divide date and time
splited_list = Time.split(' ')
date = splited_list[0]
time = splited_list[1]
print(date, time)
#We can split the time using ":", obtaining separately: hours, minutes and seconds.
#We obtain the hour by indexing the first element of the list.
hour = time.split(':')[0]
print(hour)

2022-08-20 18:21:00
18


We can replace parts of a string by our variable. This is called string formatting.

In [26]:
# We can format the string using square brackets and the .format() method.
my_time = 'Hora: {}, Minutos:{}'.format('18','26')
print(my_time)

Hora: 18, Minutos:26


Another way to format a string is to use the % symbol.

In [27]:
# We use the symbol %, followed by different defined characters, to format the string
print('Tomaremos la gravedad como %f m / s^2'%9.81)
print('Mi nombre es %s y mi apellido es %s'%('Daniela','Tocua'))

Tomaremos la gravedad como 9.810000 m / s^2
Mi nombre es Daniela y mi apellido es Tocua


# Summary

Weave seen the basic data types and data structures in Python. It's important to keep practicing to become familiar with these data structures. In the next tutorial, we will cover for and while loops and logical operations in Python.

<div align="center">
<img style="display: block; margin: auto;" alt="photo" src="https://cdn.quantconnect.com/web/i/icon.png">

Quantconnect

Introduction to Financial Python
</div>

# 02 Logical Operations and Loops

# Introduction
We discussed the basic data types and data structures in Python in the last tutorial. This chapter covers logical operations and loops in Python, which are very common in programming.

# Logical Operations
Like most programming languages, Python has comparison operators:

In [28]:
# We use logical operators to define truth values in different statements.
print(1 == 0)
print(1 == 1)
print(1 != 0)
print(5 >= 5)
print(5 >= 6)

False
True
True
True
False


Each statement above has a boolean value, which must be either True or False, but not both.

We can combine simple statements P and Q to form complex statements using logical operators:

- The statement "P and Q" is true if both P and Q are true, otherwise it is false.
- The statement "P or Q" is false if both P and Q are false, otherwise it is true.
- The statement "not P" is true if P is false, and vice versa.

In [29]:
'''
We can use the operators "and" and "or" to form complex propositions.
and statements will be true if both are true
Sentences or will be true if either of them is true.
'''
print(2 > 1 and 3 > 2)
print(2 > 1 and 3 < 2) 
print(2 > 1 or 3 < 2)
print(2 < 1 and 3 < 2)

True
False
True
False


When dealing with a very complex logical statement that involves in several statements, we can use brackets to separate and combine them.

In [30]:
# Parentheses are used to prioritize the operations performed.
print((3 > 2 or 1 < 3) and (1!=3 and 4>3) and not ( 3 < 2 or 1 < 3 and (1!=3 and 4>3)))
print(3 > 2 or 1 < 3 and (1!=3 and 4>3) and not ( 3 < 2 or 1 < 3 and (1!=3 and 4>3)))

False
True


Comparing the above two statements, we can see that it's wise to use brackets when we make a complex logical statement.

# If Statement
An if statement executes a segment of code only if its condition is true. A standard if statement consists of 3 segments: if, elif and else.

```python
if statement1:
    # if the statement1 is true, execute the code here.
    # code.....
    # code.....
elif statement2:
    # if the statement 1 is false, skip the codes above to this part.
    # code......
    # code......
else:
    # if none of the above statements is True, skip to this part
    # code......
```

An if statement doesn't necessarily has elif and else part. If it's not specified, the indented block of code will be executed when the condition is true, otherwise the whole if statement will be skipped.

In [31]:
i = 0
# Checks the truth value of i==0 and executes the code inside the statement if it is true
if i == 0:
    print('i==0 es verdadero')

i==0 es verdadero


As we mentioned above, we can write some complex statements here:

In [32]:
# We store Boolean values in p and q
p = 0 > 1
q = 2 > 3
'''
We use if, elif, and else statements to execute operations
from the truth value of these
'''
if p and q:
    print('p y q es verdadero')
elif p and not q:
    print('q es falsa')
elif q and not p:
    print('p es falsa')
else:
    print('Ni p ni q es verdadera')

Ni p ni q es verdadera


# Loop Structure
Loops are an essential part of programming. The "for" and "while" loops run a block of code repeatedly.

## While Loop
A "while" loop will run repeatedly until a certain condition has been met.

In [33]:
i = 0
#The cycle will be repeated until the evaluated condition (i<5) is false.
# We add 1 to variable i, iterating 5 times.
while i < 5:
    print(i)
    i += 1 

0
1
2
3
4


When making a while loop, we need to ensure that something changes from iteration to iteration so that the while loop will terminate, otherwise, it will run forever. Here we used i += 1 (short for i = i + 1) to make i larger after each iteration. This is the most commonly used method to control a while loop.

## For Loop
A "for" loop will iterate over a sequence of value and terminate when the sequence has ended.

In [34]:
# This for loop iterates through the elements of a list and prints them.
for i in [0,1,1,2,3,5,8]:
    print(i)

0
1
1
2
3
5
8


We can also add if statements in a for loop. Here is a real example from our pairs trading algorithm:

In [35]:
stocks = ['Ecopetrol','Bancolombia','Nutresa','Bavaria','EPM','Alpina']
selected = ['Ecopetrol','Nestlé']
new_list = []
'''
Iterate with a for loop the elements of the list stocks
We add to a new list the elements that are not also in selected (using an if)
'''
for i in stocks:
    if i not in selected:
        new_list.append(i)
print(stocks)

['Ecopetrol', 'Bancolombia', 'Nutresa', 'Bavaria', 'EPM', 'Alpina']


Here we iterated all the elements in the list 'stocks'. Later in this chapter, we will introduce a smarter way to do this, which is just a one-line code.

## Break and continue
These are two commonly used commands in a for loop. If "break" is triggered while a loop is executing, the loop will terminate immediately:

In [36]:
my_stocks = ['inversion1','inversion2','inversion3','inversion4','inversion5','inversion6', 'inversion7', 'inversion8']
for i in my_stocks:
    print(i)
    if i == 'inversion5':
      # Break sentence forces the cycle to end
        break

inversion1
inversion2
inversion3
inversion4
inversion5


The "continue" command tells the loop to end this iteration and skip to the next iteration:

In [37]:
my_stocks = ['inversion1','inversion2','inversion3','inversion4','inversion5','inversion6', 'inversion7', 'inversion8']
for i in my_stocks:
    if i == 'inversion5':
      # The continue statement ends the current iteration and proceeds to the next one.
        continue
    print(i)

inversion1
inversion2
inversion3
inversion4
inversion6
inversion7
inversion8


# List Comprehension
List comprehension is a Pythonic way to create lists. Common applications are to make new lists where each element is the result of some operations applied to each member of another sequence. For example, if we want to create a list of squares using for loop:

In [40]:
squares = []
# Creates a list with a for sentence
for i in [1,2,3,4,5]:
    squares.append(i**2)
print(squares)

[1, 4, 9, 16, 25]


Using list comprehension:

In [41]:
list = [1,2,3,4,5]
'''
We can create this list using list comprehension
We apply the **2 operation to the members of the list, and put them in squares list
'''
squares = [x**2 for x in list]
print(squares)

[1, 4, 9, 16, 25]


Recall the example above where we used a for loop to select stocks. Here we use list comprehension:

In [42]:
stock = ['Tesla','Chocorramo','Nestlé','Gansito']
selected= ['Chocorramo','Nestlé']
# We use list comprehension to filter by means of an if to filter the elements of the list stocks.
new_list = [x for x in stocks if x in selected]
print(new_list)

[]


A list comprehension consists of square brackets containing an expression followed by a "for" clause, and possibly "for" or "if" clauses. For example:

In [43]:
#Generates a list with the Cartesian product of two sets and eliminates the elements that are equal in both sets.
print([(x, y) for x in ["a","b","c"] for y in ["a","d","b"] if x != y])
#Performs the above operation by applying the concatenation operation to the result of the Cartesian product.
print([str(x)+' vs '+str(y) for x in ['Tesla','Chocorramo','Nestlé','Gansito'] for y in ['Gansito','Chocorramo','Quipitos'] if x!=y])

[('a', 'd'), ('a', 'b'), ('b', 'a'), ('b', 'd'), ('c', 'a'), ('c', 'd'), ('c', 'b')]
['Tesla vs Gansito', 'Tesla vs Chocorramo', 'Tesla vs Quipitos', 'Chocorramo vs Gansito', 'Chocorramo vs Quipitos', 'Nestlé vs Gansito', 'Nestlé vs Chocorramo', 'Nestlé vs Quipitos', 'Gansito vs Chocorramo', 'Gansito vs Quipitos']


List comprehension is an elegant way to organize one or more for loops when creating a list.

# Summary
This chapter has introduced logical operations, loops, and list comprehension. In the next chapter, we will introduce functions and object-oriented programming, which will enable us to make our codes clean and versatile.

<div align="center">
<img style="display: block; margin: auto;" alt="photo" src="https://cdn.quantconnect.com/web/i/icon.png">

Quantconnect

Introduction to Financial Python
</div>

# 03 Functions and Objective-Oriented Programming

# Introduction

In the last tutorial we introduced logical operations, loops and list comprehension. We will introduce functions and object-oriented programming in this chapter, which will enable us to build complex algorithms in more flexible ways.

# Functions
A function is a reusable block of code. We can use a function to output a value, or do anything else we want. We can easily define our own function by using the keyword "def".

In [44]:
'''
Defines a function using the def keyword
keyword, its arguments are defined inside parentheses.
When executed, the function returns the product of the arguments x and y 
'''

def product(x,y):
    return x*y
print(product(12,12))
print(product(4,64))

144
256


The keyword "def" is followed by the function name and the parenthesized list of formal parameters. The statements that form the body of the function start at the next line, and must be indented. The product() function above has "x" and "y" as its parameters. A function doesn't necessarily have parameters:

In [45]:
# This function returns nothing, it performs an operation without arguments
# When executed, the function prints "Bienvenido a mi laboratorio"
def dar_bienvenida():
    print('Bienvenido a mi laboratorio')
dar_bienvenida()

Bienvenido a mi laboratorio


# Built-in Function
**range()** is a function that creates a list containing an arithmetic sequence. It's often used in for loops. The arguments must be integers. If the "step" argument is omitted, it defaults to 1.

In [46]:
#Print a range between two numbers at a given step

# If only one argument is given, the range goes from 0 to that argument.
print(range(10))
# If 2 arguments are received, the range starts from the first one and ends at the second one excluding it
print(range(1,11))
# If 3 arguments are received, the numbers change with the given step (default is 1)
print(range(1,11,2))

range(0, 10)
range(1, 11)
range(1, 11, 2)


**len()** is another function used together with range() to create a for loop. This function returns the length of an object. The argument must be a sequence or a collection.

In [47]:
# The len() function returns the length of an iterable object
tickers = ['elemento1','elemento2','elemento3','elemento4','elemento5']
print('El tamaño de la lista es {}'.format(len(tickers)))
# Prints the elements of the tickers list using a for loop with range len(tickers)
for i in range(len(tickers)):
    print(tickers[i])

# This can be done in a pythonic way as:
for ticker in tickers: print("ticker: "+ticker)

El tamaño de la lista es 5
elemento1
elemento2
elemento3
elemento4
elemento5
ticker: elemento1
ticker: elemento2
ticker: elemento3
ticker: elemento4
ticker: elemento5


Note: If you want to print only the tickers without those numbers, then simply write "for ticker in tickers: print ticker"

**map(**) is a function that applies a specific function to every item of a sequence or collection, and returns a list of the results.

Because list at the moment is [1,2,3,4,5] and overwriting list() from builtins we del list

In [48]:
print(list)
# using the del keyword we deleted the list object, which overwrote our list function
del list
list

[1, 2, 3, 4, 5]


list

In [49]:
#With map() we apply an operation to all the elements of an iterable
tickers = ['Ecopetrol','Bancolombia','Nutresa','Bavaria','EPM','Alpina']
list(map(len,tickers))

[9, 11, 7, 7, 3, 6]

In [50]:
tickers = ['Ecopetrol','Bancolombia','Nutresa','Bavaria','EPM','Alpina']
print(list(map(len,tickers)))

[9, 11, 7, 7, 3, 6]


The **lambda operator** is a way to create small anonymous functions. These functions are just needed where they have been created. For example:

In [51]:
# With the lambda operator we create an anonymous function, which in this case is executed on the variable x, which varies from 0 to 9.
list(map(lambda x: x**2, range(10)))

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

map() can be applied to more than one list. The lists have to have the same length.

In [52]:
# use map in 2 lists to apply the anonymous function x + y 
list(map(lambda x, y: x+y, [1,2,3,4,5],[5,4,3,2,1]))

[6, 6, 6, 6, 6]

**sorted()** takes a list or set and returns a new sorted list

In [53]:
# Sorted method returns a new list with the original list elements sorted
sorted([5,2,3,4,1])

[1, 2, 3, 4, 5]

We can add a "key" parameter to specify a function to be called on each list element prior to making comparisons. For example:

In [54]:
#Using the optional reverse argument we can organize the list by the second element of each tuple
price_list = [('AAPL',144.09),('GOOG',911.71),('MSFT',69),('FB',150),('WMT',75.32)]
sorted(price_list, key = lambda x: x[1])

[('MSFT', 69), ('WMT', 75.32), ('AAPL', 144.09), ('FB', 150), ('GOOG', 911.71)]

By default the values are sorted by ascending order. We can change it to descending by adding an optional parameter "reverse'.

In [55]:
#Using the optional reverse argument we can organize the list in descending order
price_list = [('AAPL',144.09),('GOOG',911.71),('MSFT',69),('FB',150),('WMT',75.32)]
sorted(price_list, key = lambda x: x[1],reverse = True)

[('GOOG', 911.71), ('FB', 150), ('AAPL', 144.09), ('WMT', 75.32), ('MSFT', 69)]

Lists also have a function list.sort(). This function takes the same "key" and "reverse" arguments as sorted(), but it doesn't return a new list.

In [56]:
# Using the sort() function we will not create a new list to sort the data, but it will be sorted on the argument list.
price_list = [('AAPL',144.09),('GOOG',911.71),('MSFT',69),('FB',150),('WMT',75.32)]
price_list.sort(key = lambda x: x[1])
print(price_list)

[('MSFT', 69), ('WMT', 75.32), ('AAPL', 144.09), ('FB', 150), ('GOOG', 911.71)]


# Object-Oriented Programming
Python is an object-oriented programming language. It's important to understand the concept of "objects" because almost every kind of data from QuantConnect API is an object.

## Class
A class is a type of data, just like a string, float, or list. When we create an object of that data type, we call it an instance of a class.

In Python, everything is an object - everything is an instance of some class. The data stored inside an object are called attributes, and the functions which are associated with the object are called methods.

For example, as mentioned above, a list is an object of the "list" class, and it has a method list.sort().

We can create our own objects by defining a class. We would do this when it's helpful to group certain functions together. For example, we define a class named "Stock" here:

In [57]:
# Defines a stock class, its attributes and methods
class stock:
    def __init__(self, ticker, open, close, volume):
        self.ticker = ticker
        self.open = open
        self.close = close
        self.volume = volume
        self.rate_return = float(close)/open - 1
 
    def update(self, open, close):
        self.open = open
        self.close = close
        self.rate_return = float(self.close)/self.open - 1
 
    def print_return(self):
        print(self.rate_return)

The "Stock" class has attributes "ticker", "open", "close", "volume" and "rate_return". Inside the class body, the first method is called __init__, which is a special method. When we create a new instance of the class, the __init__ method is immediately executed with all the parameters that we pass to the "Stock" object. The purpose of this method is to set up a new "Stock" object using data we have provided.

Here we create two Stock objects named "apple" and "google".

In [63]:
nestle = stock('NESN', 143.69, 144.09, 20109375)
google = stock('GOOG', 898.7, 911.7, 1561616)

Stock objects also have two other methods: update() and print_return(). We can access the attribues of a Stock object and call its methods:

In [64]:
nestle.ticker
google.print_return()
google.update(912.8,913.4)
google.print_return()

0.014465338822744034
0.0006573181419806673


By calling the update() function, we updated the open and close prices of a stock. Please note that when we use the attributes or call the methods **inside a class**, we need to specify them as self.attribute or self.method(), otherwise Python will deem them as global variables and thus raise an error.

We can add an attribute to an object anywhere:

In [65]:
nestle.ceo = 'Mark Schneider'
nestle.ceo

'Mark Schneider'

We can check what names (i.e. attributes and methods) are defined on an object using the dir() function:

In [66]:
dir(nestle)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'ceo',
 'close',
 'open',
 'print_return',
 'rate_return',
 'ticker',
 'update',
 'volume']

## Inheritance
Inheritance is a way of arranging classes in a hierarchy from the most general to the most specific. A "child" class is a more specific type of a "parent" class because a child class will inherit all the attribues and methods of its parent. For example, we define a class named "Child" which inherits "Stock":

In [62]:
#We create inherited class child from stock
class child(stock):
    def __init__(self,name):
        self.name = name

In [None]:
#Child class inherits attributes and methods from parent class
aa = child('aa')
print(aa.name)
aa.update(100,102)
print(aa.open)
print(aa.close)
print(aa.print_return())

aa
100
102
0.020000000000000018
None


As seen above, the new class Child has inherited the methods from Stock.

#Summary

In this chapter we have introduced functions and classes. When we write a QuantConnect algorithm, we would define our algorithm as a class (QCAlgorithm). This means our algorithm inherited the QC API methods from QCAlgorithm class.

In the next chapter, we will introduce NumPy and Pandas, which enable us to conduct scientific calculations in Python.

<div align="center">
<img style="display: block; margin: auto;" alt="photo" src="https://cdn.quantconnect.com/web/i/icon.png">

Quantconnect

Introduction to Financial Python
</div>

# 04 NumPy and Basic Pandas

# Introduction

Now that we have introduced the fundamentals of Python, it's time to learn about NumPy and Pandas.

# NumPy
NumPy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. It also has strong integration with Pandas, which is another powerful tool for manipulating financial data.

Python packages like NumPy and Pandas contain classes and methods which we can use by importing the package:

In [68]:
import numpy as np

## Basic NumPy Arrays
A NumPy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. Here we make an array by passing a list of Apple stock prices:

In [69]:
#We create an array from a list using np.array()
price_list = [143.73, 145.83, 143.68, 144.02, 143.5, 142.62]
price_array = np.array(price_list)
print(price_array, type(price_array))

[143.73 145.83 143.68 144.02 143.5  142.62] <class 'numpy.ndarray'>


Notice that the type of array is "ndarray" which is a multi-dimensional array. If we pass np.array() a list of lists, it will create a 2-dimensional array.

In [70]:
Ar = np.array([[1,3],[2,4]])
print(Ar, type(Ar))

[[1 3]
 [2 4]] <class 'numpy.ndarray'>


We get the dimensions of an ndarray using the .shape attribute:

In [71]:
# Shape attribute contains array dimensions
print(Ar.shape)

(2, 2)


If we create an 2-dimensional array (i.e. matrix), each row can be accessed by index:

In [72]:
print(Ar[0])
print(Ar[1])

[1 3]
[2 4]


If we want to access the matrix by column instead:

In [73]:
#We index the rows and columns obtaining data of ndarray type
print('the first column: ', Ar[:,0])
print('the second column: ', Ar[:,1])

the first column:  [1 2]
the second column:  [3 4]


## Array Functions
Some functions built in NumPy that allow us to perform calculations on arrays. For example, we can apply the natural logarithm to each element of an array:

In [74]:
# log function is applied to each element of the array
print(np.log(price_array))

[4.96793654 4.98244156 4.9675886  4.96995218 4.96633504 4.96018375]


Other functions return a single value:

In [75]:
# mean: sum(price_array) / len(price_array)
print(np.mean(price_array))
# std: sqrt(((a_0 - mean(price_array))**2 + (a_1 - mean(price_array))**2  + ... + (a_n - mean(price_array))**2)/n)
print(np.std(price_array))
# sum: a_0+a_1+...+a_n  where a_i belongs to price_array and  0 < i <len(price_array)
print(np.sum(price_array))
# max: returns max number from an array
print(np.max(price_array))

143.89666666666668
0.9673790478515796
863.38
145.83


The functions above return the mean, standard deviation, total and maximum value of an array.

# Pandas
Pandas is one of the most powerful tools for dealing with financial data. 

First we need to import Pandas:

In [76]:
import pandas as pd

## Series
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, float, Python object, etc.)

We create a Series by calling pd.Series(data), where data can be a dictionary, an array or just a scalar value.

In [77]:
# We create a series from a list using pd.Series() method
price = [143.73, 145.83, 143.68, 144.02, 143.5, 142.62]
s = pd.Series(price)
s

0    143.73
1    145.83
2    143.68
3    144.02
4    143.50
5    142.62
dtype: float64

We can customize the indices of a new Series:

In [79]:
# usando el argumento opcional index podemos personalizar los índices
s = pd.Series(price,index = ['q','w','e','r','t','y'])
s

q    143.73
w    145.83
e    143.68
r    144.02
t    143.50
y    142.62
dtype: float64

Or we can change the indices of an existing Series:

In [81]:
# Changing index attribute we can customize the index
s.index = [6,5,4,3,2,1]
s

6    143.73
5    145.83
4    143.68
3    144.02
2    143.50
1    142.62
dtype: float64

Series is like a list since it can be sliced by index:

In [82]:
# Series can be sliced like lists
print(s[1:])
print(s[:-2])

5    145.83
4    143.68
3    144.02
2    143.50
1    142.62
dtype: float64
6    143.73
5    145.83
4    143.68
3    144.02
dtype: float64


Series is also like a dictionary whose values can be set or fetched by index label:

In [83]:
#The values of the series can be obtained from its index
print(s[4])
s[4] = 0
print(s)

143.68
6    143.73
5    145.83
4      0.00
3    144.02
2    143.50
1    142.62
dtype: float64


Series can also have a name attribute, which will be used when we make up a Pandas DataFrame using several series.

In [84]:
# We can give a name to the series using the attribute name
s = pd.Series(price, name = 'Apple  Lista de Precios')
print(s)
print(s.name)

0    143.73
1    145.83
2    143.68
3    144.02
4    143.50
5    142.62
Name: Apple  Lista de Precios, dtype: float64
Apple  Lista de Precios


We can get the statistical summaries of a Series:

In [85]:
#dprints the statistical summary of the series data
print(s.describe())

count      6.000000
mean     143.896667
std        1.059711
min      142.620000
25%      143.545000
50%      143.705000
75%      143.947500
max      145.830000
Name: Apple  Lista de Precios, dtype: float64


## Time Index
Pandas has a built-in function specifically for creating date indices: pd.date_range(). We use it to create a new index for our Series:

In [86]:
#We create a date_range from a given date, with n periods = len(s) with a step of 1 day.
time_index = pd.date_range('2022-08-21',periods = len(s),freq = 'D')
print(time_index)
s.index = time_index
print(s)

DatetimeIndex(['2022-08-21', '2022-08-22', '2022-08-23', '2022-08-24',
               '2022-08-25', '2022-08-26'],
              dtype='datetime64[ns]', freq='D')
2022-08-21    143.73
2022-08-22    145.83
2022-08-23    143.68
2022-08-24    144.02
2022-08-25    143.50
2022-08-26    142.62
Freq: D, Name: Apple  Lista de Precios, dtype: float64


Series are usually accessed using the iloc[] and loc[] methods. iloc[] is used to access elements by integer index, and loc[] is used to access the index of the series.

iloc[] is necessary when the index of a series are integers, take our previous defined series as example:

In [87]:
s.index = [6,5,4,3,2,1]
print(s)
print(s[1])

6    143.73
5    145.83
4    143.68
3    144.02
2    143.50
1    142.62
Name: Apple  Lista de Precios, dtype: float64
142.62


If we intended to take the second element of the series, we would make a mistake here, because the index are integers. In order to access to the element we want, we use iloc[] here:

In [88]:
#iloc obtains the indexed element in order, not by customized indexes
print(s.iloc[0])

143.73


While working with time series data, we often use time as the index. Pandas provides us with various methods to access the data by time index

In [89]:
s.index = time_index
print(s['2022-08-22'])

145.83


We can even access to a range of dates:

In [90]:
print(s['2022-08-21':'2022-08-26'])

2022-08-21    143.73
2022-08-22    145.83
2022-08-23    143.68
2022-08-24    144.02
2022-08-25    143.50
2022-08-26    142.62
Freq: D, Name: Apple  Lista de Precios, dtype: float64


Series[] provides us a very flexible way to index data. We can add any condition in the square brackets:

In [91]:
# Instead of providing an index, we can filter data by a boolean condition
print(s[s < np.mean(s)] )
print([(s > np.mean(s)) & (s < np.mean(s) + 1.64*np.std(s))])

2022-08-21    143.73
2022-08-23    143.68
2022-08-25    143.50
2022-08-26    142.62
Name: Apple  Lista de Precios, dtype: float64
[2022-08-21    False
2022-08-22    False
2022-08-23    False
2022-08-24     True
2022-08-25    False
2022-08-26    False
Freq: D, Name: Apple  Lista de Precios, dtype: bool]


As demonstrated, we can use logical operators like & (and), | (or) and ~ (not) to group multiple conditions.

# Summary
Here we have introduced NumPy and Pandas for scientific computing in Python. In the next chapter, we will dive into Pandas to learn resampling and manipulating Pandas DataFrame, which are commonly used in financial data analysis.

<div align="center">
<img style="display: block; margin: auto;" alt="photo" src="https://cdn.quantconnect.com/web/i/icon.png"> <img style="display: block; margin: auto;" alt="photo" src="https://www.marketing-branding.com/wp-content/uploads/2020/07/google-colaboratory-colab-guia-completa.jpg " width="50" height="50">
<img style="display: block; margin: auto;" alt="photo" src="https://upload.wikimedia.org/wikipedia/commons/d/da/Yahoo_Finance_Logo_2019.svg" width="50" height="50">  

Quantconnect -> Google Colab with Yahoo Finance data

Introduction to Financial Python
</div>

# 05 Pandas-Resampling and DataFrame

# Introduction
In the last chapter we had a glimpse of Pandas. In this chapter we will learn about resampling methods and the DataFrame object, which is a powerful tool for financial data analysis.

# Fetching Data
Here we use the Yahoo Finance to retrieve data.


In [92]:
!pip install yfinance

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting yfinance
  Downloading yfinance-0.1.74-py2.py3-none-any.whl (27 kB)
Collecting requests>=2.26
  Downloading requests-2.28.1-py3-none-any.whl (62 kB)
[K     |████████████████████████████████| 62 kB 1.1 MB/s 
Installing collected packages: requests, yfinance
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
Successfully installed requests-2.28.1 yfinance-0.1.74


In [93]:
import yfinance as yf

aapl = yf.Ticker("AAPL")

# get stock info
print(aapl.info)

# get historical market data
aapl_table = aapl.history(start="2021-01-01",  end="2021-12-31")
aapl_table

{'zip': '95014', 'sector': 'Technology', 'fullTimeEmployees': 154000, 'longBusinessSummary': 'Apple Inc. designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessories worldwide. It also sells various related services. In addition, the company offers iPhone, a line of smartphones; Mac, a line of personal computers; iPad, a line of multi-purpose tablets; AirPods Max, an over-ear wireless headphone; and wearables, home, and accessories comprising AirPods, Apple TV, Apple Watch, Beats products, HomePod, and iPod touch. Further, it provides AppleCare support services; cloud services store services; and operates various platforms, including the App Store that allow customers to discover and download applications and digital content, such as books, music, video, games, and podcasts. Additionally, the company offers various services, such as Apple Arcade, a game subscription service; Apple Music, which offers users a curated listening experience with o

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2021-01-04,132.155067,132.244143,125.464171,128.087082,143301900,0.0,0
2021-01-05,127.572362,130.393233,127.117058,129.670685,97664900,0.0,0
2021-01-06,126.414355,129.710315,125.088049,125.305801,155088000,0.0,0
2021-01-07,127.047822,130.284398,126.552933,129.581650,109578200,0.0,0
2021-01-08,131.076194,131.274161,128.898687,130.700089,105158200,0.0,0
...,...,...,...,...,...,...,...
2021-12-23,175.125322,176.121201,174.547711,175.553543,68356600,0.0,0
2021-12-27,176.360215,179.676494,176.340308,179.586868,74919600,0.0,0
2021-12-28,179.417557,180.582734,177.794270,178.551132,79144300,0.0,0
2021-12-29,178.590981,179.885626,177.405882,178.640778,62348900,0.0,0


We will create a Series named "aapl" whose values are Apple's daily closing prices, which are of course indexed by dates:

In [94]:
aapl = aapl_table['Close']['2021']

In [95]:
print(aapl)

Date
2021-01-04    128.087082
2021-01-05    129.670685
2021-01-06    125.305801
2021-01-07    129.581650
2021-01-08    130.700089
                 ...    
2021-12-23    175.553543
2021-12-27    179.586868
2021-12-28    178.551132
2021-12-29    178.640778
2021-12-30    177.465637
Name: Close, Length: 251, dtype: float64


Recall that we can fetch a specific data point using series['yyyy-mm-dd']. We can also fetch the data in a specific month using series['yyyy-mm'].

In [96]:
print(aapl['2021-8'])

Date
2021-08-02    144.492630
2021-08-03    146.319656
2021-08-04    145.912537
2021-08-05    146.021759
2021-08-06    145.325653
2021-08-09    145.275925
2021-08-10    144.788666
2021-08-11    145.047226
2021-08-12    148.060333
2021-08-13    148.269165
2021-08-16    150.277908
2021-08-17    149.353104
2021-08-18    145.544434
2021-08-19    145.882538
2021-08-20    147.364227
2021-08-23    148.875793
2021-08-24    148.786255
2021-08-25    147.533279
2021-08-26    146.717865
2021-08-27    147.771957
2021-08-30    152.266739
2021-08-31    150.983948
Name: Close, dtype: float64


In [97]:
#We use the format yyyy-mm to obtain data in a specific range 
aapl['2021-2':'2021-4']

Date
2021-02-01    132.768723
2021-02-02    133.610031
2021-02-03    132.570770
2021-02-04    135.985474
2021-02-05    135.564194
                 ...    
2021-04-26    133.542053
2021-04-27    133.214920
2021-04-28    132.412003
2021-04-29    132.312897
2021-04-30    130.310547
Name: Close, Length: 63, dtype: float64

.head(N) and .tail(N) are methods for quickly accessing the first or last N elements.

In [98]:
# We obtain the first 7 data and the last 6 data by using the head and tail methods
# print(aapl[:7])
# print(aapl[-6:])
print(aapl.head(7))
print(aapl.tail(6))

Date
2021-01-04    128.087082
2021-01-05    129.670685
2021-01-06    125.305801
2021-01-07    129.581650
2021-01-08    130.700089
2021-01-11    127.661453
2021-01-12    127.483307
Name: Close, dtype: float64
Date
2021-12-22    174.916183
2021-12-23    175.553543
2021-12-27    179.586868
2021-12-28    178.551132
2021-12-29    178.640778
2021-12-30    177.465637
Name: Close, dtype: float64


# Resampling
**_series.resample(freq)_** is a class called "DatetimeIndexResampler" which groups data in a Series object into regular time intervals. The argument "freq" determines the length of each interval.

**_series.resample.mean()_** is a complete statement that groups data into intervals, and then compute the mean of each interval. For example, if we want to aggregate the daily data into monthly data by mean:

In [99]:
#We reorganize the data in months, calculating and saving the mean of each month.
by_month = aapl.resample('M').mean()
print(by_month)

Date
2021-01-31    131.676840
2021-02-28    130.339429
2021-03-31    120.805690
2021-04-30    130.660323
2021-05-31    125.845275
2021-06-30    129.041125
2021-07-31    144.114830
2021-08-31    147.312345
2021-09-30    147.479778
2021-10-31    144.752678
2021-11-30    153.578459
2021-12-31    172.647003
Freq: M, Name: Close, dtype: float64


We can also aggregate the data by week:

In [100]:
# if we use  freq = 'W' we'll obtain the data sumarized by week
by_week = aapl.resample('W').mean()
print(by_week.head())

Date
2021-01-10    128.669061
2021-01-17    127.625833
2021-01-24    132.580658
2021-01-31    138.012570
2021-02-07    134.099838
Freq: W-SUN, Name: Close, dtype: float64


We can also aggregate the data by month with max:

In [101]:
#We obtain max data of each month 
aapl.resample('M').max()

Date
2021-01-31    141.696518
2021-02-28    135.985474
2021-03-31    126.672630
2021-04-30    133.660995
2021-05-31    131.381088
2021-06-30    135.993073
2021-07-31    148.097000
2021-08-31    152.266739
2021-09-30    155.816879
2021-10-31    151.719833
2021-11-30    164.618805
2021-12-31    179.586868
Freq: M, Name: Close, dtype: float64

We can choose almost any frequency by using the format 'nf', where 'n' is an integer and 'f' is M for month, W for week and D for day.

In [102]:
# We can select frecuencies with an integer followed by "W", "M" or "D"
three_day = aapl.resample('4D').mean()
two_week = aapl.resample('2W').mean()
two_month = aapl.resample('2M').mean()


print(three_day)
print(two_week)
print(two_month )

Date
2021-01-04    128.161304
2021-01-08    129.180771
2021-01-12    127.616928
2021-01-16    126.523224
2021-01-20    134.599803
                 ...    
2021-12-14    173.539379
2021-12-18    170.663773
2021-12-22    175.234863
2021-12-26    178.926259
2021-12-30    177.465637
Freq: 4D, Name: Close, Length: 91, dtype: float64
Date
2021-01-10    128.669061
2021-01-24    129.827977
2021-02-07    136.056204
2021-02-21    132.434040
2021-03-07    122.517289
2021-03-21    120.463404
2021-04-04    120.521341
2021-04-18    129.827808
2021-05-02    132.366417
2021-05-16    126.650613
2021-05-30    125.039936
2021-06-13    124.889782
2021-06-27    130.970767
2021-07-11    139.038081
2021-07-25    145.318741
2021-08-08    145.514258
2021-08-22    146.986353
2021-09-05    150.083000
2021-09-19    149.870192
2021-10-03    143.220454
2021-10-17    141.434460
2021-10-31    148.360654
2021-11-14    149.268668
2021-11-28    156.473612
2021-12-12    167.457054
2021-12-26    173.441174
2022-01-09    1

Besides the mean() method, other methods can also be used with the resampler:



In [103]:
#We can aggregate the data using std, max or min
std = aapl.resample('W').std()
max = aapl.resample('W').max()
min = aapl.resample('W').min()


print(std)
print(max)
print(min)

Date
2021-01-10    2.098294
2021-01-17    1.315374
2021-01-24    4.977887
2021-01-31    4.805412
2021-02-07    1.585070
2021-02-14    0.713243
2021-02-21    1.590622
2021-02-28    2.517064
2021-03-07    3.079017
2021-03-14    2.177679
2021-03-21    2.519356
2021-03-28    1.360368
2021-04-04    1.304575
2021-04-11    2.970556
2021-04-18    1.512193
2021-04-25    1.113220
2021-05-02    1.258303
2021-05-09    1.892434
2021-05-16    1.823230
2021-05-23    1.079980
2021-05-30    1.119368
2021-06-06    1.004038
2021-06-13    0.624552
2021-06-20    0.789393
2021-06-27    0.641696
2021-07-04    1.870535
2021-07-11    1.377539
2021-07-18    1.931999
2021-07-25    2.225637
2021-08-01    1.547768
2021-08-08    0.723525
2021-08-15    1.723224
2021-08-22    2.088033
2021-08-29    0.905412
2021-09-05    0.956703
2021-09-12    3.319941
2021-09-19    1.351200
2021-09-26    1.878797
2021-10-03    1.499646
2021-10-10    1.646000
2021-10-17    1.596806
2021-10-24    1.158914
2021-10-31    1.583678
2021-1

Often we want to calculate monthly returns of a stock, based on prices on the last day of each month. To fetch those prices, we use the series.resample.agg() method:

In [104]:
# We obtain the Closed data of each month using an anonymous function
last_day = aapl.resample('M').agg(lambda x: x[-1])
print(last_day)

Date
2021-01-31    130.611023
2021-02-28    120.199745
2021-03-31    121.081955
2021-04-30    130.310547
2021-05-31    123.730247
2021-06-30    135.993073
2021-07-31    144.830215
2021-08-31    150.983948
2021-09-30    140.711517
2021-10-31    148.965256
2021-11-30    164.618805
2021-12-31    177.465637
Freq: M, Name: Close, dtype: float64


Or directly calculate the monthly rates of return using the data for the first day and the last day:

In [105]:
#We calculate the return rate using an anonymous function
monthly_return = aapl.resample('M').agg(lambda x: x[-1]/x[0] - 1)
print(monthly_return)

Date
2021-01-31    0.019705
2021-02-28   -0.094668
2021-03-31   -0.044135
2021-04-30    0.068780
2021-05-31   -0.058234
2021-06-30    0.102028
2021-07-31    0.062577
2021-08-31    0.044925
2021-09-30   -0.072192
2021-10-31    0.050123
2021-11-30    0.111313
2021-12-31    0.081508
Freq: M, Name: Close, dtype: float64


Series object also provides us some convenient methods to do some quick calculation.

In [106]:
print(monthly_return.mean())
print(monthly_return.std())
print(monthly_return.max())

0.02264415568771186
0.07158531795712104
0.11131337593461788


Another two methods frequently used on Series are .diff() and .pct_change(). The former calculates the difference between consecutive elements, and the latter calculates the percentage change.

In [107]:
#.diff() calculates the difference between some elements x_i and x_(i-1)
print(last_day.diff())
#pct_change() calculates percentage change |x_i - x_(i-1)| / |x_i|
print(last_day.pct_change())

Date
2021-01-31          NaN
2021-02-28   -10.411278
2021-03-31     0.882210
2021-04-30     9.228592
2021-05-31    -6.580299
2021-06-30    12.262825
2021-07-31     8.837143
2021-08-31     6.153732
2021-09-30   -10.272430
2021-10-31     8.253738
2021-11-30    15.653549
2021-12-31    12.846832
Freq: M, Name: Close, dtype: float64
Date
2021-01-31         NaN
2021-02-28   -0.079712
2021-03-31    0.007340
2021-04-30    0.076218
2021-05-31   -0.050497
2021-06-30    0.099109
2021-07-31    0.064982
2021-08-31    0.042489
2021-09-30   -0.068037
2021-10-31    0.058657
2021-11-30    0.105082
2021-12-31    0.078040
Freq: M, Name: Close, dtype: float64


Notice that we induced a NaN value while calculating percentage changes i.e. returns.

When dealing with NaN values, we usually either removing the data point or fill it with a specific value. Here we fill it with 0:

In [108]:
# We can change all NaN values to some value with fillna(value)
daily_return = last_day.pct_change()
print(daily_return.fillna(0))

Date
2021-01-31    0.000000
2021-02-28   -0.079712
2021-03-31    0.007340
2021-04-30    0.076218
2021-05-31   -0.050497
2021-06-30    0.099109
2021-07-31    0.064982
2021-08-31    0.042489
2021-09-30   -0.068037
2021-10-31    0.058657
2021-11-30    0.105082
2021-12-31    0.078040
Freq: M, Name: Close, dtype: float64


Alternatively, we can fill a NaN with the next fitted value. This is called 'backward fill', or 'bfill' in short:

In [109]:
# We can use bfill to fill NaN values with the next valid value
daily_return = last_day.pct_change()
print(daily_return.fillna(method = 'bfill'))

Date
2021-01-31   -0.079712
2021-02-28   -0.079712
2021-03-31    0.007340
2021-04-30    0.076218
2021-05-31   -0.050497
2021-06-30    0.099109
2021-07-31    0.064982
2021-08-31    0.042489
2021-09-30   -0.068037
2021-10-31    0.058657
2021-11-30    0.105082
2021-12-31    0.078040
Freq: M, Name: Close, dtype: float64


As expected, since there is a 'backward fill' method, there must be a 'forward fill' method, or 'ffill' in short. However we can't use it here because the NaN is the first value.

We can also simply remove NaN values by **_.dropna()_**

In [110]:
#We can remove NaN with dropna()
daily_return = last_day.pct_change()
daily_return.dropna()

Date
2021-02-28   -0.079712
2021-03-31    0.007340
2021-04-30    0.076218
2021-05-31   -0.050497
2021-06-30    0.099109
2021-07-31    0.064982
2021-08-31    0.042489
2021-09-30   -0.068037
2021-10-31    0.058657
2021-11-30    0.105082
2021-12-31    0.078040
Freq: M, Name: Close, dtype: float64

# DataFrame
The **DataFrame** is the most commonly used data structure in Pandas. It is essentially a table, just like an Excel spreadsheet.

More precisely, a DataFrame is a collection of Series objects, each of which may contain different data types. A DataFrame can be created from various data types: dictionary, 2-D numpy.ndarray, a Series or another DataFrame.

## Create DataFrames
The most common method of creating a DataFrame is passing a dictionary:

In [111]:
import pandas as pd
# We can create a Dataframe passing a dictionary and a data index. 
dict = {'Nestle': [143.5, 144.09, 142.73, 144.18, 143.77],'Alpina':[898.7, 911.71, 906.69, 918.59, 926.99],
        'Coca-Cola':[155.58, 153.67, 152.36, 152.94, 153.49]}
data_index = pd.date_range('2021-08-21',periods = 5, freq = 'D')
df = pd.DataFrame(dict, index = data_index)
print(df)

            Nestle  Alpina  Coca-Cola
2021-08-21  143.50  898.70     155.58
2021-08-22  144.09  911.71     153.67
2021-08-23  142.73  906.69     152.36
2021-08-24  144.18  918.59     152.94
2021-08-25  143.77  926.99     153.49


## Manipulating DataFrames
We can fetch values in a DataFrame by columns and index. Each column in a DataFrame is essentially a Pandas Series. We can fetch a column by square brackets: **df['column_name']**

If a column name contains no spaces, then we can also use df.column_name to fetch a column:

In [112]:
# We can obtain columns with df['column_name']
df = aapl_table
print(df.Close.tail(5))
print(df['Volume'].tail(5))

Date
2021-12-23    175.553543
2021-12-27    179.586868
2021-12-28    178.551132
2021-12-29    178.640778
2021-12-30    177.465637
Name: Close, dtype: float64
Date
2021-12-23    68356600
2021-12-27    74919600
2021-12-28    79144300
2021-12-29    62348900
2021-12-30    59773000
Name: Volume, dtype: int64


All the methods we applied to a Series index such as iloc[], loc[] and resampling methods, can also be applied to a DataFrame:

In [113]:
# We can use Series methods with DataFrames
aapl_2021 = df['2021']
# Here we aggregate the data by month and using the last day info
aapl_month = aapl_2021.resample('M').agg(lambda x: x[-1])
print(aapl_month)

                  Open        High         Low       Close     Volume  \
Date                                                                    
2021-01-31  134.441456  135.342157  128.878913  130.611023  177523800   
2021-02-28  121.518110  123.758352  120.140265  120.199745  164560400   
2021-03-31  120.586327  122.439971  120.090699  121.081955  118323800   
2021-04-30  130.627741  132.392176  129.923958  130.310547  109839500   
2021-05-31  124.683469  124.911848  123.670674  123.730247   71311100   
2021-06-30  135.208641  136.439892  134.910756  135.993073   63261400   
2021-07-31  143.360669  145.296898  143.092571  144.830215   70440600   
2021-08-31  151.809325  151.948544  150.446948  150.983948   86453100   
2021-09-30  142.859485  143.575474  140.492742  140.711517   89056700   
2021-10-31  146.399631  149.104475  145.594147  148.965256  124953200   
2021-11-30  159.330690  164.837900  159.260971  164.618805  174048100   
2021-12-31  178.730408  179.825881  177.356090  177

  


We may select certain columns of a DataFrame using their names:

In [114]:
# We can select from DataFrames with column names inside squared brackets 
aapl_bar = aapl_month[['Open', 'High', 'Low', 'Close']]
print(aapl_bar)

                  Open        High         Low       Close
Date                                                      
2021-01-31  134.441456  135.342157  128.878913  130.611023
2021-02-28  121.518110  123.758352  120.140265  120.199745
2021-03-31  120.586327  122.439971  120.090699  121.081955
2021-04-30  130.627741  132.392176  129.923958  130.310547
2021-05-31  124.683469  124.911848  123.670674  123.730247
2021-06-30  135.208641  136.439892  134.910756  135.993073
2021-07-31  143.360669  145.296898  143.092571  144.830215
2021-08-31  151.809325  151.948544  150.446948  150.983948
2021-09-30  142.859485  143.575474  140.492742  140.711517
2021-10-31  146.399631  149.104475  145.594147  148.965256
2021-11-30  159.330690  164.837900  159.260971  164.618805
2021-12-31  178.730408  179.825881  177.356090  177.465637


We can even specify both rows and columns using loc[]. The row indices and column names are separated by a comma:

In [115]:
# We can specify rows and columns using the loc[row_index,[columns]] method. 
print(aapl_month.loc['2021-03':'2021-06',['Open', 'High', 'Low', 'Close']])

                  Open        High         Low       Close
Date                                                      
2021-03-31  120.586327  122.439971  120.090699  121.081955
2021-04-30  130.627741  132.392176  129.923958  130.310547
2021-05-31  124.683469  124.911848  123.670674  123.730247
2021-06-30  135.208641  136.439892  134.910756  135.993073


The subset methods in DataFrame is quite useful. By writing logical statements in square brackets, we can make customized subsets:

In [116]:
import numpy as np
# We can get subsets by using boolean conditions as index
above = aapl_bar[aapl_bar.Close > np.mean(aapl_bar.Close)]
print(above)

                  Open        High         Low       Close
Date                                                      
2021-07-31  143.360669  145.296898  143.092571  144.830215
2021-08-31  151.809325  151.948544  150.446948  150.983948
2021-10-31  146.399631  149.104475  145.594147  148.965256
2021-11-30  159.330690  164.837900  159.260971  164.618805
2021-12-31  178.730408  179.825881  177.356090  177.465637


## Data Validation
As mentioned, all methods that apply to a Series can also be applied to a DataFrame. Here we add a new column to an existing DataFrame:

In [117]:
#We add a column by declarating its name in square brackets 
aapl_bar['rate_return'] = aapl_bar.Close.pct_change()
print(aapl_bar)

                  Open        High         Low       Close  rate_return
Date                                                                   
2021-01-31  134.441456  135.342157  128.878913  130.611023          NaN
2021-02-28  121.518110  123.758352  120.140265  120.199745    -0.079712
2021-03-31  120.586327  122.439971  120.090699  121.081955     0.007340
2021-04-30  130.627741  132.392176  129.923958  130.310547     0.076218
2021-05-31  124.683469  124.911848  123.670674  123.730247    -0.050497
2021-06-30  135.208641  136.439892  134.910756  135.993073     0.099109
2021-07-31  143.360669  145.296898  143.092571  144.830215     0.064982
2021-08-31  151.809325  151.948544  150.446948  150.983948     0.042489
2021-09-30  142.859485  143.575474  140.492742  140.711517    -0.068037
2021-10-31  146.399631  149.104475  145.594147  148.965256     0.058657
2021-11-30  159.330690  164.837900  159.260971  164.618805     0.105082
2021-12-31  178.730408  179.825881  177.356090  177.465637     0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Here the calculation introduced a NaN value. If the DataFrame is large, we would not be able to observe it. **isnull()** provides a convenient way to check abnormal values.

In [118]:
#With .isnull() we get a boolean True if the value is abnormal
missing = aapl_bar.isnull()
print(missing)
print('---------------------------------------------')
print(missing.describe())

             Open   High    Low  Close  rate_return
Date                                               
2021-01-31  False  False  False  False         True
2021-02-28  False  False  False  False        False
2021-03-31  False  False  False  False        False
2021-04-30  False  False  False  False        False
2021-05-31  False  False  False  False        False
2021-06-30  False  False  False  False        False
2021-07-31  False  False  False  False        False
2021-08-31  False  False  False  False        False
2021-09-30  False  False  False  False        False
2021-10-31  False  False  False  False        False
2021-11-30  False  False  False  False        False
2021-12-31  False  False  False  False        False
---------------------------------------------
         Open   High    Low  Close rate_return
count      12     12     12     12          12
unique      1      1      1      1           2
top     False  False  False  False       False
freq       12     12     12     12    

The row labelled "unique" indicates the number of unique values in each column. Since the "rate_return" column has 2 unique values, it has at least one missing value.

We can deduce the number of missing values by comparing "count" with "freq". There are 12 counts and 11 False values, so there is one True value which corresponds to the missing value.

We can also find the rows with missing values easily:

In [119]:
# We can get missing values by checking the boolean values of missing.
print(missing[missing.rate_return == True])

             Open   High    Low  Close  rate_return
Date                                               
2021-01-31  False  False  False  False         True


Usually when dealing with missing data, we either delete the whole row or fill it with some value. As we introduced in the Series chapter, the same method **dropna()** and **fillna()** can be applied to a DataFrame.

In [120]:
# We can also use dropna() and fillna() to change NaN values
drop = aapl_bar.dropna()
print(drop)
print('\n--------------------------------------------------\n')
fill = aapl_bar.fillna(0)
print(fill)

                  Open        High         Low       Close  rate_return
Date                                                                   
2021-02-28  121.518110  123.758352  120.140265  120.199745    -0.079712
2021-03-31  120.586327  122.439971  120.090699  121.081955     0.007340
2021-04-30  130.627741  132.392176  129.923958  130.310547     0.076218
2021-05-31  124.683469  124.911848  123.670674  123.730247    -0.050497
2021-06-30  135.208641  136.439892  134.910756  135.993073     0.099109
2021-07-31  143.360669  145.296898  143.092571  144.830215     0.064982
2021-08-31  151.809325  151.948544  150.446948  150.983948     0.042489
2021-09-30  142.859485  143.575474  140.492742  140.711517    -0.068037
2021-10-31  146.399631  149.104475  145.594147  148.965256     0.058657
2021-11-30  159.330690  164.837900  159.260971  164.618805     0.105082
2021-12-31  178.730408  179.825881  177.356090  177.465637     0.078040

--------------------------------------------------

           

## DataFrame Concat
We have seen how to extract a Series from a dataFrame. Now we need to consider how to merge a Series or a DataFrame into another one.

In Pandas, the function **concat()** allows us to merge multiple Series into a DataFrame:

In [121]:
# We can merge series along an axis by using concat().
s1 = pd.Series([143.5, 144.09, 142.73, 144.18, 143.77], name = 'AAPL')
s2 = pd.Series([898.7, 911.71, 906.69, 918.59, 926.99], name = 'GOOG')
data_frame = pd.concat([s1,s2], axis = 1)
print(data_frame)

     AAPL    GOOG
0  143.50  898.70
1  144.09  911.71
2  142.73  906.69
3  144.18  918.59
4  143.77  926.99


The "axis = 1" parameter will join two DataFrames by columns:

In [122]:
# with axis = 1, we can merge horizontally
log_price = np.log(aapl_bar.Close)
log_price.name = 'log_price'
print(log_price)
print('\n---------------------- separate line--------------------\n')
concat = pd.concat([aapl_bar, log_price], axis = 1)
print(concat)

Date
2021-01-31    4.872224
2021-02-28    4.789155
2021-03-31    4.796468
2021-04-30    4.869920
2021-05-31    4.818104
2021-06-30    4.912604
2021-07-31    4.975562
2021-08-31    5.017174
2021-09-30    4.946712
2021-10-31    5.003713
2021-11-30    5.103633
2021-12-31    5.178777
Freq: M, Name: log_price, dtype: float64

---------------------- separate line--------------------

                  Open        High         Low       Close  rate_return  \
Date                                                                      
2021-01-31  134.441456  135.342157  128.878913  130.611023          NaN   
2021-02-28  121.518110  123.758352  120.140265  120.199745    -0.079712   
2021-03-31  120.586327  122.439971  120.090699  121.081955     0.007340   
2021-04-30  130.627741  132.392176  129.923958  130.310547     0.076218   
2021-05-31  124.683469  124.911848  123.670674  123.730247    -0.050497   
2021-06-30  135.208641  136.439892  134.910756  135.993073     0.099109   
2021-07-31  143.360

We can also join two DataFrames by rows. Consider these two DataFrames:

In [123]:
df_volume = aapl_table.loc['2020-10':'2021-04',['Volume', 'Stock Splits']].resample('M').agg(lambda x: x[-1])
print(df_volume)
print('\n---------------------- separate line--------------------\n')
df_2021 = aapl_table.loc['2020-10':'2021-04',['Open', 'High', 'Low', 'Close']].resample('M').agg(lambda x: x[-1])
print(df_2021)

               Volume  Stock Splits
Date                               
2021-01-31  177523800             0
2021-02-28  164560400             0
2021-03-31  118323800             0
2021-04-30  109839500             0

---------------------- separate line--------------------

                  Open        High         Low       Close
Date                                                      
2021-01-31  134.441456  135.342157  128.878913  130.611023
2021-02-28  121.518110  123.758352  120.140265  120.199745
2021-03-31  120.586327  122.439971  120.090699  121.081955
2021-04-30  130.627741  132.392176  129.923958  130.310547


Now we merge the DataFrames with our DataFrame 'aapl_bar'

In [124]:
concat = pd.concat([aapl_bar, df_volume], axis = 1)
print(concat)

                  Open        High         Low       Close  rate_return  \
Date                                                                      
2021-01-31  134.441456  135.342157  128.878913  130.611023          NaN   
2021-02-28  121.518110  123.758352  120.140265  120.199745    -0.079712   
2021-03-31  120.586327  122.439971  120.090699  121.081955     0.007340   
2021-04-30  130.627741  132.392176  129.923958  130.310547     0.076218   
2021-05-31  124.683469  124.911848  123.670674  123.730247    -0.050497   
2021-06-30  135.208641  136.439892  134.910756  135.993073     0.099109   
2021-07-31  143.360669  145.296898  143.092571  144.830215     0.064982   
2021-08-31  151.809325  151.948544  150.446948  150.983948     0.042489   
2021-09-30  142.859485  143.575474  140.492742  140.711517    -0.068037   
2021-10-31  146.399631  149.104475  145.594147  148.965256     0.058657   
2021-11-30  159.330690  164.837900  159.260971  164.618805     0.105082   
2021-12-31  178.730408  1

By default the DataFrame are joined with all of the data. This default options results in zero information loss. We can also merge them by intersection, this is called 'inner join

In [125]:
# We can intersect the Series by using join = 'inner' taking 
# This takes only data whose index is equal and whose info is not NaN
concat = pd.concat([aapl_bar,df_volume],axis = 1, join = 'inner')
print(concat)

                  Open        High         Low       Close  rate_return  \
Date                                                                      
2021-01-31  134.441456  135.342157  128.878913  130.611023          NaN   
2021-02-28  121.518110  123.758352  120.140265  120.199745    -0.079712   
2021-03-31  120.586327  122.439971  120.090699  121.081955     0.007340   
2021-04-30  130.627741  132.392176  129.923958  130.310547     0.076218   

               Volume  Stock Splits  
Date                                 
2021-01-31  177523800             0  
2021-02-28  164560400             0  
2021-03-31  118323800             0  
2021-04-30  109839500             0  


Only the intersection part was left if use 'inner join' method. Now let's try to append a DataFrame to another one:

In [126]:
# We use the method df_1.append(df_2) to append both DataFrames
append = aapl_bar.append(df_2021)
print(append)

                  Open        High         Low       Close  rate_return
Date                                                                   
2021-01-31  134.441456  135.342157  128.878913  130.611023          NaN
2021-02-28  121.518110  123.758352  120.140265  120.199745    -0.079712
2021-03-31  120.586327  122.439971  120.090699  121.081955     0.007340
2021-04-30  130.627741  132.392176  129.923958  130.310547     0.076218
2021-05-31  124.683469  124.911848  123.670674  123.730247    -0.050497
2021-06-30  135.208641  136.439892  134.910756  135.993073     0.099109
2021-07-31  143.360669  145.296898  143.092571  144.830215     0.064982
2021-08-31  151.809325  151.948544  150.446948  150.983948     0.042489
2021-09-30  142.859485  143.575474  140.492742  140.711517    -0.068037
2021-10-31  146.399631  149.104475  145.594147  148.965256     0.058657
2021-11-30  159.330690  164.837900  159.260971  164.618805     0.105082
2021-12-31  178.730408  179.825881  177.356090  177.465637     0

'Append' is essentially to concat two DataFrames by axis = 0, thus here is an alternative way to append:

In [127]:
# Append is like concatenating two DF vertically
concat = pd.concat([aapl_bar, df_2021], axis = 0)
print(concat)

                  Open        High         Low       Close  rate_return
Date                                                                   
2021-01-31  134.441456  135.342157  128.878913  130.611023          NaN
2021-02-28  121.518110  123.758352  120.140265  120.199745    -0.079712
2021-03-31  120.586327  122.439971  120.090699  121.081955     0.007340
2021-04-30  130.627741  132.392176  129.923958  130.310547     0.076218
2021-05-31  124.683469  124.911848  123.670674  123.730247    -0.050497
2021-06-30  135.208641  136.439892  134.910756  135.993073     0.099109
2021-07-31  143.360669  145.296898  143.092571  144.830215     0.064982
2021-08-31  151.809325  151.948544  150.446948  150.983948     0.042489
2021-09-30  142.859485  143.575474  140.492742  140.711517    -0.068037
2021-10-31  146.399631  149.104475  145.594147  148.965256     0.058657
2021-11-30  159.330690  164.837900  159.260971  164.618805     0.105082
2021-12-31  178.730408  179.825881  177.356090  177.465637     0

Please note that if the two DataFrame have some columns with the same column names, these columns are considered to be the same and will be merged. It's very important to have the right column names. If we change a column names here:

In [128]:
# Columns must have the same name, otherwise, they will be considered different columns
# and will lead us to a lot of empty data
df_2021.columns = ['Change', 'High','Low','Close']
concat = pd.concat([aapl_bar, df_2021], axis = 0)
print(concat)

                  Open        High         Low       Close  rate_return  \
Date                                                                      
2021-01-31  134.441456  135.342157  128.878913  130.611023          NaN   
2021-02-28  121.518110  123.758352  120.140265  120.199745    -0.079712   
2021-03-31  120.586327  122.439971  120.090699  121.081955     0.007340   
2021-04-30  130.627741  132.392176  129.923958  130.310547     0.076218   
2021-05-31  124.683469  124.911848  123.670674  123.730247    -0.050497   
2021-06-30  135.208641  136.439892  134.910756  135.993073     0.099109   
2021-07-31  143.360669  145.296898  143.092571  144.830215     0.064982   
2021-08-31  151.809325  151.948544  150.446948  150.983948     0.042489   
2021-09-30  142.859485  143.575474  140.492742  140.711517    -0.068037   
2021-10-31  146.399631  149.104475  145.594147  148.965256     0.058657   
2021-11-30  159.330690  164.837900  159.260971  164.618805     0.105082   
2021-12-31  178.730408  1

Since the column name of 'Open' has been changed, the new DataFrame has an new column named 'Change'.

# Summary

Hereby we introduced the most import part of python: resampling and DataFrame manipulation. We only introduced the most commonly used method in Financial data analysis. There are also many methods used in data mining, which are also beneficial. You can always check the [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) official documentations for help.