<div align="center">
<img style="display: block; margin: auto;" alt="photo" src="https://cdn.quantconnect.com/web/i/icon.png">

Quantconnect

Introduction to Financial Python
</div>

# 01 Data Types and Data Structures

# Introduction

This tutorial provides a basic introduction to the Python programming language. If you are new to Python, you should run the code snippets while reading this tutorial. If you are an advanced Python user, please feel free to skip this chapter.

# Basic Variable Types
The basic types of variables in Python are: strings, integers, floating point numbers and booleans.

Strings in python are identified as a contiguous set of characters represented in either single quotes (' ') or double quotes (" ").


In [None]:
my_string1 = 'Aprendiendo' # Se le asigna un valor de tipo string a las variables my_string1 y my_string2
my_string2 = "Python"
print(my_string1 + ' ' + my_string2)

Aprendiendo Python


An integer is a round number with no values after the decimal point.

In [None]:
my_int = 1035 # Se le asigna un valor de tipo entero a una variable
print(my_int)
print(type(my_int)) # Muestra que tipo de dato está guardando la variable my_int

1035
<class 'int'>


The built-in function int() can convert a string into an integer.

In [None]:
my_string = "1035" # Un número guardado en forma de string
print(type(my_string)) # Muestra el tipo de dato que está guardando la variable my_string
my_int = int(my_string) # Convierte un string a entero
print(type(my_int)) # Muestra el tipo de dato que está guardando la variable my_int después de convertir el valor que guardaba my_string

<class 'str'>
<class 'int'>


A floating point number, or a float, is a real number in mathematics. In Python we need to include a value after a decimal point to define it as a float

In [None]:
my_string = "1123.4" # Número decimal guardado como string
my_float = float(my_string) # Convierte el string anterior a un string
print(type(my_float)) # Muestra el tipo de dato que guarda my_float
print(my_float)

<class 'float'>
1123.4


As you can see above, if we don't include a decimal value, the variable would be defined as an integer. The built-in function float() can convert a string or an integer into a float.

In [None]:
my_bool = True # Se le asigna un valor de tipo booleano False o True
print(my_bool)
print(type(my_bool))

True
<class 'bool'>


A boolean, or bool, is a binary variable. Its value can only be True or False. It is useful when we do some logic operations, which would be covered in our next chapter.

In [None]:
# Operaciones aritméticas dentro del propio print
print("Addition ", 32+90) 
print("Subtraction ", 30-43)
print("Multiplication ", 6*9)
print("Division ", 55/9)
print('exponent', 4**2) # Doble signo ** indica un exponente

Addition  122
Subtraction  -13
Multiplication  54
Division  6.111111111111111
exponent 16


# Basic Math Operations

The basic math operators in python are demonstrated below:

In [None]:
print(3/3)
print(5/3)

1.0
1.6666666666666667


# Data Collections

## List
A list is an ordered collection of values. A list is mutable, which means you can change a list's value without changing the list itself. Creating a list is simply putting different comma-separated values between square brackets.

In [None]:
my_list = ['Lista', True, 32, 2.457, [False, 20]] # Le asignamos valores a la lista, incluso una lista dentro de otra
print(my_list)

['Lista', True, 32, 2.457, [False, 20]]


The values in a list are called "elements". We can access list elements by indexing. Python index starts from 0. So if you have a list of length n, the index of the first element will be 0, and that of the last element will be n − 1. By the way, the length of a list can be obtained by the built-in function len().

In [None]:
my_list = ['Lista', True, 32, 2.457, [False, 20]]
print(len(my_list)) # Muestra la cantidad de elementos de la lista
print(my_list[0]) # Miestra el primer elemento de la lista
print(my_list[len(my_list) -1]) # Muestra el último elemento de la lista 

5
Lista
[False, 20]


You can also change the elements in the list by accessing an index and assigning a new value.

In [None]:
my_list = ['Lista', True, 32, 2.457, [False, 20]]
my_list[2] = 'go' # Cambia el valor del elemento que está en la posición 2 de la lista
print(my_list)

['Lista', True, 'go', 2.457, [False, 20]]


A list can also be sliced with a colon:

In [None]:
my_list = ['Lista', True, 32, 2.457, [False, 20]]
print(my_list[0:4]) # [n1:n2] Muestra los elementos desde la posición n1 (incluyéndolo) hasta el elemento anterior a n2 

['Lista', True, 32, 2.457]


The slice starts from the first element indicated, but excludes the last element indicated. Here we select all elements starting from index 1, which refers to the second element:

In [None]:
print(my_list[2:]) # [n1:] Muestra todos los elementos de la lista a partir de la posición n1

[32, 2.457, [False, 20]]


And all elements up to but excluding index 3:

In [None]:
print(my_list[:4]) # [:n2] Muestra todos los elementos anteriores a la posición n2

['Lista', True, 32, 2.457]


If you wish to add or remove an element from a list, you can use the append() and remove() methods for lists as follows:

In [None]:
my_list = ['Elementos', 'Base']
my_list.append('Añadido') # Añade un elemento a la lista
print(my_list)

['Elementos', 'Base', 'Añadido']


In [None]:
my_list.remove('Base') # Elimina el elemento indicado de la lista
print(my_list)

['Elementos', 'Añadido']


When there are repeated instances of "Hello", the first one is removed.

## Tuple
A tuple is a data structure type similar to a list. The difference is that a tuple is immutable, which means you can't change the elements in it once it's defined. We create a tuple by putting comma-separated values between parentheses.

In [None]:
my_tuple = ('Esto','es','una', 'tupla', 1106) # Le asignamos valores a la tupla

Just like a list, a tuple can be sliced by using index.

In [None]:
my_tuple = ('Esto','es','una', 'tupla', 1106) 
print(my_tuple[2:]) # Al igual que en la lista, muestra los elementos a partir de la posición indicada

('una', 'tupla', 1106)


## Set
A set is an **unordered**  collection with **no duplicate** elements. The built-in function **set()** can be used to create sets.

In [None]:
stock_list = ['AAPL','GOOG','IBM','AAPL','IBM','FB','F','GOOG'] # Se le asignan valores a la lista
stock_set = set(stock_list) # Convertimos dicha lista en un conjunto
print(stock_set)

{'FB', 'GOOG', 'AAPL', 'IBM', 'F'}


Set is an easy way to remove duplicate elements from a list.

##Dictionary
A dictionary is one of the most important data structures in Python. Unlike sequences which are indexed by integers, dictionaries are indexed by keys which can be either strings or floats.

A dictionary is an **unordered** collection of key : value pairs, with the requirement that the keys are unique. We create a dictionary by placing a comma-separated list of key : value pairs within the braces.

In [None]:
my_dic = {1:'APPLE', 2:'Hola', 'Llave':'x'} # Le asignamos valores al diccionario

In [None]:
print(my_dic[2]) # Muestra el valor que contiene la llave indicada

Hola


After defining a dictionary, we can access any value by indicating its key in brackets.

In [None]:
my_dic['Llave'] = 'Nuevo valor' # Se cambia el valor que contiene la llave indicada
print(my_dic['Llave'])

Nuevo valor


We can also change the value associated with a specified key:

In [None]:
print(my_dic.keys()) # Muestra todas la llaves que hay en el diccionario

dict_keys([1, 2, 'Llave'])


The built-in method of the dictionary object dict.keys() returns a list of all the keys used in the dictionary.

# Common String Operations
A string is an immutable sequence of characters. It can be sliced by index just like a tuple:

In [None]:
my_str = 'Hola mundo'
print(my_str[4:]) # Al tratar a un string como una lista de carácteres, podemos mostrar los caráteres de una cadena a partir de cierto índice

 mundo


There are many methods associated with strings. We can use string.count() to count the occurrences of a character in a string, use string.find() to return the index of a specific character, and use string.replace() to replace characters

In [None]:
print('Counting the number of e appears in this sentence'.count('en')) # Cuenta las apariciones del carácter o string en cuestión en la frase
print('The first time e appears in this sentence'.find('i')) # Muestra el primer índice donde aparece el carácter o string en cuestión
print('all the a in this sentence now becomes e'.replace('a','e')) # Muestra la frase al reemplazar las apariciones de un caŕacter por otro

2
5
ell the e in this sentence now becomes e


The most commonly used method for strings is string.split(). This method will split the string by the indicated character and return a list:

In [None]:
Time = '2021-10-30 11:45:00'
splited_list = Time.split(' ') # Divide el string a partir del carácter especificado
date = splited_list[0] # Primera mitad de la cadena partida
time = splited_list[1] # Segunda mitad de la cadena partida
print(date, time)
hour = time.split(':')[0] # Divide la hora en las apariciones de ':' y guarda el primer 'pedazo'
print(hour)

2021-10-30 11:45:00
11


We can replace parts of a string by our variable. This is called string formatting.

In [None]:
my_time = 'Hour: {}, Minute:{}'.format('11','45') # Permite introducir datos en zonas especificas de una cadena
print(my_time)

Hour: 11, Minute:45


Another way to format a string is to use the % symbol.

In [None]:
print('the pi number is %f'%3.141592)
print('%s %s'%('Salida','Formateada'))

the pi number is 3.141592
Salida Formateada


# Summary

Weave seen the basic data types and data structures in Python. It's important to keep practicing to become familiar with these data structures. In the next tutorial, we will cover for and while loops and logical operations in Python.

<div align="center">
<img style="display: block; margin: auto;" alt="photo" src="https://cdn.quantconnect.com/web/i/icon.png">

Quantconnect

Introduction to Financial Python
</div>

# 02 Logical Operations and Loops

# Introduction
We discussed the basic data types and data structures in Python in the last tutorial. This chapter covers logical operations and loops in Python, which are very common in programming.

# Logical Operations
Like most programming languages, Python has comparison operators:

In [None]:
print(1 == 0) # Comprueba la igualdad de valor
print(1 == 1)
print(1 != 0) # Comprueba la diferencia de valor
print(15 >= 20) # Comprueba si es mayor o igual
print(30 >= 26)

False
True
True
False
True


Each statement above has a boolean value, which must be either True or False, but not both.

We can combine simple statements P and Q to form complex statements using logical operators:

- The statement "P and Q" is true if both P and Q are true, otherwise it is false.
- The statement "P or Q" is false if both P and Q are false, otherwise it is true.
- The statement "not P" is true if P is false, and vice versa.

In [None]:
print(212 > 1000 and 30 > 22) # Preposición P y Q
print(20 > 11 and 30 < 22) 
print(45 > 6 or 3 < 2) # Preposición P o Q
print(90 < 0 or 321 > 200)

False
False
True
True


When dealing with a very complex logical statement that involves in several statements, we can use brackets to separate and combine them.

In [None]:
print((30 > 21 or 12 < 33) and (2 != 3 and 4 > 3) or not ( 3 < 2 or 1 < 3 and (1 != 3 and 4 > 3))) # Combinación de negación y disyunción
print(3 == 2 or 1 < 3 or (1 != 3 and 4 <= 3) and not ( 3 < 2 or 1 < 3 and (1 != 3 or 4 > 3))) # Combinación de negación, conjunción y disyunción

True
True


Comparing the above two statements, we can see that it's wise to use brackets when we make a complex logical statement.

# If Statement
An if statement executes a segment of code only if its condition is true. A standard if statement consists of 3 segments: if, elif and else.

```python
if statement1:
    # if the statement1 is true, execute the code here.
    # code.....
    # code.....
elif statement2:
    # if the statement 1 is false, skip the codes above to this part.
    # code......
    # code......
else:
    # if none of the above statements is True, skip to this part
    # code......
```

An if statement doesn't necessarily has elif and else part. If it's not specified, the indented block of code will be executed when the condition is true, otherwise the whole if statement will be skipped.

In [None]:
i = 10
if i != 1: # Comprueba si se cumple la condición de i == 10
    print('i != 1 is True')

i != 1 is True


As we mentioned above, we can write some complex statements here:

In [None]:
p = 1 < 0 # Se guarda un valor booleano en las variales p y q
q = 2 > 3
if p and q:
    print('p and q is true')
elif p and not q:
    print('q is false')
elif q and not p:
    print('p is false')
else:
    print('None of p and q is true')

None of p and q is true


# Loop Structure
Loops are an essential part of programming. The "for" and "while" loops run a block of code repeatedly.

## While Loop
A "while" loop will run repeatedly until a certain condition has been met.

In [None]:
i = 0
while i < 5: # Ejecuta el el bloque interno hasta que i sea menor a 5
    print(i)
    i += 2 

0
2
4


When making a while loop, we need to ensure that something changes from iteration to iteration so that the while loop will terminate, otherwise, it will run forever. Here we used i += 1 (short for i = i + 1) to make i larger after each iteration. This is the most commonly used method to control a while loop.

## For Loop
A "for" loop will iterate over a sequence of value and terminate when the sequence has ended.

In [None]:
for i in [1, 'Hola', 3.14, 4, True]: # Iteración sobre una lista de valores
    print(i)

1
Hola
3.14
4
True


We can also add if statements in a for loop. Here is a real example from our pairs trading algorithm:

In [None]:
stocks = ['AAPL','GOOG','IBM','FB','F','V', 'G', 'GE']
selected = ['AAPL','IBM']
new_list = []
for i in stocks: # Iteración sobre una lista
    if i not in selected: # Verifica que el valor i no esté en la lista de seleccionados
        new_list.append(i) # Añade los valores que no están el la lista de 'selected'
print(new_list)

['GOOG', 'FB', 'F', 'V', 'G', 'GE']


Here we iterated all the elements in the list 'stocks'. Later in this chapter, we will introduce a smarter way to do this, which is just a one-line code.

## Break and continue
These are two commonly used commands in a for loop. If "break" is triggered while a loop is executing, the loop will terminate immediately:

In [None]:
stocks = ['AAPL','GOOG','IBM','FB','F','V', 'G', 'GE']
for i in stocks:
    print(i)
    if i == 'IBM':
        break # Permite que si la rompa la iteración del ciclo for

AAPL
GOOG
IBM


The "continue" command tells the loop to end this iteration and skip to the next iteration:

In [None]:
stocks = ['AAPL','GOOG','IBM','FB','F','V', 'G', 'GE']
for i in stocks:
    if i == 'IBM':
        continue # Permite romper dejar de ejecutar la iteración actual y continuar con la siguiente
    print(i)

AAPL
GOOG
FB
F
V
G
GE


# List Comprehension
List comprehension is a Pythonic way to create lists. Common applications are to make new lists where each element is the result of some operations applied to each member of another sequence. For example, if we want to create a list of squares using for loop:

In [None]:
squares = []
for i in [1, 2, 3, 4, 5, 6, 7]:
    squares.append(i**3) # Añade en una lista vacia los valores al cubo de i
print(squares)

[1, 8, 27, 64, 125, 216, 343]


Using list comprehension:

In [None]:
list = [1, 2, 3, 4, 5, 6, 7]
squares = [x**3 for x in list] # Forma simplificada de añadir los valores al cubo de i
print(squares)

[1, 8, 27, 64, 125, 216, 343]


Recall the example above where we used a for loop to select stocks. Here we use list comprehension:

In [None]:
stocks = ['APPLE','GOOGLE','IBM','FACEBOOK','F','V', 'G', 'GE']
selected = ['GOOGLE','IBM', 'FACEBOOK']
new_list = [x for x in stocks if x in selected] # Añade en la nueva lista los valores de x en 'stocks' que a su vez estén también en 'selected'
print(new_list)

['GOOGLE', 'IBM', 'FACEBOOK']


A list comprehension consists of square brackets containing an expression followed by a "for" clause, and possibly "for" or "if" clauses. For example:

In [None]:
# Lista de duplas cuyos valores son distintos
print([(x, y) for x in [1,2,3] for y in [2,5,4] if x != y])
print([str(x)+' vs '+str(y) for x in ['APPLE','GOOGLE','IBM'] for y in ['F','V','G'] if x!=y])

[(1, 2), (1, 5), (1, 4), (2, 5), (2, 4), (3, 2), (3, 5), (3, 4)]
['APPLE vs F', 'APPLE vs V', 'APPLE vs G', 'GOOGLE vs F', 'GOOGLE vs V', 'GOOGLE vs G', 'IBM vs F', 'IBM vs V', 'IBM vs G']


List comprehension is an elegant way to organize one or more for loops when creating a list.

# Summary
This chapter has introduced logical operations, loops, and list comprehension. In the next chapter, we will introduce functions and object-oriented programming, which will enable us to make our codes clean and versatile.

<div align="center">
<img style="display: block; margin: auto;" alt="photo" src="https://cdn.quantconnect.com/web/i/icon.png">

Quantconnect

Introduction to Financial Python
</div>

# 03 Functions and Objective-Oriented Programming

# Introduction

In the last tutorial we introduced logical operations, loops and list comprehension. We will introduce functions and object-oriented programming in this chapter, which will enable us to build complex algorithms in more flexible ways.

# Functions
A function is a reusable block of code. We can use a function to output a value, or do anything else we want. We can easily define our own function by using the keyword "def".

In [None]:
def product(x,y): # Define una función que retorna la multiplicación de dos valores
    return x*y
print(product(3,3)) # Llama a la función definida junto con sus valores de parámetro
print(product(5,15))

9
75


The keyword "def" is followed by the function name and the parenthesized list of formal parameters. The statements that form the body of the function start at the next line, and must be indented. The product() function above has "x" and "y" as its parameters. A function doesn't necessarily have parameters:

In [None]:
def say_hi(): # Define una función que imprime un mensaje
    print('Hola mundo')
say_hi() # Llama a la función

Hola mundo


# Built-in Function
**range()** is a function that creates a list containing an arithmetic sequence. It's often used in for loops. The arguments must be integers. If the "step" argument is omitted, it defaults to 1.

In [None]:
print(range(15)) # Crea una lista del 0 al 14
print(range(1,21)) # Crea una lista del 1 al 20
print(range(1,16,3)) # Crea una lista del 1 al 15 aumentando de 3 en 3

range(0, 15)
range(1, 21)
range(1, 16, 3)


**len()** is another function used together with range() to create a for loop. This function returns the length of an object. The argument must be a sequence or a collection.

In [None]:
tickers = ['APPLE','GOOGLE','IBM','FACEBOOK','F','V']
print('The length of tickers is {}'.format(len(tickers)))
for i in range(len(tickers)): # Se usa la lista creada por range() del 0 al 5
    print(tickers[i])

The length of tickers is 6
APPLE
GOOGLE
IBM
FACEBOOK
F
V


Note: If you want to print only the tickers without those numbers, then simply write "for ticker in tickers: print ticker"

**map(**) is a function that applies a specific function to every item of a sequence or collection, and returns a list of the results.

Because list at the moment is [1,2,3,4,5] and overwriting list() from builtins we del list

In [None]:
list = [1,2,3,4,5]
print(list)
del list # Elimina la lista
list

[1, 2, 3, 4, 5]


list

In [None]:
tickers = ['APPLE','GOOGLE','IBM','FACEBOOK','F','V']
list(map(len,tickers)) # Imprime en una lista los valores de la longitud de cada cadena

[5, 6, 3, 8, 1, 1]

In [None]:
tickers = ['APPLE','GOOGLE','IBM','FACEBOOK','F','V', 'G', 'GE']
print(list(map(len,tickers))) # Imprime en una lista los valores de la longitud de cada cadena

[5, 6, 3, 8, 1, 1, 1, 2]


The **lambda operator** is a way to create small anonymous functions. These functions are just needed where they have been created. For example:

In [None]:
list(map(lambda x: x**2, range(10))) # Crea una función anónima local, que sirve como parámetro para map()

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

map() can be applied to more than one list. The lists have to have the same length.

In [None]:
list(map(lambda x, y: x+y, [1, 2, 3, 4, 5], [10, 9, 8, 7, 6])) # La función anómima suma los valores en cada una de las posiciones de ambas listas

[11, 11, 11, 11, 11]

**sorted()** takes a list or set and returns a new sorted list

In [None]:
sorted([15, 23, 3, 0, 2]) # Ordena una lista de valores dada

[0, 2, 3, 15, 23]

We can add a "key" parameter to specify a function to be called on each list element prior to making comparisons. For example:

In [None]:
price_list = [('APPLE',144.09),('GOOGLE',911.71),('MICROSOFT',69),('FACEBOOK',150),('WMT',75.32)] # Lista de precios junto con su llave
sorted(price_list, key = lambda x: x[1]) # Ordena la lista de precios a partir del precio de cada ítem de forma ascendente

[('MICROSOFT', 69),
 ('WMT', 75.32),
 ('APPLE', 144.09),
 ('FACEBOOK', 150),
 ('GOOGLE', 911.71)]

By default the values are sorted by ascending order. We can change it to descending by adding an optional parameter "reverse'.

In [None]:
price_list = [('APPLE',144.09),('GOOGLE',911.71),('MICROSOFT',69),('FACEBOOK',150),('WMT',75.32)]
sorted(price_list, key = lambda x: x[1],reverse = True) # Ordena la lista de precios de forma descendente

[('GOOGLE', 911.71),
 ('FACEBOOK', 150),
 ('APPLE', 144.09),
 ('WMT', 75.32),
 ('MICROSOFT', 69)]

Lists also have a function list.sort(). This function takes the same "key" and "reverse" arguments as sorted(), but it doesn't return a new list.

In [None]:
price_list = [('APPLE',144.09),('GOOGLE',911.71),('MICROSOFT',69),('FACEBOOK',150),('WMT',75.32)]
price_list.sort(key = lambda x: x[1]) # Permite ordenar la misma lista que se da, sin necesidad de crear otra
print(price_list)

[('MICROSOFT', 69), ('WMT', 75.32), ('APPLE', 144.09), ('FACEBOOK', 150), ('GOOGLE', 911.71)]


# Object-Oriented Programming
Python is an object-oriented programming language. It's important to understand the concept of "objects" because almost every kind of data from QuantConnect API is an object.

## Class
A class is a type of data, just like a string, float, or list. When we create an object of that data type, we call it an instance of a class.

In Python, everything is an object - everything is an instance of some class. The data stored inside an object are called attributes, and the functions which are associated with the object are called methods.

For example, as mentioned above, a list is an object of the "list" class, and it has a method list.sort().

We can create our own objects by defining a class. We would do this when it's helpful to group certain functions together. For example, we define a class named "Stock" here:

In [None]:
class stock:
    def __init__(self, ticker, open, close, volume): # Inicaliza los atributos del objeto stock
        self.ticker = ticker
        self.open = open
        self.close = close
        self.volume = volume
        self.rate_return = float(close)/open - 1
 
    def update(self, open, close): # Función que actualiza algunos valores del objeto
        self.open = open
        self.close = close
        self.rate_return = float(self.close)/self.open - 1
 
    def print_return(self): # FUnción que imprime el resultado de la operación close/open - 1
        print(self.rate_return)

The "Stock" class has attributes "ticker", "open", "close", "volume" and "rate_return". Inside the class body, the first method is called __init__, which is a special method. When we create a new instance of the class, the __init__ method is immediately executed with all the parameters that we pass to the "Stock" object. The purpose of this method is to set up a new "Stock" object using data we have provided.

Here we create two Stock objects named "apple" and "google".

In [None]:
apple = stock('APPLE', 143.69, 144.09, 20109375) # Se inicializa dos instacias de la clase stock
google = stock('GOOGLE', 898.7, 911.7, 1561616)

Stock objects also have two other methods: update() and print_return(). We can access the attribues of a Stock object and call its methods:

In [None]:
print(apple.ticker) # Se accede al valor del atributo ticker del objeto
google.print_return() # Muestra el valor que retorna la función print_return() de la clase stock
google.update(910.8, 1013.4) # Se hace uso de la función update() del objeto stock
google.print_return()# Muestra el valor actualizado de la función print_return() de la clase stock

APPLE
0.11264822134387353
0.11264822134387353


By calling the update() function, we updated the open and close prices of a stock. Please note that when we use the attributes or call the methods **inside a class**, we need to specify them as self.attribute or self.method(), otherwise Python will deem them as global variables and thus raise an error.

We can add an attribute to an object anywhere:

In [None]:
apple.ceo = 'Tim Cook' # Crea un nuevo atributo para la instacia 'apple' de la clase stock
apple.ceo

'Tim Cook'

We can check what names (i.e. attributes and methods) are defined on an object using the dir() function:

In [None]:
dir(apple) # Lista los métodos definidos que tiene una instancia de un objeto

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'ceo',
 'close',
 'open',
 'print_return',
 'rate_return',
 'ticker',
 'update',
 'volume']

## Inheritance
Inheritance is a way of arranging classes in a hierarchy from the most general to the most specific. A "child" class is a more specific type of a "parent" class because a child class will inherit all the attribues and methods of its parent. For example, we define a class named "Child" which inherits "Stock":

In [None]:
class child(stock): # Crea una clase hija de la clase stock
    def __init__(self,name):
        self.name = name # Define un atributo name para la clase hija

In [None]:
aa = child('hi') # Inicializa una instancia de la clase child, y pasa como parámetro el nombre
print(aa.name)
aa.update(230,302)
print(aa.open)
print(aa.close)
print(aa.print_return())

hi
230
302
0.31304347826086953
None


As seen above, the new class Child has inherited the methods from Stock.

#Summary

In this chapter we have introduced functions and classes. When we write a QuantConnect algorithm, we would define our algorithm as a class (QCAlgorithm). This means our algorithm inherited the QC API methods from QCAlgorithm class.

In the next chapter, we will introduce NumPy and Pandas, which enable us to conduct scientific calculations in Python.

<div align="center">
<img style="display: block; margin: auto;" alt="photo" src="https://cdn.quantconnect.com/web/i/icon.png">

Quantconnect

Introduction to Financial Python
</div>

# 04 NumPy and Basic Pandas

# Introduction

Now that we have introduced the fundamentals of Python, it's time to learn about NumPy and Pandas.

# NumPy
NumPy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. It also has strong integration with Pandas, which is another powerful tool for manipulating financial data.

Python packages like NumPy and Pandas contain classes and methods which we can use by importing the package:

In [None]:
import numpy as np

## Basic NumPy Arrays
A NumPy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. Here we make an array by passing a list of Apple stock prices:

In [None]:
price_list = [143.73, 145.83, 143.68, 144.02, 143.5, 142.62]
price_array = np.array(price_list) # Convierte la lista de números en un vector que se manipula mediante NumPy
print(price_array, type(price_array))

[143.73 145.83 143.68 144.02 143.5  142.62] <class 'numpy.ndarray'>


Notice that the type of array is "ndarray" which is a multi-dimensional array. If we pass np.array() a list of lists, it will create a 2-dimensional array.

In [None]:
Ar = np.array([[1,2],[3,4], [5,6]]) # Crea una matriz usando NumPy, mediante vectores multidimensionales 
print(Ar, type(Ar))

[[1 2]
 [3 4]
 [5 6]] <class 'numpy.ndarray'>


We get the dimensions of an ndarray using the .shape attribute:

In [None]:
print(Ar.shape) # Muestra las dimensiones de la matriz creada, (filas, columnas)

(3, 2)


If we create an 2-dimensional array (i.e. matrix), each row can be accessed by index:

In [None]:
print(Ar[0]) # Muestra la fila indicada de la matriz, iniciando por el 0
print(Ar[1])

[1 2]
[3 4]


If we want to access the matrix by column instead:

In [None]:
print('the first column: ', Ar[:,0]) # Muestra la columna indicada de la matriz
print('the second column: ', Ar[:,1])

the first column:  [1 3 5]
the second column:  [2 4 6]


## Array Functions
Some functions built in NumPy that allow us to perform calculations on arrays. For example, we can apply the natural logarithm to each element of an array:

In [None]:
print(np.log(price_array)) # Aplica el logaritmo a cada uno de los valores de la lista

[4.96793654 4.98244156 4.9675886  4.96995218 4.96633504 4.96018375]


Other functions return a single value:

In [None]:
print(np.mean(price_array)) # Calcula la media del precio de las acciones que están en la lista
print(np.std(price_array)) # Calcula la desviación estandar de los precios que están en la lista
print(np.sum(price_array)) # Suma de cada uno de los elementos de la lista
print(np.max(price_array)) # Mayor valor de la lista

143.89666666666668
0.9673790478515796
863.38
145.83


The functions above return the mean, standard deviation, total and maximum value of an array.

# Pandas
Pandas is one of the most powerful tools for dealing with financial data. 

First we need to import Pandas:

In [None]:
import pandas as pd

## Series
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, float, Python object, etc.)

We create a Series by calling pd.Series(data), where data can be a dictionary, an array or just a scalar value.

In [None]:
price = [143.73, 145.83, 143.68, 144.02, 143.5, 142.62]
s = pd.Series(price) # Convierte la lista en una Serie unidimensional
s

0    143.73
1    145.83
2    143.68
3    144.02
4    143.50
5    142.62
dtype: float64

We can customize the indices of a new Series:

In [None]:
s = pd.Series(price,index = ['i.', 'ii.', 'iii.', 'iv.', 'v.', 'vi.']) # Cambia los índices de la serie por los indicados
s

i.      143.73
ii.     145.83
iii.    143.68
iv.     144.02
v.      143.50
vi.     142.62
dtype: float64

Or we can change the indices of an existing Series:

In [None]:
s.index = [6, 5, 4, 3, 2, 1] # Otra forma de cambiar los índices de la serie
s

6    143.73
5    145.83
4    143.68
3    144.02
2    143.50
1    142.62
dtype: float64

Series is like a list since it can be sliced by index:

In [None]:
print(s[1:]) # Al igual que la lista, también se pueden mostrar los valores de la serie a partir de cierto índice
print(s[:-2]) # Muestra los valores de la serie excepto los dos últimos

5    145.83
4    143.68
3    144.02
2    143.50
1    142.62
dtype: float64
6    143.73
5    145.83
4    143.68
3    144.02
dtype: float64


Series is also like a dictionary whose values can be set or fetched by index label:

In [None]:
print(s[3]) # Imprime el valor de la serie que está el la posición 3
s[5] = 1000 # Modifica el valor de que está en la posición 5
print(s)

144.02
6     143.73
5    1000.00
4     143.68
3     144.02
2     143.50
1     142.62
dtype: float64


Series can also have a name attribute, which will be used when we make up a Pandas DataFrame using several series.

In [None]:
s = pd.Series(price, name = 'Apple Price List :D') # Añade un atributo llamado 'name' a la serie
print(s)
print(s.name) # Imprime el atributo 'name' que fue añadido antes

0    143.73
1    145.83
2    143.68
3    144.02
4    143.50
5    142.62
Name: Apple Price List :D, dtype: float64
Apple Price List :D


We can get the statistical summaries of a Series:

In [None]:
print(s.describe()) # Muestra un resumen estadístico de la serie

count      6.000000
mean     143.896667
std        1.059711
min      142.620000
25%      143.545000
50%      143.705000
75%      143.947500
max      145.830000
Name: Apple Price List :D, dtype: float64


## Time Index
Pandas has a built-in function specifically for creating date indices: pd.date_range(). We use it to create a new index for our Series:

In [None]:
time_index = pd.date_range('2021-10-30',periods = len(s),freq = 'D') # Aumenta el valor de la fecha dada de forma diaria
print(time_index)
s.index = time_index # Las fechas pasan a ser los índices de la serie
print(s)

DatetimeIndex(['2021-10-30', '2021-10-31', '2021-11-01', '2021-11-02',
               '2021-11-03', '2021-11-04'],
              dtype='datetime64[ns]', freq='D')
2021-10-30    143.73
2021-10-31    145.83
2021-11-01    143.68
2021-11-02    144.02
2021-11-03    143.50
2021-11-04    142.62
Freq: D, Name: Apple Price List :D, dtype: float64


Series are usually accessed using the iloc[] and loc[] methods. iloc[] is used to access elements by integer index, and loc[] is used to access the index of the series.

iloc[] is necessary when the index of a series are integers, take our previous defined series as example:

In [None]:
s.index = [6,5,4,3,2,1]
print(s)
print(s[1])

6    143.73
5    145.83
4    143.68
3    144.02
2    143.50
1    142.62
Name: Apple Price List :D, dtype: float64
142.62


If we intended to take the second element of the series, we would make a mistake here, because the index are integers. In order to access to the element we want, we use iloc[] here:

In [None]:
print(s.iloc[5]) # Como se puso que los índices fueran números (ordenados de forma descendente), al momento de intentar acceder a cierta posición obtendriamos un valor incorrecto
print(s[5]) # Muestra el número cuyo indice asociado es '5' a fin de realizar la comparativa con el valor obtenido con iloc()

142.62
145.83


While working with time series data, we often use time as the index. Pandas provides us with various methods to access the data by time index

In [None]:
s.index = time_index
print(s['2021-10-30']) # Se obtiene el valor de la serie asociado a la fecha dada

143.73


We can even access to a range of dates:

In [None]:
print(s['2021-10-31':'2021-11-03']) # Se obienen los valores de la serie dentro de un intervalo de fechas

2021-10-31    145.83
2021-11-01    143.68
2021-11-02    144.02
2021-11-03    143.50
Freq: D, Name: Apple Price List :D, dtype: float64


Series[] provides us a very flexible way to index data. We can add any condition in the square brackets:

In [None]:
print(s[s > np.mean(s)] ) # Muestra los valores que son menores a la media de la serie
print([(s > np.mean(s)) & (s < np.mean(s) + 1.64*np.std(s))]) # Muestra si el valor es mayor a la media y menor que la media + 1.64*la desviación estándar

2021-10-31    145.83
2021-11-02    144.02
Name: Apple Price List :D, dtype: float64
[2021-10-30    False
2021-10-31    False
2021-11-01    False
2021-11-02     True
2021-11-03    False
2021-11-04    False
Freq: D, Name: Apple Price List :D, dtype: bool]


As demonstrated, we can use logical operators like & (and), | (or) and ~ (not) to group multiple conditions.

# Summary
Here we have introduced NumPy and Pandas for scientific computing in Python. In the next chapter, we will dive into Pandas to learn resampling and manipulating Pandas DataFrame, which are commonly used in financial data analysis.

<div align="center">
<img style="display: block; margin: auto;" alt="photo" src="https://cdn.quantconnect.com/web/i/icon.png"> <img style="display: block; margin: auto;" alt="photo" src="https://www.marketing-branding.com/wp-content/uploads/2020/07/google-colaboratory-colab-guia-completa.jpg " width="50" height="50">
<img style="display: block; margin: auto;" alt="photo" src="https://upload.wikimedia.org/wikipedia/commons/d/da/Yahoo_Finance_Logo_2019.svg" width="50" height="50">  

Quantconnect -> Google Colab with Yahoo Finance data

Introduction to Financial Python
</div>

# 05 Pandas-Resampling and DataFrame

# Introduction
In the last chapter we had a glimpse of Pandas. In this chapter we will learn about resampling methods and the DataFrame object, which is a powerful tool for financial data analysis.

# Fetching Data
Here we use the Yahoo Finance to retrieve data.


In [None]:
!pip install yfinance

Collecting yfinance
  Downloading yfinance-0.1.64.tar.gz (26 kB)
Collecting lxml>=4.5.1
  Downloading lxml-4.6.3-cp37-cp37m-manylinux2014_x86_64.whl (6.3 MB)
[K     |████████████████████████████████| 6.3 MB 6.1 MB/s 
Building wheels for collected packages: yfinance
  Building wheel for yfinance (setup.py) ... [?25l[?25hdone
  Created wheel for yfinance: filename=yfinance-0.1.64-py2.py3-none-any.whl size=24109 sha256=4964c52b534f7f2caf0ebc1366b56a05a7ebe9c62c682557574287a26284cda2
  Stored in directory: /root/.cache/pip/wheels/86/fe/9b/a4d3d78796b699e37065e5b6c27b75cff448ddb8b24943c288
Successfully built yfinance
Installing collected packages: lxml, yfinance
  Attempting uninstall: lxml
    Found existing installation: lxml 4.2.6
    Uninstalling lxml-4.2.6:
      Successfully uninstalled lxml-4.2.6
Successfully installed lxml-4.6.3 yfinance-0.1.64


In [None]:
import yfinance as yf

aapl = yf.Ticker("AAPL") # Obtiene la información financiera de Apple

# get stock info
print(aapl.info) # Obtiene la información de las acciones de Apple

# get historical market data
aapl_table = aapl.history(start="2020-01-01",  end="2021-09-30") # Muestra un histórico de las acciones y demás información durante un intervalor de fechas
aapl_table

{'zip': '95014', 'sector': 'Technology', 'fullTimeEmployees': 154000, 'longBusinessSummary': 'Apple Inc. designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessories worldwide. It also sells various related services. In addition, the company offers iPhone, a line of smartphones; Mac, a line of personal computers; iPad, a line of multi-purpose tablets; AirPods Max, an over-ear wireless headphone; and wearables, home, and accessories comprising AirPods, Apple TV, Apple Watch, Beats products, HomePod, and iPod touch. Further, it provides AppleCare support services; cloud services store services; and operates various platforms, including the App Store that allow customers to discover and download applications and digital content, such as books, music, video, games, and podcasts. Additionally, the company offers various services, such as Apple Arcade, a game subscription service; Apple Music, which offers users a curated listening experience with o

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-01-02,73.082524,74.158141,72.823491,74.096466,135480400,0.0,0.0
2020-01-03,73.307008,74.153188,73.146654,73.376083,146322800,0.0,0.0
2020-01-06,72.478106,74.000243,72.221535,73.960770,118387200,0.0,0.0
2020-01-07,73.970634,74.232135,73.388424,73.612923,108872000,0.0,0.0
2020-01-08,73.309478,75.105456,73.309478,74.797081,132079200,0.0,0.0
...,...,...,...,...,...,...,...
2021-09-23,146.649994,147.080002,145.639999,146.830002,64838200,0.0,0.0
2021-09-24,145.660004,147.470001,145.559998,146.919998,53477900,0.0,0.0
2021-09-27,145.470001,145.960007,143.820007,145.369995,74150700,0.0,0.0
2021-09-28,143.250000,144.750000,141.690002,141.910004,108972300,0.0,0.0


We will create a Series named "aapl" whose values are Apple's daily closing prices, which are of course indexed by dates:

In [None]:
aapl = aapl_table['Close']['2020'] # Muestra los valores diarios de las acciones de Apple

In [None]:
print(aapl)

Date
2020-01-02     74.096466
2020-01-03     73.376083
2020-01-06     73.960770
2020-01-07     73.612923
2020-01-08     74.797081
                 ...    
2020-12-24    131.352844
2020-12-28    136.050781
2020-12-29    134.239273
2020-12-30    133.094666
2020-12-31    132.069473
Name: Close, Length: 253, dtype: float64


Recall that we can fetch a specific data point using series['yyyy-mm-dd']. We can also fetch the data in a specific month using series['yyyy-mm'].

In [None]:
print(aapl['2020-2']) # Muestra los valores de las acciones en un mes específico

Date
2020-02-03    76.146538
2020-02-04    78.660416
2020-02-05    79.301834
2020-02-06    80.229416
2020-02-07    79.138893
2020-02-10    79.514763
2020-02-11    79.035027
2020-02-12    80.911934
2020-02-13    80.335747
2020-02-14    80.355537
2020-02-18    78.884186
2020-02-19    80.026649
2020-02-20    79.205658
2020-02-21    77.412827
2020-02-24    73.735695
2020-02-25    71.238098
2020-02-26    72.368210
2020-02-27    67.637619
2020-02-28    67.598061
Name: Close, dtype: float64


In [None]:
aapl['2020-2':'2020-4'] # Muestra los valores de las acciones durante el periodo de tiempo dado, en este caso meses

Date
2020-02-03    76.146538
2020-02-04    78.660416
2020-02-05    79.301834
2020-02-06    80.229416
2020-02-07    79.138893
                ...    
2020-04-24    69.974464
2020-04-27    70.023949
2020-04-28    68.888893
2020-04-29    71.151558
2020-04-30    72.652580
Name: Close, Length: 62, dtype: float64

.head(N) and .tail(N) are methods for quickly accessing the first or last N elements.

In [None]:
print(aapl.head(10)) # Muestra los 10 primeros datos de la serie, en este caso, de los 10 primeros días del año
print(aapl.tail(10)) # Muestra los 10 últimos datos de la serie, en este caso, de los 10 últimos días del año

Date
2020-01-02    74.096466
2020-01-03    73.376083
2020-01-06    73.960770
2020-01-07    73.612923
2020-01-08    74.797081
2020-01-09    76.385841
2020-01-10    76.558517
2020-01-13    78.194138
2020-01-14    77.138268
2020-01-15    76.807693
Name: Close, dtype: float64
Date
2020-12-17    128.098129
2020-12-18    126.067680
2020-12-21    127.630333
2020-12-22    131.263260
2020-12-23    130.347565
2020-12-24    131.352844
2020-12-28    136.050781
2020-12-29    134.239273
2020-12-30    133.094666
2020-12-31    132.069473
Name: Close, dtype: float64


# Resampling
**_series.resample(freq)_** is a class called "DatetimeIndexResampler" which groups data in a Series object into regular time intervals. The argument "freq" determines the length of each interval.

**_series.resample.mean()_** is a complete statement that groups data into intervals, and then compute the mean of each interval. For example, if we want to aggregate the daily data into monthly data by mean:

In [None]:
by_month = aapl.resample('M').mean() # Vuelve a remuestrar los datos según el intervalo dado, en este caso, el intervalo es de meses. 
                                     # Mostrando la media del precio de las acciones en cada mes del año
print(by_month)

Date
2020-01-31     76.949838
2020-02-29     76.933532
2020-03-31     64.898713
2020-04-30     67.357250
2020-05-31     76.812860
2020-06-30     85.744504
2020-07-31     94.784615
2020-08-31    116.512445
2020-09-30    114.389503
2020-10-31    115.669217
2020-11-30    116.240820
2020-12-31    126.695181
Freq: M, Name: Close, dtype: float64


We can also aggregate the data by week:

In [None]:
by_week = aapl.resample('W').mean() # Remuestrea los datos de forma semanal
print(by_week.head(6)) # Muestra la media en los precios de las acciones en las primeras 6 semanas del año

Date
2020-01-05    73.736275
2020-01-12    75.063026
2020-01-19    77.708145
2020-01-26    78.439001
2020-02-02    78.172438
2020-02-09    78.695419
Freq: W-SUN, Name: Close, dtype: float64


We can also aggregate the data by month with max:

In [None]:
aapl.resample('M').max() # Además de remuestrar los datos del precio de las acciones, muestra el valor máximo de esta en cada mes del año

Date
2020-01-31     80.014793
2020-02-29     80.911934
2020-03-31     74.863319
2020-04-30     72.652580
2020-05-31     79.154755
2020-06-30     90.883041
2020-07-31    105.390907
2020-08-31    128.215332
2020-09-30    133.322495
2020-10-31    123.604996
2020-11-30    119.737419
2020-12-31    136.050781
Freq: M, Name: Close, dtype: float64

We can choose almost any frequency by using the format 'nf', where 'n' is an integer and 'f' is M for month, W for week and D for day.

In [None]:
three_day = aapl.resample('3D').mean() # Muestra la media de los precios de las acciones en un intervalo de cada 3 días
two_week = aapl.resample('2W').mean() # Muestra la media de los precios de las acciones en un intervalo de cada 2 semanas
two_month = aapl.resample('2M').mean() # Muestra la media de los precios de las acciones en un intervalo de cada 2 meses


print(three_day)
print(two_week)
print(two_month )

Date
2020-01-02     73.736275
2020-01-05     73.786846
2020-01-08     75.913813
2020-01-11     78.194138
2020-01-14     77.238594
                 ...    
2020-12-18    126.067680
2020-12-21    129.747053
2020-12-24    131.352844
2020-12-27    135.145027
2020-12-30    132.582069
Freq: 3D, Name: Close, Length: 122, dtype: float64
Date
2020-01-05     73.736275
2020-01-19     76.385586
2020-02-02     78.290911
2020-02-16     79.363010
2020-03-01     74.234111
2020-03-15     69.881996
2020-03-29     60.305849
2020-04-12     62.983980
2020-04-26     69.082268
2020-05-10     72.662711
2020-05-24     77.718850
2020-06-07     79.763347
2020-06-21     85.621429
2020-07-05     89.779913
2020-07-19     94.893465
2020-08-02     95.576089
2020-08-16    111.076923
2020-08-30    120.777678
2020-09-13    120.567858
2020-09-27    110.262808
2020-10-11    114.399205
2020-10-25    117.998057
2020-11-08    113.183499
2020-11-22    117.841325
2020-12-06    118.518699
2020-12-20    124.255190
2021-01-03    

Besides the mean() method, other methods can also be used with the resampler:



In [None]:
std = aapl.resample('M').std() # Muestra la desviación estándar de los precios de forma mensual
max = aapl.resample('W').max() # Muestra el valor máximo del precio de forma semanal
min = aapl.resample('W').min() # Muestra el valor mínimo del precio de forma semanal


print(std)
print(max)
print(min)

Date
2020-01-31    2.016131
2020-02-29    4.287998
2020-03-31    5.647737
2020-04-30    3.799528
2020-05-31    2.301080
2020-06-30    3.665461
2020-07-31    3.134716
2020-08-31    6.615103
2020-09-30    6.942130
2020-10-31    3.438472
2020-11-30    3.093313
2020-12-31    4.766074
Freq: M, Name: Close, dtype: float64
Date
2020-01-05     74.096466
2020-01-12     76.558517
2020-01-19     78.630806
2020-01-26     78.754166
2020-02-02     80.014793
2020-02-09     80.229416
2020-02-16     80.911934
2020-02-23     80.026649
2020-03-01     73.735695
2020-03-08     74.863319
2020-03-15     70.560555
2020-03-22     62.528706
2020-03-29     63.908558
2020-04-05     63.010906
2020-04-12     66.270126
2020-04-19     70.983398
2020-04-26     69.974464
2020-05-03     72.652580
2020-05-10     76.898369
2020-05-17     78.108391
2020-05-24     79.154755
2020-05-31     78.911758
2020-06-07     82.197166
2020-06-14     87.488525
2020-06-21     87.300087
2020-06-28     90.883041
2020-07-05     90.454071
20

Often we want to calculate monthly returns of a stock, based on prices on the last day of each month. To fetch those prices, we use the series.resample.agg() method:

In [None]:
last_day = aapl.resample('M').agg(lambda x: x[-1]) # Muestra los precios de las acciones en el último día de cada mes
print(last_day)

Date
2020-01-31     76.356232
2020-02-29     67.598061
2020-03-31     62.882317
2020-04-30     72.652580
2020-05-31     78.834900
2020-06-30     90.454071
2020-07-31    105.390907
2020-08-31    128.215332
2020-09-30    115.069885
2020-10-31    108.164307
2020-11-30    118.493263
2020-12-31    132.069473
Freq: M, Name: Close, dtype: float64


Or directly calculate the monthly rates of return using the data for the first day and the last day:

In [None]:
monthly_return = aapl.resample('M').agg(lambda x: x[-1]/x[0] - 1) # Muestra el valor del precio del (primer día)/(último día) de cada mes
print(monthly_return)

Date
2020-01-31    0.030498
2020-02-29   -0.112264
2020-03-31   -0.148991
2020-04-30    0.219543
2020-05-31    0.102849
2020-06-30    0.133447
2020-07-31    0.167340
2020-08-31    0.186668
2020-09-30   -0.136906
2020-10-31   -0.067900
2020-11-30    0.096400
2020-12-31    0.081242
Freq: M, Name: Close, dtype: float64


Series object also provides us some convenient methods to do some quick calculation.

In [None]:
print(monthly_return.mean()) # Muestra la media de los resultados anteriores
print(monthly_return.std()) # Muestra la desviación estándar de los resultados anteriores
print(monthly_return.max()) # Muestra el máximo de los resultados anteriores

0.04599382488058706
0.1310883002937684
0.2195425063061609


Another two methods frequently used on Series are .diff() and .pct_change(). The former calculates the difference between consecutive elements, and the latter calculates the percentage change.

In [None]:
print(last_day.diff()) # Calcula la diferencia entre los precios en el último día del mes y los anteriores
print(last_day.pct_change()) # Calcula el porcentaje de cambio entre los precios en el último día del mes y los anteriores

Date
2020-01-31          NaN
2020-02-29    -8.758171
2020-03-31    -4.715744
2020-04-30     9.770264
2020-05-31     6.182320
2020-06-30    11.619171
2020-07-31    14.936836
2020-08-31    22.824425
2020-09-30   -13.145447
2020-10-31    -6.905579
2020-11-30    10.328957
2020-12-31    13.576210
Freq: M, Name: Close, dtype: float64
Date
2020-01-31         NaN
2020-02-29   -0.114701
2020-03-31   -0.069762
2020-04-30    0.155374
2020-05-31    0.085094
2020-06-30    0.147386
2020-07-31    0.165132
2020-08-31    0.216569
2020-09-30   -0.102526
2020-10-31   -0.060012
2020-11-30    0.095493
2020-12-31    0.114574
Freq: M, Name: Close, dtype: float64


Notice that we induced a NaN value while calculating percentage changes i.e. returns.

When dealing with NaN values, we usually either removing the data point or fill it with a specific value. Here we fill it with 0:

In [None]:
daily_return = last_day.pct_change() # Calcula el porcentaje de cambio entre los precios en el último día del mes y los anteriores
print(daily_return.fillna(0)) # Cambia el NaN (Not a Number) por ceros

Date
2020-01-31    0.000000
2020-02-29   -0.114701
2020-03-31   -0.069762
2020-04-30    0.155374
2020-05-31    0.085094
2020-06-30    0.147386
2020-07-31    0.165132
2020-08-31    0.216569
2020-09-30   -0.102526
2020-10-31   -0.060012
2020-11-30    0.095493
2020-12-31    0.114574
Freq: M, Name: Close, dtype: float64


Alternatively, we can fill a NaN with the next fitted value. This is called 'backward fill', or 'bfill' in short:

In [None]:
daily_return = last_day.pct_change() # Calcula el porcentaje de cambio entre los precios en el último día del mes y los anteriores
print(daily_return.fillna(method = 'bfill')) # Selecciona el método de 'llenar al revés' para evitar los NaN

Date
2020-01-31   -0.114701
2020-02-29   -0.114701
2020-03-31   -0.069762
2020-04-30    0.155374
2020-05-31    0.085094
2020-06-30    0.147386
2020-07-31    0.165132
2020-08-31    0.216569
2020-09-30   -0.102526
2020-10-31   -0.060012
2020-11-30    0.095493
2020-12-31    0.114574
Freq: M, Name: Close, dtype: float64


As expected, since there is a 'backward fill' method, there must be a 'forward fill' method, or 'ffill' in short. However we can't use it here because the NaN is the first value.

We can also simply remove NaN values by **_.dropna()_**

In [None]:
daily_return = last_day.pct_change()
daily_return.dropna() # Elimina los NaN de los resultados

Date
2020-02-29   -0.114701
2020-03-31   -0.069762
2020-04-30    0.155374
2020-05-31    0.085094
2020-06-30    0.147386
2020-07-31    0.165132
2020-08-31    0.216569
2020-09-30   -0.102526
2020-10-31   -0.060012
2020-11-30    0.095493
2020-12-31    0.114574
Freq: M, Name: Close, dtype: float64

# DataFrame
The **DataFrame** is the most commonly used data structure in Pandas. It is essentially a table, just like an Excel spreadsheet.

More precisely, a DataFrame is a collection of Series objects, each of which may contain different data types. A DataFrame can be created from various data types: dictionary, 2-D numpy.ndarray, a Series or another DataFrame.

## Create DataFrames
The most common method of creating a DataFrame is passing a dictionary:

In [None]:
import pandas as pd

dict = {'AAPL': [143.5, 144.09, 142.73, 144.18, 143.77],'GOOG':[898.7, 911.71, 906.69, 918.59, 926.99],
        'IBM':[155.58, 153.67, 152.36, 152.94, 153.49]} # Crea un DataFrame en a través de un diccionario
data_index = pd.date_range('2020-09-15',periods = 5, freq = 'D') # Selecciona los datos a partir de una fecha dada con una cierta frecuencia y límite de datos
df = pd.DataFrame(dict, index = data_index)
print(df)

              AAPL    GOOG     IBM
2020-09-15  143.50  898.70  155.58
2020-09-16  144.09  911.71  153.67
2020-09-17  142.73  906.69  152.36
2020-09-18  144.18  918.59  152.94
2020-09-19  143.77  926.99  153.49


## Manipulating DataFrames
We can fetch values in a DataFrame by columns and index. Each column in a DataFrame is essentially a Pandas Series. We can fetch a column by square brackets: **df['column_name']**

If a column name contains no spaces, then we can also use df.column_name to fetch a column:

In [None]:
df = aapl_table
print(df.Close.tail(10)) # Retorna las últimas n filas
print(df['Volume'].tail(10)) # Busca los datos a de la columna indicada 

Date
2021-09-16    148.789993
2021-09-17    146.059998
2021-09-20    142.940002
2021-09-21    143.429993
2021-09-22    145.850006
2021-09-23    146.830002
2021-09-24    146.919998
2021-09-27    145.369995
2021-09-28    141.910004
2021-09-29    142.830002
Name: Close, dtype: float64
Date
2021-09-16     68034100
2021-09-17    129868800
2021-09-20    123478900
2021-09-21     75834000
2021-09-22     76404300
2021-09-23     64838200
2021-09-24     53477900
2021-09-27     74150700
2021-09-28    108972300
2021-09-29     74602000
Name: Volume, dtype: int64


All the methods we applied to a Series index such as iloc[], loc[] and resampling methods, can also be applied to a DataFrame:

In [None]:
aapl_2021 = df['2021'] # Selecciona los datos disponibles del año indicado
aapl_month = aapl_2021.resample('2W').agg(lambda x: x[-1]) # Muestra todos los datos de la columnas según el año y el intervalo de tiempo entre los datos
print(aapl_month)

                  Open        High  ...  Dividends  Stock Splits
Date                                ...                         
2021-01-10  131.810677  132.009754  ...      0.000           0.0
2021-01-24  135.642671  139.195983  ...      0.000           0.0
2021-02-07  136.911983  136.981751  ...      0.205           0.0
2021-02-21  129.824641  130.293143  ...      0.000           0.0
2021-03-07  120.594177  121.551114  ...      0.000           0.0
2021-03-21  119.517621  121.042740  ...      0.000           0.0
2021-04-04  123.265626  123.783964  ...      0.000           0.0
2021-04-18  133.871700  134.240515  ...      0.000           0.0
2021-05-02  131.359728  133.134050  ...      0.000           0.0
2021-05-16  126.061129  127.698675  ...      0.000           0.0
2021-05-30  125.382147  125.611806  ...      0.000           0.0
2021-06-13  126.340704  127.249347  ...      0.000           0.0
2021-06-27  133.260356  133.689705  ...      0.000           0.0
2021-07-11  142.536444  1

We may select certain columns of a DataFrame using their names:

In [None]:
aapl_bar = aapl_month[['Open', 'High', 'Low', 'Close']] # Selecciona las columnas a mostrar del DataFrame
print(aapl_bar)

                  Open        High         Low       Close
Date                                                      
2021-01-10  131.810677  132.009754  129.620969  131.432465
2021-01-24  135.642671  139.195983  134.388569  138.419632
2021-02-07  136.911983  136.981751  135.426729  136.323853
2021-02-21  129.824641  130.293143  128.389231  129.455811
2021-03-07  120.594177  121.551114  117.195048  121.032768
2021-03-21  119.517621  121.042740  119.298321  119.607330
2021-04-04  123.265626  123.783964  122.099351  122.607727
2021-04-18  133.871700  134.240515  132.854949  133.732147
2021-05-02  131.359728  133.134050  130.652001  131.040756
2021-05-16  126.061129  127.698675  125.661726  127.259331
2021-05-30  125.382147  125.611806  124.363676  124.423584
2021-06-13  126.340704  127.249347  125.911347  127.159477
2021-06-27  133.260356  133.689705  132.611319  132.910873
2021-07-11  142.536444  145.432099  142.436587  144.892914
2021-07-25  147.329270  148.497518  146.700207  148.3377

We can even specify both rows and columns using loc[]. The row indices and column names are separated by a comma:

In [None]:
print(aapl_month.loc['2021-03':'2021-06',['Open', 'High', 'Low', 'Close']]) # Muestra la info de las columnas seleccionadas en el intervalo de fechas dado

                  Open        High         Low       Close
Date                                                      
2021-03-07  120.594177  121.551114  117.195048  121.032768
2021-03-21  119.517621  121.042740  119.298321  119.607330
2021-04-04  123.265626  123.783964  122.099351  122.607727
2021-04-18  133.871700  134.240515  132.854949  133.732147
2021-05-02  131.359728  133.134050  130.652001  131.040756
2021-05-16  126.061129  127.698675  125.661726  127.259331
2021-05-30  125.382147  125.611806  124.363676  124.423584
2021-06-13  126.340704  127.249347  125.911347  127.159477
2021-06-27  133.260356  133.689705  132.611319  132.910873


The subset methods in DataFrame is quite useful. By writing logical statements in square brackets, we can make customized subsets:

In [None]:
import numpy as np

above = aapl_bar[aapl_bar.Close > np.mean(aapl_bar.Close)] # Muestra los datos que cumplan que el precio Close sea mayor a la media del precio Close del mes
print(above)

                  Open        High         Low       Close
Date                                                      
2021-01-24  135.642671  139.195983  134.388569  138.419632
2021-02-07  136.911983  136.981751  135.426729  136.323853
2021-07-11  142.536444  145.432099  142.436587  144.892914
2021-07-25  147.329270  148.497518  146.700207  148.337753
2021-08-08  146.350006  147.110001  145.630005  146.139999
2021-08-22  147.440002  148.500000  146.779999  148.190002
2021-09-05  153.759995  154.630005  153.089996  154.300003
2021-09-19  148.820007  148.820007  145.759995  146.059998
2021-10-03  142.470001  144.449997  142.029999  142.830002


## Data Validation
As mentioned, all methods that apply to a Series can also be applied to a DataFrame. Here we add a new column to an existing DataFrame:

In [None]:
aapl_bar['rate_return'] = aapl_bar.Close.pct_change() # Añade una nueva columna al DataFrame a partir del cambio porcentual del precio Close de las acciones
print(aapl_bar)

                  Open        High         Low       Close  rate_return
Date                                                                   
2021-01-10  131.810677  132.009754  129.620969  131.432465          NaN
2021-01-24  135.642671  139.195983  134.388569  138.419632     0.053162
2021-02-07  136.911983  136.981751  135.426729  136.323853    -0.015141
2021-02-21  129.824641  130.293143  128.389231  129.455811    -0.050380
2021-03-07  120.594177  121.551114  117.195048  121.032768    -0.065065
2021-03-21  119.517621  121.042740  119.298321  119.607330    -0.011777
2021-04-04  123.265626  123.783964  122.099351  122.607727     0.025085
2021-04-18  133.871700  134.240515  132.854949  133.732147     0.090732
2021-05-02  131.359728  133.134050  130.652001  131.040756    -0.020125
2021-05-16  126.061129  127.698675  125.661726  127.259331    -0.028857
2021-05-30  125.382147  125.611806  124.363676  124.423584    -0.022283
2021-06-13  126.340704  127.249347  125.911347  127.159477     0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Here the calculation introduced a NaN value. If the DataFrame is large, we would not be able to observe it. **isnull()** provides a convenient way to check abnormal values.

In [None]:
missing = aapl_bar.isnull() # Muestra del forma booleana si el valor de una columna es NaN
print(missing)
print('---------------------------------------------')
print(missing.describe())

             Open   High    Low  Close  rate_return
Date                                               
2021-01-10  False  False  False  False         True
2021-01-24  False  False  False  False        False
2021-02-07  False  False  False  False        False
2021-02-21  False  False  False  False        False
2021-03-07  False  False  False  False        False
2021-03-21  False  False  False  False        False
2021-04-04  False  False  False  False        False
2021-04-18  False  False  False  False        False
2021-05-02  False  False  False  False        False
2021-05-16  False  False  False  False        False
2021-05-30  False  False  False  False        False
2021-06-13  False  False  False  False        False
2021-06-27  False  False  False  False        False
2021-07-11  False  False  False  False        False
2021-07-25  False  False  False  False        False
2021-08-08  False  False  False  False        False
2021-08-22  False  False  False  False        False
2021-09-05  

The row labelled "unique" indicates the number of unique values in each column. Since the "rate_return" column has 2 unique values, it has at least one missing value.

We can deduce the number of missing values by comparing "count" with "freq". There are 12 counts and 11 False values, so there is one True value which corresponds to the missing value.

We can also find the rows with missing values easily:

In [None]:
print(missing[missing.rate_return == True]) # Muestra las filas que contienen NaN en la columna indicada

             Open   High    Low  Close  rate_return
Date                                               
2021-01-10  False  False  False  False         True


Usually when dealing with missing data, we either delete the whole row or fill it with some value. As we introduced in the Series chapter, the same method **dropna()** and **fillna()** can be applied to a DataFrame.

In [None]:
drop = aapl_bar.dropna() # Elimina los valores NaN de los resultados
print(drop)
print('\n--------------------------------------------------\n')
fill = aapl_bar.fillna(0) # Reemplaza los NaN con ceros
print(fill)

                  Open        High         Low       Close  rate_return
Date                                                                   
2021-01-24  135.642671  139.195983  134.388569  138.419632     0.053162
2021-02-07  136.911983  136.981751  135.426729  136.323853    -0.015141
2021-02-21  129.824641  130.293143  128.389231  129.455811    -0.050380
2021-03-07  120.594177  121.551114  117.195048  121.032768    -0.065065
2021-03-21  119.517621  121.042740  119.298321  119.607330    -0.011777
2021-04-04  123.265626  123.783964  122.099351  122.607727     0.025085
2021-04-18  133.871700  134.240515  132.854949  133.732147     0.090732
2021-05-02  131.359728  133.134050  130.652001  131.040756    -0.020125
2021-05-16  126.061129  127.698675  125.661726  127.259331    -0.028857
2021-05-30  125.382147  125.611806  124.363676  124.423584    -0.022283
2021-06-13  126.340704  127.249347  125.911347  127.159477     0.021989
2021-06-27  133.260356  133.689705  132.611319  132.910873     0

## DataFrame Concat
We have seen how to extract a Series from a dataFrame. Now we need to consider how to merge a Series or a DataFrame into another one.

In Pandas, the function **concat()** allows us to merge multiple Series into a DataFrame:

In [None]:
s1 = pd.Series([143.5, 144.09, 142.73, 144.18, 143.77], name = 'AAPL')
s2 = pd.Series([898.7, 911.71, 906.69, 918.59, 926.99], name = 'GOOG')
data_frame = pd.concat([s1,s2], axis = 1) # Concatena las series por columnas, si axis=0 se concatenan por filas
print(data_frame)

     AAPL    GOOG
0  143.50  898.70
1  144.09  911.71
2  142.73  906.69
3  144.18  918.59
4  143.77  926.99


The "axis = 1" parameter will join two DataFrames by columns:

In [None]:
log_price = np.log(aapl_bar.Close) # Calcula el logaritmo del precio Close de las acciones
log_price.name = 'log_price'
print(log_price)
print('\n---------------------- separate line--------------------\n')
concat = pd.concat([aapl_bar, log_price], axis = 1) # Concatena por columnas el DataFrame existente con el del logaritmo de los precios Close de las acciones
print(concat)

Date
2021-01-10    4.878493
2021-01-24    4.930290
2021-02-07    4.915033
2021-02-21    4.863340
2021-03-07    4.796061
2021-03-21    4.784214
2021-04-04    4.808990
2021-04-18    4.895839
2021-05-02    4.875508
2021-05-16    4.846227
2021-05-30    4.823692
2021-06-13    4.845442
2021-06-27    4.889679
2021-07-11    4.975995
2021-07-25    4.999492
2021-08-08    4.984565
2021-08-22    4.998495
2021-09-05    5.038899
2021-09-19    4.984017
2021-10-03    4.961655
Freq: 2W-SUN, Name: log_price, dtype: float64

---------------------- separate line--------------------

                  Open        High  ...  rate_return  log_price
Date                                ...                        
2021-01-10  131.810677  132.009754  ...          NaN   4.878493
2021-01-24  135.642671  139.195983  ...     0.053162   4.930290
2021-02-07  136.911983  136.981751  ...    -0.015141   4.915033
2021-02-21  129.824641  130.293143  ...    -0.050380   4.863340
2021-03-07  120.594177  121.551114  ...    -0.

We can also join two DataFrames by rows. Consider these two DataFrames:

In [None]:
df_volume = aapl_table.loc['2021-04':'2021-09',['Volume', 'Stock Splits']].resample('M').agg(lambda x: x[-1]) # Selecciona las columnas a 
                                                                                                              # mostrar durante un intervalor de tiempo
print(df_volume)
print('\n---------------------- separate line--------------------\n')
df_2017 = aapl_table.loc['2021-04':'2021-09',['Open', 'High', 'Low', 'Close']].resample('M').agg(lambda x: x[-1]) # Selecciona las columnas a 
                                                                                                                  # mostrar durante un intervalor de tiempo
print(df_2017)

               Volume  Stock Splits
Date                               
2021-04-30  109839500           0.0
2021-05-31   71311100           0.0
2021-06-30   63261400           0.0
2021-07-31   70382000           0.0
2021-08-31   86453100           0.0
2021-09-30   74602000           0.0

---------------------- separate line--------------------

                  Open        High         Low       Close
Date                                                      
2021-04-30  131.359728  133.134050  130.652001  131.040756
2021-05-31  125.382147  125.611806  124.363676  124.423584
2021-06-30  135.966285  137.204435  135.666731  136.755112
2021-07-31  144.164003  146.111083  143.894403  145.641785
2021-08-31  152.660004  152.800003  151.289993  151.830002
2021-09-30  142.470001  144.449997  142.029999  142.830002


Now we merge the DataFrames with our DataFrame 'aapl_bar'

In [None]:
concat = pd.concat([aapl_bar, df_volume], axis = 1) # Concatena los dos DataFrames existentes
print(concat)

                  Open        High  ...       Volume  Stock Splits
Date                                ...                           
2021-01-10  131.810677  132.009754  ...          NaN           NaN
2021-01-24  135.642671  139.195983  ...          NaN           NaN
2021-02-07  136.911983  136.981751  ...          NaN           NaN
2021-02-21  129.824641  130.293143  ...          NaN           NaN
2021-03-07  120.594177  121.551114  ...          NaN           NaN
2021-03-21  119.517621  121.042740  ...          NaN           NaN
2021-04-04  123.265626  123.783964  ...          NaN           NaN
2021-04-18  133.871700  134.240515  ...          NaN           NaN
2021-04-30         NaN         NaN  ...  109839500.0           0.0
2021-05-02  131.359728  133.134050  ...          NaN           NaN
2021-05-16  126.061129  127.698675  ...          NaN           NaN
2021-05-30  125.382147  125.611806  ...          NaN           NaN
2021-05-31         NaN         NaN  ...   71311100.0          

By default the DataFrame are joined with all of the data. This default options results in zero information loss. We can also merge them by intersection, this is called 'inner join

In [None]:
concat = pd.concat([aapl_bar,df_volume],axis = 1, join = 'inner') # Concatena los DataFrames mediante una intersección
print(concat)

                 Open       High  ...     Volume  Stock Splits
Date                              ...                         
2016-10-31  26.646396  26.782384  ...  105677600             0
2016-11-30  26.300094  26.441492  ...  144649200             0
2016-12-31  27.490193  27.619807  ...  122345200             0

[3 rows x 7 columns]


Only the intersection part was left if use 'inner join' method. Now let's try to append a DataFrame to another one:

In [None]:
append = aapl_bar.append(df_2017) # Concatena los DataFrames por filas
print(append)

                  Open        High  ...  rate_return      Change
Date                                ...                         
2021-01-10  131.810677  132.009754  ...          NaN         NaN
2021-01-24  135.642671  139.195983  ...     0.053162         NaN
2021-02-07  136.911983  136.981751  ...    -0.015141         NaN
2021-02-21  129.824641  130.293143  ...    -0.050380         NaN
2021-03-07  120.594177  121.551114  ...    -0.065065         NaN
2021-03-21  119.517621  121.042740  ...    -0.011777         NaN
2021-04-04  123.265626  123.783964  ...     0.025085         NaN
2021-04-18  133.871700  134.240515  ...     0.090732         NaN
2021-05-02  131.359728  133.134050  ...    -0.020125         NaN
2021-05-16  126.061129  127.698675  ...    -0.028857         NaN
2021-05-30  125.382147  125.611806  ...    -0.022283         NaN
2021-06-13  126.340704  127.249347  ...     0.021989         NaN
2021-06-27  133.260356  133.689705  ...     0.045230         NaN
2021-07-11  142.536444  1

'Append' is essentially to concat two DataFrames by axis = 0, thus here is an alternative way to append:

In [None]:
concat = pd.concat([aapl_bar, df_2017], axis = 0) # Otra forma para concatenar DataFrames por filas, como alternativa a append()
print(concat)

                 Open       High        Low      Close  rate_return
Date                                                               
2016-01-31  21.852388  22.440250  21.750952  22.440250          NaN
2016-02-29  22.450761  22.768308  22.402086  22.411358    -0.001288
2016-03-31  25.431520  25.473242  25.236819  25.262316     0.127210
2016-04-30  21.785529  21.954734  21.442488  21.727583    -0.139921
2016-05-31  23.226409  23.412967  23.044516  23.287041     0.071773
2016-06-30  22.023112  22.333262  21.990464  22.293619    -0.042660
2016-07-31  24.296780  24.380731  24.177849  24.301443     0.090063
2016-08-31  24.773054  24.986412  24.768364  24.876215     0.023652
2016-09-30  26.367388  26.580747  26.212645  26.505720     0.065505
2016-10-31  26.646396  26.782384  26.540888  26.620605     0.004334
2016-11-30  26.300094  26.441492  25.986660  26.045576    -0.021601
2016-12-31  27.490193  27.619807  27.202683  27.294592     0.047955
2016-10-31  26.646396  26.782384  26.540888  26.

Please note that if the two DataFrame have some columns with the same column names, these columns are considered to be the same and will be merged. It's very important to have the right column names. If we change a column names here:

In [None]:
df_2017.columns = ['Change', 'High','Low','Close'] # Selecciona las columnas indicadas del DataFrame
concat = pd.concat([aapl_bar, df_2017], axis = 0) # Concatena los DataFrames por filas
concat = concat.fillna(0) # Reemplaza los NaN por ceros
print(concat)

                  Open        High  ...  rate_return      Change
Date                                ...                         
2021-01-10  131.810677  132.009754  ...     0.000000    0.000000
2021-01-24  135.642671  139.195983  ...     0.053162    0.000000
2021-02-07  136.911983  136.981751  ...    -0.015141    0.000000
2021-02-21  129.824641  130.293143  ...    -0.050380    0.000000
2021-03-07  120.594177  121.551114  ...    -0.065065    0.000000
2021-03-21  119.517621  121.042740  ...    -0.011777    0.000000
2021-04-04  123.265626  123.783964  ...     0.025085    0.000000
2021-04-18  133.871700  134.240515  ...     0.090732    0.000000
2021-05-02  131.359728  133.134050  ...    -0.020125    0.000000
2021-05-16  126.061129  127.698675  ...    -0.028857    0.000000
2021-05-30  125.382147  125.611806  ...    -0.022283    0.000000
2021-06-13  126.340704  127.249347  ...     0.021989    0.000000
2021-06-27  133.260356  133.689705  ...     0.045230    0.000000
2021-07-11  142.536444  1

Since the column name of 'Open' has been changed, the new DataFrame has an new column named 'Change'.

# Summary

Hereby we introduced the most import part of python: resampling and DataFrame manipulation. We only introduced the most commonly used method in Financial data analysis. There are also many methods used in data mining, which are also beneficial. You can always check the [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) official documentations for help.