<div align="center">
<img style="display: block; margin: auto;" alt="photo" src="https://cdn.quantconnect.com/web/i/icon.png">

Quantconnect

Introduction to Financial Python
</div>

# 01 Data Types and Data Structures

# Introduction

This tutorial provides a basic introduction to the Python programming language. If you are new to Python, you should run the code snippets while reading this tutorial. If you are an advanced Python user, please feel free to skip this chapter.

# Basic Variable Types
The basic types of variables in Python are: strings, integers, floating point numbers and booleans.

Strings in python are identified as a contiguous set of characters represented in either single quotes (' ') or double quotes (" ").


In [244]:
my_string1 = 'Welcome to'
my_string2 = "QuantConnect"
print(my_string1 + ' ' + my_string2)


# ++

change_line_in_string = "\nCHANGE da world\n"

# Convertions between a character and it's Unicode representation
s = chr(ord("r")+1)

# String multiplication
double_s = s*2

# Concatenation of strings with '+' and ','.
# the string methods lower converts the string to lower case.
print(change_line_in_string.lower() + "my final me" + double_s + "age", 
      "Goodbye", sep=". ")

Welcome to QuantConnect

change da world
my final message. Goodbye


An integer is a round number with no values after the decimal point.

In [245]:
my_int = 10
print(my_int)
print(type(my_int))

10
<class 'int'>


The built-in function int() can convert a string into an integer.

In [246]:
my_string = "100"
print(type(my_string))
my_int = int(my_string)
print(type(my_int))

<class 'str'>
<class 'int'>


A floating point number, or a float, is a real number in mathematics. In Python we need to include a value after a decimal point to define it as a float

In [247]:
my_string = "100"
my_float = float(my_string)
print(type(my_float))
flo=3.54
print(type(flo))

<class 'float'>
<class 'float'>


As you can see above, if we don't include a decimal value, the variable would be defined as an integer. The built-in function float() can convert a string or an integer into a float.

A boolean, or bool, is a binary variable. Its value can only be True or False. It is useful when we do some logic operations, which would be covered in our next chapter.

In [248]:
my_bool = False
print(my_bool)
print(type(my_bool))

False
<class 'bool'>


In [249]:
print("Addition ", 1+1)
print("Subtraction ", 5-2)
print("Multiplication ", 2*3)
print("Division ", 10/2)
print('exponent', 2**3)

Addition  2
Subtraction  3
Multiplication  6
Division  5.0
exponent 8


# Basic Math Operations

The basic math operators in python are demonstrated below:

In [250]:
print(1/3)
print(1.0/3)

0.3333333333333333
0.3333333333333333


# Data Collections

## List
A list is an ordered collection of values. A list is mutable, which means you can change a list's value without changing the list itself. Creating a list is simply putting different comma-separated values between square brackets.

In [251]:
my_list = ['Quant', 'Connect']#, 1,2,3]
print(my_list)


# ++

my_int_list = [1, 2, 3, 4, 5, 6, 39]
print("\nInteger list: \n", my_int_list, sep="")

# Lists are iterable, thus the built-in function map can be used
# to pass each list element as an argument to a callback function
my_str_list = list(map(str, my_int_list))
print("String list: \n", my_str_list, sep="")

['Quant', 'Connect']

Integer list: 
[1, 2, 3, 4, 5, 6, 39]
String list: 
['1', '2', '3', '4', '5', '6', '39']


The values in a list are called "elements". We can access list elements by indexing. Python index starts from 0. So if you have a list of length n, the index of the first element will be 0, and that of the last element will be n âˆ’ 1. By the way, the length of a list can be obtained by the built-in function len().

In [252]:
my_list = ["First element", 'Quant', 'Connect', 1,2,3, "Last element"]
print("My list: ", my_list, "\n")
print("n: ", len(my_list))
print("Index 0: ", my_list[0])
print("Index n-1: ", my_list[len(my_list) -1], "\n")


# ++

# One way of swapping two elements.
my_list[3], my_list[4] = my_list[4], my_list[3]
print("Swapped 1 and 2: ", my_list, "\n")

# Elements can be accessed by negative indexes. Should our list have n
# elements, the last one has index -1 and the first one index -n.
print("Index -1: ", my_list[-1])
print("Index -n: ", my_list[-len(my_list)])

My list:  ['First element', 'Quant', 'Connect', 1, 2, 3, 'Last element'] 

n:  7
Index 0:  First element
Index n-1:  Last element 

Swapped 1 and 2:  ['First element', 'Quant', 'Connect', 2, 1, 3, 'Last element'] 

Index -1:  Last element
Index -n:  First element


You can also change the elements in the list by accessing an index and assigning a new value.

In [253]:
my_list = ["First element", 'Quant', 'Connect', 1,2,3, "Last element"]
my_list[2] = 'go'
print(my_list)

['First element', 'Quant', 'go', 1, 2, 3, 'Last element']


A list can also be sliced with a colon:

In [254]:
my_list = ["First element", 'Quant', 'Connect', 1,2,3, "Last element"]
print(my_list[1:3])

['Quant', 'Connect']


The slice starts from the first element indicated, but excludes the last element indicated. Here we select all elements starting from index 1, which refers to the second element:

In [255]:
print(my_list[1:])

['Quant', 'Connect', 1, 2, 3, 'Last element']


And all elements up to but excluding index 3:

In [256]:
print(my_list[:3])


# ++

# A second colon can be used to designate the increment step. 
# The following code prints the list from start to end omitting the
# elements with odd indexes.
print(my_list[::2])

['First element', 'Quant', 'Connect']
['First element', 'Connect', 2, 'Last element']


If you wish to add or remove an element from a list, you can use the append() and remove() methods for lists as follows:

In [257]:
my_list = ['Hello', 'Quant']
# This takes place at the end of the list
my_list.append('Hello')
print(my_list)


# ++

# insert method can be used to append elements at a given index
my_list.insert(0 ,"Don't")
print(my_list)
my_list.insert(4 ,"Me")
print(my_list)

# Alternatively, the '+' and '*' operators can be used to append elements.
my_list = my_list + my_list + [":)"]*3
print(my_list)

['Hello', 'Quant', 'Hello']
["Don't", 'Hello', 'Quant', 'Hello']
["Don't", 'Hello', 'Quant', 'Hello', 'Me']
["Don't", 'Hello', 'Quant', 'Hello', 'Me', "Don't", 'Hello', 'Quant', 'Hello', 'Me', ':)', ':)', ':)']


In [258]:
my_list.remove('Hello')
print(my_list)

["Don't", 'Quant', 'Hello', 'Me', "Don't", 'Hello', 'Quant', 'Hello', 'Me', ':)', ':)', ':)']


When there are repeated instances of "Hello", the first one is removed.

## Tuple
A tuple is a data structure type similar to a list. The difference is that a tuple is immutable, which means you can't change the elements in it once it's defined. We create a tuple by putting comma-separated values between parentheses.

In [259]:
my_tuple = ('Welcome','to','QuantConnect')

Just like a list, a tuple can be sliced by using index.

In [260]:
my_tuple = ('Welcome','to','QuantConnect')
print(my_tuple[1:])


# ++

# We can iterate over tuples with the map function.
print(tuple(map(len, my_tuple[::2])))

('to', 'QuantConnect')
(7, 12)


## Set
A set is an **unordered**  collection with **no duplicate** elements. The built-in function **set()** can be used to create sets.

In [261]:
stock_list = ['AAPL','GOOG','IBM','AAPL','IBM','FB','F','GOOG']
stock_set = set(stock_list)
print(stock_set)


# ++

# Sets are iterable but not subscriptable, that is, due to the lack of 
# order, their elements can't be accesses with indexes.
print(set(map(len, stock_list)))

{'IBM', 'AAPL', 'F', 'FB', 'GOOG'}
{1, 2, 3, 4}


Set is an easy way to remove duplicate elements from a list.

##Dictionary
A dictionary is one of the most important data structures in Python. Unlike sequences which are indexed by integers, dictionaries are indexed by keys which can be either strings or floats.

A dictionary is an **unordered** collection of key : value pairs, with the requirement that the keys are unique. We create a dictionary by placing a comma-separated list of key : value pairs within the braces.

In [262]:
my_dic = {'AAPL':'AAPLE', 'FB':'FaceBook', 'GOOG':'Alphabet'}

After defining a dictionary, we can access any value by indicating its key in brackets.

In [263]:
print(my_dic['GOOG'])


# ++

# Keys and values admit various types (keys must be hashable).
my_dict = {1:2, 
           'a':'b', 
           True:False, 
           3.4:{"inner":"dictionary"}, 
           "subscriptables":[("core1", "core2"), True]}
print(my_dict[3.4], my_dict[True])
print(my_dict["subscriptables"][0][0])

Alphabet
{'inner': 'dictionary'} False
core1


We can also change the value associated with a specified key:

In [264]:
my_dic['GOOG'] = 'Alphabet Company'
print(my_dic['GOOG'])

Alphabet Company


The built-in method of the dictionary object dict.keys() returns a list of all the keys used in the dictionary.

In [265]:
print(my_dic.keys())

dict_keys(['AAPL', 'FB', 'GOOG'])


# Common String Operations
A string is an immutable sequence of characters. It can be sliced by index just like a tuple:

In [266]:
my_str = 'Welcome to QuantConnect'
print(my_str[8:])


# ++

my_str2 = "a1Ab2Bc3Cd4De5Ef6F"

# Slicing up to the element indexed by 5.
print(my_str2[:6])

# Printing from the third character to the nineth one with a step
# of three (the slicing goes up to the index previous to the one
# that follows the first colon, in this case, up to index 8).
print(my_str2[2:9:3]) 

to QuantConnect
a1Ab2B
ABC


There are many methods associated with strings. We can use string.count() to count the occurrences of a character in a string, use string.find() to return the index of a specific character, and use string.replace() to replace characters

In [267]:
print('Counting the number of e appears in this sentence'.count('e'))
print('The first time e appears in this sentence'.find('e'))
print('all the a in this sentence now becomes e'.replace('a','e'))


# ++

# We can operate on strings as well.
print('up down down up down up down up up'.count('up'))

# A third parameter can be added to determine an upper bound to 
# the number of uccurances to be replaced with the replace method.
print('up down down up down up down up up'.replace('down','up', 3))

# The find method returns the firs matching string
print('hanarebanare no machi o tsunagu ressha'.find('na'))

# A search range can be established by adding two more arguments.
print('hanarebanare no machi o tsunagu ressha'.find('na', 3, 12))

7
2
ell the e in this sentence now becomes e
5
up up up up up up down up up
2
8


The most commonly used method for strings is string.split(). This method will split the string by the indicated character and return a list:

In [268]:
Time = '2016-04-01 09:43:00'
splited_list = Time.split(' ')
date = splited_list[0]
time = splited_list[1]
print(date, time)
hour = time.split(':')[0]
print(hour)

2016-04-01 09:43:00
09


We can replace parts of a string by our variable. This is called string formatting.

In [269]:
my_time = 'Hour: {}, Minute:{}'.format('09','43')
print(my_time)

Hour: 09, Minute:43


Another way to format a string is to use the % symbol.

In [270]:
print('the pi number is %f'%3.14)
print('%s to %s'%('Welcome','Quantconnect'))


# ++

# the value between the dot and the 'f' designates the number of
# decimal places for the float expression. Here we can appreciate
# it rounding our float number.
print('the pi number is %.3f'%3.14159)

# A third way of string formatting is accomplised by typing 'f'
# before the string.
word = "etiolate"
print(f"The word \"{word}\" has {len(word)} characters.")

the pi number is 3.140000
Welcome to Quantconnect
the pi number is 3.142
The word "etiolate" has 8 characters.


# Summary

Weave seen the basic data types and data structures in Python. It's important to keep practicing to become familiar with these data structures. In the next tutorial, we will cover for and while loops and logical operations in Python.

<div align="center">
<img style="display: block; margin: auto;" alt="photo" src="https://cdn.quantconnect.com/web/i/icon.png">

Quantconnect

Introduction to Financial Python
</div>

# 02 Logical Operations and Loops

# Introduction
We discussed the basic data types and data structures in Python in the last tutorial. This chapter covers logical operations and loops in Python, which are very common in programming.

# Logical Operations
Like most programming languages, Python has comparison operators:

In [271]:
# Comparison operators return boolean values.

# == checks if the values are the same.
print(1 == 0)
print(1 == 1)
# != checks if the values differ.
print(1 != 0)
# >= checks if the first value is greater than or equal to the second.
print(5 >= 5)
print(5 >= 6, end="\n"*2)


# ++

# <= checks if the first value is lesser than or equal to the second.
print(4 <= 4.5)

# > checks if the first value is greater than the second.
print(-5 > -12)

# < checks if the first value is lesser than the second.
# Many objects support comparison, for example, characters are compered
# using their Unicode value.
print("a" < "b")
print("c" < "b")

False
True
True
True
False

True
True
True
False


Each statement above has a boolean value, which must be either True or False, but not both.

We can combine simple statements P and Q to form complex statements using logical operators:

- The statement "P and Q" is true if both P and Q are true, otherwise it is false.
- The statement "P or Q" is false if both P and Q are false, otherwise it is true.
- The statement "not P" is true if P is false, and vice versa.

In [272]:
print(2 > 1 and 3 > 2)
print(2 > 1 and 3 < 2) 
print(2 > 1 or 3 < 2)
print(2 < 1 and 3 < 2, end="\n"*2)

# ++

# Statement negation can be achieved placing 'not' at the begining
# of the expression.
print(not(2 > 1 and 3 > 2))
print(not(2 > 1 and 3 < 2))
print(not(2 > 1 or 3 < 2))
print(not(2 < 1 and 3 < 2))

True
False
True
False

False
True
False
True


When dealing with a very complex logical statement that involves in several statements, we can use brackets to separate and combine them.

In [273]:
print((3 > 2 or 1 < 3) and (1 != 3 and 4 > 3) and
      not (3 < 2 or 1 < 3 and (1 != 3 and 4 > 3)))
print(3 > 2 or 1 < 3 and (1 != 3 and 4 > 3) and 
      not (3 < 2 or 1 < 3 and (1 != 3 and 4 > 3)))

False
True


Comparing the above two statements, we can see that it's wise to use brackets when we make a complex logical statement.

# If Statement
An if statement executes a segment of code only if its condition is true. A standard if statement consists of 3 segments: if, elif and else.

```python
if statement1:
    # if the statement1 is true, execute the code here.
    # code.....
    # code.....
elif statement2:
    # if the statement 1 is false, skip the codes above to this part.
    # code......
    # code......
else:
    # if none of the above statements is True, skip to this part
    # code......
```

An if statement doesn't necessarily has elif and else part. If it's not specified, the indented block of code will be executed when the condition is true, otherwise the whole if statement will be skipped.

In [274]:
i = 0
if i == 0:
    print('i==0 is True')
if i == 1:
    print("This won't be printed.")

i==0 is True


As we mentioned above, we can write some complex statements here:

In [275]:
p = 1 > 0
q = 2 > 3
if p and q:
    # q is false, so the next line is ignored.
    print('p and q is true')
elif p and not q:
    # p is true and q is false, so the next line is run and 
    # the remaining elif and else segments are ignored.
    print('q is false')
elif q and not p:
    print('p is false')
else:
    print('None of p and q is true')


# ++

# A one line if-else statement can be implemented using the ternary 
# operator, which follows the structure: x if condition else y.
print("p and q is true") if p and q else print("p is false or q is false")

q is false
p is false or q is false


# Loop Structure
Loops are an essential part of programming. The "for" and "while" loops run a block of code repeatedly.

## While Loop
A "while" loop will run repeatedly until a certain condition has been met.

In [276]:
i = 0
# 5 iterations of the command i += 1 are needed 
# for i to fail the condition, thus stoping the loop.
while i < 5:
    print(i)
    i += 1 

0
1
2
3
4


When making a while loop, we need to ensure that something changes from iteration to iteration so that the while loop will terminate, otherwise, it will run forever. Here we used i += 1 (short for i = i + 1) to make i larger after each iteration. This is the most commonly used method to control a while loop.

## For Loop
A "for" loop will iterate over a sequence of value and terminate when the sequence has ended.

In [277]:
print("Iteration over list:")
for i in [1,2,3,4,5]:
    print(i)


# ++

# The sequence in question can be any iterable.
my_tuple = (1,2,3)
my_str = "123"
my_dict = {'a':1, 'b':2, 'c':3}
my_set = {1, 2, 3}
my_map = map(int, my_tuple)

print("Iteration over tuple:")
for i in my_tuple:
    print(i, end=" ")
print("\nIteration over string:")
for i in my_str:
    print(i, end=" ")
print("\nIteration over dictionary keys:")
for i in my_dict:
    print(i, end=" ")
print("\nIteration over set:")
for i in my_set:
    print(i, end=" ")
print("\nIteration over map object:")
for i in my_map:
    print(i, end=" ")

Iteration over list:
1
2
3
4
5
Iteration over tuple:
1 2 3 
Iteration over string:
1 2 3 
Iteration over dictionary keys:
a b c 
Iteration over set:
1 2 3 
Iteration over map object:
1 2 3 

We can also add if statements in a for loop. Here is a real example from our pairs trading algorithm:

In [278]:
stocks = ['AAPL','GOOG','IBM','FB','F','V', 'G', 'GE']
selected = ['AAPL','IBM']
new_list = []
# This code creates a list of stocks' elements that aren't part 
# of the selected list.
for i in stocks:
    if i not in selected:
        new_list.append(i)
print(stocks)


# ++

print("Not selected stocks:", new_list)

['AAPL', 'GOOG', 'IBM', 'FB', 'F', 'V', 'G', 'GE']
Not selected stocks: ['GOOG', 'FB', 'F', 'V', 'G', 'GE']


Here we iterated all the elements in the list 'stocks'. Later in this chapter, we will introduce a smarter way to do this, which is just a one-line code.

## Break and continue
These are two commonly used commands in a for loop. If "break" is triggered while a loop is executing, the loop will terminate immediately:

In [279]:
stocks = ['AAPL','GOOG','IBM','FB','F','V', 'G', 'GE']
for i in stocks:
    print(i)
    # The for loop will be terminated when as soon as i 
    # is assigned to be 'FB'.
    if i == 'FB':
        break

AAPL
GOOG
IBM
FB


The "continue" command tells the loop to end this iteration and skip to the next iteration:

In [280]:
stocks = ['AAPL','GOOG','IBM','FB','F','V', 'G', 'GE']
for i in stocks:
    # When 'FB' is assigned to the variable i, the lines of the for
    # loop below the continue will be skipped.
    if i == 'FB':
        continue
    print(i)


# ++

# The following code will print all the positive numbers in a list by
# skipping the print instruction when facing a negative number or 0.
my_nums = [-2, 3.3, 4, 0, 34, -2, -83, 2.99993, -0.5, -20, 8, 39]
print("Positive numbers:")
for i in my_nums:
  if i <= 0:
    continue
  print(i, end=" ")

AAPL
GOOG
IBM
F
V
G
GE
Positive numbers:
3.3 4 34 2.99993 8 39 

# List Comprehension
List comprehension is a Pythonic way to create lists. Common applications are to make new lists where each element is the result of some operations applied to each member of another sequence. For example, if we want to create a list of squares using for loop:

In [281]:
squares = []
for i in [1,2,3,4,5]:
    squares.append(i**2)
print(squares)

[1, 4, 9, 16, 25]


Using list comprehension:

In [282]:
list = [1,2,3,4,5]
squares = [x**2 for x in list]
print(squares)


# ++

# Getting the built-in definition of list
del list

[1, 4, 9, 16, 25]


Recall the example above where we used a for loop to select stocks. Here we use list comprehension:

In [283]:
stocks = ['AAPL','GOOG','IBM','FB','F','V', 'G', 'GE']
selected = ['AAPL','IBM']
new_list = [x for x in stocks if x in selected]
print(new_list)


# ++

# Naturally, this works on other iterable objects.
# The code below turns a string into a list of characters 
# omitting the spaces.
my_str = "that sky is the wrong color"
print([x for x in my_str if x != " "])

# Here we have an easy way of creating a 4x4 matrix of zeros.
print([[0 for j in range(4)] for i in range(4)])

['AAPL', 'IBM']
['t', 'h', 'a', 't', 's', 'k', 'y', 'i', 's', 't', 'h', 'e', 'w', 'r', 'o', 'n', 'g', 'c', 'o', 'l', 'o', 'r']
[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]


A list comprehension consists of square brackets containing an expression followed by a "for" clause, and possibly "for" or "if" clauses. For example:

In [284]:
print([(x, y) for x in [1,2,3] for y in [3,1,4] if x != y])
print([str(x)+' vs '+str(y) for x in ['AAPL','GOOG','IBM','FB'] for y in ['F','V','G','GE'] if x!=y])


# ++

# Below, an example containing all the integers from 2 to 10 paired
# with their positive, greater than 2, divisors.
print([(x,y) for x in range(2,11) for y in range(2,11) if x % y == 0])

[(1, 3), (1, 4), (2, 3), (2, 1), (2, 4), (3, 1), (3, 4)]
['AAPL vs F', 'AAPL vs V', 'AAPL vs G', 'AAPL vs GE', 'GOOG vs F', 'GOOG vs V', 'GOOG vs G', 'GOOG vs GE', 'IBM vs F', 'IBM vs V', 'IBM vs G', 'IBM vs GE', 'FB vs F', 'FB vs V', 'FB vs G', 'FB vs GE']
[(2, 2), (3, 3), (4, 2), (4, 4), (5, 5), (6, 2), (6, 3), (6, 6), (7, 7), (8, 2), (8, 4), (8, 8), (9, 3), (9, 9), (10, 2), (10, 5), (10, 10)]


List comprehension is an elegant way to organize one or more for loops when creating a list.

# Summary
This chapter has introduced logical operations, loops, and list comprehension. In the next chapter, we will introduce functions and object-oriented programming, which will enable us to make our codes clean and versatile.

<div align="center">
<img style="display: block; margin: auto;" alt="photo" src="https://cdn.quantconnect.com/web/i/icon.png">

Quantconnect

Introduction to Financial Python
</div>

# 03 Functions and Objective-Oriented Programming

# Introduction

In the last tutorial we introduced logical operations, loops and list comprehension. We will introduce functions and object-oriented programming in this chapter, which will enable us to build complex algorithms in more flexible ways.

# Functions
A function is a reusable block of code. We can use a function to output a value, or do anything else we want. We can easily define our own function by using the keyword "def".

In [285]:
def product(x,y):
    # Returns the product of x and y 
    # return specifies the output of the function
    return x*y
print(product(2,3))
print(product(5,10))


# ++

# Implicit values can be implemented so as not to have to always pass 
# certain arguments to the function, as shown in the next example:

# When given only one argument x, the root function returns an 
# approximation of the square root of x. On the other hand, when a second
# argument y is provided, the function returns the approximate y-th root 
# of x.
def root(x, root=2):
    """
    Returns the root-th root of x.
    """
    return x**(1/root)

print(root(25))
print(root(64, 3))

6
50
5.0
3.9999999999999996


The keyword "def" is followed by the function name and the parenthesized list of formal parameters. The statements that form the body of the function start at the next line, and must be indented. The product() function above has "x" and "y" as its parameters. A function doesn't necessarily have parameters:

In [286]:
def say_hi():
    """
    Prints 'Welcome to QuantConnect'
    """
    print('Welcome to QuantConnect')
say_hi()

Welcome to QuantConnect


# Built-in Function
**range()** is a function that creates a list containing an arithmetic sequence. It's often used in for loops. The arguments must be integers. If the "step" argument is omitted, it defaults to 1.

In [287]:
print(range(10))
print(range(1,11))
print(range(1,11,2))

# list and touple allow the sequence to be easily visualized. 
print(list(range(10)))
print(list(range(1,11)))
print(tuple(range(1,11,2)))


# ++

# Step can be negative.
print(list(range(10, 0, -1)))

# The resultant sequence is iterable, here chr turns integers 
# into characters.
print(tuple(map(chr, range(65, 90, 2))))

range(0, 10)
range(1, 11)
range(1, 11, 2)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
(1, 3, 5, 7, 9)
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
('A', 'C', 'E', 'G', 'I', 'K', 'M', 'O', 'Q', 'S', 'U', 'W', 'Y')


**len()** is another function used together with range() to create a for loop. This function returns the length of an object. The argument must be a sequence or a collection.

In [288]:
tickers = ['AAPL','GOOG','IBM','FB','F','V', 'G', 'GE']
print('The length of tickers is {}'.format(len(tickers)))
# Traversing a list using its length.
for i in range(len(tickers)):
    print(tickers[i])

The length of tickers is 8
AAPL
GOOG
IBM
FB
F
V
G
GE


Note: If you want to print only the tickers without those numbers, then simply write "for ticker in tickers: print ticker"

**map(**) is a function that applies a specific function to every item of a sequence or collection, and returns a list of the results.

Because list at the moment is [1,2,3,4,5] and overwriting list() from builtins we del list

In [289]:
list = [1,2,3,4,5]
print(list)
del list
list

[1, 2, 3, 4, 5]


list

In [290]:
tickers = ['AAPL','GOOG','IBM','FB','F','V', 'G', 'GE']
# The following code turns a map object into a list, which gives 
# a handy visual representation when printed.
list(map(len,tickers))

[4, 4, 3, 2, 1, 1, 1, 2]

In [291]:
tickers = ['AAPL','GOOG','IBM','FB','F','V', 'G', 'GE']
print(list(map(len,tickers)))


# ++

# Example with a tuple and a set, the later deals with duplicates.
print(tuple(map(len,tickers)))
print(set(map(len,tickers)))

[4, 4, 3, 2, 1, 1, 1, 2]
(4, 4, 3, 2, 1, 1, 1, 2)
{1, 2, 3, 4}


The **lambda operator** is a way to create small anonymous functions. These functions are just needed where they have been created. For example:

In [292]:
list(map(lambda x: x**2, range(10)))


# ++

def distance(x1, y1, x2=0, y2=0):
    """
    Returns the distance between two points (x1, y1) and (x2, y2).
    When given two parameters, returns distance to the origin.
    """
    return ((x1 - x2)**2 + (y1 - y2)**2)**(1/2)

# The following code prints a list of the distances between the origin
# and several points, namely: (0,0), (1,0), (2,0), (3,0), (4,0), (5,0), 
# (6,0), (7,0), (8,0), (9,0) and (10,0).
# Notice how y1 is frozen to be 0.
print(list(map(lambda x: distance(x, 0), range(11))))

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]


map() can be applied to more than one list. The lists have to have the same length.

In [293]:
list(map(lambda x, y: x+y, [1,2,3,4,5],[5,4,3,2,1]))


# ++

# Here we freeze x2 and y2 to be 3 and 8 respectively, so the
# result is a list of the distances between the point (3,8) and
# the points (1,5), (2,3), (3,1), (4, -1) and (5, -3).
print(list(map(lambda x, y: distance(x, y, 3, 8), 
               range(1, 6), 
               range(5, -4, -2))))

[3.605551275463989, 5.0990195135927845, 7.0, 9.055385138137417, 11.180339887498949]


**sorted()** takes a list or set and returns a new sorted list

In [294]:
print(sorted([5,2,3,4,1]))

# ++

# Sorting tuples.
print(sorted((5,2,3,4,1)))
# Sorting sets.
print(sorted({5,2,3,4,1}))

[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]


We can add a "key" parameter to specify a function to be called on each list element prior to making comparisons. For example:

In [295]:
price_list = [('AAPL',144.09),('GOOG',911.71),('MSFT',69),('FB',150),('WMT',75.32)]
print(sorted(price_list, key = lambda x: x[1]))


# ++

# Strings are sorted lexicographically, as we can see when passing
# a string to key:
print(sorted(price_list, key = lambda x: x[0]))

# Here the key is the length of the first element in every tuple, so sorting
# will result in a list with the shortest word at the beginning.
print(sorted(price_list, key = lambda x: len(x[0])))

[('MSFT', 69), ('WMT', 75.32), ('AAPL', 144.09), ('FB', 150), ('GOOG', 911.71)]
[('AAPL', 144.09), ('FB', 150), ('GOOG', 911.71), ('MSFT', 69), ('WMT', 75.32)]
[('FB', 150), ('WMT', 75.32), ('AAPL', 144.09), ('GOOG', 911.71), ('MSFT', 69)]


By default the values are sorted by ascending order. We can change it to descending by adding an optional parameter "reverse'.

In [296]:
price_list = [('AAPL',144.09),('GOOG',911.71),('MSFT',69),('FB',150),('WMT',75.32)]
print(sorted(price_list, key = lambda x: x[1], reverse = True))


# ++

# Inverse lexicographic order (z-a):
print(sorted(price_list, key = lambda x: x[0], reverse = True))

[('GOOG', 911.71), ('FB', 150), ('AAPL', 144.09), ('WMT', 75.32), ('MSFT', 69)]
[('WMT', 75.32), ('MSFT', 69), ('GOOG', 911.71), ('FB', 150), ('AAPL', 144.09)]


Lists also have a function list.sort(). This function takes the same "key" and "reverse" arguments as sorted(), but it doesn't return a new list.

In [297]:
price_list = [('AAPL',144.09),('GOOG',911.71),('MSFT',69),('FB',150),('WMT',75.32)]
price_list.sort(key = lambda x: x[1])
print(price_list)

[('MSFT', 69), ('WMT', 75.32), ('AAPL', 144.09), ('FB', 150), ('GOOG', 911.71)]


# Object-Oriented Programming
Python is an object-oriented programming language. It's important to understand the concept of "objects" because almost every kind of data from QuantConnect API is an object.

## Class
A class is a type of data, just like a string, float, or list. When we create an object of that data type, we call it an instance of a class.

In Python, everything is an object - everything is an instance of some class. The data stored inside an object are called attributes, and the functions which are associated with the object are called methods.

For example, as mentioned above, a list is an object of the "list" class, and it has a method list.sort().

We can create our own objects by defining a class. We would do this when it's helpful to group certain functions together. For example, we define a class named "Stock" here:

In [298]:
class stock:
    def __init__(self, ticker, open, close, volume):
        """
        Constructor: initializes every instance of the class.
        """
        self.ticker = ticker
        self.open = open
        self.close = close
        self.volume = volume
        self.rate_return = float(close)/open - 1
 
    def update(self, open, close):
        """
        Example of a method that updates instance variables.
        """
        self.open = open
        self.close = close
        self.rate_return = float(self.close)/self.open - 1
 
    def print_return(self):
        print(self.rate_return)

The "Stock" class has attributes "ticker", "open", "close", "volume" and "rate_return". Inside the class body, the first method is called __init__, which is a special method. When we create a new instance of the class, the __init__ method is immediately executed with all the parameters that we pass to the "Stock" object. The purpose of this method is to set up a new "Stock" object using data we have provided.

Here we create two Stock objects named "apple" and "google".

In [299]:
apple = stock('AAPL', 143.69, 144.09, 20109375)
google = stock('GOOG', 898.7, 911.7, 1561616)

Stock objects also have two other methods: update() and print_return(). We can access the attribues of a Stock object and call its methods:

In [300]:
print(apple.ticker)
# The following lines call the stock class methods print_return
# and update passing google as the first argument.
google.print_return()
google.update(912.8,913.4)
google.print_return()

AAPL
0.014465338822744034
0.0006573181419806673


By calling the update() function, we updated the open and close prices of a stock. Please note that when we use the attributes or call the methods **inside a class**, we need to specify them as self.attribute or self.method(), otherwise Python will deem them as global variables and thus raise an error.

We can add an attribute to an object anywhere:

In [301]:
# Adding an atribute to the apple object outside the class.
apple.ceo = 'Tim Cook'
print(apple.ceo)


# ++

# Adding a method to the apple object outside the class.
def difference(a, b):
    """
    Computes the difference between a and b.
    """
    return abs(a - b)

apple.difference = difference
print(apple.difference(apple.open, apple.close))

Tim Cook
0.4000000000000057


We can check what names (i.e. attributes and methods) are defined on an object using the dir() function:

In [302]:
# Notice how the method 'difference' is listed.
dir(apple)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'ceo',
 'close',
 'difference',
 'open',
 'print_return',
 'rate_return',
 'ticker',
 'update',
 'volume']

## Inheritance
Inheritance is a way of arranging classes in a hierarchy from the most general to the most specific. A "child" class is a more specific type of a "parent" class because a child class will inherit all the attribues and methods of its parent. For example, we define a class named "Child" which inherits "Stock":

In [303]:
class child(stock):
    def __init__(self,name):
        self.name = name

In [304]:
# Creation of an object of type child, with the property name set to 'aa'.
aa = child('aa')
print(aa.name)
aa.update(100,102)
print(aa.open)
print(aa.close)
print(aa.print_return())


# ++

# Since the method difference was thefined only for an instance of the class
# stock, it is not passed down to the class child. Notice how it is not a
# result in the dir list of aa:
print("difference" in dir(aa), "difference" in dir(apple))

aa
100
102
0.020000000000000018
None
False True


As seen above, the new class Child has inherited the methods from Stock.

#Summary

In this chapter we have introduced functions and classes. When we write a QuantConnect algorithm, we would define our algorithm as a class (QCAlgorithm). This means our algorithm inherited the QC API methods from QCAlgorithm class.

In the next chapter, we will introduce NumPy and Pandas, which enable us to conduct scientific calculations in Python.

<div align="center">
<img style="display: block; margin: auto;" alt="photo" src="https://cdn.quantconnect.com/web/i/icon.png">

Quantconnect

Introduction to Financial Python
</div>

# 04 NumPy and Basic Pandas

# Introduction

Now that we have introduced the fundamentals of Python, it's time to learn about NumPy and Pandas.

# NumPy
NumPy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. It also has strong integration with Pandas, which is another powerful tool for manipulating financial data.

Python packages like NumPy and Pandas contain classes and methods which we can use by importing the package:

In [305]:
import numpy as np

## Basic NumPy Arrays
A NumPy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. Here we make an array by passing a list of Apple stock prices:

In [306]:
price_list = [143.73, 145.83, 143.68, 144.02, 143.5, 142.62]
price_array = np.array(price_list)
print(price_array, type(price_array))

[143.73 145.83 143.68 144.02 143.5  142.62] <class 'numpy.ndarray'>


Notice that the type of array is "ndarray" which is a multi-dimensional array. If we pass np.array() a list of lists, it will create a 2-dimensional array.

In [307]:
Ar = np.array([[1,3],[2,4]])
print(Ar, type(Ar), "\n"*2)


# ++

# The next line gives the np class method array an alias.
array = np.array

# Three-dimensional nunpy array.
arr3d = array([[[1,1,1],[2,2,2],[3,3,3]],[[4,4,4],[5,5,5],[6,6,6]],[[7,7,7],[8,8,8],[9,9,9]]])
print(arr3d, type(arr3d))

[[1 3]
 [2 4]] <class 'numpy.ndarray'> 


[[[1 1 1]
  [2 2 2]
  [3 3 3]]

 [[4 4 4]
  [5 5 5]
  [6 6 6]]

 [[7 7 7]
  [8 8 8]
  [9 9 9]]] <class 'numpy.ndarray'>


We get the dimensions of an ndarray using the .shape attribute:

In [308]:
print(Ar.shape)


# ++

# Dimensions of the 3-dimentional array form the previous example.
print(arr3d.shape)

(2, 2)
(3, 3, 3)


If we create an 2-dimensional array (i.e. matrix), each row can be accessed by index:

In [309]:
print(Ar[0])
print(Ar[1], end="\n"*2)


# ++

# Accessing "layers" from our 3-dimentional array.
print("Second layer:\n", arr3d[1])
print("\nFirst layer:\n", arr3d[0])

[1 3]
[2 4]

Second layer:
 [[4 4 4]
 [5 5 5]
 [6 6 6]]

First layer:
 [[1 1 1]
 [2 2 2]
 [3 3 3]]


If we want to access the matrix by column instead:

In [310]:
# The unpreceded and unfollowed colon indicates that all rows are selected,
# while the values after the comma designate the column of the required 
# element on each one of the rows.
print('the first column: ', Ar[:,0])
print('the second column: ', Ar[:,1])


# ++

# To access the middle element in our "cubic" array, we must provide 3 indexes:
print(arr3d[1,1,1])

the first column:  [1 2]
the second column:  [3 4]
5


## Array Functions
Some functions built in NumPy that allow us to perform calculations on arrays. For example, we can apply the natural logarithm to each element of an array:

In [311]:
print(np.log(price_array))


# ++

# Here we apply the natural logarithm to the middle column that goes across
# the three "layers" or "floors" of our 3-dimentional array, thanks to a
# carefull slicing:
print(arr3d[:,1,1])
print(np.log(arr3d[:,1,1]))

[4.96793654 4.98244156 4.9675886  4.96995218 4.96633504 4.96018375]
[2 5 8]
[0.69314718 1.60943791 2.07944154]


Other functions return a single value:

In [312]:
# Returns the result of adding all the array elements and dividing the
# result by the number of elements.
print(np.mean(price_array))

# Handles the array like a distribution to compute the standard deviation.
print(np.std(price_array))

# Returns the result of adding all the elements in the array.
print(np.sum(price_array))

# Returns the greatest element in the array.
print(np.max(price_array))


# ++

# One of the faces (actually 2 of them) of the cube that represents our 
# 3-dimentional array is composed of the numbers from 1 to 9:
print(arr3d[:,:,0])

# We can compare the result of their addition with a known formula:
print(np.sum(arr3d[:,:,0]) == 9*(10)/2)

143.89666666666668
0.9673790478515796
863.38
145.83
[[1 2 3]
 [4 5 6]
 [7 8 9]]
True


The functions above return the mean, standard deviation, total and maximum value of an array.

# Pandas
Pandas is one of the most powerful tools for dealing with financial data. 

First we need to import Pandas:

In [313]:
import pandas as pd

## Series
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, float, Python object, etc.)

We create a Series by calling pd.Series(data), where data can be a dictionary, an array or just a scalar value.

In [314]:
price = [143.73, 145.83, 143.68, 144.02, 143.5, 142.62]
s = pd.Series(price)
print(s)


# ++

# A series of integer cubes from 1 to 100:
my_cubes = pd.Series([x**3 for x in range(1, 11)])
print("Cubes:\n", my_cubes, sep="")

0    143.73
1    145.83
2    143.68
3    144.02
4    143.50
5    142.62
dtype: float64
Cubes:
0       1
1       8
2      27
3      64
4     125
5     216
6     343
7     512
8     729
9    1000
dtype: int64


We can customize the indices of a new Series:

In [315]:
s = pd.Series(price, index = ['a','b','c','d','e','f'])
print(s)

a    143.73
b    145.83
c    143.68
d    144.02
e    143.50
f    142.62
dtype: float64


Or we can change the indices of an existing Series:

In [316]:
s.index = [6,5,4,3,2,1]
print(s)
# ++

# Below we use the index method to shift the indexes of our cubes series by 1:
my_cubes.index = range(1, 11)
print("Cubes:\n", my_cubes, sep="")

6    143.73
5    145.83
4    143.68
3    144.02
2    143.50
1    142.62
dtype: float64
Cubes:
1        1
2        8
3       27
4       64
5      125
6      216
7      343
8      512
9      729
10    1000
dtype: int64


Series is like a list since it can be sliced by index:

In [317]:
print(s[1:])
print(s[:-2])


# ++

# The next line displays the odd cubes of my_cubes.
print(my_cubes[::2])

# The next line displays the even cubes of my_cubes in descending order.
print(my_cubes[10:0:-2])

5    145.83
4    143.68
3    144.02
2    143.50
1    142.62
dtype: float64
6    143.73
5    145.83
4    143.68
3    144.02
dtype: float64
1      1
3     27
5    125
7    343
9    729
dtype: int64
10    1000
8      512
6      216
4       64
2        8
dtype: int64


Series is also like a dictionary whose values can be set or fetched by index label:

In [318]:
print(s[4])
s[4] = 0
print(s)


# ++

# Notice how the first element is accessed with the number 1 instead of 0.
print(my_cubes[1])

143.68
6    143.73
5    145.83
4      0.00
3    144.02
2    143.50
1    142.62
dtype: float64
1


Series can also have a name attribute, which will be used when we make up a Pandas DataFrame using several series.

In [319]:
s = pd.Series(price, name = 'Apple Price List')
print(s)
print(s.name)


# ++

# the next line add a name to my_cubes by directly accessing the property
my_cubes.name = "First 10 integer cubes"
print(my_cubes)

0    143.73
1    145.83
2    143.68
3    144.02
4    143.50
5    142.62
Name: Apple Price List, dtype: float64
Apple Price List
1        1
2        8
3       27
4       64
5      125
6      216
7      343
8      512
9      729
10    1000
Name: First 10 integer cubes, dtype: int64


We can get the statistical summaries of a Series:

In [320]:
print(s.describe(), end="\n"*2)


# ++

# Description of the odd cubes.
print(my_cubes[::2].describe())

# Description of the even cubes.
print(my_cubes[1::2].describe())

count      6.000000
mean     143.896667
std        1.059711
min      142.620000
25%      143.545000
50%      143.705000
75%      143.947500
max      145.830000
Name: Apple Price List, dtype: float64

count      5.000000
mean     245.000000
std      302.208537
min        1.000000
25%       27.000000
50%      125.000000
75%      343.000000
max      729.000000
Name: First 10 integer cubes, dtype: float64
count       5.000000
mean      360.000000
std       407.725398
min         8.000000
25%        64.000000
50%       216.000000
75%       512.000000
max      1000.000000
Name: First 10 integer cubes, dtype: float64


## Time Index
Pandas has a built-in function specifically for creating date indices: pd.date_range(). We use it to create a new index for our Series:

In [321]:
time_index = pd.date_range('2017-01-01', periods = len(s), freq = 'D')
print(time_index)
s.index = time_index
print(s)


# ++

# By stablishing freq = 'M' the indexes vary by month.
cube_index = pd.date_range('2022-01-03', periods = len(my_cubes), freq = 'M')
my_cubes.index = cube_index
print(my_cubes)

DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
               '2017-01-05', '2017-01-06'],
              dtype='datetime64[ns]', freq='D')
2017-01-01    143.73
2017-01-02    145.83
2017-01-03    143.68
2017-01-04    144.02
2017-01-05    143.50
2017-01-06    142.62
Freq: D, Name: Apple Price List, dtype: float64
2022-01-31       1
2022-02-28       8
2022-03-31      27
2022-04-30      64
2022-05-31     125
2022-06-30     216
2022-07-31     343
2022-08-31     512
2022-09-30     729
2022-10-31    1000
Freq: M, Name: First 10 integer cubes, dtype: int64


Series are usually accessed using the iloc[] and loc[] methods. iloc[] is used to access elements by integer index, and loc[] is used to access the index of the series.

iloc[] is necessary when the index of a series are integers, take our previous defined series as example:

In [322]:
s.index = [6,5,4,3,2,1]
print(s)
print(s[1])

6    143.73
5    145.83
4    143.68
3    144.02
2    143.50
1    142.62
Name: Apple Price List, dtype: float64
142.62


If we intended to take the second element of the series, we would make a mistake here, because the index are integers. In order to access to the element we want, we use iloc[] here:

In [323]:
print(s.iloc[1])


# ++

# The next code accesses the first element of the my_cubes series using
# zero-based numbering.
print(my_cubes.iloc[0])

# As we can see, negative indexes work as well. The next code prints
# the last element of the my_cubes series.
print(my_cubes[-1])

145.83
1
1000


While working with time series data, we often use time as the index. Pandas provides us with various methods to access the data by time index

In [324]:
s.index = time_index
print(s['2017-01-03'])

143.68


We can even access to a range of dates:

In [325]:
print(s['2017-01-02':'2017-01-05'])

2017-01-02    145.83
2017-01-03    143.68
2017-01-04    144.02
2017-01-05    143.50
Freq: D, Name: Apple Price List, dtype: float64


Series[] provides us a very flexible way to index data. We can add any condition in the square brackets:

In [326]:
print(s[s < np.mean(s)] )
print([(s > np.mean(s)) & (s < np.mean(s) + 1.64*np.std(s))])


# ++

# The statement bellow prints the even cubes within halve a standard deviation
# from the mean, the logical operator & is used.
print(my_cubes.describe())
print(my_cubes[(my_cubes % 2 == 0) &
               (abs(np.mean(my_cubes)-my_cubes) < np.std(my_cubes)/2)])

2017-01-01    143.73
2017-01-03    143.68
2017-01-05    143.50
2017-01-06    142.62
Name: Apple Price List, dtype: float64
[2017-01-01    False
2017-01-02    False
2017-01-03    False
2017-01-04     True
2017-01-05    False
2017-01-06    False
Freq: D, Name: Apple Price List, dtype: bool]
count      10.000000
mean      302.500000
std       343.728333
min         1.000000
25%        36.250000
50%       170.500000
75%       469.750000
max      1000.000000
Name: First 10 integer cubes, dtype: float64
2022-06-30    216
Freq: M, Name: First 10 integer cubes, dtype: int64


As demonstrated, we can use logical operators like & (and), | (or) and ~ (not) to group multiple conditions.

# Summary
Here we have introduced NumPy and Pandas for scientific computing in Python. In the next chapter, we will dive into Pandas to learn resampling and manipulating Pandas DataFrame, which are commonly used in financial data analysis.

<div align="center">
<img style="display: block; margin: auto;" alt="photo" src="https://cdn.quantconnect.com/web/i/icon.png"> <img style="display: block; margin: auto;" alt="photo" src="https://www.marketing-branding.com/wp-content/uploads/2020/07/google-colaboratory-colab-guia-completa.jpg " width="50" height="50">
<img style="display: block; margin: auto;" alt="photo" src="https://upload.wikimedia.org/wikipedia/commons/d/da/Yahoo_Finance_Logo_2019.svg" width="50" height="50">  

Quantconnect -> Google Colab with Yahoo Finance data

Introduction to Financial Python
</div>

# 05 Pandas-Resampling and DataFrame

# Introduction
In the last chapter we had a glimpse of Pandas. In this chapter we will learn about resampling methods and the DataFrame object, which is a powerful tool for financial data analysis.

# Fetching Data
Here we use the Yahoo Finance to retrieve data.


In [327]:
!pip install yfinance

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [328]:
import yfinance as yf

aapl = yf.Ticker("AAPL")

# get stock info
print(aapl.info)

# get historical market data
aapl_table = aapl.history(start="2016-01-01",  end="2017-12-31")
aapl_table

{'zip': '95014', 'sector': 'Technology', 'fullTimeEmployees': 154000, 'longBusinessSummary': 'Apple Inc. designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessories worldwide. It also sells various related services. In addition, the company offers iPhone, a line of smartphones; Mac, a line of personal computers; iPad, a line of multi-purpose tablets; AirPods Max, an over-ear wireless headphone; and wearables, home, and accessories comprising AirPods, Apple TV, Apple Watch, Beats products, HomePod, and iPod touch. Further, it provides AppleCare support services; cloud services store services; and operates various platforms, including the App Store that allow customers to discover and download applications and digital content, such as books, music, video, games, and podcasts. Additionally, the company offers various services, such as Apple Arcade, a game subscription service; Apple Music, which offers users a curated listening experience with o

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-01-04,23.523350,24.156081,23.383508,24.151495,270597600,0.0,0
2016-01-05,24.243193,24.266117,23.477498,23.546272,223164000,0.0,0
2016-01-06,23.053389,23.468333,22.895207,23.085484,273829600,0.0,0
2016-01-07,22.622398,22.954810,22.106586,22.111170,324377600,0.0,0
2016-01-08,22.592597,22.720976,22.182239,22.228088,283192000,0.0,0
...,...,...,...,...,...,...,...
2017-12-22,41.594672,41.770882,41.551813,41.673252,65397600,0.0,0
2017-12-26,40.670771,40.830311,40.404075,40.616005,132742000,0.0,0
2017-12-27,40.504090,40.666010,40.411224,40.623150,85992800,0.0,0
2017-12-28,40.718396,40.920799,40.594573,40.737446,65920800,0.0,0


We will create a Series named "aapl" whose values are Apple's daily closing prices, which are of course indexed by dates:

In [329]:
# Data fetch in the year 2017
aapl = aapl_table['Close']['2017']

In [330]:
print(aapl)

Date
2017-01-03    27.219833
2017-01-04    27.189369
2017-01-05    27.327633
2017-01-06    27.632294
2017-01-09    27.885389
                ...    
2017-12-22    41.673252
2017-12-26    40.616005
2017-12-27    40.623150
2017-12-28    40.737446
2017-12-29    40.296925
Name: Close, Length: 251, dtype: float64


Recall that we can fetch a specific data point using series['yyyy-mm-dd']. We can also fetch the data in a specific month using series['yyyy-mm'].

In [331]:
print(aapl['2017-3'])


# ++

# The next line provides a second condition for the data from 2017-8
print(aapl['2017-8'][aapl > 36])

Date
2017-03-01    32.901905
2017-03-02    32.706566
2017-03-03    32.899559
2017-03-06    32.796001
2017-03-07    32.838367
2017-03-08    32.715973
2017-03-09    32.640652
2017-03-10    32.748932
2017-03-13    32.763046
2017-03-14    32.713631
2017-03-15    33.059612
2017-03-16    33.113743
2017-03-17    32.948990
2017-03-20    33.294983
2017-03-21    32.913692
2017-03-22    33.285564
2017-03-23    33.167881
2017-03-24    33.101978
2017-03-27    33.158459
2017-03-28    33.845737
2017-03-29    33.921047
2017-03-30    33.876335
2017-03-31    33.812794
Name: Close, dtype: float64
Date
2017-08-02    37.138184
2017-08-03    36.767151
2017-08-04    36.960934
2017-08-07    37.532875
2017-08-08    37.833035
2017-08-09    38.064640
2017-08-10    36.852200
2017-08-11    37.364700
2017-08-14    37.927025
2017-08-15    38.342239
2017-08-16    38.188007
2017-08-17    37.454872
2017-08-18    37.369453
2017-08-21    37.300640
2017-08-22    37.910416
2017-08-23    37.957859
2017-08-24    37.789413
20

In [332]:
print(aapl['2017-2':'2017-4'])


# ++

# The code below prints the data points between July 1st and
# September 24th, excluding those that differ in more than
# 0.5 units with the slice mean.
print(aapl['2017-7':'2017-9-24'][abs(aapl - np.mean(aapl['2017-7':'2017-9-24']) < .5)])

Date
2017-02-01    30.172651
2017-02-02    30.121099
2017-02-03    30.249989
2017-02-06    30.533548
2017-02-07    30.824135
                ...    
2017-04-24    33.808075
2017-04-25    34.017559
2017-04-26    33.817497
2017-04-27    33.843384
2017-04-28    33.810436
Name: Close, Length: 61, dtype: float64
Date
2017-07-03    33.914536
2017-07-05    34.053978
2017-07-06    33.732555
2017-07-07    34.075241
2017-07-10    34.283226
2017-07-11    34.394299
2017-07-12    34.443939
2017-07-13    34.923706
2017-07-14    35.223846
2017-07-17    35.346748
2017-07-18    35.469646
2017-07-19    35.691795
2017-07-20    35.531086
2017-07-21    35.514549
2017-07-24    35.944679
2017-07-25    36.098309
2017-07-26    36.268475
2017-07-27    35.583088
2017-07-28    35.332569
2017-07-31    35.150589
2017-08-01    35.462551
2017-08-02    37.138184
2017-08-03    36.767151
2017-08-04    36.960934
2017-08-10    36.852200
2017-09-20    37.030167
2017-09-21    36.394287
2017-09-22    36.038372
Name: Close, d

.head(N) and .tail(N) are methods for quickly accessing the first or last N elements.

In [333]:
print(aapl.head(5))
print(aapl.tail(10))


# ++

# The same can be achieved with integer indexes.
print(aapl.iloc[:5])
print(aapl.iloc[-10:])

Date
2017-01-03    27.219833
2017-01-04    27.189369
2017-01-05    27.327633
2017-01-06    27.632294
2017-01-09    27.885389
Name: Close, dtype: float64
Date
2017-12-15    41.425610
2017-12-18    42.009003
2017-12-19    41.561340
2017-12-20    41.516098
2017-12-21    41.673252
2017-12-22    41.673252
2017-12-26    40.616005
2017-12-27    40.623150
2017-12-28    40.737446
2017-12-29    40.296925
Name: Close, dtype: float64
Date
2017-01-03    27.219833
2017-01-04    27.189369
2017-01-05    27.327633
2017-01-06    27.632294
2017-01-09    27.885389
Name: Close, dtype: float64
Date
2017-12-15    41.425610
2017-12-18    42.009003
2017-12-19    41.561340
2017-12-20    41.516098
2017-12-21    41.673252
2017-12-22    41.673252
2017-12-26    40.616005
2017-12-27    40.623150
2017-12-28    40.737446
2017-12-29    40.296925
Name: Close, dtype: float64


# Resampling
**_series.resample(freq)_** is a class called "DatetimeIndexResampler" which groups data in a Series object into regular time intervals. The argument "freq" determines the length of each interval.

**_series.resample.mean()_** is a complete statement that groups data into intervals, and then compute the mean of each interval. For example, if we want to aggregate the daily data into monthly data by mean:

In [334]:
by_month = aapl.resample('M').mean()
print(by_month)


# ++

# The next line groups data by month and takes the minimum data by month.
by_month_max = aapl.resample("M").min()
print(by_month_max)

Date
2017-01-31    28.021314
2017-02-28    31.430153
2017-03-31    33.096759
2017-04-30    33.630810
2017-05-31    35.924379
2017-06-30    34.938204
2017-07-31    35.048843
2017-08-31    37.686052
2017-09-30    37.395190
2017-10-31    37.444725
2017-11-30    41.004146
2017-12-31    40.930681
Freq: M, Name: Close, dtype: float64
Date
2017-01-31    27.189369
2017-02-28    30.121099
2017-03-31    32.640652
2017-04-30    33.111389
2017-05-31    34.488289
2017-06-30    33.623848
2017-07-31    33.732555
2017-08-31    35.462551
2017-09-30    35.720436
2017-10-31    36.415638
2017-11-30    39.597378
2017-12-31    40.244545
Freq: M, Name: Close, dtype: float64


We can also aggregate the data by week:

In [335]:
by_week = aapl.resample('W').mean()
print(by_week.head())

Date
2017-01-08    27.342282
2017-01-15    27.941166
2017-01-22    28.108611
2017-01-29    28.394868
2017-02-05    29.497255
Freq: W-SUN, Name: Close, dtype: float64


We can also aggregate the data by month with max:

In [336]:
# Data is grouped into months and then the maximum of each month is computed.
aapl.resample('M').max()

Date
2017-01-31    28.579069
2017-02-28    32.271137
2017-03-31    33.921047
2017-04-30    34.074047
2017-05-31    36.892403
2017-06-30    36.738777
2017-07-31    36.268475
2017-08-31    38.911682
2017-09-30    38.923538
2017-10-31    40.107498
2017-11-30    41.815819
2017-12-31    42.009003
Freq: M, Name: Close, dtype: float64

We can choose almost any frequency by using the format 'nf', where 'n' is an integer and 'f' is M for month, W for week and D for day.

In [337]:
three_day = aapl.resample('3D').mean()
two_week = aapl.resample('2W').mean()
two_month = aapl.resample('2M').mean()


print(three_day)
print(two_week)
print(two_month)

Date
2017-01-03    27.245612
2017-01-06    27.632294
2017-01-09    27.954131
2017-01-12    27.921719
2017-01-15    28.122086
                ...    
2017-12-17    41.785172
2017-12-20    41.620867
2017-12-23          NaN
2017-12-26    40.658867
2017-12-29    40.296925
Freq: 3D, Name: Close, Length: 121, dtype: float64
Date
2017-01-08    27.342282
2017-01-22    28.015586
2017-02-05    28.946062
2017-02-19    31.341191
2017-03-05    32.413922
2017-03-19    32.833895
2017-04-02    33.437847
2017-04-16    33.661105
2017-04-30    33.603545
2017-05-14    35.498712
2017-05-28    36.292809
2017-06-11    36.311532
2017-06-25    34.336639
2017-07-09    34.074194
2017-07-23    35.082284
2017-08-06    36.070653
2017-08-20    37.692905
2017-09-03    38.244486
2017-09-17    38.069384
2017-10-01    36.635582
2017-10-15    36.865018
2017-10-29    37.546923
2017-11-12    40.803592
2017-11-26    40.974770
2017-12-10    40.639344
2017-12-24    41.388940
2018-01-07    40.568381
Freq: 2W-SUN, Name: Close, 

Besides the mean() method, other methods can also be used with the resampler:



In [338]:
std = aapl.resample('W').std()
max = aapl.resample('W').max()
min = aapl.resample('W').min()


print(std)
print(max)
print(min)

Date
2017-01-08    0.202235
2017-01-15    0.072124
2017-01-22    0.025410
2017-01-29    0.243920
2017-02-05    0.938004
2017-02-12    0.250597
2017-02-19    0.230104
2017-02-26    0.059015
2017-03-05    0.338195
2017-03-12    0.075863
2017-03-19    0.176841
2017-03-26    0.156385
2017-04-02    0.318029
2017-04-09    0.127973
2017-04-16    0.211289
2017-04-23    0.173703
2017-04-30    0.089524
2017-05-07    0.234327
2017-05-14    0.351016
2017-05-21    0.533099
2017-05-28    0.060056
2017-06-04    0.279656
2017-06-11    0.616582
2017-06-18    0.380433
2017-06-25    0.128173
2017-07-02    0.262638
2017-07-09    0.158005
2017-07-16    0.402029
2017-07-23    0.124289
2017-07-30    0.382300
2017-08-06    0.919238
2017-08-13    0.464869
2017-08-20    0.432820
2017-08-27    0.274228
2017-09-03    0.250321
2017-09-10    0.379509
2017-09-17    0.292502
2017-09-24    0.731101
2017-10-01    0.352871
2017-10-08    0.203713
2017-10-15    0.118070
2017-10-22    0.514117
2017-10-29    0.676665
2017-1

Often we want to calculate monthly returns of a stock, based on prices on the last day of each month. To fetch those prices, we use the series.resample.agg() method:

In [339]:
# x takes every value in the column of monthly data.
last_day = aapl.resample('M').agg(lambda x: x[-1])
print(last_day)


# ++

# The next line fetches the prices on the last day of each month and computes
# their difference with the prices on January 31st.
january_diff = aapl.resample('M').agg(lambda x: abs(x[-1] - aapl["2017-01"][-1]))
print(january_diff)

Date
2017-01-31    28.438459
2017-02-28    32.242882
2017-03-31    33.812794
2017-04-30    33.810436
2017-05-31    36.103024
2017-06-30    34.037430
2017-07-31    35.150589
2017-08-31    38.911682
2017-09-30    36.567490
2017-10-31    40.107498
2017-11-30    40.920811
2017-12-31    40.296925
Freq: M, Name: Close, dtype: float64
Date
2017-01-31     0.000000
2017-02-28     3.804422
2017-03-31     5.374334
2017-04-30     5.371977
2017-05-31     7.664564
2017-06-30     5.598970
2017-07-31     6.712130
2017-08-31    10.473223
2017-09-30     8.129030
2017-10-31    11.669039
2017-11-30    12.482351
2017-12-31    11.858465
Freq: M, Name: Close, dtype: float64


Or directly calculate the monthly rates of return using the data for the first day and the last day:

In [340]:
monthly_return = aapl.resample('M').agg(lambda x: x[-1]/x[0] - 1)
print(monthly_return)

Date
2017-01-31    0.044770
2017-02-28    0.068613
2017-03-31    0.027685
2017-04-30   -0.000348
2017-05-31    0.046463
2017-06-30   -0.059799
2017-07-31    0.036446
2017-08-31    0.097261
2017-09-30   -0.060530
2017-10-31    0.099018
2017-11-30    0.033422
2017-12-31   -0.010640
Freq: M, Name: Close, dtype: float64


Series object also provides us some convenient methods to do some quick calculation.

In [341]:
# Compute average by adding the data and dividing the result by the number 
# of data elements.
print(monthly_return.mean())

# Compute standard deviation.
print(monthly_return.std())

# Find maximum.
print(monthly_return.max())


# ++

# Find minimum.
print(monthly_return.min())

# Add all the data.
print(monthly_return.sum())

# Compute product.
print(monthly_return.prod())

0.02686341469547739
0.05225850941003253
0.09901830993350358
-0.06053017511906611
0.3223609763457287
6.205670946207166e-19


Another two methods frequently used on Series are .diff() and .pct_change(). The former calculates the difference between consecutive elements, and the latter calculates the percentage change.

In [342]:
print(last_day.diff())
print(last_day.pct_change())


# ++

# The diff method admits a parameter to determine the sepparation
# of the data whose difference is to be taken.
# The following code takes the differences betwen the 1st and 3rd 
# elements, 2nd and 4th elemnts, 3rd and 5th elements, etc.
last_day_spaced_diff = last_day.diff(2)
print(last_day_spaced_diff)

Date
2017-01-31         NaN
2017-02-28    3.804422
2017-03-31    1.569912
2017-04-30   -0.002357
2017-05-31    2.292587
2017-06-30   -2.065594
2017-07-31    1.113159
2017-08-31    3.761093
2017-09-30   -2.344193
2017-10-31    3.540009
2017-11-30    0.813313
2017-12-31   -0.623886
Freq: M, Name: Close, dtype: float64
Date
2017-01-31         NaN
2017-02-28    0.133777
2017-03-31    0.048690
2017-04-30   -0.000070
2017-05-31    0.067807
2017-06-30   -0.057214
2017-07-31    0.032704
2017-08-31    0.106999
2017-09-30   -0.060244
2017-10-31    0.096808
2017-11-30    0.020278
2017-12-31   -0.015246
Freq: M, Name: Close, dtype: float64
Date
2017-01-31         NaN
2017-02-28         NaN
2017-03-31    5.374334
2017-04-30    1.567554
2017-05-31    2.290230
2017-06-30    0.226994
2017-07-31   -0.952435
2017-08-31    4.874252
2017-09-30    1.416901
2017-10-31    1.195816
2017-11-30    4.353321
2017-12-31    0.189426
Freq: M, Name: Close, dtype: float64


Notice that we induced a NaN value while calculating percentage changes i.e. returns.

When dealing with NaN values, we usually either removing the data point or fill it with a specific value. Here we fill it with 0:

In [343]:
daily_return = last_day.pct_change()
print(daily_return.fillna(0))


# ++

# fillna removes all NaN values:
print(last_day_spaced_diff.head(5))
last_day_spaced_diff = last_day_spaced_diff.fillna(0)
print(last_day_spaced_diff.head(5))

Date
2017-01-31    0.000000
2017-02-28    0.133777
2017-03-31    0.048690
2017-04-30   -0.000070
2017-05-31    0.067807
2017-06-30   -0.057214
2017-07-31    0.032704
2017-08-31    0.106999
2017-09-30   -0.060244
2017-10-31    0.096808
2017-11-30    0.020278
2017-12-31   -0.015246
Freq: M, Name: Close, dtype: float64
Date
2017-01-31         NaN
2017-02-28         NaN
2017-03-31    5.374334
2017-04-30    1.567554
2017-05-31    2.290230
Freq: M, Name: Close, dtype: float64
Date
2017-01-31    0.000000
2017-02-28    0.000000
2017-03-31    5.374334
2017-04-30    1.567554
2017-05-31    2.290230
Freq: M, Name: Close, dtype: float64


Alternatively, we can fill a NaN with the next fitted value. This is called 'backward fill', or 'bfill' in short:

In [344]:
daily_return = last_day.pct_change()
print(daily_return.fillna(method = 'bfill'))


# ++

# The amount of filled NaN values can be limited.
last_day_spaced_diff = last_day.diff(2)
# The next line only deals with one NaN value.
print(last_day_spaced_diff.fillna(method = 'bfill', limit = 1))

Date
2017-01-31    0.133777
2017-02-28    0.133777
2017-03-31    0.048690
2017-04-30   -0.000070
2017-05-31    0.067807
2017-06-30   -0.057214
2017-07-31    0.032704
2017-08-31    0.106999
2017-09-30   -0.060244
2017-10-31    0.096808
2017-11-30    0.020278
2017-12-31   -0.015246
Freq: M, Name: Close, dtype: float64
Date
2017-01-31         NaN
2017-02-28    5.374334
2017-03-31    5.374334
2017-04-30    1.567554
2017-05-31    2.290230
2017-06-30    0.226994
2017-07-31   -0.952435
2017-08-31    4.874252
2017-09-30    1.416901
2017-10-31    1.195816
2017-11-30    4.353321
2017-12-31    0.189426
Freq: M, Name: Close, dtype: float64


As expected, since there is a 'backward fill' method, there must be a 'forward fill' method, or 'ffill' in short. However we can't use it here because the NaN is the first value.

We can also simply remove NaN values by **_.dropna()_**

In [345]:
daily_return = last_day.pct_change()
print(daily_return.dropna())


# ++

# The dropna method simply removes the missing data.
last_day_spaced_diff = last_day.diff(2)
# This will display only 10 months.
print(last_day_spaced_diff.dropna())

Date
2017-02-28    0.133777
2017-03-31    0.048690
2017-04-30   -0.000070
2017-05-31    0.067807
2017-06-30   -0.057214
2017-07-31    0.032704
2017-08-31    0.106999
2017-09-30   -0.060244
2017-10-31    0.096808
2017-11-30    0.020278
2017-12-31   -0.015246
Freq: M, Name: Close, dtype: float64
Date
2017-03-31    5.374334
2017-04-30    1.567554
2017-05-31    2.290230
2017-06-30    0.226994
2017-07-31   -0.952435
2017-08-31    4.874252
2017-09-30    1.416901
2017-10-31    1.195816
2017-11-30    4.353321
2017-12-31    0.189426
Freq: M, Name: Close, dtype: float64


# DataFrame
The **DataFrame** is the most commonly used data structure in Pandas. It is essentially a table, just like an Excel spreadsheet.

More precisely, a DataFrame is a collection of Series objects, each of which may contain different data types. A DataFrame can be created from various data types: dictionary, 2-D numpy.ndarray, a Series or another DataFrame.

## Create DataFrames
The most common method of creating a DataFrame is passing a dictionary:

In [346]:
import pandas as pd

dict = {'AAPL': [143.5, 144.09, 142.73, 144.18, 143.77],'GOOG':[898.7, 911.71, 906.69, 918.59, 926.99],
        'IBM':[155.58, 153.67, 152.36, 152.94, 153.49]}
data_index = pd.date_range('2017-07-03',periods = 5, freq = 'D')
df = pd.DataFrame(dict, index = data_index)
print(df)


# ++

# The following dictionary has list with the first 15 second, third,
# fourth, fifth and sixth, powers stored in lists.
powers = {2:[x**2 for x in range(1, 11)], 
          3:[x**3 for x in range(1, 11)], 
          4:[x**4 for x in range(1, 11)],
          5:[x**5 for x in range(1, 11)],
          6:[x**6 for x in range(1, 11)],}
power_index = range(1, 11)
power_df = pd.DataFrame(powers, index = power_index)
print(power_df)

              AAPL    GOOG     IBM
2017-07-03  143.50  898.70  155.58
2017-07-04  144.09  911.71  153.67
2017-07-05  142.73  906.69  152.36
2017-07-06  144.18  918.59  152.94
2017-07-07  143.77  926.99  153.49
      2     3      4       5        6
1     1     1      1       1        1
2     4     8     16      32       64
3     9    27     81     243      729
4    16    64    256    1024     4096
5    25   125    625    3125    15625
6    36   216   1296    7776    46656
7    49   343   2401   16807   117649
8    64   512   4096   32768   262144
9    81   729   6561   59049   531441
10  100  1000  10000  100000  1000000


## Manipulating DataFrames
We can fetch values in a DataFrame by columns and index. Each column in a DataFrame is essentially a Pandas Series. We can fetch a column by square brackets: **df['column_name']**

If a column name contains no spaces, then we can also use df.column_name to fetch a column:

In [347]:
df = aapl_table
print(df.Close.tail(5))
print(df['Volume'].tail(5))


# ++

# The next line returns the results of adding: 1^2 + 1^4, 2^2 + 2^4...
print(power_df[2] + power_df[4])

Date
2017-12-22    41.673252
2017-12-26    40.616005
2017-12-27    40.623150
2017-12-28    40.737446
2017-12-29    40.296925
Name: Close, dtype: float64
Date
2017-12-22     65397600
2017-12-26    132742000
2017-12-27     85992800
2017-12-28     65920800
2017-12-29    103999600
Name: Volume, dtype: int64
1         2
2        20
3        90
4       272
5       650
6      1332
7      2450
8      4160
9      6642
10    10100
dtype: int64


All the methods we applied to a Series index such as iloc[], loc[] and resampling methods, can also be applied to a DataFrame:

In [348]:
aapl_2016 = df['2016']
aapl_month = aapl_2016.resample('M').agg(lambda x: x[-1])
print(aapl_month)


# ++

# Below, we slace our powers DataFrame and produce the remainder of its
# values when divided by 5.
mini_power_df = power_df[:8].agg(lambda x: x % 5)
print(mini_power_df)

                 Open       High        Low      Close     Volume  Dividends  \
Date                                                                           
2016-01-31  21.730620  22.315207  21.629749  22.315207  257666000        0.0   
2016-02-29  22.325658  22.641435  22.277254  22.286474  140865200        0.0   
2016-03-31  25.289805  25.331294  25.096189  25.121544  103553600        0.0   
2016-04-30  21.664134  21.832395  21.323003  21.606510  274126000        0.0   
2016-05-31  23.096979  23.282497  22.916099  23.157272  169228800        0.0   
2016-06-30  21.900390  22.208812  21.867924  22.169390  143345600        0.0   
2016-07-31  24.161392  24.244875  24.043124  24.166029  110934800        0.0   
2016-08-31  24.635012  24.847181  24.630348  24.737598  118649600        0.0   
2016-09-30  26.220457  26.432628  26.066577  26.358019  145516400        0.0   
2016-10-31  26.497912  26.633142  26.392992  26.472265  105677600        0.0   
2016-11-30  26.153537  26.294148  25.841

  """Entry point for launching an IPython kernel.


We may select certain columns of a DataFrame using their names:

In [349]:
aapl_bar = aapl_month[['Open', 'High', 'Low', 'Close']]
print(aapl_bar)


# ++

# The next two lines selects the 2nd and 4th columns.
powers3n5 = power_df[[3, 5]]
print(powers3n5)

                 Open       High        Low      Close
Date                                                  
2016-01-31  21.730620  22.315207  21.629749  22.315207
2016-02-29  22.325658  22.641435  22.277254  22.286474
2016-03-31  25.289805  25.331294  25.096189  25.121544
2016-04-30  21.664134  21.832395  21.323003  21.606510
2016-05-31  23.096979  23.282497  22.916099  23.157272
2016-06-30  21.900390  22.208812  21.867924  22.169390
2016-07-31  24.161392  24.244875  24.043124  24.166029
2016-08-31  24.635012  24.847181  24.630348  24.737598
2016-09-30  26.220457  26.432628  26.066577  26.358019
2016-10-31  26.497912  26.633142  26.392992  26.472265
2016-11-30  26.153537  26.294148  25.841851  25.900438
2016-12-31  27.337010  27.465901  27.051101  27.142498
       3       5
1      1       1
2      8      32
3     27     243
4     64    1024
5    125    3125
6    216    7776
7    343   16807
8    512   32768
9    729   59049
10  1000  100000


We can even specify both rows and columns using loc[]. The row indices and column names are separated by a comma:

In [350]:
print(aapl_month.loc['2016-03':'2016-06',['Open', 'High', 'Low', 'Close']])


# ++

# The next lines uses loc and iloc to print the first odd 3rd and 5th powers.
print(power_df.loc[::2, [3, 5]])
# iloc allows the usage of zero-based numbering.
print(power_df.iloc[::2, [1, 3]])

                 Open       High        Low      Close
Date                                                  
2016-03-31  25.289805  25.331294  25.096189  25.121544
2016-04-30  21.664134  21.832395  21.323003  21.606510
2016-05-31  23.096979  23.282497  22.916099  23.157272
2016-06-30  21.900390  22.208812  21.867924  22.169390
     3      5
1    1      1
3   27    243
5  125   3125
7  343  16807
9  729  59049
     3      5
1    1      1
3   27    243
5  125   3125
7  343  16807
9  729  59049


The subset methods in DataFrame is quite useful. By writing logical statements in square brackets, we can make customized subsets:

In [351]:
import numpy as np

# Filters so that only the rows with a Close value over the mean remain.
above = aapl_bar[aapl_bar.Close > np.mean(aapl_bar.Close)]
print(above)

                 Open       High        Low      Close
Date                                                  
2016-03-31  25.289805  25.331294  25.096189  25.121544
2016-08-31  24.635012  24.847181  24.630348  24.737598
2016-09-30  26.220457  26.432628  26.066577  26.358019
2016-10-31  26.497912  26.633142  26.392992  26.472265
2016-11-30  26.153537  26.294148  25.841851  25.900438
2016-12-31  27.337010  27.465901  27.051101  27.142498


## Data Validation
As mentioned, all methods that apply to a Series can also be applied to a DataFrame. Here we add a new column to an existing DataFrame:

In [352]:
aapl_bar['rate_return'] = aapl_bar.Close.pct_change()
print(aapl_bar)


# ++

# A new column with the 6th powers is added by multiplying the values 
# in the 4th column and the index values.
power_df[6] = power_df[5] * power_df.index
print(power_df)

                 Open       High        Low      Close  rate_return
Date                                                               
2016-01-31  21.730620  22.315207  21.629749  22.315207          NaN
2016-02-29  22.325658  22.641435  22.277254  22.286474    -0.001288
2016-03-31  25.289805  25.331294  25.096189  25.121544     0.127210
2016-04-30  21.664134  21.832395  21.323003  21.606510    -0.139921
2016-05-31  23.096979  23.282497  22.916099  23.157272     0.071773
2016-06-30  21.900390  22.208812  21.867924  22.169390    -0.042660
2016-07-31  24.161392  24.244875  24.043124  24.166029     0.090063
2016-08-31  24.635012  24.847181  24.630348  24.737598     0.023652
2016-09-30  26.220457  26.432628  26.066577  26.358019     0.065504
2016-10-31  26.497912  26.633142  26.392992  26.472265     0.004334
2016-11-30  26.153537  26.294148  25.841851  25.900438    -0.021601
2016-12-31  27.337010  27.465901  27.051101  27.142498     0.047955
      2     3      4       5        6
1     1   

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Here the calculation introduced a NaN value. If the DataFrame is large, we would not be able to observe it. **isnull()** provides a convenient way to check abnormal values.

In [353]:
missing = aapl_bar.isnull()
print(missing)
print('---------------------------------------------')
print(missing.describe(), end="\n"*2)


# ++

# The following code adds a column with the difference between each 
# 5th power and the value 4 rows below.
power_df["diff"] = power_df[5].diff(4)
print(power_df)

# Missing data can be appreciated in the upper right corner.
print(power_df.isnull())
print('---------------------------------------------')

# The describe method shows the presence of two different characters, 
# implying the precense of NaN values.
print(power_df.isnull().describe())

             Open   High    Low  Close  rate_return
Date                                               
2016-01-31  False  False  False  False         True
2016-02-29  False  False  False  False        False
2016-03-31  False  False  False  False        False
2016-04-30  False  False  False  False        False
2016-05-31  False  False  False  False        False
2016-06-30  False  False  False  False        False
2016-07-31  False  False  False  False        False
2016-08-31  False  False  False  False        False
2016-09-30  False  False  False  False        False
2016-10-31  False  False  False  False        False
2016-11-30  False  False  False  False        False
2016-12-31  False  False  False  False        False
---------------------------------------------
         Open   High    Low  Close rate_return
count      12     12     12     12          12
unique      1      1      1      1           2
top     False  False  False  False       False
freq       12     12     12     12    

The row labelled "unique" indicates the number of unique values in each column. Since the "rate_return" column has 2 unique values, it has at least one missing value.

We can deduce the number of missing values by comparing "count" with "freq". There are 12 counts and 11 False values, so there is one True value which corresponds to the missing value.

We can also find the rows with missing values easily:

In [354]:
print(missing[missing.rate_return == True])


# ++

# Spot the rows with NaN values in the "diff" column.
power_missing = power_df.isnull()
print(power_missing[power_missing["diff"] == True])

#Drops the "diff" column.
power_df = power_df.drop("diff", axis=1)

             Open   High    Low  Close  rate_return
Date                                               
2016-01-31  False  False  False  False         True
       2      3      4      5      6  diff
1  False  False  False  False  False  True
2  False  False  False  False  False  True
3  False  False  False  False  False  True
4  False  False  False  False  False  True


Usually when dealing with missing data, we either delete the whole row or fill it with some value. As we introduced in the Series chapter, the same method **dropna()** and **fillna()** can be applied to a DataFrame.

In [355]:
# Drops the first row.
drop = aapl_bar.dropna()
print(drop)
print('\n--------------------------------------------------\n')
# Replaces NaN values with zeros.
fill = aapl_bar.fillna(0)
print(fill)

                 Open       High        Low      Close  rate_return
Date                                                               
2016-02-29  22.325658  22.641435  22.277254  22.286474    -0.001288
2016-03-31  25.289805  25.331294  25.096189  25.121544     0.127210
2016-04-30  21.664134  21.832395  21.323003  21.606510    -0.139921
2016-05-31  23.096979  23.282497  22.916099  23.157272     0.071773
2016-06-30  21.900390  22.208812  21.867924  22.169390    -0.042660
2016-07-31  24.161392  24.244875  24.043124  24.166029     0.090063
2016-08-31  24.635012  24.847181  24.630348  24.737598     0.023652
2016-09-30  26.220457  26.432628  26.066577  26.358019     0.065504
2016-10-31  26.497912  26.633142  26.392992  26.472265     0.004334
2016-11-30  26.153537  26.294148  25.841851  25.900438    -0.021601
2016-12-31  27.337010  27.465901  27.051101  27.142498     0.047955

--------------------------------------------------

                 Open       High        Low      Close  rate_re

## DataFrame Concat
We have seen how to extract a Series from a dataFrame. Now we need to consider how to merge a Series or a DataFrame into another one.

In Pandas, the function **concat()** allows us to merge multiple Series into a DataFrame:

In [356]:
s1 = pd.Series([143.5, 144.09, 142.73, 144.18, 143.77], name = 'AAPL')
s2 = pd.Series([898.7, 911.71, 906.69, 918.59, 926.99], name = 'GOOG')
data_frame = pd.concat([s1,s2], axis = 1)
print(data_frame)


# ++

# Creates a series with the first 10 7th powers.
seventh = pd.Series([x**7 for x in range(1, 11)], name = 7)
seventh.index = range(1, 11)

# Merges the power_df dataframe and the seventh series by the side.
power_df = pd.concat([power_df, seventh], axis = 1)
print(power_df)

     AAPL    GOOG
0  143.50  898.70
1  144.09  911.71
2  142.73  906.69
3  144.18  918.59
4  143.77  926.99
      2     3      4       5        6         7
1     1     1      1       1        1         1
2     4     8     16      32       64       128
3     9    27     81     243      729      2187
4    16    64    256    1024     4096     16384
5    25   125    625    3125    15625     78125
6    36   216   1296    7776    46656    279936
7    49   343   2401   16807   117649    823543
8    64   512   4096   32768   262144   2097152
9    81   729   6561   59049   531441   4782969
10  100  1000  10000  100000  1000000  10000000


The "axis = 1" parameter will join two DataFrames by columns:

In [357]:
log_price = np.log(aapl_bar.Close)
log_price.name = 'log_price'
print(log_price)
print('\n---------------------- separate line--------------------\n')
concat = pd.concat([aapl_bar, log_price], axis = 1)
print(concat)

Date
2016-01-31    3.105268
2016-02-29    3.103980
2016-03-31    3.223726
2016-04-30    3.072995
2016-05-31    3.142309
2016-06-30    3.098712
2016-07-31    3.184948
2016-08-31    3.208324
2016-09-30    3.271773
2016-10-31    3.276098
2016-11-30    3.254260
2016-12-31    3.301101
Freq: M, Name: log_price, dtype: float64

---------------------- separate line--------------------

                 Open       High        Low      Close  rate_return  log_price
Date                                                                          
2016-01-31  21.730620  22.315207  21.629749  22.315207          NaN   3.105268
2016-02-29  22.325658  22.641435  22.277254  22.286474    -0.001288   3.103980
2016-03-31  25.289805  25.331294  25.096189  25.121544     0.127210   3.223726
2016-04-30  21.664134  21.832395  21.323003  21.606510    -0.139921   3.072995
2016-05-31  23.096979  23.282497  22.916099  23.157272     0.071773   3.142309
2016-06-30  21.900390  22.208812  21.867924  22.169390    -0.04266

We can also join two DataFrames by rows. Consider these two DataFrames:

In [358]:
df_volume = aapl_table.loc['2016-10':'2017-04',['Volume', 'Stock Splits']].resample('M').agg(lambda x: x[-1])
print(df_volume)
print('\n---------------------- separate line--------------------\n')
df_2017 = aapl_table.loc['2016-10':'2017-04',['Open', 'High', 'Low', 'Close']].resample('M').agg(lambda x: x[-1])
print(df_2017)

               Volume  Stock Splits
Date                               
2016-10-31  105677600             0
2016-11-30  144649200             0
2016-12-31  122345200             0
2017-01-31  196804000             0
2017-02-28   93931600             0
2017-03-31   78646800             0
2017-04-30   83441600             0

---------------------- separate line--------------------

                 Open       High        Low      Close
Date                                                  
2016-10-31  26.497912  26.633142  26.392992  26.472265
2016-11-30  26.153537  26.294148  25.841851  25.900438
2016-12-31  27.337010  27.465901  27.051101  27.142498
2017-01-31  28.391590  28.447834  28.267384  28.438459
2017-02-28  32.264064  32.348796  32.174623  32.242882
2017-03-31  33.826915  33.956368  33.659803  33.812794
2017-04-30  33.913998  33.963427  33.720999  33.810436


Now we merge the DataFrames with our DataFrame 'aapl_bar'

In [359]:
concat = pd.concat([aapl_bar, df_volume], axis = 1)
print(concat)

                 Open       High        Low      Close  rate_return  \
Date                                                                  
2016-01-31  21.730620  22.315207  21.629749  22.315207          NaN   
2016-02-29  22.325658  22.641435  22.277254  22.286474    -0.001288   
2016-03-31  25.289805  25.331294  25.096189  25.121544     0.127210   
2016-04-30  21.664134  21.832395  21.323003  21.606510    -0.139921   
2016-05-31  23.096979  23.282497  22.916099  23.157272     0.071773   
2016-06-30  21.900390  22.208812  21.867924  22.169390    -0.042660   
2016-07-31  24.161392  24.244875  24.043124  24.166029     0.090063   
2016-08-31  24.635012  24.847181  24.630348  24.737598     0.023652   
2016-09-30  26.220457  26.432628  26.066577  26.358019     0.065504   
2016-10-31  26.497912  26.633142  26.392992  26.472265     0.004334   
2016-11-30  26.153537  26.294148  25.841851  25.900438    -0.021601   
2016-12-31  27.337010  27.465901  27.051101  27.142498     0.047955   
2017-0

By default the DataFrame are joined with all of the data. This default options results in zero information loss. We can also merge them by intersection, this is called 'inner join

In [360]:
# Every row with a NaN value is dropped.
concat = pd.concat([aapl_bar,df_volume],axis = 1, join = 'inner')
print(concat)

                 Open       High        Low      Close  rate_return  \
Date                                                                  
2016-10-31  26.497912  26.633142  26.392992  26.472265     0.004334   
2016-11-30  26.153537  26.294148  25.841851  25.900438    -0.021601   
2016-12-31  27.337010  27.465901  27.051101  27.142498     0.047955   

               Volume  Stock Splits  
Date                                 
2016-10-31  105677600             0  
2016-11-30  144649200             0  
2016-12-31  122345200             0  


Only the intersection part was left if use 'inner join' method. Now let's try to append a DataFrame to another one:

In [361]:
append = aapl_bar.append(df_2017)
print(append)


# ++

# Creates a series with the first 7 powers of 11 and the due indexes.
elevenp = pd.Series([11**x for x in range(2, 8)], name = 11)
elevenp.index = range(2, 8)

# Uses append to add the row with the powers of 11.
power_df = power_df.append(elevenp)
print(power_df)

                 Open       High        Low      Close  rate_return
Date                                                               
2016-01-31  21.730620  22.315207  21.629749  22.315207          NaN
2016-02-29  22.325658  22.641435  22.277254  22.286474    -0.001288
2016-03-31  25.289805  25.331294  25.096189  25.121544     0.127210
2016-04-30  21.664134  21.832395  21.323003  21.606510    -0.139921
2016-05-31  23.096979  23.282497  22.916099  23.157272     0.071773
2016-06-30  21.900390  22.208812  21.867924  22.169390    -0.042660
2016-07-31  24.161392  24.244875  24.043124  24.166029     0.090063
2016-08-31  24.635012  24.847181  24.630348  24.737598     0.023652
2016-09-30  26.220457  26.432628  26.066577  26.358019     0.065504
2016-10-31  26.497912  26.633142  26.392992  26.472265     0.004334
2016-11-30  26.153537  26.294148  25.841851  25.900438    -0.021601
2016-12-31  27.337010  27.465901  27.051101  27.142498     0.047955
2016-10-31  26.497912  26.633142  26.392992  26.

'Append' is essentially to concat two DataFrames by axis = 0, thus here is an alternative way to append:

In [362]:
# One DataFrame is "stacked" over the other.
concat = pd.concat([aapl_bar, df_2017], axis = 0)
print(concat)

                 Open       High        Low      Close  rate_return
Date                                                               
2016-01-31  21.730620  22.315207  21.629749  22.315207          NaN
2016-02-29  22.325658  22.641435  22.277254  22.286474    -0.001288
2016-03-31  25.289805  25.331294  25.096189  25.121544     0.127210
2016-04-30  21.664134  21.832395  21.323003  21.606510    -0.139921
2016-05-31  23.096979  23.282497  22.916099  23.157272     0.071773
2016-06-30  21.900390  22.208812  21.867924  22.169390    -0.042660
2016-07-31  24.161392  24.244875  24.043124  24.166029     0.090063
2016-08-31  24.635012  24.847181  24.630348  24.737598     0.023652
2016-09-30  26.220457  26.432628  26.066577  26.358019     0.065504
2016-10-31  26.497912  26.633142  26.392992  26.472265     0.004334
2016-11-30  26.153537  26.294148  25.841851  25.900438    -0.021601
2016-12-31  27.337010  27.465901  27.051101  27.142498     0.047955
2016-10-31  26.497912  26.633142  26.392992  26.

Please note that if the two DataFrame have some columns with the same column names, these columns are considered to be the same and will be merged. It's very important to have the right column names. If we change a column names here:

In [363]:
df_2017.columns = ['Change', 'High','Low','Close']
concat = pd.concat([aapl_bar, df_2017], axis = 0)
print(concat)

                 Open       High        Low      Close  rate_return     Change
Date                                                                          
2016-01-31  21.730620  22.315207  21.629749  22.315207          NaN        NaN
2016-02-29  22.325658  22.641435  22.277254  22.286474    -0.001288        NaN
2016-03-31  25.289805  25.331294  25.096189  25.121544     0.127210        NaN
2016-04-30  21.664134  21.832395  21.323003  21.606510    -0.139921        NaN
2016-05-31  23.096979  23.282497  22.916099  23.157272     0.071773        NaN
2016-06-30  21.900390  22.208812  21.867924  22.169390    -0.042660        NaN
2016-07-31  24.161392  24.244875  24.043124  24.166029     0.090063        NaN
2016-08-31  24.635012  24.847181  24.630348  24.737598     0.023652        NaN
2016-09-30  26.220457  26.432628  26.066577  26.358019     0.065504        NaN
2016-10-31  26.497912  26.633142  26.392992  26.472265     0.004334        NaN
2016-11-30  26.153537  26.294148  25.841851  25.9004

Since the column name of 'Open' has been changed, the new DataFrame has an new column named 'Change'.

# Summary

Hereby we introduced the most import part of python: resampling and DataFrame manipulation. We only introduced the most commonly used method in Financial data analysis. There are also many methods used in data mining, which are also beneficial. You can always check the [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) official documentations for help.