# Python for Data Analysis - Basics

This tutorial is based on the open-source [notebooks](https://github.com/manujeevanprakash/Python-and-Numpy-Basics) of Manu Jeevan Prakash.

While working on this notebook, an instructor will guide you through the basic concepts of programming languages. 

Self-learners should take the first eight lessons of the free Python course in [Code Academy](https://www.codecademy.com/learn/learn-python) before going through this condensed guide.
If you have 10 hours and enough coffe just take all the 12 lessons! ;-)

Manu also made a good compilation of more advanced resources in his [blog](http://bigdataexaminer.com/2015/05/18/5-amazingly-powerful-python-libraries-for-data-science/).


At this point, there is two important things to remember:

* Python places an emphasis on **readablity**, **simplicity** and **explicitiness**.

* Everything is an **object** in python, and objects have attributes.


## Conditionals, Control Flow and Loops

In this tutorial we will review the basics of *conditionals*, *control flow* and *loops* while working on examples of different functions of the Python language.

We will only briefly mention the commonly used format of **ternary expresions**.

In [1]:
# These are called ternary expressions:
s1 = 'I can not read.'
s2 = 'This is a beautiful quote.' if s1 == 'I can not read.' else 'I am a rabbit.'
print(s2)

## Built-In Functions and Types

First, we will start with some of the Python **Buit-In** [functions](https://docs.python.org/3/library/functions.html) and [types](https://docs.python.org/3/library/stdtypes.html).
You can find all the available modules in the official documentation of the Python [Standard Library](https://docs.python.org/3/library/) .

### Booleans

In [2]:
# As we already saw, boolean values in python are written as True and False.
print(True and True)
print(True or False)
print(True and False)

In [3]:
# Empty iterables (lists, dictionaries, strings, tuples, etc.) are "Falsy".
# Once they contain at least one element, they are considered "Truthy".
print(bool([]), bool([1, 2, 3]))
print(bool(''), bool('Hi'))
print(bool(0), bool(1))

You can check the type of an object using the `type()` function.

You can check whether an object is an instance of a particular type using the `isinstance()` function.

In [4]:
a = 'This is a string.'
print(type(a))
print(isinstance(a, str))
print(isinstance(a, float))
print(isinstance(a, int))
print(isinstance(a, (str, float, int)))

b = 4.5 
print(type(b))
print(isinstance(b, str))
print(isinstance(b, float))
print(isinstance(b, int))
print(isinstance(b, (str, float, int)))

The attributes of a python object can be accessed using `object.attribute_name`.

In [5]:
# Remove <tab> and press the tab button:
#a.<tab>  # This will return a SyntaxError.

## Mutable and Immutable Objects

Objects whose *value can be changed* once they are created, are called **mutable** objects.

For example:
* Lists
* Dictionaries
* Arrays

Objects whose *value cannot be changed* once they are created, are called **immutable** objects.

For example:
* Strings
* Tuples

In [6]:
animals = ['Penguin', 'Dog', 'Cow', 'Python']
animals[2] = 'Cat'
print(animals)

numbers = (10, 20, 1, 123)
#numbers[1] = 1984    # This will return a TypeError.

## Mutable

### Lists

Now we will discuss in more detail some useful concepts:

* Adding and removing elements from a list
* Combining and conctenating lists
* Sorting
* List slicing

In [7]:
# Notice the difference between append() and extend():
countries = ['Germany', 'Zambia', 'United Kingdom','Mexico']
countries.append(['Portugal', 'Canada'])
print(countries)

countries = ['Germany', 'Zambia', 'United Kingdom','Mexico']
countries.extend(['Portugal', 'Canada'])
print(countries)

In [8]:
# sort() according to number of characters:
countries.sort()
print(countries)

# sort() according to number of characters:
countries.sort(key=len)
print(countries)

Differently to the **method** `.sort()`, the **function** `sorted()` returns a new object.

In [9]:
countries = [8, 5, 1, 3, 9, 7, 2, 4, 6]
print(countries)

countries = [8, 5, 1, 3, 9, 7, 2, 4, 6]
countries.sort()
print(countries)

[8, 5, 1, 3, 9, 7, 2, 4, 6]
[1, 2, 3, 4, 5, 6, 7, 8, 9]


In [10]:
countries = [8, 5, 1, 3, 9, 7, 2, 4, 6]
sorted(countries)
print(countries)

countries = [8, 5, 1, 3, 9, 7, 2, 4, 6]
countries = sorted(countries)
print(countries)

[8, 5, 1, 3, 9, 7, 2, 4, 6]
[1, 2, 3, 4, 5, 6, 7, 8, 9]


Bisect finds the location where an element should be inserted to keep it sorted.

Remember that Python has **zero-based** indexing.

In [11]:
from bisect import bisect

a = [0, 1, 2, 3, 4, 5]
print(a)

b = bisect(a, 2)
print(b)

c = bisect(a, 6)
print(c)

[0, 1, 2, 3, 4, 5]
3
6


When iterating over a sequence, you can keep track of the index of the current element with `enumerate`.

In [12]:
languages = ['Italian', 'Farsi', 'Spanish', 'Mandarin']

for index, value in enumerate(languages):
    print(index, value)

0 Italian
1 Farsi
2 Spanish
3 Mandarin


You can use `zip()` to join lists into a list of tuples.

In [13]:
jobs = ['magician', 'singer', 'sales man', 'priest']
languages = ['c', 'c++', 'java', 'javascript']
stats = ['mean', 'median', 'mode', 'skewness']

print(list(zip(jobs, languages, stats)))

for i, (x, y, z) in enumerate(zip(jobs, languages, stats)):
    print('{0}: {1}, {2}, {3}'.format(i, x, y, z))

[('magician', 'c', 'mean'), ('singer', 'c++', 'median'), ('sales man', 'java', 'mode'), ('priest', 'javascript', 'skewness')]
0: magician, c, mean
1: singer, c++, median
2: sales man, java, mode
3: priest, javascript, skewness


You can *unzip* a zipped sequence as follows:

In [14]:
singers = [('Elvis', 'Presley'), ('Frank', 'Sinatra'), ('Placido', 'Domingo')]

first_names, last_names = zip(*singers)

print(first_names)
print(last_names)

('Elvis', 'Frank', 'Placido')
('Presley', 'Sinatra', 'Domingo')


Use `reversed()` to reverse a sequence:

In [15]:
print(list(range(10)))
print(list(reversed(range(10))))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]


### Dictionaries

Some key concepts to remember in dictionary are:

* How to access elements in a dictionary
* `.keys()` and `.values()` methods
* `pop` and `del` methods

You can combine two dictionaries using `update()`.

In [16]:
d1 = {'foo': 'bar', 123: 'abc'}
d2 = {'baz': '123', 'abc': 123}
d1.update(d2)
print(d1)

# The new dictionary overrides the previous value of 'foo':
d1 = {'foo': 'bar', 123: 'abc'}
d2 = {'foo': '123', 'abc': 123}
d1.update(d2)
print(d1)

{'foo': 'bar', 123: 'abc', 'baz': '123', 'abc': 123}
{'foo': '123', 123: 'abc', 'abc': 123}


The `dict()` function accepts tuples; therefore, you can use `zip()` to create key-value pairs.

In [17]:
keys = range(10)
values = reversed(range(10))
wow_dict = dict(zip(keys, values))
print(wow_dict)

{0: 9, 1: 8, 2: 7, 3: 6, 4: 5, 5: 4, 6: 3, 7: 2, 8: 1, 9: 0}


The keys of a dictionary should be *immutable* (int, string, float, tuple).
The technical term for this is [hashability](https://docs.python.org/3/glossary.html).
An object is hashable if it has a hash value which never changes during its lifetime and can be compared to other objects.
Hashable objects which compare equal must have the same hash value.

In [18]:
print(hash('string'))
print(hash((1, 2, 3)))
#print(hash([1, 2, 3]))  # This will return a TypeError.

6920973180326063016
2528502973977326415


If necessary, you can cast a `list` into a `tuple` and use it as a key in a dictionary.

In [19]:
d = {}
l = [4, 5, 6]
t = tuple(l)
d[t] = 789
print(d)

{(4, 5, 6): 789}


Let's take a look to dictionary comphrehensions.

In [20]:
football_list = ['Manchester', 'Liverpool', 'Arsenal', 'Chelsea',
                  'Mancity', 'Tottenham', 'Barcelona', 'Dortmund']
football_dict = {}

for club in football_list:
    # Get the initial letter of the club name:
    initial = club[0]
    if initial not in football_dict.keys(): 
        football_dict[initial] = [club]
    else:
        football_dict[initial].append(club)
        
print(football_dict)

{'M': ['Manchester', 'Mancity'], 'L': ['Liverpool'], 'A': ['Arsenal'], 'C': ['Chelsea'], 'T': ['Tottenham'], 'B': ['Barcelona'], 'D': ['Dortmund']}


The Same operation can be done using `defaultdict`.
Usually, a Python dictionary throws a KeyError if you try to get an item with a key that is not currently in the dictionary.
In contrast, defaultdict will simply create any items that you try to access if they do not exist yet.
defaultdict is part of the `collections` library.
To create an instance, defaultdict calls the object that you pass in the constructor (it can be any object, including function and type objects).

In [21]:
from collections import defaultdict

football_defdict = defaultdict(list)

for club in football_dict:
    football_defdict[club[0]].append(club)
    
print(football_defdict)

defaultdict(<class 'list'>, {'M': ['M'], 'L': ['L'], 'A': ['A'], 'C': ['C'], 'T': ['T'], 'B': ['B'], 'D': ['D']})


### Arrays

Arrays are not data types handled by the built-in functions in Python.
For this we need to use the **Numpy** [library](https://scipy.org/install.html).

You can rename imported modules with the extension `as`.

In [22]:
import numpy as np

# Define a list of numbers:
data_list = [6, 7.5, 8, 0, 1]
print(type(data_list))
print(data_list)

# Convert the list into an array using numpy:
data_array = np.array(data_list)
print(type(data_array))
print(data_array)

<class 'list'>
[6, 7.5, 8, 0, 1]
<class 'numpy.ndarray'>
[6.  7.5 8.  0.  1. ]


Try these functions, they are self explanatory:

In [23]:
x = 2
y = x 
z = float(x)
print(x, y, z)

print(x is y)
print(z is not x)
print(x / y)

2 2 2.0
True
True
1.0


You can also use the following operators:

In [24]:
# This is called floor divide, it drops the fractional remainder:
print(x // y)

# Raise x to the y the power:
print(x**y)

# True if y is less-than or equal-to y:
print(x <= y, x < y)

1
4
True False


The same applies to other logical operators such as `&`, `|`, `^`, `==` and `!=`.

## Immutable

### Strings

You can write multiline strings using triple quotes ''' or """.

In [25]:
print("""
Hello myself!
Do I want another cup of coffee?
""")


Hello myself!
Do I want another cup of coffee?



In [26]:
# As we said, python strings don't support item assignment:
s = 'This is a profound string.'
#s[10] = 'X'  # This will return a TypeError.

In [27]:
# However, there is a way around this:
s2 = s.replace('profound', 'great')
print(s2)

This is a great string.


In [28]:
# Many python objects can be converted to a string using 'str' function
x = 123456
y = str(x)
print(y)
print(type(y))

123456
<class 'str'>


In [29]:
# Strings act like other sequences, such as lists and tuples:
s = 'Superman'
print(list(s))

# So, you can also slice a python string:
print(s[:3])
print(s[3:])
print(s[-3:])

['S', 'u', 'p', 'e', 'r', 'm', 'a', 'n']
Sup
erman
man


In [30]:
# String concentation is very useful:
p = 'Python is a funky language'
q = ", I'm loving it."
z = p + q
print(z)

Python is a funky language, I'm loving it.


We often need to do string formatting while analysing data.
You can format a string using the method `format()` and define the length of floats with the operator `%`. For example, use `%3f` to format a number with 3 decimal points.
You can find more options in the official [documentation](https://docs.python.org/3.4/library/string.html#format-specification-mini-language).

In [31]:
print('Amount: {:2.4f}. Currency: {:s}'.format(1234.56789, 'EUR'))
print('Amount: {:12.4f}. Currency: {:s}'.format(1234.56789, 'USD'))
print('Amount: {:2.4}. Currency: {:s}'.format(1234.56789, 'CNY'))

Amount: 1234.5679. Currency: EUR
Amount:    1234.5679. Currency: USD
Amount: 1.235e+03. Currency: CNY


### Tuples

Tuples can be defined by the `,` operator.
Once created, a tuple has a fixed length and content.

In [32]:
t1 = 1, 2, 3
print(t1)
print(t1[1])

t2 = (1, 2, 3), (4, 5, 6)  # Nested tuples.
print(t2)
print(t2[1])
print(t2[1][2])

(1, 2, 3)
2
((1, 2, 3), (4, 5, 6))
(4, 5, 6)
6


You can cast any sequence into a tuple by using `tuple()`:

In [33]:
t1 = tuple([4, 0, 2])
print(t1)

t2 = tuple('Python')
print(t2)

(4, 0, 2)
('P', 'y', 't', 'h', 'o', 'n')


However, you can `append()` to a list inside a tuple:

In [34]:
x = tuple(['Conchita', [1, 2], 'Wurst'])
print(x)

x[1].append(3)
print(x)

('Conchita', [1, 2], 'Wurst')
('Conchita', [1, 2, 3], 'Wurst')


You can concatenate tuples using the `+` symbol:

In [35]:
y = ('Mean') + ('Median') + ('Mode')
print(y)

y = ('Mean',) + ('Median',) + ('Mode',)
print(y)

MeanMedianMode
('Mean', 'Median', 'Mode')


You can also perform arithmetic operations on tuples:

In [36]:
print(y * 2)

('Mean', 'Median', 'Mode', 'Mean', 'Median', 'Mode')


Often we need to **unpack** tuples:

In [37]:
deep_learning = ('Theano', 'Open cv', 'Torch')

x, y, z = deep_learning
print(x)
print(y)
print(z)

Theano
Open cv
Torch


In [38]:
countries = 'Germany', ('Mexico','China')  

a, b = countries
print(a, b)

a, (b, c) = countries
print(a, b, c)

Germany ('Mexico', 'China')
Germany Mexico China


You can use `.count()` to count how many values are there in a tuple:

In [39]:
countries ='Canada', ('Peru',' France'), 'Peru', 'Peru'
print(countries.count('Peru'))

2


### Sets

A set is an unordered collection of unique elements.

In [40]:
print(    {2, 2, 3, 3, 3, 4, 4, 4, 4} )
print(set([2, 2, 3, 3, 3, 4, 4, 4, 4]))

{2, 3, 4}
{2, 3, 4}


Sets support mathematical operations like **union**, **intersection**, **difference**, and **symmetric difference**.

In [41]:
a = {1, 2, 3, 4, 5}
b = {3, 4, 5, 6, 7}

print(a | b)  # Union
print(a & b)  # Intersection
print(a - b)  # Difference
print(a ^ b)  # Symmetric difference
print({1, 2, 3} == {3, 3, 3, 2, 2, 1})  # Comparisson of values

{1, 2, 3, 4, 5, 6, 7}
{3, 4, 5}
{1, 2}
{1, 2, 6, 7}
True


### Date-Time

Date and time are not data types handled by the built-in functions in Python.
For this we need to import the `datetime` module from the **Standard Library**.
You can find all the available directives for parsing datetime strings in the official [documentation](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior).

In [42]:
from datetime import datetime

# Do not write number 6 as 06; you will get an invalid token error.
dt = datetime(2018, 12, 31, 15, 30, 45)
print(dt.day)
print(dt.minute)
print(dt.date())
print(dt.time())

31
30
2018-12-31
15:30:45


`strftime()` converts the date and time into a string with the specified format:

In [43]:
dt.strftime('%d/%m/%Y %H:%M:%S')

'31/12/2018 15:30:45'

Strings can be converted to datetime objects using `strptime()`:

In [44]:
dt = datetime.strptime('2018-12-31', '%Y-%m-%d') 
print(dt)

2018-12-31 00:00:00


You can use `replace()` to edit a datetime object:

In [45]:
dt2 = dt.replace(hour=15, minute=30, second=45)
print(dt2)

2018-12-31 15:30:45


You can also perform arithmetic operaitons on datetime objects:

In [46]:
dt1 = datetime(2000, 1, 1, 12, 0, 0)
print(dt1)

dt2 = datetime(2001, 1, 1, 13, 30, 30)
print(dt2)

2000-01-01 12:00:00
2001-01-01 13:30:30


In [47]:
time_delta = dt2 - dt1

print(type(time_delta))
print(time_delta)
print(dt1 + time_delta)

<class 'datetime.timedelta'>
366 days, 1:30:30
2001-01-01 13:30:30


## Handling Exceptions

Handling exceptions is only a fancy name for handling **code errors**.

In Python, many functions only work on certain type of input.
For example, the float function returns a value error when you feed it with a string.

In [48]:
print(float('123'))
#print(float('one two three'))  # This will return a ValueError.
#print(float((9,8)))  # This will return a TypeError.

123.0


Suppose that we want the float function to return the input value.
We can do this using **try / except** handlers:

In [49]:
def return_float(x):
    try:
        return float(x)
    except:
        return x

print(return_float('123'))
print(return_float('one two three'))  # This time it won't return a ValueError.

123.0
one two three


Value and Type errors are defined as **exception** values.

In [50]:
def return_float(x):
    try:
        return float(x)
    except(ValueError, TypeError) as exception:
        print(exception)
        return x

print(return_float('one two three'))
print(return_float((9,8)))

could not convert string to float: 'one two three'
one two three
float() argument must be a string or a number, not 'tuple'
(9, 8)


## Functions

Functions can take none or multiple arguments, and return none or multiple values.

In [51]:
def f(n):
    x = 11 * n
    y = 22 * n
    z = 33 * n
    return x, y, z

print(f(2))

(22, 44, 66)


**Closure functions** are dynamically-generated functions returned by another function.
Their main property, is that the returned function has access to the variables in the namespace where it was created.
In layman's terms, a closure function is a function within main function.

In [52]:
# Return True if an element is repeated in a list:
def parent_function():
    
    new_dict = {}
    def existing_element(element):
        if element in new_dict:
            return True
        else:
            new_dict[element] = True
            return False
        
    return existing_element

f = parent_function()
numbers = [1, 2, 1, 2, 3, 4]
existing = [f(n) for n in numbers]

print(existing)

[False, False, True, True, False, False]


### Generator expressions

When using generator functions, use yield instead of return.

In [53]:
def generator():
    for i in range(5):
        yield i**2

gen_object = generator()
print(gen_object)
print(list(gen_object))

# The same function can be written as follows:
gen_object = (x**2 for x in range(5))
print(gen_object)
print(list(gen_object))

# Generator expressions can be used inside other functions:
d = dict((x, x**0.5) for x in generator())
print(d)

<generator object generator at 0x1103a9f68>
[0, 1, 4, 9, 16]
<generator object <genexpr> at 0x1103d7b48>
[0, 1, 4, 9, 16]
{0: 0.0, 1: 1.0, 4: 2.0, 9: 3.0, 16: 4.0}


## Cleaning data

Real-world data is messy.
Often, the most time consuming part of data analysis is to clean the data set to make it ready for analysis.

First we define a list of strings that contain unnecessary punctuation, capilitalization and white space.
Then, we import a python module called [regular expression](https://docs.python.org/2/library/re.html) and use `re.sub` to remove the characters that we don't want.
Afterwards, we create a list of functions: to clean the data:
remove_characters,
[`str.strip`](http://www.tutorialspoint.com/python/string_strip.htm) and
[`str.title`](http://www.tutorialspoint.com/python/string_title.htm).

In [54]:
import re

data = ['     Green', 'RED!', 'yellow  ', 'Pink', 
        'blue', 'Light  Brown##', 'dark purple?']

# Regular expresion function:
def remove_characters(string): 
    return re.sub('[!#?]', '', string) 

# Create a list of functions:
cleaners = [remove_characters, str.strip, str.title]

Now we create one last function to apply all the cleaner functions to each element in the data:

In [55]:
def clean_data(data, cleaners):      # The function takes two arguments
    clean = []                       # Create a empty list
    for datum in data:               # Loop over the data elements
        for function in cleaners:    # Loop over the list of functions
            datum = function(datum)  # Apply each function to the elements
        clean.append(datum)          # Store the cleaned elements in a new list
    return clean                     # Return the clean list 
    
clean = clean_data(data, cleaners)
print(clean)

['Green', 'Red', 'Yellow', 'Pink', 'Blue', 'Light  Brown', 'Dark Purple']


**Lambda** functions are a short form of writing a function.

In [56]:
def f(x):
    return x**2
print(f(5))

# Now the same function using lambda:
lambda_f = lambda x: x**2 
print(lambda_f(5))

25
25
