# Introduction to Python

## Miguel Ángel Canela, IESE Business School

******

### General introduction

**Python** is a programming language, introduced in 1991. We find it everywhere, and it is actually ranked second in the list of the most used programming languages, after JavaScript. It is said to be the preferred choice of the developers of malicious software (and those people are knowledgeable). 

There are currently two versions of the Python language: Python 2 and Python 3. This course uses version 3. Although all beginners are currently adopting Python 3, there is a lot of Python 2 code still in circulation, and many books base their explanations about how to do something in Python 2. Although most of the Python 2 code runs in Python 3, one finds trouble from time to time. So, it is recommended to check Python's version before starting to read a book or before copypasting somebody else's code.  

In this course, we do not look at Python as a programming language, that is, for developing software applications, but from a data science perspective. About 2008, three libraries were added to the Python portfolio: Pandas, for managing data sets, Matplotlib, for plotting, and scikit-learn, for machine learning. This trio put Python in the data analytics arena. Since then, Python's popularity has been growing steadily among data analysts and, nowadays, it is the preferred tool for developing **machine learning models**. 

There are many distributions of Python. In the data science community, **Anaconda** (`anaconda.com`) is the favorite one. Anaconda distribution comes with the three libraries mentioned above. Downloading and installing Anaconda (choose Python 3 when the choice is presented to you) will leave you with the **Anaconda Navigator**, which opens in the browser and allows you to choose among different interfaces to Python. Once Anaconda is installed, you can bypass the navigator using a **command-line interface** (CLI), like Terminal on Mac computers or the Anaconda prompt on Windows. 

Among the many interfaces offered by Anaconda, I use for these notes the **Jupyter QtConsole**, which is an input/output text interface. Jupyter is a new name for an older project called **IPython**, so you may find in many places reference to the "IPython console", which is the same as the Jupyter QtConsole. 

An alternative approach is based on the **notebook** concept. In a notebook, you can combine input, output and ordinary text. In the notebook arena, the **Jupyter Notebook** is the leading choice, followed by **Apache Zeppelin**. These two are multilingual, that is, they can be used with other languages, like R, besides Python. Jupyter has powerful supporters and very smart people in the development team, so we will probably see plenty of Jupyter notebooks in the immediate future. Most pythonistas prefer the console for developing their code, but use notebooks for difussion, specially for posting their work on platforms like GitHub. This document is a Jupyter notebook.

In the console, you can type or paste your code. When you open it, you find an input  prompt (such as `In[1]:`), where you can type a command and press `Return`. Then Python returns an output (preceded by `Out[1]:`), a (typically long and difficult) error message or no answer at all. A supersimple example:

In [1]:
2 + 2

4

So, if we input `2 + 2`, the output is the result of this calculation. But, when we want to store this result, we input it with a name, as follows.

In [2]:
a = 2 + 2

Note that the value of `2 + 2` is not outputted now. If we want it to be outputted, we have to ask for that explicitly:

In [3]:
a

4

*Note*. In some programming environments, you should type `print(a)`, or similar, to print `a` on the screen.

If you copypaste code from a text editor (which is what you would do if you were working in the console, so that you could save your code), you can input several lines of code at once. In that case, you will get the output only for the last line. If the cursor is not at the end of the last line, you have to press now `Shift + Return` to get the output. A simple example:

In [4]:
b = 2 * 3
b - 1
b**2

36

*Note*. You would probably have written `b^2` for the square of 2, but the "hat" symbol does not work in Python.

As said above, Python is a programming language to which many additional resources have been added in the form of **modules**. A module is just a text file that contains Python code (extension `.py`). Modules are grouped in libraries. These libraries are also called **packages**, because their elements are packed according to some specific rules that allow you to install and call them together. Python can be extended by more than 60,000 packages.

The basic Python (without any package) is quite limited, so you need additional modules for practically everything. For instance, suppose that your math work goes beyond the above calculations, and you want to calculate the square root of 2. You will import first the module  `math`, whose resources include the square root and many other mathematical functions, and then apply the **function** `math.sqrt`. This notation indicates that `sqrt` is a function of the module `math`. In the console, the square root calculation shows up as:  

In [5]:
import math
math.sqrt(2)

1.4142135623730951

### Learning about Python

There are many books for learning about Python, but most of them would not be appropriate for learning how to work with data in Python. It can even happen that you do not find anything about data in many of them. Mind that Python has so many applications that the intersection of the know-how of all Python users is relatively narrow. For an introduction to Python as a programming language, in a computer science context, I would recommend Zelle (2010). For the self-learning data scientist, McKinney (2017) and VanderPlas (2017) are both worth their price. To those who are not afraid of manuals, I would recommend the Pandas manual (which is free).

There is also plenty of learning materials in Internet, including MOOC's. For instance, **Coursera** has a pack of courses on Python (`coursera.org/courses?query=python`). But, probably, the most attractive marketplace for Data Science courses is **DataCamp**. They offer, under subscription or academic license, an impressive collection of courses, most of them focused on either R or Python (there are also a few ones on SQL). In addition to follow DataCamp courses, you can benefit from the **DataCamp Community Tutorials**, which are free and cover a wide range of topics. Also, a good place to start is `learningpython.org`.

### Numbers

As said in our first example, the equal sign (`=`) is used to assign a value to a **variable**. For the variable `a` defined in the first place:

In [6]:
type(a)

int

So, `a` has **integer type**. Another numeric type is **float**:

In [7]:
b = math.sqrt(2)
type(b)

float

There are subdivisions of integers and floats (e.g. int64), but I skip them in this brief introduction. Note that, in Python, integers are not, as in the mathematics course, a subset of the real numbers, but a different type:

In [8]:
type(2)

int

In [9]:
type(2.0)

float

Note that, in the above square root calculation, `b` is a float because this is what the function `math.sqrt` returns (try `math.sqrt(1)`). The functions `int` and `float` can be used to convert numbers from one type to another type (sometimes at a loss): 

In [10]:
float(2)

2.0

In [11]:
int(2.3)

2

### Boolean

We also have **Boolean** variables, which are either `True` or `False`:

In [12]:
d = 5 < a
d

False

In [13]:
type(d)

bool

So, if I define a variable with an expression like the above one, it has Boolean type. Warning: note that to put equality in the expression, we need two equal signs (this may surprise you).

In [14]:
a == 4

True

Boolean variables can be transformed into integers and floats with the functions mentioned above, but also by applying a mathematical operator: 

In [15]:
math.sqrt(d)

0.0

In [16]:
1 - d

1

### Strings

Besides numbers, we can also manage **strings**:

In [17]:
c = 'Messi'
type(c)

str

The quote marks indicate string type. You can use single or double quotes, but take care of using the same on both sides of the string. Strings come in Python with many methods attached, but I postpone the discussion, so string methods will be discussed in the context of Pandas data frame methods. The same with date and time types.

### Lists

Python has several **data container** classes, which are used to group together other values. The most versatile is the **list**, which can be written as a sequence of comma-separated values between square brackets. Other container classes are dictionaries, sets and tuples.

Lists can contain items of different type, although this not usual. A simple example of a list, of length 4, is:

In [18]:
x = ['Messi', 'Cristiano', 'Neymar', 'Coutinho']

In [19]:
len(x)

4

Lists can be concatenated in a very simple way in Python:

In [20]:
y = x + [2, 3]
y

['Messi', 'Cristiano', 'Neymar', 'Coutinho', 2, 3]

Now, the length of the list `y` is 6:

In [21]:
len(y)

6

The first item of `y` can be extracted as `y[0]`, the second item as `y[1]`, etc. The last item can be extracted as `y[5]` or as `y[-1]`. Sublists can be extracted putting a colon within the brackets, as in:

In [22]:
y[0:2]

['Messi', 'Cristiano']

Note that `0:2` includes 0 but not 2. This is a general rule for indexing in Python. Other examples:

In [23]:
y[3:]

['Coutinho', 2, 3]

In [24]:
y[:3]

['Messi', 'Cristiano', 'Neymar']

In Pandas data frames, there are other ways of extracting parts of the data, based on expressions such as `y > 0`, as we will see later in this course. The items of a list are ordered, and can be repeated. This is not so in other compound data types, like **sets**:

In [25]:
set(y)

{2, 3, 'Coutinho', 'Cristiano', 'Messi', 'Neymar'}

Note that the items in the set are printed in alphabetic order, which means that there is no order. Also, repeated items are dropped, which some coders use to extract a list of unique values of a list with repeated items:

In [26]:
list(set([1, 0, 1, 0, 7]))

[0, 1, 7]

### For loops

The **for loop** exists in practically all current programming languages as a way to avoid repetition. I give first a supersimple example:


In [27]:
squares = [0]
for i in range(1, 4):
    squares = squares + [i**2]
squares

[0, 1, 4, 9]

When typing this in the console, you may have noted that, in the definition of `squares`, the third line comes indented. This is triggered by the colon at the end of the second line. Also, note that `range(1, 4)` contains `1` but not `4`, as it is the rule in Python.

The loop for generating `squares` has been presented in a standard form. But in Python, you can do better since it is possible to transform a list into another list with a loop, in one line of code:

In [28]:
squares = [i**2 for i in range(0,4)]
squares

[0, 1, 4, 9]

**Ranges**, generated by the function `range`, are not the same as lists, but  similar. I skip the details. Instead of a range, we can also use a list, as in the following example:

In [29]:
[len(name) for name in x]

[5, 9, 6, 8]

Note that, applied to a string variable, the function `len` returns the number of characters. Now, a bit more difficult. The following loop generates a sequence of Fibonacci numbers (quite popular since they appeared in the *The Da Vinci Code*, where they are used to unlock a safe).

In [30]:
fib = [1, 1]
for i in range(2, 10):
    fib = fib + [fib[i-1] + fib[i-2]]

In [31]:
fib

[1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

###  Functions

Python is a fully functional language. Part of its real power comes from defining the operations that we wish to perform as **functions**, so they can be applied many times. A simple example of the definition of a function follows. Again, note the indent after the colon.

In [32]:
def f(x):
    y = 1/(1 - x**2)
    return(y)

When we define a function, Python just takes note of the definition, accepting it when it is syntactically correct (parentheses, commas, etc). The function can be applied later to different arguments.

In [33]:
f(2)

-0.3333333333333333

If we apply the function to an argument for which it does not make sense, Python will return an error message which depends on the values supplied for the argument.

In [34]:
f(1)

ZeroDivisionError: division by zero

In [35]:
f('Mary')

TypeError: unsupported operand type(s) for ** or pow(): 'str' and 'int'

Functions can have more than one argument, as in:

In [36]:
def g(x, y): return x*y/(x**2 + y**2)
g(1, 1)

0.5

Note that, in the definition of `g`, I have used a shorter way. Most programmers would make it longer, as I did previously for `f`. 

**Lambda expressions** provide an alternative way to define functions. They are practical for functions given by expression which can be written in one line and it is not reused. To define the function `f` by means of a lambda expression, I would use:

In [37]:
f = lambda x: 1/(1 - x**2)

### References

1. W McKinney (2017), *Python for Data Analysis --- Data Wrangling with Pandas, NumPy, and IPython*, O'Reilly.

2. W McKinney & PyData Development Team (2018), *pandas --- powerful data analysis toolkit*.

3. J VanderPlas (2017), *Python Data Science Handbook*, O'Reilly.

4. J Zelle (2010), *Python Programming --- An Introduction to Computer Science*, Franklin, Beedle & Associates.
