<a href="https://colab.research.google.com/github/MCanela-1954/DataSci_Course/blob/main/%5BDATA-01%5D%20Python%20and%20Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [DATA-01] Python and Pandas

## What is Python?

**Python** is a programming language, born in 1991. The latest stable version (when this is being written) is Python 3.14. As a programming language, Python can be used by a programmer to write a program that performs a task. This program can be later executed many times without being modified. But you can use Python in other ways. For instance, to examine the variation of the stock price of a specific company, or the structure of the vacation rental market in a specific region.

The basic tool is the **Python interactive interpreter**. It can be imagined as a translator: we speak Python and the interpreter translates our commands to **machine language**. We can call several instances of the interpreter, or **kernels**, and have them running simultaneously. To send our commands to the kernel and (eventually) get results and other messages from the kernel, we use an app as an interface. In the examples of this course, the interface is assumed to be a **notebook**, but everything can be easily transported to a different interface, *e.g*. to a **console**.

## Python modules and packages

Many additional resources have been added to Python in the form of **modules**. A module is just a text file containing Python code. Modules are grouped in **libraries**. The **Python Standard Library**, distributed with Python, contains **built-in modules** providing standardized solutions for many problems that occur in everyday programming. For instance, the module `math` provides mathematical functions, while the module `datetime` provides functions for manipulating dates and times.

Other libraries are typically called **packages**, because their elements are packed according to some specific rules which allow you to install and call them together. Python can be extended by more than 500,000 packages. Some big packages like scikit-learn (used in machine learning) have **subpackages**.

Since the basic Python toolkit (without any module) is quite limited, you will need to **import** additional resources for practically everything. Once a module has been imported, all its functions are available. Alternatively, you can import a single function from a module. Resources are imported just for the current kernel. You can only import from packages which are already **installed** in your computer.

Almost everything in the Python world is open-source. In particular, Python packages are contributed by various agents, such as university professors, freelance programmers, or industry giant companies like Google. This leads to a dynamic ecosystem, where you may find overlapping, dependencies and multiple versions.

## Python distributions

A **Python distribution** is a software bundle, containing, at least, a Python interpreter and the corresponding version of the standard library. It also includes a collection of packages and one or more package managers for installing, uninstalling or updating packages. All distributions have a **package manager** called `pip`, and some distributions have a specific package manager. A few details about one of the top popular Python distributions come in the next section.

One option for working with Python is to start a Python kernel directly in a **shell** application associated to the operating system of your computer. A shell app is called **Terminal** in Mac/Linux computers and **Prompt** in Windows computers. For this approach to work, the shell app has to find the Python files. This is automatic when the folder where the Python distribution is in the **path** of that shell. In the contrary, you have to know where to find it.

If you are just starting with Python, you will prefer a friendlier approach. Python distributions provide various interfaces to the Python interpreter. All include a **command line interface** (CLI), which may be a shell-like application or an **integrated development environment** (IDE). A Python IDE provides a Python-aware code editor integrated with the ability to run code from that editor.

## The Anaconda distribution

In the data science community, **Anaconda** (`anaconda.com`) is the favorite distribution. The current Anaconda distribution comes with Python 3.12. Besides `pip`, Anaconda has a specific package manager, called `conda`. You may need `conda` for more advanced work, because it reviews the packages that are already installed in our computer, keeping track of the dependencies, and solving version conflicts between packages. On the downside, `conda` is much slower than `pip`.

After downloading and installing Anaconda, you can start your Python experience with the **Anaconda Navigator**, which opens in the browser and allows you to choose among different interfaces to the Python interpreter. First, you have **Jupyter Qt Console**, which is a shell-like app with some extra features. Jupyter (Julia/Python/R) is a new name for an older project called **IPython** (Interactive Python). IPython's contribution was the IPython shell, which added some features to the mere Python language. Qt Console is the result of adding a **graphical user interface** (GUI), with drop-down menus, mouse-clicking, etc, to the IPython shell, by means of a toolkit called Qt.

Jupyter provides an alternative approach, based on the **notebook** concept. A notebook is a document where you can combine input, output and ordinary text. A notebook is stored in a file with extension `ipynb` (IPython notebook). In the notebook arena, **Jupyter Notebook** is the leading choice. Notebooks are multilingual, that is, they can be used, not only with Python, but also with other languages like R. Most data scientists prefer the console for developing their code, but use notebooks for diffusion, specially for posting their work on platforms like **GitHub**.

Besides the Jupyter apps, Anaconda also provides a Python IDE called **Spyder**, where you can manage together a console and a text editor for your code. If you have previous experience with IDE's, for instance from working with R in RStudio, you may prefer Spyder to Qt Console.

Once Anaconda is installed, you can bypass the navigator by calling your preferred interface from a shell. To start Qt Console, enter `jupyter qtconsole`. To get access to the notebooks in the default browser (*e.g*. Google Chrome), enter `jupyter notebook`. To start Spyder, enter `spyder`.

*Note*. Use *Anaconda Prompt* in Windows, instead of the standard Windows prompt, whose path does not contain the Anaconda apps.

## Colab notebooks

**Colaboratory** is a Google app which allows you to run a remote Python kernel in a notebook displayed in a browser (not necessarily Google Chrome), with some advantages. It does not require installation nor configuration and it allows you an easy way to share content. In Colaboratory, you work with documents called **Colab notebooks**. Though they are not exactly the same, Colab notebooks are pretty similar to Jupyter notebooks, and they are usually stored in Google Drive as files with extension `ipynb`.

You can access a Colab notebook from any deviced connected to Internet, such as an IPAD. The only thing you need to start working with Colab notebooks is a Google account, meaning a `gmail.com` address and its password. You will also have to install Colaboratory in your drive. To do this, open the *MyDrive* page, click on the *Settings* button, select *Settings >> Manage apps* button and click on *Connect more apps*.

## Typing in a notebook

In these notes, we are assumed to be working on a notebook. Neverthless, most of the comments are valid for a console.

The notebook is a sequence of **cells**. There two type of cells, the Markdown cells and the code cells. In the **Markdown cells**, you write comments, while in the **code cells** you write the Python commands. The content of each code cell makes a single **input**, either Markdown (Markdown is a language for formatting text in a webpage) or Python.

When a cell is selected, you can enter that cell to edit it by pressing *Return*. Conversely, when you are inside the cell, you go out by pressing *Esc*. When you are inside the cell, pressing *Return* starts a new line *within the same cell*. In both situations, inside and outside, with *Shift+Return*, you execute the input, passing to the following cell.

The input in a code cell can get the corresponding **output**, an **error message** or no answer at all. Error messages are typically long and unfriendly. A simple example follows. Note the white space in the input, around the *plus* sign (`+`), which is ignored by the Python kernel, but improves the **readability** of our code.

In [None]:
2 + 2

So, if you enter `2 + 2`, the output will be the result of this calculation. But, if you want to store this result for later use (in the same session), you will enter it with a name, as follows:

In [None]:
a = 2 + 2

This creates the **variable** `a`. Note that the value of `2 + 2` is not outputted now. But you can call it:

In [None]:
a

In Pyhton, when you assign a value to a variable which has already been created, the previous assignment is forgotten:

In [None]:
a = 7 - 2

In [None]:
a

You can so input several code lines at once. In that case, the output will be the output for the last line of the input. An example follows.

In [None]:
b = 2 * 3
b - 1
b**2

*Note*. You would probably have written `b^2` for the square of 2, but the caret symbol (`^`) plays a different role in Python.

## Python packages

Additional Python resources come in **packages**. For instance, suppose that you try a bit of math, calculating the square root of 2. You will then **import** the package `math`, whose resources include the square root and many other mathematical functions. Once the package has been imported, all its functions are available, so you can call the function as `math.sqrt()`. This notation indicates that `sqrt()` is a function of the module `math`.

So, the square root calculation is carried as:

In [None]:
import math
math.sqrt(2)

Alternatively, you can import only the functions that you plan to use:

In [None]:
from math import sqrt
sqrt(2)

## Data types

As in other languages, data can have different **data types** in Python. The data type can be learned with the function `type()`. Let us start with the numeric types, which are three: **integer** (`int`), **floating-point** (`float`) and **complex** (`complex`). In addition, we have **Booleans** (`bool`), whjci are a subtype on integers.

 Note that, in Python, integers are not, as in the mathematics textbook, a subset of the real numbers, but a different type:

In [None]:
type(2)

In [None]:
type(2.0)

The value of a Boolean variable, is either `True` or `False`:

In [None]:
d = (5 < a)
d

In [None]:
type(d)

When you enter an expression involving a comparison such as `5 < a`, the Python interpreter evaluates it, returning either `True` or `False`.  Here, we have defined a variable by means of such an expression, so we got a Boolean variable. Warning: as a comparison operator, equality is denoted by two equal signs. This may surprise you.

In [None]:
a == 4

Boolean variables can be converted to `int` and `float` type by the functions mentioned above, but also by a mathematical operator:

In [None]:
math.sqrt(d)

In [None]:
1 - d

Besides numbers, we can also manage **strings** with type `str`:

In [None]:
c = 'Lamine Yamal'
type(c)

The quote marks indicate type `str`. You can use single or double quotes, but take care of using the same on both sides of the string.

##  Functions

A **function** takes a collection of **arguments** and performs an action. Let me present a couple of examples of value-returning functions. They are easily distinguished from other functions, because the definition's last line is a `return` clause.

A first example follows. Note the **indentation** after the colon, which is created automatically by the notebook.

In [None]:
def f(x):
    y = 1/(1 - x**2)
    return y

When we define a function, Python just takes note of the definition, accepting it when it is syntactically correct (parentheses, commas, etc). The function can be applied later to different arguments.

In [None]:
f(2)

If we apply the function to an argument for which it does not make sense, Python will return an error message which depends on the values supplied for the argument.

In [None]:
f('Mary')

Functions can have more than one argument, as in:

In [None]:
def g(x, y):
  return x*y/(x**2 + y**2)
g(1, 1)

*Note*. We follow in this course a common practice in Python learning materials, writing functions as `func()`. The parentheses remind you that this is an object that takes arguments.

## Lists, ranges and dictionaries

Python has several types for objects that work as **data containers**. The most versatile is the **list**, which is represented as a sequence of comma-separated values inside square brackets. The items of a list are ordered, and can be repeated. Also, a list can contain items of different type.

A simple example of a list, of length 4, follows.

In [None]:
mylist = ['Vinicius', 'Messi', 'Yamal', 'Mbappé']

In [None]:
len(mylist)

Lists can be concatenated in a very simple way in Python:

In [None]:
newlist = mylist + [2, 3]
newlist

Now, the length of `newlist` is 6:

In [None]:
len(newlist)

The first item of `mylist` can be extracted as `mylist[0]`, the second item as `mylist[1]`, etc. The last item can be extracted either as `mylist[3]` or as `mylist[-1]`. Sublists can be extracted by using a colon inside the brackets, as in:

In [None]:
mylist[0:2]

Note that `0:2` includes `0` but not `2`. This is a general rule for indexing in Python. Other examples follow.

In [None]:
mylist[2:]

In [None]:
mylist[:3]

A **range** is a sequence of integers which in many aspects works as a list, but the terms of the sequence are not saved as in a list. Instead, only the procedure to create the sequence is saved. Example:

In [None]:
myrange = range(0, 10)
list(myrange)

Note that the items from a range cannot printed directly. So, we have converted the range to a list with the function `list`. Note that `range(start, end)` contains `start` but not `end`. If `start` is omitted, it is assumed to be zero.

A **dictionary** is a set of **pairs key/value**. For instance, the following dictionary contains three features of an individual:

In [None]:
my_dict = {'name': 'Joan', 'gender': 'F', 'age': 32}

In the dictionary, a value is not extracted using an index which indicates its order in a sequence, as in the list, but using the corresponding key:

In [None]:
my_dict['name']

## NumPy arrays

In mathematics, a **vector** is a sequence of numbers, and a **matrix** is a rectangular arrangement of numbers. Operations with vectors and matrices are the subject of a branch of mathematics called linear algebra. In Python (and in many other languages), vectors are called one-dimensional (1D) **arrays**, while matrices are called two-dimensional (2D) arrays. Arrays of more than two dimensions can be managed in Python without pain.

Python arrays are not necessarily numeric. Indeed, vectors of dates and strings appear frequently in data science. In principle, all the terms of an ordinary array must have the same type, so the array itself can have a type, though you can relax this constraint using mixed types (not covered by this course). Arrays were already implemented in plain Python, but the functionality of the Python arrays was enlarged in **NumPy**, intended to be the fundamental library for scientific computing in Python.

The usual way to import NumPy is:

In [None]:
import numpy as np

A 1D array can be created from a list with the NumPy function `array()`. If the items of the list have different type, they are converted to a common type when creating the array. A simple example follows.

In [None]:
arr1 = np.array([2, 7, 14, 5, 9])
arr1

A 2D array can be directly created from a list of lists of equal length. The terms are entered row-by-row:

In [None]:
arr2 = np.array([[0, 7, 2, 3], [3, 9, -5, 1]])
arr2

Although we visualize a vector as a column (or as a row) and a matrix as a rectangular arrangement, with rows and columns, it is not so in the computer. The 1D array is just a sequence of elements of the same type, neither horizontal nor vertical. It has one **axis**, which is the 0-axis.

In a similar way, a 2D array is a sequence of 1D arrays of the same length and type. It has two axes. When we visualize it as rows and columns, `axis=0` means *across rows*, while `axis=1` means *across columns*.

The number of terms stored along an axis is the **dimension** of that axis. The dimensions are collected in the attribute `.shape`.

In [None]:
arr1.shape

In [None]:
arr2.shape

**Subsetting** a 1D array is done as for a list:

In [None]:
arr1[:3]

The same applies to 2D arrays, but we need two indexes within the square brackets. The first index selects the rows (`axis=0`), and the second index the columns (`axis=1`):

In [None]:
arr2[:1, 1:]

When an expression involving an array is evaluated by the Python kernel, a Boolean array with the same shape is returned:

In [None]:
arr2 > 2

## The package Pandas

**Pandas** provides a wide range of data wrangling tools. It is typically imported as

In [None]:
import pandas as pd

There are two data container classes in Pandas, the series (one-dimensional) and the data frames (two-dimensional). A **series** can be seen as the combination of a 1D array containing the **values** and a list containing the names of the values, called the **index**. These components can be extracted as the attributes `.values` and `.index`.

A **data frame** can be seen as formed by one or several series with the same index (hence, with the same length). It can also be seen as a table for which the index provides the row names. In a Pandas data frame, each column has its own data type. The numeric types work as usual, but Pandas uses the data type `object` for many things, in particular for strings.

## Pandas series

Although we rarely do it in data science, a Pandas series can be created directly, for instance from an array, with the Pandas function `Series()`:

In [None]:
s1 = pd.Series(arr1)
s1

Now, the values of the series are extracted as:

In [None]:
s1.values

As shown above, when a series is printed, the index appears on the left. Since the index of `s1` has not been specified, a range of consecutive integers has been assigned as the index.

In [None]:
s1.index

Instead of an array, a list can be used to provide the values of a series. In the list, the items can have different type, but Pandas converts them to a common type, as shown in the following example. Here, instead of letting the Python kernel to create an index automatically, as a `RangeIndex`, we specify an index directly:

In [None]:
s2 = pd.Series([1, 5, 'Messi'], index = ['a', 'b', 'c'])
s2

Indexes are useful for combining, filtering and joining data sets. There are many types of indexes, which allow for specific operations. So, don't look at the index as an embarrassment, which is what it seems at first sight, but as a tool for data management.

## Pandas data frames

A Pandas **data frame** can be seen as a collection of series with the same index (hence, with the same length). Data frames can be built in many ways with the Pandas function `DataFrame()`, for instance from a dictionary of vector-like objects of the same length, as in

In [None]:
df = pd.DataFrame({'v1': range(5), 'v2': ['a', 'b', 'c', 'd', 'e'], 'v3': np.repeat(-1.3, 5)})
df

As the series, the data frames have the attributes `.values` and `.index`. Without a explicit specification, the index is automatically created as a `RangeIndex`. The third component of the data frame is a list with the column names, which can be extracted as the attribute `.columns`:

In [None]:
df.columns

A data frame has the same shape of the array of values. Having rows and columns, a data frame looks like a 2D array with row and column names. But not all data frames are so simple. While a NumPy 2D array has a single data type, in a Pandas data frame every column has its own data type.


## Exploring Pandas objects

The methods `.head()` and `.tail()` extract the first and the last rows of a data frame, respectively. The default number of rows extracted is 5, but you can pass a custom number.

In [None]:
df.head(2)

The content of a data frame can also be explored with the **method** `.info()`. It reports the dimensions, the data type and the number of non-missing values of every column of the data frame. Note that the data type of the second column, for which you would have expected `str`, is reported as `object`.

In [None]:
df.info()

*Note*. A method is a function that belongs to a certain type of object. So, there are string methods, list methods, etc.

The method `.describe()` extracts a conventional statistical summary of a Pandas object. The string columns are omitted, except when all the columns have that type. Then the report contains only counts.

In [None]:
df.describe()

## Subsetting data frames

Pandas offers multiple ways for subsetting data frames. First, you can extract a column, as a series:

In [None]:
df['v2']

Note that the syntax is the same as for extracting the value of a key from a dictionary (not by chance). You can also extract a subframe containing a subset of complete columns from a data frame. You can specify this with a list containing the names of those columns:

In [None]:
print(df[['v1', 'v2']])

*Note*. You can extract a subframe with a single column. Beware that this is not the same as a series. `df['v2']`is a series with shape `(5,)`, and `df[['v2']]` is a data frame with shape `(5,1)`.

In data science, rows are typically filtered by means of an **expression**. Example:

In [None]:
expr = df['v1'] > 2
df[expr]

Combining a row filter and a column selection:

In [None]:
print(df[df['v1'] > 2][['v1', 'v2']])

Besides this, there are two additional ways to carry out a selection, specifying rows and columns in one shot:

* **Selection by label** is specified by adding  `.loc` after the name of the data frame. The selection of the rows is based on the index, and that of the columns is based on the column names.

* **Selection by position** uses `.iloc`. The selection of the rows is based on the row number and that of the columns on the column number.

In both cases, if you enter a single specification inside the brackets, it refers to the rows. If you enter two specifications, the first one refers to the rows and the second one to the columns. We don't use `loc` and `iloc` in this course, since you can work in Pandas without that, selecting first the rows and then the columns. Sticking to this simple approach, you will save the time wasted learning too many methods.

## Importing data from CSV files

Data sets in tabular form can be imported as Pandas data frames from many file formats. In particular, data from a CSV file can be imported to a data frame with the Pandas function `read_csv()`. The (default) syntax is

```
dfname = pd.read_csv(fname)
```

The data frame name is chosen by the user, and the file name has to contain the **path** of that file (either local or remote). `read_csv()` works the same way for CSV files and for **zipped ZIP files**. Although the default syntax works in most cases satisfactorily, this function can be customized in order to meet various requirements.

## Summary statistics

The method `.describe()` extracts a conventional statistical summary. Columns of type `object` are omitted, except when all the columns have that type. Then the report contains just counts.

Basic statistics can also be calculated separately. For instance, the method `.mean()` returns the column means of a data frame. Correlations are also pretty easy:

* For a pair of series, `s1` and `s2`, `s1.corr(s2)` returns the **correlation** of `s1` and `s2`.

* For a data frame `df`, `df.corr()` returns the **correlation matrix** for the columns of `df`.

## Plotting

We typically visualize the data with bar plots, histograms, scatter plots and line plots. They can be obtained directly from a Pandas object. Suppose first that `df` is a Pandas data frame and set `cname1` as the *x*-column and `cname2` as the *y*-column (numeric). To explore the dependence of the *y*-column on the *x*-column, we use a **bar plot**, when the *x*-column is categorical, and a **scatter plot**, when it is numeric:

* `df.plot.bar(x=cname1, y=cname2)` returns a bar plot. The bars represent the values of the column `cname2` for the different values of the column `cname1`. If you do not specify the *x*-column, the index is used instead.

* `df.plot.scatter(x=cname1, y=cname2)` returns a scatter plot.

Suppose now that `s` is a numeric Pandas series. To explore the distribution of `s`, you use a **histogram**. Alternatively, to explore a trend, you use a **line plot**. This is pretty easy in Pandas:

* `s.plot.hist()` returns a histogram.

* `s.plot.line()` returns a line plot.

Pandas uses the functions of the package Matplotlib, but not explicitly. If you are satisfied with a basic functionality, you can skip Matplotlib in your code.
