# Module 1. Samples

Most Python scripts (or Jupyter Notebooks) for analyzing data use the same program libraries. The most important ones are:

- `numpy` - multidimensional arrays, linear algebra, etc.
- `scipy` - mathematics, science, engineering etc.
- `pandas` - data-analysis and -manipulation
- `matplotlib`, `seaborn`, `altair` - data visualisation

You can install these using `pip`:

```console
> pip install numpy scipy pandas matplotlib seaborn altair statsmodels
```

Since you usually need the same packages every time, it is best to put them at the top of every script you write. In many scripts, the convention is to abbreviate package names, e.g. `np` for `numpy`, `sns` for `seaborn`, etc.

In [None]:
# Importing the necessary packages
import numpy as np                                  # "Scientific computing"
import scipy.stats as stats                         # Statistical tests

import pandas as pd                                 # Data Frame
from pandas.api.types import CategoricalDtype

import matplotlib.pyplot as plt                     # Basic visualisation
from statsmodels.graphics.mosaicplot import mosaic  # Mosaic diagram
import seaborn as sns                               # Advanced data visualisation
import altair as alt                                # Alternative visualisation system

## Opening a Dataset, General Information

You can read in a dataset from a variety of sources (Rajagopalan, 2021, p.158). You can specify a path to a file or even a URL:

In [None]:
# Importing the Titanic dataset. (Rajagopalan, 2021, p. 106)
titanic = pd.read_csv('https://raw.githubusercontent.com/DataRepo2019/Data-files/master/titanic.csv')
# Show the first few records of the Data Frame
titanic.head()

In [None]:
# How many  rows does the DataFrame have?
print(f"Number of rows: {len(titanic)}")
# How many columns?
print(f"Number of columns: {len(titanic.columns)}")
# How many rows and columns, i.e. the shape
print(f"The shape of the Data Frame is: {titanic.shape}")
# General information about the DataFrame
print("*"*50)
titanic.info()

# Give the data type of each column.
print("*"*50)
print(titanic.dtypes)

# How many columns of each data type are there?
#   Watch it! The book says to use get_dtype_counts(), but this method no longer exists
print("*"*50)
print(titanic.dtypes.value_counts())

## Indices

The columns "PassengerId" is not an actual variabele, but contains a number to identify each observation. You can mark this column as an index:

In [None]:
titanic.set_index(['PassengerId'])

## Qualitative variables

Some of the variables, such as `Survived` and `Pclass`, are incorrectly considered to be quantitative. You can correct this by explicitly converting them to a **qualitative** (categorical) variable:

In [None]:
# Describe the variable Survived -> is considered to be quantitative
print(titanic.Survived.describe())
# Convert to a categorical variable
titanic.Survived = titanic.Survived.astype('category')
# Ask to describe once more -> not it is considered to be qualitative
print(titanic.Survived.describe())

You can also mark variables as **ordinal**, that is, with an ordering. We will do this as an example with the variable "Embarked" and order the ports in the order of departure. The Titanic departed at SouthHampton, and then picked up passengers first at Cherbourg and then at Queenstown.

For cases like this, define your own datatype specifying the order:

In [None]:
print(titanic.Embarked.unique())

embarked_type = CategoricalDtype(categories=['S', 'C', 'Q'], ordered=True)
titanic.Embarked = titanic.Embarked.astype(embarked_type)
titanic.Embarked.describe()

This order will then always be respected, e.g. in graphs

In [None]:
sns.countplot(data=titanic, x='Embarked');

## Selecting Data

In [None]:
# Select all observations for a single variable (i.e. a DataFrame column)
titanic.Age
# This also works (and is prefarable as it will also work when the column name has a space in it):
# titanic['Age']
# This also works, but isn't very nice
# titanic.loc[:, 'Age']

In [None]:
# Select adjacent columns
titanic.iloc[:, 2:4]

You can also select multiple columns based on their names.
This is often clearer than selecting based on position and the columns must 
not be adjacent.

In [None]:
titanic[['Name', 'Age', 'Cabin']] # Note: two sets of square brackets!

In [None]:
# Observation with row number 5 (counting from zero)
print(titanic.iloc[5])

# The first 4 observations
titanic.iloc[0:4]

In [None]:
# Select observations where the value of Age is less than 18
titanic[titanic.Age < 18]  

# The same, but only keep the column 'Embarked'
titanic[titanic.Age < 18].Embarked

# The same, but keep columns 'Age' and 'Embarked'
titanic[titanic['Age'] < 18][['Age', 'Embarked']]

In [None]:
# Select all boys younger than 10
titanic.query("(Sex=='male') and (Age < 18)")