# 2.3 Lab: Introduction to Python

This notebook is a replacement of the lab session in section 2.3 of the book and teaches you how to use the methods in Python. The lab has the following sections:
1. Basic Commands
2. Graphics
3. Indexing Data
4. Loading Data
5. Additional Graphical and Numerical Summaries

Before getting started, we need to get access to the data. To do so, we clone a gitlab repository containing the data. It will make the data available in this Colab in the folder `data`.

Run the code-block below to perform this action. (**NB** make sure to run all the code blocks in the notebook when they appear.(

In [None]:
!git clone https://git.wur.nl/koots006/msc-course-machine-learning.git data

We will use a number of libraries in this lab session. A brief description is given as comments in the code

**DO:**:
* Read the description to understand the purpose of using the library

In [1]:
# Import libraries
import numpy as np                      # Numpy offers several math functions and methods to deal with arrays and matrices
import pandas as pd                     # Pandas is the weapon of choice to manage data. It has good formats to store data in tables and to process the data
import math                             # Some more math functions
import matplotlib.pyplot as plt         # With matplotlib, we can make the plots and graphs
%matplotlib inline

ModuleNotFoundError: No module named 'pandas'

## 2.3.1 Basic Commands
Python uses functions to perform operations. To run a function called funcname, we type `funcname(input1, input2)`, where the inputs (or arguments) `input1` and `input2` tell Python how to run the function. A function can have any number of inputs. 

The numpy library has several functions to deal with data. the function `np.array(...)` can be used to create a numpy-array, storing an array of numbers.
The command in the next code block instructs Python to create a list with the numbers 1, 3, 2, and 5, and to save them as an array named `x`. When we type `x`, it gives us back the vector.

In [None]:
x = np.array([1, 3, 2, 5])  # Create array/vector 
# Returns the array [NOTE: ONLY WORKS THIS WAY IF THIS IS LAST ELEMENT OF CELL, OTHERWISE USE `print(x)`]
x                             


In [None]:
x = np.array([1, 6, 2])
print('x contains:', x)
y = np.array([1, 4, 3])
print('y contains:', y)

You can get the size of the array using `len()` or by using `x.shape`:

In [None]:
print('The length of the array is:', len(x)  )
print('The shape of the array is:',  x.shape )

Using *help(<function_name>)* returns additional information, i.e. its input parameters and meaning, function output and usually some examples that you can try out to get a better understanding of how the function works. <br/>
**Exercise:**
* Run the code below to figure out what the function `np.sum` does, which parameters it uses and what it returns

In [None]:
help(np.sum)

**Exercise:**
* Complete the code below to use the function `np.sum(x)` to sum all the elements in the array `x`:

In [None]:
x = np.array([1, 6, 2])
sum_x = ..
print('The sum of the elements in x is:', sum_x)

We can tell Python to add two numpy arrays of numbers together using `x + y`. It will then add the first number from the array `x` to the first number from the array `y`, and so on. However, `x` and `y` should be the same length.
 We can check their length using the `len()` function.

**Exercise**
* Run the code below
* Now add one more element to the array `y` and run the code again. What happens?
* Fix the problem by making the arrays if the same length again

In [None]:
x = np.array([1, 6, 2])
y = np.array([1, 4, 3])

print("Length of x: ", len(x))
print("Length of y: ", len(y))

x + y 

The `np.matrix()` function can be used to create a matrix of numbers. A matrix is basically a two-dimensional array, or an array-of-arrays.

**Exercise:**
* Run the code below to construct a matrix with 2 rows and 3 columns, with the values 1-6 in the cells of the matrix.
* Change the code to create a matrix with 3 rows and 2 columns instead. 

In [None]:
x = np.matrix([[1,2,3], [4,5,6]])

print(x)

You can get the shape of the array using `x.shape`. For a 2D matrix, this gives a tuple containing height and width of the matrix:

In [None]:
print('The shape of the matrix is:',  x.shape )
print('Number of rows:', x.shape[0])
print('Number of columns:', x.shape[1])


Sometimes, you want to **transpose** a matrix.

**Exercise:**
* Run the code below
* If you forgot what the transpose of a matrix is: look at the two matrices and observe how `x` is changed into `x_t`.

In [None]:
x_t = x.transpose()

print('x:\n', x, end='\n\n')
print('x_t:\n', x_t)

With both numpy-arrays and numpy-matrices, we can perform some mathematical operations on the elements in the array/matrix.

**Exercise:**
* Run the code below, study the code and look at the results to understand what the functions `np.sqrt()`, `np.power()`, and `np.round()` do. Remember that you can use `help(...)` if you need more info on the functions.


In [None]:
x = np.random.randint(0,10,size=(2,3))
print('x: \n', x, end='\n\n')

# Sqrt: np.sqrt()
print("sqrt(x): \n", np.sqrt(x), end='\n\n') 

# Sqrt: np.sqrt()
print("sqrt(x): \n", np.round(np.sqrt(x), decimals=2), end='\n\n') 

# Power: np.power()
print("x^2: \n", np.power(x, 2), end='\n\n')

The `np.random.normal()` function generates an array of **random values** drawn from a normal distribution. With the argument `size` the sample size can be set.  Each time we call this function, we will get a different answer. 

**Exercise:**
* Run the code several times to see that you indeed get different random numbers everytime.


In [None]:
x = np.random.normal(size=6)
print(x)

Sometimes, you want to be able to reproduce results exactly, even when you use random numbers. In that case, you can set a so called **seed**. This makes sure that the same random numbers are picked every time.

**Exercise:**
* Run the code below a number of times to see the results
* Change the seed from 123 to a different value. What do you see now?

In [None]:
np.random.seed(123)
x = np.random.normal(size=6)
print(x)

We use `np.random.seed()` throughout the labs whenever we perform calculations involving random quantities. In general this should allow the user to reproduce our results. However,  as new versions of `Python` become available,  small discrepancies may arise between this book and the output from `Python`.

**NB**: Be aware when using a seed that the results might be very specific to the random values that we chosen. Especially when the set of random values is small, it is wise to run the code several with different random sequences (so remove the seed) to see if the results are similar for different runs. 

By default, np.random.normal() creates standard normal random variables with a mean of 0 and a standard deviation of 1 . However, the mean and standard deviation can be altered using the arguments `loc` (for mean) and `scale` (for the standard deviation).

**Exercise:**
* Study the code below and run it. Don't worry about how to make the plot at this moment. That will be explained later.
* Change the mean and the standard deviation for `y` and observe the new histogram

In [None]:
x = np.random.normal(size=1000)
y = np.random.normal(loc=6, scale=0.5, size=1000)

# Plot a histogram of the random values 
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.hist(x)
plt.xlim(-5,10)
plt.subplot(1,2,2)
plt.hist(y)
plt.xlim(-5,10)
plt.show()

The functions `np.mean()` and `np.var()` compute the mean and variance of an array of numbers. `np.std()` gives the standard deviation.

**Exercise:**
* Run the code
* Change the mean and the standard deviation in `np.random.normal()` and see if the calculated mean and standard deviation of the array approximate these values 
* Add a line of code to check if the standard deviation is equal to the square root of the variance.

In [None]:
np.random.seed(3)
y = np.random.normal(size=100)

# Compute mean of y
print("Mean of y: ", np.mean(y))

# Compute variance of y
print("Variance of y: ", np.var(y))

# And now using np.std()
print("Standard deviation of y: ", np.std(y))


Here we create two correlated sets of numbers, `x` and `y`, and use
the `pearsonr()` function to compute the correlation between
them. A plot is created to show the relation between the two variables.

In [None]:
from scipy.stats.stats import pearsonr  # Statistical methods, specifically Pearson correlation

x = np.random.normal(size=100)
y = x + np.random.normal(size=100, loc=50, scale=0.1)

r = np.round(pearsonr(x, y)[0], 3) # pearsonr returns (1) correlation coefficient and (2) the p-value - which is not needed here.
print('The Pearson correlation coefficient:', r)


plt.plot(x,y, '.')
plt.show()

## 2.3.2 Graphics

The Matplotlib library includes many functions to make plots and it is the default way in Python to visualize data. The function `plt.plot()` is the primary way to plot data in `Python`. For instance, `plt.plot(x, y)` produces a plot of the numbers in `x` versus the numbers in `y`. There are many additional options that can be passed in to the `plt.plot()` function. To find out more information about the plt.plot() function, type `help(plt.plot)`.


In [None]:
# For documentation --> help(plt.plot)
x = np.random.normal(size=100)
y = np.random.normal(size=100)

plt.plot(x, y, '.') # Produces scatter plot 
plt.title("Using plt.plot")
plt.show()

Note that you should add `plt.show()` after the code that creates the plot. If you don't do it, you will see some additional output. In particular you need to use `plt.show()` if you want to create multiple plots in one code block

**Exercise:**
* Run the code
* Put a comment (`#`) in front of `plt.show()` to see what happens if you don't use the show function

In [None]:
plt.plot(x, y, '.') # Produces scatter plot 
plt.title("Plot 1")
plt.show()

plt.plot(x, y+10, '.') # Produces scatter plot 
plt.title("Plot 2")
plt.show()

Matplotlib includes different functions to change the layout of the plot. Below, you can find an example. You can read more on the plot function on https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html

**Exercise:**
* Run the code and answer the following questions
  * How do you create a new figure of a given size?
  * How do you create a number of subplots?
  * How can you define the shape and the color of a point?
  * How do you set a title for a (sub)plot?
  * How do you set labels on the axes?
  * How do you define the limites for each axis?
  * How to set grid lines?
  * How do you create a tight layout?
* Change:
  * Now change to a 2x2 layout of the subplots (was 1x4)
  * Remove the lines in the fourth subplot
  * Change the labels on the x and y axis in the 2nd plot
  * Change the markers in the first plot to triangles with an orange color

In [None]:
plt.figure(figsize=(12,3))   # Create a figure with a given size (width of 12, height of 3)

plt.subplot(1,4,1)           # Create subplots. In this case 1 row, 4 columns and we will now create the first subplot
plt.plot(x, y, 'r.')         # Create a plot with red dots 
plt.title("The first plot")  # Give a title to the (sub)plot
plt.xlabel("The x-axis")     # Put a label on the x axis
plt.ylabel("The y-axis")     # Put a label on the y axis
plt.xlim(-5,5)               # Set the lower and upper limits of the x axis
plt.xlim(-3,3)               # Set the lower and upper limits of the y axis

plt.subplot(1,4,2)           # Create the second subplot
plt.plot(x, y, 'bs')         # Create a plot with blue squares
plt.title("The second plot")  
plt.xlabel("The x-axis")     
plt.ylabel("The y-axis")     

plt.subplot(1,4,3)           # Create the third subplot
plt.plot(x, y, linestyle='none', marker='^', color='purple')    # Create a plot with purple triangles
plt.grid()
plt.title("The third plot")  
plt.xlabel("The x-axis")     
plt.ylabel("The y-axis")     

plt.subplot(1,4,4)           # Create the fourthe subplot
plt.plot(x, y, linestyle='solid', marker='+', color='green')    # Create a plot with green + and connecting lines
plt.title("The fourth plot")  
plt.xlabel("The x-axis")     
plt.ylabel("The y-axis")     

plt.tight_layout()           # To make a nice layout of the three plots
plt.show()


The plot above is called a scatter plot. An alternative way to make this is by using `plt.scatter()`:

In [None]:
plt.scatter(x, y) # Produces scatter plot 
plt.title("Using plt.scatter")
plt.show()

Sometimes, we need to create a vector with a series of sequential numbers. There are multiple ways to do this:
* The function `np.arange(a,b)` can be used to create a sequence of integer numbers from `a` and `b`.
* The function `np.linspace(a,b,num=n)` creates an array of `n` equally spaced sequential numbers from `a` to `b`.

In [None]:
x = np.arange(1, 11)
print(x)

x = np.linspace(-np.pi, np.pi, num=50)
print(x)

Here is an example how this can be used:

In [None]:
x = np.linspace(-np.pi, np.pi, num=50)
y = np.sin(x)

plt.plot(x,y,'r-')
plt.show()

Above, you have seen how to create 2D plots. In machine learning, we typically use these to visualize the relation between one independent variable (x) with one dependent variable (y). When we have two independent varliables (x1 and x2, we need to plot the data in a different way. Here is an example:

**Exercise:**
* Look in the code how the plots are produced
* Print `y` to see how it looks

In [None]:
# Create some 3D data using two independent variables (x1,x2) and a dependent
x1 = np.linspace(-2*np.pi,2*np.pi,50)
x2 = np.linspace(-2*np.pi,2*np.pi,50)
xx1, xx2 = np.meshgrid(x1,x2)
y = np.cos(xx1)+np.cos(0.5*xx2)

# Plot the data
plt.figure(figsize=(12,4))
plt.subplot(1,3,1)
plt.imshow(y)
plt.title('Plot as an image')

plt.subplot(1,3,2)
cp = plt.contour(x1, x2, y, levels=5)
plt.clabel(cp, inline=1, fontsize=10)
plt.title('A contour plot with 5 levels')

ax = plt.subplot(1,3,3)
cp = plt.contour(x1, x2, y, levels=10)
plt.clabel(cp, inline=1, fontsize=10)
plt.title('A contour plot with 10 levels')

plt.tight_layout()
plt.show()


Or as actual 3D plots:

In [None]:
plt.figure(figsize=(12,5))

ax = plt.subplot(1,3,1, projection='3d')
ax.plot_surface(xx1, xx2, y, rstride=1, cstride=1,
                cmap='viridis', edgecolor='none')
ax.set_title('surface')

ax = plt.subplot(1,3,2, projection='3d')
ax.plot_wireframe(xx1, xx2, y)
ax.set_title('wireframe');

ax = plt.subplot(1,3,3, projection='3d')
ax.contour3D(xx1, xx2, y, 25)
ax.set_title('3D contour with 25 levels');

plt.tight_layout()
plt.show()


## 2.3.3 Indexing Data
We often wish to examine part of a set of data. Suppose that our data is stored in the matrix `A`.

In [None]:
A = np.matrix(np.arange(1, 13).reshape(3, 4))
A

We can access the value of a specific cell in the matrix using an **index** providing the row index and the column index. Mind that we start counting from 0, so the second row has index 1 and the third column has index 2. We can index the array using `A[row_id, col_id]`. For instance to access the value on the second row (row index 1) and third column (row index 2) we use:

In [None]:
A[1, 2]


**Exercise:**
* Change the code above to get the value on the 3rd row and the 2nd column 

We can also select multiple rows and columns at a time, by **slicing** the matrix. In that case, we use `A[r1:r2, c1:c2]`, where we select the rows starting at index `r1` until (not including) `r2` and the columns starting at index `c1` until `c1`:


For instance, The code below selects the rows with index 0,1 and the columns with index 1,2,3.

**Exercise:**
* Run the code below
* Change the code to select only the third row and the columns with index 0,1.

In [None]:
# select a range of rows and columns
A[0:2, 1:4]

To select a few rows and all associated columns, we can use `:` as column index:

In [None]:
# select a range of rows and all columns
A[0:2, :]

Similarly, we can slice al rows for only a few columns:

In [None]:
A[:,1:3]

If you want to select not a consecutive range of rows (or columns), but a few selected rows, you can use a vector as index. For instance, to select row 0 and row 2, you can use:

In [None]:
A[[0,2],:]

An alternative way is to use a **mask**. We can use, for instance a mask to select rows. The mask is an array with the size equal to the number of rows, containing `False` for all elements that should not be selected and `True` for all that do. To select rows 0, 2 and 3:

In [None]:
# Using a mask of row or column indices we do not want to keep, 
# we can keep all columns or rows EXCEPT those indicated in the index.

mask = [True, False, True]
A[mask, ]

By providing a negative index, you can select the last row or column:

In [None]:
print('Last row')
print(A[-1,:])
print()
print('Last two columns')
print(A[:, -1])

Or if you want to select the last two columns:

In [None]:
A[:, -2:]

## 2.3.4 Loading Data

For most analyses, the first step involves importing a data set into `Python`. To this end, we make use of the Pandas library: https://pandas.pydata.org/

Data can be stored in different formats. A very commonly used format are **comma separated values** or **csv** files. We can load these files using the pandas function `pd.read_csv()`. You can get more info on that function using `help(pd.read_csv)`.

At the start of this notebook, you executed a command to make a clone of a git repository. This made all data files used in this course available in this Colab environment. The data files can be found in the directory `data/islr_data/`. At the left side of the screen, you see a folder icon (looking similar to &#x1F4C1;). If you press that icon, you can see the file and folder structure and inspect which datafiles are available. If later, you would like to use your own files, you can upload files.

We begin by loading in the `Auto.csv` data set. This data is part of the `ISLR2` library, discussed in Chapter 3. It is available in the Colab environment at `data/islr_data/Auto.csv`. The next code block loads the data and puts it into a pandas data frame called `Auto`. 

The file Auto.csv contains missing values that are indicated by a question mark. We need to tell pandas that a question mark means missing data by using `na_values=['?']`.

**Exercise:**
* Run the code below. 
* How many samples are there in the dataset?
* Which variables are in the dataset?




In [None]:
Auto = pd.read_csv('data/islr_data/Auto.csv', na_values=['?'])
Auto

We can get all the column or variable names using:

In [None]:
Auto.columns

To check the type used for every variable, we can use `Auto.info()`.

**Exercise:**
* Check for every variable which datatype is used. What is the difference?

In [None]:
Auto.info()

Most of the columns contain numerical values either as 'float64', representing real numbers, or as 'int64', representing integer numbers. One column is of type 'object'. 

The variable 'cylinders' is stored as an numeric integer value. That is, it is a quantitative variable. However, we might want to treat it as a categorical/qualitative variable. We can use pandas to change the type of the variable using:

In [None]:
Auto['cylinders'] = Auto['cylinders'].astype('category')
Auto.info()

With can select specific rows of the data frame in a similar way as we slice numpy matrices if we use `.iloc`. For instance, to select row 11 to 15 (exclusing 15), we can use:

In [None]:
Auto.iloc[11:15]

Or to select columns 3 to 6:

In [None]:
Auto.iloc[:, 3:6]

**Exercise:**
* Use the code block below to slice the data frame to get rows 100 to 110 and columns 1 to 5.

In [None]:
## Add your code here
..

Data frames also allow to select columns based on their name. If you want to select a single column (single variable), there are several ways to do it. For instance, if you want to select the variable `mpg`:
* `Auto.mpg`
* `Auto['mpg']`
* `Auto.loc[:,'mpg']`

In [None]:
print('Using Auto.mpg')
print( Auto.mpg )
print(20*'=','\n')

print('Using Auto[\'mpg\']')
print( Auto['mpg'] )
print(20*'=','\n')

print('Using Auto.loc[:,\'mpg\']')
print( Auto.loc[:,'mpg'] )


You can also select multiple variables at once. In that case, you index by passing a list of variables:
* `df[[col1, col2, ..., coln]]`

Mind that you need now to opening brackets and two closing brackets. The outer pair of brackets is to index the data frame, the inner pair is to define the list.

**Exercise:**
* Run the code to select the columns mpg, cylinders and weight
* Change the code to select 'displacement' and 'acceleration'

In [None]:
Auto[['mpg','cylinders', 'weight']]

We can inspect the first few entries in the data frame using `.head()`:

In [None]:
Auto.head(5)

We can look at the size of the data frame using `Auto.shape`:

In [None]:
Auto.shape

You can see the data has  397  observations, or rows, and nine variables, or columns. 



Remember that we indicated to mark all missing values when we loaded the data file. We can check if a cell in the table contains a missing value (not a number) using:

In [None]:
Auto.isna()

We can find columns that contain one or more missing values using 
* `df.isna().any()`

**Exercise:**
* Run the code. Which column contains missing data?

In [None]:
Auto.isna().any()

We typically want to find data sample (rows) that contain a missing value in any of the columns. We can do this by applying 
* `df.isna().any(axis=1)`
This gives True if there is at least one missing value per row. When all columns for a row contain valid data, it returns False:



In [None]:
invalid_rows = Auto.isna().any(axis=1)
print(invalid_rows)


We can use this to count the number of samples with missing data and to index the data frame to view these samples.

**Exercise:**
* Run the code below
* Which samples contain missing data? 

In [None]:
print('Nr of samples with missing data:', invalid_rows.sum())
Auto.loc[invalid_rows,:]

We simply want to **remove rows with missing data**. This can be done with the function `.dropna()`.<br>
**NB**: removing missing data is important, because otherwise many machine-learning algorithms will fail!

**Exercise:**
* Run the code below
* How manyu samples are in the data frame before and after removing the missing data?


In [None]:
print('Shape before removal of NAs:',Auto.shape)

Auto.dropna(axis=0, inplace=True)

print('Shape after removal of NAs:',Auto.shape)


We can also remove specific observations/samples (rows) or predictors (columns) from the data frame using `.drop()`. Here are a few examples. 

**Exercise:**
* Study the code and observe the results
* Answer the following questions
  * How do you remove a single sample/observation with a given row index?
  * How can you remove multiple samples at once?
  * How do you remove a specific column from the data frame?

In [None]:
#Auto.index[10:85]
print('Shape of Auto:',Auto.shape)

# Removing the 11th observation (NB index = 10)
Auto_new = Auto.drop(10)
print('Shape of Auto_new:',Auto_new.shape)

# Removing the 11th,29th,101st observation 
Auto_new = Auto.drop([10,28,100])
print('Shape of Auto_new:',Auto_new.shape)

# Removing the 10th until and including the 30th observation
Auto_new = Auto.drop(Auto.index[9:30])
print('Shape of Auto_new:',Auto_new.shape)

# Removing specific columns:
Auto_new = Auto.drop(columns=['mpg', 'weight'])
print('Shape of Auto_new:',Auto_new.shape)

Remember that we have a data frame containing one categorical variable and all the other variables are numeric. We might want to create a new data frame with only the numerical variables, or only the categorical variables. This can be done using `df.select_dtypes()`:

In [None]:
Auto_num = Auto.select_dtypes(include='number')
print(Auto_num.head())

print()
Auto_cat = Auto.select_dtypes(include='category')
print(Auto_cat.head())

## 2.3.5a Additional graphical summaries

We can use the matplotlib function `plt.scatter()` to produce scatterplots of the quantitative variables:

In [None]:
plt.scatter(x=Auto['cylinders'], y=Auto['mpg'])
plt.xlabel("Cylinders (#)")
plt.ylabel("Mpg")
plt.show()

An alternative to the `matplotlib` library is to use the `seaborn` library for plotting. Seaborn is especially interesting to use in combination with pandas data frames: https://seaborn.pydata.org/, as it contains many functions that handle data frames.

Let's import the library.

In [None]:
import seaborn as sns                   # Library containing many functions for plotting the data

We can use it, for instance, to make a boxplot (https://seaborn.pydata.org/generated/seaborn.boxplot.html) using:
* `sns.boxplot(x,y,data)`

With `data=Auto`, we tell seaborn to make the plots using the data frame 'Auto'. We can then pass the column names that we want to put on the x and y axis to the function. 

**Exercise:**
* Run the code to make a bar plot with miles per galon (mpg) as a function of the number of cylinders.
* Change the code to make make a bar plot with the weight as a function of the number of cylinders. Do you see a relation?

In [None]:
sns.boxplot(x='cylinders', y='mpg', data=Auto);

The `sns.histplot()` function can be used to plot a histogram with density plot. https://seaborn.pydata.org/generated/seaborn.histplot.html.

**Exercise:**
* Run the code below. Of which variable is the histogram made?
* Make histogramsn of different variables by changing `x=...`.


In [None]:
sns.histplot(data=Auto, x='mpg', bins=15, kde=True); 

We can create a figure with a few subplots using `plt.subplots(nr_rows,nr_cols,...)`:

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(18, 10))

fig.suptitle('Showing some histograms')

sns.histplot(data=Auto, ax=axes[0, 0], x='mpg', bins=15, kde=True)
sns.histplot(data=Auto, ax=axes[0, 1], x='displacement', bins=15, kde=True) 
sns.histplot(data=Auto, ax=axes[1, 0], x='weight', bins=10, kde=True)
sns.histplot(data=Auto, ax=axes[1, 1], x='horsepower', bins=25, kde=True)


It can be interesting to plot two variables agains each other as a scatter plot to see if there are correlations visible. This can be done with `sns.scatterplot()`, https://seaborn.pydata.org/generated/seaborn.scatterplot.html.

**Exercise:**
* Run the code below to make a scatter plot of displacement vs mpg.
* Do you see a relation between these two variables?
* Try a few other combinations of variables. Can you find a few that are correlated?



In [None]:
sns.scatterplot(x='displacement', y='mpg', data=Auto)

Additionally, you can color the points in the scatterplot using a categorical variable, such as 'cylinders':

In [None]:
sns.scatterplot(x='displacement', y='mpg', hue='cylinders', data=Auto)

It is very tedious to manually create the scatterplots for all posible combinations of two variables. This can be automated using a pairplot with  `sns.pairplot()` (https://seaborn.pydata.org/generated/seaborn.pairplot.html). function creates a scatterplot matrix, i.e. a scatterplot for every pair of variables. 

**Exercise:**
* Run the code. Note that it takes some time to create the pairplot if you use all variables.
* Can you find some pairs of variables that correlate positively, negatively, or that don't correlate well?
* Do you see correlations between the number of cylinders and the different variables?

In [None]:
sns.pairplot(data=Auto, hue='cylinders')

Most of the subplots are scatterplots, plotting one variable against the others. On the diagonal, you see density plots.

We can also produce scatterplots for just a subset of the variables using `, vars=...` and the density plots van be changed to histograms using `, diag_kind='hist'`.

**Exercise:**
* Run the code below.
* Change the code and pick a few variables that seem most interesting to explore based on the full pair plot.

In [None]:
sns.pairplot(data=Auto, vars=['mpg', 'displacement', 'horsepower', 'acceleration'], hue='cylinders', diag_kind='hist')

Instead of using the names of the columns, you can also select based on the column number. For instance to make a pair plot of predictor/variable 2:6 (3rd until 6th), we can use the code below:

In [None]:
sns.pairplot(data=Auto, vars=Auto.columns[2:6], hue='cylinders')

Note that `sns.pairplot()` cannot deal with categorical variables apart from using it to color the scatter points. A few code blocks ago, we made the predictor 'cylinders' categorical. This is the 2nd predictor in the data frame.

**Exercise:**
* In the code below, we try to make a pair plot of the first four predictors. Run it and see what happens. 
* To do make this plot using the categorical variabe 'cylinders', we should change it back to a numerical predictor using `Auto['cylinders'] = Auto['cylinders'].astype('int64')`. Uncomment the code to make this transformation and run the code again to make the plot.

In [None]:
# Remove the comments (#) below to change the type of cylinders to integer values
#Auto['cylinders'] = Auto['cylinders'].astype('int64')
#Auto.info()

sns.pairplot(data=Auto, vars=Auto.columns[0:4], hue='cylinders')


If you are interested in the distribution of the different variables. You can use a boxplot:

In [None]:
plt.figure(figsize=(20,5))
sns.boxplot(data=Auto, orient='h')

Because the range of each variable is quite different. It can be more insightful to have separate boxplots per variable. Mind that we explicitly have to select only the quantitative variables using `Auto.select_dtypes(include='number')`.<br>
We can now also mark the values for one specific sample in the dataset, for instance row 99:  

**Exercise:**
* Study the code below. 
  * How is a data frame containting only numerical variables obtained?
  * How are the separate box plots created?
  * How is the red line drawn for one specific sample? Try a few different samples.

In [None]:
Auto_numeric = Auto.select_dtypes(include='number')
show_sample_id = 99

fig, axes = plt.subplots(1, len(Auto_numeric.columns), figsize=(20, 5))
for i in range(len(Auto_numeric.columns)):
  g = sns.boxplot(data=Auto_numeric, y=Auto_numeric.columns[i], ax=axes[i])
  axes[i].plot(0, Auto_numeric.iloc[show_sample_id,i], marker='_', markersize=50, markeredgewidth=5, color='red', linestyle='')
plt.tight_layout()


## 2.3.5b Additional Numerical Summaries

The `.describe()` function produces a numerical summary of each variable in a particular data set. With `include='number'`, we get a summary of all the numerical variables:

In [None]:
Auto.describe(include='number')

Here is the interpretation:
* 'count': number of values
* 'mean`: the numerical mean
* 'std': the standard deviation
* 'min': minimum value
* '25%': 25th percentile or 1st quartile
* '50%': 50th percentile or 2nd quartile or the median
* '75%': 75th percentile or 3rd quartile
* 'max': maximum value

**Exercise:**
* What is the min, mean and maximum value for the mpg and acceleration?


You can get access to specific statistics made by `.describe()` by indexing the results using `.loc[]`. For instance, to get the standard deviation for all numerical predictors, you can use:

In [None]:
Auto.describe(include='number').loc['std']

We can also look at the variables that are not numerical:

In [None]:
Auto.describe(exclude='number')

You can see that there are 301 unique names. That the most frequent name is 'toyota corolla', which appears 5 times.

We can also produce a summary of just a single variable.

In [None]:
Auto['mpg'].describe()

Panda also provides functions to calculate additional statistics of the variables, for instance, the median, variance, and skewness:

In [None]:
print('Median:\t\t', Auto['mpg'].median() )
print('Variance:\t', Auto['mpg'].var() )
print('Skewness:\t', Auto['mpg'].skew() )

**Exercise:**
* Based on the 5 code blocks above answer the following questions:
  * What is the minimum, maximum, 25%, 50% and 75% percentile for the horsepower?
  * Which variable has the highest value for the mean?
  * What is the skewness of the weight variable?

####Correlations in the data
To see if there are relationships in the data. It is often interesting to calculate the **correlation coefficients** between variables in the data.

You can calculate the full correlation matrix, where every variable is correlated with every other variable.

**Exercise:**
* Run the code below
* Answer some questions
  * What is the correlation between horsepower and miles per galon? Does that make sense?
  * Which two variables have the strongest positive correlation? Which the strongest negative correlation?

In [None]:
Auto.corr()

Here, we see, for instance high positive correlations between 'weight' and 'horsepower' and 'weight' and 'displacement'. This shows that higher weight usually relates to more horsepower and a larger displacement. There is a high negative correlation between 'weight' and 'mpg', showing that a heavier car drives less miles per galon.

It is also possible to calculate only the correlations of one variable against the other. For instance, to calculate all correlations with 'weight':

In [None]:
Auto.corrwith(Auto['weight'])

And you might want to sort these values:

In [None]:
Auto.corrwith(Auto['weight']).sort_values()

#### Getting the samples with highest or lowest values
You might be interested to select the samples/observations in the data with particular high or low values for certain variables. Here are some examples of how to do so.

**Exercise:**
* What are the names of the lightest cars and what is their weight?
* How much heavier is the heaviest car compared to the lightest and how much lower is the mpg?
* Get the top-10 cars with heighest mpg.  

In [None]:
pd.set_option('expand_frame_repr', False)
# Selecting the car wth the largest weight

auto_heaviest = Auto.loc[Auto['weight'].idxmax()]

print('Heaviest car:\n',  auto_heaviest, end='\n\n')

# Selecting the 5 most heavy cars
auto_top5_weight = Auto.loc[Auto['weight'].nlargest(5).index]
print('Top 5 most heaviest cars:\n',  auto_top5_weight, end='\n\n')

# Selecting the 5 cars with least horsepower
auto_low5_weight = Auto.loc[Auto['horsepower'].nsmallest(5).index]
print('5 cars with lowest horsepower:\n', auto_low5_weight, end='\n\n')
