# 2.3 Lab: Introduction to Python

This notebook is a replacement of the lab session in section 2.3 of the book and teaches you how to use the methods in Python. The lab has the following sections:
1. Basic Commands
2. Graphics
3. Indexing Data
4. Loading Data
5. Additional Graphical and Numerical Summaries

Before getting started, we need to get access to the data. To do so, we clone a gitlab repository containing the data. It will make the data available in this Colab in the folder `data`.

Run the code-block below to perform this action. (**NB** make sure to run all the code blocks in the notebook when they appear.(

In [None]:
!git clone https://git.wur.nl/koots006/msc-course-machine-learning.git data

We will use a number of libraries in this lab session. A brief description is given as comments in the code

In [None]:
# Import libraries
import numpy as np                      # Numpy offers several math functions and methods to deal with arrays and matrices
import pandas as pd                     # Pandas is the weapon of choice to manadge data
import math                             # Some more math functions
import matplotlib.pyplot as plt         # With matplotlib, we can make the plots and graphs
from scipy.stats.stats import pearsonr  # Statistical methods, specifically Pearson correlation
import seaborn as sns                   # Library containing many functions for plotting the data
%matplotlib inline

## 2.3.1 Basic Commands
`Python` uses functions to perform operations. To run a function called funcname, we type funcname(input1, input2), where the inputs (or arguments) input1 and input2 tell R how to run the function. A function can have any number of inputs. For example, to create an array of numbers, we use the function np.asarray(). Any numbers inside the parentheses are joined together. The following command instructs `Python` to join together the numbers 1, 3, 2, and 5, and to save them as an array named x. When we type x, it gives us back the vector.

In [None]:
x = np.asarray([1, 3, 2, 5])  # Create array/vector 
# Returns the array [NOTE: ONLY WORKS THIS WAY IF THIS IS LAST ELEMENT OF CELL, OTHERWISE USE `print(x)`]
x                             


In [None]:
x = np.asarray([1, 6, 2])
print(x)
y = np.asarray([1, 4, 3])

#### Getting help or additional function information
Using *help(<function_name>)* returns additional information, i.e. its input parameters and meaning, function output and usually some examples that you can try out to get a better understanding of how the function works. <br/>
**Example:**

In [None]:
help(np.asarray)

We can tell `Python` to add two sets of numbers together. It will then add the first number from `x` to the first number from `y`, and so on. However, `x` and `y` should be the same length.
 We can check their length using the `len()` function.

In [None]:
print("Length of x: ", len(x))
print("Length of y: ", len(y))

x + y 

The `np.asmatrix()` function can be used to create a matrix of numbers.
Before we use the `np.asmatrix()` function, we can learn more about it:

In [None]:
help(np.asmatrix)

In [None]:
# Create a numpy matrix with 2 rows and 2 columns.
x = np.asmatrix(data=(1,2,3,4)) \
                    .reshape(2, 2) \
                    .transpose()
x

In [None]:
np.asmatrix(data=(1,2,3,4)) \
            .reshape(2, 2)

Notice that in the above command we did not assign the matrix to a value such as x. In this case the matrix is printed to the screen but is not saved for future calculations. The `np.sqrt()` function returns the square root of each element of a vector or matrix. The command x^2 raises each element of x to the power 2; any powers are possible, including fractional or negative powers.

In [None]:
# Sqrt: np.sqrt()
print("sqrt(x): \n", np.round(np.sqrt(x))) # Round fractions to two decimals.

# Power: np.power()
print("x^2: \n", np.power(x, 2))

The `np.random.normal()` function generates an array of random
normal variables, with first argument `size` the sample size.  Each
time we call this function, we will get a different answer. Here we
create two correlated sets of numbers, `x` and `y`, and use
the `pearsonr()` function to compute the correlation between
them.


In [None]:
x = np.random.normal(size=50)
y = x + np.random.normal(size=50, loc=50, scale=0.1)

np.round(pearsonr(x, y)[0], 3) # pearsonr returns (1) correlation coefficient and (2) the p-value - which is not needed here.

By default, `np.random.normal()` creates standard normal random variables with a mean of  0  and a standard deviation of  1 . However, the mean and standard deviation can be altered using the mean and sd arguments, as illustrated above. Sometimes we want our code to reproduce the exact same set of random numbers; we can use the `np.random.seed()` function to do this. The `np.random.seed()` function takes an (arbitrary) integer argument.

In [None]:
np.random.seed(42) # -> sets seed for anything that uses np.random.*
np.random.normal(size=50)

We use `np.random.seed()` throughout the labs whenever we perform calculations involving random quantities. In general this should allow the user to reproduce our results. However,  as new versions of `Python` become available,  small discrepancies may arise between this book and the output from `Python`.

The `np.mean()` and `np.var()` functions can be used to compute the mean and variance of a vector of numbers. Applying `np.sqrt()` to the output of `np.var()` will give the standard deviation. Or we can simply use the
`np.std()` function.

In [None]:
np.random.seed(3)
y = np.random.normal(size=100)

# Compute mean of y
print("Mean of y: ", np.round(np.mean(y), 4))

# Compute variance of y
print("Variance of y: ", np.round(np.var(y), 4))

# Compute standard deviation of y using np.sqrt() and np.var()
print("Standard deviation of y: ", np.round(np.sqrt(np.var(y)), 4))

# And now using np.std()
print("Standard deviation of y: ", np.round(np.std(y), 4))


## 2.3.2 Graphics

The plt.plot() function is the primary way to plot data in `Python`. For instance, plt.plot(x, y) produces a plot of the numbers in x versus the numbers in y. There are many additional options that can be passed in to the `plt.plot()` function. To find out more information about the plt.plot() function, type help(plt.plot).

In [None]:
# For documentation --> help(plt.plot)
x = np.random.normal(size=100)
y = np.random.normal(size=100)

plt.scatter(x, y) # Produces scatter plot 
plt.title("Using plt.scatter")
plt.show()
# OR
plt.plot(x, y, '.') # Produces scatter plot 
plt.title("Using plt.plot")
plt.show()

The function `np.arange()` can be used to create a sequence of numbers. For instance, `np.arange(a, b)` makes an array of integers between `a` and `b`.

In [None]:
x = np.arange(1, 11)
print(x)

x = np.linspace(-np.pi, np.pi, num=50)
print(x)

We  will now create some more sophisticated plots. The `plt.contour()` function produces a  in order to represent three-dimensional data; it is  like a topographical map.
 It takes three arguments:
 
 * An array of the `x` values (the first dimension),
 * An array of the `y` values (the second dimension), and
 * A matrix whose elements correspond to the `z` value (the third dimension) for each pair of (`x`, `y`) coordinates.

As with the `plt.plot()` function, there are many other inputs that can be used to fine-tune the output of the `plt.contour()` function. To learn more about these, take a look at  the help file by typing `help(plt.contour)`.

In [None]:

y = x

def outer(a, b):
  return math.cos(b) / (1 + a**2)

f = np.empty((len(x), len(y)))

for i in range(len(x)):
    for j in range(len(y)):
        f[i,j] = outer(x[i], y[j])

        
# contour plot
cp = plt.contour(x, y, f, levels=45)
plt.clabel(cp, inline=1, fontsize=10)
plt.show()

fa = (f - f.transpose()) / 2
cp = plt.contour(x, y, fa, levels=15)
plt.clabel(cp, inline=1, fontsize=10)
plt.show()

The `plt.contourf()` function works the same way as `plt.contour()`, except that it produces a color-coded plot whose colors depend on the `z` value. This is  known as a , and is sometimes used to plot temperature in weather forecasts. Alternatively, `plt.plot_wireframe()` can be used to produce a three-dimensional plot. The arguments `elev` and `azim` in `view_init()` control the angles at which the plot is viewed.

In [None]:
cp = plt.contourf(x, y, fa, 15)
plt.clabel(cp, inline=1, fontsize=10)
plt.colorbar()
plt.title("Heatmap")
plt.show()

# Persp function in Python
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_wireframe(x, y, fa)
ax.view_init(30)
#ax.view_init(30, 20)
#ax.view_init(30, 70)
#ax.view_init(30, 40)

## 2.3.3 Indexing Data
We often wish to examine part of a set of data. Suppose that our data is stored in the matrix `A`.

In [None]:
A = np.asmatrix(np.arange(1, 13).reshape(3, 4))
A

We can get the shape of the matrix using `.shape`. It shows the number of rows and columns:

In [None]:
A.shape

We can access the value on the second row (row index 1) and third column (row index 2) using:

In [None]:
A[1, 2]

The first number after the open-bracket symbol `[` always refers to the row, and the second number always refers to the column.

We can also select multiple rows and columns at a time, by **slicing** the matrix. To select rows 0,1 and columns 1,2,3:

In [None]:
# select a range of rows and columns
A[0:2, 1:4]

To select a few rows and all associated columns, we can use `:` as column index:

In [None]:
# select a range of rows and all columns
A[0:2, :]

Or to select a few columns and all rows:

In [None]:
# select all rows and a range of columns
A[:,0:2]

If you want to select not a consecutive range of rows (or columns), but a few selected rows, you can use a vector as index. For instance, to select row 0 and row 2, you can use:

In [None]:
A[[0,2],:]

An alternative way is to use a **mask**. We can use, for instance a mask to select rows. The mask is an array with the size equal to the number of rows, containing `False` for all elements that should not be selected and `True` for all that do. To select rows 0, 2 and 3:

In [None]:
# Using a mask of row or column indices we do not want to keep, 
# we can keep all columns or rows EXCEPT those indicated in the index.

mask = [True, False, True]
A[mask, ]

By providing a negative index, you can select the last row or column:

In [None]:
print('Last row')
print(A[-1,:])
print('Last two columns')
print(A[:, -1])

Or if you want to select the last two columns:

In [None]:
A[:, -2:]

## 2.3.4 Loading Data

For most analyses, the first step involves importing a data set into `Python`. To this end, we make use of the Pandas library: https://pandas.pydata.org/

Data can be stored in different format. A very commonly used format are **comma separated values** or **csv** files. We can load these files using the pandas function `pd.read_csv()`. You can get more info on that function using `help(pd.read_csv)`.

At the start of this notebook, you executed a command to make a clone of a git repository. This made all data files used in this course available in this Colab environment. The data files can be found in the directory `data/islr_data/`. At the left side of the screen, you see a folder icon (looking similar to &#x1F4C1;). If you press that icon, you can see the file and folder structure and inspect which datafiles are available. If later, you would like to use your own files, you can upload files.

We begin by loading in the `Auto.csv` data set. This data is part of the `ISLR2` library, discussed in Chapter 3. It is available in the Colab environment at `data/islr_data/Auto.csv`. The next code block loads the data and puts it into a pandas data frame called `Auto`. The file Auto.csv contains missing values that are indicated by a question mark. We need to tell pandas that a question mark means missing data by using `na_values=['?']`:


In [None]:
Auto = pd.read_csv('data/islr_data/Auto.csv', na_values=['?'])
Auto

We can get all the column or variable names using:

In [None]:
Auto.columns

To check the type used for every variable:

In [None]:
Auto.info()

Most of the columns contain numerical values either as 'float64', representing real numbers, or as 'int64', representing integer numbers. One column is of type 'object'. 

With can select specific rows of the data frame in a similar way as we slice numpy matrices if we use `.iloc`. For instance, to select row 11 to 15 (exclusing 15), we can use:

In [None]:
Auto.iloc[11:15]

Or to select columns 3 to 6:

In [None]:
Auto.iloc[:, 3:6]

Data frames also allow to select columns based on their name. For instance to select the columns mpg, cylinders and weight:

In [None]:
Auto[['mpg','cylinders', 'weight']]

We can inspect the first few entries in the data frame using `.head()`:

In [None]:
Auto.head(5)

We can look at the size of the data frame using `Auto.shape`:

In [None]:
Auto.shape

You can see the data has  397  observations, or rows, and nine variables, or columns. 

We would like to see which of the rows contain missing information (na). To inspect which columns/variables contain na values, we can use:

In [None]:
Auto.isna().any()

We see that there are missing data in the 'horsepower' column. To continue with our analysis, we want to remove the rows/samples that have missing data. We can find rows in the data frame that have one or more missing values using `.any(axis=1)`:

In [None]:
Auto.isna().any(axis=1)

To count how many rows contain midding data, we can use:

In [None]:
Auto.isna().any(axis=1).sum()

And to view the five rows missing data, we can use the row information to index the data frame: 

In [None]:
Auto[Auto.isna().any(axis=1)]

We simply want to remove these rows. This can be done with the function `.dropna()`:

In [None]:
print('Shape before removal of NAs:',Auto.shape)

Auto.dropna(axis=0, inplace=True)

print('Shape after removal of NAs:',Auto.shape)


We can remove observations (rows) or predictors (columns) from the data frame using `.drop()`. Here are a few examples. Study them and observe the results

In [None]:
#Auto.index[10:85]
print('Shape of Auto:',Auto.shape)

# Removing the 11th observation (NB index = 10)
Auto_new = Auto.drop(10)
print('Shape of Auto_new:',Auto_new.shape)

# Removing the 11th,29th,101st observation 
Auto_new = Auto.drop([10,28,100])
print('Shape of Auto_new:',Auto_new.shape)

# Removing the 10th until and including the 30th observation
Auto_new = Auto.drop(Auto.index[9:30])
print('Shape of Auto_new:',Auto_new.shape)

# Removing specific columns:
Auto_new = Auto.drop(columns=['mpg', 'weight'])
print('Shape of Auto_new:',Auto_new.shape)

## 2.3.5a Additional graphical summaries

We can use the matplotlib function `plt.scatter()` to produce scatterplots of the quantitative variables:

In [None]:
plt.scatter(x=Auto['cylinders'], y=Auto['mpg'])
plt.xlabel("Cylinders (#)")
plt.ylabel("Mpg")
plt.show()

The *cylinders* variable is stored as a numeric vector, so Python has treated it as quantitative. However, since there are only a small number of possible values for *cylinders*, one may prefer to treat it as a qualitative variable. The function `.astype('category')` converts quantitative variables into qualitative variables. We use this to convert 'cylinders' to categorical:

In [None]:
Auto['cylinders'] = Auto['cylinders'].astype('category')
Auto.info()

An alternative to the `matplotlib` library is to use the `seaborn` library for plotting. Seaborn is especially interesting to use in combination with pandas data frames: https://seaborn.pydata.org/.

Let's import the library.

In [None]:
import seaborn as sns

We can use it, for instance, to make a boxplot (https://seaborn.pydata.org/generated/seaborn.boxplot.html). With `,data=Auto`, we tell seaborn to make the plots using the data frame 'Auto'. 

In [None]:
sns.boxplot(x='cylinders', y='mpg', data=Auto);

The `sns.histplot()` function can be used to plot a histogram with density plot. https://seaborn.pydata.org/generated/seaborn.histplot.html.<br> 
Try inspecting different variables by changing `x=...`.

In [None]:
sns.histplot(data=Auto, x='mpg', bins=15, kde=True); 

We can create a figure with a few subplots using `plt.subplots(nr_rows,nr_cols,...)`:

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(18, 10))

fig.suptitle('Showing some histograms')

sns.histplot(data=Auto, ax=axes[0, 0], x='mpg', bins=15, kde=True)
sns.histplot(data=Auto, ax=axes[0, 1], x='displacement', bins=15, kde=True) 
sns.histplot(data=Auto, ax=axes[1, 0], x='weight', bins=10, kde=True)
sns.histplot(data=Auto, ax=axes[1, 1], x='horsepower', bins=25, kde=True)


It can be interesting to plot two variables agains each other to see if there are correlations visible. This can be done with `sns.scatterplot()`, https://seaborn.pydata.org/generated/seaborn.scatterplot.html.
We can, for instance, create a scatterplot of the displacement against mpg



In [None]:
sns.scatterplot(x='displacement', y='mpg', data=Auto)

Additionally, you can color the points in the scatterplot using a categorical variable, such as 'cylinders':

In [None]:
sns.scatterplot(x='displacement', y='mpg', hue='cylinders', data=Auto)

It is very tedious to manually create the scatterplots for all posible combinations of two variables. This can be automated using a pairplot with  `sns.pairplot()` (https://seaborn.pydata.org/generated/seaborn.pairplot.html). function creates a scatterplot matrix, i.e. a scatterplot for every pair of variables. 

In [None]:
sns.pairplot(data=Auto, hue='cylinders')

Most of the subplots are scatterplots, plotting one variable against the others. On the diagonal, you see density plots.

We can also produce scatterplots for just a subset of the variables using `, vars=...` and the density plots van be changed to histograms using `, diag_kind='hist'`

In [None]:
sns.pairplot(data=Auto, vars=['mpg', 'displacement', 'horsepower', 'acceleration'], hue='cylinders', diag_kind='hist')

Instead of using the names of the columns, you can also select based on the column number. For instance to make a pair plot of predictor/variable 2:6 (3rd until 6th), we can use the code below:

In [None]:
sns.pairplot(data=Auto, vars=Auto.columns[2:6], hue='cylinders')

Note that `sns.pairplot()` cannot deal with categorical. A few code blocks ago, we made the predictor 'cylinders' categorical. This is the 2nd predictor in the data frame. In the code below, we try to make a pair plot of the first four predictors. Run it and see what happens. To do make this plot, we should change cylinders back to a numerical predictor using `Auto['cylinders'] = Auto['cylinders'].astype('int64')`. Uncomment the code to make this transformation and run the code again to make the plot.

In [None]:
# Remove the comments (#) below to change the type of cylinders to integer values
#Auto['cylinders'] = Auto['cylinders'].astype('int64')
#Auto.info()

sns.pairplot(data=Auto, vars=Auto.columns[0:4], hue='cylinders')


If you are interested in the distribution of the different variables. You can use a boxplot:

In [None]:
plt.figure(figsize=(20,5))
sns.boxplot(data=Auto, orient='h')

Because the range of each variable is quite different. It can be more insightful to have separate boxplots per variable. Mind that we explicitly have to select only the quantitative variables using `Auto.select_dtypes(include='number')`.<br>
We can now also mark the values for one specific sample in the dataset, for instance row 99:  

In [None]:
Auto_numeric = Auto.select_dtypes(include='number')
show_sample_id = 99

fig, axes = plt.subplots(1, len(Auto_numeric.columns), figsize=(20, 5))
for i in range(len(Auto_numeric.columns)):
  g = sns.boxplot(data=Auto_numeric, y=Auto_numeric.columns[i], ax=axes[i])
  axes[i].plot(0, Auto_numeric.iloc[show_sample_id,i], marker='_', markersize=50, markeredgewidth=5, color='red', linestyle='')
plt.tight_layout()


## 2.3.5b Additional Numerical Summaries

The `.describe()` function produces a numerical summary of each variable in a particular data set. With `include='number'`, we show a summary of all the numerical variables:

In [None]:
Auto.describe(include='number')

Here is the interpretation:
* 'count': number of values
* 'mean`: the numerical mean
* 'std': the standard deviation
* 'min': minimum value
* '25%': 25th percentile or 1st quartile
* '50%': 50th percentile or 2nd quartile or the median
* '75%': 75th percentile or 3rd quartile
* 'max': maximum value


You can get access to specific statistics made by `.describe()` by indexing the results using `.loc[]`. For instance, to get the standard deviation for all numerical predictors, you can use:

In [None]:
Auto.describe(include='number').loc['std']

We can also look at the variables that are not numerical:

In [None]:
Auto.describe(exclude='number')

You can see that there are 301 unique names. That the most frequent name is 'toyota corolla', which appears 5 times.

We can also produce a summary of just a single variable.

In [None]:
Auto['mpg'].describe()

Panda also provides functions to calculate additional statistics of the variables, for instance, the median, variance, and skewness:

In [None]:
print('Median:\t\t', Auto['mpg'].median() )
print('Variance:\t', Auto['mpg'].var() )
print('Variance:\t', Auto['mpg'].skew() )


####Correlations in the data
To see if there are relationships in the data. It is often interesting to calculate the **correlation coefficients** between variables in the data.

You can calculate the full correlation matrix, where every variable is correlated with every other variable:

In [None]:
Auto.corr()

Here, we see, for instance high positive correlations between 'weight' and 'cylinder' and 'weight' and 'displacement'. This show that higher weight usually relates to more cylinders and a larger displacement. There is a high negative correlation between 'weight' and 'mpg', showing that a heavier car has less miles per galon.

It is also possible to calculate only the correlations of one variable against the other. For instance, to calculate all correlations with 'weight':

In [None]:
Auto.corrwith(Auto['weight'])

And you might want to sort these values:

In [None]:
Auto.corrwith(Auto['weight']).sort_values()

#### Getting the samples with highest or lowest values
You might be interested to select the samples/observations in the data with particular high or low values for certain variables. Here are some examples of how to do so:

In [None]:
# Selecting the car wth the largest weight
print('Heaviest car:', Auto.loc[Auto['weight'].idxmax()] )

# Selecting the 5 most heavy cars
print()
print('Top 5 most heaviest cars:\n', Auto.loc[Auto['weight'].nlargest(5).index] )

# Selecting the 5 cars with least horsepower
print()
print('5 cars with lowest horsepower:\n', Auto.loc[Auto['weight'].nsmallest(5).index] )




# **Credits**:

---
**[TO-DO: Correct links in references in following paragraph!]** <br/>
Text is based on R-notebook by the authors, the code for the Chapter 6 section comes mostly directly from [here](http://www.science.smith.edu/~jcrouser/SDS293/labs/lab11-py.html) with some minor changes, the code for the Chapter 12 sections is based on work that can be found [here](https://github.com/jcrouser/islr-python/blob/master/Lab%2018%20-%20PCA%20in%20Python.ipynb) and [here](https://github.com/hardikkamboj/An-Introduction-to-Statistical-Learning/blob/master/Chapter_10/). <br/>


**[TO-DO: Correct links in references in following paragraph!]** <br/>
**Links to Python versions of the labs by other:** <br/>
* [Jcrouser Python Lab on KNN](https://github.com/jcrouser/islr-python/blob/master/Lab%203%20-%20K-Nearest%20Neighbors%20in%20Python.ipynb)

* [Jcrouser Python Lab on LDA and QDA](https://github.com/jcrouser/islr-python/blob/master/Lab%205%20-%20LDA%20and%20QDA%20in%20Python.ipynb)

* [Mscaudill Labs Ch 4](https://github.com/mscaudill/IntroStatLearn/blob/master/notebooks/Ch4_Classification/Ch4_Lab_Classification.ipynb)

*  [Hardikkamboj Labs Ch 4](https://github.com/mscaudill/IntroStatLearn/blob/master/notebooks/Ch4_Classification/Ch4_Lab_Classification.ipynb)

Additional (generally) useful links: <br/>
* https://opensourcelibs.com/lib/kulbear-islr-python

* https://github.com/Kulbear/ISLR-Python/blob/master/labs

* https://github.com/jcrouser/islr-python

* https://botlnec.github.io/islp/

* https://github.com/mscaudill/IntroStatLearn/blob/master/notebooks

* https://github.com/a-martyn/ISL-python/blob/master/Notebooks

* https://github.com/hardikkamboj/An-Introduction-to-Statistical-Learning

* https://github.com/hyunblee/ISLR-with-Python/tree/master
