# 2.3 Lab: Introduction to ~~R~~ Python

## 2.3.1 Basic Commands

In this lab, we will work through some basic commands in Python that mirror the functionality that we need in R. 

### Libraries

Python has native structures for storing data, but other libraries are used for more advanced data structures and data manipulation. The 2 most popular ones and an additional library to help with R data sets are: 

* `numpy` (NumPy) is a package for scientific computing with powerful arrays. https://numpy.org/
* `pandas` is a library for high-performance, easy-to-use data structures and data analysis tools. https://pandas.pydata.org/pandas-docs/stable/index.html
* `statsmodels` a library that can be used to load R data sets and replicate some of the functionality in R.

Common graphic libraries used within Python are: 

* `matplotlib` is a comprehensive library for creating static, animated, and interactive visualizations in Python. https://matplotlib.org/
* `seaborn` is a library for making statistical graphics in Python and is built on top of matplotlib. https://seaborn.pydata.org/introduction.html

The main library used for statistics and machine learning is: 

* `sklearn` (scikit-learn): a library for machine learning https://scikit-learn.org/stable/


We will be mapping the functionality between R objects and functions to Python objects and structures. 

### Imports
In R, we use the `library` function to import objects, functions, and data into the local namespace for usage. In Python, we will use the `import` statement to import libraries. 

There are a few built-in libraries that we can use along with third-party libaraies. Many of the third-part libraries have common used aliases that can be confusing for beginners to remember or to lookup. Most of the examples will not use an alias for the library to reduce confusion (and future conflict as alias are social conventions and not enforced, restricted mutual exclusive designations).

You can also import specific classes from a namespace directly into the local namespace so you do not have to deep reference them. This is used sparingly where it improves code readability and ensures not conflicts within the local namespace. 

**Example** 
```Python
import sklearn  # specific namespace import
from sklearn.model_selection import GridSearchCV  # single class import

# now the SearchGridCV class is referencable by the full namespace or the class namespace
sklearn.model_selection.GridSearchCV(estimator = None, param_grid = {})
GridSearchCV(estimator = None, param_grid = {})
```



In [1]:
from typing import List, Dict, Tuple, Sequence, Union # annotation library
import sys # system library 

import statsmodels # root namespace for intellisense and deep reference
import statsmodels.api as statsModels # alias to hide the extra api namespace

import scipy
import numpy   # as np
import pandas  # as pd

from sklearn import * # import all child namespaces 

import matplotlib.pyplot as pyplot
import seaborn 


### Vectors, Arrays, and NumPy

Vectors are 1-dimensional arrays that hold primitive data types (generally) and can dynamically resize (shrink or grow) as needed. Python has a native type for a `List` collection which is an ordered and changable single dimension array. NumPy provides additional support for arrays and more advanced manipulation.

**R Code**
```R
x = c (1, 3, 2, 5)   # concatenate into vector 
x                    # print the vector
### output ###
[1] 1 3 2 5 

x = c (1, 6, 2)
x 
### output ###
[1] 1 6 2 

y = c (1, 4, 3)
```

In Python, we can using the `typing` library to optionally annotate the typing of the data type of list. Typing is only annotation or documentation. It does not enforce the data type. Each element in the list can have a different data type. You can see the style guide about type hinting and annotations and where it can be useful and where it can be duplicative.

In [2]:
# creating list collections, showing annotations, and mixed list
list1: List [int] = [1, 3, 2, 5]
list2: List [float] = [1.0, 3.0, 2.0, 5.0]
list3: List = ["a", 1, 2.0, numpy.array ([1, 2, 3, 4])]

print ("list 1")
print (list1)
print (type (list1 [0]))

print ("\nlist 2")
print (list2)
print (type (list2 [0]))

print ("\nlist 3 data types")
print (type (list3 [0]))
print (type (list3 [1]))
print (type (list3 [2]))
print (type (list3 [3]))

list 1
[1, 3, 2, 5]
<class 'int'>

list 2
[1.0, 3.0, 2.0, 5.0]
<class 'float'>

list 3 data types
<class 'str'>
<class 'int'>
<class 'float'>
<class 'numpy.ndarray'>


So lists may not be the best way to store data for most of our purposes. We can use the `numpy.array` function (a factory function) to generate arrays from the `numpy.ndarray` class, which is a multi-dimensional array class of the same data type with high performance operations. Both `SciPy` and `NumPy` have support for optimized functions and operations such as linear algebra. 

```Python

# creates an instance of the numpy.ndarray class
numpy.array (object, dtype = None, copy = True, order = "K", subok = False, ndmin = 0) -> numpy.ndarray

```

In [3]:
# let's create the same code as in R here as a vector
x: numpy.ndarray = numpy.array ([1, 3, 2, 5])
print (x)

x = numpy.array ([1, 6, 2])
print (x)

y = numpy.array ([1, 4, 3])
print (y)

[1 3 2 5]
[1 6 2]
[1 4 3]


When we want to know the dimensions of an array (of `ndarray`), we want to use the `shape` property of class that will return a tuple with an element for each dimension that contains that dimension's length. In R, with vectors, we simply use the `length()` function. In Python, it would be better to be consistent in using the `shape` property even for 1-dimensional arrays.

When we use an operator on 2 vectors in R, we get that operation over the elements of the vector, so that `x + y` becomes `1 + 1, 6 + 4, 2 + 3`. Python supports elementwise operations for ndarray objects, too.

**R Code**
```R
length (x)
[1] 3

length (y)
[1] 3

x + y
[1] 2 10 5 
```

In [4]:
# get the length of the first dimension of the array 
print ("length of x is {:d}".format (x.shape [0]))
print ("length of y is {:d}".format (y.shape [0]))

print ("x + y elementwise operation")
print (x + y)

length of x is 3
length of y is 3
x + y elementwise operation
[ 2 10  5]


### Variables in Memory and Removal 

R allows us to expect the variables defined in memory (the global environment) and remove them. Python has similar functionality but requires a little more work to separate functions and variables. 

**R Code** 
```R
ls ()
### output ###
[1] "x" "y"

rm (x, y)
ls ()
### output ###
character (0) 

rm (list = ls ())
```

Python offers a couple of ways to inspect in memory variables but not quite as simply as R. You can use the `dir()` function to list the objects within a namespace or `global()` and `local()` to list those scopes. They all will return more than just variables. They will return class definitions, modules, and functions, too. 

In IPython  (like in Notebooks), we can use the `%whos` built-in magic command that provides a more detailed list. 

In [5]:
# get a list of tuples (variable name, type) from globals 
variables = list (map (lambda variableName: (variableName, str (type (globals () [variableName]))), dir ()))

# filter the list (as much as possible) to just variables
variables = list (filter (lambda variable: (
        not (variable[0].startswith ("_"))      # no private/protected variables
        and not (variable [0] in ["In", "Out"]) # IPython variables
        and not ("'IPython" in variable [1])    # no IPython 
        and not ("'module'" in variable [1])    # no modules/namespaces
        and not ("'method'" in variable [1])    # no methods
        and not ("'function'" in variable [1])  # no functions
        and not ("'typing"  in variable [1])  # no typing support
        ), variables))

# print variable list 
variables 

[('list1', "<class 'list'>"),
 ('list2', "<class 'list'>"),
 ('list3', "<class 'list'>"),
 ('x', "<class 'numpy.ndarray'>"),
 ('y', "<class 'numpy.ndarray'>")]

In [6]:
# remove variables from memory 
del variables, x, y

### Matrix

R has the `Matrix()` function, `as.matrix`, and `is.matrix` function. They are an extension of numeric and character vectors. 

In Python, we will use the `numpy.ndarray` for matrices. This will alow multi-dimensional arrays of different data types. 

To retreive the documentation (if available and generally the docstring), we can use the `help()` function. In this example, we could use `help (numpy.ndarray)` to retrieve the documentation. This was not included due to length requirements and `textOutputLimit` settings. 

Matrix operations are elementwise operations. There are operators and functions that match between R and Python (using functions in the `numpy` namespace). But it is important to note that in Python the `^` operator is the bit-wise xor and not the power operator, which is `**`.

**R Code**
```R
# display help about matrix function 
?matrix 

# initialize a 2x2 matrix with values by column (default)
x = matrix (data = c (1, 2, 3, 4), nrow = 2, nol = 2) 

# initialize a 2x2 matrix with values by row (must be specified)
x = matrix (data = c (1, 2, 3, 4), nrow = 2, nol = 2, byrow = TRUE) 

# square root the matrix, elementwise
sqrt (x)

# raise the matrix to the power of 2, elementwise
x^2
```


In [7]:
# help (numpy.ndarray)

# initialize a 2x2 matrix with values by column
# we will use the single dimension array, reshape it into 2x2 and change the default 
# order from `order = "C"` to `order = "F"`
x = numpy.array ([1, 2, 3, 4]).reshape (2, 2, order = "F")
print ("\ncolumn ordered matrix")
print (x)

# initialize a 2x2 matrix with values by row, allow the default order 
x = numpy.array ([1, 2, 3, 4]).reshape (2, 2)
print ("\nrow ordered matrix")
print (x)

# we can specify -1 in the reshape to specify an indeterminate length 
# indeterminate number of rows, fixed number of columns
x = numpy.arange (0, 10).reshape (-1, 2, order = "F")
print ("\nindeterminate row length")
print (x)

# indeterminate number of rows, fixed number of columns
x = numpy.arange (0, 10).reshape (2, -1)
print ("\nindeterminate column length")
print (x)


column ordered matrix
[[1 3]
 [2 4]]

row ordered matrix
[[1 2]
 [3 4]]

indeterminate row length
[[0 5]
 [1 6]
 [2 7]
 [3 8]
 [4 9]]

indeterminate column length
[[0 1 2 3 4]
 [5 6 7 8 9]]


In [8]:
# reset initialize a 2x2 matrix with values by row, allow the default order 
x = numpy.array ([1, 2, 3, 4]).reshape (2, 2, order = "F")
print ("\ncolumn ordered matrix")
print (x)

# matrix operations sqrt by element 
print ("\n sqrt of matrix, by element")
print (numpy.sqrt (x).round (2))

# matrix operation square by element
print ("\n sqrt of matrix, by element using square function")
print (numpy.square (x).round (2))

# matrix operation square by element
print ("\n sqrt of matrix, by element using power function")
print (numpy.power (x, 2).round (2))

# matrix operation square by element
print ("\n sqrt of matrix, by element using * operator")
print ((x * x).round (2))

# matrix operation square by element
print ("\n sqrt of matrix, by element using ** operator")
print ((x ** 2).round (2))



column ordered matrix
[[1 3]
 [2 4]]

 sqrt of matrix, by element
[[1.   1.73]
 [1.41 2.  ]]

 sqrt of matrix, by element using square function
[[ 1  9]
 [ 4 16]]

 sqrt of matrix, by element using power function
[[ 1  9]
 [ 4 16]]

 sqrt of matrix, by element using * operator
[[ 1  9]
 [ 4 16]]

 sqrt of matrix, by element using ** operator
[[ 1  9]
 [ 4 16]]


### Random Numbers 

R provides a set of functions for generating random numbers based on a distribution. The naming convention for these functions are a prefix followed by the distribution. 

```
d{distribution-name} - probability density function (PDF)
p{distribution-name} - cumulative distribution function (CDF)
q{distribution-name} - quantile function (CDF inverse)
r{distribution-name} - random sample 
```

The `rnorm()` function generates random numbers from a normal distribution with the parameters `n` for sample size, `mean` for the mean, and `sd` for the standard deviation. The default is `mean = 0` and `sd = 1`. 

We can use the `set.seed` function to set a starting seed for generating random numbers that will allow for generating the same extact numbers every time. 

In Python, we can use the distributions in `numpy.random` namespace to generate random samples. The equivalent for `nrorm` is `numpy.random.normal`, and it supports location (`loc`), `scale`, and `size` of sample. 

**R Code**
```
x = rnorm (50)
y = x + rnorm (50, mean = 50, sd = 0.1)
cor (x, y)

set.seed (1303)
rnorm (50)
```

In [13]:
# sample the normal distribution 
x = numpy.random.normal (size = 50)
y = x + numpy.random.normal (loc = 50, scale = 0.1, size = 50)

# get the correlation matrix between the x and y variables
# and get the pearson correlation coefficient
print ("scipy correlation {:.3f}".format (scipy.stats.pearsonr (x, y)[0]))
print ("numpy correlation {:.3f}".format (numpy.corrcoef (x, y)[0, 1]))

# using pandas data frames
# correlation between 2 variables in a dataframe 
df = pandas.DataFrame ({ "x": x, "y": y })
print ("pandas correlation {:.3f}".format (df ["x"].corr (df ["y"])))

scipy correlation 0.992
numpy correlation 0.992
pandas correlation 0.992


In [10]:
# setting the global seed 
numpy.random.seed (1303)
numpy.random.normal (size = 50)

array([-0.03425693,  0.06035959,  0.45511859, -0.36593175, -1.6773304 ,
        0.5910023 ,  0.41090101,  0.46972388, -1.50462476, -0.70082238,
        1.43196963,  0.35474484,  1.67574682,  1.62741373,  0.27015354,
        0.15248539,  0.11593596,  0.89272237, -2.16627436,  0.26787192,
        0.36658207,  2.72335408,  0.44060293,  0.36036757,  0.38119264,
       -0.27845602,  1.73458476, -1.48138111, -0.47556927, -0.1932596 ,
        0.68115816, -0.05143463, -0.59151688,  0.02292374, -0.12259196,
        0.50633508,  0.63181139, -0.2443932 ,  0.39847385, -1.2716468 ,
        0.43167303, -1.36491646,  0.91004701,  0.65707308, -0.080445  ,
       -1.12057881, -1.31479423,  0.26394714, -0.59459381, -0.07624482])

### Mean, Variance, and Standard Deviation

When dealing with arrays, we can use the `numpy` namespace and functions `mean`, `var`, and `std`. When the data is in a `pandas.DataFrame`, we can reference the individual column and use the method `mean`, `var`, and `std`.

**R Code**
```
set.seed (3)
y = rnorm (100)
mean (y)
var (y)
sqrt (var (y))
sd (y)
```

In [11]:
# set random seed and generate a 100 number sample
numpy.random.seed (3)
y = numpy.random.normal (size = 100)

# mean, variance, standard deviation
print ("numpy mean {:.3f}".format (numpy.mean (y)))
print ("numpy var  {:.3f}".format (numpy.var (y)))
print ("numpy sqrt (var) {:.3f}".format (numpy.sqrt (numpy.var (y))))
print ("numpy sd {:.3f}".format (numpy.std (y)))

numpy mean -0.109
numpy var  1.132
numpy sqrt (var) 1.064
numpy sd 1.064


In [12]:
# pandas version of mean, variance, standard deviation
df = pandas.DataFrame ({ "y": y }) # recreate data frame with just y to match
print ("pandas mean {:.3f}".format (df ["y"].mean ()))
print ("pandas var  {:.3f}".format (df ["y"].var ()))
print ("pandas sqrt (var)  {:.3f}".format (numpy.sqrt (df ["y"].var ())))
print ("pandas std  {:.3f}".format (df ["y"].std ()))

pandas mean -0.109
pandas var  1.144
pandas sqrt (var)  1.069
pandas std  1.069
