# Introduction to Python - Lecture 11 (May 7th 2020)
<br>

## ---------- Numpy & Pandas ---------

- easy-to-use data structures
- advantage for handling large and mutivariate data
- performant tools for data analysis
- efficient for mathematical operations
- state of the art for all Data Science


<br>
The goal of this lecture is to give you an overview of the data structures accessible with Numpy and Pandas.

<br>

**All future lessons and most probably all your future projects will use these two modules!**
## ----------------------------------

<br>

- links for this class

Stack Overflow:
https://stackoverflow.com/c/nyumc-coding-courses/questions

Online Courses Page:
http://fenyolab.org/presentations/Bioinformatics_2020/

<br>
<br>

- some useful documentation

Numpy and scipy<br>                   
https://scipy-lectures.org/<br>
https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html<br>
<br>
Pandas<br>
https://pandas.pydata.org/docs/user_guide/index.html

<br>

- Tone of examples for data science based on Numpy and Panda<br>

https://towardsdatascience.com/

https://realpython.com/

<br>


- A few more examples related to this lecture ...    <br>                

https://towardsdatascience.com/top-python-libraries-numpy-pandas-8299b567d955

https://towardsdatascience.com/10-python-pandas-tricks-that-make-your-work-more-efficient-2e8e483808ba

https://realpython.com/python-pandas-tricks/


---------

<br>

## Let's get ready ....

This lesson will cover two aspects of Numpy and Pandas modules:

1. How to use their data structures
 - arrays for Numpy
 - dataframes for Pandas

<br>
2. Examples from the large collection of their functions


They are not new languages but an advanced or data-oriented version of Python. 


-------

## cheat sheets

+ Numpy <br>                                  
https://ugoproto.github.io/ugo_py_doc/pdf/Numpy_Python_Cheat_Sheet.pdf

<br> 

+ Panda        <br>                      
https://ugoproto.github.io/ugo_py_doc/pdf/pandas-cheat-sheet.pdf
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

------
<br>


### Why do we need new data structures other than lists, dictionnaries, or sets?


**Numpy (and its add-on Scipy)**

- NumPy extends python high-level language for manipulating numerical data

- NumPy is a library for efficient array computations. Arrays differ from Python lists in the way they are stored and handled. Array elements stay together in memory, so they can be quickly accessed. 

- NumPy provides vectorized mathematical functions. When, you call numpy.sin(a), the sine function is applied on every element of array a. It works much faster than a Python for loop, even faster than list comprehensions `[sin(i) for i in l]`

- NumPy also supports quick subindexing

- Nympy is optimized for **homogeneous dataset** signal processing (convolution, Fourier transform, ...) or to analyze multi dimentional data or simulations.


---------

**Panda**

+ Pandas makes working with **Dataframes**, like **R** or **Matlab**. 

+ Pandas does a lot of things under the hood. You will get familiar with all funtions and methods that it contains. 

+ Pandas is efficient to load or to save data

+ Pandas handles different types of data (**heterogeneous dataset**) and deals with incomplete data (n/a)

+ concatenate/merge/join different datasets

+ powerful way to parse/filter the data by column or by row and subindexing (a bit like Structured Query Language SQL).

--------

**It will pave the way for the use of other important modules for plots, statistics calculations or learning methods**

+ scipy
+ matplotlib
+ seaborn
+ SciKit-Learn

... with just a couple of lines of code!

In [50]:
## type your code here:

l=[[0,1,2,3],[2,3,4,5]]
print(l)
print(l[1][-1])

# d={}
# d['Mary']=[1,2,3]
# d['Peter']=[2,3,4]
# print(d)
# print(d['Peter'][2])

[[0, 1, 2, 3], [2, 3, 4, 5]]
5


# 1. Numpy

+ A package for scientific computing
+ A more powerful version of lists
+ All of the methods are optimized to run fast
+ Great for linear algebra

<br>

Numpy is not part of Pythons standard libraries and needs to be installed.

This can be done using the conda command if you are using Anaconda:

```bash
conda install numpy
```
or
```bash
conda install --name itp_2020 numpy
```

Alternatively this can be done using the built in Python package manager Pip

```bash
pip install numpy
```


<br>

Once numpy is installed it then needs to be imported to use its functionality

```python
import numpy as np

```

In [3]:
## type your code here:

import numpy as np

### 1.1 Lists recap

Lists are created using '[]'

```python
l1 = [1, 2, 3, 4]
```

Lists can contain mixed types

```python
l2 = [1, 'a', {'abs': abs}, (1, 2)]
```


Lists can be joined using the + operator which creates a new list

```python
l3 = l1 + l2
```

Lists can be extended using the .extend() method which happens in place

```python
l1.extend(l2)
```

The range() function can be used to initialize numeric lists

```python
base = list(range(0, 101, 2))
```

To perform calculations using a list, a for loop is required

```python
base_squared = []
for x in base:
    base_squared.append(x**2)
```

In [4]:
## type your code here:

l1 = [1, 2, 3, 4]
print(l1)
l2 = [1, 'a', {'abs': abs}, (1, 2)]
print(l2)
l3 = l1 + l2
print(l3)
l1.extend(l2)
print(l1)
print(l1[3:])

[1, 2, 3, 4]
[1, 'a', {'abs': <built-in function abs>}, (1, 2)]
[1, 2, 3, 4, 1, 'a', {'abs': <built-in function abs>}, (1, 2)]
[1, 2, 3, 4, 1, 'a', {'abs': <built-in function abs>}, (1, 2)]
[4, 1, 'a', {'abs': <built-in function abs>}, (1, 2)]


In [5]:
## type your code here:

base = list(range(0, 101, 2))
base_squared = []
for x in base:
    base_squared.append(x**2)

print(base)
print(base_squared)

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100]
[0, 4, 16, 36, 64, 100, 144, 196, 256, 324, 400, 484, 576, 676, 784, 900, 1024, 1156, 1296, 1444, 1600, 1764, 1936, 2116, 2304, 2500, 2704, 2916, 3136, 3364, 3600, 3844, 4096, 4356, 4624, 4900, 5184, 5476, 5776, 6084, 6400, 6724, 7056, 7396, 7744, 8100, 8464, 8836, 9216, 9604, 10000]


### 1.2 Creating arrays with Numpy

There are a number of ways of creating numpy arrays.

##### Converting a regular python list into a numpy array:

```python
var = np.array(< list >)
```

This is useful when the original list needs to be constructed from a file. Numpy does not have a simple method to append to lists. For this reason it is sometimes easier to build a normal python list before converting it to a numpy array.

```python
mylist=list(range(10))
myarray= np.array(mylist)
```


You can make an array of characters as well.
```python
characters = []
for i in range(32, 100):
    characters.append(chr(i))
print(characters)
np_char = np.array(characters)
print(np_char)
```


In [6]:
## type your code here:

mylist=list(range(10))
#mylist.pop()
myarray= np.array(mylist)

#mylist.pop()
#myarray.pop()

print(myarray)

[0 1 2 3 4 5 6 7 8 9]


### 1.3 Initializing arrays using numpy

1. Creating an array with *n* zeros

```python
lst = np.zeros(25)
lst_2d = np.zeros((5, 5))
```

2. Creating an array with *n* ones

```python
lst = np.ones(25)
lst_2d = np.ones((5, 5))
```

3. Using a range(start, end (exclusive), increment)

```python
lst = np.arange(10, 20, 2)
```

4. Linspace is similar to range - linspace(start, end (inclusive), number_of_elements) 

```python
lst = np.linspace(10, 20, 2)
```

5. Filling a list with random numbers between [0, 1)

```python
lst = np.random.rand(5)
lst_2d = np.random.rand(5, 5)
```

6. Uses samples from a distribution to fill a list

```python
lst = np.random.normal(5, 2, 20)
lst_2d = np.random.normal(5, 2, (20,20))
```

In [21]:
## type your code here:

lst = np.zeros(25)
lst_2d = np.zeros((5, 5))
lst_2d = np.random.rand(5, 5)
print(lst_2d )

[[0.03176476 0.22493852 0.11702611 0.77617366 0.33418575]
 [0.92849337 0.36903751 0.69579402 0.49656674 0.19121759]
 [0.71586893 0.57530182 0.84146784 0.53517594 0.97561907]
 [0.76426939 0.53091586 0.3021782  0.48793313 0.81012863]
 [0.18545599 0.48244402 0.86812389 0.8008814  0.43474165]]


### 1.4 Numpy mathematical operations

Numpy is a tool which simplifies performing linear algebra in Python, for this reason most operations will match linear algebra operations.

1. Addition
    1. Adding a scalar to a matrix does not follow normal mathematical rules as the scalar is 
    added to each object in the matrix
  
    ```
    ```
        $
            \begin{bmatrix} 
            a_{0,0} & a_{0,1} & \cdots & a_{0,n} \\
            a_{1,0} & a_{1,1} & \cdots & a_{1,n} \\
            \vdots & \vdots & \ddots & \vdots \\
            a_{m,0} & a_{m,1} & \cdots & a_{m,n} \\
            \end{bmatrix} + C
        $
        $ =
            \begin{bmatrix} 
            a_{0,0} + C & a_{0,1} + C & \cdots & a_{0,n} + C \\
            a_{1,0} + C & a_{1,1} + C & \cdots & a_{1,n} + C \\
            \vdots & \vdots & \ddots & \vdots \\
            a_{m,0} + C & a_{m,1} + C & \cdots & a_{m,n} + C \\
            \end{bmatrix}
        $

    ```python
    lst_2d = np.ones((n, m)) + C
    ```

In [23]:
## type your code here:

lst_2d = np.zeros((2, 3))



1. 
    2. Adding two numpy arrays requires them to have the same shape
  
    ```
    ```
        $
            \begin{bmatrix} 
            a_{0,0} & a_{0,1} & \cdots & a_{0,n} \\
            a_{1,0} & a_{1,1} & \cdots & a_{1,n} \\
            \vdots & \vdots & \ddots & \vdots \\
            a_{m,0} & a_{m,1} & \cdots & a_{m,n} \\
            \end{bmatrix} + 
            \begin{bmatrix} 
            b_{0,0} & b_{0,1} & \cdots & b_{0,n} \\
            b_{1,0} & b_{1,1} & \cdots & b_{1,n} \\
            \vdots & \vdots & \ddots & \vdots \\
            b_{m,0} & b_{m,1} & \cdots & b_{m,n} \\
            \end{bmatrix} =
            \begin{bmatrix} 
            a_{0,0} + b_{0,0} & a_{0,1} + b_{0,1} & \cdots & a_{0,n} + b_{0,n} \\
            a_{1,0} + b_{1,0} & a_{1,1} + b_{1,1} & \cdots & a_{1,n} + b_{1,n} \\
            \vdots & \vdots & \ddots & \vdots \\
            a_{m,0} + b_{m,0} & a_{m,1} + b_{m,1} & \cdots & a_{m,n} + b_{m,n} \\
            \end{bmatrix}
        $

    ```python
    lst_2d = np.ones((n, m)) + np.ones((n, m))
    ```
    

In [25]:
## type your code here:

lst_2d = np.ones((2, 3)) + 1 #+ np.ones((2, 3))
lst_2d

array([[2., 2., 2.],
       [2., 2., 2.]])

In [9]:
## type your code here:

lst_2d = np.ones((2, 3)) * 5
lst_2d

array([[5., 5., 5.],
       [5., 5., 5.]])

In [10]:
## type your code here:

lst_2d = np.ones((2, 3)) * (np.ones((2, 3)) * 2)
lst_2d

array([[2., 2., 2.],
       [2., 2., 2.]])

In [11]:
## type your code here:

lst_2d = np.ones((6, 5)) + np.array([1, 2, 3, 2, 1])

#print(np.ones((6, 5)))

#print(np.array([1, 2, 3, 2, 1]))

print(lst_2d)

[[2. 3. 4. 3. 2.]
 [2. 3. 4. 3. 2.]
 [2. 3. 4. 3. 2.]
 [2. 3. 4. 3. 2.]
 [2. 3. 4. 3. 2.]
 [2. 3. 4. 3. 2.]]


### 1.5 Accessing elements/rows/columns

<br>

**Constructing a 2d array for demonstration purposes**


```python
lst_2d = np.vstack((np.arange(1, 100, 2), np.arange(100, 1, -2)))
```

This is a 2D array in which the first row comprises of all the odd numbers between [1 and 99] and the second row comprises the even numbers between [100 and 2].
<br>

In [12]:
## type your code here:

lst_2d = np.vstack((np.arange(1, 100, 2), np.arange(100, 1, -2)))
lst_2d

# a1=np.arange(1, 100, 2)
# a2=np.arange(100, 1, -2)
# l=np.array([a1,a2])

array([[  1,   3,   5,   7,   9,  11,  13,  15,  17,  19,  21,  23,  25,
         27,  29,  31,  33,  35,  37,  39,  41,  43,  45,  47,  49,  51,
         53,  55,  57,  59,  61,  63,  65,  67,  69,  71,  73,  75,  77,
         79,  81,  83,  85,  87,  89,  91,  93,  95,  97,  99],
       [100,  98,  96,  94,  92,  90,  88,  86,  84,  82,  80,  78,  76,
         74,  72,  70,  68,  66,  64,  62,  60,  58,  56,  54,  52,  50,
         48,  46,  44,  42,  40,  38,  36,  34,  32,  30,  28,  26,  24,
         22,  20,  18,  16,  14,  12,  10,   8,   6,   4,   2]])

### 1.6 Accessing individual values

1. Using standard list indexing
```python
# lst_2d[row_index][column_index]
lst_2d[1][2]
```

2. Numpy has more advanced indexing which allows both values to be specified together
```python
# lst_2d[row_index, column_index]
lst_2d[1, 2]
```

##### Retrieving a row

```python
# lst_2d[row_index, :]
lst_2d[0, :] # return an array of all the elements in the first row
lst_2d[1, :] # return an array of all the elements in the second row
```

##### Retrieving a column

```python
# lst_2d[:, column_index]
lst_2d[:, 5] # return an array of all the elements in the fifth column
lst_2d[:, 20] # return an array of all the elements in the twentieth column
```


In [31]:
## type your code here:
lst_2d = np.vstack((np.arange(1, 100, 2), np.arange(100, 1, -2)))
print(lst_2d)

lst_2d[:,0]

[[  1   3   5   7   9  11  13  15  17  19  21  23  25  27  29  31  33  35
   37  39  41  43  45  47  49  51  53  55  57  59  61  63  65  67  69  71
   73  75  77  79  81  83  85  87  89  91  93  95  97  99]
 [100  98  96  94  92  90  88  86  84  82  80  78  76  74  72  70  68  66
   64  62  60  58  56  54  52  50  48  46  44  42  40  38  36  34  32  30
   28  26  24  22  20  18  16  14  12  10   8   6   4   2]]


array([  1, 100])

### 1.7 Numpy functions 

#### Note - There are often two ways of using Numpy functions

1. np.<< function >>()
2. np_array.<< function >>()

#### Example

1. Using the np function
    ```python
lst = np.arange(10)
m = np.mean(lst)
    ```
2. Using the array method
    ```python
lst = np.arange(10)
m = lst.mean()
    ```
    
There are some functions which are only available as one of these two options. But when both are available they are generally interchangable.

In [32]:
## type your code here:

lst = np.arange(10)
m = np.mean(lst)
print(m)

lst = np.arange(10)
m = lst.mean()
print(m)

4.5
4.5


#### Min and Max

The names are self explanatory, they will output the minimum and maximum values of the array.


```python
lst = np.array([21, 19, 11, 17, 20])
print(lst.min())
print(lst.max())
```

In [15]:
## type your code here:

lst = np.array([21, 19, 11, 17, 20])
print(lst.min())
print(lst.max())

11
21


### 1.8 2D Arrays

1. Getting the min and max for the entire 2D array

```python
# How I generated the random 2D array
# lst = np.random.randint(5, 25, (5, 5))
lst = np.array(
      [[15, 21,  7, 21, 24],
       [13, 19, 21, 23,  7],
       [ 8, 17, 13, 21, 14],
       [11, 16, 20, 20, 12],
       [22,  5,  8, 23,  8]]
)

min_v = lst.min()
max_v = lst.max()

print('The min is {}'.format(min_v))
print('The max is {}'.format(max_v))
```

In [33]:
## type your code here:

lst = np.array(
      [[15, 21,  7, 21, 24],
       [13, 19, 21, 23,  7],
       [ 8, 17, 13, 21, 14],
       [11, 16, 20, 20, 12],
       [22,  5,  8, 23,  8]]
)

min_v = lst.min()
max_v = lst.max()

print('The min is {}'.format(min_v))
print('The max is {}'.format(max_v))

The min is 5
The max is 24


### 1.8.1       The axis argument in 2D Arrays
```
      [[15, 21,  7, 21, 24],  |
       [13, 19, 21, 23,  7],  |
       [ 8, 17, 13, 21, 14],  | axis = 0
       [11, 16, 20, 20, 12],  |
       [22,  5,  8, 23,  8]]  V
       
       ------------------->
               axis=1
```

- axis 0 applies the operation row-wise
- axis 1 appries the operation column-wise

If the data has a third dimension then axis 2 will apply the function in that dimension and so on...

In [35]:
## type your code here:


lst = np.array(
      [[15, 21,  7, 21, 24],
       [13, 19, 21, 23,  7],
       [ 8, 17, 13, 21, 14],
       [11, 16, 20, 20, 12],
       [22,  5,  8, 23,  8]]
)

#lst = np.arange(25).reshape((5,5)).transpose()

print(lst)

min_v = lst.min(axis=1)
max_v = lst.max(axis=1)

#min_v = lst.min(axis=1)
#max_v = lst.max(axis=1)

print('The min for each column is {}'.format(min_v))
print('The max for each column is {}'.format(max_v))

[[15 21  7 21 24]
 [13 19 21 23  7]
 [ 8 17 13 21 14]
 [11 16 20 20 12]
 [22  5  8 23  8]]
The min for each column is [ 7  7  8 11  5]
The max for each column is [24 23 21 20 23]


In [18]:
## type your code here:

# indexing
lst[2:5]  #prints items 2 to 4. Index in NumPy arrays starts from 0
lst[2::2] #prints items 2 to end skipping 2 items
lst[::-1] #rints the array in the reverse order
lst[1:]   #prints from row 1 to end
lst[0:4,0:4] #
lst[0:4,0:4:2]
#....

array([[15,  7],
       [13, 21],
       [ 8, 13],
       [11, 20]])

In [42]:
lst_2d = np.vstack((np.arange(1, 100, 2), np.arange(100, 1, -2)))
print(lst)
lst[0:4,0:4:2]

[[15 21  7 21 24]
 [13 19 21 23  7]
 [ 8 17 13 21 14]
 [11 16 20 20 12]
 [22  5  8 23  8]]


array([[15,  7],
       [13, 21],
       [ 8, 13],
       [11, 20]])

### 1.8.2 Example:  Min Max normalization

Subtract the minimum value from each element and divide it by the difference between the maximum and the minimum.

$$ \frac{ X - X_{min}}{X_{max} - X_{min}} $$

Using numpy it is easy to perform these types of calculations

```python
lst = np.array([21, 19, 11, 17, 20])
lst_norm = (lst - lst.min()) / (lst.max() - lst.min())
lst_norm
```

In [19]:
## type your code here:

lst = np.array([21, 19, 11, 17, 20])
lst_norm = (lst - lst.min()) / (lst.max() - lst.min())
lst_norm

array([1. , 0.8, 0. , 0.6, 0.9])

The same method will also work even if the data is 2 dimensional

```python
lst = np.array(
      [[15, 21,  7, 21, 24],
       [13, 19, 21, 23,  7],
       [ 8, 17, 13, 21, 14],
       [11, 16, 20, 20, 12],
       [22,  5,  8, 23,  8]]
)
lst_norm = (lst - lst.min()) / (lst.max() - lst.min())
lst_norm
```

In [20]:
## type your code here:

lst = np.array(
      [[15, 21,  7, 21, 24],
       [13, 19, 21, 23,  7],
       [ 8, 17, 13, 21, 14],
       [11, 16, 20, 20, 12],
       [22,  5,  8, 23,  8]]
)
lst_norm = (lst - lst.min()) / (lst.max() - lst.min())
lst_norm

array([[0.52631579, 0.84210526, 0.10526316, 0.84210526, 1.        ],
       [0.42105263, 0.73684211, 0.84210526, 0.94736842, 0.10526316],
       [0.15789474, 0.63157895, 0.42105263, 0.84210526, 0.47368421],
       [0.31578947, 0.57894737, 0.78947368, 0.78947368, 0.36842105],
       [0.89473684, 0.        , 0.15789474, 0.94736842, 0.15789474]])

#### Some additional functions to test on 2D arrays
<br>



```python
A = np.array([[1,1],[0,1]])
B = np.array([[2,0],[3,4]])
A+B              #addition of two array
np.add(A,B)      #addition of two array
A * B            # elementwise product
A @ B            # matrix product
A.dot(B)         # another matrix product
B.T              #Transpose of B array
A.flatten()      #form 1-d array
B < 3            #Boolean of Matrix B. True for elements less than 3
A.sum()          # sum of all elements of A
A.sum(axis=0)    # sum of each column
A.sum(axis=1)    # sum of each row
A.cumsum(axis=1) # cumulative sum along each row
A.min()          # min value of all elements
A.max()          # max value of all elements
np.exp(B)        # exponential
np.sqrt(B)       # squre root
A.argmin()       #position of min value of elements 
A.argmax()       #position of max value of elements
A[1,1]           #member of a array in (1,1) position
np.concat((np_array, np_array, ...)) 
```

<br>

These are just a few of the functions which numpy offers. All of the available functions are listed in their documentation [https://docs.scipy.org/doc/numpy-1.15.1/reference/#]


In [21]:
## type your code here:

print(A)
print(B)

print(A * B)            # elementwise product
print(A @ B)            # matrix product

A = np.array([[1,1],[0,1]])
B = np.array([[2,0],[3,4]])

A.cumsum(axis=1)
np.exp(A)

NameError: name 'A' is not defined


-----------------------------------------------------------------------------------------
<br>


# 2. Pandas

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.


### 2.1 The History of Pandas

Pandas was created by Wes McKinney in 2008 while he was working for AQR Capital Management. He managed to convince the company to open source the project.

Chang She, who also worked for AQR joined the project in 2012 and is the second major contributor.

Pandas signed onto NumFOCUS, a nonprofit charity, in 2015.(https://en.wikipedia.org/wiki/Pandas_(software))

A book written by Wes McKinney (Reviews indicate that the examples are not great):
+ McKinney, Wes. Python for data analysis : data wrangling with pandas, NumPy, and IPython. Sebastopol, CA: O'Reilly Media, Inc, 2018. Print.

This book covers using python for data science and includes some machine learning:
+ Vanderplas, Jacob T. Python data science handbook : essential tools for working with data. Sebastopol, CA: O'Reilly Media, Inc, 2016. Print.

There are also many excelent blog posts on pandas.

In [43]:
## type your code here:

import pandas as pd

df = pd.DataFrame([('bird', 389.0),
                   ('bird', 24.0),
                   ('mammal', 80.5),
                   ('mammal', np.nan)])

df

Unnamed: 0,0,1
0,bird,389.0
1,bird,24.0
2,mammal,80.5
3,mammal,


### 2.2 Learning Pandas

#### Pandas Use Cases

Pandas primary use case is for data analysis.
It provides data structures and a tone of functions for quickly manipulating data.
Some example usage:
  + cleaning data (removing/imputing missing values)
  + transforming (changing the form of the dataframe)
  + visualizing (Creating plots that summarize the data)
  
  
 
#### Pandas has extensive documentation with examples
If you go for a few months without coding its hard to remember the little details. If you remember the function names you can read up on them when you need them.

https://pandas.pydata.org/pandas-docs/stable/

I find it hard to navigate their documentation, so I Google 'pandas \< functionality \>' and choose the link to the pandas site.

Eg:
+ Google 'pandas merge'
+ https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html


#### Use Cheat sheets!

### 2.3  How to install Pandas


- Pandas is an external Python Package
  + This means that it is not included with the Python Standard Libraries
  + It can be installed using conda (https://anaconda.org/anaconda/pandas)
    + `conda install -c anaconda pandas`
  + The latest version dropped support for Python 2
 
<br> 
- Importing Pandas

In order to use the functionality provided by pandas the package will need to be imported.
The standard practice is to name this import **pd**.


```python
import pandas as pd
```

In [23]:
## type your code here:
import pandas as pd

### 2.4 Creating a Dataframe

#### A dataframe is a collection of data where each row consists of a collection of observations.

#### There are many ways to create dataframes:

+ Converting a list to a dataframe (see above)


+ Converting a dictionary to a dataframe
    ```python
df = pd.Dataframe.from_dict( << dict >> )
    ```
+ Loading the data from a csv file
    ```python
df = pd.read_csv( << csv_path >> )
    ```
+ Load the data from a url
    ```python
url = "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/trees.csv"
df = pd.read_csv(url)
    ```
+ Load the data from an excel file
    ```python
df = pd.read_excel('tmp.xlsx', index_col=None, header=None, sheet_name='Sheet3') 
    ```
     help(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)


Some of these examples are taken from the pandas documentation

(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html#pandas.DataFrame.from_dict)

### 2.5 Columns and Indices

###### By default, each item in the dictionary will represent a column
```python
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data)
```

In [44]:
## type your code here:

data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
df = pd.DataFrame.from_dict(data)
df

Unnamed: 0,col_1,col_2
0,3,a
1,2,b
2,1,c
3,0,d


###### This can be changed by changing the orient parameter to 'index' (the default is 'column')
```python
data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data, orient='index')
```

In [25]:
## type your code here:

data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data, orient='index')

Unnamed: 0,0,1,2,3
row_1,3,2,1,0
row_2,a,b,c,d


###### The names of the columns can be set using the columns parameter
```python
data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data, orient='index',
                        columns=['A', 'B', 'C', 'D'])
```

In [26]:
## type your code here:

data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data, orient='index',
                        columns=['A', 'B', 'C', 'D'])

Unnamed: 0,A,B,C,D
row_1,3,2,1,0
row_2,a,b,c,d


###### Alternatively you can specify the column names in the dictionary
```python
data = {
    'Tree1': {'girth': 8.3, 'height': 70, 'volume': 10.3},
    'Tree2': {'girth': 8.6, 'height': 65, 'volume': 10.3},
    'Tree3': {'girth': 8.8, 'height': 63, 'volume': 10.2}
}
pd.DataFrame.from_dict(data, orient='index')
```

https://en.wikipedia.org/wiki/Tree_girth_measurement

In [45]:
## type your code here:

data = {
    'Tree1': {'girth': 8.3, 'height': 70, 'volume': 10.3},
    'Tree2': {'girth': 8.6, 'height': 65, 'volume': 10.3},
    'Tree3': {'girth': 8.8, 'height': 63, 'volume': 10.2}
}
pd.DataFrame.from_dict(data, orient='index')



Unnamed: 0,girth,height,volume
Tree1,8.3,70,10.3
Tree2,8.6,65,10.3
Tree3,8.8,63,10.2


**Note** Each row needs to have a unique identifier, in the above example this is represented by '**Tree#**'. Generally this is represented by an integer ranging from 0->n. 
+ In the above example we can reset the index to be the integers using the reset_index() function.
```python
data = {
    'Tree1': {'girth': 8.3, 'height': 70, 'volume': 10.3},
    'Tree2': {'girth': 8.6, 'height': 65, 'volume': 10.3},
    'Tree3': {'girth': 8.8, 'height': 63, 'volume': 10.2}
}
pd.DataFrame.from_dict(data, orient='index').reset_index()
```

In [28]:
## type your code here:

data = {
  'Tree1': {'girth': 8.3, 'height': 70, 'volume': 10.3},
  'Tree2': {'girth': 8.6, 'height': 65, 'volume': 10.3},
  'Tree3': {'girth': 8.8, 'height': 63, 'volume': 10.2}
}
pd.DataFrame.from_dict(data, orient='index').reset_index()


Unnamed: 0,index,girth,height,volume
0,Tree1,8.3,70,10.3
1,Tree2,8.6,65,10.3
2,Tree3,8.8,63,10.2


###### Loading Data from a URL
```python
url = "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/trees.csv"
df = pd.read_csv(url)
df
```

And there are many cvs files from https://opendata.cityofnewyork.us/! 


In [56]:
## type your code here:

url = "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/trees.csv"
df = pd.read_csv(url)   # URL or LocalFile
df.head()

Unnamed: 0.1,Unnamed: 0,Girth,Height,Volume
0,1,8.3,70,10.3
1,2,8.6,65,10.3
2,3,8.8,63,10.2
3,4,10.5,72,16.4
4,5,10.7,81,18.8


In [57]:
## type your code here:

df = df.drop(columns=['Unnamed: 0'])


### 2.6 Accessing values in the dataframe
```python
data = {
    'Tree1': {'girth': 8.3, 'height': 70, 'volume': 10.3},
    'Tree2': {'girth': 8.6, 'height': 65, 'volume': 10.3},
    'Tree3': {'girth': 8.8, 'height': 63, 'volume': 10.2}
}
df = pd.DataFrame.from_dict(data, orient='index')
```

In [60]:
## type your code here:

data = {
    'Tree1': {'girth': 8.3, 'height': 70, 'volume': 10.3},
    'Tree2': {'girth': 8.6, 'height': 65, 'volume': 10.3},
    'Tree3': {'girth': 8.8, 'height': 63, 'volume': 10.2}
}

df = pd.DataFrame.from_dict(data, orient='index')
print(df)

       girth  height  volume
Tree1    8.3      70    10.3
Tree2    8.6      65    10.3
Tree3    8.8      63    10.2


#### Columns

##### Getting Column Names

+ To get a list of column names you can covert the dataframe into a list
```python
list(df) # returns a list of column names
# or
df.columns.values # returns a numpy array of column names
```

As with most of programming there are multiple methods of doing the same thing.
It is generally preferable to use the option that is easier to understand. In this case `df.columns.values` as this explicitly states what values are being extracted


In [61]:
## type your code here:

list(df)

df.columns.values

array(['girth', 'height', 'volume'], dtype=object)

##### Pulling a column out of a dataframe

+ Access columns using the column name in square brackets to return a series containing the data.
```python
df["column_name"]
```

+ A series is a 1D array of id, value pairs

In [63]:
## type your code here:

df['volume']

#df.volume
df['volume'].mean()

10.266666666666667

##### Subsetting a Dataframe using a List of Columns

Use a list of column names to extract those columns as a dataframe

```python
df[["column_1", "column_2"]]
```

A useful way to remember this is 1D indexing \[ \] returns a series and 2D indexing \[\[ \]\] returns a dataframe.

#### Example
Extracting the girth and height from the dataframe of tree measurements.

```python
df[["girth", "height"]]
```

Extracting the volume as a series (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)
```python
df["volume"]
```

Extracting the volume as a dataframe (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame)
```python
df[["volume"]]
```

In [66]:
## type your code here:

df[["girth", "height"]]

type(df[["volume","height"]])

#type(df["volume"])


pandas.core.frame.DataFrame

### 2.7 Pandas and Numpy

+ A package for scientific computing
+ A more powerful version of lists
+ All of the methods are optimized to run fast
+ Great for linear algebra, statistical analysis

Pandas makes use of a lot of Numpy functionality under the hood!

#### Converting Series and Dataframe types into Arrays

The `.values` function will return a numpy array containing the data.

For series type data the array will be 1-dimensional and for dataframes the array will be 2-dimensional.

```python
df["volume"].values
```

```python
df[["girth", "height"]].values
```

In [35]:
## type your code here:

df[["girth", "height"]].values

array([[ 8.3, 70. ],
       [ 8.6, 65. ],
       [ 8.8, 63. ]])

#### Accessing rows from the dataframe

Rows are accessed using either **loc** or **iloc**

###### iloc
+ This will access rows depending on their integer index
+ The first row will have index 0
+ Then next will have index 1, ...
+ To extract the first row you would use the following command
    + This will return a series containing the information from that row
```python
df.iloc[0]
```
+ To extract multiple rows you can pass a list of indices
    + This will return a dataframe containing the specified rows
    + The dataframe will be in the order of the indicies
      + \[0, 1, 2\] - original order
      + \[2, 1, 0\] - reversed
      + \[0, 0, 1\] - repeats
      
```python
df.iloc[[0, 1, 2]]
```

In [68]:
## type your code here:

df
df.iloc[[2, 1, 0]]

Unnamed: 0,girth,height,volume
Tree3,8.8,63,10.2
Tree2,8.6,65,10.3
Tree1,8.3,70,10.3


###### loc

Various arguments will work with loc to extract rows from a dataframe

+ A single index label
    + Returns a series for that specific **row**
    ```python
df.loc["Tree2"]
    ```
+ A list of index labels
    + Returns a dataframe containing those **rows**
    ```python
df.loc[["Tree1", "Tree3"]]
    ```
+ A boolean list
    + Returns a dataframe for **rows** that are labeled true
      + The number of booleans should match the number of rows
      + row 0 - False
      + row 1 - True
      + row 2 - False
    ```python
df.loc[[False, True, False]]
    ```

In [73]:
## type your code here:

#print(df)


print(df.loc[["Tree1", "Tree3"]])


       girth  height  volume
Tree1    8.3      70    10.3
Tree3    8.8      63    10.2


### 2.8 Extracting data by value

Comparison operators can be applied to series objects (which are numpy lists)
For each value it will return either True or False depending on the comparison

**eg**
```python
pd.Series([1,1,1,5,5,5]) > 3
> [False, False, False, True, True, True]
```

This is convenient as **.loc** can use an array of booleans to extract rows.


This allows for specific rows to be extracted from the dataframe depending on their value


In [38]:
## type your code here:

s=pd.Series(np.arange(10,100,10))

cond1=s>80
cond2=s<50

~(cond1 | ~cond2)

0     True
1     True
2     True
3     True
4    False
5    False
6    False
7    False
8    False
dtype: bool

+ Trees that are shorter than 70
    + df["height"] will return a series
    ```python
    df["height"]
    ```
    + df["height"] < 70 will return a list of booleans
        + ```python
          df["height"] < 70
          ```
        + ```python
        > [False, True, True]
        ```
    
    + We can then use this to extract those rows from the dataframe
    ```python
df.loc[df["height"] < 70]
    ```
    

|name|girth|	height|	volume|
|-    |-      |-      |-|
|Tree1|	8.3|	70|	10.3|
|Tree2|	8.6|	65|	10.3|
|Tree3|	8.8|	63|	10.2|

In [39]:
## type your code here:

df.loc[df["height"] < 70]
# df[["height"]] < 70

Unnamed: 0,girth,height,volume
Tree2,8.6,65,10.3
Tree3,8.8,63,10.2


In [40]:
##### How would you get the rows where the volume is equal to 10.3?

In [41]:
df.loc[df["volume"] == 10.3]

Unnamed: 0,girth,height,volume
Tree1,8.3,70,10.3
Tree2,8.6,65,10.3


##### Combining conditions

Numpy has various bitwise operations which work on boolean arrays (bitwise operations work on binary sequences, a boolean list is a binary sequence)
```python
import numpy as np
```
+ **&**
    + This is equivilant to the **and** logical operator or intersection set operator
    + The resulting list will only be true where both conditions are true
    ```python
l1 = np.array([True, False])
l2 = np.array([True, True])
l1 & l2
    ```
+ |
    + This is equivilant to the **or** logical operator or union set operator
    + The resulting list will be true where any of the conditions is true
    ```python
l1 = np.array([True, False])
l2 = np.array([True, True])
l1 | l2
    ```
+ ~
    + This is the **negation** operator and is equivilant to the **not** logical operator
    + The resulting True/False values will be flipped
        ```python
l1 = np.array([True, False])
~l1
    ```
    
**When combining multiple conditions they should be put in separate parenthesis**

This is to avoid ambiguity. The list of booleans for each condition is calculated then the results are compared.

```python
l = np.array([1,1,1, 10,10,10])
# l % 2 == 0 & l < 10
(l % 2 == 0) & (l < 10)
```


In [42]:
## type your code here:

l1 = np.array([True, False])
l2 = np.array([True, True])
print(l1, l2)

[ True False] [ True  True]


In [43]:
l1 & l2

array([ True, False])

In [44]:
l = np.array([1,1,1, 10,10,10])
(l % 2 == 0) & (l < 10)

array([False, False, False, False, False, False])

###### Using this how can we extract all rows with height < 70 and volume equal to 10.3?

In [45]:
## type your code here:

height_cond = df['height'] < 70
volume_cond = df['volume'] == 10.3
df.loc[height_cond & volume_cond]

Unnamed: 0,girth,height,volume
Tree2,8.6,65,10.3


###### Using this how can we extract all rows except with height < 70 and volume equal to 10.3?


In [46]:
## type your code here:

height_cond = df['height'] < 70
volume_cond = df['volume'] == 10.3
df.loc[~(height_cond & volume_cond)]

Unnamed: 0,girth,height,volume
Tree1,8.3,70,10.3
Tree3,8.8,63,10.2


### 2.9 Other functions

In [74]:
## type your code here:

_  = df['volume'].hist()

- save dataframe to a csv file

df.to_csv()


- save dataframe to an excel file

df.to_excel()
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html


- apply a function to an entire dataframe

```python
print(df)
print(df.apply(lambda x: x*2))
```

In [49]:
## type your code here:



## Summary in the cheat sheets

+ Numpy <br>     

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf

<br> 

+ Panda        <br>                      

http://datacamp-community-prod.s3.amazonaws.com/dbed353d-2757-4617-8206-8767ab379ab3

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

 <br>   

Many more here:  https://towardsdatascience.com/collecting-data-science-cheat-sheets-d2cdff092855

------
<br>
