# Lab 2

## Handling Data in Python

*Credit note:*

Part of this lab was adapted from the adaptation by [Volodymyr Kuleshov](http://web.stanford.edu/~kuleshov/) and [Isaac Caswell](https://symsys.stanford.edu/viewing/symsysaffiliate/21335) for Stanford University's `CS228`, which was originally from Stanford University's `CS231n` Python tutorial by Justin Johnson (http://cs231n.github.io/python-numpy-tutorial/).

## Numpy

Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. If you are already familiar with MATLAB, you might find this [tutorial](http://wiki.scipy.org/NumPy_for_Matlab_Users) useful to get started with Numpy. If you are familiar with C++/Java, working with these arrays can be a big sigh of relief :-) 

To use Numpy, we first need to import the `numpy` package:

In [3]:
import numpy as np

What we are doing here with **as** is to give an "alias" to the numpy class that we are importing for use. From now on, we can just use _np_ to call functions instead of its full name _numpy_.

You may check for its version...

In [4]:
np.__version__       # double underscore on both sides

'1.18.5'

### Arrays

A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.

We can initialize numpy arrays from nested Python lists, and access elements using square brackets:

In [None]:
a = np.array([1, 2, 3])  # Create a rank 1 array
print (type(a), a.shape, a[0], a[1], a[2])        # observe what is printed out
a[0] = 5                 # Change an element of the array
print(a)                # ..oh this is easier..               

In [None]:
b = np.array([[1,2,3],[4,5,6]])   # Create a rank 2 array
print(b)

In [None]:
print(b.shape)                   
print(b[0, 0], b[0, 1], b[1, 0])

Numpy also provides many functions to create specific format of arrays:

In [None]:
a = np.zeros((2,2))  # Create an array of all zeros. Note the floating point .
print(a)

In [None]:
b = np.ones((1,2))   # Create an array of all ones
print(b)

In [None]:
c = np.full((2,2), 7) # Create a constant array
print(c) 

In [None]:
d = np.eye(2)        # Create a 2x2 identity matrix. eye literally means 'I'
print(d)

In [None]:
e = np.random.random((2,2)) # Create an array filled with random values
print(e)

### Array indexing

Numpy offers several ways to index into arrays.

Slicing: Similar to Python lists, numpy arrays can be sliced. Since arrays may be multidimensional, you must specify a slice for each dimension of the array:

In [None]:
# Create the following rank 2 array with shape (3, 4)
# [[ 1  2  3  4]
#  [ 5  6  7  8]
#  [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print(a)

# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):
# [[2 3]
#  [6 7]]
b = a[:2, 1:3]
print(b)

Important note: A slice of an array is a "view" into the same data, so modifying it will modify the original array.

In [None]:
print(a[0, 1])  
b[0, 0] = 77    # b[0, 0] is the same piece of data as a[0, 1]
print(a[0, 1]) 
print(a)

You can also mix integer indexing with slice indexing. However, doing so will yield an array of lower rank than the original array. Note that this is quite different from the way that MATLAB handles array slicing:

In [None]:
# Create the following rank 2 array with shape (3, 4)
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print(a)

Two ways of accessing the data in the middle row of the array.
Mixing integer indexing with slices yields an array of lower _rank_,
while using only slices yields an array of the same _rank_ as the
original array (Note that "rank" here is slightly different from the matrix rank you learn in linear algebra. Read: https://www.pythoninformer.com/python-libraries/numpy/anatomy-of-array/):

In [None]:
row_r1 = a[1, :]    # Rank 1 view of the second row of a  
row_r2 = a[1:2, :]  # Rank 2 view of the second row of a
row_r3 = a[[1], :]  # Rank 2 view of the second row of a
print (row_r1, row_r1.shape) 
print (row_r2, row_r2.shape)
print (row_r3, row_r3.shape)

In [None]:
# We can make the same distinction when accessing columns of an array:
col_r1 = a[:, 1]
col_r2 = a[:, 1:2]
print (col_r1, col_r1.shape)
print ()
print (col_r2, col_r2.shape)

**Integer array indexing**: When you index into numpy arrays using slicing, the resulting array view will always be a subarray of the original array. In contrast, integer array indexing allows you to construct arbitrary arrays using the data from another array. Here is an example:

In [None]:
a = np.array([[1,2], [3, 4], [5, 6]])

# An example of integer array indexing.
# The returned array will have shape (3,) and 
print (a[[0, 1, 2], [0, 1, 0]])       # ... fancy!
print (a.shape)
print()

# The above example of integer array indexing is equivalent to this:
print (np.array([a[0, 0], a[1, 1], a[2, 0]]))

In [None]:
# When using integer array indexing, you can reuse the same
# element from the source array:
print (a[[0, 0], [1, 1]])

# Equivalent to the previous integer array indexing example
print (np.array([a[0, 1], a[0, 1]]))

One useful trick with integer array indexing is selecting or mutating one element from each row of a matrix:

In [None]:
# Create a new array from which we will select elements
a = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
print (a)

In [None]:
# Create an array of indices
b = np.array([0, 2, 0, 1])

# Select one element from each row of a using the indices in b
print (a[np.arange(4), b])       # prints "[ 1  6  7 11]"

In [None]:
# Mutate one element from each row of a using the indices in b
a[np.arange(4), b] += 10
print (a)

**Boolean array indexing**: Boolean array indexing lets you pick out arbitrary elements of an array. Frequently this type of indexing is used to select the elements of an array that satisfy some condition. Here is an example:

In [None]:
import numpy as np

a = np.array([[1,2], [3, 4], [5, 6]])

bool_idx = (a > 2)  # Find the elements of a that are bigger than 2;
                    # this returns a numpy array of Booleans of the same
                    # shape as a, where each slot of bool_idx tells
                    # whether that element of a is > 2.

print (bool_idx)

In [None]:
# We use boolean array indexing to construct a rank 1 array
# consisting of the elements of a corresponding to the True values
# of bool_idx
print (a[bool_idx])

# We can do all of the above in a single concise statement:
print (a[a > 2])

There are many other useful functions in Numpy such as `mean`, `min`, `max` which can help us find useful information from data. Explore here: https://docs.scipy.org/doc/numpy/reference/routines.statistics.html

## Getting Data

You can also explicitly read from and write to files directly in your code. Python makes working with files pretty simple. Let's try firstly with a text file by obtaining a *file object* using open. Then, proceed to read the contents of the file.

In [None]:
#To open a text file, use: 
fh = open("hello.txt", "r") 

#To read a text file, use: 
fh = open("hello.txt","r") 
print(fh.read()) 
print()
fh.close() 

#To read one line at a time, use: 
fh = open("hello.txt", "r") 
print(fh.readline()) 
fh.close() 

#To read a list of lines use: 
fh = open("hello.txt", "r") 
print(fh.readlines()) 
fh.close()

We can also attempt to write some content to the file:

In [None]:
#To write to a file, use: 
fh = open("hello2.txt","w") 
fh.write("Hello World") 
fh.close() 

#To write to a file, use: 
fh = open("hello2.txt", "w") 
lines_of_text = ["a line of text", "another line of text", "a third line"] 
lot = [line + '\n' for line in lines_of_text]       # this is necessary to ensure we go to next line. It is not auto!
fh.writelines(lot) 
fh.close() 

#To append to file, use: 
fh = open("hello2.txt", "a") 
fh.write("Hello World again") 
fh.close 

fh = open("hello2.txt", "r") 
print(fh.read())

Another commonly used data file is a comma-separated value, or **CSV** file. CSV files contain lines (or rows) that are comma separated (literally separated by commas) into individual pieces of data. 

![](http://www.manifold.net/doc/mfd9/images/eg_formats_csv01_01.png)

Of course, the position of data should be consistent across the lines -- the data before the first comma (for all lines) should carry the same information, and so is the data before the second comma, and so on.

You can also open CSV files in spreadsheet editors such as MS Excel, and view them in the form of columns.

In [None]:
import csv

The following code opens a CSV file and reads the whole bunch of the data. Then a loop goes through each list in the csv_reader object and prints it. Observe the output. Compare with the same data when opened in a spreadsheet editor.

In [None]:
with open('clinic1M.csv') as csv_file:
    csv_reader = list(csv.reader(csv_file, delimiter=','))
    # ...puts each line of data from csv into a list
    for row in csv_reader:     
        print(row)

**Quick Question 1**: Write some code (a) to extract all the names of the clinics from the data into another new list. (You can use list comprehension!), (b) count the total number of clinics 

In [None]:
with open('clinic1M.csv') as csv_file:
    csv_reader = list(csv.reader(csv_file, delimiter=','))
    # fill in your code
    
    

Once you are done with the question above, let's try to write the extracted clinic names data to a new CSV file called "`clinic_list.csv`". The following code does just that. Check the contents of the CSV file to ensure the data is intact.

In [None]:
with open('clinic_list.csv', mode='w', newline='') as clinic_list:
    clinic_writer = csv.writer(clinic_list, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for n in clinic_names:
        clinic_writer.writerow([n])

For more information, look up the documentation: https://docs.python.org/3/library/csv.html

## Pandas

Pandas is a powerful library written for Python for data manipulation and analysis. Its name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals.

**Note**: Pandas need Numpy to be installed. 

In [None]:
import pandas as pd

_Series_ and _DataFrame_ are two important structures in Pandas that you will use frequently and interchangeably. 

#### Series

A Series is a one-dimensional object similar to an array, list, or column in a table. It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

In [None]:
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!']) 
print(s)

The numbers that you see in the first "column" (actually it's not a column of the data), are actually the indices of the data rows. The other column is the actual data column.

Alternatively, you can specify an index of your choice when creating the Series, but it isn't very intuitive.

In [None]:
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'], index=['A', 'Z', 'C', 'Y', 'E']) 
print(s)

You can use `iloc` and `loc` to access data. `iloc` takes an integer index, which may not be very intuitive. `loc` returns the data based on the index.

In [None]:
s.iloc[4]

In [None]:
s.loc['E']

The Series constructor can convert a dictonary as well, using the keys of the dictionary as its index. How convenient! Pandas actually uses this dictionary representation in its structure, as you will begin to notice slowly.

In [None]:
d = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 'San Francisco': 1100, 'Austin': 450, 'Boston': None} 
cities = pd.Series(d) 
print(cities)

Note: Assigning value "None" will show up as "NaN" (or Not a Number). Think "Null" in other languages you are familiar with.

You can use the index (or a bunch of indices) to select specific items from the Series.

In [None]:
cities[['Chicago', 'Portland', 'San Francisco']]

Think dictionaries...use a key to get the value

In [None]:
cities['Chicago']

cities < 1000 returns a Series of True/False values, which we then pass to our Series cities, returning the corresponding True items. 

**Recall**: We came across this form of boolean indexing earlier when dealing with Numpy. Looks like this also works in Pandas.

In [None]:
less_than_1000 = cities < 1000 
print(less_than_1000) 
print() 
print(cities[less_than_1000])

In [None]:
print('Seattle' in cities) 
print('San Francisco' in cities)

All these are quick ways of filtering through the data to retrieve or search for something. If you want to find cities that are less than 1000, you may use boolean indexing. If you want to find if a specific city exists, you can also use a boolean condition with "in" operator.

Mathematical operations can be done using scalars and functions.

In [None]:
# square city values 
np.square(cities)

You can add two Series together, which returns a union of the two Series with the addition occurring on the shared index values. Values on either Series that did not have a shared index will produce a NULL/NaN (not a number).

In [None]:
print(cities[['Chicago', 'New York', 'Portland']])
print() 
print(cities[['Austin', 'New York']]) 
print() 
print(cities[['Chicago', 'New York', 'Portland']] + cities[['Austin', 'New York']])

NULL checking can be performed with `isnull` and `notnull`.

In [None]:
print(cities.isnull()) 
print() 
print(cities[cities.isnull()])

#### DataFrame

A DataFrame is similar to Excel workbook $–$ you have column names referring to *columns* and you have *rows*, which can be accessed with use of row numbers. The essential difference being that column names and row numbers are known as column index and row index, in the case of dataframes.

Pandas actually makes things even more convenient. Reading a CSV is as simple as calling the `read_csv` function.

In [None]:
df = pd.read_csv('clinic1M.csv')      # or you may use any CSV file

We can use the head or tail method to print the first/last N rows of our data frame. We could accomplish the same thing using the method. The `iloc` method allows us to retrieve rows and columns by position. In order to do that, we’ll need to specify the positions of the rows that we want, and the positions of the columns that we want as well.

In [None]:
print(df.head()) #prints first N rows. Default: 5 

In [None]:
print(df.tail()) #prints last N rows. Default: 5

In [None]:
print(df.iloc[0:5,:])   # SLICING: accesses first 5 rows of data, and all columns

In [None]:
print(df.iloc[0:5,1:3]) # SLICING: first 5 rows of data, and 2 selected columns

Pandas library is huge. You can have a look here: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

Some attributes (properties) are quite useful to get information of the dataframe.

In [None]:
df.columns

In [None]:
df.values

In [None]:
df.dtypes

`dtypes` property tells us the data types of each column. Notice that string data is always denoted as "object". That is because the string class is used  and hence it is an object instance type.

In [None]:
df.shape

Just like Numpy, `shape` gets us the dimensions of the data. Think of dataframes like a two-dimensional array, the first number is the number of rows, and the second number is the number of columns.

Now, what if we want to access a particular slice of the data, but we cannot remember indices (numbers are really difficult to remember and unintuitive since it might change when data changes). Slicing an entire column out of a dataframe is simple:

In [None]:
sr = df['city']
print(sr)
type(sr)

**Note**: Observe closely, that this sliced out column is actually...a Series! We can consider the column name as the key or index, to access the Series. So you can think anatomically, that a Dataframe (table) consists of many Series (columns). Here's a nice labelled diagram showing this concept:

![](https://geo-python.github.io/site/_images/pandas-structures-annotated.png)

In [None]:
len(df)

Let's create a new DataFrame, with this Series `sr` in it... (very easy)

In [None]:
fr = pd.DataFrame(sr)
fr

Let's generate more Series data and add them to the DataFrame...

In [None]:
x = [np.random.randint(500, 1500) for p in range(0, len(fr))]
fr['patients'] = pd.Series(x)
fr['ratings'] = pd.Series(np.random.randn(len(fr)))

In [None]:
fr

We can remove rows and columns from a Dataframe using the `drop` function. To remove rows, just specify which rows (in a list) that you want it dropped. 

In [None]:
fr2 = fr.drop([50, 51])
fr2

Rows with index 50 and 51 have now been removed. Now, data removal can be a very serious thing and if you happened to make a mistake, it might be costly. Pandas operate in a safe way by not allowing changes to be made directly to the original dataframe that is being accessed. Notice that the rows were only dropped in the new dataframe `fr2`. The original dataframe `fr` remains the same.

In [None]:
fr

If you ever need to make permanent changes (overwriting the original data), you can use the parameter `inplace=True` to perform this drop operation "in place", in the original dataframe.

**Quick Question 2**: Write the code (one line only!) to drop the "patients" column. Ensure that the modification is done on the original dataframe. 

In [None]:
# fill in here



In [None]:
fr2

We can also explore the data (getting some summary of statistics) by using the `describe` method:

In [None]:
print(df.describe())

Becareful with using `describe()` if your data has not been cleaned yet! They can often give you incorrect results.

That's all for now. We will do more handling and manipulation of data in the next lab.