# Lab 1: Basic Data Analysis

The purpose of this lab is to get you started running Jupyter notebooks and familiar loading and analyzing data in a notebook. In particular, we will explore the basics of NumPy and Pandas by exploring of some public policy datasets from the Chicago Data Portal. Some of the examples below are adapted from *Machine Learning with Python Cookbook* by Chris Albon.

Parts 0, 1 & 2 involve running provided code to become familiar with the libraries. Don't rush through these parts because the code already works. Instead, make sure you understand what is happening and how to replicate it if you were given similar data without the provided code. Part 3 involves writing your own code to explore a new dataset.



# 0. Jupyter Notebooks

Elements in a notebook are divided into cells, which might be markdown text (such as this cell) or code (such as the next cell). You can execute either type of cell by clicking on it and typing "Shift + Enter". Execute the following cell to run the "Hello World" program.

In [None]:
print("Hello, World")

You can also define functions and variables in cells. Once you have run the cell, the functions and variables are available to any cell in the notebook, *even cells above where the variables or functions were defined.* This can be a source of bugs if you start running cells out of order and forget that a later cell overwrote a function or variable value. If you want to reset Python and remove all defined values, use the Kernel > Restart Kernel option from the menu bar. 

## 1. Numpy


NumPy is the foundation of the Python machine learning stack. It allows for efficient operations on the data structures often used in machine learning: vectors, arrays, and matrices.

This section covers the most common NumPy operations you are likely to run into when preparing data for ML

* 1.1 Creating a Vector
* 1.2 Creating an Array
* 1.3 Cereating a Sparse Matrix
* 1.4 Selecting Elements in an Array
* 1.5 Describing an Array
* 1.6 Applying Operations to Elements
* 1.7 Calculating Basic Statistics
* 1.8 Reshaping Arrays
* 1.9 Other Array Operations


#### 1.1 Creating a Vector

Use NumPy's built in array type to create a one-dimensional vector

In [None]:
# load library
import numpy as np

# create a row vector
vector_row = np.array([1, 2, 3])

# create a column vector
vector_column = np.array([[1],
                          [2],
                          [3]])

# display
print(vector_row)
print(vector_column)

### 1.2 Creating an Array or Matrix

To create a 2D or higher dimensional array,  we can also use a NumPy's array type. In the code below, the array (matrix) contains three rows and two columns (a column of 1s and a column of 2s)

NumPy also has a builtin "matrix" data type separate from `array`, however the matrix data type is not recommended for two reasons. First, `array` is the de facto standard data structure of NumPy. Second the vast majority of NumPy operations return arrays, not matrix objects.

In [None]:
# create a matrix
matrix = np.array([[1, 2],
                   [1, 2],
                   [1, 2]])

# display
matrix

### 1.3 Creating a Sparse Matrix
Sparse matrices allow you to efficiently represent data with very few nonzero values.

In [None]:
# load libraries
from scipy import sparse

# create a matrix
matrix = np.array([[0, 0],
                  [0, 1],
                  [3, 0]])

# create compressed sparse row (CSR) matrix
matrix_sparse = sparse.csr_matrix(matrix)

A frequent situation in machine learning occurs when you have a huge amount of data but most of the elements in the data are zeros. For example, imagine a matrix where the columns are every movie on Netflix, the rows are every Netflix user, and the values are how many times a user has watched that particular movie. This matrix would have tens of thousands of columns and millions of rows. However, since most users do not watch most movies, the vast majority of elements would be zero.

Sparse matrices only store nonzero elements and assume all other values will be zero, leading to significant computational savings. In the above code, we created a NumPy array with two nonzero values, then converted it into a sparse matrix. If we view the sparse matrix we can see that only the nonzero values are stored:

In [None]:
# view sparse matrix
print(matrix_sparse)

There are a number of ways to represent sparse matrices. However, in compressed sparse row (CSR) matrices, (1, 1) and (2, 0) represent the (zero-indexed) indices of the non-zero values 1 and 3, respectively. For example, the element 1 is in the second row and second column. We can see the advantage of sparse matrices if we create a much larger matrix with many more zero elements and then compare this larger matrix with our original sparse matrix:

In [None]:
# create larger matrix
matrix_large = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                         [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
                         [3, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

# create compressed sparse row (CSR) matrix
matrix_large_sparse = sparse.csr_matrix(matrix_large)

# view larger sparse matrix
print(matrix_large_sparse)

In [None]:
# view original sparse matrix
print(matrix_sparse)

Despite the fact that we added many more zero elements in the larger matrix, its sparse representation is exactly the same as our original sparse matrix. 

#### See Also
* Sparse matrices, SciPy documentation (https://docs.scipy.org/doc/scipy/reference/sparse.html)
* 101 Ways to Store a Sparse Matrix (https://medium.com/@jmaxg3/101-ways-to-store-a-sparse-matrix-c7f2bf15a229)

### 1.4 Selecting Elements in an Array
You often need to select one or more elements in a vector or array. Fortunately, NumPy arrays make that easy. Like most things in Python, NumPy arrays are zero-indexed, meaning that the index of the first element is 0, not 1. With that caveat, NumPy offers a wide variety of methods for selecting (i.e., indexing and slicing) elements or groups of elements in arrays. Some examples are shown below:

In [None]:
# create row vector
vector = np.array([1, 2, 3, 4, 5, 6])

# create array
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

# select the third element of vector
vector[2]

In [None]:
# select second row, second column of array
matrix[1,1]

In [None]:
# Select all elements of vector
vector[:]

In [None]:
# select everything up to and including the third element of vector
vector[:3]

In [None]:
# select the last element of vector
vector[-1]

In [None]:
# select the first two rows and all columns of array
matrix[:2, :]

In [None]:
# select all rows and the second column of array
matrix[:,1:2]

These techniques generalize beyond 2D arrays to arbitrary dimensions.

### 1.5 Describing an Array

If it often helpful to be able to describe the shape, size, and dimensions of a NumPy array. The shape, size, and ndim functions allow you to do this.

In [None]:
# create array
matrix = np.array([[1, 2, 3, 4],
                   [5, 6, 7, 8],
                   [9, 10, 11, 12]])

# view number of rows and columns
matrix.shape

In [None]:
# view number of elements (rows * columns)
matrix.size

In [None]:
# view number of dimensions
matrix.ndim

Although this might seem basic, it is frequently valuable to check the shape and size of an array before further calculations (e.g. to make sure it is the right shape for input into a ML algorithm) or simply as a gut check after performing an operation

### 1.6 Applying Operations to Elements
Some data preprocessing operations involve applying a function to multiple elements in an array. This is easy using NumPy's vectorize() function, which behaves much like the Python map() function only it uses the broadcasting rules of NumPy arrays.

In [None]:
# create array
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

# create function that adds 1000 to something
add_1000 = lambda i: i + 1000

# create vectorized function
vectorized_add_1000 = np.vectorize(add_1000)

# apply function to all elementsin matrix
vectorized_add_1000(matrix)

Vectorize converts an argument function into a new function that can apply to all elements in an array or slice of an array. It’s worth noting that vectorize() is essentially a for loop over the elements and does not increase performance. Furthermore, NumPy arrays allow us to perform operations between arrays even if their dimensions are not the same (a process called broadcasting). For example, we can create a much simpler version of our solution using broadcasting:

In [None]:
# add 1000 to all elements
matrix + 1000

Broadcasting makes performing operations that combine arrays and constants very concise. A common preprocessing operation involves adding 1 to arrays with very small values to avoid doing math on small fractions. 

### 1.7 Calculating Basic Statistics

NumPy also provides functions for computing basic statistics about data arrays. It is often helpful to inspect these statistics before ML training to better understand your data. 

In [None]:
# create array
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

# find maximum element
np.max(matrix)

In [None]:
# find minimum element
np.min(matrix)

In [None]:
# find maximum element in each column
np.max(matrix, axis=0)

In [None]:
# find maximum element in each row
np.max(matrix, axis=1)

In [None]:
# find mean across entire matrix
np.mean(matrix)

In [None]:
# find variance across entire matrix
np.var(matrix)

In [None]:
# find standard deviation across entire matrix
np.std(matrix)

Just like with max and min, we can easily get descriptive statistics about the whole matrix or just do calculations along a single axis:

In [None]:
# find the mean value in each column
np.mean(matrix, axis=0)

In [None]:
# find the mean value in each row
np.mean(matrix, axis=1)

### 1.8 Reshaping Arrays

Machine learning algorithms often expect input data in a particular shape or dimensionality. NumPy makes it easy to change the shape (number of rows and columns) of an array without changing the element values.

In [None]:
# create 4x3 matrix
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9],
                   [10, 11, 12]])

# reshape matrix into 2x6 matrix
matrix.reshape(2, 6)

The reshape() function allows us to restructure an array so that we maintain the same data but it is organized as a different number of rows and columns. The only requirement is that the shape of the original and new matrix contain the same number of elements (i.e., the same size). We can see the size of a matrix using size:

In [None]:
matrix.size

One useful argument to reshape is -1, which effectively means “as many as needed,” so reshape(-1, 1) means one row and as many columns as needed:

In [None]:
matrix.reshape(1, -1)

Finally, if we provide one integer, reshape will return a 1D array of that length. The flatten() function will also do the same thing. 

In [None]:
matrix.reshape(12)

In [None]:
matrix.flatten()

Occasionally, you may want to transpose a vector or matrix. Transposing is a common operation in linear algebra where the column and row indices of each element are swapped. Transposition is performed in NumPy using the T field:

In [None]:
matrix.T

One nuanced point is that, technically, a vector cannot be transposed because it is just a collection of values:

In [None]:
# transpose vector
np.array([1, 2, 3, 4, 5, 6]).T

However, it is common to refer to transposing a vector as converting a row vector (notice the second pair of brackets) to a column vector or vice versa:

In [None]:
# transpose row vector into column vector
np.array([[1, 2, 3, 4, 5, 6]]).T

Using a row vector when you should have a column vector (or vice versa) is a common gotcha when supplying data to ML algorithms

### 1.9 Other Array Operations

NumPy provides many more built-in functions and methods for performing operations on data arrays. Descriptions are all available in the NumPy documentation. 

**Final recommendation:** If you are ever considering writing a loop to process a data array one element at a time, there are probably more efficient methods using array operations and array broadcasting. Stack Overflow is very helpful for finding these tricks.

# 2. Pandas

The Pandas library simplifies the import and manipulation of labeled data arrays (arrays where the rows and columns have names). 

This section provides examples of some common Pandas operations you are likely to run into when preparing data for ML:

* 2.1 Loading (Importing) Data
* 2.2 Basic Data Exploration
* 2.3 Handling Time Series
* 2.4 Plotting Data
* 2.5 Additional Pandas Operations




In [None]:
# load library
import pandas as pd

### 2.1 Loading (Importing) Data

Pandas lets you easily import data from CSV, Excel, and JSON files

In [None]:
# CSV Example 

# path to CSV file (URL or path to file on your hard drive)
path = "https://people.sc.fsu.edu/~jburkardt/data/csv/cities.csv"

# load data
df = pd.read_csv(path)

# view the first five rows
df.head(5)

In [None]:
# Excel Example

# path to Excel file (URL or path to file on your hard drive)
path = "https://www.sample-videos.com/xls/Sample-Spreadsheet-10-rows.xls"

# load data
df = pd.read_excel(path, sheet_name=0, header=None)

# view the first five rows
df.head(5)

### 2.2 Basic Data Exploration

Now let's start to explore the cta-ridership.csv dataset from the Chicago Data Portal with information about rides on the Chicago "L" system. The first step is to import the data and see what it looks like

In [None]:
df = pd.read_csv('cta-ridership.csv')
df.head(5)

Immediately from looking at this, we can get an understanding of the type of data we're looking at.  There's a station identifier and name, a date on which the statistic takes place, the day type, and the number of rides for that date.

It's not immediately clear what A/W, is, but looking at the description of the dataset here:
https://data.cityofchicago.org/Transportation/CTA-Ridership-L-Station-Entries-Daily-Totals/5neh-572f

tells us that this column indicates that we are looking at a weekday, weekend, or holiday.

We can then explore some basic characteristics of the data, including the size of the dataset, min/max/etc. to explore outliers, etc. This basic exploration allows us to spot potential outliers and mistakes in the data.

In [None]:
df.shape

So the data has 1028040 rows and 5 columns

In [None]:
df.describe()

We notice that there is a station with no rides (minimum is zero!). Also, the station with the maximum number of rides appears to be about 10x the mean and median. Let's have a look at what station that is.

In [None]:
# find the station with the most rides. This expression takes some work to parse, so make sure you understand what's going on
df[df['rides'] == max(df['rides'])].head(1)

We can also find the stations with 0 rides on particular days:

In [None]:
# find stations/day combos with 0 rides 
zero = df[df['rides'] == 0]

# display 5 of them
zero.head(5)

Looks like a lot of weekends and holidays.  We can group by columns and types to get a better understanding of what might be going on. We then use these groups to count how many dates a station had zero rides and sort these in descending order.

In [None]:
zerogroups = zero.groupby(['stationname','daytype']).count()
zerogroups.sort_values(by=['date'],ascending=False)

It should be clear from a little bit of research why some of the stations at the top of the list report dates with 0 rides. If you're curious, do some searching to find out!

**Note:** The groupby() function is very powerful, enabling statistics and feature extraction from combinations (groups) of data attributes. However, groupby() does not return an array directly, but a GroupBy object that must have other functions applied to it to be useful. Again, the Pandas documentation and Stack Overflow are your friends. 

### 2.3 Handling Time Series

Many ML problems involve identifying temporal patterns in data. Pandas makes it easy to treat columns of a dataset as dates and handle them appropriately (although some ML algorithms will require conversion into a specific time format, e.g. UNIX time)

Let's continue with the CTA ridership data for this section. First, lets see what the time range is in the dataset

In [None]:
min(df['date'])

In [None]:
max(df['date'])

So we have all rides from January 1, 2001 to December 31, 2019. Let's do some statistics that group rides by date. First we need to tell Pandas that the date column is in fact a date. So, we convert the column to a proper 'DateTime' type, and then set the index to this column.

In [None]:
# change the 'date' column to a datetime object and set that column as the index of the dataset
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)

df.head(10)

Now the date column is the index, but the rows are not sorted chronologically. We can fix this with a sort function.

In [None]:
# Sort the columns by date
rides_by_date = df.sort_values(by='date')

rides_by_date.head(10)

That looks better. But it sure would be easier to visualize this data in the form of a plot...

### 2.4 Plotting Data

Plotting data is extremely useful for understanding it's properties. We will be using a combination of the Matplotlib, Seaborn, and Pandas libraries for plotting data in this class. 
Plotting is as much an art as a science, and everything you learned in middle and high school science classes about making plots clear, nondeceptive, and well-labeled still applies here.

In [None]:
# import libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Tell Jupyter notebooks to put the plot below the cell rather than in a separate window
%matplotlib inline 

# set the size of the plot
sns.set(rc={'figure.figsize':(11, 4)})

The following example shows one way to plot the CTA ridership from the Garfield Green station over time:

In [None]:
# select rides from the Garfield Green Line station
garfield_green = rides_by_date[rides_by_date['stationname'] == 'Garfield-South Elevated']

# plot the rides using the Pandas plot method
garfield_green['rides'].plot(linewidth=0.5)

# Label the plot
plt.ylabel("Rides")
plt.xlabel("Date")
# There are many more matplotlib functions for making plots prettier. The Matplotlib documentation has examples of their use, and we will see more in future class exercises

This indicates that something unusual may have happened near the Garfield Green Line in the early part of 2014...that or there was a problem with the data collection during that period that resulted in inflated ridership counts

### 2.5 Additional Pandas Operations
Like NumPy, Pandas comes with many more builtin functions and methods for manipulating data. Descriptions are all available in the Pandas documentation, and Google searches are a good place to start if you want to know how to do something specific.

# Part 3. Try It Yourself

In this section, you will try writing your own code to analyze data about Divvy bike sharing trips in Chicago - one of the many public datasets from the Chicago Data Portal.

The first step is to download the [Divvy Trip data](https://data.cityofchicago.org/Transportation/Divvy-Trips/fg6s-gzvg) as a CSV file and import it into a Pandas data frame.

**Note**: The file is large (5 GB), and so this will possibly take a fair bit of time to download/load. Be patient!

In [None]:
# YOUR CODE HERE

### 3.1 Basic Data Analysis
Now that you have the data loaded into a dataframe, you can write code to answer the following questions about the data using Pandas. Your code should print the answer to the questions with as little extra "fluff" as possible

#### 3.1.1 What is the number of rows in the data frame? 

This question is intended to help you understand one of the most basic questions about your data: How many data points does it have?

In [None]:
# YOUR CODE HERE

#### 3.1.2 What are the start and end dates of the rides in the data set?

It is typically important to understand basic information about the data, such as when it starts and ends.  This is also an example of _looking for outliers_. The Divvy program started in Chicago somewhat recently (find out when!) and so if the earliest ride in the dataset predates that, you know you the dataset has a problem. Performing these kinds of basic sanity checks on the data is critical and something you should always do when exploring a new dataset.

In [None]:
# YOUR CODE HERE

#### 3.1.3 What is the mean duration of all trips?

Calculating this statistic will involve (1) selecting a column from the Pandas dataframe and (2) applying an aggregate function (e.g., a mean) to a column of the dataframe. 

In [None]:
# YOUR CODE HERE

#### 3.1.4 Do men or women take longer trips on average?

The goal of this question is to give you experience with the groupby function in Pandas, as well as how to combine groupby with an aggregation operation.

In [None]:
# YOUR CODE HERE

#### 3.1.5 Birth Year Sanity Check

We know anecdotally that the birth year column (`BIRTH YEAR`) has several missing values. How many rows are missing a birth year? 

In [None]:
# YOUR CODE HERE

#### 3.1.6 Do women under 40 or women 40+ take longer trips on average?

This will require grouping by more attributes and then applying an aggregate function.

In [None]:
# YOUR CODE HERE

### 3.2.1 Basic Plotting and Visualization

Now you will get some practice plotting data

#### 3.2.1 Set a time index in the data frame
Recall the example above and set one of the columns in the data frame to be a datetime index.

In [None]:
# YOUR CODE HERE

#### 3.1.2 Plot the total trip duration by day of the week

Visualize whether people spend more time riding Divvy bikes on particular days of the week. While there are a number of ways to perform this operation, you may find the `resample` function in Pandas useful. Remember to label your plot!

In [None]:
# YOUR CODE HERE

### 3.3 Data Exploration on Your Own

Pick a question or hypothesis related to the Divvy bike share data, justify **why** you picked that question (i.e., why it might be an interesting question to some audience, such as city officials), and present a simple analysis. 

Some example questions might include:
* Adjusting for seasons, is ridership increasing? (You could use conditional selection on dates or months.)
* Are rides getting longer? (on average? max?)
* Do ride characteristics differ by user type?
* Are certain trip routes (e.g. pairs of start and end stations) more popular than others? Does this change during peak and non-peak "rush" hours (defined loosely)?

In [None]:
# YOUR CODE HERE (add cells below as needed)