# Introduction to NumPy and Pandas
***

**NumPy** is an open source Python package for scientific computing. NumPy supports large, multidimensional arrays and matrices. It also supports a large collection of mathematical functions not found in Python's standard math library.

**Pandas** is an open source data analysis library in Python used for storing, cleaning, wrangling, and analyzing data.  

We will begin by importing the NumPy and Pandas libraries.  It's customary to import Pandas and Numpy  with the aliases $\texttt{pd}$, and $\texttt{np}$, respectively.  We can then access any function in the Numpy or Pandas libraries by prepending function names by $\texttt{np.}$ or $\texttt{pd.}$.  

In [None]:
import pandas as pd
import numpy as np 

## The Numpy ndarray

The main workhorse of Numpy is it's multidimensional array class called the $\texttt{ndarray}$.  This allows us to build and work with one dimensional, two dimensional, or even $N$-dimensional arrays of numbers. 

<div>
<img src="attachment:arrays.png" style="width: 500px;"/>
<div>
    
Image Source: https://towardsdatascience.com/numpy-array-cookbook-generating-and-manipulating-arrays-in-python-2195c3988b09
    
Let's begin with a few examples using the Python np.array function. 

In [None]:
# An example of a 1D array created by passing in a Python list to the np.array function
oneD = np.array([8, 6, 7, 5, 3, 0, 9])
print(oneD)

In [None]:
# An example of a 2D array
# Note that we are now passing in a list of lists into the np.array function. 
# Each list becomes a row in the 2D array

twoD = np.array([[1, 1, 1, 1, 1],[2, 2, 2, 2, 2]])
print(twoD)

In [None]:
# An example of a 3D array
# This will be put into the np.array function as a list of twoD arrays (a list of lists of lists!)

threeD = np.array([[[1, 1, 1, 1],[2, 2, 2, 2]],[[3, 3, 3, 3],[4, 4, 4, 4]],[[5, 5, 5, 5],[6, 6, 6, 6]]])
print(threeD)

We can check the type of the array that we've just created with Python's type command.

In [None]:
type(threeD)

**Arrays versus Lists:**
- Less memory is required for an array
- We can have elements of any type in a list
- Elements in an array must have the same type
- Easier to do arithmetic on arrays

In Module 1, we saw various examples of lists and how to generate lists (e.g. list comprehensions). Now we see that we can convert any list to a NumPy array and take advantage of the functionality of the NumPy package.

In [None]:
# Remember that a NumPy array needs to have data of all one type.
# Note what happens when this cell is executed. 
x = np.array([1, 2, 3, "string"])
print(x)

Using **np.arange** is a quick and convenient way to generate an array of values that are evenly spaced within some specified interval.

In [None]:
# np.arange takes in arguments np.arange(begin_interval, end_interval, step_size)
x = np.arange(0,10, 1)
print(x)

Using Numpy arrays to store numerical values makes it relatively easy to perform mathematical operations on them.  

In [None]:
# Suppose we define an array of all integer values from 0 to 8.
x = np.arange(0,9,1)

# Now take this array and subtract the mean of x from each element.

x_mean = x.mean()
y = x-x_mean

print(y)

# Now, for fun, let's compute the standard deviation of array x and divide each entry in
# y by that number

std_dev_x = x.std(ddof=1)
print(std_dev_x)

z = y/std_dev_x
print(z)

# Out of curiousity, what is the mean and standard deviation of z?
print(z.std(ddof=1))
print(z.mean())

We can add a number to each element as just demonstrated in the above cell. We can also add two arrays together element by element. This is far more concise that writing a for loop to handle this.

In [None]:
# Add arrays a and b together to get element-wise addition. Note that we have to have
# two arrays of the same length to do this.
a = np.array([1, 1, 1, 1, 1])
b = np.array([1, 2, 3, 4, 5])
c = a + b 
print(c)

We can access elements of Numpy arrays in a ways similar to the way we access elements in Python lists. For instance, if we want to get the first $3$ elements of the array $\texttt{y}$, we can do so as follows: 

In [None]:
c[0:3]

Just like with Python lists, if we're indexing from the start of the array there is no need to include the $0$ in the index range.  We can simply do 

In [None]:
c[:3]

Similarly, if we want to access everything from the third entry to the end of the array, we can do 

In [None]:
c[2:]

Indexing in multidimensional arrays is similar.  First, let's build a two dimensional array with our arrays $\texttt{x}, \texttt{y},$ and $\texttt{z}$ as the rows. 

In [None]:
B = np.array([a, b, c])
print(B)

We can slice up the two dimensional array by doing slices along rows and columns.  Let's suppose we wanted to carve out the second row of $\texttt{B}$. 

In [None]:
# Matrices are indexed as rows x columns in mathematics. The same convention holds true for 
# Python. If we have a matrix B, then we index B in the following way: B[rows, columns]

# So B[c, :] means specifying the row indexed by c and including all columns within that row


In [None]:
# Suppose we wanted to look at the second row of matrix B again, however we only want to 
# viewe the second and third column entries within that row.


Now, suppose that instead of isolating a row, we'd like to isolate a column. Let's play with the syntax for doing that: 

In [None]:
# B[:,c] means specifying all row entries in the column indexed by c. 
# Thus the following line of code isolates the 4th column.


Now, let's slice our 2D array, B, by both rows **and** columns. Let's pull out the 1st, 2nd, and 3rd rows along with the 3rd and 4th columns.

In [None]:
print(B)

We can also apply mathematical functions to the two dimensional array.  For instance, if we want to sum all of the entries in $\texttt{B}$ we can do 

In [None]:
np.sum(B)

If instead we just want to sum along the rows or columns of the array we can add the $\texttt{axis}$ parameter.  

If we choose $\texttt{axis=0}$, then we obtain a column sum. If we chooose $\texttt{axis=1}$, then we will obtain a row sum.

In [None]:
# Compute the column sum

In [None]:
# Compute the row sum

We can also index into an array by conditions.  For instance, let's recall the vector $c$ from above. We can set a condition $c > 4$. This will return a boolean array with boolean values of True if a particular entry of $c$ satisfies $c>4$ and a boolean value of False if $c \leq 4$.

In [None]:
print(c)

In [None]:
c > 4

Now, suppose we want to filter out those entries of $c$ that satisfy the condition.

To do so, we siimply we can index $c$ using this condition:

In [None]:
c[c>4]

### Numpy Mathematical Functions

Numpy includes all of the functions and mathematical constants in Python's standard math library, like logarithms, exponentiation, and even $\pi$.  

In [None]:
print(np.log(np.exp(1)))
print(np.log2(16))
print(np.log10(1000))
print(np.pi)

Numpy's mathematical functions can be applied to arrays as well as scalars.

In [None]:
u = [10, 100, 1000, 10000]
np.log10(u)

## Pandas Data Structures

Pandas has two types of data structures: 
- The **Series**: A one dimensional array with labeled indices. 
- The **Dataframe**: A tabular spreadsheet-like structure with rows and columns. 


### The Pandas Series

Let's begin exploring the Pandas library by creating our own Pandas Series.

In [None]:
# Note that data is presented in a Python list. We then pass this list into 
# pd.Series to convert from a list to a Series
data = [37, 52, 1, 3, 31, 29, 10, 2]

my_series = pd.Series(data)

Print data to the screen, then print my_series. What are the differences?

In [None]:
print("data:", data)
my_series

Note that printing my_series results in multiple outputs being displayed on the screen.  The first column lists the default indices of the entries in the Series.  The second column lists the actual values of the Series.  Finally, the last line of output tells us the type of the values of the Series. 

If we want a simple array containing the value of the Series we can access this with the $\texttt{.values}$ attribute. 

In [None]:
my_series.values

Note that this returns a simple array type containing the values of the Series.  In fact, this is actually a Numpy ndarray, which we can verify by asking Python for the type of $\texttt{my_series.values}$. 

In [None]:
type(my_series.values)

We can also obtain an array of the indices associated with the Series using the $\texttt{.index}$ attribute.  

In [None]:
my_series.index

Another helpful command is len(Pandas Series). This tells you how many entries you have in your Series.

In [None]:
len(my_series)

OK, so far we've defined a basic Pandas Series that uses the default indices.  

What if we would like to create custom indices?

For example, we could create a series listing the distance from each planet in the solar system to the sun. The units is in millions of miles.

In [None]:
planets = pd.Series([35.98,67.24,92.96,141.6,483.8,890.8,1784,2793], 
                         index=["Mercury", "Venus", "Earth", "Mars", "Jupiter","Saturn","Uranus","Neptune"])

In [None]:
planets

Notice that the indices of the Series are now strings corresponding to the names of the planets, while the values are floats representing the distance from each planet to the sun in millions of miles.  

Now suppose that we only care about the planets that are farther than 1 billion miles from the sun (1000 million).  We can index into the Series using a boolean condition, much like we did with Numpy arrays: 

In [None]:
planets[planets > 1000]

When Series are very large it's important to be able to check if particular elements are in the Series without having to scan through with our eyeballs.  Here we can check if Pluto and Earth are part of the series. 

In [None]:
"Pluto" in planets

In [None]:
"Earth" in planets

### The Pandas DataFrame

The key data structure in Pandas is the **DataFrame** object. A DataFrame is basically a tabular data structure, with rows and columns. Rows have a specific index to access them, which can be any name or value. The columns are just Pandas Series. The Pandas DataFrame data structure can be seen as a spreadsheet, but it is much more flexible. 

First, let's create a DataFrame from scratch by passing in a Python dictionary. The following data records distance from the sun in millions of miles, year in earth days, day in earth days, and diameter in thousands of miles. 

In [None]:
# We begin by using a dictionary called data. The "keys" in this dictionary are planet, 
# distance, year, day, and diameter. The values are given in lists for each key.
data = {'planet': ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus',
                 'Neptune'],
        'distance': [35.98,67.24,92.96,141.6,483.8,890.8,1784,2793],
        'year':   [87.97, 224.7, 365.26, 686.69, 4331.98, 10760.55, 30685.49, 60191.19],
        'day':  [58.6, 243, 0.99, 1.03, 0.41, 0.45, 0.72, 0.67],
        'diameter': [3.031, 7.521, 7.926, 4.222, 88.846, 74.9, 31.763, 30.779]
        }

We'll convert this data to a Pandas DataFrame as follows: 

In [None]:
planet_stats = pd.DataFrame(data)
print(planet_stats)

Note that each list in the dictionary has become a column in our DataFrame with the associated dictionary key as the column name.  

Now, suppose we want to change the order of the columns in the data frame. If instead we want the columns in the order _planet_, _diameter_, _day_, _year_, _distance_ we can pass this ordering of the columns in when we create the DataFrame. 

In [None]:
planet_stats = pd.DataFrame(data, columns=['planet', 'diameter', 'day', 'year', 'distance'])
print(planet_stats)

Above, we've displayed the DataFrame using the standard Python $\texttt{print()}$ function, but Pandas has it's own spread-sheet-like way of displaying data.  If we want to see the first 5 rows of the DataFrame we can use the $\texttt{.head()}$ method. 

In [None]:
planet_stats.head()

If we want to see more rows we can pass in an optional argument into $\texttt{.head()}$. 

In [None]:
planet_stats.head(6)

If we want to see the last $5$ rows of the DataFrame we can use the $\texttt{.tail()}$ method. 

In [None]:
planet_stats.tail()

### Adding Columns to Pandas DataFrames

Suppose we want to uses the information in our data frame to compute an estimate for the length of each planet's orbit around the sun. We can do this by multiplying the distance column by $\pi$. We will store this information in a new column called "orbit".

In [None]:
# Create a new column called orbits

There are several takeaways from this exercise.  First, if we want to create a new column in our DataFrame, we simply have to assign the DataFrame indexed by the new column name to some valid Panda Series.  The second takeaway is that we can scale and add Pandas Series in the same way we can with Numpy arrays. 

In [None]:
# Let's display the dataframe to the screen to see the new column.

### Sorting in Pandas

Now suppose we want to rank the planets from longest to shortest length of a day.  One way we could do this would be to sort the DataFrame by the **day** column.  We can do this as follows:

In [None]:
planets_sorted = planet_stats.sort_values(by="day", ascending=False)
planets_sorted

It looks like Venus has the longest "day" of any planet and Jupiter has the shortest.  

There are a couple of things to notice here.  First, the sort_values method actually returns a new DataFrame, with the data sorted as requested.  If instead we wanted to modify the existing DataFrame so that it's values are sorted we could have added the "inplace=True" parameter to the sort_values method.  For now though we'll leave the original planet_stats DataFrame as it is. 

The second interesting thing to notice is that the DataFrame indices (the first column) have been rearranged as well.  In truth, the indices of a DataFrame are only numbers by default.  We could also set them to be other unique identifiers like strings.  In that setting, in makes perfect sense that they should tag along with their original data. 

### Slicing and Dicing Pandas DataFrames

The sorted planet_stats DataFrame gives us an opportunity to talk about the intricacies of selecting data from Pandas DataFrames.  There are two main methods that we will use to slice and dice data from DataFrames: $\texttt{loc}$ and $\texttt{iloc}$. Note that if you've used Pandas in the past, you might have seen the method $\texttt{ix}$ which is deprecated in the current version of Pandas, so we shall avoid it. 

We'll explicitly write down the differences between $\texttt{loc}$ and $\texttt{iloc}$ here, because it's a bit hard to remember

- $\texttt{loc}$ slices rows and columns by their **names** 
- $\texttt{iloc}$ slices rows and columns by their **integer index order**, much like slicing in Numpy 

At first, we'll just worry about selecting certain rows.  Shift-enter these two cells and see if you can pick up on the different behaviors

In [None]:
planets_sorted.iloc[:4]

In [None]:
planets_sorted.loc[:4]

Do you see the difference?  $\texttt{iloc[:4]}$ sliced out the first 4 rows of the DataFrame, ignoring what the rows happened to be named. 

On the other hand,  $\texttt{loc[:4]}$ sliced out the rows in the sorted DataFrame from the beginning until we found the row **named** 4 by it's index. 


Now suppose we want the first four rows of the DataFrame, but only the first three colums.  Since we're slicing by integer order, we'll use $\texttt{iloc}$.  We have

In [None]:
planets_sorted.iloc[:4,:3]

Now, suppose that we want to mix slicing by numerical order and slicing by name.  This can get kinda tricky. For instance, suppose we wanted the first four rows of the DataFrame, and just the columns corresponding to **diameter** and **distance**.  So we're really doing two slices: one by integer index order, and one by name.  

In [None]:
planets_sorted.iloc[:4].loc[:,["diameter", "distance"]]

This works because $\texttt{planets_sorted.iloc[:4]}$ returns a DataFrame which we then apply $\texttt{loc[:,["diameter", "distance"]]}$ to.  The problem with this is that the actual object that is returned by this can be unpredictable.  We won't go into too much detail about this, but if you're interested in reading more, you can do so in the Pandas documentation [here](https://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy). Rest assured, if you ever go this route, Pandas will scream a lot of fun warnings at you. 

A better thing to do is use $\texttt{loc}$ and then use a trick to get the named indices of the first four rows.  We can do this as follows: 

In [None]:
planets_sorted.loc[planets_sorted.index[:4], ["diameter", "distance"]]

This worked because $\texttt{planets_sorted.index[:4]}$ returns the names of the first four rows, which you can check here:

In [None]:
planets_sorted.index[:4]

We can also select rows of a DataFrame using conditions, much like you can do with Numpy arrays.  For this, let's go back to the original $\texttt{planet_stats}$ DataFrame, which we reprint here for your recollection. 

In [None]:
planet_stats

Suppose we want to select all of the planets where length of a "year" is longer than an Earth year (365.26 days).  Much like in Numpy, we can simply write down a logical comparison of the **year** column with the desired value. 

In [None]:
# Filter the data frame based on planets with years longer than 365.26 days

In this case, what we get out is a Pandas Series with boolean values, indicating whether each row of the DataFrame passes the condition. We can then pass this into the $\texttt{loc}$ method to get all of the rows of the DataFrame that satisfy the condition. 

In [None]:
# loc method to filter the data frame based on year length exceeding 365.26 days

Note that we could have just passed the condition into the DataFrame itself, as 

In [None]:
# now try to pass the conditional into the DataFrame

Suppose now that we also want to filter out all columns except for planet, year, and day. 

In [None]:
# filter on year length and the three columns: planet, day, year

We can also combine logical conditions together.  Suppose we want to know if any planets have both a year length that is longer than Earth and a day length that is longer than Earth. We can do that as follows:

In [None]:
# filter on day length and year length both exceeding Earth's and show only the columns planet, day, year

**IMPORTANT NOTE**: When doing logical operations in Pandas, we use $\texttt{&}$ and $\texttt{|}$ for _and_ and _or_, respectively.  This differs from the traditional $\texttt{and}$ and $\texttt{or}$ connectives used in Python. 

In [None]:
planet_stats

# Further Reading

We've just barely scratched the surface of what Pandas and Numpy can do.  As we move forward in the course we'll learn how to apply various statistical functions to data, how to make plots and charts based on data, and how to clean and wrangle messy data.  

If you want to explore more of what Pandas can do, check out the (very readable) pandas docs for more information

http://pandas.pydata.org/pandas-docs/stable/