# Working with tabular data

## Introducing `numpy`

In the last module you saw some of the limitations for quick quantiative analysis using built-in Python functionalities:

In [1]:
# It's hard to calculate on lists!
my_list = [4,1,5,2]

my_list * 2

[4, 1, 5, 2, 4, 1, 5, 2]

Fortunately you also learned about packages -- they'll come to our rescue!

Let's store these same numbers in what's called a `numpy` *array*. 

This involves importing the `numpy` package. 

`numpy` is short for "numerical Python."

Generally when we are using a package for the first time, we need to do one of these:

In [1]:
# Install numpy
#!pip install numpy



However, `numpy` was installed already when we installed `pandas`.

We *do* still need to import `numpy` before using it: 

In [9]:
import numpy

[4 1 5 2]
[4 1 5 2]


We can use `numpy.array()` to create an array.

In [None]:
my_array = numpy.array([4,1,5,2])
print(my_array)

We can also convert our list to an array:

In [None]:
my_list_to_array = numpy.array(my_list)
print(my_list_to_array)

`numpy` arrays work in many ways like ranges of a spreadsheet...

In [10]:
# Isn't this what you were expecting earlier?

print(my_array * 2)
print(my_list_to_array * 2)

[ 8  2 10  4]
[ 8  2 10  4]


Lists and arrays may *look* the same to you, but they are different data types to Python:

In [14]:
my_list = [4,1,5,2]
my_array = numpy.array([4,1,5,2])

print(type(my_list))
print(type(my_array))

<class 'list'>
<class 'numpy.ndarray'>


Based on what we're seeing, we may want to be calling for `numpy` *quite* often. 

Let's look at a cool "hack" for doing so...

### Aliasing modules

Remember that each time we use a function or method associated with `numpy`, we need to tell Python where to look for it: 

In [11]:
# Create another array...
my_other_array = numpy.array([4,16,25,100])

# numpy has a square root function of its own...
numpy.sqrt(my_other_array)

array([ 2.,  4.,  5., 10.])

I am already getting sick of typing `numpy` each time I want to use something from it! Can't we make this easier?

Yes. Yes, we can.

Turns out we can temporarily rename, or *alias*, the `numpy` module when we import it. We will use the format:

```
import [name of module] as [alias]
```

`np` is a popular alias for `numpy`. Rather than calling for `numpy` each time you are using methods from that library, you can simply type `np`. 



In [2]:
import numpy as np

# Create another array...
my_other_array = np.array([4,16,25,100])

# numpy has a square root function of its own...
np.sqrt(my_other_array)

array([ 2.,  4.,  5., 10.])

### Drill

Take a shot at assigning an array and finding its square root using this aliasing method. 

In [None]:
# Import and alias the module

import ___ ___ ___

# Create an array
my_new_array = ___.___([36, 49, 64, 81])

# Take its square root
np.___(___)

Aliasing saved you some keystrokes, huh?

![Life hackz](life-hackz.gif)

## Accessing and reshaping arrays

Python indexes *everything* at zero, not just lists. This includes `numpy` arrays!

In [15]:
my_list = [4,1,5,2]

# Access first element of the array
print(my_list[1])

# Oh sorry... NOW I'm accessing the first element! 🤦‍♂️
print(my_list[0])

1
4


You've already sweated through zero-based indexing, so let's move on... to two-dimensional arrays. 

(You will see that you'll never truly escape zero-based indexing in Python, however... 😼)

## Two-dimensional arrays in `numpy`

So far, we have been working on one-dimensional sets of data. But what if we wanted to mix that up? 

![Illustration of numpy arrays](numpy-arrays.png)



*Source: Nunez-Iglesias, Juan, Stéfan Van Der Walt, and Harriet Dashnow. *Elegant SciPy: The Art of Scientific Python.* O'Reilly Media, 2017.*


`numpy` can create three-dimensional arrays, but let's focus on two: this is a familiar way to shape data as it's how data is often is stored in spreadsheets (as rows and columns).


We can create a two-dimensional array in `numpy` with the `array()` function. This time we will 

In [3]:
# Create a two-dimensional array with `np.array()`

my_2d_array = np.array([[3,4,1],[2,5,0]])
print(my_2d_array)

[[3 4 1]
 [2 5 0]]


We can also re-shape an existing one-dimensional array into a two-dimensional array using `np.reshape()`

In [6]:
# New array
my_array = np.array([1,2,3,4,5,6])
print(my_array)

# Let's make a 2 x 3 array
my_reshaped_array = np.reshape(my_array, (2, 3))
print(my_reshaped_array)

[1 2 3 4 5 6]
[[1 2 3]
 [4 5 6]]


A two-dimensional array is starting to look like the kind of dataset that you might actually work with as a spreadsheet user, with rows and columns.

## Inspecting our arrays

Variables in Python carry different `attributes` which we can find using the format 

`variable.[attribute]`


Some attributes we can use to learn more about our `numpy` arrays are:

`shape`: gives us the dimensions of the array.  
`size`: gives us the number of elements of the array.   
`dtype`: gives us the data type of the elements of the array.   

In [8]:
print(my_reshaped_array.shape)
print(my_reshaped_array.size)
print(my_reshaped_array.dtype)

(2, 3)
6
int32


### Indexing and slicing our arrays

Remember when I said that zero-based indexing never really goes away? I wasn't kidding. Now we have to index on *two* counts: the row and the column. 



# DRILL

Practice your `numpy` skills by operating on a pretty significant array! I will get you started; complete the operations based on what the comments are asking for. 

In [18]:
# Don't worry about this part -- I am reading the file into Python.
# You will learn how to read files into Python in the next unit. 

my_array = np.genfromtxt('numpy-drill.csv')
print(my_array)

[47. 21. 23. 24. 45.  6. 30. 43. 45. 23.  2. 46.  4. 34. 42.  2. 47. 14.
 18.  9. 50. 34. 12. 24. 42. 24.  3. 39. 17. 15. 37. 18. 46. 25.  9. 41.
 45. 34. 22. 26. 27. 44. 28.  4. 15. 31.  3. 39. 15. 23.  5. 27. 11. 25.
 16. 11.  2. 43. 35. 45. 27. 48. 44. 20.  4. 21.  8. 48. 29. 20. 15. 20.
 37. 17.  6. 13. 39. 25.  5. 11.  4. 20. 47.  9.  2.  8. 44. 40.  8.  1.
 45. 26. 43. 10. 22. 24.  3. 48. 29. 49.]


In [23]:
# What is the size of this array?

np.size(my_array)

100

In [None]:
# What is its datatype 

In [20]:
# Assign the square roots of this array to
# another array, `my_array_sqrt`.

my_array_sqrt = np.sqrt(my_array)
print(my_array_sqrt)

(100,)

array([6.8556546 , 4.58257569, 4.79583152, 4.89897949, 6.70820393,
       2.44948974, 5.47722558, 6.55743852, 6.70820393, 4.79583152,
       1.41421356, 6.78232998, 2.        , 5.83095189, 6.4807407 ,
       1.41421356, 6.8556546 , 3.74165739, 4.24264069, 3.        ,
       7.07106781, 5.83095189, 3.46410162, 4.89897949, 6.4807407 ,
       4.89897949, 1.73205081, 6.244998  , 4.12310563, 3.87298335,
       6.08276253, 4.24264069, 6.78232998, 5.        , 3.        ,
       6.40312424, 6.70820393, 5.83095189, 4.69041576, 5.09901951,
       5.19615242, 6.63324958, 5.29150262, 2.        , 3.87298335,
       5.56776436, 1.73205081, 6.244998  , 3.87298335, 4.79583152,
       2.23606798, 5.19615242, 3.31662479, 5.        , 4.        ,
       3.31662479, 1.41421356, 6.55743852, 5.91607978, 6.70820393,
       5.19615242, 6.92820323, 6.63324958, 4.47213595, 2.        ,
       4.58257569, 2.82842712, 6.92820323, 5.38516481, 4.47213595,
       3.87298335, 4.47213595, 6.08276253, 4.12310563, 2.44948

In [22]:
my_array

array([47., 21., 23., 24., 45.,  6., 30., 43., 45., 23.,  2., 46.,  4.,
       34., 42.,  2., 47., 14., 18.,  9., 50., 34., 12., 24., 42., 24.,
        3., 39., 17., 15., 37., 18., 46., 25.,  9., 41., 45., 34., 22.,
       26., 27., 44., 28.,  4., 15., 31.,  3., 39., 15., 23.,  5., 27.,
       11., 25., 16., 11.,  2., 43., 35., 45., 27., 48., 44., 20.,  4.,
       21.,  8., 48., 29., 20., 15., 20., 37., 17.,  6., 13., 39., 25.,
        5., 11.,  4., 20., 47.,  9.,  2.,  8., 44., 40.,  8.,  1., 45.,
       26., 43., 10., 22., 24.,  3., 48., 29., 49.])

# Working with `pandas`

When you think of "tabular data" in Python, think of `pandas`. 

This package is built on top of `numpy`, but brings some extra functionalities for us. 

We will focus on the `pandas` DataFrame, which is a two-dimensional, tabular data structure with labeled rows and columns. 

![`pandas` DataFrame example](images/pandas-data-frame.jpg)
Source: "Operations in Pandas," [O'Reilly Media blog](https://www.oreilly.com/content/operations-in-pandas/)


*Look familiar?* This is very much the way we often store data in a spreadsheet.

## Importing `pandas`

Same as with `numpy`, we will need to call in `pandas` each time we want to use it.

Similarly to `numpy`, it is common to *alias* `pandas` when we import it. This alias usually takes the form:

`import pandas as pd`

Go ahead and try it yourself in the cell below!

In [10]:
import pandas as pd

### Creating DataFrames

There are several ways to create a DataFrame. We could, for example, convert a `numpy` array into one, using the `DataFrame` function.

In [15]:
# Create our array
numpy_data = np.array([[1,2,3], [4,5,6]])

# DataFrames rows and columns must be explicitly named
df = pd.DataFrame(data=numpy_data)

df

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6


By default, our 