<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Notes" data-toc-modified-id="Notes-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Notes</a></span><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Pandas" data-toc-modified-id="Pandas-1.0.0.1"><span class="toc-item-num">1.0.0.1&nbsp;&nbsp;</span>Pandas</a></span></li></ul></li></ul></li></ul></li></ul></div>

# Notes
**Numpy Library**: The NumPy library takes advantage of a processor feature called Single Instruction Multiple Data (SIMD) to process data faster. SIMD allows a processor to perform the same operation, on multiple data points, in a single processor cycle

**Vectorization**: This concept of replacing for loops with operations applied to multiple data points at once is called vectorization.

**Continue**: The core data structure in NumPy that makes vectorization possible is the ndarray or n-dimensional array. In programming, array describes a collection of elements, similar to a list. The word n-dimensional refers to the fact that ndarrays can have one or more dimensions. We'll start by first working with one-dimensional (1D) ndarrays.

In [1]:
import numpy as np

Directly convert a list to an ndarray using the `numpy.array()` constructor. To create a 1D ndarray, we can pass in a single list:

In [3]:
data_ndarray = np.array([5, 10, 15, 20])
type(data_ndarray)
print(data_ndarray.shape)

(4,)


As you can see from the above, this is a 1D array. There is only 1 row containing 4 items which create the shape `(4,)`. Let's work with 2D arrays next. 

A 2D Array is an array with multiple dimmensions

In [4]:
data_ndarray = np.array([[5, 10, 15], 
                         [20, 25, 30]])
print(data_ndarray.shape)

(2, 3)


The data type returned is called a tuple. Tuples are very similar to Python lists, but can't be modified. The output gives us a few important pieces of information:

    - The first number tells us that there are `2 rows` in data_ndarray.
    - The second number tells us that there are `3 columns` in data_ndarray.

As shown above, we can select rows in ndarrays very similarly to lists of lists. In reality, what we're seeing is a kind of shortcut. For any 2D array, the full syntax for selecting data is:

`ndarray[row_index,column_index]`

 or if you want to select all
 columns for a given set of rows
 
`ndarray[row_index]`

With a list of lists, we use two separate pairs of square brackets back-to-back. With a NumPy ndarray, we use a single pair of brackets with comma-separated row and column locations.

Let's practice selecting one row, multiple rows, and single items from our taxi ndarray.

**LIST**

`data[1][3]`

**NumPy**

`data[1,3]`

Now let's view more

`data[:,3]` - Produces a 1D ndarray of the list, but it selects a single column (column 3)

`data[:1:3]` - Produces a 2D ndarray of the list, selecting columns 1 and 2 (0 index)

`data[:, [1,3,4]]` - Produces a 2D ndarray of the list, selecting columns 1, 3 and 4 (0 index)

With a list of lists, we need to use a for loop to extract specific column(s) and append them back to a new list. With ndarrays, the process is much simpler. We again use single brackets with comma-separated row and column locations, but we use a colon (:) for the row locations, which gives us all of the rows.

If we want to select a partial 1D slice of a row or column, we can combine a single value for one dimension with a slice for the other dimension:

`data[2, 1:4]` - Produces a 1D ndarray of the list, selecting row 2, columns 1, 2, and 3 (0 index)
`data[1:,4]` - Produces a 1D ndarray of the list, selecting rows 1-4, and column 4 (0 index)

Lastly, if we want to select a 2D slice, we can use slices for both dimensions:

`data[1:4,,:3]` - Produces a 2D ndarray of the list, selecting rows 1-3, and column 0-2 (0 index)

At the time, we only talked about how vectorized operations make this faster; however, vectorized operations also make our code easier to execute. Here's how we would perform the same task above with vectorized operations

The result of adding two 1D ndarrays is a 1D ndarray of the same shape (or dimensions) as the original. In this context, ndarrays can also be called vectors, a term taken from a branch of mathematics called linear algebra. What we just did, adding two vectors together, is called vector addition.

When we perform these operations on two 1D vectors, both vectors must have the same shape.

**Functions** act as stand alone segments of code that usually take an input, perform some processing, and return some output. For example, we can use the len() function to calculate the length of a list or the number of characters in a string.

**Methods** are special functions that belong to a specific type of object. This means that, for instance, when we work with list objects, there are special functions or methods that can only be used with lists. For example, we can use the list.append() method to add an item to the end of a list. If we try to use that method on a string, we will get an error:

Next, we'll calculate statistics for 2D ndarrays. If we use the ndarray.max() method on a 2D ndarray without any additional parameters, it will return a single value, just like with a 1D array:

But what if we wanted to find the maximum value of each row? We'd need to use the axis parameter and specify a value of 1 to indicate we want to calculate the maximum value for each row (axis=1).

If we want to find the maximum value of each column, we'd use an axis value of 0 (axis=0):

**Get a textfile using Numpy**

`np.genfromtxt(filename, delimiter=None, skip_header=1)`

`filename`: A positional argument, usually a string representing the path to the text file to be read.

`delimiter`: A named argument, specifying the string used to separate each value.

`skip_header`:  Accepts an integer, the number of rows from the start of the file to skip

Because we have a CSV file, the delimiter is a comma. Here's how we'd read in a file named:

`data = np.genfromtxt('data.csv', delimiter=',')`

That's because NumPy ndarrays can contain only one datatype.

We can use the ndarray.dtype attribute to see the internal datatype that has been used.

`print(taxi.dtype)`

In the last mission, we learned how to index — or select — data from ndarrays. In this mission, we're going to focus on arguably the most powerful method, the boolean array. A boolean array, as the name suggests, is an array of boolean values. Boolean arrays are sometimes called boolean vectors or boolean masks.

The boolean array acts as a filter, so that the values corresponding to True become part of the result and the values corresponding to False are removed.

When working with 2D ndarrays, you can use boolean indexing in combination with any of the indexing methods we learned in the previous mission. The only limitation is that the boolean array must have the same length as the dimension you're indexing. Let's look at some examples

#### Pandas

**Dataframe**, the primary pandas data structure. 2D pandas object

Recall that one of the features that makes pandas better for working with data is its support for string column and row labels:

    - Axis values can have string labels, not just numeric ones.
    - Dataframes can contain columns with multiple data types: including integer, float, and string.
    
We can use the `DataFrame.dtypes` attribute (similar to NumPy's ndarray.dtype attribute) to return information about the types of each column.    

**Series** is the pandas type for one-dimensional objects. Anytime you see a 1D pandas object, it will be a **series**. Anytime you see a *2D pandas object*, it will be a dataframe.

Think of a dataframe as a collection of series objects, which is similar to how pandas stores the data behind the scenes.

**Selecting Columns**

**`df['column_label']`** - select single column

**`df[['column_label_1', 'column_label_2']]`** - select a list of columns

**`df.loc[:, 'column_label_1':'column_label_2']`** - slice columns

**Selecting Rows**

Now that we've learned how to select columns by label, let's learn how to select rows using the labels of the index axis:

**`df.loc['row']`** - select single row

**`df.loc[['row_1', 'row_2', 'row_3']]`** - select multiple rows

**`df.loc['row_1':'row_5']`** - slice rows

Because series and dataframes are two distinct objects, they have their own unique methods. Let's look at an example of a series method next - the `Series.value_counts()` method. This method displays each unique non-null value in a column and their counts in order.

In the resulting series, we can see each unique non-null value in the column and their counts.

However, what if we wanted to select just the count for a specific row?

As with dataframes, we can use `Series.loc[]` to select rows from a series using single labels, a list, or a slice object. We can also omit `loc[]` and use bracket shortcuts for all three:

**`s['row1']`** - select single row

**`s[['row_1', 'row_2', 'row_3']]`** - select multiple rows

**`s['row_1':'row_5']`** - slice rows

**Different Label Selection Methods**

**Single column from dataframe** - `df.loc[:,"col1"]`	OR `df["col1"]`

**List of columns from dataframe** - `df.loc[:,["col1","col7"]]` OR `df[["col1","col7"]]`

**Slice of columns from dataframe** - `df.loc[:,"col1":"col4"]`

**Single row from dataframe** - `df.loc["row4"]`

**List of rows from dataframe**	- `df.loc[["row1", "row8"]]`

**Slice of rows from dataframe** - `df.loc["row3":"row5"]` OR `df["row3":"row5"]`

**Single item from series**	- `s.loc["item8"]` OR `s["item8"]`

**List of items from series** - `s.loc[["item1","item7"]]` OR `s[["item1","item7"]]`

**Slice of items from series** - `s.loc["item2":"item4"]` OR `s["item2":"item4"]`

Because pandas is designed to operate like NumPy, a lot of concepts and methods from Numpy are supported. Recall that one of the ways NumPy makes working with data easier is with vectorized operations, or operations applied to multiple data points at once:

Vectorization not only improves our code's performance, but also enables us to write code more quickly.

Because pandas is an extension of NumPy, it also supports vectorized operations. Let's look at an example of how this would work with a pandas series:

Like NumPy, pandas supports many descriptive stats methods that can help us answer these questions. Here are a few of the most useful ones (with links to documentation):

`Series.max()`

`Series.min()`

`Series.mean()`

`Series.median()`

`Series.mode()`

`Series.sum()`

`Series.describe()`

Unlike their series counterparts, dataframe methods require an axis parameter so we know which axis to calculate across. While you can use integers to refer to the first and second axis, pandas dataframe methods also accept the strings "index" and "columns" for the axis parameter:

`DataFrame.method(axis=0)` OR `DataFrame.method(axis='index')` - Calculates along the row axis

`DataFrame.method(axis=1)` OR `DataFrame.method(axis='column')` - Calculates along the column axis

**Assignments**

`df['col_1'] = 0` - Assigns each row in `col_1` to `0`

`df.loc['row_1', 'col_1'] = 0` - Assigns `0` to `row_1` in `col_1`

Recall that when we worked with a dataframe with string index labels, we used `loc[]` to select data:

`loc` - label based selection

`iloc` - integer position based selection

`.iloc[]`

`df.iloc[row_index, column_index]`