# Introduction to pandas Data Structures


---

In this chapter, you will become familiar with the 2 primary data structures of `pandas`: `Series` and `DataFrames`.  The reason we use `Series` and `DataFrames` rather than native `python` data structures to hold our data is because there are additional attributes and methods associated with `Series` and `DataFrames` that will be useful for wrangling and analytics. We will see that one of the primary benefits of `Series` and `DataFrames` over native python data structures is that it is a very natural way to describe a data set in an excel-like manner by referencing the rows and columns of our data with labels of our choosing.


## Preparing our Environment

---


At the start of every chapter we will be importing any additional modules necessary for our execution environment. In this chapter all we will be needing is the `pandas` `Python` package, commonly aliased as `pd`.


In [0]:
import pandas as pd

## About the Data

---

We will be manually constructing `Series` and `DataFrames` to gain a deeper understanding of these data structures, specifically how they are organized and how we can interact with them. To illustrate points in this chapter we will be using a framework for a data set that describes prescription orders made by doctors. The data structures we will be building will contain one or more of the following features:

| Feature |Description|
|:----------|-----------|
| `unique_id`| A unique identifier for a Medicare claim to the Center for Medicare and Medicaid Services (CMS) |
| `doctor_id` | The Unique Identifier of the doctor who <br/> prescribed the medicine  |
| `specialty` | The specialty of the doctor prescribed the medicine |
| `medication` | The medication prescribed |
| `nb_beneficiaries` | The number of beneficiaries the <br/> medicine was prescribed to  |
| `spending` | The total cost of the medicine prescribed <br/>for the CMS |

## Pandas `Series` Vs. `DataFrames`

---

Pandas has two principal data structures, `Series` and `DataFrames`. If you are familiar with Microsoft's Excel application then you can liken `Series` to single columns (or rows) in an Excel sheet and `DataFrames` to entire tables (or spreadsheets).

![](images/ExcelToPandas.png)

We see in the image above that a `Series` in the context of Excel could be the first row of the spreadsheet, while a `DataFrame` would be the entire spreadsheet. In other words, a `DataFrame` is simply a collection of labeled `Series`.



## `Series`

---

Relating to native `python` data structures, `Series` are most like `python` `lists` in that both are ordered collections of items. However, two important differences between `python` `lists` and `pandas` `Series` is that all the items stored in a `Series` are of the same datatype and `Series` contain  a user-defined array of labels for each data entry called the `Index`.

![](images/series.png)

In the image above each data entry in the right hand column has a corresponding label stored in the left hand column. These labels, as we will see, are in fact flexible and may be whatever we wish. 

Also note that the data entries looked mixed in the image, some entries are strings like 'DIAZEPAM' while other entries look like integers for example '3', but we said earlier that `Series` are  made up of data of the same type. This is still true, the way `pandas` handles storing data that would be mixed in native `python` is by casting all the data to what it calls 'objects'. 

### `pandas` Data Types

When creating a `Series` `pandas` will store all the data as the same type. The mapping from the native `python` types to what they would be in `pandas` is summarized below. 

| Python Type | Equivalent `pandas` Type | Description | 
|:-------------|:------------------------|:-------------|
| `string or mixed` | `object` |  Columns contain partially or completely made up from strings|
| `int`   | `int64` | Columns with numeric (inetger) values. The 64 here referes <br/>to size of the memory space allocated to this type| 
| `float` | `float64` | Columns with floating points numbers (numbers with decimal points) | 
|`bool`| `bool` | True/False values |

* The `datetime64` type will not be discussed here.

# Exercise 1.1 `pandas` Data Types

What will the `pandas` data type of the following `Series` be?

|  | 
|:-------------|:------------------------|
| `food` | 'poke' |
| `drink`   | 'tea' |
| `price` | `10.00` | 

A: object

B: int64

C: float64

D: bool

### Creating Series From Python Data Structures

Two ways we can create a pandas `Series` from native `Python` Data Structures is from `lists` and `dicts`. To construct a `Series` we will be using the `pandas` `Series()` function. The `Series` we build in these examples can be thought of as single row entries of the running example dataset discussed earlier in the *About the Data* section of this chapter. 

 1. To create a pandas `Series` from a Python `list` we can use the following syntax:
 
```python
>>> s1 = pd.Series([1234, 'DIAZEPAM', 3, '$32'])
>>> s1
0        1234
1    DIAZEPAM
2           3
3         $32
dtype: object
```

Note that the values in the left hand column are the indices labeling the data in the right hand column, and since we only passed a list of data entries to the `pandas` `Series()` function, `pandas` had to infer an index. By default `pandas` will index the data using a range of integers starting from 0. Also notice that the `pandas` data type of each entry is printed. In this example each data entry is stored as a pandas object, this is because the provided list was a list of mixed native `python` types, strings and integers. 

2. To create a pandas `Series` from a Python `dict` we can use the following syntax: 

```python
>>> s2 = pd.Series({'doctor_id': 1234, 'medication': 'DIAZEPAM', 'nb_beneficiaries': 3, 'spending': '$32'})
>>> s2
doctor_id               1234
medication          DIAZEPAM
nb_beneficiaries           3
spending                 $32
dtype: object
 ```
 
Notice this time the keys of the `Python` `dict` are used to build the `Index` and the values of the `Python` `dict` are the `Series'` data entries. 


In [0]:
s1 = pd.Series([1234, 'DIAZEPAM', 3, '$32'])
s1

0        1234
1    DIAZEPAM
2           3
3         $32
dtype: object

In [0]:
s2 = pd.Series({'doctor_id': 1234, 'medication': 'DIAZEPAM', 
                'nb_beneficiaries': 3, 'spending': '$32'})
s2

doctor_id               1234
medication          DIAZEPAM
nb_beneficiaries           3
spending                 $32
dtype: object

### The pandas `Index` Object

As mentioned, `pandas` `Series` have an associated `Index` object which labels each entry of the `Series`. We saw in the example of creating `Series` from Python lists that the default `Index` object of the `Series`, if not specified when created, is the range of integers $0$ through $N-1$ where $N$ is the number of data entries in the `Series`. To access a `Series'` associated `Index` object, we use the `index` attribute of the `Series`. Thus, printing the `Index` object of the `Series` `s1` and `s2` yields the following: 

```python
>>> s1.index
RangeIndex(start=0, stop=4, step=1)
>>> s2.index
Index([u'doctor_id', u'medication', u'nb_beneficiaries', u'spending'], dtype='object')
```

The index of the first `Series`, `s1`, is the range of integers starting from 0, stopping before 4, with a step of 1(in other words, the list \[0,1,2,3\]). The index of the second `Series`, `s2`, is a list of strings: \['doctor_id', 'medication', 'nb_beneficiaries', 'spending'\]. 



In [0]:
s1.index

RangeIndex(start=0, stop=4, step=1)

In [0]:
s2.index

Index(['doctor_id', 'medication', 'nb_beneficiaries', 'spending'], dtype='object')

# Exercise 1.2: Creating Series

Using the code cell below, create and print a `Series` from a `Python dict` that has two entries: '2' labeled by the index 'spam musubi' and '1' labeled with the index 'surf wax' . Save the `Series` using the variable name 'shopping_list'.

Hint: Remember that with Jupyter Notebooks all we need to do to print the `Series` saved by the variable 'shopping_list' is type the name of the variable on the last line of the code cell and then run the code cell.

In [0]:
# Type your answer to exercise 1.2 here


### Indexing `Series`
The `Index` object of a `Series` is flexible and we can make it whatever set of labels we would want. There are three ways to customize the `Index` object of a pandas `Series`.

1. Instantiating the `Series` using a dictionary (as we saw in the earlier section: *Creating `Series` From Python Data Structures*):

```python
>>> s1 = pd.Series({'doctor_id': 1234, 'medication': 'DIAZEPAM', 'nb_beneficiaries': 3, 'spending': '$32'})
```

2. Passing a list to the "index" argument of the `pandas` `Series` function:

```python
>>>s1 =  pd.Series( [1234, 'DIAZEPAM', 3, '$32'], index=['doctor_id', 'medication', 'nb_beneficiaries', 'spending'])
```

3. By changing the index after instantiation. Please note that the new index must be a list of the same length as the object it modifies:

```python
>>> s1 = pd.Series([1234, 'DIAZEPAM', 3, '$32'])
>>> s1.index = ['doctor_id', 'medication', 'nb_beneficiaries', 'spending']
```

In [0]:
s1 = pd.Series({'doctor_id': 1234, 'medication': 'DIAZEPAM', 
                'nb_beneficiaries': 3, 'spending': '$32'})
print(s1)
print('----------------------------')
s1 =  pd.Series( [1234, 'DIAZEPAM', 3, '$32'], 
                index= ['doctor_id', 'medication', 'nb_beneficiaries', 
                        'spending'])
print(s1)
print('----------------------------')
s1 = pd.Series([1234, 'DIAZEPAM', 3, '$32'])
s1.index = ['doctor_id', 'medication', 'nb_beneficiaries', 'spending']
print(s1)

doctor_id               1234
medication          DIAZEPAM
nb_beneficiaries           3
spending                 $32
dtype: object
----------------------------
doctor_id               1234
medication          DIAZEPAM
nb_beneficiaries           3
spending                 $32
dtype: object
----------------------------
doctor_id               1234
medication          DIAZEPAM
nb_beneficiaries           3
spending                 $32
dtype: object


### Accessing `Series` Data

Data in a `Series` may be accessed by *integer location* or by *label location*. 

1. To access the data in a `pandas` `Series` by its integer location we use the same approach seen in python list indexing, we specify the integer location, $i$, of the data we want to retrieve from the `Series`, s, within square brackets: s\[$i$\] , (similar to lists, `Series` indexes start with 0). For instance, to access the $1^{st}$ entry of the `Series` `s1` we would type:

```python
>>> s1[1]
'DIAZEPAM'
```

Some of the conviences seen in native `python` list indexing are also available with `Series`. For example, to access the last element of the `Series` we would type:

```python
>>> s1[-1]
'$32'
```

2. To access the data in a `pandas` `Series` by label we can use the same approach seen in python `dicts`, we specify the label (key for dicts) , $l$, of the data we want to retrieve from the `Series`, s, within square brackets: `s[l]`, or using *dot* notation: `s.l`.  For instance, if we want the data labeled "medication" from the `Series` `s1` we would type:

```python
>>> s1["medication"]
'DIAZEPAM'
```

OR

```python
>>> s1.medication
'DIAZEPAM'
```

Based on that explanation, it is fair to think of a `Series` as a hybrid between lists and dictionaries

In [0]:
print(s1[1])
print(s1["medication"])
print(s1.medication)

# Exercise 1.3: Indexing Series and Acessing Series Data

Which lines of code will create the `Series`, `shopping_list`, that has two entries: '2' labeled by 'musubi' and '1' labeled by 'wax' , and then print the value labeled by 'wax'?

The `Series` should have the following form:

|  | |
|:----------|-----------|
| `musubi` | 2 |
| `wax` | 1  |


A:
```python
shopping_list = pd.Series({'musubi': 2, 'wax': 1}) 
shopping_list[0]
```

B:
```python
shopping_list = pd.Series( ['musubi', 'wax'], index=[2,1]) 
shopping_list[0]
```

C:
```python
shopping_list = pd.Series( [2,1], index=['musubi', 'wax']) 
shopping_list['wax']
```

D:
```python
shopping_list = pd.Series({2:1, 'index':['musubi', 'wax']}) 
shopping_list.wax
```

*Hint: Feel free to use the code cell below to try these commands out. For the incorrect options, make note of what is going wrong and or what errors are being thrown.*

In [0]:
# Exercise 1.3 scratch code cell

### `Series` Attributes and Methods

`Series` objects have many useful Attributes and Methods, some you may find most helpful are listed below.

| Attribute |Description|
|:----------|-----------|
| `dtype`| return the dtype object of the underlying data |
| `name`| return the name of the Series |
| `size`| return the number of elements in the underlying data |
| `values`| Return Series as ndarray or ndarray-like|


| Method |Description|
|:----------|-----------|
| `add(other[, level, fill_value, axis])` | 	Addition of series and other, element-wise (binary operator add) |
| `nlargest([n, keep])`| Return the largest n elements. |
| `sort_values([axis, ascending, inplace, ...])`| Sort by the values along either axis |
| `sum([axis, skipna, level, numeric_only, ...])`| 	Return the sum of the values for the requested axis |
| `unique()`| Return unique values in the object |

To call a Method or access an Attribute of the `Series` object, we use *dot* notation:

```python
>>> s1.size
4
>>> s1.unique()
[1234 'DIAZEPAM' 3 '$32']
```

More Attributes and Methods be found at the <a href="https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.Series.html">pandas documentation</a> webpage.

In [0]:
print(s1.size)
print(s1.unique())

## `DataFrames`

---

`DataFrames` are essentially ordered collections of `Series` with two associated `Index` objects, one to label rows and another to label columns. Reiterating the example mentioned earlier, it helps to think of `DataFrames` as MS. Excel spreadsheets where each row (or column) as an individual `Series`.

![](df_index_cols.png)


### Creating `DataFrames` From Python Data Structures

There are many ways to create a `DataFrame` from native `Python` data structures including, but not limited to, a `list` of `dicts`, a `list` of `lists`, and a `list` of `tuples`. However, possibly the most common way to build a `DataFrame` from native `Python` data structures, is using a `dict` of equal-length `lists`. To do this we will be using the `pandas` `DataFrame()` function :

```python
>>> data = {'doctor_id': [1234, 3210, 6789, 5678], 
            'medication': ['DIAZEPAM', 'CLONAZEPAM', 'NADOLOL', 'OXYCODONE HCL'], 
            'nb_beneficiaries': [3, 50, 113, 95], 
            'spending': [ '$32', '$102', '$54', '$43']}
>>> df1 = pd.DataFrame(data)
   doctor_id     medication  nb_beneficiaries spending
0       1234       DIAZEPAM                 3     \$32
1       3210     CLONAZEPAM                50    \$102
2       6789        NADOLOL               113     \$54
3       5678  OXYCODONE HCL                95     \$43
```

In the example above we first constructed a dictionary, called data, of key value pairs where each key is a label for its corresponding value, a list of data entries. Notice that by default the `dict's` keys are used as the column labels and the values are aligned by index, that is its location in the list, to fill the rows.  The rows of the `DataFrame`, similar to `Series`, are automatically indexed with a range of integers starting from 0. 


In [0]:
data = {'doctor_id': [1234, 3210, 6789, 5678], 
            'medication': ['DIAZEPAM', 'CLONAZEPAM', 'NADOLOL', 
                           'OXYCODONE HCL'], 
            'nb_beneficiaries': [3, 50, 113, 95], 
            'spending': [ '$32', '$102', '$54', '$43']}
df1 = pd.DataFrame(data)
df1

Unnamed: 0,doctor_id,medication,nb_beneficiaries,spending
0,1234,DIAZEPAM,3,$32
1,3210,CLONAZEPAM,50,$102
2,6789,NADOLOL,113,$54
3,5678,OXYCODONE HCL,95,$43


# Exercise 1.4: Creating `DataFrames`

Using the code cell bellow, create and print a `DataFrame` from a `Python dict` of equal-length `lists` with two columns and two rows. The first column label should be the string: 'item', and the second column label should be the string: 'quantity'.  The first row should have the entries 'poke bowl' and the integer value 1 in the first and second columns respectively. The second row should have the entries 'pineapple' and the integer value 2 in the first and second columns respectively. Save the `DataFrame` using the variable name 'shopping_list_df'.

`shopping_list_df` should have the following form:

|  | item | quantity|
|:----------|-----------|:----------|
|0| 'poke bowl' | 1|
|1| 'pineapple' | 2 |


Note: Remember that with Jupyter Notebooks all we need to do to print the `Series` saved by the variable 'shopping_list' is type the name of the variable on the last line of the code cell and then run the code cell.


In [0]:
# Type your answer to exercise 1.4 here

### Indexing `DataFrames`

`DataFrames` have two `Index` objects associated with it, one labeling the columns, accesible via the `columns` attribute, and the other labeling the rows, accessible via the `index` attribute. Continuing with the previous example, we access the `columns` and `index` attributes of the `df1` `DataFrame` like so:

```python
>>> df1.columns
Index([u'doctor_id', u'medication', u'nb_beneficiaries', u'spending'], dtype='object')
>>> df1.index
RangeIndex(start=0, stop=4, step=1)
```

Notice that, similar to `Series`, the `DataFrame's` default indexing is the range of integers starting from 0. In the example above the row index, `df1.index` is the range of integers starting from 0, stopping before 4, and with a step of size 1, i.e. the list of integers: \[0,1,2,3\]. 

The `columns` and `index` labels may be customized in essentially the same 3 ways one could set the `Series` `Index` object:

1. Instantiating the `DataFrame` using a `dict` of `dicts` (similar to what we saw in the previous section):

```python
>>> data2 = {'doctor_id': {'row_1':1234, 
                           'row_2':3210, 
                            }, 
             'medication': {'row_1':'DIAZEPAM',
                           'row_2':'CLONAZEPAM'}
               }
>>> df2 = pd.DataFrame(data2)
```

2. Using lists which specify the row names and column names using the "index"  and "columns" argument of the `pandas DataFrame()` function respectively:

```python
>>> data3 = [[1234, 'DIAZEPAM'], [3210, 'CLONAZEPAM']]
>>> df2 = pd.DataFrame(data3, 
                       index=['row_1', 'row_2'], 
                       columns=['doctor_id', 'medication'])
```

3. By changing the index and columns after instantiation:

```python
>>> df2 = pd.DataFrame(data3)
>>> df2.columns = ['doctor_id', 'medication']
>>> df2.index=['row_1', 'row_2']
```

All three methods result in equivalent `DataFrames`
 ```python
>>> print(df2)
          doctor_id  medication
row_1       1234    DIAZEPAM
row_2       3210  CLONAZEPAM
```
 

In [0]:
print(df1.columns)
print(df1.index)

In [0]:
data2 = {'doctor_id': {'row_1':1234, 'row_2':3210}, 
            'medication': {'row_1':'DIAZEPAM', 'row_2':'CLONAZEPAM'}}
df2 = pd.DataFrame(data2)
print(df2)
print('-----------------------------------')
data3 = [[1234, 'DIAZEPAM'],
         [3210, 'CLONAZEPAM']]
df2 = pd.DataFrame(data3, 
                   index=['row_1', 'row_2'], 
                   columns=['doctor_id', 'medication'])
print(df2)
print('-----------------------------------')
df2 = pd.DataFrame(data3)
df2.columns = ['doctor_id', 'medication']
df2.index=['row_1', 'row_2']
print(df2)

### Acessing Data from the `DataFrame`

#### Singular Entries:

Similar to `pandas` `Series`, Rows and Columns of `DataFrames` can be accessed using either their *label location* or *integer location*. The Row and Column *labels* are the labels stored in the `DataFrames'` `Index` and `Columns` attributes, respectively. Row and Column *locations* are the integer values of the ordered positions of the Row or Column being referenced. 

There are two attributes we will be using to access Rows and columns of our `DataFrame`: `iloc` and `loc`. To access rows and columns by their integer location, `iloc` (short for "**I**nteger **loc**ation")  is used. To access rows and columns by their labels, `loc`(short for "**loc**ation") is used. 

Once we call one of the two attributes (`iloc` or `loc`), accessing rows and columns of a `DataFrame` is similar to `python` `list` indexing. For instance, if we wanted to access row $i$ column $j$, of a `DataFrame`, `df`, then we would use either `loc` or `iloc`, depending on whether we are using the integer location or labels, and square brackets, i.e., `df.loc[i,j]` or `df.iloc[i,j]`. 

Let us continue with our example using the `DataFrame` `df2`. Suppose we wanted to access the entry label by row: 'row_1' and by column: 'medication', we could do this in two ways:

1. **By Integer Location**

```python
>>>> df2.iloc[0,1]
'DIAZEPAM'
```
2. **By Label**

```python
>>>> df2.loc['row_1', 'medication']
'DIAZEPAM'
```


In [0]:
print(df2.iloc[0,1])
print(df2.loc['row_1', 'medication'])

#### Accessing Data from the `DataFrame` - Continued

#### Subsets of Rows and Columns

Multiple Rows and Columns may be accessed at once by passing a list of *labels* or *Integer-locations* to the `loc` and `iloc` attributes, rather than single labels or integers. What we will have returned to us for each of these commands will be a new `pandas` object with the same column and row labels but with only a subset of entries.  The returned object will either be a `Series` or `DataFrame` depending on the dimensions of the subset we are accessing, if the subset has either only a single column or row then it will be a `Series`, otherwise it will be a `DataFrame`

For example, if we wanted to accessing the first 2 rows of the `DataFrame` `df2` but only the column labeled 'medication' we could do this in two ways:

1. **By *integer locations*** 

```python
>>> df2.iloc[[0,1], 1]
row_1    DIAZEPAM
row_2    CLONAZEPAM
Name: medication, dtype: object
```

2. **By *labels*** 

```python
>>> df2.iloc[['row_1', 'row_2'], 'medication']
row_1    DIAZEPAM
row_2    CLONAZEPAM
Name: medication, dtype: object
```

We see from the above examples that the returned object is indeed a `Series`. Notice that the index labels match the index labels of the original `DataFrame`. Also note that there is a new attirbute associated with the `Series` being displayed, the `name`, which was the label of the single column we are accessing.

Furthermore, suppose we wanted to acess the first 2 columns of the `DataFrame` `df2` but only the row at integer location 1. We could do this in the following two ways.

1. **By *integer locations*** 

```python
>>> df2.iloc[1,[0,1]]
doctor_id           3210
medication    CLONAZEPAM
Name: row_2, dtype: object
```

2. **By *labels*** 

```python
>>> df2.loc['row_2', ['doctor_id','medication']]
doctor_id           3210
medication    CLONAZEPAM
Name: row_2, dtype: object
```

The above example again shows that the returned object is a `Series`, as expected. Notice this time though that the index labels match the column labels of the original `DataFrame`. Also note that the `name` of the `Series` is label of the single row we were accessing.

Subsetting rows and columns can infact be done simultaneously, if you desire. For instance, if we want the first two rows and columns of the `DataFrame` `df2` the we could type it in one of two ways:

1. **By *integer locations***

```python
>>> df2.iloc[[0,1], [0,1]]
       doctor_id  medication
row_1       1234    DIAZEPAM
row_2       3210  CLONAZEPAM
```

2. **By *labels**


```python
>>> df2.iloc[['row_1', 'row_2'], ['doctor_id','medication']]
       doctor_id  medication
row_1       1234    DIAZEPAM
row_2       3210  CLONAZEPAM
```

In the example above we access a subset of the `DataFrame` with more than one column and row so we get back a new `DataFrame`. The new `DataFrame` has all the same index and column labels as the original. 

Note that this is the start of an important Data Wrangling skill called *subsetting* which we will cover more deeply in the upcoming chapter: *Exploring Data*


In [0]:
print(df2.iloc[[0,1], 1])
print('----------------------------')
print(df2.loc[['row_1', 'row_2'], 'medication'])

In [0]:
print(df2.iloc[1,[0,1]])
print('----------------------------')
print(df2.loc['row_2',['doctor_id','medication']])

In [0]:
print(df2.iloc[[0,1],[0,1]])
print('----------------------------')
print(df2.loc[['row_1', 'row_2'],['doctor_id','medication']])

#### Accessing Data from the `DataFrame` - Continued

#### Entire Rows and Columns

Accessing an entire row or column is very similar to acesssing subsets of the `DataFrame`, which we covered in the previous cell. In fact, by providing a list of all the integers locations or all the labels of the `DataFrame`, the syntax is identical to that in the previous cell.  For instance if we wanted to access the entire first row of the `DataFrame` `df2`, then we could use the label of the first row and columns attribute of the `DataFrame`, 'df2.columns', to get the entire list of labels for the columns. We would type:

```python
>>> df2.loc['row_1', df2.columns]
doctor_id         1234
medication    DIAZEPAM
Name: row_1, dtype: object
```

This method works just fine, but since this operation is so common, `pandas` lets us use a special shorter syntax to say "give me all the columns (or rows)" and that is with the slicing operator (colon, ":"").  For instance if we wanted to access the row labeled by 'row_1'  of the `DataFrame` `df2`, then we could do this in the following two ways:

1.  **Row by * integer location***

```python
>>> df2.iloc[0, :]
doctor_id         1234
medication    DIAZEPAM
Name: row_1, dtype: object
```

2. **Row by * label***

```python
>>> df2.iloc['row_1', :]
doctor_id         1234
medication    DIAZEPAM
Name: row_1, dtype: object
```

What we have returned to us in the example above is the zeroth row as a pandas `Series` with an associated `Index` object whose labels are the `DataFrames'` columns and whose values are the entries in the `DataFrames` zerooth row.  Again, we see that `pandas` saved the name of the single row we were accessing in the returned `Series'` `name` attribute.

Similarly, the following expressions can be used to access the first column ('medication') by integer location and by label, respectively:

1. **Column by *integer location***

```python
>>>> df2.iloc[:, 1]
row_1      DIAZEPAM
row_2    CLONAZEPAM
Name: medication, dtype: object
```

2. **Column by *label***

```python
>>> df2.loc[:, 'medication']
row_1      DIAZEPAM
row_2    CLONAZEPAM
Name: medication, dtype: object
```

The above returns is the first column as a `pandas` `Series` with an `Index` object whose labels are the `DataFrames'` row indices and whose values are the entries in the `DataFrames` first column. The `name` attribute of the `Series` is set to the name of the single column label we were accessing. 

Since accessing entire columns is a common practice, `pandas` also allows us to simply use either dot notation or list indexing to acces a column by label. 

```python
>>> df2.medication
row_1      DIAZEPAM
row_2    CLONAZEPAM
Name: medication, dtype: object
>>> df2.['medication']
row_1      DIAZEPAM
row_2    CLONAZEPAM
Name: medication, dtype: object
```

We cover this notation since it is often seen in examples you may find from other resources, but do not worry about memorizing all the tricks `pandas` provides. Rather, practice the basics first and then once you are comfortable you should start experimenting with the different syntactic sugar `pandas` has implemented.  

In [0]:
df2.loc['row_1', df2.columns]

In [0]:
print('------------Rows------------\n\n')
print(df2.iloc[0, :])
print('----------------------------')
print(df2.loc['row_1', :])

In [0]:
print('------------Columns------------\n\n')
print(df2.iloc[:,1])
print('-------------------------------')
print(df2.loc[:,'medication'])
print('-------------------------------')
print(df2.medication)
print('-------------------------------')
print(df2['medication'])

### Acessing Data From the DataFrame Summary
![](images/DataFrameColumns.png) 

# Exercise 1.5: Indexing `DataFrames` and Acessing `DataFrame` Data

Which lines of code will create and print the `DataFrame`, `lunch_order_df`,  with column labels: 'item' and 'price' and row labels: 'Kalani' and 'Tito', where the first row labeled by 'Kalani' has the entries 'poke bowl' and 8.00 in the columns 'item' and 'price' respectively and the second row labeled by 'Tito' has the entreis 'steak plate' and 10.00 in the columns 'item' and 'price' respectively?

`lunch_order_df` should have the following form:

|  | item | price|
|:----------|-----------|:----------|
|Kalani| 'poke bowl' | 8.0 |
|Tito| 'steak plate' | 10.0 |

A:
```python
lunch_order_df = pd.DataFrame({'item': ['poke bowl', 'steak plate'], 'price':[8.00, 10.00] })
lunch_order_df.index = ['Kalani', 'Tito']
lunch_order_df
```

B:
```python
lunch_order_df = pd.DataFrame({'Kalani': ['poke bowl',  8.00], 'Tito':['steak plate', 10.00] })
lunch_order_df.index = ['item', 'price']
lunch_order_df
```

C:
```python
lunch_order_df = pd.DataFrame({'Kalani': {'item': 'poke bowl', 'price': 8.00}, 'Tito':{'item': 'steak plate', 'price': 10.00}})
lunch_order_df
```

D:
```python
data = [['poke bowl', 'steak plate'], [8.00, 10.00]]
lunch_order_df = pd.DataFrame(data, index = ['Kalani', 'Tito'] , columns = ['items', 'price'])
lunch_order_df
```


---


Which lines of code will access and print the zero'th row of `lunch_order_df`?

A:
```python
lunch_order_df.loc[['items', 'price'], 'Kalani']
```

B:
```python
lunch_order_df.iloc[0, :]
```

C:
```python
lunch_order_df.Kalani
```

D:
```python
lunch_order_df['Kalani']
```

Hint: Feel free to use the code cell below to try these commands out. For the incorrect options, make note of what is going wrong and or what errors are being thrown.


In [0]:
# Exercise 1.5 scratch code cell


### `DataFrame` Attributes and Methods

In addition to the methods and attributes we have covered so far (ex. the `iloc` and `loc` attributes), `DataFrames` have numerous attributes and methods that you will find useful, here are some of the most common summarized in a table:

| Attribute |Description|
|:----------|-----------|
| `T`|  Transpose index and columns |
| `dtype`|  Return the dtypes in this object |
| `shape`| Return a tuple representing the dimensionality of the DataFrame |
| `size`| number of elements in the NDFrame |
| `values`| 	Numpy representation of NDFrame |

| Method |Description|
|:----------|-----------|
| `add(other[, axis, level, fill_value])`|  	Addition of dataframe and other, element-wise (binary operator add) |
| `count([axis, level, numeric_only])`|  	Return Series with number of non-NA/null observations over requested axis |
| `describe([percentiles, include, exclude])`| Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.|
|`head([n])`|Return the first n rows.|

To call a method or access an attribute we again use *dot* notation.

```python
>>> df2.values
array([[1234, 'DIAZEPAM'],
       [3210, 'CLONAZEPAM']], dtype=object)
>>> df2.count()
doctor_id     2
medication    2
dtype: int64
```

All `DataFrame` attributes and methods can be found at the <a href="https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.html#pandas.DataFrame">pandas documentation</a> webpage.

In [0]:
print(df2.values)
print('---------------------')
print(df2.count())

## Modifying `Series` and `DataFrames`


---



### *In-place* Modification Vs. Copying 

Most of the `pandas` methods for modifying the data in a `Series` and `DataFrame` will, by default, not alter the original object. This practice is primarily implemented for damage control, if you accidentally perform the wrong operation on the `DataFrame` then the orignal information may be lost. For instance if we wanted to sort the  rows of the `DataFrame` `df2` by the values in the 'medication' column then we may use the `sort_values` method like so:
 
```python
>>> df2.sort_values(by='medication')
         doctor_id  medication
row_2       3210  CLONAZEPAM
row_1       1234    DIAZEPAM
>>> df2
         doctor_id  medication
row_1       1234    DIAZEPAM
row_2       3210  CLONAZEPAM
```
Notice that a `DataFrame` is returned from the method call and when `df2` is printed there is no change.

Numerous `pandas` methods allow you to change the `DataFrame` in place by passing an optional argument `inplace=True.`. For instance the same command we saw in the last example could have been executed as is shown below to modify the `DataFrame` in-place.
 
 ```python
>>> df2.sort_values(by='medication', inplace=True)
>>> df2
         doctor_id  medication
row_2       3210  CLONAZEPAM
row_1       1234    DIAZEPAM
```

This time 'none' is returned from the method call rather than a `DataFrame`, and when we print `df2` we see that it is sorted by the 'medication' column.

Please note that the above examples are to illustrate the concept of In-place Modification Vs. Copying, rather than the functionality of `sort_values()` method, so please do not worry about understanding this method just yet as it will be coverd in coming chapters. 


In [0]:
print(df2.sort_values(by='medication'))
print('----------------------------')
print(df2)

In [0]:
print(df2.sort_values(by='medication', inplace=True))
print('----------------------------')
print(df2)

### Adding Columns to `DataFrames`

To add columns to `DataFrames` we have two options: Modify the `DataFrame` in place or make a new `DataFrame` from a copy of the original with the desired additions. To add a column to the `DataFrame` in place, we can first reference and then designate values to columns that don't yet exist. To make a copy of the original `DataFrame` and then make the desired additions, we can use the `assign()` `DataFrame` method.

To make a new `DataFrame`, the `assign()` method of the `DataFrame`, `df`, will take one positional argument, the label of the new column, `new_label`, and the value(s) the new column entries will be set to, `val`, `df.assign(new_label = val)`. To modify the original `DataFrame`, `df`, we first "access" the new column by its new label, `new_label`, and assign it its new value(s), `val`, using the '=' operator, `df.loc[:, 'new_label'] = val`.

Note that `val` can be a list of values of the same number of elements as there are rows, or a single entry. The latter will use Broadcasting to assign the same value to all the rows. 

We also have a choice of how to assign the new value(s). We can either set all the entries in the new column individually using a list, or set all the values to a single entry using *Broadcasting*. To assign all the entries in a new column individually we need a list of the same number of entries as there are rows, while with broadcasting we need only one number. Broadcasting essentially creates a list with correct number of entries using the single specified value.

For example, if we wanted to add a new column called 'nb_beneficiaries' with each entry assigned the same value, 45, we have two options:
1: use the assign method to create a new `DataFrame` with the desired addition
2: modify the original `DataFrame`.

1. Making a **new `DataFrame` using *Broadcasting***: 

```python
>>> df2.assign(nb_beneficiaries = 45)
       doctor_id  medication nb_beneficiaries
row_2       3210  CLONAZEPAM               45
row_1       1234    DIAZEPAM               45
```

We see in the above example that the method actually outputs a new `DataFrame` with the additional column.

2. Modifying **in-place using *Broadcasting***

```python 
>>> df2.loc[:, 'nb_beneficiaries'] = 45
>>> print(df2)
       doctor_id  medication  nb_beneficiaries
row_2       3210  CLONAZEPAM                45
row_1       1234    DIAZEPAM                45
```

In the second example above, a new `DataFrame` is not immediately returned, instead we need to print the original `DataFrame` to see that has changed.

Similarly, to add a new column, 'specialty' with each entry assigned individually, ` ['Psychiatry', 'Family']`,  we can do it in one of two ways as above.

1: Making a **new `DataFrame` using a list**

```python 
>>> df2.assign(specialty =  ['Psychiatry', 'Family'])
       doctor_id  medication  nb_beneficiaries   specialty
row_2       3210  CLONAZEPAM                45  Psychiatry
row_1       1234    DIAZEPAM                45      Family
```

The list values are assign in order by integer location, i.e. the $0^{th}$ entry of the list is assigned to the $0^{th}$ entry of the new column.

2: Modifying **in-place using a list**

```python
>>>> df2.loc[:, 'specialty'] = ['Psychiatry', 'Family']
>>>> print(df2)
       doctor_id  medication  nb_beneficiaries   specialty
row_2       3210  CLONAZEPAM                45  Psychiatry
row_1       1234    DIAZEPAM                45      Family
```

Now, in the example above, we observe that we again had to print out the original `DataFrame` to see that `df2` was changed.

In [0]:
df2.assign(nb_beneficiaries = 45)

In [0]:
df2.loc[:,'nb_beneficiaries'] = 45
df2

In [0]:
df2.assign(specialty =  ['Psychiatry', 'Family'])

In [0]:
df2.loc[:, 'specialty'] = ['Psychiatry', 'Family']
df2

# Exercise 1.6: Adding Columns to DataFrames

Which lines of code will add a column with the label: 'drink',  with all values set to 'water' to the 'lunch_order_df DataFrame' from Exercise 1.4 **in place** (the modification should be made to the original `DataFrame`)?

A:
```python
lunch_order_df.assign(drink = 'water')
```

B:
```python
lunch_order_df.loc[:, 'drink'] = 'water'
```

C:
```python
lunch_order_df.iloc[:, 2] = ['water', 'water']
```

D:
```python
lunch_order_df.assign(drink = ['water', 'water'])
```

Hint: Feel free to use the code cell below to try these commands out. For the incorrect options, make note of what is going wrong and or what errors are being thrown.

In [0]:
# Exercise 1.6 scratch code cell

### Adding Rows to `DataFrames`

Similar to adding Columns to `DataFrame`, the 'append()' method is used specifically to add rows without altering the original DataFrame.  We can also modify the `DataFrame` in place by first referencing and then designating values to a Row that does not yet exist. 

The `append()` `DataFrame` method for rows is slightly different from the `assign()` method for columns in that takes `Series` or `dict` like objects. This is because `Series` and `dict` like objects have labels which can be aligned with the columns of the `DataFrame`. We will be showing examples of appending `Series` to `DataFrames`. 

When using the `assign()` method to append a `Series`, `s1`, to a `DataFrame`, `df`, we must be sure that the `Series`, `s1`, has its `name`attribute assigned since this will be the new row label in the `DataFrame` `df1` after being appended. To assign `s1` the name, 'new_name' we would type: `s1.name = 'new_name'`. 

Suppose we wanted to add a row to the `DataFrame` `df2` with the labels and value pairs: 'medication': 'DIAZEPAM', and  'doctor_id': 5678, we would have two options: making a new `DataFrame` with the new row and modifying the original `DataFrame` in place.

1. Making a **new `DataFrame`**

```python
>>> temp_series = pd.Series({'medication': 'DIAZEPAM', 'doctor_id': 5678}, name='row_3')
>>> print(df2.append(temp_series))
         doctor_id  medication  nb_beneficiaries   specialty
row_2       3210  CLONAZEPAM              45.0  Psychiatry
row_1       1234    DIAZEPAM              45.0      Family
row_3       5678    DIAZEPAM               NaN         NaN
```

In the example above we first create a `Series` object with the name 'row_3' using the name argument of the `pandas` `Series` method and save it into the `temp_series` variable. Then we use the `append()` method of the `df2` `DataFrame` to append the `Series` `temp_series` to `df2`. Observe that the method `append()` returns a new `DataFrame` where the `Series'` `name` attribute is used as the row index label, 'row_3'. The indices that did not exist in the new `Series` but do exist in the `Column` object of the `DataFrame` are filled with `NaN`, these are referred to as *missing* values in `pandas`.

2. Modifying **in-place**
```python
>>> df2.loc['row_3', :] = temp_series
>>> print(df2)
         doctor_id  medication  nb_beneficiaries   specialty
row_2       3210  CLONAZEPAM              45.0  Psychiatry
row_1       1234    DIAZEPAM              45.0      Family
row_3       5678    DIAZEPAM               NaN         NaN
```

To modify the `DataFrame` in-place we reference the new row to be assigned by its label using the `loc` attribute and assign to it a `Series`. Also notice that we had to print `df2` since appending the new row was done in place and a new `DataFrame` is not returned from that operation.


In [0]:
temp_series = pd.Series({'medication': 'DIAZEPAM', 'doctor_id': 5678}, name='row_3')
df2.append(temp_series)

In [0]:
df2.loc['row_3', :] = temp_series
df2

# Exercise 1.7: Adding Rows to DataFrames

Which lines of code will make a **copy** of the 'lunch_order_df DataFrame' from Exercise 1.4,  and add a row with the label: 'Moana' and values 'Salad', 9.00, 'tea', to the columns 'item', 'price', and 'drink' respectively?

A:
```python
data = pd.Series({'item': 'Salad', 'price':9.00, 'drink': 'tea'}, name='Moana' )
lunch_order_df.iloc[2, :] = data
```

B:
```python
data = pd.Series({'item': 'Salad', 'price':9.00, 'drink': 'tea'})
lunch_order_df.append(data)
```

C:
```python
data = pd.Series({'item': 'Salad', 'price':9.00, 'drink': 'tea'})
lunch_order_df.loc['Moana', :] = data
```

D:
```python
data = pd.Series({'item': 'Salad', 'price':9.00, 'drink': 'tea'}, name='Moana')
lunch_order_df.append(data)
```

Hint: Feel free to use the code cell below to try these commands out. For the incorrect options, make note of what is going wrong and or what errors are being thrown.

In [0]:
# Exercise 1.7 scratch code cell

### Dropping Rows and Columns from `DataFrames`

Rows and Columns from a `DataFrame` can be discarded using the `drop()` method. The `drop()` method takes one positional parameter which is the label of the row or column that is going to be dropped. 

The `drop()` method will drop a row or a column depending on the value that the `axis` parameter is set to. If the `axis` parameter is set to 'columns' then we are dropping a column with the specified label, otherwise if the `axis` parameter is set to 'rows' then we are dropping a row. The default, the `axis` parameter is 'rows'. 

As with sort and many other methods, `drop()` can modify the input `DataFrame`, rather than making a new `DataFrame` with the desired changes, by setting the `inplace`  parameter to `True`.

Suppose, for instance, we wanted to drop the row labeled 'row_3' from the `DataFrame` `df2`. We again have two options, we can create a new `DataFrame` with the dropped row without modifying the original `df2`, or, we can drop the row in place.

1. **New DataFrame with Dropped Row**

```python
>>> df2.drop('row_3', axis='rows')
         doctor_id  medication  nb_beneficiaries   specialty
row_2       3210  CLONAZEPAM              45.0  Psychiatry
row_1       1234    DIAZEPAM              45.0      Family
```

We see in the example above, New DataFrame with Dropped Row, that method `drop()` returns a new `DataFrame` with the desired results.

2. **Drop Row In-place**

```python
>>> df2.drop('row_3', axis='rows', inplace=True)
>>> df2
       doctor_id  medication  nb_beneficiaries   specialty
row_2       3210  CLONAZEPAM              45.0  Psychiatry
row_1       1234    DIAZEPAM              45.0      Family
```

In the second example, we see that by setting `inplace = True` we need to print out df2 after the method call since there was nothing returned and by printing out `df2` our desired results can be verified. 

Now suppose we want to drop the entire column labeled by 'specialty' from the `DataFrame` `df2`. We again have two options, we can create a new `DataFrame` with the dropped row without modifying the original `df2`, or, we can drop the row in place.

1. **New DataFrame with Dropped Column**

```python
>>> df2.drop('specialty', axis='columns')
       doctor_id  medication  nb_beneficiaries
row_2       3210  CLONAZEPAM              45.0
row_1       1234    DIAZEPAM              45.0
```

The example above shows that the `drop()` method will return a new `DataFrame` if `inplace` is not manually set to False. Notice also that we had to set the axis parameter to 'columns' so that the method knows that we are dropping a `column`.

2. **Drop Column In-place**

```python
>>> df2.drop('specialty', axis='columns', inplace=True)
>>> print(df2)
       doctor_id  medication  nb_beneficiaries
row_2       3210  CLONAZEPAM              45.0
row_1       1234    DIAZEPAM              45.0
```

This last example above demonstrates that if `inplace=True` in the `drop()` method call, then `df2` is modified in place and the `drop()` method does not return anything.

In [0]:
df2.drop('row_3', axis= 'rows')

In [0]:
df2.drop('row_3', inplace=True)
df2

In [0]:
df2.drop('specialty', axis='columns')

In [0]:
df2.drop('specialty', axis='columns', inplace=True)
df2

# Exercise 1.8: Dropping Data from DataFrames

What is the result of the follwing lines of code?

```python
lunch_order_df.drop('price', axis='columns', inplace=True)
```

A: There is a syntax error in this line of code and it will not execute.

B: The column 'price' is dropped from the original `lunch_order_df DataFrame`.

C: A copy of `lunch_order_df` is made without the column 'price' from the original `DataFrame`. The original `DataFrame` is left unchanged.

D: A copy of `lunch_order_df` is made without the row 'price' from the original `DataFrame`. The original `DataFrame` is left unchanged.


### Adding Data to `Series`

We can add data with a corresponding index to a `Series` by referencing the new index and assigning it a value. This is similar to adding columns and rows to `DataFrames`, however dot notation cannot be used when initializing the data. 

If we wanted to add a new data entry, $i$, to the `Series`, `s`, with label, `l`, we would write `s[l] = i`.

For instance to add a new data entry 'test_data' labeled 'new_index' to the `Series s1` in-place we would type:

```python
>>> s1['new_index'] = 'test_data'
>>> s1
doctor_id                1234
medication           DIAZEPAM
nb_beneficiaries            3
spending                  \$32
new_index           test_data
dtype: object
```

To demonstrate that dot notation does not work for `Series`, we can try dot notation and compare to the previous results: 

```python
>>> s1.new_index2 = 'test_data2'
>>> s1
doctor_id                1234
medication           DIAZEPAM
nb_beneficiaries            3
spending                  \$32
new_index           test_data
dtype: object
>>> s1.new_index2
test_data2
```

We see that a new index and value is not added to the `Series`, but rather a new attribute has been saved for `s1`. This is a feature that may be convient for other purposes, but not for adding data to a `Series`. 


In [0]:
s1['new_index'] = 'test_data'
s1

In [0]:
s1.new_index2 = 'test_data2'
print(s1)

In [0]:
s1.new_index2

### Dropping Data From Series

To remove a particular data entry, `pandas` `Series` have a `drop()` method just like `pandas` `DataFrames` that may be used for both dropping data in-place and making a copy. The `drop()` method takes one positional argument, `label`, which specifies the entry we are trying to drop. If we wanted to drop the data entry with label $l$ from the `Series` $s$, then we would type `s.drop(l)`. For instance, to drop the data entry labeled 'new_index' in the example `Series s1`, we would type:

```python
>>> s1.drop('new_index')
doctor_id               1234
medication          DIAZEPAM
nb_beneficiaries           3
spending                 $32
dtype: object
```

We see in the above example that the `drop()` method will return a new `Series` with the data entry labeled by the positional argument we just dropped. To modify the orignal `Series` in place we can set the optional argument `inplace` to `True`. For example, if we wanted to do the same modification, drop the data entry labeled 'new_index', but make the modification inplace, then we would type:

```python
>>> s1.drop('new_index', inplace = True)
>>> print(s1)
doctor_id               1234
medication          DIAZEPAM
nb_beneficiaries           3
spending                 $32
dtype: object
```

In the example above we see that the `drop()` method returns `None` and that the original `Series`, which we had to print after the method call, no longer has the `new_index` data entry. 


In [0]:
s1.drop('new_index')

In [0]:
s1.drop('new_index', inplace = True)
print(s1)

# Summary

---
**`Series` Vs. `DataFrames`**
  1. `Series`: list-like objects that store data in a given order.
  2. `DataFrames`: spreadsheet-like tables that contain one or more Series.
  
**`Series`**
* To create a series from native `Python`  data structures we use the `pandas` `Series()` function
  
  ![](images/seriesIndex.png)

**`DataFrames`**
* To create a series from native `Python`  data structures we use the `pandas` `DataFrame()` function
  
  ![](images/pandasDataFrame.png)

**Accessing Data from the `DataFrame`**

| Syntax                |   Meaning    |
|:--------------------------|:-------------------|
| `dataframe["col_name"]`    |  Return `Series` of "col_name"'s  |
| `dataframe[["col_name_1", "col_name_2"] ]`    |  Returns `DataFrame` with  "col_name_1" and "col_name_2"|
| `dataframe.loc["label",:]`| returns row indexed by "label" |
| `dataframe.loc["label", ["col_3", 'col_5']]`| return entry indexed by "label", subsets columns to only "col_3" and "col_5" |
| `dataframe.loc[["label_1", "label_2"], ['col_3', 'col_5']]`| returns lines with indices "label_1" and "label", subsets columns to only "col_3" and "col_5" |
| `dataframe.iloc[23, [0, 1]]`| returns line 23, and only values of columns 0 and 1 |
| `dataframe.iloc[[1,2], [0, 1]]`| returns lines 1 and 2, and only values of columns 0 and 1 |

![](images/DataFrameColumns.png)

**Modifying Series and DataFrames**

* Most of the pandas methods for modifying the data in a `Series` and `DataFrame` will, by default, not alter the original object, rather a **copy** with the desired changes is made. If you are confident of the operation you are performing on the object and do not care to make a copy, then many times you can pass the optional argument  "`inplace=True`" to the pandas method you are calling and the original object will be altered **inplace**.

* To add columns to `DataFrames` we have 2 options: 
  
    1. Modify the `DataFrame` in place: use the `assign()` `DataFrame` method.
    2. Make a new `DataFrame` from a copy of the original with the desired additions: first reference and then designate values to columns that don't yet exist. 
  
* We may drop both Rows and Columns from a `DataFrame` using the `drop()` method. 
  * The `drop()` method takes a positional parameter which is the label of the row or column that is going to be dropped. 
  * The `drop()` method will drop a row or a column depending on the value that the `axis` parameter is set to. 
    * If the `axis` parameter is set to 'columns' then we are dropping a column with the specified name, otherwise if the `axis` parameter is set to 'rows' then we are dropping a row with the specified name.
  
