# DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

### Import `numpy` with the alias `np` and `pandas` with the alias `pd`

<br>

## <span style="color:blue">1. Loading Data and Initial Exploration</span>

<hr>

##### Mount Drive - **Google Colab Only Step**

When using google colab in order to access files on our google drive we need to mount the drive by running the below python cell, then clicking the link it generates and pasting the code in the cell.



In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


Change Directory To Access The Dependent Files - **Google Colab Only Step**

In [0]:
directory = "student"
if (directory == "student"):
  %cd drive/Colab\ Notebooks/intro-to-python/
else:
  %cd drive/Shared\ drives/Rubrik/Data\ Science\ Track/intro-to-python

/content/drive/Shared drives/Rubrik/Data Science Track/intro-to-python


#### `.read_csv()`: Load `csv` data into a `pandas` DataFrame called `ri`.

Directory system keys:

`./` - current directory 

`../` - previous directory; **Note** that the current folder lives inside the previous folder. 



Your first challenge! Use what you know about file structures to input the appropriate path into `path_to_file` variable to load in the Rhode Island Police Stops dataset. 

**Hint:**
The data is in the `data` folder inside a folder called `data` which lives inside the current directory. The filename is `rhode-island-police-stops.csv`  

```python
# Fix Me!
path_to_file = "path/to/rhode-island-police-stops.csv"

ri = pd.read_csv(path_to_file)
```

<br>

### Basic Dataframe `.attributes` and `.methods()`

**References:** 
- [Pandas DataFrame Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html)
- [Pandas DataFrame Guide](https://www.geeksforgeeks.org/python-pandas-dataframe/)

#### `.shape`: How many rows? How many columns?
This will be important once we start manipulating the DataFrame. It'll allow us to check if our manipulation was performed correctly.

**Key:**
`(rows, columns)`

```python
ri.shape
```

<br>

#### `.dtypes`: List the `dtype` of each column.
* `object` == Python `str`
* `float64` == Python `float`
* `bool` == Python `bool`
* `int64` == Python `int`

```python
ri.dtypes
```

<br>

#### `.columns`: List all of the column_names of `dataframe`

```python
ri.columns
```

<br>

#### `.head()`: Take a "peek" into the *first* 'n' rows of `pandas` `dataframe`.

We can pass in a number inside of the `head` method to state how many of the first rows we would like to peek. Default amount of rows is 5. 

This will be useful to get a feel of what the data set looks like. It'll allow us to quickly preview if certain columns have inconsistent values. i.e., `NaN` values are unwanted values

```python
ri.head(10)
```

**Note:** Look at row 2, it looks like dirty data

<br>

#### `.tail()`: Take a "peek" into the *last* 'n' rows of `pandas` `dataframe`.

We can pass in a number inside of the `tail` method to state how many of the last rows we would like to peek. Default amount of rows is 5. 

This will be useful to get a feel of what the data set looks like. It'll allow us to quickly preview if certain columns have inconsistent values. i.e., `NaN` values are unwanted values

```python
ri.tail()
```

<br>

#### `.describe()`: Summary statistics of numerical `pandas` `dataframe` columns.



```python
# Will default to "numerical columns"
ri.describe()
```

### Summary statistics of categorical features

Display summary statistics for categorical features.

- Use the dataframe's `describe()` method
- The describe method can accepts a parameter called `include` which can take in an the following value `"object"`. Doing this will show only information about categorical features.  

<hr>
<br>
<br>

## <span style="color:blue"> 2. 'Naive' Selection and Indexing </span>

Let's learn the various methods to grab data from a DataFrame

## <span style="color:red"> Columns </span>

### Selecting Columns </span>

#### Grab a *single* column.

Here, we grab a single column by `indexing` the `ri` variable that holds our `dataframe`, we do this by passing in the name of the column we're interested in as a string inside square brackets `[]`. 

After, we chain a `.head()` on the resulting pandas `series` aka column, because we don't want to see all `500000+` rows of the column that we picked.

<hr>

```python
# Pass a column name
ri['driver_age_raw'].head()
```

<hr>

**Your Turn:**
<br>
Pass in another `column name` to `index` `ri`, be sure to follow it with a `.head()`!

<br>

#### Grab *multiple* columns.

To grab multiple columns, we must pass the column names within a list `[]`, so your indexing call will contain a list ike this: `['col1', 'col2']`, inside of the indexing brackets on `ri[]`, which can be confusing mix of brackets, try to differentiate the brackets from the indexing call from the brackets containing the columns you are interested in.

<hr>

```python
# create a list of column names
col_names= ['driver_age_raw','driver_age']
# Pass a list of column names
ri[col_names].head()
```

Or in one line:
```python
# Pass a list of column names
ri[ ['driver_age_raw','driver_age'] ].head()
```

<hr>

**Your Turn:**
<br>
Pass in a list of `column names` to `index` `ri`, be sure to follow it with a `.head()`!

<br>

#### It is also allowed to pass in a variable containing the list of columns, for readability. Try it out to get into the right mindset.

<hr>

```python
cols_i_want = ['driver_age_raw','driver_age']
ri[cols_i_want].head()
```

<hr>

<br>

#### All `DataFrame` Columns Are `pandas` `series` Objects.

```python
type(ri['driver_age_raw'])
```

<br>

### Creating Columns

#### <span style="color:orange"> Remember this pattern! </span> 

This pattern will allow us to make sure our operations were performed correctly.
<hr>

1. `print` the `shape` of a `dataframe` 
2. Try to make some change to that dataframe
3. `print` the `shape` of the `dataframe` once more to make sure we performed the operation properly.

<hr>

```python
ri.shape
```

#### Make the change. 
*We can create new columns by exploiting the fact that `pandas` `series` behave like `numpy` `arrays`.*

**Note:** We can save over an existing column or create a new one with identical syntax!

>`Important Condition:` The `index` of the column that is either replacing an existing column or being appended to the end of a dataframe —MUST MATCH the `index`ing of the `dataframe`. That means that we need to have the same amount of rows after the operation. We can not append a new colum or update existing columns with a different amount of rows.

### Create a new series that combines the strings from the `violation` series and the `stop_outcome` series adding a space `' '` between the two strings
Like so:
```python
ri['violation'] + " " + ri['stop_outcome']
```

**Note:** It is good practice to see if the operation has the desired results before updating a column or creating a new column. 
Once we have confirmed that we have the write logic then we will update or create a new column.

Because we know that we have the desired results lets create a new series in the DataFrame called `event`. 

Store the following logic inside of the series of the DataFrame like so:

```python
 ri['event'] = ri['violation'] + " " + ri['stop_outcome']
```

This will create the new column AND populate it.

#### Always verify that your changes took place!

Do we see an additional column being added to the column component of the shape `tuple`?

Key:
(rows, columns)

```python
ri.shape
```

<br>

### Removing Columns

#### Remember the pattern? `print` the `shape` of a `dataframe` before AND after you try to alter anything in pandas.

```python
ri.shape
```

#### Make the change

We will use the DataFrame's `drop` method.
The first argument will be the column/series name to drop. 
The second argument will be the axis we want to drop. If we specify `axis=1` then we will drop all the rows the series. If we specify `axis=0` we will drop row wise. 

```python
ri.drop('event', axis=1)

# same (only run one!)
ri.drop('event', axis='columns')
```

#### Always verify that your changes took place!

```python
ri.shape
```

#### The reason we still have 17 columns is because we did not specify droping the column/series `inplace`. 


**Your Turn:**
<br>
In order to actually drop a column and save the change, we must `overwrite` the variable containing the original `dataframe`, there are 2 ways of achiveing this.
1. **overwrite:** `ri = ri.drop(...)`
2. **inplace param:** `ri.drop(..., inplace=True)`
3. **DON'T do BOTH!:** `ri = ri.drop(..., inplace=True)`

### Reprint Shape To Confirm

**Note:** shape is a DataFrame attribute not a method so it will not have parenthese `()`

<br>
<br>

## <span style="color:red"> Rows </span>

### Selecting Rows

select rows based off of position aka 'index'

#### When selecting rows use the following syntax `DataFrame.iloc[]`

You can remember this by remembering that `i` stands for `index` and `loc` stands for `location`

```python
ri.iloc[3:10]
```

<br>

#### All `DataFrame` Rows Are *Also* `pandas` `series` Objects.

```python
type(ri.iloc[3])
```

<hr>
<br>
<br>

## <span style="color:blue"> 3. Conditional Selection </span>

An important feature of `pandas` is conditional selection using bracket notation, very similar to `numpy`:
```python
dataframe[some_condition applied to dataframe] 
```

### <span style="color:red"> Single Conditions </span>

#### Quick reminder: this is what our `dataframe` looks like and its `shape`.

```python
print("Rows:", ri.shape[0])
print("Columns:", ri.shape[1])
ri.head()
```

#### Construct a `boolean series` to `index` the `dataframe`.

Notice how the result is a `boolean` `series`, with a matching `True` or `False` for every value in `driver age`, we will exploit this returned series (boolean mask), to view all other data points accross the dataframe for those people older than 21.

Remember it is good practice to precede the boolean mask with `is` to signify that this series is a boolean mask

```python
is_older_than_21 = ri['driver_age'] > 21
print(is_older_than_21)
```

#### Index the original `dataframe` with `boolean series`.

This is called filtering because we will create a new dataframe with possibly `fewer` rows than the original, by only keeping the rows where `driver age` was `>` 21. If all indexes satisfy the condition we will have a new DataFrame with the same amount of rows as the original DataFrame  but we will never have more rows than the original DataFrame.

<br>
The rows that are kept are a result of the column and condition that you passed on that column. 
<br>

Could have easily been `driver_age < 21`, or `stop_outcome == specific-citation-reason`, etc...


#### Create a new DataFrame filtering the original DataFrame `ri` and show only the first ten rows of the new DataFrame:
```python
ri[is_older_than_21].head(10)
```

**Important Note:** The indexes start at 4 and there is no index number `9` and `14`. We will handle this in the cells below. 

#### Now that we know the operation had desired results save this filtered DataFrame in a DataFrame called `over_21_ri`

**Important Note** Make sure to not invoke the `head` method or else the result will be a DataFrame with at maximium five rows.

```python 
over_21_ri = ri[is_older_than_21]
```



#### Because we removed a few rows from the original DataFrame when filtering let's reset the index of the `over_21_ri` DataFrame

The Pandas' `reset_index()` method will generate a new DataFrame or Series with the index reset. This is useful when the index needs to be treated as a column, or when the index is meaningless and needs to be reset to the default before another operation.

#### Series' `reset_index()` parameters:
- `drop` (bool): default `False`, will reset the indexes and create a new column with the old indexes. If `True` the method drops the current index of the DataFrame and replaces it with an index of increasing integers, meaning that we will not keep the a copy of the old indexes. 
- `inplace` (bool): default `False` which would modify the DataFrame in place. If set to true this operation will happen in place.

### Reset the index of the DataFrame:
- reassign the `over_21_ri` variable with the DataFrame's `reset_index()` method
- set the `reset_index()` `drop` argument to `True` and the `inplace` argument `True` as well


### Re-Print The DataFrame's `over_21_ri` first five rows

If the indexes start at 0, this would confirm that we did this operation correctly.

#### Grab a column from this new `dataframe`

We can index the resulting dataframe directly.

Here, we are asking for the `violation` column for only those people in the `dataframe` whose `driver_age > 21`
```python
over_21_ri['violation']
```

**Your Turn:**

**Note:** Let's use the DataFrame `ri` for the below code cells

Hint: you can view what unique values you can base your boolean index on, by invoking the `.unique` method on a column.
<br>
In this case I ran a `ri['search_type'].unique()` to find `Probable Cause`.

<hr>

1. Create a Boolean Series, `is_prob_cause` : Only those stops where `search_type` was `'Probable Cause'`
2. Index `ri` with `is_prob_cause`, something about brackets... `[]`
3. save resulting `dataframe` to variable `ri_probable_cause`
4. You are now ready to start performing some analysis on only those police stops in Rhode Island were the `search_type` was initiated by a `probable_cause`

<hr>

<br>
<br>

### <span style="color:red"> Multiple Conditions </span>

> **Hint:** You will need to use the `&` operator between both of the boolean checks to get the appropriate selection

**For two (or more) conditions, use the `|` and `&` operators, and make sure to use `()` parentheses to seperate each condition.**
Example:
```python 
# condition_one and condition_two
((condition_one) & (condition_two)) 

# condition_one or condition_two
((condition_one) | (condition_two)) 
```



### `&`: Logical `and`

#### Create A New DataFrame containing observations recorded where the driver is `white` **and** `is older than 21`

##### Note: `&` in `pandas` is equivalent to python `and`
`and` isn't set up to work with these data structures natively.

```python
# create boolean masks
is_white = ri['driver_race'] == 'White'
is_over_21 = ri['driver_age'] > 21
is_white_and_over_21 = ( ( is_white ) & ( is_over_21 ) )

# Print the filter to confirm logic
ri[is_white_and_over_21]
```

#### If the above filter has the desired results store the results in a variable called `white_over_21`

```python 
white_over_21 = ri[is_white_and_over_21]
```

##### Explore new Dataframe!

```python
print(white_over_21.shape)
print(white_over_21.head())
```

**Important Note:** the indexes do not start at 0, we might need to reset the indexes depending on future operations 

<br>
<br>

### `|`: Logical `or`

#### Create a New DataFrame containing observations recorded where the driver's violation is `Speeding` **or** `is not wearing a seatbelt`

##### Note: `|` in `pandas` is equivalent to python `or`

```python
# Create boolean masks
is_speeding = ri['violation'] == 'Speeding'
is_no_seat_belt = ri['violation'] == 'Seat belt'
is_speeding_or_no_seat_belt = ( (is_speeding) | (is_no_seat_belt) )

# Print the filter to confirm logic
print(ri[is_speeding_or_no_seat_belt].head())


#### If the above filter has the desired results store the results in a variable called `speeding_or_no_belt`

```python
# New DataFrame
speeding_or_no_belt = ri[is_speeding_or_no_seat_belt]
```

##### Explore new Dataframe!

**Note:** Instead of printing the DataFrame using python's `print` function we can use python's `display` function to display the DataFrame in a cleaner form. 

```python
display(speeding_or_no_belt.shape)
display(speeding_or_no_belt.head())
```

**Important Note:** the indexes do not start at 0, we might need to reset the indexes depending on future operations 