<a href="https://colab.research.google.com/github/NIP-Data-Computation/show-and-tell/blob/master/piercel_week2_notes2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Author**: Pierce Lopez <br>
**Date Created**: August 11, 2020 <br>
**Last Updated**: August 12, 2020 <br> 
**Description**: Contains my notes on the Data Analyst lesson: _Merging DataFrames with pandas_.

# Merging DataFrames with pandas
For this chapter, we will make use of the `pandas` and `NumPy` functions so do not forget to import the necessary modules!

```
# import modules
import pandas as pd
import numpy as np
```
## Chapter 1: Preparing Data

### Section 1: Reading multiple data files

1. Tools for pandas data import
  * `pd.read_csv()`
  * `pd.read_excel()`
  * `pd.read_html()`
  * `pd.read_json()`

2. Load files using loops

```
# make list of filenames
filenames = ["file1.csv", "file2.csv"]

# initialize DataFrame
df = []

# loop pd.read_csv() over filenames list
for f in filenames
  df.append(pd.read_csv(f))
```

3. Load files using list comprehensions

```
# make list of filenames
filenames = ["file1.csv", "file2.csv"]

# read several files using list comprehensions
df = [pd.read_csv(f) for f in filenames]
```

4. Load files using `glob`
  * `glob` is a Python module which can be handy for instances where filenames have similar patterns!

```
# import module
import glob

# make list of strings that contain the prefix 'sales' and suffix '.csv'
filenames = glob('sales*.csv')

# read several files using list comprehensions
df = [pd.read_csv(f) for f in filenames]
```
  * The `*` is a wildcard that can match any number of characters.

**Additional Insights:**

1. You can copy a DataFrame using the `copy()` method.

<br>

### Section 2: Reindexing DataFrames

1. Setting conventions
  * Indices - refers to several row labels within one data structure
  * Indexes - refers to several row labels from several data structures

2. The DataFrame indexes can be accessed using the `.index()` method.
3. Manually arrange  indices using `.reindex([list that contains the new order])`
  * **Note:** We can reindex using index labels of another DataFrame!
4. Arrange indices in order using `.sort_index()`
5. Reindexing with missing labels will create new rows with missing values.
6. Using the `index_col` argument, we will let a column of our DataFrame be the index of such.

**Additional insights:**

1. The `.ffill()` method changes `NaN` values to last non-null value.

<br>

### Section 3: Arithmetic with Series and DataFrames

1. `dataframe.divide(col_name, axis = "rows")`
  * `col_name` acts as a divisor. Each row of the DataFrame will be divided by the corresponding `col_name` value.

2. `dataframe.pct_change()`
$$ \%_{change} = {current\;row\;value - previous\;row\;value\over previous\;row\;value}.$$

3. We can add Series values using the `.add()` method.

```
# arithmetic
series1 + series2

# .add()
series1.add(series2, fill_value = 0)
```
  * **Note:** Adding Series values where some rows have `NaN` values, the sum will also return `NaN`. To avoid this, we use the `fill_value = 0` argument.
  * We can also chain `.add()` to add more Series data!
  * **Note:** Adding is index-based!

**Additional insights:** 
1. We can change string characters with the `.str.replace()` method.
2. `.resample('A')` allows us to resample a Series.
  * 'A' for annual frequency.
3. Chaining `.last()` to `.resample()` picks the last element during resampling.
4. `.multiply(axis = "rows")` is for row-by-row multiplication.
5. `parse_dates = True` converts strings into `datetime` objects

<br>



## Chapter 2: Concatenating Data

### Section 1: Appending and concatenating Series

1. `series1.append(series2)`
  * Stacks rows.
  * Works for Series and DataFrames.
2. `pd.concat([s1, s2, s3], ignore_index = True)`
  * Can stack row-wise or column-wise.
  * Works for Series and DataFrames.
  * `ignore_index = True` is the argument equivalent of `reset_index(drop = True)`
**Note:** When stacking rows, indexes are kept in the stacked version so there wil lbe multiple indices of the same value or label. `reset_index(drop = True)` takes care of that.

<br>

### Section 2: Appending and concatenating DataFrames

1. Appending is similar to Series appending.
  * In the case of different indexes and column values, the final DataFrame will keep all indexes (regardless if they are the same) and a number of columns equal to how many unique column names were appended.
  * Missing values will be placed on areas where the original individual DataFrames didn't have column values for such.

2. Concatenating is also similar to Series concatenation.
  * `axis = 0` for row concatenation.
  * `axis = 1` for column concatenation.
    * **Note:** When stacking row-wise, all indexes will be kept regardless if they are the same. When stacking column-wise, similar indexes will be combined, reducing rows.

**Additional insights:** 
1. `"%s_top5.csv" % "string"` replaces `%s` with `"string"`. 
2. The `header = 0` argument removes column names from a DataFrame.
3. The `names = [list_of_column_names]` argument adds column names to a DataFrame.

<br>

### Section 3: Concatenation, keys, and multi-indexes

1. Multi-level indexing on row concatenations
  * `pd.concat([s1, s2], keys = [s1_out_index, s2_out_index], axis = 0)`
2. Multi-level indexing on column concatenations
  * `pd.concat([s1, s2], keys = [s1_out_index, s2_out_index], axis = 1)`  

**Note:** Keys help us distinguish data to avoid confusion!
**Additional insights:** 
1. The `keys` argument are the same with the keys from dictionaries so it will also work if we use `pd.concat()` on dictionaries!

```
# make dictionary
dict = {s1_out_index:s1, s2_out_index:s2}

# concatenate
`pd.concat(dict, axis = 1)`
```
2. `pd.IndexSlice` is required when slicing on the inner levels of multilevel indices.

<br>

### Section 4: Outer and inner joins

1. Horizontal array stacking
  * `np.hstack([array1, array2])`
  * `np.concatenate([array1, array2], axis = 1)`
2. Vertical array stacking
  * `np.vstack([array1, array2])`
  * `np.concatenate([array1, array2], axis = 0)`
3. Outer joins include all indexes from the original tables without repetition (Set Union).
4. Inner joins only include common indexes from different tables (Set Intersection).
**Note:** We can specify which type of join we want to apply to our data using the `join` argument.

**Note:** A `ValueError` is raised when the matrices have different sizes on the axis of concatenation!

**Additional insights:**
1. `np.array(Series_or_DataFrame)` converts a Series or a DataFrame into a NumPy array.

<br>

## Chapter 3: Merging Data

### Section 1: Merging DataFrames

`pd.merge(df1, df2)` works like an inner join.
  * This function uses all common columns as bases to merge.
  * We can select which columns will be bases for merging using the `on = []` argument.
    * If the two columns have different names, we can use the `left_on = []` and `right_on = []` arguments.
  * We can change the names of the column labels using the `suffixes = []` argument.

### Section 2: Joining DataFrames

Extending what we've learned from Section 1 of this Chapter:

1. There are two more `join` types:
  * Left join: all rows from the left DataFrame are kept.
    * `how = "left"`
  * Right join: all rows from the right DataFrame are kept.
    * `how = "right"`
2. The combination of the left and right join yields the outer join!
3. There exists a `.join()` method in the pandas module.
```
# use .join()
df1.join(df2, how = )
```

**Additional insights: Which should you use?**
1. `.append` for simple stacking of Series data
2. `pd.concat()` for stacking DataFrames with simple joins
3. `df1.join(df2)` for joining
4. `pd.merge(df1,df2)` for multiple joins on many columns

<br>

### Section 3: Ordered merges

1. Sorting merges can be done by using the `.sort_values('col_name')`
  * We can opt to use `pd.merge_ordered()` to drop the use `.sort_values('col_name')`.
  * **Note:** `pd.merge()` does an INNER merge while `pd.merge_ordered()` does an OUTER merge by default.

<br>




Tasks from this lesson (self-assessment)

1. Preparing Data
  * Reading multiple data files
    * Did I read multiple data files using loops and/or the `glob` function?
  * Reindexing DataFrames
    * Did I manipulate DataFrame indices using the learned tools?
  * Arithmetic with Series and DataFrames
    * Did I do operations using methods like `.multiply()` or `.divide()`?

2. Concatenating Data
  * Appending and concatenating Series
    * Did I append Series data?
  * Appending and concatenating DataFrames
    * Did I append/concatenate different DataFrames?
  * Concatenation, keys, multi-indexes
    * Did I concantenate DataFrames to create a multilevel DataFrame?

3. Merging Data
  * Merging DataFrames
    * Did I merge DataFrames?
  * Joining DataFrames
    * Did I join DataFrames using one of the various joining methods?
  * Ordered merges
    * Did I do an ordered merge?