# Chapter 1
---
## Data Merging Basics

### Inner Join

It is necessary to understand that inner joins only return the rows with matching values in both tables.

`new_df = df1.merge(df2, on='columns_name')`

This is give us Inner joined table, combination of all the columns joined with respect to the column_name specified. **DO NOTE**, that columns that are common in both are tables are repeated with **_x/_y** attached at the end.
<img src=attachment:image.png style="width:40%;margin-left:230px;">

We can change this behavious of adding **_x / _y** at the end to something that we want by providing the values in *suffixes* argument.

`suffixes=('_ward', '_cen')`

### One to Many Relationship

![image.png](attachment:image.png)

From our End there is no change in syntax. It all depends on the table we are merging with based on the common column.

## Excersize

![image-2.png](attachment:image-2.png)

### Merging Multiple DataFrames

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)


---

# Chapter 2
## Merging Tables with Different Join Types
---
### Left Join

[image.png](attachment:image.png)
<img src=attachment:image.png style="width:40%;margin-left:230px;">

**how = "left"** parameter

![image-2.png](attachment:image-2.png)


**NOTE**: A left join will return all of the rows from the left table. If those rows in the left table match multiple rows in the right table, then all of those rows will be returned. Therefore, the returned rows must be equal to if not greater than the left table. Knowing what to expect is useful in troubleshooting any suspicious merges.

### Right Join

[image.png](attachment:image.png)
<img src=attachment:image.png style="width:40%;margin-left:230px;">

If the column names are different in both the tables, than we can explicitly specify that using `left_on` and `right_on`.

![image-2.png](attachment:image-2.png)

### Outer Join

[image.png](attachment:image.png)
<img src=attachment:image.png style="width:40%;margin-left:230px;">

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

## Merging a Table to Itself:  (SELF JOIN)
---

When will you have to join a table to itself:
1. Graph Data
2. Hierarchial relationships
3. Sequential relationships

**This kind of Merger is By Default, an INNER JOIN**

![image.png](attachment:ca0cc442-8e99-4447-8df8-47b5fb35f70b.png)

![image.png](attachment:6f9684b9-45c2-4e10-9589-e36ff325b70d.png)

## Merging on Indexes
---

Merging on Index is pretty much the same as merging on column **When** Index is common. Eg:
![image.png](attachment:52acbcdd-f694-4590-8b3f-ff906407648f.png)

Merging on **Multi index** with same Index name is also, pretty similar, like this:
![image.png](attachment:4b9c37a5-3760-4d04-b89d-2b6e0bb60a55.png)

The differnt in Method arises when we are merging on differnt Indexes from both tables, For that we need to mention **left_index=True** and **right_index = True** to get specify the difference in Indexes. Whenever we are using `left_on` or `right_on` argument we need to set the respective `left_index` or `right_index` to **True**

![image.png](attachment:89a60ef4-403c-4c63-b896-137825cb81ea.png)


## USE CASE (Examples)
![image.png](attachment:f0172387-3d41-4bce-9690-607ef02c284a.png)

# Chapter 3
## Advanced Merging and Concatenation
---

Fist we gotta look at the difference between **Mutating Versus filtering** Join

#### Mutating Joins:
* Combine data from two tables based on matching observations in both tables

#### Filtering Joins:
* Filter observations from table based on whether or not they match an observation in another table.

But before this, let's see what is **Semi-join** ?
So, In semi-join, the right table is filtered according to matching values from left table. Similar to INNER JOIN, but unlike that only columns of left table are shown. Like this: 

![image.png](attachment:a9da3563-785d-4c14-8356-7bcf04167341.png)

![image.png](attachment:95bcca86-9ec8-49ca-8d49-e6e0f4b9afc1.png)

---
Opposite to Semi-Join, Anti-Join excludes the intersection.

![image.png](attachment:0c9dbbf0-0582-4e36-b0c7-c8716087ee6d.png)

![image.png](attachment:8124640b-b569-40a3-bb05-017ad832c5be.png)

## Concatenate DataFrames together vertically

`.concat()`, They are combined in the order the names they are passed in. The result is a vertically combined tables.
If the index contains no valuable information than we can simply ignore the index by `ignore_index=True`.

![image.png](attachment:35843b84-f279-4a3f-8dbe-458c11ee153e.png)

**NOTE: Providing keys and ignoring index can NOT be done simultaneously**

![image.png](attachment:c97c6a49-5606-4397-83ed-513f99dbcf3b.png)

`sort = True` argument is used to sort the column names in the combined table<br>
`join='inner'` argument is used to select only the common columns between the tables, and sort doesn't work with it.

`.append()` :
* Simplified version of `.concat()` method
* Supports: `ignore_index`, and `sort`
* Does Not Support: `keys` and `join`
    * Always `join = outer`

---
### Verifying Integrity

There are very possible chances to get unintended duplicates, one-to-many, or many-to-many relationships between concatenated tables. So we need to verify our Data integrity.

![image.png](attachment:702520c0-b66d-4bbc-9cca-20d39831239f.png)

![image.png](attachment:c52b50b8-c079-43bb-acb2-b23b13eb9388.png)

If the merge is not of specific type, then its gonna throw an **ERROR**

![image.png](attachment:07d2cfa7-f514-4010-bb99-d551c1311b6c.png)

For concatenation, it checks **only Index values for duplicates** not column values, checks with `verify_integrity = True`

![image.png](attachment:4a7aa39d-1bdb-4391-9c35-8c9df99b381f.png)



---
# Chapter 4
---

## merge_ordered()

This method can merge time-series or other ordered data. **Give sorted result**

![image.png](attachment:590dc9d1-ab58-4559-be06-11b5c1914516.png)

### Merging 2 tables and then Forward Filling the missing values:

![image.png](attachment:7d194be7-c6d1-45c9-b31d-4f992547e161.png)

### When to Use:
* Dealing with Ordered data/ time series
* Filling in missing values

**Keep Note: When we are choosing the columns to merge on, it is important to chose the order of those columns wisely, as according to that the resultant values will change their order.**

---
## merge_asof()

Another method for ordered or time series data. It is similar to *ordered left join*,<br>
**However, the match doesn't need to be an exact match, nearby value is considered as well**<br>
**NOTE: Merged 'on' columns must be sorted !!**

DEFAULT VALUE ==> `direction = 'backward'`, but we can change this to: `direction = 'forward'`, this direction argument is used to determine which nearby value to pick, less than or greater than respectively.
One more value that we can set to *direction* is `direction = 'nearest'`, this will chose the nearest irrespective of forward or backward.

![image.png](attachment:602f3b87-875a-4d3a-9554-d54a9b853aaf.png)

### When to Use:
* Working with Data sampled from a process, where the data is not exactly aligned
* Wokring on training set,(no data leakage) ==> when you do not want values from the future to be visible at any point of time.

### Using merge_asof() to create dataset
The `merge_asof()` function can be used to `create datasets` where you have a table of start and stop dates, and you want to use them to create a `flag` in another table.

**Example:**

![image.png](attachment:6530edc0-aae9-4d78-a0b0-f94af2c749ad.png)

![image.png](attachment:3c54c589-918e-477f-9f12-bd955dcf059d.png)

---
## .query()

![image.png](attachment:11f30bbb-ce18-4dd1-81c4-736f9e27b19f.png)

![image.png](attachment:8aae9ce7-8909-4534-9fdf-e09582ecacc9.png)

![image.png](attachment:b058ce96-2543-49f6-897d-8e5ee8425d4c.png)

We can use query() method to select strings

![image.png](attachment:1446cafa-fd5c-4f76-8c69-efe58d552468.png)


---
## .melt()

Useful to reshaping our table in more computer friendly format.

Wide data is easier to read by people, while long data is often more accessible to computer to work with.

![image.png](attachment:1ca4299a-dd25-4cf7-9dcf-821e17c3cc71.png)

![image.png](attachment:e70d1d06-8b68-4a2c-854e-37def1a1dc4d.png)

![image.png](attachment:ae448c69-4160-458f-a362-9f2440ae1db8.png)

**id_vars**: this argument stands for identifier variables, these are those variables that remains intact in the wide format i.e. they are not unfolded/ unpivoted.

![image.png](attachment:a469fadd-f3b7-47a6-955d-c3c5c737c22f.png)

**value_vars**: this argument selects which columns are to be unpivoted into the resultant data, the columns not selected will the removed from the data. Also the order at which the columns names are mentioned is also considered while unpivoting the data.

![image.png](attachment:2f97ecde-c320-4ae7-83d7-81ed58ce3886.png)

**var_name** and **value_name**: these arguments are bydefault set to `variable` and `value` respectively. But through these arguments we can simply change the column names in unpivoted resultant data.

![image.png](attachment:64c70be6-0b66-47a2-86d1-e173c7fdfb3a.png)



---
---