# Joining Data with pandas
- The city of Chicago is divided into fifty local neighborhoods called wards. We have a table with data about the local government offices in each ward. In this example, we want to merge the local government data with census data about the population of each ward. 

##### One-to-one
- In a one-to-one relationship, every row in the left table is related to one and only one row in the right table.

##### One-to-many
- in a one-to-many relationship, every row in the left table is related to one or more rows in the right table.

##### Merging multiple DataFrames
- df1.merge(df2, on='col') \
 &emsp;&emsp;     merge(df3, on='col')
- df1.merge(df2, on='col') \
 &emsp;&emsp;     merge(df3, on='col') \
 &emsp;&emsp;     merge(df4, on='col')

##### Merging a table to itself or self join
- To complete this merge, we set the sequels table as input to the merge method for both the left and right tables. We can think of it as merging two copies of the same table. All of the aspects we have reviewed regarding merging two tables still apply here. Therefore, we can merge the tables on different columns. We'll use the 'left_on' and 'right_on' attributes to match rows where the sequel's id matches the original movie's id. Finally, setting the suffixes argument in the merge method allows us to identify which columns describe the original movie and which describe the sequel.

##### Merging on indexes
- df_merge_index = df1.merge(df2, on = 'column_name', how='left)

##### MultiIndex datasets
- df_multiIndex = df1.merge(df2, on=['column_1','column_2'])
- df_multiIndex = df1.merge(df2, left_on=$\color{red}{\text{'column_1'}}$, left_index=$\color{red}{\text{True}}$, right_on=$\color{red}{\text{'column_2'}}$, right_index=$\color{red}{\text{True}}$) 

#### Filtering joins
- **Mutating joins:**
    - Combines data from two tables based on matching observations in both tables
- **Filtering joins:**
    - Filter observations from table based on whether or not they match an observation in other table
- **Semi-joins**
    - Returns the intersection, similar to an inner joinb
    - Return only columns from the left table and **not** the right
    - No duplicates
- **Anto-join:**
    - Returns the left table, excluding the intersection
    - Returns only columns from the left table and **not** the right
- **Example**    
###### # Merge employees and top_cust
empl_cust = employees.merge(top_cust, on='srid', 
<br> &emsp;&emsp;             how='left', indicator=True)

###### # Select the srid column where _merge is left_only
srid_list = empl_cust.loc[empl_cust['_merge'] == 'left_only', 'srid']

###### # Get employees not working with top customers
print(employees[employees['srid'].isin(srid_list)])

##### Concatenate DataFrames together vertically
- **Ignoring the index**
    - If the index contains no valuable information, then we can ignore it in the concat method by setting ignore_index to True. The result is that the index will go from 0 to n-1.
    - `pd.concat([df1, df2, df3], ignore_index=True)`
    
##### Setting labels to original tables
- Now, suppose we wanted to associate specific keys with each of the pieces of our three original tables. We can provide a list of labels to the keys argument. Make sure that ignore_index argument is False, since you can't add a key and ignore the index at the same time. This results in a table with a multi-index, with the label on the first level.
- `pd.concat([df_1, df_2, df_3], ignore_index=False, keys= ['column_1','column_2','column_3'])`

##### Concatenate DataFrames tables with different column names
- **Concatenate tables with different column names**
- The concat method by default will include all of the columns in the different tables it's combining. The sort argument, if true, will alphabetically sort the different column names in the result.
- `pd.concat([inv_jan, inv_feb], sort=True)` 
- **Concatenate tables with different column names**
- If we only want the matching columns between tables, we set the join argument to "inner". Its default value is equal to "outer", which is why concat by default will include all of the columns. Additionally, the sort argument has no effect when join equals "inner". The order of the columns will be the same as the input tables. 
- `pd.concat([df_1, df_2], join='inner')`
- Using append method
- `.append()`
- Simplified version of the `.concat()` method
- Suppoorts: `ignore_index`, and `sort`
- Does Not Support: `keys` and `join`
    - Always `join = outer`
- **Append the tables**
- df_1.append([df_2, df_3], ignore_index=$\color{red}{\text{True}}$, sort=$\color{red}{\text{True}}$)

- **Validating merges
<br>`.merge(validate=None)`:
    - Checks if merge is of specific type
        - `one_to_one`
        - `one_to_many`
        - `many_to_many`
        - `manymerge_ordered() caution, multiple columnsmerge_ordered() caution, multiple columnsmerge_ordered() caution, multiple columns
- **Verifying concatenations**
<br> `.concat(verify_integrity=Flase)`:
    - Check whether the new concatenated index contains duplicates
    - Default value is `False`

- **merge_ordered() caution, multiple columns**
    - When using merge_ordered() to merge on multiple columns, the order is important when you combine it with the forward fill feature. 
    
- **Using merge_asof()**
    - Similar to a `merge_ordered()` left-join
        - Similar to a `merge_ordered()` 
    - Match on the nearest key column and not exact matches.
        - Merged "on" columns mudst be sorted.
        
- Selecting data with .query()
<br>`.query('SOME SELECTION STATEMENT')`
    - Accepts an input string
        - Input string to determine what rows are returned
        - Input string similar to statement after **WHERE** cluse in **SQL** statementy
            - Prior knowledge of SQL is not necessary
            - `df.query('column_name >= 90')`
            - `df.query('column_1 > 90 and columnn_2 < 140')`
            - `df.query('column_1 > 96 or columnn_2 < 98')`

##### using .query() to select text
- We are interested in selecting all of the rows were the column stock equals "disney" or the column stock equals "nike" and close is less than 90. Let's pause here for a moment to look at our query string. Within the parentheses of our string, we check if the stock column is nike and the close column is less than 90. Both of these conditions have to be true for the parentheses section to return true. We then add that to the condition to check if stock is listed as "disney". When checking text, we use the double equal signs, similar to an if statement in Python. Also, when checking a text string, we used double quotes to surround the word. This is to avoid unintentionally ending our string statement since we used single quotes to start the statement. 
- `stocks_long.query('stock == "disney" or (stock == "nike" and close < 90)')`

##### Reshaping data with .melt()
- **Wide versus long data**
    - Sometimes we will come across data where every row relates to one subject, and each column has different information about an attribute of that subject. Data formatted in this way is often called wide. There are other times when the information about one subject is found over many rows, and each row has one attribute about that subject. Data formatted in this way is often called long or tall. In general, wide formatted data is easier to read by people than long formatted. However, long formatted data is often more accessible for computers to work with.

- **What does the .melt() method do?**
    - The melt method will allow us to unpivot, or change the format of, our dataset. In this image, we change the height and weight columns from their wide horizontal placement to a long vertical placement.
- The melt method will allow us to unpivot dataset

- **Example of .melt()**
    - Here we call the melt() method on the table social_fin. The first input argument to the method is id_vars. These are columns to be used as identifier variables. We can also think of them as columns in our original dataset that we do not want to change.
    - social_fin_tall = social_fin.melt(id_vars=['financials','comapny'])
    - **Melting with value_vars**
        - This time, let's use the argument value_vars with the melt() method. This argument will allow us to control which columns are unpivoted. This time, let's use the argument value_vars with the melt() method. This argument will allow us to control which columns are unpivoted. Here, we unpivot only the 2018 and 2017 columns. Our output now only has data for the years 2018 and 2017. Additionally, the order of the value_var was kept. The output starts with 2018, then moves to 2017. Finally, notice that the column with the years is now named variable, and our values column is named value.
            - social_fin_tall = social_fin.melt(id_vars=['financials','comapny'], value_vars=['2018','2017'])
        - **Melting with column names**
            - In this example, we have added some additional inputs to our melt() method. The var_name argument will allow us to set the name of the year column in the output. Similarly, the value_name argument will allow us to set the name of the value column in the output. It is the same as before, except our variable and value columns are renamed year and dollars, respectively. We have seen how the melt() method is useful for reshaping our tables. Imagine a situation where you have merged many columns, making your table very wide. The merge() method can then be used to reshape that table into a more computer-friendly format.
            - social_fin_tall = social_fin.melt(id_vars=['financials','comapny'], value_vars=['2018','2017'],
            <br>&emsp;&emsp;  var_name['year'], value_name='dollarts')