## Chapter IV

### Using merge_ordered()

This method can merge time-series and other ordered data.

<img src='pictures\merge_ordered.png' alt='merge_ordered' width=750 />

We can see the output of the merge when we merge on the 'C' column. The results are similiar to the standart merge method with an outer join, but here that the results are sorted.

#### Method comparison

<img src='pictures\Method_comparison.png' alt='comparison' width=750 />

It has many of the same arguments we have already covered with the merge method. They both contain arguments to allow us to merge two tables on different columns. Both methods support different types of joins. Although, the **default for the merge method is "inner", it is "outer" for merge_order method.** Also, both methods support suffixes for overlapping column names. However, **how you call each of the methods is different**. Earlier in the course, we called the merge method by first listing a table and calling the method afterward. For merge_ordered(), you'll need to first call **pandas** then merge_ordered().

#### Stock Data

In this chapter, we will be working with financial, macroeconomic, and stock market data.

<img src='pictures\Stock_data.png' alt='stock' width=700 />

We have a table of the stock prices of the Apple corporation from February to June 2007. We also have a table of the stock price for McDonald's corporation from January to May 2007, and we want to merge them.
 
```python
import pandas as pd
pd.merge_orderd(aapl, mcd, on='date', suffixes=('_aapl', '_mcd'))
```

<img src='pictures\Merging_data.png' alt='merging' width=500 />

The first two arguments are the left and right tables. We set the **"on"** argument equal to date. Finally, we set the **suffixes** argument to determine which table the data originated. This results in a table **sorted by date**. There **isn't** a value for Apple in January or a value for McDonald's for June since values for these time periods are not available in the two original tables.

#### Forward Fill

We can fill in this missing data using a technique called forward filling. It will interpolate missing data by filling the missing values with the previous value.

```python
pd.merge_ordered(aapl, mcd, on='date', suffixes=('_aapl', '_mcd'), fill_method='ffill')
```

<img src='pictures\Forward_fill.png' alt='example' width=500 />

In the result, notice that the missing value for McDonald's in the last row is now filled in with the row before it. Apple in the first row is still missing since there isn't a row before the first row to copy into the missing value for Apple.

In [18]:
# Exercise I
# Correlation between GDP and S&P500

# DataFrames sp500, gdp
import pandas as pd
sp500 = pd.read_csv('datasets\\S&P500.csv')
gdp = pd.read_csv('datasets\\WorldBank_GDP.csv')

# User merge_orderd() to merge gdp and sp500 on year and date
gdp_sp500 = pd.merge_ordered(gdp, sp500, left_on='Year', right_on='Date', how='left', fill_method='ffill')


# Subset the gdp and returns columns
gdp_returns = gdp_sp500[['GDP', 'Returns']]

# Print gdp_returns correlation
gdp_returns.corr()

Unnamed: 0,GDP,Returns
GDP,1.0,0.040669
Returns,0.040669,1.0


### using merge_asof()

* Similar to a ```merge_ordered()``` left join
    - Similar features as ```merge_ordered()```
* Match on the nearest key column and not exact matches.
    - Merged 'on' column must be sorted.


<img src='pictures\Merged_asof.png' width=500 />

The **merge_asof()** method is similar to an **ordered left join**. It has similar features as **merge_ordered()**. However, unlike an ordered left join, merge_asof() will match on the **nearest value columns rather than equal values**. This brings up an important point - whatever columns you merge on **must be sorted**. In the table shown here, when we merge on column "C", we bring back all of the rows from the left table.

#### DataSets

For this example, we will look at merging two tables. The first is stock price data for the **Visa** company with entries for **every hour** on Nov, 11, 2017. The second table is **IBM** stock prices on the same day with entries for **roughly every five minutes**.

<img src='pictures\Tables.png' width=700 />

Let's use merge_asof() to merge the tables. The input arguments are very similar to what we have already seen in the course. Here we list the left and right tables first. Then we define that we want to merge on the "date_time" column. Finally, we provide a set of suffixes. Our output is similar to a left join, so we see all of the rows from the left Visa table. However, the values from the IBM table are based on how close the date_time values match with the Visa table. 

```python
pd.merge_asof(visa, ibm, on='date_time', suffixes=('_visa', '_ibm'))
```

<img src='pictures\asof.png' width=500 />

Notice the first row and the IBM price of 149.11. Let's show the IBM table again and see why this value was chosen in the merger. It comes from the row indexed as 4. This row has the closest date_time that is less than the date_time in the Visa table.

#### merge_asof() example with direction

This time in our merge_asof() method, we list the **direction argument** as "forward". This will change the behavior of the method to select the first row in the right table whose "on" key column is greater than or equal to the left's key column. The **default value** for the direction argument is **"backward"**.

```python
pd.merge_asof(visa, ibm, on=['date_time'], suffixes=('_visa', '_ibm'), direction='forward')
```

<img src='pictures\asof_direction.png' width=500 />

When we look at our results, we see different values for the IBM column. Let's again look at the first IBM value and trace it back to the IBM table. We see it in the row indexed as 5. Its date_time is slightly greater than the date_time in the visa table.
Finally, you can set the **direction argument to "nearest"** which returns the nearest row in the right table regardless if it is forward or backwards.

### Selecting data with .query()

Now that you have learned quite a bit about combining data from different data sources, let's review a pandas method for selecting data from the table called the query() method. pandas provides many methods for selecting data, and query() is one of them.

```python
.query('SOME SELECTION STATEMENT')
```

* Accepts an input string
    - Input string used to determine what rows are returned
    - Input string similar to statement after **WHERE** clause in **SQL** statement
        - **Prior knowledge of SQL is not necessary**


#### Querying on a single condition

We have the following table named stocks with the stock price of Disney and Nike on different days. Now imagine we would like to select the rows where Nike is equal to or above 90.

<img src='pictures\stocks.png' width=450 />

---

```python
stocks.query('nike >= 90')
```
The method returns all rows in stocks where Nike is greater than or equal to 90.

<img src='pictures\nike_90.png' width=450 />

---

#### Querying on a multiple conditions, "and", "or"

```python
stocks.query('nike > 90 and disney < 140')

stocks.query('nike > 96 or disney < 98')
```

---

#### Using .query() to select text

```python
stocks_long.query('stock == "disney" or (stock == "nike" and close < 90)')
```

### Reshaping data with .melt()

This method will unpivot a table from wide to long format. This is often a much more computer-friendly format.

| Wide      | Long  |
| :------:  |:----:|
| <img src='pictures\wide.png' width=250 /> | <img src='pictures\long.png' width=250 /> |    

The melt method will allow us to unpivot, or change the format of, our dataset. In this image, we change the height and weight columns from their wide horizontal placement to a long vertical placement.

<img src='pictures\melt.png' width=500 />

#### Dataset in wide format

To demonstrate the melt method, let's start with this dataset of financial metrics of two popular social media companies. Notice that the years are horizontal. Let's change them so that they are vertically placed.

<div><p style="float: left;"><img src='pictures\social_fin.png' width=500></p>
<p> this table is called <code>social_fin</code></p>
</div>
<div style="clear: left;"></div>

#### Example of .melt()

```python
social_fin_tall = social_fin.melt(id_vars=['financial', 'company'])
social_fin_tall.head(10)
```

<img src='pictures\social_fin_tall.png' width=500 />

Here we call the **melt()** method on the table ```social_fin```. The first input argument to the method is **id_vars**. These are columns to be used as **identifier variables**. We can also think of them as columns in our original dataset that we do not want to change. In our output, we print the first ten rows. Our years are listed vertically. Our final column now has all of our values in one column versus multiple columns. Again, this is a much more computer-friendly format than our original table. We unpivoted each of the separate columns 2016 through 2019. Our output has data for every year in our starting table, but again, we are only showing the first couple of rows. In the next example, we will look at how to control what columns are unpivoted.

#### Melting with value_vars

```python
social_fin_tall = social_fin.melt(id_vars=['financial', 'company'],
                                    value_vars=['2018', '2017'])
```

<img src='pictures\value_vars.png' width=500 />

This time, let's use the argument **value_vars** with the **melt()** method. This argument will allow us to control which columns are unpivoted. Here, we unpivot only the **2018 and 2017** columns. Our output now **only** has data for the years 2018 and 2017. Additionally, the **order** of the value_var was kept. The output starts with 2018, then moves to 2017. Finally, notice that the column with the years is now named variable, and our values column is named value. We will adjust that in our next example.

#### Melting with column names

```python
social_fin_tall = social_fin.melt(id_vars=['financial', 'company'],
                                    value_vars=['2018', '2017'],
                                    var_name=['year'], value_name='dollars')
```

<img src='pictures\column_names.png' width=500 />

In this example, we have added some additional inputs to our **melt()** method. The **var_name** argument will allow us to set the name of the year **column** in the output. Similarly, the **value_name** argument will allow us to set the name of the **value column** in the output. It is the same as before, except our variable and value columns are renamed year and dollars, respectively. 

We have seen how the melt() method is useful for reshaping our tables. Imagine a situation where you have merged many columns, making your table very wide. The merge() method can then be used to reshape that table into a more computer-friendly format.