In [18]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display
import myst_nb

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)
        
co2 = pd.read_csv('data/co2_mm_mlo.txt', header = None, skiprows = 72, 
                  sep = '\s+',
                  names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'days'])
co2_NA = co2.replace(-99.99, np.NaN)

(ch:wrangling_structure)=
# Modify Structure

We don't want to struggle with the structure of a data frame every time we analyze our data. The wrangling process often reshapes the data frame in some way to make the analysis easier and more natural. These changes can simply be taking a subset of the rows and/or columns of the table, changing its granularity, and reshaping. The techniques from {numref}`Chapter %s <ch:pandas>` and {numref}`Chapter %s <ch:sql>` are useful here. We describe them briefly and mention the methods in pandas you might use to achieve the new structure.   

**Simplify Structure.** If a data frame has many features that are not needed in our analysis then we may want to drop these extraneous columns to make handling the data frame easier. Or, if we want to focus on a particular period of time, geographic area, of xxx, we may want to take a subset of the rows. {numref}`Section %s <ch:pandas_subsetting>` demonstrated the immense flexibility in pandas for taking subsets of rows and columns of a pandas data frame.  In {numref}`Section %s <ch:wrangling_granularity>`, we read into our data frame a small set of features from the hundreds available in the DAWN survey because we were interested in understanding the patterns of type of ER visit by demographics of the patient. Later in {numref}`Section %s <ch:eda_example>`, we restrict our analysis of home sale prices to one year and a few cities in an effort to limit the impact of inflation and to better study the effect of location on sale price. 

**Adjust Granularity.** In {numref}`Section %s <ch:wrangling_co2>`, we provided an example where we aggregated monthly averages to a yearly average in order to visualize annual trends. In {numref}`Section %s <ch:wrangling_restaurants>`, we provide another example, where we aggregate violation-level data to the inspection-level so that it can be combined with the inspection scores. In both of these examples, we adjust the granularity of the data frame to work with a coarser granularity by grouping together records and then aggregating values. Recall that tor the CO2 measurements, we grouped the monthly values from the same year, and then aggregated the average CO2 measurements:

In [19]:
co2_NA.groupby('Yr')['Avg'].mean()

Yr
1958    315.33
1959    315.97
1960    316.91
         ...  
2017    406.55
2018    408.52
2019    412.27
Name: Avg, Length: 62, dtype: float64

Other common aggregations are: the size of the group; the sum, minimum, and maximum value in a group; and the first or last value in a group. The details of adjusting granularity of pandas data frames can be found in {numref}`Section %s <ch:pandas_aggregating>`, including how to group by multiple column values.

At times a dataset might have mixed granularity, where records are at different levels of detail. A common case is in data provided by government agencies where data at the county and state levels are included in the same file. When this happens, we usually want to split the data frame into two, one at the county level and the other at the state level. This makes a county level and state level analyses much easier, even feasible, to perform.

**Reshape.** Data shared as pivot tables (See {numref}`Section %s <ch:pandas_aggregating>`). These **"wide"** tables have data values as column names, and are often difficult to use in analysis. We may need to reshape them into a **"long"** form. {numref}`Figure %s <Wide_Long>` gives a heuristic of what is meant by wide and long forms.

```{figure} figures/wide_long.png
---
name: Wide_Long
---

Place Holder for Wide - Long graphic. 
```

:::{note}

**Note**: The long form is often referred to as **tidy data**. [REFERENCE]

:::

We reshaped the CO2 data into a wide data frame that is like a pivot table in shape. There is a column for each month and a row for each year.

In [20]:
co2_pivot = pd.pivot_table(
    co2[10:34],
    index='Yr',   # Column to turn into new index
    columns='Mo',  # Column to turn into new columns
    values='Avg', # Column to aggregate for values
    aggfunc=sum)    # Aggregation function

co2_wide = co2_pivot.reset_index()

co2_wide

Mo,Yr,1,2,3,...,9,10,11,12
0,1959,315.62,316.38,316.71,...,313.84,313.26,314.8,315.58
1,1960,316.43,316.97,317.58,...,314.16,313.83,315.0,316.19


The column headings are months, and the cell values in the grid are the CO2 monthly averages. We can turn this data frame back into a "tall" data frame, where the column names become a feature, called month, and the values in the grid are reorganized into a second feature, called average.  

In [21]:
co2_long = co2_wide.melt(id_vars=['Yr'], var_name='month', value_name='average')
co2_long

Unnamed: 0,Yr,month,average
0,1959,1,315.62
1,1960,1,316.43
2,1959,2,316.38
...,...,...,...
21,1960,11,315.00
22,1959,12,315.58
23,1960,12,316.19


Notice that the data has been recaptured in its original shape, although the rows are not in their original order. This reshaping, enables us, for example, to examine changes over time. 