# Reshaping Data


In [None]:
import pandas as pd
long_df = pd.read_csv(
    '/content/long_data.csv',
    usecols=['date', 'datatype', 'value']
  ).rename(
    columns={
    'value' : 'temp_C'
  }
  ).assign(
    date=lambda x: pd.to_datetime(x.date),
    temp_F=lambda x: (x.temp_C * 9/5) + 32
)
long_df.head()

# Transposing

In [None]:
long_df.T

# Pivoting

In [None]:
pivoted_df = long_df.pivot(
  index='date', columns='datatype', values='temp_C'
)
pivoted_df.head()

Now that the data is pivoted, we have wide-format data that we can grab summary statistics with:

In [None]:
pivoted_df.describe()

We can also provide multiple values to pivot on, which will result in a hierarchical index:

In [None]:
pivoted_df = long_df.pivot(
  index='date', columns='datatype', values=['temp_C', 'temp_F']
)
pivoted_df.head()

With the hierarchical index, if we want to select TMIN in Fahrenheit, we will first need to select 'temp_F' and then 'TMIN' :

In [None]:
pivoted_df['temp_F']['TMIN'].head()

We have been working with a single index throughout this chapter; however, we can create an index from any number of columns with set_index() . This gives us a
MultiIndex where the outermost level corresponds to the first element in the list provided to set_index() :

In [None]:
multi_index_df = long_df.set_index(['date', 'datatype'])
multi_index_df.index

Notice there are now 2 index sections of the dataframe:

In [None]:
multi_index_df.head()

With the MultiIndex , we can no longer use pivot() . We must now use unstack() , which by default moves the innermost index onto the columns:

In [None]:
unstacked_df = multi_index_df.unstack()
unstacked_df.head()

The unstack() method also provides the fill_value parameter, which let's us fill-in any NaN values that might arise from this restructuring of the data. Consider the
case that we have data for the average temperature on October 1, 2018, but no other date

In [None]:
extra_data = long_df.append(
  [{'datatype' : 'TAVG', 'date': '2018-10-01', 'temp_C': 10, 'temp_F': 50}]
).set_index(['date', 'datatype']).sort_index()
extra_data.head(8)

If we use unstack() in this case, we will have NaN for the TAVG columns every day but October 1, 2018:

In [None]:
extra_data.unstack().head()

To address this, we can pass in an appropriate fill_value . However, we are restricted to passing in a value for this, not a strategy (like we saw with fillna() ), so while
-40 is definitely not be the best value, we can use it to illustrate how this works, since this is the temperature at which Fahrenheit and Celsius are equal:

In [None]:
extra_data.unstack(fill_value=-40).head()

# Melting

Going from wide to long format.

## Setup

In [None]:
wide_df = pd.read_csv('/content/sample_data/wide_data.csv')
wide_df.head()

In order to go from wide format to long format, we use the melt() method. We have to specify:
- which column contains the unique identifier for each row ( date , here) to id_vars
- The column(s) that contain the values ( TMAX , TMIN , and TOBS , here) to value_vars

Optionally, we can also provide:
- value_name : what to call the column that will contain all the values once melted
- var_name : what to call the column that will contain the names of the variables being measured


In [None]:
melted_df = wide_df.melt(
    id_vars='date',
    value_vars=['TMAX', 'TMIN', 'TOBS'],
    value_name='temp_C',
    var_name='measurement'
)
melted_df.head()

In [None]:
pd.melt(
  wide_df,
  id_vars='date',
  value_vars=['TMAX', 'TMIN', 'TOBS'],
  value_name='temp_C',
  var_name='measurement'
).head()

Another option is stack() which will pivot the columns of the dataframe into the innermost level of a MultiIndex . To illustrate this, let's set our index to be the date
column:

In [None]:
wide_df.set_index('date', inplace=True)
wide_df.head()

By running stack() now, we will create a second level in our index which will contain the column names of our dataframe ( TMAX , TMIN , TOBS ). This will leave us with a
Series containing the values:

In [None]:
stacked_series = wide_df.stack()
stacked_series.head()

We can use the to_frame() method on our Series object to turn it into a DataFrame . Since the series doesn't have a name at the moment, we will pass in the name as
an argument:

In [None]:
stacked_df = stacked_series.to_frame('values')
stacked_df.head()

Once again, we have a MultiIndex

In [None]:
stacked_df.index

In [None]:
stacked_df.index.names

In [None]:
stacked_df.index.rename(['date', 'datatype'], inplace=True)
stacked_df.index.names