# Data Wrangling: Join, Combine and Reshape
## DAT540 Introduction to Data Science
## University of Stavanger
### L11.2
#### Antorweep Chakravorty (antorweep.chakravorty@uis.no)

In [1]:
#Importing libraries needed for pulls from Google
from pandas_datareader import data
import pandas as pd
import numpy as np
import datetime
from datetime import date

### *Concatenating Along an Axis*
- Concatenation is another form of data combination operation to concatenate, bind or stack DataFrames
- Concatenation requires the following points to be cautious off
  - Should DataFrames with different indices on axes be combined with distinct elements in these axes or only use the shared values
  - Do concatenated chunks of data need to be identified in the resulting object
  - Does "Concatenation axis" contain data that needs to be preserved 
- The *pandas.concat* top-level method provides a mechanism concat multiple pandas objects
- The concat method with a list of pandas objects stacks / concats them over axis=0 (default)
- The *join* argument produces an intersection of the provided list of pandas objects
- The *join_axes* argument accepts a list of indices to be selected for joining 
- In order to identify concatenated chunks in a concatenation result, the *keys* argument can be supplied with a list of values to create hierarchical index on the concatenation axis
- In case of concatenation on DataFrame with *axis=1* argument uses the *keys* argument as DataFrame column headers
- A dict of object passed to concat instead of a list of pandas objects, the dict's keys will be automatically used as values for the keys argument
- The *names* argument allows us to name the created axis levels
- The *ignore_index=True* (default False) argument does not preserve indices along concatenation axis, instead producing a new range(total_length) index

In [2]:
s1 = pd.Series([-1, 1], index=['a', 'b'])
s2 = pd.Series([2, -1, 4], index=['c', 'a', 'f'])
s3 = pd.Series([5, -3], index=['f', 'a'])

In [3]:
pd.concat([s1, s2, s3], sort=True)

a   -1
b    1
c    2
a   -1
f    4
f    5
a   -3
dtype: int64

In [4]:
# Creating intersection using join type as inner
pd.concat([s1, s2, s3], axis=1, sort=True)

Unnamed: 0,0,1,2
a,-1.0,-1.0,-3.0
b,1.0,,
c,,2.0,
f,,4.0,5.0


In [9]:
# Axes to be used on other axes for concatination over axis=1
pd.concat([s1, s2, s3], axis=1, sort=True, join='outer')

Unnamed: 0,0,1,2
a,-1.0,-1.0,-3.0
b,1.0,,
c,,2.0,
f,,4.0,5.0


In [10]:
# Creating a hierarchical index for each concatenated piece
pd.concat([s1, s2, s3], keys=['c1', 'c2', 'c3'])

c1  a   -1
    b    1
c2  c    2
    a   -1
    f    4
c3  f    5
    a   -3
dtype: int64

In [11]:
# Concatenating DataFrames
df1 = pd.DataFrame(np.arange(6).reshape(3,2), index=['a', 'b', 'c'], columns=['c1', 'c2'])
df2 = pd.DataFrame(5 + np.arange(4).reshape(2,2), index=['a', 'c'], columns=['c3', 'c4'])

In [12]:
pd.concat([df1, df2], axis=1, keys=['df1', 'df2'], sort=True, join="inner")

Unnamed: 0_level_0,df1,df1,df2,df2
Unnamed: 0_level_1,c1,c2,c3,c4
a,0,1,5,6
c,4,5,7,8


- *Combining Data with Overlap*
- In cases where datasets have indices that overlap in full or in part. We can choose values from either one when one of them has null values
- In other words, how do we patch missing data in one object from another
- Similar to ```numpy.where(pd.isnull(a), b, a)```
- Series has a *combine_first* instance method for performing equivalent operations
- With DataFrames, *combine_first* does the same thing column by column

In [13]:
a = pd.Series([np.nan, 2.5, 0.0, 3.5, 4.5, np.nan])
b = pd.Series([0.0, np.nan, 2.0, np.nan, np.nan, 5.])

In [14]:
# We can combine two objects based on any conditions
b.combine(a, lambda bi, ai: ai if pd.isnull(bi) else bi)

0    0.0
1    2.5
2    2.0
3    3.5
4    4.5
5    5.0
dtype: float64

In [15]:
# The above can be also achieved using combine_first
b.combine_first(a)

0    0.0
1    2.5
2    2.0
3    3.5
4    4.5
5    5.0
dtype: float64

- **Reshaping and Pivoting**
- Allows rearrangement of tabular data. Also referred to as *reshape* or *pivot* operations
- *Reshaping with Hierarchical Indexing*
- Hierarchical indexing provides a consistent way to rearrange data in a DataFrame
- It provides two primary actions through instance methods:
  - *stack*: rotates or pivots columns to rows
  - *unstack*: transforms rows into the columns
- by default the innermost level is stacked or unstacked. A different level can be chosen by passing a level number or name
- Unlike stacking that filters out missing data by default (*dropna=True*), unstacking might introduce missing data if all the values in the level aren't found in each subgroups
- When unstacking a DataFrame, the level unstacked becomes the lowest level in the result

In [16]:
#Let us have a simple DataFrame
sdf = pd.DataFrame(np.arange(9).reshape(3,3), index=list('abc'), columns=list('xyz'))
sdf

Unnamed: 0,x,y,z
a,0,1,2
b,3,4,5
c,6,7,8


In [17]:
# Stacking the data would pivot the columns into rows producing a Series
s1 = sdf.stack()
print('s1 type: ', type(s1))
s1

s1 type:  <class 'pandas.core.series.Series'>


a  x    0
   y    1
   z    2
b  x    3
   y    4
   z    5
c  x    6
   y    7
   z    8
dtype: int64

In [18]:
# UnStacking transforms the Series by pivoting the rows into columns. By default, it takes the inner most level
s1.unstack()

Unnamed: 0,x,y,z
a,0,1,2
b,3,4,5
c,6,7,8


In [19]:
# This can be changed, by providing the level argument with the level that we want to pivot by
s1.unstack(0)

Unnamed: 0,a,b,c
x,0,3,6
y,1,4,7
z,2,5,8


- *Pivoting "Wide" to "Long" Format*
- It merges multiple columns into one, producing a DataFrame that is longer than the input
- *pandas.melt* top-level method allows us to perform this transformation.
- It accepts a DataFrame and a list of column names to the *id_vars* argument on which to convert to a long format
- a *value_vars* argument accepts a list and selects the columns that would be represented as values
- *pandas.melt* can also be used without any group identifiers to created the long format without any labeled indices


In [20]:
start_date = datetime.datetime(2021, 1, 1)
end_date = datetime.datetime(2021, 9, 30)
aapl = data.DataReader(name='aapl', data_source='yahoo', start=start_date, end=end_date)[['Close']]
# get qqq stocks and store it to an instance
qqq = data.DataReader(name='qqq', data_source='yahoo', start=start_date, end=end_date)[['Close']]
# get bbby stocks and store it to an instance
bbby = data.DataReader(name='bbby', data_source='yahoo', start=start_date, end=end_date)[['Close']]

In [21]:
# Let us use the stocks dataset from earlier
df = aapl.join(qqq, lsuffix='_l', rsuffix='_r').join(bbby).reset_index()
# Wide Format
df.head()

Unnamed: 0,Date,Close_l,Close_r,Close
0,2021-01-04,129.410004,309.309998,18.030001
1,2021-01-05,131.009995,311.859985,19.76
2,2021-01-06,126.599998,307.540009,21.030001
3,2021-01-07,130.919998,314.980011,18.73
4,2021-01-08,132.050003,319.029999,18.940001


In [22]:
# To Long Format
df_long = pd.melt(df, ['Date'])
df_long.sample(10)

Unnamed: 0,Date,variable,value
391,2021-01-26,Close,36.869999
226,2021-03-01,Close_r,323.589996
472,2021-05-21,Close,24.15
291,2021-06-02,Close_r,333.470001
280,2021-05-17,Close_r,324.410004
375,2021-09-30,Close_r,357.959991
488,2021-06-15,Close,29.459999
400,2021-02-08,Close,26.27
1,2021-01-05,Close_l,131.009995
446,2021-04-15,Close,24.540001


- *Pivoting "Long" to "Wide" Format*
- Long or *stacked* format is a representation of storing data where each row in the table represents a single observations
- In such a format, one or more column values in a row might repeat in another row
- Alternatively such a format can be transformed into a Wide or *unstacked* format where a DataFrame can contain one column per distinct item value
- The *pivot* instance method use used to perform this operation
- Its first two argument represents the row and column indices of the pivoted table. The third argument is a optional and selects the value column to fill the DataFrame
- Omitting the third argument results in a DataFrame with hierarchical columns
- The *pivot* operation is equivalent to creating a hierarchical index using *set_index* followed by call to *unstack*

In [23]:
# Stacked Format
df_long.pivot('Date', 'variable').head()

Unnamed: 0_level_0,value,value,value
variable,Close,Close_l,Close_r
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
2021-01-04,18.030001,129.410004,309.309998
2021-01-05,19.76,131.009995,311.859985
2021-01-06,21.030001,126.599998,307.540009
2021-01-07,18.73,130.919998,314.980011
2021-01-08,18.940001,132.050003,319.029999
