### Data Unification and Transformation<br>
### Data Concatenation 
- **pd.concat( [ data_frames or serieses ], axis = 0/1 )** - concatenates DataFrames or Serieses by rows or columns. By default indexes are not ignored, thus duplicate indexes appear
- **pd.concat ( [ data_frames or serieses ], axis = 0/1, join = ' inner/left/right/outer )** - by default **join = ' outer '
- **df.append( [ data_frames or serieses ], ignore_index = True )** - can apped instead of concatenation 

### Data Merging
It unites several objects by looking for matched indexes 
- **df.merge( df, on = [' column_names ' ] )** - merges two dataframes by finding the same columns values. **Parameter on** indicates which columns to use for merging (**pd.merge( ) and df.merge( ) are the same**)
- **pd.merge( df_1, df_2, left_index = True, right_index = True)** - merges by rows **(it is important to provide both indexes with True)**
By default parameter **how** is inner and can be:
- **how = ' inner '** - merges by column values that exist in both DataFrames
- **how = ' outer '** - simply merges two tables and where values don't intersect fills with NaN ( Values from both tables are present) 
- **how = ' right '** - merges by using all values from the right table (All values from the right table are present,the Left NaN)
- **how = ' left '** - merges by all values from the left table  (All values from the left table are present,the right NaN)

- **df.join( df, lsuffix = ' _left ', rsuffix = ' _right ', how = 'inner/outer/left/right)** - is equivalent to merge but can provide suffixes

### Data Rotation 
It is common to transform column values into columns for convenience
- **df.pivot( index = ' col_name ', columns = ' col_name ', values = ' col_name ' )** - rotate a DataFrame and transforms column values into columns
- **df.stack( )** - rotates columns into rows what makes values searching much faster and more effective
- **df.unstack( level = [ level_names ] )** - rotates rows into columns. The index is supposed to be **hierarchical**

When **stacking** or **unstacking** levels go to the innerest level. **Data rotation doesn't lead to missing data**. It just changes data organization and view.
### Data Melting 
It is data reorganization which is often called transformation from **wide format** into **long format**. 
- **pd.melt( df, id, values )** - melts a DataFrame


In [1]:
import pandas as pd
import numpy as np

### Data Merging 

In [14]:
# Serieses Concatenation 
s1 = pd.Series(np.arange(5))
s2 = pd.Series(np.arange(5,10))
s3 = pd.concat([s1,s2],ignore_index=True)
print(s3)

# DataFrames Concatenation (by rows) + provide hierarchical index
df_1 = pd.DataFrame(np.arange(12).reshape(3,4),columns=list('abcd'))
df_2 = pd.DataFrame(np.arange(12,24).reshape(3,4),columns=list('abcd'))
df_3 = pd.concat([df_1,df_2],keys=['df_1','df_2'])
print('\n'+str(df_3))

# DataFrames Concatenation (by columns)
df_1 = pd.DataFrame(np.arange(12).reshape(3,4),columns=list('abcd'))
df_2 = pd.DataFrame(np.arange(12,24).reshape(3,4),columns=list('abcd'))
df_3 = pd.concat([df_1,df_2],axis=1,ignore_index=True)
print('\n'+str(df_3))

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int32

         a   b   c   d
df_1 0   0   1   2   3
     1   4   5   6   7
     2   8   9  10  11
df_2 0  12  13  14  15
     1  16  17  18  19
     2  20  21  22  23

   0  1   2   3   4   5   6   7
0  0  1   2   3  12  13  14  15
1  4  5   6   7  16  17  18  19
2  8  9  10  11  20  21  22  23


In [19]:
# Appending instead of Concatenation 
df_1.append([df_2, df_3])

Unnamed: 0,a,b,c,d,0,1,2,3,4,5,6,7
0,0.0,1.0,2.0,3.0,,,,,,,,
1,4.0,5.0,6.0,7.0,,,,,,,,
2,8.0,9.0,10.0,11.0,,,,,,,,
0,12.0,13.0,14.0,15.0,,,,,,,,
1,16.0,17.0,18.0,19.0,,,,,,,,
2,20.0,21.0,22.0,23.0,,,,,,,,
0,,,,,0.0,1.0,2.0,3.0,12.0,13.0,14.0,15.0
1,,,,,4.0,5.0,6.0,7.0,16.0,17.0,18.0,19.0
2,,,,,8.0,9.0,10.0,11.0,20.0,21.0,22.0,23.0


In [27]:
# DataFrames Merging using one column 
customers = {'CustomerID':[10,11],
             'Name':['Vlad','Max'],
             'Address':['Addres for Vlad','Address for Max']}

customers = pd.DataFrame(customers)
print(customers)

orders = {'CustomerID':[10,11,10],
          'Stuff':['Sneakers','Socks','Hoodie']}

orders = pd.DataFrame(orders)
print('\n'+str(orders))

print('\n'+str(customers.merge(orders)))

   CustomerID  Name          Address
0          10  Vlad  Addres for Vlad
1          11   Max  Address for Max

   CustomerID     Stuff
0          10  Sneakers
1          11     Socks
2          10    Hoodie

   CustomerID  Name          Address     Stuff
0          10  Vlad  Addres for Vlad  Sneakers
1          10  Vlad  Addres for Vlad    Hoodie
2          11   Max  Address for Max     Socks


In [41]:
# Data Merging using several columns 

data_1 = pd.DataFrame({'key_1':list('abc'),
                       'key_2':list('xyz'),
                       'val_1':[0,1,2]})

data_2 = pd.DataFrame({'key_1':list('abc'),
                       'key_2':list('xaz'),
                       'val_2':[6,7,8]},index=[1,2,3]) 
print(str(data_1)+'\n'+str(data_2))

# Merge by default (Pandas merged the data by using columns: key_1  and key_2)
print('\n'+str(data_1.merge(data_2)))

# Let's merge using only one colum
print('\n'+str(data_1.merge(data_2,on=['key_1'])))

# Merge rows using pd.merge()
print('\n'+str(pd.merge(data_1,data_2,left_index=True,right_index=True)))

# Merge rows using df.merge()
print('\n'+str(data_1.merge(data_2,left_index=True,right_index=True)))

  key_1 key_2  val_1
0     a     x      0
1     b     y      1
2     c     z      2
  key_1 key_2  val_2
1     a     x      6
2     b     a      7
3     c     z      8

  key_1 key_2  val_1  val_2
0     a     x      0      6
1     c     z      2      8

  key_1 key_2_x  val_1 key_2_y  val_2
0     a       x      0       x      6
1     b       y      1       a      7
2     c       z      2       z      8

  key_1_x key_2_x  val_1 key_1_y key_2_y  val_2
1       b       y      1       a       x      6
2       c       z      2       b       a      7

  key_1_x key_2_x  val_1 key_1_y key_2_y  val_2
1       b       y      1       a       x      6
2       c       z      2       b       a      7


In [55]:
# Inner 
print(data_1.merge(data_2))

# Outer (values from both tables are present)
print('\n'+str(data_1.merge(data_2,how='outer')))

# Right (All values from the right table are present, from the Left NaN)
print('\n'+str(data_1.merge(data_2,how='right')))

# Left (All values from the left table are present,from the Right NaN )
print('\n'+str(data_1.merge(data_2,how='left')))

  key_1 key_2  val_1  val_2
0     a     x      0      6
1     c     z      2      8

  key_1 key_2  val_1  val_2
0     a     x    0.0    6.0
1     b     y    1.0    NaN
2     c     z    2.0    8.0
3     b     a    NaN    7.0

  key_1 key_2  val_1  val_2
0     a     x    0.0      6
1     c     z    2.0      8
2     b     a    NaN      7

  key_1 key_2  val_1  val_2
0     a     x      0    6.0
1     b     y      1    NaN
2     c     z      2    8.0


### Data Rotation

In [43]:
path = 'D:/ML/Books/Learning_Pandas_russian_translation-1-master/Notebooks/Data/accel.csv'
data = pd.read_csv(path)
data.head()

Unnamed: 0,interval,axis,reading
0,0,X,0.0
1,0,Y,0.5
2,0,Z,1.0
3,1,X,0.1
4,1,Y,0.4


In [91]:
# Transform column values into columns
data.pivot(index='interval',columns='axis',values='reading')

axis,X,Y,Z
interval,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.0,0.5,1.0
1,0.1,0.4,0.9
2,0.2,0.3,0.8
3,0.3,0.2,0.7


In [77]:
# Stacking
df = pd.DataFrame({'a':[1,2],'b':[3,4]},index=['one','two'])
print(df)

stacked = df.stack()
print('\n'+str(stacked))

# Values can be found by using a tuple
print('\n'+str(stacked[('one','a')]))

     a  b
one  1  3
two  2  4

one  a    1
     b    3
two  a    2
     b    4
dtype: int64

1


In [108]:
# Create data for 2 users 
user_1 = data.copy()
user_2 = data.copy()

#Provide their names
user_1['who'] = 'Vlad'
user_2['who'] = 'Max'

# Scale Max's score
user_2['reading'] = user_2['reading']*100

# Make Hierachical index for better data retrivia
user_data = pd.concat([user_1,user_2])
user_data = user_data.set_index(['who','interval','axis'])
print(user_data.head())

# Now, we can extract data for each Person for exact interval and axis
user_data.xs('Vlad').xs(0)

                    reading
who  interval axis         
Vlad 0        X         0.0
              Y         0.5
              Z         1.0
     1        X         0.1
              Y         0.4


Unnamed: 0_level_0,reading
axis,Unnamed: 1_level_1
X,0.0
Y,0.5
Z,1.0


In [117]:
# Unstacking 
unstacked_user_data = user_data.unstack(level=['who','axis'])
unstacked_user_data

Unnamed: 0_level_0,reading,reading,reading,reading,reading,reading
who,Vlad,Vlad,Vlad,Max,Max,Max
axis,X,Y,Z,X,Y,Z
interval,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3
0,0.0,0.5,1.0,0.0,50.0,100.0
1,0.1,0.4,0.9,10.0,40.0,90.0
2,0.2,0.3,0.8,20.0,30.0,80.0
3,0.3,0.2,0.7,30.0,20.0,70.0


In [119]:
# Stacking
stacked_user_data = unstacked_user_data.stack(level='who')
stacked_user_data

Unnamed: 0_level_0,Unnamed: 1_level_0,reading,reading,reading
Unnamed: 0_level_1,axis,X,Y,Z
interval,who,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,Max,0.0,50.0,100.0
0,Vlad,0.0,0.5,1.0
1,Max,10.0,40.0,90.0
1,Vlad,0.1,0.4,0.9
2,Max,20.0,30.0,80.0
2,Vlad,0.2,0.3,0.8
3,Max,30.0,20.0,70.0
3,Vlad,0.3,0.2,0.7


### Data melting 

In [123]:
df = pd.DataFrame({'Name':['Vlad','Max'],
                   'Height':[6.1,6.0],
                   'Weight':[220,185]})
print(df)

melted = pd.melt(df,id_vars=['Name'],value_vars=['Height','Weight'])
print('\n'+str(melted))

   Name  Height  Weight
0  Vlad     6.1     220
1   Max     6.0     185

   Name variable  value
0  Vlad   Height    6.1
1   Max   Height    6.0
2  Vlad   Weight  220.0
3   Max   Weight  185.0


In [49]:
data.head()

Unnamed: 0,interval,axis,reading
0,0,X,0.0
1,0,Y,0.5
2,0,Z,1.0
3,1,X,0.1
4,1,Y,0.4


In [53]:
data.pivot(index='axis',columns='interval',values='reading')

interval,0,1,2,3
axis,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
X,0.0,0.1,0.2,0.3
Y,0.5,0.4,0.3,0.2
Z,1.0,0.9,0.8,0.7
