### This is part of `Python07` session but also it will be useful for refreshing the memory for already learned stuff.

Lets dive into it:

#### Combining Data in pandas With `merge()`, `.join()`, and `concat()`.

In [6]:
import pandas as pd
import numpy as np
import os

In [8]:
work_dir = os.getcwd()
print(work_dir)

c:\Users\Admin\dss03Python\PY\Python07\climate_data


### **1.** pandas merge(): Combining Data on Common Columns or Indices

merge() is most useful when you want to combine rows that share data. 

You can achieve both many-to-one and many-to-many joins with merge(). In a many-to-one join, one of your datasets will have many rows in the merge column that repeat the same values.

 For example, the values could be 1, 1, 3, 5, and 5. At the same time, the merge column in the other dataset won’t have repeated values. Take 1, 3, and 5 as an example.

When you use merge(), you’ll provide two required arguments:

    1)The left DataFrame
    
    2)The right DataFrame


After that, you can provide a number of optional arguments to define how your datasets are merged:

- `how` defines what kind of merge to make. It defaults to `'inner'`, but other possible options include `'outer', 'left', and 'right'`.

- on tells merge() which columns or indices, also called key columns or key indices, you want to join on. 

This is optional. If it isn’t specified, and left_index and right_index (covered below) are False, then columns from the two DataFrames that share names will be used as join keys. If you use on, then the column or index that you specify must be present in both objects.



#### 1a. How to Use merge()

### Outer Join

#### With outer joins, you’ll merge your data based on all the keys in the left object, the right object, or both. For keys that only exist in one object, unmatched columns in the other object will be filled in with NaN, which stands for Not a Number.

In [11]:
climate_temp = pd.read_csv('climate_temp.csv')
climate_precip = pd.read_csv('climate_precip.csv')

In [47]:
climate_temp.head()


Unnamed: 0,STATION,STATION_NAME,ELEVATION,LATITUDE,LONGITUDE,DATE,DLY-CLDD-BASE45,DLY-CLDD-BASE50,DLY-CLDD-BASE55,DLY-CLDD-BASE57,...,DLY-CLDD-NORMAL,DLY-CLDD-BASE70,DLY-CLDD-BASE72,DLY-HTDD-BASE40,DLY-HTDD-BASE45,DLY-HTDD-BASE50,DLY-HTDD-BASE55,DLY-HTDD-BASE57,DLY-HTDD-BASE60,DLY-HTDD-NORMAL
0,GHCND:USC00049099,TWENTYNINE PALMS CA US,602,34.12806,-116.03694,20100101,6,2,-7777,-7777,...,0,0,0,-7777,1,2,6,7,10,15
1,GHCND:USC00049099,TWENTYNINE PALMS CA US,602,34.12806,-116.03694,20100102,6,2,1,-7777,...,0,0,0,-7777,1,2,6,7,10,15
2,GHCND:USC00049099,TWENTYNINE PALMS CA US,602,34.12806,-116.03694,20100103,6,2,1,-7777,...,0,0,0,-7777,1,2,5,7,10,15
3,GHCND:USC00049099,TWENTYNINE PALMS CA US,602,34.12806,-116.03694,20100104,6,2,1,-7777,...,0,0,0,-7777,1,2,5,7,10,15
4,GHCND:USC00049099,TWENTYNINE PALMS CA US,602,34.12806,-116.03694,20100105,6,2,1,-7777,...,0,0,0,-7777,-7777,2,5,7,10,15


In [18]:
climate_precip.head()


Unnamed: 0,STATION,STATION_NAME,DATE,DLY-PRCP-25PCTL,DLY-SNWD-25PCTL,DLY-SNOW-25PCTL,DLY-PRCP-50PCTL,DLY-SNWD-50PCTL,DLY-SNOW-50PCTL,DLY-PRCP-75PCTL,...,DLY-PRCP-PCTALL-GE100HI,DLY-SNWD-PCTALL-GE001WI,DLY-SNWD-PCTALL-GE010WI,DLY-SNWD-PCTALL-GE003WI,DLY-SNWD-PCTALL-GE005WI,DLY-SNOW-PCTALL-GE001TI,DLY-SNOW-PCTALL-GE010TI,DLY-SNOW-PCTALL-GE100TI,DLY-SNOW-PCTALL-GE030TI,DLY-SNOW-PCTALL-GE050TI
0,GHCND:USC00049099,TWENTYNINE PALMS CA US,20100101,-6.66,-666,-66.6,-6.66,-666,-66.6,-6.66,...,3,-9999,0,-9999,-9999,-9999,-9999,0,-9999,-9999
1,GHCND:USC00049099,TWENTYNINE PALMS CA US,20100102,-6.66,-666,-66.6,-6.66,-666,-66.6,-6.66,...,3,-9999,0,-9999,-9999,-9999,-9999,0,-9999,-9999
2,GHCND:USC00049099,TWENTYNINE PALMS CA US,20100103,-6.66,-666,-66.6,-6.66,-666,-66.6,-6.66,...,3,-9999,0,-9999,-9999,-9999,-9999,0,-9999,-9999
3,GHCND:USC00049099,TWENTYNINE PALMS CA US,20100104,-6.66,-9999,-9999.0,-6.66,-9999,-9999.0,-6.66,...,3,0,0,0,0,0,0,0,0,0
4,GHCND:USC00049099,TWENTYNINE PALMS CA US,20100105,-6.66,-9999,-9999.0,-6.66,-9999,-9999.0,-6.66,...,3,0,0,0,0,0,0,0,0,0


### Inner Join

In this example, I will  use `merge()` with its default arguments, which will result in an inner join. 

Remember that in an inner join, `I will lose rows that don’t have a match in the other DataFrame’s key column. `

With the two datasets loaded into DataFrame objects, I will select a small slice of the precipitation dataset and then use a plain merge() call to do an inner join. This will result in a smaller, more focused dataset:

In [48]:
precip_one_station = climate_precip.query("STATION == 'GHCND:USC00045721'") 

# or as I did previously 
# more pythonic i guess :)

precip_one_station = climate_precip[climate_precip['STATION'].isin(['GHCND:USC00045721'])]
precip_one_station


Unnamed: 0,STATION,STATION_NAME,DATE,DLY-PRCP-25PCTL,DLY-SNWD-25PCTL,DLY-SNOW-25PCTL,DLY-PRCP-50PCTL,DLY-SNWD-50PCTL,DLY-SNOW-50PCTL,DLY-PRCP-75PCTL,...,DLY-PRCP-PCTALL-GE100HI,DLY-SNWD-PCTALL-GE001WI,DLY-SNWD-PCTALL-GE010WI,DLY-SNWD-PCTALL-GE003WI,DLY-SNWD-PCTALL-GE005WI,DLY-SNOW-PCTALL-GE001TI,DLY-SNOW-PCTALL-GE010TI,DLY-SNOW-PCTALL-GE100TI,DLY-SNOW-PCTALL-GE030TI,DLY-SNOW-PCTALL-GE050TI
1460,GHCND:USC00045721,MITCHELL CAVERNS CA US,20100101,0.04,-666,-66.6,0.16,-666,-66.6,0.44,...,11,4,0,3,3,9,6,0,-9999,-9999
1461,GHCND:USC00045721,MITCHELL CAVERNS CA US,20100102,0.05,-666,-66.6,0.16,-666,-66.6,0.44,...,11,4,0,3,3,10,6,0,-9999,-9999
1462,GHCND:USC00045721,MITCHELL CAVERNS CA US,20100103,0.05,-666,-66.6,0.16,-666,-66.6,0.45,...,11,4,0,3,3,10,6,0,-9999,-9999
1463,GHCND:USC00045721,MITCHELL CAVERNS CA US,20100104,0.05,-666,-66.6,0.16,-666,-66.6,0.45,...,11,4,0,3,2,10,6,0,-9999,-9999
1464,GHCND:USC00045721,MITCHELL CAVERNS CA US,20100105,0.05,-666,-66.6,0.17,-666,-66.6,0.46,...,11,4,0,3,2,10,6,0,-9999,-9999
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1820,GHCND:USC00045721,MITCHELL CAVERNS CA US,20101227,0.04,-666,-66.6,0.15,-666,-66.6,0.44,...,12,4,0,3,3,9,6,0,2,2
1821,GHCND:USC00045721,MITCHELL CAVERNS CA US,20101228,0.04,-666,-66.6,0.15,-666,-66.6,0.43,...,12,4,0,3,3,9,6,0,2,2
1822,GHCND:USC00045721,MITCHELL CAVERNS CA US,20101229,0.04,-666,-66.6,0.15,-666,-66.6,0.43,...,11,4,0,3,3,9,6,0,2,2
1823,GHCND:USC00045721,MITCHELL CAVERNS CA US,20101230,0.04,-666,-66.6,0.15,-666,-66.6,0.43,...,11,4,0,3,3,9,6,0,2,2


Lets merge it

In [None]:
inner_join = pd.merge(precip_one_station,climate_temp)
inner_join

<bound method NDFrame.head of                STATION            STATION_NAME      DATE  DLY-PRCP-25PCTL  \
0    GHCND:USC00045721  MITCHELL CAVERNS CA US  20100101             0.04   
1    GHCND:USC00045721  MITCHELL CAVERNS CA US  20100102             0.05   
2    GHCND:USC00045721  MITCHELL CAVERNS CA US  20100103             0.05   
3    GHCND:USC00045721  MITCHELL CAVERNS CA US  20100104             0.05   
4    GHCND:USC00045721  MITCHELL CAVERNS CA US  20100105             0.05   
..                 ...                     ...       ...              ...   
360  GHCND:USC00045721  MITCHELL CAVERNS CA US  20101227             0.04   
361  GHCND:USC00045721  MITCHELL CAVERNS CA US  20101228             0.04   
362  GHCND:USC00045721  MITCHELL CAVERNS CA US  20101229             0.04   
363  GHCND:USC00045721  MITCHELL CAVERNS CA US  20101230             0.04   
364  GHCND:USC00045721  MITCHELL CAVERNS CA US  20101231             0.04   

     DLY-SNWD-25PCTL  DLY-SNOW-25PCTL  DLY-PR

`merge()` defaults to an inner join, and an inner join will discard only those rows that don’t match

The numbers of rows stayed the same.

With `merge()`, you also have control over which column(s) to join on. Let’s say that I want to merge both entire datasets, but only on `Station and Date` since the combination of the two will yield a unique value for each row. To do so, you can use the on parameter:

In [36]:
inner_merged_total = pd.merge(climate_temp,climate_precip, how='inner', on=['STATION','DATE'])
inner_merged_total

Unnamed: 0,STATION,STATION_NAME_x,ELEVATION,LATITUDE,LONGITUDE,DATE,DLY-CLDD-BASE45,DLY-CLDD-BASE50,DLY-CLDD-BASE55,DLY-CLDD-BASE57,...,DLY-PRCP-PCTALL-GE100HI,DLY-SNWD-PCTALL-GE001WI,DLY-SNWD-PCTALL-GE010WI,DLY-SNWD-PCTALL-GE003WI,DLY-SNWD-PCTALL-GE005WI,DLY-SNOW-PCTALL-GE001TI,DLY-SNOW-PCTALL-GE010TI,DLY-SNOW-PCTALL-GE100TI,DLY-SNOW-PCTALL-GE030TI,DLY-SNOW-PCTALL-GE050TI
0,GHCND:USC00049099,TWENTYNINE PALMS CA US,602,34.12806,-116.03694,20100101,6,2,-7777,-7777,...,3,-9999,0,-9999,-9999,-9999,-9999,0,-9999,-9999
1,GHCND:USC00049099,TWENTYNINE PALMS CA US,602,34.12806,-116.03694,20100102,6,2,1,-7777,...,3,-9999,0,-9999,-9999,-9999,-9999,0,-9999,-9999
2,GHCND:USC00049099,TWENTYNINE PALMS CA US,602,34.12806,-116.03694,20100103,6,2,1,-7777,...,3,-9999,0,-9999,-9999,-9999,-9999,0,-9999,-9999
3,GHCND:USC00049099,TWENTYNINE PALMS CA US,602,34.12806,-116.03694,20100104,6,2,1,-7777,...,3,0,0,0,0,0,0,0,0,0
4,GHCND:USC00049099,TWENTYNINE PALMS CA US,602,34.12806,-116.03694,20100105,6,2,1,-7777,...,3,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
123000,GHCND:USC00046006,MOUNT WILSON CBS CA US,1740.4,34.2308,-118.0711,20101227,4,2,1,-7777,...,62,-9999,-9999,-9999,-9999,-9999,-9999,-9999,-9999,-9999
123001,GHCND:USC00046006,MOUNT WILSON CBS CA US,1740.4,34.2308,-118.0711,20101228,4,2,1,-7777,...,62,-9999,-9999,-9999,-9999,-9999,-9999,-9999,-9999,-9999
123002,GHCND:USC00046006,MOUNT WILSON CBS CA US,1740.4,34.2308,-118.0711,20101229,4,2,1,-7777,...,63,-9999,-9999,-9999,-9999,-9999,-9999,-9999,-9999,-9999
123003,GHCND:USC00046006,MOUNT WILSON CBS CA US,1740.4,34.2308,-118.0711,20101230,4,2,1,-7777,...,64,-9999,-9999,-9999,-9999,-9999,-9999,-9999,-9999,-9999


### Outer Join 

#### also known as a full outer join

##### all rows from both DataFrames will be present in the new DataFrame.

- If a row doesn’t have a match in the other DataFrame based on the key column(s), then I won’t lose the row like you would with an inner join.

- Instead, the row will be in the merged DataFrame, with NaN values filled in where appropriate.

In [43]:
outer_merged = pd.merge(precip_one_station,climate_temp, how='outer', on=['STATION','DATE'])
outer_merged.dropna()

Unnamed: 0,STATION,STATION_NAME_x,DATE,DLY-PRCP-25PCTL,DLY-SNWD-25PCTL,DLY-SNOW-25PCTL,DLY-PRCP-50PCTL,DLY-SNWD-50PCTL,DLY-SNOW-50PCTL,DLY-PRCP-75PCTL,...,DLY-CLDD-NORMAL,DLY-CLDD-BASE70,DLY-CLDD-BASE72,DLY-HTDD-BASE40,DLY-HTDD-BASE45,DLY-HTDD-BASE50,DLY-HTDD-BASE55,DLY-HTDD-BASE57,DLY-HTDD-BASE60,DLY-HTDD-NORMAL
0,GHCND:USC00045721,MITCHELL CAVERNS CA US,20100101,0.04,-666.0,-66.6,0.16,-666.0,-66.6,0.44,...,0,0,0,1,3,6,10,12,14,19
1,GHCND:USC00045721,MITCHELL CAVERNS CA US,20100102,0.05,-666.0,-66.6,0.16,-666.0,-66.6,0.44,...,0,0,0,1,3,6,10,11,14,19
2,GHCND:USC00045721,MITCHELL CAVERNS CA US,20100103,0.05,-666.0,-66.6,0.16,-666.0,-66.6,0.45,...,0,0,0,1,2,5,9,11,14,19
3,GHCND:USC00045721,MITCHELL CAVERNS CA US,20100104,0.05,-666.0,-66.6,0.16,-666.0,-66.6,0.45,...,0,0,0,1,2,5,9,11,14,19
4,GHCND:USC00045721,MITCHELL CAVERNS CA US,20100105,0.05,-666.0,-66.6,0.17,-666.0,-66.6,0.46,...,0,0,0,1,2,5,9,11,14,19
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
360,GHCND:USC00045721,MITCHELL CAVERNS CA US,20101227,0.04,-666.0,-66.6,0.15,-666.0,-66.6,0.44,...,-7777,0,0,1,3,6,10,12,15,20
361,GHCND:USC00045721,MITCHELL CAVERNS CA US,20101228,0.04,-666.0,-66.6,0.15,-666.0,-66.6,0.43,...,-7777,0,0,1,3,6,10,12,15,20
362,GHCND:USC00045721,MITCHELL CAVERNS CA US,20101229,0.04,-666.0,-66.6,0.15,-666.0,-66.6,0.43,...,-7777,0,0,1,3,6,10,12,15,20
363,GHCND:USC00045721,MITCHELL CAVERNS CA US,20101230,0.04,-666.0,-66.6,0.15,-666.0,-66.6,0.43,...,-7777,0,0,1,3,6,10,12,15,20


### Left Join 

Using a left outer join will leave your new merged DataFrame with all rows from the left DataFrame, while discarding rows from the right DataFrame that don’t have a match in the key column of the left DataFrame. 

Example in action:

In [50]:
left_join = pd.merge(climate_temp, precip_one_station, how='left',on='DATE')
#It is expected that my climate_temp data frame will be full, and it will be joined with precip_station_one specified on DATE column from precip_station_one

left_join

Unnamed: 0,STATION_x,STATION_NAME_x,ELEVATION,LATITUDE,LONGITUDE,DATE,DLY-CLDD-BASE45,DLY-CLDD-BASE50,DLY-CLDD-BASE55,DLY-CLDD-BASE57,...,DLY-PRCP-PCTALL-GE100HI,DLY-SNWD-PCTALL-GE001WI,DLY-SNWD-PCTALL-GE010WI,DLY-SNWD-PCTALL-GE003WI,DLY-SNWD-PCTALL-GE005WI,DLY-SNOW-PCTALL-GE001TI,DLY-SNOW-PCTALL-GE010TI,DLY-SNOW-PCTALL-GE100TI,DLY-SNOW-PCTALL-GE030TI,DLY-SNOW-PCTALL-GE050TI
0,GHCND:USC00049099,TWENTYNINE PALMS CA US,602,34.12806,-116.03694,20100101,6,2,-7777,-7777,...,11,4,0,3,3,9,6,0,-9999,-9999
1,GHCND:USC00049099,TWENTYNINE PALMS CA US,602,34.12806,-116.03694,20100102,6,2,1,-7777,...,11,4,0,3,3,10,6,0,-9999,-9999
2,GHCND:USC00049099,TWENTYNINE PALMS CA US,602,34.12806,-116.03694,20100103,6,2,1,-7777,...,11,4,0,3,3,10,6,0,-9999,-9999
3,GHCND:USC00049099,TWENTYNINE PALMS CA US,602,34.12806,-116.03694,20100104,6,2,1,-7777,...,11,4,0,3,2,10,6,0,-9999,-9999
4,GHCND:USC00049099,TWENTYNINE PALMS CA US,602,34.12806,-116.03694,20100105,6,2,1,-7777,...,11,4,0,3,2,10,6,0,-9999,-9999
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
127015,GHCND:USC00046006,MOUNT WILSON CBS CA US,1740.4,34.2308,-118.0711,20101227,4,2,1,-7777,...,12,4,0,3,3,9,6,0,2,2
127016,GHCND:USC00046006,MOUNT WILSON CBS CA US,1740.4,34.2308,-118.0711,20101228,4,2,1,-7777,...,12,4,0,3,3,9,6,0,2,2
127017,GHCND:USC00046006,MOUNT WILSON CBS CA US,1740.4,34.2308,-118.0711,20101229,4,2,1,-7777,...,11,4,0,3,3,9,6,0,2,2
127018,GHCND:USC00046006,MOUNT WILSON CBS CA US,1740.4,34.2308,-118.0711,20101230,4,2,1,-7777,...,11,4,0,3,3,9,6,0,2,2


In [51]:
climate_temp.shape

(127020, 21)

left_merged has 127,020 rows, matching the number of rows in the left DataFrame, climate_temp. To prove that this only holds for the left DataFrame, run the same code, but change the position of precip_one_station and climate_temp:

In [53]:
# same code as above:

left_join = pd.merge(precip_one_station, climate_temp, how='left',on='DATE')
left_join


Unnamed: 0,STATION_x,STATION_NAME_x,DATE,DLY-PRCP-25PCTL,DLY-SNWD-25PCTL,DLY-SNOW-25PCTL,DLY-PRCP-50PCTL,DLY-SNWD-50PCTL,DLY-SNOW-50PCTL,DLY-PRCP-75PCTL,...,DLY-CLDD-NORMAL,DLY-CLDD-BASE70,DLY-CLDD-BASE72,DLY-HTDD-BASE40,DLY-HTDD-BASE45,DLY-HTDD-BASE50,DLY-HTDD-BASE55,DLY-HTDD-BASE57,DLY-HTDD-BASE60,DLY-HTDD-NORMAL
0,GHCND:USC00045721,MITCHELL CAVERNS CA US,20100101,0.04,-666,-66.6,0.16,-666,-66.6,0.44,...,0,0,0,-7777,1,2,6,7,10,15
1,GHCND:USC00045721,MITCHELL CAVERNS CA US,20100101,0.04,-666,-66.6,0.16,-666,-66.6,0.44,...,0,0,0,1,2,6,10,12,15,20
2,GHCND:USC00045721,MITCHELL CAVERNS CA US,20100101,0.04,-666,-66.6,0.16,-666,-66.6,0.44,...,-7777,0,0,0,-7777,-7777,2,3,5,10
3,GHCND:USC00045721,MITCHELL CAVERNS CA US,20100101,0.04,-666,-66.6,0.16,-666,-66.6,0.44,...,-7777,0,0,0,0,-7777,1,2,4,9
4,GHCND:USC00045721,MITCHELL CAVERNS CA US,20100101,0.04,-666,-66.6,0.16,-666,-66.6,0.44,...,0,0,0,1,3,6,10,12,14,19
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
127015,GHCND:USC00045721,MITCHELL CAVERNS CA US,20101231,0.04,-666,-66.6,0.15,-666,-66.6,0.44,...,0,0,0,1,4,8,12,14,17,22
127016,GHCND:USC00045721,MITCHELL CAVERNS CA US,20101231,0.04,-666,-66.6,0.15,-666,-66.6,0.44,...,0,0,0,-7777,1,3,7,9,12,17
127017,GHCND:USC00045721,MITCHELL CAVERNS CA US,20101231,0.04,-666,-66.6,0.15,-666,-66.6,0.44,...,0,0,0,18,23,28,33,35,38,43
127018,GHCND:USC00045721,MITCHELL CAVERNS CA US,20101231,0.04,-666,-66.6,0.15,-666,-66.6,0.44,...,0,0,0,-7777,2,5,10,11,14,19


### Right Join

#### The right join, or right outer join, is the __mirror-image__ version of the left join. 
#### With this join, all rows from the right DataFrame will be retained, while rows in the left DataFrame without a match in the key column of the right DataFrame will be discarded. 

In [55]:
right_join = pd.merge(climate_temp,precip_one_station,how='right',on='STATION_NAME')
right_join

Unnamed: 0,STATION_x,STATION_NAME,ELEVATION,LATITUDE,LONGITUDE,DATE_x,DLY-CLDD-BASE45,DLY-CLDD-BASE50,DLY-CLDD-BASE55,DLY-CLDD-BASE57,...,DLY-PRCP-PCTALL-GE100HI,DLY-SNWD-PCTALL-GE001WI,DLY-SNWD-PCTALL-GE010WI,DLY-SNWD-PCTALL-GE003WI,DLY-SNWD-PCTALL-GE005WI,DLY-SNOW-PCTALL-GE001TI,DLY-SNOW-PCTALL-GE010TI,DLY-SNOW-PCTALL-GE100TI,DLY-SNOW-PCTALL-GE030TI,DLY-SNOW-PCTALL-GE050TI
0,GHCND:USC00045721,MITCHELL CAVERNS CA US,1325.9,34.9436,-115.5469,20100101,3,1,-7777,-7777,...,11,4,0,3,3,9,6,0,-9999,-9999
1,GHCND:USC00045721,MITCHELL CAVERNS CA US,1325.9,34.9436,-115.5469,20100102,3,1,-7777,-7777,...,11,4,0,3,3,9,6,0,-9999,-9999
2,GHCND:USC00045721,MITCHELL CAVERNS CA US,1325.9,34.9436,-115.5469,20100103,3,1,-7777,-7777,...,11,4,0,3,3,9,6,0,-9999,-9999
3,GHCND:USC00045721,MITCHELL CAVERNS CA US,1325.9,34.9436,-115.5469,20100104,3,1,-7777,-7777,...,11,4,0,3,3,9,6,0,-9999,-9999
4,GHCND:USC00045721,MITCHELL CAVERNS CA US,1325.9,34.9436,-115.5469,20100105,3,1,-7777,-7777,...,11,4,0,3,3,9,6,0,-9999,-9999
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133220,GHCND:USC00045721,MITCHELL CAVERNS CA US,1325.9,34.9436,-115.5469,20101227,3,1,-7777,-7777,...,11,4,0,3,3,9,6,0,2,2
133221,GHCND:USC00045721,MITCHELL CAVERNS CA US,1325.9,34.9436,-115.5469,20101228,3,1,-7777,-7777,...,11,4,0,3,3,9,6,0,2,2
133222,GHCND:USC00045721,MITCHELL CAVERNS CA US,1325.9,34.9436,-115.5469,20101229,3,1,-7777,-7777,...,11,4,0,3,3,9,6,0,2,2
133223,GHCND:USC00045721,MITCHELL CAVERNS CA US,1325.9,34.9436,-115.5469,20101230,3,1,-7777,-7777,...,11,4,0,3,3,9,6,0,2,2


## pandas `concat()`: Combining Data Across Rows or Columns

When you concatenate datasets, you can specify the axis along which you’ll concatenate.

### __pd.concat( )__ parameters:

- objs - takes any sequence—typically a list—of Series or DataFrame objects to be concatenated. You can also provide a dictionary

- `axis` represents the axis that you’ll concatenate along. The default value is 0, which concatenates along the index, or row axis. !!!!

- join is similar to the how parameter in the other techniques, but it only accepts the values inner or outer.  The default value is outer, which preserves data, while inner would eliminate data that doesn’t have a match in the other datase

#### Examples

#### First of all, __basic concatenation__ along the default axis using the DataFrames that I've been playing with throughout this tutorial:

In [85]:
double_precip = pd.concat([precip_one_station,precip_one_station],ignore_index=True)


#ignore_index=True reset the index start 0 step 1


0      False
365     True
dtype: bool

As noted before, if you concatenate along axis 0 (rows) but have labels in axis 1 (columns) that don’t match, then those columns will be added and filled in with NaN values. This results in an outer join:

In [86]:
outer_joined =pd.concat([climate_precip,climate_temp])
outer_joined.shape

(278130, 47)

With these two DataFrames, since you’re just concatenating along rows, very few columns have the same name. That means you’ll see a lot of columns with `NaN` values. 

So if I wanted to drop these NaN's I could perform the inner join.

In [88]:
outer_joined =pd.concat([climate_precip,climate_temp], join='inner')
outer_joined.shape


(278130, 3)

Previous shape: (278130, 47)

Current shape: (278130, 3) -> with `inner` join


Using the inner join, I will be left with only those columns that the original DataFrames have in common: `STATION, STATION_NAME, and DATE`. 

Now we have a new data frame that keeps only the same columns from both data frames!

Using the __axis__ parameter:

In [None]:
outer_joined = pd.concat([climate_precip,climate_temp], axis=1, join='inner')
outer_joined.shape

# Now, data frame has columns from both data frames! We did this bt setting the axis to (1) , which means 'columns' , 0 is for 'rows'. 

# i could also write it as: (axis='columns') but i like numerical more

(127020, 50)