# Combining Data for Analysis

Often large datasets are broken up into a number of files since they're easier to store and share, or you mave have diffent data for each day, e.g. timeseries data for stocks.

We thus need to be able to combine these datasets(before or after cleaning) so that we're carrying out our analysis on a single data set.

### Concatenating Dataframes

Involves combining 2 or more dataframes into a single dataframe. We can use the pandas `concat()` method which takes a list of dataframes and simply combines one on top of the other, they're stacked one on top of the other.

This means that the combined datasets keep their original row index labels, e.g. you'll have two rows labelled `0`, `1`, etc. This will cause problems when you want to select rows based on the index label.

In [1]:
import pandas as pd
import numpy as np

df_tips1 = pd.read_csv('data2/tips1.csv')
df_tips2 = pd.read_csv('data2/tips2.csv')
df_tips_comb = pd.concat([df_tips1, df_tips2])
df_tips_comb.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
0,14.78,3.23,Male,No,Sun,Dinner,2
1,10.27,1.71,Male,No,Sun,Dinner,2


To have consecutive row index labels we can reset the index label with the `ingnore_index=True` parameter when concatenating the dataframes.

In [2]:
df_tips_comb2 = pd.concat([df_tips1, df_tips2], ignore_index=True)
df_tips_comb2.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,14.78,3.23,Male,No,Sun,Dinner,2
4,10.27,1.71,Male,No,Sun,Dinner,2


Or using the `reset_index()` with the `drop=True` parameter, otherwise the original duplicate indices are saved to a `index` column.

In [3]:
# 
df_tips_comb = df_tips_comb.reset_index(drop=True)
df_tips_comb.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,14.78,3.23,Male,No,Sun,Dinner,2
4,10.27,1.71,Male,No,Sun,Dinner,2


### Combining columns of data

Think of column-wise concatenation of data as stitching data together from the sides instead of the top and bottom. To perform this action, you use the same `pd.concat()` function, but this time with the keyword argument `axis=1`. The default, `axis=0`, is for a row-wise concatenation.

In this 1st DataFrame, the status and country of a patient is contained in a single column, `status_country`. This column has been parsed into a new DataFrame, `df_ebola_status_country`, where there are separate columns for `status` and `country`.

In [4]:
import pandas as pd
import numpy as np

df_ebola = pd.read_csv('data2/ebola.csv')
df_ebola_melt = pd.melt(df_ebola, id_vars=['Date', 'Day'], var_name='status_country', value_name='counts')
df_ebola_melt.head()

Unnamed: 0,Date,Day,status_country,counts
0,1/5/2015,289,Cases_Guinea,2776.0
1,1/4/2015,288,Cases_Guinea,2775.0
2,1/3/2015,287,Cases_Guinea,2769.0
3,1/2/2015,286,Cases_Guinea,
4,12/31/2014,284,Cases_Guinea,2730.0


In [6]:
df_ebola_melt['str_split'] = df_ebola_melt['status_country'].str.split('_')
df_ebola_status_country = pd.DataFrame()
df_ebola_status_country['status'] = df_ebola_melt['str_split'].str.get(0)
df_ebola_status_country['country'] = df_ebola_melt['str_split'].str.get(1)
df_ebola_status_country.head()

Unnamed: 0,status,country
0,Cases,Guinea
1,Cases,Guinea
2,Cases,Guinea
3,Cases,Guinea
4,Cases,Guinea


In [7]:
# combine dataframes column wise
df_ebola_combined = pd.concat([df_ebola_melt, df_ebola_status_country], axis=1)
df_ebola_combined.head()

Unnamed: 0,Date,Day,status_country,counts,str_split,status,country
0,1/5/2015,289,Cases_Guinea,2776.0,"[Cases, Guinea]",Cases,Guinea
1,1/4/2015,288,Cases_Guinea,2775.0,"[Cases, Guinea]",Cases,Guinea
2,1/3/2015,287,Cases_Guinea,2769.0,"[Cases, Guinea]",Cases,Guinea
3,1/2/2015,286,Cases_Guinea,,"[Cases, Guinea]",Cases,Guinea
4,12/31/2014,284,Cases_Guinea,2730.0,"[Cases, Guinea]",Cases,Guinea


### Concatenating multiple dataframes using Globbing

When you have hundreds or even 1000s of dataframes to combine, use the `glob()` function in the **glob** library(part of the standard Python library) to find the files and `for` loop to read all those files.

Globbing is the process of finding files based on pattern matching of the file names.

We can use the `*` and `?` wildcards

- `*` will match any number of characters, 0-9 and a-zA-z.
- `?` will match any **single** character, 0-9, a-zA-Z.

Glob will return a list of file names matching the pattern. We can then iterate over the list with a `for loop`, loading the files into a pandas dataframe, followed by concatenating them together.

By using **globbing** we can programmatically combine datasets that are broken up into many smaller parts. You'll find many datasets in the wild will be stored this way, particularly data that is collected incrementally.

In [9]:
import pandas as pd
import numpy as np
import glob

# search for all csv files
csv_files = glob.glob('data2/more/tip*.csv')

# load the files into a list of pandas dataframes
df_list = []
for file in csv_files:
    df = pd.read_csv(file)
    df_list.append(df)
    
# concat the dataframes into a single dataframe
df_combined = pd.concat(df_list)
df_combined.shape

(59, 7)

In [11]:
df_combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59 entries, 0 to 8
Data columns (total 7 columns):
total_bill    59 non-null float64
tip           59 non-null float64
sex           59 non-null object
smoker        59 non-null object
day           59 non-null object
time          59 non-null object
size          59 non-null int64
dtypes: float64(2), int64(1), object(4)
memory usage: 3.7+ KB


### Merge Data

Concatenating dataframes by columns, using `axis=1`, will fail if the rows in the different dataframes are NOT in the same order. In this case we can **merge** the dataframes using pandas `.merge()` function.

Similar to 'joins' in SQL, pandas supports merging on common columns, e.g. one dataframe has the column `state` which represents state names, in a 2nd dataframe the column is called `name`.

![Table 4](img/table-04.png)

Only state names that are the same will be matched together to create a new dataframe.

**`merge()`**

- specify the `left` and `right` dataframes. 
- `on` when the two columns you want to merge on have the same name, set `on='column_name`, otherwise set `on=None`
- `left_on` and `right_on`, when `on=None` specify the names of the 'left' and 'right' columns you wish to merge on.

There are three types of merge's:

- **one-to-one** - the left and right dataframes have a one-toone corresponding key(there are no duplicate values in either column that we are merging)
- **many-to-one/one-to-many** - there a duplicate values in one of the keys(columns), values from the other key will be used to fill in the duplicates. mto/otm are the same sort of merge, it just depends on the order you specfiy the two frames.
- **many-to-many** - there are duplicate values in both keys. What happens here is that for each duplicated key, every pairwise combination will be created.

The syntax is exactly the same no matter which type of merge is being performed.

```py
df_merged =  pd.merge(left=df_one, right=df_two, on=None, left_on='name', right_on='state')
```