# Combining data for analysis


## 1. Combining rows of data
The dataset we'll be working with here relates to NYC Uber data. The original dataset has all the originating Uber pickup locations by time and latitude and longitude. For didactic purposes, we'll be working with a very small portion of the actual data.

Concatenate DataFrames together such that the resulting DataFrame has the data for all three months.

In [20]:
import pandas as pd

uber = pd.read_csv("datasets/nyc_uber_data.csv")

In [102]:
uber = uber.drop(columns=["Unnamed: 0"])

In [103]:
uber.tail()

Unnamed: 0,Date/Time,Lat,Lon,Base
292,6/1/2014 6:27:00,40.7554,-73.9738,B02512
293,6/1/2014 6:35:00,40.7543,-73.9817,B02512
294,6/1/2014 6:37:00,40.7751,-73.9633,B02512
295,6/1/2014 6:46:00,40.6952,-74.1784,B02512
296,6/1/2014 6:51:00,40.7621,-73.9817,B02512


In [104]:
uber.shape

(297, 4)

In [176]:
# Splitting each month's data into a separate dataframe for concatination later
uber1 = uber[uber["Date/Time"].str[0] == "4"]
uber2 = uber[uber["Date/Time"].str[0] == "5"]
uber3 = uber[uber["Date/Time"].str[0] == "6"]

In [177]:
print(uber1.head())
print(uber2.head())
print(uber3.head())

          Date/Time      Lat      Lon    Base
0  4/1/2014 0:11:00  40.7690 -73.9549  B02512
1  4/1/2014 0:17:00  40.7267 -74.0345  B02512
2  4/1/2014 0:21:00  40.7316 -73.9873  B02512
3  4/1/2014 0:28:00  40.7588 -73.9776  B02512
4  4/1/2014 0:33:00  40.7594 -73.9722  B02512
          Date/Time      Lat      Lon    Base
0  5/1/2014 0:02:00  40.7521 -73.9914  B02512
1  5/1/2014 0:06:00  40.6965 -73.9715  B02512
2  5/1/2014 0:15:00  40.7464 -73.9838  B02512
3  5/1/2014 0:17:00  40.7463 -74.0011  B02512
4  5/1/2014 0:17:00  40.7594 -73.9734  B02512
          Date/Time      Lat      Lon    Base
0  6/1/2014 0:00:00  40.7293 -73.9920  B02512
1  6/1/2014 0:01:00  40.7131 -74.0097  B02512
2  6/1/2014 0:04:00  40.3461 -74.6610  B02512
3  6/1/2014 0:04:00  40.7555 -73.9833  B02512
4  6/1/2014 0:07:00  40.6880 -74.1831  B02512


In [107]:
# Concatenate uber1, uber2, and uber3: row_concat
row_concat = pd.concat([uber1, uber2, uber3])

In [108]:
# Print the shape of row_concat
row_concat.shape

(297, 4)

In [179]:
print(row_concat.head())
print(uber1.head())

          Date/Time      Lat      Lon    Base
0  4/1/2014 0:11:00  40.7690 -73.9549  B02512
1  4/1/2014 0:17:00  40.7267 -74.0345  B02512
2  4/1/2014 0:21:00  40.7316 -73.9873  B02512
3  4/1/2014 0:28:00  40.7588 -73.9776  B02512
4  4/1/2014 0:33:00  40.7594 -73.9722  B02512
          Date/Time      Lat      Lon    Base
0  4/1/2014 0:11:00  40.7690 -73.9549  B02512
1  4/1/2014 0:17:00  40.7267 -74.0345  B02512
2  4/1/2014 0:21:00  40.7316 -73.9873  B02512
3  4/1/2014 0:28:00  40.7588 -73.9776  B02512
4  4/1/2014 0:33:00  40.7594 -73.9722  B02512


In [180]:
# Print the tail of row_concat
print(row_concat.tail())
print(uber3.tail())

            Date/Time      Lat      Lon    Base
292  6/1/2014 6:27:00  40.7554 -73.9738  B02512
293  6/1/2014 6:35:00  40.7543 -73.9817  B02512
294  6/1/2014 6:37:00  40.7751 -73.9633  B02512
295  6/1/2014 6:46:00  40.6952 -74.1784  B02512
296  6/1/2014 6:51:00  40.7621 -73.9817  B02512
           Date/Time      Lat      Lon    Base
94  6/1/2014 6:27:00  40.7554 -73.9738  B02512
95  6/1/2014 6:35:00  40.7543 -73.9817  B02512
96  6/1/2014 6:37:00  40.7751 -73.9633  B02512
97  6/1/2014 6:46:00  40.6952 -74.1784  B02512
98  6/1/2014 6:51:00  40.7621 -73.9817  B02512


Now, we have concatenated the three uber DataFrames! Notice that the head of `row_concat` is the same as the head of `uber1`, while the tail of `row_concat` is the same as the tail of `uber3`.

## 2. Combining columns of data
Think of column-wise concatenation of data as stitching data together from the sides instead of the top and bottom. To perform this action, we use the same `pd.concat()` function, but this time with the keyword argument `axis=1`. The default, `axis=0`, is for a row-wise concatenation.

Let's return to the Ebola dataset.Melt its DataFrame so that the `status` and `country` of a patient is contained in a single column. Parse this column into a new DataFrame, `status_country`, where there are separate columns for `status` and `country`.

Explore the `ebola_melt` and `status_country` DataFrames. Concatenate them column-wise in order to obtain a final, clean DataFrame.

In [110]:
# Load ebola dataset into a dataframe
ebola = pd.read_csv("datasets/ebola.csv")
ebola.columns

Index(['Date', 'Day', 'Cases_Guinea', 'Cases_Liberia', 'Cases_SierraLeone',
       'Cases_Nigeria', 'Cases_Senegal', 'Cases_UnitedStates', 'Cases_Spain',
       'Cases_Mali', 'Deaths_Guinea', 'Deaths_Liberia', 'Deaths_SierraLeone',
       'Deaths_Nigeria', 'Deaths_Senegal', 'Deaths_UnitedStates',
       'Deaths_Spain', 'Deaths_Mali'],
      dtype='object')

In [111]:
# Melt all columns except date and day
ebola_melt = ebola.melt(id_vars = ["Date", "Day"], var_name="status_country", value_name="counts")
ebola_melt.tail()

Unnamed: 0,Date,Day,status_country,counts
1947,3/27/2014,5,Deaths_Mali,
1948,3/26/2014,4,Deaths_Mali,
1949,3/25/2014,3,Deaths_Mali,
1950,3/24/2014,2,Deaths_Mali,
1951,3/22/2014,0,Deaths_Mali,


In [112]:
# Create a new dataframe using status_country and extract the status/country
status= ebola_melt["status_country"].str.split("_").str.get(0)
country = ebola_melt["status_country"].str.split("_").str.get(1)
status_country = pd.DataFrame({"status": status, "country": country})

In [113]:
status_country.tail()

Unnamed: 0,status,country
1947,Deaths,Mali
1948,Deaths,Mali
1949,Deaths,Mali
1950,Deaths,Mali
1951,Deaths,Mali


In [114]:
# Concatenate ebola_melt and status_country column-wise: ebola_tidy
ebola_tidy = pd.concat([ebola_melt,status_country], axis=1 )

In [115]:
# Print the shape of ebola_tidy
ebola_tidy.shape

(1952, 6)

In [116]:
# Print the head of ebola_tidy
ebola_tidy.head()

Unnamed: 0,Date,Day,status_country,counts,status,country
0,1/5/2015,289,Cases_Guinea,2776.0,Cases,Guinea
1,1/4/2015,288,Cases_Guinea,2775.0,Cases,Guinea
2,1/3/2015,287,Cases_Guinea,2769.0,Cases,Guinea
3,1/2/2015,286,Cases_Guinea,,Cases,Guinea
4,12/31/2014,284,Cases_Guinea,2730.0,Cases,Guinea


The concatenated DataFrame has 6 columns, as it should. Notice how the `status` and `country` columns have been concatenated column-wise.

## 3. Finding files that match a pattern
Use the `glob` module to find all csv files in the workspace. Programmatically load them into DataFrames.

The `glob` module has a function called `glob` that takes a pattern and returns a list of the files in the working directory that match that pattern.

For example, if we know the pattern is `part_` single digit number `.csv`, we can write the pattern as `'part_?.csv'` (which would match `part_1.csv`, `part_2.csv`, `part_3.csv`, etc.)

Similarly, we can find all `.csv` files with `'*.csv'`, or all parts with `'part_*'`. The `?` wildcard represents any 1 character, and the `*` wildcard represents any number of characters.

In [117]:
uber1.to_csv(path_or_buf="datasets/uber_april.csv", index=False)
uber2.to_csv(path_or_buf="datasets/uber_may.csv", index=False)
uber3.to_csv(path_or_buf="datasets/uber_june.csv",index=False)

In [138]:
# Import necessary modules
import glob

# Write the pattern: pattern
pattern = 'datasets/uber*.csv'

# Save all file matches: csv_files
csv_files = glob.glob(pattern)

print(csv_files)

['datasets\\uber_april.csv', 'datasets\\uber_june.csv', 'datasets\\uber_may.csv']


In [123]:
# Load the second file into a DataFrame: csv2
csv2 = pd.read_csv(csv_files[1])

# Print the head of csv2
csv2.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,6/1/2014 0:00:00,40.7293,-73.992,B02512
1,6/1/2014 0:01:00,40.7131,-74.0097,B02512
2,6/1/2014 0:04:00,40.3461,-74.661,B02512
3,6/1/2014 0:04:00,40.7555,-73.9833,B02512
4,6/1/2014 0:07:00,40.688,-74.1831,B02512


The next step is to iterate through this list of filenames, load it into a DataFrame, and add it to a list of DataFrames we can then concatenate together.

## 4. Iterating and concatenating all matches
Now that you have a list of filenames to load, we can load all the files into a list of DataFrames that can then be concatenated.

Starting with an empty list called frames. Use a for loop to:

- iterate through each of the filenames
- read each filename into a DataFrame, and then
- append it to the frames list.

Then concatenate this list of DataFrames using `pd.concat()`.

In [132]:
# Create an empty list: frames
frames = []

#  Iterate over csv_files
for csv in csv_files:

    #  Read csv into a DataFrame: df
    df = pd.read_csv(csv)
    
    # Append df to frames
    frames.append(df)

# Print the frames list
len(frames)

3

In [133]:
# Concatenate frames into a single DataFrame: uber
uber = pd.concat(frames)

# Print the shape of uber
uber.shape

(297, 4)

In [136]:
# Print the tail of uber
uber.tail()

Unnamed: 0,Date/Time,Lat,Lon,Base
94,5/1/2014 6:03:00,40.7753,-73.9901,B02512
95,5/1/2014 6:07:00,40.7204,-74.0085,B02512
96,5/1/2014 6:07:00,40.7175,-74.0022,B02512
97,5/1/2014 6:07:00,40.7321,-73.9885,B02512
98,5/1/2014 6:08:00,40.7273,-73.9922,B02512


Now we can combine datasets that are broken up into many smaller parts. Many datasets in the wild will be stored this way, particularly data that is collected incrementally.

## 5. 1-to-1 data merge
Merging data combines disparate datasets into a single dataset to do more complex analysis.

Here, we'll be using survey data that contains readings that William Dyer, Frank Pabodie, and Valentina Roerich took in the late 1920s and 1930s while they were on an expedition towards Antarctica. The dataset was taken from a sqlite database from the [Software Carpentry](http://swcarpentry.github.io/sql-novice-survey/) SQL lesson.

Perform a 1-to-1 merge of these DataFrames using the `'name'` column of `site` and the `'site'` column of `visited`.

In [164]:
site = pd.DataFrame({"name": ["DR-1", "DR-3", "MSK-4"],
                    "lat": [-49.85, -47.15, -48.87],
                    "long": [-128.57, -126.72, -123.40]})

site

Unnamed: 0,name,lat,long
0,DR-1,-49.85,-128.57
1,DR-3,-47.15,-126.72
2,MSK-4,-48.87,-123.4


In [145]:
visited = pd.DataFrame({"ident": [619, 734, 837],
                    "site": ["DR-1", "DR-3", "MSK-4"],
                    "dated": ["1927-02-08","1939-01-07" ,"1932-01-14" ]})
visited

Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,734,DR-3,1939-01-07
2,837,MSK-4,1932-01-14


In [147]:
# Merge the DataFrames: o2o
o2o = pd.merge(left=site, right=visited, left_on="name", right_on="site")

# Print o2o
o2o

Unnamed: 0,name,lat,long,ident,site,dated
0,DR-1,-49.85,-128.57,619,DR-1,1927-02-08
1,DR-3,-47.15,-126.72,734,DR-3,1939-01-07
2,MSK-4,-48.87,-123.4,837,MSK-4,1932-01-14


Notice the 1-to-1 correspondence between the `name` column of the `site` DataFrame and the `site` column of the `visited` DataFrame. This is what made the 1-to-1 merge possible.

## 6. Many-to-1 data merge
In a many-to-one (or one-to-many) merge, one of the values will be duplicated and recycled in the output. That is, one of the keys in the merge is not unique.

Note that this time, `visited` has multiple entries for the `site` column. 

The `.merge()` method call is the same as the 1-to-1 merge from the previous exercise, but the data and output will be different.

In [157]:
import numpy as np

In [171]:
visited = pd.DataFrame({"ident": [619, 622, 734, 735, 751, 752, 837, 844],
                    "site": ["DR-1", "DR-1", "DR-3", "DR-3", "DR-3", "DR-3", "MSK-4", "DR-1"],
                    "dated": ["1927-02-08", "1927-02-10", "1939-01-07", "1930-01-12", 
                              "1930-02-26", np.nan, "1932-01-14", "1932-03-22"]})
visited

Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,622,DR-1,1927-02-10
2,734,DR-3,1939-01-07
3,735,DR-3,1930-01-12
4,751,DR-3,1930-02-26
5,752,DR-3,
6,837,MSK-4,1932-01-14
7,844,DR-1,1932-03-22


In [172]:
# Merge the DataFrames: m2o
m2o = pd.merge(left = site, right = visited, left_on = "name", right_on = "site")

# Print m2o
m2o

Unnamed: 0,name,lat,long,ident,site,dated
0,DR-1,-49.85,-128.57,619,DR-1,1927-02-08
1,DR-1,-49.85,-128.57,622,DR-1,1927-02-10
2,DR-1,-49.85,-128.57,844,DR-1,1932-03-22
3,DR-3,-47.15,-126.72,734,DR-3,1939-01-07
4,DR-3,-47.15,-126.72,735,DR-3,1930-01-12
5,DR-3,-47.15,-126.72,751,DR-3,1930-02-26
6,DR-3,-47.15,-126.72,752,DR-3,
7,MSK-4,-48.87,-123.4,837,MSK-4,1932-01-14


Notice how the site data is duplicated during this many-to-1 merge!

## 7. Many-to-many data merge
The final merging scenario occurs when both DataFrames do not have unique keys for a merge. What happens here is that for each duplicated key, every pairwise combination will be created.

Look at the output and notice how pairwise combinations have been created to develop your intuition for many-to-many merges.

Work with the `site` and `visited` DataFrames from before, and a new `survey` DataFrame. Merge `site` and `visited`. Then, merge this merged DataFrame with `survey`.

In [173]:
survey = pd.DataFrame({"taken": [619, 619, 622, 622, 734, 734, 734, 735, 735, 735, 751, 751, 751, 752, 752, 752,
                                 752, 837, 837, 837, 844],
                       "person": ["dyer", "dyer", "dyer", "dyer", "pb", "lake", "pb", "pb", np.nan, np.nan,
                                 "pb", "pb", "lake", "lake", "lake", "lake", "roe", "lake", "lake", "lake", "roe"],
                       "quant": ["rad", "sal", "rad", "sal", "rad", "temp", "rad", "sal", "temp", "rad", "sal",
                                "rad", "temp", "rad", "sal", "sal", "temp", "rad", "sal", "temp", "temp"],
                       "reading": [1.34, 2.42, 2.32, 2.3, 3.4, 3.5, 3.5, 1.3, 4.5, 3.5, 3.6, 3.5, 3.5, 6.7, 4.6, 3.5,
                                  4.3,3.4, 5.6, 3.4, 2.5]})

survey

Unnamed: 0,taken,person,quant,reading
0,619,dyer,rad,1.34
1,619,dyer,sal,2.42
2,622,dyer,rad,2.32
3,622,dyer,sal,2.3
4,734,pb,rad,3.4
5,734,lake,temp,3.5
6,734,pb,rad,3.5
7,735,pb,sal,1.3
8,735,,temp,4.5
9,735,,rad,3.5


In [184]:
# Merge site and visited: m2m
m2m = pd.merge(left=site, right=visited, left_on="name", right_on="site")

# Print the first 20 lines of m2m
m2m

Unnamed: 0,name,lat,long,ident,site,dated
0,DR-1,-49.85,-128.57,619,DR-1,1927-02-08
1,DR-1,-49.85,-128.57,622,DR-1,1927-02-10
2,DR-1,-49.85,-128.57,844,DR-1,1932-03-22
3,DR-3,-47.15,-126.72,734,DR-3,1939-01-07
4,DR-3,-47.15,-126.72,735,DR-3,1930-01-12
5,DR-3,-47.15,-126.72,751,DR-3,1930-02-26
6,DR-3,-47.15,-126.72,752,DR-3,
7,MSK-4,-48.87,-123.4,837,MSK-4,1932-01-14


In [185]:
# Merge m2m and survey: m2m
m2m = pd.merge(left=m2m, right=survey, left_on="ident", right_on="taken")

# Print the first 20 lines of m2m
m2m

Unnamed: 0,name,lat,long,ident,site,dated,taken,person,quant,reading
0,DR-1,-49.85,-128.57,619,DR-1,1927-02-08,619,dyer,rad,1.34
1,DR-1,-49.85,-128.57,619,DR-1,1927-02-08,619,dyer,sal,2.42
2,DR-1,-49.85,-128.57,622,DR-1,1927-02-10,622,dyer,rad,2.32
3,DR-1,-49.85,-128.57,622,DR-1,1927-02-10,622,dyer,sal,2.3
4,DR-1,-49.85,-128.57,844,DR-1,1932-03-22,844,roe,temp,2.5
5,DR-3,-47.15,-126.72,734,DR-3,1939-01-07,734,pb,rad,3.4
6,DR-3,-47.15,-126.72,734,DR-3,1939-01-07,734,lake,temp,3.5
7,DR-3,-47.15,-126.72,734,DR-3,1939-01-07,734,pb,rad,3.5
8,DR-3,-47.15,-126.72,735,DR-3,1930-01-12,735,pb,sal,1.3
9,DR-3,-47.15,-126.72,735,DR-3,1930-01-12,735,,temp,4.5


Notice how the keys are duplicated in this many-to-many merge!