In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Combining rows of data
The dataset you'll be working with here relates to NYC Uber data. The original dataset has all the originating Uber pickup locations by time and latitude and longitude. For didactic purposes, you'll be working with a very small portion of the actual data.

Three DataFrames have been pre-loaded: uber1, which contains data for April 2014, uber2, which contains data for May 2014, and uber3, which contains data for June 2014. Your job in this exercise is to concatenate these DataFrames together such that the resulting DataFrame has the data for all three months.

In [3]:
#Import and create uber1, uber2, and uber3 dataframes
uber1 = pd.read_csv('D:/Springboard_DataCamp/data/Cleaning_Data_in_Python/uber1.csv')
uber2 = pd.read_csv('D:/Springboard_DataCamp/data/Cleaning_Data_in_Python/uber2.csv')
uber3 = pd.read_csv('D:/Springboard_DataCamp/data/Cleaning_Data_in_Python/uber3.csv')

In [4]:
print(uber1.head())
print(uber2.head())
print(uber3.head())

       Date/Time      Lat      Lon    Base
0  4/1/2014 0:11  40.7690 -73.9549  B02512
1  4/1/2014 0:17  40.7267 -74.0345  B02512
2  4/1/2014 0:21  40.7316 -73.9873  B02512
3  4/1/2014 0:28  40.7588 -73.9776  B02512
4  4/1/2014 0:33  40.7594 -73.9722  B02512
       Date/Time      Lat      Lon    Base
0  5/1/2014 0:02  40.7521 -73.9914  B02512
1  5/1/2014 0:06  40.6965 -73.9715  B02512
2  5/1/2014 0:15  40.7464 -73.9838  B02512
3  5/1/2014 0:17  40.7463 -74.0011  B02512
4  5/1/2014 0:17  40.7594 -73.9734  B02512
       Date/Time      Lat      Lon    Base
0  6/1/2014 0:00  40.7293 -73.9920  B02512
1  6/1/2014 0:01  40.7131 -74.0097  B02512
2  6/1/2014 0:04  40.3461 -74.6610  B02512
3  6/1/2014 0:04  40.7555 -73.9833  B02512
4  6/1/2014 0:07  40.6880 -74.1831  B02512


In [5]:
# Concatenate uber1, uber2, and uber3: row_concat
row_concat = pd.concat([uber1, uber2, uber3])

# Print the shape of row_concat
print(row_concat.shape)

# Print the head of row_concat
print(row_concat.head())


(297, 4)
       Date/Time      Lat      Lon    Base
0  4/1/2014 0:11  40.7690 -73.9549  B02512
1  4/1/2014 0:17  40.7267 -74.0345  B02512
2  4/1/2014 0:21  40.7316 -73.9873  B02512
3  4/1/2014 0:28  40.7588 -73.9776  B02512
4  4/1/2014 0:33  40.7594 -73.9722  B02512


## Combining columns of data
Think of column-wise concatenation of data as stitching data together from the sides instead of the top and bottom. To perform this action, you use the same pd.concat() function, but this time with the keyword argument axis=1. The default, axis=0, is for a row-wise concatenation.

In [6]:
#Import ebola_melt and status_country csv's as dataframes
ebola_melt = pd.read_csv('D:/Springboard_DataCamp/data/Cleaning_Data_in_Python/ebola_melt3.csv')
status_country = pd.read_csv('D:/Springboard_DataCamp/data/Cleaning_Data_in_Python/status_country.csv')

In [7]:
print(ebola_melt.head())
print(status_country.head())

         Date  Day status_country  counts
0    1/5/2015  289   Cases_Guinea  2776.0
1    1/4/2015  288   Cases_Guinea  2775.0
2    1/3/2015  287   Cases_Guinea  2769.0
3    1/2/2015  286   Cases_Guinea     NaN
4  12/31/2014  284   Cases_Guinea  2730.0
  status country
0  Cases  Guinea
1  Cases  Guinea
2  Cases  Guinea
3  Cases  Guinea
4  Cases  Guinea


In [8]:
# Concatenate ebola_melt and status_country column-wise: ebola_tidy
ebola_tidy = pd.concat([ebola_melt,status_country], axis=1)

# Print the shape of ebola_tidy
ebola_tidy.shape

(1952, 6)

In [9]:
ebola_tidy.head()

Unnamed: 0,Date,Day,status_country,counts,status,country
0,1/5/2015,289,Cases_Guinea,2776.0,Cases,Guinea
1,1/4/2015,288,Cases_Guinea,2775.0,Cases,Guinea
2,1/3/2015,287,Cases_Guinea,2769.0,Cases,Guinea
3,1/2/2015,286,Cases_Guinea,,Cases,Guinea
4,12/31/2014,284,Cases_Guinea,2730.0,Cases,Guinea


## Finding files that match a pattern
You're now going to practice using the **glob** module to find all csv files in the workspace. In the next exercise, you'll programmatically load them into DataFrames.

The glob module has a function called glob that takes a pattern and returns a list of the files in the working directory that match that pattern.

For example, if you know the pattern is **`part_`** **`single digit number`** .csv, you can write the pattern as **`'part_?.csv'`** (which would match part_1.csv, part_2.csv, part_3.csv, etc.)

Similarly, you can find all .csv files with '*.csv', or all parts with 'part_*'. The ? wildcard represents any 1 character, and the * wildcard represents any number of characters.

In [13]:
# Import necessary modules
import glob
import pandas as pd

# Write the pattern: pattern
pattern = 'D:/Springboard_DataCamp/data/Cleaning_Data_in_Python/Uber/*.csv'

# Save all file matches: csv_files
csv_files = glob.glob(pattern)

# Print the file names
print(csv_files)



['D:/Springboard_DataCamp/data/Cleaning_Data_in_Python/Uber\\uber1.csv', 'D:/Springboard_DataCamp/data/Cleaning_Data_in_Python/Uber\\uber2.csv', 'D:/Springboard_DataCamp/data/Cleaning_Data_in_Python/Uber\\uber3.csv']


In [14]:
# Load the second file into a DataFrame: csv2
csv2 = pd.read_csv(csv_files[1])

# Print the head of csv2
print(csv2.head())

       Date/Time      Lat      Lon    Base
0  5/1/2014 0:02  40.7521 -73.9914  B02512
1  5/1/2014 0:06  40.6965 -73.9715  B02512
2  5/1/2014 0:15  40.7464 -73.9838  B02512
3  5/1/2014 0:17  40.7463 -74.0011  B02512
4  5/1/2014 0:17  40.7594 -73.9734  B02512


## Iterating and concatenating all matches
Now that you have a list of filenames to load, you can load all the files into a list of DataFrames that can then be concatenated.

You'll start with an empty list called frames. Your job is to use a for loop to:

1. iterate through each of the filenames
2. read each filename into a DataFrame, and then
3. append it to the **frames** list.

You can then concatenate this list of DataFrames using pd.concat().

In [15]:
# Create an empty list: frames
frames = []

#  Iterate over csv_files
for csv in csv_files:

    #  Read csv into a DataFrame: df
    df = pd.read_csv(csv)
    
    # Append df to frames
    frames.append(df)

In [16]:
# Concatenate frames into a single DataFrame: uber
uber = pd.concat(frames)

# Print the shape of uber
print(uber.shape)

# Print the head of uber
print(uber.head())

(297, 4)
       Date/Time      Lat      Lon    Base
0  4/1/2014 0:11  40.7690 -73.9549  B02512
1  4/1/2014 0:17  40.7267 -74.0345  B02512
2  4/1/2014 0:21  40.7316 -73.9873  B02512
3  4/1/2014 0:28  40.7588 -73.9776  B02512
4  4/1/2014 0:33  40.7594 -73.9722  B02512


## 1-to-1 data merge
Merging data allows you to combine disparate datasets into a single dataset to do more complex analysis.

Here, you'll be using survey data that contains readings that William Dyer, Frank Pabodie, and Valentina Roerich took in the late 1920 and 1930 while they were on an expedition towards Antarctica. The dataset was taken from a sqlite database from the Software Carpentry SQL lesson.

Your task is to perform a 1-to-1 merge of these two DataFrames using the 'name' column of site and the 'site' column of visited.

In [17]:
# Create site and visited dataframes from scratch
site = pd.DataFrame({'name': ['DR-1','DR-3','MSK-4'], 'lat':[-49.85,-47.15,-48.87], 'long':[-128.57,-126.72,-123.40]})
visited = pd.DataFrame({'ident': [619,734,837], 'site': ['DR-1','DR-3','MSK-4'], 'dated': ['1927-02-08', '1939-01-07', '1932-01-14']})

In [19]:
site

Unnamed: 0,name,lat,long
0,DR-1,-49.85,-128.57
1,DR-3,-47.15,-126.72
2,MSK-4,-48.87,-123.4


In [20]:
visited

Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,734,DR-3,1939-01-07
2,837,MSK-4,1932-01-14


In [18]:
# Merge the DataFrames: o2o
o2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')
o2o

Unnamed: 0,name,lat,long,ident,site,dated
0,DR-1,-49.85,-128.57,619,DR-1,1927-02-08
1,DR-3,-47.15,-126.72,734,DR-3,1939-01-07
2,MSK-4,-48.87,-123.4,837,MSK-4,1932-01-14
