In [2]:
import pandas as pd
import os

os.chdir('c:\\datacamp\\data\\')
#Import Pittsburgh 2013 Weather CSV File
weather = pd.read_csv('pittsburgh2013.csv', index_col=['Date'], parse_dates = True)

#Separate the Mean Temp column and the Max Temp column
pmeantemp = weather[['Mean TemperatureF']]
pmaxtemp = weather[['Max TemperatureF']]

#Year Variable needed for later exerises and dictionary creation
year = ['Jan','Feb', 'Mar', 'Apr', 'May', 'June', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

#Find the mean temperature by month
Janmean = pmeantemp.loc['2013-01'].max().iloc[0]
Febmean = pmeantemp.loc['2013-02'].max().iloc[0]
Marmean = pmeantemp.loc['2013-03'].max().iloc[0]
Aprmean = pmeantemp.loc['2013-04'].max().iloc[0]
Maymean = pmeantemp.loc['2013-05'].max().iloc[0]
Junmean = pmeantemp.loc['2013-06'].max().iloc[0]
Julmean = pmeantemp.loc['2013-07'].max().iloc[0]
Augmean = pmeantemp.loc['2013-08'].max().iloc[0]
Sepmean = pmeantemp.loc['2013-09'].max().iloc[0]
Octmean = pmeantemp.loc['2013-10'].max().iloc[0]
Novmean = pmeantemp.loc['2013-11'].max().iloc[0]
Decmean = pmeantemp.loc['2013-12'].max().iloc[0]

#Create a  dictionary and convert to dataframe of monthly mean temperatures
mmeandict = {'Month':year,
'Mean TemperatureF':[Janmean, Febmean, Marmean, Aprmean, Maymean, Junmean, Julmean, Augmean, Sepmean, Octmean, Novmean, Decmean]}
weather_mean = pd.DataFrame(mmeandict).set_index('Month')

#Find the max temperature by quarter
Q1max = pmaxtemp.loc['2013-01-01':'2013-03-31'].max().iloc[0]
Q2max = pmaxtemp.loc['2013-04-01':'2013-06-30'].max().iloc[0]
Q3max = pmaxtemp.loc['2013-07-01':'2013-09-30'].max().iloc[0]
Q4max = pmaxtemp.loc['2013-10-01':'2013-12-31'].max().iloc[0]

#Create a  dictionary and convert to dataframe of quarterly max temperatures
qmaxdict = {'Month':['Jan','Apr','Jul','Oct'],'Max TemperatureF':[Q1max,Q2max,Q3max,Q4max]}
weather_max = pd.DataFrame(qmaxdict).set_index('Month')

# Merging DataFrames with Pandas

## Chapter 2 - Concatenating Data

### Appending and Concatenating Series

Combining DataFrames uses methods like .append() and the Pandas function, concat().

#### .append()

The .append() method will stack rows of Series and DataFrames. s1.append(s2) will stack rows from the Series s2 below the rows in the Series s1. 

#### concat()

The Pandas function concat accepts a list or sequences of several DataFrames, pd.concat([s1, s2, s3]) and can concatenate by stacking row wise or column wise, depending on the options provided. When stacking multiple series, the concat() function is the equivalent to chaining multiple .append() methods:

results1 = pd.concat([s1, s2, s3])
results2 = s1.append(s2).append(s3)
results1 == results2 element wise

#### Series of US States

By default, the 4 Series below of the US regions are indexed with integars, starting at 0. Using .append() on the northeast Series to append the south Series will result in a new series where the first 9 rows are the items from the northeast Series and the remaining rows are from the south Series. Notice that the .append() method stacks rows without adjusting the index values. 

In [3]:
import pandas as pd
northeast = pd.Series(['CT', 'ME', 'MA', 'NH', 'RI', 'VT', 'NJ', 'NY', 'PA'])
south = pd.Series(['DE', 'FL', 'GA', 'MD', 'NC', 'SC', 'VA', 'DC', 'WV', 'AL', 'KY', 'MS', 'TN', 'AR', 'LA', 'OK', 'TX'])
midwest = pd.Series(['IL', 'IN', 'MN', 'MO', 'NE', 'ND', 'SD', 'IA', 'KS', 'MI', 'OH', 'WI'])
west = pd.Series(['AZ', 'CO', 'ID', 'MT', 'NV', 'NM', 'UT', 'WY', 'AK', 'CA', 'HI', 'OR', 'WA'])

east = northeast.append(south)
print(east)

0     CT
1     ME
2     MA
3     NH
4     RI
5     VT
6     NJ
7     NY
8     PA
0     DE
1     FL
2     GA
3     MD
4     NC
5     SC
6     VA
7     DC
8     WV
9     AL
10    KY
11    MS
12    TN
13    AR
14    LA
15    OK
16    TX
dtype: object


#### The Appended Index

Notice that index for the east Series has duplicate values, which means when .loc() is called on say, index 3, two values are returned, one from the northeast Series and one from the south Series.

In [3]:
print(east.index)

Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  0,  1,  2,  3,  4,  5,  6,  7,
             8,  9, 10, 11, 12, 13, 14, 15, 16],
           dtype='int64')


In [4]:
print(east.loc[3])

3    NH
3    MD
dtype: object


#### Using .reset_index()

Having unique indexes is important and reset_index() method with the option drop=True will disgard the old indexes of each Series and creates a new index for the new Series. The new index is a type of RangeIndex with entries 0 to 25.

In [5]:
new_east = northeast.append(south).reset_index(drop=True)
print(new_east.head(11))

0     CT
1     ME
2     MA
3     NH
4     RI
5     VT
6     NJ
7     NY
8     PA
9     DE
10    FL
dtype: object


In [6]:
print(new_east.index)

RangeIndex(start=0, stop=26, step=1)


#### Using concat()

The function concat() can construct an equivalent Series. Using a list of Series or DataFrames. The resulting index, as before, contains a list of repeated values. 

In [7]:
east = pd.concat([northeast, south])
print(east.head(11))

0    CT
1    ME
2    MA
3    NH
4    RI
5    VT
6    NJ
7    NY
8    PA
0    DE
1    FL
dtype: object


In [8]:
print(east.index)

Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  0,  1,  2,  3,  4,  5,  6,  7,
             8,  9, 10, 11, 12, 13, 14, 15, 16],
           dtype='int64')


#### Using ignore_index

Similar to reset_index, the concat function has the ignore_index= option. When set to True, concat will reset the index.

In [9]:
new_east=pd.concat([northeast, south], ignore_index = True)
print(new_east.head(11))

0     CT
1     ME
2     MA
3     NH
4     RI
5     VT
6     NJ
7     NY
8     PA
9     DE
10    FL
dtype: object


### Exercise 1

#### Appending pandas Series

In this exercise, you'll load sales data from the months January, February, and March into DataFrames. Then, you'll extract Series with the 'Units' column from each and append them together with method chaining using .append().

To check that the stacking worked, you'll print slices from these Series, and finally, you'll add the result to figure out the total units sold in the first quarter.

__Instructions:__
* Read the files 'sales-jan-2015.csv', 'sales-feb-2015.csv' and 'sales-mar-2015.csv' into the DataFrames jan, feb, and mar respectively.
* Use parse_dates=True and index_col='Date'.
* Extract the 'Units' column of jan, feb, and mar to create the Series jan_units, feb_units, and mar_units respectively.
* Construct the Series quarter1 by appending feb_units to jan_units and then appending mar_units to the result. Use chained calls to the .append() method to do this.
* Verify that quarter1 has the individual Series stacked vertically. To do this:
* Print the slice containing rows from jan 27, 2015 to feb 2, 2015.
* Print the slice containing rows from feb 26, 2015 to mar 7, 2015.
* Compute and print the total number of units sold from the Series quarter1. This has been done for you, so hit 'Submit Answer' to see the result!

In [4]:
# Import pandas
import pandas as pd

# Load 'sales-jan-2015.csv' into a DataFrame: jan
jan = pd.read_csv('sales-jan-2015.csv', parse_dates = True, index_col='Date')

# Load 'sales-feb-2015.csv' into a DataFrame: feb
feb = pd.read_csv('sales-feb-2015.csv', parse_dates = True, index_col='Date')

# Load 'sales-mar-2015.csv' into a DataFrame: mar
mar = pd.read_csv('sales-mar-2015.csv', parse_dates = True, index_col='Date')

# Extract the 'Units' column from jan: jan_units
jan_units = jan['Units']

# Extract the 'Units' column from feb: feb_units
feb_units = feb['Units']

# Extract the 'Units' column from mar: mar_units
mar_units = mar['Units']

# Append feb_units and then mar_units to jan_units: quarter1
quarter1 = jan_units.append(feb_units).append(mar_units)

# Print the first slice from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])

# Print the second slice from quarter1
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])

# Compute & print total sales in quarter1
print(quarter1.sum())

Date
2015-01-27 07:11:55    18
2015-02-02 08:33:01     3
2015-02-02 20:54:49     9
Name: Units, dtype: int64
Date
2015-02-26 08:57:45     4
2015-02-26 08:58:51     1
2015-03-06 10:11:45    17
2015-03-06 02:03:56    17
Name: Units, dtype: int64
642


#### Concatenating pandas Series along row axis
Having learned how to append Series, you'll now learn how to achieve the same result by concatenating Series instead. You'll continue to work with the sales data you've seen previously. This time, the DataFrames jan, feb, and mar have been pre-loaded.

Your job is to use pd.concat() with a list of Series to achieve the same result that you would get by chaining calls to .append().

You may be wondering about the difference between pd.concat() and pandas' .append() method. One way to think of the difference is that .append() is a specific case of a concatenation, while pd.concat() gives you more flexibility, as you'll see in later exercises.

__Instructions:__
* Create an empty list called units. This has been done for you.
* Use a for loop to iterate over [jan, feb, mar]:
* In each iteration of the loop, append the 'Units' column of each DataFrame to units.
* Concatenate the Series contained in the list units into a longer Series called quarter1 using pd.concat().
* Specify the keyword argument axis='rows' to stack the Series vertically.
* Verify that quarter1 has the individual Series stacked vertically by printing slices. This has been done for you, so hit 'Submit Answer' to see the result!

In [5]:
# Initialize empty list: units
units = []

# Build the list of Series
for month in [jan, feb, mar]:
    units.append(month["Units"])

# Concatenate the list: quarter1
quarter1 = pd.concat(units, axis='rows')

# Print slices from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])

Date
2015-01-27 07:11:55    18
2015-02-02 08:33:01     3
2015-02-02 20:54:49     9
Name: Units, dtype: int64
Date
2015-02-26 08:57:45     4
2015-02-26 08:58:51     1
2015-03-06 10:11:45    17
2015-03-06 02:03:56    17
Name: Units, dtype: int64


### Appending and Concatenating DataFrames

A review of two 2010 population datasets shows that they same object types with the same shape, indexes and columns. When appended, these DataFrame are stacked row-wise, just like Series. Because both DataFrames have the same index, both .append() and concat will preserve the row indices. 

In [6]:
import pandas as pd
pop1 = pd.read_csv('population_01.csv', index_col=0)
pop2 = pd.read_csv('population_02.csv', index_col=0)
print(type(pop1), pop1.shape)
print(type(pop2), pop2.shape)

<class 'pandas.core.frame.DataFrame'> (4, 1)
<class 'pandas.core.frame.DataFrame'> (4, 1)


In [13]:
print(pop1)
print(pop2)

                2010 Census Population
Zip Code ZCTA                         
66407                              479
72732                             4716
50579                             2405
46241                            30670
                2010 Census Population
Zip Code ZCTA                         
12766                             2180
76092                            26669
98360                            12221
49464                            27481


In [14]:
pop1.append(pop2)

Unnamed: 0_level_0,2010 Census Population
Zip Code ZCTA,Unnamed: 1_level_1
66407,479
72732,4716
50579,2405
46241,30670
12766,2180
76092,26669
98360,12221
49464,27481


#### DataFrames with Different Indices

The population DataFrame below has the same column names as the previous population DataFrames, but the unemployment DataFrame, has different shapes, column names and index. Notice that only zip code 2860 is a common index for both DataFrames.

In [7]:
population = pd.read_csv('population_00.csv', index_col=0)
unemployment = pd.read_csv('unemployment.csv', index_col=0)
print(population)
print(unemployment)

                2010 Census Population
Zip Code ZCTA                         
57538                              322
59916                              130
37660                            40038
2860                             45199
        Unemployment   Participants
Zip                                
2860            0.11          34447
46167           0.02           4800
1097            0.33             42
80808           0.07           4310


When the population and unemployment DataFrames are appended, the resulting DataFrame has 8 rows and 3 columns. The columns are the union of the columns from the input. The top 4 rows are from the population DataFrame and the .append() method has filled in Null values for columns that came over from the unemployment DataFrame and vice versa for the bottom 4 rows that are from the unemployment DataFrame.

In addition, there are two rows with the same index number, 2860, one from the population DataFrame and one from the unemployment DataFrame. 

In [16]:
population.append(unemployment)

Unnamed: 0,2010 Census Population,Unemployment,Participants
57538,322.0,,
59916,130.0,,
37660,40038.0,,
2860,45199.0,,
2860,,0.11,34447.0
46167,,0.02,4800.0
1097,,0.33,42.0
80808,,0.07,4310.0


Concatenating the population and unemployment DataFrames along the axis = 0 (aka axis = rows) will result in stacking rows vertically at the bottom; the result is identical to using the .append() method. The axis = 0 is the default action of .concat()

In [17]:
pd.concat([population, unemployment])

Unnamed: 0,2010 Census Population,Unemployment,Participants
57538,322.0,,
59916,130.0,,
37660,40038.0,,
2860,45199.0,,
2860,,0.11,34447.0
46167,,0.02,4800.0
1097,,0.33,42.0
80808,,0.07,4310.0


Using the axis = 1 option, also known as axis = columns, stacks the DataFrames horizontally, across the columns. This will result in a DataFrame with 7 rows and 3 columns, with the one common index, 2860, having all three columns filled in and the unique row having Null values inserted for columns from the other DataFrame. 

In [18]:
pd.concat([population, unemployment], axis = 1)

Unnamed: 0,2010 Census Population,Unemployment,Participants
1097,,0.33,42.0
2860,45199.0,0.11,34447.0
37660,40038.0,,
46167,,0.02,4800.0
57538,322.0,,
59916,130.0,,
80808,,0.07,4310.0


### Exercise 2

#### Appending DataFrames with ignore_index

In this exercise, you'll use the Baby Names Dataset (from data.gov) again. This time, both DataFrames names_1981 and names_1881 are loaded without specifying an Index column (so the default Indexes for both are RangeIndexes).

You'll use the DataFrame .append() method to make a DataFrame combined_names. To distinguish rows from the original two DataFrames, you'll add a 'year' column to each with the year (1881 or 1981 in this case). In addition, you'll specify ignore_index=True so that the index values are not used along the concatenation axis. The resulting axis will instead be labeled 0, 1, ..., n-1, which is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.

__Instructions:__
* Create a 'year' column in the DataFrames names_1881 and names_1981, with values of 1881 and 1981 respectively. Recall that assigning a scalar value to a DataFrame column broadcasts that value throughout.
* Create a new DataFrame called combined_names by appending the rows of names_1981 underneath the rows of names_1881. Specify the keyword argument ignore_index=True to make a new RangeIndex of unique integers for each row.
* Print the shapes of all three DataFrames. This has been done for you.
* Extract all rows from combined_names that have the name 'Morgan'. To do this, use the .loc[] accessor with an appropriate filter. The relevant column of combined_names here is 'name'.

In [8]:
names_1881 = pd.read_csv('names1881.csv', parse_dates=True, names=['Name', 'Gender','Count'])
names_1981 = pd.read_csv('names1981.csv', parse_dates=True, names=['Name', 'Gender','Count'])

# Add 'year' column to names_1881 and names_1981
names_1881['year'] = 1881
names_1981['year'] = 1981

# Append names_1981 after names_1881 with ignore_index=True: combined_names
combined_names = names_1881.append(names_1981, ignore_index = True)

# Print shapes of names_1981, names_1881, and combined_names
print(names_1981.shape)
print(names_1881.shape)
print(combined_names.shape)

# Print all rows that contain the name 'Morgan'
print(combined_names.loc[combined_names['Name']=='Morgan'])

(19455, 4)
(1935, 4)
(21390, 4)
         Name Gender  Count  year
1283   Morgan      M     23  1881
2096   Morgan      F   1769  1981
14390  Morgan      M    766  1981


In [20]:
combined_names.columns

Index(['Name', 'Gender', 'Count', 'year'], dtype='object')

In [21]:
names_1881

Unnamed: 0,Name,Gender,Count,year
0,Mary,F,6919,1881
1,Anna,F,2698,1881
2,Emma,F,2034,1881
3,Elizabeth,F,1852,1881
4,Margaret,F,1658,1881
...,...,...,...,...
1930,Wiliam,M,5,1881
1931,Wilton,M,5,1881
1932,Wing,M,5,1881
1933,Wood,M,5,1881


#### Concatenating pandas DataFrames along column axis
The function pd.concat() can concatenate DataFrames horizontally as well as vertically (vertical is the default). To make the DataFrames stack horizontally, you have to specify the keyword argument axis=1 or axis='columns'.

In this exercise, you'll use weather data with maximum and mean daily temperatures sampled at different rates (quarterly versus monthly). You'll concatenate the rows of both and see that, where rows are missing in the coarser DataFrame, null values are inserted in the concatenated DataFrame. This corresponds to an outer join (which you will explore in more detail in later exercises).

The files 'quarterly_max_temp.csv' and 'monthly_mean_temp.csv' have been pre-loaded into the DataFrames weather_max and weather_mean respectively, and pandas has been imported as pd.

__Instructions:__
* Create weather_list, a list of the DataFrames weather_max and weather_mean.
* Create a new DataFrame called weather by concatenating weather_list horizontally.
* Pass the list to pd.concat() and specify the keyword argument axis=1 to stack them horizontally.
* Print the new DataFrame weather.

In [22]:
# Create a list of weather_max and weather_mean
weather_list = [weather_max, weather_mean]

# Concatenate weather_list horizontally
weather = pd.concat(weather_list, axis=1)

# Print weather
print(weather)

      Max TemperatureF  Mean TemperatureF
Jan               68.0                 62
Apr               89.0                 72
Jul               91.0                 80
Oct               84.0                 74
Feb                NaN                 48
Mar                NaN                 55
May                NaN                 77
June               NaN                 78
Aug                NaN                 77
Sep                NaN                 79
Nov                NaN                 60
Dec                NaN                 62


#### Reading multiple files to build a DataFrame
It is often convenient to build a large DataFrame by parsing many files as DataFrames and concatenating them all at once. You'll do this here with three files, but, in principle, this approach can be used to combine data from dozens or hundreds of files.

Here, you'll work with DataFrames compiled from The Guardian's Olympic medal dataset.

pandas has been imported as pd and the list medal_types has been pre-loaded for you, which contains the strings 'bronze', 'silver', and 'gold'.

__Instructions:__
* Iterate over medal_types in the for loop.
* Inside the for loop:
  * Create file_name using string interpolation with the loop variable medal. This has been done for you. The expression "%s_top5.csv" % medal evaluates as a string with the value of medal replacing %s in the format string.
  * Create the list of column names called columns. This has been done for you.
  * Read file_name into a DataFrame called medal_df. Specify the keyword arguments header=0, index_col='Country', and names=columns to get the correct row and column Indexes.
  * Append medal_df to medals using the list .append() method.
* Concatenate the list of DataFrames medals horizontally (using axis='columns') to create a single DataFrame called medals_df. Print it in its entirety.

In [9]:
medal_types = ['bronze', 'silver', 'gold']
#Initialize an empyy list: medals
medals =[]

for medal in medal_types:
    # Create the file name: file_name
    file_name = '%s_top5.csv' % medal
    # Create list of column names: columns
    columns = ['Country', medal]
    # Read file_name into a DataFrame: medal_df
    medal_df = pd.read_csv(file_name, header = 0, index_col='Country', names=columns)
    # Append medal_df to medals
    medals.append(medal_df)

# Concatenate medals horizontally: medals_df
medals_df = pd.concat(medals, axis='columns')

# Print medals_df
print(medals_df)

                bronze  silver    gold
United States   1052.0  1195.0  2088.0
Soviet Union     584.0   627.0   838.0
United Kingdom   505.0   591.0   498.0
France           475.0   461.0     NaN
Germany          454.0     NaN   407.0
Italy              NaN   394.0   460.0


### Concatenation, Keys and MultiIndexes

Examining the rain2013 and rain2014 DataFrames, we see there are common row levels and column names.

In [10]:
import pandas as pd
file1 = 'rainQ12013.csv'
file2 = 'rainQ12014.csv'

rain2013 = pd.read_csv(file1, index_col = 'Month', parse_dates = True)
rain2014 = pd.read_csv(file2, index_col = 'Month', parse_dates = True)

print("Rain 2013 Q1")
print(rain2013)
print('Rain 2014 Q1')
print(rain2014)

Rain 2013 Q1
        Precipitation
Month                
Jan          0.096129
Feb          0.067143
Mar          0.061613
Rain 2014 Q1
        Precipitation
Month                
Jan          0.050323
Feb          0.082143
Mar          0.070968


Using concat with the default of axis = 0 will create a single DataFrame with recurring indices and obscures the fact that the bottom 3 rows are from the 2014 DataFrame while the top 3 are from the 2013 DataFrame.

In [25]:
pd.concat([rain2013, rain2014])

Unnamed: 0_level_0,Precipitation
Month,Unnamed: 1_level_1
Jan,0.096129
Feb,0.067143
Mar,0.061613
Jan,0.050323
Feb,0.082143
Mar,0.070968


#### Using a Multi-index on Rows

A way to address the issue above is to use a multi-level index on the rows by passing the keys = option on concat() a list of outer index labels. This assigns an outer index label for each of the original input data. The order of the list of keys must match the order of the list of input DataFrames. 

In [26]:
rain1314 = pd.concat([rain2013, rain2014], keys=[2013, 2014], axis = 0)
print(rain1314)

             Precipitation
     Month                
2013 Jan          0.096129
     Feb          0.067143
     Mar          0.061613
2014 Jan          0.050323
     Feb          0.082143
     Mar          0.070968


#### Accessing a Multi-index

As expected, the outer most index can be selected when slicing the combined DataFrame.

In [27]:
print(rain1314.loc[2014])

        Precipitation
Month                
Jan          0.050323
Feb          0.082143
Mar          0.070968


#### Concatenating Columns

A different approach would be the add the 2014 data across the columns to the 2013 data. This is done using axis = 1 or axis = columns.

In [28]:
rain1314 = pd.concat([rain2013, rain2014], axis = 'columns')
print(rain1314)

        Precipitation   Precipitation
Month                                
Jan          0.096129        0.050323
Feb          0.067143        0.082143
Mar          0.061613        0.070968


Again, we see that the resulting DataFrame obscures which Precipitation column comes from which year. This is addressed using the keys = option.

In [29]:
rain1314 = pd.concat([rain2013, rain2014], keys = [2013, 2014], axis = 'columns')
print(rain1314)

                2013           2014
       Precipitation  Precipitation
Month                              
Jan         0.096129       0.050323
Feb         0.067143       0.082143
Mar         0.061613       0.070968


#### pd.concat() with Dict

Finally, the concat() function can accept a dictionary rather than a list of DataFrames. In this scenario, the dictionary keys are automatically treated as values for the keys = argument when building a multi-index on the columns. 

In [30]:
rain_dict = {2013: rain2013, 2014: rain2014}
rain1314 = pd.concat(rain_dict, axis = 'columns')
print(rain1314)

                2013           2014
       Precipitation  Precipitation
Month                              
Jan         0.096129       0.050323
Feb         0.067143       0.082143
Mar         0.061613       0.070968


### Exercise 3

#### Concatenating vertically to get MultiIndexed rows
When stacking a sequence of DataFrames vertically, it is sometimes desirable to construct a MultiIndex to indicate the DataFrame from which each row originated. This can be done by specifying the keys parameter in the call to pd.concat(), which generates a hierarchical index with the labels from keys as the outermost index label. So you don't have to rename the columns of each DataFrame as you load it. Instead, only the Index column needs to be specified.

Here, you'll continue working with DataFrames compiled from The Guardian's Olympic medal dataset. Once again, pandas has been imported as pd and two lists have been pre-loaded: An empty list called medals, and medal_types, which contains the strings 'bronze', 'silver', and 'gold'.

__Instructions:__
* Within the for loop:
 * Read file_name into a DataFrame called medal_df. Specify the index to be 'Country'.
 * Append medal_df to medals.
* Concatenate the list of DataFrames medals into a single DataFrame called medals. Be sure to use the keyword argument keys=['bronze', 'silver', 'gold'] to create a vertically stacked DataFrame with a MultiIndex.
* Print the new DataFrame medals. This has been done for you, so hit 'Submit Answer' to see the result!

In [11]:
medals =[]
for medal in medal_types:

    file_name = '%s_top5.csv' % medal
    
    # Read file_name into a DataFrame: medal_df
    medal_df = pd.read_csv(file_name, index_col='Country')
    
    # Append medal_df to medals
    medals.append(medal_df)
    
# Concatenate medals: medals
medals = pd.concat(medals, keys=['bronze', 'silver', 'gold'])

# Print medals in entirety
print(medals)

                        Total
       Country               
bronze United States   1052.0
       Soviet Union     584.0
       United Kingdom   505.0
       France           475.0
       Germany          454.0
silver United States   1195.0
       Soviet Union     627.0
       United Kingdom   591.0
       France           461.0
       Italy            394.0
gold   United States   2088.0
       Soviet Union     838.0
       United Kingdom   498.0
       Italy            460.0
       Germany          407.0


In [12]:
medals = []
for medal in medal_types:

    file_name = '%s_top5.csv' % medal
    
    # Read file_name into a DataFrame: medal_df
    medal_df = pd.read_csv(file_name, index_col='Country')
    
    # Append medal_df to medals
    medals.append(medal_df)

    # Concatenate medals: medals
medals = pd.concat(medals, keys=['bronze', 'silver', 'gold'])

# Print medals in entirety
print(medals)

                        Total
       Country               
bronze United States   1052.0
       Soviet Union     584.0
       United Kingdom   505.0
       France           475.0
       Germany          454.0
silver United States   1195.0
       Soviet Union     627.0
       United Kingdom   591.0
       France           461.0
       Italy            394.0
gold   United States   2088.0
       Soviet Union     838.0
       United Kingdom   498.0
       Italy            460.0
       Germany          407.0


#### Slicing MultiIndexed DataFrames
This exercise picks up where the last ended (again using The Guardian's Olympic medal dataset).

You are provided with the MultiIndexed DataFrame as produced at the end of the preceding exercise. Your task is to sort the DataFrame and to use the pd.IndexSlice to extract specific slices. Check out this exercise from Manipulating DataFrames with pandas to refresh your memory on how to deal with MultiIndexed DataFrames.

pandas has been imported for you as pd and the DataFrame medals is already in your namespace.

__Instructions:__
 * Create a new DataFrame medals_sorted with the entries of medals sorted. Use .sort_index(level=0) to ensure the Index is sorted suitably.
 * Print the number of bronze medals won by Germany and all of the silver medal data. This has been done for you.
 * Create an alias for pd.IndexSlice called idx. A slicer pd.IndexSlice is required when slicing on the inner level of a MultiIndex.
 * Slice all the data on medals won by the United Kingdom in the DataFrame medals_sorted. To do this, use the .loc[] accessor with <br>
 idx[:,'United Kingdom'], :.

In [13]:
#Behind the scene data preparation for the exercise

gold = pd.read_csv('gold_top5.csv', index_col = 'Country')
silver = pd.read_csv('silver_top5.csv', index_col = 'Country')
bronze =pd.read_csv('bronze_top5.csv', index_col = 'Country')

medals = pd.concat([gold, silver, bronze], axis = 0, keys = ['gold', 'silver','bronze'])

In [72]:
# Sort the entries of medals: medals_sorted
medals_sorted = medals.sort_index(level=0)

# Print the number of Bronze medals won by Germany
print(medals_sorted.loc[('bronze','Germany')])

# Print data about silver medals
print(medals_sorted.loc['silver'])

# Create alias for pd.IndexSlice: idx
idx = pd.IndexSlice

# Print all the data on medals won by the United Kingdom
print(medals_sorted.loc[idx[:,'United Kingdom'],:])

Total    454
Name: (bronze, Germany), dtype: int64
                Total
Country              
France            461
Italy             394
Soviet Union      627
United Kingdom    591
United States    1195
                       Total
       Country              
bronze United Kingdom    505
gold   United Kingdom    498
silver United Kingdom    591


#### Concatenating horizontally to get MultiIndexed columns
It is also possible to construct a DataFrame with hierarchically indexed columns. For this exercise, you'll start with pandas imported and a list of three DataFrames called dataframes. All three DataFrames contain 'Company', 'Product', and 'Units' columns with a 'Date' column as the index pertaining to sales transactions during the month of February, 2015. The first DataFrame describes Hardware transactions, the second describes Software transactions, and the third, Service transactions.

Your task is to concatenate the DataFrames horizontally and to create a MultiIndex on the columns. From there, you can summarize the resulting DataFrame and slice some information from it.

__Instructions:__
* Construct a new DataFrame february with MultiIndexed columns by concatenating the list dataframes.
* Use axis=1 to stack the DataFrames horizontally and the keyword argument keys=['Hardware', 'Software', 'Service'] to construct a hierarchical Index from each DataFrame.
* Print summary information from the new DataFrame february using the .info() method. This has been done for you.
* Create an alias called idx for pd.IndexSlice.
* Extract a slice called slice_2_8 from february (using .loc[] & idx) that comprises rows between Feb. 2, 2015 to Feb. 8, 2015 from columns under 'Company'.
* Print the slice_2_8. This has been done for you, so hit 'Submit Answer' to see the sliced data!

In [14]:
#Behind the scene data preparation for the exercise

jan = pd.read_csv('sales-jan-2015.csv', parse_dates = True, index_col = 'Date')
feb = pd.read_csv('sales-feb-2015.csv', parse_dates = True, index_col = 'Date')
mar = pd.read_csv('sales-mar-2015.csv', parse_dates = True, index_col = 'Date')


dataframes = [jan[jan['Product'] == 'Hardware'].append(feb[feb['Product']=='Hardware']).append(mar[mar['Product']=='Hardware']),
              jan[jan['Product'] == 'Software'].append(feb[feb['Product']=='Software']).append(mar[mar['Product']=='Software']),
              jan[jan['Product'] == 'Service'].append(feb[feb['Product']=='Service']).append(mar[mar['Product']=='Service'])]

In [75]:
february = pd.concat(dataframes, axis = 1, keys=['Hardware', 'Software', 'Service'])
print(february.info())

# Assign pd.IndexSlice: idx
idx = pd.IndexSlice

# Create the slice: slice_2_8
slice_2_8 = february.loc['Feb.2, 2015':'Feb.8,2015', idx[:, 'Company']]

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 60 entries, 2015-01-01 07:31:00 to 2015-03-28 19:20:00
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   (Hardware, Company)  21 non-null     object 
 1   (Hardware, Product)  21 non-null     object 
 2   (Hardware, Units)    21 non-null     float64
 3   (Software, Company)  23 non-null     object 
 4   (Software, Product)  23 non-null     object 
 5   (Software, Units)    23 non-null     float64
 6   (Service, Company)   16 non-null     object 
 7   (Service, Product)   16 non-null     object 
 8   (Service, Units)     16 non-null     float64
dtypes: float64(3), object(6)
memory usage: 4.7+ KB
None


In [166]:
# Concatenate dataframes: february
february = pd.concat(dataframes, axis = 1, keys=['Hardware', 'Software', 'Service'])

# Print february.info()
print(february.info())

# Assign pd.IndexSlice: idx
idx = pd.IndexSlice

# Create the slice: slice_2_8
slice_2_8 = february.loc['Feb.4, 2015':'Feb.8,2015', idx[:, 'Company']]

# Print slice_2_8
print(slice_2_8)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 53 entries, 2015-01-01 07:31:00 to 2015-03-28 19:20:00
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   (Hardware, Company)  21 non-null     object 
 1   (Hardware, Product)  21 non-null     object 
 2   (Hardware, Units)    21 non-null     float64
 3   (Software, Company)  22 non-null     object 
 4   (Software, Product)  22 non-null     object 
 5   (Software, Units)    22 non-null     float64
 6   (Service, Company)   10 non-null     object 
 7   (Service, Product)   10 non-null     object 
 8   (Service, Units)     10 non-null     float64
dtypes: float64(3), object(6)
memory usage: 4.1+ KB
None
                             Hardware          Software Service
                              Company           Company Company
Date                                                           
2015-02-04 15:36:00               NaN         Streeplex     N

#### Concatenating DataFrames from a dict
You're now going to revisit the sales data you worked with earlier in the chapter. Three DataFrames jan, feb, and mar have been pre-loaded for you. Your task is to aggregate the sum of all sales over the 'Company' column into a single DataFrame. You'll do this by constructing a dictionary of these DataFrames and then concatenating them.

__Instructions:__
* Create a list called month_list consisting of the tuples ('january', jan), ('february', feb), and ('march', mar).
* Create an empty dictionary called month_dict.
* Inside the for loop:
 * Group month_data by 'Company' and use .sum() to aggregate.
* Construct a new DataFrame called sales by concatenating the DataFrames stored in month_dict.
* Create an alias for pd.IndexSlice and print all sales by 'Mediacore'. This has been done for you, so hit 'Submit Answer' to see the result!

In [76]:
# Make the list of tuples: month_list
month_list = [('january', jan), ('february', feb), ('march', mar)]

# Create an empty dictionary: month_dict
month_dict = {}

for month_name, month_data in month_list:

    # Group month_data: month_dict[month_name]
    month_dict[month_name] = month_data.groupby('Company').sum()

# Concatenate data in month_dict: sales
sales = pd.concat(month_dict)

# Print sales
print(sales)

# Print all sales by Mediacore
idx = pd.IndexSlice
print(sales.loc[idx[:, 'Mediacore'], :])

                           Units
         Company                
january  Acme Coporation      76
         Hooli                70
         Initech              37
         Mediacore            15
         Streeplex            50
february Acme Corporation     34
         Hooli                30
         Initech              13
         Initech Service      10
         InitechSoftware       7
         Mediacore            45
         Streeplex            37
march    Acme Corporation      5
         Hooli                37
         Initech              68
         Mediacore            68
         Streeplex            40
                    Units
         Company         
january  Mediacore     15
february Mediacore     45
march    Mediacore     68


### Out and Inner Joins

This lesson is to help better understand what append and concatenate are doing when joining DataFrames and Series together. 

Let's start with stacking arrays. First we create 3 arrays, A, B and C: sized 2 by 4, 2 by 3 and 3 by 4 respectively. The + constant added to the end is to allow us to visualize which array the numbers came from as we proceed to join these arrays.

In [77]:
import numpy as np
import pandas as pd
A = np.arange(8).reshape(2,4) + .1
print(A)

[[0.1 1.1 2.1 3.1]
 [4.1 5.1 6.1 7.1]]


In [78]:
B = np.arange(6).reshape(2,3) + .2
print(B)

[[0.2 1.2 2.2]
 [3.2 4.2 5.2]]


In [79]:
C = np.arange(12).reshape(3,4) + .3
print(C)

[[ 0.3  1.3  2.3  3.3]
 [ 4.3  5.3  6.3  7.3]
 [ 8.3  9.3 10.3 11.3]]


#### Stacking Arrays Horizontally

We can stack the 2x4 matrix A with the 2x3 matrix B horizontally using np.hstack(). Equivalently, np.concatenate, with the axis = 1 option will create the same array. In both cases, A and B must have the same number of rows, but the number of columns can differ. 

In [80]:
np.hstack([B,A])

array([[0.2, 1.2, 2.2, 0.1, 1.1, 2.1, 3.1],
       [3.2, 4.2, 5.2, 4.1, 5.1, 6.1, 7.1]])

In [81]:
np.concatenate([B, A], axis = 1)

array([[0.2, 1.2, 2.2, 0.1, 1.1, 2.1, 3.1],
       [3.2, 4.2, 5.2, 4.1, 5.1, 6.1, 7.1]])

#### Stacking Arrays Vertically

We can also stack the 2x4 matrix A and the 3x4 matrix C vertically using np.vstack() or np.concatenate() with axis = 0, the default for np.concatenate(). This case it's important that both matrices have 4 columns, but the number of rows can differ. 

In [82]:
np.vstack([A,C])

array([[ 0.1,  1.1,  2.1,  3.1],
       [ 4.1,  5.1,  6.1,  7.1],
       [ 0.3,  1.3,  2.3,  3.3],
       [ 4.3,  5.3,  6.3,  7.3],
       [ 8.3,  9.3, 10.3, 11.3]])

In [83]:
np.concatenate([A,C], axis = 0)

array([[ 0.1,  1.1,  2.1,  3.1],
       [ 4.1,  5.1,  6.1,  7.1],
       [ 0.3,  1.3,  2.3,  3.3],
       [ 4.3,  5.3,  6.3,  7.3],
       [ 8.3,  9.3, 10.3, 11.3]])

A value error Exception is raised when you try to concatenate arrays of different sizes.

In [85]:
np.concatenate([A,B], axis = 0)

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 4 and the array at index 1 has size 3

In [86]:
np.concatenate([A,C], axis = 1)

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 2 and the array at index 1 has size 3

Returning to the population and unemployment data, we see that population has 4 rows and 2 columns and the unemployment data has 4 rows and 3 columns. 

In [87]:
print(population)
print(unemployment)

                2010 Census Population
Zip Code ZCTA                         
57538                              322
59916                              130
37660                            40038
2860                             45199
        Unemployment   Participants
Zip                                
2860            0.11          34447
46167           0.02           4800
1097            0.33             42
80808           0.07           4310


#### Converting to Arrays

Recall that Pandas is based on NumPy, so first let's convert the above DataFrames to NumPy arrays. Notice that with the array, the index information is disgarded. And we know have 2 NumPy arrays, dimensions 4x1 and 4x2. 

In [88]:
population_array = np.array(population)
print(population_array)

[[  322]
 [  130]
 [40038]
 [45199]]


In [89]:
unemployment_array = np.array(unemployment)
print(unemployment_array)

[[1.1000e-01 3.4447e+04]
 [2.0000e-02 4.8000e+03]
 [3.3000e-01 4.2000e+01]
 [7.0000e-02 4.3100e+03]]


#### Manipulating Data as Arrays

We could use apply np.concatenate() or np.hstack() to these arrays, but the new array, dimension 4x3, would be meaningless. The rows are simply glued together, independent of the original meaning. It would be necessary to store the zip codes in other lists or arrays to align rows properly. Appending label tables in this manner corresponds to what is called a database join.

In [90]:
np.concatenate([population_array, unemployment_array], axis = 1)

array([[3.2200e+02, 1.1000e-01, 3.4447e+04],
       [1.3000e+02, 2.0000e-02, 4.8000e+03],
       [4.0038e+04, 3.3000e-01, 4.2000e+01],
       [4.5199e+04, 7.0000e-02, 4.3100e+03]])

#### Joins

Joining tables involves meaningfully gluing index rows together. An outer join preserves the indices in the original table, filling Null values for missing rows. An outer joined table has all the indices of the original tables without any repetition, like a set union. Conversely, an inner join only has index labels common to both tables, like a set intersection. 

#### Concatenation and Inner Join
If we call concat and specify the option axis = to 1, or columns, and the option join= to 'inner' for an inner join which will result in only 1 row where the intersection of both DataFrames occurs, zip code 2860.

In [91]:
pd.concat([population, unemployment], axis = 1, join='inner')

Unnamed: 0,2010 Census Population,Unemployment,Participants
2860,45199,0.11,34447


#### Concatenation and Outer Join
if we specify join = 'outer' across axis = 1, we get the default behavior of the .concat() function where the unspecified joined paramete defaults to Null. When a row entry occurs in one DataFrame and not the other, the missing entries are filled the a Null value. 

In [92]:
pd.concat([population, unemployment], axis = 1, join='outer')

Unnamed: 0,2010 Census Population,Unemployment,Participants
1097,,0.33,42.0
2860,45199.0,0.11,34447.0
37660,40038.0,,
46167,,0.02,4800.0
57538,322.0,,
59916,130.0,,
80808,,0.07,4310.0


#### Inner Join on the Other Axis

We could also do an inner join on the axis = 0. No column index label occurs in both population and unemployment

In [93]:
pd.concat([population, unemployment], join = 'inner', axis = 0).info

<bound method DataFrame.info of Empty DataFrame
Columns: []
Index: [57538, 59916, 37660, 2860, 2860, 46167, 1097, 80808]>

### Exercise 4

#### Concatenating DataFrames with inner join

Here, you'll continue working with DataFrames compiled from The Guardian's Olympic medal dataset.

The DataFrames bronze, silver, and gold have been pre-loaded for you.

Your task is to compute an inner join.

__Instructions:__
* Construct a list of DataFrames called medal_list with entries bronze, silver, and gold.
* Concatenate medal_list horizontally with an inner join to create medals.
* Use the keyword argument keys=['bronze', 'silver', 'gold'] to yield suitable hierarchical indexing.
* Use axis=1 to get horizontal concatenation.
* Use join='inner' to keep only rows that share common index labels.
* Print the new DataFrame medals.

In [94]:
# Create the list of DataFrames: medal_list
medal_list = [bronze,silver,gold]

# Concatenate medal_list horizontally using an inner join: medals
medals = pd.concat(medal_list, axis = 1, join = 'inner', keys = ['bronze', 'silver', 'gold'])

# Print medals
print(medals)

               bronze silver  gold
                Total  Total Total
Country                           
United States    1052   1195  2088
Soviet Union      584    627   838
United Kingdom    505    591   498


Well done! France, Italy, and Germany got dropped as part of the join since they are not present in each of bronze, silver, and gold. Therefore, the final DataFrame has only the United States, Soviet Union, and United Kingdom.

#### Resampling & concatenating DataFrames with inner join
In this exercise, you'll compare the historical 10-year GDP (Gross Domestic Product) growth in the US and in China. The data for the US starts in 1947 and is recorded quarterly; by contrast, the data for China starts in 1961 and is recorded annually.

You'll need to use a combination of resampling and an inner join to align the index labels. You'll need an appropriate offset alias for resampling, and the method .resample() must be chained with some kind of aggregation method (.pct_change() and .last() in this case).

pandas has been imported as pd, and the DataFrames china and us have been pre-loaded, with the output of china.head() and us.head() printed in the IPython Shell.

__Instructions:__
* Make a new DataFrame china_annual by resampling the DataFrame china with .resample('A').last() (i.e., with annual frequency) and chaining two method calls:
* Chain .pct_change(10) as an aggregation method to compute the percentage change with an offset of ten years.
* Chain .dropna() to eliminate rows containing null values.
* Make a new DataFrame us_annual by resampling the DataFrame us exactly as you resampled china.
* Concatenate china_annual and us_annual to construct a DataFrame called gdp. Use join='inner' to perform an inner join and use axis=1 to concatenate horizontally.
* Print the result of resampling gdp every decade (i.e., using .resample('10A')) and aggregating with the method .last(). This has been done for you, so hit 'Submit Answer' to see the result!

In [18]:
import pandas as pd
#Behind the scenes work to load data that was already loaded in DataCamp
china = pd.read_csv('gdp_china.csv')
us = pd.read_csv('gdp_usa.csv')
china['Year'] = pd.to_datetime(china['Year'])
us['Year'] = pd.to_datetime(us['Year'])
us = us.set_index('Year')
china = china.set_index('Year')

# Resample and tidy china: china_annual
china_annual = china.resample('A').last().pct_change(10).dropna()

# Resample and tidy us: us_annual
us_annual = us.resample('A').last().pct_change(10).dropna()

# Concatenate china_annual and us_annual: gdp
gdp = pd.concat([china_annual, us_annual], join='inner', axis=1)

# Resample gdp and print
print(gdp.resample('10A').last())

                 GDP     Value
Year                          
1970-12-31  0.546128  1.017187
1980-12-31  1.072537  1.742556
1990-12-31  0.892820  1.012126
2000-12-31  2.357522  0.738632
2010-12-31  4.011081  0.454332
2020-12-31  3.789936  0.361780
