# Preparing data


## 1. Reading Multiple Data Files

The data files for this example have been derived from a list of Olympic medals awarded between 1896 & 2008 compiled by the [Guardian](https://www.theguardian.com/sport/datablog/2012/jun/25/olympic-medal-winner-list-data).

The column labels of each DataFrame are NOC, Country, & Total where NOC is a three-letter code for the name of the country and Total is the number of medals of that type won (bronze, silver, or gold).

In [2]:
# Import pandas
import pandas as pd

In [62]:
base_path = "data/Summer Olympic medals/"

In [63]:
# Read 'Bronze.csv' into a DataFrame: bronze
bronze = pd.read_csv(base_path + "Bronze.csv")

# Read 'Silver.csv' into a DataFrame: silver
silver = pd.read_csv(base_path + "Silver.csv")

# Read 'Gold.csv' into a DataFrame: gold
gold = pd.read_csv(base_path + "Gold.csv")

# Print the first five rows of gold
gold.head()

Unnamed: 0,NOC,Country,Total
0,USA,United States,2088.0
1,URS,Soviet Union,838.0
2,GBR,United Kingdom,498.0
3,FRA,France,378.0
4,GER,Germany,407.0


Well done! Reading csv files into DataFrames like this should now be second nature to you!



## 2. Reading DataFrames from multiple files in a loop
As you saw in the video, loading data from multiple files into DataFrames is more efficient in a loop or a list comprehension.

Notice that this approach is not restricted to working with CSV files. That is, even if your data comes in other formats, as long as pandas has a suitable data import function, you can apply a loop or comprehension to generate a list of DataFrames imported from the source files.

In [64]:
# Create the list of file names: filenames
filenames = [base_path + 'Gold.csv', base_path + 'Silver.csv', base_path + 'Bronze.csv']

# Create the list of three DataFrames: dataframes
dataframes = []
for filename in filenames:
    dataframes.append(pd.read_csv(filename))

# Print top 5 rows of 1st DataFrame in dataframes
dataframes[0].head()

Unnamed: 0,NOC,Country,Total
0,USA,United States,2088.0
1,URS,Soviet Union,838.0
2,GBR,United Kingdom,498.0
3,FRA,France,378.0
4,GER,Germany,407.0


Great work! When you are dealing with multiple csv files like this, it is more efficient to read them into DataFrames using a loop.



## 3. Combining DataFrames from multiple data files
In this exercise, you'll combine the three DataFrames from earlier exercises - gold, silver, & bronze - into a single DataFrame called medals. The approach you'll use here is clumsy. Later, you'll see various powerful methods that are frequently used in practice for concatenating or merging DataFrames.

Remember, the column labels of each DataFrame are NOC, Country, and Total, where NOC is a three-letter code for the name of the country and Total is the number of medals of that type won.

In [65]:
# Make a copy of gold: medals
medals = gold.copy()

# Create list of new column labels: new_labels
new_labels = ['NOC', 'Country', 'Gold']

# Rename the columns of medals using new_labels
medals.columns = new_labels

# Add columns 'Silver' & 'Bronze' to medals
medals['Silver'] = silver["Total"]
medals['Bronze'] = bronze["Total"]

# Print the head of medals
medals.head()

Unnamed: 0,NOC,Country,Gold,Silver,Bronze
0,USA,United States,2088.0,1195.0,1052.0
1,URS,Soviet Union,838.0,627.0,584.0
2,GBR,United Kingdom,498.0,591.0,505.0
3,FRA,France,378.0,461.0,475.0
4,GER,Germany,407.0,350.0,454.0


Excellent! Later in this course, you'll learn far more powerful tools for combining DataFrames!



## 4. Sorting DataFrame with the Index & columns
It is often useful to rearrange the sequence of the rows of a DataFrame by sorting. You don't have to implement these yourself; the principal methods for doing this are `.sort_index()` and `.sort_values()`.

In this exercise, you'll use these methods with a DataFrame of temperature values indexed by month names. You'll sort the rows alphabetically using the Index and numerically using a column. Notice, for this data, the original ordering is probably most useful and intuitive: the purpose here is for you to understand what the sorting methods do.

In [66]:
# Read 'monthly_max_temp.csv' into a DataFrame: weather1
weather1 = pd.read_csv("data/monthly_max_temp.csv", index_col = "Month")

# Print the head of weather1
weather1.head()

Unnamed: 0_level_0,Max TemperatureF
Month,Unnamed: 1_level_1
Jan,68
Feb,60
Mar,68
Apr,84
May,88


In [67]:
# Sort the index of weather1 in alphabetical order: weather2
weather2 = weather1.sort_index()

# Print the head of weather2
weather2.head()

Unnamed: 0_level_0,Max TemperatureF
Month,Unnamed: 1_level_1
Apr,84
Aug,86
Dec,68
Feb,60
Jan,68


In [68]:
# Sort the index of weather1 in reverse alphabetical order: weather3
weather3 = weather1.sort_index(ascending = False)

# Print the head of weather3
weather3.head()

Unnamed: 0_level_0,Max TemperatureF
Month,Unnamed: 1_level_1
Sep,90
Oct,84
Nov,72
May,88
Mar,68


In [69]:
# Sort weather1 numerically using the values of 'Max TemperatureF': weather4
weather4 = weather3.sort_values("Max TemperatureF")

# Print the head of weather4
weather4.head()

Unnamed: 0_level_0,Max TemperatureF
Month,Unnamed: 1_level_1
Feb,60
Mar,68
Jan,68
Dec,68
Nov,72


## 5. Reindexing DataFrame from a list
Sorting methods are not the only way to change DataFrame Indexes. There is also the `.reindex()` method.

In this exercise, you'll reindex a DataFrame of quarterly-sampled mean temperature values to contain monthly samples (this is an example of upsampling or increasing the rate of samples).

The original data has the first month's abbreviation of the quarter (three-month interval) on the Index, namely Apr, Jan, Jul, and Oct. 

You'll initially use a list of all twelve month abbreviations and subsequently apply the `.ffill()` method to forward-fill the null entries when upsampling. 

In [70]:
# Defining index list
year = ['Jan',
 'Feb',
 'Mar',
 'Apr',
 'May',
 'Jun',
 'Jul',
 'Aug',
 'Sep',
 'Oct',
 'Nov',
 'Dec']

In [73]:
# Slicing weather1 df for this exercise
weather1 = weather1.loc[["Apr", "Jan", "Jul", "Oct"], :]

In [74]:
# Reindex weather1 using the list year: weather2
weather2 = weather1.reindex(year)

# Print weather2
weather2

Unnamed: 0_level_0,Max TemperatureF
Month,Unnamed: 1_level_1
Jan,68.0
Feb,
Mar,
Apr,84.0
May,
Jun,
Jul,91.0
Aug,
Sep,
Oct,84.0


In [75]:
# Reindex weather1 using the list year with forward-fill: weather3
weather3 = weather1.reindex(year).ffill()

# Print weather3
weather3

Unnamed: 0_level_0,Max TemperatureF
Month,Unnamed: 1_level_1
Jan,68.0
Feb,68.0
Mar,68.0
Apr,84.0
May,84.0
Jun,84.0
Jul,91.0
Aug,91.0
Sep,91.0
Oct,84.0


Great work! Notice that values corresponding to months missing from `weather1` are filled with `NaN` values in `weather2`. This does not happen in `weather3`, since you used forward-fill.



## 6. Reindexing using another DataFrame Index
Another common technique is to reindex a DataFrame using the Index of another DataFrame. The DataFrame `.reindex()` method can accept the Index of a DataFrame or Series as input. You can access the Index of a DataFrame with its `.index` attribute.

The [Baby Names Dataset](https://www.data.gov/developers/baby-names-dataset/) from [data.gov](http://data.gov/) summarizes counts of names (with genders) from births registered in the US since 1881. 

The DataFrames `names_1981` and `names_1881` both have a MultiIndex with levels name and gender giving unique labels to counts in each row. If you're interested in seeing how the MultiIndexes were set up, names_1981 and names_1881 were read in using the following commands:

```python
names_1981 = pd.read_csv('names1981.csv', header=None, names=['name','gender','count'], index_col=(0,1))
names_1881 = pd.read_csv('names1881.csv', header=None, names=['name','gender','count'], index_col=(0,1))
```

As you can see by looking at their shapes,, the DataFrame corresponding to 1981 births is much larger, reflecting the greater diversity of names in 1981 as compared to 1881.

Your job here is to use the DataFrame `.reindex()` and `.dropna()` methods to make a DataFrame `common_names` counting names from 1881 that were still popular in 1981.

In [9]:
# Loading dataframes
names_1981 = pd.read_csv('data/Baby names/names1981.csv', header=None, names=['name','gender','count'], index_col=(0,1))
names_1981.shape

(19455, 1)

In [10]:
# Loading dataframes
names_1881 = pd.read_csv('data/Baby names/names1881.csv', header=None, names=['name','gender','count'], index_col=(0,1))
names_1881.shape

(1935, 1)

In [11]:
# Print tail of names_1981
names_1981.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,count
name,gender,Unnamed: 2_level_1
Zeferino,M,5
Zerrick,M,5
Zimbabwe,M,5
Zoltan,M,5
Zuriel,M,5


In [15]:
# Reindex names_1981 with index of names_1881: common_names
common_names = names_1981.reindex(names_1881.index)

# Print shape of common_names
common_names.shape

(1935, 1)

In [20]:
# Print tail of common_names after reindexing
common_names.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,count
name,gender,Unnamed: 2_level_1
Wiliam,M,11.0
Wilton,M,33.0
Wing,M,8.0
Wood,M,
Wright,M,6.0


In [14]:
# Drop rows with null counts: common_names
common_names = common_names.dropna()

# Print shape of new common_names
common_names.shape

(1587, 1)

Excellent work! It looks like 348 names fell out of fashion between 1881 and 1981!



## 7. Broadcasting in arithmetic formulas
In this exercise, you'll work with weather data pulled from [wunderground.com](https://www.wunderground.com/). The dataframe `weather` has 365 rows (observed each day of the year 2013 in Pittsburgh, PA) and 22 columns reflecting different weather measurements each day.

You'll subset a collection of columns related to temperature measurements in degrees Fahrenheit, convert them to degrees Celsius, and relabel the columns of the new DataFrame to reflect the change of units.

Remember, ordinary arithmetic operators (like `+`, `-`, `*`, and `/`) broadcast scalar values to conforming DataFrames when combining scalars & DataFrames in arithmetic expressions. Broadcasting also works with pandas Series and NumPy arrays.

In [24]:
# Loading data 
weather = pd.read_csv("data/pittsburgh2013.csv", index_col = "Date", parse_dates = True)

# Show info
weather.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 365 entries, 2013-01-01 to 2013-12-31
Data columns (total 22 columns):
Max TemperatureF             365 non-null int64
Mean TemperatureF            365 non-null int64
Min TemperatureF             365 non-null int64
Max Dew PointF               365 non-null int64
MeanDew PointF               365 non-null int64
Min DewpointF                365 non-null int64
Max Humidity                 365 non-null int64
Mean Humidity                365 non-null int64
Min Humidity                 365 non-null int64
Max Sea Level PressureIn     365 non-null float64
Mean Sea Level PressureIn    365 non-null float64
Min Sea Level PressureIn     365 non-null float64
Max VisibilityMiles          365 non-null int64
Mean VisibilityMiles         365 non-null int64
Min VisibilityMiles          365 non-null int64
Max Wind SpeedMPH            365 non-null int64
Mean Wind SpeedMPH           365 non-null int64
Max Gust SpeedMPH            244 non-null float64
Prec

In [30]:
# Extract selected columns from weather as new DataFrame: temps_f
temps_f = weather[["Min TemperatureF", "Mean TemperatureF", "Max TemperatureF"]]

# Print first 5 rows of temps_f
temps_f.head()

Unnamed: 0_level_0,Min TemperatureF,Mean TemperatureF,Max TemperatureF
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-01-01,21,28,32
2013-01-02,17,21,25
2013-01-03,16,24,32
2013-01-04,27,28,30
2013-01-05,25,30,34


In [31]:
# Convert temps_f to celsius: temps_c
temps_c = (temps_f - 32) * 5/9

# Rename 'F' in column names with 'C': temps_c.columns
temps_c.columns = temps_c.columns.str.replace("F", "C")

# Print first 5 rows of temps_c
temps_c.head()

Unnamed: 0_level_0,Min TemperatureC,Mean TemperatureC,Max TemperatureC
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-01-01,-6.111111,-2.222222,0.0
2013-01-02,-8.333333,-6.111111,-3.888889
2013-01-03,-8.888889,-4.444444,0.0
2013-01-04,-2.777778,-2.222222,-1.111111
2013-01-05,-3.888889,-1.111111,1.111111


Well done! In only three lines of code, you converted the units of 365 data points (over three columns) from degrees Fahrenheit to degrees Celsius.



## 8. Computing percentage growth of GDP
Your job in this exercise is to compute the yearly percent-change of US GDP ([Gross Domestic Product](https://en.wikipedia.org/wiki/Gross_domestic_product)) since 2008.

The data has been obtained from the [Federal Reserve Bank of St. Louis](https://fred.stlouisfed.org/series/GDP) and is available in the file `GDP.csv`, which contains quarterly data; you will resample it to annual sampling and then compute the annual growth of GDP.

In [33]:
# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv("data/GDP/gdp_usa.csv", index_col = "DATE", parse_dates = True)

# Print info of gdp
gdp.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 278 entries, 1947-01-01 to 2016-04-01
Data columns (total 1 columns):
VALUE    278 non-null float64
dtypes: float64(1)
memory usage: 4.3 KB


In [36]:
# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc["2008":, :]

# Print the last 8 rows of post2008
post2008.tail(8)

Unnamed: 0_level_0,VALUE
DATE,Unnamed: 1_level_1
2014-07-01,17569.4
2014-10-01,17692.2
2015-01-01,17783.6
2015-04-01,17998.3
2015-07-01,18141.9
2015-10-01,18222.8
2016-01-01,18281.6
2016-04-01,18436.5


In [37]:
# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample("A").last()

# Print yearly
yearly

Unnamed: 0_level_0,VALUE
DATE,Unnamed: 1_level_1
2008-12-31,14549.9
2009-12-31,14566.5
2010-12-31,15230.2
2011-12-31,15785.3
2012-12-31,16297.3
2013-12-31,16999.9
2014-12-31,17692.2
2015-12-31,18222.8
2016-12-31,18436.5


In [38]:
# Compute percentage growth of yearly: yearly['growth']
yearly['growth'] = yearly.pct_change()* 100

# Print yearly again
yearly

Unnamed: 0_level_0,VALUE,growth
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1
2008-12-31,14549.9,
2009-12-31,14566.5,0.11409
2010-12-31,15230.2,4.556345
2011-12-31,15785.3,3.644732
2012-12-31,16297.3,3.243524
2013-12-31,16999.9,4.311144
2014-12-31,17692.2,4.072377
2015-12-31,18222.8,2.999062
2016-12-31,18436.5,1.172707


Fantastic! Note that the first column of the `'growth'` column is `NaN` because there is no data for the year 2007.



## 9. Converting currency of stocks
In this exercise, stock prices in US Dollars for the S&P 500 in 2015 have been obtained from [Yahoo Finance](https://consent.yahoo.com/collectConsent?sessionId=3_cc-session_363d0eb2-dd05-47c5-806d-a5e1755139b4&lang=&inline=false). 

Using the daily exchange rate to Pounds Sterling, your task is to convert both the Open and Close column prices.

In [41]:
# Read 'sp500.csv' into a DataFrame: sp500
sp500 = pd.read_csv("data/sp500.csv", index_col = "Date", parse_dates = True)

# Print tail of sp500
sp500.tail()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-12-24,2063.52002,2067.360107,2058.72998,2060.98999,1411860000,2060.98999
2015-12-28,2057.77002,2057.77002,2044.199951,2056.5,2492510000,2056.5
2015-12-29,2060.540039,2081.560059,2060.540039,2078.360107,2542000000,2078.360107
2015-12-30,2077.340088,2077.340088,2061.969971,2063.360107,2367430000,2063.360107
2015-12-31,2060.590088,2062.540039,2043.619995,2043.939941,2655330000,2043.939941


In [42]:
# Read 'exchange.csv' into a DataFrame: exchange
exchange = pd.read_csv("data/exchange.csv", index_col = "Date", parse_dates = True)

# Print tail of exchange
exchange.tail()

Unnamed: 0_level_0,GBP/USD
Date,Unnamed: 1_level_1
2015-12-23,0.67285
2015-12-24,0.66926
2015-12-29,0.67597
2015-12-30,0.67427
2015-12-31,0.6782


In [43]:
# Subset 'Open' & 'Close' columns from sp500: dollars
dollars = sp500[["Open", "Close"]]

# Print the head of dollars
dollars.head()

Unnamed: 0_level_0,Open,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-01-02,2058.899902,2058.199951
2015-01-05,2054.439941,2020.579956
2015-01-06,2022.150024,2002.609985
2015-01-07,2005.550049,2025.900024
2015-01-08,2030.609985,2062.139893


In [61]:
# Convert dollars to pounds using * operator: pounds
pounds = dollars * exchange["GBP/USD"]

# Print the head of pounds to show it won't work
pounds.tail().

Unnamed: 0_level_0,2015-01-02 00:00:00,2015-01-05 00:00:00,2015-01-06 00:00:00,2015-01-07 00:00:00,2015-01-08 00:00:00,2015-01-09 00:00:00,2015-01-12 00:00:00,2015-01-13 00:00:00,2015-01-14 00:00:00,2015-01-15 00:00:00,...,2015-12-18 00:00:00,2015-12-21 00:00:00,2015-12-22 00:00:00,2015-12-23 00:00:00,2015-12-24 00:00:00,2015-12-29 00:00:00,2015-12-30 00:00:00,2015-12-31 00:00:00,Close,Open
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-12-24,,,,,,,,,,,...,,,,,,,,,,
2015-12-28,,,,,,,,,,,...,,,,,,,,,,
2015-12-29,,,,,,,,,,,...,,,,,,,,,,
2015-12-30,,,,,,,,,,,...,,,,,,,,,,
2015-12-31,,,,,,,,,,,...,,,,,,,,,,


In [44]:
# Convert dollars to pounds: pounds
pounds = dollars.multiply(exchange["GBP/USD"], axis = "rows")

# Print the head of pounds
pounds.head()

Unnamed: 0_level_0,Open,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2015-01-02,1340.364425,1339.90875
2015-01-05,1348.616555,1326.389506
2015-01-06,1332.51598,1319.639876
2015-01-07,1330.562125,1344.063112
2015-01-08,1343.268811,1364.126161


Excellent! Now that you've become familiar with how to share information between DataFrames, you'll learn about concatenating DataFrames next.

