## Re-indexing

In [1]:
import pandas as pd
import numpy as np
from glob import glob

A dataframe's index is the means by which a dataframes rows are identified. On occasion dataframes will need to be re-index(re-arrange the sequence of rows) so that they can be combined. There are two pandas methods that are used, `sort_index` and `sort_values`.

**Note**:

`indices` refers to 2 or more rows within a dataframe
`indexes` is used to when refering to 2 or more indexes from several dataframes.

In [2]:
weather = pd.read_csv('./data/monthly_max_temp.csv', index_col='Month')
weather.head()

Unnamed: 0_level_0,Max TemperatureF
Month,Unnamed: 1_level_1
Jan,68
Feb,60
Mar,68
Apr,84
May,88


In [3]:
# sort index alphabetically using the sort_index
weather.sort_index()

Unnamed: 0_level_0,Max TemperatureF
Month,Unnamed: 1_level_1
Apr,84
Aug,86
Dec,68
Feb,60
Jan,68
Jul,91
Jun,89
Mar,68
May,88
Nov,72


In [4]:
# rort index in reverse alphabetical
weather.sort_index(ascending=False)

Unnamed: 0_level_0,Max TemperatureF
Month,Unnamed: 1_level_1
Sep,90
Oct,84
Nov,72
May,88
Mar,68
Jun,89
Jul,91
Jan,68
Feb,60
Dec,68


In [5]:
# sort rows numerically using a column
weather.sort_values('Max TemperatureF')

Unnamed: 0_level_0,Max TemperatureF
Month,Unnamed: 1_level_1
Feb,60
Jan,68
Mar,68
Dec,68
Nov,72
Apr,84
Oct,84
Aug,86
May,88
Jun,89


In [6]:
# If we import the weather data and use the default 'range' index
weather = pd.read_csv('./data/monthly_max_temp.csv')
weather.head()

Unnamed: 0,Month,Max TemperatureF
0,Jan,68
1,Feb,60
2,Mar,68
3,Apr,84
4,May,88


In [7]:
# sorting the index using 'sort_index' 
weather.sort_index().head()

Unnamed: 0,Month,Max TemperatureF
0,Jan,68
1,Feb,60
2,Mar,68
3,Apr,84
4,May,88


In [8]:
weather.sort_index(ascending=False).head()

Unnamed: 0,Month,Max TemperatureF
11,Dec,68
10,Nov,72
9,Oct,84
8,Sep,90
7,Aug,86


We can also re-order a dataframes index using the `reindex` method, and apply a deliberate order by passing a list of items in the order required. You can also pass another dataframe's index.

In [9]:
mean_temps = pd.read_csv('./data/monthly_mean_temps.csv', index_col='Month')
mean_temps.head()

Unnamed: 0_level_0,Mean TemperatureF
Month,Unnamed: 1_level_1
Jan,32.133333
Apr,61.956044
Jul,68.934783
Oct,43.434783


In [10]:
# impose the index order, returns a new dataframe 
reindexed = mean_temps.reindex(['Oct', 'Jan', 'Apr', 'Jul'])
reindexed

Unnamed: 0_level_0,Mean TemperatureF
Month,Unnamed: 1_level_1
Oct,43.434783
Jan,32.133333
Apr,61.956044
Jul,68.934783


`sort_index` re-orders indexes alphabetically.

In [11]:
# re-orders alphabetically
reindexed.sort_index()

Unnamed: 0_level_0,Mean TemperatureF
Month,Unnamed: 1_level_1
Apr,61.956044
Jan,32.133333
Jul,68.934783
Oct,43.434783


When using `reindex`, where the indices do not exist, pandas inserts `NaN` values.

In [12]:
months = [
 'Jan',
 'Feb',
 'Mar',
 'Apr',
 'May',
 'Jun',
 'Jul',
 'Aug',
 'Sep',
 'Oct',
 'Nov',
 'Dec'
]

mean_temps.reindex(months) #  returns a new dataframe

Unnamed: 0_level_0,Mean TemperatureF
Month,Unnamed: 1_level_1
Jan,32.133333
Feb,
Mar,
Apr,61.956044
May,
Jun,
Jul,68.934783
Aug,
Sep,
Oct,43.434783


We can replace the `NaN` values by chaining `ffill` method, which will replace the null values with the last preceding non-null value.

In [13]:
mean_temps.reindex(months).ffill()

Unnamed: 0_level_0,Mean TemperatureF
Month,Unnamed: 1_level_1
Jan,32.133333
Feb,32.133333
Mar,32.133333
Apr,61.956044
May,61.956044
Jun,61.956044
Jul,68.934783
Aug,68.934783
Sep,68.934783
Oct,43.434783


Another common technique is to reindex a DataFrame using the Index of another DataFrame. This allows you to discover where two dataframes overlap.

The DataFrame `.reindex()` method can accept the Index of a DataFrame or Series as input. You can access the Index of a DataFrame with its .index attribute.

In [14]:
names_1881 = pd.read_csv(
    './data/Baby names/names1881.csv',
    header=None,
    names=['name', 'gender', 'count'],
    index_col=(0,1)
)

names_1981 = pd.read_csv(
    './data/Baby names/names1981.csv',
    header=None,
    names=['name', 'gender', 'count'],
    index_col=(0,1)
)
print(names_1881.shape)
print(names_1981.shape)
names_1881.head(10)

(1935, 1)
(19455, 1)


Unnamed: 0_level_0,Unnamed: 1_level_0,count
name,gender,Unnamed: 2_level_1
Mary,F,6919
Anna,F,2698
Emma,F,2034
Elizabeth,F,1852
Margaret,F,1658
Minnie,F,1653
Ida,F,1439
Annie,F,1326
Bertha,F,1324
Alice,F,1308


DataFrame corresponding to 1981 births is much larger, reflecting the greater diversity of names in 1981 as compared to 1881. We'll use the `.reindex()` and `.dropna()` methods to make a DataFrame `common_names` counting names from 1881 that were still popular in 1981.

First, create a new DataFrame `common_names` by reindexing `names_1981` using the Index of the DataFrame `names_1881` of older names.

In [15]:
# keep those rows in 'names_1981', present in 'names_1881'
common_names = names_1981.reindex(names_1881.index)
print(common_names.shape)
common_names.head(10)

(1935, 1)


Unnamed: 0_level_0,Unnamed: 1_level_0,count
name,gender,Unnamed: 2_level_1
Mary,F,11030.0
Anna,F,5182.0
Emma,F,532.0
Elizabeth,F,20168.0
Margaret,F,2791.0
Minnie,F,56.0
Ida,F,206.0
Annie,F,973.0
Bertha,F,209.0
Alice,F,745.0


Drop the rows of `common_names` that have null counts using the `.dropna(`) method. These rows correspond to names that fell out of fashion between 1881 & 1981.

In [16]:
common_names = common_names.dropna()
print(common_names.shape)

(1587, 1)


## Working with TimeSeries Index

In [17]:
weather = pd.read_csv(
    './data/pittsburgh2013.csv',
    index_col='Date',
    parse_dates=True
)
weather.columns

Index(['Max TemperatureF', 'Mean TemperatureF', 'Min TemperatureF',
       'Max Dew PointF', 'Mean Dew PointF', 'Min DewpointF', 'Max Humidity',
       'Mean Humidity', 'Min Humidity', 'Max Sea Level PressureIn',
       'Mean Sea Level PressureIn', 'Min Sea Level PressureIn',
       'Max VisibilityMiles', 'Mean VisibilityMiles', 'Min VisibilityMiles',
       'Max Wind SpeedMPH', 'Mean Wind SpeedMPH', 'Max Gust SpeedMPH',
       'PrecipitationIn', ' CloudCover', 'Events', 'WindDirDegrees'],
      dtype='object')

In [18]:
weather.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
               '2013-01-09', '2013-01-10',
               ...
               '2013-12-22', '2013-12-23', '2013-12-24', '2013-12-25',
               '2013-12-26', '2013-12-27', '2013-12-28', '2013-12-29',
               '2013-12-30', '2013-12-31'],
              dtype='datetime64[ns]', name='Date', length=365, freq=None)

When the index consists of datetime object, we can use strings to select indices.

In [19]:
weather.loc['2013-01-05':'2013-01-10', 'PrecipitationIn']

Date
2013-01-05    0.21
2013-01-06    0.26
2013-01-07    0.06
2013-01-08    0.00
2013-01-09    0.03
2013-01-10    0.00
Name: PrecipitationIn, dtype: float64

We can apply arithmetic operations to our selection, applied element-wise.

In [20]:
weather.loc['2013-01-05':'2013-01-10', 'PrecipitationIn'] * 2.54

Date
2013-01-05    0.5334
2013-01-06    0.6604
2013-01-07    0.1524
2013-01-08    0.0000
2013-01-09    0.0762
2013-01-10    0.0000
Name: PrecipitationIn, dtype: float64

Use the `divide` method when wanting to divide across rows, e.g. determining an absolute temperature range - divide both the `Min` and `Max` temp columns by the `Mean` temp column, then multiplying each by 100.

In [21]:
# 'grab' the temp range for the date range in question
temp_range = weather.loc[
    '2013-07-01':'2013-07-07', 
    ['Min TemperatureF', 'Max TemperatureF']
]
temp_range

Unnamed: 0_level_0,Min TemperatureF,Max TemperatureF
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-07-01,66,79
2013-07-02,66,84
2013-07-03,71,86
2013-07-04,70,86
2013-07-05,69,86
2013-07-06,70,89
2013-07-07,70,77


In [22]:
# 'grab' the mean tempatures for the date range
mean_range = weather.loc[
    '2013-07-01':'2013-07-07',
    'Mean TemperatureF'
]
mean_range

Date
2013-07-01    72
2013-07-02    74
2013-07-03    78
2013-07-04    77
2013-07-05    76
2013-07-06    78
2013-07-07    72
Name: Mean TemperatureF, dtype: int64

Simply dividing the two results in `NaN` values as the column labels do not match.

In [23]:
temp_range / mean_range

  return this.join(other, how=how, return_indexers=return_indexers)


Unnamed: 0_level_0,2013-07-01 00:00:00,2013-07-02 00:00:00,2013-07-03 00:00:00,2013-07-04 00:00:00,2013-07-05 00:00:00,2013-07-06 00:00:00,2013-07-07 00:00:00,Min TemperatureF,Max TemperatureF
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2013-07-01,,,,,,,,,
2013-07-02,,,,,,,,,
2013-07-03,,,,,,,,,
2013-07-04,,,,,,,,,
2013-07-05,,,,,,,,,
2013-07-06,,,,,,,,,
2013-07-07,,,,,,,,,


The answer is to use the `divide` method, with the `axis='rows'` argument broadcasts the operation across each row to compute the desired ratio.

In [24]:
temp_range.divide(mean_range, axis='rows')

Unnamed: 0_level_0,Min TemperatureF,Max TemperatureF
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-07-01,0.916667,1.097222
2013-07-02,0.891892,1.135135
2013-07-03,0.910256,1.102564
2013-07-04,0.909091,1.116883
2013-07-05,0.907895,1.131579
2013-07-06,0.897436,1.141026
2013-07-07,0.972222,1.069444


We can calulate the percentage change between the current and the previous values in a  column and dividing by the mean of the column using the `pct_change` method, e.g takes the value for '2013-07-03' - '2013-07-02' / column mean * 100.

In [25]:
mean_range.pct_change() * 100 # multiply by 100 to yield a % value

Date
2013-07-01         NaN
2013-07-02    2.777778
2013-07-03    5.405405
2013-07-04   -1.282051
2013-07-05   -1.298701
2013-07-06    2.631579
2013-07-07   -7.692308
Name: Mean TemperatureF, dtype: float64

### Adding dataframes where the indexes differ

In [26]:
filenames = glob('./data/Summer Olympic medals/*_top5.csv')
medal_list = [pd.read_csv(f, index_col='Country') for f in filenames]
medal_list

[                 Total
 Country               
 United States   1195.0
 Soviet Union     627.0
 United Kingdom   591.0
 France           461.0
 Italy            394.0,                  Total
 Country               
 United States   2088.0
 Soviet Union     838.0
 United Kingdom   498.0
 Italy            460.0
 Germany          407.0,                  Total
 Country               
 United States   1052.0
 Soviet Union     584.0
 United Kingdom   505.0
 France           475.0
 Germany          454.0]

Pandas provide the `+` operator and the `add` method to add the dataframes together in an element-wise fashion.

In [27]:
medal_list[0] + medal_list[1]

Unnamed: 0_level_0,Total
Country,Unnamed: 1_level_1
France,
Germany,
Italy,854.0
Soviet Union,1465.0
United Kingdom,1089.0
United States,3283.0


When using the `+` operator, for any index label not found in all dataframes, the value is replaced with `NaN`. Arithmetic operations are carried out on rows that have common index values.

In [28]:
medal_list[0] + medal_list[1] + medal_list[2]

Unnamed: 0_level_0,Total
Country,Unnamed: 1_level_1
France,
Germany,
Italy,
Soviet Union,2049.0
United Kingdom,1594.0
United States,4335.0


The `add` method is more powerful, if we provide the `fill_value=0` argument. The operation uses `0` in place of any `NaN` values, so that the add operation returns a value.

In [29]:
medal_list[0].add(medal_list[1], fill_value=0).add(medal_list[2], fill_value=0)

Unnamed: 0_level_0,Total
Country,Unnamed: 1_level_1
France,936.0
Germany,861.0
Italy,854.0
Soviet Union,2049.0
United Kingdom,1594.0
United States,4335.0


In [30]:
print(weather.shape)
weather.head().T

(365, 22)


Date,2013-01-01 00:00:00,2013-01-02 00:00:00,2013-01-03 00:00:00,2013-01-04 00:00:00,2013-01-05 00:00:00
Max TemperatureF,32,25.0,32.0,30.0,34.0
Mean TemperatureF,28,21.0,24.0,28.0,30.0
Min TemperatureF,21,17.0,16.0,27.0,25.0
Max Dew PointF,30,14.0,19.0,21.0,23.0
Mean Dew PointF,27,12.0,15.0,19.0,20.0
Min DewpointF,16,10.0,9.0,17.0,16.0
Max Humidity,100,77.0,77.0,75.0,75.0
Mean Humidity,89,67.0,67.0,68.0,68.0
Min Humidity,77,55.0,56.0,59.0,61.0
Max Sea Level PressureIn,30.1,30.27,30.25,30.28,30.42


Convert the max, min and mean temperatures from `F` to `C`.

**Note**: 

Ordinary arithmetic operators (like +, -,etc) broadcast scalar values to conforming DataFrames when combining scalars & DataFrames in arithmetic expressions. Broadcasting also works with pandas Series and NumPy arrays.

In [31]:
# Extract selected columns from weather as new DataFrame: temps_f
temps_f = weather[['Min TemperatureF', 'Mean TemperatureF', 'Max TemperatureF']]

# Convert temps_f to celsius: temps_c
temps_c = (temps_f - 32) * 5 / 9

# Rename 'F' in column names with 'C': temps_c.columns
temps_c.columns = temps_c.columns.str.replace('F', 'C')

temps_c.head()

Unnamed: 0_level_0,Min TemperatureC,Mean TemperatureC,Max TemperatureC
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-01-01,-6.111111,-2.222222,0.0
2013-01-02,-8.333333,-6.111111,-3.888889
2013-01-03,-8.888889,-4.444444,0.0
2013-01-04,-2.777778,-2.222222,-1.111111
2013-01-05,-3.888889,-1.111111,1.111111


In [32]:
gdp = pd.read_csv('./data/GDP/gdp_usa.csv', index_col='DATE', parse_dates=True)
gdp.head()

Unnamed: 0_level_0,VALUE
DATE,Unnamed: 1_level_1
1947-01-01,243.1
1947-04-01,246.3
1947-07-01,250.1
1947-10-01,260.3
1948-01-01,266.2


In [33]:
# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp['2008':]

# Print the last 8 rows of post2008
print(post2008.tail(8))

              VALUE
DATE               
2014-07-01  17569.4
2014-10-01  17692.2
2015-01-01  17783.6
2015-04-01  17998.3
2015-07-01  18141.9
2015-10-01  18222.8
2016-01-01  18281.6
2016-04-01  18436.5


Create the DataFrame `yearly` by resampling the slice `post2008` by year. Remember, you need to chain `.resample()` (using the alias 'A' for annual frequency) with some kind of aggregation; use the aggregation method `.last(`) to select the last element when resampling.

In [34]:
# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()

# Print yearly
print(yearly)

              VALUE
DATE               
2008-12-31  14549.9
2009-12-31  14566.5
2010-12-31  15230.2
2011-12-31  15785.3
2012-12-31  16297.3
2013-12-31  16999.9
2014-12-31  17692.2
2015-12-31  18222.8
2016-12-31  18436.5


Compute the percentage growth of the resampled DataFrame yearly with `.pct_change() * 100`.

In [35]:
# Compute percentage growth of yearly: yearly['growth']
yearly['growth'] = yearly.pct_change() * 100

# Print yearly again
print(yearly)

              VALUE    growth
DATE                         
2008-12-31  14549.9       NaN
2009-12-31  14566.5  0.114090
2010-12-31  15230.2  4.556345
2011-12-31  15785.3  3.644732
2012-12-31  16297.3  3.243524
2013-12-31  16999.9  4.311144
2014-12-31  17692.2  4.072377
2015-12-31  18222.8  2.999062
2016-12-31  18436.5  1.172707


Using the files `sp500.csv` for sp500 and `exchange.csv` for the exchange rates,  convert both the Open and Close column prices.

In [37]:
# Read 'sp500.csv' into a DataFrame: sp500
sp500 = pd.read_csv('./data/sp500.csv', index_col='Date', parse_dates=True)

# Read 'exchange.csv' into a DataFrame: exchange
exchange = pd.read_csv('./data/exchanges.csv', index_col='Date', parse_dates=True)

# Subset 'Open' & 'Close' columns from sp500: dollars
dollars = sp500[['Open', 'Close']]

# Print the head of dollars
print(dollars.head())

# Construct a new DataFrame pounds by converting US dollars to British pounds. 
# use the .multiply() method of dollars with exchange['GBP/USD'] and axis='rows'
pounds = dollars.multiply(exchange['GBP/USD'], axis='rows')

# Print the head of pounds
print(pounds.head())

                   Open        Close
Date                                
2015-01-02  2058.899902  2058.199951
2015-01-05  2054.439941  2020.579956
2015-01-06  2022.150024  2002.609985
2015-01-07  2005.550049  2025.900024
2015-01-08  2030.609985  2062.139893
                   Open        Close
Date                                
2015-01-02  1340.364425  1339.908750
2015-01-05  1348.616555  1326.389506
2015-01-06  1332.515980  1319.639876
2015-01-07  1330.562125  1344.063112
2015-01-08  1343.268811  1364.126161
