# Introduction

This is an explanatory analysis of US freight data made with following purposes:
* to make animated visualization with help of matplotlib and pyplot,
* find some interesting trends and search additional explanatory information in the web to support or challenge them.

I was interested in money measurement of data so I took columns with values (value_2012, value_2013 and so on).
The code below can be adapted to other measurements like tons or miles.

In [None]:
# load libraries and set basic options
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import shapefile as shp
import plotly.graph_objs as go
import random
import io
import base64

from matplotlib.animation import FFMpegWriter
from matplotlib import gridspec
from matplotlib.patches import Polygon
from matplotlib import animation, rc, rcParams
from matplotlib.collections import PatchCollection
from mpl_toolkits.basemap import Basemap

from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from itertools import chain

#%matplotlib inline
from IPython.display import HTML, Image
rc('animation', html='html5')

init_notebook_mode(connected=True)
np.set_printoptions(formatter={'int_kind': '{:,}'.format})

File "FAF4 User Guide" provides detailed explanations for data in tables. For this analysis, I shall use file "FAF4_Regional" as it provides data by regions which I think is more interesting for visualization on maps.

Data from user guide will be implemented through dictionaries and lists below.

In [None]:
# prepare commodity dictionary based on User Guide
com_list = ['Animals and Fish (live)', 'Cereal Grains (includes seed)',
            'Agricultural Products (excludes Animal Feed, Cereal Grains, and Forage Products)',
            'Animal Feed, Eggs, Honey, and Other Products of Animal Origin',
            'Meat, Poultry, Fish, Seafood, and Their Preparations',
            'Milled Grain Products and Preparations, and Bakery Products',
            'Other Prepared Foodstuffs, Fats and Oils',
            'Alcoholic Beverages and Denatured Alcohol',
            'Tobacco Products', 'Monumental or Building Stone',
            'Natural Sands', 'Gravel and Crushed Stone (excludes Dolomite and Slate)',
            'Other Non-Metallic Minerals not elsewhere classified',
            'Metallic Ores and Concentrates', 'Coal', 'Crude Petroleum',
            'Gasoline, Aviation Turbine Fuel, and Ethanol (includes Kerosene, and Fuel Alcohols)',
            'Fuel Oils (includes Diesel, Bunker C, and Biodiesel)',
            'Other Coal and Petroleum Products, not elsewhere classified',
            'Basic Chemicals', 'Pharmaceutical Products', 'Fertilizers',
            'Other Chemical Products and Preparations', 'Plastics and Rubber',
            'Logs and Other Wood in the Rough', 'Wood Products',
            'Pulp, Newsprint, Paper, and Paperboard', 'Paper or Paperboard Articles',
            'Printed Products', 'Textiles, Leather, and Articles of Textiles or Leather',
            'Non-Metallic Mineral Products', 
            'Base Metal in Primary or Semi-Finished Forms and in Finished Basic Shapes',
            'Articles of Base Metal', 'Machinery', 
            'Electronic and Other Electrical Equipment and Components, and Office Equipment',
            'Motorized and Other Vehicles (includes parts)', 'Transportation Equipment, not elsewhere classified',
            'Precision Instruments and Apparatus', 
            'Furniture, Mattresses and Mattress Supports, Lamps, Lighting Fittings, and Illuminated Signs',
            'Miscellaneous Manufactured Products', 'Waste and Scrap (excludes of agriculture or food)',
            'Mixed Freight', 'Commodity unknown']

com_dict = {}
for i in range(1,len(com_list)+1):
    com_dict[i] = com_list[i-1]

# make sure that dictionary has same numeration as User Guide
com_dict[99] = com_dict.pop(len(com_list))
com_dict[43] = com_dict.pop(42)

# dictionary of foreign trading partners
fr_dict = {801: "Canada", 802: "Mexico",
          803: "Rest of Americas", 804: "Europe",
          805: "Africa", 806: "SW & Central Asia",
          807: "Eastern Asia", 808: "SE Asia & Oceania",
          'total': "Total"}

# transportation mode dictionary and list
mode_dict = {1: "Truck", 2: "Rail", 3: "Water", 4: "Air",
             5: "Multimode and mail", 6: "Pipeline",
             7: "Other/unknown", 8: "No domestic mode"}

mode_list = ["Truck", "Rail", "Water", "Air",
             "Multimode and mail", "Pipeline",
             "Other/unknown"]

years = ['2012', '2013', '2014','2015','2020',
         '2025', '2030', '2035', '2040', '2045']

In [None]:
regions_df = pd.read_csv('../input/FAF4_Regional.csv',
                         dtype = {'dms_orig': str, 'dms_dest': str})
reader = shp.Reader('../input/CFS_AREA_shapefile_010215/CFS_AREA_shapefile_010215')

# Choropleth maps

Choropleth maps provide good visualization for data that can be combined with polygons from shapefiles. Just like this database.

I decided to make choropleth map with animation to show changes from year 2012 to year 2045 (forecasted data was provided).

The following visualizations were made:
*  Dynamic of total value of commodities originated in each region. This gives sense of production activity with clear presentation of production clusters around Los-Angeles, Houston, Dallas, Chicago, New-York and other big cities.
* Dynamic of total value of commodities by destination region. This shows consumption potential which may differ from production capabilities if the region primary source of income is services.
* Balance of production and consumption per each region. Here production disbalances can be traced. Once again, these disbalances can be covered by revenue from services provided or capital inflows in any of its forms.


Few notes regarding animated maps. 

The best way to present the code for animation is to make function or class and then use it for each set of parameters (dataframe, color, subtitle text etc.). This what I've done in Jupyter notebook. But for some reason it just doesn't work within this kernel. The output is only last frame and it seems that animation function doesn't work if it is nested inside other function. 
Thus, the code below is overblown. If you would like to use it for some other purposes it can be done properly - through nested function (or maybe class).

Maps below use one colorbar for all years (2012-2045). It is good for showing dynamic through final year, to show how some regions are predicted to grow their productions while others expected to stall. But this approach makes first few years pallid. So, in case you are interested in shorter horizon, list of years should be shortened as well as number of frames. 

In [None]:
#Following sources helped a lot in making this animation:
#https://jakevdp.github.io/blog/2012/08/18/matplotlib-animation-tutorial/
#http://louistiao.me/posts/notebooks/embedding-matplotlib-animations-in-jupyter-notebooks/
#http://louistiao.me/posts/notebooks/save-matplotlib-animations-as-gifs/
#http://matplotlib.org/api/_as_gen/matplotlib.animation.FuncAnimation.html#matplotlib.animation.FuncAnimation
#https://www.kaggle.com/kostyabahshetsyan/animated-choropleth-map-for-crime-statistic/code
#https://www.kaggle.com/jaeyoonpark/heatmap-animation-us-drought-map/code

In [None]:
# domestic origin by value 2012-2045 
dom_origin_df = regions_df.loc[pd.isnull(regions_df['fr_orig'])]

In [None]:
dff = dom_origin_df
category = 'dms_orig'
map_color = 'YlOrRd'
num_colors = 20
text = 'commodities with domestic origin'

# animated plot
cat_val_df = dff[[category,'value_2012',
                     'value_2013', 'value_2014',
                     'value_2015', 'value_2020',
                     'value_2025', 'value_2030',
                     'value_2035', 'value_2040',
                     'value_2045']].groupby(category, as_index = False).sum()

cat_val_df.columns = [category] + years

# color map for values origins (applicable to all periods)
values_df = cat_val_df[years]
values = values_df.values.reshape((cat_val_df.shape[0]*10,1))
    
# set up steady layer for the map
fig = plt.figure(figsize=(16, 8))
ax = fig.add_subplot(111)
    
# shapefile data and map setups
m = Basemap(width = 6000000, height = 4000000, projection = 'lcc',
                resolution = None, lat_1=27.,lat_2=32,lat_0=37,lon_0=-97.)
m.shadedrelief()
m.readshapefile('../input/CFS_AREA_shapefile_010215//CFS_AREA_shapefile_010215',
                    'regions', drawbounds = True, linewidth=.01)

# set up steady legend for the map
cm = plt.get_cmap(map_color)
scheme = [cm(q*1.0/num_colors) for q in range(num_colors)]
bins = np.linspace(values.min(), values.max(), num_colors)
ax_legend = fig.add_axes([0.8,0.12,0.02,0.77])
cmap = mpl.colors.ListedColormap(scheme)
cb = mpl.colorbar.ColorbarBase(ax_legend, cmap = cmap, ticks = bins,
                               boundaries = bins, orientation = 'vertical')
cb.ax.tick_params(labelsize=9)

# initial set up for polygons
chor_map = ax.plot([],[])[0]
for shape in m.regions:
        patches = [Polygon(np.array(shape), True)]
        pc = PatchCollection(patches)
        ax.add_collection(pc)

# animated elements of the map
def animate(j):
    year = years[j-1]
    fig.suptitle('Freight value of {} in year {}, USD million'.format(text, year), fontsize=20, y=.95)
    dms_orig_val_animated = cat_val_df.loc[:,[category, year]]
    dms_orig_val_animated.set_index(category, inplace = True)
    dms_orig_val_animated['bin'] = np.digitize(dms_orig_val_animated[year], bins) - 1
        
    for info, shape in zip(m.regions_info, m.regions):
        name = info['CFS07DDGEO']
        # some names of the regions are misstated in database or in shapefile
        # the purpose of if... loop is to make this names equal
        # -------------------------------------------------------------------
        if name == '349':
            name = '342'
        elif name == '100':
            name = '101'
        elif name == '330':
            name = '339'
        elif name == '310':
            name = '311'
        else:
            name
        # -------------------------------------------------------------------
        # Alaska and Hawaii were excluded from analysis to make map size reasonable
        if name not in ['020', '159', '151']: 
            color = scheme[dms_orig_val_animated.loc[name]['bin'].astype(int)]
            patches = [Polygon(np.array(shape), True)]
            pc = PatchCollection(patches)
            pc.set_facecolor(color)
            ax.add_collection(pc)
    return chor_map,

anim = animation.FuncAnimation(fig, func = animate, frames = 10,
                               repeat_delay = 2000, interval = 1000)

anim.save('freight_d_d.gif', writer='imagemagick')
plt.close()
Image(url='freight_d_d.gif', width = 1200, height = 1000)

In [None]:
# domestic destination by value 2012-2045 
dom_dest_df = regions_df.loc[pd.isnull(regions_df['fr_dest'])]

In [None]:
dff = dom_dest_df
category = 'dms_dest'
map_color = 'Greens'
num_colors = 20
text = 'commodities with domestic destinaton'

# animated plot
cat_val_df = dff[[category,'value_2012',
                     'value_2013', 'value_2014',
                     'value_2015', 'value_2020',
                     'value_2025', 'value_2030',
                     'value_2035', 'value_2040',
                     'value_2045']].groupby(category, as_index = False).sum()

cat_val_df.columns = [category] + years

# color map for values origins (applicable to all periods)
values_df = cat_val_df[years]
values = values_df.values.reshape((cat_val_df.shape[0]*10,1))
    
# set up steady layer for the map
fig = plt.figure(figsize=(16, 8))
ax = fig.add_subplot(111)
    
# shapefile data and map setups
m = Basemap(width = 6000000, height = 4000000, projection = 'lcc',
                resolution = None, lat_1=27.,lat_2=32,lat_0=37,lon_0=-97.)
m.shadedrelief()
m.readshapefile('../input/CFS_AREA_shapefile_010215//CFS_AREA_shapefile_010215',
                    'regions', drawbounds = True, linewidth=.01)

# set up steady legend for the map
cm = plt.get_cmap(map_color)
scheme = [cm(q*1.0/num_colors) for q in range(num_colors)]
bins = np.linspace(values.min(), values.max(), num_colors)
ax_legend = fig.add_axes([0.8,0.12,0.02,0.77])
cmap = mpl.colors.ListedColormap(scheme)
cb = mpl.colorbar.ColorbarBase(ax_legend, cmap = cmap, ticks = bins,
                               boundaries = bins, orientation = 'vertical')
cb.ax.tick_params(labelsize=9)

# initial set up for polygons
chor_map = ax.plot([],[])[0]
for shape in m.regions:
        patches = [Polygon(np.array(shape), True)]
        pc = PatchCollection(patches)
        ax.add_collection(pc)

# animated elements of the map
def animate(j):
    year = years[j-1]
    fig.suptitle('Freight value of {} in year {}, USD million'.format(text, year), fontsize=20, y=.95)
    dms_orig_val_animated = cat_val_df.loc[:,[category, year]]
    dms_orig_val_animated.set_index(category, inplace = True)
    dms_orig_val_animated['bin'] = np.digitize(dms_orig_val_animated[year], bins) - 1
        
    for info, shape in zip(m.regions_info, m.regions):
        name = info['CFS07DDGEO']
        # some names of the regions are misstated in database or in shapefile
        # the purpose of if... loop is to make this names equal
        # -------------------------------------------------------------------
        if name == '349':
            name = '342'
        elif name == '100':
            name = '101'
        elif name == '330':
            name = '339'
        elif name == '310':
            name = '311'
        else:
            name
        # -------------------------------------------------------------------
        # Alaska and Hawaii were excluded from analysis to make map size reasonable
        if name not in ['020', '159', '151']: 
            color = scheme[dms_orig_val_animated.loc[name]['bin'].astype(int)]
            patches = [Polygon(np.array(shape), True)]
            pc = PatchCollection(patches)
            pc.set_facecolor(color)
            ax.add_collection(pc)
    return chor_map,

anim = animation.FuncAnimation(fig, func = animate, frames = 10,
                               repeat_delay = 2000, interval = 1000)

anim.save('freight_d_o.gif', writer='imagemagick')
plt.close()
Image(url='freight_d_o.gif', width = 1200, height = 1000)

In [None]:
# dataframe for freight balance (outflow from regoin minus inflow to region)
dom_origin_bal_df = dom_origin_df[['dms_orig','value_2012',
                                'value_2013', 'value_2014',
                                'value_2015', 'value_2020',
                                'value_2025', 'value_2030',
                                'value_2035', 'value_2040',
                                'value_2045']].groupby('dms_orig', as_index = True).sum()

dom_dest_bal_df = dom_dest_df[['dms_dest','value_2012',
                                'value_2013', 'value_2014',
                                'value_2015', 'value_2020',
                                'value_2025', 'value_2030',
                                'value_2035', 'value_2040',
                                'value_2045']].groupby('dms_dest', as_index = True).sum()

dom_dest_bal_df = dom_dest_bal_df.apply(lambda x: x*(-1))

balance_df = dom_origin_bal_df.add(dom_dest_bal_df, fill_value = 0.0)
balance_df.reset_index(inplace = True)

In [None]:
dff = balance_df
category = 'dms_orig'
map_color = 'RdYlGn'
num_colors = 20
text = 'commodities input/output balance'

# animated plot
cat_val_df = dff[[category,'value_2012',
                     'value_2013', 'value_2014',
                     'value_2015', 'value_2020',
                     'value_2025', 'value_2030',
                     'value_2035', 'value_2040',
                     'value_2045']].groupby(category, as_index = False).sum()

cat_val_df.columns = [category] + years

# color map for values origins (applicable to all periods)
values_df = cat_val_df[years]
values = values_df.values.reshape((cat_val_df.shape[0]*10,1))
    
# set up steady layer for the map
fig = plt.figure(figsize=(16, 8))
ax = fig.add_subplot(111)
    
# shapefile data and map setups
m = Basemap(width = 6000000, height = 4000000, projection = 'lcc',
                resolution = None, lat_1=27.,lat_2=32,lat_0=37,lon_0=-97.)
m.shadedrelief()
m.readshapefile('../input/CFS_AREA_shapefile_010215//CFS_AREA_shapefile_010215',
                    'regions', drawbounds = True, linewidth=.01)

# set up steady legend for the map
cm = plt.get_cmap(map_color)
scheme = [cm(q*1.0/num_colors) for q in range(num_colors)]
bins = np.linspace(values.min(), values.max(), num_colors)
ax_legend = fig.add_axes([0.8,0.12,0.02,0.77])
cmap = mpl.colors.ListedColormap(scheme)
cb = mpl.colorbar.ColorbarBase(ax_legend, cmap = cmap, ticks = bins,
                               boundaries = bins, orientation = 'vertical')
cb.ax.tick_params(labelsize=9)

# initial set up for polygons
chor_map = ax.plot([],[])[0]
for shape in m.regions:
        patches = [Polygon(np.array(shape), True)]
        pc = PatchCollection(patches)
        ax.add_collection(pc)

# animated elements of the map
def animate(j):
    year = years[j-1]
    fig.suptitle('Freight value of {} in year {}, USD million'.format(text, year), fontsize=20, y=.95)
    dms_orig_val_animated = cat_val_df.loc[:,[category, year]]
    dms_orig_val_animated.set_index(category, inplace = True)
    dms_orig_val_animated['bin'] = np.digitize(dms_orig_val_animated[year], bins) - 1
        
    for info, shape in zip(m.regions_info, m.regions):
        name = info['CFS07DDGEO']
        # some names of the regions are misstated in database or in shapefile
        # the purpose of if... loop is to make this names equal
        # -------------------------------------------------------------------
        if name == '349':
            name = '342'
        elif name == '100':
            name = '101'
        elif name == '330':
            name = '339'
        elif name == '310':
            name = '311'
        else:
            name
        # -------------------------------------------------------------------
        # Alaska and Hawaii were excluded from analysis to make map size reasonable
        if name not in ['020', '159', '151']: 
            color = scheme[dms_orig_val_animated.loc[name]['bin'].astype(int)]
            patches = [Polygon(np.array(shape), True)]
            pc = PatchCollection(patches)
            pc.set_facecolor(color)
            ax.add_collection(pc)
    return chor_map,

anim = animation.FuncAnimation(fig, func = animate, frames = 10,
                               repeat_delay = 2000, interval = 1000)

anim.save('freight_bal.gif', writer='imagemagick')
plt.close()
Image(url='freight_bal.gif', width = 1200, height = 1000)

Thoughts and conclusions:

(1).  Us production is concentrated around big cities:

 * New-York (CSA population 23.7 million people)
 * Los-Angeles (CSA population 18.7 million people
 * Chicago (CSA population 9.9 million people)
 * Dallas (CSA population 7.7 million people)
 * Houston (CSA population 7 million people)

[CSA - is composed of adjacent metropolitan (MSA) and micropolitan statistical areas (µSA) in the United States and Puerto Rico that can demonstrate economic or social linkage (source: Wikipedia).]


(2). Consumption concentration is similar to production's one. I see the two main reasons behind this:
 * population concentration around production facilities
 * production centers consume row materials for following production.
 
 
 (3). According to forecasts provided in the database Detroit and New-York CSA are heading into production deficit in following 30 years. At the same time Houston and Dallas CSAs are expected to have production surplus.
 
 (4). Texas has many regions with higher production of commodities than nearby states. Information is provided per each small region. Brief web search results say that Texas has low tax burden and business friendly regulatory environment.

# Analysis by commodities

The next section is analysis by commodities. I made analysis by top 10 commodities in terms of consumption, production, and balance between them per commodities.

I selected plotly as tool for this part as it provides very clear way to make buttons and sliders. They are very convenient for  presenting multiperiod data.

In [None]:
# origination dataframe
dom_origin_comm_df = dom_origin_df[['sctg2','value_2012',
                                'value_2013', 'value_2014',
                                'value_2015', 'value_2020',
                                'value_2025', 'value_2030',
                                'value_2035', 'value_2040',
                                'value_2045']].groupby('sctg2', as_index = False).sum()

dom_origin_comm_df.columns = ['sctg2'] + years
dom_origin_comm_df.loc['total'] = dom_origin_comm_df.sum()

In [None]:
# destination dataframe
dom_dest_comm_df = dom_dest_df[['sctg2','value_2012',
                                'value_2013', 'value_2014',
                                'value_2015', 'value_2020',
                                'value_2025', 'value_2030',
                                'value_2035', 'value_2040',
                                'value_2045']].groupby('sctg2', as_index = False).sum()

dom_dest_comm_df.columns = ['sctg2'] + years
dom_dest_comm_df.loc['total'] = dom_dest_comm_df.sum()

# top 10 originations/destinations in 2012 and 2045

In [None]:
# list of top 10 commodities by consumption (2012 and 2045)
top_10_dest_2012 = dom_dest_comm_df[['sctg2','2012','2045']].sort_values('2012', axis = 0, ascending = False)
top_10_dest_2012 = top_10_dest_2012.head(11)
top_10_dest_2012.drop('total', inplace = True)

top_10_dest_2045 = dom_dest_comm_df[['sctg2','2012','2045']].sort_values('2045', axis = 0, ascending = False)
top_10_dest_2045 = top_10_dest_2045.head(11)
top_10_dest_2045.drop('total', inplace = True)

top_2012_dest = top_10_dest_2012['sctg2'].values.flatten().tolist()
top_2045_dest = top_10_dest_2045['sctg2'].values.flatten().tolist()
top_dest_list = [x for x in top_2012_dest or top_2045_dest]

top_dest_2012_2045 = dom_dest_comm_df.loc[dom_dest_comm_df['sctg2'].isin(top_dest_list), 
                                          ['sctg2']+years]
top_dest_2012_2045.reset_index(inplace = True, drop = True)
top_dest_2012_2045['Commodity'] = top_dest_2012_2045['sctg2'].apply(lambda x: com_dict[x])

In [None]:
# list of top 10 commodities by production (2012 and 2045)
top_10_origin_2012 = dom_origin_comm_df[['sctg2','2012','2045']].sort_values('2012', axis = 0, ascending = False)
top_10_origin_2012 = top_10_origin_2012.head(11)
top_10_origin_2012.drop('total', inplace = True)

top_10_origin_2045 = dom_origin_comm_df[['sctg2','2012','2045']].sort_values('2045', axis = 0, ascending = False)
top_10_origin_2045 = top_10_origin_2045.head(11)
top_10_origin_2045.drop('total', inplace = True)

top_2012_origin = top_10_origin_2012['sctg2'].values.flatten().tolist()
top_2045_origin = top_10_origin_2045['sctg2'].values.flatten().tolist()
top_origin_list = [x for x in top_2012_origin or top_2045_origin]

top_origin_2012_2045 = dom_origin_comm_df.loc[dom_origin_comm_df['sctg2'].isin(top_origin_list), 
                                              ['sctg2'] + years]
top_origin_2012_2045.reset_index(inplace = True, drop = True)
top_origin_2012_2045['Commodity'] = top_origin_2012_2045['sctg2'].apply(lambda x: com_dict[x])

In [None]:
# check assumption that top origin and top destination commodities lists contains same items
print (len([x for x in top_origin_list and top_dest_list]))
# As there are only 10 items in final list we can say that they contain same items.
# Thus, they can be used interchangeably as parameters of the function "line_plot".
# This will help to preserve color scheme during making different plots.

In [None]:
# function for plotting top 2012/2045 destinations/originations by commodity
# plotly web page with examples was the best source of information https://plot.ly/python/
def line_plot(category, sel_list, df, sel_dict, clustering_criterion, xAxis):
    data = []
    buttons = []
    for i in sel_list:
        r = random.randint(1,256)
        g = random.randint(1,256)
        b = random.randint(1,256)
        rgb = 'rgb({}, {}, {})'.format(r, g, b)
        trace = go.Scatter(x = ["year {}".format(x) for x in xAxis],
                           y = df.loc[df[category] == i, xAxis].apply(lambda x: x/1000000).values.flatten(),
                           name = '{}_{}'.format(" ".join(sel_dict[i].split(" ")[:2]), i),
                           line = dict(width = 2,
                                       dash = 'longdash'))
        data.extend([trace])

        buttons_upd = list([dict(label = '{}'.format(sel_dict[i]),
                                 method = 'update',
                                 args = [{'visible': [x==i for x in sel_list]}])])
        buttons.extend(buttons_upd)

    # button for reset / all items
    buttons_all = list([dict(label = 'All selected items',
                                 method = 'update',
                                 args = [{'visible': [True for x in sel_list]}])])
    buttons.extend(buttons_all)
    
    # set menues inside the plot
    update_menus = list([dict(active=-5,
                              buttons = buttons,
                              direction = 'down',
                              pad = {'r': 10, 't': 10},
                              showactive = True,
                              x = 0.001,
                              xanchor = 'left',
                              y = 1.1,
                              yanchor = 'top')])
    # Edit the layout
    layout = dict(title = '{}'.format(clustering_criterion),
                  xaxis = dict(title = 'Years',
                               nticks = len(xAxis)),
                  yaxis = dict(title = 'Value, trillion USD'),
                  updatemenus = update_menus,
                  showlegend = True)
         
    fig_top_10 = dict(data = data, layout = layout)
    iplot(fig_top_10)

In [None]:
# plot top 2012 vs 2045 destination by commodity
line_plot('sctg2', top_dest_list, top_origin_2012_2045, com_dict,
          'Top 10 commodities by production', years)

This chart gives interesting results that are worth further analysis. 

I ignored mixed freight as it contains many other commodities that are just not significant enough to get their own group. 

Two related groups - "**Machinery**" and "**Electronic and Other Electrical Equipment and Components, and Office Equipment**" are expected to drastically grow up to 2045. According to estimation provided, production of these commodities is going to be tripled (taking year 2012 as the basis). Considering automatization trend this forecast seems very reasonable. Also, it shows that Department of Transportation expects that USA will take significant share of world's new equipment production.

"**Pharmaceutical Products**" projected growth seems to be supported by increase in population and longer life span. 

Growth of "**Other Prepared Foodstuffs, Fats and Oils**" may have many reasons behind it - population growth, shift to higher quality thus higher priced food, increase in export of food.

All trends from last three passages have their logic but what perplexed me is that all of them start booming right after factual period ends (2015) and forecast starts. I think these expectations are overestimated and probably gets so high because of compound annual growth rate. But this is my speculation. It would be interesting to see what forecasting methodology was applied and how it was done.

In contrast, steady increase of "**Motorized and Other Vehicles (includes parts)**" continues trend of factual periods (2012-2015). Because of this it looks much more plausible.

Now let's look on counterintuitive trends:

"**Gasoline, Aviation Turbine Fuel, and Ethanol (includes Kerosene, and Fuel Alcohols)**" and "**Fuel Oils (includes Diesel, Bunker C, and Biodiesel)**". Both are expected to climb up first and then drop to the level below year 2012. This is perfectly in line with anticipated domination of electric cars and probably trucks. Another big fuel consumer, planes, improves fuel efficiency with every new model.

At the same time we see that "**Other Coal and Petroleum Products, not elsewhere classified**"  shows strong upward trend. Clue can be found in old version of User guide (https://ops.fhwa.dot.gov/freight/freight_analysis/faf/faf3/userguide/index.htm). In year 2012 this group was named "Coal and petroleum products, n.e.c. (includes Natural gas)". So the answer is quite simple - the group includes natural gas. And this growth reflects soar of fracking industry, as well as USA plans to become one of the world's top gas exporter.

In [None]:
# plot top 2012 vs 2045 origination by commodity
line_plot('sctg2', top_dest_list, top_dest_2012_2045, com_dict,
          'Top 10 commodities by consumption', years)

Consumption trends are very similar to production trends and seems to be guided by the same forecasting logic.

The visible difference are in consumption trend for "**Other Coal and Petroleum Products, not elsewhere classified**"  - it is much less aggressive than production trend. It supports expectation of exporting significant part of natural gas produced.

# origination / destination balance

The following chart represent comparison of domestic origination and domestic consumption for each commodity group. Imbalances are covered by export or import. I took all imbalance instances higher than 150 billion USD.

In [None]:
# plot biggest commodity deficits and surpluses
dom_dest_comm_df_2 = dom_dest_comm_df.set_index(['sctg2'])
dom_origin_comm_df_2 = dom_origin_comm_df.set_index(['sctg2'])
dom_dest_comm_df_neg = dom_dest_comm_df_2.apply(lambda x: x*(-1))

comm_balance_df = dom_origin_comm_df_2.add(dom_dest_comm_df_neg, fill_value = 0.0)
comm_balance_df['max'] = comm_balance_df[years].max(axis = 1)
comm_balance_df['min'] = comm_balance_df[years].min(axis = 1)
comm_balance_df['abs_max'] = comm_balance_df[['max','min']].apply(lambda x: abs(x)).max(axis = 1)

selected_comm_bal = comm_balance_df.loc[comm_balance_df['abs_max'] >= 150000, years]
selected_comm_bal.reset_index(inplace = True)

bal = selected_comm_bal['sctg2'].values.flatten().tolist()
com_dict[1003] = "Total for all commodities"

In [None]:
# function for plotting balance
def balance_plot(category, sel_list, df, sel_dict, heading):
    data = []
    buttons = []

    # rgb for surplus
    r_s = 200
    g_s = 100
    b_s = 20

    # rgb for deficite
    r_d = 50
    g_d = 100
    b_d = 200

    for i in sel_list:
        y = df.loc[df[category] == i,years].apply(lambda x: x/1000000).values
        if np.sum(y)>=0:
            rgb = 'rgb({}, {}, {})'.format(r_s, g_s, b_s)
            r_s += 5
            g_s += 10
            b_s += 10
        else:
            rgb = 'rgb({}, {}, {})'.format(r_d, g_d, b_d)
            r_d += 10
            g_d += 10
            b_d += 5
        trace = go.Scatter(x = ["year {}".format(x) for x in years],
                       y = y.flatten(),
                       name = '{}_{}'.format(" ".join(sel_dict[i].split(" ")[:2]), i),
                       line = dict(color = (rgb), 
                                   width = 2))
        
    
        data.append(trace)
    
        buttons_upd = list([dict(label = '{}'.format(sel_dict[i]),
                                 method = 'update',
                                 args = [{'visible': [x==i for x in sel_list]}])])
        buttons.extend(buttons_upd)

    buttons_all = list([dict(label = 'All except total',
                         method = 'update',
                         args = [{'visible': [True for x in bal[:-1]]+[False]}])]) 

    buttons.extend(buttons_all)

    # set menues inside the plot
    update_menus = list([dict(active=-5,
                              buttons = buttons,
                              direction = 'down',
                              pad = {'r': 10, 't': 10},
                              showactive = True,
                              x = 0.001,
                              xanchor = 'left',
                              y = 1.1,
                              yanchor = 'top')])

    # Edit the layout
    layout = dict(title = heading,
                  xaxis = dict(title = 'Years'),
                  yaxis = dict(title = 'Value, trillion USD'),
                  updatemenus = update_menus,
                  showlegend = True)
         
    fig = dict(data = data, layout = layout)
    iplot(fig)

In [None]:
balance_plot('sctg2', bal, selected_comm_bal, com_dict, 
             'Balance of commodities consumed and produced')

Plot above shows that most of selected commodity groups with production imbalances have and will have deficit. Only "**Other Coal and Petroleum Products, not elsewhere classified**" and "**Precision Instruments and Apparatus**" expected to have surplus. 

To see total amount of deficit for all commodities you can select button "Total for all commodities". The chart shows that US production deficit is anticipated to decrease up to year 2025. Using trends by each selected commodity it becomes clear that main reason behind this is export oriented natural gas production.

Overall conclusion is that US is going to continue its trend of producing less goods than its population consume. Such imbalance can be covered by production of services and non-tangible goods (like software). Another way to cover it is to get capital inflows from other countries (purchase of corporate / government debt and investments in ownership rights).

# Analysis by foreign trading partners

This part is breakdown of provided data by foreign trading partners. Most of countries are grouped by geographical regions (see User Guide or "fr_dict"). Only Canada and Mexico are presented as single countries.

In [None]:
# domestically originated goods exported to other countries
fr_dest_df = regions_df.loc[pd.notnull(regions_df['fr_dest'])]
fr_dest_list = fr_dest_df['fr_dest'].unique().flatten().tolist()
fr_dest_df = fr_dest_df[['fr_dest','value_2012',
                           'value_2013', 'value_2014',
                           'value_2015', 'value_2020',
                           'value_2025', 'value_2030',
                           'value_2035', 'value_2040',
                           'value_2045']].groupby('fr_dest', as_index = True).sum()

fr_dest_df.columns = years

In [None]:
# goods imported from other countries
fr_orig_df = regions_df.loc[pd.notnull(regions_df['fr_orig'])]
fr_orig_list = fr_orig_df['fr_orig'].unique().flatten().tolist()
fr_orig_df = fr_orig_df[['fr_orig','value_2012',
                           'value_2013', 'value_2014',
                           'value_2015', 'value_2020',
                           'value_2025', 'value_2030',
                           'value_2035', 'value_2040',
                           'value_2045']].groupby('fr_orig', as_index = True).sum()

fr_orig_df.columns = years

In [None]:
# fr balance df
fr_balance_df = fr_dest_df.add(fr_orig_df.apply(lambda x: x*(-1)))
fr_balance_df.loc['total'] = fr_balance_df.sum()
fr_balance_df.reset_index(inplace = True)

In [None]:
# reset index for fr dataframes
fr_dest_df.reset_index(inplace = True)
fr_orig_df.reset_index(inplace = True)

In [None]:
def fr_dest_orig_plot(category, list_fr, df, clustering_criterion):
    data = []
    for i in list_fr:

        trace = go.Bar(x = ["year {}".format(x) for x in years],
                       y = df.loc[df[category] == i,years].apply(lambda x: x/1000000).values.flatten(),
                       name = '{}'.format(fr_dict[i]))
        
        data.append(trace)
                       
    # Edit the layout
    layout = dict(title = 'Trading partners grouped by {}'.format(clustering_criterion),
                  xaxis = dict(title = 'Years'),
                  yaxis = dict(title = 'Value, trillion USD'),
                  showlegend = True,
                  barmode='stack')
         
    fig = dict(data = data, layout = layout)
    iplot(fig, filename='stacked-bar')

In [None]:
# plot overall change and structure of trading partners by export
fr_dest_orig_plot('fr_dest', fr_dest_list, fr_dest_df, 'export')

In [None]:
# plot trading partners by export
line_plot('fr_dest', fr_dest_list, fr_dest_df, fr_dict,
          'Trading partners by export', years)

In [None]:
# plot overall change and structure of trading partners by export
fr_dest_orig_plot('fr_orig', fr_dest_list, fr_orig_df, 'import') # fr_dest_list was applied to match colors with previous plots

In [None]:
# plot trading partners by export
line_plot('fr_orig', fr_dest_list, fr_orig_df, fr_dict,
          'Trading partners by import', years) # fr_dest_list was applied to match colors with previous plots

In [None]:
# plot balance for foreign destination / origination
fr_dest_list_total = fr_dest_list+['total']
balance_plot('fr_dest', fr_dest_list_total, fr_balance_df, fr_dict, 'Balance of foreign trade by commodities')

Here are my thoughts and conclusions on this part:

(1). In my opinion both trends (export and import) looks overly optimistic. 2012-2015 dynamic is very moderate. Future periods look too promising.

(2). Eastern Asia expected to become the biggest importer of goods to US and only second partner by export. This make projected deficit quite impressive.  

(3) The Europe is going to import more US goods than any other region. It is mostly guided by US plans to dominate Europe natural gas market. Additional chart to support this conclusion is presented in next cell.

In [None]:
#print (regions_df)
ng_europe = regions_df[['value_2012',
                     'value_2013', 'value_2014',
                     'value_2015', 'value_2020',
                     'value_2025', 'value_2030',
                     'value_2035', 'value_2040',
                     'value_2045', 'fr_dest', 'sctg2']]
pd.to_numeric(ng_europe['fr_dest'], errors='coerce')
ng_europe.columns = [years + ['fr_dest','sctg2']]

ng_europe = ng_europe.loc[regions_df['fr_dest'] == 804, years+['sctg2']]
ng_europe = ng_europe.groupby('sctg2', as_index = False).sum()
ng_europe = ng_europe.sort_values('2045', axis = 0, ascending = False)
ng_europe_list = ng_europe['sctg2'].head(10).values.flatten().tolist()

line_plot('sctg2', top_dest_list, ng_europe, com_dict,
          'Natural gas export to Europe', years)

 # Analysis of transportation mode structure

The last part of my analysis describes structure of transport used for carrying goods. For this purpose, pie chart is a good choice. 

In [None]:
# domestic freight type (includes only transportation between entry/exit
# point and destination/origin point)
domestic_mode_df = regions_df[['dms_mode','value_2012',
                               'value_2013', 'value_2014',
                               'value_2015', 'value_2020',
                               'value_2025', 'value_2030',
                               'value_2035', 'value_2040',
                               'value_2045']].groupby('dms_mode', as_index = False).sum()

domestic_mode_df.columns = ['dms_mode'] + years

# exclude mode '8' as it is not domestic transportation mode
domestic_mode_df = domestic_mode_df.loc[domestic_mode_df['dms_mode']!=8,]
domestic_mode_df[years] = domestic_mode_df[years].div(domestic_mode_df[years].sum(axis=0),
                                                      axis=1).multiply(100)

In [None]:
# foreign fraight type (both import and export)
# domestic origin for export by value 2012-2045
for_dest_df = regions_df.loc[pd.notnull(regions_df['fr_dest'])]
for_outmode_df = for_dest_df[['fr_outmode','value_2012',
                               'value_2013', 'value_2014',
                               'value_2015', 'value_2020',
                               'value_2025', 'value_2030',
                               'value_2035', 'value_2040',
                               'value_2045']].groupby('fr_outmode', as_index = True).sum()

# domestic destination for import by value 2012-2045
for_orig_df = regions_df.loc[pd.notnull(regions_df['fr_orig'])]
for_inmode_df = for_orig_df[['fr_inmode','value_2012',
                                  'value_2013', 'value_2014',
                                  'value_2015', 'value_2020',
                                  'value_2025', 'value_2030',
                                  'value_2035', 'value_2040',
                                  'value_2045']].groupby('fr_inmode', as_index = True).sum()

for_mode_df = for_outmode_df.add(for_inmode_df)
for_mode_df.reset_index(inplace = True)
for_mode_df.columns = ['for_mode'] + years

for_mode_df[years] = for_mode_df[years].div(for_mode_df[years].sum(axis=0), axis=1).multiply(100)

In [None]:
# plotting pie charts for transportation modes
data_pie = []
pie_colors = ['rgb(100, 100, 100)',
              'rgb(230, 120, 40)',
              'rgb(110, 210, 220)',
              'rgb(220, 220, 220)',
              'rgb(180, 60, 110)',
              'rgb(80, 120, 150)',
              'rgb(125, 30, 120)']

# add year 2012 manually to ensure it appears on plot right after the code is executed
data_2012 = [{"values": domestic_mode_df['2012'].values.flatten().tolist(),
                 "labels": mode_list,
                 "domain": {"x": [0, .48]},
                 "marker": {"colors": pie_colors},
                 "hoverinfo": "label+percent",
                 "hole": .4,
                 "type": 'pie',
                 "visible": True},
           
                {"values": for_mode_df['2012'].values.flatten().tolist(),
                 "labels": mode_list,
                 "domain": {"x": [.52, 1]},
                 "marker": {"colors": pie_colors},
                 "hoverinfo": "label+percent",
                 "hole": .4,
                 "type": 'pie',
                 "visible": True}]
data_pie.extend(data_2012)

for i in years[1:]:    
    data_upd = [{"values": domestic_mode_df[i].values.flatten().tolist(),
                 "labels": mode_list,
                 "domain": {"x": [0, .48]},
                 "marker": {"colors": pie_colors},
                 "hoverinfo": "label+percent",
                 "hole": .4,
                 "type": 'pie',
                 "visible": False},
           
                {"values": for_mode_df[i].values.flatten().tolist(),
                 "labels": mode_list,
                 "domain": {"x": [.52, 1]},
                 "marker": {"colors": pie_colors},
                 "hoverinfo": "label+percent",
                 "hole": .4,
                 "type": 'pie',
                 "visible": False}]
    
    data_pie.extend(data_upd)
    
# set menues inside the plot
steps = []
yr = 0

for i in range(0,len(data_pie),2):
    step = dict(method = "restyle",
                args = ["visible", [False]*len(data_pie)],
                label = years[yr]) 
    step['args'][1][i] = True
    step['args'][1][i+1] = True
    steps.append(step)
    yr += 1

sliders = [dict(active = 0,
                currentvalue = {"prefix": "Year: ",
                               "visible": True},
                pad = {"t": 50},
                steps = steps)]

# Set the layout
layout = dict(title = 'Structure of transportation mode',
              annotations = [{"font": {"size": 20},
                              "showarrow": False,
                              "text": "DMT",
                              "x": 0.20,
                              "y": 0.5},
                             {"font": {"size": 20},
                              "showarrow": False,
                              "text": "FMT",
                              "x": 0.8,
                              "y": 0.5}],
              sliders = sliders)
         
fig = dict(data = data_pie, layout = layout)
iplot(fig, filename='donut')

**Domestic transportation** is and will be dominated by Trucks. Air transportation is expected to almost triple up to year 2045 (the base for comparison is year 2012). Decrease is forecasted for Pipeline mode, which is in line with decrease in oil products consumption.


US geographical location defines structure of **Foreign transportations**. Truck mode is applicable almost only to trade with Mexico and Canada. Trading with rest of the world is made through Water and Air modes.

Water mode is the most important one and expected to be so. It is perfect for goods that are heavy and has low price per ton. 

Air mode, according to estimations, is going to increase its weight up to 30% at the expense of Water, Truck and Rail modes. This shift may be guided by increase in shipment of expensive light goods.