# Daily Weather in the U.S., 2017

Three questions that I want to investigate:

- Where is it hottest in summer?
- Where is it dryest in summer?
- Where is it windiest in summer?

I will limit the investigation by looking only at the continental US.

In [1]:
import altair as alt
import folium
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from datetime import datetime
from vega_datasets import data

In [2]:
weather = pd.read_csv('weather.csv')

## Preliminary exploratory visual analysis

Where are the weather stations?

In [3]:
stations = weather.groupby(['station', 'latitude', 'longitude']).count().reset_index()

In [4]:
m = folium.Map(location=[40, -100], zoom_start=1)
for i in range(0, len(stations)):
    folium.Marker([stations['latitude'].iloc[i], stations['longitude'].iloc[i]]).add_to(m)
m

I start by deleting the stations that are not in the contiguous 48 states.

In [5]:
subdata = weather[~weather['state'].isin(['AB', 'AK', 'BC', 'GU', 'HI', 'MP', 'MB', 'NB', 'NL', 'NS', 'NT', 'ON', 'PE', 'PR', 'QC', 'VI'])]

In [6]:
stations = subdata.groupby(['station', 'latitude', 'longitude']).count().reset_index()
m = folium.Map(location=[40, -100], zoom_start=4)
for i in range(0, len(stations)):
    folium.Marker([stations['latitude'].iloc[i], stations['longitude'].iloc[i]]).add_to(m)
m

I do a little sanity check, and look for data that are too big / too small.

In [7]:
subdata.min()

station      ABERDEEN
state              AL
latitude       24.555
longitude    -124.555
elevation       -36.0
date         20170101
TMIN           -98.86
TMAX           -10.84
TAVG           -20.56
AWND              0.0
WDF5              2.0
WSF5         4.026492
SNOW              0.0
SNWD              0.0
PRCP              0.0
dtype: object

A temperature of -98.86 seems very low. Let us at rows with small values of TMIN.

In [8]:
subdata.loc[subdata.TMIN < -50]

Unnamed: 0,station,state,latitude,longitude,elevation,date,TMIN,TMAX,TAVG,AWND,WDF5,WSF5,SNOW,SNWD,PRCP
91571,DECATUR PRYOR FLD,AL,34.6525,-86.9453,180.4,20170804,-71.86,87.98,,6.039738,240.0,16.105968,0.0,0.0,0.0
148276,ALTUS AFB,OK,34.3622,-98.9761,386.2,20170915,-98.86,96.08,,13.869028,160.0,29.974996,0.0,0.0,0.0
154693,MAYPORT PILOT STN,FL,30.4,-81.4167,4.9,20170517,-70.78,86.0,,13.869028,150.0,27.066974,,,0.0
154694,MAYPORT PILOT STN,FL,30.4,-81.4167,4.9,20170518,-79.78,84.02,,12.750558,150.0,23.935258,,,0.0
168315,MAYPORT PILOT STN,FL,30.4,-81.4167,4.9,20170516,-50.8,80.06,,9.171454,130.0,23.040482,,,0.0


This is weird. Maybe there is a negative sign where there should be a positive sign. I will replace these values by NaN.

In [9]:
subdata.at[subdata['TMIN'] < -50, 'TMIN'] = np.NaN

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


In [10]:
subdata.max()

station      Younts Peak
state                 WY
latitude           48.98
longitude       -67.7928
elevation         3541.8
date            20170921
TMIN               98.96
TMAX              129.92
TAVG              105.26
AWND          112.518082
WDF5               360.0
WSF5           180.07367
SNOW           67.992163
SNWD          280.000151
PRCP            26.03151
dtype: object

An average daily wind speed of 112 mph is very big. Let us at rows with high values of AWND.

In [11]:
subdata.loc[subdata.AWND > 70]

Unnamed: 0,station,state,latitude,longitude,elevation,date,TMIN,TMAX,TAVG,AWND,WDF5,WSF5,SNOW,SNWD,PRCP
185713,PINE RIDGE AP,SD,43.0206,-102.5183,999.1,20170830,98.96,98.96,,99.096442,99.0,99.096442,,,0.0
236985,Lake Irene,CO,40.4100,-105.8200,3261.4,20170215,10.04,36.50,21.56,72.253162,,,,77.992168,0.0
236986,Lake Irene,CO,40.4100,-105.8200,3261.4,20170216,14.90,43.88,28.22,72.700550,,,,75.984293,0.0
237044,Lake Irene,CO,40.4100,-105.8200,3261.4,20170415,25.16,44.60,34.16,73.147938,,,,62.992160,0.0
237045,Lake Irene,CO,40.4100,-105.8200,3261.4,20170416,28.22,46.94,36.32,70.910998,,,,60.984285,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
264252,El Diente Peak,CO,37.7900,-108.0200,3048.0,20170501,28.58,46.94,37.58,93.951480,,,,22.992138,0.0
264253,El Diente Peak,CO,37.7900,-108.0200,3048.0,20170502,24.80,51.62,39.02,88.135436,,,,22.007886,0.0
264254,El Diente Peak,CO,37.7900,-108.0200,3048.0,20170503,27.86,51.44,40.82,105.359874,,,,20.000011,0.0
264255,El Diente Peak,CO,37.7900,-108.0200,3048.0,20170504,27.32,60.80,44.78,79.411370,,,,17.992136,0.0


I will replace the big values of AWND by NaN.

In [12]:
subdata.at[subdata['AWND'] > 70, 'AWND'] = np.NaN

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


I transform the date column into a datetime format.

In [13]:
date = pd.to_datetime(subdata['date'], format='%Y%m%d')

In [14]:
subdata['date'] = date

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subdata['date'] = date


I select the data corresponding to summer.

In [15]:
summer = subdata.loc[(date >= datetime(2017, 6, 21)) & (date <= datetime(2017, 9, 20))]

Before starting visualizing the data, let us look at how many data points are missing in the dataset.

In [16]:
len(summer)

134525

In [17]:
summer.isna().sum()

station          0
state            0
latitude         0
longitude        0
elevation        0
date             0
TMIN           776
TMAX           763
TAVG         51927
AWND         59091
WDF5         61834
WSF5         61800
SNOW         90714
SNWD         31739
PRCP          1403
dtype: int64

## Visualization

I first get a background map of USA for plotting.

In [18]:
usa = data.us_10m.url

### Where is it hottest during summer?

In [19]:
Tmin = summer.groupby(['station', 'latitude', 'longitude']).agg({'TMIN': 'min'}).reset_index()

In [20]:
alt.layer(
    alt.Chart(alt.topo_feature(usa, 'states'), title='Minimum temperature in summer').mark_geoshape(
        fill='#ddd', stroke='#fff', strokeWidth=1
    ),
    alt.Chart(Tmin).mark_circle(size=15).encode(
        latitude='latitude:Q',
        longitude='longitude:Q',
        color=alt.Color('TMIN:Q', scale=alt.Scale(domain=[0, 80], clamp=True, scheme='plasma'),
        legend=alt.Legend(title='Temperature (F)'))
    )
).project(
    type='albersUsa'
).properties(
    width=750,
    height=500
).configure_view(
    stroke=None
).configure_title(fontSize=24, anchor="middle").configure_legend(titleFontSize=20, labelFontSize=20)

It can get cold in the Rocky Mountains, but it never gets cold (in summer) in the southernmost part of the USA.

In [21]:
Tmax = summer.groupby(['station', 'latitude', 'longitude']).agg({'TMAX': 'max'}).reset_index()

In [22]:
alt.layer(
    alt.Chart(alt.topo_feature(usa, 'states'), title='Maximum temperature in summer').mark_geoshape(
        fill='#ddd', stroke='#fff', strokeWidth=1
    ),
    alt.Chart(Tmax).mark_circle(size=15).encode(
        latitude='latitude:Q',
        longitude='longitude:Q',
        color=alt.Color('TMAX:Q', scale=alt.Scale(domain=[65, 130], clamp=True, scheme='plasma'),
        legend=alt.Legend(title='Temperature (F)'))
    )
).project(
    type='albersUsa'
).properties(
    width=750,
    height=500
).configure_view(
    stroke=None
).configure_title(fontSize=24, anchor="middle").configure_legend(titleFontSize=20, labelFontSize=20)

It can get very hot in California, Arizona, and the central part of the USA.

In [23]:
Tavg = summer.groupby(['station', 'latitude', 'longitude']).agg({'TAVG': 'mean'}).reset_index()

In [24]:
alt.layer(
    alt.Chart(alt.topo_feature(usa, 'states'), title='Average temperature in summer').mark_geoshape(
        fill='#ddd', stroke='#fff', strokeWidth=1
    ),
    alt.Chart(Tavg).mark_circle(size=15).encode(
        latitude='latitude:Q',
        longitude='longitude:Q',
        color=alt.Color('TAVG:Q', scale=alt.Scale(domain=[45, 100], clamp=True, scheme='plasma'),
        legend=alt.Legend(title='Temperature (F)'))
    )
).project(
    type='albersUsa'
).properties(
    width=750,
    height=500
).configure_view(
    stroke=None
).configure_title(fontSize=24, anchor="middle").configure_legend(titleFontSize=20, labelFontSize=20)

There are many missing data for the average temperature, so this map may not give reliable information.

I read about a definition of canicule as a time when it does not get cool during the night. I will define a hot night as a day where the minimum temperature stays higher than 70. How many hot nights were there during summer 2017?

In [25]:
canicule = summer.loc[summer['TMIN'] >= 70]

In [26]:
canicule = canicule.groupby(['station', 'latitude', 'longitude', 'state']).agg({'TMIN': 'count'}).reset_index()

In [27]:
alt.layer(
    alt.Chart(alt.topo_feature(usa, 'states'), title='Number of days of canicule in summer').mark_geoshape(
        fill='#ddd', stroke='#fff', strokeWidth=1
    ),
    alt.Chart(canicule).mark_circle(size=15).encode(
        latitude='latitude:Q',
        longitude='longitude:Q',
        color=alt.Color('TMIN:Q', scale=alt.Scale(domain=[0, 90], clamp=True, scheme='plasma'),
        legend=alt.Legend(title='Days'))
    )
).project(
    type='albersUsa'
).properties(
    width=750,
    height=500
).configure_view(
    stroke=None
).configure_title(fontSize=24, anchor="middle").configure_legend(titleFontSize=20, labelFontSize=20)

If you like the temperature to gets cooler during the night, you should avoid the southeastern part of the USA during summer.

### Where is it dryest during summer?

In [28]:
rain = summer.groupby(['station', 'latitude', 'longitude']).agg({'PRCP': 'sum'}).reset_index()

In [29]:
alt.layer(
    alt.Chart(alt.topo_feature(usa, 'states'), title='Precipitation during summer').mark_geoshape(
        fill='#ddd', stroke='#fff', strokeWidth=1
    ),
    alt.Chart(rain).mark_circle(size=15).encode(
        latitude='latitude:Q',
        longitude='longitude:Q',
        color=alt.Color('PRCP:Q', scale=alt.Scale(domain=[0, 20], clamp=True, scheme='plasma'),
        legend=alt.Legend(title='Precipitation (in)'))
    )
).project(
    type='albersUsa'
).properties(
    width=750,
    height=500
).configure_view(
    stroke=None
).configure_title(fontSize=24, anchor="middle").configure_legend(titleFontSize=20, labelFontSize=20)

I note that the rainiest part of the USA is again the southeastern part. Let us look at how well temperature and precipitation correlate in summer.

In [30]:
both = summer.groupby(['station', 'latitude', 'longitude']).agg({'PRCP': 'sum', 'TMIN': 'mean'}).reset_index()

In [31]:
alt.Chart(both, title='Are high temperatures correlated with high precipitation?').mark_circle(size=15).encode(
    alt.X('TMIN:Q', title='Minimum temperature (F)', scale=alt.Scale(domain=[30, 90])),
    alt.Y('PRCP:Q', title='Precipitation (in)', scale=alt.Scale(domain=[0, 70]))
).properties(
    width=700,
    height=500
).configure_view(
    stroke=None
).configure_title(fontSize=24, anchor="middle").configure_axis(titleFontSize=20, labelFontSize=20)

Maybe.

### Where is it windiest during summer?

There are many missing data for the wind speed and direction, so I am not sure making a map would provide a reliable answer to this question.

Hurricane Harvey hit Texas and Louisiana in August 2017. Instead, let us look whether we can follow its trajectory using the weather dataset.

In [32]:
august24 = subdata.loc[subdata.date == datetime(2017, 8, 24)]

In [33]:
august25 = subdata.loc[subdata.date == datetime(2017, 8, 25)]

In [34]:
august26 = subdata.loc[subdata.date == datetime(2017, 8, 26)]

In [35]:
august27 = subdata.loc[subdata.date == datetime(2017, 8, 27)]

In [36]:
august28 = subdata.loc[subdata.date == datetime(2017, 8, 28)]

In [37]:
august29 = subdata.loc[subdata.date == datetime(2017, 8, 29)]

In [38]:
august30 = subdata.loc[subdata.date == datetime(2017, 8, 30)]

In [39]:
august31 = subdata.loc[subdata.date == datetime(2017, 8, 31)]

In [40]:
alt.layer(
    alt.Chart(alt.topo_feature(usa, 'states'), title='August 24 2017').mark_geoshape(
        fill='#ddd', stroke='#fff', strokeWidth=1
    ),
    alt.Chart(august24).mark_circle().encode(
        alt.Latitude('latitude:Q'),
        alt.Longitude('longitude:Q'),
        alt.Size('AWND:Q', scale=alt.Scale(domain=[0, 45]), legend=None),
        alt.Color('AWND:Q', scale=alt.Scale(domain=[0, 45], scheme='turbo'),
        legend=alt.Legend(title='Wind (mph)'))
)
).project(
    type='albersUsa'
).properties(
    width=750,
    height=500
).configure_view(
    stroke=None
).configure_title(fontSize=24, anchor="middle").configure_legend(titleFontSize=20, labelFontSize=20)

On August 24th, average wind speeds are still low. The hurricane has not arrived yet.

In [41]:
alt.layer(
    alt.Chart(alt.topo_feature(usa, 'states'), title='August 25 2017').mark_geoshape(
        fill='#ddd', stroke='#fff', strokeWidth=1
    ),
    alt.Chart(august25).mark_circle().encode(
        alt.Latitude('latitude:Q'),
        alt.Longitude('longitude:Q'),
        alt.Size('AWND:Q', scale=alt.Scale(domain=[0, 45]), legend=None),
        alt.Color('AWND:Q', scale=alt.Scale(domain=[0, 45], scheme='turbo'),
        legend=alt.Legend(title='Wind (mph)'))
)
).project(
    type='albersUsa'
).properties(
    width=750,
    height=500
).configure_view(
    stroke=None
).configure_title(fontSize=24, anchor="middle").configure_legend(titleFontSize=20, labelFontSize=20)

On August 25th, wind speeds get higher on the coastal areas of Texas.

In [42]:
alt.layer(
    alt.Chart(alt.topo_feature(usa, 'states'), title='August 26 2017').mark_geoshape(
        fill='#ddd', stroke='#fff', strokeWidth=1
    ),
    alt.Chart(august26).mark_circle().encode(
        alt.Latitude('latitude:Q'),
        alt.Longitude('longitude:Q'),
        alt.Size('AWND:Q', scale=alt.Scale(domain=[0, 45]), legend=None),
        alt.Color('AWND:Q', scale=alt.Scale(domain=[0, 45], scheme='turbo'),
        legend=alt.Legend(title='Wind (mph)'))
)
).project(
    type='albersUsa'
).properties(
    width=750,
    height=500
).configure_view(
    stroke=None
).configure_title(fontSize=24, anchor="middle").configure_legend(titleFontSize=20, labelFontSize=20)

The hurrican is hitting Texas on August 26th.

In [43]:
alt.layer(
    alt.Chart(alt.topo_feature(usa, 'states'), title='August 27 2017').mark_geoshape(
        fill='#ddd', stroke='#fff', strokeWidth=1
    ),
    alt.Chart(august27).mark_circle().encode(
        alt.Latitude('latitude:Q'),
        alt.Longitude('longitude:Q'),
        alt.Size('AWND:Q', scale=alt.Scale(domain=[0, 45]), legend=None),
        alt.Color('AWND:Q', scale=alt.Scale(domain=[0, 45], scheme='turbo'),
        legend=alt.Legend(title='Wind (mph)'))
)
).project(
    type='albersUsa'
).properties(
    width=750,
    height=500
).configure_view(
    stroke=None
).configure_title(fontSize=24, anchor="middle").configure_legend(titleFontSize=20, labelFontSize=20)

In [44]:
alt.layer(
    alt.Chart(alt.topo_feature(usa, 'states'), title='August 28 2017').mark_geoshape(
        fill='#ddd', stroke='#fff', strokeWidth=1
    ),
    alt.Chart(august28).mark_circle().encode(
        alt.Latitude('latitude:Q'),
        alt.Longitude('longitude:Q'),
        alt.Size('AWND:Q', scale=alt.Scale(domain=[0, 45]), legend=None),
        alt.Color('AWND:Q', scale=alt.Scale(domain=[0, 45], scheme='turbo'),
        legend=alt.Legend(title='Wind (mph)'))
)
).project(
    type='albersUsa'
).properties(
    width=750,
    height=500
).configure_view(
    stroke=None
).configure_title(fontSize=24, anchor="middle").configure_legend(titleFontSize=20, labelFontSize=20)

In [45]:
alt.layer(
    alt.Chart(alt.topo_feature(usa, 'states'), title='August 29 2017').mark_geoshape(
        fill='#ddd', stroke='#fff', strokeWidth=1
    ),
    alt.Chart(august29).mark_circle().encode(
        alt.Latitude('latitude:Q'),
        alt.Longitude('longitude:Q'),
        alt.Size('AWND:Q', scale=alt.Scale(domain=[0, 45]), legend=None),
        alt.Color('AWND:Q', scale=alt.Scale(domain=[0, 45], scheme='turbo'),
        legend=alt.Legend(title='Wind (mph)'))
)
).project(
    type='albersUsa'
).properties(
    width=750,
    height=500
).configure_view(
    stroke=None
).configure_title(fontSize=24, anchor="middle").configure_legend(titleFontSize=20, labelFontSize=20)

The hurricane starts moving to the east.

In [46]:
alt.layer(
    alt.Chart(alt.topo_feature(usa, 'states'), title='August 30 2017').mark_geoshape(
        fill='#ddd', stroke='#fff', strokeWidth=1
    ),
    alt.Chart(august30).mark_circle().encode(
        alt.Latitude('latitude:Q'),
        alt.Longitude('longitude:Q'),
        alt.Size('AWND:Q', scale=alt.Scale(domain=[0, 45]), legend=None),
        alt.Color('AWND:Q', scale=alt.Scale(domain=[0, 45], scheme='turbo'),
        legend=alt.Legend(title='Wind (mph)'))
)
).project(
    type='albersUsa'
).properties(
    width=750,
    height=500
).configure_view(
    stroke=None
).configure_title(fontSize=24, anchor="middle").configure_legend(titleFontSize=20, labelFontSize=20)

The hurrican is hitting Louisian now, but with lower wind strength.

In [47]:
alt.layer(
    alt.Chart(alt.topo_feature(usa, 'states'), title='August 31 2017').mark_geoshape(
        fill='#ddd', stroke='#fff', strokeWidth=1
    ),
    alt.Chart(august31).mark_circle().encode(
        alt.Latitude('latitude:Q'),
        alt.Longitude('longitude:Q'),
        alt.Size('AWND:Q', scale=alt.Scale(domain=[0, 45]), legend=None),
        alt.Color('AWND:Q', scale=alt.Scale(domain=[0, 45], scheme='turbo'),
        legend=alt.Legend(title='Wind (mph)'))
)
).project(
    type='albersUsa'
).properties(
    width=750,
    height=500
).configure_view(
    stroke=None
).configure_title(fontSize=24, anchor="middle").configure_legend(titleFontSize=20, labelFontSize=20)

The hurricane seems to be over now.