<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Week-2:-Python-and-Metro" data-toc-modified-id="Week-2:-Python-and-Metro-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Week 2: Python and Metro</a></span><ul class="toc-item"><li><span><a href="#A-quick-geopandas-teaser" data-toc-modified-id="A-quick-geopandas-teaser-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>A quick geopandas teaser</a></span></li><li><span><a href="#Pandas-Data-Types" data-toc-modified-id="Pandas-Data-Types-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Pandas Data Types</a></span></li><li><span><a href="#Data-exploration" data-toc-modified-id="Data-exploration-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Data exploration</a></span><ul class="toc-item"><li><span><a href="#Counting-unique-values-in-a-column" data-toc-modified-id="Counting-unique-values-in-a-column-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Counting unique values in a column</a></span></li><li><span><a href="#Trimming-the-data" data-toc-modified-id="Trimming-the-data-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Trimming the data</a></span></li><li><span><a href="#Subsetting/Querying-the-data" data-toc-modified-id="Subsetting/Querying-the-data-1.3.3"><span class="toc-item-num">1.3.3&nbsp;&nbsp;</span>Subsetting/Querying the data</a></span></li></ul></li><li><span><a href="#Plotting" data-toc-modified-id="Plotting-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Plotting</a></span></li><li><span><a href="#Back-to-mapping" data-toc-modified-id="Back-to-mapping-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Back to mapping</a></span></li><li><span><a href="#Projections" data-toc-modified-id="Projections-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Projections</a></span></li><li><span><a href="#Iterating-through-rows-in-a-dataframe" data-toc-modified-id="Iterating-through-rows-in-a-dataframe-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Iterating through rows in a dataframe</a></span></li><li><span><a href="#Get-average-lat/lon's" data-toc-modified-id="Get-average-lat/lon's-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Get average lat/lon's</a></span></li><li><span><a href="#Unique-values-in-a-column" data-toc-modified-id="Unique-values-in-a-column-1.9"><span class="toc-item-num">1.9&nbsp;&nbsp;</span>Unique values in a column</a></span></li><li><span><a href="#Update-a-column-based-on-a-query-on-another-column" data-toc-modified-id="Update-a-column-based-on-a-query-on-another-column-1.10"><span class="toc-item-num">1.10&nbsp;&nbsp;</span>Update a column based on a query on another column</a></span></li></ul></li></ul></div>

# Week 2: Python and Metro

## A quick geopandas teaser
Following our Python bootcamp last week (was it boring? exhilerating? a bit of both?), let's put that programming knowledge into action, using and creating data that reflects a real urban situation.

We start by importing a new module `geopandas`. This is a pretty high level geospatial library, widely used by spatial data scientists all over the world. Don't worry about it too much for now, but know that it allows us to import a variety of spatial data formats, and plot them on a map.

In [1]:
import geopandas as gpd

Next, we import some data. In this case, it is a shapefile I downloaded from the [LA Metro's Developer web portal](https://developer.metro.net/bus-rail-gis-data/). Notice that I am using relative paths to point to where the data is located in. the `../` indicates that it is one folder level above, so `../../` takes us two levels above (the project root), and then into the data folder there.

In [2]:
metro = gpd.read_file('../../data/MetroStations/Stations_All_0715.shp')

<div class="alert alert-info">

Note that the reason we use `geopandas` instead of `pandas` (other than the fact that we love maps) is that `pandas` cannot read shapefiles, whereas `geopandas` can.
    
</div>

In [3]:
# what's the data type?
type(metro)

geopandas.geodataframe.GeoDataFrame

In [4]:
# what does the data look like? 
metro.head()

Unnamed: 0,LINE,LINENUM,LINENUM2,STNSEQ,STNSEQ2,DIR,STOPNUM,STATION,LAT,LONG,TPIS_NAME,POINT_X,POINT_Y,geometry
0,Blue,801,0,21,0,,80101,Downtown Long Beach Station,33.768071,-118.192921,Long Bch,6503030.0,1738034.0,POINT (6503030.095 1738033.828)
1,Blue,801,0,22,0,North,80102,Pacific Ave Station,33.772258,-118.1937,Pacific,6502796.0,1739558.0,POINT (6502796.262 1739558.050)
2,Blue,801,0,18,0,,80105,Anaheim Street Station,33.78183,-118.189384,Anaheim,6504115.0,1743039.0,POINT (6504114.567 1743039.068)
3,Blue,801,0,17,0,,80106,Pacific Coast Hwy Station,33.78909,-118.189382,PCH,6504120.0,1745681.0,POINT (6504120.152 1745681.179)
4,Blue,801,0,16,0,,80107,Willow Street Station,33.807079,-118.189834,Willow,6503995.0,1752228.0,POINT (6503995.170 1752228.119)


Ah! Surprise, surprise. Welcome to your first look at a pandas dataframe. We will cover dataframes more extensively in later sessions, but know that a python dataframe is like an excel spreadsheet. 

The `head()` command shows us the first 5 rows of the dataframe. You can also use `tail()` and `sample()`. Try these commands in the cells below:

In [5]:
# try tail()


In [6]:
# try sample()


## Pandas Data Types

Let's look at the data types for each column. You can collectively get all the datatypes for each column in a dataframe using the `dtypes` command.

In [7]:
metro.dtypes

LINE           object
LINENUM         int64
LINENUM2        int64
STNSEQ          int64
STNSEQ2         int64
DIR            object
STOPNUM         int64
STATION        object
LAT           float64
LONG          float64
TPIS_NAME      object
POINT_X       float64
POINT_Y       float64
geometry     geometry
dtype: object

But there is better command that will get you more info. Yes, the `info` command.

In [8]:
# dataframe info
metro.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 83 entries, 0 to 82
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   LINE       83 non-null     object  
 1   LINENUM    83 non-null     int64   
 2   LINENUM2   83 non-null     int64   
 3   STNSEQ     83 non-null     int64   
 4   STNSEQ2    83 non-null     int64   
 5   DIR        3 non-null      object  
 6   STOPNUM    83 non-null     int64   
 7   STATION    83 non-null     object  
 8   LAT        83 non-null     float64 
 9   LONG       83 non-null     float64 
 10  TPIS_NAME  83 non-null     object  
 11  POINT_X    83 non-null     float64 
 12  POINT_Y    83 non-null     float64 
 13  geometry   83 non-null     geometry
dtypes: float64(4), geometry(1), int64(5), object(4)
memory usage: 9.2+ KB


Wait. That looks different from what we have worked on! As it turns out, pandas datatypes are slightly different from the raw python datatypes. Check out the table below:

<table class="table table-striped">
  <thead>
    <tr>
      <th>Pandas Type</th>
      <th>Native Python Type</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>object</td>
      <td>string</td>
      <td>The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings).</td>
    </tr>
    <tr>
      <td>int64</td>
      <td>int</td>
      <td>Numeric characters. 64 refers to the memory allocated to hold this character.</td>
    </tr>
    <tr>
      <td>float64</td>
      <td>float</td>
      <td>Numeric characters with decimals. If a column contains numbers and NaNs (see below), pandas will default to float64, in case your missing value has a decimal.</td>
    </tr>
    <tr>
      <td>datetime64, timedelta[ns]</td>
      <td>N/A (but see the <a href="http://doc.python.org/2/library/datetime.html">datetime</a> module in Python’s standard library)</td>
      <td>Values meant to hold time data. Look into these for time series experiments.</td>
    </tr>
  </tbody>
</table>

## Data exploration

Part of data exploration is learning what is in your data. How many rows are there? What are the columns? How many rows represent a particular slice of the data?

### Counting unique values in a column

First, learn how to get values for a single column.

In [21]:
# single column
metro['LINE']

0     Blue
1     Blue
2     Blue
3     Blue
4     Blue
      ... 
78    Gold
79    Gold
80    Gold
81    Gold
82    Gold
Name: LINE, Length: 83, dtype: object

But what if you want to know how many stations there are for each line?

In [9]:
metro['LINE'].value_counts()

Gold          21
Blue          20
Green         14
EXPO          10
Red            8
Red/Purple     6
Purple         2
Blue/EXPO      2
Name: LINE, dtype: int64

In [10]:
# try it yourself. find the unique values for LINENUM


### Trimming the data
Oftentimes, we import data and it has too many columns. It is always good practice to elimnate those rows that you are sure you will not use, and keep your data "clean" and "mean."

### Subsetting/Querying the data

In [11]:
expo = metro[metro.LINE == 'EXPO']
expo.head()

Unnamed: 0,LINE,LINENUM,LINENUM2,STNSEQ,STNSEQ2,DIR,STOPNUM,STATION,LAT,LONG,TPIS_NAME,POINT_X,POINT_Y,geometry
20,EXPO,806,0,10,0,,80123,LATTC / Ortho Institute Station,34.029112,-118.273603,LATTC/Ortho,6478766.0,1833089.0,POINT (6478766.175 1833089.413)
21,EXPO,806,0,9,0,,80124,Jefferson / USC Station,34.022123,-118.278118,Jeff/USC,6477391.0,1830550.0,POINT (6477391.116 1830549.740)
22,EXPO,806,0,8,0,,80125,Expo Park / USC Station,34.018227,-118.285734,Expo Pk,6475079.0,1829138.0,POINT (6475079.285 1829138.386)
23,EXPO,806,0,7,0,,80126,Expo / Vermont Station,34.018245,-118.29154,Ex/Vrmnt,6473320.0,1829150.0,POINT (6473319.888 1829149.989)
24,EXPO,806,0,6,0,,80127,Expo / Western Station,34.018331,-118.30891,Ex/Wstrn,6468056.0,1829197.0,POINT (6468056.285 1829197.007)


In [None]:
# try it yourself. create variables for each unique LINE in the dataframe


## Plotting

In [None]:
metro.plot()

In [None]:
expo.plot()

And then we plot it. Don't worry about the intricacies of the syntax just yet (we will learn this in much more detail later), but remember, "command, bracket, arguments!"

In [None]:
metro.plot(figsize=(20,12),   #size of the plot (a bit bigger than the default)
           column = 'LINE',   # column that defines the color of the dots
           legend = True,     # add a legend           
           legend_kwds={
               'loc': 'upper right',
               'bbox_to_anchor':(1.3,1)
           }                  # this puts the legend to the side
          ) 


## Back to mapping
We can't finish our lesson without a map :). Let's go back to the original metro dataset that was the inspiration for this notebook. Recall that we used the module `geopandas` to define the data.

In [None]:
metro.head()

## Projections

In [None]:
# what is the projection?
metro.crs

In [None]:
# let's reproject it
metro_gcs = metro.to_crs("EPSG:4326")

In [None]:
type(metro_gcs.geometry)

Now it's time for another module. Everybody, please welcome `folium`. Folium brings leaflet, an open source javascript mapping library into our Python environment, allowing you to create instant interactive maps. Try it:

In [None]:
metro_gcs.plot()

In [None]:
# what did it look like before we reprojected it?
metro.plot()

## Iterating through rows in a dataframe

You learned how to loop through a python list. Looping over rows in a dataframe is similar, but the syntax is slightly different.

In [None]:
for index, row in metro_gcs.iterrows():
    print(row.STATION, row.LINE)

In [None]:
metro_gcs.head()

## Get average lat/lon's

In [None]:
latitude = metro.LAT.mean()
latitude

In [None]:
longitude = metro.LONG.mean()
longitude

In [None]:
import folium

In [None]:
#initialize map
m = folium.Map(location=[latitude,longitude], tiles='Stamen Terrain', zoom_start=10)
m

In [None]:
# add the stations
for index, row in metro.iterrows():
    folium.Marker([row.LAT,row.LONG], popup=row.STATION, tooltip=row.STATION).add_to(m)
m

In [None]:
# add a column
metro['color'] = ''

In [None]:
metro.head()

## Unique values in a column

In [None]:
metro.LINE.unique()

In [None]:
# display rows that match a query
metro.loc[metro['LINE'] == 'EXPO']

## Update a column based on a query on another column

In [None]:
metro.loc[metro['LINE'] == 'EXPO', 'color'] = 'orange'

In [None]:
metro.loc[metro['LINE'] == 'Blue', 'color'] = 'blue'
metro.loc[metro['LINE'] == 'Blue/EXPO', 'color'] = 'cadetblue'
metro.loc[metro['LINE'] == 'Red', 'color'] = 'red'
metro.loc[metro['LINE'] == 'Red/Purple', 'color'] = 'darkred'
metro.loc[metro['LINE'] == 'Purple', 'color'] = 'purple'
metro.loc[metro['LINE'] == 'Green', 'color'] = 'green'
metro.loc[metro['LINE'] == 'Gold', 'color'] = 'beige'



In [None]:
metro.sample(5)

In [None]:
# add the stations with color icons
for index, row in metro.iterrows():
    folium.Marker([row.LAT,row.LONG], popup=row.STATION, tooltip=row.STATION,icon=folium.Icon(color=row.color)).add_to(m)
m

In [None]:
m.save('metro.html')