# Lab02 - Data manipulation, geostatistical analysis, and mapping

Learning goals

- develop skills in importing and modifying data using python.
- perform geographic statistical operations and transformations using python
- devlop skill in new and innovative means of mapping

---

This lab will look at an incredibly esoteric dataset and question. How has the geographic center of baseball teams shifted since 1900.

*Disclaimer* - I do not really even like baseball, but it is a cool dataset and also tells us a bit about how we have shifted geographically as a society.

*Disclaimer* - Liam, last year's TA, really likes baseball, go Cubs

*Discliaimer* - Maddie knows nothing about baseball, and is more of a hockey fan, go AVS!

In [None]:
# This will install all of the required packages and modules
!apt install libspatialindex-dev --quiet
!apt install folium --quiet
!pip install rtree --quiet
!pip install geopandas --quiet
!pip install -U --no-deps mapclassify git+git://github.com/geopandas/geopandas.git@master --quiet
!pip install descartes --quiet # Helps geopandas plot polygons
!pip install rtree --quiet
!pip install pysal --quiet # Map classifiers for choropleth maps
!apt-get install software-properties-common python-software-properties > /dev/null
# !add-apt-repository ppa:ubuntugis/ppa -y > /dev/null
!apt-get update > /dev/null
#!apt-get install -y --fix-missing python-gdal gdal-bin libgdal-dev > /dev/null
#!pip2 install OpticalRS > /dev/null

In [2]:
# First lets import the packages we will need. Always put these up front!
%matplotlib inline
import numpy as np
import math
import folium #Folium is a library that allows us to create webmaps
# This will get all of the libraries (subcomponents of the packages and modules).  
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statistics
import scipy
from scipy.stats import kurtosis
from scipy.stats import skew
from scipy.stats import variation
import geopandas as gpd
import mapclassify
import os



In [3]:
# Change your computer's directory to where you have saved the baseball .csv file
# GitHub is set up so that you can download the coad and change the specification on where your data is stored
# For me, I have all of the data saved in my GPHY504 folder
os.chdir("/Users/f67f911/Desktop/GPHY504_Lab/")

In [4]:
# read in the csv file and look at it
# Since you have changed your directory, this should automatically be read in
df = pd.read_csv('Baseball_Decades.csv')
# Print the head of the new dataframe
print(df.head)
# Look at the different types of data that are stored in the dataframe
print(df.dtypes)

<bound method NDFrame.head of      Decade        Team   State  Latitude  Longitude
0      1900      Boston      MA  42.35866   -71.0567
1      1900    Brooklyn      NY  40.69245   -73.9904
2      1900     Chicago      IL  41.88425   -87.6324
3      1900  Cincinnati      OH  39.10713   -84.5041
4      1900    New York      NY  40.78200   -73.8317
..      ...         ...     ...       ...        ...
137    2010   St. Louis      MO  38.62775   -90.1996
138    2010   Tampa Bay      FL  27.58300   -82.6330
139    2010       Tempe      AZ  33.42551  -111.9370
140    2010     Toronto  Canada  43.65740   -79.4328
141    2010  Washington      DC  38.89037   -77.0320

[142 rows x 5 columns]>
Decade         int64
Team          object
State         object
Latitude     float64
Longitude    float64
dtype: object


### It is important to look at the data types and understand the data, this will potentially mess you up later on!

### Indexing
​
[Indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) on a pandas DataFrame allows you to subset or filter the data based upon a condition(s) that you provide.
​
A couple new concepts: To access all rows of one or more columns, we can simply specify the column names in brackets. Note that the printout doesn't print all of our values, but tells us the total length of the column and also tells us the data type of all elements of the column. Since each pandas DataFrame column is a NumPy array, all values in a column have the same data type.
---
Understanding how to quickly index data will help you immensely with using coding languages in the future!

In [None]:
# Let's look at all of the cities that had a team
print(df['Team']) # just print out the column of the cities that had a team

In [None]:
# We can call the funciton 'unique' to look at all the unique team names 
df.Team.unique()

In [None]:
# Let's look at all of the years we have data for as well
print(df['Decade'].unique())

### To access data by index labels, we must use the `.loc` attribute before the brackets. Recall from above that we currently have a `RangeIndex` with numbers that go from 0 to 141 with a step size of 1.

In [None]:
df.loc[0] # just grab the first row

In [None]:
df.loc[70:80] # grab rows 70 - 80

In [None]:
df.loc[10:20, ['Team','Decade']] # grab rows 10 - 20, but just the team and decade columns. 
# Notice how the column order was switched based on the order we specified to grab column data from

#### We can use conditional statements and boolean logic in combination with `.loc` as well. Instead of the English words `and` and `or` like we used in Lab 1 conditional statements, use `&` and `|` characters for bitwise comparisons in the `.loc` statements. Also note that multiple logical statements each need to be enclosed in their own parentheses:

Lets find all of the data for 1960...

In [None]:
# Create a new data frame just with team data from the 1960's
teams_1960s = df.loc[df['Decade'] == 1960]
teams_1960s

### Let's just look at the teams in the 1960s that were in Ohio.

In [None]:
# Combine a numerical value (Decade) with a text string (State)
# We can use the & sign to specify when we want information that fits BOTH conditions
data_1960_OH = df.loc[(df['Decade'] == 1960) & df.State.str.contains('OH')]
data_1960_OH

In [None]:
# Do the same as above but for the year 1900 and teams that were in the state of New York!
data_1900_NY = # enter your code here
data_1900_NY

### One of the things to notice are the differences between the following two lines of code:

```
- df.loc[df['Decade'] == 1960]
```

```
- teams_1960s = df.loc[df['Decade'] == 1960]
```
While both code blocks might look the same, the first code block simply displays the selection, like a filter in Excel. The second code block actually creates a new variable from the subset.

## Questions
### 1. What team is saved in the 75th row? 
Your answer here
### 2. How many baseball teams were in NY state in 1900? How many teams were in NY state in 2000?
Your answer here

--- 
## Getting the Latitude and Longitude into a cartesian surface. 

While it is easier for us to visualize lat & lon, the reality is that it is good for a flat map, not something like a sphere. The reality is that it is not great for measuring distance on something like a sphere. Because of the we need to convert the Latitude and Longitude column into radians and then into a Cartesian plane. This will give our locations a numeric values rather than and coordinate value.

Convert Lat and Lon from degrees to radians using the general formulas.

```
lat_rad = lat * PI/180
lon_rad = lon * PI/180
```
Now we need to get into a Cartesian space.

```
X = cos(lat_rad)*cos(lon_rad)
Y = cos(lat_rad)*sin(lon_rad)
Z = sin(lat_rad)
```

So now let's turn that into executable code. To see how that might work, you need to run the following code in a new code cell:


```
demo = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
demo["shiny_new_column"] = demo["a"] * demo["b"]

print(demo)
```

### Now let's roll with this on our Baseball dataset. 

In [None]:
# Python has some nice built in math variables, such as pi
PI = math.pi # set the PI variable using the math package for simplicity 
# Look at the variable
PI

In [None]:
# Finish this cell block for lon_rad
df["lat_rad"] = df["Latitude"] * PI/180

# You insert code for lon_rad here


### Now convert the X,Y,Z cartesian space.

In [None]:
# Convert to X cartesian space
df['Z'] = np.sin(df.lat_rad)

# You complete the code for X and Y

Alright so now we have our dataframe all set, now we just have to compute the geographic mean using the formulas:

$\overline{X} = \frac{X_{n}}{n}$


$\overline{Y} = \frac{Y_{n}}{n}$

$\overline{Z} = \frac{Z_{n}}{n}$

for each decade...

---
So  we need to do our analysis on the subset out each decade.

Hmm...

We need to go back to our indexing expertise. 

In [None]:
# Let's look at our dataframe. Make sure you have a column for latitude and longitude (that you calculated!)
df

In [None]:
# Let's take the mean values for the data from the 1960's
data_1960X = np.mean(df.loc[(df['Decade'] == 1960)])
data_1960X

---
While that was pretty sweet, doing this for all of the decades will be a pain. Plus then you are going to have a variable (like data_1960X for each of the decades. 

This is where the power of computing really shows its strengths.

Pandas `groupby` groups data according to the categories and apply a function to the categories.

Let me show you what I mean.

In [None]:
geogMean = df.groupby(['Decade']).mean()
geogMean

Holy Cow - that was slick. But it looks like `Decade` is not a column name, and is instead an index. (This caused me a lot of suffering while developing this lab!)

> Let's redo our work, but clean up our dataset.

In [None]:
geogMean = df.groupby(['Decade'],as_index=False).mean() # Don't make decade an Index! Instead it should be a column name.
geogMean

## Questions
### 3. What is the mean x,y, and z coordinate for 2010?

---
# Converting back to lat and long

As we discussed in class lat and lon are great for mapping things on a flat map, but do not do so well with measuring distances on a sphere. 

To display our mean centers on a webmap we have to take our x,y, and z measuremtns and return them to Lat and Long

Convert lat and lon to degrees from radians.

```
meanLat = meanLat * 180/PI
meanLon = meanLon * 180/PI
```

Convert average x, y, z coordinate to latitude and longitude. 

```
meanLon = arctan2(y, x)
Hyp = sqrt(x * x + y * y)
meanLat = arctan2(z, hyp)
```

In [None]:
geogMean["meanLon"] = (np.arctan2(geogMean.Y, geogMean.X)) * (180/PI)
Hyp = np.sqrt(geogMean.X * geogMean.X + geogMean.Y * geogMean.Y)

geogMean

In [None]:
#Create a new dataframe to hold the mean latitude and mean longitude from our geogMean dataframe here
geogClean = geogMean[['Decade', 'meanLon', 'meanLat']]
# Look at the new geogClean dataframe
geogClean

#HINT: you may need to make an additioanl column to calculate mean lattitude

### Now on to mapping the data! This is a geospatial class

In [None]:
# First, we need to make sure the the decade column in the cleaned dataframe is of type string!
geogClean['Decade'] = geogClean['Decade'].apply(str) #Confirms that the decade variable a string

Now let's map out these data using [Folium](/https://python-visualization.github.io/folium/quickstart.html#Vincent/Vega-and-Altair/VegaLite-Markers). <- Click here for more info!

In [None]:
#Create a webmap here that displays the mean centers of MLB teams over time. 
m = folium.Map(prefer_canvas=True, location=[40.57861,-85.186529],
zoom_start=6, tiles="Stamen Toner")

for i in range (0,len(geogClean)):
  folium.Marker(
      location = [geogClean.iloc[i]['meanLat'],geogClean.iloc[i]['meanLon']], 
      # popup=geogClean.iloc[i]['Decade'],
      icon=folium.DivIcon(html=f"""<div style="font-family: courier new; color: blue">{geogClean.iloc[i]['Decade']}</div>""")
                
  ).add_to(m)
m

## Questions
### 4. Our map shows an unweighted spatial **Mean** of professional baseball teams over time. What do you think might change if we graphed the spatial **Median** of the teams? Why?


### 5. Liam is a huge baseball fan. According to our map, where should he move to be as close to as many professional teams as possible? 


### 6. Liam wants to minimize his travel time to see as many games as possible. Do you think he should decide where to move using the spatial mean of the teams or the spatial median? Why?

