# Introduction to pyBaseball
For the first part of this class, we'll use the pyBaseball package to access data. pyBaseball is a python package that provides a nice API for the Baseball Savant website and the Lahman database. Lahman is actually a bunch of .csv files that you download onto your local machine.

There are lots of examples of pyBaseball queries on their github repo. I encourage you to rummage through the site and look at what's available. 

If you haven't already done so, you need to install pyBaseball.

Open a terminal and type the following commands to pull the latest pybaseball

* git clone https://github.com/jldbc/pybaseball
* cd pybaseball
* python setup.py install --user

To test that pybaseball installed correctly, run the following. If you get back data, pybaseball is working.

In [1]:
from pybaseball import statcast
data = statcast(start_dt='2017-06-24', end_dt='2017-06-27')
data.head(2)

ModuleNotFoundError: No module named 'pybaseball'

## Lahman Database ##
The Lahman database was created by Sean Lahman, and contains pitching, hitting, and fielding statistics for Major League Baseball from 1871 through 2017.  It includes data from the American and National Leagues, as well as other leagues from 1871-1875. 

The version that we will work with here is a collection of .csv files. The files are all publicly available through several sources, including http://www.seanlahman.com. We're going to access it through the pybaseball Python package. You can find documentation on what's in each .csv on the Lahman website.

A quick query of pybaseball Lahman to display some data

In [17]:
#Download the Lahman data
from pybaseball.lahman import *

#download the entire lahman database to your current working directory
download_lahman() 

#Look at the data. Divided by category
#.csv files for Batting, Fielding, Managers, Pitching, etc

## Dataframe ##
A dataframe in Python is a 2d data structure with rows and columns. Think of dataframes as data tables where each row is an observation and each column is a feature of the observation.

In the Lahman database, we can access create a dataframe from the Batting.csv file. Once we create the dataframe, we should look at it:

In [18]:
batting = batting()
#first, inspect the data
#get the first 10 rows
batting.head(10)

#or, to show the column headers only

list(batting)

['playerID',
 'yearID',
 'stint',
 'teamID',
 'lgID',
 'G',
 'AB',
 'R',
 'H',
 '2B',
 '3B',
 'HR',
 'RBI',
 'SB',
 'CS',
 'BB',
 'SO',
 'IBB',
 'HBP',
 'SH',
 'SF',
 'GIDP']

### Things to do with dataframes ###
It's rare that we want to use all of the data that we have in our dataframe. More likely, you'll want to pull out certain rows and columns that you can use to answer interesting questions.

#### Filter rows ####
Access the rows in the dataframe using the column name and set criteria for the values in those columns. For example, to pull out the 2016 data out of the batting dataframe:

In [19]:
batting2016 = batting.loc[(batting["yearID"]==2016)]#one year only
batting2016.head(5)

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
101333,abadfe01,2016,1,MIN,AL,39,1,0,0,0,...,0.0,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0
101334,abadfe01,2016,2,BOS,AL,18,0,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0
101335,abreujo02,2016,1,CHA,AL,159,624,67,183,32,...,100.0,0.0,2.0,47,125.0,7.0,15.0,0.0,9.0,21.0
101336,achteaj01,2016,1,LAA,AL,27,0,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0
101337,ackledu01,2016,1,NYA,AL,28,61,6,9,0,...,4.0,0.0,0.0,8,9.0,0.0,0.0,0.0,1.0,0.0


#### Filter columns ####
If you only want to see certain columns in the dataframe, you can pull just those columns:

In [20]:
batting.loc[:,["teamID", "lgID"]].head(10)
#or filter rows and select columns
batting2016 = batting.loc[(batting["yearID"]==2016),["teamID", "lgID"]]
batting2016.head(10)

Unnamed: 0,teamID,lgID
101333,MIN,AL
101334,BOS,AL
101335,CHA,AL
101336,LAA,AL
101337,NYA,AL
101338,COL,NL
101339,CLE,AL
101340,SLN,NL
101341,CIN,NL
101342,SFN,NL


### Another dataframe example ###
If you look at the pyBaseball Lahman.py code, you'll see the functions that return the data stored in each csv. For example, you can get information about different ballparks by looking at the Parks data in Parks.csv. 

In [21]:
parks = parks()
parks.head(10)
p = parks.loc[(parks["park.name"]=='Riverside Park')]
print(p)

  park.key       park.name park.alias    city state country
0    ALB01  Riverside Park        NaN  Albany    NY      US


#### Dataframes in memory ####
In all of these examples, we created new variables that point to the original dataframe in memory. If you change a dataframe, it changes it everywhere that it is referenced. If you want to make changes, you should copy the dataframe first.

#### For more information about .loc, .iloc, and indexing ####
Here is a good article about accessing rows in a dataframe: <a href="https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix" target="_blank">Accessing data in a dataframe</a>

## Questions ##
1. Generate a dataframe of all players who played second base in 2010.
2. Generate a dataframe of all pitchers with an ERA < 4.0 since 2010.