# Lecture 4 Data Collection and Manipulation Cont.

### This is continuing from last lecture. Reminder: please go to the class github page and scroll to the bottom to register your ONLY email for jupyterhub, otherwise your account will be cleared. There are also a few videos on github tutorial that professor encourages you to watch. 

### In this notes there are alot webpage output that would take up more than 10 pages so I will not print the output in the pdf version. But most of the output are json script from the webpage.

## Deciphering the NBA stats API

NBA provides a nice website for all data related to the tornament: [http://stat.nba.com](http://stat.nba.com). For example, in order to navigate to the shooting records for Stephen Curry, you navigate their menus to get to here:

> [http://stats.nba.com/player/201939/shooting/?Season=2016-17&SeasonType=Regular%20Season](http://stats.nba.com/player/201939/shooting/?Season=2016-17&SeasonType=Regular%20Season)

Here, we see some information related to our choices:
- Season: 2016-17
- SeasonType: Regular Season ([%20 is character code for space](https://en.wikipedia.org/wiki/Percent-encoding#Character_data))
- Player: 201939 (less obvious)

This type of URL is using a [GET method](https://www.w3schools.com/tags/ref_httpmethods.asp). When your URLs are very long, it is usually passing a series of variables and values to the web page. There are tools such as this [online URL parser](https://www.freeformatter.com/url-parser-query-string-splitter.html). Try passing in the URL.

Knowledge of how web sites work is useful for data science since there is so much interaction through the web.

In [1]:
useragent = "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9\""
playerurl = "\"http://stats.nba.com/stats/commonallplayers?LeagueID=00&Season=2015-16&IsOnlyCurrentSeason=0\""
json_str = !wget -q -O - --user-agent={useragent} {playerurl}

Above defines a url to download data from. Also, it defines an argument for what is called a User Agent. User agent allows you to mimic any browser. This is useful since websites can return different content depending on the browser users are on.

In the case of NBA data, they block programatic scraping of websites by simple use of `wget`. However, by passing in the user agent string, we pretend that our connection is a user using a Mozilla-type browser on OS X.

In [2]:
json_str[0]

"'wget' is not recognized as an internal or external command,"

This is what is called the json format (Javascript object notation) and is becoming one of the widely used standards in data formats.

In fact, Jupyter notebooks are entirely in json format.

In [3]:
! head 03-Data-collection-and-manipulation.ipynb 

'head' is not recognized as an internal or external command,
operable program or batch file.


Json format is very similar to Python dictionary: i.e., key and values.

There are built-in libraries to work with json files formats. We read the output of `wget` command into a python variable: `json_str`. Now, we can parse that string with the `json` library.

In [7]:
import json
data = json.loads(json_str[0])
#data

In [18]:
data.keys()

dict_keys(['resource', 'parameters', 'resultSets'])

In [19]:
data['resultSets'][0].keys()

dict_keys(['name', 'headers', 'rowSet'])

In [8]:
data['resultSets'][0] print out the name, headers and rowset of each player, raw script

In [21]:
import pandas as pd

h = data['resultSets'][0]['headers']
d = data['resultSets'][0]['rowSet']
players = pd.DataFrame(d, columns=h)
players

Unnamed: 0,PERSON_ID,DISPLAY_LAST_COMMA_FIRST,DISPLAY_FIRST_LAST,ROSTERSTATUS,FROM_YEAR,TO_YEAR,PLAYERCODE,TEAM_ID,TEAM_CITY,TEAM_NAME,TEAM_ABBREVIATION,TEAM_CODE,GAMES_PLAYED_FLAG
0,76001,"Abdelnaby, Alaa",Alaa Abdelnaby,0,1990,1994,HISTADD_alaa_abdelnaby,0,,,,,Y
1,76002,"Abdul-Aziz, Zaid",Zaid Abdul-Aziz,0,1968,1977,HISTADD_zaid_abdul-aziz,0,,,,,Y
2,76003,"Abdul-Jabbar, Kareem",Kareem Abdul-Jabbar,0,1969,1988,HISTADD_kareem_abdul-jabbar,0,,,,,Y
3,51,"Abdul-Rauf, Mahmoud",Mahmoud Abdul-Rauf,0,1990,2000,mahmoud_abdul-rauf,0,,,,,Y
4,1505,"Abdul-Wahad, Tariq",Tariq Abdul-Wahad,0,1997,2003,tariq_abdul-wahad,0,,,,,Y
5,949,"Abdur-Rahim, Shareef",Shareef Abdur-Rahim,0,1996,2007,shareef_abdur-rahim,0,,,,,Y
6,76005,"Abernethy, Tom",Tom Abernethy,0,1976,1980,HISTADD_tom_abernethy,0,,,,,Y
7,76006,"Able, Forest",Forest Able,0,1956,1956,HISTADD_frosty_able,0,,,,,Y
8,76007,"Abramovic, John",John Abramovic,0,1946,1947,HISTADD_brooms_abramovic,0,,,,,Y
9,203518,"Abrines, Alex",Alex Abrines,0,2016,2017,alex_abrines,0,,,,,Y


What other data can we download using these types of URLS? It turns out that NBA does not publish (I wasn't able to find one) an official documentation, but people have come up with a [community documentation](https://github.com/seemethere/nba_py/wiki/stats.nba.com-Endpoint-Documentation).

Let's work with the [shot chart](https://github.com/seemethere/nba_py/wiki/stats.nba.com-Endpoint-Documentation#shotchartdetail). The site kindly tells me [which parameters are required if none is passed](http://stats.nba.com/stats/shotchartdetail).

In [22]:
from urllib.parse import urlencode
from urllib.request import urlretrieve

params = {'LeagueID':'00'}
teamurl = 'http://stats.nba.com/stats/commonTeamYears?' + urlencode(params)
!wget -q -O - --user-agent={useragent} {teamurl}

{"resource":"commonteamyears","parameters":{"LeagueID":"00"},"resultSets":[{"name":"TeamYears","headers":["LEAGUE_ID","TEAM_ID","MIN_YEAR","MAX_YEAR","ABBREVIATION"],"rowSet":[["00",1610612737,"1949","2017","ATL"],["00",1610612738,"1946","2017","BOS"],["00",1610612739,"1970","2017","CLE"],["00",1610612740,"2002","2017","NOP"],["00",1610612741,"1966","2017","CHI"],["00",1610612742,"1980","2017","DAL"],["00",1610612743,"1976","2017","DEN"],["00",1610612744,"1946","2017","GSW"],["00",1610612745,"1967","2017","HOU"],["00",1610612746,"1970","2017","LAC"],["00",1610612747,"1948","2017","LAL"],["00",1610612748,"1988","2017","MIA"],["00",1610612749,"1968","2017","MIL"],["00",1610612750,"1989","2017","MIN"],["00",1610612751,"1976","2017","BKN"],["00",1610612752,"1946","2017","NYK"],["00",1610612753,"1989","2017","ORL"],["00",1610612754,"1976","2017","IND"],["00",1610612755,"1949","2017","PHI"],["00",1610612756,"1968","2017","PHX"],["00",1610612757,"1970","2017","POR"],["00",1610612758,"1948","2

Now that we know what a general request looks like, we can create a function to make our requests simpler.

The function will do the following:
1. Set User Agent
1. Set base URL with appropriate end point
1. Set parameters required for query
1. Read JSON string into python variable
1. Parse JSON string into python object
1. Convert the objects into pandas a data frame

In [23]:
def get_nba_data(endpt, params, return_url=False):

    ## endpt: https://github.com/seemethere/nba_py/wiki/stats.nba.com-Endpoint-Documentation
    ## params: dictionary of parameters: i.e., {'LeagueID':'00'}
    from pandas import DataFrame
    from urllib.parse import urlencode
    import json
    
    useragent = "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9\""

    dataurl = "\"" + "http://stats.nba.com/stats/" + endpt + "?" + urlencode(params) + "\""
    
    # for debugging: just return the url
    if return_url:
        return(dataurl)
    
    jsonstr = !wget -q -O - --user-agent={useragent} {dataurl}
    
    data = json.loads(jsonstr[0])
    
    h = data['resultSets'][0]['headers']
    d = data['resultSets'][0]['rowSet']
    
    return(DataFrame(d, columns=h))

To see what URL string is returned, set `return_url=True`.

In [24]:
params = {'LeagueID':'00'}
get_nba_data('commonTeamYears', params, return_url=True)

'"http://stats.nba.com/stats/commonTeamYears?LeagueID=00"'

In [25]:
params = {'LeagueID':'00'}
teamdata = get_nba_data('commonTeamYears', params)
teamdata

Unnamed: 0,LEAGUE_ID,TEAM_ID,MIN_YEAR,MAX_YEAR,ABBREVIATION
0,0,1610612737,1949,2017,ATL
1,0,1610612738,1946,2017,BOS
2,0,1610612739,1970,2017,CLE
3,0,1610612740,2002,2017,NOP
4,0,1610612741,1966,2017,CHI
5,0,1610612742,1980,2017,DAL
6,0,1610612743,1976,2017,DEN
7,0,1610612744,1946,2017,GSW
8,0,1610612745,1967,2017,HOU
9,0,1610612746,1970,2017,LAC


In [26]:
params = {'LeagueID':'00', 'Season': '2016-17', 'IsOnlyCurrentSeason': '0'}
plyrdata = get_nba_data('commonallplayers', params)
plyrdata

Unnamed: 0,PERSON_ID,DISPLAY_LAST_COMMA_FIRST,DISPLAY_FIRST_LAST,ROSTERSTATUS,FROM_YEAR,TO_YEAR,PLAYERCODE,TEAM_ID,TEAM_CITY,TEAM_NAME,TEAM_ABBREVIATION,TEAM_CODE,GAMES_PLAYED_FLAG
0,76001,"Abdelnaby, Alaa",Alaa Abdelnaby,0,1990,1994,HISTADD_alaa_abdelnaby,0,,,,,Y
1,76002,"Abdul-Aziz, Zaid",Zaid Abdul-Aziz,0,1968,1977,HISTADD_zaid_abdul-aziz,0,,,,,Y
2,76003,"Abdul-Jabbar, Kareem",Kareem Abdul-Jabbar,0,1969,1988,HISTADD_kareem_abdul-jabbar,0,,,,,Y
3,51,"Abdul-Rauf, Mahmoud",Mahmoud Abdul-Rauf,0,1990,2000,mahmoud_abdul-rauf,0,,,,,Y
4,1505,"Abdul-Wahad, Tariq",Tariq Abdul-Wahad,0,1997,2003,tariq_abdul-wahad,0,,,,,Y
5,949,"Abdur-Rahim, Shareef",Shareef Abdur-Rahim,0,1996,2007,shareef_abdur-rahim,0,,,,,Y
6,76005,"Abernethy, Tom",Tom Abernethy,0,1976,1980,HISTADD_tom_abernethy,0,,,,,Y
7,76006,"Able, Forest",Forest Able,0,1956,1956,HISTADD_frosty_able,0,,,,,Y
8,76007,"Abramovic, John",John Abramovic,0,1946,1947,HISTADD_brooms_abramovic,0,,,,,Y
9,203518,"Abrines, Alex",Alex Abrines,1,2016,2017,alex_abrines,1610612760,Oklahoma City,Thunder,OKC,thunder,Y


Finally, we can get the shot chart detail.

In [27]:
params = {'PlayerID':'201935',
          'PlayerPosition':'',
          'Season':'2016-17',
          'ContextMeasure':'FGA',
          'DateFrom':'',
          'DateTo':'',
          'GameID':'',
          'GameSegment':'',
          'LastNGames':'0',
          'LeagueID':'00',
          'Location':'',
          'Month':'0',
          'OpponentTeamID':'0',
          'Outcome':'',
          'Period':'0',
          'Position':'',
          'RookieYear':'',
          'SeasonSegment':'',
          'SeasonType':'Regular Season',
          'TeamID':'0',
          'VsConference':'',
          'VsDivision':''}

shotdata = get_nba_data('shotchartdetail', params)
shotdata

Unnamed: 0,GRID_TYPE,GAME_ID,GAME_EVENT_ID,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_NAME,PERIOD,MINUTES_REMAINING,SECONDS_REMAINING,...,SHOT_ZONE_AREA,SHOT_ZONE_RANGE,SHOT_DISTANCE,LOC_X,LOC_Y,SHOT_ATTEMPTED_FLAG,SHOT_MADE_FLAG,GAME_DATE,HTM,VTM
0,Shot Chart Detail,0021600013,9,201935,James Harden,1610612745,Houston Rockets,1,10,58,...,Center(C),Less Than 8 ft.,2,-24,8,1,1,20161026,LAL,HOU
1,Shot Chart Detail,0021600013,13,201935,James Harden,1610612745,Houston Rockets,1,10,15,...,Center(C),Less Than 8 ft.,2,-13,16,1,1,20161026,LAL,HOU
2,Shot Chart Detail,0021600013,22,201935,James Harden,1610612745,Houston Rockets,1,9,14,...,Left Side Center(LC),24+ ft.,25,-142,217,1,1,20161026,LAL,HOU
3,Shot Chart Detail,0021600013,77,201935,James Harden,1610612745,Houston Rockets,1,4,16,...,Center(C),Less Than 8 ft.,2,13,21,1,0,20161026,LAL,HOU
4,Shot Chart Detail,0021600013,89,201935,James Harden,1610612745,Houston Rockets,1,3,31,...,Center(C),Less Than 8 ft.,1,5,11,1,1,20161026,LAL,HOU
5,Shot Chart Detail,0021600013,99,201935,James Harden,1610612745,Houston Rockets,1,2,21,...,Center(C),Less Than 8 ft.,0,0,8,1,0,20161026,LAL,HOU
6,Shot Chart Detail,0021600013,111,201935,James Harden,1610612745,Houston Rockets,1,0,29,...,Left Side Center(LC),24+ ft.,26,-139,222,1,0,20161026,LAL,HOU
7,Shot Chart Detail,0021600013,221,201935,James Harden,1610612745,Houston Rockets,2,4,34,...,Center(C),Less Than 8 ft.,0,-3,6,1,1,20161026,LAL,HOU
8,Shot Chart Detail,0021600013,281,201935,James Harden,1610612745,Houston Rockets,3,10,31,...,Right Side(R),24+ ft.,24,243,1,1,0,20161026,LAL,HOU
9,Shot Chart Detail,0021600013,308,201935,James Harden,1610612745,Houston Rockets,3,7,48,...,Right Side(R),8-16 ft.,13,135,36,1,1,20161026,LAL,HOU
