# Lecture 5 - 4/16 - PandasDataFrame

# Pandas for Data Analysis

## On the course github site there are a few links on Regular Expressions (The one went over in last Friday's section) and one ariticle Professor encourages you to read. The links are on the bottom under April 16th tab.

### Lecture starts with review of the NBA data and json script.

Pandas package implements functionalities like data frames in R. There are many similarities but also differences. We will go over some differences in the context of working with the basketball data.

Obtaining data from NBA can be done using the function developed previously.

In [25]:
import pandas as pd

def get_nba_data(endpt, params, return_url=False): #function to get NBA data

    ## endpt: https://github.com/seemethere/nba_py/wiki/stats.nba.com-Endpoint-Documentation
    ## params: dictionary of parameters: i.e., {'LeagueID':'00'}
    
    from pandas import DataFrame
    from urllib.parse import urlencode
    import json
    
    useragent = "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9\""
    dataurl = "\"" + "http://stats.nba.com/stats/" + endpt + "?" + urlencode(params) + "\""
    
    # for debugging: just return the url
    if return_url:
        return(dataurl)
    
    jsonstr = !wget -q -O - --user-agent={useragent} {dataurl}
    
    data = json.loads(jsonstr[0])
    
    h = data['resultSets'][0]['headers']
    d = data['resultSets'][0]['rowSet']
    
    return(DataFrame(d, columns=h))

In [28]:
useragent = "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9\""
playerurl = "\"http://stats.nba.com/stats/commonallplayers?LeagueID=00&Season=2015-16&IsOnlyCurrentSeason=0\""
json_str = !wget -q -O - --user-agent={useragent} {playerurl}
import json
data = json.loads(json_str[0])
h = data['resultSets'][0]['headers']
d = data['resultSets'][0]['rowSet']
players = pd.DataFrame(d, columns=h)


JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Using the function, we can get data about teams and players.

** in R studio you can try the shiny app. **

`install.packages("shiny")
library(shiny)
runGitHub("ballr","toddwschneider")`

In [27]:
## get all teams
params = {'LeagueID':'00'}
teams = get_nba_data('commonTeamYears', params)

## get all players
params = {'LeagueID':'00', 'Season': '2016-17', 'IsOnlyCurrentSeason': '0'}
players = get_nba_data('commonallplayers', params)

## Programming style

Programming language is really like a language, and you will get better with practice. It is good to think about good programming style and better way to do the same thing. By better, I mean more readable, concise, efficient (computationally), etc.

For example, there are guides and articles such as these:
- http://docs.python-guide.org/en/latest/writing/style/#short-ways-to-manipulate-lists
- https://google.github.io/styleguide/pyguide.html?showone=List_Comprehensions#List_Comprehensions
- https://google.github.io/styleguide/pyguide.html?showone=Naming#Naming
- https://www.python.org/dev/peps/pep-0008/

In [9]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


Above easter egg is from Zen of Python: https://www.python.org/dev/peps/pep-0020/. 

- https://www.quora.com/What-do-different-aphorisms-in-The-Zen-of-Python-mean 
- 20th aphorism?:https://www.reddit.com/r/Python/comments/3cjhlo/this_disobeys_the_zen_of_python/

# Pandas

Pandas has an extensive set of functions. Refer to [Chapter 3 in PDSH](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html) and the [official website](https://pandas.pydata.org). Latest stable release documentation is here: [http://pandas.pydata.org/pandas-docs/stable/api.html](http://pandas.pydata.org/pandas-docs/stable/api.html).

## Pandas Series 

The section on `Series` is here: http://pandas.pydata.org/pandas-docs/stable/api.html#series. These are available by placing a dot after the object.

### Data frames are made of Series
Pandas data frames are different objects:

In [13]:
print("data frame object :", type(teams))
print("data row object   :", type(teams.iloc[0]))
print("data column object:", type(teams.ABBREVIATION))

Note that rows as well as columns of pandas data frame are `Series` objects. (In R, rows would be a smaller data frame.)

There are categories of functions that are applicable to certain object types:

- Pandas general functions: http://pandas.pydata.org/pandas-docs/stable/api.html#general-functions   
    e.g., [`pandas.melt()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html#pandas-melt) take `DataFrame` as input. 
- Series methods: http://pandas.pydata.org/pandas-docs/stable/api.html#series
- DataFrame methods: http://pandas.pydata.org/pandas-docs/stable/api.html#dataframe

### Pandas (often) shows you views

Recall that python objects are often _views_ of the same instance in memory space. Following says these are the same objects in memory:

In [None]:
temp = teams
print(id(temp) == id(teams))

So, if you change one, you see the change in the other:

In [None]:
s1 = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
s2 = s1
print("id of s1:", id(s1))
print("id of s2:", id(s2))
print("s1 is s2:", s1 is s2)

In [None]:
s1[0] = 10000

print("s1 changed:", s1[0])
print("s2 also   :", s2[0])
#print("s1 is s2:", s1[0] is s2[0])

Needs to be copied in order to make an independent variable.

In [None]:
abbr = teams.ABBREVIATION.copy()
abbr is teams.ABBREVIATION

### Indexing

There are many different ways to index `Series` and `DataFrames` in pandas: https://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing.

- `.loc` is primarily for using labels and booleans: e.g., column and row indices, comparison operators, etc
- `.iloc` is primarily for using integer positions: i.e., like you would matrices

In [None]:
abbr #Some of the older teams do not have abbreviations, thus "NONE"

In [None]:
dict(abbr.head().items()) #gives dictionary with key corresponding to abbr

### Series as lists

In [None]:
list(abbr.head().items()) # you can also do it as list

In [None]:
abbr.keys()

There are many more useful functions and properties. Refer to [Chapter 3 in PDSH](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html). Latest stable release documentation is here: [http://pandas.pydata.org/pandas-docs/stable/api.html](http://pandas.pydata.org/pandas-docs/stable/api.html).

The section on `Series` is here: http://pandas.pydata.org/pandas-docs/stable/api.html#series. These are available by placing a dot after the object.

In [None]:
abbr.unique()

A convenient method function is [`str`](http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling). This allows functions to be applied to each value as strings separately. For example, we can search for patterns. For example, we can search for teams that end with letter `S`: 

In [None]:
abbr.str.contains('S$')

__Exercise__: how would you use this to pick out team names that end with S? Can you use the resulting boolean `Series`?

In [None]:
## abbr.loc[abbr.str.contains('S$')] ## what is the problem?

__Exercise__: what is `dir()` function?

In [None]:
## dir(abbr)

## Data Frames


### Getting columns

Following ways to call columns are equivalent. The *dot notation* is easier to read.

In [None]:
temp = teams.copy()

print(temp['MIN_YEAR'].head())
print(temp.MIN_YEAR.head())

### Setting columns

Note that you cannot set a new column with a dot notation. Consider the following:

In [None]:
temp['new_column_1'] = temp.MAX_YEAR
temp.new_column_2 = temp.MAX_YEAR
temp.head()

However, you can set an existing column with dot notation.

In [None]:
temp.LEAGUE_ID = 'ZZ'
temp.head()

### Data Frame, Series, dtype

This is different than R data frame in that columns in R data frames have their data types: e.g., `factor`, `integer`, `numeric`, etc. Pandas data frame columns are *all* `Series` with different dtypes. With column types not specified, everything is of dtype `object`:

In [None]:
print(teams.ABBREVIATION.dtype)

In [None]:
teams.ABBREVIATION = teams.ABBREVIATION.astype('category')
teams.TEAM_ID      = teams.TEAM_ID.astype('category')
teams.MIN_YEAR     = teams.MIN_YEAR.astype('int')
teams.MAX_YEAR     = teams.MAX_YEAR.astype('int')

Note that `object` is a general term

In [None]:
print(type(teams.iloc[0]))
print(teams.iloc[0])

### Condition based slicing

Subset just the current teams

In [None]:
teams = teams[teams.MAX_YEAR == 2017]
teams['TEAM_AGE'] = teams.MAX_YEAR - teams.MIN_YEAR

teams_clean = teams.copy() ## make a copy for later
teams

Note the following indexing

In [None]:
print('*** indexing with .iloc:\n', teams.iloc[1])
print('\n*** indexing with .loc :\n', teams.loc[14])

Subset just the players in current teams:

In [None]:
players = players[players.TEAM_ID.isin(teams.TEAM_ID)]
players.tail()

List players groupped by teams:

In [None]:
players.groupby('TEAM_CODE')

Above is called an iterable. You can iterate on the object to see the _views_.

In [None]:
for t, p in players.groupby('TEAM_NAME'):
    print("***", t)
    print('; '.join(p.DISPLAY_LAST_COMMA_FIRST.values), '\n')

### Merging data frames

First we can create a table of unique rows with full team names

In [None]:
team_names = players[['TEAM_ABBREVIATION', 'TEAM_CODE']].drop_duplicates()#.set_index('TEAM_ABBREVIATION')
team_names.head()

We have team codes (names) as a new column.

In [None]:
teams_clean.head()

In [None]:
teams = pd.merge(teams_clean, team_names, left_on='ABBREVIATION', right_on='TEAM_ABBREVIATION')
teams.tail()

We can apply `str` method:

In [None]:
teams.TEAM_CODE = teams.TEAM_CODE.str.capitalize() # returns values so needs to be reassigned
teams.sort_values('ABBREVIATION', inplace=True)    # modifies object
teams.tail()

In [None]:
players.head()

## Interaction with Widgets

One of the advantages of Jupyter notebooks is that it is browser-based. Browsers are highly interactive, and we can also interact with the data by using interactive widgets IPython provides.

We will digress a little bit, and talk about widgets. Widgets take user input by waiting for some action. We can create a simple slider to select some number:

In [None]:
from ipywidgets import interact, FloatSlider, Dropdown, Button

def selected_val(x):
    print('Selected value is', x)

xslider = FloatSlider(min=0.0, max=10.0, step=0.05)
interact(selected_val, x=xslider);

In [None]:
def f(x, y):
    print(x, y)
    
drop1 = {'Galileo': 10, 'Brahe': 11, 'Hubble': 12}
drop2 = {'Apple': 345, 'Orange': 234, 'Banana': 123}

interact(f, x=drop1, y=drop2);

In [None]:
menu = {
    'juice':['apple', 'peach', 'grape'],
    'tea':['ginger', 'green', 'earl grey'],
}

selected = 'tea'

flavor = Dropdown(options=menu[selected], value=menu[selected][0])
drink = Dropdown(options=menu.keys(), value=selected)
order = Button(description='Order!', icon='check')

def update_drink(change):
    flavor.options = menu[change['new']]
    flavor.selected = menu[change['new']][0]
    
def make_order(change):
    print(flavor.value, drink.value)
    
drink.observe(update_drink, names='value')
order.on_click(make_order)
 
display(flavor, drink, order)

__Exercise__: Can you add a widget for selecting the size? Size is independent of flavors; however, it should be included when the order is made. Allow for sizes small, regular, and large.