# STA 141B Lecture 5

The class website is <https://github.com/2019-winter-ucdavis-sta141b/notes>

### Announcements

* Assignment 2 will be posted later today
* Information about the project and groups later this week

### Topics

* NumPy Example
* Pandas

### Data Sets

* Dogs (in repository)

### References

* Python for Data Analysis, Ch. 5, 10
* [Python Data Science Handbook][PDSH], Ch. 3

[PDSH]: https://jakevdp.github.io/PythonDataScienceHandbook/
[ProGit]: https://git-scm.com/book/

## Example - Using NumPy for Monte Carlo Integration

Consider a circle with radius 1 circumscribed by a square with side length 2.

The area of the circle is $\pi$, so for a uniform distribution on the square, the probability a point will fall in the circle is $\pi / 4$.

We can estimate the probability to estimate $\pi$.

## Pandas

Pandas is a Python package that provides tools for manipulating tabular data. The name "pandas" is short for "**PAN**el **DA**ta", an econometrics term. Since we're using Anaconda, Pandas is already installed.

Pandas is documented [here](http://pandas.pydata.org/pandas-docs/stable/).

In [2]:
import pandas as pd
import numpy as np

### Series

A Pandas Series is a generalization of a NumPy array.

In addition to elements, every series includes an _index_.

A series can be indexed in all of the same ways as a NumPy array, but also by index values.

This means a series can also be used like an ordered dictionary.

What if a series has integer indexes?

For a indexing series (and as we'll see later, also data frames):

* `[ ]` is by position, name, or condition. **Exception:** for an integer index it's by name or condition only.
* `.iloc[ ]` is by position
* `.loc[ ]` is by name or condition

In [19]:
s = pd.Series([1,3,5,np.nan,6,8])
dates = pd.date_range('20130101', periods=6)
s[0::2]
s = pd.Series("123")
s.loc[0]==s.iloc[0]

True

In [23]:
s[0]
s.loc[0]
s.iloc[0]

'123'

In [25]:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

In [33]:
len(df)
type(df)

pandas.core.frame.DataFrame

df.index

In [45]:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df

Unnamed: 0,col1,col2
0,1,3
1,2,4


In [47]:
df = pd.DataFrame(data=d, dtype=np.int8)

In [54]:
d

{'col1': [1, 2], 'col2': [3, 4]}

In [50]:
df.dtypes

col1    int8
col2    int8
dtype: object

In [56]:
df2 = pd.DataFrame(np.random.randint(low=0, 5)),columns=['a', 'b', 'c', 'd', 'e'])

SyntaxError: invalid syntax (<ipython-input-56-4d988f225f2d>, line 1)

In [58]:
np.random.randint(low=0, 5)),columns=['a', 'b', 'c', 'd', 'e']

SyntaxError: invalid syntax (<ipython-input-58-bcc1108ffa01>, line 1)

In [59]:
np.random.rand(3,2)

array([[0.24485683, 0.74389253],
       [0.48737505, 0.15322604],
       [0.14826117, 0.22101086]])

### Data Frames

A Pandas Data Frame represents tabular data as a collection of Series.

Data frames support the similar indexing methods as series. However, for indexing with `[ ]`,

* Scalar values get columns
* Conditions or slices get rows

### Missing Data

Pandas represents missing data with `NaN` and `None`, but these values do not exclusively mean missing data. For instance, `NaN` stands for "Not a Number" and is also the result of undefined computations. Pay attention to your data and code to determine whether values are missing or have some other meaning.

You can create `NaN` values with NumPy.

Use the `.isna()` or `.isnull()` methods to detect missing values.

### Data Alignment

Pandas supports vectorized operations, but elements are automatically aligned by index. **Beware!!** This is a major difference compared to R.

You can use the `.reset_index()` method to reset the indexes on a series or data frame.

### Reading Data

Pandas provides functions for reading (and writing) a variety of common formats. Most of their names begin with `read_`. For instance, we can read the dogs data from a CSV file:

### Inspecting Data

Series and data frames provide many of the same methods and attributes as NumPy arrays.

For a data frame, the `.dtypes` attribute gives the column types.

The type "object" means some non-numeric Python object, often a string.

There are also several methods for quickly summarizing data.

### Aggregation

Pandas also provides several methods for aggregating data, such as `.mean()`, `.median()`, `.std()`, and `.value_counts()`. They ignore missing values by default.

For counting one group against another (crosstabulating), use `pd.crosstab()`.

### Applying Functions

You can also use Pandas to apply your own aggregation functions to columns or rows.

* `.apply()` applies a function column-by-column or row-by-row.
* `.applymap()` applies a function element-by-element.


### Grouping

Use the `.groupby()` method to group data before computing aggregate statistics.

By default, the groups become the index. You can keep them as regular columns by setting `as_index = False` when grouping.

You can group by multiple columns.

On groups, the `.apply()` method computes group-by-group. It is the most general form of two other methods:

* `.agg()`, which applies a function to each group to compute summary statistics
* `.transform()`, which applies a function to each group to compute transformations (such as standardization)

## Static Visualizations in Python

Discussion and part of the next lecture will cover the plotnine package, an implementation of ggplot2 for Python. Unlike packages we've seen so far, plotnine is not included with Anaconda. To install the package:

* On Windows, run `conda install -c conda-forge plotnine` in an Anaconda Prompt (find it in the start menu)
* On MacOS or Linux, run `conda install -c conda-forge plotnine` in the Terminal

You may have to restart Jupyter after installing. To test your installation, try running

In [2]:
import plotnine

plotnine.__version__

'0.5.1'

## More About Packages and Modules

Which of the built-in modules are important?

Module      | Description
----------- | -----------
sys         | info about Python (version, etc)
pdb         | Python debugger
pathlib     | tools for file paths
collections | additional data structures
string      | string processing
re          | regular expressions
datetime    | date processing
urlparse    | tools for URLs
itertools   | tools for iterators
functools   | tools for functions

Python's built-in `math` and `statistics` modules are missing features we need for serious scientific computing, so we use the "SciPy Stack" instead.

The SciPy Stack is a collection of packages for scientific computing (marked with a `*` below). Most scientists working in Python use the SciPy Stack. The 3 most important packages in the stack are:

Package      | Description
------------ | -----------
numpy\*      | arrays, matrices, math/stat functions
scipy\*      | additional math/stat functions
pandas\*     | data frames

There are also several packages available for creating static plots. We'll see these soon:

Package      | Description
------------ | -----------
matplotlib\* | visualizations
seaborn      | "statistical" visualizations
plotnine     | ggplot2 for Python

Finally, there are many other packages we may use for specific statistical tasks. Some of these are:

Package      | Description
------------ | -----------
requests     | web (HTTP) requests
lxml         | web page parsing (XML & HTML)
beatifulsoup | web page parsing (HTML)
nltk         | natural language processing
spacy        | natural language processing
textblob     | natural language processing
statsmodels  | classical statistical models
scikit-learn | machine learning models
pillow       | image processing
scikit-image | image processing
opencv       | image processing