# Chapter 1: Preliminaries

## 1.1: What Is This Book About
- Nuts and bolts of data mani, processing, cleaning, and crunching
- Mostly focusses on Python programming and libraries rather than methodology

### What Kinds of Data?
- Structured data
  - Tabular / spreadsheet data (csvs)
  - Multidimensional Arrays (vectors, matrices, tensors)
  - Relational data interrelated by key columns (primary keys)
  - time-series data

## 1.2 Why Python for Data Analysis?
- Python has a large and active scientific computing and data analysis community
- Compared to R, MATLAB, SAS, Stata, etc, Python has far greater usability

### Python as Glue
- Python can easily utilize C, C++, and FORTRAN code, allowing the user the ease
    of use of Python while getting the efficiency and power of those languages.
- Python is used as the "Glue Code" tying together computations and libraries.
    The runtime of this glue code is nearly insignificant, and the improvement
    in development speed can far outweigh the cost in computation speed.

### Solving the "Two-Language" Problem
- In most applications, it's common to have a research / prototyping language 
    and a production language
    - Research languages tend to be SAS or R
    - Production languages tend to be C, C++, C#, FORTRAN, Java, etc
- Python acomplishes the task of both of these categories due to the heavy 
    lifting being done by C/C++ code.

### Why Not Python?
- Applications that need very low latency (high freq trading), the time lost
    in programming directly in C/C++ to acheive maximum performance might be 
    worthwhile
- Python struggles in many parallel computing problems like multithreading due
    to the restriction that Python's interpreter can only execute one 
    instruction at a time
    - Python C  libraries can execute truly parallel code, but they cannot 
        communicate with Python objects simultaneously.

## 1.3 Essential Python Libraries
### NumPy
- (Numerical Python)
- Provides data structures and algorithms required for most scientific 
    applications
    - ndarray: optimized multidimensional array
    - Element-wise array functions or array binary operations
    - Tools for reading / writing array-based datasets to disk
    - Linear Algebra operations, Fourier transform, and random number generation
- NumPy's data structures are used widely across different computing libraries
- Additionally, NumPy objects can be acted on by lower level languages such as
    C and FORTRAN without having to convert representations

### pandas
- Provides high-level data structures and functions
- Has two primary objects:
    - DataFrame: tabular, column oriented data structure
    - Series: a one-dimensional labelled array object
- Blends the high-performance of NumPy with the capabilities of spreadsheets
    and relational databases
- Provides sophisticated indexing, which makes it easy to reshape, slice, 
    perform aggregations, and select substets of data

### matplotlib
- Most popular Python plotting and data visualizations library

### IPython & Jupyter
- IPython creates a more interactive Python interpreter
- Instead of eidt-compile-run, IPython utilized a execute-explore workflow
- Very effective for analyzing, prototyping, and experimenting
- IPython became a component of the much broader, multi-language Jupyter.
- Code is held in cells which can be run individually and out of order, and 
    their output can be displayed next to them
- Additionally, Markdown and HTML can be used to create rich explanations

### SciPy
- A collection of packages addressing a number of different problems in 
    scientific computing:
    - scipy.integrate: Numerical integration and differential equations solvers
    - scipy.linalg: Linear algebra routines and matrix decomposisitons
    - scipy.optimize: Functions optimizers and root finding algorithms
    - scipy.signal: Signal processing tools
    - scipy.sparse: Sparse matrices and sparse linear system solvers
    - scipy.special: Wrapper around SPECFUN, a FORTRAN library with many common
        mathematical functions
    - scipy.stats: Standard continuous and discrete probability distributions,
        statistical test, and more descriptive statistics
- Combined with NumPy, SciPy forms a reasonably complete computational 
    foundation in scientific applications.

### scikit-learn
- Premier general-purpose machine learning toolkit for Python.
- It includes submodules for Classification, Regressing, Clustering,
    Dimensionality Reduction, Model Selection, Preprocessing, etc

### statsmodels
- A statistical analysis package derived from common models in R.
- Compared with scikit-learn, statsmodels contains models for more classical 
    statistics and econometrics such as Regressions, Analysis of Variance, 
    Time series analysis, Nonparametric methods, and Visualization of 
    statistical model results
- More focused on statistical inference than predictions

## 1.4 Installation and Setup
(already set up; see book)

## 1.5 Community and Conferences
(see book)

## 1.6 Navigating this Book
(see book)