# What Is This Book About?

This book is concerned with the nuts and bolts of manipulating, processing, cleaning,
and crunching data in Python.

## What Kinds of Data?

When I say “data,” what am I referring to exactly? The primary focus is on structured
data, a deliberately vague term that encompasses many different common forms of
data, such as:

- Tabular or spreadsheet-like data in which each column may be a different type(string, numeric, date, or otherwise). This includes most kinds of data commonly stored in relational databases or tab- or comma-delimited text files.
- Multidimensional arrays (matrices).
- Multiple tables of data interrelated by key columns (what would be primary or foreign keys for a SQL user).
- Evenly or unevenly spaced time series.

# Why Python for Data Analysis?

## Why Not Python?

As Python is an interpreted programming language, in general most Python code will run substantially slower than code written in a compiled language like Java or C++.

# Essential Python Libraries
## NumPy
NumPy, short for Numerical Python, has long been a cornerstone of numerical com‐
puting in Python. NumPy contains, among other things:
- A fast and efficient multidimensional array object ndarray
- Functions for performing element-wise computations with arrays or mathematical operations between arrays
- Tools for reading and writing array-based datasets to disk
- Linear algebra operations, Fourier transform, and random number generation
- A mature C API to enable Python extensions and native C or C++ code to access NumPy’s data structures and computational facilities

## pandas
pandas provides high-level data structures and functions designed to make working
with structured or tabular data fast, easy, and expressive.

The pandas name itself is derived from panel data, an econometrics term for multidimensional structured datasets, and a play on the phrase Python data analysis itself.

## matplotlib
matplotlib is the most popular Python library for producing plots and other two-dimensional data visualizations. 

## IPython and Jupyter
The IPython project began in 2001 as Fernando Pérez’s side project to make a better
interactive Python interpreter. It encourages an execute-explore workflow instead of the typical edit-compile-run workflow of many other programming languages. Since much of data analysis coding involves exploration, trial and error, and iteration, IPython can help you get the job done faster.

In 2014, Fernando and the IPython team announced the Jupyter project, a broader
initiative to design language-agnostic interactive computing tools. The IPython web
notebook became the Jupyter notebook, with support now for over 40 programming
languages.

## SciPy
SciPy is a collection of packages addressing a number of foundational problems
in scientific computing. Here is a sampling of the packages included:
- scipy.integrate： Numerical integration routines and differential equation solvers
- scipy.linalg： Linear algebra routines and matrix decompositions extending beyond those provided in numpy.linalg
- scipy.optimize： Function optimizers (minimizers) and root finding algorithms 
- scipy.signal： Signal processing tools
- scipy.sparse： Sparse matrices and sparse linear system solvers
- scipy.special： Wrapper around SPECFUN, a Fortran library implementing many common mathematical functions, such as the gamma function
- scipy.stats： Standard continuous and discrete probability distributions (density functions, samplers, continuous distribution functions), various statistical tests, and more descriptive statistics

## scikit-learn
Since the project’s inception in 2010, scikit-learn has become the premier general-purpose machine learning toolkit for Python programmers. It includes submodules for such models as:
- Classification: SVM, nearest neighbors, random forest, logistic regression, etc.
- Regression: Lasso, ridge regression, etc.
- Clustering: k-means, spectral clustering, etc.
- Dimensionality reduction: PCA, feature selection, matrix factorization, etc.
- Model selection: Grid search, cross-validation, metrics
- Preprocessing: Feature extraction, normalization

## statsmodels
statsmodels is a statistical analysis package that was seeded by work from Stanford
University statistics professor Jonathan Taylor, who implemented a number of regression analysis models popular in the R programming language.
Compared with scikit-learn, statsmodels contains algorithms for classical (primarily
frequentist) statistics and econometrics. This includes such submodules as:
- Regression models: Linear regression, generalized linear models, robust linear models, linear mixed effects models, etc.
- Analysis of variance (ANOVA)
- Time series analysis: AR, ARMA, ARIMA, VAR, and other models
- Nonparametric methods: Kernel density estimation, kernel regression
- Visualization of statistical model results

statsmodels is more focused on statistical inference, providing uncertainty estimates
and p-values for parameters. scikit-learn, by contrast, is more prediction-focused.

# Navigating This Book
## Import Conventions
It’s considered bad practice in Python software develop‐
ment to import everything (from numpy import *) from a large package like NumPy.

## Jargon
I’ll use some terms common both to programming and data science that you may not
be familiar with. Thus, here are some brief definitions:

- *Munge/munging/wrangling*： Describes the overall process of manipulating unstructured and/or messy data into a structured or clean form. The word has snuck its way into the jargon of many modern-day data hackers. “Munge” rhymes with “grunge.”
- *Pseudocode*： A description of an algorithm or process that takes a code-like form while likely not being actual valid source code.
- *Syntactic sugar*： Programming syntax that does not add new features, but makes something more convenient or easier to type.