## Data Science

Eric Schmidt, the former CEO of Google: "Data Science is the Future of Everything".  https://www.youtube.com/watch?v=9hDnO_ykC7Y.

## Data analytics pipeline

Data modeling is the process of using data to build predictive models. Data can also be used for descriptive and prescriptive analysis. But before we make use of data, it has to be fetched from several sources, stored, assimilated, cleaned, and engineered to suit our goal. The sequential operations that need to be performed on data are akin to a manufacturing pipeline, where each subsequent step adds value to the potential end product and each progression requires a new person or skill set.

The various steps in a data analytics pipeline are shown in the following diagram: 

<img src="../images/ds-pipe.png" alt="ds-steps" width=600 align="left" />

These steps can be combined into three high-level categories:
1. Data engineering
2. Data science
3. Product development.



**Data Engineering** deals with sourcing data from a variety of sources, creating a suitable database and table schema, and loading the data in a suitable database. There can be many approaches to this step depending on the following:

- Type of data: Structured (tabular data) versus unstructured (such as images and text) versus semi-structured (such as JSON and XML)
- Velocity of data upgrade: Batch processing versus real-time data streaming
- Volume of data: Distributed (or cluster-based) storage versus single instance databases


**Data Science** is the phase where the data is made usable and used to predict the future, learn patterns, and extrapolate these patterns. Data science can further be sub-divided into two phases.

**Product Development** is the phase where all the hard work bears fruit and all the insights, results, and patterns are served to the users in a way that they can consume, understand, and act upon. It might range from building a dashboard on data with additional derived fields to an API that calls a trained model and returns an output on incoming data.

Apart from these steps in the pipeline, there are some additional steps that might come into the picture. This is due to the highly evolving nature of the data landscape. For example, deep learning, which is used extensively to build intelligent products around image, text, and audio data, often requires the training data to be labeled into a category or augmented if the quantity is too small to create an accurate model.

## Why Python

Among the characteristics that make Python popular for data science are its very user-friendly (human-readable) syntax, the fact that it is interpreted rather than compiled (leading to faster development time), and it has very comprehensive libraries for parsing and analyzing data, as well as its capacity for numerical and statistical computations.

Python has libraries that provide a complete toolkit for data science and analysis. The major ones are as follows:

- **NumPy**: The general-purpose array functionality with an emphasis on numeric computation
- **SciPy**: Numerical computing
- **Matplotlib**: Graphics
- **pandas**: Series and data frames (1D and 2D array-like types)
- **Scikit-learn**: Machine learning
- **NLTK**: Natural language processing
- **Statstool**: Statistical analysis