<a href="https://colab.research.google.com/github/Flux159/Polymorph/blob/master/01_Data_Science_and_Machine_Learning_Notebooks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Data Science and Machine Learning Notebooks

This is a starting point for a number of Jupyter (Colab) notebooks learning about Data Science, Data Engineering, and Machine Learning. The notebooks should cover theory, practical applications (with executable code), and modern workflows in industry.

### Background

Whenever applicable, keep references to books and articles inline within a markdown cell. Interviews with professionals in industry or current best practices should also be marked with references. 

### Data

All data used in these notebooks should be public and open source - with the exception of the practical exercises that involve data scraping and cleaning. Since these notebooks cover both theory and practical realities - when covering data scraping, a thorough understanding of what reproducability in analytics means is assumed.

In practice, data is also not cleaned and structured in a useable format - in many academic contexts the structuring of data is not covered since the datasets are already mostly prepared. In the data engineering and pipeline notebooks, these topics will be covered in-depth along with an overview of modern solutions to these problems.

### Reproducable analytics

What does it mean for an analysis to be reproducible?

When talking about reproducible analysis in the context of these notebooks, it does not mean "get the same result." In an ideal scenario, you would have the exact same dataset, you would run the same deterministic code across it, and you would produce an analysis that could be rerun at anytime in the future to get the same result.

As these notebooks cover practical realities, it will readily become evident that the above definition simply won't work.

Data is constantly changing from new/existing users posting, gathering new data from sensors and other systems, laws and regulations requiring that user identifyable data be deleted after 90 days, the cost of storing data for all time becoming prohibitively expensive, or any other set of constraints on a large system.

Since data is changing over time, the result of an analysis will also be changing with underlying data changes. However, since we need to be able to compare different analysis frameworks in a consistent way - we need to have a way to run analyses against a static dataset. This leads us to using static data in the context of academia and when learning about theory in order to compare results. 

We may also need to run the same analysis across different datasets or datasets that change with time. In this case, we want to ensure that the analysis we are running across the datasets remains the same (is invariant) while the data changes. We can then analyze the results and compare them with some guarantees that the things we are comparing are relevant (apples to apples).

### Programming

A basic understanding of Python and SQL is assumed. While these notebooks could be written in any language (R, Julia, Matlab, etc.), python is a good starting point since it is applicable to a number of different domains. SQL is also a necessity in order to query data stores and a basic understanding of SQL will be covered when discussing Relational Algebra.

### Contributing

Contributions are welcome under the MIT License. Data shared should also be under a similar open license.

## Table of contents

1. What is Data Science and required Background knowledge

  a. Statistics and Probability, Linear Algebra, Relational Algebra

  b. Programming (Python)

  c. Data Engineering, Data Science, and Machine Learning

2. Statistics and Probability

3. Linear Algebra

  a. Vectors, Matrices, Tensors

  b. Relationship to differential equations, 3d modeling

  c. PCA, tSNE

4. Relational Algebra

5. Programming (Python and SQL)

  a. Python

  b. SQL

6. Data Engineering

  a. Logging and data pipelines

  b. Unstructured vs structured data

  c. Cleaning up data

  d. Getting data to work with
    - i. Web scraping
    - ii. API requests / User input
    - iii. IOT / Sensors
    - iv. Mobile devices, web extensions
    - v. Business data sharing agreements

7. Data Science

  a. What is the question you want to answer?

  b. Where is the data to answer said question? If you don't have it, how would you get it and clean it up (Data Engineering)? 

  c. What tools (regressions, plots) do you need to answer your question and produce a result?

  d. Modeling and prediction
  - i. Classification
  - ii. Prediction

8. Machine Learning

  a. What is Machine Learning?

  b. Theory

  c. Bayesian learning, Genetic algorithms, non-neural networks
  - i. Classifiers
  - ii. Predictive
  - iii. Generative

  d. Neural Networks
  - i. Computer Vision
  - ii. Text processing
  
