# Introduction to Data Science

We will explore:

1. Foundations
2. Processes
3. Tools

of Data Science.

## Story: John Snow and the Broad Street Pump

In the 1850s, London faced severe cholera outbreaks. The prevailing theory was that "miasmas" (bad smells from decaying matter) caused the disease. Dr. John Snow doubted this and observed that while cholera wiped out entire households, neighboring houses remained unaffected despite sharing the same air. Snow also noted that cholera victims suffered from vomiting and diarrhea, suggesting water contamination as the culprit.

In August 1854, when cholera hit Soho, Snow mapped the location of cholera deaths and found that many victims lived near the Broad Street pump.

- He also noted that deaths near the Rupert Street pump were from residents who preferred using the more convenient Broad Street pump.
- No deaths occurred at the Lion Brewery, where workers drank only brewed beer and water from their own well.
- Deaths in distant houses involved children who drank from the Broad Street pump on their way to school.

Snow's observations led him to conclude that the Broad Street pump was the source of the cholera outbreak. He convinced local authorities to remove the pump handle, and surely enough, the outbreak subsided, preventing further deaths.

![Death Counts Mapped in the neighborhood](../assets/snows-mapped-death-frequency.png)

Source: [Data 8](https://inferentialthinking.com/chapters/02/1/observation-and-visualization-john-snow-and-the-broad-street-pump.html)

## Parkinson's Law

Individual observations are grouped together to form a sample, on which one might draw conclusions about the population.

Example: **Parkinson's Law** states that "Work expands to fill the time available for its completion."

![](../assets/parkinsons-law.png)

Image Source: consuunt.com

# Statistics

**Statistics** is a tool that applies to many fields, including business, finance, economics, biology, sociology, psychology, education, public health, and sports.

- **Snow** applied statistics to track the root cause of the cholera outbreak in a way that's scientifically sound to the authorities by analyzing the data of the deaths.
- **Parkinson's Law** is a funny example of how staistical plots can be used to express a joke or an real observation.

There are two major fields of statistics:

1. Descriptive statistics
2. Inferential statistics

### Descriptive statistics

***Descriptive statistics*** summarizes qualities of a group (of people or things) numerically and visually.

Numerical summaries are divided into these main categories:

1. **Frequency** (proportion)
1. **Measures of Central Tendency** (mean, median, mode)
1. **Measures of Dispersion** (range, variance, standard deviation)
1. **Measures of Shape** (skewness, kurtosis)

Visual summaries uses graphical representations such as:
- Histogram
- Bar chart
- Box plot

![](../assets/data-viz-examples.png)

Example: In Snow's case, the data is the location of the cholera deaths. The frequency is the number of deaths in each location. The measures of central tendency are the location of the most deaths. The measures of dispersion are the range of the locations of the deaths.

### Inferential Statistics

***Inferential Statistics*** make statements about a population based on samples:

1. ***Estimation***: make informed guess about population parameters. Example:
    - How many people in the population are obese?
    - What is the median income of housesholds in region X?
2. ***Comparison***: finding out if differences are actually caused by some variable(s), or just due to random chance. Example:
    - What is the impact of a new drug on the recovery time of patients?
    - What is the impact of a new teaching method on student performance?
3. ***Relationships***: quantifying the magnitude of the relationship between to variables, where we can do a “what if” analysis. Example:
    - How much more can a house sell for an additional bedroom?
    - What is the impact of lot size on housing price?

![Inferential Statistics](../assets/inferential-statistics.png)

# What is Data Science?

A **Data Point** is an **Observation**. When collected, data describe the phenomenon we are analyzing. But, first, we store them on computers.

- Data can be numbers, words, images, sounds, etc.
- Data also falls into types.

There is alot to say about data .. hence, we have Data Science.

**Data Science** is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It combines principles and techniques from statistics, computer science, and domain-specific knowledge to analyze, interpret, and leverage data for decision-making and predictive analytics.

- **Statistics** is essential since it studies how to make robust conclusions based on incomplete information.
- **Computer science** is essential because data are stored on computers and analyzed by algorithms.
- **Domain knowledge** is essential for asking the right questions and for understanding the answers produced by computational tools.

![](../assets/Data-Science-Venn-Diagram.png)

Image Source: https://www.researchgate.net/figure/Data-Science-Venn-Diagram_fig1_365946272

How did Snow do it?

- **Domain Knowledge**: Snow is a doctor
- **Match & Statistics**: Snow used data, plotted it, and drew conclusions
- **Computer Science**: Snow did not have computers, but the data was stored on paper, and he did not need to do much computations.

Computers help us deal with big amounts of data. They help **the government, the police, the military, the business, etc**.

## Processes of Data Science

Data Science is three processes:

1) **Exploration**: Identifying patterns in information. It involves:
    - *Descriptive Statistics*
    - *Visualization*

2) **Inference**: draw conclusions about a population based on samples of data taken from that population. It involves:
    - *Estimation*
    - *Comparison*
    - *Relationship*

3) **Prediction**: fill in the missing values based on other values. It involves:
    - *Machine Learning*
    - ~~*Optimization*~~

### Prediction

We mentioned exploration and inference, but what is *Prediction*?

If the time it takes to commute to the bootcamp location:

- Last week on Sunday it took me 50 minutes
- On Monday it took 48 minutes
- On Tuesday: 52 minutes
- On Wednesday: 49 minutes
- On Thursday: ... (how much time will it take?)

**Time-wise**: prediction can be about the future, the past, or missing in-between. **In general**: missing values can be inferred from other values, after the pattern is learned from the data.

# Data Science Tools

Python tools for data science are built on top of the following fundamental packages/libraries:

- **NumPy**: The fundamental **package** for scientific computing with Python.
- **SciPy**: Fundamental **algorithms** for scientific computing in Python.
- **Matplotlib** is a comprehensive library for creating static, animated, and interactive **visualizations** in Python.


![Numpy Languages](../assets/numpy-languages.png)

Such Python libraries use **C** underneath to achieve high performance, yet provides it in a simple Pythonic Interface / API.

## A. Data Wrangling and Exploration

### 1. Pandas

**Pandas** provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It is built on top of **NumPy**, which is built on top of **C**. So, it is really fast.

#### Get started

To get started with Pandas, follow the official [Getting started tutorials](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html).

#### Exercises

To get some exercise, follow the Kaggle Courses:

- Learn Pandas: https://www.kaggle.com/learn/pandas
- Data Cleaning: https://www.kaggle.com/learn/data-cleaning
- Feature Engineering: https://www.kaggle.com/learn/feature-engineering

#### Mindmap (Cheat Sheet)

You may also look at the following Pandas Mind Map:

- https://xmind.ai/share/ugVH30g4

### 2. Seaborn

- Create publication quality plots.
- Make interactive figures that can zoom, pan, update.
- Customize visual style and layout.
- Export to many file formats.
- Embed in JupyterLab and Graphical User Interfaces.
- Use a rich array of third-party packages built on Matplotlib.

**Seaborn** is a library for making statistical graphics in Python. It builds on top of **matplotlib** and integrates closely with **pandas** data structures.

- Seaborn Intro: https://seaborn.pydata.org/tutorial/introduction.html
- Matplotlib (low-level) user guide: https://matplotlib.org/stable/users/index.html#users-guide-index

Kaggle Courses:

- Data Visualization: https://www.kaggle.com/learn/data-visualization

## B. Statistical Packages

**Statsmodels**: statistical models, hypothesis tests, and data exploration.

## C. Machine Learning (Pattern Recognition)

**Scikit-learn**:

- Simple and efficient tools for predictive data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable - BSD license

**PyCaret**: an alternate low-code library that can be used to replace hundreds of lines of code with few words only. This makes experiments exponentially fast and efficient. PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, LightGBM, CatBoost, Optuna, Hyperopt, Ray, and many more.