# What is Data Science?

**Data** are descriptions of the world around us, collected through observation and stored on computers.

Data can be numbers, words, images, sounds, etc.

**Data Science** is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It combines principles and techniques from statistics, computer science, and domain-specific knowledge to analyze, interpret, and leverage data for decision-making and predictive analytics.

- **Statistics** is essential since it studies how to make robust conclusions based on incomplete information.
- **Computer science** is essential because data are stored on computers and analyzed by algorithms.
- **Domain knowledge** is essential for asking the right questions and for understanding the answers produced by computational tools.

<img src="https://www.researchgate.net/publication/365946272/figure/fig1/AS:11431281104229617@1669985347028/Data-Science-Venn-Diagram.png" width="400">

Image Source: https://www.researchgate.net/figure/Data-Science-Venn-Diagram_fig1_365946272

## Processes of Data Science

Data Science is three processes:

1) **Exploration**: summarize facts and identify patterns in samples. It involves:
    - *Descriptive Statistics*
    - *Visualization*

2) **Inference**: draw conclusions about a population based on samples of data taken from that population. It involves:
    - *Estimation*
    - *Comparison*
    - *Relationship*

3) **Prediction**: fill in the missing values based on other values. It involves:
    - *Machine Learning*
    - Optimization

## John Snow and the Broad Street Pump

In the 1850s, London faced severe cholera outbreaks. The prevailing theory was that "miasmas" (bad smells from decaying matter) caused the disease. Dr. John Snow doubted this and observed that while cholera wiped out entire households, neighboring houses remained unaffected despite sharing the same air. Snow also noted that cholera victims suffered from vomiting and diarrhea, suggesting water contamination as the culprit.

In August 1854, when cholera hit Soho, Snow mapped the location of cholera deaths and found that many victims lived near the Broad Street pump.

- He also noted that deaths near the Rupert Street pump were from residents who preferred using the more convenient Broad Street pump.
- No deaths occurred at the Lion Brewery, where workers drank only brewed beer and water from their own well.
- Deaths in distant houses involved children who drank from the Broad Street pump on their way to school.

Snow's observations led him to conclude that the Broad Street pump was the source of the cholera outbreak. He convinced local authorities to remove the pump handle, and surely enough, the outbreak subsided, preventing further deaths.

[Source: data8](https://inferentialthinking.com/chapters/02/1/observation-and-visualization-john-snow-and-the-broad-street-pump.html)

# Statistics

**Statistics** is a tool that applies to many fields, including business, finance, economics, biology, sociology, psychology, education, public health, and sports.

There are two major fields of statistics:

1. Descriptive statistics
2. Inferential statistics

### Descriptive statistics

***Descriptive statistics*** summarizes qualities of a group (of people or things) numerically and visually.

Numerical summaries are divided into these main categories:

1. **Frequency** (proportion)
1. **Measures of Central Tendency** (mean, median, mode)
1. **Measures of Dispersion** (range, variance, standard deviation)
1. **Measures of Shape** (skewness, kurtosis)

Visual summaries uses graphical representations such as:
- Histogram
- Bar chart
- Box plot

Funny example (Parkinson's Law)

<img src="https://baba-blog.com/wp-content/uploads/2022/08/parkinsons-law-plotted-on-a-graph..jpg" width="500">

Image Source: baba-blog.com

### Inferential Statistics

***Inferential Statistics*** draw conclusions about a population based on samples of data taken from that population:

1. ***Estimation***: make informed guess about population parameters. Example:
    - How many people in the population are obese?
    - What is the median income of housesholds in region X?
2. ***Comparison***: finding out if differences are actually caused by some variable(s), or just due to random chance. Example:
    - What is the impact of a new drug on the recovery time of patients?
    - What is the impact of a new teaching method on student performance?
3. ***Relationships***: quantifying the magnitude of the relationship between to variables, where we can do a “what if” analysis. Example:
    - How much more can a house sell for an additional bedroom?
    - What is the impact of lot size on housing price?

# Pandas

**Pandas** provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.

### Get started

To get started with Pandas, follow the official [Getting started tutorials](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html).

### Exercises

To get some exercise, follow the Kaggle Courses:

- Learn Pandas: https://www.kaggle.com/learn/pandas
- Data Cleaning: https://www.kaggle.com/learn/data-cleaning
- Feature Engineering: https://www.kaggle.com/learn/feature-engineering

### Mindmap

You may also look at the following Pandas Mind Map:

- https://xmind.ai/share/ugVH30g4

# Seaborn and Matplotlib

**Seaborn** is a library for making statistical graphics in Python. It builds on top of **matplotlib** and integrates closely with **pandas** data structures.

- Seaborn Intro: https://seaborn.pydata.org/tutorial/introduction.html
- Matplotlib (low-level) user guide: https://matplotlib.org/stable/users/index.html#users-guide-index

Kaggle Courses:

- Data Visualization: https://www.kaggle.com/learn/data-visualization

# Numpy

- What is Numpy: https://numpy.org/doc/stable/user/whatisnumpy.html
- Numpy Quickstart: https://numpy.org/doc/stable/user/quickstart.html
- Absolute Beginners (VISUALS): https://numpy.org/doc/stable/user/absolute_beginners.html
- Numpy Array Creation: https://numpy.org/doc/stable/user/basics.creation.html
    - Routines: https://numpy.org/doc/stable/reference/routines.html
- Numpy Broadcasting: https://numpy.org/doc/stable/user/basics.broadcasting.html
- Copies vs Views: https://numpy.org/doc/stable/user/basics.copies.html
- Perf
    ```python
    %timeit arr.min()
    %timeit min(pylist)
    ```