# Introduction to Data Science

Foundations & Tools of Data Science.

## Story: John Snow and the Broad Street Pump

In the 1850s, London faced severe cholera outbreaks. The prevailing theory was that "miasmas" (bad smells from decaying matter) caused the disease. Dr. John Snow doubted this and observed that while cholera wiped out entire households, neighboring houses remained unaffected despite sharing the same air. Snow also noted that cholera victims suffered from vomiting and diarrhea, suggesting water contamination as the culprit.

In August 1854, when cholera hit Soho, Snow mapped the location of cholera deaths and found that many victims lived near the Broad Street pump.

- He also noted that deaths near the Rupert Street pump were from residents who preferred using the more convenient Broad Street pump.
- No deaths occurred at the Lion Brewery, where workers drank only brewed beer and water from their own well.
- Deaths in distant houses involved children who drank from the Broad Street pump on their way to school.

Snow's observations led him to conclude that the Broad Street pump was the source of the cholera outbreak. He convinced local authorities to remove the pump handle, and surely enough, the outbreak subsided, preventing further deaths.

![Death Counts Mapped in the neighborhood](../assets/snows-mapped-death-frequency.png)

Source: [Data 8](https://inferentialthinking.com/chapters/02/1/observation-and-visualization-john-snow-and-the-broad-street-pump.html)

## Parkinson's Law

Individual observations are grouped together to form a sample, on which one might draw conclusions about the population.

Example: **Parkinson's Law** states that "Work expands to fill the time available for its completion."

![](../assets/parkinsons-law.png)

Image Source: consuunt.com

# Statistics

**Statistics** is a tool that applies to many fields, including business, finance, economics, biology, sociology, psychology, education, public health, and sports.

- **Snow** applied statistics to track the root cause of the cholera outbreak in a way that's scientifically sound to the authorities by analyzing the data of the deaths.
- **Parkinson's Law** is a funny example of how staistical plots can be used to express a joke or an real observation.

There are two major fields of statistics:

1. Descriptive statistics
2. Inferential statistics

### Descriptive statistics

***Descriptive statistics*** summarizes qualities of a group (of people or things) numerically and visually.

Visual summaries uses graphical representations such as:

- Histogram
- Bar chart
- Box plot

![](../assets/data-viz-examples.png)

Example: In Snow's case, the data is the location of the cholera deaths. The frequency is the number of deaths in each location. The measures of central tendency are the location of the most deaths. The measures of dispersion are the range of the locations of the deaths.

### Inferential Statistics

***Inferential Statistics*** goes beyond description and into:

1. making informed guess about a group of people / items
2. verify claims through data; e.g., coffee and sleep

<img src="https://datatab.net/assets/tutorial/Descriptive_statistics_and_inferential_statistics.png">

Source: Datatab.

# What is Data Science?

A **Data Point** is an **Observation**. When collected, data describe the phenomenon we are analyzing. But, first, we store them on computers.

- Data can be numbers, words, images, sounds, etc.
- Data also falls into types.

There is alot to say about data .. hence, we have Data Science.

> **Data Science** is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It combines principles and techniques from statistics, computer science, and domain-specific knowledge to analyze, interpret, and leverage data for decision-making and predictive analytics.

- **Domain Knowledge** is essential for asking the right questions and for understanding the answers produced by computational tools.
- **Math & Statistics** studies how to make robust conclusions based on incomplete information.
- **Computer Science** is needed since data are stored on computers and processed by algorithms.

![](../assets/Data-Science-Venn-Diagram.png)

Image Source: https://www.researchgate.net/figure/Data-Science-Venn-Diagram_fig1_365946272

How did Snow do it?

- **Domain Knowledge**: Snow is a doctor
- **Math & Statistics**: Snow used data, plotted it, and drew conclusions
- **Computer Science**: Snow did not have computers, but the data was stored on paper, and he did not need to do much computations.

Computers help us deal with big amounts of data. Be it in terms of speed or storage.

# Data Analysis

> **Data analysis** is the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.

Two approaches to data analysis:

1. **Top-Down** (Confirmatory - CDA)
    - Start with a question and use data to answer it.
2. **Bottom-Up** (Exploratory - EDA)
    - Start with the data and try to find something interesting.

## The 4 Types of Data Analytics

1. **Descriptive Analytics** (What happened?): This is the foundation of business reporting. It uses historical data to summarize performance.
   - Example: A retail monthly sales report showing which regions met their quotas.

2. **Diagnostic Analytics** (Why did it happen?): This involves "drilling down" into the data to find dependencies and causes.
   - Example: Investigating why sales dropped in May and discovering it was due to a specific supply chain bottleneck.

3. **Predictive Analytics** (What will happen?): This uses statistical models and machine learning to forecast future trends.
   - Example: A bank estimating the likelihood of a customer defaulting on a loan based on their credit history.

4. **Prescriptive Analytics** (How can we make it happen?): The most advanced stage, where the analysis recommends specific actions to achieve an optimal outcome.
   - Example: An airline's pricing algorithm automatically adjusting ticket costs in real-time based on demand and weather patterns.

# Key Steps in Data Analysis

## 1. Data Wrangling

### 1.1 Data Loading

Gather relevant data from sources:

1. Files (csv, excel, txt, json, xml, pdf, etc.)
2. Surveys (Google Forms, SurveyMonkey)
3. Web Scraping (BeautifulSoup, Scrapy, Selenium)
4. APIs (Twitter, Facebook, Google Maps, etc.)
5. Databases (SQL, NoSQL)

### 1.2 Data Cleaning

Prepare data for analysis by:

 1. Handle missing values
     1. Remove rows with missing values
     2. Impute missing values
         - Mean, Median, Mode
         - Forward fill, Backward fill
         - Interpolation
     3. Drop columns
 2. Remove duplicates
 3. Correct errors (invalid values)
 4. Standardize feature names
 5. Remove irrelevant and redundant features
 6. Standardize data:
     1. Convert data types (e.g., `str -> datetime`)
     2. inconsistencies (e.g., `Female` and `F` both present in `gender` column)
     3. One format for dates, phone numbers, etc.


## 2. Data Transformation

### 2.1 Scaling & Outlier Treatment

1. Scale the data:
    1. [**Normalize**](../techniques/02_transformation/normalize.ipynb)
        - Log Transformation
    2. [**Standardize**](../techniques/02_transformation/feature_scaling.ipynb)
        - Z-score
        - Min-max Scaling
        - Robust Scaling
2. [**Handle Outliers**](./L05_outliers.ipynb)
    - Z-score method
    - IQR method
    - 95th percentile
    - 99th percentile
    - Domain knowledge
3. Handle data imbalance
4. Encode categorical variables (e.g., one-hot encoding)

### 2.2 [Feature Engineering](../techniques/02_transformation/feature_engineering.ipynb)

Based on domain knowledge and questions to be answered:

1. Create new variables
    1. `age` from `date_of_birth`
    2. `BMI` from `weight` and `height`
    3. `season` from `date`
2. *Data Enrichment* - Add new data from external sources
    1. Geocoding (convert address to latitude and longitude using **Google Maps API**)
    2. Sentiment analysis (analyze text data using **OpenAI API**)
3. *Binning*
    1. `age_group` from `age`
    2. `income_group` from `income`
    3. `weight_category` from `weight`
4. Handle date-time variables
    1. Extract year, month, day, day of week, etc.
    2. Time since last purchase
    3. Time since first visit
5. Aggregate data
    1. `total_sales` from `sales` table
    2. `orders_count` from `orders` table
    3. `maximum_amount` from `transactions` table

### 2.3 [Feature Selection](../techniques/02_transformation/feature_selection.ipynb)

1. Select Features
    1. Correlation Coefficient
    2. Mutual information

## 3. [Analysis, Modeling & Interpretation](L04_statistics.ipynb)

- Descriptive Analytics: what happened?
- Diagnostic Analytics: why did it happen?
- Predictive Analytics: what will happen?
- Prescriptive Analytics: how can we make it happen?

### 3.1 Uni-variate Analysis

Understand the distribution of each variable.

1. *Descriptive Statistics*
    - Measure of central tendency (mean, median, mode)
    - Dispersion (range, variance, standard deviation)
    - Shape (skewness, kurtosis).
2. *Visualizations*
    - Histogram
    - Boxplot
    - Density plot
    - Violin plot
    - Bar plot
    - Pie chart
    - Frequency table
    - Word cloud



### 3.2 Multi-variate Analysis

Understand the relationship between multiple variables.

1. *Descriptive Statistics*
   - Covariance
   - Correlation
2. *Visualizations*
   - Scatter plot
   - Line plot
   - Heatmap
   - Pairplot
   - Boxplot
   - Violin plot
   - Bar plot
   - Stacked bar plot
   - Grouped bar plot


### 3.3 Modeling

1. Make **inference** about population from sample
2. **Quantify relationships** (regression)
3. **Hypothesis testing**

## 4. Communication

1. *Reports*: Jupyter Notebook
2. *Dashboards*: Tableau, Power BI, Google Data Studio

# Python Packages for Data Analysis


Python tools for data science are built on top of the following fundamental packages/libraries:

- **NumPy**: The fundamental **package** for scientific computing with Python.
- **SciPy**: Fundamental **algorithms** for scientific computing in Python.
- **Matplotlib** is a comprehensive library for creating static, animated, and interactive **visualizations** in Python.

Such Python libraries use **C** underneath to achieve high performance, yet provides it in a simple Pythonic Interface / API. As shown in the image below (for NumPy):

![Numpy Languages](../assets/numpy-languages.png)

We will use the following libraries in this course:

1. **Pandas**: for data wrangling and analysis.
2. **Seaborn**: for data visualization.
3. **statsmodels**: for statistical modeling and inference.

![](../assets/intro/ds_stack.png)