# DSC 80 Review Session

## Overview
The goals of this course are to be able to understand real world data, clean them and extract meaningful insight from them.
### Real world data types
* tabular data (csv file)
* unstructued text (create your own parser)
* json (geojson)
* web data (scrape) 
* time series data


### Tools
* numpy: math computation library on array and matrices , serves as underlying data structure for pandas
* pandas: tabular data manipulation
* matplotlib and folium: visualization 
* regex: pattern matching on text
* requests and BeautifulSoup: scraping tools
* scikit: modeling and analysis 

### Data science lifecycle
* Ask questions, define metrics, form hypotheses
* Find data (if it doesn't already exist)
* Clean data into organized format for analysis
* Find anomalies, biases and simplify data (imputation, smoothing, etc.)
* Model data (build classifier, conduct A/B experiment or hypothesis testing)
* Evaluate rusults (and reiterate!)

## Pandas
* Know the difference between DataFrame and Series, indexing, and the function associate with them
* Know which column should be which data type, how to convert between data types and properties of each dtype
    - object
    - int
    - datetime
    - float
* Group by and aggregate function
* Filter rows with conditions and merge (left_on, right_on, etc.)
* Write lambda functions (df.apply())

## Data Cleaning
* Data types: quantitative, ordinal, nominal
* Unfaithful data (data that doesn't represent the data generating process being measured)
    - solution -> case specific but we can drop them or replace them
* Outliers data (extreme values) 
    - we can correct them if we know they're wrong for sure, or try smoothing them out or filter out some unreasonable ones
* Handling Nans -> fillna() or dropna() depending on the situation
* Understand the cause of missingness -> MD, MCAR, MAR or NMAR
* Having domain knowledge will allow you to make decisions on how to concat data, merge, join them on row or on column on which key (e.g. College Scorecard dataset example)

## Smoothing
* Extract trend from data -> separate signal from noise
* Distribution -> likelihood that a given value x will occur. Can be quantitative or categorical
* Empirical distribution, how frequent each value occurs
    - Represented as histogram
    - To plot histogram we can put data into bins or percentiles
* Smoothing reduces the extremeness in the outlier
* Noise causes more error in events with rare occurence -> motivation for additive smoothing
* Additive smoothing -> define alpha as how much we uncertain about the data collected. 
    - $p_i = \frac{x_i + \alpha}{N + \alpha d}$ 
    - Big alpha will converge to uniform distribution, small alpha will converge to empirical distribution
* Rolling windows -> apply a function on a rolling basis of the data, generally used for time series

## Total Variation Distance
* Measure how similar two distributions are -> help us determine the type of missingness
* The sum of the absolute difference between the terms of $X$  and the terms in  $Y$ , divided by two
    - $TVD = \Sigma_i \frac{|x_i - y_i|}{2}$

* Absolute because we want to consider only the magnitude, and we divide by 2 to account for that

## Imputation
* Imputation with single values e.g. mean or median, will reduce the variance
    - Mean imputation -> preserve the mean of the observed data e.g. when we know the missing value is MCAR
    - Group-wise mean imputation -> preserve mean within one group e.g. when encounter MAR
* Sampling from distribution imputation -> use empirical distribution to sample values to fillna() in MCAR

## JSON
* key-value data structure -> dictionary like
* GeoJson
* Request/Response format

## HTTPs
* HTTP request/response consist of header and body. Header has metadata and body has the content
* Difference between get and post requests
    - GET retrieves data, params parsed with URL, restricted length and no sensitive info
    - POST sends data, params embeded in the request body, unlimited length and can contain any data format
* HTTP status code tells us the result of the request e.g. 200 means ok
* To send a request, determine which API endpoints to call and what params are required
* Sending too many get requests may get you blocked 
    - Check robots.txt file regarding the scraping policy
    

## Parsing HTML
* HTML consists of DOM tags
* Parse using BeautifulSoup 
* Traverse the objects to extract the information
    - Tree traversal e.g. DFS, BFS
    - Calling parent(), children(), next_sibling()

## Cleaning on Unstructured Data 

- We've seen how to extract and collect data from different sources
- This data is in unstructured form 
- we need to structurize it to be able to work with it 

## Canonicalization

* a sequence of steps that transforms both columns into a single form
- Replace each string with a unique representation
- in the following image, we try to transform the 2 columns of `county`

<img src="image_1.png">


Cons:
- Used string methods
- Very brittle procedure; may only work for X% of the data.
- Hard to verify correctness.
- Also *parse* data using a data model if given the choice!

## Regular Expressions (`regexp`)

* Fast, compact way of matching patterns in text
* Python library: `import re`


* Pros: powerful; capable of matching very complex patterns.
* Cons: 
    - It's still text processing, so brittle and likely to break.
    - Hard to understand: "write-only" language.

### Regexp Expressions
* Parsing the expression:
```
'\[([0-9]{2}\/[A-Z]{1}[a-z]{2}\/[0-9]{4}:[0-9]{2}:[0-9]{2}:[0-9]{2} -[0-9]{4})\]'
```

* `[0-9]{2}` matches any 2-digit number.
* `[A-Z]{1}` matches any single occurrence of any upper-case letter.
* `[a-z]{2}` matches any 2 consecutive occurrences of lower-case letters.
* Certain special characters (`[`, `]`, `/`) need to be escaped with `\`

### Basic Patterns
The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

- **a, X, 9, <** -- ordinary characters just match themselves exactly
- The meta-characters which do not match themselves because they have special meanings are: **. ^ $ * + ? { [ ] \ | ( )** (details below)

-  **. (a period)** -- matches any single character except newline '\n'
- **\w** -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. **Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word**. \W (upper case W) matches any non-word character.

- **\s** -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.

- **\t, \n, \r** -- tab, newline, return

- **\d** -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)

- **^ = start, $ = end** -- match the start or end of the string

- **\** -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.

### Python `re` library functions
* `re.search`:
    - `m = re.search(r, s); m.groups`
* `re.findall`
* `re.sub`

Also in Pandas!

### Regexp Groups (Briefly)
* Use `(` regex `)` to *extract* text from a string.

## Information Extraction 

### Bag of Words

* Create an index out of *all* distinct words 
    - The basis for the vector space of words.
* Create vectors for each text entry by computing the counts of words in the entry.
* The dot product between two vectors is proportional to their 'similarity':
    - This defines the **cosine similarity** between vectors via: $$dist(v, w) = 1 - \cos(\theta) = 1 - \frac{v \cdot w}{|v||w|}$$


* Downside of `bag of words`: treats all words as *equally important*

### Term Frequency, Inverse Document Frequency

* The *term frequency* of a word $t$ in a document $d$, denoted ${\rm tf}(t,d)$, is the likelihood of the term appearing in the document.
* The *inverse document frequency* of a word $t$ in a set of documents $\{d_i\}$, denoted ${\rm idf}(t,d)$ is: 

$$\log(\frac{{\rm\ total\ number\ of\ documents}}{{\rm number\ of\ documents\ in\ which\ t\ appears}})$$

* The *tf-idf* of a term $t$ in document $d$ is given by the product: 

$${\rm tfidf}(t,d) = {\rm tf}(t,d) \cdot {\rm idf}(t)$$

* Term Frequency, Inverse Document Frequency balances:
    - how often a word appears in a document/sentence, with
    - how often a word appears *across* documents.
* For a given document, the word with the highest TF-IDF best summarizes that document.

# Feature Engineering

### What is a feature?

* A **feature** is a measurable property or characteristic of a phenomenon being observed.
* Synonyms: (explanatory) variable, attribute
* Examples include:
    - a column of a dataset.
    - a derived value from a dataset, perhaps using additional information.
    
We have been creating features to summarize data!

### Feature Engineering

* We already engineered features to summarize and understand data.
    - smoothing, transformations, ad hoc derived properties of data

* What can we do with it?
    - Visualization and summarization
    - Modeling (prediction; inference)

### Modeling setup

Want to estimate a relationship between X and Y.
* X is the observed data (almost anything!)
* Y is a quantitative value (e.g. a correlation coefficient; a predicted value)

<img src="image_2.png">

### The missing step: data to models

* Modeling techniques typically require *quantitative* input.
* Models require (strong) relationships between X and Y.

<img src="image_3.png">

There is work to be done transforming data into effective features!

### The goal of feature engineering

* Find transformations that effectively transform data into effective quantitative variables

* Find functions $\phi:X\to\mathbb{R}^d$ where similar points $x,y\in X$ have close images $\phi(x), \phi(y)\in \mathbb{R}^d$

* A "good" choice of features depends on many factors:
    - data type (quantitative, ordinal, nominal),
    - the relationship(s) and association(s) being modeled,
    - the model type (e.g. linear models, decision tree models, neural networks).

# The modeling pipeline


### The steps of the modeling pipeline

1. We have already seen Feature Engineering, ie. to create features to best reflect the meaning behind data
2. Create model appropriate to capture relationships between features
    - e.g. linear, non-linear
3. Select a loss function and fit the model (determine $\hat{\theta}$).
4. Evaluate model (e.g. using RMSE)

After these steps, use the model for prediction and/or inference.

 The pipeline above represents a single attempt at a model
    - May have thousands of feature/model/paramater combinations to choose from!
    - Remember the Data Science Life Cycle!


### Features and Models using `Scikit Learn`

* Scikit-Learn implements many common steps in the feature/model creation pipeline.
* It interfaces with `numpy` arrays, *not* Pandas dataframes :(
    - Some work required keeping track of columns in scikit

### Example of a Scikit-Learn Model 

* Initialize a model with (perhaps zero) parameters:
    - e.g. `lr = LinearRegression()`
* Fit model to given dataset using `.fit`
    - e.g. `lr.fit(data, outcomes)` fits the model weights using `data` and `outcomes`.
* Use the model to predict using `.predict` method
    - e.g. `lr.predict(newdata)` predicts outcomes for `newdata`.
* Inspect model attributes, like model weights.

### Evaluating the quality of a model

* Given a fit model on dataset, calculate e.g. the root-mean-square error.
* If the error is low, do you think it's a good model?
    - It fits the given *data* well, but is it a good model? (Is the sample representative?)
    - E.g. will it give good predictions on similar, unknown, data?

### Fundamental Concepts of the quality of a 'fit model'

* **Bias**: the expected deviation between the predicted value and true value
* **Variance**:
    - **Observation Variance**: the variability of the random noise in the process we are trying to model. 
    - **Estimated Model Variance**: the variability in the predicted value across different datasets. 

### A more detailed understanding of 'model quality'

* Accuracy is defined as the proportion of predictions that are correct.
    - treats all incorrect guesses equally
    - treats all correct guesses equally
* Ignores *how* the predictions were (in)correct!

### Binary Classification Outcomes

* **True positive (TP)**: the model correctly predicts the positive class.

* **True negative (TN)**: the model correctly predicts the negative class.

* **False positive (FP)**:the model incorrectly predicts the positive class. 

* **False negative (FN)**: the model incorrectly predicts the negative class. 



In order to estimate the accuracy of the classifier, we need to know the number of real positive cases in the data **P**, ie. TP+FN


### F1-score: combining precision and recall: 


* **F1-score** combines precision and recall via the 'harmonic mean'.

* The F1 score is a measure of how well a test labels positive instances. 

* It considers both the precision and the recall (TPR) of the test to compute the score.

$$
F1 = 2 \times \frac{Precision * Recall}{Precision + Recall}
$$

### False Discovery Rate (FDR)

* The proportion of positive identifications that were false (positives).
* Terrorism Example: the proportion of people flagged as terrorists who are normal passengers.
    - A high FDR leads to a lot of average people inconvenienced.
* Related to precision (FDR = 1 - Precision).

$$
{\rm FDR} =\frac{FP}{TP + FP}
$$

### Sensitivity, Specificity, Precision and Recall

Sensitivity: 
- What proportion of actual positives were correctly identified?
* Also called: True positive rate (TPR), hit-rate, recall.

$${\rm Sensitivity/ Recall} = {\rm TPR} = \frac{TP}{P} =\frac{TP}{TP + FN}$$


Specificity
- What proportion of actual negatives were correctly identified?
* Also called: Selectivity, true negative rate (TNR).

$$
{\rm Specificity} =\frac{TN}{N} = \frac{TN}{TN + FP} 
$$

Precision:
- What proportion of positive identifications were actually correct?

$$
{\rm Precision} =\frac{TP}{TP + FP}
$$


### Train-Test Split

To assess your model for overfitting to the data, randomly split the data into a "training set" and a "test set".

- The training set is used to fit the model (train the predictor).
-  The test set is used to test the goodness-of-fit of the fit model.
- *similar* to bootstrap estimating a regression model.

### The machine learning training pipeline:

<img src="train-test.png" width="50%">

Scikit-Learn as functions that help us do this.

### How to put together Scikit-Learn Pipelines
- Put together feature transformers and models using sklearn.Pipeline objects
- Create a pipeline: <i>pl = Pipeline([feat, mdl])</i>
- Fit the model(s) in the pipeline using pl.fit(data, target)
- Predict from raw input data through the pipeline using pl.predict

### Simple Example of a Pipeline

- We use the iris dataset
- We perform pre-preprocessing by standardizing the data
- We use a Logistic Regressor to classify the dataset into its target iris

In [34]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

In [35]:
# Load and split the data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size= 0.2, random_state=42)
X_train.shape

(120, 4)

In [36]:
pipe_lr = Pipeline([('stdscr', StandardScaler()),
 ('clf', LogisticRegression(solver='newton-cg', multi_class='ovr'))])

# Standardize features by removing the mean and scaling to unit variance
# Logistic Regression 

In [37]:
pipe_lr.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('stdscr', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          n_jobs=None, penalty='l2', random_state=None, solver='newton-cg',
          tol=0.0001, verbose=0, warm_start=False))])

In [5]:
score = pipe_lr.score(X_test, y_test)
print('Logistic Regression pipeline test accuracy: %.3f' % score)

Logistic Regression pipeline test accuracy: 0.967


### Function Transformer
- Recall what a function transformer is
- It forwards the X (and optionally y) arguments to a user-defined function or function object and returns the result of this function
- Used in Data Pre-processing
- Somewhat like an `apply` in pandas

But what if we cannot apply the same transformations to every individual feature of a data point in X?
This is why we need `Column Transformers`. 

###  Column Transformer
- Datasets can often contain components that require different feature extraction and processing pipelines
- Datasets may have a mix of Categorical columns and Continuous Numeric columns, which will almost always need separate transformations
- Datasets may be stored in a Pandas DataFrame and different columns require different processing pipelines

For Example:
    - Your dataset consists of heterogeneous data types (e.g. raster images and text captions)
    - You want to standardize the numerical columns but one-hot-encode the categorical ones

- The brand new ColumnTransformer allows you to choose which columns get which transformations 

- The ColumnTransformer takes a list of tuples, where each tuple has the following 3 entries:
    - The first value in the tuple is a name that labels it, 
    - the second is an instantiated estimator or transformation, 
    - and the third is a list of columns you want to apply the transformation to. 
- The tuple will look like this:
     `('name', SomeTransformer(parameters), columns)`