# 01 | Data Science Introduction 

Session recording: https://youtu.be/gqK_qOdHUl0

Delivering business value depends on your ability to answer clients’ 
questions with data.

- **Decisions:** Successful analytical projects are those that support decision making or automation
- **Questions:** Key is to understand what questions need to be answered to make a decision
- **Approach:** We choose analytical approach that is most suitable to answer these questions
- **Technology:** It is important to pay attention to technology allowing implementation into the decision process
- **Implementation:** We ensure, in cooperation with IT, implementation of the solution and take care of the maintenance process

Different types of analytics solve different business needs. 

DESCRIPTIVE ANALYTICS: Interactive visualization are used commonly to present complex data
and allow for certain level of user self-service


DIAGNOSTIC ANALYTICS: In diagnostic analytics we want to uncover hidden factors that drive
certain behavior – for example different countries death tol

PREDICTIVE ANALYTICS: Forecasting techniques allow to predict chosen KPI(s) outlook or even
probability of customer actions

PRESCRIPTIVE ANALYTICS: Prescriptive analytics accounts for constraints when making decisions


## Data Structures and Terminology

- Table = Data
- Header = Metadata
- Columns = Values
- Row = Observation

Data Structures
- Wide -->  1 OBS = All Employee Information
- Long --> 1 OBS = PersonID + Variable

### Data Types

**Cross-sectional data**

- Characterized by individual units - people, companies, countries, ...

**Time series**
- Data collected at several time points
- Stock prices, interest rates, exchange rates, GDP,...
- Many different frequencies (hourly, daily, monthly, quarterly,..)

**Panel data (longitudinal)**
- Combines cross-sectional and time series data
- The same individuals (persons, firms, cities, etc.) are observed at several points in time (days, years, ..)

**Other data types**
- Images 
- Videos (Video == a sequence of images == a sequence of RGB arrays)
- Text (news articles etc)




### Data Terminology
Example: predict income based on person’s age, education, and gender

Income is :
- Dependent variable
- Target

Age, Education, and Gender are:
- Independent variables
- Explanatory variables
- Features


### Column Types

- numerical (100, 200)
- string/factor (M/F, CZ, SK...)
- numerical string

**ONE-HOT ENCODING OF NON-NUMERIC DATA**

Categorical variables need to be transformed into numeric ones.

Usually, one category is dropped (use only male gender here) to avoid 
multicollinearity

## Data Science Problem Types

- Hypotheses testing (statistics)
    - with cross sectional data
    - with panel data
- Supervised learning (ML)
    - classification
    - regression
- Unsupervised learning (ML)
    - dimensionality reduction
    - clustering
- Other
    - image, video processing
    - anomaly detection
    - optimizations
    - simulations

## Hypothesis Testing

Assumption to test: Women have smaller income.

`mean(Income_F) < mean(Income_M) ?`


Methodology: 
- Calculate `mean(Income_F)` and `mean(Income_M)` in table
- Assess Probability for `mean(Income_F) = mean(Income_M)`
- `P-value`, if too small conclude: `mean(Income_F) < mean(Income_M)`


## Unsupervised Learning
### Clustering

Goal: Identify groups of observation with similar patterns

- Labelling unlabelled data
- Automatic labelling
- Anomaly detection

## Supervised Learning

- Classification
    - credit cefault = a binary variable!

- Regression
    - eg. sales = a continuous variable


## Statistics vs. Machine Learning

**Statistics**
- Look at past trends
- Support/Reject your hypothesis
- Examine effects of individual factors
- Assumptions are important

Methods
- Linear regression (OLS)
- Logistic regression
- Panel regressions (Fixed effects, Random effects)
- Time Series models

**Machine Learning**
- Main goal: predict future using 
past data
- Usually no strict assumptions
- Often no examination of impacts of individual factors
- Maximize accuracy/ precision of prediction (classification)
- Minimize errors of prediction (regression)
- Use train/test split

Methods
- Classification:
    - Logistic regression
    - Decision Trees
    - Support Vector Machines etc.
- Regression:
    - Linear regression (OLS)
    - Regression Trees
    - Support Vector Regression etc.




## Model Performance Evaluation

**Hypothesis Testing**
- $R^2$
    - Indicates the percentage of the variance in the dependent variable that the independent variables explain collectively
    - 0-100% scale (the higher the better)

- p-values:
    - Show if we have some good variables in the model that explain well the  
dependent variable

**Train/test Data Split**
- train/test data split - cross-sectional data
    - Before we use the model to predict new values, we need to use the already labeled data to find the relationships between variables.
    - Use only a part of the data for finding those relationships, i.e., for model training (~60-80%). Keep another part of the data for testing.

- train/test data split - time series
    - train sample: long enough to catch seasonality, some models might need more data to train than others
    - test sample: the size of your forecasting horizon

*We want both training and testing model performance to be similarly good.*

**Metrics**
- regression performance
    - compare observed values vs. predicted values
    - Root Mean Squared Error (RMSE)
        - `RMSE = sqrt((1/n) * Σ(yᵢ - ȳ)²)`
        - n is the number of data points.
        - yᵢ refers to the observed (actual) values.
        - ȳ represents the mean of the observed (actual) values.

        - RMSE is a commonly used metric to evaluate the accuracy of a regression model. It measures the average magnitude of the differences between the predicted values and the actual values. The square of the differences is taken to ensure that positive and negative errors do not cancel each other out. Finally, the square root is taken to obtain the RMSE value, which is in the same units as the dependent variable.

        - Please note that RMSE is sensitive to outliers and larger errors have a greater impact on the overall value.
    - Mean Absolute Percentage Error (MAPE)
        - `MAPE = (1/n) * Σ(|(yᵢ - ŷᵢ)/yᵢ|) * 100`
        - n is the number of data points.
        - yᵢ refers to the observed (actual) values.
        - ŷᵢ represents the predicted values.

        - MAPE is a commonly used metric to assess the accuracy of a forecasting or prediction model, especially in the context of time series analysis. It calculates the average percentage difference between the predicted values and the actual values. The absolute difference between each pair of values is divided by the actual value, and the result is multiplied by 100 to express the error as a percentage. Finally, the average of these percentage errors is taken to obtain the MAPE value.

        - It's important to note that MAPE is not defined when one or more of the actual values (yᵢ) is zero. Additionally, MAPE is not symmetric and can be influenced by extreme values.

- classification performance
    - Confusion Matrix
        - true positive, false positivi, false negatives, true negatives
    - Accuracy  = `(TP + TN) / (TP + TN + FP + FN)`
        - the proportion of correctly classified instances out of the total number of instances in the dataset
    - Precision = `TP / (TP + FP)`
        - How many observations predicted as positive are really positive?
    - Recall (Sensitivity) = `TP / (TP + FN)`
        - How many observations out of all positive observations have we classified as positive?
    - other classification metrics
        - ROC curve and AUC
        - F1 score







