# Revision of stats and ML for interview

## Statistics

### Distributions

- Normal
    - Xi-squared
    - F distribution
- Bernoulli
- Binomial

### Common questions:

- What’s the difference between descriptive and inferential statistics
    - Descriptive statistics provides some certain results that describe the data such as mean, median, st.dev, IQR, percentiles, curtosis
        - Characteristics and distributions of the sample
        - Doesn’t involve assumptions about the distribution
        - It describes the sample
    - Inferential statistics allows to test hypothesis (to infer from sample) about the population
        - Often involves assumptions regarding the distribution
- What Are the Main Measures Used to Describe the Central Tendency of Data?
    - Mean, Median, Mode
- Measures of variability
    - Variance
    - IQR
    - Range
    - St. dev
- Skewness
    - Measures of the symmetry of data
    - Positive — right tail is longer
    - Negative — left tail is longer
- Kurtosis
    - Measure of fat tails compared to the normal distribution
- Correlation vs autocorrelation
    - Correlation — measure of linear relationship between two variables
    - Autocorrelation — measure of linear relationship between the same variable in consecutive periods (two values of the same variable)
- Probability distribution vs sampling distribution
    - Probability distribution — how any variable from a population is distributed
    - Sampling distribution — distribution of a statistic that’s based on sample (sample mean, variance, etc)
- Normal distribution
    - Symmetric, bell shaped distribution that’s crucial due to the central limit theorem, based on mean and st.dev
- Assumptions of linear regression
    - Hoho, which ones
    - MLR1 — linear in parameters
    - MLR2 — Random sampling — iid
    - MLR3 — no perfect collineariry
    - MLR4 — Zero Conditional mean
    - MLR5 — homoskedasticity
    - MLR6 — normality
- Hypothesis testing
    - Evaluate hypothesis about a population
    - H0, H1
- Tests
    - F test, t test, ANOVA
- P value
    - Probability of a type 1 error — probability that H0 is true given the sample. We reject h0 if p value is lower than significance level
- Confidence interval
    - Range we expect the result to lie with a significance level
- LLN
    - Sample mean → population
- CLT
    - I.i.d assumption
    - Sampling distribution of mean is N(0, 1)
- Probability vs likelihood
    - Probability — chance of a particular outcome to occur
    - Likelihood — measure to verify if parameters are trustworthy given result
    

# Machine Learning

- Random forests
    - Decision trees tend to overfit heavily
    - We aggregate results of many decision trees to get good predictions by removing dependency on a particular set of features
- Gradient boosting vs random forests
    - Both are decision-tree based
    - Random forest uses bagging — aggregating results of many trees
    - Gradient boosting uses boosting — trees are arranged in a series sequential fashion, each tree trying to minimize error of the previous one
    - Random forests — independent trees; gradient boosting — dependent on the previous
    - Gradient boosting:
        - More accurate since minimizing error
        - Can capture complex patterns
        - Better when used on unbalanced data sets
        - Susceptible to overfitting
        - More complex tuning of hyperparameters
    - Random forests:
        - Less prone to overfitting
        - Has faster training since parallel trees
- K-means clustering
    - Partition dataset into K clusters, arbitrarily choose a centroid
    - Repeatedly assigning points to the nearest centroid, update the centroid, repeating until convergence,
    - Minimizes inner-cluster variance
    - Elbow method for determining K — when stops to be a sharp decrease in variance
- Dimensionality reduction
    - Reduce dimensions of our data without sacrificing much variance / predictive power
    - One popular method is Principal Components Analysis
        - Combines highly correlated variables into a new smaller set , capturing most of the variance
        - Looks for a linear combination to explain variance
        - First component with the higher variance, second that is uncorrelated and has second-highest variance. Number depends on the threshold for percent of variance
- L1, L2 regularization
    - Regularization to reduce overfitting
    - Normal cost function is MSE
    - L1 — Lasso. Sum of absoulte of weights to the loss function we try to minimze
    - L2 — ridge regression — sum of squares
    - L1 helps as it tends to push parameters straight to zero, reather than holding them at some small level
- Overfitting vs underfitting
    - Overfitting — well on training, bad on test
        - Learning patterns in data noice, bad in generalizing
    - Underfitting — bad on both; too simple to learn any patterns
    - Ways to avoid:
        - Reduce number of features
        - More representative training data
        - Data preprocessing
        - Regularization
        - Techniques that tend to not ovefit
        - Use validation set
- Bias and variance
    - Bias — simplifying assumption to make target easier to learn
    - Variance — amount that the estimate will change if different training was used (variance error)
    - Irreducible error
    - Balance bias and variance
    - Linear regression — high bias, low variance
- Precision, recall, and F1
    - Measures for evaluating classification
    - Precision — ratio of correct predictions to A to total predictions to A (how likely a given prediction is correct)
    - Recall — percentage of correctly classified of class A to  class A samples (how well can detect the class)
    - Tradeoff between them
    - F1 — harmonic mean of precision and recall. Used when both are equally important
    - Choose what to optimise for based on the problem
- Missing / corrupt data
    - Deleting row with missing values
    - Learning algorithms that support missing values
        - K-NN (nearest neighbours)
        - Naive Bayes
        - Random forest — can work on non-linear / categorical data
    - Imputation
        - The most repeated value
        - Mean-median-mode imputation
            - Data leakage, doesn’t factor covariance
        - Usee ML model to learn patter between data and predict the missing value
- Robust to outliers
    - Regularization
    - Tree-based
    - Transform (like log)