## Concepts

* PMF:
<br>
Another way to represent a distribution is a probability mass function (PMF), which maps from each value to its probability.  
useful for discrete features.


* CDF:  
maps from each value to its percentile rank  
The CDF gives the probability that a random variable is less than or equal to a certain value. 


* PDF:  
It is a function that describes the probability distribution of a continuous random variable.  
The PDF is used to calculate the probability of a random variable taking on a specific value or falling within a certain range of values.  
The PDF is defined as the derivative of the cumulative distribution function (CDF) of a random variable.  The PDF gives the rate at which the CDF changes with respect to the random variable.  
The PDF has several important properties, including:


    - The area under the PDF curve is equal to 1, since the total probability of all possible outcomes is 1.
    - The PDF is non-negative, since the probability of a random variable taking on a negative value is zero.
    - The PDF can be used to calculate the expected value and variance of a random variable.
    - The PDF can be used to calculate the probability of a random variable falling within a certain range of values, by integrating the PDF over that range.


* Kernel density estimation (KDE) is an algorithm that takes a sample and finds an appropriately smooth PDF that fits the data.

* PDFS:
    - gaussian dist: 
        + z-score: standardize data
        + QQ-plot: A plot to visualize how close a sample distribution is to a specified distribution
        + Unimodal -one mode
        + Symmetrical -left and right halves are mirror images
        + Bell-shaped -maximum height (mode) at the mean
        + Mean, Mode, and Median are all located in the center
        + Asymptotic(no bias to left or right)
        + 68% of the data falls within one standard deviation of the mean or median. 
        if the median of a dataset is 10 and the standard deviation is 2, "1 standard deviation from the median" would refer to the range of values from 8 to 12, since these values are one standard deviation away from the median of 10.
    - uniform dist
    - exponential dist
    - lognormal dist

* Centeral Limit Theorem:  
<br>
this rule says mean of each dataset comes from a normal dist.  
if we take many samples from our data and calculate mean of samples. means will make a normal dist.
<br>
now if we take interval which includes 95% of the means, we have 95% confidence interval.  
it means anything out of this interval is significantly different from data.  
or if confidence intervals of two samples of data don't overlap, they are significantly different.    
useful when we don't know what dist our data is coming from.   
so we calc mean, find confidence interval, do t-test or ANOva to find out if ther's difference between two or more samples.

* P-value:
<br>
the purpose is minimizing effects of random things on an event.
<br>
the more closer the p-value is to 0, the more we are confident that two samples are different.
p-value of 0.05 means we may get 5 false positive in every 100 cases.
alpha = significance level = 0.05
<br>
p-value helps to reject null hypothesis.
<br>
the problem with p-value is it doesn't include the size of samples. so we need effect size

* Calculating p-value:
prob of an event + prob of event equally rare(area under dist) + prob of an event rarer (area under dist)

* Power:  
the probability of a hypothesis test of finding an effect, if there is an effect to be found?  
or the probability of getting correctly small p-value, or correctly reject null hypothesis 


* power analysis:  
tells how much sample size we 

>Compared to the variance of the Maximum Likelihood Estimate (MLE), the variance of the Maximum A Posteriori (MAP) estimate is Lower.


>Bayes’ Theorem gives you the posterior probability of an event given what is known as prior knowledge.  
Mathematically, it’s expressed as the true positive rate of a condition sample divided by the sum of the false positive rate of the population and the true positive rate of a condition. Say you had a 60% chance of actually having the flu after a flu test, but out of people who had the flu, the test will be false 50% of the time, and the overall population only has a 5% chance of having the flu. Would you actually have a 60% chance of having the flu after having a positive test?  
Bayes’ Theorem says no. It says that you have a (.6 * 0.05) (True Positive Rate of a Condition Sample) / (.60.05)(True Positive Rate of a Condition Sample) + (.50.95) (False Positive Rate of a Population) = 0.0594 or 5.94% chance of getting a flu.  

---

## Statistical Tests

- __T-test (Independent and Paired)__:  
 Used to compare the means of two groups or samples. Independent t-test is for comparing two independent   groups, while the paired t-test is for comparing two related groups (e.g., before and after measurements).
    + Assumes normality of the data.
    + Assumes homogeneity of variances (equal variances) for the independent t-test.
    + Sensitive to outliers.
    + Not suitable for categorical data.
- __ANOVA (Analysis of Variance)__:  
 Used to compare the means of three or more groups or samples. One-way ANOVA is for comparing groups with a   single independent variable, while two-way ANOVA is for comparing groups with two independent variables.
    + Assumes normality of the data.
    + Assumes homogeneity of variances (equal variances) among groups.
    + Sensitive to outliers.
    + Not suitable for categorical data.
- __Chi-square test__:  
 Used to test the association between two categorical variables in a contingency table.  

    + Requires categorical data.
    + Assumes that observations are independent.
    + May not be accurate for small sample sizes or when expected frequencies are very low.
- __Fisher's exact test__:  
 Similar to the Chi-square test, but used when the sample size is small, and the Chi-square test assumptions may not hold. 

    + Requires categorical data.
    + Assumes that observations are independent.
    + Computationally intensive for large sample sizes or large contingency tables.
- __Mann-Whitney U test__:  
 A non-parametric test used to compare the distributions of two independent groups when the data is not normally   distributed.

    + Assumes that the distributions of both groups are similar in shape.
    + Not suitable for paired or related samples.
    + Less powerful than parametric tests when data is normally distributed.
- __Wilcoxon signed-rank test__:  
 A non-parametric test used to compare the distributions of two related groups when the data is not normally   distributed.

    + Assumes that the differences between paired observations are symmetric.
    + Less powerful than parametric tests when data is normally distributed.
- __Kruskal-Wallis test__:  
 A non-parametric test used to compare the distributions of three or more independent groups when the data is not normally   distributed.

    + Assumes that the distributions of all groups are similar in shape.
    + Less powerful than parametric tests when data is normally distributed.
- __Spearman's rank correlation__:  
 A non-parametric measure of the strength and direction of the association between two ranked variables.  

    + Assumes a monotonic relationship between variables.
    + Less powerful than Pearson's correlation when the relationship is linear.
- __Pearson's correlation__:  
 A measure of the strength and direction of the linear relationship between two continuous variables.  

    + Assumes a linear relationship between variables.
    + Sensitive to outliers.
    + Not suitable for non-linear relationships or non-continuous data.
- __Linear regression__:  
 A method for modeling the relationship between a dependent variable and one or more independent variables.  

    + Assumes a linear relationship between dependent and independent variables.
    + Assumes normality of residuals.
    + Assumes homoscedasticity (constant variance of residuals).
    + Sensitive to outliers and multicollinearity.
- __Logistic regression__:  
 A method for modeling the probability of a binary outcome based on one or more independent variables.  

    + Assumes a linear relationship between the logit of the outcome and the independent variables.
    + Requires a large sample size for accurate estimates.
    + Not suitable for continuous outcomes or non-binary categorical outcomes.
- __Time series analysis__:
    + Assumes that the time series is stationary or can be made stationary through transformations.
    + Requires a sufficiently long time series for accurate modeling.
    + May not be suitable for time series with complex or non-linear patterns.

----

## Questions

Questions

1. __In an A/B test, how can you check if assignment to the various buckets was truly random?__  
Plot the distributions of multiple features for both A and B and make sure that they have the same shape.  
More rigorously, we can conduct a permutation test to see if the distributions are the same.  
MANOVA to compare different means

2. __How would you run an A/B test if the observations are extremely right-skewed?__  
- Choose an appropriate metric:  
Instead of using the mean, which can be heavily influenced by extreme values in right-skewed data, consider using the median or other robust metrics that are less sensitive to outliers.  
- Transform the data:  
Apply a transformation, such as the Box-Cox or log transformation, to reduce skewness and make the data more symmetric. This can help stabilize variance and make the data more suitable for statistical analysis.  
- Use non-parametric tests:  
Non-parametric tests, such as the Mann-Whitney U test or the Kolmogorov-Smirnov test, do not rely on assumptions about the underlying data distribution and can be more appropriate for analyzing skewed data.  
- Bootstrap or permutation tests:  
These resampling-based methods can provide valid inferences without relying on distributional assumptions. They can be particularly useful when dealing with skewed data or small sample sizes. 
