# Regression

<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-MML-AM_CHTML"></script>

<h1>Root Mean Square Error (RMSE)</h1> 
<ul>
    <li>Measures the average error performed by the model in predicting the outcome for an observation</li>
    <li>The lower the better</li>
</ul>
<p>\(RMSE = \sqrt{\frac{{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}}{n}}\)</p>
<p>where \(y_i\) represents the observed values, \(\hat{y}_i\) represents the predicted values, and \(n\) represents the total number of observations.</p>


In [4]:
from IPython.display import Latex

<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-MML-AM_CHTML"></script>

<h1>Residual Square Error (RSE)</h1> 
<ul>
    <li>Similar as RMSE the only difference is that the denominator is the degrees of freedom. The difference is very small between RMSE and RSE for big data applictions.</li>
    <li>The lower the better</li>
<p>\(RSE = \sqrt{\frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{n-p-1}}\)</p>
<p>where \(y_i\) represents the observed values, \(\hat{y}_i\) represents the predicted values, \(n\) represents the total number of observations, and \(p\) represents the number of predictors.</p>

<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-MML-AM_CHTML"></script>

<h1>Mean Absolute Error (MAE)</h1> 
<ul>
    <li>Measures the average error performed by the model in predicting the outcome for an observation</li>
    <li>The lower the better</li>
    <li>It's less sensitive to outliers compared to RMSE</li>
</ul>

<p>\(MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|\)</p>
<p>where \(y_i\) represents the observed values, \(\hat{y}_i\) represents the predicted values, and \(n\) represents the total number of observations.</p>


<ul>
    <li><b>Prediction Interval</b>: uncertainity around a single value</li>
    <li><b>Confidence Interval</b>: mean or other statistics calculated from multiple values.</li>
</ul>
<img src="https://biologyforfun.files.wordpress.com/2015/06/bootglmm1.png" alt="Plot Image">


# Classification

<table>
    <tr>
        <td></td>
        <td colspan="2"><strong>Actual Class</strong></td>
    </tr>
    <tr>
        <td></td>
        <td>Positive</td>
        <td>Negative</td>
    </tr>
    <tr>
        <td rowspan="2"><strong>Predicted Class</strong></td>
        <td>True Positive (TP)</td>
        <td>False Positive (FP)</td>
    </tr>
    <tr>
        <td>False Negative (FN)</td>
        <td>True Negative (TN)</td>
    </tr>
</table>


<ul>
    <li><b>Recall:</b> the percent or proportion of all 1's that are correctly classified as 1 (how good is the model at predicting positives?)</li> $recall (sensitivity) = \frac{TP}{Y=1} = \frac{TP}{(TP+FN)}$
    <li><b>Accuracy:</b> the percent or proportion of cases classified correctly (how often the model is correct?)</li>$accuracy = \frac{(TP + TN)}{total}$
    <li><b>Precision:</b> the percent or proportion of predicted 1's that are correctly 1 (of the predicted positives are truly positives?)</li>$precision = \frac{TP}{\hat{y}=1} = \frac{TP}{(TP+FP)}$
    <li><b>Specificity:</b> The percent or proportion of all 0's that are correctly classified 0's</li>$specificity = \frac{TN}{y=0} = \frac{TN}{(TN+FP)} $
    <li><b>Prevelance:</b> The proportion of positive cases in the actual dataset</li>$prevelance = \frac{y=1}{total} = \frac{TP+FN}{total}$
    <li><b>F1-Score:</b> the mean of precision and recall. It considers both false and positive cases and it is a good measureent for imbalance data</li>$F_1 = 2 \cdot \frac{precision \cdot recall}{precision + recall}$
    <li><b>ROC (Receiver Operating Characteristic) curve:</b>  is a graphical representation of the performance of a binary classification model. It illustrates the trade-off between the true positive rate (TPR) and the false positive rate (FPR) for different classification thresholds</li>
    <li><b>Area Underneath the Curve (AUC):</b> the larger the value the more effective the classifier (range 0-1)</li> 
    <li><b>Lift:</b> it measures how effective a model is in identiifying the 1's, and it is often calculated decile bby decile, starting with the most probable 1's</li>
    <li><b>Type I Error (False Positive)</b>: It is rejecting a true null hypothesis. In other words, concluding 
        there is a significant effect or difference between groups when in fact there is none (mistakenly 
        concluding an effect is real when it is due to chance)</li>
    <ul>
        <li>Causes: small sample size, inadecuate study design, bias sampling, cofounding variables, 
            measurement errors</li>
    </ul>
    <li><b>Type II Error (False Negative)</b>: the result of your analysis says that there is no difference 
        between the groups when there actually is a difference (mistakenly concluding an effect is due to 
        chance when it is real)</li> 
    <ul>
        <li>Causes: large sample size, small effect size (small differences may be statistically 
            significant)</li> 
    </ul>    

![image.png](attachment:image.png) 

<a href="https://towardsdatascience.com/roc-curve-explained-using-a-covid-19-hypothetical-example-binary-multi-class-classification-bab188ea869c">Source</a>
![image.png](attachment:image.png)

<h1>Strategies for imbalance data</h1>
<ul>
    <li>Undersampling/downsampling dominant case</li>
    <li>Oversample/upsample the rare class by using bootstrapping</li>
    <li>Up/down weighting (optimizing loass function)</li>
    <li>Use SMOTE to create synthetic data (similar to existing rare cases)</li>

<ul>
    <li><b>The standard error (se)</b> of the coefficients can be used to measure the realibility of variable's 
        contribution to a model</li>
    <li><b>R Squared:</b> </li>$R^2 = 1 - \frac{SS_{\text{residual}}}{SS_{\text{total}}}$
    <ul>
        <li>the proportion of variation in the outcome that is explained by the predictor variales in multiple 
            regressions models </li>
        <li>It also corresponds to the squared correlation between oserved outcome values and predicted values by 
            the model</li>
    </ul>
    <li><b>R Square:</b> 
        <ul>
            <li>It penalizes the addition of variables that do't contribute significantly to the model's 
                predictive power</li>
            <li>It also helps avoid selecting complex models that might suffer from overfitting</li>
        </ul>
    <li><b>Akaike Information Criterion (AIC):</b></li>
    <ul>
        <li>Penalizes adding terms to a model</li>
        <li>The goal is to find the model that minimizes AIC (by dropping variables - backward 
            elimination/forward selection)</li>
        </ul>
    <li><b>Bayesian Information Criterion (BIC)</b></li>
    <ul>
        <li>It is a variant of AIC with stronger penalty for including additional variables to the model</li>
        <li>The lower the better</li>
    </ul>
    <li><b>Step-wise regression:</b> is a way to automatically determine which variables should be included in 
        the model</li>
    <li><b>Standarize residuals:</b> play a crucial role in diagnosing the adequacy of a regression model and 
        identifying potential issues such as outliers, violations of model assumptions, and nonlinear 
        relationships</li>
    <li><b>Weighted regression:</b> is used to give certain records more or less weight in fitting equation</li>
    <li><b>Confunding variables:</b> An important predictor that when ommitted leads to spurious relationship in 
        a regression equation</li>
    <li><b>Heteroskedacity:</b> when some ranges of the outcome experience residuals with higher variance (may 
        indicate a predictor is missing from the equation)</li>
    <li><b>Cook's distance:</b> is a measure used in linear regression analysis to assess the influence of 
        individual data points on the regression model. It quantifies how much the predicted values of the 
        response variable change when a particular observation is excluded from the model.</li>
        $D_i = \frac{(\hat{Y}_i - \hat{Y}_{(-i)})^2}{p \cdot \text{MSE}} $ <p>\(p\) represents the number of 
    predictors.</p>
    <li><b>Bootstrapped data:</b> is used to estimate the stability (variability) of the model parameters, or to 
    improve the predictive power.
    <ul>
        <li>An effective way to construct confidence intervals</li>
        <li>It is particularly useful when the underlying distribution of the statistic is unknown or when the 
            assumption of normality may not hold. </li>
        <li>Helps to communicate the potential error in estimate, and perhaps to learn whether a larger sample is 
            needed</li>
        <li>
        

<h1><b>Cross Validation (CV)</b></h1>
<li><b>K-Fold</b>: the training data used in the model is split, into K number of smaller sets, to be used to 
    validate the model. The model is then trained on k-1 folds of training set. The remainding fold is then used 
    as a validation set to evaluate the model</li>
<li><b>Stratified K-fold</b>: in cases where classes are imbalanced we need a way to account for the imbalance in 
    both the train and validation sets. To do so we can stratify the target classes, meaning that both sets will 
    have an equal proporttion of all classes, while the number of folds is the same, the average CV increases 
    from the basic K-fold when making sure there is stratified classes</li>
<li><b>Leave One Out (LOO)</b>: instead of selecting the number of splits in the training data set like k-fold 
    LOO utilize 1 observation to validate and n-1 observation to train. This method is an exhaustive technique. </li>
<li><b>Leave One Out (LOO)</b>: it is simply a nuance difference to the LOO idea, in that we can select the 
    number of P to use in out validation set.</li>
<li><b>Shuffle Split</b>: Unlike K-fold, shuffle split leaves out a percentage of the data, not to be used in the 
    train or validation sets. To do so we must decide what train and test size are, as well as the number of 
    splits.</li>

There are more techniques but the above are the mostly used.


<h1>How to fix for high bias/variance</h1>
<li><b>High bias</b>
    <ul>
        <li>try adding polynomial features</li>
        <li>try getting additional features</li>
        <li>decreasing the regularization parameter</li>
    </ul></li>
<li><b>High variance</b>
    <ul>
        <li>try increasing the regularization parameter</li>
        <li>try small sets of features</li>
        <li>get more training samples</li>
    </ul>