#### Correlation Coefficient:

    Correlation coefficients measure the strength of association between two variables. The most common correlation coefficient, called the Pearson product-moment correlation coefficient, measures the strength of the linear association between variables measured on an interval or ratio scale.

    when we speak simply of a correlation coefficient, we are referring to the Pearson product-moment correlation. Generally, the correlation coefficient of a sample is denoted by r, and the correlation coefficient of a population is denoted by ρ (Roh).
    

#### How to Interpret a Correlation Coefficient

    The sign and the absolute value of a correlation coefficient describe the direction and the magnitude of the relationship between two variables.
    
    (The absolute value of a number refers to the magnitude of the number, without regard to its sign. The absolute value of -1 and 1 is 1, the absolute value of -2 and 2 is 2, the absolute value of -3 and 3 is 3, and so on. )
    
    > The value of a correlation coefficient ranges between -1 and 1.
    > The greater the absolute value of the Pearson product-moment correlation coefficient, the stronger the linear relationship.
    > The strongest linear relationship is indicated by a correlation coefficient of -1 or 1.
    > The weakest linear relationship is indicated by a correlation coefficient equal to 0.
    > A positive correlation means that if one variable gets bigger, the other variable tends to get bigger.
    > A negative correlation means that if one variable gets bigger, the other variable tends to get smaller.
    
    
    Keep in mind that the Pearson product-moment correlation coefficient only measures linear relationships. Therefore, a correlation of 0 does not mean zero relationship between two variables; rather, it means zero linear relationship. (It is possible for two variables to have zero linear relationship and a strong curvilinear relationship at the same time.)
    
    
    

#### Scatterplots and Correlation Coefficients

    The scatterplots below show how different patterns of data produce different degrees of correlation.
    
   ![](img1.png)
   
       Several points are evident from the scatterplots.
       
       > When the slope of the line in the plot is negative, the correlation is negative; and vice versa.
       > The strongest correlations (r = 1.0 and r = -1.0 ) occur when data points fall exactly on a straight line
       > The correlation becomes weaker as the data points become more scattered.
       > If the data points fall in a random pattern, the correlation is equal to zero.
       > Correlation is affected by outliers. Compare the first scatterplot with the last scatterplot. The single outlier in the last plot greatly reduces the correlation (from 1.00 to 0.71).





### Linear Regression:

    In a cause and effect relationship, the independent variable is the cause, and the dependent variable is the effect. Least squares linear regression is a method for predicting the value of a dependent variable Y, based on the value of an independent variable X.
    
    


#### Prerequisites for Regression

    Simple linear regression is appropriate when the following conditions are satisfied.
    
    > The dependent variable Y has a linear relationship to the independent variable X. To check this, make sure that the XY scatterplot is linear and that the residual plot shows a random pattern.
    
    > For each value of X, the probability distribution of Y has the same standard deviation σ. When this condition is satisfied, the variability of the residuals will be relatively constant across all values of X, which is easily checked in a residual plot.
    
    > For any given value of X,
            > The Y values are independent, as indicated by a random pattern on the residual plot.
            > The Y values are roughly normally distributed (i.e., symmetric and unimodal). A little skewness is ok if the sample size is large. A histogram or a dotplot will show the shape of the distribution.
    
    

#### The Least Squares Regression Line

    Linear regression finds the straight line, called the least squares regression line or LSRL, that best represents observations in a bivariate data set. Suppose Y is a dependent variable, and X is an independent variable. The population regression line is:
    
                            y = mx + b
                            
                                or, some books written as 
                               
                            y = b0 + b1x (same as above)
                            
          where b0 is a constant, b1 is the regression coefficient, X is the value of the independent variable, and Y is the value of the dependent variable.
          
          
                                
                            

#### How to Define a Regression Line

    b1 = Σ [ (xi - x̅)(yi - y̅) ] / Σ [ (xi - x̅)2]
    
    b1 = r * (sy / sx)

    b0 = y̅ - b1 * x̅
    
    where b0 is the constant in the regression equation, b1 is the regression coefficient, r is the correlation between x and y, xi is the X value of observation i, yi is the Y value of observation i, x is the mean of X, y is the mean of Y, sx is the standard deviation of X, and sy is the standard deviation of Y.

#### Properties of the Regression Line

    When the regression parameters (b0 and b1) are defined as described above, the regression line has the following properties.
    
    > The line minimizes the sum of squared differences between observed values (the y values) and predicted values (the ŷ values computed from the regression equation).
    
    > The regression line passes through the mean of the X values (x) and through the mean of the Y values (y).
    
    > The regression constant (b0) is equal to the y intercept of the regression line.
    
    > The regression coefficient (b1) is the average change in the dependent variable (Y) for a 1-unit change in the independent variable (X). It is the slope of the regression line.
    
  

#### The Coefficient of Determination

    The coefficient of determination (denoted by R²) is a key output of regression analysis. It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable.
    
    > The coefficient of determination ranges from 0 to 1.
    
    > An R² of 0 means that the dependent variable cannot be predicted from the independent variable.
    
    > An R² of 1 means the dependent variable can be predicted without error from the independent variable.
    
    > An R² between 0 and 1 indicates the extent to which the dependent variable is predictable. An R² of 0.10 means that 10 percent of the variance in Y is predictable from X; an R² of 0.20 means that 20 percent is predictable; and so on.
    
    If you know the linear correlation (r) between two variables, then the coefficient of determination (R²) is easily computed using the following formula: R2 = r2.



#### Standard Error

    The standard error about the regression line (often denoted by SE) is a measure of the average amount that the regression equation over- or under-predicts. The higher the coefficient of determination, the lower the standard error; and the more accurate predictions are likely to be.



| Stretch/Untouched | ProbDistribution | Accuracy |
| --- | --- | --- |
| Stretched | Gaussian | .843 |
|+++++|+++++| --- |
| Stretched | Gaussian | .843 |
| --- | --- | --- |

### Fundamental of Statistics


| <h3>Function</h3> | <h3>Symbol</h3> | <h3>Formulae</h3> | <h3>Description</h3> |
| --- | --- | --- | --- |
| <h3>Average/Mean</h3> | <h3>$\large \mu \space or \space \bar {x}$</h3> | <h3>$\large \frac {\sum_{i = 0}^n {x_i}} {n}$</h3> | <h3>Average value of a Series; n = number of observations</h3> |
| <hr></hr> | <hr></hr> | <hr></hr> | <hr></h3> |
| <h3>Variance</h3> | <h3>$\large s^2 \space or \space \sigma^2$</h3> | <h4>$\large \frac {\sum_{i \space = \space 1}^{n}(x_i \space - \bar{x})^2} {n \space - \space 1}$</h4> | <h3>Large Spread of the data from the Mean</h3> |
| <hr></hr> | <hr></hr> | <hr></hr> | <hr></h3> |
| <h3>Std. Deviation</h3> | <h3>$\large s \space or \space \sigma$</h3> | <h4>$\large \sqrt {\frac {\sum_{i \space = \space 1}^{n}(x_i \space - \bar{x})^2} {n \space - \space 1}}$</h4> | <h3>Spread of the data from the Mean</h3>
| <hr></hr> | <hr></hr> | <hr></hr> | <hr></h3> |
| <h3>Z Score or Standardized Score</h3> | <h3>$\large z$</h3> | <h3>$\large \frac {x \space - \space \bar {x}} {s}$</h3> | <h3>How many standard deviations away from the Mean</h3>
| <hr></hr> | <hr></hr> | <hr></hr> | <hr></h3> |
| <h3>Skew</h3> | <h3>$\large sk$</h3> | <h3>$\large \frac {\sum_{i \space = \space 1}^{n}(z_i)^3} {n \space - \space 1}$</h3> | <h4>- 0 means distribution is symmetric - Usually a score between -1 and +1 - Positive sk indicates +ve skewed data - Negative sk indicates –ve skewed data </h4> |