<style>
    .small {
        font-size: 5px;
    }
</style>
<h2>Structuring a problem</h2>
<h4 style="font-weight: 400; line-height:1.5;">
    The null hypothesis has to be related to a business problem. Moreover, in most scenarios, the alternate hypothesis presented is what the statistician or data scientist has set out to prove. 
    <strong>Therefore, the hypothesis testing should be approached from an alternate hypothesis point of view.</strong>
</h4>
<h4 style="font-weight:400; line-height: 1.5;">
    A company has data describing the efficiency scores of in-person business meetings they have company-wide. They are looking to utilize more online options such as Google Meet and Zoom to see if these services could help improve meeting efficiency scores. As a data scientist, your job is to predict how effective these services could be. (Assume the efficiency scores of the in-person meetings are 50 (out of 100), with a standard deviation of 7).
    <br />
    <br />
    <strong>H<span class="small">o</span></strong>: The efficiency of the online services will result in an efficiency score equal to or less than 50.
    <br />
    <strong>H<span class="small">a</span></strong>: The efficiency of the online services will result in an efficiency score greater than 50.
    <br />
    <i>*Only the null hypothesis can have an equality sign because we never prove the null hypothesis, we either reject it or fail to reject it.</i>
</h4>

<h2>Obtaining information about the new situation in the proposed problem.</h2>
<h4 style="font-weight: 400; line-height: 1.5;">
    <ol>
        <li>Find a sample to gather information about the proposed problem.<br />Ex) Take a sample of 35 company-wide meetings that used online services and collect data about their efficiencies.</li>
        <li>Perform some analysis on the sample data. <br /> Ex) The efficiency score of the online meetings was 75. Therefore, the standard error is 1.18.</li>
        <li>Establish an alpha value (level of significance) that will be used in the hypothesis testing <br />Ex) 0.05</li>
    </ol>
</h4>

<style>
    .small {
        font-size: 5px;
    }
</style>
<h2>Performing the hypothesis test</h2>
<h4 style="font-weight: 400; line-height: 1.5;">
    <ol>
        <li>For the example stated above, a z-test is the best way to compare the mean scores. When we run the z-test, we find out the p-value (significance value) is 2.34<span class="small">E</span>-99.</li>
        <li>Since our p-value is much lower than our level of significance at 0.05. Therefore, we reject the null hypothesis and can say the online services do increase efficiency of company meetings.</li>
    </ol>
</h4>

In [1]:
import scipy.stats as stats

<h3>Here we calculate what the critical value was for the z-test done above.</h3>

In [2]:
alpha = 0.05
original_mean = 50
scale = 2

In [3]:
print(stats.norm.isf(alpha,original_mean, scale))

53.28970725390295


<h4 style="font-weight: 600; line-height: 2;">Any value above 53.28 will be statistically significant given an alpha of 0.05 and an original mean of 50. Therefore, since our new mean was 75, this definitely means the results from the online services are significant.</h4>

<h2>Important Hypothesis Test</h2>
<ul>
    <li>Z-test</li>
    <li>T-test</li>
    <li>ANOVA</li>
    <li>Welch T-test</li>
    <li>Mann-Whitney U-Test</li>
    <li>Kruskal Willis H-test</li>
    <li>Pearson's Chi-square test</li>
    <li>Shapiro Wilk Test</li>
</ul>
<i>Note: These hypothesis tests will be used in conjunction with the output variables to perform feature selection. For the features whose hypothesis tests have p-values less than 0.05, we keep. For the ones that do not have p-values less than 0.05, we do not drop them right away. Instead, we combine them to create new features that could possibly be more meaningful</i>

<h2>Important Wrapper Methods</h2>
<ul>
    
</ul>
<a href="https://sebastianraschka.com/faq/docs/feature_sele_categories.html">Extra information about wrapper, embedded, and filter methods for feature selection</a>

<h2>Ways to Calculate Relationships between Variables</h2>
<ul>
    <li>Numerical - Numerical: Correlation</li>
    <li>Numerical - Categorical: ANOVA</li>
    <li>Categorical - Categorical: Chi-Squared</li>
</ul>

<style>
    .small {
        font-size: 5px;
    }
</style>
<h2>ANOVA</h2>
<p>
    <ul>
        <li>Used for comparing means between more than 2 groups</li>
        <li>
            Key Terminology
            <ul>
                <li><strong>F-Distribution</strong>: The variance between groups divided by the variance within groups (Will be 1 if sample variances are equal)
            </ul>
        </li>
    </ul>
    <p>
    <p>
        H<span class="small">o</span>: U<span class="small">1</span> = U<span class="small">2</span> = U<span class="small">3</span>
     </p>
    <p>
        H<span class="small">a</span>: Not all the population means are the same
    </p>
    <i>U is a mean</i>
    </p>
</p>
<h4>Steps</h4>
<ol>
    <li>Calculate the means of each of the features</li>
    <li>Calculate the <strong>within group sample variance ( Sum of Squares Within - SSW)</strong> for each of the features
        <br /> 
        <i>Equation: (X<span class="small">i</span> - X<span class="small">mean</span>)**2 / (n-1)</i>
        <br />
        <i>Note: the degrees of freedom (dof) is number of instances * number of columns - number of columns</i>
    </li>
    <li>Steps to calculate the <strong>between sample variances (Sum of Squares Between - SSB)</strong>
        <ol>
            <li>Sum the data points of all the features together and divided by total number of data points. <i>This is called the grand mean</i></li>
            <li>For each feature, substact the sample means from the grand mean and square them. Then add them together</li>
            <li>Multiply this value by the number of observations in one feature. <br />
                <i>Equation: Sum of((feature_mean-grand_mean)**2)*number_of_instances</i></li>
            <li>Calculate the Total Sum of Squares by adding the sum of squares between and the sum of squares within. <br />
                <i>Equation: SSW + SSB = TSS</i>
            </li>
        </ol>
        <br />
        <i>Note: the degrees of freedom (dof) is number of features - 1</i>
    </li>
    <li>
        Calculate the <strong>F-statistic</strong>
        <br />
        <i>Equation: (SSB / k - 1) / (SSW / nk - k)</i>
        <br />
        <i>**k is the number of columns and n is the number of instances per column</i>
    </li>
    <li>Compare the F-statistic to its corresponding f-statistic in the f-statistic chart</li>
    <a href="http://www.socr.ucla.edu/Applets.dir/F_Table.html">Link of F-statistic chart</a>
</ol>

<i>**To Calculate variance by diving SSB/TSS</i>

<style>
    
    .indent {
        margin-left: 40px;
    }
    .small {
        font-size: 5px;
    }
</style>
<h2>Z-Test/T-Test</h2>
<div>
    <p>
        <span class="index">Often</span> times when you are adding and dropping features from your dataset, you are doing so to improve the performance of your machine learning model. However, it is also important to determine whether those choices have a statistically significant impact on the final results. Therefore, it is common practice to gather data on the the machine learning models before and after adding/dropping certain features using Stratified Splitting multiple times.
     </p>
     <p>
    <span class="index">After</span> obtaining these scores, we now have two samples, one with scores from before adjusting the features and one with scores from after adjusting them. We can use a 2-sample t-test (if the number of samples is less than 30) or a 2-sample z-test( if the number of samples is greater than 30) to determine if the change was significant.
    </p>
</div>