__Problem Statement:1__<br>
Is gender independent of education level? A random sample of 395 people were
surveyed and each person was asked to report the highest education level they
obtained. The data that resulted from the survey is summarized in the following table:<br>

Question: Are gender and education level dependent at 5% level of significance? In
other words, given the data collected above, is there a relationship between the
gender of an individual and the level of education that they have obtained?
<table>
    <thead>
        <th>Gender<th>
        <th>High School</th>
        <th>Bachelors</th>
        <th>Masters</th>
        <th>Ph.d</th>
        <th>Total</th>
    </thead>
    <tbody>
        <tr>
            <td>Female<td>
            <td>60 </td>
            <td>54 </td>
            <td>46 </td>
            <td>41 </td>
            <td>201 </td>
         </tr>
        <tr>
            <td>Male<td>
            <td>40 </td>
            <td>44 </td>
            <td>53 </td>
            <td>57 </td>
            <td>194 </td>
         </tr>
        <tr>
            <td>Total<td>
            <td>100 </td>
            <td>98 </td>
            <td>99 </td>
            <td>98 </td>
            <td>395 </td>
         </tr>
     </tbody>


In [1]:
import pandas as pd
import numpy as np

In [2]:
female_list = [60,54,46,41]
male_list = [40,44,53,57]

number_of_indiv = male_list + female_list
sex =  ['Male','Male','Male','Male','Female','Female','Female','Female']
edu = ['High School', 'Bachelors', 'Masters', 'Ph.d.','High School', 'Bachelors', 'Masters', 'Ph.d.']
data = pd.DataFrame({"Sex":sex,"Edu":edu,"No_of_People":number_of_indiv})

In [3]:
data

Unnamed: 0,Sex,Edu,No_of_People
0,Male,High School,40
1,Male,Bachelors,44
2,Male,Masters,53
3,Male,Ph.d.,57
4,Female,High School,60
5,Female,Bachelors,54
6,Female,Masters,46
7,Female,Ph.d.,41


In [4]:
cross_tab = pd.crosstab(data.Sex, data.Edu,margins=True)
cross_tab

Edu,Bachelors,High School,Masters,Ph.d.,All
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Female,1,1,1,1,4
Male,1,1,1,1,4
All,2,2,2,2,8


In [5]:

data_new = pd.crosstab(data.Sex, data.Edu,data.No_of_People, aggfunc="sum",margins=True)

data_new.columns = ["Bachelors","High School","Masters","Ph.d.","Total_People"]

data_new.index = ["Female","Male","Total_columns"]

data_new

Unnamed: 0,Bachelors,High School,Masters,Ph.d.,Total_People
Female,54,60,46,41,201
Male,44,40,53,57,194
Total_columns,98,100,99,98,395


In [20]:
data_observed = data_new.iloc[:2,:-1]

In [21]:
data_observed

Unnamed: 0,Bachelors,High School,Masters,Ph.d.
Female,54,60,46,41
Male,44,40,53,57


As Gender and Education are both categorical variables, to check the correlation between two
categorical variables we can use Chi-squared test.<br>

Here,<br>
H0 = There is no dependency between gender and education. They are independent<br>
H1 = There is a dependency between gender and education.

Chi-squared test statistic - <math xmlns="http://www.w3.org/1998/Math/MathML">
  <msup>
    <mi>&#x3C7;</mi>
    <mn>2</mn>
  </msup>
  <mo>=</mo>
  <mo data-mjx-texclass="OP">&#x2211;</mo>
  <mo stretchy="false">(</mo>
  <mi>O</mi>
  <mo>&#x2212;</mo>
  <mi>E</mi>
  <msup>
    <mo stretchy="false">)</mo>
    <mn>2</mn>
  </msup>
  <mrow data-mjx-texclass="ORD">
    <mo>/</mo>
  </mrow>
  <mi>E</mi>
</math>
<br>
where,<br> 
O is observed data,<br>
E is expected frequency under null hypothesis<br>
E = (row total x column total)/sample size

In [22]:
data_expected = np.outer(data_new['Total_People'][0:2],data_new.loc['Total_columns'][0:4])/395.0

In [23]:
data_expected

array([[49.86835443, 50.88607595, 50.37721519, 49.86835443],
       [48.13164557, 49.11392405, 48.62278481, 48.13164557]])

In [24]:
data_expected = pd.DataFrame(data_expected)
data_expected.columns= ["Bachelors","High School","Masters","Ph.d."]
data_expected.index = ["Female","Male"]
data_expected

Unnamed: 0,Bachelors,High School,Masters,Ph.d.
Female,49.868354,50.886076,50.377215,49.868354
Male,48.131646,49.113924,48.622785,48.131646


In [34]:
chi_square_test = (((data_observed-data_expected)**2)/data_expected).sum().sum()
print(chi_square_test)

8.006066246262538


sum().sum() is used to get column sums and then add these sum together.

Degree of freedom for test of independence  = number of categories in each feature - 1. In the above 
scenario it is 2-1, 4-1 = 1,3

In [39]:
import scipy.stats as stats

critical_value = stats.chi2.ppf(q=0.95, df = 3)

print("Critical Value: {}".format(critical_value))

pvalue = 1 - stats.chi2.cdf(x=chi_square_test, df=3) 
print("P-Value: {}".format(pvalue))

Critical Value: 7.814727903251179
P-Value: 0.04588650089174717


As critical value(7.8147) < 8.0060 and P-value(0.045) < 0.05, we have enough evidence to reject the null hypothesis.<br>
We can say that there is a dependency between gender and education.

__Problem Statement 2:__<br>
Using the following data, perform a oneway analysis of variance using α=.05. Write up the results in APA format.

[Group1: 51, 45, 33, 45, 67]<br> [Group2: 23, 43, 23, 43, 45]<br> [Group3: 56, 76, 74, 87, 56]

We use ANOVA when independent variable is categorical and dependent variable is continuous. We can<br> compare multiple groups at the same time. We can compare if the mean of some numerical<br>
features change for one categorical feature.

In [40]:
Group1 = [51, 45, 33, 45, 67]
Group2 = [23, 43, 23, 43, 45]
Group3 = [56, 76, 74, 87, 56]

In [42]:
mean_group1 = np.mean(Group1)
mean_group2 = np.mean(Group2)
mean_group3 = np.mean(Group3)



35.4