# Exercises (Overview)

### For each of the following questions, formulate a null and alternative hypothesis (be as specific as you can be), then give an example of what a true positive, true negative, type I and type II errors would look like. 

#### Note that some of the questions are intentionally phrased in a vague way. It is your job to reword these as more precise questions that could be tested.

#### Has the network latency gone up since we switched internet service providers?

* **null_hypothesis** = switching internet service providers makes **no difference** on the latency

* **alternate_hypothesis** = switching internet service providers makes **a difference** on the latency

* **True Positive**: The internet service provider **does** affect the network latency, and we were correct

* **True Negative**: The internet service provider **does NOT** affect the network latency, and we were correct

* **Type I Error (False Positive)**: The internet service provider **does** affect the network latency, and we were **WRONG**

* **Type II Error(False Negative)**: The internet service provider **does NOT** affect the network latency, and we were **WRONG**


#### Is the website redesign any good?

#### Is our television ad driving more sales?

# Exercises (Comparison of Groups)
## chi_squared_exercises

Continue working in your hypothesis_testing notebook.

In [1]:
import pandas as pd
import numpy as np

from pydataset import data

from scipy import stats
import env

### 1. Use the following contingency table to help answer the question of whether using a Macbook and being a Codeup student are independent of each other.

 	                    Codeup Student       Not Codeup Student
    Uses a Macbook	        49	              20
    Doesn't Use A Macbook	1	              30

In [47]:
data_dict = {'Codeup Student': [49,1], 'Not a Codeup Student':[20,30]}
data_dict
#could also make a list of lists

{'Codeup Student': [49, 1], 'Not a Codeup Student': [20, 30]}

In [48]:
observed = pd.DataFrame(data_dict, index=['Uses a Macbook','doesn\'t Use a Macbook'])
observed

Unnamed: 0,Codeup Student,Not a Codeup Student
Uses a Macbook,49,20
doesn't Use a Macbook,1,30


#### set my hypothesis and alpha

$H_o$: There is **no** relationship between being a codeup student and using a macbook

$H_a$: There **is** a relationship between being a codeup student and using a macbook

In [49]:
alpha = 0.05

##### calculate

In [50]:
stats.chi2_contingency(observed)

Chi2ContingencyResult(statistic=36.65264142122487, pvalue=1.4116760526193828e-09, dof=1, expected_freq=array([[34.5, 34.5],
       [15.5, 15.5]]))

In [51]:
chi2, p, dof, expected = stats.chi2_contingency(observed)

In [52]:
expected

array([[34.5, 34.5],
       [15.5, 15.5]])

In [53]:
observed.values

array([[49, 20],
       [ 1, 30]])

In [54]:
chi2, dof

(36.65264142122487, 1)

In [55]:
p

1.4116760526193828e-09

##### conclude

My p-value is **less than** alpha, therefore, we reject the null hypothesis.

We can conclude there is a relationship between being a codeup student and using a macbook.

### 2. Choose another 2 categorical variables from the mpg dataset.

In [56]:
df = data('mpg')
df.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


In [57]:
df.nunique()

manufacturer    15
model           38
displ           35
year             2
cyl              4
trans           10
drv              3
cty             21
hwy             27
fl               5
class            7
dtype: int64

##### setup

#### Q: Does the class of the car affect how many cylinders it has?

##### State your null and alternative hypotheses.

$H_o$: Car class is **independent** of number of cylinders.

$H_a$: Car class is **dependent** of number of cylinders

##### State your alpha.

In [58]:
alpha = 0.05

In [59]:
df.cyl.value_counts()

cyl
4    81
6    79
8    70
5     4
Name: count, dtype: int64

In [60]:
df.class.value_counts() #class is a reserved word

SyntaxError: invalid syntax (3473782885.py, line 1)

In [61]:
(df['class']).value_counts()

class
suv           62
compact       47
midsize       41
subcompact    35
pickup        33
minivan       11
2seater        5
Name: count, dtype: int64

In [62]:
observed = pd.crosstab(df.cyl, df['class'])
observed

class,2seater,compact,midsize,minivan,pickup,subcompact,suv
cyl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
4,0,32,16,1,3,21,8
5,0,2,0,0,0,2,0
6,0,13,23,10,10,7,16
8,5,0,2,0,20,5,38


##### Perform a chi2 test of independence.

In [63]:
chi2, p, dof, expected = stats.chi2_contingency(observed)

In [64]:
expected

array([[ 1.73076923, 16.26923077, 14.19230769,  3.80769231, 11.42307692,
        12.11538462, 21.46153846],
       [ 0.08547009,  0.8034188 ,  0.7008547 ,  0.18803419,  0.56410256,
         0.5982906 ,  1.05982906],
       [ 1.68803419, 15.86752137, 13.84188034,  3.71367521, 11.14102564,
        11.81623932, 20.93162393],
       [ 1.4957265 , 14.05982906, 12.26495726,  3.29059829,  9.87179487,
        10.47008547, 18.54700855]])

In [65]:
chi2, dof

(138.02824375973248, 18)

In [66]:
chi2

138.02824375973248

In [67]:
p

1.5351076620141522e-20

In [68]:
print('Observed')
print(observed.values)
print('\nExpected')
print(expected.astype(int))
print('\n----')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

Observed
[[ 0 32 16  1  3 21  8]
 [ 0  2  0  0  0  2  0]
 [ 0 13 23 10 10  7 16]
 [ 5  0  2  0 20  5 38]]

Expected
[[ 1 16 14  3 11 12 21]
 [ 0  0  0  0  0  0  1]
 [ 1 15 13  3 11 11 20]
 [ 1 14 12  3  9 10 18]]

----
chi^2 = 138.0282
p     = 0.0000


##### State your conclusion

My p-value is **less** than alpha α, therefore, we reject the null hypothesis.

We can conclude that car class is dependent on number of cylinders

### 3. Use the data from the employees database to answer these questions:

In [10]:
from env import user, password, host ### I added this and it made the first code work (w/out the env. in front)

#in my env file
def get_db_url(db, user=user, password=password, host=host):
    return (f'mysql+pymysql://{user}:{password}@{host}/{db}')


# get_db_url = f'mysql+pymysql://{env.user}:{env.password}@{env.host}/employees'   
                    ### From Advanced DF (I guess needed for either the above or below code?)

# def get_db_url(user, password, host, database):
#     return f'mysql+pymysql://{user}:{password}@{host}/{database}'
                    ### This is how I've done it before but need more practice

In [11]:
url = env.get_db_url('employees')

In [12]:
pd.read_sql('show tables', url)

Unnamed: 0,Tables_in_employees
0,departments
1,dept_emp
2,dept_manager
3,employees
4,salaries
5,titles


### Is an employee's gender independent of whether an employee works in sales or marketing? (only look at current employees)

In [None]:
query = '''
select *
from employees as e
    join dept_emp as de
        using
    
'''

In [6]:
df = pd.read_sql(query, url)

NameError: name 'query' is not defined

In [None]:
df.head()

##### setup

There is 

##### calculate

In [None]:
df.gender.value_counts()

In [None]:
df.dept_name.value_counts()

In [None]:
observed = pd.crosstab(df.gender, df.dept_name)
observed

In [None]:
chi2, p, dof, expected = stats.chi2.

In [None]:
chi2, dof

In [None]:
p

##### conclude

My p-value is greater than alpha, therefore, fail to reject the null hypothesis.

We can conclude there is no relationship between gender and department (sales or marketing only)

### Is an employee's gender independent of whether or not they are or have been a manager?

In [None]:
query = '''
select *
from dept_manager
    right join employees
        using (emp_no)
;
'''

In [None]:
pd.read_sql(query, url)

In [None]:
df.head()

In [None]:
df.gender.value_counts()

In [None]:
df.to_date.value_counts(dropna=False) #to drop the null values

#many things to clean right now because there are columns we don't need

In [None]:
df.to_date.isnull() #checking for not null values

In [None]:
df['is_manager'] = df.to_date.notnull()

In [None]:
df

In [None]:
df.is_manager.value_counts()

##### setup

there is no relationship between gender and being a manager
there is...

##### calculate

In [None]:
observed = pd.crosstab(df.is_manager, df.gender)
observed

In [None]:
chi2, p, dof, expected = stats.chi2_contingency(observed)

In [None]:
expected

In [None]:
chi2, dof

In [None]:
p

##### conclude

My p-value is greater than alpha, **fail** to reject the null hypothesis.

We can conclude there is **NO** relationship between gender and being a manager.

# Exercises (Correlation)

Continue working in your `hypothesis_testing` notebook.

### 1. Answer with the type of stats test you would use (assume normal distribution):

* Is there a relationship between the length of your arm and the length of your foot?

    -A: correlation, pearsonsr

* Does smoking affect when or not someone has lung cancer?

    -A: chi2

* Is gender independent of a person’s blood type?

    -A: chi2

* Does whether or not a person has a cat or dog affect whether they live in an apartment?

    -A: chi2

* Does the length of time of the lecture correlate with a student's grade?

    -A: correlation

### 2. Use the `telco_churn` data.

In [13]:
url = env.get_db_url('telco_churn')

In [14]:
pd.read_sql('show tables', url)

Unnamed: 0,Tables_in_telco_churn
0,contract_types
1,customer_churn
2,customer_contracts
3,customer_details
4,customer_payments
5,customer_signups
6,customer_subscriptions
7,customers
8,internet_service_types
9,payment_types


* Does tenure correlate with monthly charges?

* Total charges?

* What happens if you control for phone and internet service?

### 3. Use the `employees` database.

* Is there a relationship between how long an employee has been with the company and their salary?

* Is there a relationship between how long an employee has been with the company and the number of titles they have had?

### 4. Use the `sleepstudy` data.

* Is there a relationship between days and reaction time?