# =======================================================================================================
# Overview (3) START
# =======================================================================================================

##### For each of the following questions, formulate a null and alternative hypothesis (be as specific as you can be), then give an example of what a true positive, true negative, type I and type II errors would look like. Note that some of the questions are intentionally phrased in a vague way. It is your job to reword these as more precise questions that could be tested.

In [172]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
from pydataset import data
import env
import datetime

### 1. Has the network latency gone up since we switched internet service providers?

In [None]:
# H0: Network latency is the same or less since
# switching internet providers

# Ha: Network latency increased since switching
# internet providers

# True Pos: Network latency increased

# True Neg: Network latency same or decreased

# Type I Error: Network latency same or decreased

# Type II Error: Network latency increased, but
# assumed that it would be the same

### 2. Is the website redesign any good?

In [None]:
# H0: Website redesign reviews were the same or
# less positive than last website design

# Ha: Website redesign reviews more positive than
# last website design

# True Pos: Website redesign reviews more positive

# True Neg: Website redesign reviews the same or
# less positive

# Type I Error: Website redesign reviews the same
# or less positive

# Type II Error: Website redesign reviews more positive,
# but assumed it would be the same or less

### 3. Is our television ad driving more sales?

In [None]:
# H0: TV ad is the same or less amount of sales
# than last TV ad

# Ha: TV ad is driving more sales than last TV ad

# True Pos: TV ad has more sales

# True Neg: TV ad has same or less sales

# Type I Error: TV ad same or less sales

# Type II Error: TV ad more sales, but assumed
# that it would be the same or less

# =======================================================================================================
# Overview (3) END
# Overview (3) TO Comparison of Means (3)
# Comparison of Means (3) START
# =======================================================================================================

### 1. Answer with the type of test you would use (assume normal distribution):

##### 1a. Is there a difference in grades of students on the second floor compared to grades of all students?

In [None]:
# H0: 2nd Floor student grades are similar to others
# Ha: 2nd Floor students have a higher overall grade
# one sample
# two tail
# stats.ttest_1samp(a, μth)

##### 1b. Are adults who drink milk taller than adults who dont drink milk?

In [None]:
# H0: Adults who drink milk are equal or shorter than those who do not
# Ha: Adults who drink milk are taller than those who do not
# two sample
# one tail
# stats.ttest_ind(drinkmilk, nomilk)

##### 1c. Is the the price of gas higher in texas or in new mexico?

In [None]:
# H0: Gas is equal or lower in New Mexico than Texas
# Ha: Gas is higher in New Mexico than Texas
# two sample
# one tail
# stats.ttestt_ind(new_mexico, texas)

##### 1d. Are there differences in stress levels between students who take data science vs students who take web development vs students who take cloud academy?

In [None]:
# H0: Gas is equal or lower in New Mexico than Texas
# Ha: Gas is higher in New Mexico than Texas
# three sample
# two tail
# stats.f_oneway()

### 2. Ace Realty wants to determine whether the average time it takes to sell homes is different for its two offices. A sample of 40 sales from office #1 revealed a mean of 90 days and a standard deviation of 15 days. A sample of 50 sales from office #2 revealed a mean of 100 days and a standard deviation of 20 days. Use a .05 level of significance.

In [52]:
# office 1 sample: 40
# office 1 mean: 90
# office 1 std: 15
# office 2 sample: 50
# office 2 mean: 100
# office 2 std: 20
# alpha: 0.05
office_stat, office_p = stats.ttest_ind_from_stats(90, 15, 40, 100, 20, 50)

In [54]:
office_stat < 0

True

In [55]:
office_p < 0.05

True

In [56]:
# ttable_stat: True
# ttable_p: True
# Office 1 takes less time than office 2

### 3. Load the mpg dataset and use it to answer the following questions:

In [3]:
mpg = data('mpg')

In [4]:
mpg.sample()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
12,audi,a4 quattro,2.8,1999,6,auto(l5),4,15,25,p,compact


##### 3a. Is there a difference in fuel-efficiency in cars from 2008 vs 1999?

In [None]:
# H0: 2008 cars have equal or less mpg than 1999 cars
# Ha: 2008 cars have more mpg than 1999 cars
# Two samples
# One tail

In [9]:
mpg['combined_mpg'] = (mpg.cty + mpg.hwy) / 2

In [11]:
mpg.sample()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,combined_mpg
188,toyota,camry solara,2.2,1999,4,manual(m5),f,21,29,r,compact,25.0


In [13]:
mpg_1999 = mpg[mpg.year == 1999].combined_mpg
mpg_2008 = mpg[mpg.year == 2008].combined_mpg

In [39]:
# Levene: False
stats.levene(mpg_2008, mpg_1999)

LeveneResult(statistic=0.033228136671080453, pvalue=0.855517569468803)

In [40]:
mpg_stat, mpg_p = stats.ttest_ind(mpg_2008, mpg_1999, equal_var=False)

In [41]:
mpg_stat > 0

False

In [42]:
(mpg_p / 2) < 0.05

False

In [19]:
# Based off of tttest_ind:
# null hypothesis is true
# There isn't a strong difference in mpg

##### 3b. Are compact cars more fuel-efficient than the average car?

In [20]:
# H0: Compact cars have equal or less mpg than non-compact cars
# Ha: Compact cars have better mpg than non-compact cars
# Two samples
# One Tail

In [28]:
mpg['class'].value_counts()

suv           62
compact       47
midsize       41
subcompact    35
pickup        33
minivan       11
2seater        5
Name: class, dtype: int64

In [34]:
compact_mpg = mpg[mpg['class'] == 'compact'].combined_mpg
noncompact_mpg = mpg[mpg['class'] != 'compact'].combined_mpg

In [38]:
# Levene: True... Inequal variance
stats.levene(compact_mpg, noncompact_mpg)[1] < 0.05

True

In [46]:
compact_stat, compact_p = stats.ttest_ind(compact_mpg, noncompact_mpg)

In [49]:
(compact_p / 2) < 0.05

True

In [47]:
compact_stat > 0

True

In [None]:
# levene: True
# ttest_p: True
# ttest_stat: True
# Compact cars do have better mpg than non-compact

##### 3c. Do manual cars get better gas mileage than automatic cars?

In [71]:
# H0: Manual cars have equal or less mpg than auto cars
# Ha: Manual cars have higher mpg than auto cars

# Levene: False
# ttable_stat: True
# ttable_p: True

# H0 or Ha?: Ha
# Manual cars get better mpg than auto cars

In [65]:
auto_mpg = mpg[mpg.trans.str.startswith('auto')].combined_mpg
manual_mpg = mpg[mpg.trans.str.startswith('manual')].combined_mpg

In [66]:
stats.levene(manual_mpg, auto_mpg)

LeveneResult(statistic=0.20075824847529639, pvalue=0.6545276355131857)

In [68]:
trans_stat, trans_p = stats.ttest_ind(manual_mpg, auto_mpg, equal_var=False)

In [69]:
trans_stat > 0

True

In [70]:
(trans_p / 2) < 0.05

True

# =======================================================================================================
# Comparison of Means (3) END
# Comparison of Means (3) TO Correlation (4)
# Correlation (4) START
# =======================================================================================================

### 1. Answer with the type of stats test you would use (assume normal distribution):

##### 1a. Is there a relationship between the length of your arm and the length of your foot?

In [None]:
# two samples
# Continuous vs Continuous
# stats.ttest(arm, foot)

##### 1b. Do guys and gals quit their jobs at the same rate?

In [None]:
# two samples
# Discrete vs Discrete
# stats.chi2_contingency(observed<guy & gal>)

##### 1c. Does the length of time of the lecture correlate with a students grade?

In [None]:
# two samples
# Discrete vs Continuous
# stats.chi

### 2. Use the telco_churn data.

In [92]:
telco = pd.read_csv('telco_data.csv', sep = '\t', encoding='UTF-16')

In [93]:
telco.head()

Unnamed: 0,Add-on Count,Churn,Contract,Customer ID,Dependents,Device Protection,Gender,Internet Service,Multiple Lines,Online Backup,...,Tech Support,with Online Backup,with Online Security,with Streaming Movies,with Streaming TV,with Tech Support,Estimated Tenure(months),Monthly Charges,Tenure,Total Charges
0,0,No,Month-to-month,7590-VHVEG,,Device Wiithout Protection,Female,DSL,No phone service,Online Backup,...,Support Not Contacted,0,0,0,0,0,1.0,29.85,1,29.85
1,0,No,One year,5575-GNVDE,,Protected Device,Male,DSL,Single Line,Internet With No Backup,...,Support Not Contacted,0,0,0,0,0,33.0,56.95,34,1889.5
2,0,Yes,Month-to-month,3668-QPYBK,,Device Wiithout Protection,Male,DSL,Single Line,Online Backup,...,Support Not Contacted,0,0,0,0,0,2.0,53.85,2,108.15
3,0,No,One year,7795-CFOCW,,Protected Device,Male,DSL,No phone service,Internet With No Backup,...,Has a Support Ticket,0,0,0,0,0,43.0,42.3,45,1840.75
4,0,Yes,Month-to-month,9237-HQITU,,Device Wiithout Protection,Female,Fiber optic,Single Line,Internet With No Backup,...,Support Not Contacted,0,0,0,0,0,2.0,70.7,2,151.65


##### 2a. Does tenure correlate with monthly charges?

In [95]:
# Tenure: Discrete
# Monthly Charges: Continuous
# Correlation: No
stats.pearsonr(telco.Tenure, telco['Monthly Charges'])

PearsonRResult(statistic=0.24789985628615002, pvalue=4.0940449915016345e-99)

##### 2b. Total charges?

In [105]:
telco['Total Charges'].isna().value_counts()

False    7032
True       11
Name: Total Charges, dtype: int64

In [108]:
total_charges_nona = telco[telco['Total Charges'].isna() == False]

In [110]:
# Tenure: Discrete
# Total Charges: Continuous
# Correlation: Yes
stats.pearsonr(total_charges_nona.Tenure, total_charges_nona['Total Charges'])

PearsonRResult(statistic=0.825880460933202, pvalue=0.0)

##### 2c. What happens if you control for phone and internet service?

In [117]:
# Internet Service
# Phone Service
# Correlation: Yes
telco['Internet Service'].value_counts()

Fiber optic    3096
DSL            2421
No             1526
Name: Internet Service, dtype: int64

In [118]:
telco['Phone Service'].value_counts()

Subscribed      6361
Unsubscribed     682
Name: Phone Service, dtype: int64

In [120]:
intphonecross = pd.crosstab(telco['Internet Service'], telco['Phone Service'])
pd.crosstab(telco['Internet Service'], telco['Phone Service'])

Phone Service,Subscribed,Unsubscribed
Internet Service,Unnamed: 1_level_1,Unnamed: 2_level_1
DSL,1739,682
Fiber optic,3096,0
No,1526,0


In [121]:
stats.chi2_contingency(intphonecross)

(1441.6233871976854,
 0.0,
 2,
 array([[2186.56552605,  234.43447395],
        [2796.20275451,  299.79724549],
        [1378.23171944,  147.76828056]]))

### 3. Use the employees database.

##### 3a. Is there a relationship between how long an employee has been with the company and their salary?

In [200]:
# Employeed Length
#     fromdate - todate(now)
# Salary
#     maxsalary - minsalary
# Titles
#     SUM(titles)
# Correlation: YEET
employees = pd.read_sql(
    '''
    SELECT 
        A.emp_no, 
        A.empdiffday, 
        B.salarydiff, 
        C.totaltitles
    FROM
        (SELECT emp_no, DATEDIFF(NOW(), hire_date) AS empdiffday FROM employees) AS A,
        (SELECT emp_no, (MAX(salary) - MIN(salary)) AS salarydiff FROM salaries GROUP BY emp_no) AS B,
        (SELECT emp_no, COUNT(title) AS totaltitles FROM titles GROUP BY emp_no) AS C
    LIMIT 300024    
    ''', env.get_db_url('employees'))

In [201]:
employees.shape

(300024, 4)

In [202]:
employees.sample(5)

Unnamed: 0,emp_no,empdiffday,salarydiff,totaltitles
197795,31063,11796,7046,1
35401,38865,10957,3693,1
257031,37931,11024,7017,1
158645,31189,12766,28901,1
286097,40813,12469,1413,1


In [203]:
stats.pearsonr(employees.empdiffday, employees.salarydiff)

PearsonRResult(statistic=0.00011117410060224185, pvalue=0.9514430346501251)

##### 3b. Is there a relationship between how long an employee has been with the company and the number of titles they have had?

In [204]:
employees.totaltitles.value_counts()

1    300024
Name: totaltitles, dtype: int64

In [206]:
test = pd.read_sql(
    '''
    SELECT emp_no, COUNT(title)
    FROM titles
    GROUP BY emp_no
    ''', env.get_db_url('employees'))

In [210]:
test

Unnamed: 0,emp_no,COUNT(title)
0,10001,1
1,10002,1
2,10003,1
3,10004,2
4,10005,2
...,...,...
300019,499995,1
300020,499996,2
300021,499997,2
300022,499998,2


In [213]:
# Employeed Length
# Total Titles
# Correlation: YEET (Not sure why had to make title separate)
stats.pearsonr(employees.empdiffday, test['COUNT(title)'])

PearsonRResult(statistic=0.0014090830341339786, pvalue=0.4402242423263949)

### 4. Use the sleepstudy data.

##### 4a. Is there a relationship between days and reaction time?

In [123]:
slpstdy = data('sleepstudy')

In [125]:
# Correlation: Yes
stats.pearsonr(slpstdy.Days, slpstdy.Reaction)

PearsonRResult(statistic=0.5352302262650253, pvalue=9.894096322214812e-15)

# =======================================================================================================
# Correlation (4) END
# Correlation (4) TO Comparison of Groups (4)
# Comparison of Groups (4) START
# =======================================================================================================

### 1. Answer with the type of stats test you would use (assume normal distribution):

##### 1a. Do students get better test grades if they have a rubber duck on their desk?

##### 1b. Does smoking affect when or not someone has lung cancer?

##### 1c. Is gender independent of a person’s blood type?

##### 1d. A farming company wants to know if a new fertilizer has improved crop yield or not

##### 1e. Does the length of time of the lecture correlate with a students grade?

##### 1f. Do people with dogs live in apartments more than people with cats?

### 2. Use the following contingency table to help answer the question of whether using a macbook and being a codeup student are independent of each other.

| | Codeup Student | Not Codeup Student |
| ----- | ------ | ------ |
| Uses a Macbook | 49 | 20 |
| Doesn't Use A Macbook	| 1	| 30 |

### 3. Choose another 2 categorical variables from the mpg dataset and perform a chi^2 contingency table test with them. Be sure to state your null and alternative hypotheses.

### 4. Use the data from the employees database to answer these questions:

##### 4a. Is an employee's gender independent of whether an employee works in sales or marketing? (only look at current employees)

##### 4b. Is an employee's gender independent of whether or not they are or have been a manager?

# =======================================================================================================
# Comparison of Groups (4) END
# Comparison of Groups (4) TO More Examples
# More Examples START
# =======================================================================================================

##### Choose several continous and categorical variables that were not covered in the lesson and perform each type of test on them. You may use another data set if you wish.

# =======================================================================================================
# More Examples END
# =======================================================================================================