# Week 7 - DSC 530 - Emilio Flores

## Chapter 9 - Exercise 9.1
As sample size increases, the power of a hypothesis test increases, which means it is more likely to be positive if the effect is real. Conversely, as sample size decreases, the test is less likely to be positive even if the effect is real.

To investigate this behavior, run the tests in this chapter with different subsets of the NSFG data. You can use thinkstats2.SampleRows to select a random subset of the rows in a DataFrame.

What happens to the p-values of these tests as sample size decreases? What is the smallest sample size that yields a positive test?

In [57]:
import thinkstats2
import random
import first
import numpy as np
live, firsts, others = first.MakeFrames()

In [59]:
class PregLengthTest(thinkstats2.HypothesisTest):

    def MakeModel(self):
        firsts, others = self.data
        self.n = len(firsts)
        self.pool = np.hstack((firsts, others))

        pmf = thinkstats2.Pmf(self.pool)
        self.values = range(35, 44)
        self.expected_probs = np.array(pmf.Probs(self.values))

    def RunModel(self):
        np.random.shuffle(self.pool)
        data = self.pool[:self.n], self.pool[self.n:]
        return data
    
    def TestStatistic(self, data):
        firsts, others = data
        stat = self.ChiSquared(firsts) + self.ChiSquared(others)
        return stat

    def ChiSquared(self, lengths):
        hist = thinkstats2.Hist(lengths)
        observed = np.array(hist.Freqs(self.values))
        expected = self.expected_probs * len(lengths)
        stat = sum((observed - expected)**2 / expected)
        return stat

In [61]:
class CorrelationPermute(thinkstats2.HypothesisTest):

    def TestStatistic(self, data):
        xs, ys = data
        test_stat = abs(thinkstats2.Corr(xs, ys))
        return test_stat

    def RunModel(self):
        xs, ys = self.data
        xs = np.random.permutation(xs)
        return xs, ys

In [63]:
class DiffMeansPermute(thinkstats2.HypothesisTest):

    def TestStatistic(self, data):
        group1, group2 = data
        test_stat = abs(group1.mean() - group2.mean())
        return test_stat

    def MakeModel(self):
        group1, group2 = self.data
        self.n, self.m = len(group1), len(group2)
        self.pool = np.hstack((group1, group2))

    def RunModel(self):
        np.random.shuffle(self.pool)
        data = self.pool[:self.n], self.pool[self.n:]
        return data

In [41]:
def RunTests(live, iters=1000):
    """Runs the tests from Chapter 9 with a subset of the data.

    live: DataFrame
    iters: how many iterations to run
    """
    n = len(live)
    firsts = live[live.birthord == 1]
    others = live[live.birthord != 1]

    # compare pregnancy lengths
    data = firsts.prglngth.values, others.prglngth.values
    ht = DiffMeansPermute(data)
    p1 = ht.PValue(iters=iters)

    data = (firsts.totalwgt_lb.dropna().values,
            others.totalwgt_lb.dropna().values)
    ht = DiffMeansPermute(data)
    p2 = ht.PValue(iters=iters)

    # test correlation
    live2 = live.dropna(subset=['agepreg', 'totalwgt_lb'])
    data = live2.agepreg.values, live2.totalwgt_lb.values
    ht = CorrelationPermute(data)
    p3 = ht.PValue(iters=iters)

    # compare pregnancy lengths (chi-squared)
    data = firsts.prglngth.values, others.prglngth.values
    ht = PregLengthTest(data)
    p4 = ht.PValue(iters=iters)

    print('%d\t%0.2f\t%0.2f\t%0.2f\t%0.2f' % (n, p1, p2, p3, p4))

In [65]:
n = len(live)
for _ in range(7):
    sample = thinkstats2.SampleRows(live, n)
    RunTests(sample)
    n //= 2

9148	0.16	0.00	0.00	0.00
4574	0.43	0.01	0.00	0.00
2287	0.35	0.01	0.00	0.00
1143	0.07	0.13	0.09	0.01
571	0.33	0.58	0.06	0.15
285	0.77	0.70	0.01	0.04
142	0.09	0.76	0.06	0.00


Results tend to be positive the larger the sample is, nevertheless, there are stil some positive results in smaller
samples. For example, a sample of 142 units yielded a positive result. 

## Chapter 10 - Exercise 10.1s?

### Linear least squares fit for log(weight) vs. height
Using the data from the BRFSS, compute the linear least squares fit for log(weight) versus height. How would you best present the estimated parameters for a model like this where one of the variables is log-transformed? If you were trying to guess someone’s weight, how much would it help to know their height?

In [69]:
import brfss

df = brfss.ReadBrfss(nrows=None)
df = df.dropna(subset=['htm3', 'wtkg2'])
heights, weights = df.htm3, df.wtkg2
log_weights = np.log10(weights)

inter, slope = thinkstats2.LeastSquares(heights, log_weights)
inter, slope

(0.9930804163932863, 0.005281454169417785)

```
The estimated parameters could be represented like this:
Log10(weight) = 0.9931 + (0.005 * height)
```

```
Height would be very helpful to guess someone's weight as there is a positive relationship between both.
What would be the weight for someone that is 170cm tall?
```

In [77]:
height = 170
weight = 10 ** (inter + (slope * height))
print(f"The weight of someone that is {height} cm tall is {round(weight, 2)} kg according to this model")

The weight of someone that is 170 cm tall is 77.79 kg according to this model


### Resampling
Like the NSFG, the BRFSS oversamples some groups and provides a sampling weight for each respondent. In the BRFSS data, the variable name for these weights is totalwt. Use resampling, with and without weights, to estimate the mean height of respondents in the BRFSS, the standard error of the mean, and a 90% confidence interval. How much does correct weighting affect the estimates?

In [85]:
import numpy as np

# Function that calculates mean height, standard error of the mean, and 90% confidence interval

def Summarize(estimates):
    mean = np.mean(estimates)
    stderr = np.std(estimates)
    ci = np.percentile(estimates, [5, 95])  # 90% confidence interval
    return mean, stderr, ci


# Resampling without weight
estimates_unweighted = [thinkstats2.ResampleRows(df).htm3.mean() for _ in range(100)]

mean_unweighted, stderr_unweighted, ci_unweighted = Summarize(estimates_unweighted)

print("Unweighted:")
print("Mean Height:", round(mean_unweighted, 2))
print("Standard Error:", round(stderr_unweighted, 2))
print("90% Confidence Interval:", round(ci_unweighted[0], 2), "-", round(ci_unweighted[1], 2))


Unweighted:
Mean Height: 168.95
Standard Error: 0.02
90% Confidence Interval: 168.92 - 168.99


In [95]:
def ResampleRowsWeighted(df, weight_col):
    weights = df[weight_col] / df[weight_col].sum()
    return df.sample(n=len(df), replace=True, weights=weights)

estimates_weighted = [ResampleRowsWeighted(df, 'finalwt').htm3.mean() for _ in range(100)]
mean_weighted, stderr_weighted, ci_weighted = Summarize(estimates_weighted)
print("\nWeighted:")
print("Mean Height:", round(mean_weighted, 2))
print("Standard Error:", round(stderr_weighted, 2))
print("90% Confidence Interval:", round(ci_weighted[0], 2), "-", round(ci_weighted[1], 2))



Weighted:
Mean Height: 170.5
Standard Error: 0.02
90% Confidence Interval: 170.47 - 170.52


In [97]:
# Compare the effect of weighting
effect_on_mean = mean_weighted - mean_unweighted
effect_on_stderr = stderr_weighted - stderr_unweighted

print("\nEffect of Weighting:")
print("Difference in Mean Height:", round(effect_on_mean, 2))
print("Difference in Standard Error:", round(effect_on_stderr, 2))


Effect of Weighting:
Difference in Mean Height: 1.54
Difference in Standard Error: -0.0


```
The correct weighting affects estimates by almost 2 cm
```