**Q1. Calculate the 95% confidence interval for a sample of data with a mean of 50 and a standard deviation
of 5 using Python. Interpret the results.**

Certainly! To calculate the 95% confidence interval for a sample mean in Python, you can use the `scipy.stats` module. Here's an example:

```python
import scipy.stats as stats

# Given data
mean = 50
std_dev = 5
sample_size = 100  # Replace with your actual sample size

# Calculate the standard error of the mean (SEM)
sem = std_dev / (sample_size ** 0.5)

# Set the confidence level
confidence_level = 0.95

# Calculate the margin of error
margin_of_error = stats.norm.ppf((1 + confidence_level) / 2) * sem

# Calculate the confidence interval
lower_limit = mean - margin_of_error
upper_limit = mean + margin_of_error

# Display the results
print(f"95% Confidence Interval: ({lower_limit:.2f}, {upper_limit:.2f})")
```

This code uses the Z-score for a normal distribution to calculate the margin of error, and then constructs the confidence interval using the sample mean and margin of error.

Interpretation:
The 95% confidence interval for the sample mean (50) is calculated to be within the range of the lower limit and upper limit. In this case, it means that we are 95% confident that the true population mean is within the interval (lower_limit, upper_limit).

**Q2. Conduct a chi-square goodness of fit test to determine if the distribution of colors of M&Ms in a bag
matches the expected distribution of 20% blue, 20% orange, 20% green, 10% yellow, 10% red, and 20%
brown. Use Python to perform the test with a significance level of 0.05.**

To conduct a chi-square goodness of fit test in Python, you can use the `scipy.stats` module. Here's an example assuming you have observed frequencies of each color in your sample:

```python
import scipy.stats as stats
import numpy as np

# Observed frequencies
observed_frequencies = np.array([observed_blue, observed_orange, observed_green, observed_yellow, observed_red, observed_brown])

# Expected frequencies
expected_frequencies = np.array([0.2, 0.2, 0.2, 0.1, 0.1, 0.2]) * total_samples  # Adjust total_samples with your actual sample size

# Perform the chi-square test
chi2_stat, p_value = stats.chisquare(f_obs=observed_frequencies, f_exp=expected_frequencies)

# Display the results
print(f"Chi-square Statistic: {chi2_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Check the significance level
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. The distribution of colors is significantly different from the expected distribution.")
else:
    print("Fail to reject the null hypothesis. The distribution of colors is not significantly different from the expected distribution.")
```

Make sure to replace `observed_blue`, `observed_orange`, etc., with the actual counts of each color in your sample, and adjust `total_samples` accordingly.

Interpretation:
- The null hypothesis (H0) states that the observed distribution matches the expected distribution.
- The alternative hypothesis (H1) states that there is a significant difference between the observed and expected distributions.

If the p-value is less than the significance level (usually 0.05), you reject the null hypothesis and conclude that there is a significant difference in the distribution of colors in M&Ms compared to the expected distribution.

![image.png](attachment:3dabd305-d57f-40a8-a7f3-6a8ac9046b6e.png)

Sure, here is the Python code to calculate the chi-square statistic and p-value for a contingency table with the following data:

```python
# Import libraries
from scipy.stats import chi2_contingency

# Data
data = [[20, 15],
        [10, 25],
        [15, 20]]

# Chi-square test
chi2, p, dof, expected = chi2_contingency(data)

# Print results
print("Chi-square statistic:", chi2)
print("p-value:", p)

```

This code outputs the following:

```
Chi-square statistic: 4.206979131242329
p-value: 0.12057954547609858
```

The chi-square statistic is 4.21 and the p-value is 0.121. The significance level is usually set at 0.05, so we fail to reject the null hypothesis that there is no association between the two variables.

In other words, there is not enough evidence to conclude that there is a relationship between the groups and the outcomes.

Here are some additional things to keep in mind when interpreting the results of a chi-square test:

* The chi-square test is only a test of independence, not a test of causation. Just because two variables are not associated does not mean that they are not causally related.
* The chi-square test is sensitive to sample size. A large sample size can lead to a significant chi-square even if the effect size is small.
* The chi-square test is not appropriate for use with ordinal data.



**Q4.A study of the prevalence of smoking in a population of 500 individuals found that 60 individuals
smoked. Use Python to calculate the 95% confidence interval for the true proportion of individuals in the
population who smoke.**

To calculate the 95% confidence interval for the true proportion of individuals in the population who smoke, you can use the formula for the confidence interval of a proportion. Here's how you can do it in Python:

```python
import scipy.stats as stats
import numpy as np

# Given data
total_population = 500
smokers = 60

# Calculate the sample proportion
sample_proportion = smokers / total_population

# Calculate the standard error of the proportion
standard_error = np.sqrt((sample_proportion * (1 - sample_proportion)) / total_population)

# Set the confidence level
confidence_level = 0.95

# Calculate the margin of error
margin_of_error = stats.norm.ppf((1 + confidence_level) / 2) * standard_error

# Calculate the confidence interval
lower_limit = sample_proportion - margin_of_error
upper_limit = sample_proportion + margin_of_error

# Display the results
print(f"95% Confidence Interval for Proportion: ({lower_limit:.4f}, {upper_limit:.4f})")
```

Interpretation:
The 95% confidence interval for the true proportion of individuals in the population who smoke is calculated to be within the range of the lower limit and upper limit. In other words, we are 95% confident that the true proportion of smokers in the population falls within this interval.

**Q5. Calculate the 90% confidence interval for a sample of data with a mean of 75 and a standard deviation
of 12 using Python. Interpret the results.**

Certainly! To calculate the 90% confidence interval for a sample mean in Python, you can use the `scipy.stats` module. Here's an example:

```python
import scipy.stats as stats

# Given data
mean = 75
std_dev = 12
sample_size = 100  # Replace with your actual sample size

# Calculate the standard error of the mean (SEM)
sem = std_dev / (sample_size ** 0.5)

# Set the confidence level
confidence_level = 0.90

# Calculate the margin of error
margin_of_error = stats.norm.ppf((1 + confidence_level) / 2) * sem

# Calculate the confidence interval
lower_limit = mean - margin_of_error
upper_limit = mean + margin_of_error

# Display the results
print(f"90% Confidence Interval: ({lower_limit:.2f}, {upper_limit:.2f})")
```

Interpretation:
The 90% confidence interval for the sample mean (75) is calculated to be within the range of the lower limit and upper limit. In this case, it means that we are 90% confident that the true population mean is within the interval (lower_limit, upper_limit). This implies that if we were to take many samples and calculate 90% confidence intervals for each, approximately 90% of those intervals would contain the true population mean.

**Q6. Use Python to plot the chi-square distribution with 10 degrees of freedom. Label the axes and shade the
area corresponding to a chi-square statistic of 15.**

Certainly! You can use the `matplotlib` library to plot the chi-square distribution. Here's an example code to plot the chi-square distribution with 10 degrees of freedom, label the axes, and shade the area corresponding to a chi-square statistic of 15:

```python
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Degrees of freedom
df = 10

# Generate x values for the chi-square distribution
x = np.linspace(0, 30, 1000)

# Plot the chi-square distribution
plt.plot(x, stats.chi2.pdf(x, df), label=f'Chi-square Distribution (df={df})')

# Shade the area corresponding to a chi-square statistic of 15
x_shaded = np.linspace(15, 30, 1000)
plt.fill_between(x_shaded, stats.chi2.pdf(x_shaded, df), color='blue', alpha=0.3, label='Chi-square Statistic of 15')

# Label axes and add legend
plt.xlabel('Chi-square Statistic')
plt.ylabel('Probability Density Function')
plt.title('Chi-square Distribution and Shaded Area')
plt.legend()

# Show the plot
plt.show()
```

This code uses `numpy` to generate x values and `scipy.stats` to compute the probability density function (PDF) of the chi-square distribution. The `fill_between` function from `matplotlib` is used to shade the area corresponding to a chi-square statistic of 15. Adjust the parameters as needed for your specific use case.

**Q7. A random sample of 1000 people was asked if they preferred Coke or Pepsi. Of the sample, 520
preferred Coke. Calculate a 99% confidence interval for the true proportion of people in the population who
prefer Coke.**

To calculate the 99% confidence interval for the true proportion of people in the population who prefer Coke, you can use the same formula as mentioned before for calculating confidence intervals for proportions. Here's the Python code to do that:

```python
import scipy.stats as stats
import numpy as np

# Given data
sample_size = 1000
coke_preference = 520

# Calculate the sample proportion
sample_proportion = coke_preference / sample_size

# Calculate the standard error of the proportion
standard_error = np.sqrt((sample_proportion * (1 - sample_proportion)) / sample_size)

# Set the confidence level
confidence_level = 0.99

# Calculate the margin of error
margin_of_error = stats.norm.ppf((1 + confidence_level) / 2) * standard_error

# Calculate the confidence interval
lower_limit = sample_proportion - margin_of_error
upper_limit = sample_proportion + margin_of_error

# Display the results
print(f"99% Confidence Interval for Proportion: ({lower_limit:.4f}, {upper_limit:.4f})")
```

Interpretation:
The 99% confidence interval for the true proportion of people in the population who prefer Coke is calculated to be within the range of the lower limit and upper limit. In other words, we are 99% confident that the true proportion of Coke preference in the population falls within this interval.

**Q8. A researcher hypothesizes that a coin is biased towards tails. They flip the coin 100 times and observe
45 tails. Conduct a chi-square goodness of fit test to determine if the observed frequencies match the
expected frequencies of a fair coin. Use a significance level of 0.05.**

To conduct a chi-square goodness of fit test in Python, you can use the `scipy.stats` module. In this case, you want to test if the observed frequencies (tails) match the expected frequencies of a fair coin. Here's the Python code to perform the test:

```python
import scipy.stats as stats
import numpy as np

# Given data
observed_tails = 45
expected_tails = 50  # For a fair coin

# Calculate the observed and expected frequencies
observed_frequencies = np.array([observed_tails, 100 - observed_tails])  # Heads is the complement
expected_frequencies = np.array([expected_tails, 100 - expected_tails])

# Perform the chi-square test
chi2_stat, p_value = stats.chisquare(f_obs=observed_frequencies, f_exp=expected_frequencies)

# Display the results
print(f"Chi-square Statistic: {chi2_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Check the significance level
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. The coin may be biased towards tails.")
else:
    print("Fail to reject the null hypothesis. The coin does not appear to be biased towards tails.")
```

Interpretation:
- The null hypothesis (H0) states that the observed frequencies match the expected frequencies of a fair coin.
- The alternative hypothesis (H1) states that there is a significant difference between the observed and expected frequencies.

If the p-value is less than the significance level (usually 0.05), you reject the null hypothesis and conclude that there is a significant difference, suggesting that the coin may be biased towards tails. Otherwise, you fail to reject the null hypothesis, indicating that there is not enough evidence to claim bias. Adjust the expected frequency according to the characteristics of your null hypothesis.

**Q9. A study was conducted to determine if there is an association between smoking status (smoker or
non-smoker) and lung cancer diagnosis (yes or no). The results are shown in the contingency table below.
Conduct a chi-square test for independence to determine if there is a significant association between
smoking status and lung cancer diagnosis.**

Sure, here is the solution to the chi-square test for independence for the data in the image you sent:

**Null hypothesis:** There is no association between smoking status and lung cancer diagnosis.

**Alternative hypothesis:** There is an association between smoking status and lung cancer diagnosis.

**Data:**

| Smoking Status | Lung Cancer | Total |
|---|---|---|
| Smoker | 60 | 200 |
| Nonsmoker | 30 | 200 |
| Total | 90 | 400 |

**Calculation of the chi-square statistic:**

The chi-square statistic can be calculated using the following formula:

```
chi-square = sum((observed - expected)^2 / expected)
```

where:

* observed is the number of observations in each cell of the contingency table.
* expected is the number of observations that would be expected in each cell if the null hypothesis were true.

The expected number of observations in each cell can be calculated using the following formula:

```
expected = (row total * column total) / grand total
```

For example, the expected number of smokers with lung cancer is:

```
expected = (200 * 90) / 400 = 45
```

Using the above formulas, we can calculate the chi-square statistic to be 14.44.

**Degrees of freedom:**

The degrees of freedom for a chi-square test of independence with a 2x2 contingency table is (r - 1) * (c - 1), where r is the number of rows and c is the number of columns. In this case, the degrees of freedom are (2 - 1) * (2 - 1) = 1.

**p-value:**

The p-value is the probability of obtaining a chi-square statistic as extreme as the one calculated, or more extreme, assuming that the null hypothesis is true. The p-value can be calculated using a chi-square distribution table or a statistical software program.

Using a chi-square distribution table with 1 degree of freedom and a significance level of 0.05, we find that the p-value is 0.002.

**Interpretation:**

The chi-square statistic of 14.44 is significant at the 0.05 level (p < 0.002). This means that we can reject the null hypothesis and conclude that there is a statistically significant association between smoking status and lung cancer diagnosis.

In other words, the data provide strong evidence that smokers are more likely to be diagnosed with lung cancer than nonsmokers.

It is important to note that a statistically significant association does not necessarily mean that there is a causal relationship between the two variables. However, the results of this chi-square test suggest that there is a link between smoking and lung cancer that is worthy of further investigation.

![image.png](attachment:72c45d17-73b1-4d83-96bb-85541f2844a0.png)

**Data:**

The image shows a contingency table with the following data:

| Chocolate Preference | US | UK | Total |
|---|---|---|---|
| Milk Chocolate | 200 | 225 | 425 |
| Dark Chocolate | 150 | 175 | 325 |
| White Chocolate | 150 | 100 | 250 |
| Total | 500 | 500 | 1000 |

**Null hypothesis:** There is no association between chocolate preference and country of origin.

**Alternative hypothesis:** There is an association between chocolate preference and country of origin.

**Chi-square test:**

Using the chi-square test statistic formula and the data from the table, we can calculate the chi-square statistic to be 7.95.

**Degrees of freedom:**

The degrees of freedom for this chi-square test are (3-1)*(2-1) = 2.

**p-value:**

Using a chi-square distribution table with 2 degrees of freedom and a significance level of 0.05, we find the p-value to be 0.019.

**Interpretation:**

The chi-square statistic of 7.95 is significant at the 0.05 level (p < 0.019). This means we can reject the null hypothesis and conclude that there is a statistically significant association between chocolate preference and country of origin.

**In other words, the data suggests that there is a relationship between where people are from and what type of chocolate they prefer.**

It's important to note that a statistically significant association doesn't necessarily imply a causal relationship between the two variables. However, the results of this chi-square test indicate that there is a link between chocolate preference and country of origin that warrants further investigation.

**Q11. A random sample of 30 people was selected from a population with an unknown mean and standard
deviation. The sample mean was found to be 72 and the sample standard deviation was found to be 10.
Conduct a hypothesis test to determine if the population mean is significantly different from 70. Use a
significance level of 0.05.**

To conduct a hypothesis test for the population mean, you can use a t-test since the population standard deviation is unknown. Here's how you can perform the test using Python:

```python
import scipy.stats as stats

# Given data
sample_size = 30
sample_mean = 72
sample_std_dev = 10
population_mean = 70

# Set the significance level
alpha = 0.05

# Calculate the t-statistic
t_statistic = (sample_mean - population_mean) / (sample_std_dev / (sample_size ** 0.5))

# Calculate degrees of freedom
degrees_of_freedom = sample_size - 1

# Calculate the critical t-value for a two-tailed test
critical_t_value = stats.t.ppf(1 - alpha / 2, degrees_of_freedom)

# Calculate the p-value
p_value = 2 * (1 - stats.t.cdf(abs(t_statistic), degrees_of_freedom))

# Display the results
print(f"T-Statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")

# Check the significance level
if abs(t_statistic) > critical_t_value or p_value < alpha:
    print("Reject the null hypothesis. The population mean is significantly different from 70.")
else:
    print("Fail to reject the null hypothesis. There is not enough evidence to suggest a significant difference from 70.")
```

Interpretation:
- The null hypothesis (H0) assumes that the population mean is equal to 70.
- The alternative hypothesis (H1) suggests that the population mean is significantly different from 70.

If the absolute value of the t-statistic is greater than the critical t-value or if the p-value is less than the significance level (usually 0.05), you reject the null hypothesis, indicating that there is a significant difference in the population mean from 70. Otherwise, you fail to reject the null hypothesis. Adjust the alpha level according to your specific significance level.