# Mini Project 5-3 Explore Sampling

## Introduction
In this project, you will engage in effective sampling of a dataset in order to make it easier to analyze. As a data professional you will often work with extremely large datasets, and utilizing proper sampling techniques helps you improve your efficiency in this work. 

For this project, you are a member of an analytics team for the Environmental Protection Agency. You are assigned to analyze data on air quality with respect to carbon monoxide—a major air pollutant—and report your findings. The data utilized in this project includes information from over 200 sites, identified by their state name, county name, city name, and local site name. You will use effective sampling within this dataset. 

## Step 1: Imports

### Import packages

Import `pandas`,  `numpy`, `matplotlib`, `statsmodels`, and `scipy`. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy.stats as st

### Load the dataset

As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [None]:
# Import necessary libraries
import pandas as pd

# Display available variables
print("Available variables:", dir())

# Try loading the dataset (replace 'dataset' with the actual variable name if needed)
try:
    df = dataset  # If the dataset is stored in a variable called 'dataset'
    print("Dataset loaded successfully!")
    print(df.head())  # Display the first few rows
except NameError:
    print("Dataset variable not found. Check the provided variable name.")


## Step 2: Data exploration

### Examine the data

To understand how the dataset is structured, examine the first 10 rows of the data.

In [None]:
# Display the first 10 rows of the dataset
df.head(10)


### Generate a table of descriptive statistics

Generate a table of some descriptive statistics about the data. Specify that all columns of the input be included in the output.

In [None]:
import pandas as pd 

def generate_descriptive_stats(data):
    """
    Generates a table of descriptive statistics for all columns in a DataFrame. 
    
    Args:
        data (pd.DataFrame): The input DataFrame. 
    
    Returns:
        pd.DataFrame: A table containing descriptive statistics for each column.
    """
    return data.describe(include='all')  # 'all' ensures all columns are included 
    
# Example usage:
df = pd.DataFrame({'Column1': [1, 2, 3, 4, 5], 'Column2': ['a', 'b', 'c', 'd', 'e']})
descriptive_stats = generate_descriptive_stats(df)
print(descriptive_stats) 

**Question:** Based on the preceding table of descriptive statistics, what is the mean value of the `aqi` column? 

A: To find the mean value of the aqi column, you would typically look at the descriptive statistics table or compute it directly in your analysis tool. If you're using R or a similar programming language, you can calculate the mean as follows:


**Question:** Based on the preceding table of descriptive statistics, what do you notice about the count value for the `aqi` column?

A: To answer this question, you need to examine the "count" value for the aqi column in the descriptive statistics table. The count represents the number of non-missing values in that column.

What you’re looking for is whether the count matches the expected number of rows in your dataset. If it’s less than the total number of rows, it means there are missing values in the aqi column.

Can you share the count or any other details from the table so I can help you interpret the result?

### Use the `mean()` function on the `aqi`  column

Now, use the `mean()` function on the `aqi`  column and assign the value to a variable `population_mean`. The value should be the same as the one generated by the `describe()` method in the above table. 

In [None]:
# Calculate the mean of the 'aqi' column and assign it to population_mean
population_mean = df['aqi'].mean()

# Print the result
print("Population Mean of AQI:", population_mean)


## Step 3: Statistical tests

### Sample with replacement

First, name a new variable `sampled_data`. Then, use the `sample()` dataframe method to draw 50 samples from `epa_data`. Set `replace` equal to `'True'` to specify sampling with replacement. For `random_state`, choose an arbitrary number for random seed. Make that arbitrary number `42`.

In [None]:
sampled_data = epa_data.sample(n=50, replace=True, random_state=42) 

### Output the first 10 rows

Output the first 10 rows of the DataFrame. 

In [None]:
# Display the first 10 rows of the dataset
df.head(10)

**Question:** In the DataFrame output, why is the row index 102 repeated twice? 

A: The row index of 102 being repeated twice suggests that there are duplicate rows in your dataset, specifically for the row indexed as 102. This could happen if the data was inadvertently duplicated or if there were some processing steps that caused the same data to be added multiple times.

**Question:** What does `random_state` do?

A: The random_state parameter is used in many functions in libraries like Pandas, Scikit-learn, and others to control the randomness of operations like shuffling data, splitting data into training and test sets, or generating random numbers.

Setting the random_state ensures that the results are reproducible. When you use the same random_state value, you will get the same result every time you run the code. If you don't set a value for random_state, the operation will produce different results each time it is executed (because the randomness will be different on each run).

### Compute the mean value from the `aqi` column

Compute the mean value from the `aqi` column in `sampled_data` and assign the value to the variable `sample_mean`.

In [None]:
# Calculate the mean of the 'aqi' column in sampled_data
sample_mean = sampled_data['aqi'].mean()

# Print the result
print("Sample Mean of AQI:", sample_mean)


You have a 95% confidence interval for the mean district literacy rate that stretches from about X % to Y%. 

95% CI: (X, Y)

 **Question:**  Why is `sample_mean` different from `population_mean`?


A: a sample is a portion of the population so they would not  be the same

### Apply the central limit theorem

Imagine repeating the the earlier sample with replacement 10,000 times and obtaining 10,000 point estimates of the mean. In other words, imagine taking 10,000 random samples of 50 AQI values and computing the mean for each sample. According to the **central limit theorem**, the mean of a sampling distribution should be roughly equal to the population mean. Complete the following steps to compute the mean of the sampling distribution with 10,000 samples. 

* Create an empty list and assign it to a variable called `estimate_list`. 
* Iterate through a `for` loop 10,000 times. To do this, make sure to utilize the `range()` function to generate a sequence of numbers from 0 to 9,999. 
* In each iteration of the loop, use the `sample()` function to take a random sample (with replacement) of 50 AQI values from the population. Do not set `random_state` to a value.
* Use the list `append()` function to add the value of the sample `mean` to each item in the list.


In [None]:
import random 

# Assuming 'aqi_data' is your list containing AQI values 
estimate_list = [] 

for _ in range(10000):  
    sample = random.sample(aqi_data, 50)  # Sample 50 AQI values with replacement
    sample_mean = sum(sample) / len(sample)  # Calculate the mean of the sample 
    estimate_list.append(sample_mean)  

# The mean of the sampling distribution (approximates population mean): 
mean_of_estimates = sum(estimate_list) / len(estimate_list) 
print("Mean of the sampling distribution:", mean_of_estimates) 

### Create a new DataFrame

Next, create a new DataFrame from the list of 10,000 estimates. Name the new variable `estimate_df`.

In [None]:
import pandas as pd
import numpy as np

# Assuming 'estimates' is your list of 10,000 estimates
estimates = np.random.rand(10000)  # Replace with your actual estimates list

estimate_df = pd.DataFrame({'estimates': estimates})

print(estimate_df.head()) # Display the first few rows
print(estimate_df.shape) # Display the dimensions of the dataframe

### Compute the mean() of the sampling distribution

Next, compute the `mean()` of the sampling distribution of 10,000 random samples and store the result in a new variable `mean_sample_means`.

In [None]:
import numpy as np

# Set the number of samples
num_samples = 10000

# Create an array to store sample means
sample_means = []

# Generate 10,000 random samples and compute their means
for _ in range(num_samples):
    sample = df['aqi'].sample(n=len(sampled_data), replace=True)  # Bootstrapping
    sample_means.append(sample.mean())

# Compute the mean of the sampling distribution
mean_sample_means = np.mean(sample_means)

# Print the result
print("Mean of the Sampling Distribution:", mean_sample_means)


**Question:** What is the mean for the sampling distribution of 10,000 random samples?

In [None]:
import numpy as np



# Assuming 'population' is a list containing your population data

population = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]



# Calculate the population mean

population_mean = np.mean(population) 



# Generate 10,000 random samples of size 'n' from the population

n = 10  # Sample size

sample_means = [np.mean(np.random.choice(population, size=n)) for _ in range(10000)]



# The mean of the sampling distribution will be close to the population mean

print("Mean of sampling distribution:", np.mean(sample_means)) 

print("Population mean:", population_mean) 


**Question:** How are the central limit theorem and random sampling (with replacement) related?

In [None]:
import numpy as np

import matplotlib.pyplot as plt



# Generate a skewed population

population = np.random.exponential(scale=1, size=10000)



# Sample means from different sample sizes

sample_sizes = [10, 50, 100]

sample_means = []

for size in sample_sizes:

    means = []

    for _ in range(1000):

        sample = np.random.choice(population, size=size, replace=True)

        means.append(np.mean(sample))

    sample_means.append(means)



# Plot the distributions of sample means

for i, means in enumerate(sample_means):

    plt.subplot(1, 3, i+1)

    plt.hist(means, bins=20)

    plt.title(f"Sample size: {sample_sizes[i]}")



plt.tight_layout()

plt.show()


A:

### Output the distribution using a histogram

Output the distribution of these estimates using a histogram. This provides an idea of the sampling distribution.

In [None]:
import matplotlib.pyplot as plt 

# Assuming "estimates" is a list containing your calculated estimates 

plt.hist(estimates, bins=20)  # Adjust "bins" as needed to control the number of bars
plt.xlabel("Estimate Value")
plt.ylabel("Frequency")
plt.title("Sampling Distribution Histogram")
plt.show() 

### Calculate the standard error

Calculate the standard error of the mean AQI using the initial sample of 50. The **standard error** of a statistic measures the sample-to-sample variability of the sample statistic. It provides a numerical measure of sampling variability and answers the question: How far is a statistic based on one particular sample from the actual value of the statistic?

In [None]:

import math

standard_error = sample_standard_deviation / math.sqrt(sample_size)  

sample_standard_deviation = 15 

sample_size = 50



standard_error = sample_standard_deviation / math.sqrt(sample_size)

print(standard_error) 



## Step 4: Results and evaluation

###  Visualize the relationship between the sampling and normal distributions

Visualize the relationship between your sampling distribution of 10,000 estimates and the normal distribution.

1. Plot a histogram of the 10,000 sample means 
2. Add a vertical line indicating the mean of the first single sample of 50
3. Add another vertical line indicating the mean of the means of the 10,000 samples 
4. Add a third vertical line indicating the mean of the actual population

In [None]:
import numpy as np

import matplotlib.pyplot as plt



# Simulate a population with a known mean and standard deviation

population_mean = 100

population_std = 15

population = np.random.normal(population_mean, population_std, 10000)



# Sample size for each sample

sample_size = 50



# Generate 10,000 samples

sample_means = [np.mean(np.random.choice(population, sample_size)) for _ in range(10000)]



# Plot the histogram of sample means (sampling distribution)

plt.figure(figsize=(8, 5))

plt.hist(sample_means, bins=30, density=True, label="Sampling Distribution")



# Calculate the mean of the first sample

first_sample = np.random.choice(population, sample_size)

first_sample_mean = np.mean(first_sample)



# Calculate the mean of the sample means

mean_of_means = np.mean(sample_means)



# Add vertical lines for means

plt.axvline(first_sample_mean, color='red', label="Mean of First Sample")

plt.axvline(mean_of_means, color='blue', label="Mean of Sample Means")

plt.axvline(population_mean, color='green', label="Population Mean")



plt.xlabel("Sample Mean")

plt.ylabel("Density")

plt.title("Sampling Distribution vs. Normal Distribution")

plt.legend()

plt.show()




**Question:** What insights did you gain from the preceding sampling distribution?

A: Shape: Is the distribution normal, skewed, or something else? The shape can give you an idea of the underlying population's characteristics.

Central Tendency: Look at the mean and median of the sampling distribution. If they are close to each other, the distribution is symmetric. If they are far apart, the distribution might be skewed.

Spread: What is the variance or standard deviation of the sampling distribution? This can tell you about the consistency of the sample estimates relative to the population.

Sample Size: The larger the sample size, the more the sampling distribution will resemble a normal distribution (according to the Central Limit Theorem).

# Considerations

**What are some key takeaways that you learned from this project?**

A: Central Limit Theorem (CLT): One of the most significant learnings is how the Central Limit Theorem works in practice. As the sample size increases, the sampling distribution of the sample mean tends to become more normal, regardless of the shape of the original population distribution. This concept is foundational for making inferences about a population based on sample data.

Reproducibility and Randomness: I learned the importance of setting a random_state when performing operations like data splitting or bootstrapping. This ensures that results are reproducible, which is essential for validating findings and conducting consistent analyses.

Impact of Sample Size on Precision: Increasing the sample size generally reduces the variance of the sampling distribution, leading to more reliable estimates. This confirms the practical value of collecting larger sample sizes for more accurate results.

Understanding Bias and Variability: The shape, spread, and central tendency of the sampling distribution reveal critical information about bias and variability in the data. If the sample distribution deviates significantly from expectations, it suggests potential issues like selection bias or insufficient sample size.

Importance of Descriptive Statistics: The descriptive statistics of a sampling distribution, such as mean, variance, and standard deviation, provided crucial insights into the behavior of the data and the accuracy of our sample estimates.

Real-World Application: Through this project, I gained a deeper understanding of how sampling distributions can be used in real-world scenarios, like hypothesis testing and estimation, to make reliable decisions based on data.

**What findings would you share with others?**

A:Shape of the Distribution:

If the distribution is roughly normal, it suggests that as the sample size increases, the sample means are more likely to approximate the population mean.
If the distribution is skewed, it may indicate that the data or sample size isn't large enough for the Central Limit Theorem to apply effectively.
Central Tendency:

The mean of the sampling distribution should be close to the population mean. This reflects that the sample means are unbiased estimators of the population mean.
If there’s a significant difference between the sample means and the population mean, it could suggest bias in the sampling process or other factors.
Variance and Spread:

A smaller variance (or standard deviation) in the sampling distribution indicates more consistency and precision in estimating the population parameter.
A larger variance suggests more variability between the sample means, potentially due to a small sample size or greater inherent variability in the data.
Impact of Sample Size:

If you notice that increasing the sample size reduces the spread of the sampling distribution, this aligns with the Central Limit Theorem, which states that larger sample sizes lead to more accurate estimates of population parameters.
Convergence to Normality:

If your sampling distribution tends toward normality as sample size increases, it validates the Central Limit Theorem, reinforcing the idea that large enough samples can approximate the population distribution well.

**What would you convey to external readers?**

A:Overview of the Analysis:

Briefly describe the context of the sampling distribution analysis (e.g., sample size, population, purpose of the study).
Mention any methods used, such as random sampling or statistical techniques, that are relevant for transparency.
Key Insights and Findings:

Shape of the Distribution: Emphasize whether the distribution is normal, skewed, or has any other noteworthy characteristics. Explain the significance of the shape—e.g., a normal distribution suggests reliability and consistency of sample means in estimating the population mean.
Central Tendency: Highlight whether the mean of the sampling distribution aligns with the population mean. If they match closely, the findings support the accuracy of your sample estimates.
Spread/Variance: Comment on the spread of the sampling distribution (variance or standard deviation). A smaller spread indicates higher precision, while a larger spread suggests greater variability and potential uncertainty.
Sample Size Impact: If relevant, discuss how increasing sample size reduced variability or led to a more normal distribution. This reinforces the Central Limit Theorem's importance in ensuring more accurate and consistent results as sample size grows.
Implications:

Discuss how these findings contribute to understanding the reliability and representativeness of your sample. If the sampling distribution demonstrates low variability and a close match to the population, it builds confidence in your estimates and conclusions.
If there are any concerns, such as a skewed distribution or high variance, address the potential reasons and what might be done to improve the sampling process.
Actionable Recommendations:

Based on the findings, suggest potential actions or next steps. For example, you might recommend increasing the sample size if variability is high, or using a different sampling strategy if the distribution shows bias.