In [None]:
Sure, here's the updated cheatsheet:

1. **Creating a sample from normal, binomial, poisson, exponential distributions using numpy:**

```python
import numpy as np

# Normal distribution sample
# loc: Mean (“centre”) of the distribution.
# scale: Standard deviation (spread or “width”) of the distribution.
# size: Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn.
normal_sample = np.random.normal(loc=0, scale=1, size=100)

# Binomial distribution sample
# n: Parameter of the distribution, >= 0. Floats are also accepted, but they will be truncated to integers.
# p: Parameter of the distribution, >= 0 and <=1.
# size: Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn.
binomial_sample = np.random.binomial(n=10, p=0.5, size=100)

# Poisson distribution sample
# lam: Expectation of interval, should be >= 0. A sequence of expectation intervals must be broadcastable over the requested size.
# size: Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn.
poisson_sample = np.random.poisson(lam=5, size=100)

# Exponential distribution sample
# scale: The scale parameter, > 0.
# size: Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn.
exponential_sample = np.random.exponential(scale=1.0, size=100)
```

2. **Performing hypothesis tests and interpreting the results:**

```python
from scipy import stats

# One-sample t-test
one_sample = stats.ttest_1samp(a= data, popmean= expected_mean)

# Two-sample t-test
two_sample = stats.ttest_ind(a= data1, b= data2)

# One-sample z-test
one_sample_z = statsmodels.stats.weightstats.ztest(x1= data, value= expected_mean)

# Two-sample z-test
two_sample_z = statsmodels.stats.weightstats.ztest(x1= data1, x2= data2)
```
Use t-tests when the sample size is small (<30) and/or the population standard deviation is unknown. Use z-tests when the sample size is large and the population standard deviation is known.

In hypothesis testing, we start with a null hypothesis (H0) and an alternative hypothesis (H1 or Ha). If the p-value is less than or equal to the significance level (usually 0.05), we reject the null hypothesis. If the p-value is greater than the significance level, we fail to reject the null hypothesis. A smaller p-value means that there is stronger evidence in favor of the alternative hypothesis.

3. **Shortcut for common p-values for normal distribution and t-distribution:**

```python
# Normal distribution
p_value_norm = stats.norm.sf(x= z_score) #one-sided

# T-distribution
p_value_t = stats.t.sf(x= t_score, df= degrees_of_freedom) #one-sided
```

4. **Using np.random.choice:**

```python
# np.random.choice generates a random sample from a given 1-D array
sample = np.random.choice(a= [0, 1], size= 100, replace= True, p= [0.5, 0.5])
```
Use np.random.choice when you want to generate a random sample from a given 1-D array.

5. **Using df.sample():**

```python
# df.sample() returns a random sample of items from an axis of object.
sample = df.sample(n=10)
```
Use df.sample() when you want to get a random sample of items from your DataFrame.

6. **Type 1 and Type 2 errors:**
   - Type 1 error (False Positive): Rejecting the null hypothesis when it is true.
   - Type 2 error (False Negative): Failing to reject the null hypothesis when it is false.

7. **Using .cdf() and .pdf():**

```python
# .cdf() gives the cumulative distribution function
cdf_value = stats.norm.cdf(x= value, loc= mean, scale= std_dev)

# .pdf() gives the probability density function
pdf_value = stats.norm.pdf(x= value, loc= mean, scale= std_dev)
```
Use .cdf() when you want to find the probability that a random observation that is taken from the population will be less than or equal to a certain value. Use .pdf() when you want to find the likelihood of the random variable taking a particular value.

8. **Bootstrapping distribution:**

```python
# Bootstrapping is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement.
bootstrap_sample = np.random.choice(a= data, size= len(data), replace= True)
```
Use bootstrapping when you want to understand the uncertainty associated with a given machine learning model or statistical analysis.


Sure, here's a cheat sheet for the requested topics:

**Statistical Distributions:**

1. **Binomial Distribution:** A discrete probability distribution of the number of successes in a sequence of n independent experiments. It's characterized by two parameters: the number of trials (n) and the probability of success on each trial (p).

```python
from scipy.stats import binom
# n: number of trials, p: probability of success
binom.rvs(n=10, p=0.5)
```

2. **Normal Distribution:** A continuous probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean.

```python
from scipy.stats import norm
# loc: mean, scale: standard deviation
norm.rvs(loc=0, scale=1)
```

3. **Poisson Distribution:** A discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space.

```python
from scipy.stats import poisson
# mu: mean number of events per interval
poisson.rvs(mu=5)
```

4. **T-Distribution:** A type of probability distribution that is symmetric and bell-shaped, like the standard normal distribution, but has heavier tails, meaning it is more prone to producing values that fall far from its mean.

```python
from scipy.stats import t
# df: degrees of freedom
t.rvs(df=10)
```

5. **Chi-Square Distribution:** A distribution of a sum of the squares of k independent standard normal random variables. It's often used in hypothesis testing.

```python
from scipy.stats import chi2
# df: degrees of freedom
chi2.rvs(df=2)
```

6. **F-Distribution:** A distribution that arises in the testing of whether two observed samples have the same variance.

```python
from scipy.stats import f
# dfn: degrees of freedom in numerator, dfd: degrees of freedom in denominator
f.rvs(dfn=1, dfd=48)
```

**Hypothesis Testing:**

1. **Null Hypothesis (H0):** The hypothesis that there is no significant difference between specified populations, any observed difference being due to sampling or experimental error.

2. **Alternative Hypothesis (H1 or Ha):** The hypothesis that there is a significant difference between specified populations.

3. **One-Tailed Hypothesis Test:** A test where the region of rejection is on only one side of the sampling distribution.

4. **Two-Tailed Hypothesis Test:** A test where the region of rejection is on both sides of the sampling distribution.

**Experimental Design:**

1. **Control Group:** A group in an experiment or study that does not receive treatment by the researchers and is then used as a benchmark to measure how the other tested subjects do.

2. **Randomization:** The practice of assigning subjects to treatments in such a way that each subject has an equal chance of being assigned to any treatment.

3. **Confounding Variables:** Variables that the researcher failed to control, or eliminate, damaging the internal validity of an experiment.

**Parameter Estimation and Confidence Intervals:**

1. **Parameter Estimation:** The process of using sample data to estimate the parameters of the selected distribution.

2. **Confidence Intervals:** A range of values, derived from a statistical estimation process, that is likely to contain the value of an unknown parameter. For example, a 95% confidence interval means that 95% of the time, the parameter will fall within this range.

```python
from scipy.stats import norm
# Calculate a 95% confidence interval
confidence_interval = norm.interval(0.95, loc=mean, scale=std_dev)
```

Sure, here's a cheat sheet for different types of random sampling techniques, how to sample data from a statistical distribution, and how to calculate a probability from a statistical distribution using Python:

**Random Sampling Techniques:**

1. **Simple Random Sampling:** Every member of the population has an equal chance of being selected. It's implemented in Python using the `random.sample()` function.

```python
import random

population = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
sample = random.sample(population, k=5)
```

2. **Stratified Sampling:** The population is divided into subgroups (strata) and random samples are taken from each stratum. This can be implemented in Python using the `groupby()` function from pandas and then applying `random.sample()` to each group.

```python
import pandas as pd

df = pd.DataFrame({
    'group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'value': [1, 2, 3, 4, 5, 6, 7, 8, 9]
})

stratified_sample = df.groupby('group').apply(lambda x: x.sample(n=1))
```

3. **Cluster Sampling:** The population is divided into clusters (groups) and a set of clusters are selected at random. This can be implemented in Python using the `random.sample()` function to select clusters.

```python
clusters = [['A1', 'A2', 'A3'], ['B1', 'B2', 'B3'], ['C1', 'C2', 'C3']]
sample_clusters = random.sample(clusters, k=2)
```

**Sampling from a Statistical Distribution:**

1. **Normal Distribution:**

```python
import numpy as np

normal_sample = np.random.normal(loc=0, scale=1, size=100)
```

2. **Binomial Distribution:**

```python
binomial_sample = np.random.binomial(n=10, p=0.5, size=100)
```

3. **Poisson Distribution:**

```python
poisson_sample = np.random.poisson(lam=5, size=100)
```

4. **Exponential Distribution:**

```python
exponential_sample = np.random.exponential(scale=1.0, size=100)
```

**Calculating a Probability from a Statistical Distribution:**

1. **Normal Distribution:**

```python
from scipy.stats import norm

# Probability that a random variable from a standard normal distribution is less than 1
prob = norm.cdf(1, loc=0, scale=1)
```

2. **Binomial Distribution:**

```python
from scipy.stats import binom

# Probability of getting 5 successes in 10 trials with a success probability of 0.5
prob = binom.pmf(k=5, n=10, p=0.5)
```

3. **Poisson Distribution:**

```python
from scipy.stats import poisson

# Probability of getting exactly 5 events in an interval where the average rate of events is 5
prob = poisson.pmf(k=5, mu=5)
```

4. **Exponential Distribution:**

```python
from scipy.stats import expon

# Probability that a random variable from an exponential distribution with scale parameter 1 is less than 1
prob = expon.cdf(1, scale=1)
```

Both `np.random.binomial()` and `scipy.stats.binom.rvs()` are used to generate random numbers from a binomial distribution. However, they belong to different libraries and have slightly different functionalities.

`np.random.binomial(n, p, size=None)` is a function from the NumPy library. It generates random numbers from a binomial distribution defined by the number of trials `n` and the probability of success `p`. The `size` parameter determines the shape of the output.

`scipy.stats.binom.rvs(n, p, size=1, random_state=None)` is a function from the SciPy library. It also generates random numbers from a binomial distribution defined by `n` and `p`. The `size` parameter determines the shape of the output, and `random_state` parameter can be used to seed the random number generator for reproducibility.

In terms of when to use which, it generally depends on your specific needs and which library you are already using in your code. If you're already using NumPy for other tasks, it might be more convenient to use `np.random.binomial()`. If you're using SciPy and need to set a random seed for reproducibility, `scipy.stats.binom.rvs()` would be the better choice. Both functions will give you similar results when used with the same parameters.

Sure, here's a cheat sheet for handling outliers in a data sample using Python:

**1. Z-Score:**

The Z-score is a measure of how many standard deviations an element is from the mean. If the Z-score of a data point is more than 3, it indicates that the data point is quite different from the other data points. Such a data point can be an outlier.

```python
from scipy import stats
import numpy as np

data = np.array([1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2])
z_scores = stats.zscore(data)
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3)
new_data = data[filtered_entries]
```

**2. IQR Method:**

The Interquartile Range (IQR) method uses the IQR to detect outliers. The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile). Any data points that fall below Q1 - 1.5 IQR or above Q3 + 1.5 IQR are considered outliers.

```python
import numpy as np

data = np.array([1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2])
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
new_data = data[(data > lower_bound) & (data < upper_bound)]
```

**3. Removing Outliers using Standard Deviation:**

If a data point is more than 3 standard deviations from the mean, it is considered an outlier.

```python
import numpy as np

data = np.array([1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2])
mean = np.mean(data)
std_dev = np.std(data)
cut_off = std_dev * 3
lower_bound, upper_bound = mean - cut_off, mean + cut_off
new_data = data[(data > lower_bound) & (data < upper_bound)]
```

Remember, before removing outliers, it's important to investigate the nature of the outlier. Sometimes, outliers are legitimate data points that indicate important findings or errors in data collection.

Sure, here's a cheat sheet for using aggregate functions in Python with the pandas library:

**1. Sum:**

The `sum()` function returns the sum of the values for the requested axis.

```python
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
total = df['A'].sum()
```

**2. Mean:**

The `mean()` function returns the mean of the values for the requested axis.

```python
average = df['A'].mean()
```

**3. Min:**

The `min()` function returns the minimum of the values for the requested axis.

```python
minimum = df['A'].min()
```

**4. Max:**

The `max()` function returns the maximum of the values for the requested axis.

```python
maximum = df['A'].max()
```

**5. Count:**

The `count()` function returns the number of non-NA/null values in the Series.

```python
count = df['A'].count()
```

**6. Median:**

The `median()` function returns the median of the values for the requested axis.

```python
median = df['A'].median()
```

**7. Mode:**

The `mode()` function returns the mode(s) of the dataset.

```python
mode = df['A'].mode()
```

**8. Standard Deviation:**

The `std()` function returns the standard deviation of the values for the requested axis.

```python
std_dev = df['A'].std()
```

**9. Variance:**

The `var()` function returns the variance of the values for the requested axis.

```python
variance = df['A'].var()
```

**10. Aggregate:**

The `aggregate()` or `agg()` function is used to apply some aggregation across one or more operations over the specified axis.

```python
aggregation = df['A'].agg(['sum', 'min', 'max', 'mean', 'median', 'std', 'var'])
```

Remember to replace `'A'` with the name of your column.

When calculating the standard deviation of a sample, we use `ddof=1` to perform what is known as Bessel's correction. This correction is necessary because we're trying to estimate the standard deviation of a population based on a sample.

In statistics, the standard deviation is calculated slightly differently for populations and samples. For a population, we calculate the standard deviation by taking the square root of the average of the squared deviations from the mean. However, when we're working with a sample, we typically want to estimate the standard deviation of the population, not the sample itself.

If we use the same formula for a sample as we do for a population (which is what happens when `ddof=0`), we tend to underestimate the population standard deviation. This is because a sample tends to be less spread out than the population it came from, as it's less likely to include extreme values.

Bessel's correction corrects for this by dividing the sum of squared deviations by `n-1` instead of `n` (which is what happens when `ddof=1`). This effectively increases the standard deviation, providing a better estimate of the population standard deviation.

Sure, here's an entry on when to use a paired vs unpaired t-test:

**Paired t-test:**

A paired t-test is used when the observations are not independent of each other. This typically occurs when the observations are taken from the same individual or object at different times or under different conditions. For example, you might use a paired t-test to compare the weights of individuals before and after a diet, or to compare the performance of a machine before and after a maintenance procedure.

In Python, you can perform a paired t-test using the `scipy.stats.ttest_rel()` function:

```python
from scipy import stats

# data1 and data2 are arrays of the paired observations
t_statistic, p_value = stats.ttest_rel(data1, data2)
```

**Unpaired t-test:**

An unpaired t-test (also known as an independent t-test) is used when the observations are independent of each other. This typically occurs when the observations are taken from two different groups. For example, you might use an unpaired t-test to compare the weights of individuals in two different diet groups, or to compare the performance of two different machines.

In Python, you can perform an unpaired t-test using the `scipy.stats.ttest_ind()` function:

```python
from scipy import stats

# data1 and data2 are arrays of the independent observations
t_statistic, p_value = stats.ttest_ind(data1, data2)
```

Remember, the choice between a paired and unpaired t-test depends on the nature of your observations and your experimental design. Always make sure to understand your data and the assumptions of the statistical test before choosing which one to use.

Sure, here's a short cheat sheet on how to find Interquartile Ranges (IQRs) in Python:

**Interquartile Range (IQR):**

The IQR is a measure of statistical dispersion, being equal to the difference between the upper and lower quartiles. It's often used to find outliers in the data.

```python
import numpy as np

# Given data
data = np.array([1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2])

# Calculate Q1 (25th percentile)
Q1 = np.percentile(data, 25)

# Calculate Q3 (75th percentile)
Q3 = np.percentile(data, 75)

# Calculate IQR as difference between Q3 and Q1
IQR = Q3 - Q1
```

In this code, `np.percentile()` is used to find the first quartile (Q1) and the third quartile (Q3). The IQR is then calculated as the difference between Q3 and Q1.

Sure, here's a cheat sheet for the requested topics:

**1. Imputation Methods:**

Imputation is the process of replacing missing data with substituted values. When a value is missing beacuse it was not observed, imputation can be used to fill in the "gap" in the data. In Python, the `pandas` library provides the `fillna()` function which can be used for this purpose.

```python
import pandas as pd

# Assuming df is your DataFrame and 'column' is the column with missing values
# Use mean of the column for imputation
df['column'].fillna(df['column'].mean(), inplace=True)

# Use median of the column for imputation
df['column'].fillna(df['column'].median(), inplace=True)

# Use mode of the column for imputation
df['column'].fillna(df['column'].mode()[0], inplace=True)
```

**2. Variable Transformations:**

Transformations can be used to stabilize variance, make the data more closely align with the normal distribution, improve the interpretability of the data, or satisfy the assumptions of a statistical model. In Python, `numpy` provides functions for common transformations.

```python
import numpy as np

# Log transformation
log_data = np.log(data)

# Square root transformation
sqrt_data = np.sqrt(data)

# Box-Cox transformation
from scipy import stats
boxcox_data, _ = stats.boxcox(data)
```

**3. Types of Missingness:**

There are three types of missingness: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR).

- MCAR: The missingness of data is not related to any other variable. This is the ideal scenario but is very rare in practice.
- MAR: The missingness of data is related to some other variables but not the variable with missing data.
- MNAR: The missingness of data is related to the values of the variable that has missing data.

Handling of missing data depends on the type of missingness. MCAR data can be safely omitted. For MAR, you can use methods like multiple imputation or algorithms that support missing values (like XGBoost). For MNAR, it's important to investigate why the data is missing and it may be necessary to make untestable assumptions about the missing data.

**4. Handling Outliers:**

Outliers are data points that are significantly different from other observations. They can be genuine or due to errors. In Python, the Z-score method or the IQR method can be used to detect and handle outliers.

```python
import numpy as np
from scipy import stats

# Z-score method
z_scores = stats.zscore(data)
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3)
new_data = data[filtered_entries]

# IQR method
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
new_data = data[(data > lower_bound) & (data < upper_bound)]
```

Remember, before removing outliers, it's important to investigate the nature of the outlier. Sometimes, outliers are legitimate data points that indicate important findings or errors in data collection.

Sure, here's how you can calculate these statistics in Python using the pandas and numpy libraries:

```python
import pandas as pd
import numpy as np

# Assuming df is your DataFrame and 'column' is the column for which you want to calculate these statistics

# Measures of Center
mean = df['column'].mean()
median = df['column'].median()
mode = df['column'].mode()

# Measures of Spread
range = df['column'].max() - df['column'].min()
std_dev = df['column'].std()
variance = df['column'].var()

# Skewness
skewness = df['column'].skew()

# Missingness
missing_values = df['column'].isnull().sum()
total_values = len(df['column'])
missingness_percentage = (missing_values / total_values) * 100

# Correlation between two variables 'column1' and 'column2'
correlation = df['column1'].corr(df['column2'])

# Print the calculated statistics
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Range: {range}")
print(f"Standard Deviation: {std_dev}")
print(f"Variance: {variance}")
print(f"Skewness: {skewness}")
print(f"Missingness: {missingness_percentage}%")
print(f"Correlation between column1 and column2: {correlation}")
```

In this code:

- The mean, median, and mode are measures of center. They give you a sense of the "typical" value of the variable.
- The range, standard deviation, and variance are measures of spread. They give you a sense of how much the values of the variable vary.
- Skewness is a measure of the asymmetry of the probability distribution.
- Missingness is the percentage of missing values in the variable. High missingness can bias the results of your analysis.
- The correlation is a measure of how two variables move in relation to each other.

There are several common methods for handling missing data in statistical analysis:

1. **Listwise Deletion:** Also known as complete case analysis, the simplest method is to remove entire observations that have missing values. However, this method can only be used when the data is missing completely at random (MCAR).

```python
df.dropna(inplace=True)
```

2. **Pairwise Deletion:** This method, also known as available case analysis, uses all the available data to compute the statistical analysis. For example, if you are correlating multiple variables, it will use every case where each pair of variables is available.

3. **Mean/Median/Mode Imputation:** Replace the missing values with the mean, median, or mode of that variable. This is a quick and easy method, but it can distort the distribution of the data.

```python
# Mean imputation
df['column'].fillna(df['column'].mean(), inplace=True)

# Median imputation
df['column'].fillna(df['column'].median(), inplace=True)

# Mode imputation
df['column'].fillna(df['column'].mode()[0], inplace=True)
```

4. **Last Observation Carried Forward (LOCF) & Next Observation Carried Backward (NOCB):** These are methods used in time series data where missing values are replaced with either the next valid observation or the last valid observation.

5. **Interpolation:** In time series data, missing values can be replaced by interpolating between valid observations. The 'interpolate()' function in pandas uses various methods like linear, time, and polynomial to interpolate missing values.

```python
df.interpolate(method ='linear', limit_direction ='forward')
```

6. **Multiple Imputation:** Multiple imputation is a more sophisticated method that fills missing values multiple times to create "complete" datasets. Analysis is then performed on all datasets and results are pooled. This can be done using the 'mice' package in R or the 'IterativeImputer' class in sklearn in Python.

```python
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer()
df_imputed = imputer.fit_transform(df)
```

Remember, the method you choose depends on the nature of your data and the reason why the data is missing. Always try to understand why data is missing before choosing a method to handle it.