#STATISTICS BASICS

1. What is statistics, and why is it important?
   - Statistics is the science of collecting, organizing, analyzing, and interpreting data to make informed decisions. It’s important because it helps in understanding patterns, making predictions, and guiding decision-making in fields like science, business, health, and economics.

2. What are the two main types of statistics?
   - The two main types are descriptive statistics, which summarize data, and inferential statistics, which use data from a sample to make generalizations about a population.

3. What are descriptive statistics?
   - Descriptive statistics involve methods like mean, median, mode, and standard deviation to summarize and describe the main features of a dataset.

4. What is inferential statistics?
   - Inferential statistics draw conclusions about a population based on sample data using techniques like hypothesis testing, confidence intervals, and regression.

5. What is sampling in statistics?
   - Sampling is the process of selecting a subset of individuals from a population to analyze and draw conclusions about the entire group.

6. What are the different types of sampling methods?
   - Common methods include random sampling, systematic sampling, stratified sampling, cluster sampling, and convenience sampling.

7. What is the difference between random and non-random sampling?
   - Random sampling gives each member an equal chance of selection, ensuring objectivity, while non-random sampling relies on subjective choices, which may lead to bias.

8. Define and give examples of qualitative and quantitative data?
   - Qualitative data describes categories or qualities (e.g., gender, color), while quantitative data involves numbers and measurements (e.g., age, weight).

9. What are the different types of data in statistics?
   - Data types include nominal, ordinal, interval, and ratio, which determine the level of measurement and appropriate statistical methods.

10. Explain nominal, ordinal, interval, and ratio levels of measurement.
   - Nominal data are categories (e.g., blood type), ordinal shows order (e.g., rankings), interval has equal spacing without a true zero (e.g., temperature), and ratio has a true zero (e.g., height, weight).

11. What is the measure of central tendency?
   - It refers to values that represent the center or average of a dataset, mainly using mean, median, and mode.

12. Define mean, median, and mode.
   - Mean is the average, median is the middle value, and mode is the most frequent value in a dataset.

13. What is the significance of the measure of central tendency?
   - It helps summarize large data sets with a single representative value, making data easier to interpret and compare.

14. What is variance, and how is it calculated?
   - Variance measures how much data points deviate from the mean. It’s calculated as the average of the squared differences from the mean.

15. What is standard deviation, and why is it important?
   - Standard deviation is the square root of variance and shows how spread out the data is. It’s important for understanding variability and consistency.

16. Define and explain the term range in statistics.
   - Range is the difference between the highest and lowest values in a dataset. It gives a quick sense of the data’s spread.

17. What is the difference between variance and standard deviation?
   - Both measure spread, but variance uses squared units while standard deviation returns values in the original units of the data, making it more interpretable.

18. What is skewness in a dataset?
   - Skewness measures the asymmetry of a data distribution. A skewed dataset has a longer tail on one side.

19. What does it mean if a dataset is positively or negatively skewed?
   - Positively skewed means the tail is on the right (more low values), while negatively skewed means the tail is on the left (more high values).

20. Define and explain kurtosis.
   - Kurtosis measures the "tailedness" of a distribution. High kurtosis means more outliers, while low kurtosis means fewer extreme values.

21. What is the purpose of covariance?
   - Covariance indicates the direction of the relationship between two variables—positive if they increase together, negative if one increases while the other decreases.

21. What does correlation measure in statistics?
   - Correlation quantifies the strength and direction of a linear relationship between two variables, typically ranging from -1 to +1.

22. What is the difference between covariance and correlation?
   - While both show relationships, covariance gives direction without scale, whereas correlation is standardized and shows both direction and strength.

23. What are some real-world applications of statistics?
  - Statistics is used in healthcare (clinical trials), economics (market trends), business (customer analysis), sports (performance metrics), and government (policy making).



#PRACTICAL QUESTIONS

In [2]:
#  How do you calculate the mean, median, and mode of a dataset?

import statistics

data = [4, 5, 6, 8, 9, 9]

mean = statistics.mean(data)
median = statistics.median(data)
mode = statistics.mode(data)

print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)



Mean: 6.833333333333333
Median: 7.0
Mode: 9


In [3]:
#  Write a Python program to compute the variance and standard deviation of a dataset

import statistics

data = [10, 12, 23, 23, 16, 23, 21, 16]

variance = statistics.variance(data)
std_dev = statistics.stdev(data)

print("Variance:", variance)
print("Standard Deviation:", std_dev)


Variance: 27.428571428571427
Standard Deviation: 5.237229365663818


In [None]:
# Implement sampling techniques like random sampling and stratified sampling

import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.DataFrame({'Group': ['A']*5 + ['B']*5, 'Value': range(10)})

# Random Sampling
random_sample = data.sample(n=4)

# Stratified Sampling
stratified_sample, _ = train_test_split(data, stratify=data['Group'], test_size=0.5)

print("Random Sample:\n", random_sample)
print("Stratified Sample:\n", stratified_sample)


In [None]:
# Write a Python function to calculate the range of a dataset

def calculate_range(data):
    return max(data) - min(data)

data = [2, 8, 4, 10, 6]
print("Range:", calculate_range(data))


In [None]:
# Create a dataset and plot its histogram to visualize skewness

import matplotlib.pyplot as plt

data = [1, 2, 2, 3, 4, 10, 15]

plt.hist(data, bins=5)
plt.title("Histogram")
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.show()


In [None]:
# Calculate skewness and kurtosis of a dataset using Python libraries

from scipy.stats import skew, kurtosis

data = [1, 2, 2, 3, 4, 10, 15]

print("Skewness:", skew(data))
print("Kurtosis:", kurtosis(data))


In [None]:
# Generate a dataset and demonstrate positive and negative skewness

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Positive skew
pos_skew = np.random.exponential(scale=2, size=1000)

# Negative skew
neg_skew = -np.random.exponential(scale=2, size=1000)

sns.histplot(pos_skew, kde=True)
plt.title("Positive Skew")
plt.show()

sns.histplot(neg_skew, kde=True)
plt.title("Negative Skew")
plt.show()


In [None]:
#  Write a Python script to calculate covariance between two datasets

import numpy as np

x = [2, 4, 6, 8]
y = [1, 3, 2, 5]

cov_matrix = np.cov(x, y)
print("Covariance:", cov_matrix[0, 1])


In [None]:
# Write a Python script to calculate the correlation coefficient between two datasets

import numpy as np

x = [2, 4, 6, 8]
y = [1, 3, 2, 5]

corr_matrix = np.corrcoef(x, y)
print("Correlation Coefficient:", corr_matrix[0, 1])


In [None]:
#  Create a scatter plot to visualize the relationship between two variables

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y = [2, 4, 1, 8, 7]

plt.scatter(x, y)
plt.title("Scatter Plot")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()


In [None]:
# Implement and compare simple random sampling and systematic sampling

import numpy as np
import pandas as pd

data = pd.DataFrame({'Value': range(1, 21)})

# Simple random sampling
simple_random = data.sample(n=5)

# Systematic sampling (every 4th element)
systematic = data.iloc[::4]

print("Simple Random:\n", simple_random)
print("Systematic:\n", systematic)


In [None]:
Calculate the mean, median, and mode of grouped data

import pandas as pd
import statistics

# Grouped data
data = pd.Series([10, 10, 20, 20, 20, 30, 30, 40])

print("Mean:", data.mean())
print("Median:", data.median())
print("Mode:", data.mode()[0])


In [None]:
#  Simulate data using Python and calculate its central tendency and dispersion

import numpy as np
import statistics

data = np.random.randint(1, 100, 20)

print("Mean:", statistics.mean(data))
print("Median:", statistics.median(data))
print("Mode:", statistics.mode(data))
print("Range:", max(data) - min(data))
print("Variance:", statistics.variance(data))
print("Std Dev:", statistics.stdev(data))
