# Statistics Basics| Assignment

### Question 1: What is the difference between descriptive statistics and inferential statistics? Explain with examples.

**Descriptive Statistics:**

* Descriptive statistics are used to summarize and describe the main features of a dataset. They provide simple summaries about the sample and about the observations that have been made. These summaries may be either quantitative (e.g., calculating the average height of students in a class) or visual (e.g., creating a bar chart to show the distribution of test scores).

Examples of descriptive statistics include:

*   **Measures of central tendency:** Mean, median, mode (e.g., the average income of a city's residents).
*   **Measures of dispersion:** Range, variance, standard deviation (e.g., the spread of ages in a survey group).
*   **Frequency distributions:** Tables or graphs showing how often each value or range of values appears in a dataset (e.g., a histogram showing the distribution of grades on an exam).

**Inferential Statistics:**

Inferential statistics are used to make inferences or predictions about a larger population based on a sample of data taken from that population. They allow us to draw conclusions beyond the immediate data alone and to make generalizations.

Examples of inferential statistics include:

*   **Hypothesis testing:** Determining if there is enough evidence to reject a null hypothesis (e.g., testing if a new drug is effective in treating a disease).
*   **Confidence intervals:** Estimating a range of values within which a population parameter is likely to fall (e.g., estimating the average height of all adult males in a country based on a sample).
*   **Regression analysis:** Examining the relationship between two or more variables to make predictions (e.g., predicting a student's test score based on the number of hours they studied).

* In essence, descriptive statistics describe what is in the data, while inferential statistics allow us to make educated guesses and draw conclusions about a larger group based on that data.

### Question 2: What is sampling in statistics? Explain the differences between random and stratified sampling.

**Sampling in Statistics:**

* Sampling in statistics is the process of selecting a subset of individuals or observations from a larger population. It is often impractical or impossible to collect data from every member of a population, so researchers use sampling to gather data from a representative group that can be used to make inferences about the entire population.

**Random Sampling:**

* Random sampling (also known as simple random sampling) is a method where every member of the population has an equal chance of being selected for the sample. This method helps to minimize bias and ensures that the sample is likely to be representative of the population.

*   **Example:** Imagine you want to survey students' opinions on a new school policy. In random sampling, you could put all students' names into a hat and draw a certain number of names randomly to be your sample.

**Stratified Sampling:**

* Stratified sampling is a method where the population is divided into subgroups or "strata" based on shared characteristics (e.g., age, gender, income level). Then, a random sample is taken from each stratum in proportion to its size in the population. This method ensures that each subgroup is adequately represented in the sample, which can be important when the characteristic being studied varies significantly across subgroups.

*   **Example:** If you are conducting a survey about voting preferences in a city, you might divide the population into strata based on age groups (e.g., 18-25, 26-40, 41-60, 61+). You would then randomly sample from each age group in proportion to their representation in the city's population. This ensures that your sample reflects the age distribution of the city.

**Key Differences:**

| Feature           | Random Sampling                          | Stratified Sampling                                 |
| :---------------- | :--------------------------------------- | :-------------------------------------------------- |
| **Process**       | Every member has an equal chance.        | Population divided into strata, then random sampling within strata. |
| **Population Division** | No division of the population.         | Population is divided into subgroups (strata).      |
| **Representation** | Relies on randomness for representation. | Ensures representation of specific subgroups.       |
| **Use Case**      | Suitable for homogeneous populations.    | Suitable for heterogeneous populations with important subgroups. |

### Question 3: Define mean, median, and mode. Explain why these measures of central tendency are important.

**Measures of Central Tendency:**

* Measures of central tendency are statistical values that represent the center or typical value of a dataset. They provide a single value that summarizes the distribution of data. The three most common measures of central tendency are the mean, median, and mode.

*   **Mean:**
* The mean (or arithmetic average) is calculated by summing all the values in a dataset and dividing by the number of values. It is sensitive to extreme values (outliers).

    *   **Example:** For the dataset [10, 15, 20, 25, 30], the mean is (10 + 15 + 20 + 25 + 30) / 5 = 100 / 5 = 20.

*   **Median:**
* The median is the middle value in a dataset that has been ordered from least to greatest. If the dataset has an even number of values, the median is the average of the two middle values. The median is not affected by extreme values.

    *   **Example:** For the dataset [10, 15, 20, 25, 30], the ordered dataset is [10, 15, 20, 25, 30]. The middle value is 20, so the median is 20.
    *   **Example:** For the dataset [10, 15, 20, 25, 30, 35], the ordered dataset is [10, 15, 20, 25, 30, 35]. The two middle values are 20 and 25, so the median is (20 + 25) / 2 = 22.5.

*   **Mode:** The mode is the value that appears most frequently in a dataset. A dataset can have one mode (unimodal), more than one mode (multimodal), or no mode.

    *   **Example:** For the dataset [10, 15, 20, 20, 25, 30], the mode is 20 because it appears most often.
    *   **Example:** For the dataset [10, 10, 15, 20, 20, 25], the modes are 10 and 20.
    *   **Example:** For the dataset [10, 15, 20, 25, 30], there is no mode as each value appears only once.

**Importance of Measures of Central Tendency:**

These measures are important for several reasons:

*   **Summarizing data:** They provide a single, representative value that can summarize a large dataset, making it easier to understand and interpret.
*   **Comparing datasets:** They allow for easy comparison between different datasets by looking at their typical values.
*   **Identifying typical values:** They help identify what a "typical" observation in the dataset looks like.
*   **Basis for further analysis:** They are often the starting point for more advanced statistical analyses.

* Choosing the appropriate measure of central tendency depends on the type of data and the presence of outliers. The mean is suitable for symmetrical distributions without outliers, while the median is preferred for skewed distributions or datasets with outliers. The mode is useful for categorical or discrete data.

### Question 4: Explain skewness and kurtosis. What does a positive skew imply about the data?

**Skewness:**

* Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. In simpler terms, it indicates the extent to which a dataset is not symmetrical.

*   **Positive Skew (Right Skew):**
* When a distribution has a positive skew, its tail extends more to the right side. This indicates that there are more extreme values on the higher end of the distribution. The mean is typically greater than the median in a positively skewed distribution.

    *   **What a positive skew implies:**
    * A positive skew means that the majority of the data points are clustered towards the lower end, and there are some higher values that are pulling the mean towards the right. For example, income data often exhibits a positive skew, where most people earn a lower income, but a few individuals earn significantly higher incomes.

*   **Negative Skew (Left Skew):**
* When a distribution has a negative skew, its tail extends more to the left side. This indicates that there are more extreme values on the lower end of the distribution. The mean is typically less than the median in a negatively skewed distribution.

*   **Zero Skew:**
* A distribution with zero skew is perfectly symmetrical, like a normal distribution. The mean, median, and mode are all equal in a perfectly symmetrical distribution.

**Kurtosis:**

* Kurtosis is a measure of the "tailedness" and peakedness of the probability distribution of a real-valued random variable. It describes the shape of the distribution's tails relative to the tails of a normal distribution.

*   **Leptokurtic (High Kurtosis):**
* Distributions that are leptokurtic have heavier tails and a sharper peak than a normal distribution. This means there is a higher probability of extreme values (outliers).

*   **Mesokurtic (Medium Kurtosis):**
* Distributions that are mesokurtic have kurtosis similar to a normal distribution.

*   **Platykurtic (Low Kurtosis):**
* Distributions that are platykurtic have lighter tails and a flatter peak than a normal distribution. This means there is a lower probability of extreme values.

**In summary:**

*   **Skewness** tells us about the symmetry of the data.
*   **Kurtosis** tells us about the shape of the tails and the peakedness of the data.

* Understanding skewness and kurtosis helps in describing the shape of a distribution and in choosing appropriate statistical methods for analysis.

###Question 5: Implement a Python program to compute the mean, median, and mode of
###a given list of numbers.


In [None]:
import statistics

def compute_mean_median_mode(data):
  """
  Computes the mean, median, and mode of a list of numbers.

  Args:
    data: A list of numbers.

  Returns:
    A dictionary containing the mean, median, and mode.
  """
  mean_value = statistics.mean(data)
  median_value = statistics.median(data)

  try:
    mode_value = statistics.mode(data)
  except statistics.StatisticsError:
    mode_value = "No unique mode"

  return {
      "mean": mean_value,
      "median": median_value,
      "mode": mode_value
  }

# Example usage:
my_list = [10, 15, 20, 20, 25, 30]
results = compute_mean_median_mode(my_list)
print(f"The list is: {my_list}")
print(f"Mean: {results['mean']}")
print(f"Median: {results['median']}")
print(f"Mode: {results['mode']}")

my_list_2 = [1, 2, 3, 4, 5]
results_2 = compute_mean_median_mode(my_list_2)
print(f"\nThe list is: {my_list_2}")
print(f"Mean: {results_2['mean']}")
print(f"Median: {results_2['median']}")
print(f"Mode: {results_2['mode']}")

my_list_3 = [1, 1, 2, 3, 3, 4]
results_3 = compute_mean_median_mode(my_list_3)
print(f"\nThe list is: {my_list_3}")
print(f"Mean: {results_3['mean']}")
print(f"Median: {results_3['median']}")
print(f"Mode: {results_3['mode']}")

###Question 6: Compute the covariance and correlation coefficient between the following
###two datasets provided as lists in Python:

In [None]:
import numpy as np

# Define the two datasets
dataset1 = [1, 2, 3, 4, 5]
dataset2 = [5, 4, 3, 2, 1]

# Compute the covariance
covariance = np.cov(dataset1, dataset2)[0, 1]

# Compute the correlation coefficient
correlation_coefficient = np.corrcoef(dataset1, dataset2)[0, 1]

print(f"Dataset 1: {dataset1}")
print(f"Dataset 2: {dataset2}")
print(f"Covariance: {covariance}")
print(f"Correlation Coefficient: {correlation_coefficient}")

###Question 7: Write a Python script to draw a boxplot for the following numeric list and
###identify its outliers. Explain the result:


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# The given numeric list (replace with the actual list if it was provided in the prompt)
# Assuming the list is named 'data_list'
data_list = [10, 15, 20, 22, 25, 30, 35, 40, 100] # Example list, replace with the user's list

# Create a pandas Series from the list for easier plotting with seaborn
data_series = pd.Series(data_list)

# Create the boxplot
plt.figure(figsize=(8, 6))
sns.boxplot(x=data_series)
plt.title("Boxplot of the Data")
plt.xlabel("Values")
plt.show()

**Explanation of the Boxplot and Outlier Identification:**

* The boxplot provides a visual representation of the distribution of the data and helps in identifying potential outliers.

*   **The Box:**
* The box in the plot represents the interquartile range (IQR), which is the range between the first quartile (Q1) and the third quartile (Q3). Q1 is the 25th percentile, and Q3 is the 75th percentile of the data. The line inside the box represents the median (the 50th percentile).
*   **The Whiskers:**
* The lines extending from the box are called whiskers. They typically extend to the minimum and maximum values within 1.5 times the IQR from the edges of the box (Q1 - 1.5 * IQR and Q3 + 1.5 * IQR).
*   **Outliers:**
* Data points that fall outside the whiskers are considered potential outliers. They are often plotted as individual points (like circles or diamonds) beyond the whiskers.

* In the boxplot generated for the list `[10, 15, 20, 22, 25, 30, 35, 40, 100]`:

*   The box covers the range of the middle 50% of the data.
*   The whiskers extend to the values within the calculated range based on the IQR.
*   The data point `100` is shown as a separate circle outside the upper whisker. This indicates that `100` is an outlier in this dataset, as it is significantly larger than the other values and falls beyond the typical range of the data distribution as represented by the box and whiskers.

* Outliers can significantly affect statistical calculations like the mean, so identifying and understanding them is an important step in data analysis.

### Question 8: You are working as a data analyst in an e-commerce company. The marketing team wants to know if there is a relationship between advertising spend and daily sales.

*   **Explain how you would use covariance and correlation to explore this relationship.**

* In this e-commerce scenario, covariance and correlation are valuable tools to understand the relationship between advertising spend and daily sales.

*   **Covariance:**
* Covariance measures the direction of the linear relationship between two variables. A positive covariance indicates that as advertising spend increases, daily sales tend to increase as well. A negative covariance suggests that as advertising spend increases, daily sales tend to decrease. A covariance close to zero implies a weak or no linear relationship. However, the magnitude of covariance is not standardized, making it difficult to compare the strength of the relationship across different datasets.

*   **Correlation Coefficient:**
* The correlation coefficient (specifically, Pearson's correlation coefficient) is a standardized measure of the linear relationship between two variables. It ranges from -1 to +1.
    *   A correlation coefficient of +1 indicates a perfect positive linear relationship (as advertising spend increases, daily sales increase proportionally).
    *   A correlation coefficient of -1 indicates a perfect negative linear relationship (as advertising spend increases, daily sales decrease proportionally).
    *   A correlation coefficient of 0 indicates no linear relationship.
    *   Values between 0 and +1 or 0 and -1 indicate weaker positive or negative linear relationships, respectively.

* By computing the correlation coefficient between advertising spend and daily sales, the marketing team can get a clear, standardized measure of how strongly and in what direction these two variables are linearly related. This information can help them understand the impact of advertising on sales and inform their marketing strategies.

*   **Write Python code to compute the correlation between the two lists:**
    `advertising_spend = [200, 250, 300, 400, 500]`
    `daily_sales = [2200, 2450, 2750, 3200, 4000]`

In [None]:
import numpy as np

# Define the two datasets
advertising_spend = [200, 250, 300, 400, 500]
daily_sales = [2200, 2450, 2750, 3200, 4000]

# Compute the correlation coefficient
correlation_coefficient = np.corrcoef(advertising_spend, daily_sales)[0, 1]

print(f"Advertising Spend: {advertising_spend}")
print(f"Daily Sales: {daily_sales}")
print(f"Correlation Coefficient: {correlation_coefficient}")

Question 9: Your team has collected customer satisfaction survey data on a scale of
1-10 and wants to understand its distribution before launching a new product.
● Explain which summary statistics and visualizations (e.g. mean, standard
deviation, histogram) you’d use.
● Write Python code to create a histogram using Matplotlib for the survey data:
survey_scores = [7, 8, 5, 9, 6, 7, 8, 9, 10, 4, 7, 6, 9, 8, 7]
(Include your Python code and output in the code  below.)

### Question 9: Your team has collected customer satisfaction survey data on a scale of 1-10 and wants to understand its distribution before launching a new product.

*   **Explain which summary statistics and visualizations (e.g. mean, standard deviation, histogram) you’d use.**

* To understand the distribution of customer satisfaction survey data on a scale of 1-10, several summary statistics and visualizations would be useful:

**Summary Statistics:**

*   **Mean:** The average satisfaction score provides a single measure of the typical satisfaction level.
*   **Median:** The middle satisfaction score when the data is ordered helps understand the central tendency, especially if the data is skewed.
*   **Mode:** The most frequent satisfaction score indicates the most common response.
*   **Standard Deviation:** This measures the spread or variability of the satisfaction scores around the mean. A higher standard deviation indicates more variation in responses.
*   **Range:** The difference between the maximum and minimum scores gives a simple measure of the spread.
*   **Percentiles (e.g., 25th, 75th):** These can help understand the distribution of scores and identify quartiles, which are useful for creating boxplots.

**Visualizations:**

*   **Histogram:** A histogram is excellent for visualizing the distribution of the data. It shows the frequency of each satisfaction score (or ranges of scores) and helps identify the shape of the distribution (e.g., normal, skewed), the most common scores, and the spread of the data.
*   **Boxplot:** A boxplot provides a visual summary of the distribution, showing the median, quartiles, and potential outliers. It's useful for comparing the distribution across different groups if you had additional categorical variables (e.g., satisfaction scores by region).
*   **Bar Chart:** While a histogram groups values into bins, a bar chart can be used to show the frequency of each individual satisfaction score from 1 to 10.

In this case, a **histogram** would be particularly insightful to see the frequency of each score and the overall shape of the distribution. Calculating the **mean**, **median**, **mode**, and **standard deviation** would provide key numerical summaries of the data.

*   **Write Python code to create a histogram using Matplotlib for the survey data:**
    `survey_scores = [7, 8, 5, 9, 6, 7, 8, 9, 10, 4, 7, 6, 9, 8, 7]`

In [None]:
import matplotlib.pyplot as plt

# The given survey data
survey_scores = [7, 8, 5, 9, 6, 7, 8, 9, 10, 4, 7, 6, 9, 8, 7]

# Create the histogram
plt.figure(figsize=(8, 6))
plt.hist(survey_scores, bins=range(4, 12), align='left', rwidth=0.8) # Bins for scores 4 to 10
plt.title("Distribution of Customer Satisfaction Scores")
plt.xlabel("Satisfaction Score (1-10)")
plt.ylabel("Frequency")
plt.xticks(range(4, 11)) # Set x-axis ticks to be the survey scores
plt.grid(axis='y', alpha=0.75)
plt.show()