# QUIZ : EDA DATASET
---

## 1. What does the data. Head() function do in pandas? 
1. Displays the shape of the dataset 
2. Displays the first few rows of the dataset 
3. Displays the last few rows of the dataset 
4. Displays the data types of each column

The correct answer is:
**2. Displays the first few rows of the dataset**

### Explanation:

In **Pandas**, the `data.head()` function returns the **first 5 rows** of the DataFrame by default.
You can also specify the number of rows, e.g., `data.head(10)` will return the first 10 rows.

---

Here’s an example:

```python
import pandas as pd

df = pd.read_csv("data.csv")
print(df.head())
```

This helps in **previewing the dataset**, especially when it's large.


## 2. Which plot type is used to show the frequency distribution of a single categorical variable? 
1. Histogram 
2. Bar Chart 
3. Bar Chart 
4. Bar Chart

The correct answer is:
**2. Bar Chart**

However, the options seem to have a typo (repeating **Bar Chart** multiple times). Ideally, the options should look like:

1. Histogram
2. Bar Chart
3. Pie Chart
4. Line Chart

### Explanation:

* A **bar chart** is used to show the **frequency distribution** of a **single categorical variable**.
* It displays **categories** on the x-axis and their **frequencies** (counts) on the y-axis.

### Note:

* **Histogram** is for **numerical (continuous)** data, not categorical.




## 3. Which function is used in seaborn to create a heatmap? 
1. sns.heatmap() 
2. sns.boxplot() 
3. sns.scatterplot() 
4. sns.hisplot()

The correct answer is:
**1. `sns.heatmap()`**

### Explanation:

In the **Seaborn** library, `sns.heatmap()` is used to create a **heatmap**, which is a graphical representation of data where individual values are represented as colors.

### Example:

```python
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Example data
data = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})

sns.heatmap(data, annot=True)
plt.show()
```

---

Other options:

* `sns.boxplot()` → Used for box plots (distribution and outliers)
* `sns.scatterplot()` → Used for scatter plots (relationship between two numeric variables)
* `sns.hisplot()` → **Incorrect spelling**, the correct one is `sns.histplot()` (for histograms)



## 4. Which of the following plots would best display the count of passengers who survived? 
1. Scatter Plot 
2. Bar Chart 
3. Box Plot 
4. Density Plot

The correct answer is:
**2. Bar Chart**

---

### ✅ Explanation:

To display the **count** of passengers who **survived**, you're working with **categorical data** (e.g., "Survived" = Yes/No or 0/1).

A **bar chart** is best suited for:

* **Categorical variables**
* Showing the **frequency/count** of each category

---

### Other Options:

1. **Scatter Plot** – Used to show relationships between two **numeric** variables
2. **Box Plot** – Shows distribution, median, and outliers of a **numerical** variable
3. **Density Plot** – Estimates the **distribution** of a **continuous** variable

---

### Example in code:

```python
import seaborn as sns
import matplotlib.pyplot as plt

# Example: Titanic dataset
titanic = sns.load_dataset('titanic')
sns.countplot(x='survived', data=titanic)
plt.title('Count of Passengers Who Survived')
plt.show()
```



## 5. In which plot is the area of slices proportional to the values they represent? 
1. Scatter plot 
2. Histogram 
3. Box Plot 
4. Pie Chart

The correct answer is:
**4. Pie Chart**

---

### ✅ Explanation:

In a **pie chart**, the **area of each slice** (or sector) is **proportional** to the value it represents relative to the whole. It's used to show **percentage or part-to-whole relationships** for **categorical data**.

---

### Other Options:

1. **Scatter Plot** – Displays individual data points, showing relationships between two variables
2. **Histogram** – Shows frequency distribution of **continuous** numerical data
3. **Box Plot** – Displays distribution, median, quartiles, and outliers of a **numerical** variable

---

### Example:

```python
import matplotlib.pyplot as plt

labels = ['Apples', 'Bananas', 'Cherries']
sizes = [30, 45, 25]

plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.title('Fruit Distribution')
plt.show()
```



## 6. What type of plot would you use to compare the median fare across different passenger classes? 
1. Scatter plot 
2. Box plot 
3. Pie Chart 
4. Histogram

The correct answer is:
**2. Box Plot**

---

### ✅ Explanation:

A **box plot** is ideal for comparing **distributions** (like **median**, **quartiles**, and **outliers**) across **different categories**, such as **passenger classes**.

In this case, it helps visualize how the **fare** varies across **passenger classes**, including the **median fare**.

---

### Why Not the Others?

1. **Scatter Plot** – Shows relationships between two continuous variables
2. **Pie Chart** – Shows proportions, not distributions or medians
3. **Histogram** – Shows frequency distribution of a **single** continuous variable, not grouped comparisons

---

### Example in code:

```python
import seaborn as sns
import matplotlib.pyplot as plt

# Titanic dataset example
titanic = sns.load_dataset("titanic")

sns.boxplot(x='pclass', y='fare', data=titanic)
plt.title('Median Fare by Passenger Class')
plt.show()
```


## 7. Which of the following represents the most frequent value in a dataset? 
1. Mean 
2. Median 
3. Mode 
4. Standard Deviation

The correct answer is:
**3. Mode**

---

### ✅ Explanation:

* **Mode** is the value that **appears most frequently** in a dataset.
* It can be used for **both categorical and numerical** data.

---

### Other Options:

1. **Mean** – The **average** value
2. **Median** – The **middle** value when data is sorted
3. **Standard Deviation** – Measures the **spread or dispersion** of the data

---

### Example in Python:

```python
import pandas as pd

data = [1, 2, 2, 3, 4, 4, 4, 5]

mode_value = pd.Series(data).mode()
print("Mode:", mode_value.values)
```

This will return `4` as the mode, since it appears most often.


## 8. What does a high standard deviation indicate? 
1. Data points are close to the mean 
2. Data points are spread out over a wide range 
3. There is no variability in the data 
4. Data points are normally distributed

The correct answer is:
**2. Data points are spread out over a wide range**

---

### ✅ Explanation:

* A **high standard deviation** means that the **data points vary greatly** from the **mean** — they are **more spread out**.
* It indicates **greater variability** in the dataset.

---

### Other Options:

1. **Data points are close to the mean** → This is true for **low standard deviation**
2. **No variability in the data** → This would mean the **standard deviation is zero**
3. **Data points are normally distributed** → Standard deviation doesn't **guarantee** normal distribution; it just describes **spread**

---

### Visual Tip:

* **Low SD** → Tall, narrow bell curve
* **High SD** → Wide, flat bell curve


## 9. Which Pandas function is used to identify numerical columns in a dataset? 
1. select_columns() 
2. select_dtypes(include=['object']) 
3. select_dtypes(include=['int64', 'float64']) 
4. get_dtypes()

The correct answer is:
**3. `select_dtypes(include=['int64', 'float64'])`**

---

### ✅ Explanation:

To identify or select **numerical columns** in a Pandas DataFrame, you use:

```python
df.select_dtypes(include=['int64', 'float64'])
```

This filters only the columns with **numeric** data types like integers and floats.

---

### Other Options:

1. **`select_columns()`** – ❌ Not a valid Pandas function
2. **`select_dtypes(include=['object'])`** – ✅ Selects **categorical/text** columns, not numerical
3. **`get_dtypes()`** – ❌ No such function; the correct one is `df.dtypes` to view column types

---

### Example:

```python
import pandas as pd

df = pd.DataFrame({
    'age': [25, 30, 22],
    'name': ['Alice', 'Bob', 'Charlie'],
    'salary': [50000.0, 60000.5, 52000.0]
})

numerical_cols = df.select_dtypes(include=['int64', 'float64'])
print(numerical_cols)
```



## 10. Why is it important to separate the dataset into numerical and categorical columns? 
1. To perform data quality checks 
2. To perform data quality checks 
3. To handle missing values 
4. To merge the data with other datasets

The correct answer is:
**2. To handle missing values**

---

### ✅ Explanation:

Separating a dataset into **numerical** and **categorical** columns is important because:

* **Different data types require different preprocessing techniques.**
* For example:

  * **Numerical columns** may use **mean/median imputation** for missing values.
  * **Categorical columns** may use **mode imputation** or **fill with 'Unknown'**.

---

### Why Not the Others?

1. **To perform data quality checks** – While true, it's **not the main reason** for separation.
2. *(Duplicate option)*
3. **To merge the data with other datasets** – Merging is based on keys and structure, not necessarily data type separation.

---

### Summary:

Separating columns helps you apply the **right preprocessing methods**, especially for:

* Handling **missing values**
* **Encoding categorical** variables
* **Scaling** numerical variables



## 11. What is the purpose of examining unique values in categorical columns? 
1. To find the average of numerical columns 
2. To identify the distribution of data within each category 
3. To check for missing values 
4. To calculate variance and standard deviation

The correct answer is:
**2. To identify the distribution of data within each category**

---

### ✅ Explanation:

* Examining **unique values** in **categorical columns** helps you:

  * Understand how many **distinct categories** exist
  * Identify **data distribution**, imbalance, or unexpected entries
  * Detect potential **typos** or **inconsistencies** in categories

---

### Other Options:

1. **To find the average of numerical columns** → Not related to categorical data
2. **To check for missing values** → You’d use `.isnull()` or `.info()`, not `.unique()`
3. **To calculate variance and standard deviation** → These apply to **numerical** data only

---

### Example in code:

```python
import pandas as pd

df = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Red', 'Green', 'Blue', 'Green', 'Blue']
})

print(df['Color'].unique())  # Shows: ['Red' 'Blue' 'Green']
print(df['Color'].value_counts())  # Shows count for each
```



## 12. What does a data dictionary provide in the context of a dataset? 
1. It lists the columns and their unique values 
2. It provide an overview, data types, and descriptions of the columns 
3. It displays the statistical summary of the dataset 
4. It checks for missing values and duplicates

The correct answer is:
**2. It provides an overview, data types, and descriptions of the columns**

---

### ✅ Explanation:

A **data dictionary** is a document or table that describes the structure and details of a dataset. It typically includes:

* **Column names**
* **Data types**
* **Descriptions or definitions** of each column
* **Units** (if applicable)
* **Allowed or expected values** (especially for categorical fields)

---

### Other Options:

1. **It lists the columns and their unique values** → That’s only a small part of what a data dictionary may contain
2. **It displays the statistical summary** → That’s what `df.describe()` does
3. **It checks for missing values and duplicates** → That’s part of **data cleaning**, not what a data dictionary does

---

### Example:

| Column Name | Data Type | Description              | Example Values   |
| ----------- | --------- | ------------------------ | ---------------- |
| age         | int       | Age of the person        | 25, 30, 45       |
| gender      | object    | Gender of the individual | Male, Female     |
| salary      | float     | Monthly salary in USD    | 50000.0, 62000.5 |



## 13. What does data.isnull().sum() in Pandas do? 
1. Identifies missing values in the dataset 
2. Removes duplicates from the dataset 
3. Converts data types to the correct format 
4. Creates a summary of the dataset

The correct answer is:
**1. Identifies missing values in the dataset**

---

### ✅ Explanation:

In **Pandas**, the function:

```python
data.isnull().sum()
```

does the following:

* `data.isnull()` → Returns a **DataFrame of booleans** (True where the value is missing/NaN)
* `.sum()` → Adds up the `True` values **column-wise**, giving the **count of missing values** in each column

---

### Example:

```python
import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', None],
    'age': [25, None, 30]
})

print(df.isnull().sum())
```

**Output:**

```
name    1
age     1
dtype: int64
```

---

### Other Options:

2. **Removes duplicates** → Done using `data.drop_duplicates()`
3. **Converts data types** → Use `data.astype()`
4. **Creates a summary** → Use `data.describe()` or `data.info()`



## 14. Why is it important to check for duplicates in a dataset? 
1. To increase the size of the dataset 
2. To ensure each observation is unique 
3. To standardize categorical values 
4. To identify outliers

The correct answer is:
**2. To ensure each observation is unique**

---

### ✅ Explanation:

Checking for **duplicates** in a dataset is important because:

* **Duplicate rows** can **skew analysis**, especially in:

  * Aggregations (like sums, averages)
  * Machine learning models (they may overfit or bias results)
* Ensuring each **observation (row)** is **unique** helps maintain **data integrity and accuracy**

---

### Other Options:

1. **To increase the size of the dataset** → Duplicates do this, but it's **not desirable**
2. **To standardize categorical values** → That's a separate preprocessing step
3. **To identify outliers** → Outlier detection involves statistical techniques, not duplication checks

---

### Example in code:

```python
import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Alice'],
    'age': [25, 30, 25]
})

# Check for duplicates
print(df.duplicated())
print(df[df.duplicated()])

# Remove duplicates
df_cleaned = df.drop_duplicates()
```



## 15. What kind of plot is used to visualizes potential outliers in numerical columns? 
1. Histogram 
2. Box Plot 
3. Scatter Plot 
4. Heatmap

The correct answer is:
**2. Box Plot**

---

### ✅ Explanation:

A **box plot** (also known as a **box-and-whisker plot**) is used to:

* Show the **distribution** of a numerical column
* Display the **median**, **quartiles**, and **range**
* **Visualize potential outliers**, which appear as **points outside the whiskers**

---

### How it works:

* **Box** = Interquartile Range (IQR: Q1 to Q3)
* **Whiskers** = Data within 1.5 × IQR
* **Outliers** = Points **outside the whiskers**

---

### Other Options:

1. **Histogram** – Shows frequency, but not clearly outliers
2. **Scatter Plot** – Can show data spread but not ideal for detecting outliers in **one column**
3. **Heatmap** – Shows correlations or patterns in **matrix-like data**, not outliers

---

### Example in code:

```python
import seaborn as sns
import matplotlib.pyplot as plt

# Example with Titanic dataset
titanic = sns.load_dataset("titanic")
sns.boxplot(x=titanic['fare'])
plt.title("Box Plot of Fare (to Detect Outliers)")
plt.show()
```


## 16. How can you address inconsistencies in categorical data? 
1. By calculating the mean and median 
2. By removing missing values 
3. By standardizing the format of categorical values 
4. By checking for duplicates

The correct answer is:
**3. By standardizing the format of categorical values**

---

### ✅ Explanation:

**Inconsistencies in categorical data** often come from:

* Different spellings (e.g., "Male", "male", "MALE")
* Use of abbreviations (e.g., "NY" vs. "New York")
* Typos or unwanted symbols

To fix this, you should **standardize** the format by:

* Converting to lowercase or uppercase
* Stripping whitespace
* Replacing or mapping inconsistent values

---

### Example in code:

```python
import pandas as pd

df = pd.DataFrame({'gender': ['Male', 'male ', ' FEMALE', 'female', 'MALE']})

# Standardize format
df['gender'] = df['gender'].str.strip().str.lower()
print(df['gender'].unique())
```

**Output:**

```
['male' 'female']
```

---

### Other Options:

1. **Mean/median** → Only for numerical data
2. **Removing missing values** → Helps with nulls, not inconsistencies
3. **Checking for duplicates** → Helps with repeated rows, not inconsistent labels


## 17. What is an outlier? 
1. A common data point within the range 
2. A data point that deviates significantly from other observations 
3. A duplicated data point 
4. A missing data point

The correct answer is:
**2. A data point that deviates significantly from other observations**

---

### ✅ Explanation:

An **outlier** is a value in a dataset that is **significantly different** from most of the other data points. It can result from:

* Data entry errors
* Measurement errors
* Genuine extreme values (e.g., extremely high salary)

Outliers can **skew results**, affect **mean**, and impact **machine learning models** if not handled properly.

---

### Other Options:

1. **A common data point** → This is the opposite of an outlier
2. **A duplicated data point** → Not the same as an outlier
3. **A missing data point** → Referred to as `NaN` or null, not an outlier

---

### Example:

In the dataset: `[10, 12, 11, 13, 14, 200]`
→ `200` is an outlier (far from the rest)


## 18. Which measure of central tendency represents the average value of a dataset? 
1. Mean 
2. Median 
3. Mode 
4. Range

The correct answer is:
**1. Mean**

---

### ✅ Explanation:

The **mean** is the **average** of a dataset and is calculated by:

$$
\text{Mean} = \frac{\text{Sum of all values}}{\text{Number of values}}
$$

---

### Other Options:

2. **Median** – The **middle value** when data is sorted
3. **Mode** – The value that **occurs most frequently**
4. **Range** – The **difference between the maximum and minimum** values, not a central tendency

---

### Example:

For the dataset: `[2, 4, 6, 8]`

* **Mean** = (2 + 4 + 6 + 8) / 4 = **5**


## 19. What does the median represent in a dataset? 
1. The most frequent value 
2. The middle value when data is sorted 
3. The average value 
4. The difference between maximum and minimum values

The correct answer is:
**2. The middle value when data is sorted**

---

### ✅ Explanation:

The **median** is the value that **divides the dataset into two equal halves** when it is **sorted in order**.

* If the number of values is **odd**, it's the middle number.
* If it's **even**, it's the **average of the two middle numbers**.

---

### Other Options:

1. **Most frequent value** → That’s the **mode**
2. **Average value** → That’s the **mean**
3. **Difference between max and min** → That’s the **range**

---

### Example:

For the sorted dataset `[3, 5, 7]` → **Median = 5**
For `[3, 5, 7, 9]` → **Median = (5 + 7)/2 = 6**



## 20. Which of the following measures the spread of data around the mean? 
1. Range 
2. Variance 
3. Mode 
4. Mode

The correct answer is:
**2. Variance**

---

### ✅ Explanation:

* **Variance** measures how **spread out** the data is around the **mean**.
* It calculates the **average of the squared differences** from the mean.

$$
\text{Variance} = \frac{\sum (x_i - \bar{x})^2}{n}
$$

---

### Other Options:

1. **Range** – Difference between **maximum and minimum** (simple spread, not around mean)
   3 & 4. **Mode** – The **most frequent value**, not a measure of spread

---

### Related Concept:

* **Standard Deviation** is the **square root of variance** and is also a key measure of spread.

---

### Example in Python:

```python
import numpy as np

data = [4, 8, 6, 5, 3]
variance = np.var(data)
print("Variance:", variance)
```



## 21. What is skewness in the context of distribution analysis?
1. It measures the symmetry of the data distribution 
2. It measures the spread of data 
3. It measures the central tendency 
4. It measures the correlation between variables

The correct answer is:
**1. It measures the symmetry of the data distribution**

---

### ✅ Explanation:

**Skewness** tells us whether the data distribution is **symmetrical** or **asymmetrical**:

* **Skewness = 0** → Perfectly symmetrical (normal distribution)
* **Positive skew** (Right-skewed) → Tail on the **right**, more values on the left
* **Negative skew** (Left-skewed) → Tail on the **left**, more values on the right

---

### Other Options:

2. **Spread of data** → Measured by **variance** or **standard deviation**
3. **Central tendency** → Includes **mean, median, mode**
4. **Correlation** → Measures **relationship between variables**, not shape

---

### Example in Python:

```python
import pandas as pd

data = pd.Series([1, 2, 3, 4, 5, 100])
print("Skewness:", data.skew())
```

This will show **positive skewness** due to the outlier `100`.



## 22. What does a positive kurtosis value indicate? 
1. A distribution with heavy tails and a sharper peak than normal 
2. A distribution with lighter tails 
3. A perfectly normal distribution 
4. a distribution with a single mode

The correct answer is:
**1. A distribution with heavy tails and a sharper peak than normal**

---

### ✅ Explanation:

**Kurtosis** measures the **"tailedness"** or **peakedness** of a distribution:

* **Positive kurtosis (> 0)** →

  * **Leptokurtic** distribution
  * **Sharper peak** and **heavier tails** than a normal distribution
  * Indicates more **extreme outliers**

---

### Other Options:

2. **Lighter tails** → This describes **negative kurtosis** (platykurtic)
3. **Perfectly normal distribution** → Has **zero kurtosis** (mesokurtic)
4. **Single mode** → Refers to **modality**, not kurtosis

---

### Quick Summary:

| Kurtosis Type | Value | Characteristics                         |
| ------------- | ----- | --------------------------------------- |
| Leptokurtic   | > 0   | Sharp peak, heavy tails (many outliers) |
| Mesokurtic    | = 0   | Normal distribution                     |
| Platykurtic   | < 0   | Flat peak, light tails (few outliers)   |


## 23. Which of the following is NOT a measure of dispersion? 
1. Standard Deviation 
2. Range 
3. Mode 
4. Variance

The correct answer is:
**3. Mode**

---

### ✅ Explanation:

* **Mode** is a **measure of central tendency**, not dispersion.
* It represents the **most frequent value** in a dataset.

---

### The Others Are Measures of Dispersion:

1. **Standard Deviation** → Measures how much data varies from the mean
2. **Range** → Difference between the maximum and minimum values
3. **Variance** → Average of the squared differences from the mean

---

### Summary Table:

| Concept                | Type             | Description                         |
| ---------------------- | ---------------- | ----------------------------------- |
| **Mode**               | Central Tendency | Most frequent value                 |
| **Standard Deviation** | Dispersion       | Spread around the mean              |
| **Range**              | Dispersion       | Max - Min                           |
| **Variance**           | Dispersion       | Average squared deviation from mean |


## 24. How is the range of a dataset calculated? 
1. Maximum value minus the minimum value 
2. Mean value divided by the standard deviation 
3. Sum of all data points 
4. Number of data points

The correct answer is:
**1. Maximum value minus the minimum value**

---

### ✅ Explanation:

The **range** of a dataset measures the **spread** by calculating:

$$
\text{Range} = \text{Maximum value} - \text{Minimum value}
$$

---

### Example:

For the dataset: `[3, 7, 10, 15]`

$$
\text{Range} = 15 - 3 = 12
$$

---

### Why the Others Are Incorrect:

2. **Mean ÷ Standard Deviation** → Not related to range
3. **Sum of all data points** → That gives the **total**, not the range
4. **Number of data points** → That’s the **count**, not the range



## 25. What does the data['Age'].skew() function return? 
1. The variance of the 'Age' column 
2. The Skewness of the 'Age' column 
3. The median of the 'Age' column 
4. The standard deviation of the 'Age' column

The correct answer is:
**2. The Skewness of the 'Age' column**

---

### ✅ Explanation:

In **Pandas**, the function:

```python
data['Age'].skew()
```

returns the **skewness** of the `'Age'` column — a measure of the **asymmetry** of the distribution:

* **Positive skew** → Tail on the **right**
* **Negative skew** → Tail on the **left**
* **Skew = 0** → Symmetrical distribution

---

### Other Options:

1. **Variance** → Use `data['Age'].var()`
2. **Median** → Use `data['Age'].median()`
3. **Standard deviation** → Use `data['Age'].std()`

---

### Example:

```python
import pandas as pd

data = pd.DataFrame({'Age': [22, 25, 29, 34, 120]})
print(data['Age'].skew())  # Likely positive due to outlier (120)
```



## 26. Which plot is best for visualizing the distribution of a single numerical variables? 
1. Histogram 
2. Scatter Plot 
3. Bar Chart 
4. Pie Chart

The correct answer is:
**1. Histogram**

---

### ✅ Explanation:

A **histogram** is best for visualizing the **distribution** of a **single numerical variable**. It:

* Divides data into **bins (intervals)**
* Shows **frequency** (how many values fall into each bin)
* Helps detect **shape**, **spread**, and **outliers** in the data

---

### Other Options:

2. **Scatter Plot** – Shows relationships between **two numerical** variables
3. **Bar Chart** – Used for **categorical** data, not continuous distributions
4. **Pie Chart** – Shows **proportions** of **categorical** data, not distribution

---

### Example in Python:

```python
import matplotlib.pyplot as plt

ages = [22, 25, 29, 34, 22, 40, 45, 29, 33, 37]
plt.hist(ages, bins=5)
plt.title("Histogram of Ages")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()
```



## 27. Which plot is used to visualize the relationship between two numerical variables? 
1. Bar Chart 
2. Histogram 
3. Scatter Plot 
4. Pie Chart

The correct answer is:
**3. Scatter Plot**

---

### ✅ Explanation:

A **scatter plot** is used to visualize the **relationship** or **correlation** between **two numerical variables** by plotting data points on an x-y coordinate system.

Each point represents an observation with:

* One variable on the **x-axis**
* Another on the **y-axis**

This helps identify:

* **Trends**
* **Clusters**
* **Outliers**
* **Correlations** (positive, negative, or none)

---

### Other Options:

1. **Bar Chart** – For comparing **categories**, not relationships
2. **Histogram** – For showing **distribution** of **one** numerical variable
3. **Pie Chart** – For showing **proportions** of **categorical** data

---

### Example in Python:

```python
import matplotlib.pyplot as plt

# Example data
height = [150, 160, 170, 180, 190]
weight = [50, 60, 65, 80, 90]

plt.scatter(height, weight)
plt.xlabel("Height (cm)")
plt.ylabel("Weight (kg)")
plt.title("Height vs Weight")
plt.show()
```



## 28. What does a heatmap visualize? 
1. The distribution of a single variable 
2. The relationship between two variables 
3. The correlation matrix of several variables 
4. The frequency of categorical data

The correct answer is:
**3. The correlation matrix of several variables**

---

### ✅ Explanation:

A **heatmap** is commonly used to visualize a **correlation matrix**, which shows the **pairwise correlation coefficients** between **multiple numerical variables** in a dataset.

* Each cell in the heatmap shows the **correlation value** (e.g., Pearson correlation) between two variables.
* Colors indicate the **strength and direction** of the correlation:

  * Dark or intense colors → strong correlations (positive or negative)
  * Light or neutral colors → weak or no correlation

---

### Other Options:

1. **Distribution of a single variable** → Use **histogram** or **box plot**
2. **Relationship between two variables** → Use **scatter plot**
3. **Frequency of categorical data** → Use **bar chart** or **countplot**

---

### Example in Python:

```python
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# Sample DataFrame
data = sns.load_dataset('iris')
correlation_matrix = data.corr(numeric_only=True)

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()
```


## 29. Which type of plot is best for identifying outliers in the dataset? 
1. Histogram 
2. Bar Chart 
3. Box Plot 
4. Scatter Plot

The correct answer is:
**3. Box Plot**

---

### ✅ Explanation:

A **box plot** (or **box-and-whisker plot**) is the best plot for identifying **outliers** in a dataset. It displays:

* **Median (Q2)**
* **First (Q1) and third (Q3) quartiles**
* **Whiskers** (typically 1.5 × IQR range)
* **Outliers** as **individual points** outside the whiskers

---

### Why Not the Others?

1. **Histogram** – Shows distribution but doesn’t clearly highlight outliers
2. **Bar Chart** – Best for categorical data, not suitable for detecting outliers
3. **Scatter Plot** – Can show unusual points, but not specifically designed for outlier detection in **one variable**

---

### Example in Python:

```python
import seaborn as sns
import matplotlib.pyplot as plt

# Example data
data = [10, 12, 11, 13, 14, 100]  # 100 is an outlier

sns.boxplot(data=data)
plt.title("Box Plot to Detect Outliers")
plt.show()
```


## 30. What kind of relationship does a scatter plot show? 
1. Categorical vs. categorical 
2. Numerical vs. numerical 
3. Categorical vs. numerical 
4. Distribution of a single variable

The correct answer is:
**2. Numerical vs. numerical**

---

### ✅ Explanation:

A **scatter plot** is used to show the relationship between **two numerical variables**. Each point represents one observation with:

* One variable plotted on the **x-axis**
* The other on the **y-axis**

It helps in identifying:

* **Positive or negative correlations**
* **Linear or non-linear trends**
* **Clusters**
* **Outliers**

---

### Why Not the Others?

1. **Categorical vs. categorical** → Use **heatmaps** or **grouped bar charts**
2. **Categorical vs. numerical** → Use **box plots** or **violin plots**
3. **Distribution of a single variable** → Use **histograms** or **density plots**

---

### Example in Python:

```python
import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

plt.scatter(x, y)
plt.xlabel("X values")
plt.ylabel("Y values")
plt.title("Scatter Plot: X vs Y")
plt.show()
```

