# **Data Toolkit**

1. What is NumPy, and why is it widely used in Python?

**1. What is NumPy, and why is it widely used in Python?**

**NumPy** (Numerical Python) is a powerful open-source Python library used primarily for **numerical computing**. It provides support for **large, multi-dimensional arrays and matrices**, along with a collection of **high-level mathematical functions** to operate on these arrays.

### Why NumPy is Widely Used:

1. **Efficient Array Handling**:
   NumPy arrays (`ndarray`) are more compact, faster, and more efficient than Python’s built-in lists for numerical operations.

2. **Vectorized Operations**:
   It supports vectorized operations (element-wise operations without explicit loops), leading to cleaner and faster code.

3. **Mathematical and Statistical Functions**:
   Includes a vast array of functions for linear algebra, Fourier transforms, random number generation, statistics, and more.

4. **Interoperability**:
   Works well with other scientific computing libraries like SciPy, pandas, scikit-learn, and TensorFlow.

5. **C/C++ Integration**:
   Parts of NumPy are written in C, making it much faster for heavy numerical tasks.

6. **Foundation for Data Science & Machine Learning**:
   Many libraries in data science and machine learning use NumPy internally or require NumPy arrays as inputs.



2. How does broadcasting work in NumPy?

**Broadcasting** in NumPy is a powerful mechanism that allows NumPy to perform arithmetic operations on arrays of **different shapes** in a **vectorized** way, without making unnecessary data copies.

### How Broadcasting Works:

When performing operations on two arrays, NumPy compares their shapes element-wise from **right to left** and applies the following **broadcasting rules**:

---

### **Broadcasting Rules**:

1. If the arrays have **different numbers of dimensions**, the shape of the smaller array is padded with **ones on the left**.

2. If the shape dimensions are **equal** or **one of them is 1**, they are **compatible**.

3. If the dimensions are **not compatible**, NumPy will raise a `ValueError`.

---

### **Example 1: Basic Broadcasting**

```python
import numpy as np

A = np.array([[1, 2, 3],
              [4, 5, 6]])       # shape: (2, 3)

B = np.array([10, 20, 30])      # shape: (3,)

C = A + B                       # shape: (2, 3)
print(C)
```

**Output:**

```
[[11 22 33]
 [14 25 36]]
```

> `B` is broadcast across each row of `A`.

---

### **Example 2: With Scalars**

```python
A = np.array([[1, 2], [3, 4]])  # shape: (2, 2)
B = 10                         # scalar

C = A + B
print(C)
```

**Output:**

```
[[11 12]
 [13 14]]
```

> A scalar is broadcast to the shape of `A`.

---

### **Visual Summary**:

| Shape A | Shape B | Resulting Shape | Broadcasting Possible? |
| ------- | ------- | --------------- | ---------------------- |
| (3, 1)  | (1, 4)  | (3, 4)          | ✅ Yes                  |
| (2, 3)  | (2,)    | ❌ Error         | ❌ No (2 ≠ 3)           |




3. What is a Pandas DataFrame?

**3. What is a Pandas DataFrame?**

A **Pandas DataFrame** is a **two-dimensional**, **labeled data structure** in Python, similar to a **spreadsheet**, **SQL table**, or **dictionary of Series objects**. It is one of the core data structures provided by the **pandas** library, widely used for data manipulation and analysis.

---

### **Key Features of a DataFrame:**

* **Rows and Columns**:
  Data is organized in **rows** and **columns**, each with labels (index and column names).

* **Heterogeneous Data**:
  Columns can hold **different data types** (e.g., int, float, string).

* **Indexing & Labeling**:
  Flexible and powerful indexing using **labels** and **integer positions**.

* **Missing Data Handling**:
  Built-in tools to **detect**, **filter**, and **fill** missing data (`NaN`).

* **Powerful Data Operations**:
  Supports **filtering**, **grouping**, **aggregation**, **joining**, and **reshaping**.

---

### **Example: Creating a DataFrame**

```python
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print(df)
```

**Output:**

```
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
```

---

A Pandas DataFrame is the **go-to structure** for most data analysis tasks in Python due to its **flexibility**, **efficiency**, and wide **integration with data tools**.




4.  Explain the use of the groupby() method in Pandas?

**4. Explain the use of the `groupby()` method in Pandas**

The `groupby()` method in Pandas is used to **split** a DataFrame into groups based on the values in one or more columns, and then **apply aggregate functions** (like `sum()`, `mean()`, `count()`, etc.) to each group. It's a powerful tool for **data analysis**, especially for summarizing and transforming data.

---

### **The GroupBy Process: Split-Apply-Combine**

1. **Split**: Divide the data into groups based on some criteria (e.g., a column's values).
2. **Apply**: Apply a function to each group independently (e.g., aggregation, transformation).
3. **Combine**: Merge the results into a new DataFrame.

---

### **Example: Basic Usage**

```python
import pandas as pd

data = {
    'Department': ['Sales', 'Sales', 'HR', 'HR', 'IT'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Salary': [50000, 55000, 60000, 58000, 70000]
}

df = pd.DataFrame(data)

# Group by Department and calculate average salary
grouped = df.groupby('Department')['Salary'].mean()
print(grouped)
```

**Output:**

```
Department
HR       59000.0
IT       70000.0
Sales    52500.0
Name: Salary, dtype: float64
```

---

### **Multiple Aggregations**

```python
df.groupby('Department')['Salary'].agg(['mean', 'max', 'min'])
```

This returns a DataFrame with mean, max, and min salary per department.

---

### **Use Cases**:

* Finding totals or averages per category (e.g., total sales per region).
* Counting records per group (e.g., number of employees per department).
* Applying custom functions to grouped data.



5. Why is Seaborn preferred for statistical visualizations?

**5. Why is Seaborn preferred for statistical visualizations?**

**Seaborn** is a high-level Python data visualization library built on top of **Matplotlib**. It is widely preferred for **statistical visualizations** because it provides a **simpler interface**, **aesthetic default styles**, and **powerful tools** for exploring relationships in data.

---

### **Key Reasons Seaborn Is Preferred:**

1. **Built-in Statistical Plots**:
   Seaborn offers ready-to-use statistical charts like:

   * `boxplot()`, `violinplot()` (distribution)
   * `barplot()` (categorical stats with confidence intervals)
   * `lmplot()`, `regplot()` (linear regression)

2. **Beautiful Default Styles**:
   Automatically applies attractive color palettes and layouts, saving time on styling.

3. **Simplified Syntax**:
   Easier to use for common tasks compared to raw Matplotlib.

4. **Works Well with Pandas**:
   Accepts Pandas DataFrames directly, using column names for axes, hue, etc.

5. **Multi-variable Visualization**:
   Easily visualize complex relationships using:

   * `hue`, `row`, `col` parameters for faceting
   * `pairplot()`, `heatmap()`, `catplot()`

6. **Integration with Matplotlib**:
   Can be extended or fine-tuned using Matplotlib for custom needs.

---

### **Example: Quick Visualization with Seaborn**

```python
import seaborn as sns
import pandas as pd

# Load example dataset
tips = sns.load_dataset('tips')

# Show average tip by day
sns.barplot(x='day', y='tip', data=tips)
```

This creates a bar chart showing the **mean tip amount** per day with **confidence intervals**, all with minimal code.

---

In short, Seaborn is preferred for statistical visualizations because it makes complex plots **easy to create**, **aesthetically pleasing**, and **statistically informative**.



6. What are the differences between NumPy arrays and Python lists?

**Differences Between NumPy Arrays and Python Lists**

While both **NumPy arrays** and **Python lists** can store sequences of elements, they have significant differences in terms of **performance**, **functionality**, and **behavior**, especially for numerical and scientific computing.

---

### 🔍 **Key Differences:**

| Feature                | **NumPy Array**                                 | **Python List**                          |
| ---------------------- | ----------------------------------------------- | ---------------------------------------- |
| **Data Type**          | Homogeneous (all elements of same type)         | Heterogeneous (can store mixed types)    |
| **Performance**        | Much faster (implemented in C, vectorized ops)  | Slower (uses Python loops internally)    |
| **Memory Usage**       | More efficient (contiguous memory block)        | Higher memory usage                      |
| **Functionality**      | Supports mathematical operations & broadcasting | No direct support for math operations    |
| **Operations**         | Element-wise (e.g., `array1 + array2`)          | Must loop manually (e.g., `for` loops)   |
| **Slicing & Indexing** | More powerful (multi-dimensional support)       | Basic slicing, no true multi-dim support |
| **Best Use Case**      | Large-scale numeric data processing             | General-purpose, smaller mixed-type data |

---

### ✅ **Example Comparison**:

```python
import numpy as np

# NumPy array
a = np.array([1, 2, 3])
print(a + 10)     # Output: [11 12 13]

# Python list
b = [1, 2, 3]
print(b + [10])   # Output: [1, 2, 3, 10] — list concatenation, not element-wise addition
```

---

### 📌 Summary:

* Use **NumPy arrays** when you're working with **large numeric data**, need **performance**, and want access to **powerful math operations**.
* Use **Python lists** for **general-purpose programming**, especially when the data types are mixed or performance isn’t critical.




7. What is a heatmap, and when should it be used?

**What is a Heatmap, and When Should It Be Used?**

A **heatmap** is a **data visualization** technique that uses **color to represent values** in a two-dimensional matrix or table. Each cell’s color intensity indicates the magnitude of the data point it represents.

---

### 🔥 **Key Features of a Heatmap:**

* Displays **numerical data** in a matrix form.
* Uses **color gradients** (e.g., light to dark, cold to warm) to show **data magnitude**.
* Easy to spot **patterns, correlations, or anomalies**.

---

### ✅ **When to Use a Heatmap:**

1. **To Visualize Correlation Matrices**:

   * Quickly see how features are related (positive or negative correlation).
   * Example: `sns.heatmap(df.corr())` in Seaborn.

2. **To Display Matrix-Like Data**:

   * E.g., confusion matrices in classification models, distance matrices, etc.

3. **To Detect Outliers or Trends**:

   * Use color intensity to highlight high or low values in datasets.

4. **For Time-Series or Pivoted Data**:

   * Show metrics over time (e.g., sales per day/hour across months).

---

### 🧪 **Example in Python (Seaborn):**

```python
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# Sample correlation heatmap
df = sns.load_dataset('iris')
corr = df.corr(numeric_only=True)
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.show()
```

This produces a heatmap showing correlation coefficients between numeric columns of the Iris dataset.

---

### 📌 Summary:

Use a **heatmap** when you want a **quick, intuitive overview** of **how values vary** across two dimensions, especially when spotting **relationships** or **anomalies** visually is more effective than using raw numbers.



8. What does the term “vectorized operation” mean in NumPy?

In NumPy, a **vectorized operation** refers to performing operations on entire arrays (or vectors) element-wise without the need for explicit loops in Python.

### Key Points:

* **Efficient**: Vectorized operations are implemented in low-level C, making them much faster than manual Python loops.
* **Concise**: Code using vectorized operations is typically shorter and easier to read.
* **Element-wise**: Operations are applied to each element in the array automatically.

### Example:

```python
import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Vectorized addition
c = a + b  # Output: array([5, 7, 9])
```

Instead of writing:

```python
c = []
for i in range(len(a)):
    c.append(a[i] + b[i])
```




9. How does Matplotlib differ from Plotly?

**Matplotlib** and **Plotly** are both powerful Python libraries for data visualization, but they differ significantly in style, functionality, and use cases:

### 🔹 1. **Interactivity**

* **Matplotlib**: Primarily **static** plots (e.g., for reports or publications). Interactivity is limited unless combined with other tools (like `mpl_interactions` or `nbagg` in Jupyter).
* **Plotly**: Built for **interactive** plots—users can zoom, hover, and pan by default. Great for dashboards and web apps.

### 🔹 2. **Ease of Use**

* **Matplotlib**: More **low-level**; gives fine-grained control but can be verbose.
* **Plotly**: Higher-level API with more **user-friendly** functions for complex plots.

### 🔹 3. **Output and Integration**

* **Matplotlib**: Best for generating **static images** (PNG, PDF, etc.). Works well in Jupyter notebooks and for academic publishing.
* **Plotly**: Outputs **interactive HTML**—ideal for embedding in web applications or sharing visualizations online.

### 🔹 4. **Customization**

* **Matplotlib**: Extremely **customizable** through detailed configuration.
* **Plotly**: Also customizable, especially for layout and interactivity, though slightly more abstracted.

### 🔹 5. **Types of Charts**

* **Matplotlib**: Great for basic charts (line, bar, scatter, etc.), but 3D and advanced plots require add-ons.
* **Plotly**: Excellent support for a wide range of **advanced and interactive charts** like choropleths, 3D surface plots, and sunbursts.

### Summary Table:

| Feature        | Matplotlib             | Plotly                |
| -------------- | ---------------------- | --------------------- |
| Interactivity  | Limited                | High                  |
| Plot type      | Static                 | Interactive           |
| Learning curve | Moderate to high       | Easier for many cases |
| Output formats | PNG, PDF               | HTML, PNG, PDF        |
| Customization  | Very detailed          | High, but abstracted  |
| Ideal use case | Publications, academia | Web apps, dashboards  |



10. What is the significance of hierarchical indexing in Pandas?

**Hierarchical indexing** (also called **MultiIndexing**) in **Pandas** allows you to have multiple levels of indexes on a single axis (row or column), enabling more complex data structures and easier analysis of high-dimensional data within a 2D `DataFrame`.

---

### 🔹 Significance of Hierarchical Indexing:

1. **Organizes Complex Data**
   Lets you represent data with multiple dimensions (like time and location) in a single DataFrame, while still retaining powerful indexing and slicing capabilities.

2. **Enhanced Data Grouping and Aggregation**
   Makes operations like `groupby`, pivot tables, and aggregation more powerful and intuitive.

3. **Efficient Data Slicing**
   Allows subsetting data at multiple levels without restructuring the DataFrame.

4. **Facilitates Panel Data Handling**
   Useful for time series data across different entities (e.g., stock prices for multiple companies over time).

---

### 🔹 Example:

```python
import pandas as pd

# Create a MultiIndex
index = pd.MultiIndex.from_tuples([
    ('USA', 'New York'),
    ('USA', 'Chicago'),
    ('Canada', 'Toronto'),
    ('Canada', 'Vancouver')
], names=['Country', 'City'])

data = pd.DataFrame({'Population': [8_000_000, 2_700_000, 2_900_000, 630_000]}, index=index)

print(data)
```

#### Output:

```
                 Population
Country City               
USA     New York     8000000
        Chicago      2700000
Canada  Toronto      2900000
        Vancouver     630000
```

Now you can easily access data like:

```python
data.loc['USA']
```

---

### 🔹 Summary:

Hierarchical indexing brings structure and flexibility, especially when dealing with multidimensional datasets in a 2D format. It’s key for tidy, powerful data manipulation in complex real-world scenarios.



11. What is the role of Seaborn’s pairplot() function?

The `pairplot()` function in **Seaborn** is used to create a matrix of scatter plots (and histograms or KDEs on the diagonal) that visualizes **pairwise relationships** between numeric variables in a dataset.

---

### 🔹 **Role and Purpose of `pairplot()`**

1. **Exploratory Data Analysis (EDA)**
   Helps quickly identify relationships, correlations, or patterns between multiple numeric features.

2. **Visualizing Distributions**
   Shows univariate distributions on the diagonal and bivariate relationships off-diagonal.

3. **Class/Group Differentiation**
   With the `hue` parameter, it can show how relationships vary by category (e.g., species, gender).

---

### 🔹 **Example:**

```python
import seaborn as sns
import matplotlib.pyplot as plt

# Load a sample dataset
df = sns.load_dataset('iris')

# Create a pairplot
sns.pairplot(df, hue='species')
plt.show()
```

This will plot scatterplots between all combinations of numerical columns in the Iris dataset, colored by species.

---

### 🔹 **Key Parameters:**

* `data`: DataFrame containing the data.
* `hue`: Categorical variable to separate data by color.
* `kind`: Type of plot (`'scatter'` or `'kde'`).
* `diag_kind`: Type of plot on the diagonal (`'auto'`, `'hist'`, `'kde'`).

---

### 🔹 Summary:

Seaborn’s `pairplot()` is a quick and powerful tool for **visualizing multidimensional relationships**, making it a go-to for initial data exploration.



12. What is the purpose of the describe() function in Pandas?

The `describe()` function in **Pandas** is used to generate **summary statistics** of a DataFrame or Series, providing a quick overview of the **central tendency, dispersion, and shape** of a dataset's distribution.

---

### 🔹 **Purpose of `describe()`**

1. **Quick Summary**
   Gives essential descriptive statistics like **count, mean, standard deviation**, and **percentiles** for each numeric column.

2. **Data Understanding**
   Helps identify missing data, outliers, or skewed distributions during **exploratory data analysis (EDA)**.

3. **Works on Numeric and Categorical Data**

   * By default, it describes **numeric** columns.
   * You can include **object/categorical** columns with `include='object'` or `include='all'`.

---

### 🔹 **Example:**

```python
import pandas as pd

df = pd.DataFrame({
    'age': [25, 30, 35, 40, 45],
    'salary': [40000, 50000, 60000, 70000, 80000]
})

print(df.describe())
```

#### Output:

```
             age        salary
count   5.000000      5.000000
mean   35.000000  60000.000000
std     7.905694  15811.388301
min    25.000000  40000.000000
25%    30.000000  50000.000000
50%    35.000000  60000.000000
75%    40.000000  70000.000000
max    45.000000  80000.000000
```

---

### 🔹 **Custom Usage:**

```python
df.describe(include='all')  # Includes non-numeric columns too
df['age'].describe()        # Describes a single column
```

---

### 🔹 Summary:

`describe()` is a **foundational EDA tool** that helps you understand your dataset's basic statistical properties with just one line of code.




13. Why is handling missing data important in Pandas?

Handling **missing data** in Pandas is critically important because it directly affects the **accuracy, integrity, and usability** of your data analysis and machine learning models.

---

### 🔹 Why Handling Missing Data Matters:

1. **Prevents Errors in Analysis**
   Many Pandas operations (e.g., mean, sum) and machine learning algorithms can't handle `NaN` values and may crash or give misleading results.

2. **Ensures Data Integrity**
   Missing values can bias your analysis if not treated carefully—for example, when calculating averages or correlations.

3. **Improves Model Performance**
   Machine learning models often require complete datasets. Unhandled missing values can lead to poor predictions or the model refusing to train.

4. **Reveals Underlying Issues**
   Patterns of missing data can indicate data collection problems, system errors, or hidden structure in the data.

5. **Supports Better Decision-Making**
   Clean, complete data allows for more reliable insights and informed decisions.

---

### 🔹 Common Techniques to Handle Missing Data:

| Method                 | Use Case                                        |
| ---------------------- | ----------------------------------------------- |
| `dropna()`             | Remove rows/columns with missing data           |
| `fillna(value)`        | Fill missing values with a constant or strategy |
| `interpolate()`        | Estimate values based on neighboring data       |
| Conditional Imputation | Fill based on logic or grouped statistics       |

---

### 🔹 Example:

```python
import pandas as pd
df = pd.DataFrame({'A': [1, None, 3], 'B': [4, 5, None]})

# Check for missing data
df.isnull()

# Fill missing values with the column mean
df.fillna(df.mean(numeric_only=True))
```

---

### 🔹 Summary:

Handling missing data in Pandas is essential for ensuring **reliable, valid, and usable** results in any data-driven project.



14. What are the benefits of using Plotly for data visualization?

The **benefits of using Plotly** for data visualization include its powerful interactivity, ease of use, and flexibility, making it a top choice for both exploratory analysis and presentation-ready visuals. Here’s a detailed look:

---

### 🔹 1. **Built-in Interactivity**

* Plotly charts support zooming, panning, hovering, and dynamic legends by default.
* Enables deeper **exploratory data analysis (EDA)** with minimal effort.

### 🔹 2. **High-Quality, Web-Ready Visuals**

* Produces sleek, responsive, and professional-looking plots.
* Charts can be exported as **interactive HTML** or **static images (PNG, SVG, etc.)**.

### 🔹 3. **Wide Range of Chart Types**

* Supports:

  * Basic plots: line, bar, scatter, pie
  * Statistical plots: box, violin, histogram
  * Advanced visuals: 3D plots, choropleths, Sankey diagrams, sunbursts, heatmaps

### 🔹 4. **Dash Integration for Web Apps**

* Seamless integration with **Dash**, Plotly’s framework for creating full **data apps** without needing JavaScript or front-end code.

### 🔹 5. **Cross-Platform and Language Support**

* Available in **Python**, **R**, **JavaScript**, and **Julia**—useful for teams working across tech stacks.

### 🔹 6. **Customizable and Themeable**

* Layouts, tooltips, colors, and axes are **highly customizable**.
* Offers built-in themes and full control over plot appearance.

### 🔹 7. **Easy Sharing and Collaboration**

* Share interactive visualizations through:

  * HTML files
  * Jupyter notebooks
  * Dash apps
  * Online platforms (like Plotly Cloud or Dash Enterprise)

---

### ✅ Summary:

| Benefit                | Description                                         |
| ---------------------- | --------------------------------------------------- |
| Interactivity          | Built-in hover, zoom, filter without extra code     |
| Professional visuals   | High-resolution, responsive, and presentation-ready |
| Chart variety          | Basic to advanced 2D and 3D plots                   |
| Web app integration    | Easy integration with Dash for full dashboards      |
| Multi-language support | Use in Python, R, JS, and Julia                     |
| Easy sharing           | Export to HTML or embed in notebooks/webpages       |

---



15. How does NumPy handle multidimensional arrays?

**NumPy** handles **multidimensional arrays** (called **ndarrays**) with great flexibility and efficiency, making it ideal for scientific computing and data analysis in Python.

---

### 🔹 Key Features of NumPy’s Multidimensional Array Handling:

#### 1. **n-Dimensional Arrays (ndarray)**

* NumPy's `ndarray` can represent arrays of **any number of dimensions** (1D, 2D, 3D, etc.).
* Example of creation:

  ```python
  import numpy as np
  a = np.array([[1, 2], [3, 4]])  # 2D array (matrix)
  b = np.array([[[1], [2]], [[3], [4]]])  # 3D array
  ```

#### 2. **Shape and Dimensions**

* Use `.shape` to get the dimensions (rows, columns, etc.).
* Use `.ndim` to get the number of dimensions.

  ```python
  a.shape  # (2, 2)
  a.ndim   # 2
  ```

#### 3. **Broadcasting**

* Allows arithmetic operations between arrays of different shapes (e.g., 2D + 1D) by automatically expanding dimensions where possible.

#### 4. **Indexing and Slicing**

* Multidimensional arrays support advanced indexing and slicing:

  ```python
  a[1, 0]      # Access element at row 1, column 0
  a[:, 1]      # Get all rows from column 1
  ```

#### 5. **Reshaping and Transposing**

* Use `.reshape()` to change the shape without changing the data.
* Use `.T` or `.transpose()` to change axes.

  ```python
  a.reshape(4, 1)
  a.T
  ```

#### 6. **Axis-Based Operations**

* NumPy functions like `sum()`, `mean()`, etc., can operate along specific axes:

  ```python
  a.sum(axis=0)  # Sum over rows (column-wise)
  a.sum(axis=1)  # Sum over columns (row-wise)
  ```

---

### 🔹 Summary:

NumPy provides powerful tools for creating, manipulating, and performing operations on **multidimensional arrays**, making it foundational for data science, machine learning, and numerical computing.



16. What is the role of Bokeh in data visualization?

**Bokeh** is a powerful Python library for **interactive and web-ready data visualization**. Its primary role is to help users create **rich, interactive plots and dashboards** that can be **rendered in modern web browsers**, making it ideal for data exploration and presentation.

---

### 🔹 **Role and Benefits of Bokeh in Data Visualization**

#### 1. **Interactive Visualizations**

* Provides interactive features like **zoom, pan, hover tooltips, sliders**, and **widgets**.
* Enables deeper data exploration directly in the browser.

#### 2. **Web Integration**

* Bokeh plots render as **HTML/JavaScript**, making them suitable for embedding in web applications, reports, or dashboards.
* Can integrate with Flask, Django, and Jupyter Notebooks.

#### 3. **Custom Dashboards**

* Allows building interactive **data apps** with controls (like dropdowns or checkboxes) using **Bokeh Server**.

#### 4. **High-Performance for Large Datasets**

* Efficiently handles large and streaming datasets using **WebGL** rendering and downsampling.

#### 5. **Pythonic API**

* Offers a clean, intuitive, object-oriented API in Python, making it easy for developers and analysts to adopt.

---

### 🔹 **Example Use Cases**

* Real-time data dashboards
* Financial data visualization
* Interactive reports for stakeholders
* Scientific data exploration

---

### 🔹 **Comparison with Other Tools**

| Feature           | Bokeh               | Matplotlib    | Plotly        |
| ----------------- | ------------------- | ------------- | ------------- |
| Interactivity     | ✅ Built-in          | 🚫 Limited    | ✅ Built-in    |
| Web Integration   | ✅ Native HTML/JS    | 🚫 Not native | ✅ HTML, Dash  |
| Dashboard Support | ✅ With Bokeh Server | 🚫            | ✅ With Dash   |
| Ease of Use       | 👍 Medium           | 👍 Simple     | 👍 High-level |

---

### ✅ **In Summary:**

**Bokeh** is best used when you need **interactive, web-based visualizations** that are easy to build with Python and can be integrated into dashboards or applications.



17. Explain the difference between apply() and map() in Pandas.

Great question! Both `apply()` and `map()` in **Pandas** are used to **apply functions** to data, but they differ in **where** and **how** they are typically used.

---

### 🔹 **`map()`**

* **Used with:** Primarily on **Series**.
* **Purpose:** Map **values in a Series** according to an input correspondence (like a dict, Series, or function).
* **Returns:** A new Series with mapped values.
* **Common use cases:**

  * Replacing values using a dictionary.
  * Applying a simple function element-wise.
* **Example:**

  ```python
  s = pd.Series(['cat', 'dog', 'bird'])
  s.map({'cat': 'kitten', 'dog': 'puppy'})
  # Output: ['kitten', 'puppy', NaN]
  ```

---

### 🔹 **`apply()`**

* **Used with:** Both **Series** and **DataFrames**.
* **Purpose:** Apply a function along an axis (rows or columns) of a DataFrame, or element-wise on a Series.
* **Returns:** Depends on the function — can be a scalar, Series, or DataFrame.
* **Common use cases:**

  * Complex row/column-wise operations on DataFrames.
  * Applying functions that need access to full rows or columns.
* **Example:**

  ```python
  df = pd.DataFrame({
      'A': [1, 2, 3],
      'B': [4, 5, 6]
  })
  df.apply(lambda row: row['A'] + row['B'], axis=1)
  # Output: Series([5, 7, 9])
  ```

---

### 🔹 **Summary Table**

| Feature  | `map()`                              | `apply()`                                 |
| -------- | ------------------------------------ | ----------------------------------------- |
| Works on | Series only                          | Series and DataFrames                     |
| Purpose  | Element-wise mapping/replacement     | Apply function along axis or element-wise |
| Input    | Dict, Series, function               | Function                                  |
| Use case | Value substitution or transformation | Complex row/column-wise computations      |
| Returns  | Series                               | Varies (scalar, Series, DataFrame)        |

---


18. What are some advanced features of NumPy?

Here are some **advanced features of NumPy** that go beyond basic array creation and arithmetic, making it a powerful tool for scientific computing and data analysis:

---

### 🔹 1. **Broadcasting**

* Enables arithmetic operations on arrays of different shapes by automatically expanding the smaller array along missing dimensions.
* Allows writing concise and efficient code without explicit loops.

### 🔹 2. **Fancy Indexing and Boolean Masking**

* Access or modify array elements using arrays of indices or boolean conditions.
* Useful for filtering, selecting, or updating subsets of data efficiently.

### 🔹 3. **Structured Arrays and Record Arrays**

* Store heterogeneous data (like tables) with named fields.
* Useful for handling complex datasets with multiple data types.

### 🔹 4. **Memory Mapping (np.memmap)**

* Access large datasets stored on disk as if they were in memory, without loading the entire file.
* Enables working with datasets larger than your RAM.

### 🔹 5. **Universal Functions (ufuncs)**

* Vectorized functions that operate element-wise on arrays with optimized C implementation.
* Includes advanced ufuncs like `np.add.reduce()`, `np.multiply.accumulate()`, etc.

### 🔹 6. **Linear Algebra Module (`numpy.linalg`)**

* Provides functions for matrix operations, eigenvalues, singular value decomposition (SVD), solving linear systems, and more.

### 🔹 7. **Random Sampling (`numpy.random`)**

* Powerful suite for generating random numbers, random sampling, and probabilistic distributions.

### 🔹 8. **FFT (Fast Fourier Transform)**

* Perform fast Fourier transforms for signal processing tasks.

### 🔹 9. **Advanced Broadcasting with `np.newaxis` and `np.expand_dims`**

* Manipulate array dimensions to enable broadcasting in complex operations.

### 🔹 10. **Masked Arrays (`numpy.ma`)**

* Handle arrays with missing or invalid entries by masking elements rather than removing them.

---



19. How does Pandas simplify time series analysis?

Pandas simplifies **time series analysis** by providing powerful, intuitive tools and data structures tailored specifically for handling date and time data. Here’s how:

---

### 🔹 Key Ways Pandas Simplifies Time Series Analysis:

#### 1. **Datetime Indexing**

* You can easily convert columns to `DatetimeIndex`, allowing you to **index, slice, and filter** data based on dates and times.
* Example:

  ```python
  df['date'] = pd.to_datetime(df['date'])
  df.set_index('date', inplace=True)
  df['2023-01-01':'2023-01-31']  # Slicing by date range
  ```

#### 2. **Resampling and Frequency Conversion**

* Easily resample data to different frequencies (daily, monthly, yearly) using `.resample()`.
* Aggregations like sum, mean, or custom functions can be applied during resampling.

  ```python
  df.resample('M').mean()  # Monthly average
  ```

#### 3. **Handling Missing Dates**

* Automatically handles missing dates and missing data in time series.
* You can fill gaps with interpolation, forward-fill, or backfill.

#### 4. **Date Offsets and Shifting**

* Shift data forward or backward in time with `.shift()`.
* Generate date ranges with `pd.date_range()`.
* Perform time-based arithmetic easily.

#### 5. **Rolling Window Calculations**

* Calculate moving averages, rolling sums, and other window functions using `.rolling()`.

  ```python
  df['rolling_mean'] = df['value'].rolling(window=7).mean()  # 7-day moving average
  ```

#### 6. **Time Zone Handling**

* Convert time series data between time zones with `.tz_localize()` and `.tz_convert()`.

#### 7. **Powerful Time-based Grouping**

* Group data by time periods using `.groupby()` with time-based keys.

---

### 🔹 Summary:

Pandas offers a comprehensive toolkit for **cleaning, manipulating, aggregating, and analyzing** time series data with simple syntax, making it a go-to library for time-dependent datasets.



20. What is the role of a pivot table in Pandas?

A **pivot table** in Pandas is a powerful tool used to **summarize, aggregate, and reshape data**—especially useful for exploratory data analysis and generating reports.

---

### 🔹 Role of a Pivot Table in Pandas:

1. **Data Summarization**
   It helps **aggregate data** based on one or more categorical variables (like grouping by columns and rows) using aggregation functions such as sum, mean, count, etc.

2. **Reshaping Data**
   Transforms data from a **long format** to a **wide format**, making it easier to analyze relationships between variables.

3. **Multi-level Grouping**
   Supports grouping by multiple indices (rows and columns), enabling complex summaries and comparisons.

4. **Handling Missing Data**
   Automatically fills missing combinations with NaN or specified fill values.

---

### 🔹 Example:

```python
import pandas as pd

data = {
    'Region': ['East', 'East', 'West', 'West', 'East'],
    'Product': ['A', 'B', 'A', 'B', 'A'],
    'Sales': [100, 150, 200, 250, 300]
}

df = pd.DataFrame(data)

pivot = df.pivot_table(index='Region', columns='Product', values='Sales', aggfunc='sum')
print(pivot)
```

#### Output:

```
Product      A      B
Region               
East      400.0  150.0
West      200.0  250.0
```

---

### 🔹 Summary:

Pivot tables in Pandas let you **quickly summarize** large datasets, revealing insights by **aggregating and restructuring** data along multiple dimensions.



21. Why is NumPy’s array slicing faster than Python’s list slicing?

NumPy’s array slicing is faster than Python’s list slicing mainly because of how data is stored and accessed under the hood:

---

### 🔹 Reasons Why NumPy Array Slicing is Faster:

1. **Contiguous Memory Storage**

* NumPy arrays store data in a **contiguous block of memory** (like C arrays), allowing efficient access and manipulation.
* Python lists are arrays of **pointers to objects scattered in memory**, requiring extra dereferencing.

2. **Homogeneous Data Types**

* NumPy arrays have a fixed, uniform data type (e.g., all floats or all ints), enabling **optimized low-level operations**.
* Python lists can hold objects of different types, adding overhead to access and operations.

3. **View vs. Copy**

* NumPy slicing typically returns a **view** (a window into the same data), so no new data is copied.
* Python list slicing always creates a **new list copy**, which takes extra time and memory.

4. **Vectorized Operations**

* NumPy uses **highly optimized C/Fortran code** for operations on arrays, speeding up slicing and computations.
* Python lists rely on slower, interpreted Python loops.

---

### 🔹 Summary:

| Feature          | NumPy Array                    | Python List                      |
| ---------------- | ------------------------------ | -------------------------------- |
| Memory Layout    | Contiguous, homogeneous        | Array of pointers, heterogeneous |
| Slicing Behavior | Returns view (no data copy)    | Returns new list (copy of data)  |
| Performance      | Fast, optimized low-level code | Slower, Python-level operations  |

---




22. What are some common use cases for Seaborn?

Seaborn is a popular Python visualization library built on top of Matplotlib that simplifies creating attractive and informative statistical graphics. Here are some common use cases for Seaborn:

---

### 🔹 Common Use Cases for Seaborn

1. **Exploratory Data Analysis (EDA)**

   * Quickly visualize distributions, relationships, and trends in data.
   * Functions like `pairplot()`, `distplot()`, and `heatmap()` help uncover patterns and outliers.

2. **Statistical Visualizations**

   * Plot statistical relationships with regression lines (`lmplot()`), box plots (`boxplot()`), violin plots (`violinplot()`), and swarm plots (`swarmplot()`).

3. **Visualizing Categorical Data**

   * Easily create bar plots, count plots, and point plots that compare categories and show confidence intervals.

4. **Correlation Analysis**

   * Use `heatmap()` with correlation matrices to visualize the strength and direction of relationships between variables.

5. **Time Series Visualization**

   * Use line plots with confidence intervals to analyze trends over time.

6. **Multi-Variable Plots**

   * Facet grids (`FacetGrid`) and categorical plots allow visualization across multiple subsets or categories.

---

### 🔹 Summary Table:

| Use Case                         | Seaborn Functions                         |
| -------------------------------- | ----------------------------------------- |
| Distribution of single variables | `distplot()`, `histplot()`, `kdeplot()`   |
| Relationship between variables   | `scatterplot()`, `pairplot()`, `lmplot()` |
| Categorical comparisons          | `barplot()`, `countplot()`, `boxplot()`   |
| Correlation visualization        | `heatmap()`                               |
| Multi-plot layouts               | `FacetGrid`, `catplot()`                  |

---



# **Practical**

1. How do you create a 2D NumPy array and calculate the sum of each row?

You can create a 2D NumPy array using `np.array()` and then calculate the sum of each row using the `.sum()` method with `axis=1`. Here’s how:

```python
import numpy as np

# Create a 2D NumPy array
arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

# Calculate the sum of each row
row_sums = arr.sum(axis=1)

print(row_sums)
```

**Output:**

```
[ 6 15 24]
```

* `axis=1` means summing across columns for each row.



2. Write a Pandas script to find the mean of a specific column in a DataFrame.

Sure! Here’s a simple Pandas script to find the mean of a specific column in a DataFrame:

```python
import pandas as pd

# Sample DataFrame
data = {
    'A': [10, 20, 30, 40],
    'B': [5, 15, 25, 35]
}

df = pd.DataFrame(data)

# Calculate the mean of column 'A'
mean_A = df['A'].mean()

print("Mean of column A:", mean_A)
```

This will output:

```
Mean of column A: 25.0
```



3. Create a scatter plot using Matplotlib.

Here’s a simple example of creating a scatter plot using Matplotlib:

```python
import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y = [5, 7, 4, 6, 8]

# Create scatter plot
plt.scatter(x, y, color='blue', marker='o')

# Add title and labels
plt.title('Sample Scatter Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')

# Show plot
plt.show()
```



4.  How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap?

To calculate the correlation matrix and visualize it using Seaborn’s heatmap, you can follow these steps:

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Sample DataFrame
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [5, 4, 3, 2, 1],
    'C': [2, 3, 4, 5, 6]
}

df = pd.DataFrame(data)

# Calculate correlation matrix
corr_matrix = df.corr()

# Visualize correlation matrix using Seaborn heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')

# Add title
plt.title('Correlation Matrix Heatmap')

# Show plot
plt.show()
```

* `.corr()` computes the correlation matrix.
* `sns.heatmap()` visualizes it with colors; `annot=True` shows correlation coefficients on the heatmap.



5. Generate a bar plot using Plotly.

Here’s how you can generate a simple bar plot using Plotly in Python:

```python
import plotly.graph_objects as go

# Sample data
categories = ['A', 'B', 'C', 'D']
values = [10, 15, 7, 12]

# Create bar plot
fig = go.Figure(data=[go.Bar(x=categories, y=values)])

# Add title and axis labels
fig.update_layout(
    title='Sample Bar Plot',
    xaxis_title='Category',
    yaxis_title='Value'
)

# Show plot
fig.show()
```

This will open an interactive bar chart where you can zoom, hover, and pan.


6.  Create a DataFrame and add a new column based on an existing column.

Here’s how you can create a Pandas DataFrame and add a new column based on an existing one:

```python
import pandas as pd

# Create a sample DataFrame
data = {
    'A': [10, 20, 30, 40],
    'B': [1, 2, 3, 4]
}
df = pd.DataFrame(data)

# Add a new column 'C' which is double the values of column 'A'
df['C'] = df['A'] * 2

print(df)
```

**Output:**

```
    A  B   C
0  10  1  20
1  20  2  40
2  30  3  60
3  40  4  80
```


7. Write a program to perform element-wise multiplication of two NumPy arrays.

Here’s a simple Python program to perform element-wise multiplication of two NumPy arrays:

```python
import numpy as np

# Define two arrays
arr1 = np.array([1, 2, 3, 4])
arr2 = np.array([5, 6, 7, 8])

# Element-wise multiplication
result = arr1 * arr2

print(result)
```

**Output:**

```
[ 5 12 21 32]
```



8. Create a line plot with multiple lines using Matplotlib.

Here’s how you can create a line plot with multiple lines using Matplotlib:

```python
import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y1 = [2, 3, 5, 7, 11]
y2 = [1, 4, 6, 8, 10]

# Create the plot
plt.plot(x, y1, label='Line 1', marker='o')
plt.plot(x, y2, label='Line 2', marker='s')

# Add title and labels
plt.title('Multiple Line Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')

# Show legend
plt.legend()

# Show plot
plt.show()
```

This will plot two lines with different markers and a legend.



9. Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold.

Here’s how you can create a Pandas DataFrame and filter rows where a column’s value is greater than a specific threshold:

```python
import pandas as pd

# Create sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Score': [85, 90, 78, 92]
}

df = pd.DataFrame(data)

# Filter rows where 'Age' is greater than 30
filtered_df = df[df['Age'] > 30]

print(filtered_df)
```

**Output:**

```
      Name  Age  Score
2  Charlie   35     78
3    David   40     92
```



10. Create a histogram using Seaborn to visualize a distribution.

Here’s a simple example of creating a histogram using Seaborn to visualize a data distribution:

```python
import seaborn as sns
import matplotlib.pyplot as plt

# Sample data
data = [12, 15, 14, 10, 13, 17, 19, 18, 14, 15, 16, 20, 21, 22, 20]

# Create histogram
sns.histplot(data, bins=5, kde=False, color='skyblue')

# Add title and labels
plt.title('Sample Histogram')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Show plot
plt.show()
```

* `bins=5` sets the number of bins.
* `kde=False` disables the kernel density estimate curve (set to True if you want a smooth curve).



11. Perform matrix multiplication using NumPy.

Here's how you perform matrix multiplication in NumPy using the `@` operator or `np.dot()`:

```python
import numpy as np

# Define two 2D arrays (matrices)
A = np.array([[1, 2],
              [3, 4]])

B = np.array([[5, 6],
              [7, 8]])

# Matrix multiplication using @ operator
result = A @ B

# Or using np.dot()
# result = np.dot(A, B)

print(result)
```

**Output:**

```
[[19 22]
 [43 50]]
```



12. Use Pandas to load a CSV file and display its first 5 rows.

Here’s how you can use Pandas to load a CSV file and display its first 5 rows:

```python
import pandas as pd

# Load CSV file into DataFrame
df = pd.read_csv('your_file.csv')

# Display first 5 rows
print(df.head())
```

Replace `'your_file.csv'` with the path to your actual CSV file.



13. Create a 3D scatter plot using Plotly.

Here’s a simple example of creating a 3D scatter plot using Plotly:

```python
import plotly.graph_objects as go

# Sample data
x = [1, 2, 3, 4, 5]
y = [5, 6, 7, 8, 9]
z = [9, 8, 7, 6, 5]

# Create 3D scatter plot
fig = go.Figure(data=[go.Scatter3d(
    x=x,
    y=y,
    z=z,
    mode='markers',
    marker=dict(
        size=8,
        color=z,          # set color to the z values
        colorscale='Viridis',   # choose a colorscale
        opacity=0.8
    )
)])

# Update layout
fig.update_layout(
    title='3D Scatter Plot',
    scene=dict(
        xaxis_title='X Axis',
        yaxis_title='Y Axis',
        zaxis_title='Z Axis'
    )
)

fig.show()
```

This will display an interactive 3D scatter plot you can rotate and zoom.

