**Theoritical Questions**

1. What is NumPy, and why is it widely used in Python?
NumPy is a powerful Python library for numerical computing. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on them efficiently. NumPy is widely used because of its speed, ease of use, and ability to integrate with other libraries like SciPy, pandas, and scikit-learn. It uses optimized C code under the hood, making numerical operations much faster than standard Python. NumPy is essential for data science, machine learning, and scientific computing, forming the foundation for many advanced data manipulation and analysis tasks.
2. How does broadcasting work in NumPy?
A. Broadcasting in NumPy allows operations on arrays of different shapes without explicit replication. When performing element-wise operations, NumPy compares array shapes from right to left and automatically stretches the smaller array to match the larger one, if compatible. A dimension is compatible if it is equal or one. If alignment fails, an error is raised. This mechanism simplifies code and improves performance by avoiding unnecessary data duplication. Broadcasting is especially useful in vectorized computations, such as adding a scalar to an array or combining arrays of different but compatible shapes.
3. What is a Pandas DataFrame?
A. A Pandas DataFrame is a two-dimensional, labeled data structure in Python, similar to a table or spreadsheet. It consists of rows and columns, where each column can hold data of different types (e.g., integers, floats, strings). DataFrames are part of the Pandas library, which is widely used for data manipulation, analysis, and cleaning. They offer powerful tools for indexing, filtering, grouping, and handling missing data. You can create a DataFrame from various sources like dictionaries, CSV files, or databases, making it ideal for data science and machine learning tasks.
4.  Explain the use of the groupby() method in Pandas.
A. The groupby() method in Pandas is used to group data based on one or more columns. It allows you to split a DataFrame into groups, apply a function to each group (like sum(), mean(), or custom functions), and then combine the results. This is useful for analyzing and summarizing large datasets.
5. Why is Seaborn preferred for statistical visualizations?
A. Seaborn is preferred for statistical visualizations because it builds on Matplotlib and provides a high-level, user-friendly interface for creating attractive and informative graphics. It simplifies complex plots like heatmaps, box plots, and violin plots with minimal code. Seaborn integrates well with Pandas DataFrames, making it easy to visualize relationships in datasets. It also comes with built-in themes and color palettes for aesthetically pleasing plots and supports statistical estimation, such as confidence intervals in line plots. This makes it a powerful tool for data analysis and exploration.
6. What are the differences between NumPy arrays and Python lists?
A. NumPy arrays and Python lists differ in several key ways:

Performance: NumPy arrays are faster and more memory-efficient than lists due to their fixed data type and use of optimized C-based implementations.

Data Type: NumPy arrays hold elements of the same data type, while Python lists can store mixed types.

Functionality: NumPy provides a wide range of mathematical and statistical operations on arrays, which are not available with basic Python lists.

Multidimensional Support: NumPy arrays support multi-dimensional data natively; lists require nesting and are harder to manipulate.
7. What is a heatmap, and when should it be used?
A. A **heatmap** is a data visualization tool that uses color to represent the magnitude of values in a matrix or 2D dataset. Each cell's color intensity reflects the corresponding data value, making it easy to identify patterns, correlations, and outliers.

**When to use a heatmap:**

* To visualize correlations between variables (e.g., a correlation matrix).
* To detect trends or clusters in large datasets.
* To compare values across two categorical variables (like time vs. activity).

Heatmaps are especially useful in exploratory data analysis and statistical modeling.
8.  What does the term “vectorized operation” mean in NumPy?
A. A vectorized operation in NumPy means applying a mathematical or logical operation to an entire array or set of arrays at once, rather than processing elements one by one using loops. These operations are optimized and executed at a low level, making them significantly faster and more efficient than traditional Python loops.

Vectorized operations are used for tasks like element-wise addition, multiplication, or comparisons across arrays. They make code more concise, readable, and better performing, especially when working with large datasets in scientific or data-intensive applications.
9. How does Matplotlib differ from Plotly?
A. Matplotlib and Plotly are both powerful Python libraries for data visualization, but they differ in several key ways:

Interactivity:

Matplotlib creates static plots by default (e.g., PNG, PDF).

Plotly generates interactive, web-based plots with features like zooming and tooltips.

Ease of Use:

Matplotlib offers fine-grained control but requires more code for customization.

Plotly has a higher-level API that's more intuitive for interactive visuals.

Output:

Matplotlib is ideal for static reports and publications.

Plotly excels in dashboards and web apps.
10. What is the significance of hierarchical indexing in Pandas?
A. Hierarchical indexing (MultiIndex) in Pandas enables handling multi-dimensional data within 1D or 2D structures by using multiple index levels. It organizes complex datasets naturally, such as grouping by categories like country and city. This allows easier, more intuitive data selection, slicing, and aggregation at different hierarchy levels without reshaping data. MultiIndex supports efficient pivoting and reshaping (via stack()/unstack()), facilitating flexible data transformations. It also improves memory efficiency by compactly storing repeated labels. Overall, hierarchical indexing enhances data clarity, manipulation, and analysis, making it essential for working with complex, structured datasets in Pandas.
11. What is the role of Seaborn’s pairplot() function?
A. Seaborn’s `pairplot()` function is used to visualize relationships between multiple variables in a dataset by creating a matrix of scatter plots and histograms (or KDE plots). Each variable is plotted against every other variable, showing pairwise relationships, while the diagonal displays the distribution of each variable. It helps quickly explore data patterns, correlations, and potential clusters. `pairplot()` can also color points by a categorical variable, aiding in visual classification. Overall, it’s a powerful, easy-to-use tool for exploratory data analysis, providing a comprehensive overview of variable interactions in a single visualization.
12. What is the purpose of the describe() function in Pandas?
A. The describe() function in Pandas provides a quick summary of the statistical properties of a DataFrame or Series. It returns key descriptive statistics such as count, mean, standard deviation, minimum, 25th percentile (Q1), median (50th percentile), 75th percentile (Q3), and maximum values for numerical columns. For categorical data, it can show count, unique values, top (most frequent) value, and frequency. This function helps you quickly understand the distribution, central tendency, and spread of your data, making it essential for initial exploratory data analysis.
13. Why is handling missing data important in Pandas?
A. Handling missing data in Pandas is crucial because missing values can lead to inaccurate analysis, biased results, or errors in calculations. Many statistical methods and machine learning algorithms require complete data to function properly. Ignoring or improperly dealing with missing data can distort insights, reduce model performance, and cause runtime errors. Pandas provides tools to detect, remove, or impute missing values, ensuring data quality and reliability. Proper handling helps maintain dataset integrity, enables accurate analysis, and supports better decision-making based on clean, consistent data.
14. What are the benefits of using Plotly for data visualization?
A. Plotly offers several benefits for data visualization:

1. **Interactive Graphics**: Plots are highly interactive, allowing zooming, panning, hovering, and clicking for detailed data exploration.
2. **Wide Range of Chart Types**: Supports diverse visualizations including scatter plots, line charts, 3D plots, maps, and more.
3. **Easy Integration**: Works well with Python, R, and JavaScript, and integrates smoothly with Jupyter notebooks and web apps.
4. **Customizability**: Highly customizable visuals with control over colors, layouts, annotations, and animations.
5. **Web-Ready**: Visualizations are rendered as HTML, easily shareable or embedded in websites.
6. **Open Source with Enterprise Options**: Free to use with options for enterprise-level features.

Overall, Plotly enables rich, dynamic, and shareable visual storytelling.
15. How does NumPy handle multidimensional arrays?
A. NumPy handles multidimensional arrays using its **ndarray** (n-dimensional array) object, which can represent arrays with any number of dimensions (1D, 2D, 3D, and beyond). Each dimension is called an axis. NumPy arrays are fixed-size, homogeneous (all elements share the same data type), and stored in contiguous memory for fast computation.

You can create, index, slice, and manipulate these arrays efficiently. Operations like broadcasting allow arithmetic between arrays of different shapes. NumPy’s multidimensional arrays form the foundation for scientific computing in Python, enabling complex mathematical and linear algebra operations across multiple dimensions.
16. What is the role of Bokeh in data visualization?
A. Bokeh is a Python library designed for creating interactive, web-ready data visualizations. Its role is to enable users to build rich, dynamic plots that can be easily embedded in web browsers or dashboards. Bokeh supports features like zooming, panning, tooltips, and real-time streaming, making data exploration intuitive. It handles large or streaming datasets efficiently and integrates well with other Python tools. Overall, Bokeh helps turn data into interactive, shareable visual stories suitable for web applications and detailed data analysis.
17.  Explain the difference between apply() and map() in Pandas.
A. Sure! Here’s the difference between `apply()` and `map()` in Pandas:

* **`map()`** is mainly used for **element-wise transformations** on a Series. It applies a function, dictionary, or Series to each element individually. It’s great for replacing or mapping values in a Series.

* **`apply()`** is more flexible and can be used on both **Series and DataFrames**. It lets you apply a function along an axis (rows or columns) of a DataFrame or to each element of a Series. It supports more complex operations and aggregation.

In short:

* Use **`map()`** for simple element-wise mapping on Series.
* Use **`apply()`** for more complex row/column-wise or element-wise operations on Series or DataFrames.
18. What are some advanced features of NumPy?
A. Some advanced features of NumPy include:

1. **Broadcasting** — Enables arithmetic operations between arrays of different shapes without explicit looping.
2. **Fancy Indexing and Boolean Indexing** — Allows selection and manipulation of array elements using arrays of indices or boolean masks.
3. **Vectorization** — Performs batch operations on arrays without Python loops, making computations faster.
4. **Structured Arrays** — Support for arrays with heterogeneous data types, similar to database tables.
5. **Memory Mapping** — Handles large datasets by mapping files on disk directly to memory.
6. **Linear Algebra and FFT** — Built-in functions for advanced math like matrix decompositions, eigenvalues, and Fourier transforms.
7. **Masked Arrays** — Handle arrays with missing or invalid entries gracefully.

These features make NumPy powerful for scientific and numerical computing.
19.  How does Pandas simplify time series analysis?
A. Pandas simplifies time series analysis by providing specialized tools to handle date and time data efficiently. It offers:

* **DatetimeIndex** and **PeriodIndex** for easy indexing and slicing by time periods.
* Built-in support for date parsing, frequency conversion, and resampling (e.g., converting daily data to monthly).
* Powerful **rolling** and **expanding** window functions for moving averages and other statistics.
* Convenient handling of missing time points and time zone-aware timestamps.
* Easy alignment and merging of time series data with different frequencies.

Together, these features streamline cleaning, manipulating, and analyzing time-stamped data in a flexible, intuitive way.
20. What is the role of a pivot table in Pandas?
A. A pivot table in Pandas summarizes and reshapes data by aggregating values based on one or more categorical variables. It allows you to transform a long-form dataset into a wide-format table, making it easier to analyze relationships between variables. Using `pivot_table()`, you can specify rows, columns, values, and aggregation functions (like sum, mean, count). This helps quickly generate insights, compare groups, and perform multi-dimensional summarizations without complex coding. Essentially, pivot tables simplify data exploration and reporting by turning raw data into organized, easy-to-interpret summaries.
21. Why is NumPy’s array slicing faster than Python’s list slicing?
A. NumPy’s array slicing is faster than Python’s list slicing because:

1. **Contiguous Memory Storage**: NumPy arrays store data in continuous blocks of memory (like C arrays), enabling fast access and efficient slicing without copying data.

2. **Homogeneous Data Types**: All elements in a NumPy array share the same data type, allowing optimized, low-level operations directly on the data buffer.

3. **View vs Copy**: NumPy slicing returns a **view** (a reference to the original data), avoiding data duplication. Python list slicing creates a new list (a copy), which is slower and uses more memory.

These factors combined make NumPy slicing significantly faster and more memory-efficient than Python list slicing.
22. What are some common use cases for Seaborn?
A. Some common use cases for Seaborn include:

1. **Exploratory Data Analysis (EDA)** — Quickly visualize distributions, relationships, and patterns in data.
2. **Statistical Visualization** — Plot regression lines, confidence intervals, and statistical summaries easily.
3. **Categorical Data Visualization** — Create bar plots, box plots, violin plots, and swarm plots to compare groups.
4. **Correlation Analysis** — Use heatmaps and pair plots to explore correlations between variables.
5. **Time Series Visualization** — Plot trends and seasonal patterns with line plots and area plots.
6. **Multivariate Analysis** — Visualize interactions between multiple variables using pairplots and facet grids.

Seaborn simplifies making attractive, informative statistical graphics with minimal code.





**Practical Questions**

In [None]:
1. How do you create a 2D NumPy array and calculate the sum of each row?
A. import numpy as np

arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

row_sums = arr.sum(axis=1)
print(row_sums)
2. Write a Pandas script to find the mean of a specific column in a DataFrame?
A. import pandas as pd

 Example DataFrame
df = pd.DataFrame({
    'A': [10, 20, 30, 40],
    'B': [5, 15, 25, 35]
})

mean_value = df['A'].mean()
print(mean_value)
3. Create a scatter plot using Matplotlib.
A. import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y = [5, 7, 4, 6, 8]

plt.scatter(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot Example')
plt.show()
4. How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap?
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

 Example DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [5, 4, 3, 2, 1],
    'C': [2, 3, 4, 5, 6]
})

 Calculate correlation matrix
corr = df.corr()

 Visualize with heatmap
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix Heatmap')
plt.show()
5. Generate a bar plot using Plotly.
A. import plotly.express as px

 Sample data
data = {'Fruits': ['Apples', 'Oranges', 'Bananas'],
        'Quantity': [10, 15, 7]}

fig = px.bar(data, x='Fruits', y='Quantity', title='Fruit Quantity')
fig.show()
6. Create a DataFrame and add a new column based on an existing column.
A. import pandas as pd

 Create DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4]
})

 Add new column 'B' as double of column 'A'
df['B'] = df['A'] * 2

print(df)
7. Write a program to perform element-wise multiplication of two NumPy arrays.
A. import numpy as np

 Define two arrays
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

 Element-wise multiplication
result = arr1 * arr2

print(result)
8. Create a line plot with multiple lines using Matplotlib.
A. import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y1 = [2, 3, 5, 7, 11]
y2 = [1, 4, 6, 8, 10]

plt.plot(x, y1, label='Line 1')
plt.plot(x, y2, label='Line 2')

plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Multiple Lines Plot')
plt.legend()
plt.show()
9. Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold.
A. import pandas as pd

 Create DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40]
})

 Filter rows where Age > 30
filtered_df = df[df['Age'] > 30]

print(filtered_df)
10. Create a histogram using Seaborn to visualize a distribution.
A. import seaborn as sns
import matplotlib.pyplot as plt

 Sample data
data = [12, 15, 14, 10, 8, 12, 15, 17, 20, 18, 15, 14, 10]

 Create histogram
sns.histplot(data, bins=5, kde=True)

plt.title('Histogram with KDE')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
11. Perform matrix multiplication using NumPy.
A. import numpy as np

 Define two matrices
A = np.array([[1, 2],
              [3, 4]])

B = np.array([[5, 6],
              [7, 8]])

 Matrix multiplication
result = np.dot(A, B)

print(result)
12. Use Pandas to load a CSV file and display its first 5 rows.
A. import pandas as pd

 Load CSV file
df = pd.read_csv('your_file.csv')

 Display first 5 rows
print(df.head())
13. Create a 3D scatter plot using Plotly.
A. import plotly.express as px
import pandas as pd

 Sample data
df = pd.DataFrame({
    'x': [1, 2, 3, 4, 5],
    'y': [5, 6, 7, 8, 9],
    'z': [9, 8, 7, 6, 5],
    'category': ['A', 'B', 'A', 'B', 'A']
})

fig = px.scatter_3d(df, x='x', y='y', z='z', color='category', title='3D Scatter Plot')
fig.show()
