1. What is NumPy, and why is it widely used in Python?
-> NumPy is a Python library used for numerical computing. It provides support for multi-dimensional arrays and matrices, along with mathematical functions to operate on them. It's widely used because:

1. Efficient Arrays**: Provides fast, flexible arrays (`ndarray`) for numerical data.
2. **Performance**: Operations are faster than regular Python lists due to C-based implementation.
3. Mathematical Functions**: Offers functions for linear algebra, statistics, and more.
4. Integration**: Works well with other libraries like Pandas, SciPy, and Matplotlib.
5. Memory Efficiency**: More memory-efficient than Python lists.
6. **Multidimensional Arrays**: Handles multi-dimensional arrays for complex data.
7. Data Science**: Essential for data science, machine learning, and scientific computing.

2. How does broadcasting work in NumPy?
-> Broadcasting in NumPy is a feature that allows arrays of different shapes to be used in arithmetic operations. It automatically expands smaller arrays to match the shape of larger arrays without the need for explicit replication of data. This makes operations more memory-efficient and faster.

- Key Concepts of Broadcasting:

- Shape Compatibility: For broadcasting to occur, the dimensions of the arrays must be compatible. This means:

- If the arrays have different dimensions, NumPy will pad the smaller array's shape with 1s on the left.

- The size of each dimension should either be the same or one of the arrays should have a size of 1 in that dimension.

- Rules of Broadcasting:

- Align dimensions: NumPy aligns the dimensions of both arrays from the right.

- Size compatibility: If the sizes of the dimensions do not match, one of the arrays must have a size of 1 in that dimension to allow broadcasting.

3. What is a Pandas DataFrame?
-> A Pandas DataFrame is a two-dimensional, labeled data structure in Python. It stores data in rows and columns, similar to a table.

- Key Features:

 - Tabular structure: Rows and columns with labels.

 - Heterogeneous: Can contain different data types in different columns.

 - Indexing: Both rows and columns are indexed.

 - Mutable: You can add or remove rows/columns.

4. Explain the use of the groupby() method in Pandas?

-> The groupby() method in Pandas is used to:

- Split data into groups based on values in one or more columns.

- Apply a function (like sum(), mean(), etc.) to each group.

- Combine the results into a new DataFrame.

5. Why is Seaborn preferred for statistical visualizations?

-> Seaborn is preferred for statistical visualizations because:

- It has built-in support for complex plots (e.g., boxplots, violin plots, pair plots).

- It automatically handles statistical aggregation and visual themes.

- It integrates well with Pandas DataFrames.

- It makes beautiful, informative charts with minimal code.

6. What are the differences between NumPy arrays and Python lists

-> Data Type:
 - NumPy arrays: Only one data type (homogeneous).
 - Python lists: Can contain different data types (heterogeneous).

-> Speed:
 - NumPy arrays: Faster for numerical operations.
 - Python lists: Slower in comparison.

-> Memory Efficiency:
 - NumPy arrays: Use less memory.
 - Python lists: Use more memory.

-> Functionality:
 - NumPy arrays: Support vectorized operations (e.g., array * 2).
 - Python lists: Need loops for such operations.

-> Multidimensional Support:
 - NumPy arrays: Support multi-dimensional arrays easily.
 - Python lists: Require nested lists, which are harder to manage.

-> Built-in Functions:
 - NumPy arrays: Have many built-in functions for stats, algebra, etc.
 - Python lists: Limited built-in functionality.

7. What is a heatmap, and when should it be used?

-> A heatmap is a data visualization that uses color to show the magnitude of values in a matrix.

-> When to use:

 - To visualize correlations between variables.
 - To spot patterns, trends, or outliers in a dataset.
 - To display confusion matrices or pivot tables clearly.

 8. What does the term “vectorized operation” mean in NumPy?

 -> A vectorized operation in NumPy means performing operations on entire arrays without using loops.

 9. How does Matplotlib differ from Plotly?

 -> Interactivity:
 - Matplotlib: Static plots (basic interactivity with add-ons).
 - Plotly: Highly interactive plots (zoom, hover, tooltips).

-> Ease of Use:
 - Matplotlib: More code-heavy; flexible but manual.
 - Plotly: Easier for interactive and polished visuals.

-> Output Format:
 - Matplotlib: Best for static images (PNG, PDF).
 - Plotly: Best for web-based, interactive HTML plots.

-> Customization:
 - Matplotlib: Highly customizable with more control.
 - Plotly: Limited low-level control but great visuals.

-> Use Case:
 - Matplotlib: Good for academic/static reports.
 - Plotly: Great for dashboards and web apps.

10. What is the significance of hierarchical indexing in Pandas?

-> Hierarchical indexing (or MultiIndex) in Pandas allows you to have multiple levels of index on rows or columns.

-> Significance:

 - Organizes data in a nested structure.

 - Makes it easier to analyze multi-dimensional data.

 - Enables complex grouping and slicing operations.

 - Helps in reshaping data (e.g., with stack() and unstack()).

11. What is the role of Seaborn’s pairplot() function?

-> Seaborn’s pairplot() function is used to:

 - Visualize relationships between all pairs of features in a dataset.

 - Show scatter plots, histograms, and KDEs for numerical variables.

 - Help detect patterns, correlations, or outliers.

12. What is the purpose of the describe() function in Pandas?

->The describe() function in Pandas is used to:

 - Generate summary statistics for numerical columns.

 - Show metrics like count, mean, std, min, max, and quartiles.

13. Why is handling missing data important in Pandas?

-> Handling missing data in Pandas is important because:

 - Accuracy: Prevents biased or incorrect analysis.

 - Model Requirements: Many algorithms need complete data.

 - Error Prevention: Avoids calculation errors.

 - Efficiency: Ensures faster, more reliable processing.

 - Consistency: Ensures reproducibility of results.

14. What are the benefits of using Plotly for data visualization?

-> The benefits of using Plotly for data visualization include:

 - Interactive Plots: Allows zooming, panning, and hover functionality.

 - Beautiful Visuals: High-quality, visually appealing charts.

 - Wide Range of Plots: Supports a variety of chart types (e.g., 3D, maps, heatmaps).

 - Integration: Easily integrates with tools like Jupyter Notebooks and Dash.

 - Customization: Highly customizable for personalized styling and layout.

 - Web-Based: Interactive plots can be shared online or embedded in websites.

15.  How does NumPy handle multidimensional arrays?

-> NumPy handles multidimensional arrays through its ndarray object, which can represent arrays of any number of dimensions. Key features include:

-> Shape: The shape of an array is a tuple that defines the size of the array along each dimension.

-> Indexing: You can index and slice arrays in multiple dimensions using commas (e.g., array[i, j] for 2D arrays).

-> Broadcasting: NumPy automatically handles operations between arrays of different shapes by "broadcasting" smaller arrays across larger ones.

-> Efficient Storage: Multidimensional arrays are stored as contiguous blocks of memory, optimizing speed and memory usage.

-> Vectorized Operations: Allows fast element-wise operations across entire arrays without explicit loops.

16. What is the role of Bokeh in data visualization?

-> Bokeh is used in data visualization for:

-> Interactive Visuals: It allows the creation of highly interactive, web-based visualizations (e.g., zooming, panning, and hover tools).

-> Web Integration: Bokeh visualizations can be easily embedded in web applications, Jupyter notebooks, or dashboards.

-> Customizability: Offers extensive customization options for charts, layouts, and tools.

-> Real-Time Data: Supports streaming and updating data in real-time.

-> Wide Range of Plots: Includes line, bar, scatter, heatmaps, and more advanced charts (e.g., geographic maps, network graphs).

17. Explain the difference between apply() and map() in Pandas

-> In Pandas, both apply() and map() are used for applying functions to data, but they differ in their usage and functionality:

 - apply():

 -> Works on both Series and DataFrame objects.
 -> Can apply a function along an axis (rows or columns) in a DataFrame.
 -> Allows more complex operations, such as applying a function to multiple columns at once.

 -> Syntax: df.apply(func, axis=0) for DataFrames (axis=0 for columns, axis=1 for rows).

 - map():

 -> Works only on Series.

 -> Primarily used to map or transform values in a Series, often with a dictionary, a function, or a list.

 -> More limited in functionality compared to apply() (usually for element-wise operations).
 -> Syntax: series.map(func).

18. What are some advanced features of NumPy?

-> Some advanced features of NumPy include:

 - Broadcasting: Enables operations on arrays of different shapes by automatically adjusting dimensions to make them compatible for element-wise operations.

 - Vectorization: Allows for the efficient application of operations across entire arrays without the need for explicit loops, leading to faster execution.

 - Advanced Indexing:
Supports fancy indexing, where you can index with arrays or lists, not just single integers.
- Provides boolean indexing to filter data based on conditions.

 - Linear Algebra: NumPy has a suite of functions for linear algebra operations like matrix multiplication (dot), eigenvalues, determinants, and singular value decomposition (SVD).

 - Random Module: Provides tools for generating random numbers and sampling from various probability distributions.

 - Memory Layout Control: Offers fine-grained control over memory (e.g., using C vs. Fortran-order arrays).

 - Strides: Allows for advanced manipulation of array memory and control over how data is accessed in memory.

 - Masked Arrays: Supports arrays where some elements are masked (ignored), allowing for handling missing or invalid data efficiently.

19. How does Pandas simplify time series analysis?

-> Pandas simplifies time series analysis with features like:

 - Datetime Indexing: You can easily convert date strings into DatetimeIndex, enabling time-based indexing and easy access to data based on dates.

 - Resampling: Pandas allows you to resample data at different frequencies (e.g., daily, monthly, yearly) using .resample(), which is useful for aggregating or downsampling time series.

 - Shifting and Lagging: With .shift(), you can easily create lagged variables or calculate changes over time (e.g., differences between current and previous time periods).

 - Rolling Windows: Provides methods like .rolling() to calculate moving averages, sums, and other statistics over a rolling window.

 - Time Zone Handling: Pandas supports time zone conversion and localization, making it easier to work with data from different time zones.

 - Date Range Generation: The pd.date_range() function allows you to generate sequences of dates, making it easy to create time series data for modeling.

 - Easy Plotting: Pandas integrates with libraries like Matplotlib, allowing you to quickly plot time series data for visual analysis.

20.  What is the role of a pivot table in Pandas?

-> In Pandas, a pivot table is used to summarize, aggregate, and reshape data. It helps in:

 - Aggregating Data: Pivot tables allow you to group and aggregate data based on certain categorical variables (e.g., sum, mean, count).

 - Reshaping Data: You can transform long-format data into a more readable wide-format, where rows become columns, making it easier to compare values.

 - Multi-dimensional Analysis: By using multiple columns as indices and aggregating over others, pivot tables allow for multi-dimensional analysis of data.

 - Data Summarization: It helps to condense large datasets into smaller, more manageable summaries, often with statistics like averages, sums, or counts.

21. Why is NumPy’s array slicing faster than Python’s list slicing?

-> NumPy's array slicing is faster than Python's list slicing because:

 - Contiguous Memory: NumPy arrays are stored in a single block, making slicing faster.

 - No Copying: Slicing creates a view, not a copy, saving time and memory.

 - C Implementation: NumPy is optimized in C, while list slicing is slower in Python.

 - Vectorization: NumPy handles operations on slices more efficiently.

22. What are some common use cases for Seaborn?

-> Statistical Plots: Creating plots like bar charts, box plots, violin plots, and pair plots to visualize statistical relationships.

-> Correlation Heatmaps: Displaying correlations between variables in a dataset with color-coded matrices.

-> Categorical Data Visualization: Visualizing categorical data distributions with plots like count plots, bar plots, and strip plots.

-> Regression Plots: Creating scatter plots with fitted regression lines using regplot() or lmplot().

-> Distribution Plots: Visualizing the distribution of data with histograms, KDE plots, or ECDFs (empirical cumulative distribution functions).

-> Faceted Grid: Creating grid layouts for plotting multiple subplots based on a categorical variable using FacetGrid.

Practical

1. How do you create a 2D NumPy array and calculate the sum of each row?

In [None]:
import numpy as np

# Create a 2D NumPy array
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Calculate the sum of each row
row_sums = arr.sum(axis=1)

print(row_sums)

2. Write a Pandas script to find the mean of a specific column in a DataFrame

In [None]:
import pandas as pd

# Sample DataFrame
data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Calculate the mean of column 'A'
mean_value = df['A'].mean()

print(f"The mean of column 'A' is: {mean_value}")

3. Create a scatter plot using Matplotlib

In [None]:
import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]

# Create a scatter plot
plt.scatter(x, y)

# Adding labels and title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot Example')

# Show the plot
plt.show()

4.  How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap?

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Sample DataFrame
data = {'A': [1, 2, 3, 4, 5],
        'B': [5, 4, 3, 2, 1],
        'C': [2, 3, 4, 5, 6],
        'D': [7, 8, 9, 10, 11]}

df = pd.DataFrame(data)

# Calculate the correlation matrix
correlation_matrix = df.corr()

# Create a heatmap to visualize the correlation matrix
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')

# Show the plot
plt.title('Correlation Matrix Heatmap')
plt.show()

5. Generate a bar plot using Plotly

In [None]:
import plotly.graph_objects as go

# Sample data
categories = ['A', 'B', 'C', 'D', 'E']
values = [10, 15, 7, 12, 9]

# Create a bar plot
fig = go.Figure(data=[go.Bar(x=categories, y=values)])

# Add title and labels
fig.update_layout(title='Bar Plot Example', xaxis_title='Category', yaxis_title='Value')

# Show the plot
fig.show()

6. Create a DataFrame and add a new column based on an existing column

In [None]:
import pandas as pd

# Create a sample DataFrame
data = {'A': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)

# Add a new column 'B' which is double the values of column 'A'
df['B'] = df['A'] * 2

# Display the updated DataFrame
print(df)

7. Write a program to perform element-wise multiplication of two NumPy arrays

In [None]:
import numpy as np

# Create two NumPy arrays
array1 = np.array([1, 2, 3, 4])
array2 = np.array([5, 6, 7, 8])

# Perform element-wise multiplication
result = array1 * array2

# Display the result
print("Result of element-wise multiplication:", result)

8.  Create a line plot with multiple lines using Matplotlib

In [None]:
import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y1 = [1, 4, 9, 16, 25]  # Line 1
y2 = [25, 20, 15, 10, 5]  # Line 2

# Create a line plot with multiple lines
plt.plot(x, y1, label='y = x^2', color='blue')  # Line 1
plt.plot(x, y2, label='y = 30 - x', color='red')  # Line 2

# Adding labels and title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Multiple Lines Plot')

# Show legend
plt.legend()

# Display the plot
plt.show()

9. Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold

In [None]:
import pandas as pd

# Create a sample DataFrame
data = {'A': [5, 12, 8, 20, 3],
        'B': [15, 20, 25, 10, 30]}
df = pd.DataFrame(data)

# Define a threshold
threshold = 10

# Filter rows where values in column 'A' are greater than the threshold
filtered_df = df[df['A'] > threshold]

# Display the filtered DataFrame
print(filtered_df)

10. Create a histogram using Seaborn to visualize a distribution

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Sample data
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5]

# Create a histogram with Seaborn
sns.histplot(data, kde=True, color='blue', bins=5)

# Adding title and labels
plt.title('Histogram with Seaborn')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Show the plot
plt.show()

11.  Perform matrix multiplication using NumPy

In [None]:
import numpy as np

# Define two matrices
matrix1 = np.array([[1, 2], [3, 4]])
matrix2 = np.array([[5, 6], [7, 8]])

# Perform matrix multiplication
result = np.dot(matrix1, matrix2)

# Alternatively, you can use the @ operator
# result = matrix1 @ matrix2

# Display the result
print("Matrix multiplication result:")
print(result)

12. Use Pandas to load a CSV file and display its first 5 rows

In [None]:
import pandas as pd

# Load the CSV file into a DataFrame (replace 'file.csv' with the actual file path)
df = pd.read_csv('file.csv')

# Display the first 5 rows of the DataFrame
print(df.head())

13.  Create a 3D scatter plot using Plotly.

In [None]:
import plotly.express as px
import pandas as pd

# Sample data for the 3D scatter plot
data = {
    'X': [1, 2, 3, 4, 5],
    'Y': [5, 4, 3, 2, 1],
    'Z': [2, 3, 4, 5, 6],
}

# Create a DataFrame
df = pd.DataFrame(data)

# Create a 3D scatter plot
fig = px.scatter_3d(df, x='X', y='Y', z='Z', title='3D Scatter Plot')

# Show the plot
fig.show()