#Data Toolkit Question and Answers

1. What is NumPy, and why is it widely used in Python?<br>
- NumPy (Numerical Python) is a library used for numerical computing in Python. It provides powerful support for handling large arrays, performing mathematical operations efficiently, and enabling vectorized computations, making it faster than Python lists.



2. How does broadcasting work in NumPy?<br>
- Broadcasting allows NumPy to perform operations between arrays of different shapes by expanding the smaller array to match the larger one. This avoids explicit looping and improves efficiency.

3. What is a Pandas DataFrame?<br>
- A Pandas DataFrame is a two-dimensional labeled data structure, similar to a table with rows and columns. It is widely used for data analysis and manipulation.



4. Explain the use of the groupby() method in Pandas.<br>
- The groupby() method groups data based on a column and applies aggregation functions like sum(), mean(), etc.

5. Why is Seaborn preferred for statistical visualizations?<br>
- Seaborn provides high-level visualization functions with beautiful themes, built-in statistical plotting capabilities, and smooth integration with Pandas.

6. What are the differences between NumPy arrays and Python lists?<br>
 - NumPy arrays are faster than Python lists because they use optimized C implementations, while lists rely on Python's interpreted execution. NumPy arrays consume less memory since they store elements of the same data type, whereas lists can store mixed data types, increasing memory usage. Arrays support vectorized operations, allowing element-wise computations without loops, while lists require explicit loops for similar operations. NumPy provides built-in mathematical functions, making it efficient for numerical computing, whereas lists lack such functions. For large datasets, NumPy performs significantly better due to its fixed-size storage and optimized processing.

7. What is a heatmap, and when should it be used?<br>
- A heatmap is a color-coded matrix representation of data used to visualize correlations, frequencies, or distributions in datasets.

8. What does the term “vectorized operation” mean in NumPy?<br>
- A vectorized operation performs computations on entire arrays without explicit loops, making it much faster than traditional loops.

9. How does Matplotlib differ from Plotly?<br>
- Matplotlib: Used for static visualizations.<br>
- Plotly: Used for interactive visualizations with zooming, tooltips, and 3D plotting.


10. What is the significance of hierarchical indexing in Pandas?<br>
- Hierarchical indexing allows multi-level row and column indexing in Pandas, making it useful for handling multi-dimensional data.



11. What is the role of Seaborn’s pairplot() function?<br>
- pairplot() creates scatter plots for each pair of numerical variables in a dataset, helping to analyze relationships between them.

12. What is the purpose of the describe() function in Pandas?<br>
- The describe() function provides summary statistics like mean, standard deviation, min, max, and percentiles for numerical columns in a DataFrame.

13. Why is handling missing data important in Pandas?<br>
- Missing data can bias results and reduce accuracy. Pandas provides methods like dropna() (remove missing values) and fillna() (replace with default values).



14. What are the benefits of using Plotly for data visualization?<br>
- Interactive charts (zooming, hovering, etc.).
Web-based visualizations with minimal code.
Supports 3D and animated plots.

15. How does NumPy handle multidimensional arrays?<br>
- NumPy supports ndarrays (n-dimensional arrays) and allows operations like reshaping, slicing, and broadcasting.



16. What is the role of Bokeh in data visualization?<br>
- Bokeh is used for interactive and web-friendly visualizations, often preferred for real-time data dashboards.



17. Explain the difference between apply() and map() in Pandas.<br>
- apply() works on rows/columns and can apply custom functions.
map() applies functions only to Series (single column).
Example:

18. What are some advanced features of NumPy?<br>
- Linear Algebra (numpy.linalg)
- Fourier Transform (numpy.fft)
- Random Number Generation (numpy.random)


19. How does Pandas simplify time series analysis?<br>  


Pandas provides:
- Datetime indexing (pd.to_datetime())
- Resampling (resample())
- Rolling windows for moving averages


20. What is the role of a pivot table in Pandas?<br>

- A pivot table summarizes data by grouping values and applying functions like sum, mean, count, etc.

21. Why is NumPy’s array slicing faster than Python’s list slicing?<br>
- NumPy creates views (not copies) when slicing, whereas Python lists create new copies, making NumPy much faster.



22. What are some common use cases for Seaborn?<br>
- Correlation heatmaps
- Pair plots for exploratory analysis
- Distribution and categorical data visualization

##Practical Questions

1. How do you create a 2D NumPy array and calculate the sum of each row?

In [None]:
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
row_sums = arr.sum(axis=1)
print(row_sums)  # Output: [ 6 15 ]


2. Write a Pandas script to find the mean of a specific column in a DataFrame.

In [None]:
import pandas as pd
df = pd.DataFrame({'A': [10, 20, 30], 'B': [5, 15, 25]})
mean_value = df['A'].mean()
print(mean_value)  # Output: 20.0


3. Create a scatter plot using Matplotlib.

In [None]:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [10, 15, 7, 20, 25]
plt.scatter(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot Example')
plt.show()


4. How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap?

In [None]:
import seaborn as sns
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(5, 5), columns=list('ABCDE'))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()


5. Generate a bar plot using Plotly.

In [None]:
import plotly.express as px
data = {'Category': ['A', 'B', 'C'], 'Values': [10, 15, 7]}
fig = px.bar(data, x='Category', y='Values', title="Bar Plot Example")
fig.show()


6. Create a DataFrame and add a new column based on an existing column.

In [None]:
import pandas as pd
df = pd.DataFrame({'A': [10, 20, 30]})
df['B'] = df['A'] * 2  # Creating new column B based on A
print(df)


7. Write a program to perform element-wise multiplication of two NumPy arrays.

In [None]:
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
result = a * b
print(result)  # Output: [ 4 10 18 ]


8. Create a line plot with multiple lines using Matplotlib

In [None]:
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y1 = [10, 15, 7, 20, 25]
y2 = [5, 10, 3, 15, 20]
plt.plot(x, y1, label='Line 1')
plt.plot(x, y2, label='Line 2')
plt.legend()
plt.title('Multiple Line Plot')
plt.show()


9. Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold.

In [None]:
import pandas as pd
df = pd.DataFrame({'A': [10, 20, 30, 40], 'B': [5, 15, 25, 35]})
filtered_df = df[df['A'] > 20]
print(filtered_df)


10. Create a histogram using Seaborn to visualize a distribution.

In [None]:
import seaborn as sns
import numpy as np
data = np.random.randn(1000)
sns.histplot(data, bins=30, kde=True)
plt.show()


11. Perform matrix multiplication using NumPy.

In [None]:
import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
result = np.dot(A, B)  # Or A @ B
print(result)


12. Use Pandas to load a CSV file and display its first 5 rows.

In [None]:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())


13. Create a 3D scatter plot using Plotly.

In [None]:
import plotly.express as px
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': np.random.rand(50), 'y': np.random.rand(50), 'z': np.random.rand(50)})
fig = px.scatter_3d(df, x='x', y='y', z='z', title='3D Scatter Plot')
fig.show()
