#Data Toolkit

##Theoretical Questions

1. What is NumPy, and why is it widely used in Python?
-> NumPy is a Python library for numerical computing. It provides efficient array operations, mathematical functions, and integration with other libraries. Its speed comes from contiguous memory storage and vectorized operations.

2. How does broadcasting work in NumPy?
-> Broadcasting allows arithmetic operations between arrays of different shapes by expanding smaller arrays to match the shape of larger ones. Rules: dimensions must be equal or one of them is 1.

3. What is a Pandas DataFrame?
-> A DataFrame is a 2D labeled data structure with columns of different data types, similar to a spreadsheet or SQL table.

4. Explain the use of the groupby() method in Pandas.
-> Groups data by a column's values and applies aggregations (e.g., sum, mean). Example:

In [None]:
df.groupby('category')['sales'].sum()

5. Why is Seaborn preferred for statistical visualizations?
-> Seaborn offers high-level, attractive statistical plots (e.g., regression, distribution) with minimal code, built on Matplotlib.

6. What are the differences between NumPy arrays and Python lists?
-> NumPy arrays: homogeneous, fixed size, faster for math. Python lists: heterogeneous, dynamic size.

7. What is a heatmap, and when should it be used?
-> A color-coded matrix for visualizing patterns (e.g., correlations, confusion matrices). Use it to show relationships between variables.

8. What does “vectorized operation” mean in NumPy?
-> Operations applied to entire arrays without loops, optimized in C for speed. Example: arr + 1.

9. How does Matplotlib differ from Plotly?
-> Matplotlib: static plots, more code for interactivity. Plotly: interactive, web-ready plots.

10. What is hierarchical indexing in Pandas?
-> Multi-level indexing for grouping data hierarchically. Example:

In [None]:
df.set_index(['region', 'year'])

11. Role of Seaborn’s pairplot() function?
-> Creates scatter plots and histograms for pairwise relationships in a dataset:

In [None]:
sns.pairplot(df)

12. Purpose of describe() in Pandas?
-> Generates summary statistics (mean, std, min, max, etc.):

In [None]:
df.describe()

13. Why handle missing data?
-> Missing data can skew results. Use fillna() or dropna() to manage them.

14. Benefits of Plotly?
-> Interactive plots, 3D support, dashboards, web integration.

15. How does NumPy handle multidimensional arrays?
-> Uses ndarray for n-dimensional data with efficient memory storage.

16. Role of Bokeh?
-> Interactive visualization library for web browsers, handles large datasets.

17. Difference between apply() and map()?

* apply(): Apply function to DataFrame/Series.

* map(): Replace Series values using a dict/function.

18. Advanced NumPy features?
-> Broadcasting, ufuncs, structured arrays, advanced indexing.

19. How Pandas simplifies time series?
-> Date ranges, resampling, time zone handling:

In [None]:
pd.date_range('2023-01-01', periods=5)

20. Role of pivot table?
-> Reshapes data by grouping and aggregating:

In [None]:
df.pivot_table(values='sales', index='region', aggfunc='sum')

21. Why NumPy slicing is faster?
-> Creates memory-efficient views instead of copies like Python lists.

22. Common Seaborn use cases?
-> Heatmaps, distribution plots, regression plots, pairplots.

##Practical Questions

1. How do you create a 2D NumPy array and calculate the sum of each row?
-> Create 2D NumPy array and sum rows

In [None]:
import numpy as np
arr = np.array([[1, 2], [3, 4]])
row_sums = arr.sum(axis=1)
print(row_sums)

2. Write a Pandas script to find the mean of a specific column in a DataFrame?
->Pandas script for column mean

In [None]:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3]})
mean = df['A'].mean()
print(mean)

3. Create a scatter plot using Matplotlib.

In [None]:
import matplotlib.pyplot as plt
x = [1, 2, 3]
y = [4, 5, 6]
plt.scatter(x, y)
plt.show()

4. How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap?

In [None]:
import seaborn as sns
import pandas as pd
df = pd.DataFrame(np.random.rand(5, 5))
sns.heatmap(df.corr())
plt.show()

5.  Generate a bar plot using Plotly.

In [None]:
import plotly.express as px
df = px.data.tips()
fig = px.bar(df, x='day', y='total_bill')
fig.show()

6. Create a DataFrame and add a new column based on an existing column.

In [None]:
df = pd.DataFrame({'A': [1, 2, 3]})
df['B'] = df['A'] * 2
print(df)

7.  Write a program to perform element-wise multiplication of two NumPy arrays.

In [None]:
a = np.array([1, 2])
b = np.array([3, 4])
result = a * b
print(result)

8. Create a line plot with multiple lines using Matplotlib.

In [None]:
x = [1, 2, 3]
y1 = [4, 5, 6]
y2 = [7, 8, 9]
plt.plot(x, y1, label='Line 1')
plt.plot(x, y2, label='Line 2')
plt.legend()
plt.show()

9. Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold.

In [None]:
df = pd.DataFrame({'A': [10, 20, 30]})
filtered = df[df['A'] > 15]
print(filtered)

10. Create a histogram using Seaborn to visualize a distribution.

In [None]:
sns.histplot(data=df, x='A')
plt.show()

11. Perform matrix multiplication using NumPy.

In [None]:
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
result = a @ b
print(result)

12. Use Pandas to load a CSV file and display its first 5 rows.

In [None]:
df = pd.read_csv('data.csv')
print(df.head())

13. Create a 3D scatter plot using Plotly.


In [None]:
df = px.data.iris()
fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_length', color='species')
fig.show()