# Data Toolkit Assignment

1. What is NumPy, and why is it widely used in Python
  -> NumPy is a library for numerical computing in Python, known for its fast array operations and support for large, multi-dimensional arrays.


2. How does broadcasting work in NumPy
  -> Broadcasting allows NumPy to perform operations on arrays of different shapes as if they had compatible shapes.


3. What is a Pandas DataFrame
  -> A DataFrame is a 2D, labeled data structure in Pandas, similar to a table in a database or an Excel spreadsheet


4. Explain the use of the groupby() method in Pandas
  -> groupby() is used to split data into groups, apply a function, and combine results, useful for aggregation.


5. Why is Seaborn preferred for statistical visualizations
  -> Seaborn simplifies complex visualizations and integrates well with Pandas, offering attractive default styles.


6. What are the differences between NumPy arrays and Python lists
  -> NumPy arrays are faster, more compact, support element-wise operations, and have more functionality than Python lists.


7. What is a heatmap, and when should it be used
  -> A heatmap visualizes data using color in a matrix format, ideal for showing correlation or intensity of variables.


8. What does the term “vectorized operation” mean in NumPy
  -> It means performing operations on entire arrays without using explicit loops, leading to faster computations.


9. How does Matplotlib differ from Plotly
  -> Matplotlib is static and low-level, while Plotly is interactive and better for complex, web-based visualizations.


10. What is the significance of hierarchical indexing in Pandas
 -> It allows multiple index levels on axes, enabling better data organization and easier access to data subsets.


11. What is the role of Seaborn’s pairplot() function
 -> pairplot() creates a matrix of scatter plots to visualize pairwise relationships in a dataset.


12. What is the purpose of the describe() function in Pandas
  -> describe() generates summary statistics of numerical columns like mean, std, min, and percentiles.


13. Why is handling missing data important in Pandas
  -> It ensures data quality and prevents errors during analysis or model training.


14. What are the benefits of using Plotly for data visualization
  -> Plotly offers interactive plots, easy integration with web apps, and wide support for charts.


15. How does NumPy handle multidimensional arrays
 -> NumPy supports n-dimensional arrays (ndarrays) with efficient operations across dimensions.


16. What is the role of Bokeh in data visualization
  -> Bokeh is used for creating interactive, web-based plots with high flexibility and real-time updates.


17. Explain the difference between apply() and map() in Pandas
 -> apply() can be used on DataFrames or Series with functions, while map() is used only on Series.


18. What are some advanced features of NumPy
 -> Features include broadcasting, vectorization, masking, linear algebra, Fourier transforms, and random number generation.


19. How does Pandas simplify time series analysis
 -> It provides datetime indexing, resampling, and frequency conversion tools built-in.


20. What is the role of a pivot table in Pandas
 -> A pivot table summarizes data using groupings, making it easier to analyze patterns.


21. Why is NumPy’s array slicing faster than Python’s list slicing
 -> NumPy slicing accesses memory more efficiently due to contiguous storage and lower overhead.


22. What are some common use cases for Seaborn?
  -> Common uses include correlation heatmaps, distribution plots, categorical plots, and pairwise relationships.


# Practical Questions

In [None]:
#1. How do you create a 2D NumPy array and calculate the sum of each row
import numpy as np
arr = np.array([[1, 2], [3, 4]])
row_sums = np.sum(arr, axis=1)


In [None]:
#2. Write a Pandas script to find the mean of a specific column in a DataFrame
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
mean_b = df['B'].mean()


In [None]:
#3. Create a scatter plot using Matplotlib
import matplotlib.pyplot as plt
x = [1, 2, 3]
y = [4, 5, 6]
plt.scatter(x, y)
plt.show()


In [None]:
#4. How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(...)  # your data
corr = df.corr()
sns.heatmap(corr, annot=True)
plt.show()

In [None]:
#5. Generate a bar plot using Plotly
import plotly.express as px
data = {'x': ['A', 'B', 'C'], 'y': [10, 20, 15]}
fig = px.bar(data, x='x', y='y')
fig.show()

In [None]:
#6. Create a DataFrame and add a new column based on an existing column
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3]})
df['B'] = df['A'] * 2


In [None]:
#7. Write a program to perform element-wise multiplication of two NumPy arrays
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
result = a * b


In [None]:
#8. Create a line plot with multiple lines using Matplotlib
import matplotlib.pyplot as plt
x = [1, 2, 3]
y1 = [2, 4, 6]
y2 = [1, 2, 3]
plt.plot(x, y1, label='Line 1')
plt.plot(x, y2, label='Line 2')
plt.legend()
plt.show()


In [None]:
#9. Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4]})
filtered_df = df[df['A'] > 2]


In [None]:
#10.Create a histogram using Seaborn to visualize a distribution
import seaborn as sns
data = [1, 2, 2, 3, 3, 3, 4]
sns.histplot(data, kde=True)


In [None]:
#11. Perform matrix multiplication using NumPy
import numpy as np
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
result = np.dot(a, b)


In [None]:
#12. Use Pandas to load a CSV file and display its first 5 rows
import pandas as pd
df = pd.read_csv('file.csv')
print(df.head())


In [None]:
#13. Create a 3D scatter plot using Plotly
import plotly.express as px
df = px.data.iris()
fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_length', color='species')
fig.show()
