# **Data Toolkit**



1. **What is NumPy, and why is it widely used in Python?**  
   NumPy is a powerful library for numerical computing in Python. It provides fast array operations, mathematical functions, and efficient handling of large datasets.  

2. **How does broadcasting work in NumPy?**  
   Broadcasting allows NumPy to perform operations on arrays of different shapes by automatically expanding them to match each other.  

3. **What is a Pandas DataFrame?**  
   A DataFrame is a tabular data structure in Pandas, similar to a spreadsheet or SQL table, used for data manipulation and analysis.  

4. **Explain the use of the groupby() method in Pandas.**  
   The `groupby()` method is used to group data based on a specific column, enabling aggregation and summarization.  

5. **Why is Seaborn preferred for statistical visualizations?**  
   Seaborn provides high-level functions for statistical plotting, built-in themes, and better aesthetics compared to Matplotlib.  

6. **What are the differences between NumPy arrays and Python lists?**  
   NumPy arrays are faster, more memory-efficient, and support advanced mathematical operations, whereas Python lists are more flexible but slower for numerical computations.  

7. **What is a heatmap, and when should it be used?**  
   A heatmap is a graphical representation of data using colors. It is useful for visualizing correlations and patterns in datasets.  

8. **What does the term "vectorized operation" mean in NumPy?**  
   Vectorized operations allow computations to be performed element-wise on entire arrays without explicit loops, leading to faster execution.  

9. **How does Matplotlib differ from Plotly?**  
   Matplotlib is used for static plots, while Plotly provides interactive visualizations with zooming and tooltips.  

10. **What is the significance of hierarchical indexing in Pandas?**  
    Hierarchical indexing allows multi-level index structures, making complex data organization and selection easier.  

11. **What is the role of Seaborn's pairplot() function?**  
    `pairplot()` creates pairwise plots of numerical features, helping in understanding relationships between variables.  

12. **What is the purpose of the describe() function in Pandas?**  
    describe()` provides summary statistics such as mean, median, and standard deviation for numerical columns.  

13. **Why is handling missing data important in Pandas?**  
    Missing data can skew analyses, so handling them properly ensures accurate insights and model performance.  

14. **What are the benefits of using Plotly for data visualization?**  
    Plotly enables interactive, web-based, and aesthetically appealing visualizations with ease.  

15. **How does NumPy handle multidimensional arrays?**  
    NumPy supports n-dimensional arrays (`ndarray`) and provides functions for reshaping, slicing, and performing operations on them.  

16. **What is the role of Bokeh in data visualization?**  
    Bokeh is a Python library for creating interactive and web-friendly visualizations.  

17. **Explain the difference between apply() and map() in Pandas.**  
    - `apply()` works on DataFrames and Series to apply a function to elements.  
    - `map()` is specific to Series, applying a function element-wise.  

18. **What are some advanced features of NumPy?**  
    NumPy supports linear algebra, Fourier transforms, masked arrays, and random number generation.  

19. **How does Pandas simplify time series analysis?**  
    Pandas provides time-based indexing, resampling, rolling computations, and date-handling utilities.  

20. **What is the role of a pivot table in Pandas?**  
    Pivot tables help summarize and reshape data by grouping values based on categories.  

21. **Why is NumPy's array slicing faster than Python's list slicing?**  
    NumPy arrays use contiguous memory allocation and optimized C-based implementations, making slicing more efficient.  

22. **What are some common use cases for Seaborn?**  
    Seaborn is used for statistical data visualization, correlation matrices, categorical plots, and distribution plots.  


# **Practical**

In [1]:
#1.Create a 2D NumPy array and calculate the sum of each row:

import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
row_sums = arr.sum(axis=1)
print(row_sums)

[ 6 15 24]


In [2]:
#2.Find the mean of a specific column in a Pandas DataFrame

import pandas as pd
df = pd.DataFrame({'A': [10, 20, 30], 'B': [40, 50, 60]})
mean_value = df['A'].mean()
print(mean_value)

20.0


In [None]:
#3.Create a scatter plot using Matplotlib:

import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [5, 4, 3, 2, 1]
plt.scatter(x, y)
plt.xlabel('X values')
plt.ylabel('Y values')
plt.title('Scatter Plot')
plt.show()

In [None]:
# 4. Calculate the correlation matrix using Seaborn and visualize it with a heatmap:

import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data = pd.DataFrame(np.random.rand(10, 4), columns=['A', 'B', 'C', 'D'])
corr_matrix = data.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()

In [4]:
# 5. Generate a bar plot using Plotly:

import plotly.express as px
data = {'Category': ['A', 'B', 'C'], 'Values': [10, 20, 30]}
fig = px.bar(data, x='Category', y='Values', title='Bar Plot')
fig.show()

In [3]:
#6. Create a DataFrame and add a new column based on an existing column:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3]})
df['B'] = df['A'] * 2
print(df)

   A  B
0  1  2
1  2  4
2  3  6


In [None]:
#7.Perform element-wise multiplication of two NumPy arrays:

import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
result = arr1 * arr2
print(result)

In [None]:
#8. Create a line plot with multiple lines using Matplotlib:

import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y1 = [2, 4, 6, 8, 10]
y2 = [1, 3, 5, 7, 9]
plt.plot(x, y1, label='Line 1')
plt.plot(x, y2, label='Line 2')
plt.legend()
plt.show()

In [None]:
# 9. Filter rows where a column value is greater than a threshold in Pandas:

import pandas as pd
df = pd.DataFrame({'A': [10, 20, 30], 'B': [40, 50, 60]})
filtered_df = df[df['A'] > 15]
print(filtered_df)

In [None]:
# 10. Create a histogram using Seaborn:

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

data = np.random.randn(100)
sns.histplot(data, bins=20, kde=True)
plt.show()

In [None]:
#11. Perform matrix multiplication using NumPy:

import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
result = np.dot(A, B)
print(result)

In [None]:
#12. Load a CSV file using Pandas and display its first 5 rows

import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())

In [None]:
#13. Create a 3D scatter plot using Plotly:

import plotly.express as px
import pandas as pd
import numpy as np

data = pd.DataFrame({'X': np.random.rand(100),
                     'Y': np.random.rand(100),
                     'Z': np.random.rand(100)})

fig = px.scatter_3d(data, x='X', y='Y', z='Z', title='3D Scatter Plot')
fig.show()