#data toolkit assignment



1. What is NumPy, and why is it widely used in Python?**  
  ==> NumPy is a library for numerical computing in Python, providing support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on them. It is widely used because of its efficiency, ease of use, and powerful capabilities for scientific computing, including vectorized operations and broadcasting.



2. How does broadcasting work in NumPy?**  
  ==> Broadcasting allows NumPy to perform element-wise operations on arrays of different shapes by automatically expanding the smaller array to match the larger one. This avoids the need for explicit looping, improving performance.



3. What are the differences between NumPy arrays and Python lists?**  
==>
- **Performance:** NumPy arrays are faster and consume less memory.  
- **Functionality:** They support mathematical operations directly, unlike lists.  
- **Fixed Size:** NumPy arrays have a fixed size, while lists are dynamic.  
- **Homogeneous Elements:** NumPy arrays store elements of the same type, whereas lists can hold mixed types.  



4. How does NumPy handle multidimensional arrays?
   ==> NumPy provides the `ndarray` object, which supports multi-dimensional arrays with efficient indexing, slicing, reshaping, and broadcasting. Functions like `reshape()`, `transpose()`, and `axis` operations make handling these arrays easier.



5. What are some advanced features of NumPy?
==> Some advanced features include:  
- **Vectorized operations** for fast computations  
- **Broadcasting** for handling different-sized arrays  
- **Masked arrays** for missing data handling  
- **Linear algebra functions** (`numpy.linalg`)  
- **Random sampling** (`numpy.random`)  
- **FFT (Fast Fourier Transform)** for signal processing  



6. What is a Pandas DataFrame?
==> A DataFrame is a two-dimensional labeled data structure in Pandas, similar to an Excel spreadsheet or SQL table. It allows data manipulation, indexing, filtering, and analysis.



7. Explain the use of the `groupby()` method in Pandas.
==> `groupby()` is used to group data based on a specific column and apply functions like sum, mean, count, etc., to each group. It is useful for summarizing large datasets.



8. What is the significance of hierarchical indexing in Pandas?
==> Hierarchical indexing allows multiple index levels in a DataFrame or Series, enabling multi-dimensional data representation and efficient data retrieval.



9. What is the purpose of the `describe()` function in Pandas?
==>`describe()` provides summary statistics (count, mean, standard deviation, min, max, and quartiles) for numerical columns in a DataFrame.



10. Why is handling missing data important in Pandas?
==>Handling missing data ensures the integrity of analyses. Pandas provides methods like `dropna()`, `fillna()`, and interpolation to manage missing values effectively.




11. Explain the difference between `apply()` and `map()` in Pandas.
==>
- `apply()`: Used for applying a function to an entire row or column in a DataFrame.  
- `map()`: Used for element-wise transformations in a Series.  




12. How does Pandas simplify time series analysis?  
==> Pandas provides built-in date-time functions, resampling, frequency conversion, time zone handling, and rolling-window operations, making time series analysis easier.




13. What is the role of a pivot table in Pandas?  
==> Pivot tables summarize data by aggregating values based on specified columns, similar to Excel pivot tables.  




14. Why is Seaborn preferred for statistical visualizations?
==> Seaborn provides built-in themes, beautiful visualizations, and functions for statistical data exploration, such as correlation heatmaps and regression plots.



15. What is a heatmap, and when should it be used?
==> A heatmap is a graphical representation of data using colors to indicate values. It is used for correlation matrices, confusion matrices, and visualizing large datasets.



16. What is the role of Seaborn’s `pairplot()` function?
==>`pairplot()` creates scatter plots for each pair of numerical features in a dataset, helping to identify relationships and patterns.




17. What are some common use cases for Seaborn?
==>  
- Correlation heatmaps  
- Distribution analysis (`distplot()`)  
- Regression analysis (`regplot()`)  
- Categorical plots (`boxplot()`, `violinplot()`)  
- Pairwise relationships (`pairplot()`)  



18. How does Matplotlib differ from Plotly?
==>  
- **Matplotlib**: Static, simple plots with full customization control.  
- **Plotly**: Interactive, web-based visualizations with built-in interactivity.  




19. What are the benefits of using Plotly for data visualization?
==>
- Interactive plots (zoom, hover, filter)  
- Web compatibility (HTML-based)  
- Supports 3D visualizations  
- Easy to integrate with Dash for dashboards




20. What is the role of Bokeh in data visualization?  
==> Bokeh is a Python library for interactive visualizations, focusing on web-based dashboards and streaming data support.




21. Why is NumPy’s array slicing faster than Python’s list slicing?
==> NumPy slices create **views** instead of copies, meaning modifications affect the original array. Python lists require creating new objects, increasing overhead.




22. What does the term “vectorized operation” mean in NumPy?
==> A vectorized operation allows element-wise computations without explicit loops, leveraging optimized C implementations for speed.



In [None]:
#1. How do you create a 2D NumPy array and calculate the sum of each row)
==>
import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Sum of each row
row_sums = arr.sum(axis=1)
print(row_sums)



#2. Write a Pandas script to find the mean of a specific column in a DataFrame
==>
import pandas as pd

data = {'A': [10, 20, 30, 40], 'B': [50, 60, 70, 80]}
df = pd.DataFrame(data)

# Mean of column 'A'
mean_A = df['A'].mean()
print(mean_A)




#3. Create a scatter plot using MatplotlibA
==>
 import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y = [10, 20, 25, 30, 40]

plt.scatter(x, y, color='red', marker='o')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot Example')
plt.show()


#4. How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap)
==>
import seaborn as sns
import matplotlib.pyplot as plt


df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [10, 20, 30, 40], 'C': [5, 15, 25, 35]})

# Compute correlation matrix
corr_matrix = df.corr()

# Create a heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix Heatmap")
plt.show()



#5. Generate a bar plot using PlotlyA
==>
import plotly.express as px

df = pd.DataFrame({'Category': ['A', 'B', 'C'], 'Values': [10, 20, 30]})
fig = px.bar(df, x='Category', y='Values', title="Bar Plot Example")
fig.show()


#6. Create a DataFrame and add a new column based on an existing column
==>
df['C'] = df['A'] * 2  # New column 'C' is twice column 'A'
print(df)



#7. Write a program to perform element-wise multiplication of two NumPy arrays
==>
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

# Element-wise multiplication
result = arr1 * arr2
print(result)



#8. Create a line plot with multiple lines using Matplotlib
==>
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)

plt.plot(x, y1, label='sin(x)', color='blue')
plt.plot(x, y2, label='cos(x)', color='green')

plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Plot with Multiple Lines')
plt.legend()
plt.show()



#9. Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold
==>
filtered_df = df[df['A'] > 15]  # Filtering rows where 'A' is greater than 15
print(filtered_df)



#10. Create a histogram using Seaborn to visualize a distribution
==>
sns.histplot(df['A'], bins=5, kde=True, color='blue')
plt.title("Histogram of Column A")
plt.show()


#11.Perform matrix multiplication using NumPy
==>
matrix1 = np.array([[1, 2], [3, 4]])
matrix2 = np.array([[5, 6], [7, 8]])

# Matrix multiplication
result = np.dot(matrix1, matrix2)
print(result)



#12. Use Pandas to load a CSV file and display its first 5 rows
==>
df = pd.read_csv("data.csv")  # Replace with actual CSV file path
print(df.head())



#13. Create a 3D scatter plot using Plotly
==>
import plotly.graph_objects as go

fig = go.Figure(data=[go.Scatter3d(
    x=[1, 2, 3, 4, 5],
    y=[10, 20, 30, 40, 50],
    z=[5, 15, 25, 35, 45],
    mode='markers',
    marker=dict(size=8, color=[5, 10, 15, 20, 25], colorscale='Viridis')
)])

fig.update_layout(title="3D Scatter Plot")
fig.show()
