#Data Toolkit

1. What is NumPy, and why is it widely used in Python?
 - NumPy, short for Numerical Python, is a fundamental package for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a rich collection of mathematical functions to operate on these arrays. It's widely used for its efficiency, powerful features, and seamless integration with other libraries.

2. How does broadcasting work in NumPy?
 - Broadcasting allows NumPy to perform element-wise operations on arrays of different shapes. It stretches the smaller array along the larger array's shape to make their dimensions compatible. This makes computations more efficient and reduces the need for explicit looping.


3. What is a Pandas DataFrame?
 - A Pandas DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is similar to a table in a database or an Excel spreadsheet and is a core data structure in the Pandas library, used for data manipulation and analysis.


4. Explain the use of the groupby() method in Pandas.
 - The groupby() method in Pandas is used to split data into groups based on some criteria. Once the data is split, we can apply various aggregation functions (e.g., sum, mean) to each group independently, making it easy to analyze and summarize data.


5. Why is Seaborn preferred for statistical visualizations?
 - Seaborn is preferred for statistical visualizations because it provides high-level interface for drawing attractive and informative statistical graphics. It is built on top of Matplotlib and integrates well with Pandas DataFrames. Seaborn simplifies the process of creating complex visualizations like heatmaps, violin plots, and pair plots.

6. What are the differences between NumPy arrays and Python lists?

 - NumPy arrays are more efficient for numerical computations due to their fixed size and homogenous data type.

    NumPy arrays support element-wise operations, whereas Python lists do not.

    NumPy arrays offer advanced broadcasting, slicing, and vectorized operations, enhancing performance and productivity.


7. What is a heatmap, and when should it be used?
 - A heatmap is a graphical representation of data where individual values are represented as colors. Heatmaps are useful for visualizing the intensity of data across two dimensions and are commonly used in correlation matrices, geographical data, and visualizing clustering results.


8. What does the term “vectorized operation” mean in NumPy?
 - Vectorized operations refer to performing element-wise computations on entire arrays without explicit loops. These operations leverage low-level optimizations and hardware acceleration, making them significantly faster and more efficient than traditional looping methods.


9. How does Matplotlib differ from Plotly?

 - Matplotlib is a static plotting library, ideal for creating publication-quality charts and visualizations.

    Plotly, on the other hand, is an interactive plotting library that supports zooming, panning, and real-time data updates. Plotly is more suitable for web applications and interactive dashboards.

10. What is the significance of hierarchical indexing in Pandas?
 - Hierarchical indexing allows for multi-level indexing of data, enabling us to work with data in a more flexible and powerful way. It is particularly useful for working with higher-dimensional data in a two-dimensional DataFrame, allowing for easier selection, aggregation, and reshaping.

11. What is the role of Seaborn’s pairplot() function?
 - The pairplot() function in Seaborn creates a matrix of scatter plots (and optionally histograms or KDE plots) for each pair of variables in a dataset. It is useful for exploring relationships between variables and identifying patterns or correlations.

12. What is the purpose of the describe() function in Pandas?
 - The describe() function in Pandas generates descriptive statistics for numerical columns in a DataFrame. It provides a summary of central tendency, dispersion, and shape of the dataset’s distribution, including metrics like mean, median, standard deviation, and quartiles.


13. Why is handling missing data important in Pandas?
 - Handling missing data is crucial because it can lead to inaccurate analyses, misleading results, and biases in the model. Pandas provides various methods to detect, remove, or impute missing values, ensuring the integrity and reliability of the dataset.

14. What are the benefits of using Plotly for data visualization?

 - Interactive visualizations with zooming, panning, and tooltips.

     Supports a wide range of chart types and 3D plots.

     Seamless integration with web applications and Jupyter notebooks.

     Easy to create and share dashboards and reports.


15. How does NumPy handle multidimensional arrays?
 - NumPy efficiently handles multidimensional arrays using the ndarray object. It supports advanced indexing, slicing, and reshaping operations, allowing for flexible manipulation of multi-dimensional data. NumPy also provides functions for mathematical operations across different axes.


16. What is the role of Bokeh in data visualization?
 - Bokeh is a powerful interactive visualization library for creating web-based dashboards and plots. It offers a high-level interface for creating visually appealing and interactive plots, making it easy to build complex visualizations and integrate them into web applications.


17. Explain the difference between apply() and map() in Pandas.

 - apply() is used to apply a function along an axis (rows or columns) of the DataFrame. It is versatile and can be used for both element-wise and aggregate operations.

     map() is used for element-wise transformations on a Series (or DataFrame if applymap() is used). It is typically used to replace or modify values in a Series.

18. What are some advanced features of NumPy?

 - Broadcasting for efficient computation on arrays of different shapes.

     Linear algebra functions (linalg module) for matrix operations.

     Fourier transform capabilities for signal processing.

     Random number generation for simulations and statistical modeling.
     
     Support for masked arrays, enabling operations on incomplete or invalid data.

19. How does Pandas simplify time series analysis?
 - Pandas simplifies time series analysis with dedicated data structures (DatetimeIndex, PeriodIndex, TimedeltaIndex) and functions for date parsing, resampling, shifting, and rolling window calculations. It also supports time zone handling, frequency conversion, and datetime operations, making it a powerful tool for working with time series data.


20. What is the role of a pivot table in Pandas?
- A pivot table in Pandas is used to summarize and aggregate data, transforming it from a long format to a wide format. It allows for easy grouping, aggregation, and reshaping of data, making it easier to analyze and interpret complex datasets.

21. Why is NumPy’s array slicing faster than Python’s list slicing?
 - NumPy’s array slicing is faster because NumPy arrays are implemented in C, allowing for efficient memory access and optimized computations. Slicing a NumPy array returns a view of the original array, avoiding data copying and enhancing performance.

22. What are some common use cases for Seaborn?

 - Creating statistical visualizations like box plots, violin plots, and pair plots.

     Visualizing distributions with histograms, KDE plots, and rug plots.

     Plotting categorical data with bar plots, count plots, and point plots.

     Generating correlation matrices and heatmaps.

     Enhancing visualizations with built-in themes and color palettes.


#Practical

1. How do you create a 2D NumPy array and calculate the sum of each row?

In [None]:
'''
import numpy as np

# Create a 2D NumPy array
array_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Calculate the sum of each row
row_sums = np.sum(array_2d, axis=1)
print(row_sums)
'''

2. Write a Pandas script to find the mean of a specific column in a DataFrame.

In [None]:
'''
import pandas as pd

# Sample DataFrame
data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Calculate the mean of column 'B'
mean_B = df['B'].mean()
print(mean_B)
'''

3. Create a scatter plot using Matplotlib.

In [None]:
'''
import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 20, 25, 30, 35]

# Create scatter plot
plt.scatter(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot')
plt.show()
'''

4. How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap?

In [None]:
'''
import seaborn as sns
import pandas as pd

# Sample DataFrame
data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50], 'C': [5, 4, 3, 2, 1]}
df = pd.DataFrame(data)

# Calculate the correlation matrix
corr_matrix = df.corr()

# Visualize it with a heatmap
sns.heatmap(corr_matrix, annot=True)
plt.show()
'''

5. Generate a bar plot using Plotly.

In [None]:
'''
import plotly.express as px

# Sample data
data = {'Category': ['A', 'B', 'C', 'D'], 'Values': [10, 20, 30, 40]}
df = pd.DataFrame(data)

# Create bar plot
fig = px.bar(df, x='Category', y='Values', title='Bar Plot')
fig.show()
'''

6. Create a DataFrame and add a new column based on an existing column.

In [None]:
'''
import pandas as pd

# Sample DataFrame
data = {'A': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)

# Add a new column 'B' based on column 'A'
df['B'] = df['A'] * 2
print(df)
'''

7. Write a program to perform element-wise multiplication of two NumPy arrays.

In [None]:
'''
import numpy as np

# Sample arrays
array1 = np.array([1, 2, 3])
array2 = np.array([4, 5, 6])

# Element-wise multiplication
result = np.multiply(array1, array2)
print(result)
'''

8. Create a line plot with multiple lines using Matplotlib.

In [None]:
'''
import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y1 = [10, 20, 25, 30, 35]
y2 = [15, 25, 20, 35, 45]

# Create line plots
plt.plot(x, y1, label='Line 1')
plt.plot(x, y2, label='Line 2')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Plot with Multiple Lines')
plt.legend()
plt.show()
'''

9. Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold.

In [None]:
'''
import pandas as pd

# Sample DataFrame
data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Filter rows where column 'B' is greater than 25
filtered_df = df[df['B'] > 25]
print(filtered_df)
'''

10. Create a histogram using Seaborn to visualize a distribution.

In [None]:
'''
import seaborn as sns
import matplotlib.pyplot as plt

# Sample data
data = [1, 2, 2, 3, 3, 3, 4, 4, 5]

# Create histogram
sns.histplot(data, bins=5, kde=True)
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Histogram')
plt.show()
'''

11. Perform matrix multiplication using NumPy.

In [None]:
'''
import numpy as np

# Sample matrices
matrix1 = np.array([[1, 2], [3, 4]])
matrix2 = np.array([[5, 6], [7, 8]])

# Matrix multiplication
result = np.dot(matrix1, matrix2)
print(result)
'''

12. Use Pandas to load a CSV file and display its first 5 rows.

In [None]:
'''
import pandas as pd

# Load CSV file
df = pd.read_csv('sample.csv')

# Display the first 5 rows
print(df.head())
'''

13. Create a 3D scatter plot using Plotly.

In [None]:
'''
import plotly.express as px

# Sample data
data = {'X': [1, 2, 3, 4, 5], 'Y': [10, 20, 30, 40, 50], 'Z': [5, 4, 3, 2, 1]}
df = pd.DataFrame(data)

# Create 3D scatter plot
fig = px.scatter_3d(df, x='X', y='Y', z='Z', title='3D Scatter Plot')
fig.show()
'''