#Data Toolkit

1. What is NumPy, and why is it widely used in Python?

-> NumPy (Numerical Python) is a Python library for numerical computations. It provides powerful N-dimensional arrays, mathematical functions, linear algebra, Fourier transforms, and random number generation.
 Widely used because:

Faster than Python lists (implemented in C).

Supports vectorized operations.

Foundation for data science and ML libraries (Pandas, SciPy, Scikit-learn, TensorFlow).

2. How does broadcasting work in NumPy?

-> Broadcasting allows NumPy to perform operations on arrays of different shapes by automatically expanding dimensions.
Example:

a = np.array([1,2,3])
b = 2
a + b   # → [3,4,5]


Here, b is broadcasted to match the shape of a.

3. What is a Pandas DataFrame?

-> A DataFrame is a two-dimensional labeled data structure in Pandas, like a table (rows + columns).

Columns = Series objects

Supports indexing, filtering, aggregation, merging, reshaping.

Similar to Excel or SQL tables.

4. Explain the use of the groupby() method in Pandas.

-> groupby() splits data into groups, applies functions, and combines results.

Split → Apply → Combine.
Example:

df.groupby("Category")["Sales"].mean()


This gives average sales per category.

5. Why is Seaborn preferred for statistical visualizations?

-> Seaborn is built on Matplotlib but provides:

Predefined themes & styles.

High-level functions (boxplot, violinplot, pairplot).

Automatic handling of Pandas DataFrames.

Clearer, more attractive statistical plots.

6. Differences between NumPy arrays and Python lists:

-> NumPy Arrays: Fixed size, homogeneous data type, faster, support vectorized operations.

Python Lists: Dynamic size, heterogeneous data, slower (no vectorization).

7. What is a heatmap, and when should it be used?

-> A heatmap is a graphical representation of data using colors.
Use when:

Visualizing correlation matrices.

Representing intensity values in grids.

Highlighting patterns in large datasets.

8. What does “vectorized operation” mean in NumPy?

-> Vectorization means applying operations on entire arrays without loops.
Example:

a = np.array([1,2,3])
b = np.array([4,5,6])
a + b   # → [5,7,9]


This is faster than looping in Python.

9. How does Matplotlib differ from Plotly?

-> Matplotlib: Static, traditional plots (bar, line, scatter). Needs more customization.

Plotly: Interactive plots (zoom, hover, tooltips) with dashboards and web integration.

10. Significance of hierarchical indexing in Pandas:

-> Hierarchical (MultiIndex) allows multiple levels of indexing in rows/columns.
Useful for:

Working with multi-dimensional data.

Simplifying group and pivot operations.

11. Role of Seaborn’s pairplot() function:

-> pairplot() creates pairwise scatterplots for all numerical columns and histograms on the diagonal.

Useful for exploring relationships & distributions in datasets.

12. Purpose of describe() function in Pandas:

-> df.describe() generates summary statistics (count, mean, std, min, quartiles, max) for numerical columns.

13. Why is handling missing data important in Pandas?

-> Missing data can bias results, cause errors, or reduce accuracy.

Pandas provides fillna(), dropna(), interpolate() for handling them.

14. Benefits of using Plotly for data visualization:

-> Interactive and web-friendly.

Supports dashboards and real-time updates.

Easy export to HTML.

Rich chart types (3D plots, maps, gauges).

15. How does NumPy handle multidimensional arrays?

-> NumPy provides ndarray objects which can be 1D, 2D, 3D, or higher.

Uses efficient memory storage.

Supports indexing, slicing, reshaping, transposing.

16. Role of Bokeh in data visualization:

-> Bokeh is a Python library for interactive visualizations in the browser.

Generates plots in HTML/JavaScript.

Good for dashboards, streaming, and big data.

17. Difference between apply() and map() in Pandas:

-> map(): Works on a Pandas Series (element-wise).

apply(): Works on DataFrame rows/columns with custom functions.

18. Some advanced features of NumPy:

-> Linear algebra operations (np.linalg).

FFT (Fast Fourier Transform).

Random number generation (np.random).

Memory-efficient broadcasting.

Integration with C/C++ and Fortran.

19. How does Pandas simplify time series analysis?

-> Date parsing & indexing (pd.to_datetime).

Resampling (resample() for weekly, monthly).

Rolling & expanding windows.

Easy shifting & lagging of data.

20. Role of a pivot table in Pandas:

-> Pivot tables summarize data like Excel.

df.pivot_table(values="Sales", index="Region", columns="Category", aggfunc="sum")


This gives sales summary by region & category.

21. Why is NumPy’s array slicing faster than Python’s list slicing?

-> NumPy arrays are stored in contiguous memory blocks.

Uses views instead of copies (no duplication).

Python lists store references, so slicing needs extra processing.

22. Common use cases for Seaborn:

-> Visualizing correlations (heatmap).

Exploring distributions (histogram, KDE, boxplot, violinplot).

Pairwise relationships (pairplot).

Categorical comparisons (barplot, countplot).

In [None]:
#How do you create a 2D NumPy array and calculate the sum of each row
import numpy as np

# Creating a 2D NumPy array
arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

# Sum of each row
row_sum = np.sum(arr, axis=1)
print("2D Array:\n", arr)
print("Sum of each row:", row_sum)


Output:

2D Array:
 [[1 2 3]
  [4 5 6]
  [7 8 9]]
Sum of each row: [ 6 15 24]

#Write a Pandas script to find the mean of a specific column in a DataFrame
import pandas as pd

# Create DataFrame
data = {'Name': ['A', 'B', 'C'], 'Marks': [85, 90, 95]}
df = pd.DataFrame(data)

# Find mean of "Marks"
mean_marks = df['Marks'].mean()
print("Mean of Marks:", mean_marks)


Output:

Mean of Marks: 90.0

#Create a scatter plot using Matplotlib
import matplotlib.pyplot as plt

x = [5, 7, 8, 7, 6, 9]
y = [99, 86, 87, 88, 100, 86]

plt.scatter(x, y)
plt.title("Scatter Plot Example")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()


Output: Scatter plot (window opens with points scattered).

#How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Create DataFrame
df = pd.DataFrame(np.random.rand(5, 4), columns=list("ABCD"))

# Correlation matrix + Heatmap
corr = df.corr()
sns.heatmap(corr, annot=True, cmap="coolwarm")
plt.show()


Output: Heatmap with correlation values.

#Generate a bar plot using Plotly
import plotly.express as px

data = {'Fruits': ['Apple', 'Banana', 'Orange'], 'Count': [10, 15, 7]}
fig = px.bar(data, x='Fruits', y='Count', title="Fruit Count")
fig.show()


Output: Interactive bar chart of fruit counts.

#Create a DataFrame and add a new column based on an existing column
import pandas as pd

df = pd.DataFrame({'Name': ['A', 'B', 'C'], 'Marks': [50, 70, 90]})
df['Grade'] = df['Marks'].apply(lambda x: 'Pass' if x >= 60 else 'Fail')
print(df)


Output:

  Name  Marks Grade
0    A     50  Fail
1    B     70  Pass
2    C     90  Pass

#Write a program to perform element-wise multiplication of two NumPy arrays
import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

result = a * b
print("Element-wise multiplication:", result)


Output:

Element-wise multiplication: [ 4 10 18]

#Create a line plot with multiple lines using Matplotlib
import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y1 = [2, 4, 6, 8, 10]
y2 = [1, 2, 3, 4, 5]

plt.plot(x, y1, label="Line 1")
plt.plot(x, y2, label="Line 2")
plt.legend()
plt.title("Multiple Line Plot")
plt.show()


Output: Line plot with two lines.

#Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold
import pandas as pd

df = pd.DataFrame({'Name': ['A', 'B', 'C', 'D'], 'Marks': [50, 75, 40, 90]})
filtered = df[df['Marks'] > 60]
print(filtered)


Output:

  Name  Marks
1    B     75
3    D     90

#Create a histogram using Seaborn to visualize a distribution
import seaborn as sns
import matplotlib.pyplot as plt

data = [1,2,2,3,3,3,4,4,4,4,5,5,6]
sns.histplot(data, bins=5, kde=True)
plt.title("Histogram Example")
plt.show()


Output: Histogram with KDE curve.

#Perform matrix multiplication using NumPy
import numpy as np

a = np.array([[1, 2],
              [3, 4]])
b = np.array([[5, 6],
              [7, 8]])

result = np.dot(a, b)
print("Matrix Multiplication:\n", result)


Output:

Matrix Multiplication:
 [[19 22]
  [43 50]]

#Use Pandas to load a CSV file and display its first 5 rows
import pandas as pd

# Assuming a CSV file "data.csv" exists
df = pd.read_csv("data.csv")
print(df.head())


Output: First 5 rows of the CSV file.

#Create a 3D scatter plot using Plotly
import plotly.express as px
import pandas as pd

df = pd.DataFrame({
    'x': [1,2,3,4,5],
    'y': [10,20,30,40,50],
    'z': [5,15,25,35,45],
    'color': ['A','B','A','B','A']
})

fig = px.scatter_3d(df, x='x', y='y', z='z', color='color', size='z', title="3D Scatter Plot")
fig.show()


Output: Interactive 3D scatter plot.