# Data Tool kit



1. What is NumPy, and why is it widely used in Python?

   - NumPy (Numerical Python) is the core library for numerical and scientific computing in Python.

Provides an efficient ndarray object for storing large, homogeneous data.

Includes fast mathematical functions, linear algebra, random sampling, FFT, etc.

Underlies most data-science packages (Pandas, scikit-learn, TensorFlow, etc.).
Chosen for speed, memory efficiency, and integration with other scientific libraries.

2. How does broadcasting work in NumPy?

  - Broadcasting allows NumPy to perform operations on arrays of different shapes by automatically expanding them without making explicit copies.

Example:

import numpy as np
a = np.array([1, 2, 3])      # shape (3,)
b = 5                        # scalar
print(a + b)  # [6 7 8]  (b is broadcast to [5,5,5])


It works when shapes are compatible — NumPy stretches smaller arrays to match larger ones.

3. What is a Pandas DataFrame?

  - A DataFrame is a 2D labeled data structure in Pandas, similar to an Excel sheet or SQL table.

Rows = records

Columns = variables/features

It supports heterogeneous data types.

import pandas as pd
df = pd.DataFrame({
    "Name": ["Asha", "Raj", "Meena"],
    "Age": [25, 30, 28]
})

4. Explain the use of the groupby() method in Pandas?

  - groupby() is used to split data into groups, apply a function (like sum, mean), and combine the results.

df.groupby("Category")["Sales"].sum()


 Useful for aggregation, summarization, and analysis.

5. Why is Seaborn preferred for statistical visualizations?

  - Built on top of Matplotlib but with simpler syntax.

Integrates with Pandas DataFrames directly.

Provides statistical plots like violin plots, boxplots, pairplots, heatmaps with ease.

Default themes improve readability.

6. Differences between NumPy arrays and Python lists?

  - Feature	NumPy Array	Python List
Speed	Much faster	Slower
Memory	Compact, fixed type	Larger, mixed types
Operations	Vectorized, elementwise	Requires loops
Dimensionality	Multi-dimensional	Mostly 1D
7. What is a heatmap, and when should it be used?

  - A heatmap is a visualization where values are represented by colors in a grid.

Useful for showing correlations, frequency tables, or matrix data.

Example: correlation heatmap in data analysis.

8. What does “vectorized operation” mean in NumPy?

  - Performing operations on entire arrays without explicit loops.
Example:

a = np.array([1,2,3])
b = np.array([4,5,6])
print(a + b)   # [5 7 9]


  Faster, cleaner, and optimized in C under the hood.

9. How does Matplotlib differ from Plotly?

  - Matplotlib: Static, publication-quality plots, highly customizable.

Plotly: Interactive plots, zoom/pan/hover, better for dashboards and web apps.

10. Significance of hierarchical indexing in Pandas?

  - Also called MultiIndexing.

Allows multiple levels of indexing for rows/columns.

Useful for complex datasets like panel data or pivot tables.

11. Role of Seaborn’s pairplot() function?

  - Creates a grid of scatterplots for each pair of variables, with histograms on the diagonal.
  Useful for exploring relationships in multivariate datasets.

12. Purpose of the describe() function in Pandas?

  - Provides a quick summary of numerical (and sometimes categorical) data:

Count, mean, std, min, max, quartiles.
Example:

df.describe()

13. Why is handling missing data important in Pandas?

  - Missing values can bias results or break models.

Pandas provides tools like dropna(), fillna(), and interpolation to manage missing data.

14. Benefits of using Plotly for data visualization?

  - Interactive visualizations (hover, zoom, filter).

Works well in web apps (Dash).

Handles 3D, geo, and time-series data easily.

15. How does NumPy handle multidimensional arrays?

  - NumPy’s ndarray supports n-dimensional arrays (tensors).

Efficient storage, slicing, broadcasting, and linear algebra operations.

Example: 3D arrays for images or scientific data.

16. Role of Bokeh in data visualization?

-Focused on interactive visualizations in web browsers.

Generates plots in HTML/JS, integrates with Flask/Django.

Good for dashboards.

17. Difference between apply() and map() in Pandas?
  - map() → Element-wise function application on Series.

apply() → Applies function along axis of DataFrame (rows or columns).

s.map(lambda x: x*2)  
df.apply(sum, axis=0)

18. Advanced features of NumPy?
  - Linear algebra (numpy.linalg)

Random numbers (numpy.random)

FFT (Fast Fourier Transform)

Masking & fancy indexing

Memory-mapped files (for big data)

19. How does Pandas simplify time series analysis?

  - Built-in support for date-time indexing, resampling, rolling windows.

Easy operations like shifting, lagging, or frequency conversion.

20. Role of a pivot table in Pandas?

  - Summarizes and aggregates data across categories.

Similar to Excel pivot tables.

df.pivot_table(values="Sales", index="Region", columns="Product", aggfunc="sum")

21. Why is NumPy’s array slicing faster than Python’s list slicing?

  - NumPy arrays are stored in contiguous memory blocks.

Slicing creates a view (no copy), while lists create a new object.

22. Common use cases for Seaborn?

  - Correlation heatmaps

Distribution plots (histograms, KDEs)

Categorical comparisons (boxplot, violin)

Regression analysis (lmplot)

Pairwise relationships (pairplot)



# Pratical


1. How do you create a 2D Numpy array and calculate the sum of each row?

 - We can create a 2D NumPy array with numpy.array() (or np.arange().reshape()), and then use the .sum() method with axis=1 to get the sum of each row.

Example
import numpy as np

# 1️⃣ Create a 2D array
arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

print("Array:")
print(arr)

# 2️⃣ Sum of each row
row_sums = a(axis=1)
print("Sum of each row:", row_sums)

2. Write a Pandas script to find the mean of a specific column in a DataFrame.



3.  Create a scatter plot using Matplotlib.

  import matplotlib.pyplot as plt

x = [5, 7, 8, 7, 6, 9, 5]
y = [7, 6, 7, 8, 7, 9, 8]

plt.scatter(x, y, color="teal", marker="o")
plt.title("Scatter Plot Example")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()

4. How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap?

 - We can calculate a correlation matrix with Pandas’ .corr() method and then use Seaborn’s heatmap() to visualize it.
Here’s a complete example:

Example: Correlation Matrix + Heatmap
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# 1️⃣ Create a sample DataFrame
df = pd.DataFrame({
    "Math": [90, 80, 70, 85],
    "Science": [88, 78, 67, 90],
    "English": [75, 85, 80, 70]
})

# 2️⃣ Compute the correlation matrix
corr_matrix = df.corr()       # default = Pearson correlation

# 3️⃣ Visualize it with a heatmap
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix Heatmap")
plt.show()

5. Generate a bar plot using Plotly.

 import plotly.express as px

data = {"Fruits": ["Apple", "Banana", "Cherry"],
        "Quantity": [30, 45, 25]}

fig = px.bar(data, x="Fruits", y="Quantity",
             title="Fruit Quantities")
fig.show()

6. Create a DataFrame and add a new column based on an existing column.
  
  import pandas as pd

df = pd.DataFrame({"Price": [100, 200, 300]})
df["Price_with_Tax"] = df["Price"] * 1.18
print(df)


7.  Write a program to perform element-wise multiplication of two NumPy arrays.

  import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

result = a * b
print(result)   # [4 10 18]

8. Create a line plot with multiple lines using Matplotlib.

  import matplotlib.pyplot as plt

x = [1, 2, 3, 4]
y1 = [1, 4, 9, 16]
y2 = [1, 2, 3, 4]

plt.plot(x, y1, label="Squares")
plt.plot(x, y2, label="Linear")
plt.legend()
plt.title("Line Plot with Multiple Lines")
plt.show()

9. Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold.

  import pandas as pd

df = pd.DataFrame({"Name": ["A", "B", "C"],
                   "Score": [45, 80, 65]})

filtered = df[df["Score"] > 60]

print(filtered)

10.  Create a histogram using Seaborn to visualize a distribution.

   import seaborn as sns
import matplotlib.pyplot as plt

tips = sns.load_dataset("tips")
sns.histplot(tips["total_bill"], bins=20, kde=True)
plt.title("Distribution of Total Bill")
plt.show()

11. Perform matrix multiplication using NumPy.

  import numpy as np

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
C = np.dot(A, B)
print(C)

12. Use Pandas to load a CSV file and display its first 5 rows.

  import pandas as pd

df = pd.read_csv("your_file.csv")
print(df.head())

13. Create a 3D scatter plot using Plotly.

import plotly.express as px
import pandas as pd

df = pd.DataFrame({
    "x": [1,2,3,4,5],
    "y": [10,15,13,17,12],
    "z": [5,3,8,4,7]
})

fig = px.scatter_3d(df, x="x", y="y", z="z",
                    size="z", color="y",
                    title="3D Scatter Plot")
fig.show()








In [None]:
  - import pandas as pd

df = pd.DataFrame({"Name": ["Asha", "Raj", "Meena"],
                   "Marks": [85, 92, 78]})

mean_marks = df["Marks"].mean()
print("Mean of Marks:", mean_marks)