#Data Toolkit

1. What is NumPy, and why is it widely used in Python ?
  -  NumPy (Numerical Python) is a powerful Python library for numerical computing. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on them efficiently. NumPy is widely used because it offers fast array operations, vectorization, and integration with other scientific computing libraries, making it essential for data science, machine learning, and engineering applications.

  Example Usage:

In [None]:
import numpy as np

# Creating an array
arr = np.array([1, 2, 3, 4, 5])

# Performing operations
print(arr * 2)  # Output: [ 2  4  6  8 10]
print(np.mean(arr))  # Output: 3.0


2. How does broadcasting work in NumPy ?
   - Broadcasting allows NumPy to perform element-wise operations on arrays of different shapes without explicit loops.

   Example:

In [None]:
import numpy as np

# 1D array
arr1 = np.array([1, 2, 3])

# Scalar broadcasting
print(arr1 + 5)
# Output: [6 7 8]

# 2D and 1D array broadcasting
arr2 = np.array([[1], [2], [3]])  # Shape (3,1)
arr3 = np.array([10, 20, 30])     # Shape (3,)

print(arr2 + arr3)
# Output:
# [[11 21 31]
#  [12 22 32]
#  [13 23 33]]


  Rules of Broadcasting

1. If arrays have different dimensions, prepend 1 to the smaller array’s shape.
2. Dimensions must either be the same or one of them should be 1 to match.
3. NumPy automatically expands dimensions when needed.

3. What is a Pandas DataFrame ?
   -  A DataFrame in Pandas is a two-dimensional, tabular data structure similar to an Excel spreadsheet or SQL table. It consists of rows and columns, where each column can hold different data types.

   Example:

In [None]:
import pandas as pd

# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'London', 'Paris']
}

df = pd.DataFrame(data)

print(df)

 #Outpt

   #  Name  Age      City
#0   Alice   25  New York
#1     Bob   30   London
#2  Charlie   35    Paris


4. Explain the use of the groupby() method in Pandas .
   - The groupby() method in Pandas is used to group data based on one or more columns and perform aggregate operations like sum, mean, count, etc.
   
   Example:

In [None]:
import pandas as pd

# Creating a DataFrame
data = {
    'Department': ['HR', 'IT', 'HR', 'IT', 'Finance'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Salary': [50000, 60000, 55000, 70000, 65000]
}

df = pd.DataFrame(data)

# Grouping by 'Department' and calculating the average salary
grouped_df = df.groupby('Department')['Salary'].mean()

print(grouped_df)

#Output

#Department
#Finance    65000.0
#HR         52500.0
#IT         65000.0
#Name: Salary, dtype: float64



5. Why is Seaborn preferred for statistical visualizations ?
   -  Seaborn is preferred for statistical visualizations because it:
   1. Provides built-in themes for better aesthetics.
   2.  Supports complex visualizations like heatmaps, violin plots, and pair plots.
   3. Works seamlessly with Pandas DataFrames.
   4.  Automatically handles statistical aggregation and categorical data.

   Example: Creating a Seaborn Scatter Plot with Regression Line

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Sample Data
tips = sns.load_dataset("tips")

# Scatter plot with regression line
sns.lmplot(x="total_bill", y="tip", data=tips)

plt.show()


6. What are the differences between NumPy arrays and Python lists ?
   - ### **Differences Between NumPy Arrays and Python Lists**  

| Feature          | NumPy Array (`ndarray`) | Python List |
|-----------------|----------------|-------------|
| **Speed**       | Faster (uses C-based optimizations) | Slower (interpreted in Python) |
| **Memory**      | More efficient (fixed type, contiguous memory) | Less efficient (stores references) |
| **Operations**  | Supports vectorized operations | Requires loops for element-wise operations |
| **Data Type**   | Homogeneous (all elements must be of the same type) | Heterogeneous (can have mixed types) |
| **Functionality** | Supports mathematical functions, broadcasting, and indexing | Limited mathematical operations |

Example:


In [None]:
import numpy as np
import time

# NumPy Array
arr = np.array([1, 2, 3, 4, 5])

# Python List
lst = [1, 2, 3, 4, 5]

# NumPy operation (vectorized)
print(arr * 2)
# Output: [ 2  4  6  8 10]

# List operation (loop required)
print([x * 2 for x in lst])
# Output: [2, 4, 6, 8, 10]


7. What is a heatmap, and when should it be used ?
   -  A heatmap is a data visualization technique that uses colors to represent values in a matrix or table. It helps identify patterns, correlations, and trends in large datasets.

   ## When to Use a Heatmap:

    1. To visualize correlations between numerical variables.
    2. To represent density or intensity in a dataset.
    3. For analyzing confusion matrices in machine learning.
    4.  To detect missing values in a dataset.

    Example: Creating a Heatmap in Seaborn



In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Creating a random dataset
data = np.random.rand(5, 5)

# Plotting the heatmap
sns.heatmap(data, annot=True, cmap="coolwarm")

plt.show()


8.  What does the term “vectorized operation” mean in NumPy ?
    - A vectorized operation in NumPy means performing operations on entire arrays without using explicit loops. It leverages optimized C and Fortran code, making it much faster than traditional Python loops.

    Example: Vectorized vs. Loop-Based Operations



In [None]:
import numpy as np

# NumPy array
arr = np.array([1, 2, 3, 4, 5])

# Vectorized operation (fast)
print(arr * 2)
# Output: [ 2  4  6  8 10]

# Loop-based operation (slow)
lst = [1, 2, 3, 4, 5]
print([x * 2 for x in lst])
# Output: [2, 4, 6, 8, 10]


9. How does Matplotlib differ from Plotly ?
   -  ### **Differences Between Matplotlib and Plotly**  

| Feature         | Matplotlib 📊 | Plotly 📈 |
|---------------|-------------|---------|
| **Interactivity** | Static plots (by default) | Interactive plots |
| **Customization** | Highly customizable but manual | Easy customization with built-in themes |
| **Ease of Use** | Requires more coding for complex plots | User-friendly with simpler syntax |
| **3D Support** | Limited 3D support | Better 3D visualization |
| **Best For** | Static reports, scientific plotting | Dashboards, web apps, interactive visualization |

Example: Line Plot in Matplotlib




In [None]:
import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y = [10, 20, 25, 30, 50]

plt.plot(x, y, marker='o')
plt.title("Matplotlib Line Plot")
plt.show()


Example: Line Plot in Plotly

In [None]:
import plotly.express as px

data = {'x': [1, 2, 3, 4, 5], 'y': [10, 20, 25, 30, 50]}
fig = px.line(data, x='x', y='y', markers=True, title="Plotly Line Plot")
fig.show()


10.  What is the significance of hierarchical indexing in Pandas ?
     -  Hierarchical indexing (MultiIndex) in Pandas allows multiple levels of indexing, making it easier to work with multi-dimensional data in a tabular format.

     ### Why Use Hierarchical Indexing
     1. Organizes data with multiple dimensions efficiently.
     2.  Enables grouping & aggregation on multiple levels.
     3. Facilitates slicing and subsetting complex datasets.

     Example: Creating a MultiIndex DataFrame
  

     

In [None]:
import pandas as pd

# Creating a MultiIndex DataFrame
data = {
    'Sales': [200, 300, 400, 500],
    'Profit': [50, 80, 100, 120]
}

index = pd.MultiIndex.from_tuples([
    ('Store A', '2024-Q1'), ('Store A', '2024-Q2'),
    ('Store B', '2024-Q1'), ('Store B', '2024-Q2')
], names=['Store', 'Quarter'])

df = pd.DataFrame(data, index=index)

print(df)

#Output:
             #  Sales  Profit
#Store   Quarter
#Store A 2024-Q1    200     50
 #       2024-Q2    300     80
#Store B 2024-Q1    400    100
 #       2024-Q2    500    120


11. What is the role of Seaborn’s pairplot() function ?
    -  The pairplot() function in Seaborn is used to visualize pairwise relationships between numerical variables in a dataset. It creates a grid of scatter plots and histograms, helping in data exploration and correlation analysis.

    Example: Using pairplot() in Seaborn
    

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Load example dataset
df = sns.load_dataset("iris")

# Create a pairplot
sns.pairplot(df, hue="species", diag_kind="kde")

plt.show()


12.  What is the purpose of the describe() function in Pandas ?
    -  The describe() function in Pandas provides summary statistics of numerical (or categorical) columns in a DataFrame, helping in data exploration

    Example: Using describe() in Pandas

In [None]:
import pandas as pd

# Sample DataFrame
data = {'Age': [25, 30, 35, 40, 45], 'Salary': [50000, 60000, 70000, 80000, 90000]}
df = pd.DataFrame(data)

# Get summary statistics
print(df.describe())

#Output:
#             Age       Salary
#count   5.00000      5.00000
#mean   35.00000  70000.00000
#std     7.90569  15811.38830
#min    25.00000  50000.00000
##25%    30.00000  60000.00000
#50%    35.00000  70000.00000
#75%    40.00000  80000.00000
#max    45.00000  90000.00000


13.  Why is handling missing data important in Pandas ?
    -  Handling missing data in Pandas is important to ensure accurate analysis and prevent errors in computations. Missing values can distort statistical results and affect machine learning models.

    Example:

In [None]:
import pandas as pd
import numpy as np

# Creating a DataFrame with missing values
data = {'A': [1, 2, np.nan], 'B': [4, np.nan, 6]}
df = pd.DataFrame(data)

# Handling missing values
df.fillna(0, inplace=True)  # Replace NaN with 0

print(df)


#Output

#     A    B
# 0  1.0  4.0
# 1  2.0  0.0
# 2  0.0  6.0


14.  What are the benefits of using Plotly for data visualization ?
     - 1. Interactive Plots – Enables zooming, hovering, and panning.
       2. High-Quality Visuals – Produces aesthetically appealing charts.
       3. Supports Multiple Chart Types – Includes scatter plots, bar charts, 3D plots, etc.
       4. Easy Integration – Works with Jupyter Notebooks, Dash, and web applications.
       5. Customizable – Allows detailed styling and annotations.

       Example:

In [None]:
import plotly.express as px

# Sample data
data = px.data.iris()

# Creating an interactive scatter plot
fig = px.scatter(data, x="sepal_width", y="sepal_length", color="species")

# Show plot
fig.show()


15.  How does NumPy handle multidimensional arrays ?
     -  NumPy uses ndarrays (N-dimensional arrays) to efficiently store and manipulate multidimensional data. It supports indexing, slicing, reshaping, and broadcasting operations.

     Example:

In [None]:
import numpy as np

# Creating a 2D array (matrix)
arr = np.array([[1, 2, 3], [4, 5, 6]])

# Accessing elements
print(arr[1, 2])  # Output: 6

# Reshaping into a 3D array
arr_3d = arr.reshape(2, 1, 3)
print(arr_3d)


16. What is the role of Bokeh in data visualization ?
    -   Bokeh is a Python library designed for creating interactive and web-friendly visualizations. It allows users to build dashboards and plots that can be embedded in web applications.

    Example:

In [None]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook

output_notebook()  # Display in Jupyter Notebook

# Create a figure
p = figure(title="Bokeh Line Plot", x_axis_label="X", y_axis_label="Y")

# Add a line plot
p.line([1, 2, 3, 4], [10, 20, 25, 30], line_width=2)

show(p)


17. Explain the difference between apply() and map() in Pandas .
    -  ### **Difference Between `apply()` and `map()` in Pandas:**

| Function  | Use Case | Works On | Example |
|-----------|---------|----------|---------|
| `apply()` | Applies a function to each row/column | Series & DataFrame | `df["col"].apply(func)` or `df.apply(func, axis=1)` |
| `map()` | Applies a function element-wise | Series only | `df["col"].map(func)` |

Example:


In [None]:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Using apply() on a column
df['A_squared'] = df['A'].apply(lambda x: x**2)

# Using map() on a Series
df['B_double'] = df['B'].map(lambda x: x * 2)

print(df)


#Output

#   A  B  A_squared  B_double
# 0  1  4         1        8
# 1  2  5         4       10
# 2  3  6         9       12


18. What are some advanced features of NumPy ?
    -  1. Broadcasting – Perform operations on arrays of different shapes.
       2. Vectorization – Fast element-wise operations without loops.
       3. Masked Arrays – Handle missing or invalid values.
       4. Linear Algebra Functions – Solve equations, find eigenvalues, etc.
       5. FFT (Fast Fourier Transform) – Perform signal processing.
       6. Random Sampling – Generate random numbers efficiently.

       Example:

In [None]:
import numpy as np

# Broadcasting example
arr1 = np.array([[1], [2], [3]])  # 3x1 array
arr2 = np.array([4, 5, 6])  # 1x3 array
result = arr1 + arr2  # Broadcasts to 3x3 shape

# Linear algebra example: solving Ax = B
A = np.array([[3, 1], [1, 2]])
B = np.array([9, 8])
x = np.linalg.solve(A, B)  # Solves for x

print("Broadcasted Array:\n", result)
print("Solution to Ax = B:", x)


19. How does Pandas simplify time series analysis ?
    - 1. Datetime Indexing – Easily handle dates as index.
      2. Resampling – Aggregate data at different time intervals.
      3. Time-based Slicing – Filter data by specific dates.
      4. Rolling & Expanding Windows – Compute moving averages.
      5. Shifting & Lagging – Analyze trends over time.

      Example:

In [None]:
import pandas as pd

# Creating a time series DataFrame
dates = pd.date_range(start="2024-01-01", periods=5, freq="D")
df = pd.DataFrame({"date": dates, "value": [10, 20, 15, 25, 30]})
df.set_index("date", inplace=True)

# Resampling: Weekly mean
weekly_avg = df.resample("W").mean()

# Rolling window: 2-day moving average
df["rolling_avg"] = df["value"].rolling(window=2).mean()

print(df)

#Output

#            value  rolling_avg
# date
# 2024-01-01     10          NaN
# 2024-01-02     20         15.0
# 2024-01-03     15         17.5
# 2024-01-04     25         20.0
# 2024-01-05     30         27.5


20.  What is the role of a pivot table in Pandas ?
     -  A pivot table in Pandas is used to summarize, aggregate, and reshape data efficiently. It helps in analyzing large datasets by grouping and applying functions like sum, mean, or count.

     Example:

In [None]:
import pandas as pd

# Sample DataFrame
data = {
    "Category": ["A", "A", "B", "B", "C"],
    "Salesperson": ["X", "Y", "X", "Y", "X"],
    "Sales": [100, 200, 150, 250, 300]
}

df = pd.DataFrame(data)

# Creating a pivot table
pivot = df.pivot_table(values="Sales", index="Category", columns="Salesperson", aggfunc="sum", fill_value=0)

print(pivot)

#Output

# Salesperson    X    Y
# Category
# A            100  200
# B            150  250
# C            300    0



21. Why is NumPy’s array slicing faster than Python’s list slicing ?
    - 1. Memory Efficiency – NumPy arrays use contiguous memory blocks, making operations faster.
      2. No Copying (Views Instead of Copies) – Slicing a NumPy array creates a
      3.    view (not a copy), avoiding redundant memory allocation.
      4. Optimized C Backend – NumPy is implemented in C, enabling low-level optimizations.

      Example:

In [None]:
import numpy as np
import time

# NumPy array slicing
arr = np.arange(1000000)
start = time.time()
sliced_arr = arr[:500000]  # Creates a view
end = time.time()
print("NumPy slicing time:", end - start)

# Python list slicing
lst = list(range(1000000))
start = time.time()
sliced_lst = lst[:500000]  # Creates a new list (copy)
end = time.time()
print("List slicing time:", end - start)

#Output

#NumPy slicing time: 0.00001 sec
#List slicing time: 0.005 sec




22. What are some common use cases for Seaborn?
    - 1. Statistical Data Visualization – Built-in support for statistical plots.
      2. Beautiful and Informative Graphs – Enhances Matplotlib visuals.
      3. Easy Handling of Categorical Data – Supports box plots, violin plots, etc.
      4. Correlation Analysis – Heatmaps for relationships between variables.
      5. Regression Analysis – Visualizes trends and relationships.

      Example:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Load sample dataset
tips = sns.load_dataset("tips")

# Create a boxplot to analyze total bill distribution by day
sns.boxplot(x="day", y="total_bill", data=tips, palette="coolwarm")

# Show plot
plt.show()


#Practical Questions

1.  How do you create a 2D NumPy array and calculate the sum of each row ?

In [1]:
import numpy as np

# Creating a 2D NumPy array
arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

# Calculating the sum of each row
row_sums = arr.sum(axis=1)

print("2D Array:\n", arr)
print("Sum of each row:", row_sums)

#Output

#2D Array:
# [[1 2 3]
# [4 5 6]
# [7 8 9]]
#Sum of each row: [ 6 15 24]



2D Array:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
Sum of each row: [ 6 15 24]


2. Write a Pandas script to find the mean of a specific column in a DataFrame .

In [None]:
import pandas as pd

# Creating a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'Salary': [50000, 60000, 70000, 80000]}

df = pd.DataFrame(data)

# Calculating the mean of the 'Salary' column
mean_salary = df['Salary'].mean()

print("Mean Salary:", mean_salary)

#Output

#Mean Salary: 65000.0


3. Create a scatter plot using Matplotlib .

In [None]:
import matplotlib.pyplot as plt

# Sample data
x = [10, 20, 30, 40, 50]
y = [5, 15, 25, 35, 45]

# Creating the scatter plot
plt.scatter(x, y, color='blue', marker='o', label="Data Points")

# Adding labels and title
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Simple Scatter Plot")
plt.legend()

# Show the plot
plt.show()


4. How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap ?

In [None]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Creating a sample DataFrame
data = {
    'A': np.random.rand(10),
    'B': np.random.rand(10),
    'C': np.random.rand(10),
    'D': np.random.rand(10)
}

df = pd.DataFrame(data)

# Calculating the correlation matrix
corr_matrix = df.corr()

# Visualizing with a heatmap
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)

# Show the plot
plt.title("Correlation Matrix Heatmap")
plt.show()


5.  Generate a bar plot using Plotly .

In [None]:
import plotly.express as px

# Sample data
data = {"Category": ["A", "B", "C", "D"], "Values": [10, 20, 15, 25]}

# Creating a bar plot
fig = px.bar(data, x="Category", y="Values", title="Bar Plot using Plotly", color="Category")

# Show plot
fig.show()


6.  Create a DataFrame and add a new column based on an existing column .

In [None]:
import pandas as pd

# Creating a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40]}

df = pd.DataFrame(data)

# Adding a new column 'Age Category' based on the 'Age' column
df['Age Category'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Adult')

print(df)

#Output

#     Name  Age Age Category
#0  Alice   25       Young
#1    Bob   30       Adult
#2 Charlie   35       Adult
#3  David   40       Adult


7. Write a program to perform element-wise multiplication of two NumPy arrays .

In [None]:
import numpy as np

# Creating two NumPy arrays
arr1 = np.array([1, 2, 3, 4])
arr2 = np.array([5, 6, 7, 8])

# Element-wise multiplication
result = arr1 * arr2

print("Array 1:", arr1)
print("Array 2:", arr2)
print("Element-wise Multiplication:", result)

#output

#Array 1: [1 2 3 4]
#Array 2: [5 6 7 8]
#Element-wise Multiplication: [ 5 12 21 32]


8. Create a line plot with multiple lines using Matplotlib .

In [None]:
import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y1 = [10, 20, 25, 30, 40]
y2 = [5, 15, 20, 25, 35]

# Creating the line plot
plt.plot(x, y1, marker='o', linestyle='-', color='b', label="Line 1")
plt.plot(x, y2, marker='s', linestyle='--', color='r', label="Line 2")

# Adding labels and title
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Multiple Line Plot")

# Adding legend
plt.legend()

# Show the plot
plt.show()


9.  Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold .

In [None]:
import pandas as pd

# Creating a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, 30, 35, 40, 22]}

df = pd.DataFrame(data)

# Filtering rows where Age is greater than 30
filtered_df = df[df['Age'] > 30]

print(filtered_df)

#output

#     Name  Age
#2  Charlie   35
#3   David   40


10. Create a histogram using Seaborn to visualize a distribution .

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Generating sample data
data = np.random.randn(1000)  # 1000 random values from a normal distribution

# Creating the histogram
sns.histplot(data, bins=30, kde=True, color="blue")

# Adding labels and title
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Histogram of Random Data")

# Show the plot
plt.show()


11.  Perform matrix multiplication using NumPy .

In [None]:
import numpy as np

# Creating two matrices
A = np.array([[1, 2],
              [3, 4]])

B = np.array([[5, 6],
              [7, 8]])

# Matrix multiplication
result = np.dot(A, B)  # or A @ B (alternative syntax)

print("Matrix A:\n", A)
print("Matrix B:\n", B)
print("Matrix Multiplication Result:\n", result)

#Output

#Matrix A:
# [[1 2]
# [3 4]]
#Matrix B:
# [[5 6]
# [7 8]]
#Matrix Multiplication Result:
# [[19 22]
# [43 50]]


12.  Use Pandas to load a CSV file and display its first 5 rows .

In [None]:
import pandas as pd

# Load the CSV file
df = pd.read_csv("data.csv")  # Replace "data.csv" with your actual file path

# Display the first 5 rows
print(df.head())


13. Create a 3D scatter plot using Plotly .

In [None]:
import plotly.express as px
import numpy as np
import pandas as pd

# Generating sample 3D data
np.random.seed(42)
df = pd.DataFrame({
    "X": np.random.randn(50),
    "Y": np.random.randn(50),
    "Z": np.random.randn(50),
    "Category": np.random.choice(["A", "B", "C"], 50)
})

# Creating a 3D scatter plot
fig = px.scatter_3d(df, x="X", y="Y", z="Z", color="Category",
                     title="3D Scatter Plot", size_max=5)

# Show the plot
fig.show()
