## **Data Toolkit**

**Q1.** What is NumPy, and why is it widely used in Python?
    
    .NumPy (Numerical Python) is an open source Python library that's widely
    used in science and engineering. The NumPy library contains
    multidimensional array data structures, such as the homogeneous,
    N-dimensional ndarray , and a large library of functions that operate
    efficiently on these data structures

**Q2.** How does broadcasting work in NumPy?
    
    .Broadcasting in NumPy allows us to perform arithmetic operations on arrays
    of different shapes without reshaping them. It automatically adjusts the
    Smaller array to match the larger array's shape by replicating its values
    along the necessary dimensions.

**Q3.** What is a Pandas DataFrame?

    .A Pandas DataFrame is a two-dimensional, size-mutable, and potentially
    heterogeneous tabular data structure in Python. It is a core component of
    the Pandas library, widely used for data manipulation and analysis.

**Q4.** Explain the use of the groupby() method in Pandas?

    . The groupby() method in Pandas is a fundamental tool for data analysis, enabling the "split-apply-combine" strategy. This approach involves:
**Splitting:**
Dividing a DataFrame into groups based on the unique values in one or more specified columns. When you call df.groupby('column_name'), it returns a GroupBy object, which conceptually represents these distinct groups.

**Applying:**
Performing an operation on each of these independent groups. This operation can be an aggregation (e.g., sum(), mean(), count(), min(), max()), a transformation (e.g., rank(), shift()), or a custom function defined by the user.

**Combining:**
Merging the results of the applied operations back into a single Series or DataFrame. The structure of the combined result depends on the applied function.

**Q5.** Why is Seaborn preferred for statistical visualizations?

    . Seaborn is more than just a tool for creating attractive plots—it's a
    gateway to understanding data. Its ability to simplify statistical
    plotting, integrate with pandas, and support a wide range of customization
    options makes it an essential library for anyone involved in data analysis
    and visualization

**Q6.** What are the differences between NumPy arrays and Python lists?

    . The main difference is that NumPy arrays are much faster and have strict
    requirements on the homogeneity of the objects. For example, a NumPy array
    of strings can only contain strings and no other data types, but a Python
    list can contain a mixture of strings, numbers, booleans and other objects

**Q7.** What is a heatmap, and when should it be used?

    . A heatmap is a graphical representation of data that uses a system of
    color coding to represent different values. Heatmaps are used in various
    forms of analytics but are most commonly used to show user behavior on
    specific web pages or webpage templates.

**Q8.** What does the term “vectorized operation” mean in NumPy?

    . Vectorized operations in NumPy enable the use of efficient, pre-compiled
    functions and mathematical operations on NumPy arrays and data sequences.
    Vectorization is a method of performing array operations without the use of
    for loops

**Q9.** A How does Matplotlib differ from Plotly?

    . Matplotlib: Is often preferred for academic or highly customized plots
    because you can fine-tune just about any aspect of the figure—fonts,
    margins, axis scales, etc. Plotly: While still highly customizable,
    Plotly's real strength lies in interactivity and web-based visuals.

**Q10.** What is the significance of hierarchical indexing in Pandas?

    . Hierarchical Indexing, also known as MultiIndexing, is a powerful feature
    in Pandas that allows you to have multiple levels of indexing on an axis
    (row or column). This capability is particularly useful when dealing with
    high-dimensional data.

**Q11.**  What is the role of Seaborn’s pairplot() function?

    . pairplot. Plot pairwise relationships in a dataset. By default, this
    function will create a grid of Axes such that each numeric variable in data
    will by shared across the y-axes across a single row and the x-axes across
    a single column.

**Q12.** What is the purpose of the describe() function in Pandas?

    . Pandas DataFrame describe() Method

    The describe() method returns description of the data in the DataFrame.
    If the DataFrame contains numerical data, the description contains these
    information for each column: count - The number of not-empty values. mean -
    The average (mean) value. std - The standard deviation.

**Q13.** Why is handling missing data important in Pandas?

    . It is necessary to fill in missing data values in datasets, as most of
    the machine learning models that you want to use will provide an error if
    you pass NaN values into them.

**Q14.** What are the benefits of using Plotly for data visualization?

    . The main benefits of Plotly Express include:
    Ease of Use: With minimal code, you can generate complex plots.
    Interactivity: Plots are not just static images; they are interactive and can be easily exported as HTML files.

**Q15.**  How does NumPy handle multidimensional arrays?

    . An ndarray is a (usually fixed-size) multidimensional container of items
    of the same type and size. The number of dimensions and items in an array
    specify the sizes of each dimension

**Q16.** What is the role of Bokeh in data visualization?

    . Bokeh is a Python library that is used to make highly interactive graphs
    and visualizations. This is done in bokeh using HTML and JavaScript. This
    makes it a powerful tool for creating projects, custom charts, and web
    design-based applications.

**Q17.** Explain the difference between apply() and map() in Pandas?

    . map() is for element-wise substitution or mapping on a Pandas Series,
    while apply() is more versatile, allowing application of functions along an
    axis (rows or columns) of both Series and DataFrames, handling more complex
    operations.

**Q18.** What are some advanced features of NumPy?

    . Features of NumPy
    Multi-Dimensional Arrays. Support for multi-dimensional arrays is one of the most essential features of NumPy. ...
    Broadcasting. ...
    Vectorized Operations. ...
    Indexing and Slicing. ...
    Array Manipulation. ...
    Linear Algebra. ...
    Random Number Generation. ...
    Performance Optimization.

**Q19.**  How does Pandas simplify time series analysis?

    . Pandas simplifies time series analysis by providing the DatetimeIndex for
    efficient time-based indexing and operations, enabling features like
    resample() for aggregation and rolling() for moving window calculations,
    all within a familiar DataFrame or Series structure

**Q20.** What is the role of a pivot table in Pandas?

    . In Pandas, a pivot table, created using the pivot_table() method, plays a
    crucial role in summarizing and analyzing data, especially large datasets.
    Its primary function is to transform data from a "long" format into a
    "wide" format, allowing for easier analysis and visualization.

**Q21.** Why is NumPy’s array slicing faster than Python’s list slicing?

    . NumPy array slicing is faster than Python list slicing because NumPy
    arrays store homogeneous data types in contiguous memory blocks, allowing
    for efficient, C-optimized operations and direct memory access, unlike
    Python lists which store heterogeneous object references scattered in
    memory.

**Q22.** What are some common use cases for Seaborn?

    . Seaborn is a library mostly used for statistical plotting in Python. It
    is built on top of Matplotlib and provides beautiful default styles and
    color palettes to make statistical plots more attractive.

    

In [None]:
## Practical

#Answer1.
import numpy as np

# Step 1: Create a 2D NumPy array
arr = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

# Step 2: Calculate the sum of each row
row_sums = np.sum(arr, axis=1)

# Print result
print("Original array:")
print(arr)
print("Sum of each row:")
print(row_sums)

#Answer2.
import pandas as pd

# Step 1: Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 60000, 70000, 80000]
}

df = pd.DataFrame(data)

# Step 2: Calculate the mean of the 'Salary' column
mean_salary = df['Salary'].mean()

# Print the result
print("Mean Salary:", mean_salary)

#Answer3.
import matplotlib.pyplot as plt

# Step 1: Sample data
x = [1, 2, 3, 4, 5]
y = [5, 7, 4, 6, 8]

# Step 2: Create a scatter plot
plt.scatter(x, y, color='blue', marker='o', label='Data Points')

# Step 3: Add labels and title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Simple Scatter Plot')
plt.legend()

# Step 4: Show the plot
plt.show()

#Answer4.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Step 1: Create a sample DataFrame
data = {
    'Math': [80, 85, 78, 90, 88],
    'Physics': [82, 79, 84, 92, 85],
    'Chemistry': [78, 81, 76, 89, 84]
}

df = pd.DataFrame(data)

# Step 2: Calculate correlation matrix
corr_matrix = df.corr()

# Step 3: Plot the heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")

# Step 4: Show the plot
plt.title('Correlation Matrix Heatmap')
plt.show()

#Answer5.
import plotly.express as px

# Step 1: Sample data
data = {
    'Fruits': ['Apples', 'Bananas', 'Cherries', 'Dates'],
    'Quantity': [10, 15, 7, 12]
}

# Step 2: Create a bar plot
fig = px.bar(data, x='Fruits', y='Quantity', title='Fruit Quantity Bar Plot')

# Step 3: Show the plot
fig.show()

#Answer6.
import pandas as pd

# Step 1: Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Marks': [85, 62, 90, 70]
}

df = pd.DataFrame(data)

# Step 2: Add a new column 'Result' based on 'Marks'
df['Result'] = df['Marks'].apply(lambda x: 'Pass' if x >= 75 else 'Fail')

# Step 3: Print the updated DataFrame
print(df)

#Answer7.
import numpy as np

# Step 1: Create two NumPy arrays
array1 = np.array([1, 2, 3, 4])
array2 = np.array([10, 20, 30, 40])

# Step 2: Perform element-wise multiplication
result = array1 * array2

# Step 3: Print the result
print("Array 1:", array1)
print("Array 2:", array2)
print("Element-wise Multiplication:", result)

#Anser8.
import matplotlib.pyplot as plt

# Step 1: Sample data for multiple lines
x = [1, 2, 3, 4, 5]
y1 = [1, 4, 9, 16, 25]      # Line 1: y = x^2
y2 = [1, 2, 3, 4, 5]        # Line 2: y = x
y3 = [2, 3, 5, 7, 11]       # Line 3: y = some primes

# Step 2: Plot multiple lines
plt.plot(x, y1, label='x²', color='blue', linestyle='-', marker='o')
plt.plot(x, y2, label='x', color='green', linestyle='--', marker='s')
plt.plot(x, y3, label='Primes', color='red', linestyle='-.', marker='^')

# Step 3: Add labels, title, and legend
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Multiple Line Plot Example')
plt.legend()

# Step 4: Display the plot
plt.grid(True)
plt.show()

#Answr9.
    data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
            'Age': [25, 30, 22, 35, 28],
            'Salary': [50000, 65000, 48000, 72000, 60000]}
    df = pd.DataFrame(data)
    print("Original DataFrame:")
    print(df)

#Answr10.
import seaborn as sns
import matplotlib.pyplot as plt

# Load a sample dataset
df = sns.load_dataset("iris")

# Create a histogram of the 'sepal_length' column with KDE
sns.histplot(data=df, x="sepal_length", kde=True)

# Add title and labels
plt.title("Distribution of Sepal Length in Iris Dataset")
plt.xlabel("Sepal Length (cm)")
plt.ylabel("Frequency")

# Display the plot
plt.show()

#Answr11.
import numpy as np

# Example matrices
matrix_a = np.array([[1, 2], [3, 4]])
matrix_b = np.array([[5, 6], [7, 8]])

# Matrix multiplication using np.matmul()
matrix_product_matmul = np.matmul(matrix_a, matrix_b)
print("Matrix product using np.matmul():\n", matrix_product_matmul)

# Matrix multiplication using the @ operator
matrix_product_at = matrix_a @ matrix_b
print("\nMatrix product using the @ operator:\n", matrix_product_at)

# Element-wise multiplication using np.multiply()
elementwise_product = np.multiply(matrix_a, matrix_b)
print("\nElement-wise product:\n", elementwise_product)

# Dot product using np.dot()
dot_product = np.dot(matrix_a, matrix_b)
print("\nDot product:\n", dot_product)

#Answr12.
import pandas as pd

# Load the CSV file named 'data.csv'
df = pd.read_csv('data.csv')

# Display the first 5 rows of the DataFrame
print(df.head())

#Answr13.
##Using Plotly Express:

import plotly.express as px
import pandas as pd

# Create a sample DataFrame (replace with your data)
data = {
    'x_data': [1, 2, 3, 4, 5],
    'y_data': [5, 4, 3, 2, 1],
    'z_data': [2, 4, 1, 5, 3],
    'category': ['A', 'B', 'A', 'B', 'A']
}
df = pd.DataFrame(data)

# Create the 3D scatter plot
fig = px.scatter_3d(df, x='x_data', y='y_data', z='z_data',
                    color='category',  # Color points by 'category' column
                    size='z_data',     # Vary marker size by 'z_data'
                    title='3D Scatter Plot with Plotly Express')

# Show the plot
fig.show()

##Using Plotly Graph Objects:

import plotly.graph_objects as go
import numpy as np

# Generate sample data
np.random.seed(42)
x = np.random.rand(50)
y = np.random.rand(50)
z = np.random.rand(50)

# Create the 3D scatter plot trace
scatter_trace = go.Scatter3d(
    x=x,
    y=y,
    z=z,
    mode='markers',
    marker=dict(
        size=8,
        color=z,  # Color points based on 'z' values
        colorscale='Viridis',
        opacity=0.8
    )
)

# Create the figure and add the trace
fig = go.Figure(data=[scatter_trace])

# Update layout for title and axis labels
fig.update_layout(
    title='3D Scatter Plot with Graph Objects',
    scene=dict(
        xaxis_title='X-axis',
        yaxis_title='Y-axis',
        zaxis_title='Z-axis'
    )
)

# Show the plot
fig.show()
