DATA TOOLKIT

1 What is NumPy, and why is it widely used in Python?
* NumPy (short for Numerical Python) is a powerful library in Python used for numerical computing. It provides support for working with large, multi-dimensional arrays and matrices, along with a collection of high-level mathematical functions to operate on these arrays.

 NumPy is Widely Used:

 1 Efficient Array Operations:

NumPy arrays (ndarray) are more compact and faster than Python lists.

Supports vectorized operations (no need for slow Python loops).

 2 Mathematical Functions:

Includes many functions for linear algebra, statistics, Fourier transforms, etc.

3 Broadcasting:

Enables operations between arrays of different shapes, making code cleaner and faster.

4 Interoperability:

Works well with other Python libraries like Pandas, Matplotlib, SciPy, scikit-learn, etc.

 5 Used in Data Science and Machine Learning:
Acts as the foundation for most data science tools in Python.


2 How does broadcasting work in NumPy?
* Broadcasting in NumPy is a powerful feature that allows operations between arrays of different shapes without explicitly copying.

Broadcasting automatically expands smaller arrays during arithmetic operations so that they have the same shape as larger ones — without actually copying data.

3 What is a Pandas DataFrame?
* A Pandas DataFrame is a 2-dimensional labeled data structure in Python — similar to a table in a database or an Excel spreadsheet.

It is part of the Pandas library, which is widely used for data manipulation and analysis.


4 Explain the use of the groupby() method in Pandas?
* The groupby() method in Pandas is used to group data based on one or more columns, so you can then perform operations like sum, mean, count, etc. on each group separately.

It’s very useful for analyzing categorized or grouped data — similar to SQL's GROUP BY.

Split the data into groups.

Apply a function (e.g., sum, mean).

Combine the results into a new DataFrame.

5 Why is Seaborn preferred for statistical visualizations?
* Seaborn is a powerful Python library built on top of Matplotlib, and it's preferred for statistical visualizations because it makes it easier, cleaner, and more effective to create attractive and informative plots.

1 High-level API:

Simple syntax for complex plots (e.g., box plots, violin plots, heatmaps).
Built-in Support for DataFrames:

W0rks directly with Pandas DataFrames and column names.

3 Automatic Statistical Aggregation:

Automatically computes means, medians, confidence intervals, etc.

Great for visualizing distributions and trends.

4 Beautiful Default Styles:

Cleaner and more visually appealing than raw Matplotlib.

Ready for presentations and reports.

5 Rich Variety of Plots:

 Supports statistical plots like:

boxplot, violinplot, pairplot, distplot, heatmap, regplot

6 Built-in Themes and Color Palettes:

Easy to set visual themes

6 What are the differences between NumPy arrays and Python lists?
* Here are the key differences between NumPy arrays and Python lists:

Feature	NumPy Array	Python List
Performance	Faster (uses C under the hood)	Slower (pure Python)
Memory Usage	Less memory (homogeneous data type)	More memory (can store different types)
Data Type	Homogeneous (all elements same type)	Heterogeneous (different data types allowed)
Functionality	Supports many mathematical operations directly (e.g. +, *, matrix ops)	No built-in vectorized math operations
Slicing/Indexing	More powerful (e.g. advanced slicing, boolean indexing)	Basic slicing and indexing
Broadcasting	Supports broadcasting (auto expand shapes for ops)	No broadcasting
Convenience	Requires numpy library	Built into Python
Use Case	Scientific computing, large datasets

7 What is a heatmap, and when should it be used?
* A heatmap is a data visualization technique that uses color to represent values in a matrix or 2D dataset. Each cell in the heatmap shows a value with a corresponding color, making it easy to spot patterns, correlations, and outliers.
 When to Use a Heatmap:
1 To Show Correlation Between Variables

Example: Correlation matrix of features in a dataset.

Use it when you want to understand relationships between multiple variables.

2 To Visualize a Matrix or 2D Table

Example: Student scores (rows = students, columns = subjects).

Makes it easy to see highs, lows, and trends.

3 To Represent Density or Frequency

Example: Website clicks on a page or population density.

Highlights hotspots or areas with more activity.

4 To Compare Values Across Categories

Example: Sales across regions and months.

Useful when you want to compare categories over time or groups.

8 What does the term “vectorized operation” mean in NumPy?
* In NumPy, a vectorized operation means performing operations on entire arrays (vectors, matrices, etc.) without using explicit loops (like for or while).

 Key Points:

Operates on whole arrays at once

Faster and more efficient than using loops

Takes advantage of low-level optimizations (written in C)


9 How does Matplotlib differ from Plotly?
* Matplotlib and Plotly are both Python libraries for data visualization, but they differ significantly in features, interactivity, and usage. Here's a comparison:

🔷 1. Interactivity
Matplotlib:

Primarily static plots (images).

Limited interactivity using tools like %matplotlib notebook or external widgets.

Plotly:

Highly interactive plots by default (zoom, pan, hover tooltips).

Great for dashboards and web apps.

🔷 2. Output Format
Matplotlib:

Generates images (PNG, PDF, SVG).

Good for print-ready figures or reports.

Plotly:

Generates interactive HTML/JavaScript visualizations.

Ideal for web or Jupyter Notebook.

🔷 3. Ease of Use
Matplotlib:

Syntax is more traditional and lower-level.

More code is often required for customization.

Plotly:

Higher-level API (especially with plotly.express).

Less code for attractive, interactive plots.


10 What is the significance of hierarchical indexing in Pandas?
* Hierarchical indexing (also called MultiIndexing) in Pandas allows you to work with higher-dimensional data in a lower-dimensional (2D) DataFrame or Series. It is a powerful feature that lets you group and organize data with multiple levels of indexes.

🔑 Significance of Hierarchical Indexing in Pandas:
1. Better Organization of Data
It allows for multi-level grouping, which is helpful when you want to organize data by multiple categories.

Example:
You can index sales data by Country, then by Year.

python
Copy
Edit
import pandas as pd

data = {
    ('India', 2023): 100,
    ('India', 2024): 120,
    ('USA', 2023): 150,
    ('USA', 2024): 170
}

index = pd.MultiIndex.from_tuples(data.keys(), names=["Country", "Year"])
series = pd.Series(data, index=index)
print(series)
2. Easy Subsetting and Slicing
You can easily select data from one level of the index or filter based on multiple levels.
print(series.loc["India"])
3. Supports Complex Grouping
It works very well with groupby() operations, especially when analyzing grouped statistics.
df = series.reset_index(name='Sales')
grouped = df.groupby(['Country'])['Sales'].mean()
4. Efficient Reshaping
With functions like unstack(), stack(), and pivot_table(), you can reshape the data easily using the levels in a MultiIndex.

df = series.unstack()  # Converts 'Year' to columns
5. Compact Data Representation
Instead of creating multiple columns for hierarchy, a MultiIndex can represent complex relationships in a more compact format.

11 What is the role of Seaborn’s pairplot() function?
* The pairplot() function in Seaborn is used to create a matrix of scatter plots for visualizing relationships between multiple variables in a dataset, especially useful for exploratory data analysis (EDA).

 Role of pairplot():
Visualizes pairwise relationships:

Plots scatter plots for each pair of numerical features.

Helps identify correlations, clusters, and outliers.

Shows distributions:

The diagonal typically displays histograms or KDE plots of individual variables.

Supports categorical distinction:

You can color the plots based on a categorical variable using the hue parameter.

12 What is the purpose of the describe() function in Pandas?
* The describe() function in Pandas is used to generate summary statistics of a DataFrame or Series. It provides useful insights about the data, especially numerical columns.

 Purpose of describe():
Quickly understand the distribution, central tendency, and spread of your data.

Helpful for data exploration and initial analysis.


13 Why is handling missing data important in Pandas?
* Handling missing data is important in Pandas because missing values can:

Distort analysis results: Many functions (like mean, sum, or correlation) give incorrect or misleading results if missing values are not handled properly.

Cause errors: Some models or visualizations may not work if missing data is present.

Impact data quality: Clean and complete data leads to better insights, predictions, and decisions.

14 What are the benefits of using Plotly for data visualization?
* Plotly is a powerful library for data visualization in Python (and other languages), especially known for its interactive and web-friendly capabilities. Here are the main benefits of using Plotly:

1. Interactivity
Plotly charts are interactive by default (zoom, pan, hover tooltips, etc.).

Useful for exploring data dynamically and making dashboards.

2. High-Quality Visuals
Produces publication-ready and aesthetically pleasing visualizations.

Great for reports, presentations, and web apps.

3. Wide Range of Chart Types
Supports a variety of charts:

Line, bar, scatter

Heatmaps, 3D plots

Maps (geo, choropleth)

Financial charts (candlestick, OHLC)

15 How does NumPy handle multidimensional arrays?
* NumPy handles multidimensional arrays using a powerful object called the ndarray (N-dimensional array). Here's how it works:

1. Creation
You can create multidimensional arrays using functions like np.array(), np.zeros(), np.ones(), and np.random.
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])  # 2D array
2. Shape and Dimensions
.shape tells you the dimensions (rows, columns, etc.).

.ndim gives the number of dimensions.
size gives the total number of elements.
a.shape    # (2, 3)
a.ndim     # 2
a.size     # 6
3. Indexing and Slicing
You can access elements using [row][column] or advanced slicing.
a[0, 1]     # 2
a[:, 1]     # [2, 5] (second column)
4. Broadcasting
NumPy allows operations on arrays of different shapes by broadcasting (automatic dimension matching).
a + 10   # Adds 10 to every element
5. Vectorized Operations
Operations like a + b, a * 2, np.mean(a), etc., work element-wise and efficiently without loops.

6. Reshaping
You can change the shape using .reshape() or .ravel().
a.reshape(3, 2)  # reshapes into 3x2
Summary:
NumPy’s multidimensional arrays are memory-efficient, fast, and support complex mathematical operations with ease. They are the foundation for scientific computing in Python.

16 What is the role of Bokeh in data visualization?
* Bokeh is a powerful Python library for interactive and web-based data visualizations. Its main role in data visualization includes:

🔹 1. Interactive Plots
Bokeh creates interactive charts like zoomable, pannable, hoverable plots.

Useful for dashboards and real-time data exploration.

🔹 2. Web Integration
Generates HTML, JSON, or JavaScript files that can be embedded in web apps.

Perfect for creating interactive visualizations in Flask, Django, or standalone pages.

🔹 3. High-performance for Large Data
Can handle large datasets efficiently in the browser.

Supports server-side downsampling and streaming of data.

🔹 4. Easy Syntax with Python
Designed for Python developers; similar to matplotlib or pandas plotting syntax.

You can create complex visuals with relatively simple code.

🔹 5. Versatile Plot Types
Supports line plots, bar charts, scatter plots, heatmaps, maps, and more.

Also allows linked brushing, custom tooltips, and widgets (sliders, buttons).

17 Explain the difference between apply() and map() in Pandas?
*In Pandas, both apply() and map() are used to apply functions to data, but they are used in different contexts and have different capabilities. Here's the main difference:

🔹 map()
Used with: Series (single column)

Purpose: Applies a function element-wise to each value in the Series.

Common use: Simple operations on each element, like formatting or replacing.

✅ Example:
import pandas as pd

s = pd.Series([1, 2, 3])
s.map(lambda x: x * 10)
Output:
0    10
1    20
2    30
dtype: int64
🔹 apply()
Used with: Both Series and DataFrame

Purpose:

On Series: similar to map(), but more flexible.

On DataFrame: applies a function to rows or columns.

Common use: Row/column-wise operations, aggregations, custom logic.

✅ Example on Series:
s.apply(lambda x: x * 10)
Same output as map().

✅ Example on DataFrame:
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df.apply(sum, axis=0)  # column-wise sum
Output:
a    3
b    7
dtype: int64

18 What are some advanced features of NumPy?
* NumPy offers several advanced features that make it powerful for numerical and scientific computing. Here are some of the key ones:

1. Broadcasting
Allows operations between arrays of different shapes without copying data.

Example:

python
Copy
Edit
a = np.array([1, 2, 3])
b = 2
print(a + b)  # Output: [3 4 5]
2. Vectorization
Speeds up operations by applying functions to entire arrays without loops.

Example:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(a * b)  # Output: [4 10 18]
3. Masked Arrays
Useful for handling missing or invalid data in arrays.

Example:
import numpy.ma as ma
data = ma.array([1, 2, -999, 4], mask=[0, 0, 1, 0])
print(data.mean())  # Ignores the masked value
4. Structured Arrays
Allows you to define custom data types (like records or C structs).

Example:
dt = np.dtype([('name', 'U10'), ('age', 'i4')])
data = np.array([('Alice', 25), ('Bob', 30)], dtype=dt)
print(data['name'])  # Output: ['Alice' 'Bob']




19 How does Pandas simplify time series analysis?
* Pandas simplifies time series analysis in Python through several built-in features that make working with date and time data more intuitive and powerful. Here's how:

 1. Date and Time Handling with DatetimeIndex
Pandas can convert strings to datetime objects using pd.to_datetime().

Once converted, you can set a column as an index (DatetimeIndex) which enables time-based operations.
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
 2. Easy Resampling
You can change the frequency of time series data easily — e.g., daily to monthly, hourly to weekly.
df.resample('M').mean()  # Monthly average

20 What is the role of a pivot table in Pandas?
* A pivot table in Pandas is a powerful tool used for summarizing, reorganizing, and analyzing data. It allows you to transform long-format data into a more readable format by grouping and aggregating based on one or more keys.
 Role of a Pivot Table in Pandas:
Data Summarization
It helps in summarizing data by grouping based on categories and applying aggregation functions (like sum, mean, count, etc.).

Data Reshaping
A pivot table can transform data from a "long" format to a "wide" format.

Multi-dimensional Analysis
You can analyze data across multiple dimensions (rows and columns), like comparing sales by region and product.

Custom Aggregations
Allows the use of custom aggregation functions using the aggfunc parameter.

21 Why is NumPy’s array slicing faster than Python’s list slicing?
* NumPy’s array slicing is faster than Python’s list slicing due to several key reasons:

 1. Homogeneous Data Types
NumPy arrays store elements of the same data type in a contiguous block of memory.

Python lists can hold mixed data types and are more like arrays of pointers.

 This makes NumPy arrays more memory-efficient and enables faster access and slicing.

 2. C Implementation
NumPy is written in C, so operations like slicing are executed at a lower level.

This reduces the overhead of Python's interpreted execution.

 3. No Data Copy on Slice (View vs Copy)
NumPy slicing returns a view (not a copy) of the original array, avoiding extra memory allocation.

Python list slicing returns a new list (a full copy), which takes more time and memory.

22 What are some common use cases for Seaborn?
* Seaborn is a Python data visualization library built on top of Matplotlib, and it's designed for making statistical graphics easier and more attractive. Here are some common use cases for Seaborn:

1. Exploratory Data Analysis (EDA)
Seaborn is widely used to explore datasets visually before applying any modeling. It helps identify patterns, correlations, and outliers.

2. Visualizing Relationships Between Variables
scatterplot(): to show relationships between two numeric variables.

lineplot(): to show trends over time or another continuous variable.

relplot(): a figure-level interface for scatter and line plots.

3. Comparing Categories
boxplot() and violinplot(): to compare distributions across categories.

barplot(): to compare means or aggregated values for categories.

PRACTICAL

In [None]:
1 How do you create a 2D NumPy array and calculate the sum of each row.
import numpy as np

# Step 1: Create a 2D NumPy array
array_2d = np.array([[1, 2, 3],
                     [4, 5, 6],
                     [7, 8, 9]])

# Step 2: Calculate sum of each row
row_sums = np.sum(array_2d, axis=1)

print("Original Array:")
print(array_2d)

print("\nSum of Each Row:")
print(row_sums)

Original Array:
[[1 2 3]
 [4 5 6]
 [7 8 9]]

Sum of Each Row:
[ 6 15 24]

In [None]:
2 Write a Pandas script to find the mean of a specific column in a DataFrame.
* import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Score': [85, 90, 95, 88]
}

df = pd.DataFrame(data)

# Calculate mean of the 'Score' column
mean_score = df['Score'].mean()

print("Mean of the 'Score' column:", mean_score)

Mean of the 'Score' column: 89.5

In [None]:
3 Create a scatter plot using Matplotlib.
import matplotlib.pyplot as plt

# Sample data
x = [5, 7, 8, 7, 2, 17, 2, 9]
y = [99, 86, 87, 88, 100, 86, 103, 87]

# Create scatter plot
plt.scatter(x, y, color='blue', marker='o')

# Add titles and labels
plt.title("Simple Scatter Plot")
plt.xlabel("X Axis")
plt.ylabel("Y Axis")

# Show the plot
plt.show()

plt.scatter()

In [None]:
4 How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap.
# Sample DataFrame
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [5, 4, 3, 2, 1],
    'C': [2, 3, 4, 5, 6]
}
df = pd.DataFrame(data)

[1,2,3,4,5,5,4,3,2,1,2,3,4,5,6]

In [None]:
5 Generate a bar plot using plotly.
import plotly.graph_objects as go

# Sample data
fruits = ['Apples', 'Oranges', 'Bananas', 'Grapes']
quantities = [10, 15, 7, 12]

# Create the bar plot
fig = go.Figure(data=[
    go.Bar(name='Fruit Count', x=fruits, y=quantities)
])

# Customize layout
fig.update_layout(
    title='Fruit Quantities',
    xaxis_title='Fruit',
    yaxis_title='Quantity',
    template='plotly_dark'
)

# Show the plot
fig.show()


[10,15,7,12]

In [None]:
6 Create a DataFrame and add a new column based on an existing column.
import pandas as pd

# Step 1: Create a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score': [85, 62, 90]
}

df = pd.DataFrame(data)

# Step 2: Add a new column based on the 'Score' column
df['Passed'] = df['Score'] >= 70

# Display the updated DataFrame
print(df)

     Name   Score   Passed

0    Alice     85     True

1     Bob     62   False

2  Charlie     90    True

In [None]:
7  Write a program to perform element-wise multiplication of two NumPy arrays.
import numpy as np

# Define two NumPy arrays
array1 = np.array([1, 2, 3, 4])
array2 = np.array([5, 6, 7, 8])

# Perform element-wise multiplication
result = array1 * array2

# Display the result
print("Array 1:", array1)
print("Array 2:", array2)
print("Element-wise Multiplication:", result)

Array 1: [1 2 3 4]
Array 2: [5 6 7 8]
Element-wise Multiplication: [ 5 12 21 32]

In [None]:
8 Create a line plot with multiple lines using Matplotlib.
import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y1 = [2, 3, 5, 7, 11]
y2 = [1, 4, 6, 8, 10]
y3 = [3, 5, 2, 6, 9]

# Create a line plot with multiple lines
plt.plot(x, y1, label='Line 1', color='blue', marker='o')
plt.plot(x, y2, label='Line 2', color='green', marker='s')
plt.plot(x, y3, label='Line 3', color='red', marker='^')

# Add labels and title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Multiple Line Plot Example')

# Add a legend
plt.legend()

# Show the plot
plt.grid(True)
plt.show()

false

In [None]:
9 Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold.
* import pandas as pd

# Step 1: Create the DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Score': [85, 42, 77, 90, 66]
}
df = pd.DataFrame(data)

# Step 2: Define the threshold
threshold = 70

# Step 3: Filter rows where 'Score' > threshold
filtered_df = df[df['Score'] > threshold]

# Display the filtered DataFrame
print(filtered_df)

     Name  Score

0   Alice     85

2 Charlie     77

3   David     90


In [None]:
10 Create a histogram using Seaborn to visualize a distribution.
import seaborn as sns
import matplotlib.pyplot as plt

# Sample data: random distribution
import numpy as np
data = np.random.randn(1000)  # 1000 values from a normal distribution

# Create histogram
sns.histplot(data, bins=30, kde=True, color='skyblue')

# Add labels and title
plt.title("Histogram of Normally Distributed Data")
plt.xlabel("Value")
plt.ylabel("Frequency")

# Show the plot
plt.show()

histogram of normally distributed data

In [None]:
11 Perform matrix multiplication using NumPy.
import numpy as np

# Define two matrices
A = np.array([[1, 2],
              [3, 4]])

B = np.array([[5, 6],
              [7, 8]])

# Matrix multiplication
result = np.matmul(A, B)

# Or you can use: result = A @ B

print("Matrix A:\n", A)
print("Matrix B:\n", B)
print("Result of A * B:\n", result)

Matrix A:

 [[1 2]

  [3 4]]

Matrix B:

 [[5 6]

  [7 8]]

Result of A * B:

 [[19 22]

  [43 50]]

In [None]:
12  Use Pandas to load a CSV file and display its first 5 rows.

import pandas as pd

# Load the CSV file
df = pd.read_csv('your_file.csv')  # Replace with your actual file name or path

# Display the first 5 rows
print(df.head())

file.csv

In [None]:
13 Create a 3D scatter plot using Plotly.
import plotly.express as px
import pandas as pd

# Sample data
data = {
    'X': [1, 2, 3, 4, 5],
    'Y': [10, 11, 12, 13, 14],
    'Z': [5, 6, 7, 8, 9],
    'Label': ['A', 'B', 'C', 'D', 'E']
}
df = pd.DataFrame(data)

# Create 3D scatter plot
fig = px.scatter_3d(df, x='X', y='Y', z='Z', color='Label', size_max=10)

# Show plot
fig.show()

[A,B,C,D,E]