In [None]:
# Data Science with Python: NumPy, Pandas, Matplotlib, Seaborn, and Plotly

This notebook contains comprehensive answers to theoretical questions and practical exercises covering the essential Python libraries for data science.

## Table of Contents
1. [SKILLS - Theoretical Questions](#skills)
2. [Practical - Coding Exercises](#practical)

# SKILLS - Theoretical Questions

## 1. What is NumPy, and why is it widely used in Python?

NumPy (Numerical Python) is a fundamental library for scientific computing in Python. It provides:

- **Efficient N-dimensional array objects**: NumPy arrays are homogeneous and stored in contiguous memory blocks, making operations faster than Python lists
- **Mathematical functions**: Broadcasting, linear algebra, Fourier transforms, and random number generation
- **Foundation for other libraries**: Pandas, Matplotlib, SciPy, and scikit-learn are built on NumPy
- **Performance**: Written in C, NumPy operations are vectorized and significantly faster than pure Python
- **Memory efficiency**: Uses less memory than Python lists due to homogeneous data types

NumPy is widely used because it enables efficient numerical computations essential for data science, machine learning, and scientific computing.

## 2. How does broadcasting work in NumPy?

Broadcasting is NumPy's ability to perform element-wise operations on arrays with different shapes without explicitly reshaping them. The rules are:

1. **Shape comparison**: Arrays are aligned from the rightmost dimension
2. **Dimension compatibility**: Dimensions are compatible if:
   - They are equal, OR
   - One of them is 1, OR
   - One array has fewer dimensions (prepend 1s to its shape)
3. **Result shape**: The resulting array has the maximum size along each dimension

**Examples**:
- `(3, 4) + (4,)` → broadcasts to `(3, 4) + (1, 4)` → result: `(3, 4)`
- `(2, 3, 4) + (3, 1)` → broadcasts to `(2, 3, 4) + (1, 3, 1)` → result: `(2, 3, 4)`

Broadcasting enables efficient operations without creating intermediate arrays, saving memory and computation time.

## 3. What is a Pandas DataFrame?

A Pandas DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. It's similar to:
- A spreadsheet or SQL table
- A dictionary of Series objects sharing the same index

**Key characteristics**:
- **Heterogeneous data**: Different columns can have different data types (int, float, string, etc.)
- **Labeled axes**: Both rows (index) and columns have labels
- **Size-mutable**: Rows and columns can be added or removed
- **Data alignment**: Automatic alignment based on labels during operations

DataFrames are the primary data structure in Pandas for data manipulation, analysis, and cleaning tasks.

## 4. Explain the use of the groupby() method in Pandas.

The `groupby()` method in Pandas splits data into groups based on specified criteria, applies functions to each group, and combines results. It implements the "split-apply-combine" strategy:

1. **Split**: Divide data into groups based on one or more columns
2. **Apply**: Apply a function (aggregation, transformation, or filtering) to each group
3. **Combine**: Merge results into a new DataFrame/Series

**Common uses**:
- `df.groupby('column').sum()` - Sum values for each group
- `df.groupby(['col1', 'col2']).mean()` - Multiple grouping columns
- `df.groupby('column').agg({'col1': 'sum', 'col2': 'mean'})` - Different functions per column

It's essential for data aggregation, statistical analysis, and creating summary reports.

## 5. Why is Seaborn preferred for statistical visualizations?

Seaborn is preferred because it:

- **Built on Matplotlib**: Leverages Matplotlib's power while providing a higher-level interface
- **Statistical focus**: Designed specifically for statistical data visualization
- **Beautiful defaults**: Attractive color palettes and styling out-of-the-box
- **Complex plots simplified**: Creates complex statistical plots with minimal code
- **Data integration**: Works seamlessly with Pandas DataFrames
- **Statistical functions**: Built-in statistical estimations and confidence intervals

**Key advantages**:
- Automatic handling of categorical data
- Built-in themes and color palettes
- Statistical plot types (violin plots, box plots, regression plots)
- Easy faceting and subplot creation
- Better handling of grouped data visualization

## 6. What are the differences between NumPy arrays and Python lists?

| Feature | NumPy Arrays | Python Lists |
|---------|--------------|--------------|
| **Data Types** | Homogeneous (single type) | Heterogeneous (mixed types) |
| **Memory** | Contiguous memory block | Scattered memory locations |
| **Performance** | Much faster for numerical operations | Slower for large datasets |
| **Size** | Fixed size after creation | Dynamic size |
| **Operations** | Vectorized operations | Element-by-element loops |
| **Memory Usage** | More memory efficient | Higher memory overhead |
| **Functionality** | Mathematical operations built-in | Limited mathematical functions |
| **Broadcasting** | Supports broadcasting | No broadcasting |

**Example**: Multiplying all elements by 2:
- NumPy: `arr * 2` (vectorized, fast)
- List: `[x * 2 for x in lst]` (requires loop, slower)

## 7. What is a heatmap, and when should it be used?

A heatmap is a data visualization technique that represents values in a matrix using colors. Darker/lighter colors or different hues represent higher/lower values.

**When to use heatmaps**:
- **Correlation matrices**: Show relationships between variables
- **Confusion matrices**: Visualize classification model performance
- **Time series data**: Display patterns over time and categories
- **Geographic data**: Show data intensity across regions
- **Large datasets**: Identify patterns in high-dimensional data
- **Missing data patterns**: Visualize data completeness

**Benefits**:
- Quick pattern identification
- Easy to spot outliers or clusters
- Effective for large datasets
- Intuitive color-based interpretation
- Good for presentations and reports

## 8. What does the term "vectorized operation" mean in NumPy?

Vectorized operations in NumPy are operations that are applied element-wise to entire arrays without explicit Python loops. The operations are implemented in optimized C code.

**Key characteristics**:
- **No explicit loops**: Operations apply to entire arrays at once
- **C-level implementation**: Much faster than Python loops
- **Element-wise**: Operations performed on corresponding elements
- **Broadcasting**: Automatic handling of different array shapes
- **Memory efficient**: Minimal memory allocation during operations

**Examples**:
```python
# Vectorized (fast)
result = arr1 + arr2  # Adds corresponding elements
result = np.sin(arr)  # Applies sin to all elements

# Non-vectorized (slow)
result = [arr1[i] + arr2[i] for i in range(len(arr1))]
```

This makes NumPy operations 10-100x faster than equivalent pure Python code.

## 9. How does Matplotlib differ from Plotly?

| Aspect | Matplotlib | Plotly |
|--------|------------|---------|
| **Interactivity** | Static plots (by default) | Interactive plots (by default) |
| **Learning Curve** | Steeper, more verbose | Gentler, more intuitive |
| **Output** | Images (PNG, JPG, PDF, SVG) | HTML, web-based |
| **Customization** | Highly customizable | Good customization with easier syntax |
| **Performance** | Better for large datasets | Can be slower with huge datasets |
| **3D Plotting** | Basic 3D capabilities | Excellent 3D visualization |
| **Web Integration** | Requires additional work | Built for web deployment |
| **Animation** | Possible but complex | Easy animation support |
| **Ecosystem** | Mature, extensive ecosystem | Growing ecosystem |

**Choose Matplotlib for**: Publication-quality static plots, fine-grained control, scientific publications
**Choose Plotly for**: Interactive dashboards, web applications, business presentations

## 10. What is the significance of hierarchical indexing in Pandas?

Hierarchical indexing (MultiIndex) allows multiple index levels on axes, enabling:

**Benefits**:
- **Higher-dimensional data**: Represent 3D+ data in 2D DataFrames
- **Grouping and aggregation**: Natural data organization for complex grouping
- **Efficient storage**: Avoid creating separate DataFrames for different categories
- **Easy subsetting**: Select data at different granularity levels
- **Pivot operations**: Simplify reshaping between wide and long formats

**Use cases**:
- Time series with multiple frequencies (year, month, day)
- Geographic data (country, state, city)
- Experimental data (treatment, subject, measurement)
- Financial data (asset, date, metric)

**Example structure**:
```
Index levels:     Data
Country  City     Population
USA      NYC      8.4M
         LA       3.9M
UK       London   9.0M
         Edinburgh 0.5M
```

## 11. What is the role of Seaborn's pairplot() function?

The `pairplot()` function creates a grid of scatter plots for every pair of numerical variables in a dataset, with histograms on the diagonal.

**Key features**:
- **Pairwise relationships**: Shows correlation patterns between all variable pairs
- **Distribution visualization**: Diagonal shows individual variable distributions
- **Categorical grouping**: Can color-code points by categorical variables
- **Quick exploration**: Rapid overview of multivariate relationships

**When to use**:
- Initial data exploration
- Identifying correlations and patterns
- Detecting outliers across multiple variables
- Understanding data structure before modeling
- Comparing groups within categorical variables

**Benefits**:
- Comprehensive view of dataset relationships
- Easy identification of linear/non-linear patterns
- Quick detection of clustering or separation by groups
- Foundation for feature selection decisions

## 12. What is the purpose of the describe() function in Pandas?

The `describe()` function provides a comprehensive statistical summary of numerical data in a DataFrame or Series.

**For numerical data, it returns**:
- **count**: Number of non-null values
- **mean**: Average value
- **std**: Standard deviation
- **min**: Minimum value
- **25%**: First quartile (Q1)
- **50%**: Median (Q2)
- **75%**: Third quartile (Q3)
- **max**: Maximum value

**For categorical data** (when `include='object'`):
- **count**: Number of non-null values
- **unique**: Number of unique values
- **top**: Most frequent value
- **freq**: Frequency of the most common value

**Uses**:
- Quick data exploration and quality assessment
- Identifying outliers and data distribution
- Understanding data ranges and central tendencies
- Checking for missing values and data completeness

## 13. Why is handling missing data important in Pandas?

Missing data handling is crucial because:

**Impact on analysis**:
- **Biased results**: Missing data can skew statistical measures
- **Reduced sample size**: Decreases statistical power
- **Algorithm failures**: Many ML algorithms can't handle NaN values
- **Incorrect conclusions**: Missing patterns might be informative

**Common strategies**:
1. **Detection**: `isnull()`, `isna()`, `info()` to identify missing values
2. **Removal**: `dropna()` to remove rows/columns with missing data
3. **Imputation**: `fillna()` to replace with mean, median, mode, or forward/backward fill
4. **Interpolation**: `interpolate()` for time series data
5. **Indicator variables**: Create binary columns indicating missingness

**Best practices**:
- Understand why data is missing (MCAR, MAR, MNAR)
- Choose appropriate strategy based on missing data mechanism
- Document and justify missing data handling decisions
- Consider multiple imputation for critical analyses

## 14. What are the benefits of using Plotly for data visualization?

**Key benefits**:

1. **Interactivity by default**: Zoom, pan, hover, select without additional code
2. **Web-ready**: Plots render in browsers, easy to share and embed
3. **Professional appearance**: Publication-quality plots with minimal effort
4. **3D visualization**: Excellent 3D plotting capabilities
5. **Animation support**: Easy to create animated plots for time series
6. **Cross-platform**: Works in Jupyter, web apps, desktop applications
7. **Multiple languages**: Python, R, JavaScript, Julia support
8. **Dashboard integration**: Easy integration with Dash for web apps

**Specific advantages**:
- **Hover information**: Rich tooltips with detailed data
- **Responsive design**: Plots adapt to different screen sizes
- **Export options**: Save as HTML, PNG, PDF, SVG
- **Real-time updates**: Support for streaming data
- **Statistical charts**: Built-in support for statistical visualizations
- **Geographic mapping**: Excellent mapping capabilities

**Use cases**: Interactive dashboards, web applications, business presentations, exploratory data analysis

## 15. How does NumPy handle multidimensional arrays?

NumPy handles multidimensional arrays through:

**Array structure**:
- **ndarray object**: N-dimensional array with homogeneous elements
- **Shape**: Tuple describing array dimensions (e.g., (3, 4, 5) for 3D)
- **Axes**: Each dimension is called an axis (axis 0, axis 1, etc.)
- **Strides**: Memory layout information for efficient access

**Key capabilities**:
- **Flexible indexing**: `arr[i, j, k]` or `arr[i][j][k]`
- **Slicing**: `arr[:, 1:3, :]` for multidimensional slices
- **Broadcasting**: Operations across different shaped arrays
- **Axis-specific operations**: `sum(axis=0)` operates along specific dimensions
- **Reshaping**: `reshape()`, `flatten()`, `ravel()` for changing dimensions

**Memory efficiency**:
- Contiguous memory layout for fast access
- Views vs copies for memory optimization
- C and Fortran ordering options

## 16. What is the role of Bokeh in data visualization?

Bokeh is a Python library for creating interactive visualizations for web browsers.

**Key features**:
- **Web-native**: Generates JavaScript and HTML for web deployment
- **Server applications**: Build interactive web applications
- **Large datasets**: Handles big data efficiently with data decimation
- **Streaming data**: Real-time data visualization capabilities
- **Custom interactions**: Complex user interactions and widgets

**Advantages**:
- **Performance**: Efficient rendering of large datasets
- **Flexibility**: From simple plots to complex applications
- **Interactivity**: Rich interaction tools (selection, panning, zooming)
- **Integration**: Works with NumPy, Pandas, and other Python libraries
- **Deployment**: Easy web deployment without web development expertise

**Use cases**:
- Interactive dashboards for business intelligence
- Real-time monitoring applications
- Scientific data exploration tools
- Financial trading interfaces
- Geographic information systems

## 17. Explain the difference between apply() and map() in Pandas.

| Feature | apply() | map() |
|---------|---------|--------|
| **Scope** | DataFrames and Series | Series only |
| **Functionality** | Can apply any function | Maps values using dict, Series, or function |
| **Axis parameter** | Available for DataFrames (axis=0/1) | Not applicable |
| **Performance** | Slower, more flexible | Faster for simple transformations |
| **Return type** | Can return different types | Returns Series |
| **Use case** | Complex transformations | Simple value mapping |

**apply() examples**:
```python
df.apply(lambda x: x.sum(), axis=1)  # Row-wise sum
df['col'].apply(lambda x: x**2)      # Square each value
```

**map() examples**:
```python
series.map({1: 'one', 2: 'two'})     # Dictionary mapping
series.map(lambda x: x**2)           # Function mapping
```

**When to use**:
- **map()**: Simple transformations, dictionary lookups, better performance
- **apply()**: Complex functions, DataFrame operations, multiple columns

## 18. What are some advanced features of NumPy?

**Advanced features include**:

1. **Advanced indexing**:
   - Boolean indexing: `arr[arr > 5]`
   - Fancy indexing: `arr[[1, 3, 5]]`
   - Multi-dimensional indexing: `arr[rows, cols]`

2. **Broadcasting and vectorization**:
   - Universal functions (ufuncs)
   - Custom ufuncs with `numpy.vectorize()`
   - Efficient element-wise operations

3. **Linear algebra**:
   - Matrix operations: `numpy.linalg`
   - Eigenvalues, SVD, matrix decomposition
   - Solving linear systems

4. **Memory-mapped files**:
   - `numpy.memmap` for large datasets
   - Work with data larger than RAM

5. **Structured arrays**:
   - Arrays with named fields
   - Mixed data types in single array

6. **Performance optimization**:
   - Views vs copies
   - Memory layout optimization
   - Integration with compiled code (Cython, Numba)

## 19. How does Pandas simplify time series analysis?

Pandas provides specialized tools for time series:

**Time indexing**:
- **DatetimeIndex**: Automatic date parsing and indexing
- **Period and timedelta support**: Various time frequencies
- **Time zone handling**: Localization and conversion

**Resampling and grouping**:
- `resample()`: Change frequency (daily to monthly)
- `groupby()` with time periods
- Automatic time-based aggregation

**Missing data handling**:
- Forward fill (`ffill`) and backward fill (`bfill`)
- Interpolation methods for time series
- Automatic handling of irregular time series

**Time-based operations**:
- Rolling windows: `rolling()` for moving averages
- Expanding windows: `expanding()` for cumulative statistics
- Time shifts: `shift()` for lag/lead operations

**Benefits**:
- Simplified date parsing and manipulation
- Automatic alignment of time series data
- Easy frequency conversion and resampling
- Built-in time zone support
- Integration with visualization libraries

## 20. What is the role of a pivot table in Pandas?

Pivot tables reorganize and summarize data by:

**Functionality**:
- **Reshaping**: Transform long-format data to wide-format
- **Aggregation**: Group and summarize data automatically
- **Cross-tabulation**: Create contingency tables
- **Multi-level indexing**: Handle multiple grouping variables

**Key parameters**:
- `values`: Column to aggregate
- `index`: Row grouping variable(s)
- `columns`: Column grouping variable(s)
- `aggfunc`: Aggregation function (mean, sum, count, etc.)

**Use cases**:
- Sales analysis by region and time period
- Survey data analysis by demographics
- Financial reporting by category and date
- A/B testing results analysis

**Benefits**:
- Quick data summarization
- Easy comparison across categories
- Foundation for further analysis
- Excel-like functionality in Python

## 21. Why is NumPy's array slicing faster than Python's list slicing?

**Reasons for speed advantage**:

1. **Memory layout**:
   - NumPy: Contiguous memory blocks
   - Lists: Scattered memory locations with pointers

2. **Implementation**:
   - NumPy: C-level implementation
   - Lists: Python-level with overhead

3. **Views vs copies**:
   - NumPy: Slicing creates views (shares memory)
   - Lists: Slicing creates new copies

4. **Data types**:
   - NumPy: Homogeneous, fixed-size elements
   - Lists: Heterogeneous, variable-size objects

5. **Cache efficiency**:
   - NumPy: Better CPU cache utilization
   - Lists: Poor cache performance due to indirection

**Performance difference**: NumPy slicing can be 10-100x faster for large arrays.

## 22. What are some common use cases for Seaborn?

**Statistical visualizations**:
1. **Distribution plots**: Histograms, density plots, rug plots
2. **Relationship plots**: Scatter plots with regression lines
3. **Categorical plots**: Box plots, violin plots, bar plots
4. **Matrix plots**: Heatmaps, correlation matrices

**Specific use cases**:
- **Exploratory data analysis**: Quick statistical summaries
- **A/B testing**: Comparing distributions between groups
- **Feature relationships**: Understanding variable correlations
- **Data quality**: Identifying outliers and patterns
- **Presentation**: Publication-ready statistical plots
- **Regression analysis**: Visualizing model fits and residuals
- **Time series**: Trend analysis and seasonal patterns
- **Clustering**: Visualizing cluster separation and characteristics

**Advantages for these use cases**:
- Minimal code for complex statistical plots
- Automatic statistical computations
- Beautiful default styling
- Integration with statistical testing
- Easy handling of categorical data
- Built-in support for grouped comparisons

# Practical - Coding Exercises

## 1. How do you create a 2D NumPy array and calculate the sum of each row?

In [None]:
import numpy as np

# Create a 2D NumPy array
array_2d = np.array([[1, 2, 3, 4],
                     [5, 6, 7, 8],
                     [9, 10, 11, 12]])

print("Original 2D array:")
print(array_2d)

# Calculate the sum of each row using axis=1
row_sums = np.sum(array_2d, axis=1)

print("\nSum of each row:")
print(row_sums)

# Alternative method using array.sum()
row_sums_alt = array_2d.sum(axis=1)
print("\nUsing alternative method:")
print(row_sums_alt)

**Explanation:**
- `np.array()` creates a 2D NumPy array from a nested list
- `axis=1` specifies that we want to sum along rows (axis 0 would sum along columns)
- The result is a 1D array containing the sum of each row

**Expected Output:**
```
Original 2D array:
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]

Sum of each row:
[10 26 42]
```

## 2. Write a Pandas script to find the mean of a specific column in a DataFrame.

In [None]:
import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'Salary': [50000, 60000, 70000, 55000, 65000],
    'Score': [85.5, 92.0, 78.5, 88.0, 95.5]
}

df = pd.DataFrame(data)
print("Sample DataFrame:")
print(df)

# Find the mean of specific columns
age_mean = df['Age'].mean()
salary_mean = df['Salary'].mean()
score_mean = df['Score'].mean()

print(f"\nMean of Age column: {age_mean}")
print(f"Mean of Salary column: {salary_mean}")
print(f"Mean of Score column: {score_mean}")

# Alternative way using describe() to get multiple statistics
print("\nUsing describe() for comprehensive statistics:")
print(df['Salary'].describe())

**Explanation:**
- `pd.DataFrame()` creates a DataFrame from a dictionary
- `df['column_name'].mean()` calculates the mean of a specific column
- The `.mean()` method automatically ignores NaN values
- `describe()` provides comprehensive statistics including mean, std, min, max, and quartiles

**Expected Output:**
```
Mean of Age column: 30.0
Mean of Salary column: 60000.0
Mean of Score column: 87.9
```

## 3. Create a scatter plot using Matplotlib.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Generate sample data
np.random.seed(42)  # For reproducible results
x = np.random.normal(50, 15, 100)  # 100 points with mean=50, std=15
y = 2 * x + np.random.normal(0, 10, 100)  # Linear relationship with noise

# Create scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(x, y, alpha=0.6, color='blue', s=50)

# Add labels and title
plt.xlabel('X values', fontsize=12)
plt.ylabel('Y values', fontsize=12)
plt.title('Sample Scatter Plot', fontsize=14, fontweight='bold')

# Add grid for better readability
plt.grid(True, alpha=0.3)

# Add a trend line
z = np.polyfit(x, y, 1)  # Linear fit
p = np.poly1d(z)
plt.plot(x, p(x), "r--", alpha=0.8, linewidth=2, label='Trend line')

# Add legend
plt.legend()

# Adjust layout and display
plt.tight_layout()
plt.show()

# Print correlation coefficient
correlation = np.corrcoef(x, y)[0, 1]
print(f"Correlation coefficient: {correlation:.3f}")

**Explanation:**
- `plt.scatter()` creates the scatter plot with x and y coordinates
- `alpha=0.6` makes points semi-transparent to show overlapping points
- `s=50` sets the size of the scatter points
- `np.polyfit()` and `np.poly1d()` create a linear trend line
- `plt.grid()` adds a grid for better readability
- `plt.tight_layout()` optimizes the spacing of plot elements

**Expected Output:** A scatter plot showing the relationship between x and y variables with a red dashed trend line, demonstrating a positive correlation.

## 4. How do you calculate the correlation matrix using Seaborn and visualize it with a heatmap?

In [None]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create sample dataset
np.random.seed(42)
data = {
    'Height': np.random.normal(170, 10, 100),
    'Weight': np.random.normal(70, 15, 100),
    'Age': np.random.randint(18, 65, 100),
    'Income': np.random.normal(50000, 20000, 100),
    'Score': np.random.normal(75, 12, 100)
}

# Add some correlations to make it more interesting
data['Weight'] = data['Height'] * 0.8 + np.random.normal(0, 5, 100)
data['Income'] = data['Age'] * 800 + np.random.normal(0, 5000, 100)

df = pd.DataFrame(data)

# Calculate correlation matrix
correlation_matrix = df.corr()

print("Correlation Matrix:")
print(correlation_matrix.round(3))

# Create heatmap using Seaborn
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, 
            annot=True,           # Show correlation values
            cmap='coolwarm',      # Color palette
            center=0,             # Center colormap at 0
            square=True,          # Square cells
            fmt='.3f',            # Format to 3 decimal places
            linewidths=0.5,       # Add lines between cells
            cbar_kws={'label': 'Correlation Coefficient'})

plt.title('Correlation Matrix Heatmap', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

**Explanation:**
- `df.corr()` calculates the Pearson correlation coefficient between all numeric columns
- `sns.heatmap()` creates a color-coded matrix visualization
- `annot=True` displays the correlation values on each cell
- `cmap='coolwarm'` uses a blue-to-red color scheme (blue=negative, red=positive)
- `center=0` centers the colormap at zero correlation
- `fmt='.3f'` formats numbers to 3 decimal places

**Expected Output:** A heatmap showing correlations between variables, with values ranging from -1 to 1. Strong positive correlations appear in red, strong negative in blue, and weak correlations in white.

## 5. Generate a bar plot using Plotly.

In [None]:
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd

# Create sample data for bar plot
categories = ['Product A', 'Product B', 'Product C', 'Product D', 'Product E']
sales_2022 = [120, 135, 98, 156, 142]
sales_2023 = [145, 142, 110, 178, 155]

# Method 1: Using Plotly Express (simpler)
df_sales = pd.DataFrame({
    'Product': categories * 2,
    'Sales': sales_2022 + sales_2023,
    'Year': ['2022'] * 5 + ['2023'] * 5
})

fig1 = px.bar(df_sales, 
              x='Product', 
              y='Sales', 
              color='Year',
              title='Sales Comparison: 2022 vs 2023',
              color_discrete_sequence=['lightblue', 'darkblue'])

fig1.update_layout(
    xaxis_title='Products',
    yaxis_title='Sales (in thousands)',
    font=dict(size=12),
    title_font_size=16
)

fig1.show()

# Method 2: Using Plotly Graph Objects (more control)
fig2 = go.Figure()

fig2.add_trace(go.Bar(
    name='2022',
    x=categories,
    y=sales_2022,
    marker_color='lightcoral'
))

fig2.add_trace(go.Bar(
    name='2023',
    x=categories,
    y=sales_2023,
    marker_color='darkred'
))

fig2.update_layout(
    title='Product Sales Comparison - Custom Styling',
    xaxis_title='Products',
    yaxis_title='Sales (in thousands)',
    barmode='group',  # Side by side bars
    template='plotly_white',
    font=dict(size=12),
    title_font_size=16
)

fig2.show()

print("Data used for visualization:")
print(df_sales.head(10))

**Explanation:**
- **Method 1 (Plotly Express)**: Higher-level interface, easier syntax for common plots
- **Method 2 (Graph Objects)**: More control over styling and customization
- `barmode='group'` creates side-by-side bars for comparison
- `color` parameter in px.bar automatically creates grouped bars
- `template='plotly_white'` applies a clean white background theme
- Both methods create interactive plots with hover information, zoom, and pan capabilities

**Expected Output:** Two interactive bar charts showing product sales comparison between 2022 and 2023, with hover tooltips and interactive features.

## 6. Create a DataFrame and add a new column based on an existing column.

In [None]:
import pandas as pd

# Create initial DataFrame
employee_data = {
    'Name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Diana Prince', 'Eve Wilson'],
    'Department': ['Engineering', 'Marketing', 'Engineering', 'Sales', 'Marketing'],
    'Salary': [75000, 55000, 82000, 62000, 58000],
    'Years_Experience': [5, 3, 7, 4, 2],
    'Performance_Score': [4.2, 3.8, 4.5, 4.0, 3.9]
}

df = pd.DataFrame(employee_data)
print("Original DataFrame:")
print(df)
print("\n" + "="*50 + "\n")

# Method 1: Simple arithmetic operation
df['Annual_Bonus'] = df['Salary'] * 0.1  # 10% bonus
print("Added Annual_Bonus column (10% of salary):")
print(df[['Name', 'Salary', 'Annual_Bonus']])
print("\n" + "="*50 + "\n")

# Method 2: Using conditional logic with np.where
import numpy as np
df['Salary_Category'] = np.where(df['Salary'] >= 70000, 'High', 'Standard')
print("Added Salary_Category column using conditional logic:")
print(df[['Name', 'Salary', 'Salary_Category']])
print("\n" + "="*50 + "\n")

# Method 3: Using pandas apply() with lambda function
df['Experience_Level'] = df['Years_Experience'].apply(
    lambda x: 'Senior' if x >= 6 else 'Mid' if x >= 3 else 'Junior'
)
print("Added Experience_Level column using apply():")
print(df[['Name', 'Years_Experience', 'Experience_Level']])
print("\n" + "="*50 + "\n")

# Method 4: Complex calculation using multiple columns
df['Total_Compensation'] = df['Salary'] + df['Annual_Bonus'] + (df['Performance_Score'] * 1000)
print("Added Total_Compensation column (salary + bonus + performance bonus):")
print(df[['Name', 'Salary', 'Annual_Bonus', 'Performance_Score', 'Total_Compensation']])
print("\n" + "="*50 + "\n")

# Method 5: Using string operations
df['First_Name'] = df['Name'].str.split().str[0]  # Extract first name
print("Added First_Name column using string operations:")
print(df[['Name', 'First_Name']])

# Display final DataFrame
print("\n" + "="*50 + "\n")
print("Final DataFrame with all new columns:")
print(df)

**Explanation:**

This example demonstrates 5 different ways to create new columns:

1. **Simple arithmetic**: `df['new_col'] = df['existing_col'] * 0.1`
2. **Conditional logic**: `np.where()` for if-else conditions
3. **Apply function**: `df['col'].apply(lambda x: ...)` for complex transformations
4. **Multiple columns**: Combine multiple existing columns in calculations
5. **String operations**: `str.split()` and `str[]` for text manipulation

**Key Methods:**
- Direct assignment with arithmetic operations
- `np.where(condition, value_if_true, value_if_false)`
- `apply()` with lambda functions for complex logic
- String accessor methods with `.str`

**Expected Output:** A DataFrame showing the progressive addition of new columns based on existing data, demonstrating various transformation techniques.

## 7. Write a program to perform element-wise multiplication of two NumPy arrays.

In [None]:
import numpy as np

# Create two 1D arrays
array1 = np.array([1, 2, 3, 4, 5])
array2 = np.array([2, 3, 4, 5, 6])

print("1D Arrays:")
print(f"Array 1: {array1}")
print(f"Array 2: {array2}")

# Element-wise multiplication for 1D arrays
result_1d = array1 * array2
print(f"Element-wise multiplication: {result_1d}")
print()

# Create two 2D arrays
array_2d_1 = np.array([[1, 2, 3],
                       [4, 5, 6]])

array_2d_2 = np.array([[2, 2, 2],
                       [3, 3, 3]])

print("2D Arrays:")
print("Array 1:")
print(array_2d_1)
print("Array 2:")
print(array_2d_2)

# Element-wise multiplication for 2D arrays
result_2d = array_2d_1 * array_2d_2
print("Element-wise multiplication result:")
print(result_2d)
print()

# Alternative method using np.multiply()
result_multiply = np.multiply(array_2d_1, array_2d_2)
print("Using np.multiply() function:")
print(result_multiply)
print()

# Broadcasting example (different shapes)
array_broadcast_1 = np.array([[1, 2, 3],
                              [4, 5, 6]])
array_broadcast_2 = np.array([10, 20, 30])  # 1D array

print("Broadcasting example:")
print("Array 1 (2x3):")
print(array_broadcast_1)
print("Array 2 (1x3):")
print(array_broadcast_2)

result_broadcast = array_broadcast_1 * array_broadcast_2
print("Element-wise multiplication with broadcasting:")
print(result_broadcast)
print()

# Verify element-wise vs matrix multiplication difference
print("Comparison: Element-wise (*) vs Matrix multiplication (@):")
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

print("Matrix A:")
print(A)
print("Matrix B:")
print(B)
print("Element-wise multiplication (A * B):")
print(A * B)
print("Matrix multiplication (A @ B):")
print(A @ B)

**Explanation:**
- **Element-wise multiplication**: `*` operator multiplies corresponding elements
- **Alternative method**: `np.multiply()` function does the same operation
- **Broadcasting**: NumPy automatically handles arrays of different shapes when possible
- **Difference from matrix multiplication**: `*` is element-wise, `@` or `np.dot()` is matrix multiplication

**Key Points:**
- Arrays must have compatible shapes (same shape or broadcastable)
- Element-wise: `[1,2] * [3,4] = [3,8]`
- Matrix multiplication: `[[1,2]] @ [[3],[4]] = [[11]]`

**Expected Output:** Demonstration of element-wise multiplication for 1D, 2D arrays, broadcasting, and comparison with matrix multiplication.

## 8. Create a line plot with multiple lines using Matplotlib.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Generate x values
x = np.linspace(0, 10, 100)

# Generate multiple y series
y1 = np.sin(x)
y2 = np.cos(x)
y3 = np.sin(x) * np.exp(-x/5)  # Damped sine wave
y4 = x * 0.1  # Linear trend

# Create the plot
plt.figure(figsize=(12, 8))

# Plot multiple lines with different styles
plt.plot(x, y1, label='sin(x)', color='blue', linewidth=2, linestyle='-')
plt.plot(x, y2, label='cos(x)', color='red', linewidth=2, linestyle='--')
plt.plot(x, y3, label='damped sin(x)', color='green', linewidth=2, linestyle='-.')
plt.plot(x, y4, label='linear trend', color='orange', linewidth=2, linestyle=':')

# Customize the plot
plt.title('Multiple Line Plot Example', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('X values', fontsize=14)
plt.ylabel('Y values', fontsize=14)

# Add grid
plt.grid(True, alpha=0.3)

# Add legend
plt.legend(loc='upper right', fontsize=12, framealpha=0.9)

# Set axis limits
plt.xlim(0, 10)
plt.ylim(-1.5, 1.5)

# Add annotations for interesting points
plt.annotate('Maximum of cos(x)', 
             xy=(0, 1), xytext=(2, 1.3),
             arrowprops=dict(arrowstyle='->', color='red', alpha=0.7),
             fontsize=10)

# Customize tick parameters
plt.tick_params(axis='both', which='major', labelsize=10)

# Adjust layout
plt.tight_layout()
plt.show()

# Alternative approach: subplots for comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Left subplot: All lines together
ax1.plot(x, y1, 'b-', label='sin(x)', linewidth=2)
ax1.plot(x, y2, 'r--', label='cos(x)', linewidth=2)
ax1.plot(x, y3, 'g-.', label='damped sin(x)', linewidth=2)
ax1.set_title('Trigonometric Functions')
ax1.set_xlabel('X values')
ax1.set_ylabel('Y values')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Right subplot: Linear vs exponential trends
x_trend = np.linspace(0, 5, 50)
linear = x_trend
exponential = np.exp(x_trend/2)

ax2.plot(x_trend, linear, 'b-', label='Linear (x)', linewidth=2)
ax2.plot(x_trend, exponential, 'r-', label='Exponential (e^(x/2))', linewidth=2)
ax2.set_title('Linear vs Exponential Growth')
ax2.set_xlabel('X values')
ax2.set_ylabel('Y values')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Line styles used:")
print("- Solid line: '-'")
print("- Dashed line: '--'")
print("- Dash-dot line: '-.'")
print("- Dotted line: ':'")
print("- Colors: 'blue', 'red', 'green', 'orange' or 'b', 'r', 'g', etc.")

**Explanation:**
- Multiple `plt.plot()` calls create multiple lines on the same axes
- Different line styles: solid (-), dashed (--), dash-dot (-.), dotted (:)
- `label` parameter creates entries for the legend
- `plt.legend()` displays the legend with line labels
- `plt.annotate()` adds text annotations with arrows
- Subplots allow side-by-side comparison of different data

**Expected Output:** Two visualizations - one showing multiple trigonometric functions on a single plot, and another showing subplots comparing different types of mathematical functions.

## 9. Generate a Pandas DataFrame and filter rows where a column value is greater than a threshold.

In [None]:
import pandas as pd
import numpy as np

# Generate a sample DataFrame
np.random.seed(42)
data = {
    'Student_ID': range(1, 21),
    'Name': [f'Student_{i}' for i in range(1, 21)],
    'Math_Score': np.random.randint(60, 100, 20),
    'Science_Score': np.random.randint(55, 95, 20),
    'English_Score': np.random.randint(50, 100, 20),
    'Age': np.random.randint(18, 25, 20),
    'GPA': np.round(np.random.uniform(2.0, 4.0, 20), 2)
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df.head(10))
print(f"\nTotal students: {len(df)}")
print("\n" + "="*60 + "\n")

# Method 1: Simple filtering - Math score > 80
threshold_math = 80
filtered_math = df[df['Math_Score'] > threshold_math]
print(f"Students with Math Score > {threshold_math}:")
print(filtered_math[['Name', 'Math_Score']])
print(f"Number of students: {len(filtered_math)}")
print("\n" + "="*60 + "\n")

# Method 2: Multiple conditions - GPA > 3.5 AND Age < 22
high_gpa_young = df[(df['GPA'] > 3.5) & (df['Age'] < 22)]
print("Students with GPA > 3.5 AND Age < 22:")
print(high_gpa_young[['Name', 'GPA', 'Age']])
print(f"Number of students: {len(high_gpa_young)}")
print("\n" + "="*60 + "\n")

# Method 3: Using query() method - more readable for complex conditions
high_performers = df.query('Math_Score > 85 and Science_Score > 80 and English_Score > 75')
print("High performers (Math>85, Science>80, English>75):")
print(high_performers[['Name', 'Math_Score', 'Science_Score', 'English_Score']])
print(f"Number of high performers: {len(high_performers)}")
print("\n" + "="*60 + "\n")

# Method 4: Using isin() for filtering with multiple values
target_ages = [20, 21, 22]
students_target_age = df[df['Age'].isin(target_ages)]
print(f"Students aged {target_ages}:")
print(students_target_age[['Name', 'Age']])
print(f"Number of students: {len(students_target_age)}")
print("\n" + "="*60 + "\n")

# Method 5: Filtering with string operations
# Let's filter students whose names contain specific patterns
pattern_students = df[df['Name'].str.contains('1')]  # Names containing '1'
print("Students with '1' in their name:")
print(pattern_students[['Name']])
print("\n" + "="*60 + "\n")

# Summary statistics for filtered data
print("Summary statistics for high performers:")
print(high_performers[['Math_Score', 'Science_Score', 'English_Score', 'GPA']].describe())

# Advanced filtering: Top 25% by GPA
gpa_75th_percentile = df['GPA'].quantile(0.75)
top_25_percent = df[df['GPA'] >= gpa_75th_percentile]
print(f"\nTop 25% students by GPA (GPA >= {gpa_75th_percentile:.2f}):")
print(top_25_percent[['Name', 'GPA']].sort_values('GPA', ascending=False))

**Explanation:**

**Filtering Methods Demonstrated:**
1. **Simple filtering**: `df[df['column'] > threshold]` - Basic comparison
2. **Multiple conditions**: `df[(condition1) & (condition2)]` - Use `&` for AND, `|` for OR
3. **Query method**: `df.query('condition')` - More readable for complex conditions
4. **isin() method**: `df[df['column'].isin(values)]` - Check if values are in a list
5. **String operations**: `df[df['column'].str.contains('pattern')]` - Text filtering
6. **Quantile-based**: Using percentiles for dynamic thresholds

**Key Points:**
- Parentheses are required around conditions when using `&` and `|`
- `query()` method uses string expressions and is often more readable
- String methods are accessed via `.str` accessor
- Filtering returns a new DataFrame with matching rows

**Expected Output:** Various filtered subsets of student data based on different criteria, demonstrating the flexibility of Pandas filtering operations.

## 10. Create a histogram using Seaborn to visualize a distribution.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Set style for better-looking plots
sns.set_style("whitegrid")

# Generate sample data for visualization
np.random.seed(42)

# Create different types of distributions
normal_data = np.random.normal(100, 15, 1000)  # Normal distribution
exponential_data = np.random.exponential(2, 1000)  # Exponential distribution
bimodal_data = np.concatenate([np.random.normal(70, 10, 500), 
                               np.random.normal(130, 10, 500)])  # Bimodal

# Create a DataFrame for grouped analysis
data = {
    'values': np.concatenate([normal_data[:300], normal_data[:300] + 20, normal_data[:300] + 40]),
    'group': ['Group A'] * 300 + ['Group B'] * 300 + ['Group C'] * 300
}
df_grouped = pd.DataFrame(data)

# Create subplots for different histogram examples
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Histogram Examples using Seaborn', fontsize=16, fontweight='bold')

# 1. Basic histogram
sns.histplot(normal_data, bins=30, ax=axes[0, 0])
axes[0, 0].set_title('Basic Histogram\n(Normal Distribution)')
axes[0, 0].set_xlabel('Values')
axes[0, 0].set_ylabel('Frequency')

# 2. Histogram with KDE (Kernel Density Estimation)
sns.histplot(normal_data, bins=30, kde=True, ax=axes[0, 1])
axes[0, 1].set_title('Histogram with KDE\n(Density Curve)')
axes[0, 1].set_xlabel('Values')
axes[0, 1].set_ylabel('Frequency')

# 3. Multiple distributions comparison
sns.histplot(normal_data, bins=30, alpha=0.5, label='Normal', ax=axes[0, 2])
sns.histplot(exponential_data * 20 + 60, bins=30, alpha=0.5, label='Exponential', ax=axes[0, 2])
axes[0, 2].set_title('Comparing Multiple Distributions')
axes[0, 2].set_xlabel('Values')
axes[0, 2].set_ylabel('Frequency')
axes[0, 2].legend()

# 4. Grouped histogram
sns.histplot(data=df_grouped, x='values', hue='group', bins=25, ax=axes[1, 0])
axes[1, 0].set_title('Grouped Histogram by Category')
axes[1, 0].set_xlabel('Values')
axes[1, 0].set_ylabel('Frequency')

# 5. Normalized histogram (density)
sns.histplot(bimodal_data, bins=40, stat='density', ax=axes[1, 1])
axes[1, 1].set_title('Normalized Histogram\n(Bimodal Distribution)')
axes[1, 1].set_xlabel('Values')
axes[1, 1].set_ylabel('Density')

# 6. Step histogram
sns.histplot(normal_data, bins=25, element='step', ax=axes[1, 2])
axes[1, 2].set_title('Step Histogram')
axes[1, 2].set_xlabel('Values')
axes[1, 2].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

# Additional examples with different plot types
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Distribution plot (deprecated but still useful to know)
sns.histplot(normal_data, kde=True, ax=axes[0])
axes[0].set_title('Distribution with KDE')

# Box plot for comparison
sns.boxplot(data=df_grouped, x='group', y='values', ax=axes[1])
axes[1].set_title('Box Plot of Groups')

# Violin plot combining histogram and box plot
sns.violinplot(data=df_grouped, x='group', y='values', ax=axes[2])
axes[2].set_title('Violin Plot of Groups')

plt.tight_layout()
plt.show()

# Print statistical summary
print("Statistical Summary of the Normal Distribution:")
print(f"Mean: {np.mean(normal_data):.2f}")
print(f"Standard Deviation: {np.std(normal_data):.2f}")
print(f"Min: {np.min(normal_data):.2f}")
print(f"Max: {np.max(normal_data):.2f}")
print(f"Median: {np.median(normal_data):.2f}")

print("\nGroup Statistics:")
print(df_grouped.groupby('group')['values'].describe())

**Explanation:**

**Histogram Types Demonstrated:**
1. **Basic histogram**: `sns.histplot(data, bins=30)` - Simple frequency distribution
2. **Histogram with KDE**: `kde=True` adds a smooth density curve
3. **Multiple distributions**: Overlaying histograms with `alpha` for transparency
4. **Grouped histogram**: `hue` parameter creates separate histograms by category
5. **Normalized histogram**: `stat='density'` shows proportions instead of counts
6. **Step histogram**: `element='step'` creates outline-only histogram

**Key Parameters:**
- `bins`: Number of bins (bars) in the histogram
- `kde`: Add kernel density estimation curve
- `stat`: Type of statistic ('count', 'density', 'probability')
- `hue`: Group data by categorical variable
- `alpha`: Transparency (0-1)
- `element`: How to draw histogram ('bars', 'step', 'poly')

**Expected Output:** Six different histogram visualizations showing various ways to display distribution data, plus box plots and violin plots for comparison. Statistical summaries provide numerical context for the visualizations.

## 11. Perform matrix multiplication using NumPy.

In [None]:
import numpy as np

print("Matrix Multiplication Examples using NumPy")
print("=" * 50)

# Example 1: Basic 2x2 matrix multiplication
A = np.array([[1, 2],
              [3, 4]])

B = np.array([[5, 6],
              [7, 8]])

print("Example 1: Basic 2x2 Matrix Multiplication")
print("Matrix A:")
print(A)
print("\nMatrix B:")
print(B)

# Method 1: Using @ operator (recommended)
result_at = A @ B
print("\nA @ B (using @ operator):")
print(result_at)

# Method 2: Using np.dot() function
result_dot = np.dot(A, B)
print("\nnp.dot(A, B):")
print(result_dot)

# Method 3: Using .dot() method
result_method = A.dot(B)
print("\nA.dot(B):")
print(result_method)

print("\n" + "=" * 50)

# Example 2: Different sized matrices
C = np.array([[1, 2, 3],
              [4, 5, 6]])  # 2x3 matrix

D = np.array([[7, 8],
              [9, 10],
              [11, 12]])  # 3x2 matrix

print("Example 2: Different sized matrices (2x3) × (3x2)")
print("Matrix C (2x3):")
print(C)
print("\nMatrix D (3x2):")
print(D)

result_cd = C @ D  # Results in 2x2 matrix
print("\nC @ D (result is 2x2):")
print(result_cd)

# Show the calculation step by step for first element
print("\nStep-by-step calculation for result[0,0]:")
print(f"C[0,:] = {C[0,:]}")
print(f"D[:,0] = {D[:,0]}")
print(f"Dot product: {C[0,:]} · {D[:,0]} = {np.dot(C[0,:], D[:,0])}")

print("\n" + "=" * 50)

# Example 3: Matrix-vector multiplication
E = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

v = np.array([1, 0, -1])

print("Example 3: Matrix-Vector Multiplication")
print("Matrix E:")
print(E)
print(f"\nVector v: {v}")

result_ev = E @ v
print(f"\nE @ v = {result_ev}")

print("\n" + "=" * 50)

# Example 4: Multiple matrix multiplications
F = np.array([[2, 0],
              [1, 3]])

G = np.array([[1, 4],
              [2, 1]])

H = np.array([[1, 0],
              [0, 2]])

print("Example 4: Chain Matrix Multiplication")
print("Matrix F:")
print(F)
print("\nMatrix G:")
print(G)
print("\nMatrix H:")
print(H)

# Chain multiplication: (F @ G) @ H
result_chain = F @ G @ H
print("\nF @ G @ H:")
print(result_chain)

# Show intermediate step
intermediate = F @ G
print(f"\nIntermediate result (F @ G):")
print(intermediate)
print(f"\nFinal result ((F @ G) @ H):")
print(intermediate @ H)

print("\n" + "=" * 50)

# Example 5: Identity matrix and inverse
I = np.eye(3)  # 3x3 identity matrix
M = np.array([[2, 1, 0],
              [1, 2, 1],
              [0, 1, 2]])

print("Example 5: Identity Matrix and Matrix Properties")
print("Identity matrix I:")
print(I)
print("\nMatrix M:")
print(M)

# Multiplication with identity
result_identity = M @ I
print("\nM @ I (should equal M):")
print(result_identity)

# Matrix inverse (if it exists)
try:
    M_inv = np.linalg.inv(M)
    print("\nInverse of M:")
    print(M_inv)
    
    # Verify: M @ M^(-1) should equal identity
    verification = M @ M_inv
    print("\nM @ M^(-1) (should be close to identity):")
    print(np.round(verification, 10))  # Round to avoid floating point errors
    
except np.linalg.LinAlgError:
    print("\nMatrix M is not invertible (singular matrix)")

print("\n" + "=" * 50)

# Example 6: Batch matrix multiplication
print("Example 6: Batch Operations")
# Create arrays of matrices
batch_A = np.random.randint(1, 5, (3, 2, 2))  # 3 matrices of size 2x2
batch_B = np.random.randint(1, 5, (3, 2, 2))  # 3 matrices of size 2x2

print("Batch of 3 matrices A:")
for i, matrix in enumerate(batch_A):
    print(f"A[{i}]:")
    print(matrix)

print("\nBatch matrix multiplication using np.matmul:")
batch_result = np.matmul(batch_A, batch_B)  # Multiply corresponding matrices
print("Results:")
for i, result in enumerate(batch_result):
    print(f"A[{i}] @ B[{i}]:")
    print(result)
    print()

**Explanation:**

**Matrix Multiplication Methods:**
1. **@ operator**: `A @ B` - Recommended modern approach (Python 3.5+)
2. **np.dot()**: `np.dot(A, B)` - Traditional function approach
3. **dot() method**: `A.dot(B)` - Object method approach
4. **np.matmul()**: `np.matmul(A, B)` - Explicit matrix multiplication function

**Key Rules:**
- Matrix dimensions must be compatible: (m×n) × (n×p) → (m×p)
- Order matters: A @ B ≠ B @ A (generally)
- Identity matrix: A @ I = I @ A = A
- Inverse: A @ A⁻¹ = A⁻¹ @ A = I (if A is invertible)

**Applications Shown:**
- Basic 2×2 multiplication
- Different sized matrices
- Matrix-vector multiplication
- Chain multiplication
- Identity and inverse operations
- Batch processing of multiple matrices

**Expected Output:** Comprehensive examples showing various matrix multiplication scenarios with step-by-step calculations and verification of mathematical properties.

## 12. Use Pandas to load a CSV file and display its first 5 rows. (Create sample CSV first)

In [None]:
import pandas as pd
import numpy as np
import os

# Step 1: Create a sample CSV file first
np.random.seed(42)

# Generate sample employee data
sample_data = {
    'Employee_ID': range(1001, 1051),
    'Name': [f'Employee_{i:02d}' for i in range(1, 51)],
    'Department': np.random.choice(['HR', 'Engineering', 'Marketing', 'Sales', 'Finance'], 50),
    'Salary': np.random.randint(40000, 120000, 50),
    'Years_Experience': np.random.randint(0, 20, 50),
    'Performance_Rating': np.round(np.random.uniform(2.0, 5.0, 50), 1),
    'City': np.random.choice(['New York', 'San Francisco', 'Chicago', 'Boston', 'Seattle'], 50),
    'Hire_Date': pd.date_range('2010-01-01', '2023-12-31', periods=50).strftime('%Y-%m-%d')
}

# Create DataFrame and save to CSV
df_sample = pd.DataFrame(sample_data)
csv_filename = 'sample_employee_data.csv'
df_sample.to_csv(csv_filename, index=False)

print(f"✓ Created sample CSV file: {csv_filename}")
print(f"File size: {os.path.getsize(csv_filename)} bytes")
print("\n" + "="*60 + "\n")

# Step 2: Load the CSV file using Pandas
print("Loading CSV file using pd.read_csv()...")

# Method 1: Basic loading
df = pd.read_csv(csv_filename)
print(f"Successfully loaded {csv_filename}")
print(f"DataFrame shape: {df.shape} (rows, columns)")
print("\n" + "="*60 + "\n")

# Step 3: Display first 5 rows
print("First 5 rows using head():")
print(df.head())
print("\n" + "="*60 + "\n")

# Additional useful methods when working with CSV files
print("Basic information about the dataset:")
print(f"Total rows: {len(df)}")
print(f"Total columns: {len(df.columns)}")
print(f"Column names: {list(df.columns)}")
print(f"Data types:")
print(df.dtypes)
print("\n" + "="*60 + "\n")

# Show different ways to explore the data
print("First 3 rows:")
print(df.head(3))
print("\nLast 5 rows:")
print(df.tail())
print("\n" + "="*60 + "\n")

# Method 2: Loading with specific parameters
print("Advanced CSV loading options:")

# Load with custom parameters
df_custom = pd.read_csv(csv_filename, 
                       parse_dates=['Hire_Date'],  # Parse date column
                       dtype={'Employee_ID': str})  # Specify data type

print("After custom loading with date parsing:")
print(df_custom.dtypes)
print("\nFirst 5 rows with parsed dates:")
print(df_custom.head())
print("\n" + "="*60 + "\n")

# Show summary statistics
print("Summary statistics of numerical columns:")
print(df.describe())
print("\n" + "="*60 + "\n")

# Show missing values check
print("Missing values check:")
missing_values = df.isnull().sum()
print(missing_values)
print(f"Total missing values: {missing_values.sum()}")
print("\n" + "="*60 + "\n")

# Sample queries on the loaded data
print("Sample data analysis:")
print("1. Average salary by department:")
avg_salary = df.groupby('Department')['Salary'].mean().sort_values(ascending=False)
print(avg_salary)

print("\n2. Top 5 highest paid employees:")
top_earners = df.nlargest(5, 'Salary')[['Name', 'Department', 'Salary']]
print(top_earners)

print("\n3. Employee count by city:")
city_counts = df['City'].value_counts()
print(city_counts)

# Cleanup: Remove the created CSV file (optional)
print(f"\n{'='*60}")
print(f"Note: Sample CSV file '{csv_filename}' has been created in the current directory.")
print("You can examine it or delete it after running this example.")

**Explanation:**

**Process Demonstrated:**
1. **Create sample CSV**: Generate realistic employee data and save to CSV file
2. **Load CSV**: Use `pd.read_csv()` to load the file into a DataFrame
3. **Display data**: Use `head()` to show first 5 rows
4. **Explore data**: Various methods to understand the dataset

**Key Methods for CSV Loading:**
- `pd.read_csv(filename)`: Basic CSV loading
- `parse_dates`: Automatically parse date columns
- `dtype`: Specify data types for columns
- `head(n)`: Display first n rows (default 5)
- `tail(n)`: Display last n rows
- `info()`: Overview of DataFrame structure
- `describe()`: Statistical summary

**Additional Analysis Methods:**
- `shape`: Get dimensions (rows, columns)
- `columns`: Get column names
- `dtypes`: Get data types
- `isnull().sum()`: Check for missing values
- `groupby()`: Group data for analysis

**Expected Output:** Complete workflow from creating a CSV file to loading and analyzing it, including data exploration and summary statistics.

## 13. Create a 3D scatter plot using Plotly.

In [None]:
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import pandas as pd

# Set random seed for reproducible results
np.random.seed(42)

print("Creating 3D Scatter Plots using Plotly")
print("=" * 50)

# Example 1: Basic 3D scatter plot with random data
n_points = 100

# Generate 3D data with some patterns
x = np.random.normal(0, 1, n_points)
y = np.random.normal(0, 1, n_points)
z = x**2 + y**2 + np.random.normal(0, 0.1, n_points)  # Paraboloid with noise

# Create basic 3D scatter plot
fig1 = px.scatter_3d(x=x, y=y, z=z,
                     title='Basic 3D Scatter Plot',
                     labels={'x': 'X Axis', 'y': 'Y Axis', 'z': 'Z Axis'})

fig1.update_traces(marker=dict(size=5, opacity=0.8))
fig1.show()

print("Example 1: Basic 3D scatter plot created")
print("- Interactive plot with zoom, rotate, and pan capabilities")
print("- Data shows a paraboloid pattern (z = x² + y² + noise)")
print()

# Example 2: 3D scatter plot with color coding and categories
categories = np.random.choice(['Group A', 'Group B', 'Group C'], n_points)
colors = np.random.uniform(0, 100, n_points)  # Color scale values

# Create DataFrame for easier handling
df_3d = pd.DataFrame({
    'X': x,
    'Y': y, 
    'Z': z,
    'Category': categories,
    'Color_Value': colors,
    'Size': np.random.uniform(5, 15, n_points)
})

fig2 = px.scatter_3d(df_3d, x='X', y='Y', z='Z',
                     color='Category',  # Color by category
                     size='Size',       # Size by value
                     hover_data=['Color_Value'],  # Additional hover info
                     title='3D Scatter Plot with Categories and Sizing',
                     color_discrete_sequence=['red', 'blue', 'green'])

fig2.update_layout(
    scene=dict(
        xaxis_title='X Coordinate',
        yaxis_title='Y Coordinate', 
        zaxis_title='Z Coordinate',
        bgcolor='white'
    )
)

fig2.show()

print("Example 2: Enhanced 3D scatter plot with:")
print("- Color coding by category")
print("- Variable point sizes")
print("- Custom hover information")
print("- Styled axes and background")
print()

# Example 3: Mathematical surface visualization
# Create a more complex 3D dataset
theta = np.linspace(0, 2*np.pi, 50)
phi = np.linspace(0, np.pi, 50)
theta_grid, phi_grid = np.meshgrid(theta, phi)

# Convert spherical coordinates to Cartesian (sphere surface)
x_sphere = np.sin(phi_grid) * np.cos(theta_grid)
y_sphere = np.sin(phi_grid) * np.sin(theta_grid) 
z_sphere = np.cos(phi_grid)

# Flatten arrays for scatter plot
x_flat = x_sphere.flatten()
y_flat = y_sphere.flatten()
z_flat = z_sphere.flatten()

# Add color based on height (z-coordinate)
colors_sphere = z_flat

fig3 = go.Figure(data=[go.Scatter3d(
    x=x_flat,
    y=y_flat,
    z=z_flat,
    mode='markers',
    marker=dict(
        size=3,
        color=colors_sphere,
        colorscale='Viridis',
        showscale=True,
        colorbar=dict(title="Height (Z)")
    ),
    text=[f'Point ({x:.2f}, {y:.2f}, {z:.2f})' for x, y, z in zip(x_flat, y_flat, z_flat)],
    hovertemplate='<b>Coordinates</b><br>' +
                  'X: %{x:.3f}<br>' +
                  'Y: %{y:.3f}<br>' +
                  'Z: %{z:.3f}<br>' +
                  '<extra></extra>'
)])

fig3.update_layout(
    title='3D Sphere Surface with Color Gradient',
    scene=dict(
        xaxis_title='X',
        yaxis_title='Y',
        zaxis_title='Z',
        aspectmode='cube',  # Equal aspect ratio
        camera=dict(
            up=dict(x=0, y=0, z=1),
            center=dict(x=0, y=0, z=0),
            eye=dict(x=1.5, y=1.5, z=1.5)
        )
    ),
    width=800,
    height=600
)

fig3.show()

print("Example 3: Mathematical sphere surface")
print("- Points arranged on a sphere surface")
print("- Color gradient based on height (z-coordinate)")
print("- Custom hover templates")
print("- Equal aspect ratio for true sphere appearance")
print()

# Example 4: Real-world data simulation (3D clustering)
print("Example 4: Simulated real-world data (customer segmentation)")

# Simulate customer data with 3 clusters
np.random.seed(123)

# Cluster centers
centers = [(2, 2, 2), (6, 6, 2), (2, 6, 6)]
cluster_data = []

for i, (cx, cy, cz) in enumerate(centers):
    # Generate points around each center
    n_cluster = 50
    cluster_x = np.random.normal(cx, 1, n_cluster)
    cluster_y = np.random.normal(cy, 1, n_cluster) 
    cluster_z = np.random.normal(cz, 1, n_cluster)
    
    cluster_df = pd.DataFrame({
        'Income': cluster_x * 10000,  # Scale to realistic income
        'Spending': cluster_y * 1000,  # Scale to spending
        'Age': cluster_z * 8 + 20,     # Scale to age range
        'Cluster': f'Segment {i+1}',
        'Customer_ID': range(i*n_cluster, (i+1)*n_cluster)
    })
    cluster_data.append(cluster_df)

# Combine all clusters
df_customers = pd.concat(cluster_data, ignore_index=True)

fig4 = px.scatter_3d(df_customers, 
                     x='Income', y='Spending', z='Age',
                     color='Cluster',
                     title='Customer Segmentation Analysis (3D)',
                     labels={
                         'Income': 'Annual Income ($)',
                         'Spending': 'Annual Spending ($)', 
                         'Age': 'Age (years)'
                     },
                     hover_data=['Customer_ID'])

fig4.update_traces(marker=dict(size=6, opacity=0.8))
fig4.update_layout(
    scene=dict(
        xaxis_title='Annual Income ($)',
        yaxis_title='Annual Spending ($)',
        zaxis_title='Age (years)'
    ),
    width=900,
    height=700
)

fig4.show()

print("Customer segmentation results:")
print(f"Total customers: {len(df_customers)}")
print("Segments identified:", df_customers['Cluster'].unique())
print("\nSegment characteristics:")
print(df_customers.groupby('Cluster')[['Income', 'Spending', 'Age']].mean().round(2))

print("\n" + "=" * 50)
print("3D Scatter Plot Features Demonstrated:")
print("1. Basic 3D scatter plots with Plotly Express")
print("2. Color coding and sizing by variables")
print("3. Mathematical surface visualization")
print("4. Real-world clustering example")
print("5. Interactive features: zoom, rotate, pan, hover")
print("6. Custom styling and layouts")
print("7. Multiple data encoding methods")

**Explanation:**

**3D Scatter Plot Methods:**
1. **Plotly Express**: `px.scatter_3d()` - High-level, easy-to-use interface
2. **Graph Objects**: `go.Scatter3d()` - More control and customization options

**Key Features Demonstrated:**
- **Basic 3D plotting**: x, y, z coordinates with interactive controls
- **Color coding**: `color` parameter for categorical or continuous variables
- **Size variation**: `size` parameter for additional data dimension
- **Hover information**: Custom tooltips with `hover_data` and `hovertemplate`
- **Mathematical surfaces**: Sphere generation using spherical coordinates
- **Real-world applications**: Customer segmentation clustering

**Interactive Features:**
- **Rotation**: Click and drag to rotate the 3D view
- **Zoom**: Scroll to zoom in/out
- **Pan**: Shift+drag to pan the view
- **Hover**: Detailed information on mouse hover
- **Camera control**: Programmatic view angle setting

**Styling Options:**
- Color scales (`colorscale='Viridis'`)
- Marker properties (size, opacity)
- Axis labels and titles
- Background colors and layout
- Equal aspect ratio (`aspectmode='cube'`)

**Expected Output:** Four different 3D scatter plots showcasing various visualization techniques, from basic plots to complex real-world data analysis with customer segmentation.