**Phase 4: Applications and Projects with NumPy for Data Manipulation**. The content is structured for clarity, simplicity, and practical application to align with your focus on hands-on practice.

---

## **NumPy for Data Manipulation**

### **1. Applying NumPy Operations to Clean, Filter, and Transform Datasets**

- **Data Cleaning**:
  - Use `numpy.nan` to handle missing values.
  - Replace invalid or corrupted data using conditions (`np.where()`).
  - Remove outliers using statistical methods like z-score (`scipy.stats.zscore`) to filter extreme values.
  - Example:
    ```python
    import numpy as np
    
    # Replace NaN values with mean
    data = np.array([1, 2, np.nan, 4, 5])
    cleaned_data = np.where(np.isnan(data), np.nanmean(data), data)
    ```
    
- **Data Filtering**:
  - Use boolean indexing to filter out specific rows or values.
  - Example: Filtering values greater than a threshold.
    ```python
    # Example of filtering values
    data = np.array([10, 20, 30, 40, 50])
    filtered_data = data[data > 25]  # [30, 40, 50]
    ```
    
- **Data Transformation**:
  - Use vectorized operations for speed (`+`, `-`, `*`, `/` directly on arrays).
  - Apply mathematical functions (`np.log()`, `np.sqrt()`, etc.) across datasets.
  - Example: Normalizing data between 0 and 1.
    ```python
    # Normalization example
    data = np.array([10, 20, 30, 40, 50])
    normalized_data = (data - data.min()) / (data.max() - data.min())
    ```

### **2. Integration with Pandas**

- **Pandas DataFrame Creation**:
  - Convert NumPy arrays into Pandas DataFrames using `pd.DataFrame()`.
  - Use Pandas for more advanced operations such as merging datasets, grouping, or dealing with missing data.

- **Conversion Examples**:
  ```python
  import pandas as pd

  # Convert NumPy array to Pandas DataFrame
  np_array = np.array([[1, 2, 3], [4, 5, 6]])
  df = pd.DataFrame(np_array, columns=['A', 'B', 'C'])
  ```

- **Operations**:
  - Use Pandas to manipulate data easily with NumPy support for calculations.
  - Integrate filtering, cleaning, and transforming NumPy data within a DataFrame.

### **3. Mini-Project 4: Data Cleaning and Analysis using NumPy**

#### **Objective**:
- Load a dataset from a CSV file using NumPy.
- Perform basic data cleaning tasks.
- Analyze the dataset for patterns or trends.

#### **Step-by-Step**:
1. **Loading Data**:
   ```python
   # Load CSV using NumPy
   data = np.genfromtxt('data.csv', delimiter=',', skip_header=1)
   ```

2. **Cleaning Data**:
   - Replace missing values with median or mean.
   - Normalize numeric data for consistency.
   - Remove outliers based on a threshold.

3. **Analyzing Data**:
   - Calculate mean, median, or any statistical measure.
   - Use NumPy functions like `np.mean()`, `np.median()`, `np.sum()` for quick analysis.
   
4. **Sample Analysis Code**:
   ```python
   # Example analysis - Calculating mean and median
   column_mean = np.mean(data[:, 2])  # Mean of third column
   column_median = np.median(data[:, 2])  # Median of third column
   ```

---

## **Visualization with NumPy Data**

### **1. Using Libraries Like Matplotlib and Seaborn to Visualize Data**

- **Matplotlib**:
  - Create simple plots (`plt.plot()`, `plt.scatter()`, etc.).
  - Use histograms (`plt.hist()`) to understand data distribution.
  - Example: Plotting a line graph.
    ```python
    import matplotlib.pyplot as plt

    x = np.arange(0, 10, 0.1)
    y = np.sin(x)
    
    plt.plot(x, y)
    plt.title('Sine Wave')
    plt.xlabel('X-axis')
    plt.ylabel('Y-axis')
    plt.show()
    ```

- **Seaborn**:
  - High-level visualization library built on top of Matplotlib.
  - Ideal for visualizing statistical relationships with fewer lines of code.
  - Use Seaborn for scatter plots, box plots, and heatmaps.

### **2. Plotting Data Distributions, Scatter Plots, and Trends**

- **Data Distribution**:
  - Use `plt.hist()` to visualize the frequency distribution of data.
  - Use `sns.kdeplot()` for Kernel Density Estimation (KDE) to show probability density.

- **Scatter Plots**:
  - `plt.scatter(x, y)` for visualizing the relationship between two numerical variables.
  - Use color and size for additional data dimensions (`c` and `s` parameters).

- **Trend Analysis**:
  - Plot trends over time using line graphs.
  - Apply rolling average to smooth data (`pd.Series.rolling(window).mean()`).

### **3. Exercise 8: Visualizing Data Trends with Matplotlib**

#### **Objective**:
- Use `matplotlib` to visualize trends in data from a NumPy dataset.

#### **Instructions**:
1. **Prepare Data**:
   - Create a dataset using NumPy, or use a pre-existing dataset.
   - Example:
     ```python
     data = np.random.normal(loc=50, scale=10, size=1000)  # Random normal data
     ```

2. **Visualize Data**:
   - Plot a histogram to see data distribution.
   - Create a scatter plot for two variables.
   - Show data trends using line plots.
   
3. **Sample Visualization Code**:
   ```python
   # Example visualization - Histogram
   plt.hist(data, bins=30, color='blue', edgecolor='black')
   plt.title('Data Distribution')
   plt.xlabel('Value')
   plt.ylabel('Frequency')
   plt.show()
   
   # Scatter plot
   x = np.arange(0, 100)
   y = np.random.randint(0, 50, size=100)
   
   plt.scatter(x, y, color='green')
   plt.title('Random Data Scatter Plot')
   plt.xlabel('X-axis')
   plt.ylabel('Y-axis')
   plt.show()
   ```

---