# Data Processing Approach for Portfolio Project
--------------------------------------------------------------------------------
## **Project Title**:  **Lung Segmentation of Normal, Viral Pneumonia, and COVID-affected Lungs**
--------------------------------------------------------------------------------

## **Student Name**: **Esther MBANZABIGWI**

---
1.Data Sources and Aggregation

Data Sources:

The primary dataset for this project is the COVID-19 Radiography Dataset, which is publicly available. Additional sources can include:

.Peer-reviewed articles that provide annotated radiography datasets.

.Open medical imaging repositories like MedPix and The Cancer Imaging Archive.

.Research papers detailing pneumonia cases and lung scans.


Data Aggregation:

Aggregating data from multiple sources ensures comprehensive modeling and analysis. For instance, combining COVID-19 radiographs with general pneumonia and healthy lung datasets improves model robustness.





2. **Data Format Transformation:**
   Current Data Format:
The dataset consists of images stored in .png format, organized by folder labels for each category (Normal, Pneumonia, and COVID).

Planned Transformation:
Images will be resized (e.g., 224x224) to ensure uniformity across training. File paths will be mapped to corresponding labels in a structured format (e.g., CSV or Pandas DataFrame).

3. **Data Exploration:**
   
Features:

Input Features: Pixel intensities of radiograph images.

Target Feature: Class labels (Normal, Pneumonia, COVID).

Exploratory Data Analysis (EDA):
Analyzing image histograms to understand brightness and contrast distributions. Heatmaps to identify correlations among pixel values.

In [2]:
#Include plots for EDA
#Include plots for EDA
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
import os

# Load a sample image for EDA
sample_image_path = 'path_to_sample_image.png'
# Replace 'path_to_sample_image.png' with the actual path to your image
sample_image = cv2.imread(sample_image_path, cv2.IMREAD_GRAYSCALE)

# Check if the image was loaded successfully
if sample_image is None:
    print(f"Error: Could not load image from {sample_image_path}. Please check the file path and ensure the image exists.")
else:
    # Display the image histogram
    plt.hist(sample_image.ravel(), bins=256, color='blue')
    plt.title('Pixel Intensity Distribution')
    plt.xlabel('Pixel Intensity')
    plt.ylabel('Frequency')
    plt.show()

Error: Could not load image from path_to_sample_image.png. Please check the file path and ensure the image exists.




4. **Hypothesis Testing:**
   
  

Hypotheses:

COVID-affected lungs exhibit unique radiographic patterns compared to normal and pneumonia-affected lungs.
Augmented datasets improve model performance by mitigating overfitting.

Methodology:

Test the model's ability to distinguish lung categories based on image patterns using classification metrics like accuracy and recall.

5. **Handling Sparse/Dense Data and Outliers:**
   

Density Assessment:
Sparse data in underrepresented categories (e.g., COVID scans).

Strategies:

Data augmentation (e.g., rotation, flipping).
Outlier detection using visual analysis and pixel intensity thresholds.


   

In [3]:
# Detecting and removing outlier images with extreme pixel intensities
import numpy as np

def detect_outliers(image_array):
    mean_intensity = np.mean(image_array)
    if mean_intensity < 10 or mean_intensity > 245:  # Thresholds for outliers
        return True
    return False


6. **Data Splitting:**
   
Methodology:
Use an 80-10-10 split for training, validation, and testing datasets. Employ stratified sampling to preserve label proportions.

7. **Bias Mitigation:**
  
Techniques:

Oversample minority classes using SMOTE for tabular data or image augmentation for image data.

Ensure equitable distribution of radiographs by demographics.

   
    **Your answer for Hypothesis Testing goes here **



In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Assuming your data is in a CSV file named 'your_data.csv'
# Replace 'your_data.csv' with the actual file path
data = pd.read_csv('your_data.csv')

# Now you can use 'data' in train_test_split
train, test = train_test_split(data, test_size=0.2, stratify=data['labels'])
val, test = train_test_split(test, test_size=0.5, stratify=test['labels'])

8. **Features for Model Training:**
   
   Relevant Features:

Pixel intensity values.

Augmented image variations (rotated, flipped, etc.).

Ranking Features:
Use feature importance analysis via trained CNN model visualizations.

9. **Types of Data Handling:**

Data Types:

Numerical: Pixel intensities.

Categorical: Class labels.

Preprocessing:

Normalize pixel intensities (0-1 scaling) and encode labels numerically.



   


In [None]:
#print out relevant features


10. **Data Transformation for Modeling:**

Methods:

Normalization of pixel values.

Encoding categorical labels.

11. **Data Storage:**

Storage Solution:

Processed data stored in an AWS S3 bucket with access controls or a local directory structured for reproducibility.
---

#### Notes:
- This template provides a structured framework for documenting your data processing approach for the portfolio project.
- Fill out each section with specific details relevant to your project's requirements and objectives.
- Use additional cells as needed to provide comprehensive information.