<a href="https://colab.research.google.com/github/Sadickachuli/data_preprocessing/blob/main/ml_pipeline_%5BAchuli_Mustapha_Sadick%5D_data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Processing Approach for Portfolio Project

## Project Title: Garbage Classification for Recyclable and Non-Recyclable Waste
## [Company Logo]

## Student Name: Achuli Mustapha Sadick

---

1. **Data Sources and Aggregation:**
   - List all sources of data for the project. **You must consider sources outside kaggle, google datasets** (insert links where necessary to online platforms,research papers etc)

   **I used kaggle and IEEE for dataset searching and chose this dataset (https://www.kaggle.com/datasets/asdasdasasdas/garbage-classification/data)**
   
   - Determine if data aggregation from multiple sources is necessary for comprehensive analysis.

   **- The dataset includes multiple classes like "cardboard," "metal," "plastic," etc., for recyclable materials, and only one class, "trash," for non-recyclable items.**

  **- To address class imbalance and enhance model accuracy, additional datasets may be aggregated, particularly for the "trash" class. This could involve combining different datasets that focus on household waste or non-recyclable waste.**



In [1]:
# Importing essential libraries
import os
import numpy as np
import shutil
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
import pickle
import json

import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras import layers, models, optimizers, regularizers
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.optimizers import Adam, RMSprop

from sklearn.metrics import confusion_matrix, classification_report, f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split


# Sklearn for evaluation metrics
from sklearn.metrics import confusion_matrix, classification_report, f1_score, precision_score, recall_score

# To ignore warnings
import warnings
warnings.filterwarnings('ignore')
from google.colab import drive
drive.mount('/content/drive')
print('Drive mounted')

# The base directory
base_dir = '/content/drive/MyDrive/Garbage_Classification_Data_Split'

# The input data directory
input_data_dir = '/content/drive/MyDrive/Garbage_classification_data'



2. **Data Format Transformation:**
   - Describe the current format of the data.
   - Outline the planned transformation to a unified format suitable for analysis and modeling.

 **- My dataset includes images in jpg and a label in csv format.**

 **- Resize images to a standard resolution (e.g., 128x128) to improve training consistency.**

 **- Data augmentation can be applied to improve generalization, especially for classes with fewer samples.**

3. **Data Exploration:**
   - Enumerate the features included in the dataset.
   
   - Summarize findings from exploratory data analysis (EDA) including distributions, correlations, and outliers.
   
  **Insert code for data exploration below**


In [None]:
# 6. Load and Preprocess the Dataset
img_height, img_width = 224, 224
batch_size = 32

# ImageDataGenerator for training with augmentation
train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest'
)

In [None]:
#Include plots for EDA
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Plot class distribution
class_counts = {class_name: len(os.listdir(f"input_data_dir/{class_name}")) for class_name in class_names}
sns.barplot(x=list(class_counts.keys()), y=list(class_counts.values()))
plt.title("Class Distribution")
plt.show()




4. **Hypothesis Testing:**
   - State any preexisting hypotheses about the data.
   - Explain methodologies to empirically test these hypotheses.
**1. Hypothesis:**

   **- Hypothesis 1: The model may face difficulty distinguishing between similar-looking classes like "metal" and "cardboard" on a similar-colored background.**

   **- Hypothesis 2: Adding more samples to the "trash" class (non-recyclable) will improve model accuracy by reducing bias toward recyclable materials.**

**2. Testing Methodologies:**

   **- Evaluate the model on a subset of images containing similar backgrounds and observe the model’s performance.**

   **- Perform additional training with balanced class samples for the "trash" category and compare the results.**




**5. Handling Sparse/Dense Data and Outliers:**
   - Assess the density of the data.
   - Propose strategies to handle missing data and outliers while maintaining dataset integrity.

   **Insert code for Handling Sparse/Dense Data and Outliers below**

In [None]:
# Removing corrupted or non-image files
def filter_invalid_images(file_list):
    valid_files = []
    for file in file_list:
        try:
            img = plt.imread(file)
            valid_files.append(file)
        except Exception:
            continue
    return valid_files

6. **Data Splitting:**
   - Define a methodology to split the dataset into training, validation, and testing sets.
   - Ensure randomness and representativeness in each subset.

7. **Bias Mitigation:**
   - Implement techniques to identify and mitigate biases in the dataset.
   - Ensure fairness and equity in data representation.
   
    **The split method (e.g., 70% training, 15% validation, 15% testing) to ensure balanced representation across classe**



In [None]:
# Get Class Names
class_names = os.listdir(input_data_dir)
class_names = [class_name for class_name in class_names
               if os.path.isdir(os.path.join(input_data_dir, class_name))
               and class_name not in ['train', 'validation', 'test']]
print(f"Classes found: {class_names}")

# Split Data into Train, Validation, and Test
for class_name in class_names:
    class_dir = os.path.join(input_data_dir, class_name)
    files = os.listdir(class_dir)

    # Filter out non-file entries
    files = [f for f in files if os.path.isfile(os.path.join(class_dir, f))]

    if not files:
        print(f"No files found in class '{class_name}'. Skipping...")
        continue

    # Shuffle files
    np.random.seed(42)  # For reproducibility
    np.random.shuffle(files)

    # Calculate split sizes
    total = len(files)
    train_size = int(0.7 * total)
    val_size = int(0.15 * total)
    test_size = total - train_size - val_size

        # Split files
    train_files = files[:train_size]
    val_files = files[train_size:train_size + val_size]
    test_files = files[train_size + val_size:]

    # Create class directories in train, validation, and test folders
    os.makedirs(os.path.join(train_dir, class_name), exist_ok=True)
    os.makedirs(os.path.join(val_dir, class_name), exist_ok=True)
    os.makedirs(os.path.join(test_dir, class_name), exist_ok=True)

    # Function to copy files safely
    def copy_files(file_list, src_dir, dest_dir, class_name, split_type):
        for file in file_list:
            src_file = os.path.join(src_dir, file)
            dest_file = os.path.join(dest_dir, class_name, file)
            if os.path.isfile(src_file):
                try:
                    shutil.copy(src_file, dest_file)
                except Exception as e:
                    print(f"Error copying {src_file} to {dest_file}: {e}")
            else:
                print(f"Skipping {src_file}, not a file.")

    # Moving files to respective directories
    copy_files(train_files, class_dir, train_dir, class_name, 'train')
    copy_files(val_files, class_dir, val_dir, class_name, 'validation')
    copy_files(test_files, class_dir, test_dir, class_name, 'test')

    print(f"Class '{class_name}' split into Train: {train_size}, Validation: {val_size}, Test: {test_size}")

print("Data split completed.")

8. **Features for Model Training:**
   - Identify relevant features for training the model.
   - Rank features based on their significance to project objectives.

 **Your answer for features must be plotted/ show your working code-wise **
9. **Types of Data Handling:**
   - Classify the types of data (categorical, numerical, etc.) present in the dataset.
   - Plan preprocessing steps for each data type.

   [Types of Data:

Categorical Data: Image data with categorical labels (recyclable, non-recyclable).
Numerical Data: Not applicable, as the data is image-based.
Preprocessing:

Images: Rescale, resize, and augment image data.
Labels: One-hot encode class labels if necessary for classification.s]


In [None]:
# The number of images to be displayed for each class
num_images_to_display = 2

# Function to display images from each class
def display_images_from_classes(base_dir, class_names, num_images):
    plt.figure(figsize=(15, 10))

    for i, class_name in enumerate(class_names):
        class_dir = os.path.join(base_dir, class_name)
        images = os.listdir(class_dir)

        # Shuffle images for random display
        np.random.shuffle(images)

        # Display images for each class
        for j in range(min(num_images, len(images))):
            img_path = os.path.join(class_dir, images[j])
            img = mpimg.imread(img_path)
            plt.subplot(len(class_names), num_images, i * num_images + j + 1)
            plt.imshow(img)
            plt.axis('off')
            plt.title(class_name)

    plt.tight_layout()
    plt.show()


10. **Data Transformation for Modeling:**
    - Specify methods for transforming raw data into a model-friendly format.
    - Detail steps for normalization, scaling, or encoding categorical variables.

    ***Raw Data Transformation:***

- Image Transformation: Resize, normalize, and augment images to prepare them for model training.

11. **Data Storage:**
    - Determine where and how processed data will be stored.
    - Choose suitable storage solutions ensuring accessibility and security.

    ***Data Storage:***
    
Storage Solutions:

- Store processed images in the local filesystem or cloud storage (e.g., AWS S3) for easy access during training.


---

#### Notes:
- This template provides a structured framework for documenting your data processing approach for the portfolio project.
- Fill out each section with specific details relevant to your project's requirements and objectives.
- Use additional cells as needed to provide comprehensive information.