Skip to content

This project implements a distributed K-means clustering algorithm using a custom-built MapReduce framework. It is designed to handle potentially large datasets by distributing the clustering workload across multiple processes or machines. Uses gRPC for the communication between mapper, reducer, master

Notifications You must be signed in to change notification settings

DeepanshuDabas03/K-Means_Map-Reduce

Repository files navigation

Distributed K-Means Clustering with a Custom MapReduce Framework

Overview

This project implements a distributed K-means clustering algorithm using a custom-built MapReduce framework. It is designed to handle potentially large datasets by distributing the clustering workload across multiple processes or machines.

Problem Statement

The project aims to solve the problem of clustering large datasets using the K-means algorithm in a distributed environment.

Key Features

  • Master Process: Coordinates the overall MapReduce workflow, including input splitting, process management, and centroid compilation.
  • Mapper Processes: Perform data loading, centroid assignment, and output partitioning.
  • Reducer Processes: Aggregate data points, recalculate centroids, and write new centroids to output files.
  • gRPC Communication: Facilitates efficient communication and data exchange among the master, mappers, and reducers.
  • Fault Tolerance: Implements mechanisms to detect and restart failed mapper or reducer processes, ensuring reliable completion.
  • Convergence Tracking: Includes the ability to track centroid convergence and terminate the algorithm early if convergence is reached.

Prerequisites

  • Python 3.x
  • grpcio, grpcio-tools (pip install grpcio grpcio-tools)

Directory Structure

Data/
    Input/
        points.txt                 # Input data file 
    Mappers/
        M1/ 
            partition_1.txt        # Output from Mapper 1
            partition_2.txt
            ...
        ...
    Reducers/
        R1.txt                     # Output from Reducer 1
        ...
    centroids.txt                  # Final converged centroids
    dump/                          # Log files
master.py
mapper.py
reducer.py
kmeans.proto                       # Protobuf definitions

Running the Code

Execute Master: bash python master.py <number_of_mappers> <number_of_reducers> <number_of_centroids> <number_of_iterations>

Usage Example

  • Input: Data/Input/points.txt (See provided sample in the repository for the format).
  • Command: python master.py 3 2 5 10 (3 mappers, 2 reducers, 5 initial centroids, 10 iterations).
  • Output:
    • Data/Reducers/R*.txt: New centroids from each reducer.
    • Data/centroids.txt: Final compiled list of converged centroids.

Protobuf (kmeans.proto)

The kmeans.proto file defines the data structures and gRPC services used for communication within the system. See the separate README within the proto directory or the generated code for details.

Logging

Log files for the master, mappers, and reducers are stored in the Data/dump/ directory.

Customization

  • Input Splitting: Modify the master.py if you wish to implement a different input data-splitting strategy.
  • Convergence Criteria: Adjust the convergence-checking logic if needed.

About

This project implements a distributed K-means clustering algorithm using a custom-built MapReduce framework. It is designed to handle potentially large datasets by distributing the clustering workload across multiple processes or machines. Uses gRPC for the communication between mapper, reducer, master

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages