GitHub - DeepanshuDabas03/K-Means_Map-Reduce: This project implements a distributed K-means clustering algorithm using a custom-built MapReduce framework. It is designed to handle potentially large datasets by distributing the clustering workload across multiple processes or machines. Uses gRPC for the communication between mapper, reducer, master

Distributed K-Means Clustering with a Custom MapReduce Framework

Overview

This project implements a distributed K-means clustering algorithm using a custom-built MapReduce framework. It is designed to handle potentially large datasets by distributing the clustering workload across multiple processes or machines.

Problem Statement

The project aims to solve the problem of clustering large datasets using the K-means algorithm in a distributed environment.

Key Features

Master Process: Coordinates the overall MapReduce workflow, including input splitting, process management, and centroid compilation.
Mapper Processes: Perform data loading, centroid assignment, and output partitioning.
Reducer Processes: Aggregate data points, recalculate centroids, and write new centroids to output files.
gRPC Communication: Facilitates efficient communication and data exchange among the master, mappers, and reducers.
Fault Tolerance: Implements mechanisms to detect and restart failed mapper or reducer processes, ensuring reliable completion.
Convergence Tracking: Includes the ability to track centroid convergence and terminate the algorithm early if convergence is reached.

Prerequisites

Python 3.x
grpcio, grpcio-tools (pip install grpcio grpcio-tools)

Directory Structure

Data/
    Input/
        points.txt                 # Input data file 
    Mappers/
        M1/ 
            partition_1.txt        # Output from Mapper 1
            partition_2.txt
            ...
        ...
    Reducers/
        R1.txt                     # Output from Reducer 1
        ...
    centroids.txt                  # Final converged centroids
    dump/                          # Log files
master.py
mapper.py
reducer.py
kmeans.proto                       # Protobuf definitions

Running the Code

Execute Master: bash python master.py <number_of_mappers> <number_of_reducers> <number_of_centroids> <number_of_iterations>

Usage Example

Input: Data/Input/points.txt (See provided sample in the repository for the format).
Command: python master.py 3 2 5 10 (3 mappers, 2 reducers, 5 initial centroids, 10 iterations).
Output:
- Data/Reducers/R*.txt: New centroids from each reducer.
- Data/centroids.txt: Final compiled list of converged centroids.

Protobuf (kmeans.proto)

The kmeans.proto file defines the data structures and gRPC services used for communication within the system. See the separate README within the proto directory or the generated code for details.

Logging

Log files for the master, mappers, and reducers are stored in the Data/dump/ directory.

Customization

Input Splitting: Modify the master.py if you wish to implement a different input data-splitting strategy.
Convergence Criteria: Adjust the convergence-checking logic if needed.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Data/Input		Data/Input
__pycache__		__pycache__
README.md		README.md
clean_up.py		clean_up.py
kmeans.proto		kmeans.proto
kmeans_pb2.py		kmeans_pb2.py
kmeans_pb2_grpc.py		kmeans_pb2_grpc.py
mapper.py		mapper.py
master.py		master.py
reducer.py		reducer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

DeepanshuDabas03/K-Means_Map-Reduce

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages