Skip to content

Soham-droid-pixel/Data-Mining-Algorithms

Repository files navigation

Data Mining Algorithms From Scratch

This repository contains "from scratch" Python implementations of fundamental data mining and machine learning algorithms. The code is intentionally kept simple and easy to understand, designed for educational purposes and as a preparation resource for my Data Warehousing and Mining (DWM) lab exam.

Each implementation is self-contained, heavily commented, and uses only basic Python libraries.


📚 Algorithms Included

This project includes from-scratch implementations of the following algorithms:

  • 1. K-Means Clustering

    • File: k_means.py
    • Purpose: A popular partitioning algorithm that groups data points into $k$ clusters, where each point belongs to the cluster with the nearest mean (centroid).
    • Distance: Uses Euclidean Distance.
  • 2. K-Medoids Clustering (PAM)

    • File: k_medoids.py
    • Purpose: A variation of K-Means that is more robust to outliers because it uses an actual data point (medoid) as the cluster center.
    • Distance: Uses Manhattan Distance.
  • 3. Naive Bayes (Categorical)

    • File: naive_bayes.py
    • Purpose: A probabilistic classifier based on Bayes' theorem with the "naive" assumption of feature independence. This implementation is designed for categorical data (e.g., "Sunny," "Hot").
    • Features: Includes Laplace (add-1) smoothing to handle zero-probability cases.
  • 4. Apriori Algorithm

    • File: apriori.py
    • Purpose: The classic algorithm for Association Rule Mining. It discovers frequent itemsets in a transactional dataset (e.g., "items frequently bought together").
    • Logic: Implements the L(k-1) -> C(k) -> L(k) (Join & Prune) cycle.
  • 5. PageRank

    • File: pagerank.py
    • Purpose: An algorithm that measures the importance of nodes in a graph. It's famously used by Google to rank web pages.
    • Features: Includes the damping factor and proper handling for dangling nodes.

🎯 Project Goal

The primary goal of this repository is not to create optimized, production-ready code. Instead, the focus is on clarity and readability. Each file is heavily commented to explain the core logic, step-by-step, making it an effective study guide for understanding how these algorithms work internally.


🚀 How to Use

Each algorithm is a standalone Python script. They include a simple, static dataset directly in the file (under the if __name__ == "__main__": block) for demonstration.

To run any of the algorithms, simply execute the file using Python:

# Example for K-Means
python k_means.py

# Example for Apriori
python apriori.py

# Example for Naive Bayes
python naive_bayes.py

You can modify the dataset variable within any of the files to test the algorithms with your own simple data.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages