Skip to content

Precioux/Data-Mining

Repository files navigation

Data Mining Projects

This repository contains seven distinct projects in the realm of data analysis, machine learning, and data mining. Each project is designed to explore different aspects of these fields, showcasing the application of various techniques, algorithms, and methodologies.

  • Data Mining Course - Fall 2023
  • Amirkabir University of Technology

1) Introduction to Python Libraries:

In this project, we provide an exploration of Python language libraries commonly used in data mining projects. The report covers installation, general aspects, and functions of each library. The focus is on enhancing project steps such as pre-processing with accuracy and speed.

2) EDA and Visualization:

The goal of this project is to analyze a dataset of people with biological characteristics to classify the occurrence of heart attacks. Emphasis is on statistical analysis, visualization, and in-depth exploration of the dataset.

3) Data Cleaning and Feature Engineering:

This project delves into feature engineering methods, including reduction, selection, and extraction, after data cleaning. The impact of these methods on linear regression, decision tree, and random forest algorithms is explored, providing insights into the effectiveness of feature engineering.

4) Frequency Pattern Detection:

By comparing Apriori and FP-Growth algorithms on different datasets, this project conducts sensitivity analysis. FP-Growth, known for efficient large dataset handling, outperforms Apriori in terms of speed and efficiency. The project reveals useful patterns and relationships in the data.

5) Advanced Methods in Classification:

This project involves preparing a dataset of gas sensor information, classifying it using algorithms like Random Forest, SVM, and Naive Bayes, and then evaluating each classifier in detail. Multi-model classification, utilizing stacking, enhances efficiency, and sensitivity analysis of hyperparameters is performed.

6) Advanced Methods in Clustering:

Cluster analysis is performed on an insurance dataset using methods like KMeans, Agglomerative Clustering, and DBSCAN. The clustering results are compared using metrics like Silhouette Score, and dimensionality reduction techniques such as PCA and t-SNE are applied for visualization.

7) Identifying Data Outliers and Anomalies, Comparing Data Balancing Methods, and Providing Evaluation:

This comprehensive project covers various sections in data analysis and machine learning. It includes anomaly detection, data augmentation, and de-emphasis using techniques like One-Class SVM and Local Outlier Factor. Balancing methods, including OverSampling (Random OverSampler, SMOTE) and UnderSampling (Random UnderSampler), are applied. The project concludes with the use of an LSTM network for identifying temporal patterns in the data.

These projects collectively showcase a diverse range of techniques in data analysis, addressing challenges such as class imbalance, anomaly detection, and temporal pattern recognition. They provide valuable insights into improving performance and reliability in complex data analysis, especially in climate-related datasets.