[π§π· PortuguΓͺs] [πΊπΈ English]
Institution: Pontifical Catholic University of SΓ£o Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science
Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva
Important
- Projects and deliverables may be made publicly available whenever possible.
- The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
- All activities comply with the academic and ethical guidelines of PUC-SP.
- Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.
πΆ Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4
πΊ For better resolution, watch the video on YouTube.
Tip
This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.
β Access Data Mining Main Repository
If youβd like to explore the full materials from the 1st year (not only the review), you can visit the complete repository here.
- Introduction
- What is the Affinity Propagation Algorithm?
- Unsupervised vs. Supervised Learning
- Silhouette Score: Evaluating Clustering
- Correlation Matrix
- Affinity Propagation vs. KMeans: Use Case Table
- Code Examples: Correlation Matrix with 5 Variables
- References
This repository provides an in-depth review of the Affinity Propagation Algorithm as applied in data mining tasks, emphasizing both theory and practical implementation. It is designed as a companion to professional and academic courses focusing on unsupervised learning and practical clustering methods.
Affinity Propagation is a clustering algorithm that identifies exemplars among data points and forms clusters of data points around these exemplars. Unlike algorithms like KMeans, it does not require the number of clusters to be specified beforehand. Instead, it uses the concept of "message passing" between data points to find a number of clusters that best reflects the data structure.
-
- Exchanges real-valued messages between data points.
- Determines exemplars based on similarities.
- Forms clusters around exemplars.
Affinity Propagation chooses the number of clusters automatically by maximizing a global criterion. KMeans, by contrast, needs a predefined number of clusters.
[Unsupervised learning algorithms are designed to identify patterns or groupings in datasets without labeled responses. The model explores the input data to find structure (like clustering or association).
-
Supervised learning uses labeled data to train predictive models for regression or classification.
-
Unsupervised learning (like Affinity Propagation, KMeans, Hierarchical Clustering) finds patterns without explicit feedback.
| Criteria | Supervised Learning | Unsupervised Learning |
|---|---|---|
| Labeled data | Yes | No |
| Tasks | Classification, Regression | Clustering, Association |
| Example Algorithms | Decision Tree, SVM | KMeans, Affinity Propagation |
The Silhouette Score evaluates how well an object lies within its cluster compared to other clusters. It ranges from -1 to 1:
- Close to 1: sample is well matched to its own cluster.
- Close to 0: sample is on or very close to the decision boundary between two clusters.
- Close to -1: sample might have been assigned to the wrong cluster.
-
-
A Correlation Matrix shows the statistical relationship (correlation) between pairs of variables. It helps to identify whether variables move together (correlate) and how strongly.
-
Diagonal values: always 1 (a variable's correlation with itself).
-
Off-diagonal values: range from -1 to 1.
Two variables have positive correlation if they increase in tandem.
Tip
Example: Height and weight in humans typically show positive correlation.
Negative correlation occurs when one variable increases while the other decreases, in an inversely proportional way.
Tip
Example: The amount of time spent watching TV and academic grades may have a negative correlation.
| Feature/Aspect | Affinity Propagation | KMeans |
|---|---|---|
| Cluster count | Determined automatically | Must be specified |
| Speed / Scalability | Slower for large datasets | Fast for large datasets |
| Sensitive to initialization | No | Yes |
| Suitable for | Arbitrary shaped clusters | Spherical clusters |
| Handles outliers | Better | Poorly |
| Core principle | Message Passing | Centroid Minimization |
| Memory requirement | Higher | Lower |
| Illustrative use cases | Small/medium data, unknown cluster count | Large data, known K |
Below is a Python example to generate a correlation matrix for two different dataframes (df1, df2). To generate two different plots, use df1 and df2 as indicated.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Example with 5 random variables for df1
np.random.seed(42)
df1 = pd.DataFrame(np.random.rand(100, 5), columns=['A', 'B', 'C', 'D', 'E'])
corr_matrix1 = df1.corr()
sns.heatmap(corr_matrix1, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix - df1')
plt.show()
# To generate another correlation matrix for another dataframe df2:
df2 = pd.DataFrame(np.random.rand(100, 5), columns=['V', 'W', 'X', 'Y', 'Z'])
corr_matrix2 = df2.corr()
sns.heatmap(corr_matrix2, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix - df2')
plt.show()Note: To generate a second plot with the same dataframe but different label (example: "df1 for another experiment"), just change the dataframe and the labels.
1. Castro, L. N. & Ferrari, D. G. (2016). Introduction to Data Mining: Basic Concepts, Algorithms, and Applications. Saraiva.
2. Ferreira, A. C. P. L. et al. (2024). Artificial Intelligence β A Machine Learning Approach. 2nd Ed. LTC.
3. Larson & Farber (2015). Applied Statistics. Pearson.
- THOMAS, C. Data Mining. IntechOpen, 2018.
- HUTTER, F.; KOTTHOFF, L.; VANSCHOREN, J. Automated Machine Learning: Methods, Systems, Challenges. Springer Nature, 2019.
- NETTO, A.; MACIEL, F. Python para Data Science e Machine Learning Descomplicado. Alta Books, 2021.
- RUSSELL, S. J.; NORVIG, P. Artificial Intelligence: A Modern Approach. GEN LTC, 2022.
- SUD, K.; ERDOGMUS, P.; KADRY, S. Introduction to Data Science and Machine Learning. IntechOpen, 2020.
πΈΰΉ My Contacts Hub
ββββββββββββββ πβ ββββββββββββββ
β£β’β€ Back to Top
Copyright 2025 Quantum Software Development. Code released under the MIT License license.