### 1 a)

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

file_path = '/home/pedro_loureiro/Aprendizagem/Proj4_Aprendizagem/accounts.csv'
df = pd.read_csv(file_path)

df_dummies = pd.get_dummies(df.iloc[:, :8], drop_first=True)
df_dummies.drop_duplicates(inplace=True)
df_dummies.dropna(inplace=True)
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df_dummies)
sse = []
k_values = range(2, 9)

for k in k_values:
    kmeans = KMeans(n_clusters=k, max_iter=500, random_state=42)
    kmeans.fit(df_scaled)
    sse.append(kmeans.inertia_)  

plt.figure(figsize=(10, 6))
plt.plot(k_values, sse, marker='o')
plt.title('Sum of Squared Errors (SSE) vs Number of Clusters (k)')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('SSE')
plt.grid(True)
plt.show()


### 1 b)

In the SSE (Sum of Squared Errors) graph, the ideal number of clusters can be determined by the elbow method. The SSE decreases sharply from k = 2 to k = 4, indicating that dividing the data into more clusters significantly reduces the error. From k = 4 onward, the decrease in SSE begins to slow down, meaning that adding extra clusters does not provide such a substantial improvement in the fit. Therefore, based on the trade-off between the number of clusters and inertia, k = 4 appears to be the ideal number of clusters. This is because it is the point where increasing the number of clusters stops providing significant improvements in compression (or reduction of SSE), balancing model simplicity and explained variability.

### 1 c)

K-modes is designed specifically for categorical data. It uses Hamming distance and can handle the mode (most frequent category) in place of the mean. K-means works well with numerical data since it minimizes the sum of squared distances (Euclidean distance). Having this in mind, K-modes could be a better approach than traditional k-means since the majority of the feautures are categorical features. 

### Exercício 2

### 2 a) 

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Carregar o dataset
file_path = 'accounts.csv'
data = pd.read_csv(file_path)

# Criar variáveis dummies (variáveis categóricas para numéricas)
df_dummies = pd.get_dummies(data.iloc[:, :8], drop_first=True)

# Remover duplicados e valores nulos
df_dummies.drop_duplicates(inplace=True)
df_dummies.dropna(inplace=True)

# Normalizar os dados com StandardScaler
scaler = StandardScaler()
data_transformed = scaler.fit_transform(df_dummies)

# Aplicar PCA
pca = PCA(n_components=2)  # Especificar que queremos os 2 componentes principais
pca_components = pca.fit_transform(data_transformed)

# Calcular a variância explicada pelos 2 componentes principais
explained_variance = pca.explained_variance_ratio_
total_variance_explained = explained_variance.sum() * 100  # Em porcentagem

# Imprimir a variância explicada
print(f"Explained variance by top 2 components: {explained_variance}")
print(f"Total variance explained by the top 2 components: {total_variance_explained:.2f}%")


### 2 b)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer
from sklearn.cluster import KMeans

file_path = 'accounts.csv'
data = pd.read_csv(file_path)

df_dummies = pd.get_dummies(data.iloc[:, :8], drop_first=True)
df_dummies.drop_duplicates(inplace=True)
df_dummies.dropna(inplace=True)
scaler = StandardScaler()
data_transformed = scaler.fit_transform(df_dummies)
pca = PCA(n_components=2)
pca_components = pca.fit_transform(data_transformed)
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(data_transformed)

plt.figure(figsize=(8, 6))
plt.scatter(pca_components[:, 0], pca_components[:, 1], c=clusters, cmap='viridis', s=50)
plt.xlabel('First Component')
plt.ylabel('Second Component')
plt.title('PCA Scatterplot with k=3 clusters')
plt.colorbar(label='Cluster')
plt.show()

There is significant overlap between the clusters, especially in the central region of the plot. The boundary between the purple and teal clusters is not defined, indicating that the data points in these clusters are not well-separated. The yellow cluster appears to be more distinguishable from the other two clusters, but there is still some mixing in the lower areas.
The fact that there is substantial overlap between clusters suggests that clear separation is not possible in this 2D representation. This indicates that the features (or components) captured by the first two principal components do not fully explain the variance needed to separate the clusters (22,76%).

### 2 c)

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

file_path = '/home/pedro_loureiro/Aprendizagem/Proj4_Aprendizagem/accounts.csv'
data = pd.read_csv(file_path)


df_dummies = pd.get_dummies(data.iloc[:, :8], drop_first=True)
df_dummies.drop_duplicates(inplace=True)
df_dummies.dropna(inplace=True)
scaler = StandardScaler()
data_transformed = scaler.fit_transform(df_dummies)
kmeans = KMeans(n_clusters=3, random_state=42)
df_dummies['cluster'] = kmeans.fit_predict(data_transformed) 
data['cluster'] = df_dummies['cluster'].reindex(data.index) 

plt.figure(figsize=(10, 6))
sns.displot(data=data, x='job', hue='cluster', multiple="dodge", stat='density', shrink=0.8, common_norm=False)
plt.xticks(rotation=90)
plt.title('Frequency distribution of "job" per Cluster')
plt.show()

plt.figure(figsize=(10, 6))
sns.displot(data=data, x='education', hue='cluster', multiple="dodge", stat='density', shrink=0.8, common_norm=False)
plt.xticks(rotation=90)
plt.title('Frequency distribution of "education" per Cluster')
plt.show()


Based on the provided graphs, here are the main differences between the clusters:

Job Distribution Differences:

Cluster 0: This cluster has a higher representation of blue-collar, management, and technician roles. There's a more evenly distributed presence of other professions, like admin., services, and retired, with no single job type dominating overwhelmingly.

Cluster 1: This cluster has a noticeable presence in management and technician jobs, similar to Cluster 0, but it's also the only cluster that shows a slightly increased presence in student jobs. It seems to target a broad spectrum of job categories with no extreme outliers.

Cluster 2: This cluster is heavily dominated by the retired category, showing an extreme concentration here compared to other clusters. Other job categories are underrepresented, making it clear that Cluster 2 primarily groups individuals who are retired.

Education Distribution Differences:
         
Cluster 0: This cluster has a predominant representation of individuals with secondary education. There's also a reasonable proportion of individuals with tertiary education, and a smaller, but notable presence of those with primary education.

Cluster 1: This cluster has a strong presence of individuals with secondary and tertiary education, with tertiary being relatively more common here than in Cluster 0. It suggests that this cluster might be more associated with higher educational levels.

Cluster 2: There’s a higher proportion of individuals with primary education in this cluster compared to others. It also has a considerable portion of individuals with secondary education but a much lower representation of tertiary education, indicating lower educational levels overall.

Having this in mind, Cluster 2 stands out for its dominance in the retired category and a focus on primary education, indicating an older demographic with lower educational attainment. Clusters 0 and 1 have more diverse job categories, with Cluster 0 leaning slightly more towards manual or technical jobs and Cluster 1 towards professional or academic roles. When it comes to education, Cluster 1 has a higher level of educational attainment, especially in tertiary education, compared to Cluster 0.