##### Hypothesis Testing

To statistically test the validity and reliability of the clusters, you can use various methods depending on the type of data and the nature of the clustering. One approach is to use hypothesis testing to determine if the separation between clusters is statistically significant. Here, we will outline the logic and implement a simple test for comparing clusters using permutation tests and silhouette analysis.

Permutation Test:

- Null Hypothesis (𝐻0): The clusters are not significantly different from each other, and any observed separation is by chance.

- Alternative Hypothesis (𝐻𝑎): The clusters are significantly different from each other, and the observed separation is not by chance.

Silhouette Analysis:
- Measures how similar each data point is to its own cluster (cohesion) compared to other clusters (separation).

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import silhouette_score
import numpy as np

# Load data from text files
data_files = ['cluster_0.0.txt', 'cluster_1.0.txt', 'cluster_2.0.txt', 'cluster_3.0.txt']
data = []
labels = []

for i, file in enumerate(data_files):
    with open(file, 'r') as f:
        stories = f.readlines()
        data.extend(stories)  # Add stories to the data list
        labels.extend([i] * len(stories))  # Assign a label to each story based on the cluster

# Convert text data to TF-IDF vectors
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(data)

# Calculate silhouette scores for the clustering
silhouette_avg = silhouette_score(X, labels)
print(f"Average Silhouette Score: {silhouette_avg}")

# Permutation Test
def permutation_test(data, labels, num_permutations=1000):
    observed_score = silhouette_score(data, labels)
    perm_scores = []
    
    for _ in range(num_permutations):
        permuted_labels = np.random.permutation(labels)
        score = silhouette_score(data, permuted_labels)
        perm_scores.append(score)
    
    perm_scores = np.array(perm_scores)
    p_value = np.sum(perm_scores >= observed_score) / num_permutations
    return observed_score, p_value

observed_score, p_value = permutation_test(X, labels)
print(f"Observed Silhouette Score: {observed_score}")
print(f"P-value: {p_value}")

# Interpretation
if p_value < 0.05:
    print("The clustering is statistically significant (reject the null hypothesis).")
else:
    print("The clustering is not statistically significant (fail to reject the null hypothesis).")


Average Silhouette Score: -0.006230251811177641
Observed Silhouette Score: -0.006230251811177641
P-value: 0.004
The clustering is statistically significant (reject the null hypothesis).


##### Explanation:

Use TfidfVectorizer to convert the text data into TF-IDF vectors, which are numerical representations of the text.

Calculate Silhouette Scores:

Compute the average silhouette score for the clustering to measure the quality of the clustering.

Permutation Test:

Perform a permutation test by randomly permuting the cluster labels and computing the silhouette scores for the permuted labels. Compare the observed silhouette score with the distribution of permuted scores to calculate the p-value.

Interpretation:

- The p-value is less than 0.05, we reject the null hypothesis, indicating that the clustering is statistically significant and not due to chance.

Note: If the p-value is greater than or equal to 0.05, we fail to reject the null hypothesis, indicating that the clustering might be due to chance.

This approach provides a statistical test to measure the validity and reliability of the clusters, ensuring that the observed separation between clusters is not by chance.

##### Bootstrapping Test

- Bootstrapping assesses the stability of the clusters by resampling the data and checking how often the same points end up in the same clusters.

In [6]:
from sklearn.utils import resample
from sklearn.cluster import KMeans

# Function to perform bootstrapping and evaluate cluster stability
def bootstrap_clusters(data, labels, n_clusters=4, n_iterations=100):
    stability_scores = []

    for _ in range(n_iterations):
        # Resample data
        resampled_data, resampled_labels = resample(data, labels)
        
        # Vectorize resampled data
        vectorizer = TfidfVectorizer(stop_words='english')
        X_resampled = vectorizer.fit_transform(resampled_data)
        
        # Perform K-Means clustering
        kmeans = KMeans(n_clusters=n_clusters, random_state=42)
        resampled_cluster_labels = kmeans.fit_predict(X_resampled)
        
        # Calculate silhouette score for resampled clusters
        score = silhouette_score(X_resampled, resampled_cluster_labels)
        stability_scores.append(score)

    return np.mean(stability_scores), np.std(stability_scores)

# Calculate bootstrap stability scores
mean_stability, std_stability = bootstrap_clusters(data, labels)
print()
print(f"Bootstrap mean stability score: {mean_stability}")
print(f"Bootstrap stability score standard deviation: {std_stability}")


  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super()._check_params_vs_input(X, default_n_init=10)
  super().


Bootstrap mean stability score: 0.3339838353782525
Bootstrap stability score standard deviation: 0.021235792441031902


  super()._check_params_vs_input(X, default_n_init=10)


##### Interpretation 

Cluster Separation:

The mean stability score of ~0.334 suggests that the clusters have some separation but are not highly distinct. The clusters could overlap or be less well-defined.

Cluster Robustness:

The low standard deviation (~0.021) indicates that the clusters are stable across different resamples. This consistency implies that the clustering algorithm produces similar results even when the data is slightly varied.

### Pairwise Statistical Test

Mann-Whtney U Test: This non-parametric test compares whether there is a significant difference between the distributions of two independent clusters.

In [7]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.stats import mannwhitneyu

# Load data from text files
data_files = ['cluster_0.0.txt', 'cluster_1.0.txt', 'cluster_2.0.txt', 'cluster_3.0.txt']
data = []
labels = []

for i, file in enumerate(data_files):
    with open(file, 'r') as f:
        stories = f.readlines()
        data.extend(stories)  # Add stories to the data list
        labels.extend([i] * len(stories))  # Assign a label to each story based on the cluster

# Convert text data to TF-IDF vectors
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(data).toarray()

# Perform pairwise Mann-Whitney U test between clusters
def perform_mannwhitneyu_test(X, labels):
    n_clusters = len(set(labels))
    results = {}

    for i in range(n_clusters):
        for j in range(i + 1, n_clusters):
            cluster_i = X[labels == i]
            cluster_j = X[labels == j]
            u_statistic, p_value = mannwhitneyu(cluster_i, cluster_j, alternative='two-sided')
            results[(i, j)] = p_value

    return results

# Convert labels list to a numpy array for indexing
labels = np.array(labels)

# Calculate pairwise p-values for cluster comparisons
p_values = perform_mannwhitneyu_test(X, labels)

# Display the p-values
for (cluster_i, cluster_j), p_value in p_values.items():
    print(f"P-value for clusters {cluster_i} and {cluster_j}: {p_value}")

# Interpretation
# A low p-value indicates a significant difference between the clusters, suggesting good separation.


P-value for clusters 0 and 1: [0.34065391 0.95065632 0.34065391 ... 0.29820349 0.17633212 1.        ]
P-value for clusters 0 and 2: [1.         0.52378051 1.         ... 0.52378051 0.12253111 0.12253111]
P-value for clusters 0 and 3: [1.         0.62269912 0.04699584 ... 0.62269912 1.         1.        ]
P-value for clusters 1 and 2: [0.54168218 0.54168218 0.54168218 ... 1.         0.81983732 0.10656469]
P-value for clusters 1 and 3: [0.6376465  0.6376465  0.26084874 ... 1.         0.50069487 1.        ]
P-value for clusters 2 and 3: [1.         1.         0.2033176  ... 1.         0.44528822 0.44528822]


##### Overall Interpretation
The results suggest that the clusters are not well-separated, as most of the pairwise comparisons do not show significant differences. This indicates that the clustering algorithm has not created distinctly different clusters based on the features of the news stories.

##### Which Methods to Use?

Permutation Test: Use this if you want to statistically confirm that your clustering is not due to random chance.
Silhouette Analysis: This is essential for assessing the quality of your clusters in terms of separation and cohesion.
Bootstrapping: Use this to assess the stability and reliability of your clustering results.
Pairwise Statistical Test: Use this to understand the differences between clusters and ensure they are significantly different from each other.

##### Recommended Approach

While you don't necessarily need to use all of these methods, combining a few can give you a more thorough evaluation of your clustering results. Here's a practical approach:

Silhouette Analysis: Start with this to get a quick sense of cluster quality.
Permutation Test: Use this to statistically validate the significance of your clustering.
Bootstrapping: Assess the stability and robustness of your clustering results.
Pairwise Statistical Test: Finally, use this to check for significant differences between clusters.