## Semantic Exploration of ACS Variables

With a vast array of variables detailing the U.S. population's demographics, social, economic, and housing characteristics, choosing the right variables for specific analyses can be challenging. Therefore, this notebook is designed to assist you in effectively exploring and selecting the appropriate variables for your analysis.

#### Objective - Overview
To simplify the exploration, we employ Natural Language Processing (NLP) techniques to generate word embeddings for the variable descriptions. This helps in identifying semantic similarities and grouping variables meaningfully.

We apply clustering algorithms to these embeddings, allowing us to explore groups of related variables and their interactions more efficiently. This method facilitates a deeper understanding of the dataset and assists in pinpointing relevant variables for further analysis.

In [1]:
# Importing Necessary Libraries for Exploring ACS Variables

# If you are running this notebook for the first time or do not have the censusdis library installed, 
# uncomment and run the line below to install the censusdis package.

# !pip install censusdis

# Importing the census data interface from censusdis library
import censusdis.data as ced

# Importing pandas for data manipulation and analysis
import pandas as pd

# Note: The censusdis library is used to interface with the Census API, 
# allowing for straightforward extraction and manipulation of the ACS dataset.

## Step 0: Exploration

In [2]:
#discovering all the datasets
df_datasets = ced.variables.all_data_sets()

df_datasets

Unnamed: 0,YEAR,SYMBOL,DATASET,TITLE,DESCRIPTION,API BASE URL
0,1986,CBP,cbp,1986 County Business Patterns: Business Patterns,County Business Patterns (CBP) is an annual se...,http://api.census.gov/data/1986/cbp
1,1987,CBP,cbp,1987 County Business Patterns: Business Patterns,County Business Patterns (CBP) is an annual se...,http://api.census.gov/data/1987/cbp
2,1988,CBP,cbp,1988 County Business Patterns: Business Patterns,County Business Patterns (CBP) is an annual se...,http://api.census.gov/data/1988/cbp
3,1989,CBP,cbp,1989 County Business Patterns: Business Patterns,County Business Patterns (CBP) is an annual se...,http://api.census.gov/data/1989/cbp
4,1989,CPS_BASIC_APR,cps/basic/apr,Apr 1989 Current Population Survey: Basic Monthly,"To provide estimates of employment, unemployme...",http://api.census.gov/data/1989/cps/basic/apr
...,...,...,...,...,...,...
1602,timeseries,TIMESERIES_PSEO,timeseries/pseo/flows,Post-Secondary Employment Outcomes (PSEO): Pos...,Post-Secondary Employment Outcomes (PSEO) are ...,http://api.census.gov/data/timeseries/pseo/flows
1603,timeseries,,timeseries/qwi/rh,Time Series Longitudinal Employer-Household Dy...,The Quarterly Workforce Indicators (QWI) are a...,http://api.census.gov/data/timeseries/qwi/rh
1604,timeseries,,timeseries/qwi/sa,Time Series Longitudinal Employer-Household Dy...,The Quarterly Workforce Indicators (QWI) are a...,http://api.census.gov/data/timeseries/qwi/sa
1605,timeseries,TIMESERIES_QWI,timeseries/qwi/se,Time Series Longitudinal Employer-Household Dy...,The Quarterly Workforce Indicators (QWI) are a...,http://api.census.gov/data/timeseries/qwi/se


In census datasets like the American Community Survey (ACS) and the Decennial Census (DEC), data is organized hierarchically. At the top, datasets or surveys like the ACS provide annual demographic details, while the Decennial occurs every ten years. Within these surveys, data is grouped into groups based on themes such as demographic estimates or economic characteristics. Each group contains variables, which are specific attributes or fields measured, such as employment status or income levels. This hierarchical organization helps in efficiently accessing and analyzing data relevant to specific research questions.

For example:

- **Dataset**
  - **Group 1**
    - Variable 1.1
    - Variable 1.2
    - Variable 1.3
  - **Group 2**
    - Variable 2.1
    - Variable 2.2
  - **Group 3**
    - Variable 3.1
    - Variable 3.2
    - Variable 3.3
    - Variable 3.4


The Census variable code name, such as `B01003_001E`, follows a specific structure that provides information about the variable. Here's a breakdown of its components:

1. Table ID (B01003):

    * This part identifies the specific table in which the variable is found.
    * B indicates it is a detailed table, whereas C indicates a collapsed table.
    * The numbers (01003 in this case) further specify the table. Each table ID corresponds to a specific topic or set of related data.

2. Sequence Number (001):

    * This identifies the specific line within the table.
    * It indicates the position of the variable within the table.

3. Modifier (E):

    * This part indicates the type of data:
        + E stands for estimate.
        + M stands for margin of error.

So, B01003_001E can be interpreted as follows:

* B01003: The table ID, which refers to the total population table.
* 001: The first line within this table, typically representing the total population count.
* E: The estimate for the total population.


In [3]:
#discovering the groups under acs/acs5
ced.variables.all_groups('acs/acs5', 2010)

Unnamed: 0,DATASET,YEAR,GROUP,DESCRIPTION
0,acs/acs5,2010,B00001,UNWEIGHTED SAMPLE COUNT OF THE POPULATION
1,acs/acs5,2010,B00002,UNWEIGHTED SAMPLE HOUSING UNITS
2,acs/acs5,2010,B01001,SEX BY AGE
3,acs/acs5,2010,B01001A,SEX BY AGE (WHITE ALONE)
4,acs/acs5,2010,B01001B,SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
...,...,...,...,...
934,acs/acs5,2010,C24040,"SEX BY INDUSTRY FOR THE FULL-TIME, YEAR-ROUND ..."
935,acs/acs5,2010,C24050,INDUSTRY BY OCCUPATION FOR THE CIVILIAN EMPLO...
936,acs/acs5,2010,C24060,OCCUPATION BY CLASS OF WORKER FOR THE CIVILIAN...
937,acs/acs5,2010,C24070,INDUSTRY BY CLASS OF WORKER FOR THE CIVILIAN E...


In [4]:
#discovering the variables under a group
ced.variables.all_variables('acs/acs5', 2020, 'B19013')[['YEAR', 'DATASET','GROUP','VARIABLE','LABEL']]


Unnamed: 0,YEAR,DATASET,GROUP,VARIABLE,LABEL
0,2020,acs/acs5,B19013,B19013A_001E,Estimate!!Median household income in the past ...
1,2020,acs/acs5,B19013,B19013B_001E,Estimate!!Median household income in the past ...
2,2020,acs/acs5,B19013,B19013C_001E,Estimate!!Median household income in the past ...
3,2020,acs/acs5,B19013,B19013D_001E,Estimate!!Median household income in the past ...
4,2020,acs/acs5,B19013,B19013E_001E,Estimate!!Median household income in the past ...
5,2020,acs/acs5,B19013,B19013F_001E,Estimate!!Median household income in the past ...
6,2020,acs/acs5,B19013,B19013G_001E,Estimate!!Median household income in the past ...
7,2020,acs/acs5,B19013,B19013H_001E,Estimate!!Median household income in the past ...
8,2020,acs/acs5,B19013,B19013I_001E,Estimate!!Median household income in the past ...
9,2020,acs/acs5,B19013,B19013_001E,Estimate!!Median household income in the past ...


In [5]:
#discovering the variables under a group
ced.variables.all_variables('acs/acs5', 2020, 'B25119')


Unnamed: 0,YEAR,DATASET,GROUP,VARIABLE,LABEL,SUGGESTED_WEIGHT,VALUES
0,2020,acs/acs5,B25119,B25119_001E,Estimate!!Median household income in the past ...,,
1,2020,acs/acs5,B25119,B25119_002E,Estimate!!Median household income in the past ...,,
2,2020,acs/acs5,B25119,B25119_003E,Estimate!!Median household income in the past ...,,
3,2020,acs/acs5,B25119,GEO_ID,Geography,,
4,2020,acs/acs5,B25119,NAME,Geographic Area Name,,


In [6]:
#variables in a group as tree - group > variable hierarchy as a tree structure
ced.variables.group_tree('acs/acs5', 2020, "B25119")


+ Estimate
    + Median household income in the past 12 months (in 2020 inflation-adjusted dollars) --
        + Total: (B25119_001E)
            + Owner occupied (dollars) (B25119_002E)
            + Renter occupied (dollars) (B25119_003E)
+ Geography (GEO_ID)
+ Geographic Area Name (NAME)

In [7]:
#Variables in a group as a tree - group > variable hierarchy as a tree structure

ced.variables.group_tree('dec/pl', 2020, 'P2')


+  
    + Total: (P2_001N)
        + Hispanic or Latino (P2_002N)
        + Not Hispanic or Latino: (P2_003N)
            + Population of one race: (P2_004N)
                + White alone (P2_005N)
                + Black or African American alone (P2_006N)
                + American Indian and Alaska Native alone (P2_007N)
                + Asian alone (P2_008N)
                + Native Hawaiian and Other Pacific Islander alone (P2_009N)
                + Some Other Race alone (P2_010N)
            + Population of two or more races: (P2_011N)
                + Population of two races: (P2_012N)
                    + White; Black or African American (P2_013N)
                    + White; American Indian and Alaska Native (P2_014N)
                    + White; Asian (P2_015N)
                    + White; Native Hawaiian and Other Pacific Islander (P2_016N)
                    + White; Some Other Race (P2_017N)
                    + Black or African American; American Indian and Alaska N

### Step 1: Exploring the ACS 5-Year Profile

In this section, we will start by exploring the variable groups from the American Community Survey (ACS) 5-Year Profile for the year 2020. This exploration helps us understand the types of data collected and available for analysis, providing a foundation for further detailed studies.

The ACS 5-year data profiles include the following geographies: nation, all states (including DC and Puerto Rico), all metropolitan areas, all congressional districts, all counties, all places and all tracts. The Data profiles contain broad social, economic, housing, and demographic information. The data are presented as both counts and percentages. 


In [8]:
#How do we find the ACS datasets? How do we know we need - acs/acs5/profile
df_datasets[df_datasets['TITLE'].str.contains('American Community Survey: 5-Year')].head()

Unnamed: 0,YEAR,SYMBOL,DATASET,TITLE,DESCRIPTION,API BASE URL
636,2009,ACS5,acs/acs5,American Community Survey: 5-Year Estimates: D...,The American Community Survey (ACS) is an ongo...,http://api.census.gov/data/2009/acs/acs5
637,2009,ACS5_PROFILE,acs/acs5/profile,American Community Survey: 5-Year Estimates: D...,The American Community Survey (ACS) is an ongo...,http://api.census.gov/data/2009/acs/acs5/profile
638,2009,ACS5_PUMS,acs/acs5/pums,2005-2009 American Community Survey: 5-Year Es...,The American Community Survey (ACS) Public Use...,http://api.census.gov/data/2009/acs/acs5/pums
639,2009,ACS5_PUMSPR,acs/acs5/pumspr,2005-2009 American Community Survey: 5-Year Es...,The Public Use Microdata Sample (PUMS) for Pue...,http://api.census.gov/data/2009/acs/acs5/pumspr
675,2010,ACS5_AIAN,acs/acs5/aian,American Community Survey: 5-Year Estimates: A...,The American Indian and Alaska Native (AIAN) t...,http://api.census.gov/data/2010/acs/acs5/aian


In [9]:
# Fetch ACS 5-Year Profile variable groups for 2020
acs5_profile_groups = ced.variables.all_groups('acs/acs5/profile', 2020)

# Display variable groups
acs5_profile_groups

Unnamed: 0,DATASET,YEAR,GROUP,DESCRIPTION
0,acs/acs5/profile,2020,DP02,SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ...
1,acs/acs5/profile,2020,DP02PR,SELECTED SOCIAL CHARACTERISTICS IN PUERTO RICO
2,acs/acs5/profile,2020,DP03,SELECTED ECONOMIC CHARACTERISTICS
3,acs/acs5/profile,2020,DP04,SELECTED HOUSING CHARACTERISTICS
4,acs/acs5/profile,2020,DP05,ACS DEMOGRAPHIC AND HOUSING ESTIMATES


In [10]:
# Initializing an empty list to store dataframes and converting the 'GROUP' column to a list
group_id_list = acs5_profile_groups['GROUP'].tolist()
df_list = []

# Looping through each group ID to fetch and store variables from the ACS5 profile for 2020
for group_id in group_id_list:
    # Fetching all variables for the specific group ID
    df_ = ced.variables.all_variables('acs/acs5/profile', 2020, group_id)
    # Appending each set of variables to the list
    df_list.append(df_)

# Concatenating all dataframes into one for comprehensive analysis, resetting index to avoid duplication
df = pd.concat(df_list, ignore_index=True)


In [11]:
df.head()

Unnamed: 0,YEAR,DATASET,GROUP,VARIABLE,LABEL,SUGGESTED_WEIGHT,VALUES
0,2020,acs/acs5/profile,DP02,DP02_0001E,Estimate!!HOUSEHOLDS BY TYPE!!Total households,,
1,2020,acs/acs5/profile,DP02,DP02_0001PE,Percent!!HOUSEHOLDS BY TYPE!!Total households,,
2,2020,acs/acs5/profile,DP02,DP02_0001PM,Percent Margin of Error!!HOUSEHOLDS BY TYPE!!T...,,
3,2020,acs/acs5/profile,DP02,DP02_0002E,Estimate!!HOUSEHOLDS BY TYPE!!Total households...,,
4,2020,acs/acs5/profile,DP02,DP02_0002PE,Percent!!HOUSEHOLDS BY TYPE!!Total households!...,,


### Step 2: Tokenizing Descriptions

In this section, we will process the textual descriptions associated with each variable from the ACS dataset. Tokenization allows us to break down strings into smaller pieces or "tokens". This step helps facilitate subsequent text analysis tasks such as removing stopwords and performing word embeddings.


In [12]:
# Installing required libraries for text processing
# Uncomment the lines below to install these packages if they are not already installed in your environment.

# !pip install nltk
# !pip install stopwords  
# !pip install KMeans    
# !pip install Word2Vec 

# Importing necessary libraries and modules for text tokenization and processing
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# Ensure that the NLTK components are available
# This step is necessary to download stopwords and tokenizers if they aren't already available in your NLTK installation.
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/mia694/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mia694/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [13]:
# Converting the 'LABEL' column into a list of descriptions
descriptions = df['LABEL'].tolist()

# Preprocessing descriptions to remove stopwords and punctuation
stop_words = set(stopwords.words('english'))  # Creating a set of English stopwords from NLTK

preprocessed_descriptions = []  # Initializing an empty list to store the processed descriptions

# Iterating through each description in the list
for description in descriptions:
    # Tokenizing the description and converting it to lowercase to ensure uniformity
    tokens = word_tokenize(description.lower())
    
    # Filtering out stopwords and punctuation from the tokens
    filtered_tokens = [w for w in tokens if w not in stop_words and w not in string.punctuation]
    
    # Appending the cleaned tokens to the preprocessed_descriptions list
    preprocessed_descriptions.append(filtered_tokens)

# Displaying the list of preprocessed descriptions
preprocessed_descriptions



[['estimate', 'households', 'type', 'total', 'households'],
 ['percent', 'households', 'type', 'total', 'households'],
 ['percent', 'margin', 'error', 'households', 'type', 'total', 'households'],
 ['estimate',
  'households',
  'type',
  'total',
  'households',
  'married-couple',
  'household'],
 ['percent',
  'households',
  'type',
  'total',
  'households',
  'married-couple',
  'household'],
 ['percent',
  'margin',
  'error',
  'households',
  'type',
  'total',
  'households',
  'married-couple',
  'household'],
 ['estimate',
  'households',
  'type',
  'total',
  'households',
  'married-couple',
  'household',
  'children',
  'householder',
  '18',
  'years'],
 ['percent',
  'households',
  'type',
  'total',
  'households',
  'married-couple',
  'household',
  'children',
  'householder',
  '18',
  'years'],
 ['percent',
  'margin',
  'error',
  'households',
  'type',
  'total',
  'households',
  'married-couple',
  'household',
  'children',
  'householder',
  '18',
  'ye

### Step 4: Word Embeddings

In this section, we will import word embeddings to represent the preprocessed text data in a form suitable for machine learning. Word embeddings provide a dense representation of words and their relative meanings.


In [14]:
# Importing necessary library for handling arrays
import numpy as np

# Installing gensim, a library used for loading pre-trained word embeddings
# Uncomment the following line if gensim is not already installed in your environment
# !pip install gensim

# Importing the downloader module from gensim to access pre-trained models
from gensim import downloader as api

# Downloading and loading the 'glove-wiki-gigaword-300' model
# This model provides a 300-dimensional vector representation of words
model = api.load("glove-wiki-gigaword-300")



In [15]:
# Convert tokenized descriptions to vector representations using pre-trained word embeddings
vectors = []  # Initialize a list to store the averaged vectors for each description

for description in preprocessed_descriptions:
    # Filter out words not in the model's vocabulary
    valid_words = [word for word in description if word in model.key_to_index]
    
    # Compute the average vector for descriptions with at least one valid word
    if valid_words:
        word_vectors = [model[word] for word in valid_words]
        description_vector = np.mean(word_vectors, axis=0)
        vectors.append(description_vector)
    else:
        # Append a zero vector for descriptions without valid words
        vectors.append(np.zeros(model.vector_size))

# The 'vectors' list now contains the averaged word vectors for each description,
# with out-of-vocabulary words excluded.

vectors

[array([-6.44796014e-01,  5.12863994e-01,  3.25675815e-01,  9.17840078e-02,
        -2.75257200e-01,  4.88986015e-01, -2.78351814e-01,  2.60584056e-03,
        -2.89463997e-01, -1.10335994e+00, -2.10775405e-01,  1.83662012e-01,
        -1.03880046e-02,  3.18938553e-01,  2.41633981e-01, -1.18675612e-01,
        -8.93072039e-02, -1.10810399e-01, -3.58807981e-01, -4.12108004e-01,
        -1.19426008e-02,  1.01291001e-01,  1.69059202e-01,  4.45246041e-01,
        -1.17086984e-01,  2.12961599e-01,  6.00379426e-03,  5.15800007e-02,
         3.79202038e-01,  6.49600010e-03, -6.84720054e-02,  3.88995975e-01,
        -4.40880001e-01,  6.85943961e-02, -5.60473800e-01, -3.42877209e-01,
         1.19535998e-01, -2.24000216e-03,  5.95584884e-02,  9.76260081e-02,
         2.05910012e-01, -1.49828002e-01,  1.68763995e-01,  8.16839933e-02,
        -1.68619994e-02, -3.95344019e-01,  3.14175785e-02,  2.33285613e-02,
        -2.36446813e-01,  1.42795593e-01,  7.73700029e-02,  9.75900516e-02,
         1.7

### Step 5: Cluster the vectors using KMeans


In this section, we will apply the K-means clustering algorithm to group our preprocessed text data based on their vector representations. Clustering can help us identify underlying patterns or groups within the ACS variable descriptions, revealing insights that may not be immediately obvious. We will use the scikit-learn library to perform this clustering.

In [16]:
# Importing the KMeans class from scikit-learn for clustering analysis
from sklearn.cluster import KMeans

In [17]:
# Cluster the vectors using KMeans
num_clusters = 15  # Set the number of clusters; adjust this number based on your specific analysis needs
kmeans = KMeans(n_clusters=num_clusters, random_state=0).fit(vectors)  # Fitting the KMeans model to the vectors

# Assign descriptions to clusters
clusters = kmeans.labels_  # The labels_ attribute of the model provides the cluster assignment for each description


In [18]:
# Output the results
# This loop prints each description along with its corresponding cluster number
for i, cluster in enumerate(clusters):
    print(f"Description: {descriptions[i]} -> Cluster: {cluster}")
    # Here, `descriptions[i]` refers to the i-th description in the original list
    # `cluster` is the cluster number assigned to that description by the KMeans algorithm


Description: Estimate!!HOUSEHOLDS BY TYPE!!Total households -> Cluster: 4
Description: Percent!!HOUSEHOLDS BY TYPE!!Total households -> Cluster: 4
Description: Percent Margin of Error!!HOUSEHOLDS BY TYPE!!Total households -> Cluster: 13
Description: Estimate!!HOUSEHOLDS BY TYPE!!Total households!!Married-couple household -> Cluster: 4
Description: Percent!!HOUSEHOLDS BY TYPE!!Total households!!Married-couple household -> Cluster: 4
Description: Percent Margin of Error!!HOUSEHOLDS BY TYPE!!Total households!!Married-couple household -> Cluster: 4
Description: Estimate!!HOUSEHOLDS BY TYPE!!Total households!!Married-couple household!!With children of the householder under 18 years -> Cluster: 4
Description: Percent!!HOUSEHOLDS BY TYPE!!Total households!!Married-couple household!!With children of the householder under 18 years -> Cluster: 4
Description: Percent Margin of Error!!HOUSEHOLDS BY TYPE!!Total households!!Married-couple household!!With children of the householder under 18 years ->

In [19]:

# Assuming `clusters` contains the cluster labels for each description
# And `descriptions` is the list of all descriptions from your dataset

# Organize descriptions by cluster
clustered_descriptions = {}
for i, cluster in enumerate(clusters):
    # Check if the cluster key already exists in the dictionary; if not, initialize it
    if cluster not in clustered_descriptions:
        clustered_descriptions[cluster] = []
    # Append the description to the correct cluster list
    clustered_descriptions[cluster].append(descriptions[i])

# Sample or show a few entries from each cluster to inspect the clustering results
samples_per_cluster = 10  # Define how many samples you want to display per cluster
for cluster, descs in clustered_descriptions.items():
    print(f"Cluster {cluster}:")  # Print the cluster number
    # If the number of descriptions in the cluster is greater than the desired sample size, randomly select samples
    if len(descs) > samples_per_cluster:
        sample_descs = np.random.choice(descs, samples_per_cluster, replace=False)
    else:
        sample_descs = descs  # If not, use all descriptions in the cluster
    # Print each sample description
    for desc in sample_descs:
        print(f" - {desc}")
    print()



Cluster 4:
 - Estimate!!RELATIONSHIP!!Population in households
 - Percent Margin of Error!!HOUSEHOLDS BY TYPE!!Total households!!Households with one or more people under 18 years
 - Percent!!HOUSEHOLDS BY TYPE!!Total households!!Female householder, no spouse/partner present
 - Percent Margin of Error!!HOUSEHOLDS BY TYPE!!Total households!!Male householder, no spouse/partner present
 - Estimate!!HOUSEHOLDS BY TYPE!!Total households!!Male householder, no spouse/partner present!!With children of the householder under 18 years
 - Percent!!RELATIONSHIP!!Population in households!!Other nonrelatives
 - Percent!!RELATIONSHIP!!Population in households
 - Estimate!!HOUSEHOLDS BY TYPE!!Total households!!Female householder, no spouse/partner present!!Householder living alone
 - Estimate!!HOUSEHOLDS BY TYPE!!Total households!!Married-couple household!!With children of the householder under 18 years
 - Percent!!HOUSEHOLDS BY TYPE!!Total households!!Male householder, no spouse/partner present

Cluste

### Step 6: Visualizing Clusters

In this section, we will visualize the clusters generated by the K-means algorithm in the previous section using a 2D scatter plot. Each point in the plot represents a description, with colors indicating the cluster to which it belongs. This visualization will help us understand how well the algorithm has grouped similar descriptions together.

In [21]:
import plotly.express as px
from sklearn.manifold import TSNE

# Convert lists to numpy arrays if they are not already
vectors = np.array(vectors)
cluster_tags = np.array(clusters)
descriptions = np.array(descriptions)

# Reduce dimensions to 2D using t-SNE
tsne = TSNE(n_components=2, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

# Create a DataFrame for plotting
df = pd.DataFrame(reduced_vectors, columns=['x', 'y'])
df['cluster_tag'] = cluster_tags
df['description'] = descriptions

print("Hover over the graph to see the details")

# Plot using Plotly
fig = px.scatter(df, x='x', y='y', color='cluster_tag', hover_data=['description'])
fig.update_traces(marker=dict(size=12),
                  selector=dict(mode='markers'))
fig.show()


Hover over the graph to see the details


In [22]:
# Vector embedding data
vectors = np.array(vectors)  # Ensure vectors is a numpy array

# Define different values of k
k_values = [5, 10, 15]

# Dictionary to store the TSNE reduced data for different k
tsne_results = {}

for k in k_values:
    # Cluster the vectors using KMeans
    kmeans = KMeans(n_clusters=k, random_state=0).fit(vectors)
    clusters = kmeans.labels_  # Cluster labels for each description
    
    # Reduce dimensions to 2D using t-SNE
    tsne = TSNE(n_components=2, random_state=42)
    reduced_vectors = tsne.fit_transform(vectors)
    
    # Store reduced vectors and cluster labels in the dictionary
    tsne_results[k] = (reduced_vectors, clusters)

# Plot results for each k using Plotly
for k, (reduced_vectors, clusters) in tsne_results.items():
    df = pd.DataFrame(reduced_vectors, columns=['x', 'y'])
    df['cluster'] = clusters
    fig = px.scatter(df, x='x', y='y', color='cluster', title=f"t-SNE and KMeans Clustering with k={k}",
                     color_continuous_scale=px.colors.qualitative.Vivid)
    fig.update_traces(marker=dict(size=12),
                      selector=dict(mode='markers'))
    fig.show()
