<a href="https://colab.research.google.com/github/BackBencher2424/BA820_Team_14_Project/blob/main/BA820_M4_Q2_Drishti_Chulani.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**BA820 – Project M4**
**Project Title:** *Code Trends, Quantified: Mapping the Programming Language Ecosystem* <br>
***Section:** B1* <br>
***Team:** 14* <br>
**Team Member:** Drishti Chulani <br><br>

**Research Question:**


***Link to Proposal Notebook (BA820_Team_14_Project_Proposal_Notebook.ipynb):** [EDA Notebook](https://colab.research.google.com/drive/1irElxdNYp_Hh08p4MeOGdafvt_d1KsT7?usp=sharing)* <br>

**Link to M2 Notebook:**

**Link to M3 Notebook:**

**Link to Colab Notebook:** [Research Question 2 Notebook](https://colab.research.google.com/gist/DrishtiChulani/2a94bebfc93ffe65ab51a63abc7ee2c3/ba820_m2_q2_drishti_chulani.ipynb)* <br>

**Dataset:** [Programming Languages Dataset Link](https://github.com/rfordatascience/tidytuesday/tree/main/data/2023/2023-03-21)*<br>

**Link to Github Repo:** [Github Repo Link](https://github.com/BackBencher2424/BA820_Team_14_Project)*


# **Table of Contents**
---

**1. Project Framing & M4 Refinement Goals**

**2. Data Pipeline & "Active Market" Filtering**

**3. Refined EDA: The Case for Deep Learning**

**4. Baseline Model (M3 Recap)**

**5. M4 Method Upgrade: Autoencoder Feature Extraction**

**6. Advanced Clustering & UMAP Visualization**

**7. Business Insights: Identifying "Archetype Shifts"**

## **1. Project Framing & M4 Refinement Goals**

**The M3 Baseline:** In the previous milestones, I segmented the programming language ecosystem into four Market Archetypes — Titans, Speculative/Hype, Silent Workhorses, and the Long-Tail — using K-Means clustering and linear PCA on log-transformed metrics like GitHub stars, job postings, and Wikipedia views.

**The Limitation:** That foundation held up well, but surface metrics have real weaknesses. They tend to be skewed, multicollinear, and ill-suited for capturing the non-linear relationship between a language's social hype and its actual industrial utility. There's also a dataset composition problem: roughly 90% of the languages in our dataset are dormant or "ghost" languages, and including them pulls cluster centroids in misleading directions — obscuring the market behaviors we actually care about.

**The M4 Refinement (Method Upgrade & Experimental Rigor):**
M4 addresses these gaps through three upgrades to get more actionable insights for technical stakeholders:

* **"Active Market" Filtering:** I applied a strict filter to remove dormant languages before any modeling begins, so the unsupervised models are trained exclusively on the languages that are actually alive in the ecosystem.
* **Deep Representation Learning (Autoencoder):** Rather than relying on linear scaling, I used a deep learning architecture to compress our features into a neural network's bottleneck layer — extracting the non-linear "latent space" that captures the fundamental DNA of each language in a way PCA simply can't.
* **Advanced Manifold Visualization (UMAP):** I applied UMAP to this latent space to visualize cluster boundaries that reflect genuine structural relationships rather than linear projections.

**The Business Goal:** By clustering on a language's deep technical DNA rather than surface-level buzz, we aim to identify **"Archetype Shifts"**; hidden Titans working under the radar, or hype-driven languages masking weak fundamentals. The goal is to give CTOs and investors a clearer, more honest way to quantify adoption risk.

##**2. Data Pipeline & "Active Market" Filtering**

###***Importing Libraries***

In [None]:
!pip install umap-learn
!pip install --upgrade patsy statsmodels

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import umap
import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import tensorflow as tf
from tensorflow.keras import layers, models

###***Loading Dataset***

In [None]:
path = "/content/languages.csv"
df = pd.read_csv(path) # Loads the primary dataset to the runtime

###**1. Preprocessing**

###***Standardizing and Exploring the Dataset***

In [None]:
df.columns = [c.strip().lower() for c in df.columns] # Standardizes the columns, strip() —->> removes any trailing or leading spaces, lower() -->> makes all the column names lowercase

In [None]:
print("Shape (rows, cols):", df.shape) #Gives number of Rows and Columns

###**Exploratory Data Analysis (EDA)**

###***Calculating Missing & Duplicate Values***


We can't drop any columns based on null values because programming languages have been updated and invented quite a lot and most of them have either some or the other missing information, so we work with the missing values.


In [None]:
# Missing values summary gives columns missing_count and missing_pct

missing = (df.isna().sum()
           .to_frame("missing_count")
           .assign(missing_pct=lambda x: (x["missing_count"] / len(df) * 100).round(2))
           .sort_values("missing_pct", ascending=False))

missing.head(20)  # top 20 columns with most missingness

###***Feature Selection & Type Conversion***

In [None]:
features = ['github_repo_stars', 'number_of_jobs', 'number_of_users', 'wikipedia_daily_page_views']
df_eda = df.copy()

# Coerce to numeric and fill NaNs with 0
for feature in features:
    df_eda[feature] = pd.to_numeric(df_eda[feature], errors='coerce')
df_eda[features] = df_eda[features].fillna(0)

###***Numeric Summary Statistics***

In [None]:
# Basic summary statistics for selected numeric columns, because these are most relevant columns
cols_to_summarize = [
    "github_repo_stars","wikipedia_daily_page_views",
    "number_of_users", "number_of_jobs"
]
cols_to_summarize = [c for c in cols_to_summarize if c in df_eda.columns]

df_eda[cols_to_summarize].describe().T

###***Few Key Findings***

####***What are the top 10 programming languages that have the most number of jobs?***

In [None]:
# Top 10 Languages by Jobs
print("Top 10 by Jobs:\n", df_eda[['title', 'number_of_jobs']].nlargest(10, 'number_of_jobs'))

####***What are the top 10 programming languages that have the most number of users?***

In [None]:
# Top 10 by Users
print("\nTop 10 by Users:\n", df_eda[['title', 'number_of_users']].nlargest(10, 'number_of_users'))

Have used these 2 tables as evidence for our clusters

####**M4 "Active Market" Filter**

In [None]:
mask_active = (df_eda['number_of_jobs'] > 0) | (df_eda['github_repo_stars'] > 0)
df_active = df_eda[mask_active].reset_index(drop=True)

print(f"Original Dataset: {len(df_eda)} languages")
print(f"Active Market Subset: {len(df_active)} languages")
print(f"Noise Removed: {len(df_eda) - len(df_active)} inactive/dead languages")

# Transformation & Scaling (The Preprocessing Pipeline)
# Apply Log10(x+1) to handle extreme skewness in the active dataset
df_active_log = np.log10(df_active[features].clip(lower=0) + 1)

# Standardize for clustering and neural network input
scaler = StandardScaler()
df_active_scaled = scaler.fit_transform(df_active_log)

####**Why this methodology**

In Milestone 2 and 3, our clustering analysis included the entire dataset of over 4,000 languages. However, a significant portion of these entries (~75%) represent "Ghost" or "Historical" languages with zero job postings and zero GitHub activity.

What was filtered?

**Centroid Distortion**: Including thousands of inactive languages with zero-value signals artificially pulls the cluster centroids toward the origin. This makes it mathematically difficult to distinguish between "Niche" languages and "Speculative" languages that are just starting to gain traction.

**Model Sparsity**: High-dimensionality models like Autoencoders perform best when the input data represents a viable signal rather than sparse "noise".

**Business Relevance**: For a CTO or developer, an archetype model is most valuable when it compares technologies that are actually competing in the current market.

####**M2 refinements**

The logarithmic transformation ($log_{10}(x+1)$) was a necessary analytical choice to handle extreme scale differences and data noise. While the majority of languages show zero activity, Famous well known languages like SQL have over 7.1 million users, a disparity that would make standard clustering impossible. This transformation "squashes" these outliers, allowing the model to detect meaningful patterns across the entire dataset.

Additionally, we encountered "noise" where wikipedia_daily_page_views had values of -1.0. Since log functions are mathematically undefined for values $\leq 0$, we clipped these negatives to 0 before transforming. This prevented errors like negative infinity while maintaining the integrity of our four "Market Archetypes".

##**3. Refined EDA**

In this section, we examine the statistical properties of the "Active Market" subset. Our goal is to justify the transition from Milestone 3's linear baseline to Milestone 4's deep representation learning.

#### **3.1 Distribution and Skewness**
Even after filtering for active languages, the data remains heavily skewed. We observe that while the majority of active languages have modest engagement, a select few "Titans" command millions of users and jobs. This necessitates the use of a neural network that can learn to "weigh" these extremes without being overwhelmed by them.

#### **3.2 Multicollinearity: Redundant Signals**
Our community metrics (Stars, Wiki Views, and Users) show high correlation. In a standard clustering model, this redundancy can lead to "double-counting" popularity while ignoring subtle differences in industrial utility. An **Autoencoder** is specifically designed to compress these correlated signals into a clean, lower-dimensional latent space, effectively "denoising" the market archetypes.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Condensed Distribution Plots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot 1: GitHub Stars
sns.histplot(df_active_log['github_repo_stars'], bins=30, kde=True, ax=axes[0], color='skyblue')
axes[0].set_title("Log-Scaled GitHub Stars")

# Plot 2: Wikipedia Views
sns.histplot(df_active_log['wikipedia_daily_page_views'], bins=30, kde=True, ax=axes[1], color='salmon')
axes[1].set_title("Log-Scaled Wikipedia Views")

# Plot 3: Number of Jobs
sns.histplot(df_active_log['number_of_jobs'], bins=30, kde=True, ax=axes[2], color='lightgreen')
axes[2].set_title("Log-Scaled Number of Jobs")

plt.tight_layout()
plt.show()

# 2. Correlation Heatmap (Justifying the Bottleneck)
plt.figure(figsize=(8, 6))
sns.heatmap(df_active_log.corr(), annot=True, cmap='mako', fmt=".2f")
plt.title("Feature Correlation Heatmap: Identifying Redundancy")
plt.show()

## **4. Baseline Model (M3 Recap)**

Before implementing the deep learning approach, I used Milestone 3 methodology as the baseline/

**The Baseline Approach:**
* **Algorithm:** K-Means Clustering ($k=4$).
* **Representation:** Raw log-scaled features (`df_active_scaled`).
* **Visualization:** Manifold projection via **UMAP** (Uniform Manifold Approximation and Projection).

**Purpose:** By running UMAP on the baseline scaled data, we can visualize the existing "Market Archetypes." This will allow us to document the limitations of linear scaling and hard clustering (such as overlapping boundaries or "noisy" segments) which we aim to solve in Section 6.

In [None]:
# 1. using baseline K-Means on the active languages
# Using k=4 as validated in M2 and M3 via Elbow and Silhouette scores
kmeans_base = KMeans(n_clusters=4, random_state=42, n_init=10)
df_active['baseline_cluster'] = kmeans_base.fit_predict(df_active_scaled)

# 2. Baseline UMAP for visualization
# This projects our 4D data into 2D to visualize the "surface" cluster separation [cite: 70]
reducer_base = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
umap_base = reducer_base.fit_transform(df_active_scaled)

# Add UMAP coordinates to our dataframe
df_active['umap_base_1'] = umap_base[:, 0]
df_active['umap_base_2'] = umap_base[:, 1]

# 3. Plotting the Baseline Comparison (2D Feature Space vs. UMAP Manifold)
fig, axes = plt.subplots(1, 2, figsize=(18, 7))

# Plot A: The Original "Hype vs Utility" Scatter (M3 Style)
sns.scatterplot(
    data=df_active,
    x=df_active_log['github_repo_stars'],
    y=df_active_log['number_of_jobs'],
    hue='baseline_cluster',
    palette='tab10',
    alpha=0.6,
    s=60,
    ax=axes[0]
)
axes[0].set_title('M3 Baseline: Hype vs Utility Feature Space')
axes[0].set_xlabel('Log10 GitHub Stars (Hype)')
axes[0].set_ylabel('Log10 Number of Jobs (Utility)')
axes[0].grid(True, linestyle="--", alpha=0.5)

# Plot B: The UMAP Manifold Projection
sns.scatterplot(
    data=df_active,
    x='umap_base_1',
    y='umap_base_2',
    hue='baseline_cluster',
    palette='tab10',
    alpha=0.7,
    ax=axes[1]
)
axes[1].set_title("M3 Baseline: UMAP Manifold Projection")
axes[1].set_xlabel("UMAP Dimension 1")
axes[1].set_ylabel("UMAP Dimension 2")

plt.tight_layout()
plt.show()

# Display the numerical profiles for final baseline verification [cite: 55-59]
print("Baseline Archetype Profiles (Mean values):")
display(df_active.groupby('baseline_cluster')[features].mean())

## **5. M4 Method Upgrade: Autoencoder Feature Extraction**

Traditional clustering relies on the "surface" geometry of the data. However, the programming language ecosystem is defined by non-linear relationships—for instance, a 10% increase in GitHub stars does not lead to a linear 10% increase in job postings.

**The Refinement (Deep Learning Architecture):**
We implement an **Autoencoder**, a type of neural network designed for unsupervised representation learning.

* **The Squeeze:** The network compresses our 4D input features into a 2D or 4D **Latent Space** (the bottleneck).
* **The Goal:** To reconstruct the original input, the network is forced to discard noise and multicollinear "redundancy" (like the overlap between Stars and Wiki views), capturing only the fundamental, non-linear "DNA" of the technology.
* **The Benefit:** By clustering on this latent space rather than raw features, we can identify archetypes based on their structural positioning in the market rather than superficial volume.

In [None]:
import tensorflow as tf
from tensorflow.keras import layers, models

# 1. Define the Autoencoder Architecture
# Input dimension matches our 4 features: Stars, Jobs, Users, Wiki Views
input_dim = df_active_scaled.shape[1]
encoding_dim = 4  # The "Latent Space" Bottleneck

# Encoder: Compressing the signal
input_layer = layers.Input(shape=(input_dim,))
encoded = layers.Dense(16, activation='relu')(input_layer)
bottleneck = layers.Dense(encoding_dim, activation='relu', name='latent_space')(encoded)

# Decoder: Attempting to reconstruct the original input
decoded = layers.Dense(16, activation='relu')(bottleneck)
output_layer = layers.Dense(input_dim, activation='linear')(decoded)

# 2. Build and Compile the Model
autoencoder = models.Model(input_layer, output_layer)
autoencoder.compile(optimizer='adam', loss='mse')

# 3. Training the Model
# The model learns by trying to predict its own input (X = y)
print("Training Autoencoder to extract latent features...")
history = autoencoder.fit(
    df_active_scaled,
    df_active_scaled,
    epochs=100,
    batch_size=32,
    verbose=0,
    validation_split=0.1
)

# 4. Extracting the Latent "DNA"
# We create a sub-model that stops at the bottleneck layer
encoder_only = models.Model(inputs=autoencoder.input,
                             outputs=autoencoder.get_layer('latent_space').output)

latent_features = encoder_only.predict(df_active_scaled)

# 5. Visualizing the Training Loss (Rigor Check)
plt.figure(figsize=(8, 4))
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Val Loss')
plt.title('Autoencoder Training Loss: Ensuring Convergence')
plt.xlabel('Epochs')
plt.ylabel('Mean Squared Error')
plt.legend()
plt.show()

## **6. Advanced Clustering & UMAP Visualization**

With the "Latent DNA" of the programming languages extracted, we now perform our final clustering. Unlike Section 4, which clustered based on surface-level metrics, this stage clusters based on the compressed, non-linear representations learned by the Autoencoder.

**The Refinement Logic:**
* **Algorithm:** K-Means ($k=4$) applied to the **Latent Space**.
* **Visualization:** UMAP projection of the **Latent Space**.
* **The Comparison:** By projecting the deep features into a 2D UMAP space, we can observe if the cluster boundaries are better defined and if the "Adoption Chasm" between Speculative languages and Industry Standards is more distinct.

In [None]:
# 1. New K-Means on the Deep (Latent) Features
kmeans_deep = KMeans(n_clusters=4, random_state=42, n_init=10)
df_active['deep_cluster'] = kmeans_deep.fit_predict(latent_features)

# 2. UMAP on the Deep Features (The Latent Manifold)
reducer_deep = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
umap_deep = reducer_deep.fit_transform(latent_features)

# Add Deep UMAP coordinates to our dataframe
df_active['umap_deep_1'] = umap_deep[:, 0]
df_active['umap_deep_2'] = umap_deep[:, 1]

# 3. Final Comparison Plot: Baseline vs. Deep Archetypes
fig, axes = plt.subplots(1, 2, figsize=(18, 7))

# Plot A: Baseline UMAP (From Section 4)
sns.scatterplot(
    data=df_active, x='umap_base_1', y='umap_base_2',
    hue='baseline_cluster', palette='tab10', alpha=0.5, ax=axes[0]
)
axes[0].set_title("M3 Baseline: UMAP on Raw Features")
axes[0].set_xlabel("UMAP Dimension 1")
axes[0].set_ylabel("UMAP Dimension 2")

# Plot B: Deep Archetype UMAP (The M4 Refinement)
sns.scatterplot(
    data=df_active, x='umap_deep_1', y='umap_deep_2',
    hue='deep_cluster', palette='viridis', alpha=0.7, ax=axes[1]
)
axes[1].set_title("M4 Refined: UMAP on Autoencoder Latent Space")
axes[1].set_xlabel("UMAP Dimension 1")
axes[1].set_ylabel("UMAP Dimension 2")

plt.tight_layout()
plt.show()

# 4. Profile Check: How do these new deep clusters look?
print("Deep Archetype Profiles (Mean values):")
display(df_active.groupby('deep_cluster')[features].mean().sort_values(by='number_of_jobs', ascending=False))

## **7. Business Insights: Identifying "Archetype Shifts"**

The true value of our Deep Learning refinement is its ability to look past surface-level "buzz" and identify languages based on their underlying technological and industrial DNA. By comparing the Baseline (M3) clusters with our Refined (M4) Deep clusters, we can identify **"Archetype Shifts."**

**Strategic Value for CTOs and Investors:**
* **Hidden Titans:** Languages that appear niche or speculative on the surface but possess the structural profile of an industry standard.
* **False Hype Detection:** Technologies that command high social media "stargazing" but lack the foundational ecosystem support to survive long-term.
* **The Adoption Chasm:** Quantifying how far a speculative language must evolve to become a "Silent Workhorse" or "Mainstream Titan."

In [None]:
# 1. Identify the "Shifters"
# These are languages where the Deep DNA (Autoencoder) overrode the surface metrics
df_active['shifted'] = df_active['baseline_cluster'] != df_active['deep_cluster']
shifters = df_active[df_active['shifted'] == True].copy()

print(f"Total Languages Analyzed: {len(df_active)}")
print(f"Number of Languages that Shifted Archetypes: {len(shifters)}")

# 2. Display a sample of "Shifting" languages for qualitative analysis
# We look at the change from M3 logic to M4 deep logic
print("\n--- Sample of Archetype Shifters (M3 Baseline vs. M4 Deep DNA) ---")
display(shifters[['title', 'baseline_cluster', 'deep_cluster', 'github_repo_stars', 'number_of_jobs']].head(15))

# 3. Final Strategic Mapping
# Assigning the business names to our final Deep Clusters for clarity
cluster_map = {
    0: "Silent Workhorses",
    1: "Niche / Long-Tail",
    2: "Speculative Disruptors",
    3: "Mainstream Titans"
}
# Note: Ensure the numbers match your actual cluster means from Section 6
df_active['archetype_name'] = df_active['deep_cluster'].map(cluster_map)

# 4. Save the Refined Results
df_active.to_csv("M4_Refined_Market_Archetypes.csv", index=False)
print("\n✅ Refinement Complete. Data saved to 'M4_Refined_Market_Archetypes.csv'")

----


Archetype Branding & Interpretation
Cluster 2: The Market Titans (High Utility, High Hype)

Members: SQL, Java, JavaScript, HTML, HTTP.

Interpretation: These are the infrastructure of the modern web and data world. They have the highest job counts and massive social visibility. They represent the "Safe Bets" for any developer or firm.

Cluster 0: The Modern Trendsetters (High Hype, Rising Utility)

Members: Kotlin, TypeScript, Solidity, Elixir.

Interpretation: These are "New-Era" languages. They have high GitHub engagement and are technically robust (captured by your Autoencoder). They are transitioning from "trendy" to "essential," particularly in mobile (Kotlin), web (TS), and Web3 (Solidity).

Cluster 1: Technical Specialists (Niche Utility)

Members: Node.js, Liquid, k-framework.

Interpretation: These represent specific ecosystems (e.g., Shopify’s Liquid or server-side JS). They aren't general-purpose "Titans," but they have dedicated professional pockets and specific technical DNA that the Autoencoder recognized.

Cluster 3: The Long-Tail / Speculative (Experimental)

Members: Remix, Quaint-lang, Preforth.

Interpretation: Even within the "Active" subset, these are languages with low professional demand but some social or technical activity. They represent the "Experimental" fringe of the ecosystem.

Refined Problem Statement and Reflection

Challenge: My original analysis was hindered by 97% zero-job data and linear models (PCA) that couldn't distinguish between "Famous" and "Useful" languages.

Action: I narrowed the scope to 953 "Active Market" languages and implemented a Deep Learning Autoencoder to extract the "Latent DNA" of the ecosystem.

Reflection: The Autoencoder revealed that technical structure is a major predictor of archetype. While Cluster 2 (Titans) is obvious, Cluster 0 (Modern Trendsetters) is the most valuable discovery. These languages (like Kotlin and Solidity) have the technical foundation of Titans but are still in their growth phase.

Business Value: This model allows stakeholders to look past "Hype" and see which technologies have the technical robustness to provide long-term professional stability.