# Assignment 04 – Clustering Analysis  
## Google Mergers & Acquisitions – Clustering

### Section 1: Data Preparation & Setup

**Domain / Industry**  

This project uses a dataset of **Google mergers and acquisitions**. Each row represents a single acquisition deal executed by Google.

- **Entity being clustered:** One *acquisition* (target company and deal)  
- **Original label (target):** `country` – the country where the acquired company is based  
- **Why segmentation matters:**  
  For a corporate development or M&A team, it’s not enough to know *where* deals are located. It’s critical to understand distinct **deal segments**:  
  - Small tuck-in US deals vs. large transformative acquisitions  
  - US vs. international expansion plays  
  - Patterns in timing and pricing across regions  

Clustering can reveal natural **deal archetypes** that go beyond simple labels like “US vs. non-US,” which can help refine sourcing, risk management, and integration playbooks.

Below, I load the dataset, explore the structure, handle missing values, engineer features, and standardize the data for K-means.

In [None]:
# Section 1: Data Preparation & Setup

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import (
    accuracy_score, f1_score, confusion_matrix, classification_report,
    silhouette_score, adjusted_rand_score, normalized_mutual_info_score
)

import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (8, 5)

# ---------------------------------------------------------------------
# 1.1 Load dataset
# ---------------------------------------------------------------------
DATA_PATH = "data/mergers and acquisitions.csv"  # keep this path in your repo
TARGET_COL = "country"

df = pd.read_csv(DATA_PATH)

print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nDtypes:")
print(df.dtypes)
df.head()

In [None]:
# ---------------------------------------------------------------------
# 1.2 Basic exploration
# ---------------------------------------------------------------------
print("Missing values per column:")
print(df.isna().sum())

print("\nTop country counts:")
print(df['country'].value_counts(dropna=False).head(10))

In [None]:
# ---------------------------------------------------------------------
# 1.3 Feature engineering
# We will create numeric features for clustering:
#   - year: derived from 'date'
#   - price_num: numeric version of 'price'
#   - log_price: log(1 + price_num)
#   - is_us: 1 if country == 'United States', else 0
# ---------------------------------------------------------------------

df['date_parsed'] = pd.to_datetime(df['date'], errors='coerce')
df['year'] = df['date_parsed'].dt.year

def parse_price(p):
    if pd.isna(p):
        return np.nan
    if isinstance(p, str):
        p = p.strip()
        if p in ['—', '-', '— ', '']:
            return np.nan
        p = p.replace('$', '').replace(',', '')
        try:
            return float(p)
        except ValueError:
            return np.nan
    return p

df['price_num'] = df['price'].apply(parse_price)

print(df[['date', 'year', 'price', 'price_num']].head())
print("\nMissing in 'year':", df['year'].isna().sum())
print("Missing in 'price_num':", df['price_num'].isna().sum())

In [None]:
# ---------------------------------------------------------------------
# 1.4 Handle missing values for numeric features (for clustering)
# ---------------------------------------------------------------------

df['year'] = df['year'].fillna(df['year'].median())
df['price_num'] = df['price_num'].fillna(df['price_num'].median())

df['log_price'] = np.log1p(df['price_num'])
df['is_us'] = (df['country'] == 'United States').astype(int)

feature_cols = ['year', 'log_price', 'is_us']
print("Features used for clustering:", feature_cols)
display(df[feature_cols].describe())

### Why Standardization is Necessary

K-means clustering is based on **Euclidean distance**. If one feature (like deal price) is on a much larger scale than another (like year), it will dominate the distance calculation and overwhelm the other variables.

To avoid this, I standardize the features so that each has approximately:

- Mean ≈ 0  
- Standard deviation ≈ 1  

This puts `year`, `log_price`, and `is_us` on a comparable scale, allowing K-means to treat them fairly when forming clusters of acquisition deals.

In [None]:
# ---------------------------------------------------------------------
# 1.5 Standardize features for clustering
# ---------------------------------------------------------------------

scaler = StandardScaler()
X_full = df[feature_cols].values
X_scaled = scaler.fit_transform(X_full)

X_scaled[:5]

## Section 2: Labeled Baseline Review

**Original Target Variable: `country`**  

- Each acquisition is labeled by the **country of the acquired company**.  
- This matters because location drives regulatory risk, integration complexity, time zones, and local market dynamics. US vs. non-US deals often require different playbooks.

**Supervised Features**  

For a simple supervised baseline, I predict `country` using:

- `year` – when the deal occurred (captures time trends, e.g., early vs. later waves of expansion)  
- `log_price` – deal size on a log scale (captures magnitude without being dominated by billion-dollar outliers)

This is intentionally a **minimal feature set**. It gives a reasonable baseline but clearly does not capture the full richness of deals (no industry, product, or strategic fit information).

Below I fit a multinomial logistic regression as a simple labeled model.

In [None]:
# Section 2: Labeled Baseline Review

X_supervised = df[['year', 'log_price']].values
y = df[TARGET_COL]

X_train, X_test, y_train, y_test = train_test_split(
    X_supervised, y, test_size=0.2, random_state=42
)

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=2000, multi_class='multinomial')
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average="weighted")

print("Accuracy:", acc)
print("Weighted F1:", f1)

cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", cm)

print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))

### Reflection on the Labeled Model (~150 words)

The supervised model achieves reasonably strong performance given how little information it uses. Accuracy is about **0.80** and the weighted F1 score is about **0.71**, driven largely by correctly identifying **United States** deals, which dominate the dataset. The confusion matrix and classification report show that many smaller countries are misclassified into the US or other more common locations. This makes sense: with only `year` and `log_price`, the model has no direct signal about geography.

From a business standpoint, this labeled approach is limited. It tells us whether a deal is likely US or non-US, but it does **not** differentiate:

- Very large “headline” acquisitions vs. smaller tuck-in deals  
- Different strategic roles of international targets  
- Subsegments within US deals with different risk/return profiles  

These gaps motivate clustering: instead of forcing everything into a fixed set of country labels, we can let the data reveal natural **deal segments** that may cut across geography and highlight more actionable patterns for M&A strategy.

In [None]:
# Section 3: Optimal K Selection

wcss = []
silhouette_scores = []
K_range = range(2, 11)  # 2 to 10

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)
    labels_k = kmeans.labels_
    sil = silhouette_score(X_scaled, labels_k)
    silhouette_scores.append(sil)

# 3.1 Elbow plot
plt.figure()
plt.plot(list(K_range), wcss, marker='o')
plt.xlabel("Number of clusters (K)")
plt.ylabel("Within-Cluster Sum of Squares (Inertia)")
plt.title("Elbow Method for Optimal K")
plt.show()

# 3.2 Silhouette plot
plt.figure()
plt.plot(list(K_range), silhouette_scores, marker='o')
plt.xlabel("Number of clusters (K)")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Scores vs. K")
plt.show()

list(zip(K_range, silhouette_scores))

### 3.1 Elbow Method – Interpretation

The elbow method plots the within-cluster sum of squares (WCSS) against K. As K increases, WCSS falls, but the *rate* of improvement slows. The “elbow” is the point where adding extra clusters buys relatively little extra cohesion.

In this dataset, the curve bends noticeably around **K ≈ 3**, after which WCSS still decreases but with diminishing returns.

### 3.2 Silhouette Score – Interpretation

The silhouette score measures how well each point fits into its assigned cluster compared with other clusters, with values from –1 to 1 (higher is better). For this dataset, the silhouette score peaks around **K = 3** and remains competitive for slightly larger K values, but with no big improvement.

This suggests that **K = 3** provides a good balance between cluster cohesion and separation for this M&A dataset.

### 3.3 K Selection Decision (100–150 words)

Both the elbow and silhouette analyses point toward **K = 3** as a strong choice. The elbow plot shows a clear bend near three clusters, where additional clusters yield smaller reductions in within-cluster variance. The silhouette plot also reaches a high point at K = 3, indicating well-separated and cohesive clusters in the standardized feature space.

While slightly larger K values (e.g., 7 or 8) show decent silhouette scores, they would fragment the deal universe into many small segments that are harder to interpret and act on. For a corporate development or investment banking context, **3 segments** is a practical number: large enough to capture meaningful differences in deal profile, but small enough that a deal team can remember and operationalize them.

Therefore, I choose **K = 3** as the final number of clusters for the K-means model.

In [None]:
# Section 4: K-Means Clustering

FINAL_K = 3

kmeans_final = KMeans(n_clusters=FINAL_K, random_state=42, n_init=10)
cluster_labels = kmeans_final.fit_predict(X_scaled)

df['cluster'] = cluster_labels

print(df['cluster'].value_counts())
print(df['cluster'].value_counts(normalize=True) * 100)

In [None]:
# 4.2 Cluster Characterization

cluster_sizes = df['cluster'].value_counts().sort_index()
cluster_percentages = df['cluster'].value_counts(normalize=True).sort_index() * 100

cluster_summary = pd.DataFrame({
    'count': cluster_sizes,
    'percentage': cluster_percentages.round(2)
})

feature_summary = df.groupby('cluster')[feature_cols].agg(['mean', 'median'])

display(cluster_summary)
display(feature_summary)

### 4.2 Cluster Characterization

From the summary:

- **Cluster 0:** Smallest group. Very high `log_price` and a high share of US deals. These are **mega-size, mostly US deals**, the big headline acquisitions.  
- **Cluster 1:** Medium-size group. `is_us` is 0 for all rows, with moderate deal size. This cluster is **non-US expansion deals**, spread across many countries.  
- **Cluster 2:** Largest group. Nearly all `is_us = 1` with mid-sized prices. These are **core US tuck-in acquisitions**, relatively standardized in size and geography.

Distinguishing factors:

- `log_price` clearly separates **Cluster 0** (billion-dollar scale) from the other two.  
- `is_us` perfectly separates **Cluster 1** (international) from **Cluster 2** (domestic US).  
- `year` is similar across clusters but **Cluster 0** trends slightly later on average, consistent with later-stage mega deals.

In [None]:
# 4.3 Representative Examples

cols_to_show = ['date', 'acquried_company', 'acquring_company',
                'business', 'country', 'price', 'year', 'price_num', 'cluster']

for c in sorted(df['cluster'].unique()):
    print(f"\nCluster {c} – Representative Deals")
    reps = df[df['cluster'] == c].head(3)
    display(reps[cols_to_show])

### 4.3 Representative Examples

- **Cluster 0 (Mega US strategic bets):** Representative rows show very large price tags (hundreds of millions to billions), often US-based targets with high-profile products.  
- **Cluster 1 (International expansion plays):** Examples include acquisitions in Europe, Canada, and other non-US markets at moderate deal sizes.  
- **Cluster 2 (Core US tuck-ins):** Rows show mid-sized US companies acquired to strengthen existing Google products or teams.

These representative deals make the statistical clusters more concrete and relatable to a corporate development or banking audience.

In [None]:
# Section 5: PCA Visualization

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Total variance captured:", pca.explained_variance_ratio_.sum())

### 5.1 What PCA Does and Why It’s Useful

Principal Component Analysis (PCA) compresses the original features into a smaller number of new, uncorrelated components that capture most of the variance in the data. Here, PCA reduces the three standardized features (`year`, `log_price`, and `is_us`) into two principal components.

This is useful because it lets us **visualize** the clusters in 2D while still preserving most of the structure in the M&A dataset. If clusters remain separated in PCA space, it suggests that the underlying segmentation is strong and not an artifact of only a few dimensions.

In [None]:
# 5.2 2D Visualization of Clusters

plt.figure()
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df['cluster'], alpha=0.7)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA (2D) – K-Means Clusters of Google M&A Deals")
plt.legend(*scatter.legend_elements(), title="Cluster")
plt.show()

### 5.3 PCA Interpretation (100–150 words)

The PCA plot shows three clearly differentiated regions in the first two principal components, which capture most of the variance in the standardized features. One region corresponds to **Cluster 0**, which sits far in the direction associated with high `log_price`, highlighting its role as a small group of mega-deals. Two other regions represent **US mid-sized deals** and **non-US deals**, separated primarily by the `is_us` dimension.

There is some overlap at the boundaries, which is expected given that deal size and timing are continuous. Still, the visualization confirms that the 3-cluster solution aligns with intuitive axes: deal size and geography, with smaller variation along the time dimension. The 2D representation captures the main structure of the clusters reasonably well and supports the idea that these segments are meaningful in practice.

In [None]:
# Section 6: Compare Clusters to Original Labels

crosstab = pd.crosstab(df['cluster'], df[TARGET_COL])
crosstab

In [None]:
plt.figure(figsize=(10, 5))
plt.imshow(crosstab, aspect='auto')
plt.colorbar(label='Count')
plt.xticks(range(len(crosstab.columns)), crosstab.columns, rotation=90)
plt.yticks(range(len(crosstab.index)), crosstab.index)
plt.xlabel(TARGET_COL)
plt.ylabel("Cluster")
plt.title("Cluster vs. Country (Original Label)")
for (i, j), val in np.ndenumerate(crosstab.values):
    plt.text(j, i, int(val), ha='center', va='center')
plt.tight_layout()
plt.show()

In [None]:
# Alignment metrics

ari = adjusted_rand_score(df[TARGET_COL], df['cluster'])
nmi = normalized_mutual_info_score(df[TARGET_COL], df['cluster'])

print("Adjusted Rand Index (ARI):", ari)
print("Normalized Mutual Information (NMI):", nmi)

### 6.2 Interpretation (200–250 words)

The crosstab shows a strong link between clusters and the original `country` label. **Cluster 2** is almost entirely composed of **United States** deals, while **Cluster 1** is purely non-US (plus a small number of “—” entries), capturing Google’s international expansion. **Cluster 0** is mostly US but includes a few non-US deals, representing the rare, very large acquisitions regardless of location.

The alignment metrics (Adjusted Rand Index and Normalized Mutual Information) indicate substantial agreement between clusters and countries, but not perfect overlap. That gap is important. The clustering reveals an additional dimension – **deal size** – that cuts across geography. Mega-deals (Cluster 0) are mostly US but not exclusively, and they behave differently enough in price to justify their own segment.

From a business perspective, this suggests changing how deals are categorized. Instead of viewing everything as simply “US vs. international,” a more useful view is:  
1) US core tuck-ins,  
2) international expansion, and  
3) mega strategic bets.  

Each of these has different risk/return trade-offs, integration requirements, and coverage expectations for bankers or corp dev teams.

## Section 7: Segment Personas & Action Plans

### Cluster 0 – “Mega US Strategic Bets” (Persona ~200–250 words)

This segment represents Google’s **rare, mega-size acquisitions**, typically with billion-dollar-level price tags and a strong bias toward US targets. A typical deal in this cluster is a company that materially shifts Google’s product roadmap, competitive position, or platform footprint. These are high-visibility, board-level decisions with heavy involvement from senior leadership, extensive regulatory scrutiny, and complex integration challenges.

Deals in this cluster often deliver access to massive user bases, core technology platforms, or entirely new business lines. The financial stakes are high, so diligence is deep, modeling is detailed, and downside risks must be carefully managed. Integration timelines are long, touching brand, product, engineering, and culture. From an investment banking or corporate development standpoint, these are the “once-every-few-years” transactions that define strategic narratives and investor expectations, rather than incremental improvements.

### Cluster 1 – “International Expansion Plays” (Persona ~200–250 words)

Cluster 1 covers **non-US acquisitions** at moderate deal sizes. A typical deal is a target in Europe, Canada, Israel, or other markets, often purchased to extend Google’s footprint into local ecosystems, acquire specialized teams, or secure technology tailored to regional needs. These deals are big enough to matter but not so large that they become company-defining transactions.

Key characteristics include geographic diversity, moderate prices, and an emphasis on local market knowledge, regulatory familiarity, or specialized capabilities (e.g., mapping data, payments, infrastructure, or AI talent in specific hubs). Integration risk often revolves around cross-border legal issues, data regulations, and bridging cultural differences. For a deal team, this segment represents a steady stream of international opportunities requiring strong local advisors and repeatable playbooks.

### Cluster 2 – “Core US Tuck-In Acquisitions” (Persona ~200–250 words)

Cluster 2 is the **workhorse segment**: mid-sized, US-based acquisitions that support existing Google products, platforms, or capabilities. A typical deal here is an engineering-heavy company or niche product that slots into an existing business unit such as Google Cloud, Maps, or Ads. Deal sizes are meaningful but far below the mega-deal level, making these transactions more common and operationally manageable.

These tuck-ins expand feature sets, accelerate roadmaps, and bring in specialized talent. Integration is usually simpler than in mega deals, focused on aligning technology stacks, migrating users, and integrating teams into Google’s culture and processes. From a banking or corp dev lens, this segment is where most of the **volume** lives: repeatable, pattern-driven deals that rely on efficient sourcing, standardized diligence, and fast execution to maintain Google’s competitive edge in key verticals.

### 7.2 Action Plans (≈150 words per cluster)

**Cluster 0 – “Mega US Strategic Bets” – Action Plan (~150 words)**  
For this segment, the organization should apply its **most rigorous strategic and financial discipline**. Actions include: building robust scenario models, conducting extensive regulatory and antitrust analysis, and ensuring board-level alignment before making offers. Integration plans should be developed in parallel with the deal, including leadership assignments, brand strategy, and communication plans for employees and investors. From a coverage standpoint, bankers or corp dev teams should maintain a short, carefully curated list of potential mega targets and periodically revisit them as markets evolve.

**Cluster 1 – “International Expansion Plays” – Action Plan (~150 words)**  
For international deals, the focus should be on **local expertise and repeatable cross-border processes**. Google should cultivate relationships with local advisors and regulators, develop regional integration playbooks, and standardize how it evaluates political, currency, and regulatory risk. Post-close, success depends on retaining key talent and respecting local market nuances while still integrating into the global platform. Prioritization should emphasize markets where Google’s strategic goals—such as cloud expansion, payments, or AI—overlap with strong local ecosystems.

**Cluster 2 – “Core US Tuck-Ins” – Action Plan (~150 words)**  
For US tuck-in acquisitions, speed and efficiency are crucial. Google should maintain ongoing **pipeline sourcing** from VCs, founders, and internal product teams who identify gaps that could be filled via acquisition. Standardized diligence templates, integration checklists, and playbooks should be in place so these deals can move quickly from term sheet to integration. Success metrics can focus on time-to-integration, feature adoption, and retention of critical talent. This segment is ideal for a more systematized, high-throughput M&A engine.

## Section 8: Reflection (200–300 words)

The most surprising aspect of the clustering is how cleanly the deals separate into three intuitive segments using only **year, price, and US vs. non-US status**. Without any detailed product or financial metrics, the algorithm still isolates mega-deals, international expansion plays, and core US tuck-ins. This aligns closely with how practitioners informally talk about “big bets” versus “tuck-ins,” but the clustering gives that intuition a clear, data-driven structure.

The clusters partially overlap with my prior expectations for the tech M&A space, especially the dominance of US deals and the presence of a distinct set of non-US acquisitions. The added value is the explicit, quantitative recognition of mega-deals as their own segment, which cuts across geography and deserves different governance and integration approaches.

From a business standpoint, this segmentation supports better **resource allocation**, more tailored **integration playbooks**, and clearer communication between corporate development, leadership, and external advisors. Limitations include the small dataset, heavy missingness in price, and the fact that important drivers (like revenue, profitability, or strategic fit) are not directly observed. Clustering is also sensitive to feature choices and the selected K.

In my future work—especially in investment banking or M&A advisory—I could use clustering to segment potential targets, client portfolios, or historical transactions, helping to prioritize outreach, refine valuation comps, and build more nuanced narratives around deal pipelines.