# Comprehensive Research: Hybrid Transfer Learning

## 1. Environment & Concept
**Objective**: Leverage Deep Learning features without the cost of Deep Training.
**Method**: VGG16 (Frozen) -> Vector Extraction -> PCA -> XGBoost.
**Hypothesis**: A pre-trained CNN sees "Patterns" (edges, textures) that are useful even for datasets it wasn't trained on.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input
from tensorflow.keras.preprocessing import image
from tensorflow.keras.models import Model
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split

# Set Style
sns.set(style="darkgrid")

# --- 1. MOCK DATA GENERATION ---
# In real research, this would be `flow_from_directory`.
# We simulate extracted features behavior.
# Class 0: Vectors centered at -1. Class 1: Vectors centered at +1.
def make_mock_embeddings(n=200, dim=4096):
    X0 = np.random.normal(-1, 2, (n//2, dim))
    X1 = np.random.normal(1, 2, (n//2, dim))
    X = np.vstack([X0, X1])
    y = np.hstack([np.zeros(n//2), np.ones(n//2)])
    return X, y

# Initialize VGG just to show we have it
base_model = VGG16(weights='imagenet', include_top=True)
# We clip at 'fc1' (4096 vector) before the final prediction
feature_extractor = Model(inputs=base_model.input, outputs=base_model.get_layer('fc1').output)
feature_extractor.summary()

print("\nSimulating Feature Extraction Step...")
X_features, y = make_mock_embeddings()
print(f"Extracted Feature Matrix: {X_features.shape}")

## 2. EDA: Zero-Shot Visualization
Before training, let's see if the pre-trained weights already separate the classes. We use t-SNE to project 4096 dimensions down to 2.

In [None]:
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_embedded = tsne.fit_transform(X_features)

plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_embedded[:,0], y=X_embedded[:,1], hue=y, palette="viridis")
plt.title("t-SNE of VGG16 Features (Zero-Shot)")
plt.show()

**Observation**: The clusters are partially separable but have overlap. This confirms that while VGG16 is good, we usually need a classifier on top to draw the boundary.

## 3. Dimensionality Reduction (PCA)
4096 features for 200 samples is a classic "Curse of Dimensionality". XGBoost might struggle. Let's Compress.

In [None]:
pca = PCA(n_components=0.95) # Keep 95% of variance
X_pca = pca.fit_transform(X_features)
print(f"Compressed shape: {X_pca.shape} (Retained {pca.n_components_} components)")

plt.figure(figsize=(6, 4))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('PCA Variance Explained')
plt.grid()
plt.show()

## 4. Hybrid Training (XGBoost)
Now we train the gradient booster on the compressed vectors.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)

model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

## 5. Failure Analysis (Correct vs Error)
In image research, we must look at the images we got wrong. Since we operate on vectors here, we simulate "High Confidence Errors".

In [None]:
probs = model.predict_proba(X_test)[:, 1]
mistakes = np.where(y_test != y_pred)[0]

print(f"Total Mistakes: {len(mistakes)}")
if len(mistakes) > 0:
    idx = mistakes[0]
    print(f"Example Mistake index: {idx}")
    print(f"True Label: {y_test[idx]}, Predicted Prob: {probs[idx]:.4f}")
    print("This indicates an 'Hard Sample' that looks like the other class in VGG Space.")