# 🧠📚 JobInterviewGuide_Workshop

![Image Description](./images/OnlineJobInterview.png)



## 🎯 Learning Objectives
- Practice **AI-mediated interview preparation** with a mix of *technical* and *behavioral/scenario-based* questions.
- Reinforce core ML topics from the course through **targeted exercises**.
- Produce a **graded, personalized study notebook** tailored to your quiz results.
- Demonstrate professionalism with **clear documentation** and **version control** (GitHub).



## 📘 Topics Covered (Review Scope)
1. **Supervised vs. Unsupervised** learning algorithms  
2. **Dependent vs. Independent Variables**  
3. **Train / Validation / Test Split** (data leakage, stratification)  
4. **Linear Regression**: residuals, linearization  
5. **Regression Analysis**: parametric vs. non-parametric, **R²**, **MSE**  
6. **Logistic Regression**: intercept, slope, **cross-entropy**  
7. **K-Nearest Neighbors (KNN)**: hyperparameters  
8. **Decision Trees**: leaf nodes & predictions  



## 🧭 Workflow (What You Will Do)
1. **Collect materials**  
   - Copy the workshop folders: `DataStreamVisualization_Workshop`, `LinearRegressionArchitecture_Workshop`, `PerformanceMetricsClassification`, `KNearestNeighbors_Workshop`, `LogisticRegressionClassifier`  
   - Copy the study guide from the course shell and save as **`StudyGuide.txt`**.  
   - ZIP all *workshop* `.ipynb` files into **`StudyMaterials.zip`**.

2. **Open a new LLM session** (model of your choice).  
   Upload **`StudyGuide.txt`** and **`StudyMaterials.zip`**.

3. **Run the prompt** from the section below.  
   - The LLM will act as **interviewer + evaluator + tutor**.  
   - It will **analyze materials**, **create a 500-word content summary**, **create a 100-word interview-topic summary**, and **check coverage vs. gaps**.  
   - It will then **start a 15-question quiz** (one question at-a-time) **as soon as you're ready.**
   - After the quiz, it will generate **JobInterviewGuide_Workshop.ipynb** tailored to your gaps.  
   - In this notebook (the one you're reading), you'll **record your results** and complete **extra practice**.

4. **Submit deliverable**  
   - Push this notebook (updated with your results & exercises) to your GitHub repo.



## 📝 Copy-Paste Prompt for Your LLM Session
> Paste the following prompt into your LLM after uploading `StudyGuide.txt` and `StudyMaterials.zip`.

```text
You are a seasoned Data Scientist, Machine Learning Engineer, and technical interviewer.
I am a Data Scientist and ML Engineer, fresh out of college. You will interview me for an ML Specialist role.

1) Unzip and read StudyMaterials.zip. Understand the workshop notebooks it contains. Produce a **500-word summary** of the ML learning content and coding patterns.
2) Read StudyGuide.txt. Produce a **100-word summary** of interview topics emphasized.
3) **Match** the study guide topics to the workshop materials. Create a **table** listing each topic, whether it is covered by the materials, and any **gaps**.
4) Create **15 multiple-choice questions** (A–E) spanning: supervised vs. unsupervised, variables, train/val/test, linear & logistic regression (R², MSE, cross-entropy), KNN (hyperparams), decision trees (leaf nodes/predictions), plus **scenario-based/behavioral** items (e.g., imbalanced data, data leakage, model choice trade-offs). Ask **one question at a time**. After I answer all, **score me**.
5) Based on questions I get wrong, generate a **new Jupyter Notebook** named **JobInterviewGuide_Workshop.ipynb** inside a folder **JobInterviewGuide_Workshop**. Include:
   - Clear **Markdown explanations** of weak topics
   - **Python code scaffolding** with exercises and TODOs
   - Small, realistic examples and sanity checks
   - A short **reflection** prompt about what I learned
   - Use the style and structure of the workshop notebooks in the zip as inspiration.
Stop here and **wait for my command to start the quiz**.
```

> When the LLM is ready, tell it to **begin the quiz**.



## 🧮 Grading Rubric
| Component | Description | Weight |
|---|---|---|
| **Quiz Performance** | Score across 15 technical + scenario/behavioral questions | **40%** |
| **Generated Study Notebook Quality** | Accuracy, completeness, clarity, and relevance of the personalized notebook | **40%** |
| **Reflection** | Insightful, concise self-assessment embedded at the end of this notebook | **20%** |



## ✅ Record Your Results (Complete After Quiz)
> Fill in the fields below using your quiz report from the LLM.


In [None]:

# 👉 Enter your results here after completing the quiz in your LLM session.
quiz_results = {
    "name": "",                  # Your name
    "date": "",                  # YYYY-MM-DD
    "model_used": "",            # e.g., GPT, Claude, Gemini
    "overall_score": None,       # 0-100
    "num_correct": None,         # out of 15
    "topics_missed": [           # e.g., ["Train/Val/Test", "Cross-Entropy", "KNN hyperparameters"]
    ],
    "behavioral_notes": "",      # e.g., feedback on communication, trade-off reasoning
    "next_steps_from_llm": ""    # LLM's guidance on what to practice
}

print('Recorded. You can re-run this cell to update later.')



## 🗂 Coverage vs. Gaps (Paste from LLM)
> Paste or recreate the **coverage table** generated by your LLM here.



## 🪞 Reflection (Required)
Answer briefly (3–6 sentences):
1. Which 1–2 concepts were most challenging, and why?
2. What trade-offs or assumptions did you overlook in the interview?
3. What is your plan to improve over the next week?



## 🧪 Targeted Practice (Complete These Exercises)
> The exercises below scaffold practice for common weak areas. If your LLM-generated notebook adds more, do those as well.



### 1) Train / Validation / Test Split & Data Leakage 🔪
- Implement a **stratified** train/val/test split for a binary classification dataset.
- Show how **leakage** can happen if scaling is fit improperly.
- Evaluate with **accuracy** and **cross-entropy** on val and test.


In [None]:

# TODO: Implement proper stratified train/val/test split and demonstrate leakage vs. correct pipeline.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss, accuracy_score
from sklearn.pipeline import Pipeline

X, y = load_breast_cancer(return_X_y=True, as_frame=True)

# ✅ Correct: fit scaler on train only via Pipeline
pipe = Pipeline([('scaler', StandardScaler()), ('clf', LogisticRegression(max_iter=1000))])

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42)

pipe.fit(X_train, y_train)
val_pred_proba = pipe.predict_proba(X_val)[:,1]
test_pred_proba = pipe.predict_proba(X_test)[:,1]

val_logloss = log_loss(y_val, val_pred_proba)
test_logloss = log_loss(y_test, test_pred_proba)

val_acc = accuracy_score(y_val, pipe.predict(X_val))
test_acc = accuracy_score(y_test, pipe.predict(X_test))

print({'val_logloss': val_logloss, 'test_logloss': test_logloss, 'val_acc': val_acc, 'test_acc': test_acc})

# ❌ Leakage demo (anti-pattern): scaling fit on full data before split
scaler = StandardScaler().fit(X)  # BAD: includes val/test info
X_scaled = scaler.transform(X)
X_tr_bad, X_te_bad, y_tr_bad, y_te_bad = train_test_split(X_scaled, y, test_size=0.2, stratify=y, random_state=42)
clf_bad = LogisticRegression(max_iter=1000).fit(X_tr_bad, y_tr_bad)
logloss_bad = log_loss(y_te_bad, clf_bad.predict_proba(X_te_bad)[:,1])
acc_bad = accuracy_score(y_te_bad, clf_bad.predict(X_te_bad))
print({'leakage_logloss': logloss_bad, 'leakage_acc': acc_bad})



### 2) Linear vs. Logistic Regression 📈➡️📊
- Fit **linear regression** to a synthetic dataset; inspect **residuals** and discuss **linearization**.
- Fit **logistic regression** to a classification dataset; interpret **intercept** and **slope** (weights) and compute **cross-entropy**.


In [None]:

# TODO: Residual analysis + logistic interpretation
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression, make_classification
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import r2_score, mean_squared_error, log_loss
from matplotlib import pyplot as plt

# Linear regression synthetic data
Xr, yr = make_regression(n_samples=300, n_features=1, noise=15, random_state=7)
lin = LinearRegression().fit(Xr, yr)
yr_pred = lin.predict(Xr)

print('Linear R^2:', r2_score(yr, yr_pred), 'MSE:', mean_squared_error(yr, yr_pred))

# Residuals plot
plt.figure()
plt.scatter(yr_pred, yr - yr_pred, alpha=0.6)  # no specific colors
plt.axhline(0)
plt.xlabel('Predicted')
plt.ylabel('Residual')
plt.title('Residuals vs Predicted')
plt.show()

# Logistic regression
Xc, yc = make_classification(n_samples=400, n_features=4, n_informative=3, n_redundant=0, random_state=3)
logit = LogisticRegression(max_iter=1000).fit(Xc, yc)
proba = logit.predict_proba(Xc)[:,1]
print('Cross-Entropy (log loss):', log_loss(yc, proba))
print('Intercept:', logit.intercept_)  # β0
print('Coefficients:', logit.coef_)    # β (slopes)



### 3) KNN Hyperparameters & Decision Trees 🌳
- For **KNN**, compare performance across different **k** and **distance metrics**; visualize validation performance.
- For a **Decision Tree**, show **leaf nodes** and predicted labels; discuss **overfitting** and **max_depth**.


In [None]:

# TODO: KNN sweep + Decision Tree visualization
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier, plot_tree
import numpy as np
from matplotlib import pyplot as plt

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=0)

ks = range(1, 21)
val_scores = []
for k in ks:
    knn = KNeighborsClassifier(n_neighbors=k, weights='distance', metric='minkowski')
    knn.fit(X_train, y_train)
    val_scores.append(knn.score(X_test, y_test))

plt.figure()
plt.plot(list(ks), val_scores, marker='o')
plt.xlabel('k')
plt.ylabel('Accuracy')
plt.title('KNN Validation Accuracy vs k')
plt.show()

# Decision Tree
dt = DecisionTreeClassifier(random_state=0, max_depth=3)
dt.fit(X_train, y_train)
print('Test accuracy:', accuracy_score(y_test, dt.predict(X_test)))

plt.figure(figsize=(10,6))
plot_tree(dt, feature_names=['sepal len','sepal wid','petal len','petal wid'], class_names=[str(i) for i in np.unique(y)], filled=False)
plt.title('Decision Tree (max_depth=3)')
plt.show()



## 🧰 (Optional) Local Material Checks
If you also have `StudyGuide.txt` and `StudyMaterials.zip` locally, you can use the helpers below to inspect them.


In [None]:

# OPTIONAL: Inspect local files (if present)
import os, zipfile, textwrap

for fname in ['StudyGuide.txt', 'StudyMaterials.zip']:
    print(f'{fname}:', 'FOUND' if os.path.exists(fname) else 'NOT FOUND')

if os.path.exists('StudyGuide.txt'):
    with open('StudyGuide.txt', 'r', encoding='utf-8', errors='ignore') as f:
        content = f.read(700)
    print('\n--- StudyGuide (first 700 chars) ---\n')
    print(textwrap.shorten(content, width=700))

if os.path.exists('StudyMaterials.zip'):
    with zipfile.ZipFile('StudyMaterials.zip', 'r') as z:
        print('\n--- ZIP Contents ---')
        for info in z.infolist()[:30]:
            print(info.filename, info.file_size, 'bytes')



## 🚢 Submission (GitHub)
- Ensure the notebook is **executed** and **saved** with your recorded results and completed exercises.
- Push to your GitHub repository (replace remote and branch names as appropriate):

```bash
git add JobInterviewGuide_Workshop.ipynb
git commit -m "Add graded JobInterviewGuide_Workshop notebook"
git push origin main
```
