## Question 1
**Explain the differences between AI, ML, Deep Learning (DL), and Data Science (DS).**

**Answer:**
- **Artificial Intelligence (AI):** The broad field of building systems that can perform tasks that typically require human intelligence (reasoning, planning, perception, language, decision‑making). AI includes symbolic systems, search, rule‑based systems, ML, DL, etc.
- **Machine Learning (ML):** A subfield of AI focused on algorithms that **learn patterns from data** to make predictions/decisions without being explicitly programmed. Examples: linear/logistic regression, decision trees, SVMs, clustering.
- **Deep Learning (DL):** A subfield of ML that uses **multi‑layer neural networks** to learn complex representations (e.g., CNNs, RNNs, Transformers). Excels in perception and unstructured data (images, audio, text) at the cost of data/compute.
- **Data Science (DS):** An end‑to‑end discipline that combines **statistics, ML, data engineering, and domain knowledge** to extract insights and support decisions. DS covers the whole lifecycle: problem framing, data collection, cleaning, analysis, modeling, evaluation, communication, and deployment.

## Question 2
**What are the types of machine learning? Describe each with one real‑world example.**

**Answer:**
- **Supervised Learning:** Learn from labeled data (input → known target). *Example:* Fraud detection (transaction → fraud/not‑fraud).
- **Unsupervised Learning:** Discover structure in unlabeled data. *Example:* Customer segmentation with clustering (K‑means).
- **Semi‑Supervised Learning:** Train on small labeled + large unlabeled data. *Example:* Classifying product reviews when only some are labeled.
- **Self‑Supervised Learning:** Create surrogate labels from data itself to pretrain models. *Example:* Masked‑word prediction for language models.
- **Reinforcement Learning (RL):** Learn actions by interacting with an environment to maximize reward. *Example:* Recommendation systems optimizing long‑term engagement.
- **Online/Incremental Learning:** Continuously update model with data stream. *Example:* Spam filters adapting to new spam patterns.

## Question 3
**Define overfitting, underfitting, and the bias‑variance trade‑off.**

**Answer:**
- **Overfitting:** Model learns noise/idiosyncrasies; very low training error, high test error.
- **Underfitting:** Model too simple; high error on both train and test.
- **Bias‑Variance Trade‑off:** Increasing model complexity reduces **bias** (systematic error) but increases **variance** (sensitivity to data). Aim for a sweet spot (via regularization, more data, cross‑validation, early stopping, ensembling).

## Question 4
**What are outliers in a dataset, and list three common techniques for handling them.**

**Answer:**
Outliers are observations that **deviate markedly** from the rest of the data (genuine extremes or errors). Handling techniques:
1. **Winsorizing/Capping** using IQR or percentile bounds.
2. **Transformation** (e.g., log, Box‑Cox) to reduce skew.
3. **Robust Modeling** or **robust scalers** (e.g., tree‑based models, Huber loss, RobustScaler).
4. **Removal** if proven erroneous and justified (document carefully).

## Question 5
**Explain handling missing values and mention one imputation technique for numerical and one for categorical data.**

**Answer:**
- **Process:** Inspect patterns (MCAR/MAR/MNAR), quantify missingness, decide per feature: drop rows/columns (if safe), or impute; encode missingness indicator if informative; validate effect via CV.
- **Numerical imputation:** Mean/median imputation (e.g., `SimpleImputer(strategy='median')`).
- **Categorical imputation:** Most‑frequent or a new category such as `'Missing'` (e.g., `SimpleImputer(strategy='most_frequent')` or `fill_value='Missing'`).

## Question 6
**Create a synthetic imbalanced dataset with `make_classification()` and print the class distribution.**

In [None]:
from sklearn.datasets import make_classification
from collections import Counter

X, y = make_classification(n_samples=2000, n_features=20, n_informative=3,
                           n_redundant=2, n_repeated=0, n_clusters_per_class=1,
                           weights=[0.95, 0.05], flip_y=0.01, random_state=42)

print("Class distribution:", Counter(y))

## Question 7
**Implement one‑hot encoding using pandas for the list:** `['Red', 'Green', 'Blue', 'Green', 'Red']`.

In [None]:
import pandas as pd

colors = ['Red', 'Green', 'Blue', 'Green', 'Red']
df = pd.DataFrame({'color': colors})
encoded = pd.get_dummies(df, columns=['color'], prefix='', prefix_sep='')
print(encoded)

## Question 8
**Generate 1000 samples from a normal distribution; introduce 50 random missing values; fill with the column mean; plot a histogram before and after imputation.**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(0)
data = np.random.normal(loc=0.0, scale=1.0, size=1000)
df = pd.DataFrame({'x': data})

# Introduce 50 missing values at random positions
missing_idx = np.random.choice(df.index, size=50, replace=False)
df.loc[missing_idx, 'x'] = np.nan

# Histogram before imputation
plt.figure()
df['x'].hist(bins=30)
plt.title('Histogram BEFORE imputation')
plt.xlabel('x')
plt.ylabel('Frequency')
plt.show()

# Impute with mean
mean_val = df['x'].mean()
df['x_filled'] = df['x'].fillna(mean_val)

# Histogram after imputation
plt.figure()
df['x_filled'].hist(bins=30)
plt.title('Histogram AFTER imputation (mean)')
plt.xlabel('x')
plt.ylabel('Frequency')
plt.show()

## Question 9
**Implement Min‑Max scaling on `[2, 5, 10, 15, 20]` using `MinMaxScaler`.**

In [None]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

arr = np.array([[2],[5],[10],[15],[20]], dtype=float)
scaler = MinMaxScaler()
scaled = scaler.fit_transform(arr)
print(scaled.ravel())

## Question 10
**Retail fraud dataset: data preparation plan (missing ages, outliers in amount, imbalanced target, categorical payment method).**

**Step‑by‑step plan:**
1. **Explore & audit:** Column types, missingness heatmap, basic stats, class imbalance, target leakage check.
2. **Missing data (ages):** Impute with domain‑appropriate method (median for skewed ages) and optionally add a missing‑indicator.
3. **Outliers (amount):** Cap using IQR/percentile bounds; alternatively apply a log transform; prefer **robust scalers**.
4. **Imbalance (fraud vs non‑fraud):** Use appropriate metrics (ROC‑AUC, PR‑AUC), stratified CV; mitigate via **class weights**, **upsampling minority**, **downsampling majority**, or algorithmic methods (threshold tuning, focal loss models).
5. **Categoricals (payment method):** One‑Hot Encode with `handle_unknown='ignore'`.
6. **Pipelines:** Build a `ColumnTransformer` with imputation, encoding, and scaling; wrap in a supervised estimator; evaluate with stratified CV and a hold‑out set.
7. **Monitoring:** Check drift, recalibrate thresholds, retrain schedule.

**Illustrative code on a synthetic dataset:**

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.utils import resample

np.random.seed(42)

# Synthetic data
n = 5000
ages = np.random.normal(35, 10, size=n).clip(18, 80)
# introduce missing ages (10%)
mask_missing = np.random.rand(n) < 0.1
ages[mask_missing] = np.nan

amount = np.random.lognormal(mean=3.2, sigma=0.7, size=n)
# introduce outliers
outlier_idx = np.random.choice(np.arange(n), size=15, replace=False)
amount[outlier_idx] *= 10

payment_methods = np.random.choice(['card', 'upi', 'netbanking', 'wallet'], size=n, p=[0.6, 0.25, 0.1, 0.05])

# Imbalanced target: ~2% fraud
y = (np.random.rand(n) < 0.02).astype(int)

df = pd.DataFrame({
    'age': ages,
    'amount': amount,
    'payment_method': payment_methods,
    'fraud': y
})

# ====== Simple outlier capping for `amount` via IQR ======
Q1, Q3 = df['amount'].quantile([0.25, 0.75])
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df['amount_capped'] = df['amount'].clip(lower, upper)

print("Class distribution BEFORE resampling:", df['fraud'].value_counts().to_dict())

# ====== Upsample minority class for training (simple illustration) ======
df_major = df[df.fraud == 0]
df_minor = df[df.fraud == 1]
df_minor_up = resample(df_minor, replace=True, n_samples=len(df_major)//10, random_state=42)  # 1:10 ratio
df_bal = pd.concat([df_major.sample(len(df_minor_up), random_state=42), df_minor_up], axis=0).sample(frac=1, random_state=42)

print("Class distribution AFTER simple upsampling (train sample):", df_bal['fraud'].value_counts().to_dict())

X = df_bal[['age', 'amount_capped', 'payment_method']]
y_bal = df_bal['fraud']

num_features = ['age', 'amount_capped']
cat_features = ['payment_method']

numeric_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', RobustScaler())
])

categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore'))
])

preprocess = ColumnTransformer([
    ('num', numeric_pipeline, num_features),
    ('cat', categorical_pipeline, cat_features)
])

clf = Pipeline(steps=[
    ('prep', preprocess),
    ('model', LogisticRegression(max_iter=1000))
])

X_train, X_test, y_train, y_test = train_test_split(X, y_bal, test_size=0.2, stratify=y_bal, random_state=42)

clf.fit(X_train, y_train)
pred = clf.predict(X_test)
proba = clf.predict_proba(X_test)[:,1]

print("\nClassification report (upsampled training):\n", classification_report(y_test, pred, digits=3))
print("ROC-AUC:", round(roc_auc_score(y_test, proba), 3))