**Machine Learning Intro | Assignment**

## Q.1 : Explain the differences between AI, ML, Deep Learning (DL), and Data Science (DS)

**Answer:**

- **Artificial Intelligence (AI):** The broad field that aims to create machines or systems that can perform tasks that normally require human intelligence (e.g., reasoning, planning, language understanding). Example: a chatbot that answers customer queries.

- **Machine Learning (ML):** A subset of AI that trains algorithms on data so they can make predictions or decisions without being explicitly programmed for every rule. Example: a model that predicts house prices from historical data.

- **Deep Learning (DL):** A subset of ML that uses multi-layered neural networks (deep neural networks) to learn complex patterns from very large datasets. DL often requires more data and computation. Example: image recognition (identifying objects in photos).

- **Data Science (DS):** An interdisciplinary field that uses statistics, programming, domain knowledge, and ML to extract insights and build data-driven solutions. Data science includes data cleaning, exploration, visualization, modeling, and communication. Example: analyzing customer purchase data to recommend products.

**Short relation:** AI ⊇ ML ⊇ DL. Data Science uses ML/DL and also covers data preparation, analysis, and communication.


## Q.2 : What are the types of machine learning? Describe each with one real-world example

**Answer:**

1. **Supervised Learning:** Trains on labelled data (inputs paired with correct outputs). Used for classification and regression.
   - *Example:* Email spam detection (labels: spam / not spam).

2. **Unsupervised Learning:** Finds patterns in unlabelled data (no explicit output labels). Used for clustering, dimensionality reduction.
   - *Example:* Customer segmentation in marketing (group customers by behavior).

3. **Semi-supervised Learning:** Uses a small amount of labelled data plus a larger amount of unlabelled data to improve learning.
   - *Example:* Web page classification where only a few pages are labelled but many are unlabelled.

4. **Reinforcement Learning (RL):** An agent learns by taking actions in an environment to maximize cumulative reward.
   - *Example:* Training a robot to navigate a maze or training agents for game-playing (AlphaGo).

5. **Self-supervised Learning (emerging / specialised):** The model creates labels from the data itself (e.g., predict missing words) to learn useful representations.
   - *Example:* BERT-style pretraining in NLP (predict masked words) used before fine-tuning.


## Q.3 : Define overfitting, underfitting, and the bias-variance tradeoff in machine learning

**Answer:**

- **Overfitting:** When a model learns the training data too well, including noise and minor fluctuations, it performs very well on training data but poorly on new/unseen data. It often happens with overly complex models and small data.
  - *Fixes:* Use simpler models, regularization, more data, cross-validation, or early stopping.

- **Underfitting:** When a model is too simple to capture the underlying pattern in the data and performs poorly on both training and test data.
  - *Fixes:* Use a more complex model, add relevant features, reduce regularization.

- **Bias-Variance Tradeoff:** Bias measures error from wrong assumptions (underfitting). Variance measures sensitivity to training data (overfitting). The tradeoff is balancing these two so that total error (bias^2 + variance + irreducible error) is minimized. Simple models → high bias, low variance. Complex models → low bias, high variance.

---


## Q.4 : What are outliers in a dataset, and list three common techniques for handling them.

**Answer:**

- **Outliers:** Data points that differ significantly from other observations. They can come from measurement errors, data-entry mistakes, or true extreme behaviour.

**Three common techniques:**
1. **Remove outliers:** If outliers are errors or irrelevant, drop them. Be careful with small datasets.
2. **Cap/Winsorize:** Replace extreme values with boundary values (e.g., set values above 95th percentile to the 95th percentile value).
3. **Transform data:** Apply log or Box–Cox transforms to reduce skew and the effect of extreme values.

Other approaches: use robust models (median, robust scaling) or treat separately as a special case.

---


## Q.5 : Explain the process of handling missing values and mention one imputation technique for numerical and one for categorical data.

**Answer:**

**Process:**
1. **Identify** missing values and compute the percentage missing per column.
2. **Understand missingness type:** MCAR (missing completely at random), MAR (missing at random), MNAR (not at random).
3. **Decide strategy:** Drop rows/columns (if too many missing), impute, or model the missingness.
4. **Impute/handle:** Apply chosen methods (mean/median/mode, regression imputation, KNN, or domain-specific rules).
5. **Validate:** Check results and whether imputations introduce bias; test model sensitivity.

**One imputation technique (numerical):** **Median imputation** — replace missing numeric values with the median (robust to outliers).

**One imputation technique (categorical):** **Mode imputation** — replace missing categorical values with the most frequent category.

---


### Q.6 : Python code (create imbalanced dataset and print class distribution)

In [None]:

# Q6: Create a synthetic imbalanced dataset with make_classification and print class distribution.
from sklearn.datasets import make_classification
import pandas as pd
from collections import Counter

X, y = make_classification(n_samples=1000, n_features=5, n_informative=3, n_redundant=0,
                           n_clusters_per_class=1, weights=[0.95, 0.05], flip_y=0, random_state=42)

df_q6 = pd.DataFrame(X, columns=[f'feat_{i+1}' for i in range(X.shape[1])])
df_q6['target'] = y

print('Q6: Class distribution (label: count):')
print(Counter(df_q6['target']))

print('\\nQ6: Sample rows:')
print(df_q6.head())


### Q.7 : Python code (one-hot encoding using pandas)

In [None]:

# Q7: One-hot encoding using pandas for a list of colors ['Red', 'Green', 'Blue', 'Green', 'Red'].
import pandas as pd

colors = ['Red', 'Green', 'Blue', 'Green', 'Red']
df_colors = pd.DataFrame({'color': colors})

print('Q7: Original dataframe:')
print(df_colors)

df_onehot = pd.get_dummies(df_colors, columns=['color'], prefix='', prefix_sep='')
# reorder columns to consistent order
cols = sorted([c for c in df_onehot.columns if c != 'color']) if 'color' in df_onehot.columns else sorted(df_onehot.columns)
df_onehot = df_onehot[cols] if cols else df_onehot

print('\\nQ7: One-hot encoded dataframe:')
print(df_onehot)


### Q.8 : Python code (generate normal samples, introduce missing values, impute with mean, plot histograms)

In [None]:

# Q8: Generate 1000 samples from a normal distribution, introduce 50 random missing values, fill with mean, and plot histograms before and after imputation.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(42)
values = np.random.normal(loc=50, scale=10, size=1000)  # mean 50, sd 10
df_q8 = pd.DataFrame({'value': values})

# introduce 50 random missing values
missing_idx = np.random.choice(df_q8.index, size=50, replace=False)
df_q8.loc[missing_idx, 'value'] = np.nan

print('Q8: Missing values count before imputation:', df_q8['value'].isna().sum())

# Histogram BEFORE imputation
plt.figure(figsize=(7,4))
plt.hist(df_q8['value'].dropna(), bins=30)
plt.title('Q8: Histogram BEFORE imputation (missing removed)')
plt.xlabel('value')
plt.ylabel('frequency')
plt.show()

# Fill missing with column mean
mean_val = df_q8['value'].mean()
df_q8['value'] = df_q8['value'].fillna(mean_val)
print('Q8: Missing values count after imputation:', df_q8['value'].isna().sum())

# Histogram AFTER imputation (separate plot)
plt.figure(figsize=(7,4))
plt.hist(df_q8['value'], bins=30)
plt.title('Q8: Histogram AFTER imputation (filled with mean)')
plt.xlabel('value')
plt.ylabel('frequency')
plt.show()


### Q.9 : Python code (Min-Max scaling using sklearn)

In [None]:

# Q9: Implement Min-Max scaling on [2,5,10,15,20] using sklearn.preprocessing.MinMaxScaler
import numpy as np
from sklearn.preprocessing import MinMaxScaler

arr = np.array([2, 5, 10, 15, 20]).reshape(-1,1)
scaler = MinMaxScaler()
scaled = scaler.fit_transform(arr)
print('Q9: Original array:', arr.flatten().tolist())
print('Q9: Scaled array:', scaled.flatten().tolist())


## Q.10 : Data preparation plan for a retail customer transaction dataset (step-by-step)

**Scenario:** Dataset contains missing ages, outliers in transaction amount, highly imbalanced target (fraud vs non-fraud), and categorical variables like payment method.

**Step-by-step plan (simple and practical):**

1. **Initial exploration:** Examine column types, missing value counts, basic statistics (mean, median, min, max) and target class distribution.

2. **Missing data handling:**
   - For **age (numerical)**: check distribution; if not heavily skewed use mean/median imputation. Prefer **median** if outliers present.
   - For **categorical (payment method)**: impute with **mode** (most frequent) or add a special category like 'Missing'.

3. **Outliers in transaction amount:**
   - Detect with IQR or z-score. If outliers are data errors, remove them. If they are valid but extreme, consider capping (winsorizing) or transforming (log transform) to reduce impact.

4. **Encoding categorical variables:**
   - Use **one-hot encoding** for nominal categories with limited cardinality (e.g., payment_method).
   - For high-cardinality categories, consider target-encoding or embedding approaches.

5. **Handle class imbalance:**
   - Options: oversample minority (upsampling), undersample majority, or use algorithmic methods (class weights, specialized algorithms, or synthetic sampling like SMOTE).
   - A simple approach: **upsample minority** using resampling or set `class_weight` in models like RandomForest/LogisticRegression.

6. **Feature scaling:**
   - Scale numeric features (StandardScaler or MinMaxScaler) where required (e.g., for distance-based models or regularized models).

7. **Feature engineering & selection:**
   - Create derived features (e.g., transaction_hour, user_age_group), and remove/reduce redundant features.

8. **Validation strategy:**
   - Use stratified train-test split to preserve class imbalance in validation. Use cross-validation (stratified) and monitor metrics appropriate for imbalance (precision, recall, F1, ROC-AUC, PR-AUC).

9. **Pipeline & reproducibility:**
   - Build an automated pipeline (sklearn Pipeline) that handles imputation, encoding, scaling, and sampling to avoid data leakage.

10. **Final checks:**
   - Evaluate models using relevant metrics and perform post-processing (threshold tuning) before deployment.

---


In [None]:

# Q10: Demonstration of data preparation steps for a retail transaction dataset.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample

np.random.seed(42)

n = 500

# Create synthetic data
ages = np.random.normal(loc=35, scale=12, size=n).astype(float)
# Introduce missing ages (10% missing)
missing_age_idx = np.random.choice(n, size=int(0.10*n), replace=False)
ages[missing_age_idx] = np.nan

# Transaction amount: mostly around 100-500 but with some outliers
tx_amount = np.random.exponential(scale=100, size=n) + 20  # positive skew
# Introduce some extreme outliers
outlier_idx = np.random.choice(n, size=5, replace=False)
tx_amount[outlier_idx] = tx_amount[outlier_idx] * 15

# Target: highly imbalanced (fraud = 0.03)
targets = np.zeros(n, dtype=int)
fraud_idx = np.random.choice(n, size=int(0.03*n), replace=False)
targets[fraud_idx] = 1

# Payment method categorical with some missing values
payment_methods = np.random.choice(['Card', 'Cash', 'UPI', 'NetBanking'], size=n, p=[0.45, 0.25, 0.25, 0.05])
# introduce a few missing payment methods
pm_missing_idx = np.random.choice(n, size=int(0.03*n), replace=False)
payment_methods[pm_missing_idx] = None

df_q10 = pd.DataFrame({
    'age': ages,
    'transaction_amount': tx_amount,
    'payment_method': payment_methods,
    'is_fraud': targets
})

print('Q10: Initial dataset shape and first rows:')
print(df_q10.shape)
print(df_q10.head())

# 1) Initial exploration
print('\\nQ10: Missing values per column:')
print(df_q10.isna().sum())

print('\\nQ10: Transaction amount summary (before handling outliers):')
print(df_q10['transaction_amount'].describe())

print('\\nQ10: Target distribution:')
print(df_q10["is_fraud"].value_counts())

# 2) Handling missing data
# Age: median imputation
age_median = df_q10['age'].median()
df_q10['age_imputed'] = df_q10['age'].fillna(age_median)

# Payment method: mode imputation, if no mode available set to 'Missing'
mode_pm = df_q10['payment_method'].mode()
mode_pm_val = mode_pm[0] if len(mode_pm)>0 else 'Missing'
df_q10['payment_method_imputed'] = df_q10['payment_method'].fillna(mode_pm_val)

print('\\nQ10: After imputation - missing counts:')
print(df_q10[['age_imputed','payment_method_imputed']].isna().sum())

# 3) Outlier detection & handling for transaction_amount using IQR capping (winsorizing)
Q1 = df_q10['transaction_amount'].quantile(0.25)
Q3 = df_q10['transaction_amount'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f'\\nQ10: IQR lower_bound={lower_bound:.2f}, upper_bound={upper_bound:.2f}')

# Count outliers
outliers_lower = (df_q10['transaction_amount'] < lower_bound).sum()
outliers_upper = (df_q10['transaction_amount'] > upper_bound).sum()
print('Q10: Outliers lower count:', outliers_lower)
print('Q10: Outliers upper count:', outliers_upper)

# Cap values to bounds (winsorize)
df_q10['tx_capped'] = df_q10['transaction_amount'].clip(lower=lower_bound, upper=upper_bound)

print('\\nQ10: Transaction amount summary (after capping):')
print(df_q10['tx_capped'].describe())

# 4) Encoding categorical variable - one-hot encoding for payment_method_imputed
df_encoded = pd.get_dummies(df_q10, columns=['payment_method_imputed'], prefix='pm')

# 5) Handle class imbalance - upsample minority class (fraud) to balance
df_majority = df_encoded[df_encoded['is_fraud']==0]
df_minority = df_encoded[df_encoded['is_fraud']==1]
print('\\nQ10: Before resampling class distribution:')
print(df_encoded['is_fraud'].value_counts())

df_minority_upsampled = resample(df_minority, replace=True, n_samples=len(df_majority), random_state=42)
df_balanced = pd.concat([df_majority, df_minority_upsampled]).sample(frac=1, random_state=42).reset_index(drop=True)

print('Q10: After upsampling class distribution:')
print(df_balanced['is_fraud'].value_counts())

# 6) Feature scaling (StandardScaler) for numeric features 'age_imputed' and 'tx_capped'
scaler = StandardScaler()
df_balanced[['age_scaled', 'tx_scaled']] = scaler.fit_transform(df_balanced[['age_imputed', 'tx_capped']])

print('\\nQ10: Final prepared dataset shape and sample rows:')
print(df_balanced.shape)
print(df_balanced[['age_imputed','tx_capped','age_scaled','tx_scaled','is_fraud'] + [c for c in df_balanced.columns if c.startswith('pm_')]].head())

# Try to use caas_jupyter_tools.display_dataframe_to_user if available for nicer display
try:
    import caas_jupyter_tools as cjt
    cjt.display_dataframe_to_user('Q10_sample_prepared', df_balanced.head(10))
except Exception:
    print('\\nNote: caas_jupyter_tools not available, used printed output instead.')
