In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, accuracy_score, f1_score, roc_auc_score
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression


In [5]:
# --- Task 1: House Prices Prediction (Regression) ---
data = pd.read_csv("train.csv")
print("Shape:", data.shape)
display(data.head())
print(data.info())


Shape: (1460, 81)


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [3]:
# Split features and target
X = data.drop("SalePrice", axis=1)
y = data["SalePrice"]

# Identify categorical and numerical columns
num_cols = X.select_dtypes(include=["int64", "float64"]).columns
cat_cols = X.select_dtypes(include=["object"]).columns

# Define preprocessing pipeline
numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median"))
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_transformer, num_cols),
    ("cat", categorical_transformer, cat_cols)
])


In [4]:
# --- Train model and evaluate ---
model = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", RandomForestRegressor(n_estimators=200, random_state=42))
])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model.fit(X_train, y_train)
pred = model.predict(X_test)

# Evaluate metrics (safe for all sklearn versions)
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

mse = mean_squared_error(y_test, pred)       # mean squared error
rmse = np.sqrt(mse)                          # root mean squared error (manual)
mae = mean_absolute_error(y_test, pred)      # mean absolute error
r2 = r2_score(y_test, pred)                  # R¬≤ score

print(f"RMSE: {rmse:.2f}")
print(f"MAE: {mae:.2f}")
print(f"R¬≤: {r2:.3f}")


RMSE: 28480.78
MAE: 17494.26
R¬≤: 0.894


---

## Task 1 ‚Äî House Prices Prediction: Control Questions

**Q1. What are the differences between MAE, MSE, and RMSE metrics, and when should each be used?**  
- **MAE (Mean Absolute Error):** average of absolute errors, less sensitive to outliers.  
- **MSE (Mean Squared Error):** squares the errors, penalizing large deviations more heavily.  
- **RMSE (Root Mean Squared Error):** square root of MSE, interpretable in the same units as the target.  
*MAE is good for robustness, RMSE emphasizes large errors, MSE is useful for gradient-based optimization.*

---

**Q2. Why is one-hot encoding often preferable for categorical features?**  
Because most ML models require numeric input, and one-hot encoding transforms categorical values into binary indicator variables.  
This prevents algorithms from assuming an ordinal relationship between categories (e.g., ‚Äúred > blue‚Äù).

---

**Q3. How does feature scaling affect linear regression?**  
Feature scaling ensures that all variables contribute equally to the model‚Äôs optimization process.  
Without scaling, features with larger numerical ranges dominate the gradient updates, leading to biased coefficients and slower convergence.

---

**Q4. What sources of target leakage might occur when working with datasets containing many features?**  
Target leakage occurs when information from the target variable leaks into the training data.  
Examples:
- Including post-sale variables (like ‚ÄúSaleCondition‚Äù or ‚ÄúPrice per SqFt‚Äù) when predicting `SalePrice`.  
- Using future information (e.g., tax assessed value from the next year).  
*This leads to unrealistically high performance and poor generalization.*

---


---

## Task 1 ‚Äî House Prices Prediction: Control Questions

**Q1. What are the differences between MAE, MSE, and RMSE metrics, and when should each be used?**  
- **MAE (Mean Absolute Error):** average of absolute errors, less sensitive to outliers.  
- **MSE (Mean Squared Error):** squares the errors, penalizing large deviations more heavily.  
- **RMSE (Root Mean Squared Error):** square root of MSE, interpretable in the same units as the target.  
*MAE is good for robustness, RMSE emphasizes large errors, MSE is useful for gradient-based optimization.*

---

**Q2. Why is one-hot encoding often preferable for categorical features?**  
Because most ML models require numeric input, and one-hot encoding transforms categorical values into binary indicator variables.  
This prevents algorithms from assuming an ordinal relationship between categories (e.g., ‚Äúred > blue‚Äù).

---

**Q3. How does feature scaling affect linear regression?**  
Feature scaling ensures that all variables contribute equally to the model‚Äôs optimization process.  
Without scaling, features with larger numerical ranges dominate the gradient updates, leading to biased coefficients and slower convergence.

---

**Q4. What sources of target leakage might occur when working with datasets containing many features?**  
Target leakage occurs when information from the target variable leaks into the training data.  
Examples:
- Including post-sale variables (like ‚ÄúSaleCondition‚Äù or ‚ÄúPrice per SqFt‚Äù) when predicting `SalePrice`.  
- Using future information (e.g., tax assessed value from the next year).  
*This leads to unrealistically high performance and poor generalization.*

---


In [None]:
# --- Task 2: Titanic Passenger Classification ---

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, classification_report

# Load Titanic dataset
data = pd.read_csv("train.csv")
print("Shape:", data.shape)
display(data.head())
print(data.info())


In [None]:
# --- Data preprocessing ---

# Drop useless columns
X = data.drop(["Survived", "PassengerId", "Name", "Ticket", "Cabin"], axis=1)
y = data["Survived"]

# Identify column types
num_cols = X.select_dtypes(include=["int64", "float64"]).columns
cat_cols = X.select_dtypes(include=["object"]).columns

# Pipelines
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(transformers=[
    ("num", numeric_transformer, num_cols),
    ("cat", categorical_transformer, cat_cols)
])


In [None]:
# --- Train Logistic Regression model ---
model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(max_iter=1000, random_state=42))
])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.2, random_state=42
)

model.fit(X_train, y_train)
pred = model.predict(X_test)
proba = model.predict_proba(X_test)[:, 1]

print("Accuracy:", round(accuracy_score(y_test, pred), 3))
print("F1-score:", round(f1_score(y_test, pred), 3))
print("ROC-AUC:", round(roc_auc_score(y_test, proba), 3))


In [None]:
---

## üö¢ Task 2 ‚Äî Titanic Passenger Classification: Control Questions

**Q1. What is the difference between ROC-AUC and PR-AUC, and when is PR-AUC preferable?**  
- **ROC-AUC** measures the model‚Äôs ability to separate classes overall.  
- **PR-AUC** (Precision-Recall AUC) focuses on the positive class performance and is better for **imbalanced datasets** (e.g., rare events).  

---

**Q2. Why is stratified splitting important for imbalanced classes?**  
Because it preserves the same proportion of each class in train and test sets, preventing bias in model evaluation.

---

**Q3. How does feature scaling affect logistic regression?**  
Scaling ensures faster convergence and balanced coefficient magnitudes since Logistic Regression uses gradient-based optimization.

---

**Q4. What methods for handling class imbalance do you know?**  
- **Resampling:** oversampling the minority or undersampling the majority class  
- **Class weights:** giving more importance to the minority class  
- **Synthetic data:** using SMOTE or ADASYN to create new minority samples  
- **Threshold tuning:** adjusting decision thresholds after training  

---
