# Copper ML Project: Consolidated, Refactored Notebook

This notebook merges the useful, non-duplicative parts of the existing notebooks into a single, well-commented pipeline. Each code block is preceded by a short explanation so a newcomer can follow the flow block-by-block.

## 1. Imports and global configuration
We load the common data science stack (pandas/numpy/matplotlib/seaborn) and the ML tools used for preprocessing and modeling. Warnings are muted to keep the notebook output readable.

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier


## 2. Load data
Set a single source of truth for the dataset path so the notebook is easy to reuse. Replace the placeholder path with the real file location.

In [None]:
DATA_PATH = 'path/to/processed_Copper_Set_cleaned.xlsx'  # TODO: update with your local path

df = pd.read_excel(DATA_PATH)
df.head()

## 3. Initial inspection
Quickly review shape, column names, and types to understand the raw input. This aligns the EDA approach in the existing notebooks.

In [None]:
print('Shape:', df.shape)
df.info()

## 4. Basic cleaning and normalization
We normalize column names and drop columns with excessive missing values. This reflects the cleaning steps used in the industrial data cleaning notebooks.

In [None]:
# Standardize column names
df.columns = [col.strip().lower().replace(' ', '_') for col in df.columns]

# Drop columns with more than 50% missing values
missing_threshold = 0.5
df = df.dropna(thresh=df.shape[0] * (1 - missing_threshold), axis=1)

df.head()

## 5. Missing data overview
We quantify missing values so we can decide whether to impute, drop, or model around them.

In [None]:
missing_counts = df.isnull().sum().sort_values(ascending=False)
missing_counts.head(15)

## 6. Outlier detection (IQR method)
We use the IQR method to flag outliers in numerical columns, mirroring the EDA notebook's approach. This helps decide whether to clip, transform, or keep extreme values.

In [None]:
numeric_cols = df.select_dtypes(include='number').columns

def iqr_outlier_counts(dataframe, columns):
    results = {}
    for col in columns:
        q1 = dataframe[col].quantile(0.25)
        q3 = dataframe[col].quantile(0.75)
        iqr = q3 - q1
        lower = q1 - 1.5 * iqr
        upper = q3 + 1.5 * iqr
        outliers = dataframe[(dataframe[col] < lower) | (dataframe[col] > upper)]
        results[col] = len(outliers)
    return pd.Series(results).sort_values(ascending=False)

iqr_outlier_counts(df, numeric_cols).head(10)

## 7. Log transformation for skewed numeric features
We apply `np.log1p` to reduce skewness and handle zeros safely, following the EDA-ML notebook.

In [None]:
skewed_cols = df[numeric_cols].skew().sort_values(ascending=False)
skewed_cols.head(10)

In [None]:
# Example: apply log1p to the most skewed numeric columns
top_skewed = skewed_cols.head(5).index
for col in top_skewed:
    df[f'log1p_{col}'] = np.log1p(df[col].clip(lower=0))

df[[*top_skewed, *[f'log1p_{c}' for c in top_skewed]]].head()

## 8. Feature/target selection
We prepare two targets: selling price (regression) and lead status (classification). Adjust the target column names to match your dataset.

In [None]:
REG_TARGET = 'selling_price'  # TODO: update if different
CLS_TARGET = 'status'          # TODO: update if different

feature_cols = [col for col in df.columns if col not in [REG_TARGET, CLS_TARGET]]
X = df[feature_cols]
y_reg = df[REG_TARGET]
y_cls = df[CLS_TARGET]

## 9. Preprocessing pipeline
Categorical features are one-hot encoded and numeric features are scaled. This keeps the modeling pipeline consistent across regression and classification tasks.

In [None]:
categorical_cols = X.select_dtypes(include=['object', 'category']).columns
numeric_cols = X.select_dtypes(include='number').columns

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ]
)

## 10. Regression modeling (selling price prediction)
We train a RandomForestRegressor and evaluate with MAE, RMSE, and R².

In [None]:
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X, y_reg, test_size=0.2, random_state=42
)

regressor = RandomForestRegressor(random_state=42)
reg_model = Pipeline(steps=[('preprocess', preprocessor), ('model', regressor)])

reg_model.fit(X_train_reg, y_train_reg)
reg_preds = reg_model.predict(X_test_reg)

print('MAE:', mean_absolute_error(y_test_reg, reg_preds))
print('RMSE:', mean_squared_error(y_test_reg, reg_preds, squared=False))
print('R2:', r2_score(y_test_reg, reg_preds))

## 11. Classification modeling (lead outcome prediction)
We train a RandomForestClassifier and evaluate with a classification report and confusion matrix.

In [None]:
X_train_cls, X_test_cls, y_train_cls, y_test_cls = train_test_split(
    X, y_cls, test_size=0.2, random_state=42, stratify=y_cls
)

classifier = RandomForestClassifier(random_state=42)
cls_model = Pipeline(steps=[('preprocess', preprocessor), ('model', classifier)])

cls_model.fit(X_train_cls, y_train_cls)
cls_preds = cls_model.predict(X_test_cls)

print(classification_report(y_test_cls, cls_preds))
confusion_matrix(y_test_cls, cls_preds)

## 12. Next steps
- Replace placeholder paths with the real data location.
- Tune models (e.g., hyperparameters, cross-validation).
- Add domain-specific feature engineering for price and lead classification.
- Save the best models for downstream use.