# 🧱 KDD — Text-Aware Colab Template
Auto-detects text columns and uses TF-IDF + Logistic Regression for classification when appropriate. Otherwise falls back to generic tabular processing.

Phases: Selection → Preprocessing → Transformation → Data Mining → Interpretation/Evaluation.

## Selection — Upload and choose columns

In [1]:
#@title Upload data (encoding-friendly) and select useful columns
from google.colab import files
import pandas as pd, io
up = files.upload()
assert up, 'No file uploaded.'
fname = list(up.keys())[0]
print('Using file:', fname)
def read_any(b):
    import pandas as pd, io
    try:
        return pd.read_csv(io.BytesIO(b), encoding='utf-8')
    except Exception:
        try:
            return pd.read_csv(io.BytesIO(b), encoding='latin1')
        except Exception:
            return pd.read_excel(io.BytesIO(b))
df = read_any(up[fname])
print('Columns:', df.columns.tolist())
subset = input('Enter comma-separated columns to KEEP (empty = keep all): ').strip()
if subset:
    keep = [c.strip() for c in subset.split(',')]
    df = df[keep]
print('Shape after selection:', df.shape)

Saving spam.csv to spam.csv
Using file: spam.csv
Columns: ['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4']
Enter comma-separated columns to KEEP (empty = keep all): v1, v2
Shape after selection: (5572, 2)


## Preprocessing — Basic cleaning

In [2]:
#@title Drop empty columns and safe renames (SMS spam friendly)
df = df.dropna(axis=1, how='all')
df = df.rename(columns={'v1':'label', 'v2':'message'})
print('Columns now:', df.columns.tolist())
display(df.head())

Columns now: ['label', 'message']


Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Transformation — Detect text column

In [3]:
#@title Find a free-text column heuristically
text_col = None
obj_cols = [c for c in df.columns if df[c].dtype == 'object']
for c in obj_cols:
    sample = df[c].dropna().astype(str).head(200)
    avg_len = sample.apply(len).mean() if len(sample)>0 else 0
    if avg_len and avg_len > 20 and c.lower() not in ('label','target','class','y'):
        text_col = c
        break
print('Detected text column:', text_col)

Detected text column: message


## Data Mining — Choose task

In [4]:
#@title Classification/Clustering with text-aware path
mode = input('Type "classification" or "clustering": ').strip().lower()
if mode == 'classification' and text_col is not None:
    print('Using TF-IDF + Logistic Regression on', text_col)
    label_candidates = [c for c in df.columns if c.lower() in ('label','target','class','y')]
    if label_candidates:
        target = label_candidates[0]
        print('Using target column:', target)
    else:
        print('No obvious target; choose one from:', df.columns.tolist())
        target = input('Enter TARGET column: ').strip()
    y_raw = df[target].astype(str).str.lower()
    y = y_raw.replace({'ham':0, 'spam':1, '0':0, '1':1}).astype('category').cat.codes
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score, classification_report
    X_text = df[text_col].astype(str)
    tfidf = TfidfVectorizer(stop_words='english', max_features=10000, ngram_range=(1,2))
    X = tfidf.fit_transform(X_text)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y if len(set(y))>1 else None
    )
    clf = LogisticRegression(max_iter=1000)
    clf.fit(X_train, y_train)
    preds = clf.predict(X_test)
    from sklearn.metrics import accuracy_score, classification_report
    print('Accuracy:', accuracy_score(y_test, preds))
    print('\nClassification report:\n', classification_report(y_test, preds, digits=4))
else:
    print('Using generic tabular processing...')
    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    X = pd.get_dummies(df, drop_first=True)
    num_cols = X.select_dtypes(include=['number']).columns
    if len(num_cols) > 0:
        scaler = StandardScaler()
        X[num_cols] = scaler.fit_transform(X[num_cols])
        print('Scaled numeric columns:', list(num_cols))
    else:
        print('No numeric columns to scale; skipping.')
    print('Transformed X shape:', X.shape)
    from sklearn.cluster import KMeans
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.metrics import accuracy_score, classification_report
    choice = 'classification' if mode.startswith('c') else 'clustering'
    if choice == 'classification':
        target = input('Enter TARGET column (must exist): ').strip()
        y = df[target]
        X_ = pd.get_dummies(df.drop(columns=[target]), drop_first=True)
        X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=0.2, random_state=42)
        clf = GradientBoostingClassifier(random_state=42)
        clf.fit(X_train, y_train)
        preds = clf.predict(X_test)
        print('Accuracy:', accuracy_score(y_test, preds))
        print('\nClassification report:\n', classification_report(y_test, preds, digits=4))
    else:
        k = int(input('Number of clusters (e.g., 3): ') or 3)
        num_only = X.select_dtypes(include=['number'])
        km = KMeans(n_clusters=k, n_init=10, random_state=42)
        km.fit(num_only)
        labels = km.labels_
        import numpy as np
        print('Cluster counts:', dict(zip(*np.unique(labels, return_counts=True))))
        try:
            from sklearn.decomposition import PCA
            import matplotlib.pyplot as plt
            pts = PCA(n_components=2).fit_transform(num_only)
            plt.figure(); plt.scatter(pts[:,0], pts[:,1], c=labels)
            plt.title('PCA of clusters'); plt.xlabel('PC1'); plt.ylabel('PC2'); plt.show()
        except Exception as e:
            print('Could not plot clusters:', e)

Type "classification" or "clustering": classification
Using TF-IDF + Logistic Regression on message
Using target column: label


  y = y_raw.replace({'ham':0, 'spam':1, '0':0, '1':1}).astype('category').cat.codes


Accuracy: 0.9488789237668162

Classification report:
               precision    recall  f1-score   support

           0     0.9443    1.0000    0.9713       966
           1     1.0000    0.6174    0.7635       149

    accuracy                         0.9489      1115
   macro avg     0.9721    0.8087    0.8674      1115
weighted avg     0.9517    0.9489    0.9436      1115



## Interpretation/Evaluation — Checklist
- If text classification: discuss important n-grams and classes with lower precision/recall.
- If clustering: describe clusters qualitatively; inspect feature averages.
- Record parameters (e.g., `max_features=10000`, `ngram_range=(1,2)`, `random_state=42`).