<a href="https://colab.research.google.com/github/NdumisoButhelezi/00-Login/blob/master/Planet09AI_Hackathon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **🌟 Planet 09 AI × ArmLab Hackathon Notebook**

**Objective:** Build and enhance AI models for DUT Clubs & Societies events.
**Participants:** FAI Students / Planet 09 AI Members
**Duration:** 1 Day

---

## **1️⃣ Install & Import Libraries**

In [1]:
!pip install xgboost --quiet

import pandas as pd
import numpy as np
import json
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.utils.class_weight import compute_class_weight

---

**Before you start get the data by running this script**

In [None]:
# -----------------------------
# Starter Notebook - Zindi Booking Dataset Pipeline
# -----------------------------

# 1️⃣ Install libraries
!pip install xgboost scikit-learn --quiet

# -----------------------------
# 2️⃣ Import libraries
# -----------------------------
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier, XGBRegressor
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, mean_absolute_error, mean_squared_error
from sklearn.preprocessing import LabelEncoder

# -----------------------------
# 3️⃣ Load data from Google Sheet
# -----------------------------
sheet_url = 'https://docs.google.com/spreadsheets/d/14m0YbtfwWO81bE-zHsFdfrvRMB_Rg_lh/edit?usp=drive_link&ouid=116988699228228254119&rtpof=true&sd=true'
csv_url = sheet_url.replace('/edit?usp=drive_link&ouid=116988699228228254119&rtpof=true&sd=true', '/export?format=csv')
df = pd.read_csv(csv_url)
display(df.head())

# -----------------------------
# 4️⃣ Detect target column automatically
# -----------------------------
# Heuristic: column containing 'status' or 'target'
possible_targets = [col for col in df.columns if 'status' in col.lower() or 'target' in col.lower()]
if len(possible_targets) == 0:
    raise ValueError("No target column found! Please specify manually.")
target_col = possible_targets[0]
print("Target column detected:", target_col)

# -----------------------------
# 5️⃣ Split Train/Test (70/30) and assign IDs
# -----------------------------
train, test = train_test_split(df, test_size=0.3, random_state=42)

def generate_ids(df):
    np.random.seed(42)
    ids = ['ID_' + ''.join(np.random.choice(list('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'), 6)) for _ in range(len(df))]
    df = df.copy()
    df['ID'] = ids
    return df

train = generate_ids(train)
test = generate_ids(test)

# Save Train and Test CSVs
train.to_csv("Train.csv", index=False)
test.to_csv("Test.csv", index=False)
print("Train.csv and Test.csv saved!")

# -----------------------------
# 6️⃣ Generate VariableDefinitions.csv
# -----------------------------
variable_definitions = pd.DataFrame({
    "Variable Name": train.columns,
    "Description": ["Auto-generated description" if col != target_col else "Target variable for prediction" for col in train.columns]
})
variable_definitions.to_csv("VariableDefinitions.csv", index=False)
print("VariableDefinitions.csv generated!")

# -----------------------------
# 7️⃣ Generate Reference.csv
# -----------------------------
reference = test[['ID', target_col]].copy()
reference.to_csv("Reference.csv", index=False)
print("Reference.csv generated!")

# -----------------------------
# 8️⃣ Prepare Features & Target
# -----------------------------
X = train.drop(columns=["ID", target_col])
y = train[target_col]

# Encode categorical features
X = pd.get_dummies(X)
test_features = pd.get_dummies(test.drop(columns=["ID"]))
test_features = test_features.reindex(columns=X.columns, fill_value=0)

# -----------------------------
# 9️⃣ Train/Test split for validation
# -----------------------------
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# -----------------------------
# 🔟 Train XGBoost baseline model
# -----------------------------
if y.nunique() <= 20:  # Classification
    # Encode target labels
    le = LabelEncoder()
    y_train_encoded = le.fit_transform(y_train)
    y_val_encoded = le.transform(y_val)

    print("Classes mapping:", dict(zip(le.classes_, le.transform(le.classes_))))

    model = XGBClassifier(use_label_encoder=False, eval_metric="mlogloss")
    model.fit(X_train, y_train_encoded)

    # Validation predictions
    preds_val_encoded = model.predict(X_val)
    preds_val = le.inverse_transform(preds_val_encoded)

    # Test predictions
    preds_test_encoded = model.predict(test_features)
    preds_test = le.inverse_transform(preds_test_encoded)

    # Evaluation Metrics
    precision = precision_score(y_val, preds_val, average='weighted')
    recall = recall_score(y_val, preds_val, average='weighted')
    f1 = f1_score(y_val, preds_val, average='weighted')
    accuracy = accuracy_score(y_val, preds_val)

    print(f"\nValidation Accuracy: {accuracy:.4f}")
    print(f"Validation Precision: {precision:.4f}")
    print(f"Validation Recall: {recall:.4f}")
    print(f"Validation F1 Score: {f1:.4f}")

else:  # Regression
    model = XGBRegressor()
    model.fit(X_train, y_train)

    preds_val = model.predict(X_val)
    preds_test = model.predict(test_features)

    mae = mean_absolute_error(y_val, preds_val)
    rmse = mean_squared_error(y_val, preds_val, squared=False)

    print(f"Validation MAE: {mae:.4f}")
    print(f"Validation RMSE: {rmse:.4f}")

# -----------------------------
# 1️⃣1️⃣ Generate SampleSubmission.csv
# -----------------------------
sample_submission = pd.DataFrame({"ID": test["ID"], target_col: preds_test})
sample_submission.to_csv("SampleSubmission.csv", index=False)
print("SampleSubmission.csv generated successfully!")
display(sample_submission.head())

# -----------------------------
# ✅ Summary of generated files
# -----------------------------
print("\nFiles ready for Zindi:")
print("- Train.csv")
print("- Test.csv")
print("- Reference.csv")
print("- SampleSubmission.csv")
print("- VariableDefinitions.csv")


Unnamed: 0,Booking ID,Event Name,Club,Venue,Campus,Date,Start Time,End Time,Status,Attendees,Created At
0,AROTCJwNJeUsOMQV2EFT,Prayer meeting,First Love Church,Ritson Campus DC1010,Durban,2025-10-03,18:00,19:00,confirmed,40,2025-09-05 08:39
1,jdwBUiRDdh0f6QH1gGNH,Prayer meeting,First Love Church,DC1012,Durban,2025-09-29,18:00,19:00,confirmed,40,2025-09-05 08:37
2,xl0OfpOcp8BdpLwUZkYZ,Prayer meeting,First Love Church,DC1012,Durban,2025-09-22,18:00,19:00,confirmed,40,2025-09-05 08:37
3,E6MajSK4qBB3xGlePZrp,Prayer meeting,First Love Church,DC1012,Durban,2025-09-15,18:00,19:00,confirmed,50,2025-09-05 08:35
4,4pz9GWByJJdwhTY7AT60,Sunday Service,First Love Church,Ritson Campus DC1010,Durban,2025-10-05,11:00,14:00,confirmed,50,2025-09-05 08:33


Target column detected: Status
Train.csv and Test.csv saved!
VariableDefinitions.csv generated!
Reference.csv generated!
Classes mapping: {'cancelled': np.int64(0), 'confirmed': np.int64(1), 'pending': np.int64(2), 'rejected': np.int64(3)}


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



Validation Accuracy: 1.0000
Validation Precision: 1.0000
Validation Recall: 1.0000
Validation F1 Score: 1.0000
SampleSubmission.csv generated successfully!


Unnamed: 0,ID,Status
30,ID_2OHUSW,confirmed
126,ID_KKX9XC,confirmed
199,ID_VBX3BU,confirmed
142,ID_6LVY01,confirmed
253,ID_POCGUI,confirmed



Files ready for Zindi:
- Train.csv
- Test.csv
- Reference.csv
- SampleSubmission.csv
- VariableDefinitions.csv


## **2️⃣ Upload Data Files**

**Instructions:** Upload `Train.csv` and `Test.csv`.

In [3]:
from google.colab import files
uploaded = files.upload()
print("Uploaded files:", list(uploaded.keys()))

Saving Train (2).csv to Train (2).csv
Uploaded files: ['Train (2).csv']


---

## **3️⃣ Load and Preview Data**

In [8]:
train_df = pd.read_csv("Train (2).csv")
test_df  = pd.read_csv("Test.csv")

print("Train Shape:", train_df.shape)
print("Test Shape :", test_df.shape)

display(train_df.head())
display(test_df.head())

Train Shape: (194, 12)
Test Shape : (84, 12)


Unnamed: 0,Booking ID,Event Name,Club,Venue,Campus,Date,Start Time,End Time,Status,Attendees,Created At,ID
0,5EyGlsCHK9b8R5FoOU3z,Meeting,University of Life Changers,DC1013,Durban,2025-09-11,19:00,22:00,confirmed,50,2025-08-29 17:57,ID_2OHUSW
1,7xELtdzs2fcKGAcMrf1x,Sunday Service,AFM campus ministries,DB0001,Durban,2025-08-17,13:00,15:30,confirmed,100,2025-07-23 09:14,ID_KKX9XC
2,W4DUvdBOE5DKXMAOUGJC,Clash of the Choirs (Clash Bash Round),University of Life Changers,Mansfield Hall,Durban,2025-08-21,19:00,22:00,rejected,300,2025-07-30 19:57,ID_VBX3BU
3,LoMFB8aqecgEUr0P1Tyj,ACTS Wednesday service,Association of Catholic Tertiary Students(ACTS),BC0226,Durban,2025-09-03,19:00,22:00,confirmed,50,2025-08-24 17:38,ID_6LVY01
4,AzHeClynyMBDjbPr3YWC,Wednesday services,Student Christian Organisation,DB0001,Durban,2025-05-21,19:00,21:30,confirmed,120,2025-05-03 16:05,ID_POCGUI


Unnamed: 0,Booking ID,Event Name,Club,Venue,Campus,Date,Start Time,End Time,Status,Attendees,Created At,ID
0,pjYOq5D6LDxtZfRMWwui,Prayer Meeting,First Love Church,DC1010,Durban,2025-09-19,18:00,19:00,confirmed,50,2025-08-30 07:07,ID_2OHUSW
1,fhPMmCGCzavPuOruwqPA,CRC Takeover - Revival Night,Rise for Christ Campus Impact,BC0202,Durban,2025-08-13,19:00,21:30,confirmed,150,2025-08-04 23:41,ID_KKX9XC
2,uwuxcnYrpM8xvEhHfidq,Acts Wednesday,Association of Catholic Tertiary Students(ACTS),Ab0402,Durban,2025-07-30,19:00,22:00,confirmed,50,2025-07-09 11:08,ID_VBX3BU
3,AsDT4oGBXwAkxJpq942G,ACTS Wednesday Service,Association of Catholic Tertiary Students(ACTS),BC0209,Durban,2025-08-13,19:00,22:00,confirmed,50,2025-07-31 00:41,ID_6LVY01
4,usaK2XMbEWPKVAJ6otVN,Sunday service,First Love Church,DC1010,Durban,2025-05-25,11:00,14:00,confirmed,50,2025-05-17 21:32,ID_POCGUI


**Tip:** Look at the columns and check for missing values.

---

## **4️⃣ Handle Missing Values**

In [9]:
train_df['Club'].fillna('Unknown', inplace=True)
test_df['Club'].fillna('Unknown', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_df['Club'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_df['Club'].fillna('Unknown', inplace=True)


**Challenge #1:** Think of ways missing values in other columns could affect predictions.

---

## **5️⃣ Feature Engineering**

In [16]:
def feature_engineering(df):
    df['Date'] = pd.to_datetime(df['Date'])
    df['day_of_week'] = df['Date'].dt.dayofweek
    df['month'] = df['Date'].dt.month
    df['is_weekend'] = df['day_of_week'].isin([5,6]).astype(int)

    df['start_minutes'] = pd.to_datetime(df['Start Time']).dt.hour * 60 + pd.to_datetime(df['Start Time']).dt.minute
    df['end_minutes'] = pd.to_datetime(df['End Time']).dt.hour * 60 + pd.to_datetime(df['End Time']).dt.minute
    df['event_duration'] = df['end_minutes'] - df['start_minutes']

    # Drop original datetime columns
    df = df.drop(columns=['Date', 'Start Time', 'End Time', 'Created At'])

    return df

train_df = feature_engineering(train_df)
test_df  = feature_engineering(test_df)

  df['start_minutes'] = pd.to_datetime(df['Start Time']).dt.hour * 60 + pd.to_datetime(df['Start Time']).dt.minute
  df['start_minutes'] = pd.to_datetime(df['Start Time']).dt.hour * 60 + pd.to_datetime(df['Start Time']).dt.minute
  df['end_minutes'] = pd.to_datetime(df['End Time']).dt.hour * 60 + pd.to_datetime(df['End Time']).dt.minute
  df['end_minutes'] = pd.to_datetime(df['End Time']).dt.hour * 60 + pd.to_datetime(df['End Time']).dt.minute
  df['start_minutes'] = pd.to_datetime(df['Start Time']).dt.hour * 60 + pd.to_datetime(df['Start Time']).dt.minute
  df['start_minutes'] = pd.to_datetime(df['Start Time']).dt.hour * 60 + pd.to_datetime(df['Start Time']).dt.minute
  df['end_minutes'] = pd.to_datetime(df['End Time']).dt.hour * 60 + pd.to_datetime(df['End Time']).dt.minute
  df['end_minutes'] = pd.to_datetime(df['End Time']).dt.hour * 60 + pd.to_datetime(df['End Time']).dt.minute


**Challenge #2:** Add your own features, like `attendance_ratio` or `venue_popularity`.

---

## **6️⃣ Prepare Features & Encode Categorical Data**

In [17]:
TARGET = "Status"
ID_COL = "Booking ID"

common_cols = list(set(train_df.columns).intersection(set(test_df.columns)))
features = [c for c in common_cols if c not in [TARGET, ID_COL]]

full_df  = pd.concat([train_df[features], test_df[features]], axis=0)

cat_cols = full_df.select_dtypes(include=["object"]).columns
full_encoded = pd.get_dummies(full_df, columns=cat_cols)

X = full_encoded.iloc[:len(train_df)]
X_test = full_encoded.iloc[len(train_df):]

le = LabelEncoder()
y = le.fit_transform(train_df[TARGET])

---

## **7️⃣ Handle Class Imbalance**

In [18]:
classes = np.unique(y)
weights = compute_class_weight('balanced', classes=classes, y=y)
class_weights = dict(zip(classes, weights))
print("Class weights:", class_weights)

Class weights: {np.int64(0): np.float64(12.125), np.int64(1): np.float64(0.26795580110497236), np.int64(2): np.float64(16.166666666666668), np.int64(3): np.float64(8.083333333333334)}


**Tip:** Rare classes like `rejected` or `cancelled` might need special attention.

---

## **8️⃣ Train/Validation Split**

In [19]:
X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

---

## **9️⃣ Model Training**

In [20]:
model = XGBClassifier(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=6,
    random_state=42,
    eval_metric='mlogloss'
)
model.fit(X_train, y_train)

**Challenge #3:** Experiment with hyperparameters or try other models like LightGBM.

---

## **🔟 Validation Metrics**

In [21]:
y_pred = model.predict(X_valid)
print("Accuracy :", accuracy_score(y_valid, y_pred))
print("Precision:", precision_score(y_valid, y_pred, average='weighted'))
print("Recall   :", recall_score(y_valid, y_pred, average='weighted'))
print("F1 Score :", f1_score(y_valid, y_pred, average='weighted'))

Accuracy : 0.9743589743589743
Precision: 0.9494109494109495
Recall   : 0.9743589743589743
F1 Score : 0.9617140850017563


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**Tip:** F1-score is important when classes are imbalanced.

---

## **1️⃣1️⃣ Generate Predictions for Submission**

In [22]:
test_preds = model.predict(X_test)
test_labels = le.inverse_transform(test_preds)

sample_submission = pd.DataFrame({
    "ID": test_df[ID_COL],
    TARGET: test_labels
})
sample_submission.to_csv("SampleSubmission.csv", index=False)
print("✅ SampleSubmission.csv saved!")

✅ SampleSubmission.csv saved!


---

## **1️⃣2️⃣ Enhanced JSON Export for Ollama**

In [23]:
EXTRA_COLS = ["Event Name", "Venue", "day_of_week", "month", "event_duration"]
valid_extra_cols = [c for c in EXTRA_COLS if c in test_df.columns or c in X_test.columns]

enhanced_json = {
    "predictions": [
        {
            "Booking ID": row["Booking ID"],
            "Predicted_Status": pred,
            **{col: row[col] if col in row else X_test.iloc[idx][col] for col in valid_extra_cols}
        }
        for idx, (row, pred) in enumerate(zip(test_df.to_dict(orient="records"), test_labels))
    ]
}

with open("ollama_predictions_enhanced.json", "w") as f:
    json.dump(enhanced_json, f, indent=4)
print("✅ ollama_predictions_enhanced.json saved!")

✅ ollama_predictions_enhanced.json saved!


**Challenge #4:** Add extra columns to the JSON and see how it improves context for Ollama’s AI applications.

---

### **🔹 Hackathon Learning Points**

1. **Feature Engineering:** Extract temporal, duration, or attendance features.
2. **Class Imbalance:** Learn weighted F1 and rare class handling.
3. **Model Tuning:** Experiment with XGBoost or other classifiers.
4. **AI Integration:** Use enhanced JSON as context for Ollama for smarter AI apps.

---