# üåä Campus Water: Feature Selection & Data Reliability

### **Objective**
In this section, we prove **why** we chose our variables and **how** we handle outliers to ensure the AI's integrity.

### **Step 1: Imports & Setup**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
import os
import warnings
warnings.filterwarnings('ignore')

plt.style.use('fivethirtyeight')
sns.set_context('talk')

### **Step 2: Fetching the Data Hub**

In [None]:
conn = sqlite3.connect('campus_water.db')
df = pd.read_sql("SELECT * FROM water_records", conn)
conn.close()
print(f"Success: Retrieved {len(df)} records.")
df.head()

### **Step 3: üîç Why is 'Consumption' our Target? (Correlation Proof)**
We use a **Correlation Heatmap** to see which features actually drive water use. 

**Look for:** The relationship between `occupancy_percentage` and `consumption_liters`. If the number is high (e.g. 0.7+), it proves that these features are the 'cause' of our target variable.

In [None]:
plt.figure(figsize=(10, 8))
temp_df = df.copy()
temp_df['building_code'] = temp_df['building_type'].astype('category').cat.codes

sns.heatmap(temp_df.select_dtypes(include=[np.number]).corr(), annot=True, cmap='Blues', fmt='.2f')
plt.title('Feature Drivers: What causes Consumption Spike?')
plt.show()

### **Step 4: üïµÔ∏è Identifying Outliers (The Trust Problem)**
Sometimes sensors fail and record 15,000L (Impossible) or 0L (Broken). We use a **Box Plot** to find these 'Trash' points that lie far outside the whisker range.

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x='building_type', y='consumption_liters', data=df, palette='coolwarm')
plt.title('Detecting Garbage Data (Outliers)')
plt.show()

print("The dots floating high above are outliers that will confuse the AI.")

### **Step 5: Simple Scrubbing (Removing the Noise)**
We remove anything above 8,000L to keep the dataset clean.

In [None]:
df_clean = df[df['consumption_liters'] < 8000]

# Clean Visualization Check
plt.figure(figsize=(12, 6))
sns.boxplot(x='building_type', y='consumption_liters', data=df_clean, palette='viridis')
plt.title('Cleaned Data: Ready for AI Training')
plt.show()

### **Step 6: High-Precision Training (0.90+ Accuracy)**
Now we train on the clean data with full contextual features.

In [None]:
BUILDING_MAP = {'Hostel': 0, 'Academic': 1, 'Lab': 2}
DAY_MAP = {'Monday': 0, 'Tuesday': 1, 'Wednesday': 2, 'Thursday': 3, 'Friday': 4, 'Saturday': 5, 'Sunday': 6}
PHASE_MAP = {'Normal': 0, 'Exam': 1, 'Vacation': 2}

df_clean['building_code'] = df_clean['building_type'].map(BUILDING_MAP)
df_clean['day_code'] = df_clean['day_of_week'].map(DAY_MAP)
df_clean['phase_code'] = df_clean['academic_phase'].map(PHASE_MAP)

features = ['building_code', 'day_code', 'phase_code', 'occupancy_percentage', 'time_of_day']
X = df_clean[features]
y = df_clean['consumption_liters']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

score = r2_score(y_test, model.predict(X_test))
print(f"üöÄ Final Verification (R2 SCORE): {score:.4f}")