# THE BIG PROJECT
## Papas-Piazzeria
### Flood Data Investigation Model

Maya Chai-Foo 1006946405,
Aaron Lyimo 1007483108,
Daniel Rivera Naraez 1007790455,
Aziz Yussupov 1007252759,

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pylab as plt
import matplotlib.pyplot as plt

# Configure Notebook
%matplotlib inline
plt.style.use('fivethirtyeight')
sns.set_context("notebook")
import warnings
warnings.filterwarnings('ignore')

## 0. Import Data
Let's start by importing the data. we will look at the information of the table for inspection. 

Webscraping neighbourhood income data

In [2]:
# Load the data file
income_file_path = 'median_income_by_neighbourhood.csv'
income_data = pd.read_csv(income_file_path)
income_data.head()

Unnamed: 0,Neighbourhood,Median Income Before Tax
0,Toronto,97000
1,Agincourt North,89000
2,Agincourt South-Malvern West,89000
3,Malvern East,89000
4,Malvern West,89000


Finding the water levels at black creek and don river for flooding event of interest (july 8th 2013)

In [3]:
import pandas as pd

# Load the data file
file_name = 'flow data.csv'
data = pd.read_csv(file_name, header=None, skiprows=1)

# Assign column names
data.columns = ['ID', 'PARAM', 'Date', 'Value', 'SYM']

# Convert 'Date' to datetime format and filter 'PARAM' column for valid values
data['Date'] = pd.to_datetime(data['Date'], errors='coerce')
data = data[data['PARAM'].isin(['1', '2'])].copy()
data['PARAM'] = data['PARAM'].astype(int)

# Filter for July 8, 2013, and parameter 2 (level data)
july_8_2013_level_data = data[(data['PARAM'] == 2) & (data['Date'] == '2013-07-08')]

july_8_2013_level_data = july_8_2013_level_data [['ID','Date','Value']]

# Display the filtered data
print(july_8_2013_level_data)


            ID       Date   Value
25207  02HC027 2013-07-08   1.426
55249  02HC024 2013-07-08  13.026


Finding the baseline water levels for these rivers

In [4]:
# Filter only for PARAM = 2 (water level data) and convert 'Value' to numeric
water_level_data = data[data['PARAM'] == 2]
water_level_data['Value'] = pd.to_numeric(water_level_data['Value'], errors='coerce')

# Filter data for July across all years
july_water_level_data = water_level_data[water_level_data['Date'].dt.month == 7]

# Calculate the average (baseline) water level for each gauge for July
baseline_july_water_levels = july_water_level_data.groupby('ID')['Value'].mean().reset_index()
baseline_july_water_levels.columns = ['ID', 'Baseline_July_Water_Level']

# Display the result
print(baseline_july_water_levels)

        ID  Baseline_July_Water_Level
0  02HC024                  12.195959
1  02HC027                   0.381672


Extracting the climate data

In [5]:
# Load the data file
climate_file = 'climate-daily.csv'
climate_data = pd.read_csv(climate_file)

# Convert 'LOCAL_DATE' to datetime format
climate_data['LOCAL_DATE'] = pd.to_datetime(climate_data['LOCAL_DATE'], errors='coerce')

# Filter for data from July 7th and 8th, 2013
july_7_8_data = climate_data[(climate_data['LOCAL_DATE'] == '2013-07-07') | (climate_data['LOCAL_DATE'] == '2013-07-08')]

# Select only the columns of interest: 'x', 'y', 'STATION_NAME', 'LOCAL_DATE', and 'TOTAL_PRECIPITATION'
july_7_8_selected_columns = july_7_8_data[['x', 'y', 'STATION_NAME', 'LOCAL_DATE', 'TOTAL_PRECIPITATION']]

# Display the filtered data
print(july_7_8_selected_columns)


      x          y  STATION_NAME LOCAL_DATE  TOTAL_PRECIPITATION
6 -79.4  43.666667  TORONTO CITY 2013-07-07                 38.5
7 -79.4  43.666667  TORONTO CITY 2013-07-08                 96.8


# 1. Data Cleaning and Preprocessing
Let's clean the data and process it so we can make a great model. 

## 1.a First let's check for missing values

In [6]:
missing_counts = income_data.isnull().sum()
print(missing_counts)


Neighbourhood               0
Median Income Before Tax    0
dtype: int64


Looks like there are no missing values. 

In [7]:
income_data.describe()

Unnamed: 0,Median Income Before Tax
count,146.0
mean,87932.876712
std,20457.731393
min,59200.0
25%,77000.0
50%,84000.0
75%,92000.0
max,184000.0


The range of income listed appears to be reasonable, showing thath there are no outlisers imideately appparent. 

## 1.b Convert Units

## 1.c Remove outliers

## 1.d Remove duplicates

# Data Cleaning Complete Function 

# 2. Exploratory Data Analysis
Next, let's explore the dataset. 

### 2.a Creation of temporal featurees for data analysis. 

# 3. Feature Engineering
Once we've acquired a comprehensive understanding of the dataset, we can proceed with feature engineering.

### 3.a Time feature engineering

### 3.b Location Feature Engineering


### 3.c Weather Feature Engineering

### 3.d Feature engineering comparison and evaluation

# Feature Engineering Complete Function

# 4. Train The Model

## 4.a Run code through pipeline

In [None]:
FloodData = pd.read_csv('example.csv')
train = CleanData(FloodData)
X = feature_engineering(train)

# Prepare the features and target variable
X = X.drop(columns=['TargetColumn'])  # Drop target and non-features
y = train['TargetColumn']  # Target variable
X.info()

# Split the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## 4.b Create & tune the model

## Evaluate Model
Let's get the predictions on our validation dataset.

In [None]:
# Evaluate the model's performance on the test data
mae_best = mean_absolute_error(y_test, y_pred_best)
mse_best = mean_squared_error(y_test, y_pred_best)
rmse_best = np.sqrt(mse_best)  # RMSE
r2_best = r2_score(y_test, y_pred_best)

# Print evaluation metrics for the best model
print(f"Best Model MAE: {mae_best}")
print(f"Best Model MSE: {mse_best}")
print(f"Best Model RMSE: {rmse_best}")
print(f"Best Model R^2: {r2_best}")
# Get feature importances from the best model
feature_importance = best_rf_model.feature_importances_

# Create a DataFrame for visualization
feature_names = X.columns
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importance
})

# Sort the features by importance
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Plot the top 15 most important features
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df.head(15))
plt.title('Top 15 Feature Importances')
plt.tight_layout()
plt.show()

# Predict on the test dataset