# THE BIG PROJECT
## Papas-Piazzeria
### Flood Data Investigation Model

Maya Chai-Foo 1006946405,
Aaron Lyimo 1007483108,
Daniel Rivera Naraez 1007790455,
Aziz Yussupov 1007252759,

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pylab as plt
import matplotlib.pyplot as plt

# Configure Notebook
%matplotlib inline
plt.style.use('fivethirtyeight')
sns.set_context("notebook")
import warnings
warnings.filterwarnings('ignore')

## 0. Import Data
Let's start by importing the data. we will look at the information of the table for inspection. 

Webscraping neighbourhood income data

In [2]:
# Load the data file
income_file_path = 'median_income_by_neighbourhood.csv'
income_data = pd.read_csv(income_file_path)
income_data.head()

Unnamed: 0,Neighbourhood,Median Income Before Tax
0,Toronto,97000
1,Agincourt North,89000
2,Agincourt South-Malvern West,89000
3,Malvern East,89000
4,Malvern West,89000


Finding the water levels at black creek and don river for flooding event of interest (july 8th 2013)

In [3]:
import pandas as pd

# Load the data file
file_name = 'flow data.csv'
data = pd.read_csv(file_name, header=None, skiprows=1)

# Assign column names
data.columns = ['ID', 'PARAM', 'Date', 'Value', 'SYM']

# Convert 'Date' to datetime format and filter 'PARAM' column for valid values
data['Date'] = pd.to_datetime(data['Date'], errors='coerce')
data = data[data['PARAM'].isin(['1', '2'])].copy()
data['PARAM'] = data['PARAM'].astype(int)

# Filter for July 8, 2013, and parameter 2 (level data)
july_8_2013_level_data = data[(data['PARAM'] == 2) & (data['Date'] == '2013-07-08')]

july_8_2013_level_data = july_8_2013_level_data [['ID','Date','Value']]

# Display the filtered data
print(july_8_2013_level_data)


            ID       Date   Value
25207  02HC027 2013-07-08   1.426
55249  02HC024 2013-07-08  13.026


In [4]:
data.describe()

Unnamed: 0,PARAM,Date
count,58712.0,58712
mean,1.267492,1999-01-16 03:17:17.471044992
min,1.0,1962-10-01 00:00:00
25%,1.0,1984-09-19 18:00:00
50%,1.0,2003-05-29 00:00:00
75%,2.0,2013-06-14 06:00:00
max,2.0,2023-12-31 00:00:00
std,0.442655,


In [5]:
data.head()

Unnamed: 0,ID,PARAM,Date,Value,SYM
1,02HC027,1,1966-07-04,0.357,
2,02HC027,1,1966-07-05,0.249,
3,02HC027,1,1966-07-06,0.275,
4,02HC027,1,1966-07-07,0.337,
5,02HC027,1,1966-07-08,0.252,


Finding the baseline water levels for these rivers

In [6]:
# Filter only for PARAM = 2 (water level data) and convert 'Value' to numeric
water_level_data = data[data['PARAM'] == 2]
water_level_data['Value'] = pd.to_numeric(water_level_data['Value'], errors='coerce')

# Filter data for July across all years
july_water_level_data = water_level_data[water_level_data['Date'].dt.month == 7]

# Calculate the average (baseline) water level for each gauge for July
baseline_july_water_levels = july_water_level_data.groupby('ID')['Value'].mean().reset_index()
baseline_july_water_levels.columns = ['ID', 'Baseline_July_Water_Level']

# Display the result
print(baseline_july_water_levels)

        ID  Baseline_July_Water_Level
0  02HC024                  12.195959
1  02HC027                   0.381672


Extracting the climate data

In [7]:
# Load the data file
climate_file = 'climate-daily.csv'
climate_data = pd.read_csv(climate_file)

# Convert 'LOCAL_DATE' to datetime format
climate_data['LOCAL_DATE'] = pd.to_datetime(climate_data['LOCAL_DATE'], errors='coerce')

# Filter for data from July 7th and 8th, 2013
july_7_8_data = climate_data[(climate_data['LOCAL_DATE'] == '2013-07-07') | (climate_data['LOCAL_DATE'] == '2013-07-08')]

# Select only the columns of interest: 'x', 'y', 'STATION_NAME', 'LOCAL_DATE', and 'TOTAL_PRECIPITATION'
july_7_8_selected_columns = july_7_8_data[['x', 'y', 'STATION_NAME', 'LOCAL_DATE', 'TOTAL_PRECIPITATION']]

# Display the filtered data
print(july_7_8_selected_columns)

climate_data.head()

      x          y  STATION_NAME LOCAL_DATE  TOTAL_PRECIPITATION
6 -79.4  43.666667  TORONTO CITY 2013-07-07                 38.5
7 -79.4  43.666667  TORONTO CITY 2013-07-08                 96.8


Unnamed: 0,x,y,MAX_REL_HUMIDITY,MIN_REL_HUMIDITY_FLAG,MEAN_TEMPERATURE,SNOW_ON_GROUND,MAX_TEMPERATURE,TOTAL_PRECIPITATION_FLAG,TOTAL_SNOW_FLAG,MIN_REL_HUMIDITY,...,HEATING_DEGREE_DAYS,MAX_TEMPERATURE_FLAG,MIN_TEMPERATURE,TOTAL_RAIN,CLIMATE_IDENTIFIER,TOTAL_RAIN_FLAG,COOLING_DEGREE_DAYS,MAX_REL_HUMIDITY_FLAG,MIN_TEMPERATURE_FLAG,PROVINCE_CODE
0,-79.4,43.666667,56.0,,21.1,,24.1,,M,41.0,...,0.0,,18.1,,6158355,M,3.1,,,ON
1,-79.4,43.666667,84.0,,19.8,,22.1,,M,56.0,...,0.0,,17.5,,6158355,M,1.8,,,ON
2,-79.4,43.666667,,,,,,,M,,...,,M,18.5,,6158355,M,,,E,ON
3,-79.4,43.666667,94.0,,23.7,,27.0,,M,66.0,...,0.0,,20.4,,6158355,M,5.7,,,ON
4,-79.4,43.666667,95.0,,22.7,,25.0,,M,79.0,...,0.0,,20.4,,6158355,M,4.7,,,ON


# 1. Data Cleaning and Preprocessing
Let's clean the data and process it so we can make a great model. 

## 1.a First let's check for missing values

In [8]:
missing_counts = income_data.isnull().sum()
print(missing_counts)


Neighbourhood               0
Median Income Before Tax    0
dtype: int64


Looks like there are no missing values. 

In [9]:
income_data.describe()

Unnamed: 0,Median Income Before Tax
count,146.0
mean,87932.876712
std,20457.731393
min,59200.0
25%,77000.0
50%,84000.0
75%,92000.0
max,184000.0


The range of income listed appears to be reasonable, showing thath there are no outlisers imideately appparent. 

# Data Cleaning Complete Function 

# 2. Exploratory Data Analysis
Next, let's explore the dataset. 

In [10]:
unique_values = income_data['Neighbourhood'].unique()
print(unique_values)

['Toronto' 'Agincourt North' 'Agincourt South-Malvern West' 'Malvern East'
 'Malvern West' 'Alderwood' 'Banbury-Don Mills' 'York Mills'
 'Bathurst Manor' 'Bay-Cloverhill' 'Yonge-Bay Corridor' 'Bayview Village'
 'Bayview Woods-Steeles' 'Hillcrest Village' 'Bedford Park-Nortown'
 'Beechborough-Greenbrook' 'Bendale South' 'Bendale-Glen Andrew'
 'Birchcliffe-Cliffside' 'Black Creek' 'Briar Hill-Belgravia'
 'Broadview North' 'Brookhaven-Amesbury'
 'Cabbagetown-South St. James Town' 'Caledonia-Fairbank' 'Casa Loma'
 'Church-Wellesley' 'Downtown Yonge East' 'Clairlea-Birchmount'
 'Clanton Park' 'Cliffcrest' 'Danforth Village-East York'
 'Don Valley Village' 'Pleasant View' 'Dorset Park' 'Dovercourt Village'
 'Junction-Wallace Emerson' 'Junction' 'Downsview'
 'Oakdale-Beverly Heights' 'Dufferin Grove' 'Little Portugal'
 'East End-Danforth' 'Edenbridge-Humber Valley' 'Eglinton East'
 'Elms-Old Rexdale' 'Englemount-Lawrence' 'Eringate-Centennial-West Deane'
 'Etobicoke West Mall' 'Flemingdon Par

In [11]:
# List of neighborhoods
neighborhoods = ['Toronto', 'Agincourt North', 'Agincourt South-Malvern West', 'Malvern East', 'Malvern West',
                 'Alderwood', 'Banbury-Don Mills', 'York Mills', 'Bathurst Manor', 'Bay-Cloverhill', 'Yonge-Bay Corridor',
                 'Bayview Village', 'Bayview Woods-Steeles', 'Hillcrest Village', 'Bedford Park-Nortown', 'Beechborough-Greenbrook',
                 'Bendale South', 'Bendale-Glen Andrew', 'Birchcliffe-Cliffside', 'Black Creek', 'Briar Hill-Belgravia',
                 'Broadview North', 'Brookhaven-Amesbury', 'Cabbagetown-South St. James Town', 'Caledonia-Fairbank', 'Casa Loma',
                 'Church-Wellesley', 'Downtown Yonge East', 'Clairlea-Birchmount', 'Clanton Park', 'Cliffcrest',
                 'Danforth Village-East York', 'Don Valley Village', 'Pleasant View', 'Dorset Park', 'Dovercourt Village',
                 'Junction-Wallace Emerson', 'Junction', 'Downsview', 'Oakdale-Beverly Heights', 'Dufferin Grove', 'Little Portugal',
                 'East End-Danforth', 'Edenbridge-Humber Valley', 'Eglinton East', 'Elms-Old Rexdale', 'Englemount-Lawrence',
                 'Eringate-Centennial-West Deane', 'Etobicoke West Mall', 'Flemingdon Park', 'Forest Hill North', 'Forest Hill South',
                 'Guildwood', 'Henry Farm', 'High Park North', 'High Park-Swansea', 'Humber Heights-Westmount', 'Humewood-Cedarvale',
                 'Ionview', 'Islington', 'Keelesdale-Eglinton West', 'Kennedy Park', 'Kensington-Chinatown', 'Kingsview Village-The Westway',
                 'Kingsway South', 'Lambton Baby Point', "East L'Amoreaux", "West L'Amoreaux", 'Steeles', 'Milliken',
                 'Lansing-Westgate', 'Lawrence Park North', 'Lawrence Park South', 'Leaside-Bennington', 'Little Italy',
                 'Trinity-Bellwoods', 'Long Branch', 'Maple Leaf', 'Humber Bay Shores', 'Mimico-Queensway', 'Morningside Heights',
                 'Moss Park', 'Regent Park', 'Mount Dennis', 'Mount Olive-Silverstone-Jamestown', 'Thistletown', 'Mount Pleasant East',
                 'North Toronto', 'South Eglinton-Davisville', 'New Toronto', 'North St. James Town', 'Oakridge', 'Oakwood-Vaughan',
                 "O'Connor-Parkview", 'Old East York', 'Fenside-Parkwoods', "Parkwoods-O'Connor Hills", 'Playter Estates-Danforth',
                 'Princess-Rosethorn', 'Rexdale-Kipling', 'North Riverdale', 'South Riverdale', 'Rockcliffe-Smythe', 'Roncesvalles',
                 'Rosedale', 'Morningside', 'Rouge', 'Rustic', 'Pelmo Park', 'Humberlea', 'Scarborough Village', 'South Parkdale',
                 'St. Andrew-Windfields', 'Stonegate-Queensway', "Tam O'Shanter-Sullivan", 'The Beaches', 'Thorncliffe Park',
                 'University', 'Annex', 'Victoria Village', 'Harbourfront-CityPlace', 'St Lawrence-East Bayfront The Islands',
                 'Wellington Place', 'West Hill', 'West Humber-Clairville', 'Westminster-Branson', 'Weston', 'Weston-Pelham Park',
                 'Wexford-Maryvale', 'Avondale', 'Willowdale East', 'Yonge-Doris', 'Newtonbrook East', 'Willowdale West',
                 'Newtonbrook West', 'Willowridge-Martingrove-Richview', 'Golfdale-Cedarbrae-Woburn', 'Woburn North', 'Woodbine Corridor',
                 'Greenwood-Coxwell', 'Woodbine-Lumsden', 'Wychwood', 'Yonge-Eglinton', 'Yonge-St. Clair']

# Example groupings (you can adjust this to your needs)
groupings = {
    'Central Toronto': ['Toronto', 'Yonge-Bay Corridor', 'Bayview Village', 'Bayview Woods-Steeles', 'Church-Wellesley', 'Downtown Yonge East',
                        'Cabbagetown-South St. James Town', 'Kensington-Chinatown', 'Little Italy', 'Trinity-Bellwoods', 'Annex', 'Victoria Village'],
    'Scarborough': ['Agincourt North', 'Agincourt South-Malvern West', 'Malvern East', 'Malvern West', 'Bendale South', 'Bendale-Glen Andrew',
                    'Birchcliffe-Cliffside', 'Scarborough Village', 'East End-Danforth', 'Morningside', 'Rouge', 'Thorncliffe Park', 'The Beaches'],
    'Etobicoke': ['Alderwood', 'Islington', 'Kingsway South', 'Steeles', 'Mimico-Queensway', 'West Humber-Clairville', 'Eringate-Centennial-West Deane'],
    'North York': ['Bathurst Manor', 'York Mills', 'Bay-Cloverhill', 'Bayview Woods-Steeles', 'Bedford Park-Nortown', 'Don Valley Village', 'Pleasant View',
                   'Dorset Park', 'Dovercourt Village', 'Junction-Wallace Emerson', 'Junction', 'Downsview', 'Lansing-Westgate', 'Lawrence Park North',
                   'Lawrence Park South', 'Leaside-Bennington', 'Lawrence Park North', 'North Toronto', 'South Eglinton-Davisville', 'Willowdale East',
                   'Willowdale West', 'Newtonbrook East', 'Newtonbrook West', 'Fenside-Parkwoods', 'Parkwoods-O\'Connor Hills']
}

# Create a mapping for each neighborhood
neighborhood_group = {}

# Assign neighborhoods to their respective groups
for group, neighborhoods_in_group in groupings.items():
    for neighborhood in neighborhoods_in_group:
        neighborhood_group[neighborhood] = group

# Display the result
for neighborhood in neighborhoods:
    print(f"{neighborhood}: {neighborhood_group.get(neighborhood, 'Uncategorized')}")

Toronto: Central Toronto
Agincourt North: Scarborough
Agincourt South-Malvern West: Scarborough
Malvern East: Scarborough
Malvern West: Scarborough
Alderwood: Etobicoke
Banbury-Don Mills: Uncategorized
York Mills: North York
Bathurst Manor: North York
Bay-Cloverhill: North York
Yonge-Bay Corridor: Central Toronto
Bayview Village: Central Toronto
Bayview Woods-Steeles: North York
Hillcrest Village: Uncategorized
Bedford Park-Nortown: North York
Beechborough-Greenbrook: Uncategorized
Bendale South: Scarborough
Bendale-Glen Andrew: Scarborough
Birchcliffe-Cliffside: Scarborough
Black Creek: Uncategorized
Briar Hill-Belgravia: Uncategorized
Broadview North: Uncategorized
Brookhaven-Amesbury: Uncategorized
Cabbagetown-South St. James Town: Central Toronto
Caledonia-Fairbank: Uncategorized
Casa Loma: Uncategorized
Church-Wellesley: Central Toronto
Downtown Yonge East: Central Toronto
Clairlea-Birchmount: Uncategorized
Clanton Park: Uncategorized
Cliffcrest: Uncategorized
Danforth Village-Eas

In [None]:
neighborhoods = 

# 3. Feature Engineering
Once we've acquired a comprehensive understanding of the dataset, we can proceed with feature engineering.

### 3.a Time feature engineering

### 3.b Location Feature Engineering


### 3.c Weather Feature Engineering

### 3.d Feature engineering comparison and evaluation

# Feature Engineering Complete Function

# 4. Train The Model

## 4.a Run code through pipeline

In [12]:
FloodData = pd.read_csv('example.csv')
train = CleanData(FloodData)
X = feature_engineering(train)

# Prepare the features and target variable
X = X.drop(columns=['TargetColumn'])  # Drop target and non-features
y = train['TargetColumn']  # Target variable
X.info()

# Split the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


FileNotFoundError: [Errno 2] No such file or directory: 'example.csv'

## 4.b Create & tune the model

## Evaluate Model
Let's get the predictions on our validation dataset.

In [None]:
# Evaluate the model's performance on the test data
mae_best = mean_absolute_error(y_test, y_pred_best)
mse_best = mean_squared_error(y_test, y_pred_best)
rmse_best = np.sqrt(mse_best)  # RMSE
r2_best = r2_score(y_test, y_pred_best)

# Print evaluation metrics for the best model
print(f"Best Model MAE: {mae_best}")
print(f"Best Model MSE: {mse_best}")
print(f"Best Model RMSE: {rmse_best}")
print(f"Best Model R^2: {r2_best}")
# Get feature importances from the best model
feature_importance = best_rf_model.feature_importances_

# Create a DataFrame for visualization
feature_names = X.columns
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importance
})

# Sort the features by importance
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Plot the top 15 most important features
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df.head(15))
plt.title('Top 15 Feature Importances')
plt.tight_layout()
plt.show()

# Predict on the test dataset