# Data wrangling‬ for 311 Service requests

## Objective:

   Data wrangling and visualization tasks will be performed to ensure that raw data is transformed into a‬ clean format and then visualized effectively to develop valuable insights. Data wrangling‬ involves handling null values by imputing mean or median values in missing values of columns longitude‬ or latitude‬. Duplicate records are identified and removed, standardizing data types to‬ ensure consistency, such as converting Floating Timestamp to datetime for the date columns‬. Data transformations like encoding categorical variables (e.g., encoding status description “open”=> 0 and “closed” => 1) and normalizing numerical data for easier analysis. Feature engineering to derive‬ new columns, such as calculating response time (e.g., the difference between closed date and‬ requested date will be the response time), and additionally, grouping and mapping subdivisions into‬ broader categories, such as combining detailed service name into high-level service groups for providing‬ clearer insights‬. When unusual values are found in numerical columns, these are outliers,‬ which can be treated by capping or removal to prevent skewed results‬.
    

In [7]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

import re
import pandas as pd
from tabulate import tabulate


### Part 1: Data Cleaning and Preprocessing

### 1.1 Load and Inspect the Dataset

• Load data: Load the dataset and display its shape, column names, and data types.

• Inspect data: Identify and list the number of missing values in each column.

### 1.2 Handling Missing and Unwanted Data

• Handling Missing Data: Drop columns with more than 10% missing values.

• Handling Unwanted Data: Drop requestes created before Jan-1-2023 and after Dec-31-2024.

### 1.3 Date and Time Handling:

•	Convert the Date column to a datetime object.

•	Create new columns for the year, month, and day of the week for requested, updated and closed date columns.

•	Replacing null values in derived date related columns with 0 and converting the column values to int type.

•	Add a column indicating whether each date falls on a weekend.

•	Add a column for time duration to calculate the time took to close the request.

### 1.4 Create additional columns:

•	Add a column to see if the request is duplicate or not(Yes means duplicate and No means not a duplicate request).



In [68]:
# Load data
df = pd.read_csv('/Users/anithajoseph/Documents/UofC/DATA601/CSVFiles/311_Service_Requests_2yrs.csv')
print("----------------------------------------------------------------------------")
print("\033[1m"+"Data Analysis and Visualization of Building Emergency Benchmarking"+"\033[0m")
print("----------------------------------------------------------------------------")

#display shape, columns, and data types
print("1.\tShape of the Dataset:", df.shape)
print("2.\tNumber of records or rows of the DataFrame:", df.shape[0])
print("3.\tColumns and Data types of each column:\n", df.dtypes)

----------------------------------------------------------------------------
[1mData Analysis and Visualization of Building Emergency Benchmarking[0m
----------------------------------------------------------------------------
1.	Shape of the Dataset: (1093918, 15)
2.	Number of records or rows of the DataFrame: 1093918
3.	Columns and Data types of each column:
 service_request_id     object
requested_date         object
updated_date           object
closed_date            object
status_description     object
source                 object
service_name           object
agency_responsible     object
address               float64
comm_code              object
comm_name              object
location_type          object
longitude             float64
latitude              float64
point                  object
dtype: object


In [70]:
# Inspect data
missingDataSum = df.isna().sum()
missingDataPercentage = (df.isnull().mean() * 100).round(2)
missingData = pd.DataFrame({
    "Missing Count": missingDataSum,
    "Missing Percentage": missingDataPercentage
})

pd.options.display.float_format = '{:.2f}'.format
print("\n\033[1m"+"Missing Count per column:"+"\033[0m")
print(tabulate(missingData, headers='keys', tablefmt='fancy_grid'))

#The dataframe(DF) is copied to another DF variable if in case there is a need for original DF
originalDF =df


[1mMissing Count per column:[0m
╒════════════════════╤═════════════════╤══════════════════════╕
│                    │   Missing Count │   Missing Percentage │
╞════════════════════╪═════════════════╪══════════════════════╡
│ service_request_id │     0           │                 0    │
├────────────────────┼─────────────────┼──────────────────────┤
│ requested_date     │     0           │                 0    │
├────────────────────┼─────────────────┼──────────────────────┤
│ updated_date       │     0           │                 0    │
├────────────────────┼─────────────────┼──────────────────────┤
│ closed_date        │ 39714           │                 3.63 │
├────────────────────┼─────────────────┼──────────────────────┤
│ status_description │     0           │                 0    │
├────────────────────┼─────────────────┼──────────────────────┤
│ source             │     0           │                 0    │
├────────────────────┼─────────────────┼──────────────────────┤
│ ser

In [72]:
# Handling Missing Data
columnNameDropped = missingDataPercentage[missingDataPercentage >= 40].index.tolist()
print("\nColumns with missing percentage more than 40% missing values are:\n", columnNameDropped)
df = df.drop(columns = missingDataPercentage[missingDataPercentage > 40].index)

# Handling Unwanted Data
beforeCount = df.shape[0]
df = df[(df['requested_date'] < '2025-01-01') & (df['requested_date'] > '2023-01-01')]
afterCount =df.shape[0]
deletedCount = beforeCount - afterCount
print(f"\nDeleted {deletedCount} number of unwanted records")


Columns with missing percentage more than 40% missing values are:
 ['address']

Deleted 31076 number of unwanted records


In [37]:
# Date and Time Handling:
#------------------------------------------------------------------------------------------

#Convert the Date column to a datetime object
df['requested_date'] = pd.to_datetime(df['requested_date'], format = '%Y/%m/%d %I:%M:%S %p')
df['updated_date'] = pd.to_datetime(df['updated_date'], format = '%Y/%m/%d %I:%M:%S %p')
df['closed_date'] = pd.to_datetime(df['closed_date'], format = '%Y/%m/%d %I:%M:%S %p')
print(f"Data type of 'requested_date': {df['requested_date'].dtype}")
print(f"Data type of 'updated_date': {df['updated_date'].dtype}")
print(f"Data type of 'closed_date': {df['closed_date'].dtype}")

# Converting null values to NaT
df['closed_date'] = df['closed_date'].fillna(pd.NaT)

#Create new columns for the year, month, and day of the week for requested, updated and closed date columns
df['request_year'] = df['requested_date'].dt.year
df['request_month'] = df['requested_date'].dt.month
df['request_day'] = df['requested_date'].dt.day
df['update_year'] = df['updated_date'].dt.year
df['update_month'] = df['updated_date'].dt.month
df['update_day'] = df['updated_date'].dt.day
df['closed_year'] = df['closed_date'].dt.year
df['closed_month'] = df['closed_date'].dt.month
df['closed_day'] = df['closed_date'].dt.day


# Replacing null values in derived date related columns with 0 and converting the column values to int type
df.loc[df['closed_date'].isna(), ['closed_year', 'closed_month', 'closed_day']] = 0
df[['closed_year', 'closed_month', 'closed_day']] = df[['closed_year', 'closed_month', 'closed_day']].astype('Int32')

Data type of 'requested_date': datetime64[ns]
Data type of 'updated_date': datetime64[ns]
Data type of 'closed_date': datetime64[ns]



Count 39714
0


Unnamed: 0,service_request_id,requested_date,updated_date,closed_date,status_description,source,service_name,agency_responsible,comm_code,comm_name,...,update_month,update_day,closed_year,closed_month,closed_day,is_weekend,time_closing,duplicate_request,new_serviceNames,new_agencyResponsible
0,23-00000797,2023-01-02,2023-01-10,2023-01-10,Closed,Other,Finance - ONLINE TIPP Agreement Request,CFOD - Finance,,,...,1,10,2023,1,10,False,8 days,No,Finance,CFOD
1,23-00001045,2023-01-02,2024-01-11,2024-01-11,Closed,Other,Active Living Program Application,CS - Recreation and Social Programs,,,...,1,11,2024,1,11,False,374 days,No,Active Living Program Application,CS
2,23-00001163,2023-01-02,2023-01-06,2023-01-06,Closed,Phone,CN - Registered Social Worker Letter,CS - Calgary Neighbourhoods,,,...,1,6,2023,1,6,False,4 days,No,CN,CS
3,23-00001191,2023-01-02,2024-05-19,2023-01-10,Closed,Other,CT - Lost Property,OS - Calgary Transit,,,...,5,19,2023,1,10,False,8 days,No,CT,OS
4,23-00001584,2023-01-02,2023-01-04,2023-01-04,Closed,Other,Recreation - Arena Booking Application,CS - Calgary Recreation,,,...,1,4,2023,1,4,False,2 days,No,Recreation,CS


In [None]:
#Add a column indicating whether each request date falls on a weekend
df['is_weekend_request'] = df['request_day'].dt.weekday >= 5

#Add a column for time duration to calculate the time took to close the request
df['time_closing'] = df['closed_date'] - df['requested_date']


In [48]:

# Filter rows where status_description is "open"
filtered_df = df[(df['status_description'] == 'Open') & (df['closed_date'].notnull())]

# Sort by requested_date in descending order
sorted_df = filtered_df.sort_values(by='requested_date', ascending=False)
count = sorted_df.count()
print("\n\n\nCount",count)

display(sorted_df.head(100))








Count service_request_id       2164
requested_date           2164
updated_date             2164
closed_date              2164
status_description       2164
source                   2164
service_name             2164
agency_responsible       2162
comm_code                2053
comm_name                2053
location_type            2053
longitude                2053
latitude                 2053
point                    2053
request_year             2164
request_month            2164
request_day              2164
update_year              2164
update_month             2164
update_day               2164
closed_year              2164
closed_month             2164
closed_day               2164
is_weekend               2164
time_closing             2164
duplicate_request        2164
new_serviceNames         2164
new_agencyResponsible    2162
dtype: int64


Unnamed: 0,service_request_id,requested_date,updated_date,closed_date,status_description,source,service_name,agency_responsible,comm_code,comm_name,...,update_month,update_day,closed_year,closed_month,closed_day,is_weekend,time_closing,duplicate_request,new_serviceNames,new_agencyResponsible
1092725,25-00062439,2025-01-27,2025-01-27,2025-01-27,Open,Other,WATS - Sewage Back-up,OS - Water Services,HOU,HOUNSFIELD HEIGHTS/BRIAR HILL,...,1,27,2025,1,27,False,0 days,No,WATS,OS
1092432,25-00061084,2025-01-26,2025-01-26,2025-01-26,Open,Other,Roads - Debris on Street/Sidewalk/Boulevard,OS - Mobility,PAN,PANORAMA HILLS,...,1,26,2025,1,26,False,0 days,No,Roads,OS
1092205,25-00061709,2025-01-26,2025-01-27,2025-01-27,Open,Other,ACPL - Lost and Found Animal,CS - Emergency Management and Community Safety,RIV,RIVERBEND,...,1,27,2025,1,27,False,1 days,No,ACPL,CS
1091006,25-00057868,2025-01-24,2025-01-27,2025-01-24,Open,Other,WATS - Water Meter Issues,OS - Water Services,NGM,NORTH GLENMORE PARK,...,1,27,2025,1,24,False,0 days,No,WATS,OS
1090389,25-00059730,2025-01-24,2025-01-27,2025-01-24,Open,Other,WATS - Water Meter Issues,OS - Water Services,MPL,MAPLE RIDGE,...,1,27,2025,1,24,False,0 days,No,WATS,OS
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1071457,25-00018023,2025-01-09,2025-01-13,2025-01-10,Open,Other,Roads - Traffic Signal Timing Inquiry,OS - Mobility,MDH,MEDICINE HILL,...,1,13,2025,1,10,False,1 days,No,Roads,OS
1071867,25-00019825,2025-01-09,2025-01-09,2025-01-09,Open,Other,WATS - Spills Entering Storm System,OS - Water Services,FLN,FOREST LAWN,...,1,9,2025,1,9,False,0 days,No,WATS,OS
1072534,25-00019940,2025-01-09,2025-01-11,2025-01-11,Open,Other,Bylaw - Waste and Recycling Infractions,CS - Emergency Management and Community Safety,HUN,HUNTINGTON HILLS,...,1,11,2025,1,11,False,2 days,No,Bylaw,CS
1072729,25-00020663,2025-01-09,2025-01-20,2025-01-15,Open,Other,WRS - Cart Management,OS - Waste and Recycling Services,FAI,FAIRVIEW,...,1,20,2025,1,15,False,6 days,No,WRS,OS


In [13]:
# Create additional columns:
df['duplicate_request'] = df['status_description'].str.contains(r'Duplicate \(Closed\)', regex=True)

# Convert the boolean values to 'yes'/'no'
df['duplicate_request'] = df['duplicate_request'].replace({True: 'Yes', False: 'No'})

# Checking if converted to yes or no
df_subset = df.iloc[150:166]  # Python slicing includes 150 but excludes 166
display(df_subset)

Unnamed: 0,service_request_id,requested_date,updated_date,closed_date,status_description,source,service_name,agency_responsible,comm_code,comm_name,...,request_day,update_year,update_month,update_day,closed_year,closed_month,closed_day,is_weekend,time_closing,duplicate_request
150,23-00000917,2023-01-02,2023-01-04,2023-01-04,Closed,Other,CT - Lost Property,TRAN - Calgary Transit,,,...,1970-01-01 00:00:00.000000002,2023,1,4,2023,1,4,False,2 days,No
151,23-00000924,2023-01-02,2024-05-18,2023-01-05,Duplicate (Closed),Other,Roads - Snow and Ice Control,OS - Mobility,MCT,MCKENZIE TOWNE,...,1970-01-01 00:00:00.000000002,2024,5,18,2023,1,5,False,3 days,Yes
152,23-00000925,2023-01-02,2023-01-02,2023-01-02,Closed,App,Roads - Streetlight Maintenance,TRAN - Roads,ROY,ROYAL OAK,...,1970-01-01 00:00:00.000000002,2023,1,2,2023,1,2,False,0 days,No
153,23-00000926,2023-01-02,2023-01-05,2023-01-05,Closed,Other,CBS - Planning and Development - After Hours SR,PD - Calgary Building Services,MID,MIDNAPORE,...,1970-01-01 00:00:00.000000002,2023,1,5,2023,1,5,False,3 days,No
154,23-00000942,2023-01-02,2023-01-03,2023-01-03,Closed,Phone,CBS Inspection - Residential Improvement Proje...,PD - Calgary Building Services,TAR,TARADALE,...,1970-01-01 00:00:00.000000002,2023,1,3,2023,1,3,False,1 days,No
155,23-00000931,2023-01-02,2023-01-04,2023-01-04,Closed,Other,WRS - Cart Management,UEP - Waste and Recycling Services,VAR,VARSITY,...,1970-01-01 00:00:00.000000002,2023,1,4,2023,1,4,False,2 days,No
156,23-00000943,2023-01-02,2023-01-16,2023-01-02,Closed,Phone,WATS - Sewage Back-up,UEP - Water Services,CED,CEDARBRAE,...,1970-01-01 00:00:00.000000002,2023,1,16,2023,1,2,False,0 days,No
157,23-00000930,2023-01-02,2023-01-02,2023-01-02,Closed,Other,CT AC - Trip Feedback - Checker Taxi,Tranc - Calgary Transit,,,...,1970-01-01 00:00:00.000000002,2023,1,2,2023,1,2,False,0 days,No
158,23-00000927,2023-01-02,2023-01-17,2023-01-17,Closed,Other,CT - Bus Stops,TRAN - Calgary Transit,BRI,BRIDLEWOOD,...,1970-01-01 00:00:00.000000002,2023,1,17,2023,1,17,False,15 days,No
159,23-00000938,2023-01-02,2023-01-30,2023-01-30,Closed,Other,WATS - Water Meter Issues,UEP - Water Services,BRI,BRIDLEWOOD,...,1970-01-01 00:00:00.000000002,2023,1,30,2023,1,30,False,28 days,No


## Service name analysis 

In [15]:
service_nunique_values=df['service_name'].nunique()
print("Count of distinct of all service names:", service_nunique_values)

service_unique_values=df['service_name'].unique()
#display(service_unique_values)

service_unique_values_cnt=df['service_name'].value_counts()
print(service_unique_values_cnt)

df['new_serviceNames'] = df['service_name'].str.split(' -').str[0] #extracted main service name and removed sub division of that.
sn_unique_values = df['new_serviceNames'].unique()
print(sn_unique_values)

unique_values = df['new_serviceNames'].nunique()
print("Number of Unique values: ",unique_values)
unique_values_cnt = df['new_serviceNames'].value_counts()
print(unique_values_cnt)

#count of particular value
count = (df['new_serviceNames'] == 'AS').sum()
count

Count of distinct of all service names: 640
service_name
WRS - Cart Management                      51648
Finance - Property Tax Account Inquiry     37459
Bylaw - Snow and Ice on Sidewalk           34438
Finance - ONLINE TIPP Agreement Request    28318
Corporate - Graffiti Concerns              22096
                                           ...  
CAI - Employee Complaint - Compliment          1
WATR - Water Brochure                          1
PSD - Major Mobility - Paving Program          1
WRS - Chatbot Feedback                         1
CPI - Employee Complaint - Compliment          1
Name: count, Length: 640, dtype: int64
['Finance' 'Active Living Program Application' 'CN' 'CT' 'Recreation'
 'CT AC' 'REC' 'Parks' 'AS' 'WATS' 'Bylaw' 'Roads' 'Corporate'
 'CBS Inspection' '311 Contact Us' 'WATR' 'WRS' 'After Hours Transit'
 'Compliance' 'Opinions on Business Units' 'HR' 'CFD' 'CSC'
 'Animal / Bylaw' 'CBS' 'Partnerships' 'Customer Service & Communications'
 'UEP' 'CAI' 'Law' 'GFL' 'T

24176

## Agency analysis

In [17]:
# Filter rows where agency_responsible is null
result = df[df['agency_responsible'].isnull()]['service_name']
print("rows where agency_responsible is null",result.count())
#display(result)

df['service_name'].str.split(' -').str[0] 

# Filter rows where the 'agency_responsible' column contains 'CPI'
filtered_df = df[df['service_name'].str.split(' -').str[0].str.contains('CPI', na=False)]

display(filtered_df)



agency_nunique_values=df['agency_responsible'].nunique()
print(agency_nunique_values)


agency_unique_values=df['agency_responsible'].unique()
print(agency_unique_values)

agency_unique_values_cnt=df['agency_responsible'].value_counts()
print(agency_unique_values_cnt)

df['new_agencyResponsible'] = df['agency_responsible'].str.split(' -').str[0] 
ag_unique_values = df['new_agencyResponsible'].unique()
print(ag_unique_values)


ag_unique_values = df['new_agencyResponsible'].nunique()
print("Number of Unique values: ",ag_unique_values)
ag_unique_values_cnt = df['new_agencyResponsible'].value_counts()
print(ag_unique_values_cnt)


rows where agency_responsible is null 205


Unnamed: 0,service_request_id,requested_date,updated_date,closed_date,status_description,source,service_name,agency_responsible,comm_code,comm_name,...,update_year,update_month,update_day,closed_year,closed_month,closed_day,is_weekend,time_closing,duplicate_request,new_serviceNames
262006,23-00446573,2023-06-17,2024-05-19,2023-07-06,Closed,Other,CPI - Water and Sewer Main Condition Inquiries,,RDL,ROSEDALE,...,2024,5,19,2023,7,6,False,19 days,No,CPI
321324,23-00542552,2023-07-19,2024-12-06,2024-03-13,Closed,Other,CPI - Bridge - Tunnel - Underpass Concern,,THO,THORNCLIFFE,...,2024,12,6,2024,3,13,False,238 days,No,CPI
341661,23-00577368,2023-08-01,2024-08-07,2023-08-10,Open,Phone,CPI - Water and Sewer Main Condition Inquiries,,THO,THORNCLIFFE,...,2024,8,7,2023,8,10,False,9 days,No,CPI
423595,23-00716133,2023-09-22,2024-12-05,2023-10-13,Closed,Phone,CPI - Bridge - Tunnel - Underpass Concern,,VIS,VISTA HEIGHTS,...,2024,12,5,2023,10,13,False,21 days,No,CPI
532539,23-00916982,2023-12-13,2024-12-06,2023-12-21,Closed,Phone,CPI - Water and Sewer Main Condition Inquiries,,WIN,WINSTON HEIGHTS/MOUNTVIEW,...,2024,12,6,2023,12,21,False,8 days,No,CPI
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1089308,25-00055030,2025-01-23,2025-01-27,2025-01-27,Closed,Other,CPI - Neighbourhood Streets Projects,,UND,UNIVERSITY DISTRICT,...,2025,1,27,2025,1,27,False,4 days,No,CPI
1089496,25-00055805,2025-01-23,2025-01-23,NaT,Open,Other,CPI - Future Road Planning and Upgrades,,STD,STARFIELD,...,2025,1,23,0,0,0,False,NaT,No,CPI
1091838,25-00060930,2025-01-25,2025-01-27,2025-01-27,Closed,Other,CPI - Bridge - Tunnel - Underpass Concern,,09H,09H,...,2025,1,27,2025,1,27,False,2 days,No,CPI
1092241,25-00061327,2025-01-26,2025-01-26,NaT,Open,Other,CPI - Bridge - Tunnel - Underpass Concern,,BOW,BOWNESS,...,2025,1,26,0,0,0,False,NaT,No,CPI


77
['CFOD - Finance' 'CS - Recreation and Social Programs'
 'CS - Calgary Neighbourhoods' 'OS - Calgary Transit'
 'CS - Calgary Recreation' 'TRAN - Calgary Transit' 'CS - Calgary Parks'
 'CS - Calgary Community Standards' 'UEP - Water Services'
 'OS - Parks and Open Spaces' 'TRAN - Roads'
 'PD - Calgary Building Services'
 'CFOD - Customer Services and Communications' 'UEP - Water Resources'
 'UEP - Waste and Recycling Services' 'CPFS - Assessment and Tax'
 'Corporate Wide Service Requests'
 'CS - Emergency Management and Community Safety'
 'OS - Waste and Recycling Services' 'PICS - Human Resources'
 'CS - Calgary Fire' 'OS - Mobility' 'Tranc - Calgary Transit'
 'Partnerships' 'DCMO - Corporate Analytics and Innovation' 'LL - Law'
 'Uepc - Waste and Recycling Services' 'TRAN - Transportation Planning'
 'OS - Water Services' 'CS - Calgary Housing'
 'Recreation and Social Programs'
 'PDS - Development, Business and Building Services'
 'DCMO - Facility Management' 'Elected Officials'
 'P