# Data wrangling‬ for 311 Service requests

## Objective:

   Data wrangling and visualization tasks will be performed to ensure that raw data is transformed into a‬ clean format and then visualized effectively to develop valuable insights. Data wrangling‬ involves handling null values by imputing mean or median values in missing values of columns longitude‬ or latitude‬. Duplicate records are identified and removed, standardizing data types to‬ ensure consistency, such as converting Floating Timestamp to datetime for the date columns‬. Data transformations like encoding categorical variables (e.g., encoding status description “open”=> 0 and “closed” => 1) and normalizing numerical data for easier analysis. Feature engineering to derive‬ new columns, such as calculating response time (e.g., the difference between closed date and‬ requested date will be the response time), and additionally, grouping and mapping subdivisions into‬ broader categories, such as combining detailed service name into high-level service groups for providing‬ clearer insights‬. When unusual values are found in numerical columns, these are outliers,‬ which can be treated by capping or removal to prevent skewed results‬.
    

In [302]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

import numpy as np
import re
import pandas as pd
from tabulate import tabulate

from datetime import datetime
import pytz

### Part 1: Data Cleaning and Preprocessing

### 1.1 Load and Inspect the Dataset

• Load data: Load the dataset and display its shape, column names, and data types.

• Inspect data: Identify and list the number of missing values in each column.

### 1.2 Handling Missing and Unwanted Data

• Handling Missing Data: Drop columns with more than 10% missing values.

• Handling Unwanted Data: Drop requestes created before Jan-1-2023 and after Dec-31-2024.

• Handling Missing Community Code: Fill Community Code with Community name.

• Handling Missing Longitude: Fill Longitude with Median value.

• Handling Missing Latitude: Fill Latitude with median value.

• Handling Missing Point with it's mode value

### 1.3 Date and Time Handling:

•	Convert the Date column to a datetime object.

•	Create new columns for the year, month, and day of the week for requested, updated and closed date columns.

•	Replacing null values in derived date related columns with 0 and converting the column values to int type.

### 1.4 Create additional columns:

•	Add a column indicating whether each date falls on a weekend.

•	Add a column for time duration to calculate the time took to close the request.

•	Add a column to see if the request is duplicate or not(Yes means duplicate and No means not a duplicate request).

•	Add a column for Season Categorisation of Requests

•	Add a column for Divisions of Agency assigned for the requests

•	Add a column for Community Sector for identifying the communities using community sector csv file.

### 1.5 Handling Missing for Community related columns

•	Fill value "Community Centrepoint" for null community code, name and sector.


In [304]:
# Load data
df = pd.read_csv('/Users/anithajoseph/Documents/UofC/DATA601/CSVFiles/311_Service_Requests_2yrs.csv')
print("----------------------------------------------------------------------------")
print("\033[1m"+"Data Analysis and Visualization of Building Emergency Benchmarking"+"\033[0m")
print("----------------------------------------------------------------------------")

#display shape, columns, and data types
print("1.\tShape of the Dataset:", df.shape)
print("2.\tNumber of records or rows of the DataFrame:", df.shape[0])
print("3.\tColumns and Data types of each column:\n", df.dtypes)

----------------------------------------------------------------------------
[1mData Analysis and Visualization of Building Emergency Benchmarking[0m
----------------------------------------------------------------------------
1.	Shape of the Dataset: (1093918, 15)
2.	Number of records or rows of the DataFrame: 1093918
3.	Columns and Data types of each column:
 service_request_id     object
requested_date         object
updated_date           object
closed_date            object
status_description     object
source                 object
service_name           object
agency_responsible     object
address               float64
comm_code              object
comm_name              object
location_type          object
longitude             float64
latitude              float64
point                  object
dtype: object


In [305]:
# Inspect data
missingDataSum = df.isna().sum()
missingDataPercentage = (df.isnull().mean() * 100).round(2)
missingData = pd.DataFrame({
    "Missing Count": missingDataSum,
    "Missing Percentage": missingDataPercentage
})

pd.options.display.float_format = '{:.2f}'.format
print("\n\033[1m"+"Missing Count per column:"+"\033[0m")
print(tabulate(missingData, headers='keys', tablefmt='fancy_grid'))

#The dataframe(DF) is copied to another DF variable if in case there is a need for original DF
originalDF =df


[1mMissing Count per column:[0m
╒════════════════════╤═════════════════╤══════════════════════╕
│                    │   Missing Count │   Missing Percentage │
╞════════════════════╪═════════════════╪══════════════════════╡
│ service_request_id │     0           │                 0    │
├────────────────────┼─────────────────┼──────────────────────┤
│ requested_date     │     0           │                 0    │
├────────────────────┼─────────────────┼──────────────────────┤
│ updated_date       │     0           │                 0    │
├────────────────────┼─────────────────┼──────────────────────┤
│ closed_date        │ 39714           │                 3.63 │
├────────────────────┼─────────────────┼──────────────────────┤
│ status_description │     0           │                 0    │
├────────────────────┼─────────────────┼──────────────────────┤
│ source             │     0           │                 0    │
├────────────────────┼─────────────────┼──────────────────────┤
│ ser

In [306]:
# Handling Missing Data
columnNameDropped = missingDataPercentage[missingDataPercentage >= 40].index.tolist()
print("\nColumns with missing percentage more than 40% missing values are:", columnNameDropped)
df = df.drop(columns = missingDataPercentage[missingDataPercentage > 40].index)

# Handling Unwanted Data
beforeCount = df.shape[0]
df = df[(df['requested_date'] < '2025-01-01') & (df['requested_date'] > '2023-01-01')]
afterCount =df.shape[0]
deletedCount = beforeCount - afterCount
print(f"\nCount of deleted request which are recieved on or after 2025-01-01 and before 2023-01-01: {deletedCount}")

#Handling Missing Community Code
communityNames = df[df['comm_code'].isnull() & df['comm_name'].notnull()]['comm_name']
print(f"\nCommunity name with community code null and community name exists: {communityNames}")

df['comm_code'].fillna(df['comm_name'], inplace=True)
print(f"\nCommunity Code is filled with Community name for {communityNames} community")

#Handling Missing Longitude and Latitude with their median 
df['longitude'] = df['longitude'].fillna(df['longitude'].median())
df['latitude'] = df['latitude'].fillna(df['latitude'].median())
print("\nLongitude and latitude missing values are replaced with its corresponding median")

#Handling Missing Point with the mode
df['point'] = df['point'].fillna(df['point'].mode()[0])
print("\nPoint missing values are replaced with its mode")


Columns with missing percentage more than 40% missing values are: ['address']

Count of deleted request which are recieved on or after 2025-01-01 and before 2023-01-01: 31076

Community name with community code null and community name exists: 808139    05E
Name: comm_name, dtype: object

Community Code is filled with Community name for 808139    05E
Name: comm_name, dtype: object community

Longitude and latitude missing values are replaced with its corresponding median

Point missing values are replaced with its mode


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['comm_code'].fillna(df['comm_name'], inplace=True)


In [307]:
# Date and Time Handling:
#------------------------------------------------------------------------------------------

#Convert the Date column to a datetime object
df['requested_date'] = pd.to_datetime(df['requested_date'], format = '%Y/%m/%d %I:%M:%S %p')
df['updated_date'] = pd.to_datetime(df['updated_date'], format = '%Y/%m/%d %I:%M:%S %p')
df['closed_date'] = pd.to_datetime(df['closed_date'], format = '%Y/%m/%d %I:%M:%S %p')

print("\n\033[1m"+"Date and Time Handling: Modified data type:"+"\033[0m")
print(f"Data type of 'requested_date': {df['requested_date'].dtype}")
print(f"Data type of 'updated_date': {df['updated_date'].dtype}")
print(f"Data type of 'closed_date': {df['closed_date'].dtype}")

# Converting null values to NaT
df['closed_date'] = df['closed_date'].fillna(pd.NaT)

#Create new columns for the year, month, and day of the week for requested, updated and closed date columns
df['request_year'] = df['requested_date'].dt.year
df['request_month'] = df['requested_date'].dt.month
df['request_day'] = df['requested_date'].dt.day
df['update_year'] = df['updated_date'].dt.year
df['update_month'] = df['updated_date'].dt.month
df['update_day'] = df['updated_date'].dt.day
df['closed_year'] = df['closed_date'].dt.year
df['closed_month'] = df['closed_date'].dt.month
df['closed_day'] = df['closed_date'].dt.day


# Replacing null values in derived date related columns with 0 and converting the column values to int type
df.loc[df['closed_date'].isna(), ['closed_year', 'closed_month', 'closed_day']] = 0
df[['request_year', 'request_month', 'request_day']] = df[['request_year', 'request_month', 'request_day']].astype('Int32')
df[['update_year', 'update_month', 'update_day']] = df[['update_year', 'update_month', 'update_day']].astype('Int32')
df[['closed_year', 'closed_month', 'closed_day']] = df[['closed_year', 'closed_month', 'closed_day']].astype('Int32')


print("\n\033[1m"+"Date and Time Handling: newly created columns are:"+"\033[0m")
print("For requested_date: request_year, request_month, request_day")
print("For updated_date: update_year, update_month, update_day")
print("For closed_date: closed_year, closed_month, closed_day")


[1mDate and Time Handling: Modified data type:[0m
Data type of 'requested_date': datetime64[ns]
Data type of 'updated_date': datetime64[ns]
Data type of 'closed_date': datetime64[ns]

[1mDate and Time Handling: newly created columns are:[0m
For requested_date: request_year, request_month, request_day
For updated_date: update_year, update_month, update_day
For closed_date: closed_year, closed_month, closed_day


In [308]:
# Create additional columns:
#-------------------------------------------------------------------------

#Add a column indicating whether each request date falls on a weekend
df['is_weekend_request'] = df['request_day']>= 5

#Add a column for time duration to calculate the time took to close the request
df['closing_delay'] = df['closed_date'] - df['requested_date']
df['closing_delay'] = df['closing_delay'].dt.days

#Add a column to see if the request is duplicate or not(Yes means duplicate and No means not a duplicate request)
df['duplicate_request'] = df['status_description'].str.contains(r'Duplicate \(Closed\)', regex=True)
df['duplicate_request'] = df['duplicate_request'].replace({True: 'Yes', False: 'No'})

print("\n\033[1m"+"Additional Columns created are:"+"\033[0m")
print("\tis_weekend_request")
print("\tclosing_delay")
print("\tduplicate_request")



[1mAdditional Columns created are:[0m
	is_weekend_request
	closing_delay
	duplicate_request


In [309]:
# Season Categorisation of "Requests"

# Defining Calgary's timezone
calgary_tz = pytz.timezone('America/Edmonton')  

# Exact UTC times for solstices and equinoxes (taken from Govt of Canada Website)
seasons_utc = {
    'Spring_2023': '2023-03-20 21:24:00',
    'Summer_2023': '2023-06-21 14:57:00',
    'Autumn_2023': '2023-09-23 06:50:00',
    'Winter_2023': '2023-12-22 03:27:00',
    'Spring_2024': '2024-03-20 03:06:00',
    'Summer_2024': '2024-06-20 20:50:00',
    'Autumn_2024': '2024-09-22 12:43:00',
    'Winter_2024': '2024-12-21 09:20:00'
}

# Converting the UTC times to Calgary local time
seasons = {}
for season, utc_time_str in seasons_utc.items():
    
    # Converting the UTC string into a datetime object   
    utc_time = datetime.strptime(utc_time_str, '%Y-%m-%d %H:%M:%S')
    utc_time = pytz.utc.localize(utc_time) 
    
    # Converting to Calgary local time
    local_time = utc_time.astimezone(calgary_tz)
    
    # Saving the result in the dictionary
    seasons[season] = local_time
    
for key, value in seasons.items():
#print(f"{key}: {value.strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"{key}: {value}")


# Keeping the local time but making it aware for requested_date columns

if df['requested_date'].dt.tz is None:
    df['new_requested_date'] = df['requested_date'].dt.tz_localize('America/Edmonton')

print(df['new_requested_date'].head())



# Categorizing into seasons and creating a new 'season' column

# Assigning seasons based on request date

def get_season(request_date):
    for season, season_date in seasons.items():
        if request_date < season_date:
            return season
    return 'Winter_2024'  # Default to the latest season

# Creating new season column 

df['Season'] = df['new_requested_date'].apply(get_season)

#display(df)

print("\n\033[1m"+"Additional Columns created are:"+"\033[0m")
print("\tnew_requested_date")

Spring_2023: 2023-03-20 15:24:00-06:00
Summer_2023: 2023-06-21 08:57:00-06:00
Autumn_2023: 2023-09-23 00:50:00-06:00
Winter_2023: 2023-12-21 20:27:00-07:00
Spring_2024: 2024-03-19 21:06:00-06:00
Summer_2024: 2024-06-20 14:50:00-06:00
Autumn_2024: 2024-09-22 06:43:00-06:00
Winter_2024: 2024-12-21 02:20:00-07:00
0   2023-01-02 00:00:00-07:00
1   2023-01-02 00:00:00-07:00
2   2023-01-02 00:00:00-07:00
3   2023-01-02 00:00:00-07:00
4   2023-01-02 00:00:00-07:00
Name: new_requested_date, dtype: datetime64[ns, America/Edmonton]

[1mAdditional Columns created are:[0m
	new_requested_date


In [310]:
#Add column for Community Sector using the community sector csv file
community_data=pd.read_csv("/Users/anithajoseph/Documents/UofC/DATA601/601Project/311ServiceRequests/CSV_SECTORS.csv")
def merge_community_sector(main_data, community_data):
    # Rename the relevant columns in the community_data for clarity and consistency
    community_data.rename(columns={'COMM_CODE': 'comm_code', 'SECTOR': 'community_sector'}, inplace=True)

    # Merge the datasets based on the 'comm_code'
    merged_data = main_data.merge(community_data[['comm_code', 'community_sector']], on='comm_code', how='left')

    return merged_data

df = merge_community_sector(df, community_data)
print("\n\033[1m"+"Additional Columns created are:"+"\033[0m")
print("\tcommunity_sector")

#Handling Missing for Community related columns
df['comm_code'].fillna("Community Centrepoint", inplace=True)
df['comm_name'].fillna("Community Centrepoint", inplace=True)
df['community_sector'].fillna("Community Centrepoint", inplace=True)


[1mAdditional Columns created are:[0m
	community_sector


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['comm_code'].fillna("Community Centrepoint", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['comm_name'].fillna("Community Centrepoint", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object 

In [311]:
# Add a column for Divisions of Agency assigned for the requests

#Unassigned agencies are assigned to corresponding divisions
df.loc[df['agency_responsible'].isnull() & df['service_name'].str.contains('WATR -'), 'agency_responsible'] = 'UEP - Utilities & Environmental Protection'
df.loc[df['agency_responsible'].isnull() & df['service_name'].str.contains('PSD -'), 'agency_responsible'] = 'PDS - Planning & Development Services'
df.loc[df['agency_responsible'].isnull() & df['service_name'].str.contains('CPI -'), 'agency_responsible'] = 'OSC - Operational Services and Compliance'

# agency abbreviations are extracted
def extract_division(value):
    if pd.isna(value):
        return np.nan
    parts = value.split('-')
    resultStr = parts[0].strip() if '-' in value else value.strip()
    return resultStr


df['agency_division'] = df['agency_responsible'].apply(extract_division)

#Actual agencies or divisions under Calgary Government
agency_division = {
    'agency_name': ['Affiliated Organizations', 'Chief Financial Officer Department', 'Corporate Wide Service Requests',
                    'Calgary Police & Fire Services', 'Community Services', "Deputy City Manager's Office",
                   'Elected Officials', 'Fleet and Inventory', 'Information Services','Legal or Legislative Services',
                   'Office of the City Auditor','Operational Services and Compliance', 'Partnerships',
                   'Planning & Development Services','Project Information and Control Systems', 'Recreation and Social Programs',
                    'Transportation', 'Utilities & Environmental Protection'],
    'abbreviations': [['AO', 'Affiliated Organizations'], ['CFOD'], ['Corporate Wide Service Requests'], 
                      ['CPFS'],['CS'], ['DCMO'], 
                      ['Elected Officials'], ['Fleet and Inventory'], ['IS'], ['LL','LLSS'],
                      ['Office of the City Auditor'],['OS','OSC'],['Partnerships'],
                      ['PD','PDS'],['PICS'],['Recreation and Social Programs'],
                      ['TRAN','Tranc'], ['UEP','Uepc']]
}


# Create a mapping dictionary
mapping = {abbreviation: agency_name 
           for agency_name, abbreviations in zip(agency_division['agency_name'], agency_division['abbreviations']) 
           for abbreviation in abbreviations}


# Replace the agency_division values with actual agency_name or divisions
df['agency_division'] = df['agency_division'].map(mapping)

#noDivisionDF = df[df['agency_division'].isnull()]
#display(noDivisionDF)

agencies= df['agency_division'].unique()
    
# Iterate through each agency division in the list
for division in agencies:
    subset_df = df[df['agency_division'] == division]
    
    # Split the 'agency_responsible' column at the first hyphen and create 'agency_subdivision'
    df.loc[df['agency_division'] == division, 'agency_subdivision'] = subset_df['agency_responsible'].apply(
        lambda x: x.split('-', 1)[1] if '-' in x else division
    )

    # Split the 'service_name' column at the first hyphen and create 'service_category'
    df.loc[df['agency_division'] == division, 'service_category'] = subset_df['service_name'].apply(
        lambda x: x.split('-', 1)[0] if '-' in x else x
    )

    # Split the 'service_name' column at the first hyphen and create 'service_request'
    df.loc[df['agency_division'] == division, 'service_request'] = subset_df['service_name'].apply(
        lambda x: x.split('-', 1)[1] if '-' in x else x
    )
    
# Display the updated DataFrame
#print("Updated DataFrame:")
display(df.head(100))


print("\n\033[1m"+"Additional Columns created are:"+"\033[0m")
print("\tagency_division")
print("\tagency_subdivision")
print("\tservice_category")
print("\tservice_request")

Unnamed: 0,service_request_id,requested_date,updated_date,closed_date,status_description,source,service_name,agency_responsible,comm_code,comm_name,...,is_weekend_request,closing_delay,duplicate_request,new_requested_date,Season,community_sector,agency_division,agency_subdivision,service_category,service_request
0,23-00000797,2023-01-02,2023-01-10,2023-01-10,Closed,Other,Finance - ONLINE TIPP Agreement Request,CFOD - Finance,Community Centrepoint,Community Centrepoint,...,False,8.00,No,2023-01-02 00:00:00-07:00,Spring_2023,Community Centrepoint,Chief Financial Officer Department,Finance,Finance,ONLINE TIPP Agreement Request
1,23-00001045,2023-01-02,2024-01-11,2024-01-11,Closed,Other,Active Living Program Application,CS - Recreation and Social Programs,Community Centrepoint,Community Centrepoint,...,False,374.00,No,2023-01-02 00:00:00-07:00,Spring_2023,Community Centrepoint,Community Services,Recreation and Social Programs,Active Living Program Application,Active Living Program Application
2,23-00001163,2023-01-02,2023-01-06,2023-01-06,Closed,Phone,CN - Registered Social Worker Letter,CS - Calgary Neighbourhoods,Community Centrepoint,Community Centrepoint,...,False,4.00,No,2023-01-02 00:00:00-07:00,Spring_2023,Community Centrepoint,Community Services,Calgary Neighbourhoods,CN,Registered Social Worker Letter
3,23-00001191,2023-01-02,2024-05-19,2023-01-10,Closed,Other,CT - Lost Property,OS - Calgary Transit,Community Centrepoint,Community Centrepoint,...,False,8.00,No,2023-01-02 00:00:00-07:00,Spring_2023,Community Centrepoint,Operational Services and Compliance,Calgary Transit,CT,Lost Property
4,23-00001584,2023-01-02,2023-01-04,2023-01-04,Closed,Other,Recreation - Arena Booking Application,CS - Calgary Recreation,Community Centrepoint,Community Centrepoint,...,False,2.00,No,2023-01-02 00:00:00-07:00,Spring_2023,Community Centrepoint,Community Services,Calgary Recreation,Recreation,Arena Booking Application
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,23-00001462,2023-01-02,2023-01-02,2023-01-02,Closed,Other,WATR - Erosion and Sediment Control,UEP - Water Resources,12I,12I,...,False,0.00,No,2023-01-02 00:00:00-07:00,Spring_2023,SOUTH,Utilities & Environmental Protection,Water Resources,WATR,Erosion and Sediment Control
96,23-00001465,2023-01-02,2023-01-02,2023-01-02,Closed,Phone,CFD - Fire Code and General Inquiries - FHB,CS - Calgary Fire,CPF,COPPERFIELD,...,False,0.00,No,2023-01-02 00:00:00-07:00,Spring_2023,SOUTHEAST,Community Services,Calgary Fire,CFD,Fire Code and General Inquiries - FHB
97,23-00001464,2023-01-02,2023-01-10,2023-01-10,Closed,Phone,Finance - TIPP Agreement Request,CFOD - Finance,AUB,AUBURN BAY,...,False,8.00,No,2023-01-02 00:00:00-07:00,Spring_2023,SOUTHEAST,Chief Financial Officer Department,Finance,Finance,TIPP Agreement Request
98,23-00001479,2023-01-02,2023-01-02,2023-01-02,Closed,Other,WATR - Erosion and Sediment Control,UEP - Water Resources,CRA,CRANSTON,...,False,0.00,No,2023-01-02 00:00:00-07:00,Spring_2023,SOUTHEAST,Utilities & Environmental Protection,Water Resources,WATR,Erosion and Sediment Control



[1mAdditional Columns created are:[0m
	agency_division
	agency_subdivision
	service_category
	service_request


### Part 2: Information of New Processed Data Set for 311 requests

• Shape of the new Processed Dataset.

• Count of 311 service requests.

• Columns and Data types of each column.

• 311 Request 'Status' available.

• Agencies resposible for handling service requests.

• Count of all distinct service requests.

• Count of all distinct service requests for all agencies.

In [313]:
print("1.\tShape of the new Processed Dataset:", df.shape)
print("2.\tCount of 311 service requests:", df.shape[0])
print("3.\tColumns and Data types of each column:")

# Get the columns that are in df but not in originalDF
newColumns = list(set(df.columns) - set(originalDF.columns))
dtypes = df[newColumns].dtypes
width = max(len(column) for column in df.columns) + 2
for column, dtype in dtypes.items():
    print(f"\t\t{column.ljust(width)}{dtype}")
print(f"4.\t311 Request Status available: {df['status_description'].unique()}")
#display(df['status_description'].unique())

status_counts = df['status_description'].value_counts()
dupClosedReqCnt = status_counts['Duplicate (Closed)']
dupOpenReqCnt = status_counts['Duplicate (Open)']
closedReqCnt = status_counts['Closed']
openReqCnt = status_counts['Open']
print(f"\ti.\tCount of Open requests: {openReqCnt}")

# Filter requests where status_description is "open" but has closed date
openButClosedDF = df[(df['status_description'] == 'Open') & (df['closed_date'].notnull())]
openReqWithClosedDateCnt = openButClosedDF['status_description'].value_counts()
print(f"\t\ta.\tCount of Open requests with closed date: {openReqWithClosedDateCnt['Open']}")
openDF = df[(df['status_description'] == 'Open') & (df['closed_date'].isnull())]
openDF = openDF['status_description'].value_counts()
print(f"\t\tb.\tCount of Open requests with no closed date: {openDF['Open']}")

print(f"\tii.\tCount of Closed requests: {closedReqCnt}")
print(f"\tiii.\tCount of Duplicate (Open) requests: {dupOpenReqCnt}")
print(f"\tiv.\tCount of Duplicate (Closed) requests: {dupClosedReqCnt}")

# Inspect data
missingDataSum = df.isna().sum()
missingDataPercentage = (df.isnull().mean() * 100).round(2)
missingData = pd.DataFrame({
    "Missing Count": missingDataSum,
    "Missing Percentage": missingDataPercentage
})

pd.options.display.float_format = '{:.2f}'.format
print("\n\033[1m"+"Missing Count per column:"+"\033[0m")
print(tabulate(missingData, headers='keys', tablefmt='fancy_grid'))


1.	Shape of the new Processed Dataset: (1062919, 33)
2.	Count of 311 service requests: 1062919
3.	Columns and Data types of each column:
		agency_division     object
		Season              object
		closing_delay       float64
		request_day         Int32
		duplicate_request   object
		is_weekend_request  boolean
		closed_day          Int32
		community_sector    object
		request_year        Int32
		service_request     object
		update_day          Int32
		update_year         Int32
		request_month       Int32
		closed_year         Int32
		closed_month        Int32
		new_requested_date  datetime64[ns, America/Edmonton]
		service_category    object
		agency_subdivision  object
		update_month        Int32
4.	311 Request Status available: ['Closed' 'Open' 'Duplicate (Closed)' 'Duplicate (Open)']
	i.	Count of Open requests: 32782
		a.	Count of Open requests with closed date: 2024
		b.	Count of Open requests with no closed date: 30758
	ii.	Count of Closed requests: 1019748
	iii.	Count of Duplicat

In [314]:
# Entire Unique service names
unique_service_name_df = df['service_name'].unique()
print("\n5.\tCount of all distinct service requests:",len(unique_service_name_df))

agency_vise_distinct_req = df[['agency_division','agency_subdivision', 'agency_responsible','service_name']].drop_duplicates()
print("\n6.\tCount of all distinct service requests for all agencies:",len(agency_vise_distinct_req))

#agencies= df['agency_division'].unique()
print("\n7.\tAgencies resposible for handling service requests are:")
for agency in agencies:
    print(f"\t\t- {agency}")


5.	Count of all distinct service requests: 638

6.	Count of all distinct service requests for all agencies: 1037

7.	Agencies resposible for handling service requests are:
		- Chief Financial Officer Department
		- Community Services
		- Operational Services and Compliance
		- Transportation
		- Utilities & Environmental Protection
		- Planning & Development Services
		- Calgary Police & Fire Services
		- Corporate Wide Service Requests
		- Project Information and Control Systems
		- Partnerships
		- Deputy City Manager's Office
		- Legal or Legislative Services
		- Recreation and Social Programs
		- Elected Officials
		- Information Services
		- Affiliated Organizations
		- Fleet and Inventory
		- Office of the City Auditor


In [348]:
#Community related informations
comm_code_counts = df['comm_code'].value_counts().sort_values(ascending=False)
display(comm_code_counts.head(10))

comm_code_counts = df['comm_name'].value_counts().sort_values(ascending=False)
display(comm_code_counts.head(10))

comm_code_counts = df['community_sector'].value_counts().sort_values(ascending=False)
display(comm_code_counts.head(10))


comm_code
Community Centrepoint    73833
DNC                      21102
BLN                      17854
SAD                      14633
BOW                      12880
CNS                      11527
CRA                      10958
BRD                      10920
MAH                      10885
VAR                       9307
Name: count, dtype: int64

comm_name
Community Centrepoint       73833
DOWNTOWN COMMERCIAL CORE    21102
BELTLINE                    17854
SADDLE RIDGE                14633
BOWNESS                     12880
CORNERSTONE                 11527
CRANSTON                    10958
BRIDGELAND/RIVERSIDE        10920
MAHOGANY                    10885
VARSITY                      9307
Name: count, dtype: int64

community_sector
CENTRE                   252587
SOUTH                    158839
NORTHEAST                134395
NORTHWEST                116332
NORTH                    102057
SOUTHEAST                 96932
WEST                      80684
Community Centrepoint     73833
EAST                      47260
Name: count, dtype: int64

In [360]:
#Community service related informations
comm_code_counts = df['service_category'].value_counts().sort_values(ascending=False)
display(comm_code_counts.head(10))

comm_code_counts = df['service_request'].value_counts().sort_values(ascending=False)
display(comm_code_counts.head(10))

comm_code_counts = df['service_name'].value_counts().sort_values(ascending=False)
display(comm_code_counts.head(10))


service_category
Roads               179044
WRS                 135146
Bylaw               112279
Parks                92403
Finance              87306
WATS                 76392
CBS Inspection       41782
DBBS Inspection      40874
CT                   38462
Corporate            37037
Name: count, dtype: int64

service_request
 Property Tax Account Inquiry     56167
 Cart Management                  48307
 Snow and Ice on Sidewalk         32435
 ONLINE TIPP Agreement Request    29868
 Graffiti Concerns                21235
 Pothole Maintenance              21202
311 Contact Us                    20377
 Snow and Ice Control             19657
 Electrical                       19286
 Long Grass - Weeds Infraction    17157
Name: count, dtype: int64

service_name
WRS - Cart Management                      48307
Finance - Property Tax Account Inquiry     37463
Bylaw - Snow and Ice on Sidewalk           32435
Finance - ONLINE TIPP Agreement Request    28318
Corporate - Graffiti Concerns              21235
Roads - Pothole Maintenance                21202
311 Contact Us                             20377
AT - Property Tax Account Inquiry          18704
Roads - Snow and Ice Control               18648
Bylaw - Long Grass - Weeds Infraction      17157
Name: count, dtype: int64

Unnamed: 0,service_category,service_request,service_name
0,Finance,ONLINE TIPP Agreement Request,Finance - ONLINE TIPP Agreement Request
1,Active Living Program Application,Active Living Program Application,Active Living Program Application
2,CN,Registered Social Worker Letter,CN - Registered Social Worker Letter
3,CT,Lost Property,CT - Lost Property
4,Recreation,Arena Booking Application,Recreation - Arena Booking Application
5,Active Living Program Application,Active Living Program Application,Active Living Program Application
6,CT AC,Trip Feedback - CTA,CT AC - Trip Feedback - CTA
7,REC,Southland Leisure Centre Inquiry,REC - Southland Leisure Centre Inquiry
8,Parks,Snow and Ice Concerns - WAM,Parks - Snow and Ice Concerns - WAM
9,AS,Pick Up Stray,AS - Pick Up Stray
