# Data wrangling‬ for 311 Service requests

## Objective:

   Data wrangling and visualization tasks will be performed to ensure that raw data is transformed into a‬ clean format and then visualized effectively to develop valuable insights. Data wrangling‬ involves handling null values by imputing mean or median values in missing values of columns longitude‬ or latitude‬. Duplicate records are identified and removed, standardizing data types to‬ ensure consistency, such as converting Floating Timestamp to datetime for the date columns‬. Data transformations like encoding categorical variables (e.g., encoding status description “open”=> 0 and “closed” => 1) and normalizing numerical data for easier analysis. Feature engineering to derive‬ new columns, such as calculating response time (e.g., the difference between closed date and‬ requested date will be the response time), and additionally, grouping and mapping subdivisions into‬ broader categories, such as combining detailed service name into high-level service groups for providing‬ clearer insights‬. When unusual values are found in numerical columns, these are outliers,‬ which can be treated by capping or removal to prevent skewed results‬.
   

Focus‬ on‬ data‬ cleaning,‬ like‬ handling‬ null‬ values,‬ removing‬ duplicates,‬ standardizing‬ data‬ types,‬ and‬ treating‬ outliers

1.  involves handling null values by imputing mean or median values in missing values of columns longitude‬ or latitude
2.  Duplicate records are identified and removed, standardizing data types to‬ ensure consistency, such as converting Floating Timestamp to datetime for the date columns‬
3.  Data transformations like encoding categorical variables (e.g., encoding status description “open” => 0 and “closed” => 1) and normalizing numerical data for easier analysis
4.  Feature engineering to derive‬ new columns, such as calculating response time (e.g., the difference between closed date and‬ requested date will be the response time)
5.  Grouping and mapping subdivisions into‬ broader categories, such as combining detailed service name into high-level service groups for providing‬ clearer insights
6.  When unusual values are found in numerical columns, these are outliers,‬ which can be treated by capping or removal to prevent skewed results‬
7.  

In [3]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

import re
import pandas as pd
from tabulate import tabulate


### Part 1: Data Cleaning and Preprocessing

### 1.1 Load and Inspect the Dataset

• Load the dataset and display its shape, column names, and data types.

• Identify and list the number of missing values in each column.

In [5]:
# load data
df = pd.read_csv('/Users/anithajoseph/Documents/UofC/DATA601/CSVFiles/311_Service_Requests_2yrs.csv')
print("----------------------------------------------------------------------------")
print("\033[1m"+"Data Analysis and Visualization of Building Emergency Benchmarking"+"\033[0m")
print("----------------------------------------------------------------------------")

#display shape, columns, and data types
print("1.\tShape of the Dataset:", df.shape)
print("2.\tNumber of records or rows of the DataFrame:", df.shape[0])
print("3.\tColumns and Data types of each column:\n", df.dtypes)

# Inspect data
missingDataSum = df.isna().sum()
missingDataPercentage = (df.isnull().mean() * 100).round(2)
missingData = pd.DataFrame({
    "Missing Count": missingDataSum,
    "Missing Percentage": missingDataPercentage
})

pd.options.display.float_format = '{:.2f}'.format
print("\n\033[1m"+"Missing Count per column:"+"\033[0m")
print(tabulate(missingData, headers='keys', tablefmt='fancy_grid'))

----------------------------------------------------------------------------
[1mData Analysis and Visualization of Building Emergency Benchmarking[0m
----------------------------------------------------------------------------
1.	Shape of the Dataset: (1093918, 15)
2.	Number of records or rows of the DataFrame: 1093918
3.	Columns and Data types of each column:
 service_request_id     object
requested_date         object
updated_date           object
closed_date            object
status_description     object
source                 object
service_name           object
agency_responsible     object
address               float64
comm_code              object
comm_name              object
location_type          object
longitude             float64
latitude              float64
point                  object
dtype: object

[1mMissing Count per column:[0m
╒════════════════════╤═════════════════╤══════════════════════╕
│                    │   Missing Count │   Missing Percentage │
╞══════

### 1.2 Handling Missing Data

• Drop columns with more than 10% missing values.

• For numerical columns, fill missing values with the median of their respective column.

• For categorical columns, fill missing values with the mode of their respective column.

In [7]:
#Drop columns with missing percentage >40%
columnNameDropped = missingDataPercentage[missingDataPercentage >= 40].index.tolist()
print("Columns with missing percentage more than 40% missing values are:\n", columnNameDropped)
originalDF =df

df = df.drop(columns = missingDataPercentage[missingDataPercentage > 40].index)

Columns with missing percentage more than 40% missing values are:
 ['address']


In [22]:
nan_df = df[~df['comm_code'].isna()]
#display(nan_df.head(4))

nan_df = df[df['comm_code'].isna()]
#display(nan_df.head(4))

nan_count = df.groupby('service_name')['comm_code'].apply(lambda x: x.isna().sum())
nan_count_sorted = nan_count.sort_values(ascending=False)

display(nan_count)

service_name
311 Contact Us                                         1534
ACPL - Animal Licence - Inquiries                         5
ACPL - Bite Prevention Program Application                0
ACPL - Lost and Found Animal                             10
ACPL - Magpie Trap Information Request                    4
                                                       ... 
WRS - Waste - Residential                                 0
WRS - Waste Requirements for Businesses                  57
ZZZ Business Safety - Business Licence Concern            4
ZZZ VFH - Taxi or Limousine Compliment                    0
ZZZ VFH - Taxi/Limousine/TNC (Ride Sharing) Concern     135
Name: comm_code, Length: 640, dtype: int64

In [62]:
communityNotNullDF = df[df['comm_code'].notna()]
communityNullDF = df[df['comm_code'].isna()]
#display(communityNotNullDF.head(3))
#display(communityNullDF.head(3))


ServiceNameOfCommNullDF = communityNullDF.groupby('service_name')
display(ServiceNameOfCommNullDF)

# Display the grouped DataFrame
for name, group in grouped_df:
    print(f"Service Name: {name}")
    print(group)
    print()

#service_name = df[df['service_name'] == 'ACPL - Animal Licence - Inquiries'].filter(lambda x: x.notna().any())

#display(service_name)
'''
filtered_services = grouped.filter(lambda x: x.isna().any() and x.notna().any())

display(unique_service_names)
# Get unique Service Names that satisfy the condition
unique_service_names = filtered_services.index.unique()

# Display the result'''

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x16c47e660>

Service Name: 311 Contact Us
        service_request_id          requested_date            updated_date  \
1273           23-00003793  2023/01/03 12:00:00 AM  2023/01/03 12:00:00 AM   
6357           23-00013048  2023/01/06 12:00:00 AM  2023/01/06 12:00:00 AM   
6411           23-00015197  2023/01/06 12:00:00 AM  2023/01/08 12:00:00 AM   
7823           23-00015877  2023/01/07 12:00:00 AM  2023/01/07 12:00:00 AM   
9427           23-00020603  2023/01/09 12:00:00 AM  2023/01/10 12:00:00 AM   
...                    ...                     ...                     ...   
1091925        25-00061837  2025/01/26 12:00:00 AM  2025/01/26 12:00:00 AM   
1092682        25-00064319  2025/01/27 12:00:00 AM  2025/01/27 12:00:00 AM   
1093110        25-00062965  2025/01/27 12:00:00 AM  2025/01/27 12:00:00 AM   
1093326        25-00063149  2025/01/27 12:00:00 AM  2025/01/27 12:00:00 AM   
1093532        25-00062866  2025/01/27 12:00:00 AM  2025/01/27 12:00:00 AM   

                    closed_date st

'\nfiltered_services = grouped.filter(lambda x: x.isna().any() and x.notna().any())\n\ndisplay(unique_service_names)\n# Get unique Service Names that satisfy the condition\nunique_service_names = filtered_services.index.unique()\n\n# Display the result'