# Data wrangling‬ for 311 Service requests

## Objective:

   Data wrangling and visualization tasks will be performed to ensure that raw data is transformed into a‬ clean format and then visualized effectively to develop valuable insights. Data wrangling‬ involves handling null values by imputing mean or median values in missing values of columns longitude‬ or latitude‬. Duplicate records are identified and removed, standardizing data types to‬ ensure consistency, such as converting Floating Timestamp to datetime for the date columns‬. Data transformations like encoding categorical variables (e.g., encoding status description “open”=> 0 and “closed” => 1) and normalizing numerical data for easier analysis. Feature engineering to derive‬ new columns, such as calculating response time (e.g., the difference between closed date and‬ requested date will be the response time), and additionally, grouping and mapping subdivisions into‬ broader categories, such as combining detailed service name into high-level service groups for providing‬ clearer insights‬. When unusual values are found in numerical columns, these are outliers,‬ which can be treated by capping or removal to prevent skewed results‬.
   

Focus‬ on‬ data‬ cleaning,‬ like‬ handling‬ null‬ values,‬ removing‬ duplicates,‬ standardizing‬ data‬ types,‬ and‬ treating‬ outliers

1.  involves handling null values by imputing mean or median values in missing values of columns longitude‬ or latitude
2.  Duplicate records are identified and removed, standardizing data types to‬ ensure consistency, such as converting Floating Timestamp to datetime for the date columns‬
3.  Data transformations like encoding categorical variables (e.g., encoding status description “open” => 0 and “closed” => 1) and normalizing numerical data for easier analysis
4.  Feature engineering to derive‬ new columns, such as calculating response time (e.g., the difference between closed date and‬ requested date will be the response time)
5.  Grouping and mapping subdivisions into‬ broader categories, such as combining detailed service name into high-level service groups for providing‬ clearer insights
6.  When unusual values are found in numerical columns, these are outliers,‬ which can be treated by capping or removal to prevent skewed results‬
7.  

In [47]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

import re
import pandas as pd
from tabulate import tabulate


### Part 1: Data Cleaning and Preprocessing

### 1.1 Load and Inspect the Dataset

• Load the dataset and display its shape, column names, and data types.

• Identify and list the number of missing values in each column.

In [49]:
# load data
df = pd.read_csv('/Users/anithajoseph/Documents/UofC/DATA601/CSVFiles/311_Service_Requests_2yrs.csv')
print("----------------------------------------------------------------------------")
print("\033[1m"+"Data Analysis and Visualization of Building Emergency Benchmarking"+"\033[0m")
print("----------------------------------------------------------------------------")

#display shape, columns, and data types
print("1.\tShape of the Dataset:", df.shape)
print("2.\tNumber of records or rows of the DataFrame:", df.shape[0])
print("3.\tColumns and Data types of each column:\n", df.dtypes)

# Inspect data
missingDataSum = df.isna().sum()
missingDataPercentage = (df.isnull().mean() * 100).round(2)
missingData = pd.DataFrame({
    "Missing Count": missingDataSum,
    "Missing Percentage": missingDataPercentage
})

pd.options.display.float_format = '{:.2f}'.format
print("\n\033[1m"+"Missing Count per column:"+"\033[0m")
print(tabulate(missingData, headers='keys', tablefmt='fancy_grid'))

----------------------------------------------------------------------------
[1mData Analysis and Visualization of Building Emergency Benchmarking[0m
----------------------------------------------------------------------------
1.	Shape of the Dataset: (1093918, 15)
2.	Number of records or rows of the DataFrame: 1093918
3.	Columns and Data types of each column:
 service_request_id     object
requested_date         object
updated_date           object
closed_date            object
status_description     object
source                 object
service_name           object
agency_responsible     object
address               float64
comm_code              object
comm_name              object
location_type          object
longitude             float64
latitude              float64
point                  object
dtype: object

[1mMissing Count per column:[0m
╒════════════════════╤═════════════════╤══════════════════════╕
│                    │   Missing Count │   Missing Percentage │
╞══════

### 1.2 Handling Missing Data

• Drop columns with more than 10% missing values.

• For numerical columns, fill missing values with the median of their respective column.

• For categorical columns, fill missing values with the mode of their respective column.

In [51]:
#Drop columns with missing percentage >40%
columnNameDropped = missingDataPercentage[missingDataPercentage >= 40].index.tolist()
print("Columns with missing percentage more than 40% missing values are:\n", columnNameDropped)
originalDF =df

df = df.drop(columns = missingDataPercentage[missingDataPercentage > 40].index)

Columns with missing percentage more than 40% missing values are:
 ['address']


Unnamed: 0,service_request_id,requested_date,updated_date,closed_date,status_description,source,service_name,agency_responsible,comm_code,comm_name,location_type,longitude,latitude,point
8,23-00002077,2023/01/02 12:00:00 AM,2023/01/07 12:00:00 AM,2023/01/07 12:00:00 AM,Closed,Phone,Parks - Snow and Ice Concerns - WAM,CS - Calgary Parks,MCK,MCKENZIE LAKE,Community Centrepoint,-113.99,50.91,POINT (-113.988181270219 50.914880843208)
9,23-00002084,2023/01/02 12:00:00 AM,2023/01/02 12:00:00 AM,2023/01/02 12:00:00 AM,Closed,Other,AS - Pick Up Stray,CS - Calgary Community Standards,FLN,FOREST LAWN,Community Centrepoint,-113.97,51.04,POINT (-113.971230283867 51.037847718454)
11,23-00002080,2023/01/02 12:00:00 AM,2023/01/03 12:00:00 AM,2023/01/03 12:00:00 AM,Closed,Phone,WATS - Water Quality,UEP - Water Services,BRT,BRITANNIA,Community Centrepoint,-114.09,51.01,POINT (-114.086314872035 51.012674541829)
12,23-00002085,2023/01/02 12:00:00 AM,2023/01/03 12:00:00 AM,2023/01/03 12:00:00 AM,Closed,App,Parks - Snow and Ice Concerns - WAM,CS - Calgary Parks,BED,BEDDINGTON HEIGHTS,Community Centrepoint,-114.08,51.13,POINT (-114.084911140805 51.131643985322)


In [None]:
nan_df = df[~df['comm_code'].isna()]
display(nan_df.head(4))