# **Shaghayegh Bagheri**

# Load Libraries

In [None]:
import dask.dataframe as dd
import numpy as np



# Data Extraction

The dataset for the group B was uploaded to Google Drive. I first imported my Drive into Colab and initially tried to read the CSV file using the Pandas library, but it gave an error because the file was in tar.gz format.

So, after doing some research, I found out that this type of file can be handled by first creating a directory and then extracting the file using the following command.

Resource : https://stackoverflow.com/questions/67910522/extract-tar-gz-file-and-save-again-extracted-file-using-google-colaboratory

In [None]:
!mkdir /content/dataset
!tar -xvzf /content/drive/MyDrive/TrafficEvents_Aug16_Dec20_Publish.tar.gz -C /content/dataset


mkdir: cannot create directory ‘/content/dataset’: File exists
TrafficEvents_Aug16_Dec20_Publish.csv


In this section, I checked which files are present in there.

In [None]:
!ls /content/dataset


TrafficEvents_Aug16_Dec20_Publish.csv


In this part, to utilize the GPU, I used dask.dataframe instead of the Pandas library.

In [None]:
gpu = dd.read_csv(
    '/content/dataset/TrafficEvents_Aug16_Dec20_Publish.csv',
    # when an integer column has missing values in some partitions,
    # Dask cannot correctly infer its type and throws an error
    dtype={'ZipCode': 'float64'}
)

print(gpu.head())



   EventId        Type  Severity  TMC  \
0  T-38768  Congestion         2   73   
1  T-38772  Congestion         2   72   
2  T-38775  Congestion         2   72   
3  T-38777  Congestion         1   75   
4  T-38781  Congestion         2   75   

                                         Description       StartTime(UTC)  \
0  Severe delays of 18 minutes on US-101 Redwood ...  2016-08-01 00:03:00   
1  Delays of eight minutes on CA-92 San Mateo Rd ...  2016-08-01 00:07:00   
2  Severe delays of 20 minutes and delays increas...  2016-08-01 00:00:00   
3  Delays of two minutes on Valley Fwy Southbound...  2016-08-01 00:08:00   
4  Delays of five minutes on CA-37 Sears Point Rd...  2016-08-01 00:13:00   

          EndTime(UTC)    TimeZone  LocationLat  LocationLng  Distance(mi)  \
0  2016-08-01 00:14:28  US/Pacific    38.214657  -122.602669           0.0   
1  2016-08-01 00:18:44  US/Pacific    37.477329  -122.415703           0.0   
2  2016-08-01 00:18:44  US/Pacific    36.985863  -121.98

Number of total rows and columns

In [None]:
print("Number of total rows : " , len(gpu))


Number of total rows :  31355575


In [None]:
print("Number of total columns : 19")

Number of total columns : 19


# Sampling

In [22]:
# random sampling
sample_df = gpu.sample(frac=5000/len(gpu), random_state=42).compute()
sample_df.to_csv("Traffic.csv", index=False)

# Data Cleaning

In [27]:
df = gpu.drop(columns=["EventId"]) # deleted the whole EventId column
print(df.head())

         Type  Severity  TMC  \
0  Congestion         2   73   
1  Congestion         2   72   
2  Congestion         2   72   
3  Congestion         1   75   
4  Congestion         2   75   

                                         Description       StartTime(UTC)  \
0  Severe delays of 18 minutes on US-101 Redwood ...  2016-08-01 00:03:00   
1  Delays of eight minutes on CA-92 San Mateo Rd ...  2016-08-01 00:07:00   
2  Severe delays of 20 minutes and delays increas...  2016-08-01 00:00:00   
3  Delays of two minutes on Valley Fwy Southbound...  2016-08-01 00:08:00   
4  Delays of five minutes on CA-37 Sears Point Rd...  2016-08-01 00:13:00   

          EndTime(UTC)    TimeZone  LocationLat  LocationLng  Distance(mi)  \
0  2016-08-01 00:14:28  US/Pacific    38.214657  -122.602669           0.0   
1  2016-08-01 00:18:44  US/Pacific    37.477329  -122.415703           0.0   
2  2016-08-01 00:18:44  US/Pacific    36.985863  -121.981026           0.0   
3  2016-08-01 00:19:44  US/Pacif

In [28]:
df.info()

<class 'dask.dataframe.dask_expr.DataFrame'>
Columns: 18 entries, Type to ZipCode
dtypes: float64(5), int64(2), string(11)

I converted the Side column, which only contained the letters L and R, into 0 and 1.

In [32]:
df['Side_binary'] = df['Side'].map({'L': 0, 'R': 1})
df1 = df.drop(columns=["Side"])
print(df1.head())

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map function that you are using.
  Before: .map(func)
  After:  .map(func, meta=('Side', 'float64'))



         Type  Severity  TMC  \
0  Congestion         2   73   
1  Congestion         2   72   
2  Congestion         2   72   
3  Congestion         1   75   
4  Congestion         2   75   

                                         Description       StartTime(UTC)  \
0  Severe delays of 18 minutes on US-101 Redwood ...  2016-08-01 00:03:00   
1  Delays of eight minutes on CA-92 San Mateo Rd ...  2016-08-01 00:07:00   
2  Severe delays of 20 minutes and delays increas...  2016-08-01 00:00:00   
3  Delays of two minutes on Valley Fwy Southbound...  2016-08-01 00:08:00   
4  Delays of five minutes on CA-37 Sears Point Rd...  2016-08-01 00:13:00   

          EndTime(UTC)    TimeZone  LocationLat  LocationLng  Distance(mi)  \
0  2016-08-01 00:14:28  US/Pacific    38.214657  -122.602669           0.0   
1  2016-08-01 00:18:44  US/Pacific    37.477329  -122.415703           0.0   
2  2016-08-01 00:18:44  US/Pacific    36.985863  -121.981026           0.0   
3  2016-08-01 00:19:44  US/Pacif

# Finding outliars by Lan and Lot

For Finding outliars i used Z-score method.

In [37]:
tf = dd.read_csv("/content/Traffic.csv")


lat_mean = tf['LocationLat'].mean().compute()
lat_std = tf['LocationLat'].std().compute()

lon_mean = tf['LocationLng'].mean().compute()
lon_std = tf['LocationLng'].std().compute()

# Z-score
tf['lat_z'] = (tf['LocationLat'] - lat_mean) / lat_std
tf['lon_z'] = (tf['LocationLng'] - lon_mean) / lon_std


outliers = tf[(tf['lat_z'].abs() > 2) | (tf['lon_z'].abs() > 2)].compute()

print(outliers.head())


      EventId           Type  Severity   TMC  \
89   T-979367     Congestion         0    72   
90   T-954962     Congestion         1    72   
91   T-941512     Congestion         0    76   
93  T-1040039  Flow-Incident         3  1804   
97  T-1075629     Congestion         1    73   

                                          Description       StartTime(UTC)  \
89  Delays increasing and delays of two minutes on...  2016-12-30 21:30:00   
90  Delays of five minutes on Bird Rd Westbound be...  2016-12-12 21:05:00   
91  Delays of two minutes on SW 177th Ave Southbou...  2016-12-02 22:57:00   
93  Traffic signal failure on I-95 Northbound at E...  2016-11-14 12:28:21   
97  Delays of two minutes on Tamiami Trl Northboun...  2016-09-22 21:28:00   

           EndTime(UTC)    TimeZone  LocationLat  LocationLng  ...  \
89  2016-12-30 21:55:16  US/Eastern    26.528053   -81.869850  ...   
90  2016-12-12 21:21:08  US/Eastern    25.734070   -80.287308  ...   
91  2016-12-02 23:09:38  US/East