# Data preprocessing

## Summary of data analysis and findings


### Data exploration and analysis
- 11 columns from raw dataset
- 87489 rows (transactions)
- No null values
- Mix of float (3) and object (8) dtypes
- Exists some outliers for parking_fee
    - min = 0 (how does this make sense?)
    - max = 983 (this is very high compared to 75th percentile that is 39)
    - mean = 31
    - std = 39 (quite high compared to mean => high data variance)
    - Could for example remove outliers and standardize the data for easier training
- Latidute and longitude values are not in the correct range
    - min lat = -90 (but we observe -180)
    - offsetting wont help since the range is above 180, meaning I do not trust its values to be correctly measured
- Longitude is however in the correct range [-180, 180]
- Latitude and longitude data by itself does not really give much information and would require some kind of embedding model inbetween, that maps the coordinate point (latitude, longitude) to some feature space that includes information about the location and parking behaviour
- not an exact 50/50 split between private and corporate transactions, but they are in the same order of magnitude so stratified sampling is not super necessary
- There are 4 different currencies used in the dataset, huge majority is in SEK. 2 options which are both viable
    - convert all to SEK equivalent
    - remove transactions in other currencies
- area_type have 7 different values, "OnStreet" and "SurfaceLot" are the most common, while "EVC" and "CameraParkArea" are the least common
    - Before removing "EVC" and "CameraParkArea", we should check if there are any patterns (correlation) in the data that we could use to predict the account_type from those samples
- area_type would require categorical encoding (or one-hot encoding), before being used as input to the model.
- parking_id is unique as it is a primary key and equal to number of rows in the dataset, meaning it is completely useless for the model
- There are a total of 300 registered private and corporate parking users (accounts)
- There are a total of 1652 unique car ids used for parking transactions. Since this is larger than the number of parking users, some users have multiple cars, which is indicate of a business account (business has a fleet of cars). Could be useful to create a feature of number of used/registered cars per user
- Each parkinguser_id has only one account_type. If a person has both a private and a corporate account, then they make transactions using different parkinguser_ids!
- The time span of the dataset is very long (7 years). So normalizing with this time span is not a good idea. Some users could have been active for a long time while some could have been active for a short time.
- Better to normalize with the time span of each parkinguser_id


### Data cleaning
- no duplicate rows
- remove outliers in parking_fee (99th percentile)
- removed 0 parking_fee rows
- converted time data into datetime

### Feature engineering
- currency conversion
- parking duration (in hours)
- parking weekday/weekend
- registered cars per user
- parking count per user
- parking activity per car (parkings per day equivalent)


## Data exploration and analysis

In [268]:
import df_utils
import functionals
import pandas as pd

In [269]:
# Load data into Pandas DataFrame
data_path = "assignment-sample-data.csv"
df = pd.read_csv(data_path)

In [270]:
from IPython.display import display
pd.set_option('display.width', 1000)

In [271]:
# Show first entries
df.head()
#print(df.tail())

Unnamed: 0,parking_id,area_type,parking_start_time,parking_end_time,parking_fee,currency,parkinguser_id,car_id,lat,lon,account_type
0,fake_c28a323810,SurfaceLot,2015-03-06 19:55:41,2015-03-06 20:07:00,8.5,SEK,fake_bf5d9b530e,fake_130ae2aeb1,59.24637,18.077019,corporate
1,fake_76c21cf355,SurfaceLot,2015-03-06 18:08:20,2015-03-06 19:46:00,15.67,SEK,fake_bf5d9b530e,fake_130ae2aeb1,59.231789,18.083995,corporate
2,fake_995ed971a6,OnStreet,2017-07-21 09:55:42,2017-07-21 14:23:50,67.0,SEK,fake_3ba346a0cd,fake_f7a9d564d9,59.350331,18.096649,corporate
3,fake_6b81ea4f35,SurfaceLot,2017-07-24 07:21:12,2017-07-24 07:34:31,4.34,SEK,fake_ea19a50003,fake_fae7e31b34,59.315826,18.098355,corporate
4,fake_424b61e0eb,SurfaceLot,2015-03-09 12:05:46,2015-03-09 13:57:54,50.5,SEK,fake_1cc1970582,fake_0755f3c71f,59.320919,18.047513,corporate


- 11 columns (10 features, 1 target)
- I suspect that the rows are sorted by account_type since first 50 are corporate and last 50 are private. This means we should shuffle before training

In [272]:
shuffle_rows = True
if shuffle_rows:
    df = df.sample(frac=1, random_state=123).reset_index(drop=True)

In [273]:
print(df.columns)

Index(['parking_id', 'area_type', 'parking_start_time', 'parking_end_time', 'parking_fee', 'currency', 'parkinguser_id', 'car_id', 'lat', 'lon', 'account_type'], dtype='object')


In [274]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87489 entries, 0 to 87488
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   parking_id          87489 non-null  object 
 1   area_type           87489 non-null  object 
 2   parking_start_time  87489 non-null  object 
 3   parking_end_time    87489 non-null  object 
 4   parking_fee         87489 non-null  float64
 5   currency            87489 non-null  object 
 6   parkinguser_id      87489 non-null  object 
 7   car_id              87489 non-null  object 
 8   lat                 87489 non-null  float64
 9   lon                 87489 non-null  float64
 10  account_type        87489 non-null  object 
dtypes: float64(3), object(8)
memory usage: 7.3+ MB
None


- 11 columns from raw dataset
- 87489 rows (transactions)
- No null values
- Mix of float (3) and object (8) dtypes

In [275]:
print(df.describe())

        parking_fee           lat           lon
count  87489.000000  87489.000000  87489.000000
mean      31.786153     58.786174     17.035432
std       39.338515      2.925073      2.639340
min        0.000000   -180.006219   -180.006783
25%        9.000000     59.292461     17.137638
50%       18.750000     59.332389     18.000587
75%       39.100000     59.360741     18.066440
max      983.000000     67.871132     54.009971


- Exists some outliers for parking_fee
    - min = 0 (how does this make sense?)
    - max = 983 (this is very high compared to 75th percentile that is 39)
    - mean = 31
    - std = 39 (quite high compared to mean => high data variance)
    - Could for example remove outliers and standardize the data for easier training
- Latidute and longitude values are not in the correct range
    - min lat = -90 (but we observe -180)
    - offsetting wont help since the range is above 180, meaning I do not trust its values to be correctly measured
- Longitude is however in the correct range [-180, 180]
- Latitude and longitude data by itself does not really give much information and would require some kind of embedding model inbetween, that maps the coordinate point (latitude, longitude) to some feature space that includes information about the location and parking behaviour

In [276]:
print(df["account_type"].unique())
print(df["account_type"].value_counts())
print(5*"-")
print(df["currency"].unique())
print(df["currency"].value_counts())
print(5*"-")
print(df["area_type"].unique())
print(df["area_type"].value_counts())
print(5*"-")
print(f"Number of rows in df: {len(df)}")
print(f"Number of unique parking_ids: {df['parking_id'].nunique()}")
print(f"Number of unique parkinguser_ids: {df['parkinguser_id'].nunique()}")
print(f"Number of unique car_ids: {df['car_id'].nunique()}")

['private' 'corporate']
account_type
private      57239
corporate    30250
Name: count, dtype: int64
-----
['SEK' 'NOK' 'EUR' 'DKK']
currency
SEK    86918
NOK      465
DKK       87
EUR       19
Name: count, dtype: int64
-----
['SurfaceLot' 'OnStreet' 'Administrative' 'AboveGroundGarage'
 'UndergroundGarage' 'EVC' 'CameraParkArea']
area_type
OnStreet             52747
SurfaceLot           26749
Administrative        6269
UndergroundGarage     1248
AboveGroundGarage      459
EVC                     14
CameraParkArea           3
Name: count, dtype: int64
-----
Number of rows in df: 87489
Number of unique parking_ids: 87489
Number of unique parkinguser_ids: 300
Number of unique car_ids: 1652


- not an exact 50/50 split between private and corporate transactions, but they are in the same order of magnitude so stratified sampling is not super necessary
- There are 4 different currencies used in the dataset, huge majority is in SEK. 2 options which are both viable
    - convert all to SEK equivalent
    - remove transactions in other currencies
- area_type have 7 different values, "OnStreet" and "SurfaceLot" are the most common, while "EVC" and "CameraParkArea" are the least common
    - Before removing "EVC" and "CameraParkArea", we should check if there are any patterns (correlation) in the data that we could use to predict the account_type from those samples
- area_type would require categorical encoding (or one-hot encoding), before being used as input to the model.
- parking_id is unique as it is a primary key and equal to number of rows in the dataset, meaning it is completely useless for the model
- There are a total of 300 registered private and corporate parking users (accounts)
- There are a total of 1652 unique car ids used for parking transactions. Since this is larger than the number of parking users, some users have multiple cars, which is indicate of a business account (business has a fleet of cars). Could be useful to create a feature of number of used/registered cars per user
- Also perhaps possible that some of the cars have overlapping users (shared car)

### Check if transactions with "parkinguser_id" exits with multiple account types (corporate and private)

In [277]:
parkinguser_ids = {} # parkinguser_id -> account_type

# Loop over all transactions
for index, row in df.iterrows(): 
    user_id = row['parkinguser_id']
    account_type = row['account_type']

    if user_id not in parkinguser_ids: # If user_id is not in the dictionary, add it
        parkinguser_ids[user_id] = account_type
    else: # If user_id is already in the dictionary, check if the account_type is the same

        if parkinguser_ids[user_id] != account_type:
            print(f"User {user_id} has multiple account types: {parkinguser_ids[user_id]} and {account_type}")

- Each parkinguser_id has only one account_type. If a person has both a private and a corporate account, then they make transactions using different parkinguser_ids!

## Check if private accounts use cars that corporate accounts also have used (shared car_id)

In [278]:
violations = functionals.corporate_car(df)
violations.drop_duplicates(subset=[col for col in violations.columns if col != 'parking_id'], inplace=True)


for index, row in violations.iterrows():
    car_id = row['car_id']
    parkinguser_id = row['parkinguser_id']
    matching_transactions = df[(df['car_id'] == car_id) & (df['parkinguser_id'] != parkinguser_id)] # transactions with the same car_id but different parkinguser_id
    print(f"User {parkinguser_id} has parked with car_id {car_id} that other accounts have used:")
    print(matching_transactions[['parking_id', 'parkinguser_id', 'account_type']])
    print("\n")

violations.head()

User fake_c8f3930a70 has parked with car_id fake_3c010b1ca5 that other accounts have used:
            parking_id   parkinguser_id account_type
6105   fake_29e57ca780  fake_c28849a708    corporate
49177  fake_0885e9aca0  fake_c28849a708    corporate
67311  fake_5aea0993d9  fake_c28849a708    corporate




Unnamed: 0,parking_id,parkinguser_id,car_id,account_type
616,fake_4481c6c321,fake_c8f3930a70,fake_3c010b1ca5,private


### Check for cars that are used by multiple accounts

In [279]:
shared_cars = functionals.shared_car(df)
shared_cars = shared_cars.reset_index(drop=True)
shared_cars

Unnamed: 0,car_id,user_count,user_ids
0,fake_3c010b1ca5,2,"[fake_c8f3930a70, fake_c28849a708]"
1,fake_a46999824b,2,"[fake_cef59702cd, fake_b34f71b684]"
2,fake_e877be7731,2,"[fake_b34f71b684, fake_cef59702cd]"
3,fake_f52eb84892,2,"[fake_cef59702cd, fake_b34f71b684]"


### Max datetime range

In [280]:
# Data time range
df['parking_start_time'] = pd.to_datetime(df["parking_start_time"])
df['parking_end_time'] = pd.to_datetime(df["parking_end_time"])

# Find the earliest start time and latest end time
earliest_start = df['parking_start_time'].min()
latest_end = df['parking_end_time'].max()

# Calculate the total time span
time_span = latest_end - earliest_start

print(f"Earliest transaction start: {earliest_start}")
print(f"Latest transaction end: {latest_end}")
print(f"Total time span: {time_span.days} days, {time_span.seconds // 3600} hours")

Earliest transaction start: 2013-02-08 10:40:58
Latest transaction end: 2020-09-30 14:34:35
Total time span: 2791 days, 3 hours


- The time span of the dataset is very long (7 years). So normalizing with this time span is not a good idea. Some users could have been active for a long time while some could have been active for a short time.
- Better to normalize with the time span of each parkinguser_id

## Data cleaning

In [281]:
## Remove duplicate rows

# Create a temporary DataFrame without the parking_id column and check for duplicates
temp_df = df.drop(columns=['parking_id'])
duplicate_rows = temp_df[temp_df.duplicated(keep=False)]
print(f"Number of duplicate rows (excluding parking_id): {len(duplicate_rows)}")
if not duplicate_rows.empty:
    print("\nDuplicate rows (excluding parking_id):")
    display(duplicate_rows.sort_values(by=temp_df.columns.tolist()))
else:
    print("No duplicate rows found when excluding parking_id")


print(5*"-")

## Remove outliers in some features

# Calculate the 99th percentile of parking_fee
percentile = 0.99
fee_cutoff_value = df['parking_fee'].quantile(percentile)
rows_before = len(df)
df = df[(df['parking_fee'] > 0) & (df['parking_fee'] <= fee_cutoff_value)]
rows_removed = rows_before - len(df)
rows_removed_pct = (rows_removed / rows_before) * 100
print(f"Number of rows before filtering: {rows_before}")
print(f"Number of rows after filtering: {len(df)}")
print(f"Number of rows removed: {rows_removed} ({rows_removed_pct:.2f}%)")

print(5*"-")

## Convert date columns to datetime
df['parking_start_time'] = pd.to_datetime(df['parking_start_time'])
df['parking_end_time'] = pd.to_datetime(df['parking_end_time'])

Number of duplicate rows (excluding parking_id): 0
No duplicate rows found when excluding parking_id
-----
Number of rows before filtering: 87489
Number of rows after filtering: 81428
Number of rows removed: 6061 (6.93%)
-----


- no duplicate rows
- remove outliers in parking_fee (99th percentile)
- removed 0 parking_fee rows
- converted time data into datetime

## Feature Engineering

### Currency conversion

In [282]:
df["parking_fee_sek"] = df[["currency", "parking_fee"]].apply(df_utils.convert_currency, axis=1)
print(df[df["currency"] == "EUR"].head())

            parking_id       area_type  parking_start_time    parking_end_time  parking_fee currency   parkinguser_id           car_id        lat        lon account_type  parking_fee_sek
652    fake_adc2bd373e        OnStreet 2019-08-06 07:20:04 2019-08-06 14:26:31         5.47      EUR  fake_ade3f3f432  fake_e996210230  46.530397  12.146991      private          60.7717
13045  fake_b8b80c5daf  Administrative 2016-04-07 11:09:54 2016-04-07 12:38:00         3.21      EUR  fake_f78d5f4769  fake_33a47b6e0a  52.370624   9.731980      private          35.6631
26727  fake_b6a291edfe        OnStreet 2019-01-02 14:51:02 2019-01-02 18:21:29         3.50      EUR  fake_ade3f3f432  fake_e996210230  46.532441  12.132596      private          38.8850
28479  fake_572f64c80c        OnStreet 2015-07-23 16:24:09 2015-07-24 08:40:00         7.34      EUR  fake_bcd016dbea  fake_ec45f6cb62  44.137471   9.649915      private          81.5474
58507  fake_3f124ae99a        OnStreet 2020-07-29 07:19:44 2020-0

### Parking duration (in hours)

In [283]:
df["parking_duration"] = df[["parking_start_time", "parking_end_time"]].apply(lambda x: (x["parking_end_time"] - x["parking_start_time"]).total_seconds() / 3600, axis=1)

### Weekday calculation (weekday = 1, weekend = 0)



In [284]:
df["weekday"] = df.apply(df_utils.get_weekday, axis=1)

### Unique cars used per account

In [285]:
# Count unique cars per user
user_car_counts = df.groupby("parkinguser_id")["car_id"].nunique()

# Map the counts back to the original dataframe
df["registered_cars"] = df["parkinguser_id"].map(user_car_counts)

# Display the first few rows to verify
df[["parking_id","parkinguser_id", "car_id", "registered_cars", "account_type"]].head()

Unnamed: 0,parking_id,parkinguser_id,car_id,registered_cars,account_type
0,fake_3f4411b532,fake_e764113cde,fake_14677cad5f,9,private
1,fake_7f26967cc9,fake_87f457ddef,fake_c50560f229,20,private
2,fake_ec85885056,fake_61d32bf6c5,fake_dd73ef56b8,2,corporate
3,fake_aefbf3f285,fake_dcec7e9cf0,fake_e7b12f6c21,7,private
5,fake_b265b02599,fake_256473c6ae,fake_8f045d4dab,1,private


### Parking count

In [286]:
# Count total parking transactions per user
user_parking_counts = df['parkinguser_id'].value_counts()
df['n_parkings'] = df['parkinguser_id'].map(user_parking_counts)
df[['parkinguser_id', 'n_parkings', "account_type"]].head()

Unnamed: 0,parkinguser_id,n_parkings,account_type
0,fake_e764113cde,893,private
1,fake_87f457ddef,429,private
2,fake_61d32bf6c5,132,corporate
3,fake_dcec7e9cf0,321,private
5,fake_256473c6ae,584,private


### Parking activity (normalized w.r.t days)

In [287]:
# Normalize parking frequency with respect to days used


# Ensure the timestamp columns are in datetime format
df["parking_start_time"] = pd.to_datetime(df["parking_start_time"])
df["parking_end_time"] = pd.to_datetime(df["parking_end_time"])

# Calculate account age in days for each user
user_first_last = df.groupby("parkinguser_id").agg(
    first_parking=("parking_start_time", "min"),
    last_parking=("parking_end_time", "max")
)

# Calculate account age in days (add 1 to avoid division by zero for single transactions)
user_first_last["account_age_days"] = (user_first_last["last_parking"] - user_first_last["first_parking"]).dt.days + 1

# Get total parkings per user (we already have this in "n_parkings")
# Calculate parking activity (parkings per day)
user_first_last["parking_activity"] = df.groupby("parkinguser_id").size() / user_first_last["account_age_days"]

# Map the parking_activity back to the original dataframe
df = df.merge(
    user_first_last[["parking_activity"]],
    left_on="parkinguser_id",
    right_index=True,
    how="left"
)

# Display the results
print("\nSummary statistics for parking_activity:")
print(df["parking_activity"].describe())
print("Sample of user parking activity (parkings per day):")
df[["parkinguser_id", "n_parkings", "parking_activity", "account_type"]].head(10)


Summary statistics for parking_activity:
count    81428.000000
mean         0.471321
std          0.330562
min          0.005571
25%          0.195626
50%          0.370934
75%          0.742671
max          1.234296
Name: parking_activity, dtype: float64
Sample of user parking activity (parkings per day):


Unnamed: 0,parkinguser_id,n_parkings,parking_activity,account_type
0,fake_e764113cde,893,0.786092,private
1,fake_87f457ddef,429,0.161036,private
2,fake_61d32bf6c5,132,0.211878,corporate
3,fake_dcec7e9cf0,321,0.370242,private
5,fake_256473c6ae,584,0.804408,private
6,fake_fb0b824df1,209,0.197729,private
8,fake_835dae552c,280,0.154525,corporate
9,fake_7b6d5c5995,211,0.387868,corporate
11,fake_118052c1ce,145,0.384615,corporate
12,fake_c9c716cee4,241,0.206159,private


- This feature has pros and cons. If the parking activity is uniform then it will measure average activity pretty well. However, if the activity is not uniform (such as a big break between parkings) then the acitivity will be underestimated.
- Parking activity could be a good way to measure activity per day is a better measure compared to total parking transactions since a private user that has used the app for a long time will have more parking transactions than a new business user that has just started using the app.

## Before

In [288]:
df.head()

Unnamed: 0,parking_id,area_type,parking_start_time,parking_end_time,parking_fee,currency,parkinguser_id,car_id,lat,lon,account_type,parking_fee_sek,parking_duration,weekday,registered_cars,n_parkings,parking_activity
0,fake_3f4411b532,SurfaceLot,2018-03-20 14:01:47,2018-03-20 15:56:11,11.5,SEK,fake_e764113cde,fake_14677cad5f,56.663799,12.85768,private,11.5,1.906667,1,9,893,0.786092
1,fake_7f26967cc9,SurfaceLot,2016-04-05 05:19:15,2016-04-05 06:12:00,13.89,NOK,fake_87f457ddef,fake_c50560f229,59.909424,10.781888,private,13.0566,0.879167,1,20,429,0.161036
2,fake_ec85885056,OnStreet,2020-05-19 05:50:14,2020-05-19 14:30:00,173.33,SEK,fake_61d32bf6c5,fake_dd73ef56b8,59.362391,18.01822,corporate,173.33,8.662778,1,2,132,0.211878
3,fake_aefbf3f285,SurfaceLot,2020-03-11 17:10:11,2020-03-11 19:03:00,20.02,SEK,fake_dcec7e9cf0,fake_e7b12f6c21,59.359427,17.896455,private,20.02,1.880278,1,7,321,0.370242
5,fake_b265b02599,OnStreet,2019-04-03 10:30:35,2019-04-03 16:30:00,12.0,SEK,fake_256473c6ae,fake_8f045d4dab,60.676206,17.157247,private,12.0,5.990278,1,1,584,0.804408


## Final Visualization (After)

In [289]:
df = df.drop(columns=['parking_id', "parking_fee", 'lat', 'lon', "currency", "car_id", "parking_start_time", "parking_end_time"])
df = df[['parkinguser_id'] + [col for col in df.columns if col not in ['parkinguser_id', 'account_type']] + ['account_type']]
df.head()

Unnamed: 0,parkinguser_id,area_type,parking_fee_sek,parking_duration,weekday,registered_cars,n_parkings,parking_activity,account_type
0,fake_e764113cde,SurfaceLot,11.5,1.906667,1,9,893,0.786092,private
1,fake_87f457ddef,SurfaceLot,13.0566,0.879167,1,20,429,0.161036,private
2,fake_61d32bf6c5,OnStreet,173.33,8.662778,1,2,132,0.211878,corporate
3,fake_dcec7e9cf0,SurfaceLot,20.02,1.880278,1,7,321,0.370242,private
5,fake_256473c6ae,OnStreet,12.0,5.990278,1,1,584,0.804408,private


## Save processed data

In [290]:
save_df = False
# Save cleaned dataframe
if save_df:
    clean_path = data_path[:-4] + '-cleaned.csv'
    df.to_csv(clean_path, index=False)

## Future work 

- Using longitude and latitdue coordinates, can we be useful if we have some kind of embedding models that can translate this information to a richer representation. Right now the raw coordinates are not very useful to the model, especially standard ML models. 
- Including start time into the model input could be useful but one needs to be careful how to encode that. Using the absolute date is bad since it has high range of value. One should opt to use the periodicty in the 24 hour clock. For example add a feature that shows the offset to midnight. (2018-11-27 15:17:50 -> 15. hours). But this is an issue since 23:59 and 00:01 distance. This adding as trigonometric transformation such as sine or cosine with 24h period time would include this periodicity and limit the range of values. Since we have parking duration we thus do not need to include the end time since this is redundant information.