# Data preprocessing

## Summary of data analysis and findings


### Data exploration and analysis

### Data cleaning

### Feature engineering

## Data exploration and analysis

In [167]:
import df_utils
import functionals
import pandas as pd

In [168]:
# Load data into Pandas DataFrame
data_path = "assignment-sample-data.csv"
df = pd.read_csv(data_path)

In [169]:
from IPython.display import display
pd.set_option('display.width', 1000)

In [170]:
# Show first entries
df.head()
#print(df.tail())

Unnamed: 0,parking_id,area_type,parking_start_time,parking_end_time,parking_fee,currency,parkinguser_id,car_id,lat,lon,account_type
0,fake_c28a323810,SurfaceLot,2015-03-06 19:55:41,2015-03-06 20:07:00,8.5,SEK,fake_bf5d9b530e,fake_130ae2aeb1,59.24637,18.077019,corporate
1,fake_76c21cf355,SurfaceLot,2015-03-06 18:08:20,2015-03-06 19:46:00,15.67,SEK,fake_bf5d9b530e,fake_130ae2aeb1,59.231789,18.083995,corporate
2,fake_995ed971a6,OnStreet,2017-07-21 09:55:42,2017-07-21 14:23:50,67.0,SEK,fake_3ba346a0cd,fake_f7a9d564d9,59.350331,18.096649,corporate
3,fake_6b81ea4f35,SurfaceLot,2017-07-24 07:21:12,2017-07-24 07:34:31,4.34,SEK,fake_ea19a50003,fake_fae7e31b34,59.315826,18.098355,corporate
4,fake_424b61e0eb,SurfaceLot,2015-03-09 12:05:46,2015-03-09 13:57:54,50.5,SEK,fake_1cc1970582,fake_0755f3c71f,59.320919,18.047513,corporate


- 11 columns (10 features, 1 target)
- I suspect that the rows are sorted by account_type since first 50 are corporate and last 50 are private. This means we should shuffle before training

In [171]:
shuffle_rows = True
if shuffle_rows:
    df = df.sample(frac=1).reset_index(drop=True)

In [172]:
print(df.columns)

Index(['parking_id', 'area_type', 'parking_start_time', 'parking_end_time', 'parking_fee', 'currency', 'parkinguser_id', 'car_id', 'lat', 'lon', 'account_type'], dtype='object')


In [173]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87489 entries, 0 to 87488
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   parking_id          87489 non-null  object 
 1   area_type           87489 non-null  object 
 2   parking_start_time  87489 non-null  object 
 3   parking_end_time    87489 non-null  object 
 4   parking_fee         87489 non-null  float64
 5   currency            87489 non-null  object 
 6   parkinguser_id      87489 non-null  object 
 7   car_id              87489 non-null  object 
 8   lat                 87489 non-null  float64
 9   lon                 87489 non-null  float64
 10  account_type        87489 non-null  object 
dtypes: float64(3), object(8)
memory usage: 7.3+ MB
None


- 11 columns from raw dataset
- 87489 rows (transactions)
- No null values
- Mix of float (3) and object (8) dtypes

In [174]:
print(df.describe())

        parking_fee           lat           lon
count  87489.000000  87489.000000  87489.000000
mean      31.786153     58.786174     17.035432
std       39.338515      2.925073      2.639340
min        0.000000   -180.006219   -180.006783
25%        9.000000     59.292461     17.137638
50%       18.750000     59.332389     18.000587
75%       39.100000     59.360741     18.066440
max      983.000000     67.871132     54.009971


- Exists some outliers for parking_fee
    - min = 0 (how does this make sense?)
    - max = 983 (this is very high compared to 75th percentile that is 39)
    - mean = 31
    - std = 39 (quite high compared to mean => high data variance)
    - Could for example remove outliers and standardize the data for easier training
- Latidute and longitude values are not in the correct range
    - min lat = -90 (but we observe -180)
    - offsetting wont help since the range is above 180, meaning I do not trust its values to be correctly measured
- Longitude is however in the correct range [-180, 180]
- Latitude and longitude data by itself does not really give much information and would require some kind of embedding model inbetween, that maps the coordinate point (latitude, longitude) to some feature space that includes information about the location and parking behaviour

In [175]:
print(df["account_type"].unique())
print(df["account_type"].value_counts())
print(5*"-")
print(df["currency"].unique())
print(df["currency"].value_counts())
print(5*"-")
print(df["area_type"].unique())
print(df["area_type"].value_counts())
print(5*"-")
print(f"Number of rows in df: {len(df)}")
print(f"Number of unique parking_ids: {df['parking_id'].nunique()}")
print(f"Number of unique parkinguser_ids: {df['parkinguser_id'].nunique()}")
print(f"Number of unique car_ids: {df['car_id'].nunique()}")

['private' 'corporate']
account_type
private      57239
corporate    30250
Name: count, dtype: int64
-----
['SEK' 'NOK' 'DKK' 'EUR']
currency
SEK    86918
NOK      465
DKK       87
EUR       19
Name: count, dtype: int64
-----
['SurfaceLot' 'OnStreet' 'Administrative' 'UndergroundGarage'
 'AboveGroundGarage' 'EVC' 'CameraParkArea']
area_type
OnStreet             52747
SurfaceLot           26749
Administrative        6269
UndergroundGarage     1248
AboveGroundGarage      459
EVC                     14
CameraParkArea           3
Name: count, dtype: int64
-----
Number of rows in df: 87489
Number of unique parking_ids: 87489
Number of unique parkinguser_ids: 300
Number of unique car_ids: 1652


- not an exact 50/50 split between private and corporate transactions, but they are in the same order of magnitude so stratified sampling is not super necessary
- There are 4 different currencies used in the dataset, huge majority is in SEK. 2 options which are both viable
    - convert all to SEK equivalent
    - remove transactions in other currencies
- area_type have 7 different values, "OnStreet" and "SurfaceLot" are the most common, while "EVC" and "CameraParkArea" are the least common
    - Before removing "EVC" and "CameraParkArea", we should check if there are any patterns (correlation) in the data that we could use to predict the account_type from those samples
- area_type would require categorical encoding (or one-hot encoding), before being used as input to the model.
- parking_id is unique as it is a primary key and equal to number of rows in the dataset, meaning it is completely useless for the model
- There are a total of 300 registered private and corporate parking users (accounts)
- There are a total of 1652 unique car ids used for parking transactions. Since this is larger than the number of parking users, some users have multiple cars, which is indicate of a business account (business has a fleet of cars). Could be useful to create a feature of number of used/registered cars per user

### Check if transactions with "parkinguser_id" exits with multiple account types (corporate and private)

In [176]:
parkinguser_ids = {} # parkinguser_id -> account_type

# Loop over all transactions
for index, row in df.iterrows(): 
    user_id = row['parkinguser_id']
    account_type = row['account_type']

    if user_id not in parkinguser_ids: # If user_id is not in the dictionary, add it
        parkinguser_ids[user_id] = account_type
    else: # If user_id is already in the dictionary, check if the account_type is the same

        if parkinguser_ids[user_id] != account_type:
            print(f"User {user_id} has multiple account types: {parkinguser_ids[user_id]} and {account_type}")

- Each parkinguser_id has only one account_type. If a person has both a private and a corporate account, then they make transactions using different parkinguser_ids!

## Check if private accounts use cars that corporate accounts have used

In [177]:
violations = functionals.corporate_car(df)

for index, row in violations.iterrows():
    car_id = row['car_id']
    parkinguser_id = row['parkinguser_id']
    matching_transactions = df[(df['car_id'] == car_id) & (df['parkinguser_id'] != parkinguser_id)]
    print(f"Transactions with car_id {car_id}:")
    print(matching_transactions[['parking_id', 'parkinguser_id', 'car_id', 'account_type']])
    print()


Transactions with car_id fake_3c010b1ca5:
            parking_id   parkinguser_id           car_id account_type
5390   fake_29e57ca780  fake_c28849a708  fake_3c010b1ca5    corporate
28501  fake_5aea0993d9  fake_c28849a708  fake_3c010b1ca5    corporate
70408  fake_0885e9aca0  fake_c28849a708  fake_3c010b1ca5    corporate

Transactions with car_id fake_3c010b1ca5:
            parking_id   parkinguser_id           car_id account_type
5390   fake_29e57ca780  fake_c28849a708  fake_3c010b1ca5    corporate
28501  fake_5aea0993d9  fake_c28849a708  fake_3c010b1ca5    corporate
70408  fake_0885e9aca0  fake_c28849a708  fake_3c010b1ca5    corporate



### Check private accounts that 

In [190]:
shared_cars = functionals.shared_car(df)
len(shared_cars)
shared_cars

Unnamed: 0,car_id,user_count,user_ids
373,fake_3c010b1ca5,2,"[fake_c28849a708, fake_c8f3930a70]"
1021,fake_a46999824b,2,"[fake_b34f71b684, fake_cef59702cd]"
1520,fake_f52eb84892,2,"[fake_b34f71b684, fake_cef59702cd]"


### Max datetime range

In [178]:
# Data time range
df['parking_start_time'] = pd.to_datetime(df["parking_start_time"])
df['parking_end_time'] = pd.to_datetime(df["parking_end_time"])

# Find the earliest start time and latest end time
earliest_start = df['parking_start_time'].min()
latest_end = df['parking_end_time'].max()

# Calculate the total time span
time_span = latest_end - earliest_start

print(f"Earliest transaction start: {earliest_start}")
print(f"Latest transaction end: {latest_end}")
print(f"Total time span: {time_span.days} days, {time_span.seconds // 3600} hours")

Earliest transaction start: 2013-02-08 10:40:58
Latest transaction end: 2020-09-30 14:34:35
Total time span: 2791 days, 3 hours


- The time span of the dataset is very long (7 years). So normalizing with this time span is not a good idea. Some users could have been active for a long time while some could have been active for a short time.
- Better to normalize with the time span of each parkinguser_id

## Data cleaning

In [179]:
## Remove duplicate rows

# Create a temporary DataFrame without the parking_id column
temp_df = df.drop(columns=['parking_id'])

# Check for duplicates
duplicate_rows = temp_df[temp_df.duplicated(keep=False)]

# Display the number of duplicates and the duplicate rows
print(f"Number of duplicate rows (excluding parking_id): {len(duplicate_rows)}")
if not duplicate_rows.empty:
    print("\nDuplicate rows (excluding parking_id):")
    display(duplicate_rows.sort_values(by=temp_df.columns.tolist()))
else:
    print("No duplicate rows found when excluding parking_id")



# Remove outliers in some features

# Calculate the 99th percentile of parking_fee
fee_95th_percentile = df['parking_fee'].quantile(0.99)

# Store the number of rows before filtering
rows_before = len(df)

# Filter the DataFrame
df = df[(df['parking_fee'] > 0) & (df['parking_fee'] <= fee_95th_percentile)]

# Calculate the number of rows removed
rows_removed = rows_before - len(df)
rows_removed_pct = (rows_removed / rows_before) * 100

# Print the results
print(f"95th percentile of parking_fee: {fee_95th_percentile:.2f}")
print(f"Number of rows before filtering: {rows_before}")
print(f"Number of rows after filtering: {len(df)}")
print(f"Number of rows removed: {rows_removed} ({rows_removed_pct:.2f}%)")

# Show the new distribution
print("\nNew parking_fee statistics:")
print(df['parking_fee'].describe())



## Convert date columns to datetime

# Convert the time columns to datetime format
df['parking_start_time'] = pd.to_datetime(df['parking_start_time'])
df['parking_end_time'] = pd.to_datetime(df['parking_end_time'])

# Verify the conversion
print("Data types after conversion:")
print(df[['parking_start_time', 'parking_end_time']].dtypes)

# Display a few samples to verify the conversion
print("\nSample of converted times:")
print(df[['parking_start_time', 'parking_end_time']].head())


Number of duplicate rows (excluding parking_id): 0
No duplicate rows found when excluding parking_id
95th percentile of parking_fee: 180.00
Number of rows before filtering: 87489
Number of rows after filtering: 81428
Number of rows removed: 6061 (6.93%)

New parking_fee statistics:
count    81428.000000
mean        31.591334
std         32.614929
min          0.070000
25%         10.280000
50%         20.000000
75%         40.000000
max        180.000000
Name: parking_fee, dtype: float64
Data types after conversion:
parking_start_time    datetime64[ns]
parking_end_time      datetime64[ns]
dtype: object

Sample of converted times:
   parking_start_time    parking_end_time
0 2018-05-17 12:47:02 2018-05-17 14:08:00
1 2020-02-25 09:40:35 2020-02-25 10:34:55
2 2020-08-10 20:57:50 2020-08-11 09:58:59
3 2016-07-08 13:09:36 2016-07-08 15:30:00
5 2018-02-13 09:08:31 2018-02-13 10:16:41


## Feature Engineering

### Currency conversion

In [180]:
df["parking_fee_sek"] = df[["currency", "parking_fee"]].apply(df_utils.convert_currency, axis=1)
print(df[df["currency"] == "EUR"].head())

            parking_id area_type  parking_start_time    parking_end_time  parking_fee currency   parkinguser_id           car_id        lat        lon account_type  parking_fee_sek
7610   fake_336bec8abf  OnStreet 2018-06-05 15:13:50 2018-06-05 15:30:00         1.05      EUR  fake_7047370fa5  fake_1fbe5a7dbe  60.397872  25.669604      private          11.6655
8592   fake_d7e00b2a72  OnStreet 2017-07-08 17:54:45 2017-07-08 22:29:59        10.58      EUR  fake_c66fd0c84a  fake_4e7288eccf  52.507050  13.452954      private         117.5438
13729  fake_ecde0ab81b  OnStreet 2018-09-27 07:59:18 2018-09-27 08:47:09         1.57      EUR  fake_9937b70a6a  fake_52096b23b4  54.008250  10.769583      private          17.4427
15376  fake_3f124ae99a  OnStreet 2020-07-29 07:19:44 2020-07-29 13:27:18         4.98      EUR  fake_ade3f3f432  fake_644ff1469a  46.548255  12.134466      private          55.3278
26989  fake_98fe609f77  OnStreet 2020-08-04 07:53:43 2020-08-04 09:10:00         1.28      EUR 

### Parking duration (in hours)

In [181]:
df["parking_duration"] = df[["parking_start_time", "parking_end_time"]].apply(lambda x: (x["parking_end_time"] - x["parking_start_time"]).total_seconds() / 3600, axis=1)

### Weekday calculation (0: Monday, 6: Sunday)



In [182]:
df["weekday"] = df.apply(df_utils.get_weekday, axis=1)

### Unique cars used per account

In [183]:
# Count unique cars per user
user_car_counts = df.groupby("parkinguser_id")["car_id"].nunique()

# Map the counts back to the original dataframe
df["registered_cars"] = df["parkinguser_id"].map(user_car_counts)

# Display the first few rows to verify
print(df[["parking_id","parkinguser_id", "car_id", "registered_cars", "account_type"]].head(10))

         parking_id   parkinguser_id           car_id  registered_cars account_type
0   fake_fa405290aa  fake_e764113cde  fake_14677cad5f                9      private
1   fake_b56ea92537  fake_99c739c331  fake_e182ab235b                3      private
2   fake_b53acafccd  fake_108e7f0a70  fake_7c4ff2a0fd               16      private
3   fake_297fa2cdfd  fake_c28849a708  fake_5aaf55d5f8               22    corporate
5   fake_32e012bcdb  fake_0cbece604e  fake_71d099142e               10    corporate
6   fake_b9daadc4b8  fake_0cbece604e  fake_cc41f3668e               10    corporate
7   fake_7f2571f018  fake_142265a97f  fake_c8868fabc4                1      private
8   fake_25c8613e9d  fake_b4c70dbc62  fake_33b1fa2d81                4      private
9   fake_1c87599e58  fake_108e7f0a70  fake_7c4ff2a0fd               16      private
10  fake_dca924d9b9  fake_efa0218a6c  fake_d089346a2a               10      private


### Parking count

In [184]:
# Count total parking transactions per user
user_parking_counts = df['parkinguser_id'].value_counts()
df['n_parkings'] = df['parkinguser_id'].map(user_parking_counts)
print(df[['parkinguser_id', 'n_parkings', "account_type"]].head(10))

     parkinguser_id  n_parkings account_type
0   fake_e764113cde         893      private
1   fake_99c739c331         291      private
2   fake_108e7f0a70        1519      private
3   fake_c28849a708        1952    corporate
5   fake_0cbece604e        2848    corporate
6   fake_0cbece604e        2848    corporate
7   fake_142265a97f          52      private
8   fake_b4c70dbc62         722      private
9   fake_108e7f0a70        1519      private
10  fake_efa0218a6c         216      private


### Parking activity (normalized w.r.t days)

In [185]:
# Normalize parking frequency with respect to days used


# Ensure the timestamp columns are in datetime format
df["parking_start_time"] = pd.to_datetime(df["parking_start_time"])
df["parking_end_time"] = pd.to_datetime(df["parking_end_time"])

# Calculate account age in days for each user
user_first_last = df.groupby("parkinguser_id").agg(
    first_parking=("parking_start_time", "min"),
    last_parking=("parking_end_time", "max")
)

# Calculate account age in days (add 1 to avoid division by zero for single transactions)
user_first_last["account_age_days"] = (user_first_last["last_parking"] - user_first_last["first_parking"]).dt.days + 1

# Get total parkings per user (we already have this in "n_parkings")
# Calculate parking activity (parkings per day)
user_first_last["parking_activity"] = df.groupby("parkinguser_id").size() / user_first_last["account_age_days"]

# Map the parking_activity back to the original dataframe
df = df.merge(
    user_first_last[["parking_activity"]],
    left_on="parkinguser_id",
    right_index=True,
    how="left"
)

# Display the results
print("Sample of user parking activity (parkings per day):")
print(df[["parkinguser_id", "n_parkings", "parking_activity", "account_type"]].head(10))
print("\nSummary statistics for parking_activity:")
print(df["parking_activity"].describe())

Sample of user parking activity (parkings per day):
     parkinguser_id  n_parkings  parking_activity account_type
0   fake_e764113cde         893          0.786092      private
1   fake_99c739c331         291          0.334483      private
2   fake_108e7f0a70        1519          0.757228      private
3   fake_c28849a708        1952          1.097863    corporate
5   fake_0cbece604e        2848          1.032632    corporate
6   fake_0cbece604e        2848          1.032632    corporate
7   fake_142265a97f          52          0.067885      private
8   fake_b4c70dbc62         722          0.287879      private
9   fake_108e7f0a70        1519          0.757228      private
10  fake_efa0218a6c         216          0.147239      private

Summary statistics for parking_activity:
count    81428.000000
mean         0.471321
std          0.330562
min          0.005571
25%          0.195626
50%          0.370934
75%          0.742671
max          1.234296
Name: parking_activity, dtype: float6

- This feature has pros and cons. If the parking activity is uniform then it will measure average activity pretty well. However, if the activity is not uniform (such as a big break between parkings) then the acitivity will be underestimated.
- Parking activity could be a good way to measure activity per day is a better measure compared to total parking transactions since a private user that has used the app for a long time will have more parking transactions than a new business user that has just started using the app.

## Final Visualization

In [186]:
# Drop unwanted columns
df = df.drop(columns=['parking_id', "parking_fee", 'lat', 'lon'])


In [187]:
df.head()

Unnamed: 0,area_type,parking_start_time,parking_end_time,currency,parkinguser_id,car_id,account_type,parking_fee_sek,parking_duration,weekday,registered_cars,n_parkings,parking_activity
0,SurfaceLot,2018-05-17 12:47:02,2018-05-17 14:08:00,SEK,fake_e764113cde,fake_14677cad5f,private,8.1,1.349444,3,9,893,0.786092
1,OnStreet,2020-02-25 09:40:35,2020-02-25 10:34:55,SEK,fake_99c739c331,fake_e182ab235b,private,18.45,0.905556,1,3,291,0.334483
2,SurfaceLot,2020-08-10 20:57:50,2020-08-11 09:58:59,SEK,fake_108e7f0a70,fake_7c4ff2a0fd,private,69.33,13.019167,0,16,1519,0.757228
3,OnStreet,2016-07-08 13:09:36,2016-07-08 15:30:00,SEK,fake_c28849a708,fake_5aaf55d5f8,corporate,35.25,2.34,4,22,1952,1.097863
5,SurfaceLot,2018-02-13 09:08:31,2018-02-13 10:16:41,SEK,fake_0cbece604e,fake_71d099142e,corporate,11.33,1.136111,1,10,2848,1.032632


## Save processed data

In [188]:
save_df = False
# Save cleaned dataframe
if save_df:
    clean_path = data_path[:-4] + '-cleaned.csv'
    df.to_csv(clean_path, index=False)