In [1]:
# library
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# # ML
# from sklearn.model_selection import train_test_split
# # Classification
# from sklearn.metrics import accuracy_score
# from sklearn.metrics import confusion_matrix
# # Regression
# from statsmodels.tools.eval_measures import rmse
# from statsmodels.stats.outliers_influence import variance_inflation_factor
# # KNN & Decision tree
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.tree import DecisionTreeClassifier
# # MinMax Scaler (Normalisation)
# from sklearn.preprocessing import MinMaxScaler
# from sklearn.preprocessing import StandardScaler
# Warnings
import warnings
warnings.filterwarnings('ignore')

# **DTI-DS Capstone 2 (Logic Testing)**

## General Overview

1. <mark>**Background:**</mark>
    
Transjakarta is the first Bus Rapid Transit (BRT) transportation system in Southeast Asia operating since 2004 in Jakarta, Indonesia. TransJakarta was designed as a mass transportation mode to support Indonesia’s Capital city’s around the clock activities. 

With the longest track in the world (251.2 km), as well as having 260 bus stops spread across 13 corridors. Transjakarta initially operates from 05.00 - 22.00 WIB, now it operates 24 hours available on certain corridors only. With its extensive network of routes and ease of use, Transjakarta has become the leading and favorite transportation for so called “Jakartans”. 

However, there are several problems that still lingers around and thus must be handled. Problems such as sexual harassment that still often occurs to women, pickpocketing, bus crashes, schedule punctuality, accumulation of passengers at bus stops, overcrowding inside the bus itself. All stemming from overcrowding and less than optimal fleet distribution and fleet schedule.

2. <mark>**Problem Statement:**</mark>

Overcrowding has led to several of Transjakarta’s pre-existing problems aforementioned in the Background. Thus, Transjakarta wants to research on its “overcrowding” problem associated with fleet distribution and schedule to help evaluate and improve its services to passengers (“Jakartans”) 

3. <mark>**Data:**</mark>

This data is the passenger data for the month of April 2023. It initially consists of 37,900 rows (reduced to 35,476 post-preprocessing) and 22 columns. The Data Can be seen as follows:

4. <mark>**Data Analysis:**</mark>

Overcrowding can be identified through several variables that helps us measure overall quantity of passengers. These variables can also help describe the demography of our passengers in the form of customer segmentation. For this research, we are focusing on the variables that can be associated with “overcrowding”. These variables will be <mark>**highlighted**</mark> with the <mark>**arrow (->)**</mark> notation below:

#### -> Biodata:
1.	<mark>**transID:**</mark> <u>Unique transaction id for every transaction</u>
2.	<mark>**payCardID:**</mark> <u>Customers main identifier. The card customers use as a ticket for entrance and exit.</u>
3.	<mark>**payCardBank:**</mark> <u>Customers card bank issuer name</u> <mark>**-> Payment Gateway Analysis**</mark>
4.	<mark>**payCardName:**</mark> <u>Customers name that is embedded in the card.</u>
5.	<mark>**payCardSex**</mark> <u>Customers sex that is embedded in the card</u> <mark>**-> Gender Analysis**</mark>
6.	<mark>**payCardBirthDate:**</mark> <u>Customers birth year</u> <mark>**-> Customer Segmentation by Age**</mark>
#### -> Journey (Trip Details):
7.	<mark>**corridorID:**</mark> <u>Corridor ID / Route ID as key for route grouping.</u> <mark>**-> Corridor Analysis**</mark>
8.	<mark>**corridorName:**</mark> <u>Corridor Name / Route Name contains Start and Finish for each route.</u> <mark>**-> Corridor Analysis**</mark>
9.	<mark>**direction:**</mark> <u>0 for Go, 1 for Back. Direction of the route. (0: Right_address -> Left_address & 1: Left_address -> Right_address)</u> <mark>**-> In/Out Analysis**</mark>
#### -> Journey (Tap-In details):
10.	<mark>**tapInStops:**</mark> <u>Tap In (entrance) Stops ID for identifying stops name</u>
11.	<mark>**tapInStopsName:**</mark> <u>Tap In (entrance) Stops Name where customers tap in.</u> <mark>**-> Bus Stop Analysis**</mark>
12.	<mark>**tapInStopsLat:**</mark> <u>Latitude of Tap In Stops</u> <mark>**-> Geo Analysis**</mark>
13.	<mark>**tapInStopsLon:**</mark> <u>Longitude of Tap In Stops</u>
14.	<mark>**stopStartSeq:**</mark> <u>Sequence of the stops, 1st stop, 2nd stops etc. Related to direction. (the N-th startingStop to the endingStop from Right_address (direc: 0) OR Left_address (direc: 1))</u> <mark>**-> stopCount Analysis**</mark>
15.	<mark>**tapInTime:**</mark> <u>Time of tap in. Date and time</u> <mark>**-> Time-Based Analysis**</mark>
#### -> Journey (Tap-Out details):
16.	<mark>**tapOutStops:**</mark> <u>Tap Out (Exit) Stops ID for identifying stops name</u>
17.	<mark>**tapOutStopsName:**</mark> <u>Tap out (exit) Stops Name where customers tap out.</u> <mark>**-> Bus Stop Analysis**</mark>
18.	<mark>**tapOutStopsLat:**</mark> <u>Latitude of Tap Out Stops</u> <mark>**-> Geo Analysis**</mark>
19.	<mark>**tapOutStopsLon:**</mark> <u>Longitude of Tap Out Stops</u>
20.	<mark>**stopEndSeq:**</mark> <u>Sequence of the stops, 1st stop, 2nd stops etc. Related to direction.(the N-th startingStop to the endingStop from Right_address (direc: 0) OR Left_address (direc: 1))</u> <mark>**-> stopCount Analysis**</mark>
21.	<mark>**tapOutTime:**</mark> <u>Time of tap out. Date and time</u> <mark>**-> Time-Based Analysis**</mark>
#### -> Journey (Trip Details):
22.	<mark>**payAmount:**</mark> <u>The number of what customers pay. Some are free. Some not.</u> <mark>**-> Revenue Analysis**</mark>

<br>
<br>
<br>
5. <mark>**Final Initial Hypothetical Thoughts (guiding concerns): **</mark>

Overcrowding can be best described as a phenomenon where the quantity of people exceed the threshold of collective and overall comfort of a cohort of people. To mitigate such problems. We can list the overall data analysis research findings, along with recommendations such as addressing customer segments that majorly contributes to “overcrowding”, fleet distribution and schedule that accommodate peak demand hours based on some of its busiest corridors (and its stops) whilst accommodating concerns such as female passenger safety by way of dedicated female rows on buses and female-only buses. Such implementation requires a certain degree of supervision, hence the optimal number of staff and their respective distribution along with comprehensive CCTV coverage. 

“Overcrowding” stems from congestion, and congestion does not always happen at the bus but rather the bus stops itself. This might cause certain problems such as pickpocketing and uncomfortable waiting conditions. To reduce such congestion, Transjakarta can increase its number of fleets, along with the aforementioned fleet distribution and schedule. This will also lessen the already high operating hours that might result in unexpected vehicle breakdowns, which might further worsen the problem as a delay in supply might trickle down to the whole system’s operational efficiency

In [11]:
# Import Data
df_tj = pd.read_csv('Transjakarta.csv', sep= ',')
pd.set_option("display.max_columns", None)
df_tj

Unnamed: 0,transID,payCardID,payCardBank,payCardName,payCardSex,payCardBirthDate,corridorID,corridorName,direction,tapInStops,tapInStopsName,tapInStopsLat,tapInStopsLon,stopStartSeq,tapInTime,tapOutStops,tapOutStopsName,tapOutStopsLat,tapOutStopsLon,stopEndSeq,tapOutTime,payAmount
0,EIIW227B8L34VB,180062659848800,emoney,Bajragin Usada,M,2008,5,Matraman Baru - Ancol,1.0,P00142,Pal Putih,-6.184631,106.84402,7,2023-04-03 05:21:44,P00253,Tegalan,-6.203101,106.85715,12.0,2023-04-03 06:00:53,3500.0
1,LGXO740D2N47GZ,4885331907664776,dki,Gandi Widodo,F,1997,6C,Stasiun Tebet - Karet via Patra Kuningan,0.0,B01963P,Kemenkes 2,-6.228700,106.83302,13,2023-04-03 05:42:44,B03307P,Sampoerna Strategic,-6.217152,106.81892,21.0,2023-04-03 06:40:01,3500.0
2,DJWR385V2U57TO,4996225095064169,dki,Emong Wastuti,F,1992,R1A,Pantai Maju - Kota,0.0,B00499P,Gg. Kunir II,-6.133132,106.81435,38,2023-04-03 05:59:06,B04962P,Simpang Kunir Kemukus,-6.133731,106.81475,39.0,2023-04-03 06:50:55,3500.0
3,JTUZ800U7C86EH,639099174703,flazz,Surya Wacana,F,1978,11D,Pulo Gebang - Pulo Gadung 2 via PIK,0.0,B05587P,Taman Elok 1,-6.195743,106.93526,23,2023-04-03 05:44:51,B03090P,Raya Penggilingan,-6.183068,106.93194,29.0,2023-04-03 06:28:16,3500.0
4,VMLO535V7F95NJ,570928206772,flazz,Embuh Mardhiyah,M,1982,12,Tanjung Priok - Pluit,0.0,P00239,Sunter Boulevard Barat,-6.149650,106.88900,5,2023-04-03 06:17:35,P00098,Kali Besar Barat,-6.135355,106.81143,15.0,2023-04-03 06:57:03,3500.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37895,ZWEC949B8Q87QG,4685818286724028395,brizzi,Kamila Mahendra,F,2004,6B,Ragunan - MH Thamrin via Semanggi,1.0,P00261,Tosari,-6.196892,106.82309,2,2023-04-21 18:18:37,P00228,SMK 57,-6.290967,106.82365,13.0,2023-04-21 19:55:49,3500.0
37896,YHHK837P6Y95GN,6502902290603767,dki,Titi Siregar,M,1974,9N,Pinang Ranti - Pramuka,1.0,P00064,Garuda Taman Mini,-6.290154,106.88116,1,2023-04-18 21:52:31,P00179,Pinang Ranti,-6.291075,106.88634,2.0,2023-04-18 22:28:22,3500.0
37897,YXPP627N4G95HO,213159426675861,emoney,drg. Zahra Nashiruddin,F,1976,1T,Cibubur - Balai Kota,1.0,B02873P,Plaza Sentral,-6.216247,106.81676,12,2023-04-04 10:29:47,B00226P,Buperta Cibubur,-6.370321,106.89628,14.0,2023-04-04 13:27:25,20000.0
37898,RGVK175U2U98UV,377840859133591,emoney,Ana Agustina,M,1976,JAK.13,Tanah Abang - Jembatan Lima,1.0,B02505P,Museum Textile,-6.188656,106.80954,33,2023-04-15 19:59:26,B01787P,JPO Blok G,-6.188861,106.81135,34.0,2023-04-15 20:27:50,0.0


In [None]:
# Create Iterations for Comparing the Training accuracy and Test Accuracy
# Iterate Depths (more depths = more Rules)

depths = np.arange(1,150,1)
training_accuracies = []
testing_accuracies = []
score = 0

for i in depths:
    tree = DecisionTreeClassifier( 
        # HYPER-PARAMETER
        criterion='entropy',
        max_depth=i,
        min_samples_leaf= i,
        min_samples_split= i+1
        )
    # model creation
    tree.fit(x_train, y_train)
    
    # Fitting with Training Data
    y_predict_train = tree.predict(x_train)
    training_accuracies.append(accuracy_score(y_train, y_predict_train))
    
    # Fitting with Testing Data
    y_predict_test = tree.predict(x_test)
    acc_score = accuracy_score(y_test, y_predict_test)
    testing_accuracies.append(acc_score)
    
    # acc_score -> to find best_K for "testing_accuracies"
    if score < acc_score:
        best_depth = i
        score = acc_score

In [None]:
# Populate the dictionary with DataFrames
# to be utlised on "DecisionTreeClassifier"

# 'min_samples_leaf': int,
# 'min_samples_split': int,
# 'training_accuracies': float,
# 'testing_accuracies': float,

# Example:
# print(dict_of_lists['list1'][0])  # Output: 1
# 1: {'aww': 1, 'aww1':2, 0.7345, 0.8563}
# 2: [1, 2, 0.7345, 0.8563]
# 3: [1, 2, 0.7345, 0.8563]
# 1: ['min_samples_leaf', 'min_samples_split', 'training_accuracies', 'testing_accuracies']

# Initialize a dictionary with empty lists as values
# data = {
#     1: []
# }
data = {
    1: {}
}

# Populate the dictionary with empty dictionaries
# for i in range(1, 22501):
#     data[i] = []
for i in range(1, 22501):
    data[i] = {}


# leaf
cnt = 1
for i in range(1, 151):
    # split
    for j in range(1, 151):
        # min_samples_leaf
        data[cnt]['min_samples_leaf'] = i
        #         data[cnt].insert(0, i)
        # min_samples_split
        data[cnt]['min_samples_split'] = j
        # training_accuracies
        data[cnt]['training_accuracies'] = cnt
        # testing_accuracies
        data[cnt]['testing_accuracies'] = cnt+1
        # counter
        cnt+= 1


# # Convert dictionary to DataFrame
# df = pd.DataFrame(data) 
# # Optionally rename columns if needed
# df.columns = ['min_samples_leaf', 'min_samples_split', 'training_accuracies', 'testing_accuracies']
# print(df)

# Convert nested dictionary to DataFrame
df_data = pd.DataFrame.from_dict(data, orient='index')

# Print the DataFrame
print(df_data)

display(df_data[df_data['training_accuracies'] == df_data['training_accuracies'].max()])

# Condition to identify the specific row
condition = (df_data['training_accuracies'] == df_data['training_accuracies'].max()) & (df_data['testing_accuracies'] == df_data['testing_accuracies'].max())
# Get the index of the specific row
specific_index = df_data.loc[condition].index[0]
# Print the index
print(specific_index)


# 3D Column Insert Example
# df = pd.DataFrame(data, columns=[f'Col{col}' for col in range(cols)])
# data_3d[f'Layer{layer}'] = df

In [3]:
# DataFrame with missing values
data = {
    'A': [1, None, 3, None, 5],
    'B': [None, 2, None, 4, None]
}
df = pd.DataFrame(data)

# List of DataFrames to use for filling missing values
data1 = {
    'A': [10, None, 30, None, 50],
    'B': [60, 70, None, 90, None]
}
df1 = pd.DataFrame(data1)

data2 = {
    'A': [None, 20, None, 40, None],
    'B': [None, None, 80, None, 100]
}
df2 = pd.DataFrame(data2)

dfs = [df1, df2]

# Start with the original DataFrame
df_filled = df.copy()

# Iterate over the list of DataFrames to fill missing values
for df_fill in dfs:
    df_filled = df_filled.fillna(df_fill)

print("DataFrame after filling missing values with multiple DataFrames:")
print(df_filled)

DataFrame after filling missing values with multiple DataFrames:
      A      B
0   1.0   60.0
1  20.0    2.0
2   3.0   80.0
3  40.0    4.0
4   5.0  100.0


In [8]:
# import pandas as pd

# Sample DataFrame
data = {
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
}
df = pd.DataFrame(data)

print(df)

# Method 1: iterrows
print("Using iterrows:")
for index, row in df.iterrows():
    # print(f"Index: {index}, Row data: {row.to_dict()}")
    print(f"Index: {index}, Row data: {row}")

# Method 2: itertuples
print("\nUsing itertuples:")
for row in df.itertuples(index=True, name='Pandas'):
    print(f"Index: {row.Index}, A: {row.A}, B: {row.B}, C: {row.C}")

# Method 3: apply
print("\nUsing apply to create a new column:")
df['Sum'] = df.apply(lambda row: row['A'] + row['B'] + row['C'], axis=1)
print(df)

# Method 4: applymap
print("\nUsing applymap to add 1 to each element:")
df_applied = df.applymap(lambda x: x + 1)
print(df_applied)

# Method 5: items
print("\nUsing items to iterate over columns:")
for column_name, series in df.items():
    print(f"Column: {column_name}, Data: {series.to_list()}")

# Method 6: Vectorized operation
print("\nUsing vectorized operation to create a new column:")
df['Product'] = df['A'] * df['B'] * df['C']
print(df)


   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9
Using iterrows:
Index: 0, Row data: A    1
B    4
C    7
Name: 0, dtype: int64
Index: 1, Row data: A    2
B    5
C    8
Name: 1, dtype: int64
Index: 2, Row data: A    3
B    6
C    9
Name: 2, dtype: int64

Using itertuples:
Index: 0, A: 1, B: 4, C: 7
Index: 1, A: 2, B: 5, C: 8
Index: 2, A: 3, B: 6, C: 9

Using apply to create a new column:
   A  B  C  Sum
0  1  4  7   12
1  2  5  8   15
2  3  6  9   18

Using applymap to add 1 to each element:
   A  B   C  Sum
0  2  5   8   13
1  3  6   9   16
2  4  7  10   19

Using items to iterate over columns:
Column: A, Data: [1, 2, 3]
Column: B, Data: [4, 5, 6]
Column: C, Data: [7, 8, 9]
Column: Sum, Data: [12, 15, 18]

Using vectorized operation to create a new column:
   A  B  C  Sum  Product
0  1  4  7   12       28
1  2  5  8   15       80
2  3  6  9   18      162


In [9]:
# import pandas as pd

# DataFrame with missing values
data1 = {
    'A': [1, None, 3, None, 5],
    'B': [None, 2, None, 4, None]
}
df1 = pd.DataFrame(data1)

# DataFrame to use for filling missing values
data2 = {
    'A': [10, 20, 30, 40, 50],
    'B': [60, 70, 80, 90, 100]
}
df2 = pd.DataFrame(data2)

# Create a copy of df1 to store the filled values
df_filled = df1.copy()

# Iterate over the rows of df1
for index, row in df_filled.iterrows():
    # Check each column in the row
    for column in df_filled.columns:
        # If the value is missing in df1, fill it using the corresponding value from df2
        if pd.isnull(row[column]):
            df_filled.at[index, column] = df2.at[index, column]

print("DataFrame after filling missing values:")
print(df_filled)


DataFrame after filling missing values:
      A      B
0   1.0   60.0
1  20.0    2.0
2   3.0   80.0
3  40.0    4.0
4   5.0  100.0


In [12]:
# DataFrame with missing values
data1 = {
    'A': [1, None, 3, None, 5],
    'B': [None, 2, None, 4, None]
}
df1 = pd.DataFrame(data1)

# DataFrame to use for filling missing values
data2 = {
    'A': [10, 20, 30, 40, 50],
    'B': [60, 70, 80, 90, 100]
}
df2 = pd.DataFrame(data2)

# Method 1: Iterating over a specific column 'A'
df_filled = df1.copy()
for index, row in df1.iterrows():
    if pd.isnull(row['A']):
        df_filled.at[index, 'A'] = df2.at[index, 'A']

print("DataFrame after filling missing values in column 'A':")
print(df_filled)

# Method 2: Iterating over multiple specific columns 'A' and 'B'
df_filled = df1.copy()
for index, row in df1.iterrows():
    for column in ['A', 'B']:
        if pd.isnull(row[column]):
            df_filled.at[index, column] = df2.at[index, column]

print("\nDataFrame after filling missing values in columns 'A' and 'B':")
print(df_filled)

DataFrame after filling missing values in column 'A':
      A    B
0   1.0  NaN
1  20.0  2.0
2   3.0  NaN
3  40.0  4.0
4   5.0  NaN

DataFrame after filling missing values in columns 'A' and 'B':
      A      B
0   1.0   60.0
1  20.0    2.0
2   3.0   80.0
3  40.0    4.0
4   5.0  100.0


In [13]:
# Retreiving (filling-in corridorName based on corridorID)
# Since CorridorID has less Null Values than corridorName

df_temp = df_tj.groupby(['corridorName', 'corridorID']).agg({'direction':'sum'}).reset_index()
df_temp.head(30)

Unnamed: 0,corridorName,corridorID,direction
0,Andara - Stasiun Universitas Pancasila,JAK.44,120.0
1,BKN - Blok M,M7B,154.0
2,BSD - Jelambar,S11,60.0
3,BSD Serpong - Fatmawati,S12,50.0
4,Batusari - Grogol,8K,127.0
5,Bekasi Barat - Blok M,B13,62.0
6,Bekasi Barat - Kuningan,B14,109.0
7,Bekasi Timur - Cawang,B21,109.0
8,Bintara - Cipinang Indah,JAK.85,68.0
9,Bintaro - Blok M,8E,85.0


In [14]:
# Logic Creation
# fill CorridorName based on CorridorID

# Accessing a value
# value = df.at[row_label, column_label]
# Setting a value
# df.at[row_label, column_label] = new_value

# Method 2: Iterating over multiple specific columns 'A', 'B', etc
df_filled = df_tj.copy()
for idx1, row1 in df_tj.iterrows():
    # Criteria Check (one must exist for cross-DF comparison)
    if (pd.isnull(row1["corridorName"])) & (pd.notnull(row1["corridorID"])):
        # DF matching (df_filled & df_temp) [different DF setup]
        for idx2, row2 in df_temp.iterrows():
            # Value Matching
            if row1["corridorID"] == row2["corridorID"]:
                df_filled.at[idx1, "corridorName"] = df_temp.at[idx2, "corridorName"]
            else:
                pass
    else:
        pass
    


In [15]:
df_filled.isna().sum()

transID                0
payCardID              0
payCardBank            0
payCardName            0
payCardSex             0
payCardBirthDate       0
corridorID          1257
corridorName        1125
direction              0
tapInStops          1213
tapInStopsName         0
tapInStopsLat          0
tapInStopsLon          0
stopStartSeq           0
tapInTime              0
tapOutStops         2289
tapOutStopsName     1344
tapOutStopsLat      1344
tapOutStopsLon      1344
stopEndSeq          1344
tapOutTime          1344
payAmount           1007
dtype: int64

In [16]:
df_tj.isna().sum()

transID                0
payCardID              0
payCardBank            0
payCardName            0
payCardSex             0
payCardBirthDate       0
corridorID          1257
corridorName        1930
direction              0
tapInStops          1213
tapInStopsName         0
tapInStopsLat          0
tapInStopsLon          0
stopStartSeq           0
tapInTime              0
tapOutStops         2289
tapOutStopsName     1344
tapOutStopsLat      1344
tapOutStopsLon      1344
stopEndSeq          1344
tapOutTime          1344
payAmount           1007
dtype: int64

In [17]:
# Logic Creation
# fill CorridorID based on CorridorName

# Accessing a value
# value = df.at[row_label, column_label]
# Setting a value
# df.at[row_label, column_label] = new_value

# Method 2: Iterating over multiple specific columns 'A', 'B', etc
# df_filled = df_tj.copy()
for idx1, row1 in df_filled.iterrows():
    # Criteria Check (one must exist for cross-DF comparison)
    if (pd.isnull(row1["corridorID"])) & (pd.notnull(row1["corridorName"])):
        # DF matching (df_filled & df_temp) [different DF setup]
        for idx2, row2 in df_temp.iterrows():
            # Value Matching
            if row1["corridorName"] == row2["corridorName"]:
                df_filled.at[idx1, "corridorID"] = df_temp.at[idx2, "corridorID"]
            else:
                pass
    else:
        pass

In [18]:
df_filled.isna().sum()

transID                0
payCardID              0
payCardBank            0
payCardName            0
payCardSex             0
payCardBirthDate       0
corridorID          1125
corridorName        1125
direction              0
tapInStops          1213
tapInStopsName         0
tapInStopsLat          0
tapInStopsLon          0
stopStartSeq           0
tapInTime              0
tapOutStops         2289
tapOutStopsName     1344
tapOutStopsLat      1344
tapOutStopsLon      1344
stopEndSeq          1344
tapOutTime          1344
payAmount           1007
dtype: int64

In [24]:
# ===== Locating (matching tapInStops with tapInStopsName) =====

df_temp = df_tj.groupby(['tapInStops', 'tapInStopsName']).agg({'direction':'sum'}).reset_index()
df_temp.head(30)

Unnamed: 0,tapInStops,tapInStopsName,direction
0,B00001P,18 Office Park,19.0
1,B00004P,ACC Simatupang,1.0
2,B00005P,ACE Hardware,6.0
3,B00008P,Adam Malik 1,0.0
4,B00017P,Akper Fatmawati Pondok Labu,3.0
5,B00018P,AKR Tower,0.0
6,B00027P,Al Izhar Pondok Labu 2,0.0
7,B00028P,Al Khairiyah School,0.0
8,B00030P,Al Mukhlisin,19.0
9,B00031P,Al Wathoniyah 1,0.0


In [25]:
# Logic Creation
# fill tapInStops based on tapInStopsName

# Accessing a value
# value = df.at[row_label, column_label]
# Setting a value
# df.at[row_label, column_label] = new_value

# Method 2: Iterating over multiple specific columns 'A', 'B', etc
# df_filled = df_tj.copy()
for idx1, row1 in df_filled.iterrows():
    # Criteria Check (one must exist for cross-DF comparison)
    if (pd.isnull(row1["tapInStops"])) & (pd.notnull(row1["tapInStopsName"])):
        # DF matching (df_filled & df_temp) [different DF setup]
        for idx2, row2 in df_temp.iterrows():
            # Value Matching
            if row1["tapInStopsName"] == row2["tapInStopsName"]:
                df_filled.at[idx1, "tapInStops"] = df_temp.at[idx2, "tapInStops"]
            else:
                pass
    else:
        pass

In [26]:
df_filled.isna().sum()

transID                0
payCardID              0
payCardBank            0
payCardName            0
payCardSex             0
payCardBirthDate       0
corridorID          1125
corridorName        1125
direction              0
tapInStops            34
tapInStopsName         0
tapInStopsLat          0
tapInStopsLon          0
stopStartSeq           0
tapInTime              0
tapOutStops         2289
tapOutStopsName     1344
tapOutStopsLat      1344
tapOutStopsLon      1344
stopEndSeq          1344
tapOutTime          1344
payAmount           1007
dtype: int64

In [27]:
# Drop Rows (37900 - 1344 = 36556)
df_filled.dropna(subset=['tapOutStopsName', 'tapOutStopsLat', 'tapOutStopsLon', 'stopEndSeq', 'tapOutTime'], inplace=True)

print("Case Validated - Rows decreased from 37900 -> 36556")

Case Validated - Rows decreased from 37900 -> 36556


In [28]:
df_filled.isna().sum()

transID                0
payCardID              0
payCardBank            0
payCardName            0
payCardSex             0
payCardBirthDate       0
corridorID          1078
corridorName        1078
direction              0
tapInStops            32
tapInStopsName         0
tapInStopsLat          0
tapInStopsLon          0
stopStartSeq           0
tapInTime              0
tapOutStops          945
tapOutStopsName        0
tapOutStopsLat         0
tapOutStopsLon         0
stopEndSeq             0
tapOutTime             0
payAmount            968
dtype: int64

In [30]:
# ===== Locating (matching tapOutStops with tapOutStopsName) =====

df_temp = df_tj.groupby(['tapOutStops', 'tapOutStopsName']).agg({'direction':'sum'}).reset_index()
df_temp.head(30)

Unnamed: 0,tapOutStops,tapOutStopsName,direction
0,B00002P,ABA,0.0
1,B00003P,Acacia Residence,1.0
2,B00004P,ACC Simatupang,1.0
3,B00005P,ACE Hardware,1.0
4,B00013P,Ahmad Yani Pisangan Baru,0.0
5,B00015P,Akademi Farmasi Mahadhika,0.0
6,B00022P,Akses Jembatan Ciliwung Balekambang,0.0
7,B00028P,Al Khairiyah School,0.0
8,B00029P,Al Mahbubiyah,1.0
9,B00030P,Al Mukhlisin,1.0


In [31]:
# Logic Creation
# fill tapOutStops based on tapOutStopsName

# Accessing a value
# value = df.at[row_label, column_label]
# Setting a value
# df.at[row_label, column_label] = new_value

# Method 2: Iterating over multiple specific columns 'A', 'B', etc
# df_filled = df_tj.copy()
for idx1, row1 in df_filled.iterrows():
    # Criteria Check (one must exist for cross-DF comparison)
    if (pd.isnull(row1["tapOutStops"])) & (pd.notnull(row1["tapOutStopsName"])):
        # DF matching (df_filled & df_temp) [different DF setup]
        for idx2, row2 in df_temp.iterrows():
            # Value Matching
            if row1["tapOutStopsName"] == row2["tapOutStopsName"]:
                df_filled.at[idx1, "tapOutStops"] = df_temp.at[idx2, "tapOutStops"]
            else:
                pass
    else:
        pass

In [32]:
df_filled.isna().sum()

transID                0
payCardID              0
payCardBank            0
payCardName            0
payCardSex             0
payCardBirthDate       0
corridorID          1078
corridorName        1078
direction              0
tapInStops            32
tapInStopsName         0
tapInStopsLat          0
tapInStopsLon          0
stopStartSeq           0
tapInTime              0
tapOutStops           19
tapOutStopsName        0
tapOutStopsLat         0
tapOutStopsLon         0
stopEndSeq             0
tapOutTime             0
payAmount            968
dtype: int64

In [37]:
# ===== Corridor Prices =====

df_temp = df_tj.groupby(['payAmount', 'corridorID']).agg({'direction':'sum'}).reset_index()
df_temp.head(30)

Unnamed: 0,payAmount,corridorID,direction
0,0.0,10A,73.0
1,0.0,10B,72.0
2,0.0,11B,45.0
3,0.0,11C,31.0
4,0.0,11K,65.0
5,0.0,11M,82.0
6,0.0,11N,66.0
7,0.0,11P,166.0
8,0.0,12C,66.0
9,0.0,12F,25.0


In [38]:
# Logic Creation
# fill payAmount (float64) based on corridorID

# Accessing a value
# value = df.at[row_label, column_label]
# Setting a value
# df.at[row_label, column_label] = new_value

# Method 2: Iterating over multiple specific columns 'A', 'B', etc
# df_filled = df_tj.copy()
for idx1, row1 in df_filled.iterrows():
    # Criteria Check (one must exist for cross-DF comparison)
    if (pd.isnull(row1["payAmount"])) & (pd.notnull(row1["corridorID"])):
        # DF matching (df_filled & df_temp) [different DF setup]
        for idx2, row2 in df_temp.iterrows():
            # Value Matching
            if row1["corridorID"] == row2["corridorID"]:
                df_filled.at[idx1, "payAmount"] = float(df_temp.at[idx2, "payAmount"])
            else:
                pass
    else:
        pass

In [39]:
df_filled.isna().sum()

transID                0
payCardID              0
payCardBank            0
payCardName            0
payCardSex             0
payCardBirthDate       0
corridorID          1078
corridorName        1078
direction              0
tapInStops            32
tapInStopsName         0
tapInStopsLat          0
tapInStopsLon          0
stopStartSeq           0
tapInTime              0
tapOutStops           19
tapOutStopsName        0
tapOutStopsLat         0
tapOutStopsLon         0
stopEndSeq             0
tapOutTime             0
payAmount            968
dtype: int64