## Load and Inspect Datasets

In this section, we:
- Import necessary libraries.
- Load the training and testing datasets.
- Print and compare the column names of both datasets to ensure consistency.

In [8]:
import pandas as pd

# Load the datasets
try:
    train_data = pd.read_csv("../data/fraudTrain.csv")
    test_data = pd.read_csv("../data/fraudTest.csv")
    print("Datasets loaded successfully.")
except Exception as e:
    print(f"Error loading datasets: {e}")

# Print column names for training data
print("Columns in fraudTrain.csv:")
print(train_data.columns)

# Print column names for testing data
print("\nColumns in fraudTest.csv:")
print(test_data.columns)

# Compare the column names
if list(train_data.columns) == list(test_data.columns):
    print("\nThe column names match exactly!")
else:
    print("\nThe column names do NOT match. Please inspect further.")

Datasets loaded successfully.
Columns in fraudTrain.csv:
Index(['Unnamed: 0', 'trans_date_trans_time', 'cc_num', 'merchant', 'category',
       'amt', 'first', 'last', 'gender', 'street', 'city', 'state', 'zip',
       'lat', 'long', 'city_pop', 'job', 'dob', 'trans_num', 'unix_time',
       'merch_lat', 'merch_long', 'is_fraud'],
      dtype='object')

Columns in fraudTest.csv:
Index(['Unnamed: 0', 'trans_date_trans_time', 'cc_num', 'merchant', 'category',
       'amt', 'first', 'last', 'gender', 'street', 'city', 'state', 'zip',
       'lat', 'long', 'city_pop', 'job', 'dob', 'trans_num', 'unix_time',
       'merch_lat', 'merch_long', 'is_fraud'],
      dtype='object')

The column names match exactly!


## Drop Unnecessary Columns

We drop the `Unnamed: 0` column from both datasets as it is unnecessary for analysis. This helps streamline the data for further processing.

In [3]:
# Drop the 'Unnamed: 0' column from both datasets
train_data.drop(columns=["Unnamed: 0"], inplace=True)
test_data.drop(columns=["Unnamed: 0"], inplace=True)

print("Dropped unnecessary columns.")

Dropped unnecessary columns.


## Check Data Types and Missing Values

We perform the following checks:
- Examine the data types and overall structure of the training and testing datasets.
- Identify any missing values in the datasets and quantify them.
This step ensures the data is clean and ready for preprocessing.

In [4]:
# Check data types and info
print("Training Data Info:")
print(train_data.info())

print("\nTesting Data Info:")
print(test_data.info())

# Check for missing values
print("\nMissing Values in Training Data:")
print(train_data.isnull().sum())

print("\nMissing Values in Testing Data:")
print(test_data.isnull().sum())

Training Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 22 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   trans_date_trans_time  1296675 non-null  object 
 1   cc_num                 1296675 non-null  int64  
 2   merchant               1296675 non-null  object 
 3   category               1296675 non-null  object 
 4   amt                    1296675 non-null  float64
 5   first                  1296675 non-null  object 
 6   last                   1296675 non-null  object 
 7   gender                 1296675 non-null  object 
 8   street                 1296675 non-null  object 
 9   city                   1296675 non-null  object 
 10  state                  1296675 non-null  object 
 11  zip                    1296675 non-null  int64  
 12  lat                    1296675 non-null  float64
 13  long                   1296675 non-null  float64
 14

## Define Features and Labels

Here we separate the features (`X`) and labels (`y`) for the training and testing datasets:
- **Features**: All columns except `is_fraud`.
- **Labels**: The `is_fraud` column, indicating whether the transaction is fraudulent.

We also check the shapes of the resulting feature and label datasets to confirm correctness.

In [6]:
# Define features and labels
X_train = train_data.drop(columns=["is_fraud"])
y_train = train_data["is_fraud"]

X_test = test_data.drop(columns=["is_fraud"])
y_test = test_data["is_fraud"]

print("Features and labels prepared.")
print(f"Training Features Shape: {X_train.shape}, Labels Shape: {y_train.shape}")
print(f"Testing Features Shape: {X_test.shape}, Labels Shape: {y_test.shape}")

Features and labels prepared.
Training Features Shape: (1296675, 21), Labels Shape: (1296675,)
Testing Features Shape: (555719, 21), Labels Shape: (555719,)
