# Feature Engineering 
 ### What we do Here:
   * Transaction frequency and velocity: number of transactions per user in time windows.
   * Time-based features: 
       * hour_of_day, 
       * day_of_week,
       * time_since_signup: duration between signup_time and purchase_time

#### Import Custom and Other Libraries

In [1]:
# import libraries
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
import sys as sys
sys.path.append("..")

In [2]:
# import custom libraries
from src.data_cleaning import clean_fraud_data
from src.fraud_feature_engineerimg import TimeBehaviorFeatures


#### Load The Data

In [3]:
fraud_df=pd.read_csv("../data/processed/fraud_cleaned.csv")
fraud_df=clean_fraud_data(fraud_df)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["age"].fillna(df["age"].median(),inplace=True)# fill the age column missing Value with median if any
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["sex"].fillna(df["sex"].mode(),inplace=True) # fill the sex column msiing value with mode if any


In [4]:
fraud_df.dtypes

user_id                    int64
signup_time       datetime64[ns]
purchase_time     datetime64[ns]
purchase_value             int64
device_id                 object
source                    object
browser                   object
sex                       object
age                        int64
ip_address                 int64
class                      int64
country                   object
dtype: object

#### Feature Engineering 

In [5]:
# Create time based features with custom module
time_features=TimeBehaviorFeatures()
fraud_df=time_features.apply_all(fraud_df)


In [6]:
# check the changes 
fraud_df.head()

Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class,country,hour_of_day,day_of_week,time_since_signup,user_transaction_count
0,22058,2015-02-24 22:55:49,2015-04-18 02:47:11,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758368,0,Japan,2,5,4506682.0,1
1,333320,2015-06-07 20:39:50,2015-06-08 01:38:54,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311387,0,United States,1,0,17944.0,1
2,1359,2015-01-01 18:52:44,2015-01-01 18:52:45,15,YSSKYOSJHPPLJ,SEO,Opera,M,53,2621473820,1,United States,18,3,1.0,1
3,150084,2015-04-28 21:13:25,2015-05-04 13:54:50,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542443,0,unknown,13,0,492085.0,1
4,221365,2015-07-21 07:09:52,2015-09-09 18:40:53,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583117,0,United States,18,2,4361461.0,1


#### Data Transformation and Handle Class Imbalance
#### What We Do here:
  * Normalize/scale numerical features (StandardScaler or MinMaxScaler)
  * Encode categorical features (One-Hot Encoding)
  * Apply SMOTE or undersampling to training data only
  * Document the class distribution before and after resampling

##### Define Feature Groups

In [7]:
# define numerical and catagorical columns for preprocessing
num_cols=[
    "purchase_value",
    "age",
    "time_since_signup",
    "user_transaction_count"
]
cat_cols=[
    "source",
    "browser",
    "sex",
    "country"
]

##### Check class distribution BEFORE resampling

In [8]:
from src.data_trasformer import FraudDataTransformer # import custom module for transforming data
transformer=FraudDataTransformer(num_cols,cat_cols)

In [9]:
# check class distribution before resampling
y_original=fraud_df['class']
y_original=transformer.get_class_distribution(y_original)
y_original

class
0    90.635423
1     9.364577
Name: proportion, dtype: float64

##### Train-Test split before Resampling to protect Data leakage

In [10]:
from sklearn.model_selection import train_test_split
 # split data into X(independent feature)and Y(dependent)
X=fraud_df.drop(columns=['class'])
Y=fraud_df['class']
x_train,x_test,y_train,y_test=train_test_split(X,
                                               Y,
                                               test_size=0.2,
                                               stratify=Y, # very important for imbalaced data
                                               random_state=42)


##### Recombine TRAIN data (for transformer input)

In [11]:
train_df = x_train.copy() # since our transformer expects 'class' column
train_df['class'] = y_train

#### Apply preprocessing + SMOTE (TRAIN ONLY)


In [12]:
transformer = FraudDataTransformer(
    numerical_features=num_cols,
    categorical_features=cat_cols
)

X_train_resampled, y_train_resampled = transformer.fit_resample(train_df)


#### Transform TEST data (NO SMOTE)

In [14]:
X_test_transformed = transformer.preprocessor.transform(x_test)


#### ðŸ“Š Class Distribution Documentation (Task Requirement)

In [15]:
transformer.get_class_distribution(y_train_resampled)


class
0    50.0
1    50.0
Name: proportion, dtype: float64

#### ðŸ“Œ Save the datasets For Future Modeling

In [17]:
import numpy as np

# Save training data
np.save("../data/processed/X_train.npy", X_train_resampled)
np.save("../data/processed/y_train.npy", y_train_resampled)

# Save test data
np.save("../data/processed/X_test.npy", X_test_transformed)
np.save("../data/processed/y_test.npy", y_test.values)
