### 1. Loading the Data

In this step, we will load the datasets required for fraud detection analysis. We have two datasets:
1. **Fraud_Data.csv**: Contains the transaction information, including user details and whether the transaction is fraudulent.
2. **IpAddress_to_Country.csv**: Maps IP address ranges to countries, which will help us enhance the fraud detection by identifying geographical locations of transactions.

In [3]:
# Importing necessary libraries
import pandas as pd

# Load datasets
fraud_data = pd.read_csv('../Data/Raw/Fraud_Data.csv')
ip_data = pd.read_csv('../Data/Raw/IpAddress_to_Country.csv')

# Display first few rows of Fraud Data
fraud_data.head()

Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class
0,22058,2015-02-24 22:55:49,2015-04-18 02:47:11,34,QVPSPJUOCKZAR,SEO,Chrome,M,39,732758400.0,0
1,333320,2015-06-07 20:39:50,2015-06-08 01:38:54,16,EOGFQPIZPYXFZ,Ads,Chrome,F,53,350311400.0,0
2,1359,2015-01-01 18:52:44,2015-01-01 18:52:45,15,YSSKYOSJHPPLJ,SEO,Opera,M,53,2621474000.0,1
3,150084,2015-04-28 21:13:25,2015-05-04 13:54:50,44,ATGTXKYKUDUQN,SEO,Safari,M,41,3840542000.0,0
4,221365,2015-07-21 07:09:52,2015-09-09 18:40:53,39,NAUITBZFJKHWW,Ads,Safari,M,45,415583100.0,0


In [4]:
# Display first few rows of Ip Data
ip_data.head()

Unnamed: 0,lower_bound_ip_address,upper_bound_ip_address,country
0,16777216.0,16777471,Australia
1,16777472.0,16777727,China
2,16777728.0,16778239,China
3,16778240.0,16779263,Australia
4,16779264.0,16781311,China


### 2. Data Cleaning

Data cleaning is a crucial step to ensure that our dataset is free from errors and inconsistencies. In this step, we will:
- Handle missing values by removing or imputing them.
- Remove any duplicate rows to avoid biased analysis.
- Correct data types for certain columns, such as date-time and IP address conversion to integer format.

In [5]:
# Check for missing values
fraud_data.isnull().sum()

user_id           0
signup_time       0
purchase_time     0
purchase_value    0
device_id         0
source            0
browser           0
sex               0
age               0
ip_address        0
class             0
dtype: int64

In [7]:
# Check for missing values
ip_data.isnull().sum()

lower_bound_ip_address    0
upper_bound_ip_address    0
country                   0
dtype: int64

In [8]:
# Remove duplicate rows
fraud_data = fraud_data.drop_duplicates()

In [None]:
# Remove duplicate rows
ip_data = ip_data.drop_duplicates()

In [None]:


# Drop missing values (if any)
fraud_data = fraud_data.dropna()

# Correct data types
fraud_data['signup_time'] = pd.to_datetime(fraud_data['signup_time'])
fraud_data['purchase_time'] = pd.to_datetime(fraud_data['purchase_time'])
fraud_data['ip_address'] = fraud_data['ip_address'].astype('int')

# Display cleaned data information
fraud_data.info()

### 3. Exploratory Data Analysis (EDA)

EDA helps us understand the underlying patterns in the data. Here, we will:
- Conduct **Univariate Analysis** to explore individual variables.
- Perform **Bivariate Analysis** to explore relationships between features, especially how they correlate with fraudulent transactions.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Univariate analysis: Age distribution
plt.figure(figsize=(8,6))
sns.histplot(fraud_data['age'], bins=20, kde=True)
plt.title('Distribution of User Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

# Bivariate analysis: Purchase value vs. fraud status
plt.figure(figsize=(8,6))
sns.boxplot(x='class', y='purchase_value', data=fraud_data)
plt.title('Purchase Value by Fraud Status')
plt.xlabel('Fraud Status (0 = Non-fraudulent, 1 = Fraudulent)')
plt.ylabel('Purchase Value')
plt.show()


### 4. Geolocation Analysis

By merging the IP address data, we can incorporate geographical information into our fraud detection model. First, we convert IP addresses into an integer format and then merge the `fraud_data` with `ip_data` to add country information.

In [None]:
# Convert IP address to integer range for merging
fraud_data = fraud_data.merge(ip_data, how='left', left_on='ip_address', right_on='lower_bound_ip_address')

# Display the merged dataset
fraud_data[['ip_address', 'country']].head()

### 5. Feature Engineering

In this step, we create new features that can provide more insight into fraudulent behavior:
- **Time-based features**: Extract hour of the day and day of the week from `purchase_time`.
- **Transaction count**: Count the number of transactions per user.
- **Transaction velocity**: Calculate the time difference between signup and purchase.

In [None]:
# Create time-based features
fraud_data['transaction_hour'] = fraud_data['purchase_time'].dt.hour
fraud_data['transaction_day'] = fraud_data['purchase_time'].dt.dayofweek

# Transaction count per user
fraud_data['transaction_count'] = fraud_data.groupby('user_id')['purchase_time'].transform('count')

# Transaction velocity (time between signup and purchase)
fraud_data['time_to_purchase'] = (fraud_data['purchase_time'] - fraud_data['signup_time']).dt.total_seconds()

# Display updated dataframe with new features
fraud_data[['transaction_hour', 'transaction_day', 'transaction_count', 'time_to_purchase']].head()


### 6. Encoding Categorical Features and Scaling

For machine learning models to work efficiently, categorical features need to be converted into numeric format using one-hot encoding. Also, numerical features should be scaled so that they are on the same scale.

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# One-hot encoding categorical features
categorical_features = ['source', 'browser', 'sex']
fraud_data = pd.get_dummies(fraud_data, columns=categorical_features)

# Scale numerical features
scaler = StandardScaler()
numerical_features = ['purchase_value', 'age', 'transaction_count', 'time_to_purchase']
fraud_data[numerical_features] = scaler.fit_transform(fraud_data[numerical_features])

# Display final processed data
fraud_data.head()

### 7. Save the Processed Data

After completing data preprocessing, we will save the cleaned and processed data to a new CSV file, which will be used for model training in the next steps.

In [None]:
# Save cleaned and processed data to a new file
fraud_data.to_csv('../Data/Processed/cleaned_fraud_data.csv', index=False)