## Task 1: Data Analysis and Preprocessing

Load Libraries and Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

fraud = pd.read_csv("../data/raw/Fraud_Data.csv")
ip_country = pd.read_csv("../data/raw/IpAddress_to_Country.csv")


**Data Cleaning**

We inspect missing values, duplicates, and data types to ensure data quality before analysis.


In [None]:
fraud.info()
fraud.isnull().sum()

fraud.drop_duplicates(inplace=True)
fraud['age'].fillna(fraud['age'].median(), inplace=True)

fraud['signup_time'] = pd.to_datetime(fraud['signup_time'])
fraud['purchase_time'] = pd.to_datetime(fraud['purchase_time'])


**Exploratory Data Analysis (EDA)**

We analyze feature distributions and their relationship with fraudulent behavior.

In [None]:
sns.countplot(x='class', data=fraud)
plt.title("Class Distribution (Fraud vs Non-Fraud)")

sns.boxplot(x='class', y='purchase_value', data=fraud)
plt.title("Purchase Value vs Fraud")


**Geolocation Integration**

IP addresses are converted to integers and mapped to countries to analyze geographic fraud patterns.

In [None]:
fraud['ip_address'] = fraud['ip_address'].astype(int)
ip_country[['lower_bound_ip_address','upper_bound_ip_address']] = \
ip_country[['lower_bound_ip_address','upper_bound_ip_address']].astype(int)

fraud = fraud.merge(
    ip_country,
    how='left',
    left_on='ip_address',
    right_on='lower_bound_ip_address'
)


Fraud by Country Analysis

In [None]:
fraud.groupby('country')['class'].mean().sort_values(ascending=False).head()
