![](https://i.imgur.com/OKFBWmc.jpeg)


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Day 1: Explore the dataset's basic structure (columns, rows, types)

In [None]:
#read the csv into a Pandas dataframe
fraud_data_train = pd.read_csv("/kaggle/input/fraud-detection/fraudTrain.csv")
fraud_data_test = pd.read_csv("/kaggle/input/fraud-detection/fraudTest.csv")

In [None]:
#check the dataframe by printing the first 5 rows
fraud_data_train.head(5)


In [None]:
fraud_data_test.head(5)

In [None]:
#check the columns in the dataframe
fraud_data_train.columns

In [None]:
fraud_data_test.columns

In [None]:
#we can find out about columns types and null values using info()
fraud_data_train.info()

In [None]:
fraud_data_test.info()

# Day 2: Identify missing values and basic data characteristics


In [None]:
# perform basic statistics
fraud_data_train.describe(include = 'all').transpose()
fraud_data_test.describe(include = 'all').transpose()

In [None]:
# print the mean, standard deviation, min, and max for the amt column
print("Fraud Statistics")

avg_amt = fraud_data_train["amt"].mean()
std_dev = fraud_data_train["amt"].std()
min_amt = fraud_data_train["amt"].min()
max_amt = fraud_data_train["amt"].max()

print(f"The average amount is {avg_amt}")
print(f"The std deviation for amount is {std_dev}")
print(f"The min amount is {min_amt}")
print(f"The max amount is {max_amt}")

In [None]:
# find the distribution for the is_fraud variable
fraud_data_train['is_fraud'].value_counts()

In [None]:
# find the ratio
1289169/(1289169+7506)

**As you can see, the ratio of not fraud to fraud is 99:1. This means that the data is heavily imbalanced and if we trained a binary classification model using this dataset, the model would be biased.**

In [None]:
# create a new column for standard time 
from datetime import datetime
fraud_data_train['standard_time'] = pd.to_datetime(fraud_data_train['unix_time'], unit='s')
fraud_data_test['standard_time'] = pd.to_datetime(fraud_data_test['unix_time'], unit='s')


#  Day 3: Data Visualization

**The dataset contains many features that should undergo principal component analysis(PCA) because of privacy concerns. For our data visualization, we will use only non-PII information.**

In [None]:
# drop pii and irrelevant columns
columns_to_drop = ['trans_date_trans_time','cc_num','first','last','street','zip','lat','long','city_pop', 'dob','trans_num','unix_time','merch_lat','merch_long']
fraud_data_train.drop(columns = columns_to_drop, inplace=True)


In [None]:
fraud_data_train.info()

In [None]:
columns_to_drop = ['trans_date_trans_time','cc_num','first','last','street','zip','lat','long','city_pop', 'dob','trans_num','unix_time','merch_lat','merch_long']
fraud_data_test.drop(columns = columns_to_drop, inplace=True)
fraud_data_test.info()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import time 


In [None]:
sns.barplot(x = "is_fraud", y = "amt", data = fraud_data_train)
plt.show()

In [None]:
sns.barplot(x = "amt", y = "category", data = fraud_data_train, hue = "is_fraud")
plt.show()

In [None]:
sns.barplot(x = "amt", y = "category", data = fraud_data_train[fraud_data_train['is_fraud'] == 1])
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(x = "category", y = "amt", data = fraud_data_train, hue = "is_fraud")
plt.grid(True)
plt.xticks(rotation=90)
plt.show()

In [None]:
sns.displot(fraud_data_train[fraud_data_train['is_fraud'] == 1]['amt'])
plt.xlabel('Fraud')
plt.ylabel('Amount')
plt.title('Relationship between Fraud Transactions and Amount')
plt.show()

In [None]:
sns.scatterplot(x = 'amt', y = 'category', data = fraud_data_train, hue = 'is_fraud')
plt.xlabel('Amount')
plt.ylabel('Category')
plt.title('Relationship between Amount and Category')
plt.show()

In [None]:
plt.figure(figsize=(12,10))
sns.countplot(x = "state",data = fraud_data_train, hue = "is_fraud" )
plt.xticks(rotation=90)
plt.xlabel('State')
plt.ylabel('Fraud')
plt.title('Relationship between Fraud Transcations and State')
plt.show()

In [None]:
plt.figure(figsize=(12,10))
sns.countplot(x = "category",data = fraud_data_train, hue = "is_fraud" )
plt.xticks(rotation=90)
plt.xlabel('Category')
plt.ylabel('Fraud')
plt.title('Relationship between Category and Fraud Transactions')
plt.show()

In [None]:
plt.figure(figsize=(12,10))
sns.barplot(x = "amt", y = "category", data = fraud_data_train, hue = "is_fraud")
plt.xticks(rotation=90)
plt.xlabel('Amount')
plt.ylabel('Category')
plt.title('Relationship between Amount and Category')
plt.show()

# Day 4: Identify basic patterns or trends in the data

In [None]:
fraud_count = fraud_data_train[fraud_data_train['is_fraud']==1].shape[0]
total_count = fraud_data_train.shape[0]
fraud_rate = fraud_count / total_count
print('Overall fraud rate:', fraud_rate)

In [None]:
fraud_by_category = fraud_data_train.groupby('category')['is_fraud'].mean()
fraud_by_category = fraud_by_category.sort_values(ascending=False)
print('Fraud rate by category:')
print(fraud_by_category)

In [None]:
fraud_by_state = fraud_data_train.groupby('state')['is_fraud'].mean()
fraud_by_state = fraud_by_state.sort_values(ascending=False)
print('Fraud rate by state:')
print(fraud_by_state)

In [None]:
fraud_by_city = fraud_data_train.groupby('city')['is_fraud'].mean()
fraud_by_city = fraud_by_city.sort_values(ascending=False)
print('Fraud rate by city:')
print(fraud_by_city)

In [None]:
fraud_by_merchant = fraud_data_train.groupby('merchant')['is_fraud'].mean()
fraud_by_merchant = fraud_by_merchant.sort_values(ascending=False)
print('Fraud rate by merchant:')
print(fraud_by_merchant)

# Day 5: Descriptive Statistics: calculate basic statistics

In [None]:
# print the mean, standard deviation, min, and max for the amt column
print("Fraud Statistics")

mean_amt = fraud_data_train["amt"].mean()
median_amt = fraud_data_train['amt'].median()
mode_amt = fraud_data_train['amt'].mode().values  # Mode can be multiple values
std_amt = fraud_data_train["amt"].std()
min_amt = fraud_data_train["amt"].min()
max_amt = fraud_data_train["amt"].max()
var_amt = fraud_data_train['amt'].var()


print("Mean: ", mean_amt)
print("Median: ", median_amt)
print("Mode: ", mode_amt)
print("Standard Deviation: ", std_amt)
print("Minimum: ", min_amt)
print("Maximum: ", max_amt)
print("Variance: ", var_amt)


# Day 6: Descriptive Statistics: Discuss Findings and any surprising elements in the data

**Quantiles are a type of descriptive statistic that divides a dataset into equal parts, allowing you to understand the distribution of values.**

In [None]:
fraud_data_train['amt'].quantile([0.25, 0.5, 0.75])

25th Percentile (Q1): 25% of the transactions have amounts less than or equal to $9.65.

50th Percentile (Q2 or Median): 50% of the transactions have amounts less than or equal to $47.52.

75th Percentile (Q3): 75% of the transactions have amounts less than or equal to $83.14.

**Cramer's V is used to find the correlation between a categorical variable and a binary target variable (fraud or not fraud)
commonly used guideline is for Cramer's V :**

* 0.1 to 0.3: Small or weak correlation
* 0.3 to 0.5: Moderate correlation
* 0.5 and above: Strong or high correlation

In [None]:
from scipy.stats import chi2_contingency

# Contingency table
contingency = pd.crosstab(fraud_data_train['category'], fraud_data_train['is_fraud'])

# Chi-square test 
chi2, p, dof, expected = chi2_contingency(contingency)

# Cramer's V
observations = contingency.sum().sum()
phi2 = chi2 / observations
cramer_v = np.sqrt(phi2)


print("Cramer's V = ", cramer_v)

# Visualize the association using a heatmap
sns.heatmap(contingency, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Contingency Table: Fraud vs. Category')
plt.show()

In [None]:
from scipy.stats import chi2_contingency

# Contingency table
contingency = pd.crosstab(fraud_data_train['merchant'], fraud_data_train['is_fraud'])

# Chi-square test 
chi2, p, dof, expected = chi2_contingency(contingency)

# Cramer's V
observations = contingency.sum().sum()
phi2 = chi2 / observations
cramer_v = np.sqrt(phi2)


print("Cramer's V = ", cramer_v)

# Visualize the association using a heatmap
sns.heatmap(contingency, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Contingency Table: Fraud vs. Merchant')
plt.show()

In [None]:
from scipy.stats import chi2_contingency

# Contingency table
contingency = pd.crosstab(fraud_data_train['state'], fraud_data_train['is_fraud'])

# Chi-square test 
chi2, p, dof, expected = chi2_contingency(contingency)

# Cramer's V
observations = contingency.sum().sum()
phi2 = chi2 / observations
cramer_v = np.sqrt(phi2)


print("Cramer's V = ", cramer_v)

# Visualize the association using a heatmap
sns.heatmap(contingency, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Contingency Table: Fraud vs. State')
plt.show()

In [None]:
from scipy.stats import chi2_contingency

# Contingency table
contingency = pd.crosstab(fraud_data_train['city'], fraud_data_train['is_fraud'])

# Chi-square test 
chi2, p, dof, expected = chi2_contingency(contingency)

# Cramer's V
observations = contingency.sum().sum()
phi2 = chi2 / observations
cramer_v = np.sqrt(phi2)


print("Cramer's V = ", cramer_v)

# Visualize the association using a heatmap
sns.heatmap(contingency, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Contingency Table: Fraud vs. City')
plt.show()

**ANOVA, or Analysis of Variance, is a statistical test used to compare means among multiple groups. In this dataset, we have multiple categories and we want to know if there is a significant difference in the average means of a numerical value (in this case, amt) across these categories. Let's analyze each category:**

In [None]:
from scipy.stats import f_oneway
# List of categorical variables
categorical_vars = ['merchant', 'category', 'state', 'city']

# One-way ANOVA for each categorical variable
for cat_var in categorical_vars:
    anova_result = f_oneway(*[fraud_data_train['amt'][fraud_data_train[cat_var] == category] for category in fraud_data_train[cat_var].unique()])
    print(f'ANOVA for {cat_var}: F-statistic = {anova_result.statistic:.2f}, p-value = {anova_result.pvalue:.4f}')

# Create a heatmap for correlation between numerical variable 'amount' and binary variable 'is_fraud'
point_biserial_corr = fraud_data_train['amt'].corr(fraud_data_train['is_fraud'])
plt.figure(figsize=(8, 6))
sns.heatmap([[1, point_biserial_corr], [point_biserial_corr, 1]],
            annot=True, cmap='coolwarm', fmt=".2f", vmin=-1, vmax=1,
            xticklabels=['amt', 'is_fraud'], yticklabels=['amt', 'is_fraud'])
plt.title('Point-Biserial Correlation Heatmap: Amount vs. Fraud')
plt.show()

ANOVA, or Analysis of Variance, is a statistical test used to compare means among multiple groups. In this dataset, we have multiple categories and we want to know if there is a significant difference in the average means of a numerical value (in this case, amt) across these categories. Let's analyze each category:


* F-statistic = 32.58
p-value = 0.0000 (very low)
Analysis: The F-statistic of 32.58 is relatively high, and the p-value is extremely low. This suggests that there are significant differences in the average transaction amounts among different merchants.
ANOVA for Category:

* F-statistic = 1679.16
p-value = 0.0000 (very low)
Analysis: The F-statistic of 1679.16 is very high, and the p-value is extremely low. This indicates strong evidence that there are significant differences in the average transaction amounts across different categories.
ANOVA for State:

* F-statistic = 16.78
p-value = 0.0000 (very low)
Analysis: The F-statistic of 16.78 is relatively high, and the p-value is extremely low. This suggests significant differences in the average transaction amounts among different states.
ANOVA for City:

* F-statistic = 18.86
p-value = 0.0000 (very low)
Analysis: The F-statistic of 18.86 is relatively high, and the p-value is extremely low. This indicates significant differences in the average transaction amounts among different cities.

# Day 7: Summary

**Over the past week, I embarked on a thorough exploratory data analysis (EDA) journey as part of the Women Who Code Data Science. Starting with a basic overview of its structure and identifying missing values, I progressed to creating simple visualizations like bar charts and count plots to reveal data distributions and patterns. Transitioning to the realm of descriptive statistics on Day 5, I calculated fundamental measures such as mean, median, and mode, providing a quantitative lens on central tendencies. I challeneged my self on Day 6 to gain more understanding around advanced descriptive statistics concepts around correlation such as Cramer's V and point-biserial.This journey not only honed my skills in data analysis but also highlighted the importance of iterative exploration for meaningful insights.**