# **Introduction**

Given our combined interest in financial crime and security, we had chosen a credit card transaction dataset with over a million entries. Initially we were excited to independently explore how variables associated with credit card usage impact the authenticity of a transaction however as we dived deeper into the variability of our dataset, our project scope focused on finding if there exists a relationship between each of the variables available to us and if one could make effective and consistent prediction regarding the occurence of credit card fraud. Our individual focuses were:

- How can one ascertain a fraudulent transaction which used a pin number or a chip by analysing the distance from the previous transaction, the ratio of the purchase price to the median price, and whether the transaction was made at a recurring retailer?
- How does the distance from home affect the likelihood of credit card fraud in both online and offline transactions?
- What relation does the distance from home and distance from last transaction attribute have when a transaction is made using CHIP. How does that affect the likelihood of the transaction being considered as fraudulent?

---

# **Exploratory Data Analysis**

#### We began by working with an extensive dataset, which included a million datapoints. To simplify the data, we narrowed down our focus on several variables, such as distance from home, last transaction, and ratio to median purchase price. Subsequently, we delved deeper into exploratory data analysis using fundamental techniques like .describe, .info, and .corr. However, despite our efforts, we were unable to extract any clear correlations or obtain substantial insights by reducing the dataset. Consequently, we opted to use the groupby function, with each member of our team assigned a specific variable to examine, which ultimately provided us with a much more comprehensive understanding of our data. We created various illustrative plots, including violinplots, barplots,heatmaps, and scatterplots ####

## Figure 1: #
In the following figure is a heatmap with a groupby function on distance from home, which helped us identify several strong correlations right away, such as fraud vs. distance from home and used chip vs. distance from the last transaction. This facilitated a more in-depth analysis of our data, with a solidified understanding of what to expect in the future.

![Heatmap](../images/karim_heatmap.png)

## Figure 2

![Stacked](images/Ru_stackedbarplot.png)

Another focus of the dataset was on the transactions that specifically used either chips or pin number for authentication purposes and in the initial stages of EDA, this stacked barplot explores how the distribution of these transactions that can be categorised into fraud and legitimate transactions uses chips and pin numbers. For the transactions that used pin numbers, there is a steady proportion of transactions that are both fraud and authentic transactions. This allowed us to conclude that the use of pin number to authenticate a transaction cannot be used as a variable to ascertain the likelihood of a a fraudulent transaction occuring. The same cannot be said about the use of chips. 90% of the fraudulent transactions use chips to authenticate the transaction as compared valid transactions which have close to 85% using chips. 

## Figure 3

In the figure below, we can see a clear visualization of the count of fraudulent and legitimate transactions when the chip is used. It is seen that the count of frauds when chip is used in the transaction, is very low compared to when the chip isn't used. Hence we can say that the likihood of fraudulent transaction being made with chip is very low. 

![chip vs fraud](../images/adi/chip_fraud.png)

---

# **Research Questions**

### **How can one ascertain a fraudulent transaction which used a pin number or a chip by analysing the distance from the previous transaction, the ratio of the purchase price to the median price, and whether the transaction was made at a recurring retailer?**

I have decided to explore sub-research questions which focused on the individual variables. The following explores one of the variables in answering the broader research question. To further understand how the research question was answered through exploration, see full Analysis [here](analysis/analysis1.ipynb). 

![Heatmap](images/Ru_ViolinPlot.png)

From the violin plot, there is a clear distribution influx when the ratio to median purchase price is within the range -20 to 20 for both fraudulent and non-fraudulent transactions. Although there are higher legitimate transactions (yellow) within lower ratios than fraudulent (blue), fraudulent transactions are distributed across a larger range of purchase ratio, from approximately -55 to 95 compared to the range of -25 to 90 for authentic transactions. One can infer that as the ratio to median purchase price increases, the likelihood of fraudulent transactions existing in that range increases than when the ratio is within a lower limit. It should be noted that a violin plot consists of kernel density plot on each side of the vertical axes, and the width of the kernel density plot is proportional to the density of the data at that point. 
In some cases, the density of the data at one end of the plot might be very small, resulting in a wide kernel density plot that extends below the horizontal axis. The negative values on the vertical axis could be attributed to such.

### Highlights & Takeaways

- Valid transactions are highly distributed within the range of -25 to 20 and with fraudulent transactions more dispersed across from -55 to 95.
- As the ratio to median purchase price increases, the likelihood of fraudulent transactions existing in that range increases than when the ratio is within a lower limit.

---

### **How does the distance from home affect the likelihood of credit card fraud in both online and offline transactions?**

#### In order to answer this research question i have dived deeper into the dataset using the varibales distance from home, fraud, and online transactions. ####

## Figure 1:

![Heatmap](../images/karim_RQ1.png)

## Using this graph we can can summarize a few important claims about my research question.
1) when the distance form home is <25 the probability of a fraud is approximaily 25-30%
2) when the distance from home is >25 and <100 it's almost a 50% chance of detecting a fruad
3) when the distance from home is >100 then the Probability increases up to 80-90%

## Figure 2:

![Heatmap](../images/karim_RQ2.png)

Using this boxplot we can analyze quite a few things! First would be that we can see off the bat that the chances of a transation being a fraud when it offline is very minimal but when it's online there is a much greater chance of it being a fraud. Now going a bit into details of the boxplot we can see that when the distance from home is less than 100 and it being online then the chances of a fraud is much higher. This graph is a very useful graph that will help us get close to answering our reserach question. That is because we see that frauds mainlt occur when the the payment was made online and when the dustance of the transaction. is less than 100. ####

### **What relation does the distance from home and distance from last transaction attribute have when a transaction is made using CHIP. How does that affect the likelihood of the transaction being considered as fraudulent?**

Let us look at a plot that shows fraudulent transactions that were made with and without chip at different distance from home. The plot below clearly shows that fraudulent transactions that were made without the use of chip fell within a much greater range of distance from home (approximatly 0 to 250) as opposed to non fraudulent transactions made without chip (approximately 0 to 50). We can also see that fraudulent transactions made with chip had a smaller range for distance from home as opposed to non fraudulent transactions made using chip.

![dfh vs fraud for chip transactions](../images/adi/DFH_chip.png)

Now let us look at a plot that shows fraudulent transactions that were made with and without chip at different values for distance from last transaction. In general, all fraudulent transactions, disregarding the use of chip, fall in a very similar range for distance from last transaction. The use of chip however does very slightly increase the possibility of fraud in a transaction. As we can see from the plot, transactions that used chip have a slightly greater range for distance from last transaction. However, as per the deeper analysis conducted on this attribute, we can establish that there isn't a very strong correlation between this attribute (distance from last transaction) and fraud.

<img src="../images/adi/DFLT_chip.png" width="800" />

---

# **Conclusions**

**After analyzing the data, the combination of an increase in ratio to median purchase and a decreased distane from last transaction can be said to contribute to a higher probability of witnessing fraudulent transactions. The analysis also indicates that as the distance from home increases, the likelihood of an order being online also increases, and as the distance from home increases, the chances of a fraud occurring also increases. However, there are almost no frauds detected when the transaction was made offline. By combining these findings, it can be concluded that as the distance from home increases and the transaction was made online, the likelihood of a fraudulent transaction significantly increases. Finally, the exploratory data analysis reveals that distance from home and distance from the last transaction are important variables to consider when detecting fraudulent transactions. Flagging transactions with larger values for distance from home and distances from last transaction makes them less likely to be fraudulent. Additionally, fraudulent transactions that did not use chip tend to have a maximum value of approximately 250, whereas fraudulent transactions that used chip have been made much closer to home with a maximum value of a little less than 50. Therefore, the possibility of a fraud is very low when the transaction is made using chip, providing useful insights for designing algorithms for fraud detection.**