## Data Understanding: Exploring the Dataset

This notebook will explore the Credit Card Transactions dataset to identify potential feature issues fulfilling the Data Understanding phase of the CRISP-DM project planning framework. 

### Objectives:
1. **Understand the Dataset**: Gain familiarity with the dataset structure, including feature types, distributions, and relationships.
2. **Identify Potential Issues**:
   - Missing or incomplete data
   - Outliers or anomalies
   - Features that may require transformation 
3. **Document Observations**: Note any data quality issues and potential corrective actions.

The aim is to clearly understand the dataset and necessary preprocessing steps, ensuring readiness for the feature engineering and modeling stages.

In [11]:
# loading libraries
import numpy as np
import pandas as pd
import geopandas as gpd
from geodatasets import get_path
import matplotlib.pyplot as plt
import seaborn as sns
from rapidfuzz import fuzz, process
import os

%matplotlib inline

In [14]:
# path to local csv file
data = r"C:\Users\eelil\OneDrive\Desktop\Capstone\Machine_Learning_Analysis_of_Bank_Fraud\data\1\credit_card_transactions.csv"

# making dataframe
df = pd.read_csv(data)

I am working with a local copy of the CSV file instead of accessing the wandb artifact to ensure that any changes made during the data understanding phase remain temporary and do not affect the original data. When I start on the preprocessing and feature creation phase, I will access the wandb artifact. Additionally, I am not incorporating logging into this file, as this notebook will serve as my log for this phase.

In [None]:
# viewing the first 5 rows of data
df.head()

In [None]:
df.columns

### Column Descriptions and Initial Observations

The dataset contains the following columns:

1. **Unnamed**: Original index column, redundant in this analysis.
2. **trans_date_trans_time**: Timestamp of the transaction.
3. **cc_num**: Credit card number (hashed/anonymized).
4. **merchant**: Merchant or store where the transaction occurred.
5. **category**: Type of transaction.
6. **amt**: Amount of the transaction.
7. **first**: First name of the cardholder.
8. **last**: Last name of the cardholder.
9. **gender**: Gender of the cardholder.
10. **street**: Address details of the cardholder.
11. **city**: Address details of the cardholder.
12. **state**: Address details of the cardholder.
13. **zip**: Address details of the cardholder.
14. **lat**: Geographical coordinates of the transaction.
15. **long**: Geographical coordinates of the transaction.
16. **city_pop**: Population of the city where the transaction occurred.
17. **job**: Occupation of the cardholder.
18. **dob**: Date of birth of the cardholder.
19. **trans_num**: Unique transaction number.
20. **unix_time**: Unix timestamp of the transaction.
21. **merch_lat**: Geographical coordinates of the merchant.
22. **merch_long**: Geographical coordinates of the merchant.
23. **is_fraud**: Indicator of whether the transaction is fraudulent.
24. **merch_zipcode**: Zip code of the merchant.

**Please note, these column descriptions are taken directly from [Kaggle](https://www.kaggle.com/datasets/priyamchoksi/credit-card-transactions-dataset) (Choksi, n.d.).**    

Based on the column descriptions and the initial inspection of the data (first five rows of the dataframe), the following columns will be removed:

1. **Unnamed**: This column served as the original index and is redundant.

2. **first**, **last**, **gender**, **street**, **zip**, **merch_zipcode**:
   - **first** and **last** are not necessary for analysis, as the credit card number (`cc_num`) already identifies the account.
   - **gender** is unlikely to contribute to identifying fraudulent transactions and could introduce bias or discrimination into the model.
   - **unix_time** will be dropped if 'trans_data_trans_time' is complete.
   - **street** is too specific for this project. I will be using city and state for customer's location.
   - **zip**, **merch_zipcode** will be dropped for redundency if other location based columns are complete.

**trans_num** will be retained to prevent identical values in other columns from mistakenly flagged as duplicates, and won't be adjusted unless there are duplicate values.  


In [None]:
# dropping unnamed column
df.drop(
    df.columns[df.columns.str.contains("unnamed", case=False)], axis=1, inplace=True
)

In [None]:
# dropping 'first', 'last', and 'gender'
df.drop(["first", "last", "gender"], axis=1, inplace=True)

In [None]:
# checking the number of records and columns
df.shape

In [None]:
# dropping any duplicate rows
df.drop_duplicates(inplace=True)
# checking if the number of records changed
df.shape

There were no duplicates in this dataset. However, I will add duplicate-checking in the data preparation phase to accommodate future datasets.

In [None]:
df.info()

In [None]:
# dropping redundant columns
df.drop(["street", "zip", "merch_zipcode", "unix_time"], axis=1, inplace=True)

### Based on the datatypes and null values:

**Location Columns**
Due to redundancy, several location columns ('street,' 'zip,' 'merch_zipcode) were dropped. 
- The latitude and longitude columns for merchants are complete and will be utilized in the next phase for merchant location, resulting in 'merch_zipcode' being dropped. This also removed all null values from the dataframe.
- The 'street' column was dropped since 'city' and 'state' will represent the customer's location.

**Unix Time Column**
The 'unix_time' column was dropped for redundancy since 'trans_date_trans_time' is complete. I intend to make several new datetime columns.

**Columns with Unusual Data Types** 
- trans_date_trans_time  |  object 
- merchant               |  object 
- category               |  object 
- city                   |  object   
- state                  |  object  
- job                    |  object  
- dob                    |  object  
- is_fraud               |  int64    


Will be changed to: 
1. **'trans_data_trans_time', 'dob'** - datetime
2. **'merchant', 'category', 'city', 'state', 'job', 'is_fraud'** - category

In [None]:
# number of unique values in each column
df.nunique()

In [None]:
# changing data types of some columns for easier plotting
# first the datetime columns
df["trans_date_trans_time"] = pd.to_datetime(df["trans_date_trans_time"])
df["trans_date_trans_time"].info()

In [None]:
df["dob"] = pd.to_datetime(df["dob"])
df["dob"].info()

In [None]:
# now the category columns
df[["category", "merchant", "job", "is_fraud", "state", "city"]] = df[
    ["category", "merchant", "job", "is_fraud", "state", "city"]
].astype("category")
df[["category", "merchant", "job", "is_fraud", "state", "city"]].info()

In [None]:
# checking data types before plotting
df.info()

## Column Analysis

### Column - 'is_fraud'

In [None]:
# grouping fraud accounts
fraud = df.groupby("is_fraud", observed=True)

In [None]:
# visualizing what portion of the data is fraudulent transactions
fraud_counts = fraud.size()
ax = fraud_counts.plot(kind="bar", rot=0, color=["blue", "red"], figsize=(13, 6))

plt.xlabel("Fraud")
plt.ylabel("Count in Millions")
ax.set_title("Transaction Counts: Fraud vs. Non-Fraud", fontsize=16)

plt.show()

In [None]:
# 1 = fraudulant transaction, 0 = non-fraudulant transaction
fraud_counts

In [None]:
fraud_percentages = (fraud_counts / fraud_counts.sum() * 100).round(2)
fraud_percentages

**Less than 1% of the transactions are fraudulent.**

In [None]:
# calculating how much the fraud transactions are
fraud_amount = fraud["amt"].sum()
fraud_amount

In [None]:
ax = fraud_amount.plot(kind="bar", rot=0, color=["blue", "red"], figsize=(13, 6))

plt.xlabel("Fraud Classification")
plt.ylabel("Transaction Amount (Tens of Millions)")
ax.set_title("Transaction Amount: Fraud vs. Non-Fraud", fontsize=16)

plt.show()

In [None]:
fraud_amount_per = (fraud_amount / fraud_amount.sum() * 100).round(2)
fraud_amount_per

**While it makes up over 4% of the transaction amount.** The 'is_fraud' column does not be adjusted. 

### Column - 'amt'

In [None]:
# looking further into transaction amount
amt = df["amt"]
amt.describe()

In [None]:
# for easier viewing
print(
    f"Min Transactional Amount ${amt.min():,.2f} and Max Transactional Amount ${amt.max():,.2f}"
)

In [None]:
# applying a logarithmic transformation to the amount column for better visualization in the boxplot
amt_trans = np.log(amt[amt > 0])

# boxplot
plt.boxplot(amt_trans)
plt.xlabel("Transactions")
plt.ylabel("Log-Transformed Amount")
plt.title("Boxplot of Log-Transformed Transaction Amounts")
plt.show()

This plot highlights a relatively large number of outliers at the higher end of the transaction amounts. No issues requiring changes to the 'amt' column were identified at this stage.

### Column - 'category'

In [None]:
# plotting category frequency
df["category"].value_counts().plot(
    kind="bar",
    figsize=(10, 6),
    ylabel="Number of Transactions",
    title="Transaction Frequency by Category",
)
plt.show()

The 'category' column contains 14 unique values representing transaction types. Based on the names:

1. **Online Transactions:** Categories such as 'shopping_net', 'misc_net', and 'grocery_net' include 'net', likely indicating online transactions.
2. **In-Person Transactions:** Categories such as 'grocery_pos', 'shopping_pos', 'misc_pos', and 'gas_transport' include 'pos' or are typical in-person transactions (e.g., gas purchases).
3. **Unknown or Mixed:** Categories like 'home', 'kids_pets', 'entertainment', 'food_dining', 'personal_care', 'health_fitness', and 'travel' lack clear indicators and could represent either online or in-person transactions.

Additional research on Kaggle yielded no further details about the column's interpretation. This column will not be adjusted further, but ideas have been generated for potential new columns to be added during preprocessing.

### New Column Ideas
**Online/In-Person/Mixed Indicator**
- Name: 'transaction_type'
- Type: Category
- Values: 0: Unknown/Mixed, 1: Online, 2: In-Person

**Distance From Transaction to Card Holder**
- Distance between the transaction's latitude/longitude and the user's city/state coordinates
- Name: 'transactional_distance'
- Type: Float

**Merchant Distance**
- Distance between the merchant's latitude/longitude (merch_long, merch_lat) and the user's city/state.
- Name: 'merch_distance'
- Type: Float

Processing resources and time will determine if columns are added. This will be researched further during feature creation. 

### Location Columns - 'lat', 'long', 'merch_lat', 'merch_long', 'city', 'state'

In [None]:
# shapefiles used for this process was downloaded from
# https://www.sciencebase.gov/catalog/item/5d150464e4b0941bde5b7654
map_file = r"/map/US_Map_3"

In [None]:
# checking the ranges
lat_long_columns = ["lat", "long", "merch_lat", "merch_long"]
df[lat_long_columns].describe()

In [None]:
# seeing the min and max values for each column
for col in lat_long_columns:
    print(f"Smallest value {col} = {df[col].min()}")
    print(f"Largest value {col} = {df[col].max()}")

In [None]:
# plotting the transaction locations using 'long' and 'lat' columns
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.long, df.lat))
map = gpd.read_file(map_file)
ax = map.plot(color="lightgrey", edgecolor="black", figsize=(15, 15))
plt.xlabel("Longitude")
plt.ylabel("Latitude")
gdf.plot(ax=ax, color="red", markersize=1)
plt.xlim(-170, -65)
ax.set_title("Transaction Locations", fontsize=16)
plt.show()

In [None]:
# plotting merchant locations using 'merch_long' and 'merch_lat' columns
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.merch_long, df.merch_lat))
map = gpd.read_file(map_file)
ax = map.plot(color="lightgrey", edgecolor="black", figsize=(15, 15))
plt.xlabel("Longitude")
plt.ylabel("Latitude")
gdf.plot(ax=ax, color="blue", markersize=1)
plt.xlim(-170, -65)
ax.set_title("Merchant Locations", fontsize=16)
plt.show()

In [None]:
# referencing the unique values for these columns
df[lat_long_columns].nunique()

The data appears to have been anonymized or artificially generated, as indicated by the square clustering observed on the Merchant Locations map and the relatively low number of unique values in the 'lat' and 'long' columns. No adjustments will be made to these columns. 

In [None]:
# investigating customer location columns
cust_loc = ["city", "state"]
df[cust_loc].nunique()

In [None]:
# checking state code values to see why it is 51
states = df["state"].unique()
states = states.tolist()
states.sort()
states

In [None]:
# calculating unique city, state pair
loc_pairs = df[["city", "state"]].drop_duplicates()
loc_pairs.shape[0]

- The 'state' column has 50 US states and DC for Washington DC.
- The 'city' column has 894 unique values.
- There are 928 unique value pairs representing customer locations, so some cities share names, which might look like a false pattern. 
- There are 983 unique customer credit card numbers, so some cities have multiple customers.

#### New Column Ideas
**Latitude and Longitude for Customer's City**
- Name: 'cust_lat' & 'cust_long'
- Type: Float

Create two new columns, which will be populated based on the 'city' and 'state' pair vaules. Once created, the 'city' and 'state' will be dropped after. No further changes will be made to the 'state' column. 

### Column - 'trans_date_trans_time'

In [None]:
# renaming 'trans_date_trans_time' column to something easier
df.rename(columns={"trans_date_trans_time": "trans_dt"}, inplace=True)
df.columns

In [None]:
# calculating date ranges
(df["trans_dt"].min(), df["trans_dt"].max())

In [None]:
# creating a quick and easy viz of transactions over time
# using bins=18 so there can be approximately 1 bin for each month
# this isn't accurately separated by month, using this for ideas
n, bins, patches = plt.hist(df["trans_dt"], bins=18)

labels = np.arange(1, 19)
midpoints = (bins[:-1] + bins[1:]) / 2
plt.xticks(ticks=midpoints, labels=labels)

plt.title("Transaction Volume by Month")
plt.ylabel("Number of Transactions")
plt.xlabel("Aprox Month")

plt.show()

From this plot, we can roughly see that transaction volume can vary from month to month. 

In [None]:
df["trans_dt"].head(10)

  
The 'trans_dt' column has a lot of data packed into every field. The format is Year - Month - Day - Hour - Minute - Second. I want to create new columns, afterwhich 'trans_dt' can be dropped.

### New Column Ideas
**Separate fields for Date, Year, Month, Day**
- Name: 'date', 'year', 'month', 'day'
- Type: Various
- To analyze specific transaction patterns by specific date and time components. 

**Week & Quarter**
- Name: 'week' & 'quarter'
- Type: Category
- To analyze seasonal patterns and time-related outliers. 

**Running Transaction Averages and Counts**
- Name: 'trans_by_last_hr','trans_by_last_day', 'amt_last_hour', 'amt_last_day'
- Type: Float
- To track spending patterns for individual customers, enabling the detection of unusual spikes in transaction frequency or amount. 

### Column - 'cc_num'

In [None]:
# getting the number of transaction for each cc account
cc_counts = df.groupby("cc_num", observed=True).size()

In [None]:
# the number of unique accounts
df["cc_num"].nunique()

In [None]:
cc_counts.describe()

This dataset has records for 983 customer, and number of transactions a single customer makes ranges from 7 to 3,123. 

In [None]:
# sorting the values
sorted_cc = cc_counts.sort_values()

In [None]:
# plotting the number of transactions for low to high to see
# if there are typical transaction rates
plt.figure(figsize=(12, 10))
plt.bar(range(len(sorted_cc)), sorted_cc)
plt.xlabel("CC Account")
plt.ylabel("Number of Transactions")
plt.title("Number of Transactions for Each Account", fontsize=16)

plt.show()

The 'cc_num' column reveals that transaction frequencies tend to cluster together. No changes will be made to the 'cc_num' column, as it is needed for identifying individual customers. Plans to use this column for creating new features, such as transaction patterns and spending behaviors, have already been outlined.

### Column - 'merchant'

In [None]:
# creating a variable of merchant names and finding the number of merchants
merchants = df["merchant"].unique()
len(merchants)

In [None]:
# finding the number of transactions for each merchant
merchant_count = df["merchant"].value_counts()
merchant_count.describe()

The range of transaction frequency by merchant range from 727 to 4403. 

In [None]:
# calculating the high and low monitary amount per merchant
merchant_amount = df.groupby("merchant", observed=True)["amt"].sum()
merchant_amount = merchant_amount.sort_values()

In [None]:
# the ten merchants with the lowest amount
low = merchant_amount.head(10)
low

In [None]:
# the ten merchants with the highest amounts
high = merchant_amount.tail(10)
high

In [None]:
# box plot showing the range
plt.boxplot(merchant_amount)
plt.xlabel("Merchants")
plt.ylabel("Transaction Amount (USD)")
plt.title("Transaction Amount Distribution Across Merchants")
plt.show()

This plot shows that merchant transaction amounts can vary significantly from one merchant to the next, as does the number of transactions per merchant. This column will not be adjusted at this time. 

### Column - 'city_pop'

In [None]:
# checking unique values
pop_values = df["city_pop"].drop_duplicates()
pop_values.shape[0]

In [None]:
# checking unique city, state, city_pop values
pop_triad = df[["city", "state", "city_pop"]].drop_duplicates()
pop_triad.shape[0]

There are 879 unique city population values but 928 unique customer locations, so some cities have the same population value. 

In [None]:
# looking at the low and high city populations
pop_city = pop_triad.sort_values(by="city_pop")
pop_city

The city_pop column displays a wide variation, with values ranging from 23 to 2,906,700. Concerned about potential inaccuracies, I conducted further investigation. A cursory search for Notrees, Texas, found its population to be approximately 20 in 2009, aligning closely with the dataset. However, another search for West Bethel, ME, revealed a significant discrepancy, as its population was 2,504 according to [Wikipedia](https://en.wikipedia.org/wiki/Bethel,_Maine). Similar inconsistencies were found for other cities. Given the unreliable nature of the city_pop data, I have decided to drop this column from the analysis.

### Column - 'job'

In [None]:
# checking how many unique values are in the 'job' column
df["job"] = df["job"].str.lower().str.strip()
df["job"].nunique()

In [None]:
# reducing data down to one transaction per account and job title
jobs_db = df[["cc_num", "job"]].drop_duplicates()
jobs_db = jobs_db.sort_values(by="job")
jobs_db

In [None]:
# seeing how many customers have the same job title
job_count = jobs_db["job"].value_counts()
job_count.describe()

In [None]:
# each job title
jobs = jobs_db["job"].drop_duplicates()
len(jobs)

In [None]:
# checking how many skills have similar names using rapid fuzz
similar_jobs = []

for i, job in enumerate(jobs):
    matches = process.extract(
        job, jobs[i + 1 :], scorer=fuzz.token_sort_ratio, limit=10
    )
    for match in matches:
        if match[1] >= 80:  # has similarity score of at least 80
            similar_jobs.append((job, match[0], match[1]))

len(similar_jobs)  # seeing how many matches

In [None]:
# looking at the matching jobs
similar_jobs

While the job names are very similar, the jobs with a score lower than 95 seem different, while the ones with a score higher than 95 seem to be the same job but written slightly differently. I will adjust the similarity score. 

In [None]:
# adjusting the similarity score
similar_jobs = []

for i, job in enumerate(jobs):
    matches = process.extract(
        job, jobs[i + 1 :], scorer=fuzz.token_sort_ratio, limit=10
    )
    for match in matches:
        if match[1] >= 95:  # has similarity score of at least 95
            similar_jobs.append((job, match[0], match[1]))

len(similar_jobs)  # seeing how many matches

In [None]:
similar_jobs

In [None]:
job_count

The dataset has 494 unique job titles, with each title appearing one to six times. After sorting the column values, I noticed similar but slightly different job titles. I ran RapidFuzz with a similarity threshold of 80 to identify potential matches, but this resulted in overly broad groupings. To improve accuracy, I changed the similarity threshold to 95, which reduced the matches from 144 to 81. During the preprocessing phase, I plan to replace the less common variations of job titles with the most frequently used version for standardization.

### Column - 'dob'

In [None]:
# checking the range of values for date of birth
df["dob"].describe()

One visible issue is that the maximum value for the date of birth is 2005-01-29. That would make the cardholder around 14 to 15 at the time of the transaction. This could be an indicator of fraud or an incorrect value. 

In [None]:
# grouping and sorting the birthday values for plotting
birthdays = df[["cc_num", "dob"]].drop_duplicates()
sorted_dob = birthdays.sort_values(by="dob")

In [None]:
# creating a year and decade column for easier plotting
birthdays["year"] = birthdays["dob"].dt.year
birthdays["decade"] = (birthdays["year"] // 10) * 10

In [None]:
# grouping every ten years together
decade_counts = birthdays.groupby("decade").size()

In [None]:
decade_counts.plot(kind="bar")
plt.title("Number of Birthdays by Decade")
plt.xlabel("Decade")
plt.ylabel("Count")
plt.show()

This plot highlights that the majority of account holders were born between the 1960s and 1990s. Birthdays at either extreme could indicate potential fraud, especially for account holders under 18 at the time of the transaction.

**New Column Idea**
- Name: 'age_at_trans'
- Type: Float
- This column will have the customer's age (in years) at the time of the transaction to see if it is an indicator of fraud.


---
### Summary

I explored the dataset to understand its structure, quality, and come up with potential preprocessing and feature engineering ideas. Key takeaways included:

---

#### **Data Cleaning and Preparation**

1. **Column Renaming**

- 'trans_date_trans_time' will be renamed to 'trans_dt'.

  
2. **Columns to Drop**  
- **Before feature creation**: 'unnamed', 'first', 'last', 'gender', 'street', 'zip', 'merch_zipcode', 'unix_time', and 'city_pop'. 
- **After creating features**: 'city'. 'state', 'trans_dt', and 'dob'. 

  
3. **Data Type Adjustments**:  
- Convert to datetime: 'trans_dt', 'dob'.
- Convert to category: 'merchant', 'category', 'city', 'state', 'job', 'is_fraud'.

  
4. **Duplicates**:  
- No duplicate rows were found in the dataset.


---
  
#### **Takeaways**

1. **Fraudulent Transactions**  
- Less than 1% of transactions are fraudulent but account for over 4% of the transaction amount.
- 'is_fraud' column will not be adjusted.

2. **Transaction Amounts (amt)**:  
- Range: \\$1.00 to \\$28,948.90.
- The column contains outliers at the higher end, but no adjustments will be made. 

3. **Transaction Categories**:  
- 14 unique categories with potential to classify transactions as online, in-person, or mixed based on naming conventions.
- No further changes to the column, but a new "transaction_type" column will be added.

4. **Location Data**:  
- Latitude and longitude values for both transactions and merchants suggest anonymization due to square clustering or a low number of unique values.
- **Range of values**:
    - **Transaction** latitudes: 20.0271 to 66.6933, longitudes: -165.6723 to -67.9503.
    - **Merchant** latitudes: 19.027785 to 67.510267, longitudes: -166.671242 to -66.950902.
- Columns remain unchanged, but new features such as 'merch_distance' will be added.

5. **City Population (city_pop)**:  
- Values range from 23 to 2,906,700 across 879 unique entries. However, several inconsistencies were identified during validation.
- The column will be dropped due to unreliablity.

7. **Job Titles (job)**:  
- 494 unique job titles with inconsistencies in naming conventions.
- During preprocessing, less common variations will be replaced with their most frequently used counterparts using a similarity threshold of 95 in RapidFuzz.

8. **Birth Dates (dob)**:  
- Majority of account holders were born between the 1960s and 1990s.
- Extreme birth dates could indicate potential fraud, especially for account holders under 18 at the time of the transaction.

9. **Credit Card Numbers (cc_num)**:  
- Range: 7 to 3,123 transactions per customer.
- No adjustments planned, as this column is needed for identifying patterns in spending behaviors.

---

#### **Potential New Features**
- **Transaction Features**: 'transaction_type', 'trans_by_last_hr', 'trans_by_last_day', 'amt_last_hour', 'amt_last_day'
- **Location Features**: 'merch_distance', 'cust_lat', 'cust_long'.
- **Time Features**: 'date', 'year', 'month', 'day', 'week', 'quarter'.
- **Age Feature**: "age_at_trans", representing the customer's age at the time of the transaction (float).

---

#### **Transaction Timeline**
- **Data spans**: January 1, 2019, to June 21, 2020.
- **Counts**:
  - Customers: 983.
  - Merchants: 693.
- **Transaction frequency**:
  - Customers: 7 to 3,123 transactions.
  - Merchants: 727 to 4,403 transactions.

This comprehensive understanding of the dataset lays the foundation for effective preprocessing, feature engineering, and model development in subsequent phases.


### Sources

Choksi, P. (n.d.). Credit card transactions dataset: Using transactional data for financial analysis and fraud detection. Kaggle. Retrieved November 21, 2024, from https://www.kaggle.com/datasets/priyamchoksi/credit-card-transactions-dataset

National Atlas of the United States. (2014). 1:1,000,000-Scale State Boundaries of the United States [Vector digital data]. Rolla, MO: National Atlas of the United States. Retrieved from https://www.sciencebase.gov/catalog/item/581d052de4b08da350d524e5

Wikipedia contributors. (n.d.). Bethel, Maine. In Wikipedia. Retrieved December 6, 2024, from https://en.wikipedia.org/wiki/Bethel,_Maine