# **Methodology**

#### The approach taken in this project is based on the exploratory methodology approach.The justification for this is because the dataset lends itself to this approach when taking into account the behaviour of customers from various locations around the world.

* Data Collection - The dataset has been sourced from the Kaggle website and is named ecommerce_transactions.

* Method - The dataset has been downloaded on to a computer and is in .csv file format.

#### Data Cleaning - The process of cleaning the dataset will be to use Python code such as isnull, and dropna. The dataset will be understood in more depth by using info() and describe(). Also, to modify the dataset for in depth insights. The dataset will be balanced at a later stage when fraud detection is performed.

* Tools - I will use pandas in the main and numpy to a lesser extent.

* Purpose - The reason is to ensure that the data is fit for purpose when insights need to be derived.

#### Exploratory Data Analysis (EDA) - Aim is to derive insights from the data in textual and visual formats.

* Tools - Matplotlib, Plotly, Seaborn, and Scikit-learn.

* Justification - To hone better analysis and test a number of hypotheses.

#### Outcomes

* Relate outcomes to business requirements and hypotheses.



# **Clean and Modify Dataset**

## Objectives

* The objective now is to prepare the dataset for analysis in order to derive insights from its contents. To begin with information will be shown about the data, a description will be given about it's spread, duplicates and null values will be removed.

## Inputs

* The ecommerce_transactions_resized dataset will be loaded into the Jupyter notebook.

## Outputs
 
* There will be information about the datasets contents and whether null and duplicate values exist.

## Additional Comments

* On completion the dataset will be renamed and stored in the cleaned_data folder.



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\rayaf\\OneDrive\\Documents\\global-store\\online_store\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\rayaf\\OneDrive\\Documents\\global-store\\online_store'

In [4]:
import pandas as pd

# Stage 1

### Loading of the ecommerce_transactions_resized dataset

In [None]:
sized_df = pd.read_csv('../online_store/data/cleaned_data/ecommerce_transactions_resized.csv') # Load the resized dataset
sized_df.head() # Display the first few rows of the DataFrame

Unnamed: 0,Transaction_ID,User_Name,Age,Country,Product_Category,Purchase_Amount,Payment_Method,Transaction_Date
0,33554,Isabella Lewis,24,Japan,Toys,579.51,Cash on Delivery,2024-01-16
1,9428,Elijah Rodriguez,52,Germany,Electronics,78.18,PayPal,2023-04-19
2,200,Ava Hall,62,UK,Toys,713.08,Debit Card,2024-03-05
3,12448,Ava Allen,63,Brazil,Grocery,474.14,Credit Card,2024-12-01
4,39490,Emma Lewis,52,USA,Home & Kitchen,266.15,Debit Card,2024-01-19


---

# Stage 2

### Obtaining information about the dataset

In [None]:
sized_df.info() # Display the DataFrame's structure and data types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Transaction_ID    10000 non-null  int64  
 1   User_Name         10000 non-null  object 
 2   Age               10000 non-null  int64  
 3   Country           10000 non-null  object 
 4   Product_Category  10000 non-null  object 
 5   Purchase_Amount   10000 non-null  float64
 6   Payment_Method    10000 non-null  object 
 7   Transaction_Date  10000 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 625.1+ KB


#### The above output provides an insight into the make up of the DataFrame.

# Stage 3

### Obtaining a description about the dataset

In [None]:
sized_df.describe() # Display summary statistics of the DataFrame

Unnamed: 0,Transaction_ID,Age,Purchase_Amount
count,10000.0,10000.0,10000.0
mean,24833.8751,43.7669,498.417247
std,14404.169374,15.367628,287.676742
min,5.0,18.0,5.1
25%,12456.5,30.0,248.1325
50%,24895.5,43.0,497.12
75%,37151.25,57.0,747.555
max,49993.0,70.0,999.98


#### The output above reveals the spread of the data in terms of the mean, median, and standard deviation.

# Stage 4

### Identifying null values within the dataset

In [None]:
sized_df.isnull().sum() # Check for null values in the DataFrame

Transaction_ID      0
User_Name           0
Age                 0
Country             0
Product_Category    0
Purchase_Amount     0
Payment_Method      0
Transaction_Date    0
dtype: int64

#### The output above shows that there are no null values in the dataset.

# Stage 5

### Removing any duplicate rows from the dataset

In [9]:
sized_df.drop_duplicates(inplace=True) # Remove duplicate rows from the DataFrame

# Stage 6

### Reformatting the date to datetime format

In [10]:
sized_df['Transaction_Date'] = pd.to_datetime(sized_df['Transaction_Date'],dayfirst=True) # Convert 'Transaction_Date' to datetime format
sized_df['Year'] = sized_df['Transaction_Date'].dt.year # Extract year from 'Transaction_Date'
sized_df['Month'] = sized_df['Transaction_Date'].dt.month # Extract month from 'Transaction_Date'
sized_df['Day'] = sized_df['Transaction_Date'].dt.day # Extract day from 'Transaction_Date'
sized_df.head() # Display the first few rows of the DataFrame after modifications

  sized_df['Transaction_Date'] = pd.to_datetime(sized_df['Transaction_Date'],dayfirst=True) # Convert 'Transaction_Date' to datetime format


Unnamed: 0,Transaction_ID,User_Name,Age,Country,Product_Category,Purchase_Amount,Payment_Method,Transaction_Date,Year,Month,Day
0,33554,Isabella Lewis,24,Japan,Toys,579.51,Cash on Delivery,2024-01-16,2024,1,16
1,9428,Elijah Rodriguez,52,Germany,Electronics,78.18,PayPal,2023-04-19,2023,4,19
2,200,Ava Hall,62,UK,Toys,713.08,Debit Card,2024-03-05,2024,3,5
3,12448,Ava Allen,63,Brazil,Grocery,474.14,Credit Card,2024-12-01,2024,12,1
4,39490,Emma Lewis,52,USA,Home & Kitchen,266.15,Debit Card,2024-01-19,2024,1,19


#### The DataFrame above shows that date has been reformatted to datetime into year,month, and day.

# Stage 7

### Saving the cleaned and modified dataset to the cleaned data folder

In [11]:
sized_df.to_csv('../online_store/data/cleaned_data/ecommerce_transactions_cleaned.csv', index=False) # Save the resized dataset to a new CSV file