# 1. Problem Context: 
## 1.1 Problem which exists:
Currently, the retailer simply groups their international customers by country. As you'll see in the project, this is quite inefficient because:
* There's a large number of countries (which kind of defeats the purpose of creating groups).
* Some countries have very few customers.
* This approach treats large and small customers the same, regardless of their purchase patterns.

## 1.2 Solution to come up with:
The retailer needs help to create customer clusters, a.k.a **"customer segments"** through a data-driven approach.
* They've provided us a dataset of past purchase data at the transaction level.
* Our task is to build a clustering model using that dataset.
* Our clustering model should factor in both aggregate sales patterns and specific items purchased.


# 2. Technical Specifications and Aspects:
## 2.1 Data Overview
For this project:
* The dataset has 35116 observations for previous international transactions.
* The observations span 37 different countries.
* **Note:** There is no target variable.

We have the following features:

Invoice information
* 'InvoiceNo' – Unique ID for invoice
* 'InvoiceDate' – Invoice date

Item information
* 'StockCode' – Unique ID for item
* 'Description' – Text description for item
* 'Quantity' – Units per pack for item
* 'UnitPrice' – Price per unit in GBP

Customer information
* 'CustomerID' – Unique ID for customer
* 'Country' – Country of customer

## Type of ML Problem
It is an unsupervised learning task, where given the features about each transaction, we need to segment the customers based on their buying patterns.
* It is important to note that the given data is transaction-level while the clusters (or segmenst) we need to create are customer-level.


# 3. Exploratory Data Analysis

Importing the required libraries and their giving a description to why it is used

In [7]:
# NumPy for numerical computing
import numpy as np

# Pandas for DataFrames
import pandas as pd
pd.set_option('display.max_columns', 100)

# Visualizing missing data more effectively
import missingno as msno

# Matplotlib for visualization
from matplotlib import pyplot as plt
# display plots in the notebook
%matplotlib inline 

# Seaborn for easier visualization
import seaborn as sns

# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")

# StandardScaler from Scikit-Learn
from sklearn.preprocessing import StandardScaler

# PCA from Scikit-Learn
from sklearn.decomposition import PCA

# Scikit-Learn's KMeans algorithm
from sklearn.cluster import KMeans

# Adjusted Rand index
from sklearn.metrics import adjusted_rand_score

## 3.1 Loading the dataset

In [8]:
df = pd.read_csv('C:/Users/satvi/OneDrive/Desktop/Customer_segmentation_project2/int_online_tx.csv')

Load international online transactions data from CSV

In [9]:
df.shape

(35116, 8)

In [10]:
df.head(10)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536370,22728,ALARM CLOCK BAKELIKE PINK,24,12-01-2022 08:45,3.75,12583.0,France
1,536370,22727,ALARM CLOCK BAKELIKE RED,24,12-01-2022 08:45,3.75,12583.0,France
2,536370,22726,ALARM CLOCK BAKELIKE GREEN,12,12-01-2022 08:45,3.75,12583.0,France
3,536370,21724,PANDA AND BUNNIES STICKER SHEET,12,12-01-2022 08:45,0.85,12583.0,France
4,536370,21883,STARS GIFT TAPE,24,12-01-2022 08:45,0.65,12583.0,France
5,536370,22002,INFLATABLE POLITICAL GLOBE,48,12-01-2022 08:45,0.85,12583.0,France
6,536370,21791,VINTAGE HEADS AND TAILS CARD GAME,24,12-01-2022 08:45,1.25,12583.0,France
7,536370,22235,SET/2 RED RETROSPOT TEA TOWELS,18,12-01-2022 08:45,2.95,12583.0,France
8,536370,22326,ROUND SNACK BOXES SET OF4 WOODLAND,24,12-01-2022 08:45,2.95,12583.0,France
9,536370,22629,SPACEBOY LUNCH BOX,24,12-01-2022 08:45,1.95,12583.0,France


printing the dimension of and the first 10 rows of the dataset