# Online Retail Dataset

### Overview

**Dataset**: Online Retail Dataset

- **Source**: UCI Machine Learning Repository
- **Link**: [Online Retail Dataset](https://archive.ics.uci.edu/ml/datasets/online+retail)

**Explanation**:

- **Size and Features**: Contains over 500,000 rows with features like Quantity, UnitPrice, and CustomerID.
- **Suitability**: Ideal for clustering (customer segmentation), anomaly detection (fraud detection), and association rule mining (market basket analysis).
- **Reason**: Large size and diverse features make it perfect for various unsupervised learning tasks and real-world applications.


# Step 1: Data Acquisition

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Define the URL to the CSV file
url = 'https://archive.ics.uci.edu/static/public/352/data.csv'

# Load the CSV file into a DataFrame
df = pd.read_csv(url)

In [None]:
print(df.head(20))

# Step 2: Data Cleaning

## 1. Identifying and Handling Missing Values

In [None]:
# Check for missing values

missing_values = df.isnull().sum()
print("Missing values per column:\n", missing_values)

In [None]:
# Handling missing values: Imputation -> Filling with Unknown

df = df.dropna(subset=['CustomerID', 'Description'])

In [None]:
#  Verify that there are no missing columns after cleaning

missing_values = df.isnull().sum()
print("Missing values per column:\n", missing_values)

## 2. Identifying and Handling Duplicate Values

In [None]:
# Check for duplicate entries

duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")


In [None]:
duplicate_rows = df[df.duplicated(keep=False)]
print("Duplicate rows:")
print(duplicate_rows)


## 3. Checking for Data Type Inconsistencies

In [None]:
print(df.dtypes)

In [None]:
# Convert 'InvoiceDate' to datetime type
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], errors='coerce')

# Convert 'CustomerID' to string (since it's more of an identifier)
df['CustomerID'] = df['CustomerID'].astype('Int64').astype(str)

# Convert 'InvoiceNo', 'StockCode', and 'Country' to categorical type
df['InvoiceNo'] = df['InvoiceNo'].astype('category')
df['StockCode'] = df['StockCode'].astype('category')
df['Country'] = df['Country'].astype('category')

In [None]:
print(df.dtypes)

In [None]:
print(df.describe())

## 4. Removing Negative Quantities

In [None]:
df = df[df['Quantity'] >= 0]

In [None]:
print(df.describe())

In [None]:
# Remove rows with negative Quantity
df = df[df['Quantity'] >= 0]

# Remove rows with Quantity greater than 1000 (adjust this threshold as needed)
df = df[df['Quantity'] <= 1000]

# Remove rows with UnitPrice equal to 0 or extremely high values (>1000)
df= df[(df['UnitPrice'] > 0) & (df['UnitPrice'] <= 1000)]

# Summary of cleaned data
cleaned_summary = df.describe()
print("Cleaned Data Summary:")
print(cleaned_summary)

print(f"\nDataset shape after cleaning: {df.shape}")

In [None]:
print("Missing values in each column:")
print(missing_values)

In [None]:
# Check distribution of key features
print("Distribution of Quantity:")
print(df['Quantity'].describe())

print("\nDistribution of UnitPrice:")
print(df['UnitPrice'].describe())

## 5. Identifying Unique Values in Categorical columns

In [None]:
# Check unique values in categorical columns
print("Unique values in Description:")
print(df['Description'].unique()[:10])

print("\nUnique values in CustomerID:")
print(df['CustomerID'].nunique())

print("\nUnique values in Country:")
print(df['Country'].unique())


# Step 3: Feature Engineering

#### 1. TotalPrice
- `Rationale`: The TotalPrice feature provides a more comprehensive view of the transaction by multiplying Quantity with UnitPrice. 

In [None]:
# TotalPrice

df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

### 2. TransactionMonth
- `Rationale`: Extracting the month from InvoiceDate can help in analyzing seasonal trends and patterns.

In [None]:
# Extracting month from InvoiceDate

df['TransactionMonth'] = df['InvoiceDate'].dt.month

### 3. IsDiscounted
- `Rationale`: Creating a binary feature indicating whether a transaction price is considered a discount or not

In [None]:
# Define a threshold for discount
discount_threshold = 2.0  

# Create IsDiscounted feature
df['IsDiscounted'] = df['UnitPrice'] < discount_threshold

#### Summary of New Features:
  1. `TotalPrice`: Reflects the total monetary value of each transaction, which is useful for understanding the transaction's impact.
  2. `TransactionMonth`: Helps capture seasonal patterns and trends in transactions.
  3. `IsDiscounted`: Indicates whether the transaction price is considered discounted, useful for analyzing promotions and price sensitivity.

In [None]:
# Display the first few rows of the DataFrame with the new features

print(df[['TotalPrice', 'TransactionMonth', 'IsDiscounted']].head())

# Step 4: Data Visualization

### 1. Distribution of Features
- Visualization: Histograms for Quantity, TotalPrice, and UnitPrice.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set the aesthetics
sns.set(style="whitegrid")

# Create a figure and axes
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot histograms
sns.histplot(df['Quantity'], kde=True, ax=axes[0])
axes[0].set_title('Distribution of Quantity')

sns.histplot(df['TotalPrice'], kde=True, ax=axes[1])
axes[1].set_title('Distribution of TotalPrice')

sns.histplot(df['UnitPrice'], kde=True, ax=axes[2])
axes[2].set_title('Distribution of UnitPrice')

plt.tight_layout()
plt.show()

In [None]:
# Calculate the correlation matrix
corr_matrix = df[['Quantity', 'TotalPrice', 'UnitPrice']].corr()

# Plot heatmap
plt.figure(figsize=(10, 7))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Heatmap of Feature Correlations')
plt.show()

In [None]:
# Plot pairplot
sns.pairplot(df[['Quantity', 'TotalPrice', 'UnitPrice']])
plt.suptitle('Pairwise Relationships Between Features', y=1.02)
plt.show()

In [None]:
# Create a figure and axes
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot box plots
sns.boxplot(x=df['Quantity'], ax=axes[0])
axes[0].set_title('Box Plot of Quantity')

sns.boxplot(x=df['TotalPrice'], ax=axes[1])
axes[1].set_title('Box Plot of TotalPrice')

sns.boxplot(x=df['UnitPrice'], ax=axes[2])
axes[2].set_title('Box Plot of UnitPrice')

plt.tight_layout()
plt.show()

In [None]:
from sklearn.decomposition import PCA

# Perform PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(df[['TotalPrice', 'Quantity']])

# Create a DataFrame for PCA results
pca_df = pd.DataFrame(data=pca_result, columns=['PC1', 'PC2'])

# Plot PCA results
plt.figure(figsize=(10, 7))
plt.scatter(pca_df['PC1'], pca_df['PC2'], alpha=0.5, c=df['Country'].astype('category').cat.codes)
plt.colorbar(label='Country')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA - Cluster Visualization')
plt.show()

# Step 5: Unsupervised Learning Models

### 1. K-Means Clustering

#### Step 1: Train the Model

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np

# Prepare data for clustering
X = df[['TotalPrice', 'Quantity']].values

# Train K-Means model
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(X)

# Assign cluster labels
df['KMeans_Cluster'] = kmeans.labels_

#### Step 2: Visualize Results

In [None]:
# Visualize K-Means Clustering results
plt.figure(figsize=(10, 7))
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis', marker='o', alpha=0.6)
plt.title('K-Means Clustering Results')
plt.xlabel('TotalPrice')
plt.ylabel('Quantity')
plt.show()

#### Step 3: Evaluate the Model

In [None]:
# Evaluate using silhouette score

silhouette_kmeans = silhouette_score(X, kmeans.labels_)
print(f'Silhouette Score for K-Means: {silhouette_kmeans}')

### 2. Hierarchical Clustering

#### Step 1: Train the Model

In [None]:
from sklearn.cluster import AgglomerativeClustering

# Train Hierarchical Clustering model
hierarchical = AgglomerativeClustering(n_clusters=4)
df['Hierarchical_Cluster'] = hierarchical.fit_predict(X)

#### Step 2: Visualize Results

In [None]:
# Visualize Hierarchical Clustering results
plt.figure(figsize=(10, 7))
plt.scatter(X[:, 0], X[:, 1], c=df['Hierarchical_Cluster'], cmap='plasma', marker='o', alpha=0.6)
plt.title('Hierarchical Clustering Results')
plt.xlabel('TotalPrice')
plt.ylabel('Quantity')
plt.show()

#### Step 3: Evaluate the Model

In [None]:
# Evaluate using silhouette score
silhouette_hierarchical = silhouette_score(X, df['Hierarchical_Cluster'])
print(f'Silhouette Score for Hierarchical Clustering: {silhouette_hierarchical}')


### 3. DBSCAN

#### Step 1: Train the Model

In [None]:
from sklearn.cluster import DBSCAN

# Train DBSCAN model
dbscan = DBSCAN(eps=5, min_samples=10)
df['DBSCAN_Cluster'] = dbscan.fit_predict(X)

#### Step 2: Visualize Results

In [None]:
# Visualize DBSCAN Clustering results
plt.figure(figsize=(10, 7))
plt.scatter(X[:, 0], X[:, 1], c=df['DBSCAN_Cluster'], cmap='coolwarm', marker='o', alpha=0.6)
plt.title('DBSCAN Clustering Results')
plt.xlabel('TotalPrice')
plt.ylabel('Quantity')
plt.show()

#### Step 3: Evaluate the Model

In [None]:
# Evaluate using silhouette score
silhouette_dbscan = silhouette_score(X, df['DBSCAN_Cluster'])
print(f'Silhouette Score for DBSCAN: {silhouette_dbscan}')

### 4. Principal Component Analysis (PCA)

#### Step 1: Train the Model

In [None]:
from sklearn.decomposition import PCA

# Train PCA model
pca = PCA(n_components=2)
pca_result = pca.fit_transform(X)
df['PCA1'] = pca_result[:, 0]
df['PCA2'] = pca_result[:, 1]

#### Step 2: Visualize Results

In [None]:
# Visualize PCA Results
plt.figure(figsize=(10, 7))
plt.scatter(df['PCA1'], df['PCA2'], c=kmeans.labels_, cmap='Spectral', marker='o', alpha=0.6)
plt.title('PCA Results')
plt.xlabel('PCA1')
plt.ylabel('PCA2')
plt.show()

#### Step 3: Evaluate the Model

In [None]:
# Evaluate using explained variance ratio
explained_variance = pca.explained_variance_ratio_
print(f'Explained Variance Ratio: {explained_variance}')

## Model Comparison

In [None]:
print(f"Silhouette Score for K-Means: {silhouette_kmeans}")
print(f"Silhouette Score for Hierarchical Clustering: {silhouette_hierarchical}")
print(f"Silhouette Score for DBSCAN: {silhouette_dbscan}")
print(f"Explained Variance Ratio for PCA: {explained_variance}")