In [6]:
import pandas as pd

# Load the dataset
file_path = 'data.csv'  # Replace with your actual file path
data = pd.read_csv(file_path, encoding='ISO-8859-1')  # Using a different encoding if needed

# Dataset size in terms of rows and columns
rows, columns = data.shape

# Brief description of each column
column_descriptions = data.describe(include='all').T

# Assuming 'InvoiceDate' is the column to check for the time period
# Convert 'InvoiceDate' to datetime
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'], errors='coerce')

# Time period covered by the dataset
time_period = {
    'Start Date': data['InvoiceDate'].min(),
    'End Date': data['InvoiceDate'].max()
}

# Printing the results
print(f"Dataset Size: {rows} rows, {columns} columns")
print("\nColumn Descriptions:")
print(column_descriptions)
print("\nTime Period Covered by the Dataset:")
print(f"From {time_period['Start Date']} to {time_period['End Date']}")


Dataset Size: 541909 rows, 8 columns

Column Descriptions:
                count unique                                 top    freq  \
InvoiceNo      541909  25900                              573585    1114   
StockCode      541909   4070                              85123A    2313   
Description    540455   4223  WHITE HANGING HEART T-LIGHT HOLDER    2369   
Quantity     541909.0    NaN                                 NaN     NaN   
InvoiceDate    541909  23260                    10/31/2011 14:41    1114   
UnitPrice    541909.0    NaN                                 NaN     NaN   
CustomerID   406829.0    NaN                                 NaN     NaN   
Country        541909     38                      United Kingdom  495478   

                    mean          std       min      25%      50%      75%  \
InvoiceNo            NaN          NaN       NaN      NaN      NaN      NaN   
StockCode            NaN          NaN       NaN      NaN      NaN      NaN   
Description          N

**Size of the Dataset:**

Number of Rows: 541,909
Number of Columns: 8


**Brief Description of Each Column:**

InvoiceNo: Identifier for each invoice (25,900 unique values).
StockCode: Product item code (4,070 unique values).
Description: Product description (4,223 unique descriptions; most frequent is "WHITE HANGING HEART T-LIGHT HOLDER").
Quantity: The quantities of each product per transaction (mean: ~9.55, min: -80,995, max: 80,995).
InvoiceDate: Date and time of the invoice (23,260 unique values).
UnitPrice: Price per unit (mean: ~4.61, min: -11,062.06, max: 38,970).
CustomerID: Identifier for each customer (mean ID: ~15287.69, min: 12,346, max: 18,287).
Country: Country name (38 unique countries; most frequent is the United Kingdom).

**Time Period Covered by the Dataset:**

The time period covered by the dataset is from 2010-12-01 08:26:00 to 2011-12-09 12:50:00.

In [9]:
import pandas as pd

# Load your dataset
file_path = 'data.csv'  # Replace with your dataset file path
data = pd.read_csv(file_path, encoding='ISO-8859-1')  # Adjust encoding if necessary

# 1. Count the number of unique customers
unique_customers = data['CustomerID'].nunique()

# 2. Distribution of the number of orders per customer
# Group by CustomerID and count the unique InvoiceNo for each customer
orders_per_customer = data.groupby('CustomerID')['InvoiceNo'].nunique()

# Descriptive statistics for the distribution of orders per customer
orders_distribution = orders_per_customer.describe()

# 3. Identify the top 5 customers by order count
top_5_customers = orders_per_customer.sort_values(ascending=False).head(5)

# Output the results
print(f"Number of Unique Customers: {unique_customers}")
print("\nDistribution of Orders per Customer:\n", orders_distribution)
print("\nTop 5 Customers by Order Count:\n", top_5_customers)


Number of Unique Customers: 4372

Distribution of Orders per Customer:
 count    4372.000000
mean        5.075480
std         9.338754
min         1.000000
25%         1.000000
50%         3.000000
75%         5.000000
max       248.000000
Name: InvoiceNo, dtype: float64

Top 5 Customers by Order Count:
 CustomerID
14911.0    248
12748.0    224
17841.0    169
14606.0    128
13089.0    118
Name: InvoiceNo, dtype: int64


**Based on the analysis of the dataset:**

Number of Unique Customers: There are 4,372 unique customers in the dataset.

**Distribution of the Number of Orders per Customer:**

Count: 4,372 customers have placed orders.

Mean: On average, each customer has placed about 5.08 orders.

Standard Deviation: The standard deviation in the number of orders per customer is approximately 9.34, indicating a wide variation in the number of orders per customer.

Minimum: The minimum number of orders by a customer is 1.

25th Percentile: 25% of the customers have placed 1 or fewer orders.

Median (50th Percentile): The median number of orders per customer is 3.

75th Percentile: 75% of the customers have placed 5 or fewer orders.

Maximum: The maximum number of orders by a single customer is 248.

**Top 5 Customers by Order Count:**

Customer ID 14911: 248 orders

Customer ID 12748: 224 orders

Customer ID 17841: 169 orders

Customer ID 14606: 128 orders

Customer ID 13089: 118 orders