<a href="https://www.kaggle.com/code/buketzdamar/e-commerce?scriptVersionId=222730539" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
import numpy as np
import pandas as pd
import re

In [None]:
data = pd.read_csv("/kaggle/input/e-commerce-datasets/data.csv", encoding='unicode_escape')

**InvoiceNo: Invoice Number**  
A unique invoice number for each transaction. This identifies a particular sale and may include an invoice containing multiple products.

**StockCode: Product Code**  
A unique code for the product. It is used to track the product's stock and variety.

**Description: Product Description**  
The name or description of the product. It provides text explaining what the product is.

**Quantity: Quantity**  
The number of items purchased. This specifies the quantity sold.

**InvoiceDate: Invoice Date**  
The date and time when the invoice was issued. This can be used to analyze the timing of sales and seasonal trends.

**UnitPrice: Unit Price**  
The selling price of the product. It represents the price of a single unit of the product.

**CustomerID: Customer ID**  
A unique identification number for each customer. This is important for tracking each customer's transaction history.

**Country: Country**  
The country where the customer is located. This information is used for geographical analyses or market segmentation.


In [None]:
data.head()

In [None]:
# data information
data.info()


In [None]:
# missing values

print(data.isnull().sum())

In [None]:
print(data['Description'])

In [None]:
# Let's drop rows with missing CustomerID values

data = data.dropna(subset=['CustomerID', 'Description'])


In [None]:
print(data.isnull().sum())

In [None]:
data.count

In [None]:
# Summary statistics for numerical columns

print(data.describe())


In [None]:
data.describe(include='all').T # Operations are performed not only on numerical data but also on categorical data.

In [None]:
# Convert the invoice date to datetime format

data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])

In [None]:
data.info()

In [None]:
# Sales revenue calculation: Quantity * Unit Price

data['Sales'] = data['Quantity'] * data['UnitPrice']


In [None]:
# Total revenue

total_sales = data['Sales'].sum()
print(f"Toplam Satış Geliri: {total_sales}")


In [None]:
# List best-selling products

top_products = data.groupby('Description').agg({'Sales': 'sum'}).sort_values(by='Sales', ascending=False)
print(top_products)


In [None]:
# Sales by Country

c_sales = data.groupby('Country').agg({'Sales': 'sum'}).sort_values(by='Sales', ascending=False)
print(c_sales)

In [None]:
data.head()

In [None]:
# Top customer by CustomerID based on purchase frequency

best_customer = data.groupby('CustomerID').agg({'Sales': 'sum'}).sort_values(by='Sales', ascending=False)

print(best_customer)

In [None]:
# Date with most transactions

date_sales = data.groupby('InvoiceDate').agg({'Sales':'sum'}).sort_values(by='Sales', ascending=False)
print(date_sales)

In [None]:
# Return analysis

returns = data[data['Quantity'] < 0]
return_details = returns.groupby(['CustomerID', 'StockCode']).agg({'Sales': 'sum', 'Quantity': 'sum'}).reset_index()

return_details_sorted = return_details.sort_values(by='Sales', ascending=False)
print(return_details_sorted)


In [None]:
# Customer with the most returns

customer_returns = returns.groupby('CustomerID').agg({'Sales': 'sum', 'Quantity': 'sum'}).sort_values(by='Quantity', ascending=False)
print(customer_returns)


In [None]:
# To analyze the distribution of returns over time

returns['Date'] = returns['InvoiceDate'].dt.date
return_by_date = returns.groupby('Date').agg({'Sales': 'sum', 'Quantity': 'sum'}).sort_values(by='Quantity', ascending=False)
print(return_by_date)


In [None]:
data.nunique()

In [None]:
data2 = data

In [None]:
data2.to_csv('data2.csv', index=False) ## it has not missing value


In [None]:
# Cancelled orders

data2['order_canceled'] = data2['InvoiceNo'].apply(lambda x:int('C' in x))
display(data2[:5])

n1 = data2['order_canceled'].sum()
n2 = data2.shape[0]
print('Number of orders canceled: {}/{} ({:.2f}%) '.format(n1, n2, n1/n2*100))

In [None]:
# Countries with the most cancellations

c_country = data2.groupby('Country').agg({'order_canceled': 'sum'}).sort_values(by='order_canceled', ascending=False)
print(c_country)

In [None]:
stok_inf = data2[data2['StockCode'].str.contains('^[a-zA-Z]+', regex=True)]['StockCode'].unique()
stok_inf

POST            -> POSTAGE                       
D               -> Discount                      
C2              -> CARRIAGE                      
M               -> Manual                        
BANK CHARGES    -> Bank Charges                  
PADS            -> PADS TO MATCH ALL CUSHIONS    
DOT             -> DOTCOM POSTAGE     

In [None]:
data2.head()