**Objective:** To understand and gain insights from a retail dataset by performing various exploratory data analyses, data visualization, and data modelling.

**Dataset Columns:**

- **InvoiceNo:** Invoice number. A unique number per invoice.
- **StockCode:** Product code. A unique number per product.
- **Description:** Product description.
- **Quantity:** The number of products sold per invoice.
- **InvoiceDate:** The date and time of the invoice.
- **UnitPrice:** The price of one unit of the product.
- **CustomerID:** Customer identification number.
- **Country:** The country where the customer resides.


## 1. Data Preprocessing and Cleaning:


1.1. Import necessary libraries and read the dataset:

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('Sales_data.csv', encoding='ANSI')


In [2]:
import numpy as np
import sklearn

1.2. Display the top 10 rows of the dataframe:

In [None]:
df.head(10)


1.3. Check for missing values:



In [None]:
missing_values = df.isnull().sum()
print(missing_values)


1.4. Convert the InvoiceDate column to datetime format:



In [5]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])


1.5. Add a new column 'TotalPrice' to the dataframe which is the product of 'UnitPrice' and 'Quantity':



In [None]:
df['TotalPrice'] = df['UnitPrice'] * df['Quantity']
df.head()

## 2. Exploratory Data Analysis:


2.1. How many unique products are there in the dataset?


In [None]:
unique_products = df['StockCode'].nunique()
print(unique_products)


2.2. Which are the top 10 products (using StockCode) sold by quantity?



In [None]:
top_products = df.groupby('StockCode').sum().sort_values(by='Quantity', ascending=False).head(10)
print(top_products['Quantity'])


2.3. How many unique customers are there in the dataset?



In [None]:
unique_customers = df['CustomerID'].nunique()
print(unique_customers)


2.4. Which country has the maximum number of unique customers?



In [None]:
top_country = df.groupby('Country')['CustomerID'].nunique().idxmax()
print(top_country)


2.5. Visualize the distribution of 'TotalPrice' using a histogram.



In [None]:
plt.hist(df['TotalPrice'], bins=50, range=(df['TotalPrice'].min(), df['TotalPrice'].max()))
plt.xlabel('Total Price')
plt.ylabel('Frequency')
plt.title('Distribution of Total Price')
plt.show()


## 3. Data Aggregation:


3.1. Compute the total sales (TotalPrice) per country.



In [None]:
sales_per_country = df.groupby('Country').sum()['TotalPrice']
print(sales_per_country)


3.2. Identify the month in which the sales were highest.



In [None]:
df['Month'] = df['InvoiceDate'].dt.month
month_sales = df.groupby('Month').sum()['TotalPrice']
top_month = month_sales.idxmax()
print(top_month)


3.3. Compute the average unit price per product.



In [None]:
avg_price_per_product = df.groupby('StockCode').mean()['UnitPrice']
print(avg_price_per_product)


3.4. Compute the total quantity sold per customer.



In [None]:
quantity_per_customer = df.groupby('CustomerID').sum()['Quantity']
print(quantity_per_customer)


## 4. Data Visualization:


4.1. Create a bar chart showcasing the sales (TotalPrice) for each country.



In [None]:
sales_per_country.plot(kind='bar', figsize=(10, 6))
plt.ylabel('Total Sales')
plt.title('Total Sales per Country')
plt.show()


4.2. Plot a line graph to showcase the trend of sales over time.



In [None]:
df.set_index('InvoiceDate', inplace=True)
df.resample('M').sum()['TotalPrice'].plot()
plt.ylabel('Total Sales')
plt.title('Sales Trend Over Time')
plt.show()


4.3. Use a scatter plot to visualize the relationship between UnitPrice and Quantity.



In [None]:
sns.scatterplot(x='UnitPrice', y='Quantity', data=df)
plt.title('Relationship between UnitPrice and Quantity')
plt.show()


4.4. Plot a heatmap to display the correlation between numeric columns.



In [None]:
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True)
plt.title('Correlation Matrix')
plt.show()


## 5. Advanced Analysis:


5.1. Identify potential outliers in the dataset for the Quantity and UnitPrice columns using appropriate visualization techniques.



In [None]:
sns.boxplot(df['Quantity'])
plt.show()

sns.boxplot(df['UnitPrice'])
plt.show()


5.2. Segment customers based on their purchase history (Consider factors like total purchases, frequency of purchases, etc.).



In [None]:
df['TotalSpent'] = df['Quantity'] * df['UnitPrice']
customer_segment = df.groupby('CustomerID').agg({'TotalSpent': 'sum', 'InvoiceNo': 'nunique'})
customer_segment.rename(columns={'InvoiceNo': 'NumPurchases'}, inplace=True)
print(customer_segment)



5.3. For the top 5 products (by quantity sold), visualize their monthly sales trend.



In [None]:
top_5_products = df.groupby('StockCode').sum().nlargest(5, 'Quantity').index
filtered_df = df[df['StockCode'].isin(top_5_products)]
pivot_data = filtered_df.groupby(['Month', 'StockCode']).sum()['Quantity'].unstack()
pivot_data.plot()
plt.ylabel('Quantity Sold')
plt.title('Monthly Sales Trend for Top 5 Products')
plt.show()


# Advanced Machine Learning Analysis with Retail Dataset



## 6. Feature Engineering:

6.1 Extract 'Year', 'Month', 'Day', and 'Hour' from the InvoiceDate and create separate columns for each.


In [None]:
df.reset_index(inplace=True)
df['Year'] = df['InvoiceDate'].dt.year
df['Month'] = df['InvoiceDate'].dt.month
df['Day'] = df['InvoiceDate'].dt.day
df['Hour'] = df['InvoiceDate'].dt.hour
df.head()

6.2 Create a new column `'ReturnFlag'` where if `'Quantity'` is less than zero, it's 1, otherwise 0. This will indicate whether an item was returned.


In [None]:
df['ReturnFlag'] = df['Quantity'].apply(lambda x: 1 if x < 0 else 0)
df.head()

## 7. Customer Segmentation using Clustering:

7.1. Create a matrix RFM (Recency, Frequency, Monetary) for each customer:
- Recency: Number of days since the last purchase
- Frequency: Number of purchases
- Monetary: Total money spent

In [None]:
# Calculate Recency, Frequency and Monetary for each customer
current_date = df['InvoiceDate'].max()

rfm = df.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (current_date - x.max()).days,  # Recency
    'InvoiceNo': 'count',                                   # Frequency
    'TotalPrice': 'sum'                                      # Monetary
}).rename(columns={
    'InvoiceDate': 'Recency',
    'InvoiceNo': 'Frequency',
    'TotalPrice': 'Monetary'
})

rfm.head()

7.2. Normalize the RFM matrix:

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
rfm_normalized = scaler.fit_transform(rfm)
rfm_normalized

7.3. Use K-Means clustering to segment customers into different groups. Determine the optimal number of clusters using the Elbow method.

In [None]:
!pip install -U threadpoolctl

In [None]:
from sklearn.cluster import KMeans

# Determine the optimal number of clusters
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(rfm_normalized)
    wcss.append(kmeans.inertia_)

# Plot the Elbow method
plt.figure(figsize=(10,5))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.title('K-means clustering Elbow Method')
plt.show()

# Based on the elbow point, choose optimal clusters and run KMeans
optimal_clusters = 3  # this can change based on your elbow plot
kmeans = KMeans(n_clusters=optimal_clusters, init='k-means++', random_state=42)
clusters = kmeans.fit_predict(rfm_normalized)

rfm['Cluster'] = clusters


## 8. Predictive Analytics:

8.1. Can you predict if a customer will return an item? Use the 'ReturnFlag' as the target variable and build a classification model.

- Split data into training and test sets.
- Use features like 'UnitPrice', 'Quantity', etc.
- Evaluate model accuracy, precision, recall, and F1-score.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

features = df[['UnitPrice', 'Quantity']]  # you can add more features
target = df['ReturnFlag']

for col in ['InvoiceDate', 'InvoiceNo','StockCode', 'Description', 'Country', 'CustomerID']:
    if col in features.columns:  
        del features[col]

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)

print(classification_report(y_test, predictions))


8.2. Predict the `'TotalPrice'` for an invoice using regression models.
- Consider relevant features and handle categorical ones appropriately (e.g., with one-hot encoding).
- Split data, train the model, and evaluate its performance using metrics like MAE, RMSE, and R^2.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Assuming df has dummy variables for categorical columns
features = df.drop(columns=['TotalPrice'])
target = df['TotalPrice']
for col in ['InvoiceDate', 'InvoiceNo','StockCode', 'Description', 'Country', 'CustomerID']:
    if col in features.columns:  
        del features[col]

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
reg = LinearRegression()
reg.fit(X_train, y_train)

predictions = reg.predict(X_test)

print('RMSE:', mean_squared_error(y_test, predictions, squared=False))
print('R^2:', r2_score(y_test, predictions))


## 9. Association Rule Mining:
- 9.1 Identify frequently bought products together. Use the Apriori algorithm to extract meaningful association rules.
- 9.2 Based on the rules, suggest product bundling strategies to the retail store.

In [31]:
! pip install mlxtend --q

In [None]:
from mlxtend.frequent_patterns import apriori, association_rules

# Creating a basket
basket = (df.groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

# Convert the units to 1 hot encoded values
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_encoded = basket.applymap(encode_units)

# Build frequent items
frequent_itemsets = apriori(basket_encoded, min_support=0.03, use_colnames=True)

# Association rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules


# Advanced EDA Techniques:

11. Pareto Analysis (80/20 Rule):
- Identify the 20% of the products that generate 80% of the revenue.
- Conversely, identify the 20% of the customers responsible for 80% of the sales.    

In [None]:
top_20_percent_products = int(0.2 * len(df['StockCode'].unique()))
top_products = df.groupby('StockCode').sum().nlargest(top_20_percent_products, 'TotalPrice')
top_products['TotalPrice'].plot(kind='bar')

top_20_percent_customers = int(0.2 * len(df['CustomerID'].unique()))
top_customers = df.groupby('CustomerID').sum().nlargest(top_20_percent_customers, 'TotalPrice')
top_customers['TotalPrice'].head(20).plot(kind='bar')


14. Time-Series Anomalies:
- Detect any anomalies or outliers in the sales data over time using rolling averages or other advanced methods.

In [None]:
df['RollingMean'] = df['TotalPrice'].rolling(window=5).mean()

anomalies = df[df['TotalPrice'] > df['RollingMean'] + 1.96*df['TotalPrice'].std()]
plt.plot(df['InvoiceDate'], df['TotalPrice'])
plt.plot(anomalies['InvoiceDate'], anomalies['TotalPrice'], 'ro')


# Advanced Modeling Techniques:

15. Market Basket Analysis Enhancements:
- Dig deeper into association rules. For instance, find rules with a high lift and high confidence.
- Analyze antecedents with more than one item, which can give bundled product suggestions.

In [None]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=0.1)
top_rules = rules[rules['confidence'] > 0.01]
sns.scatterplot(x='support', y='confidence', size='lift', data=top_rules)


## 19. Product Recommendation Systems:
- Develop a system to recommend products to users.
- Consider collaborative filtering techniques, matrix factorization, or deep learning-based approaches like neural collaborative filtering.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Create a user-product matrix
user_product_matrix = df.pivot_table(index='CustomerID', columns='StockCode', values='Quantity', fill_value=0)
reindexed_user_product_matrix = user_product_matrix.reset_index()
# Compute the cosine similarity matrix
cosine_sim = cosine_similarity(user_product_matrix)

# Get product recommendations for a user based on their purchase history
def get_recommendations(user_id, cosine_sim=cosine_sim):
    idx = reindexed_user_product_matrix[reindexed_user_product_matrix['CustomerID']==user_id].index[0]        
    # Get pairwise similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort users based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get products bought by the most similar user
    user_idx = sim_scores[1][0]
    similar_user_products = user_product_matrix.iloc[user_idx]
    recommended_products = similar_user_products[similar_user_products > 0].index.tolist()
    
    return recommended_products

stockcodelist = get_recommendations(13113.0) # Replace 2154 with an actual CustomerID
print(f"Stockcode list for recommended products are - {stockcodelist}")

## 20. Churn Prediction:
- Predict if a customer will stop buying products in the near future.
- Features can include Recency, Frequency, Monetary values, average time between purchases, total categories bought, etc.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# For this example, let's assume if a customer hasn't purchased in the last 6 months, they've churned
df['LastPurchase'] = df.groupby('CustomerID')['InvoiceDate'].transform('max')
max_date = df['InvoiceDate'].max()
df['Churn'] = (max_date - df['LastPurchase']).dt.days > 180

X = df[['TotalPrice', 'Quantity']] # Add more relevant features
y = df['Churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = LogisticRegression()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))


## 21. Hyperparameter Tuning and Model Optimization:
- For any given machine learning model you use, apply techniques like grid search or random search for hyperparameter tuning.
- Use ensemble methods (e.g., stacking, bagging, boosting) to enhance prediction performance.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

grid = GridSearchCV(LogisticRegression(), param_grid, verbose=3)
grid.fit(X_train, y_train)

print(grid.best_params_)

grid_predictions = grid.predict(X_test)
print(classification_report(y_test, grid_predictions))
