<a href="https://colab.research.google.com/github/Merlin0205/TEST/blob/main/FINAL_TESTY_LEDEN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Customer Segmentation for a Clothing Company

## Objective


The objective of this project is to build an automated customer segmentation system for a fashion business that operates both online and through physical stores with a loyalty program.

The system aims to:
- Segment customers based on their purchasing behavior, demographics, and preferences.
- Connect this data with inventory information to generate personalized product recommendations.

### Key Goals
1. **Customer Segmentation**  
   Group customers into meaningful segments using clustering techniques based on relevant attributes, such as:
   - Spending habits
   - Purchase frequency
   - Product preferences
   - Engagement metrics

2. **Personalized Product Offers**  
   Recommend products tailored to each customer segment by analyzing:
   - Customer preferences
   - Inventory data (stock availability and profit margins)

3. **Inventory Optimization**  
   Prioritize promoting products that:
   - Have sufficient stock levels
   - Provide higher profit margins  
   This ensures maximum revenue while maintaining relevance to the target segments.

4. **Scalable Design**  
   Create a solution that:
   - Can scale to handle larger datasets
   - Maintains efficiency and adaptability for various marketing **strategies**

## Generate, check and prepare data

*   K-menas -- dataset name **clustering_dataset_scaled**
*   Inventory -- dataset name **inventory_dataset**


Install required libraries

In [None]:
!pip install faker
!pip install gradio
!pip install google-generativeai --upgrade

### Generate dataset 1 "behavioral_dataset" and "behavioral_dataset_noEMAIL"

The Behavioral Dataset provides information about customer purchasing behavior, serving as the foundation for segmentation. It combines transactional data from the e-shop and loyalty program.

In [None]:
# Importing required libraries
import pandas as pd
import random
from faker import Faker

# Initialize Faker
fake = Faker()

# Setting the number of customers
num_customers = 2000

# Generate Behavioral Dataset
behavioral_data = {
    "customer_id": [i for i in range(1, num_customers + 1)],  # Unique customer IDs
    "email": [fake.email() for _ in range(num_customers)],  # Realistic email addresses
    "total_spent": [round(random.uniform(50, 5000), 2) for _ in range(num_customers)],  # Random total spending
    "total_orders": [random.randint(1, 50) for _ in range(num_customers)],  # Number of orders placed
    "avg_order_value": [],  # To be calculated based on total_spent and total_orders
    "last_purchase_days_ago": [random.randint(0, 365) for _ in range(num_customers)],  # Days since last purchase
    "categories_bought": [random.randint(1, 6) for _ in range(num_customers)],  # Number of unique categories
    "brands_bought": [random.randint(1, 6) for _ in range(num_customers)],  # Number of unique brands
}

# Calculate avg_order_value and add errors deliberately
for i in range(num_customers):
    if i % 50 == 0:  # Every 50th row will have a missing avg_order_value
        behavioral_data["avg_order_value"].append(None)
    else:
        total_orders = behavioral_data["total_orders"][i]
        total_spent = behavioral_data["total_spent"][i]
        avg_value = total_spent / total_orders if total_orders > 0 else 0
        behavioral_data["avg_order_value"].append(round(avg_value, 2))

# Introduce specific errors into the dataset
for i in range(20):  # Add invalid email addresses for the first 20 customers
    behavioral_data["email"][i] = "invalid_email.com" if i % 2 == 0 else "user@@example.com"

for i in range(2):  # Add negative total_spent for 2 customers
    behavioral_data["total_spent"][i] = -random.uniform(100, 500)

for i in range(num_customers - 70, num_customers):  # Customers with no purchase data
    behavioral_data["total_spent"][i] = None
    behavioral_data["total_orders"][i] = 0
    behavioral_data["avg_order_value"][i] = None
    behavioral_data["categories_bought"][i] = None
    behavioral_data["brands_bought"][i] = None
    behavioral_data["last_purchase_days_ago"][i] = None

# Convert to DataFrame
behavioral_df = pd.DataFrame(behavioral_data)

# Display the first few rows of the dataset
behavioral_df.head()

# Save the dataset to a variable for further use
behavioral_dataset = behavioral_df
behavioral_dataset.sample(20)

Analyze dataset for iconsitencies

In [None]:
# Import required library
import numpy as np

# Generate basic descriptive statistics for numeric columns
print("Descriptive Statistics for Numeric Columns:")
print(behavioral_dataset.describe())

# Check for unique values in categorical columns
print("\nUnique values in 'email':")
print(behavioral_dataset['email'].nunique(), "unique email addresses out of", len(behavioral_dataset))

# Identify invalid email formats
invalid_emails = behavioral_dataset[~behavioral_dataset['email'].str.contains(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', na=False)]
print("\nInvalid email addresses:")
print(invalid_emails)

# Check for negative or zero values in total_spent
negative_spent = behavioral_dataset[behavioral_dataset['total_spent'] < 0]
print("\nRows with negative 'total_spent':")
print(negative_spent)

# Check for customers with zero total_orders but non-zero total_spent
inconsistent_data = behavioral_dataset[(behavioral_dataset['total_orders'] == 0) & (behavioral_dataset['total_spent'] > 0)]
print("\nRows where 'total_orders' == 0 but 'total_spent' > 0:")
print(inconsistent_data)

# Analyze 'categories_bought' and 'brands_bought' for unrealistic values
print("\nAnalysis of 'categories_bought' and 'brands_bought':")
print("Unique values in 'categories_bought':", behavioral_dataset['categories_bought'].unique())
print("Unique values in 'brands_bought':", behavioral_dataset['brands_bought'].unique())


What to Remove from the Dataset
Based on the analysis of the dataset, the following data should be removed to ensure consistency and accuracy:

Rows with invalid email addresses:

Rows with emails that do not follow a valid email format (e.g., invalid_email.com, user@@example.com) should be removed, as these cannot be used for communication or further analysis.
Rows with negative total_spent:

Rows where total_spent is negative should be removed, as negative spending is illogical and likely indicates data entry errors.
Rows with missing values:

Rows where any critical fields (total_spent, categories_bought, brands_bought) are missing (NaN) should be removed to maintain dataset integrity. These include customers who registered but have no purchase data (e.g., rows with categories_bought and brands_bought as NaN).

In [None]:
# Remove rows with invalid email addresses
behavioral_dataset = behavioral_dataset[
    behavioral_dataset['email'].str.contains(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', na=False)
]

# Remove rows with negative total_spent
behavioral_dataset = behavioral_dataset[
    behavioral_dataset['total_spent'] >= 0
]

# Remove rows with missing critical values
behavioral_dataset = behavioral_dataset.dropna(
    subset=['total_spent', 'categories_bought', 'brands_bought']
)

# Display summary of cleaned dataset
print("Summary of cleaned dataset:")
print(behavioral_dataset.info())

# Display the first few rows of the cleaned dataset
behavioral_dataset.head()


Analyze AGAIN dataset for iconsitencies if they were renoved

In [None]:
# Import required library
import numpy as np

# Function to check and report issues in the dataset
def check_data_issues(dataset):
    issues_found = False  # Flag to track if any issues are found

    print("=== Dataset Integrity Check ===")

    # Descriptive statistics
    print("\n[INFO] Basic Descriptive Statistics:")
    print(dataset.describe())

    # Check for unique emails
    unique_emails = dataset['email'].nunique()
    total_records = len(dataset)
    print(f"\n[INFO] Unique email addresses: {unique_emails} out of {total_records}")

    # Invalid email formats
    invalid_emails = dataset[~dataset['email'].str.contains(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', na=False)]
    if not invalid_emails.empty:
        issues_found = True
        print("\n[WARNING] Invalid email addresses found:")
        print(invalid_emails)
    else:
        print("\n[INFO] All email addresses are valid.")

    # Negative total_spent
    negative_spent = dataset[dataset['total_spent'] < 0]
    if not negative_spent.empty:
        issues_found = True
        print("\n[WARNING] Rows with negative 'total_spent':")
        print(negative_spent)
    else:
        print("\n[INFO] No negative values in 'total_spent'.")

    # Inconsistencies: zero total_orders with non-zero total_spent
    inconsistent_data = dataset[(dataset['total_orders'] == 0) & (dataset['total_spent'] > 0)]
    if not inconsistent_data.empty:
        issues_found = True
        print("\n[WARNING] Rows where 'total_orders' == 0 but 'total_spent' > 0:")
        print(inconsistent_data)
    else:
        print("\n[INFO] No inconsistencies in 'total_orders' and 'total_spent'.")

    # Check categories_bought and brands_bought for missing values
    if dataset['categories_bought'].isnull().any() or dataset['brands_bought'].isnull().any():
        issues_found = True
        print("\n[WARNING] Missing values found in 'categories_bought' or 'brands_bought'.")
    else:
        print("\n[INFO] No missing values in 'categories_bought' or 'brands_bought'.")

    # Final report
    if issues_found:
        print("\n[RESULT] Issues detected in the dataset. Please review the warnings above.")
    else:
        print("\n[RESULT] All data checks passed. Dataset is clean and ready for analysis.")

# Run the function to check the dataset
check_data_issues(behavioral_dataset)



In this step, we remove the email column from the dataset to ensure customer privacy and focus on anonymized data analysis. The resulting dataset, named behavioral_dataset_noEMAIL, retains all other customer information but excludes email addresses. **This step is essential for data protection and compliance with privacy regulations.**

In [None]:
# Create a new dataset without the email column
behavioral_dataset_noEMAIL = behavioral_dataset.drop(columns=['email'])

# Display the first few rows of the new dataset
behavioral_dataset_noEMAIL.head(20)


### Generate dataset 2 "Preference_dataset"

The Preference Dataset focuses on customer purchasing preferences, capturing insights such as favorite product categories, brands, and price ranges. This dataset will provide valuable information for targeted marketing and segmentation.

Generate dataset

In [None]:
# Import required libraries
import pandas as pd
import random
from faker import Faker

# Initialize Faker
fake = Faker()

# Categories and Brands for the clothing and accessories e-shop
CATEGORIES = ["Tops", "Bottoms", "Dresses", "Outerwear", "Shoes", "Accessories", "Sportswear"]
BRANDS = [
    "Nike", "Adidas", "Puma", "Zara", "H&M", "Gucci", "Prada", "Levi's", "Ralph Lauren", "Under Armour",
    "Calvin Klein", "New Balance", "Tommy Hilfiger", "Versace", "Burberry"
]

# Number of customers (same as in Behavioral Dataset)
num_customers = 2000

# Generate Preference Dataset
preference_data = {
    "customer_id": [i for i in range(1, num_customers + 1)],  # Unique customer IDs
    "top_category": [random.choice(CATEGORIES) for _ in range(num_customers)],  # Most frequent category
    "top_brand": [random.choice(BRANDS) for _ in range(num_customers)],  # Most frequent brand
    "price_preference_range": [random.randint(1, 3) for _ in range(num_customers)],  # 1 = Low, 2 = Mid, 3 = High
    "discount_sensitivity": [round(random.uniform(0.0, 1.0), 2) for _ in range(num_customers)],  # Sensitivity to discounts
    "luxury_preference_score": [random.randint(1, 5) for _ in range(num_customers)]  # Preference for luxury (1-5)
}

# Convert to DataFrame
preference_df = pd.DataFrame(preference_data)


# Save the dataset to a variable for further use
preference_dataset_names = preference_df

# Display the first few rows of the dataset
preference_dataset_names.sample(20)


This code converts categorical features (top category and brand) in the preference dataset to numerical IDs for use in the K-means clustering algorithm, which requires numerical input.

In [None]:
from IPython.display import display

# Create mapping tables for 'top_category' and 'top_brand'
category_mapping = {category: idx for idx, category in enumerate(CATEGORIES)}
brand_mapping = {brand: idx for idx, brand in enumerate(BRANDS)}

# Save the mapping tables to DataFrames for future use
category_mapping_df = pd.DataFrame(list(category_mapping.items()), columns=["category_name", "category_id"])
brand_mapping_df = pd.DataFrame(list(brand_mapping.items()), columns=["brand_name", "brand_id"])

# Display the mapping tables as tables
print("Category Mapping Table:")
display(category_mapping_df)

print("\nBrand Mapping Table:")
display(brand_mapping_df)

# Convert 'top_category' and 'top_brand' to numeric values in preference_dataset_names
preference_dataset = preference_dataset_names.copy()  # Create a copy to store the results
preference_dataset["top_category"] = preference_dataset_names["top_category"].map(category_mapping)
preference_dataset["top_brand"] = preference_dataset_names["top_brand"].map(brand_mapping)

# Display 20 random rows of the updated dataset as a table
print("\nUpdated Preference Dataset with Numeric Values (Sample of 20 rows):")
display(preference_dataset.sample(20))


### Generate dataset 3 "Inventory_dataset"

The Inventory Dataset contains detailed information about the products available in stock for an e-commerce clothing and accessories store. It is logically connected to the categories and brands used in the Preference Dataset to maintain consistency across datasets. This dataset is essential for identifying products to promote, optimizing stock levels, and calculating profit margins for personalized offers.

In [None]:
# Import required libraries
import pandas as pd
import random

# Categories, Brands, Colors, and Adjectives
CATEGORIES = ["Tops", "Bottoms", "Dresses", "Outerwear", "Shoes", "Accessories", "Sportswear"]
BRANDS = [
    "Nike", "Adidas", "Puma", "Zara", "H&M", "Gucci", "Prada", "Levi's", "Ralph Lauren", "Under Armour",
    "Calvin Klein", "New Balance", "Tommy Hilfiger", "Versace", "Burberry"
]
ADJECTIVES = ["Classic", "Modern", "Stylish", "Luxury", "Casual", "Comfortable", "Premium"]
COLORS = ["Red", "Blue", "Black", "White", "Green", "Beige", "Pink", "Grey"]

# Number of products
num_products = 1000

# Generate unique product names
unique_product_names = set()

def generate_unique_product_name():
    """Generates a unique product name."""
    while True:
        brand = random.choice(BRANDS)
        category = random.choice(CATEGORIES)
        adjective = random.choice(ADJECTIVES)
        color = random.choice(COLORS)
        product_name = f"{brand} {color} {adjective} {category}"
        if product_name not in unique_product_names:
            unique_product_names.add(product_name)
            return product_name

# Generate Inventory Dataset
inventory_data = {
    "product_id": [i for i in range(1, num_products + 1)],
    "product_name": [generate_unique_product_name() for _ in range(num_products)],
    "category": [],
    "brand": [],
    "stock_quantity": [random.randint(0, 100) for _ in range(num_products)],
    "retail_price": [round(random.uniform(300, 5000), 2) for _ in range(num_products)],
    "cost_price": [],
    "profit_margin": []
}

# Populate category and brand based on product_name
for product_name in inventory_data["product_name"]:
    split_name = product_name.split(" ")
    inventory_data["brand"].append(split_name[0])
    inventory_data["category"].append(split_name[-1])

# Calculate cost_price and profit_margin
for i in range(num_products):
    retail_price = inventory_data["retail_price"][i]
    profit_margin = round(random.uniform(50, 100), 2) / 100
    cost_price = round(retail_price * (1 - profit_margin), 2)
    inventory_data["cost_price"].append(cost_price)
    inventory_data["profit_margin"].append(round(profit_margin * 100, 2))

# Convert to DataFrame
inventory_df = pd.DataFrame(inventory_data)

# Save the dataset to a variable for further use
inventory_dataset = inventory_df

# Display the first few rows of the dataset
inventory_dataset.sample(20)


### Merging Preference and Behavioral Datasets

**combined_dataset_scaled** is the final daset for segmentation
Data scaling is essential before applying K-means clustering because the algorithm relies on Euclidean distance to measure the similarity between data points. Features with larger numerical ranges, such as total_spent, will dominate the distance calculations, making features with smaller ranges, like discount_sensitivity, insignificant. By scaling the dataset, all features contribute equally to the clustering process, ensuring balanced and meaningful results.

In [None]:
# Merge preference_dataset and behavioral_dataset_noEMAIL on 'customer_id'
combined_dataset = pd.merge(preference_dataset, behavioral_dataset_noEMAIL, on="customer_id", how="inner")

# Display 20 random rows from the merged dataset
print("\nCombined Dataset (Sample of 20 rows):")
combined_dataset.sample(20)



This code prepares a dataset for clustering by performing several key preprocessing steps. It begins by checking for missing values and removing incomplete rows to ensure data integrity. Next, numeric features are standardized using StandardScaler to ensure all variables contribute equally to the clustering analysis, while preserving the customer_id column to retain customer identity. The final standardized dataset is displayed for verification and further use in clustering algorithms.

In [None]:
# Import required libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Step 1: Check the dataset structure
#print("Combined Dataset Overview:")
#print(combined_dataset.info())

# Step 2: Check for missing values in each column
print("\nChecking for missing values in the dataset:")
print(combined_dataset.isnull().sum())

# Step 3: Summary statistics to identify outliers or inconsistencies
print("\nSummary statistics for numeric columns:")
print(combined_dataset.describe())

# Step 4: Handle missing values (keep customer_id intact)
clustering_dataset = combined_dataset.dropna()
print("\nAfter removing rows with missing values:")
print(clustering_dataset.isnull().sum())

# Step 5: Standardize numeric columns to prepare for clustering (ignore customer_id)
columns_to_scale = clustering_dataset.drop(columns=["customer_id"]).columns
scaler = StandardScaler()
scaled_data = scaler.fit_transform(clustering_dataset[columns_to_scale])

# Step 6: Combine scaled data with customer_id
clustering_dataset_scaled = pd.DataFrame(scaled_data, columns=columns_to_scale)
clustering_dataset_scaled.insert(0, "customer_id", clustering_dataset["customer_id"].values)

# Step 7: Display the first 5 rows of the standardized dataset
print("\nClustering Dataset after Scaling (Sample of 5 rows):")
clustering_dataset_scaled.sample(20)

## Identify Ideal Number of Clusters and Perform K-means Analysis



*   Determine the optimal number of clusters using the Elbow Method and Silhouette Score. Send the results of both analyses to GEMINI for cluster count recommendations.
*   Performing the K-means clustering algorithm on the prepared dataset





### Determine the optimal number of clusters --- GEMINI - gemini-2.0-flash-exp model

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import google.generativeai as genai
import gradio as gr
from google.colab import userdata

# Step 1: Load the dataset
# Assume clustering_dataset_scaled is already in memory (Colab environment)
# Drop the 'customer_id' column since it is not used for clustering
features = clustering_dataset_scaled.drop(columns=["customer_id"]).values

# Step 2: Elbow Method to calculate Inertia
# Inertia measures how tightly clustered the data points are around the centroids
inertia = []
K = range(1, 11)  # Check for cluster counts from 1 to 10
for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(features)
    inertia.append(kmeans.inertia_)

# Step 3: Silhouette Score for cluster quality
# Silhouette Score measures how similar a point is to its own cluster compared to other clusters
silhouette_scores = []
k_values = range(2, 11)  # Silhouette Score is undefined for k=1
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(features)
    score = silhouette_score(features, labels)
    silhouette_scores.append(score)

# Step 4: Gemini API to Recommend Optimal Clusters
# Set your Google API key obtained from the Google Cloud Platform
# *** Replace "YOUR_API_KEY" with your actual Google Cloud API key ***
GOOGLE_API_KEY = 'AIzaSyCKnz_6ISwXYxxc3R2-Bay4ofUg4YXQH54'
genai.configure(api_key=GOOGLE_API_KEY)

def get_optimal_clusters_from_ai(elbow_data, silhouette_data):
    """
    Uses Gemini API to analyze Elbow and Silhouette results and recommend the optimal k.
    """
    # Prompt for the Gemini model
    prompt = f"""
    Based on the following clustering data:
    Elbow Method Inertia Values: {elbow_data}
    Silhouette Scores: {silhouette_data}

    Please analyze and recommend the optimal number of clusters (k).
    Provide the most appropriate number, but if it is unclear, mention the range of k values like '5-6'.
    Use the following format for your answer:
    - **Elbow Method**: [number]
    - **Silhouette Score**: [number]

    Provide a maximum of 2 bullet points explaining each method and finish with one clear concluding sentence.
    """
    # Initialize Gemini model
    model = genai.GenerativeModel("gemini-2.0-flash-exp")


    # Generate content using the Gemini model
    # response = model.generate_content(prompt)
    response = model.generate_content(prompt, generation_config={"temperature": 0.1})
    # Return the generated response text
    return response.text


# Call the API to get the optimal number of clusters
ai_response = get_optimal_clusters_from_ai(inertia, silhouette_scores)

# Step 5: Visualization Function
def generate_visuals(k):
    """
    Generates Elbow and Silhouette Score plots with user-selected k value.
    """
    # Elbow Method Visualization
    fig1, ax1 = plt.subplots()
    ax1.plot(K, inertia, 'bx-', label='Elbow Method')
    ax1.axvline(x=k, color='r', linestyle='--', label=f'Selected k={k}')
    ax1.set_xlabel('Number of Clusters (k)')
    ax1.set_ylabel('Inertia')
    ax1.set_title('Elbow Method')
    ax1.legend()

    # Silhouette Score Visualization
    fig2, ax2 = plt.subplots()
    ax2.plot(k_values, silhouette_scores, 'bx-', label='Silhouette Score')
    ax2.axvline(x=k, color='r', linestyle='--', label=f'Selected k={k}')
    ax2.set_xlabel('Number of Clusters (k)')
    ax2.set_ylabel('Silhouette Score')
    ax2.set_title('Silhouette Score')
    ax2.legend()

    return fig1, fig2, f"You selected k={k} as the number of clusters.\n\nAI Analysis:\n{ai_response}"

# Step 6: Gradio Interface for Visualization and AI Recommendation
def gradio_interface(k):
    fig1, fig2 = generate_visuals(k)
    return fig1, fig2

# Step 6: Gradio Interface for Visualization and AI Recommendation
def gradio_interface(k):
    fig1, fig2, message = generate_visuals(k)
    return message, fig1, fig2

# Global variable to store the slider value
selected_k = None

# Step 7: Launch Gradio Interface
initial_k = 5  # Default initial k to display when Gradio loads

with gr.Blocks() as interface:
    # AI Explanation Section
    gr.Markdown(f"### AI Analysis Explanation\n{ai_response}")

    # Row for Graphs
    with gr.Row():
        elbow_plot = gr.Plot(label="Elbow Method")
        silhouette_plot = gr.Plot(label="Silhouette Score")

    # Slider Section
    with gr.Row():
        slider = gr.Slider(2, 10, step=1, label="Select Number of Clusters", value=initial_k)

    # Uložit hodnotu ze slideru do globální proměnné
    def update_selected_k(k):
        global selected_k
        selected_k = k
        print(f"Selected value of k: {selected_k}")  # Kontrola v Colab logu

    # Update function for slider
    def update_visuals(k):
        update_selected_k(k)  # Uloží hodnotu slideru
        fig1, fig2, _ = generate_visuals(k)
        return fig1, fig2

    # Trigger updates on slider change
    slider.change(update_visuals, inputs=[slider], outputs=[elbow_plot, silhouette_plot])

# Launch Gradio
interface.launch(prevent_thread_lock=False)

The selected number of clusters is stored in the variable selected_k.

In [None]:
print(f"Selected value of k: {selected_k}")

### Clustering K-means

This code performs K-means clustering on the customer dataset to group customers into distinct segments. It first determines the optimal number of clusters using the Elbow and Silhouette methods, potentially with AI assistance, and then applies the K-means algorithm to assign each customer to a cluster

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.cluster import KMeans

# Load the dataset and drop the 'customer_id' column
data = clustering_dataset_scaled.drop(columns=['customer_id'])

# Get the number of clusters from the selected_k variable
n_clusters = selected_k

# Initialize the KMeans model
kmeans = KMeans(n_clusters=n_clusters, random_state=42)

# Fit the model to the data
kmeans.fit(data)

# Get the cluster labels for each data point
labels = kmeans.labels_

# Create a copy of the original dataset to avoid overwriting
clustered_data = clustering_dataset_scaled.copy()

# Add the cluster labels to the new dataset
clustered_data['cluster'] = labels

# Display the first few rows of the new dataset with cluster assignments
clustered_data.head()


 Create a new dataset **FINAL_DATASET** where the scaled values used for k-means clustering
are transformed back to their original values for easier interpretation
and cluster naming.

In [None]:
# Creating the FINAL_CLUSTERED dataset with customer_id and cluster
FINAL_CLUSTERED = clustered_data[['customer_id', 'cluster']]

# Performing LEFT JOIN with behavioral_dataset_noEMAIL, using 'customer_id' for both
FINAL_DATASET = pd.merge(
    FINAL_CLUSTERED,
    behavioral_dataset_noEMAIL,
    left_on='customer_id',
    right_on='customer_id',  # Explicitly joining on 'customer_id' in both DataFrames
    how='left'
)

# Performing LEFT JOIN with preference_dataset_names, using 'customer_id' for both
FINAL_DATASET = pd.merge(
    FINAL_DATASET,
    preference_dataset_names,
    left_on='customer_id',
    right_on='customer_id',  # Explicitly joining on 'customer_id' in both DataFrames
    how='left'
)

# Displaying the first 5 rows of the resulting dataset
FINAL_DATASET.head()


AI ANALYSIS - This code leverages Google's Gemini AI to generate descriptive names and characteristics for each customer cluster identified through K-means clustering. It prepares a prompt describing the customer features and data structure, then sends it to the AI for analysis and name suggestions.

In [None]:
import re
import google.generativeai as genai
import pandas as pd

# Configure Gemini AI
GOOGLE_API_KEY = 'AIzaSyCKnz_6ISwXYxxc3R2-Bay4ofUg4YXQH54'  # Load the API key from stored secrets
genai.configure(api_key=GOOGLE_API_KEY)

# Prompt template for AI
prompt_template = """
You are an expert in customer behavior analysis and clustering.
We have performed K-means clustering on customer data, and the dataset contains the following features for each customer:

- **Total spent**: Total amount spent by the customer.
- **Total orders**: Total number of orders placed by the customer.
- **Average order value**: Average amount spent per order.
- **Last purchase days ago**: Number of days since the customer's last purchase.
- **Categories bought**: Number of unique product categories purchased.
- **Brands bought**: Number of unique brands purchased.
- **Top category**: Most frequently purchased product category (e.g., Dresses, Sportswear).
- **Top brand**: Most frequently purchased brand (e.g., Levi's, Zara).
- **Price preference range**:
  - 1 = Low (up to 500 CZK)
  - 2 = Mid (501–2000 CZK)
  - 3 = High (>2000 CZK)
- **Discount sensitivity**: A value between 0–1 indicating how often the customer buys discounted products.
- **Luxury preference score**: A score from 1–5 indicating preference for premium products.

**Objective**: Propose descriptive names for each cluster based on customer behavior and preferences. Provide the output in the following structured format:
Cluster [ID]: [Cluster Name]
- Key Characteristics:
  - [Characteristic 1]
  - [Characteristic 2]
  - ...
"""

# Generate the prompt dynamically based on cluster data
def generate_prompt(data):
    cluster_summary = []
    clusters = data["cluster"].unique()

    for cluster in clusters:
        cluster_data = data[data["cluster"] == cluster]

        # Calculate average values for numeric columns
        avg_values = cluster_data[
            ["total_spent", "total_orders", "avg_order_value", "last_purchase_days_ago",
             "categories_bought", "brands_bought", "price_preference_range",
             "discount_sensitivity", "luxury_preference_score"]
        ].mean()

        # Get dominant values for categorical columns
        dominant_values = {
            "top_category": cluster_data["top_category"].mode()[0] if not cluster_data["top_category"].mode().empty else None,
            "top_brand": cluster_data["top_brand"].mode()[0] if not cluster_data["top_brand"].mode().empty else None
        }

        # Combine numeric and categorical aggregates
        summary = {**avg_values.to_dict(), **dominant_values}

        # Create cluster summary
        cluster_summary.append(
            f"Cluster {cluster}:\n- Values: {summary}"
        )

    return prompt_template + "\n" + "\n".join(cluster_summary)

# Call Gemini API for cluster naming
def call_gemini_api(prompt):
    model = genai.GenerativeModel("gemini-pro")
    response = model.generate_content(prompt, generation_config={"temperature": 0.2})
    return response.text

# Extract cluster names and descriptions from AI response
def parse_ai_response(response):
    cluster_names = {}
    cluster_descriptions = {}

    # Match structured format for cluster information
    clusters = re.findall(r"Cluster (\d+): (.*?)\n- Key Characteristics:\n((?:  - .*?\n)+)", response)
    for cluster_id, cluster_name, cluster_description in clusters:
        cluster_names[int(cluster_id)] = cluster_name
        cluster_descriptions[int(cluster_id)] = cluster_description.strip()

    return cluster_names, cluster_descriptions

# Main execution
def main():
    global FINAL_DATASET_AI_CLUSTER_NAMES  # Ensure the dataset is globally accessible

    # Assume FINAL_DATASET is already loaded in Colab
    if 'FINAL_DATASET' not in globals():
        raise ValueError("FINAL_DATASET is not loaded. Please ensure the dataset is available.")

    # Generate the prompt
    prompt = generate_prompt(FINAL_DATASET)
    print("\nGenerated Prompt for AI:\n")
    print(prompt)

    # Call Gemini API and get the response
    print("\nAI Response:\n")
    ai_response = call_gemini_api(prompt)
    print(ai_response)

    # Parse the AI response
    cluster_names, cluster_descriptions = parse_ai_response(ai_response)

    # Create a new dataset with added columns
    FINAL_DATASET_AI_CLUSTER_NAMES = FINAL_DATASET.copy()
    FINAL_DATASET_AI_CLUSTER_NAMES["cluster_name"] = FINAL_DATASET["cluster"].map(cluster_names)
    FINAL_DATASET_AI_CLUSTER_NAMES["cluster_description"] = FINAL_DATASET["cluster"].map(cluster_descriptions)

    # Display the updated dataset
    print("\nUpdated Dataset with Cluster Names and Descriptions (First 10 Rows):\n")
    display(FINAL_DATASET_AI_CLUSTER_NAMES.head(10))

# Run the main function
main()


#### Analyse if the segmentation is correct

PCA Visualization (2D Cluster Plot)
What to look for:
Well-separated clusters indicate high segmentation quality.
Overlapping clusters suggest potential issues with the segmentation logic or the number of clusters.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

def visualize_clusters(data):
    # Select numeric columns for PCA
    numeric_columns = [
        "total_spent", "total_orders", "avg_order_value",
        "last_purchase_days_ago", "categories_bought", "brands_bought",
        "price_preference_range", "discount_sensitivity", "luxury_preference_score"
    ]

    # Perform PCA to reduce to 2D
    pca = PCA(n_components=2)
    data["pca_1"] = pca.fit_transform(data[numeric_columns])[:, 0]
    data["pca_2"] = pca.fit_transform(data[numeric_columns])[:, 1]

    # Plot the clusters
    plt.figure(figsize=(12, 8))
    sns.scatterplot(
        x="pca_1", y="pca_2", hue="cluster_name",
        data=data, palette="Set2", s=100, alpha=0.8
    )

    # Add cluster names as text annotations
    for cluster, row in data.groupby("cluster_name"):
        plt.text(
            row["pca_1"].mean(), row["pca_2"].mean(),
            cluster, fontsize=10, weight="bold", ha="center", va="center",
            bbox=dict(facecolor="white", alpha=0.8, boxstyle="round,pad=0.3")
        )

    # Improved axis labels and title
    plt.title("Customer Segments (PCA Visualization)", fontsize=16)
    plt.xlabel("Principal Component 1 (Explains Most Variance)", fontsize=12)
    plt.ylabel("Principal Component 2 (Secondary Variance Explanation)", fontsize=12)
    plt.legend(title="Cluster Name", bbox_to_anchor=(1.05, 1), loc="upper left")
    plt.tight_layout()
    plt.show()

# Call the function
visualize_clusters(FINAL_DATASET_AI_CLUSTER_NAMES)



Silhouette Score
What to look for:
A value close to 1 means clusters are well separated and cohesive.
A value close to 0 means clusters are overlapping.
A negative value indicates poorly assigned points.

In [None]:
from sklearn.metrics import silhouette_score

# Silhouette Score Calculation
numeric_columns = [
    "total_spent", "total_orders", "avg_order_value",
    "last_purchase_days_ago", "categories_bought", "brands_bought",
    "price_preference_range", "discount_sensitivity", "luxury_preference_score"
]

silhouette_avg = silhouette_score(FINAL_DATASET_AI_CLUSTER_NAMES[numeric_columns],
                                  FINAL_DATASET_AI_CLUSTER_NAMES["cluster"])
print(f"Silhouette Score: {silhouette_avg:.3f}")


Cluster Feature Heatmap
What to look for:
The heatmap shows average values of features for each cluster.
Significant differences between clusters indicate well-defined segmentation.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate average values for each cluster
cluster_means = FINAL_DATASET_AI_CLUSTER_NAMES.groupby("cluster_name")[numeric_columns].mean()

# Heatmap Visualization
plt.figure(figsize=(12, 8))
sns.heatmap(cluster_means, annot=True, fmt=".2f", cmap="YlGnBu")
plt.title("Cluster Averages Heatmap")
plt.show()


Inter-Cluster Distance Map
What to look for:
A distance matrix shows how far apart clusters are from each other.
Well-separated clusters will have larger distances.

In [None]:
from scipy.spatial.distance import cdist
import numpy as np

# Calculate cluster centroids
centroids = FINAL_DATASET_AI_CLUSTER_NAMES.groupby("cluster")[numeric_columns].mean()

# Compute pairwise distances between centroids
distance_matrix = cdist(centroids, centroids, metric="euclidean")

# Display Distance Matrix as Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(distance_matrix, annot=True, cmap="coolwarm", fmt=".2f", xticklabels=centroids.index, yticklabels=centroids.index)
plt.title("Inter-Cluster Distance Heatmap")
plt.show()


Box Plot for Feature Distributions
What to look for:
Box plots show how feature values are distributed across clusters.
Clearly different distributions confirm meaningful segmentation.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Boxplot for Total Spent by Cluster
plt.figure(figsize=(12, 6))
sns.boxplot(x="cluster_name", y="total_spent", data=FINAL_DATASET_AI_CLUSTER_NAMES)
plt.title("Distribution of Total Spent by Cluster")
plt.xticks(rotation=45)
plt.show()


#Identifying Target Customers and Creating Customized Email Marketing

##STEP 1 - SELECT PRODUCTS FROM STOCK

This code allows users to filter the inventory dataset by stock quantity and profit margin, displaying the top products in an interactive table. Users can manually select product IDs for promotion by entering them into a textbox and saving them with a button. The selected product IDs are stored in a global dataset(selected_items) for use in subsequent steps like customer targeting or email campaigns.

In [None]:
import gradio as gr
import pandas as pd

# Initialize a global variable to store selected items as a dataset
selected_items = pd.DataFrame()

# Function to filter inventory
def filter_inventory(filter_stock, filter_profit, num_products):
    """
    Filter Inventory based on stock and/or profit margin and store as a dataset.
    """
    # Copy the dataset to avoid modifying the original
    filtered_data = inventory_dataset.copy()

    # Apply filters
    if filter_stock:
        filtered_data = filtered_data.nlargest(num_products, "stock_quantity")
    if filter_profit:
        filtered_data = filtered_data.nlargest(num_products, "profit_margin")

    # Reset index for clean output
    filtered_data.reset_index(drop=True, inplace=True)

    # Store the filtered dataset globally
    global selected_items
    selected_items = filtered_data

    # Return filtered results as a DataFrame for display
    return filtered_data[["product_id", "product_name", "category", "brand",
                          "stock_quantity", "profit_margin", "retail_price", "cost_price"]]

# Gradio interface
def gradio_interface():
    """
    Gradio interface for interactive inventory filtering and selection.
    """
    with gr.Blocks() as demo:
        # Title
        gr.Markdown("## Inventory Analysis and Selection for Promotion")

        # Row for filters
        with gr.Row():
            filter_stock = gr.Checkbox(label="Filter Top by Stock Quantity", value=True)
            filter_profit = gr.Checkbox(label="Filter Top by Profit Margin", value=False)
            num_products = gr.Slider(1, 10, step=1, label="Number of Products", value=10)

        # Table for displaying filtered results
        filtered_table = gr.DataFrame(label="Filtered Products",
                                      headers=["Product ID", "Product Name", "Category", "Brand",
                                               "Stock Quantity", "Profit Margin (%)",
                                               "Retail Price", "Cost Price"])

        # Button to trigger filtering
        filter_button = gr.Button("Apply Filters")

        # Functionality: Apply filters and display results
        def filter_and_display(filter_stock, filter_profit, num_products):
            filtered_results = filter_inventory(filter_stock, filter_profit, num_products)
            return filtered_results

        # Connect the filter button to the filtering function
        filter_button.click(filter_and_display,
                            inputs=[filter_stock, filter_profit, num_products],
                            outputs=[filtered_table])

    # Launch Gradio interface
    demo.launch()

# Launch the Gradio interface
gradio_interface()





Check selected items in selected_items dataset

In [None]:
selected_items.head(10)

##STEP 2 - AI TO LINK CUSTOMERS BASED ON SEGMENTATION WITH SLECTED PRODUCTS

AI LIMIT ONLY 100 RECORDS This code identifies customers for selected products using Gemini AI. It dynamically generates a prompt that combines product details (including product name, brand, category, stock quantity, and profit margin) from selected_items and customer segmentation data (e.g., cluster names, preferences, and behavioral metrics) from FINAL_DATASET_AI_CLUSTER_NAMES. The AI analyzes this information and returns a ranked list of the top 5 customers per product, along with reasons for their selection. The results, including product details, customer IDs, brands, categories, and AI-provided reasons, are stored in the selected_customers_for_selected_items dataset for further analysis and use..`

In [None]:
import pandas as pd
import google.generativeai as genai
import re

# Configure Gemini AI
GOOGLE_API_KEY = 'AIzaSyCKnz_6ISwXYxxc3R2-Bay4ofUg4YXQH54'  # Load API key from environment
genai.configure(api_key=GOOGLE_API_KEY)

# Ensure required datasets are loaded
if 'selected_items' not in globals() or 'FINAL_DATASET_AI_CLUSTER_NAMES' not in globals():
    raise ValueError("Make sure `selected_items` and `FINAL_DATASET_AI_CLUSTER_NAMES` datasets are available.")

# Limit the data to 100 random rows
filtered_customers = FINAL_DATASET_AI_CLUSTER_NAMES.sample(n=100, random_state=42)

# Updated Prompt Template
prompt_template = """
You are an expert in customer segmentation and personalized marketing.

We have the following customer data:
- Cluster Name and Description: High-level segmentation of customers.
- Top Category and Top Brand: Preferences for product categories and brands.
- Behavioral Metrics: Total spent, discount sensitivity, luxury preference score.

We also have the following products to promote:
- Category and Brand: The type of product.
- Stock Quantity and Profit Margin: Inventory details.

Your task:
1. Match customers to products based on category and brand preferences.
2. Rank customers based on likelihood to purchase, using behavioral metrics.
3. For each product, recommend the top 5 customers and provide reasons for each recommendation.

Your response should follow this exact structured format for each product:

Product [Product Name], Product ID [Product ID]:
- Customer ID [Customer ID]: [Reason for selection]
"""

# Generate the AI prompt dynamically
def generate_prompt(selected_items, filtered_customers):
    product_details = []
    for _, product in selected_items.iterrows():
        details = f"Product {product['product_name']}, Product ID {product['product_id']} (Category: {product['category']}, Brand: {product['brand']}, " \
                  f"Stock Quantity: {product['stock_quantity']}, Profit Margin: {product['profit_margin']}%)"
        product_details.append(details)

    customer_data = filtered_customers[[
        "customer_id", "cluster_name", "cluster_description",
        "top_category", "top_brand", "total_spent",
        "discount_sensitivity", "luxury_preference_score"
    ]].to_dict(orient="records")

    prompt = prompt_template + "\n\nProducts:\n" + "\n".join(product_details) + "\n\nCustomers:\n" + str(customer_data)
    return prompt

# Call Gemini AI to recommend customers
def call_gemini_ai(prompt):
    model = genai.GenerativeModel("gemini-pro")
    response = model.generate_content(prompt, generation_config={"temperature": 0.3})
    return response.text

# Parse AI response into a structured format
def parse_ai_response(response):
    recommendations = []
    # Najdi všechny produkty
    product_matches = re.findall(r"Product (.*?), Product ID (\d+)", response)

    for product_name, product_id in product_matches:
        # Najdi sekci pro daný produkt
        product_section = re.search(
            rf"Product {re.escape(product_name)}, Product ID {product_id}.*?:(.*?)(?=Product|$)",
            response, re.DOTALL
        )

        if product_section:
            # Najdi všechny zákazníky a jejich důvody
            customer_lines = re.findall(r"- Customer ID (\d+): (.*?)\n", product_section.group(1))
            for customer_id, reason in customer_lines:
                # Najdi detaily produktu v `selected_items`
                product_details = selected_items[selected_items["product_id"] == int(product_id)].iloc[0]
                recommendations.append({
                    "product_name": product_name,
                    "product_id": int(product_id),
                    "customer_id": int(customer_id),
                    "reason": reason.strip(),
                    "brand": product_details["brand"],
                    "category": product_details["category"]
                })

    return pd.DataFrame(recommendations)


# Main execution
def main():
    # Generate AI prompt
    print("Generating prompt for AI...")
    prompt = generate_prompt(selected_items, filtered_customers)
    print("\nGenerated Prompt:\n", prompt)

    # Call Gemini AI and get recommendations
    print("\nCalling Gemini AI...")
    ai_response = call_gemini_ai(prompt)
    print("\nAI Response:\n", ai_response)

    # Parse the AI response into a structured dataset
    print("\nParsing AI Response...")
    global selected_customers_for_selected_items
    selected_customers_for_selected_items = parse_ai_response(ai_response)

    # Display the dataset
    if not selected_customers_for_selected_items.empty:
        print("\nRecommended Customers for Selected Products (First 10 Rows):")
        print(selected_customers_for_selected_items.head(10))
    else:
        print("\nNo recommendations could be parsed from the AI response.")

# Run the main function
main()


In [None]:
selected_customers_for_selected_items.head(20
                                           )

##STEP 3 - Generate Personalized Marketing Emails TEXT for Selected Products

Generate Personalized Marketing Emails for Selected Products
This code dynamically generates personalized email campaigns for customers based on selected products. It uses Gemini AI to create engaging email templates (including subject lines and HTML bodies) for each product, emphasizing a 20% discount and the product's features. The emails are customized for each customer by replacing placeholders with their ID and stored in the selected_customers_for_selected_items dataset with new columns subject and body. This dataset can be used for further actions, such as sending the emails.

In [None]:
import pandas as pd
import google.generativeai as genai

# Configure Gemini AI
GOOGLE_API_KEY = 'AIzaSyCKnz_6ISwXYxxc3R2-Bay4ofUg4YXQH54'  # Load API key from environment
genai.configure(api_key=GOOGLE_API_KEY)

# Ensure required datasets are loaded
if 'selected_customers_for_selected_items' not in globals() or 'inventory_dataset' not in globals():
    raise ValueError("Ensure `selected_customers_for_selected_items` and `inventory_dataset` datasets are available.")

# Function to generate email prompt
def generate_email_prompt(product_name, category, brand, retail_price, discounted_price):
    return f"""
You are an expert email marketing copywriter. Create an engaging email in HTML format to promote a product.

Product Details:
- Name: {product_name}
- Category: {category}
- Brand: {brand}
- Original Price: {retail_price} CZK
- Discounted Price: {discounted_price} CZK

Requirements:
1. Write a subject line for the email that grabs attention.
2. Write an HTML body with the following sections:
   - A warm greeting for the customer.
   - Highlight the 20% discount and display the original and discounted prices clearly.
   - Emphasize the product's features and appeal (name, category, brand).
   - Include a call-to-action to "Buy Now" with a placeholder link (#).

Ensure the email is visually appealing and professional.
"""

# Call Gemini AI to generate the email
def call_gemini_ai(prompt):
    model = genai.GenerativeModel("gemini-pro")
    response = model.generate_content(prompt, generation_config={"temperature": 0.3})
    return response.text

# Main function to generate emails
def main():
    # Extract unique products from the dataset
    unique_products = selected_customers_for_selected_items[["product_id", "product_name", "category", "brand"]].drop_duplicates()

    # Prepare a mapping of product_id to email templates
    email_templates = {}

    for _, product in unique_products.iterrows():
        product_id = product["product_id"]
        product_name = product["product_name"]
        category = product["category"]
        brand = product["brand"]

        # Get retail price and calculate discounted price
        retail_price = inventory_dataset[inventory_dataset["product_id"] == product_id]["retail_price"].iloc[0]
        discounted_price = round(retail_price * 0.8, 2)

        # Generate the AI prompt
        prompt = generate_email_prompt(product_name, category, brand, retail_price, discounted_price)

        # Call Gemini AI
        print(f"Generating email for product: {product_name}...")
        ai_response = call_gemini_ai(prompt)

        # Parse AI response
        subject_match = re.search(r"Subject:\s*(.*)", ai_response)
        body_match = re.search(r"<html>.*</html>", ai_response, re.DOTALL)

        subject = subject_match.group(1).strip() if subject_match else "Special Offer for You!"
        body = body_match.group(0).strip() if body_match else ai_response.strip()

        email_templates[product_id] = {"subject": subject, "body": body}

    # Update the dataset with personalized emails
    def create_email_row(row):
        product_id = row["product_id"]
        customer_id = row["customer_id"]
        email_template = email_templates.get(product_id, {})
        subject = email_template.get("subject", "")
        body = email_template.get("body", "").replace("{{customer_id}}", str(customer_id))
        return pd.Series([subject, body])

    selected_customers_for_selected_items[["subject", "body"]] = selected_customers_for_selected_items.apply(create_email_row, axis=1)

    # Display the updated dataset
    print("\nUpdated Dataset with Marketing Emails (First 10 Rows):\n")
    display(selected_customers_for_selected_items.head(30))

# Run the main function
main()


##STEP 4 - SEND EMAILS TO CUSTOMERS

Send Unique Marketing Emails
This code sends one email for each unique product in the selected_customers_for_selected_items dataset. The emails, formatted in HTML, are sent from romecapstone@gmail.com to jan.krejci@krejca.eu with content dynamically extracted from the dataset. It ensures no duplicate emails are sent for multiple customers linked to the same product.

In [None]:
import pandas as pd
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart

# Gmail SMTP server configuration
smtp_server = "smtp.gmail.com"
port = 587
sender_email = "romecapstone@gmail.com"  # Replace with your email address
password = "wywx derq duox gndi"  # Replace with your App Password

# Ensure the dataset is loaded
if 'selected_customers_for_selected_items' not in globals():
    raise ValueError("Ensure `selected_customers_for_selected_items` dataset is available.")

# Extract unique products from the dataset
unique_products = selected_customers_for_selected_items.drop_duplicates(subset=["product_id"])

# Function to send an email
def send_email(subject, body, receiver_email):
    try:
        # Create the email message
        msg = MIMEMultipart()
        msg["From"] = sender_email
        msg["To"] = receiver_email
        msg["Subject"] = subject
        msg.attach(MIMEText(body, "html"))

        # Connect to the SMTP server and send the email
        with smtplib.SMTP(smtp_server, port) as server:
            server.starttls()  # Enable encrypted connection
            server.login(sender_email, password)
            server.sendmail(sender_email, receiver_email, msg.as_string())

        print(f"Email sent successfully to {receiver_email} for subject: {subject}")
    except Exception as e:
        print(f"Error sending email: {e}")

# Iterate over unique products and send emails
receiver_email = "jan.krejci@krejca.eu"

for _, product in unique_products.iterrows():
    subject = product["subject"]
    body = product["body"]

    print(f"Preparing to send email for product: {product['product_name']}...")
    send_email(subject, body, receiver_email)


##STREAMIT TEST

In [None]:
# Instalace Streamlit a ngrok
!pip install streamlit pyngrok --quiet




In [None]:
%%writefile app.py
import streamlit as st

st.title("Moje Aplikace v Colabu")
st.write("Toto je ukázkový příklad, jak spustit Streamlit v Google Colabu.")

if st.button("Klikni"):
    st.success("Ahoj z Colabu!")


In [None]:
from pyngrok import ngrok
import time
import threading
import subprocess

!ngrok config add-authtoken "2qIMKeezI8AM7htYMLgZ5CFZwIG_7W8qVRJ27QMuyoEGTReFs"

# Otevřeme ngrok tunel na port 8501
public_url = ngrok.connect(8501)
print("Veřejná URL aplikace:", public_url.public_url)

# Funkce pro spuštění Streamlit v pozadí
def run_streamlit():
    # Spustíme streamlit na portu 8501
    subprocess.call(["streamlit", "run", "app.py", "--server.port", "8501",
                     "--server.enableCORS", "false",
                     "--server.enableXsrfProtection", "false"])

# Spustíme Streamlit v separátním vlákně
thread = threading.Thread(target=run_streamlit)
thread.start()

# Volitelně lze přidat čekání, aby se stihlo vše spustit.
time.sleep(3)
print("Aplikace by měla být dostupná na adrese výše (ngrok).")
