## Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

df = pd.read_csv('/content/drive/Shareddrives/FINALS DATASETS/sales_data.csv')

## EDA

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df)

## Modelling

### Customer Segmentation

In [None]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns

# Define the number of rows for your dataset
num_rows = 1000

# Seed for reproducibility
np.random.seed(0)

# Generate customer IDs
customer_ids = np.random.randint(1000, 2000, size=num_rows)

# Generate food item names and corresponding types
food_items = ['Cheeseburger', 'French Fries', 'Pizza Slice', 'Salad', 'Soft Drink', 'Ice Cream']
food_types = ['Fast Food', 'Fast Food', 'Fast Food', 'Healthy', 'Beverage', 'Dessert']
food_type_dict = dict(zip(food_items, food_types))

# Generate random data for food sales
food_data = {
    'Customer ID': customer_ids,
    'Food Item': np.random.choice(food_items, size=num_rows),
    'Price': np.round(np.random.uniform(1.0, 10.0, size=num_rows), 2),
    'Quantity Sold': np.random.randint(1, 10, size=num_rows)
}

# Calculate sales
food_data['Sales'] = food_data['Price'] * food_data['Quantity Sold']

# Create a DataFrame
df = pd.DataFrame(food_data)

# Aggregate data by customer
customer_data = df.groupby('Customer ID').agg({
    'Sales': 'sum',
    'Quantity Sold': 'sum'
}).reset_index()

# Perform K-means clustering
X = customer_data[['Sales', 'Quantity Sold']]

# Choose the number of clusters
kmeans = KMeans(n_clusters=4, random_state=0)
customer_data['Cluster'] = kmeans.fit_predict(X)

# Print the cluster centers
print("Cluster Centers:")
print(kmeans.cluster_centers_)

# Plot the clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(data=customer_data, x='Sales', y='Quantity Sold', hue='Cluster', palette='viridis')
plt.xlabel('Total Sales')
plt.ylabel('Total Quantity Sold')
plt.title('Customer Segmentation using K-means Clustering')
plt.legend(title='Cluster')
plt.show()


## Questions

### What can you say about the model?

The model implemented here is a K-means clustering model. K-means clustering is an unsupervised learning algorithm used to segment a dataset into distinct groups (clusters) based on feature similarities. Here’s a breakdown of what the model does:

1. Data Preparation:
 - The dataset is aggregated by Customer ID to calculate the total sales and total quantity sold for each customer.
2.  Clustering:
- K-means is applied to segment customers into four clusters based on their total sales and total quantity sold.
- The number of clusters (n_clusters=4) is chosen arbitrarily and might need tuning based on the data's characteristics.
3.  Visualization:
- The resulting clusters are visualized using a scatter plot, which helps understand how customers are segmented based on sales and quantity sold.

## Is it a good model?

The goodness of a K-means model can be evaluated using several metrics and considerations:

1. Cluster Centers:
- The cluster centers (centroids) give insights into the average characteristics of each cluster. For example, some clusters might have higher average sales and quantities sold, indicating high-value customers.

2. Silhouette Score:
- A higher silhouette score indicates that the clusters are well-separated and distinct. It ranges from -1 to 1, where values closer to 1 mean better-defined clusters.

3. Elbow Method:
- The elbow method helps determine the optimal number of clusters by plotting the within-cluster sum of squares (WCSS) against the number of clusters. The point where the WCSS starts to diminish significantly is the "elbow point," suggesting an appropriate number of clusters.

4. Domain Knowledge:
- Understanding the business context and ensuring that the clusters make practical sense is crucial. The clusters should provide actionable insights.

Without running these evaluations, it’s hard to definitively say if it’s a good model. However, given that the data preparation and clustering steps are implemented correctly, it’s a reasonable starting point.

### What happens to Revenue when Expense increases according to the model?



If we assume that higher sales volumes (quantity sold) lead to higher expenses due to increased production or procurement costs, then:

- Revenue vs. Expense:
        If expenses increase with quantity sold, the net revenue (profit) will depend on the margin between revenue and expenses. Ideally, even with increased expenses, if the revenue increases at a higher rate, the model will segment these customers into higher-value clusters.