### 🧪 PROJECT: Facebook Live sellers in Thailand

#### ✅ STEP 1: Load the Data

In [22]:
import pandas as pd

df = pd.read_csv("../dataset/Live.csv")

#### 🔍 STEP 2: Understand the Data

In [None]:
df.head()  # look at the first few rows
df.info()  # check data types and missing values
df.describe()  # summary stats

##### Check for missing values in dataset


In [None]:
df.isnull().sum()

##### We can see that there are 4 redundant columns in the dataset. We should drop them before proceeding further.

In [25]:
df.drop(["Column1", "Column2", "Column3", "Column4"], axis=1, inplace=True)

In [None]:
# Again view summary of dataset¶
df.info()

##### As we can see, there are 3 objects and remaining others are integer, so we do have some categorical data in this dataset

##### Let's explore the 3 variables which are of object datatype

##### Explore Status ID variable

In [None]:
len(df["status_id"].unique())

##### Well, there are 6997 unique status id's are there for every transcation but total records are 7050. Thus this is not a variable that we can use. Hence, I will drop it.

##### Explore Status published variable

In [None]:
df["status_published"].unique()

In [None]:
# view how many different types of variables are there
len(df["status_published"].unique())


##### Again, we can see that there are 6913 unique labels in the status_published variable. The total number of instances in the dataset is 7050. So, it is also a approximately a unique identifier for each of the instances. Thus this is not a variable that we can use. Hence, I will drop it also.

##### Explore status_type variable

In [None]:
df["status_type"].unique()

##### We can see that there are 4 categories of labels in the status_type variable.



In [31]:
# Drop status_id and status_published variable from the dataset¶
df.drop(["status_id", "status_published"], axis=1, inplace=True)

#### 🛠️ STEP 3: Feature Engineering

##### Convert categorical variable into integers

In [32]:
from sklearn.preprocessing import LabelEncoder

X = df
y = df["status_type"]

le = LabelEncoder()
X["status_type"] = le.fit_transform(X["status_type"])

y = le.transform(y)


In [None]:
X.info()

In [None]:
X["status_type"].unique()

##### Preview the final dataset now

In [None]:
X.head()

##### Feature Scaling is Required in K-Means Clustering because it groups data based on Euclidean distance.
##### Features with larger ranges will dominate the distance calculation. So its highly recommended to do feature scaling. 
##### Use MinMaxScaling Technique that Scales all values to a range between 0 and 1

In [36]:
from sklearn.preprocessing import MinMaxScaler

cols = X.columns
ms = MinMaxScaler()
X = ms.fit_transform(X)

In [None]:
X = pd.DataFrame(X, columns=cols)
X

#### 📊 STEP 4: Modeling & Clustering

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, random_state=0)

kmeans.fit(X)

In [None]:
kmeans.cluster_centers_


In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

cs = []
for i in range(1, 11):
    kmeans = KMeans(
        n_clusters=i, init="k-means++", max_iter=300, n_init=10, random_state=0
    )
    kmeans.fit(X)
    cs.append(kmeans.inertia_)
plt.plot(range(1, 11), cs)
plt.title("The Elbow Method")
plt.xlabel("Number of clusters")
plt.ylabel("CS")
plt.savefig("../images/elbow_method_optimal_k_value.png")

#### 📈5. Visualization

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Reduce dimensions to 2D for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Predict cluster labels
labels = kmeans.predict(X)

# Plot
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap="viridis", alpha=0.6)
plt.scatter(
    pca.transform(kmeans.cluster_centers_)[:, 0],
    pca.transform(kmeans.cluster_centers_)[:, 1],
    c="red",
    marker="X",
    s=200,
    label="Centroids",
)
plt.title("📊 K-Means Clustering Results (2D PCA View)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.legend()
plt.grid(True)
plt.show()


## 📌 Conclusion: What We Learned from K-Means Clustering

In this project, we applied **K-Means clustering** to segment Facebook Live Seller posts based on various engagement metrics such as reactions, comments, shares, and different reaction types (likes, loves, wows, etc.).

### 🎯 Purpose of Clustering:
- Since we had no predefined labels, clustering helped us uncover **natural groupings** within the data.
- It allowed us to identify **patterns of user interaction** without supervision.

### 🧠 Key Insights:
- Using the **Elbow Method**, we determined that the optimal number of clusters is likely **3**, as the inertia drops sharply up to that point and flattens afterward.
- Each cluster represents a distinct **behavioral profile** of Facebook posts:
  - Some clusters contain **highly engaging posts** (many reactions, likes, shares).
  - Others consist of **low-engagement or ignored posts**.
  - Some may be dominated by a specific **status type** (e.g., images, videos, links), which perform differently.

### 📊 Business Value:
- These clusters can help **content creators or marketers** tailor their strategies:
  - Focus on content types and styles that fall into **high-performing clusters**.
  - Re-evaluate or avoid formats associated with **low-engagement clusters**.
- **Cluster labels** can also be used as a new feature in future **supervised models** (e.g., predicting post performance).

### ✅ Summary:
K-Means clustering revealed underlying groupings in user engagement behavior. This insight can inform better content planning, audience targeting, and platform strategies.

