## Clustering Model 

In this project, we will be using the [Facebook Live Sellers Dataset from UCI](https://archive.ics.uci.edu/dataset/488/facebook+live+sellers+in+thailand). 

Following are the steps:
1. Loading, cleaning, and Exploring
    - Loading the Data
    - Fixing the Formats
    - Visualizing the Data
2. Building the Model
    - K-Mean Clustering (from scratch)
    - K-Mean Clustering (sklearn)

In [1]:
## first the imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## ML Packages
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans

### 1.1. Loading the Data

In [5]:
data = pd.read_csv('https://archive.ics.uci.edu/static/public/488/data.csv')
data.head()

Unnamed: 0,status_id,status_type,status_published,num_reactions,num_comments,num_shares,num_likes,num_loves,num_wows,num_hahas,num_sads,num_angrys
0,1,video,4/22/2018 6:00,529,512,262,432,92,3,1,1,0
1,2,photo,4/21/2018 22:45,150,0,0,150,0,0,0,0,0
2,3,video,4/21/2018 6:17,227,236,57,204,21,1,1,0,0
3,4,photo,4/21/2018 2:29,111,0,0,111,0,0,0,0,0
4,5,photo,4/18/2018 3:22,213,0,0,204,9,0,0,0,0


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7050 entries, 0 to 7049
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   status_id         7050 non-null   int64 
 1   status_type       7050 non-null   object
 2   status_published  7050 non-null   object
 3   num_reactions     7050 non-null   int64 
 4   num_comments      7050 non-null   int64 
 5   num_shares        7050 non-null   int64 
 6   num_likes         7050 non-null   int64 
 7   num_loves         7050 non-null   int64 
 8   num_wows          7050 non-null   int64 
 9   num_hahas         7050 non-null   int64 
 10  num_sads          7050 non-null   int64 
 11  num_angrys        7050 non-null   int64 
dtypes: int64(10), object(2)
memory usage: 661.1+ KB


We don't have any missing values in our set, but the formats can be changed to save some memory. Also, the status_id isn't adding any value to our model, so we can simply drop that column.

### 1.2. Fixing the Formats

In [None]:
def data_cleaner(df, drop_cols = None):
    df.columns = [x.strip().replace(r'/s+','_').lower() for x in df.columns]
    type_dict = {}
    if drop_cols:
        df = df.drop(drop_cols,axis=1)
    for col in df.columns:
        if df[col].dtype == 'object' and df[col].nunique() < 5:
            