<a href="https://www.kaggle.com/code/alsayedhamdy/text-clustering-and-feature-engineering?scriptVersionId=107567848" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **Importing Libraries**

In [None]:
#Importing libraries
import numpy as np
from numpy import unique
from numpy import where
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
from wordcloud import WordCloud
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ParameterGrid
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.model_selection import GridSearchCV
from sklearn import model_selection, metrics, preprocessing, ensemble, model_selection, metrics
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import cross_val_score
from hyperopt import tpe, hp, fmin, STATUS_OK,Trials
from hyperopt.pyll.base import scope
import nltk
import time
import re
import string
from kmeanstf import KMeansTF
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from string import punctuation
from gensim.utils import tokenize

In [None]:
nltk.download('stopwords')

# **Data editing and understanding**

**Upload and understand the data**

In [None]:
df = pd.read_csv('/content/Indiegogo.csv')
df

In [None]:
df.info()

1.   **I will ignore all the features except the feature ('title')**
2.   **And I will decrease the number of raws as well so the data won't be so huge**

In [None]:
df = df.drop(range(15100, 32321))
df = df.drop(df.loc[:, 'bullet_point':'tags'].columns, axis=1)
df

**Now I will look for the missing values and clear it**

In [None]:
df.describe()

In [None]:
df.isna().sum()

In [None]:
df = df.dropna()

In [None]:
df.isna().sum()

In [None]:
#Reset the index so it won't be any confusion
df.reset_index(inplace=True)
df

# **Data preprocessing and feature engineering**

In [None]:
X = df['title']

**Now I will clean the text and do the data preprocessing**

In [None]:
stop_words = set(stopwords.words('english'))
ps = SnowballStemmer('english')
#Let's make a function to finish all the cleaning process
def clean_data(text):
    #Removing URLs
    text = re.sub('http\S+\s*', ' ', text)
    #Removing RT and cc
    text = re.sub('RT|cc', ' ', text)
    #Removing digits
    text = re.sub(r'\d+', '', text)
    #Removing hashtags
    text = re.sub('#\S+', '', text)
    #Removing mentions and E-mails
    text = re.sub('@\S+', '  ', text) 
    #Removing punctuations
    text = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), ' ', text) 
    #Replacing non-ASCII characters with a single space
    text = re.sub(r'[^\x00-\x7f]',r' ', text) 
    #Removing extra whitespace
    text = re.sub('\s+', ' ', text)
    #Making the text in lowercase
    text = "".join([char.lower() for char in text if char not in string.punctuation])
    #Removing stopwords
    text = " ".join([word for word in str(text).split() if word not in stop_words])
    #Stemming
    text = " ".join([ps.stem(word) for word in text.split()])
    return text

#And now let's apply this function to our text
X = X.apply(lambda x: clean_data(x))
X

In [None]:
#Now the text is clean!
wordcloud = WordCloud(background_color='white', 
                      max_words=200,
                      width=1500, 
                      height=800, colormap='twilight').generate(' '.join(X))

plt.figure(figsize=(32,8))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

**Now to the feature engineering part**

*   **What is TF-IDF and why we use it?**
---
📌 Term frequency-inverse document frequency. 
Machine learning algorithms often use numerical data, so when dealing with textual data or any natural language processing (NLP) task, a sub-field of ML/AI dealing with text, that data first needs to be converted to a vector of numerical data by a process known as vectorization. TF-IDF vectorization involves calculating the TF-IDF score for every word in your corpus relative to that document and then putting that information into a vector (see image below using example documents “A” and “B”). Thus each document in your corpus would have its own vector, and the vector would have a TF-IDF score for every single word in the entire collection of documents. Once you have these vectors you can apply them to various use cases such as seeing if two documents are similar by comparing their TF-IDF vector using cosine similarity.

*   **What is PCA and why we use it?**

---
📌 Principal component analysis. Principal component analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.

So, to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while preserving as much information as possible.
Why we use it?

1.   Standardize the range of continuous initial variables
2.   Compute the covariance matrix to identify correlations
3.   Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components
4.   Create a feature vector to decide which principal components to keep
Recast the data along the principal components axes










In [None]:
#I used the TFIDF vectorizer to vectorize my text
tfidf_vec = TfidfVectorizer(min_df=0.005, max_features=5000) 
data_tfidf = tfidf_vec.fit_transform(X).todense()

#And now I will use the PCA 
pca = PCA( 0.95, random_state=42)
data_pca = pca.fit_transform(data_tfidf)

In [None]:
#We need to check the shape of our array
print( f"TF-IDF dimension - {data_tfidf.shape[1]}" )
print( f"TF-IDF + PCA dimension - {data_pca.shape[1]}" )

In [None]:
#Now I will identify the number of words so I can get the features and make all the featuers equal and the same dimesion
N_WORDS = 30
mean_data_tfidf = np.array(data_tfidf.mean(axis=0)).flatten()
vocabulary = tfidf_vec.get_feature_names()
words_id = np.flip( mean_data_tfidf.argsort()[-N_WORDS:] )

#Now let's build our features dataframe
word_val_data = [(vocabulary[id], mean_data_tfidf[id]) for id in words_id]
word_val_data = pd.DataFrame(word_val_data, columns=['words','values'])

# **3- (5 points)**

**I will use the KMeans algorithm**

**K-Means Clustering**

---


K-means clustering is the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). In k-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster.


---


**The Basic Idea**

---


The basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized. There are several k-means algorithms available. The standard algorithm is the Hartigan-Wong algorithm (1979), which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid:
![k-means-objective-function.png](https://www.researchgate.net/profile/Mahesh-Sarathchandra/publication/344888655/figure/fig1/AS:950914114408448@1603727000307/k-means-objective-function.ppm)

# **Modeling**

**Now I will explain my Hyper-parameters and why did I choose them**



*   n_init

---
In K-means the initial placement of centroid plays a very important role in it's convergence. Sometimes, the initial centroids are placed in a such a way that during consecutive iterations of K-means the clusters the clusters keep on changing drastically and even before the convergence condition may occur, max_iter is reached and we are left with incorrect cluster. Hence, the clusters obtained in such may not be correct. To overcome this problem, this parameter is introduced. The value of n_iter basically determines how many different sets of randomly chosen centroids, should the algorithm use. For each different set of points, a comparision is made about how much distance did the clusters move, i.e. if the clusters travelled small distances than it is highly likely that we are closest to ground truth/best solution. The points which provide the best performance and their respective run along with all the cluster labels are returned.

**And because our clusers will be a little huge becase the data is huge so I choosed 30 as initial placement of centroid**



*   n_clusters

---

**I will use a way to tune the number of clusters and then we will know what is the most optimal number of clusters we shall use.**



In [None]:
#So let's tune between the range 90 - 105
cluster_sizes = range(90, 105)
kmeans_models = [KMeansTF(i, n_init=30) for i in cluster_sizes]
cluster_score = []

for kmeans in kmeans_models:
  y = kmeans.fit_predict(data_pca)
  score = silhouette_score(data_pca, y)
  cluster_score.append((kmeans.n_clusters, score))

cluster_score=np.array(cluster_score)

In [None]:

scores = cluster_score[:, 1]
clusters = cluster_score[:, 0]
max_score_clusters = []

fig, ax = plt.subplots(figsize=(8,5))
ax = sns.lineplot(x=clusters, y=scores, ax=ax)
ax.set_title("Silhouette score vs No. clusters", fontsize=16)

for i in np.argsort(scores)[-5:]:
  ax.vlines(clusters[i], 0, 1, linestyles='--', colors='orange')
  max_score_clusters.append(clusters[i])

ax.text(1.01, 1, f"Dashed lines indicate\n the {len(max_score_clusters)} highest scores",
        transform=ax.transAxes, ha='left', va='top')

xticks = ax.get_xticks().astype(int)
xticks = np.append(xticks, max_score_clusters)
ax.set_xticks( xticks )
ax.tick_params(axis='x', rotation=45)

ax.set_ylim([0.95*min(scores), 1.05*max(scores)])
ax.set_xlim()

plt.show()

**So as we see the optimal number of clusters will be 103**

**Now let's go and apply it to our data**

In [None]:
n_clusters = 103
kmeans_model = KMeans(n_clusters, n_init=30)
y = kmeans_model.fit_predict(data_pca)

# **Clustering score and visualization**

In [None]:
#Now let's see a sample of our result
sample_scores = silhouette_samples( data_pca, y )
sample_scores_df =  pd.DataFrame( data = {'Cluster':y, 'Silhouette':sample_scores} )
sample_scores_df = sample_scores_df.reset_index()
sample_scores_df=sample_scores_df.sort_values('Silhouette', ascending=False)
sample_scores_df

**Let's visualize the clustering results**

In [None]:
def plot_silhouette_samples(X, pred_labels):
  n_clusters = len(np.unique(pred_labels))

  fig, (ax) = plt.subplots(1, 1, figsize=(8,15))
  

  silhouette_avg = silhouette_score(X, pred_labels)
  sample_silhouette_values = silhouette_samples(X, pred_labels)

  y_lower = 10
  for i in range(n_clusters):
    ith_cluster_silhouette_values = sample_silhouette_values[pred_labels == i]
    ith_cluster_silhouette_values.sort()

    size_cluster_i = ith_cluster_silhouette_values.shape[0]
    y_upper = y_lower + size_cluster_i
  
    ax.fill_betweenx(np.arange(y_lower, y_upper), 
                     0, ith_cluster_silhouette_values)
    ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i),
            ha='center', va='center', fontsize=12,
            bbox={'boxstyle':'square',
                  'facecolor':'white'})
    y_lower = y_upper + 10
  
  ax.set_title(f"The silhouette score plot for the {n_clusters} clusters.\n",fontsize=20)
  ax.set_xlabel("Silhouette coefficient values",fontsize=18)
  ax.set_ylabel("Cluster",fontsize=18)
  ax.axvline(x=silhouette_avg, color="red", linestyle="--")
  ax.set_yticks([])
  ax.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

  return ax

plot_silhouette_samples(data_pca, y)

**Now let's see some visuals to see the most important words in every cluster**

In [None]:
#Let's see
def most_important_words(data_tfidf, y, topn=10):

  n_clusters = len(np.unique(y))
  result = []
  for i in range(n_clusters):
    ith_cluter_word_sum = np.mean(data_tfidf[np.argwhere(y==i).flatten()], axis=0)
    
    word_ids = np.array(np.argsort( ith_cluter_word_sum )[0, -topn:])
    word_ids = word_ids.reshape(-1)
    result = result + [(i, id, ith_cluter_word_sum[0, id]) for id in word_ids]

  return result

In [None]:
word_cluster_df = pd.DataFrame(most_important_words(data_tfidf, y, topn=5), columns=["Cluster", "WordId", "Score"])
word_cluster_df["Word"] = word_cluster_df["WordId"].apply(lambda id: tfidf_vec.get_feature_names()[id])
word_cluster_df = word_cluster_df.sort_values("Score", ascending=False)
word_cluster_df.head()

In [None]:
g = sns.catplot(x="Score", y="Word", col="Cluster", data=word_cluster_df, 
                sharey=False, col_wrap=4, kind="bar",
                color = 'red', aspect=.6)
[ax.tick_params(axis='x', rotation=45, size=13) for ax in g.axes.flatten()]
g.fig.suptitle("Words with highest TF-IDF scores in each cluster", y = 1.01, fontsize=15)
plt.show()

# **Thank you for your reading!**