<a href="https://colab.research.google.com/github/Ponter255/cuddly-chainsaw/blob/master/Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### The Unsupervised Clustering

*  Clustering is a Machine Learning technique that involves the grouping of data points. Given a set of data points, we can use a clustering algorithm to classify each data point into a specific group. In theory, data points that are in the same group should have similar properties and/or features, while data points in different groups should have highly dissimilar properties and/or features. Clustering is a method of unsupervised learning and is a common technique for statistical data analysis used in many fields.



---
Useful links: [K-Means](https://towardsdatascience.com/k-means-clustering-8e1e64c1561c), [Clustering Algorithms](https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68)



In [0]:
import matplotlib.pyplot                     as plt
import numpy                                 as np
import pandas                                as pd
                                 
from sklearn.cluster                         import KMeans
from sklearn.naive_bayes                     import MultinomialNB
from sklearn.feature_extraction.text         import CountVectorizer
from collections                             import Counter
from sklearn.model_selection                 import train_test_split
from sklearn.feature_extraction.text         import TfidfTransformer
from IPython.display                         import Image, display
from IPython.core.display                    import HTML
from sklearn.feature_extraction.text         import TfidfVectorizer
from sklearn.decomposition                   import PCA
from sklearn.preprocessing                   import normalize
from sklearn.metrics                         import pairwise_distances
from sklearn.metrics                         import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline                        import Pipeline
from sklearn.metrics                         import confusion_matrix



%matplotlib inline

In [0]:
dataset = pd.read_csv("spam.csv", error_bad_lines=False, engine="python")
dataset = dataset.rename({'v1': 'Class', 'v2': 'Narrative'}, axis=1)
dataset =  dataset[['Class','Narrative']]
dataset

Unnamed: 0,Class,Narrative
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will �_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


* Can we just give our algorithm a bunch of text data and expect anything to happen? NO THE COMPUTER IS NOT A GINIE. Algorithms can't understand text data so we need to transform the data into numbers so that the model can understand. 

* If we represent the text in each email as a vector of numbers then our algorithm will be able to understand this and proceed accordingly. 

* What we will be doing is transforming the text in the body of each email into a vector of numbers using Term Frequency-Inverse Document Frequency or TF-IDF (for more information about TF-IDF click [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)).

In [0]:
narrative = dataset['Narrative']


tf_idf_vectorizor = TfidfVectorizer(stop_words = 'english')
tf_idf = tf_idf_vectorizor.fit_transform(narrative)
tf_idf_norm = normalize(tf_idf)
tf_idf_array = tf_idf_norm.toarray()
#print(tf_idf_array)
analise = pd.DataFrame(tf_idf_array, columns=tf_idf_vectorizor.get_feature_names())
analise

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,02,0207,02072069400,02073162414,02085076972,021,03,04,0430,05,050703,0578,06,07,07008009200,07046744435,07090201529,07090298926,07099833605,07123456789,0721072,07732584351,07734396839,07742676969,07753741225,0776xxxxxxx,07781482378,07786200117,077xxx,078,07801543489,...,yor,yorge,youdoing,youi,young,younger,youphone,youre,yourinclusive,yourjob,youuuuu,youwanna,yoville,yowifes,yoyyooo,yr,yrs,ystrday,ything,yummmm,yummy,yun,yunny,yuo,yuou,yup,yupz,zac,zaher,zealand,zebra,zed,zeros,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5568,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5569,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5570,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## KMeans


* HOW IT WORKS
 
  1. First select a number of classes/groups to use.

  2. Each data point is classified by computing the distance between that point and each group center, and then classifying the point to be in the group whose center is closest to it.

  ![Euclidean](https://miro.medium.com/max/890/1*UVJKdowZ9CHxvrII1IYolw.png)

  3. Based on these classified points, we recompute the group center by taking the mean of all the vectors in the group.
  4. Repeat these steps for a set number of iterations or until the group centers don’t change much between iterations (converge).

* ADVANTAGES

    > It’s fast, as all we’re really doing is computing the distances between points and group centers; very few computations!

* DISADVANTAGES

   >  You have to select how many groups/classes there are. 
    
   >  K-means also starts with a random choice of cluster centers and therefore it may yield different clustering results on different runs of the algorithm (not consistent). 
    

![KMeans](https://miro.medium.com/max/960/1*KrcZK0xYgTa4qFrVr0fO2w.gif)

In [0]:
#km = KMeans(n_clusters = 2)
km = KMeans(algorithm='elkan', max_iter=300, n_clusters = 2, n_init=30)
km

KMeans(algorithm='elkan', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=2, n_init=30, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [0]:
km.fit(analise)
predict = km.predict(analise)
predict

array([0, 1, 0, ..., 0, 0, 0], dtype=int32)

In [0]:
dataset['Cluster'] = predict
dataset['Cluster'].value_counts()

0    5372
1     200
Name: Cluster, dtype: int64

In [0]:
dataset['Class'].value_counts()

ham     4825
spam     747
Name: Class, dtype: int64

* First let's assume that the values that were predicted are referring to the following classes:


> HAM  = 0

> SPAM = 1



In [0]:
List =[]
i = 0
HAM_TP = 0

while  (i <= 5571):
  res = dataset.loc[i][2]
  if (res == 0):
    List.append("ham")
  else:
    List.append("spam")
  i=i+1

result_pred = pd.DataFrame (List,columns=['Predicted'])
actual_data = dataset['Class']



act = actual_data.to_numpy()

y_actu = pd.Series(act, name='Actual')
y_pred = pd.Series(List, name='Predicted')
df_confusion = pd.crosstab(y_actu, y_pred)
df_confusion


Predicted,ham,spam
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
ham,4626,199
spam,746,1


##Mean-Shift Clustering

![alt text](https://miro.medium.com/max/648/1*bkFlVrrm4HACGfUzeBnErw.gif)