## Linear Algebra Assignment I 
Name: Shrinivas Khiste

Rno: 19CS30043

### Q10. Text Clustering

## Imports

In [1]:
!pip install wikipedia --quiet

  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone


In [2]:
import pandas as pd
import wikipedia

from sklearn.feature_extraction.text import TfidfVectorizer

import numpy as np

## Loading Data

In [3]:
articles = ['Linear algebra','Data Science','Artificial intelligence','European Central Bank',
            'Financial technology','International Monetary Fund','Basketball','Swimming',
            'Cricket']

wiki_list=[]
title=[]
for article in articles:
  print("Loading Content: ",article)
  wiki_list.append(wikipedia.page(article).content)
  title.append(article)

Loading Content:  Linear algebra
Loading Content:  Data Science
Loading Content:  Artificial intelligence
Loading Content:  European Central Bank
Loading Content:  Financial technology
Loading Content:  International Monetary Fund
Loading Content:  Basketball
Loading Content:  Swimming
Loading Content:  Cricket


## TFIDF Vectoriser

In [4]:
vectoriser = TfidfVectorizer(stop_words=('english'))
X = vectoriser.fit_transform(wiki_list)
X=X.toarray()

In [5]:
print(X.shape) # number of documents x number of unique words

(9, 8031)


## K-Means Code

In [6]:
def clusterAssignment(X,Z):
  C = np.zeros(shape=X.shape[0],dtype='int')
  for i,x in enumerate(X):
     C[i]= np.argmin(np.array([np.linalg.norm(x-z) for z in Z]),axis=0)
  return C

In [7]:
def updateRepresentatives(X,C,k):
  Z = np.zeros(shape=(k,X.shape[1]))
  for i in range(0,k):
    cluster = [x for j,x in enumerate(X) if C[j]==i]
    if(len(cluster)>0):
      Z[i]=sum(cluster)/len(cluster)
  return Z

In [8]:
def calculateJClust(X,Z,C):
  J=0
  for i,x in enumerate(X):
    z=Z[C[i]]
    J+=np.linalg.norm(x-z)
  J/=X.shape[0]
  return J

In [9]:
def runKMeans(Z,X,k,verbose=0):
  JClustPrev=0
  numIter=0
  while(True):
    C = clusterAssignment(X,Z)
    Z = updateRepresentatives(X,C,k)
    JClust= calculateJClust(X,Z,C)
    numIter+=1
    if(verbose==1):
      print("Iteration: "+str(numIter)+" JClust: ",JClust)
    if(abs(JClustPrev-JClust)<1e-9 or numIter>20):
      if(verbose==1):
        print("Convergence Point Reached. Number of iterations: ",numIter)
      break
    JClustPrev=JClust
  return JClust,Z

### Initialisation

In [10]:
def initialiseRandom(X,k):
  Z = np.random.rand(k,X.shape[1])
  return Z

In [11]:
def initialiseFromData(X,k):
  Z = X[np.random.choice(X.shape[0],k,replace=True),:]
  return Z

### Main Function

In [12]:
def run(X,k,isRandom,verbose=0):
  if isRandom:
    Z = initialiseRandom(X,k)
  else:
    Z=initialiseFromData(X,k)
  JClust,Z = runKMeans(Z,X,k,verbose)
  if(verbose==1):
    print("Final JClust: ",JClust)
  return JClust,Z

### (a) Running K Means for k=4,8,12

In [13]:
np.random.seed(0)

In [24]:
JClust_4,Z_4 = run(X,4,False,1)

Iteration: 1 JClust:  0.7753259139123334
Iteration: 2 JClust:  0.7753259139123334
Convergence Point Reached. Number of iterations:  2
Final JClust:  0.7753259139123334


In [26]:
JClust_8,Z_8 = run(X,8,False,1)

Iteration: 1 JClust:  0.5644806961618293
Iteration: 2 JClust:  0.5644806961618293
Convergence Point Reached. Number of iterations:  2
Final JClust:  0.5644806961618293


In [28]:
JClust_12,Z_12 = run(X,12,False,1)

Iteration: 1 JClust:  0.14724195714222163
Iteration: 2 JClust:  0.14724195714222163
Convergence Point Reached. Number of iterations:  2
Final JClust:  0.14724195714222163


### (b) Find Document Cluster Association

In [17]:
def findDocClusterAssoc(Z,X,articles):
  C =clusterAssignment(X,Z)
  k=Z.shape[0]
  for i in range(0,k):
    cluster = [j for j,x in enumerate(X) if C[j]==i]
    print("Cluster Number "+str(i+1)+": ",end=" ")
    for index in cluster:
      print(articles[index],end=",")
    print()

In [25]:
print("For k=4")
findDocClusterAssoc(Z_4,X,articles)

For k=4
Cluster Number 1:  European Central Bank,Financial technology,International Monetary Fund,
Cluster Number 2:  Basketball,Swimming,Cricket,
Cluster Number 3:  Linear algebra,Data Science,Artificial intelligence,
Cluster Number 4:  


In [27]:
print("For k=8")
findDocClusterAssoc(Z_8,X,articles)

For k=8
Cluster Number 1:  European Central Bank,International Monetary Fund,Cricket,
Cluster Number 2:  
Cluster Number 3:  Basketball,Swimming,
Cluster Number 4:  Linear algebra,
Cluster Number 5:  Data Science,Artificial intelligence,
Cluster Number 6:  
Cluster Number 7:  Financial technology,
Cluster Number 8:  


In [29]:
print("For k=12")
findDocClusterAssoc(Z_12,X,articles)

For k=12
Cluster Number 1:  European Central Bank,
Cluster Number 2:  Data Science,Artificial intelligence,
Cluster Number 3:  Swimming,
Cluster Number 4:  
Cluster Number 5:  Linear algebra,
Cluster Number 6:  
Cluster Number 7:  Financial technology,
Cluster Number 8:  International Monetary Fund,
Cluster Number 9:  
Cluster Number 10:  Basketball,
Cluster Number 11:  Cricket,
Cluster Number 12:  


### (c) Best k for given data

I think that k=4 is giving the best results. This is because the topics are largely from 3 domains, namely: Machine Learning, Finance, Sports. And k=4 does a good job of separating them. Although this required multiple re-runs. 

For k=8 and k=12, the documents just get split into smaller groups and thus even though they have less JClust, it is not an optimal clustering