# Nafisur Rahman
nafisur21@gmail.com<br>
https://www.linkedin.com/in/nafisur-rahman

# Clustering Problem in NLP
* Identifying the genre of a book
* Recognizing the themes or topics in an article
* Topic modeling
* Clustering text Data
* Modeling Text Topics

## Topic Modeling in NLP

## Clustering text data (Article) using Kmeans algorithm

## Recognizing the themes or topics in an article

### Part 2:- Clustering
In part1 we have saved all the article from given blogpost (http://doxydonkey.blogspot.in/) into a tab seperated file called "allposts.csv".<br>
In this part, we are loading "allposts.csv" file into pandas dataframe and doing cluster analysis. 

Loading the libraries

In [1]:
import pandas as pd
import nltk

Loading the dataset

In [2]:
dataset=pd.read_csv('allposts.csv',sep='\t',quoting=3)

In [3]:
df=dataset[['post']]
df.head()

Unnamed: 0,post
0,SoftBank's $100 Billion Tech Fund Rankles VCs ...
1,Quora tests video answers to steal Q&A from Yo...
2,Pittsburgh Welcomed Uber s Driverless Car Expe...
3,"""LeEco employees are being called to a Tuesday..."
4,Why Did a Chinese Peroxide Company Pay $1 Bill...


In [4]:
len(df)

3117

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3117 entries, 0 to 3116
Data columns (total 1 columns):
post    2792 non-null object
dtypes: object(1)
memory usage: 24.4+ KB


Removing the missing values

In [6]:
df=df.dropna()
len(df)

2792

Removing column without any text

In [7]:
df[df['post']=='"'].head()

Unnamed: 0,post
575,""""
766,""""
1476,""""
1479,""""
1482,""""


In [8]:
l=df[df['post']=='"'].index

In [9]:
df=df.drop(labels=l)
len(df)

2636

Resetting the index

In [10]:
df = df.reset_index(drop=True)
len(df)

2636

In [11]:
df1=df
df.head()

Unnamed: 0,post
0,SoftBank's $100 Billion Tech Fund Rankles VCs ...
1,Quora tests video answers to steal Q&A from Yo...
2,Pittsburgh Welcomed Uber s Driverless Car Expe...
3,"""LeEco employees are being called to a Tuesday..."
4,Why Did a Chinese Peroxide Company Pay $1 Bill...


#### NLP Preprocessing task

In [12]:
from nltk.corpus import stopwords
from string import punctuation
import re
from nltk.stem import LancasterStemmer
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
ls=WordNetLemmatizer()
cstopwords=set(stopwords.words('english')+list(punctuation))

In [13]:
text_corpus=[]
for i in range(0,len(df)):
    review=re.sub('[^a-zA-Z]',' ',df['post'][i])
    #review=df['post'][i]
    review=[ls.lemmatize(w) for w in word_tokenize(str(review).lower()) if w not in cstopwords]
    review=' '.join(review)
    text_corpus.append(review)
    
len(text_corpus)

2636

### NLP Features extraction

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [15]:
cv=CountVectorizer()

In [16]:
X1=cv.fit_transform(text_corpus).toarray()

In [17]:
X1.shape

(2636, 19426)

In [18]:
tfidfvec=TfidfVectorizer(max_df=0.5,min_df=2,stop_words='english')
X2=tfidfvec.fit_transform(text_corpus).toarray()
X2.shape

(2636, 10919)

### Kmean Clustering Algorithm

In [19]:
from sklearn.cluster import KMeans
from nltk.probability import FreqDist

#### Finding number of cluster

wcss=[]
for i in range(1,10):
    kmeans=KMeans(n_clusters=i)
    kmeans.fit(X1)
    wcss.append(kmeans.inertia_)
    
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(range(1,10),wcss)

For simplicity we are only taking number of cluster=3

In [20]:
km=KMeans(n_clusters=3)
km.fit(X1)
km.labels_

array([2, 1, 2, ..., 0, 2, 2])

In [21]:
df1['labels']=km.labels_
df1['processed']=text_corpus
df1.head()

Unnamed: 0,post,labels,processed
0,SoftBank's $100 Billion Tech Fund Rankles VCs ...,2,softbank billion tech fund rankles vcs valuati...
1,Quora tests video answers to steal Q&A from Yo...,1,quora test video answer steal q youtube newly ...
2,Pittsburgh Welcomed Uber s Driverless Car Expe...,2,pittsburgh welcomed uber driverless car experi...
3,"""LeEco employees are being called to a Tuesday...",2,leeco employee called tuesday meeting massive ...
4,Why Did a Chinese Peroxide Company Pay $1 Bill...,0,chinese peroxide company pay billion talking c...


In [22]:
km.n_clusters

3

Most frequent words in each cluster

In [23]:
for i in range(km.n_clusters):
    df2=df1[df['labels']==i]
    df2=df2[['processed']]
    words=word_tokenize(str(list(set([a for b in df2.values.tolist() for a in b]))))
    dist=FreqDist(words)
    print('Cluster :',i)
    print('most common words :',dist.most_common(30))

Cluster : 0
most common words : [('company', 2757), ('billion', 1841), ('percent', 1773), ('year', 1640), ('million', 1363), ('said', 1244), ('revenue', 1202), ('share', 1162), ('quarter', 944), ('business', 835), ('market', 732), ('investor', 705), ('sale', 686), ("'", 662), (',', 661), ('new', 642), ('last', 632), ('service', 566), ('china', 518), ('growth', 508), ('also', 479), ('analyst', 453), ('stock', 450), ('apple', 445), ('according', 434), ('profit', 422), ('first', 417), ('would', 410), ('inc', 403), ('uber', 402)]
Cluster : 1
most common words : [('facebook', 1203), ('ad', 887), ('user', 763), ('video', 491), ('company', 488), ('app', 437), ('google', 433), ('mobile', 378), ('new', 377), ('twitter', 375), ('said', 350), ('people', 338), ('like', 307), ("'", 293), (',', 292), ('service', 287), ('year', 272), ('also', 256), ('search', 253), ('million', 245), ('product', 239), ('percent', 237), ('business', 226), ('apps', 221), ('time', 219), ('one', 217), ('social', 209), ('d

Most frequent unique words in each cluster

In [24]:
text={}
for i,cluster in enumerate(km.labels_):
    oneDocument = df1['processed'][i]
    if cluster not in text.keys():
        text[cluster] = oneDocument
    else:
        text[cluster] += oneDocument

In [25]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from collections import defaultdict
from string import punctuation
from heapq import nlargest
import nltk

In [26]:
_stopwords = set(stopwords.words('english') + list(punctuation)+["million","billion","year","millions","billions","y/y","'s","''","``"])

In [27]:
keywords = {}
counts={}
for cluster in range(3):
    word_sent = word_tokenize(text[cluster].lower())
    word_sent=[word for word in word_sent if word not in _stopwords]
    freq = FreqDist(word_sent)
    keywords[cluster] = nlargest(100, freq, key=freq.get)
    counts[cluster]=freq

In [28]:
unique_keys={}
for cluster in range(3):   
    other_clusters=list(set(range(3))-set([cluster]))
    keys_other_clusters=set(keywords[other_clusters[0]]).union(set(keywords[other_clusters[1]]))
    unique=set(keywords[cluster])-keys_other_clusters
    unique_keys[cluster]=nlargest(10, unique, key=counts[cluster].get)

In [29]:
unique_keys

{0: ['quarter',
  'growth',
  'analyst',
  'stock',
  'profit',
  'valuation',
  'cloud',
  'cent',
  'public',
  'rose'],
 1: ['facebook',
  'ad',
  'search',
  'apps',
  'social',
  'advertiser',
  'brand',
  'advertising',
  'feature',
  'instagram'],
 2: ['car',
  'store',
  'consumer',
  'pay',
  'system',
  'startup',
  'plan',
  'payment',
  'country',
  'industry']}

### Topics Modeling

1. Cluster 0= Related to startup and Fund raising plan
2. Cluster 1= Stock performance related
3. Cluster 2= Social media and advertising related