In natural language processing, the Latent Dirichlet Allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics.

  Over here in task 1 our main aim is to use topic modelling and then corrosponding get the containers as explained in the given text file (guidelined)

  The Basic structure of the system would be Loading the dataset, Data Analysis, Data Preprocessing, Bag of words/TF-IDF approach for the dataset, LDA using the Bag of words/TF-IDF, Classification and Testing of the model and Probability of the category.

Subtask -1

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('sentisum-assessment-dataset.csv',header=None)
df

Unnamed: 0,0,1
0,Tires where delivered to the garage of my choi...,
1,"Easy Tyre Selection Process, Competitive Prici...",
2,Very easy to use and good value for money.,
3,Really easy and convenient to arrange,
4,It was so easy to select tyre sizes and arrang...,
...,...,...
10127,"I ordered the wrong tyres, however [REDACTED] ...",
10128,"Good experience, first time I have used [REDAC...",
10129,"I ordered the tyre I needed on line, booked a ...",
10130,Excellent service from point of order to fitti...,


In [3]:
df=df.dropna(axis=1)
df

Unnamed: 0,0
0,Tires where delivered to the garage of my choi...
1,"Easy Tyre Selection Process, Competitive Prici..."
2,Very easy to use and good value for money.
3,Really easy and convenient to arrange
4,It was so easy to select tyre sizes and arrang...
...,...
10127,"I ordered the wrong tyres, however [REDACTED] ..."
10128,"Good experience, first time I have used [REDAC..."
10129,"I ordered the tyre I needed on line, booked a ..."
10130,Excellent service from point of order to fitti...


In [4]:
print(df.shape)

(10132, 1)


The dataset (corpus) includes 10132 entries of data. These are sentences/phrases which belong to one of 12 categories provided in the evaluation label file.

In [5]:
### The top most entry of the data
df.loc[1]

0    Easy Tyre Selection Process, Competitive Prici...
Name: 1, dtype: object

In [6]:
### the top 5 entries of the dataset 
df.head()

Unnamed: 0,0
0,Tires where delivered to the garage of my choi...
1,"Easy Tyre Selection Process, Competitive Prici..."
2,Very easy to use and good value for money.
3,Really easy and convenient to arrange
4,It was so easy to select tyre sizes and arrang...


In [7]:
df.describe(include=[object])

Unnamed: 0,0
count,10132
unique,10132
top,"Best price, very easy ordering process, short ..."
freq,1


The standardized procedure of solving the NLP problem is by following the conventional way of Data Pre-processing which includes Tokenization, Stopwrods removal, lemmatization and Stemming. We'll discuss all of them in detail with an example and then on our dataset.
For Preporcssing we'll be using NLTK.

NLTK:
The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.

In [8]:
import nltk
from nltk.corpus import stopwords  #stopwords
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

Stemming:
The second pre-procsessing technique we'll be discussing is Stemming. We'll be using few words and let's see how it recognizes and feals with them.

Lemmatization:
Let's start with Lemmatization of the text (Example) This is one of the first pre-processing task we'll be doing for our dataset. For convenience an example is demonstrate below:

In [9]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
stop_words=set(nltk.corpus.stopwords.words('english'))

In [11]:
stemmer = SnowballStemmer("english") #snowball stemmer
original_words = ['alumnus','universal', 'waited', 'Flying', 'caring', 'flies', 'dies', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'state', 'siezing', 'itemization','sensational', 
           'traditionally', 'referencing', 'colonizer','plotted','providing'] 
plural = [stemmer.stem(plural) for plural in original_words] #Stemmed into plural form

pd.DataFrame(data={'original word':original_words, 'stemmed':plural})

Unnamed: 0,original word,stemmed
0,alumnus,alumnus
1,universal,univers
2,waited,wait
3,Flying,fli
4,caring,care
5,flies,fli
6,dies,die
7,agreed,agre
8,owned,own
9,humbled,humbl


In [12]:
import nltk
nltk.download('wordnet')
print(WordNetLemmatizer().lemmatize('working', pos = 'v')) 
# past tense to present tense

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
work


In [13]:
def clean_text(text):
      text = re.sub("@[A-Za-z0-9]+", '',text)
      le=WordNetLemmatizer()
      word_tokens=word_tokenize(text)
      tokens=[le.lemmatize(w) for w in word_tokens if w not in stop_words and len(w)>3]
      cleaned_text=" ".join(tokens)
      return cleaned_text

In [14]:
import nltk
nltk.download('punkt')
df['cleaned_text']=df[0].apply(clean_text)
df

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,0,cleaned_text
0,Tires where delivered to the garage of my choi...,Tires delivered garage choice garage notified ...
1,"Easy Tyre Selection Process, Competitive Prici...",Easy Tyre Selection Process Competitive Pricin...
2,Very easy to use and good value for money.,Very easy good value money
3,Really easy and convenient to arrange,Really easy convenient arrange
4,It was so easy to select tyre sizes and arrang...,easy select tyre size arrange local fitting pr...
...,...,...
10127,"I ordered the wrong tyres, however [REDACTED] ...",ordered wrong tyre however REDACTED arranged c...
10128,"Good experience, first time I have used [REDAC...",Good experience first time used REDACTED Harbo...
10129,"I ordered the tyre I needed on line, booked a ...",ordered tyre needed line booked specified time...
10130,Excellent service from point of order to fitti...,Excellent service point order fitting complain...


Carrying out a TFIDF vectorization on the text column gives us a document term matrix on which we can carry out the topic modelling.

In [15]:
vect =TfidfVectorizer(stop_words=stop_words,max_features=1000)
vect_text=vect.fit_transform(df['cleaned_text'])

The parameters that we have given to the LDA model, as shown below, include the number of topics, the learning method (which is the way the algorithm updates the assignments of the topics to the documents), the maximum number of iterations to be carried out and the random state. 

In [16]:
from sklearn.decomposition import LatentDirichletAllocation
lda_model=LatentDirichletAllocation(n_components=12,
learning_method='online',random_state=42,max_iter=1) 
lda_top=lda_model.fit_transform(vect_text)

In [17]:
vocab = vect.get_feature_names()
for i, comp in enumerate(lda_model.components_):
     vocab_comp = zip(vocab, comp)
     sorted_words = sorted(vocab_comp, key= lambda x:x[1], reverse=True)[:12]
     print("Topic " +str(i)+": ")
     for t in sorted_words:
            print(t[0],end=" ") 

Topic 0: 
value good great experience service money start finish straightforward easy pleased cheap Topic 1: 
priced perfect choose easy exactly book tyre communication fitted wanted want decent Topic 2: 
quick easy efficient simple helpful friendly service website staff process good price Topic 3: 
tyre redacted time garage fitting fitted service ordered good used appointment issue Topic 4: 
recommend would highly recommended service professional definitely excellent prompt well friend good Topic 5: 
tyre used service time redacted hassle price good best easy problem great Topic 6: 
service good price great excellent fast easy reliable tyre fantastic always choice Topic 7: 
service thank definitely great purchase tyre excellent fault easy price using cheaper Topic 8: 
easy convenient order really arrange straight forward pricing competitive spot price good Topic 9: 
class brilliant first deal else service anywhere pleasant fitting plenty easily tire Topic 10: 
tyre time thanks said ar

1. 0 -> value for money
2. 1-> change of date
3. 2 -> wait time
4. 3 -> booking confusion
5. 4 -> mobile fitter
6. 5 -> length of fitting
7. 6 -> tyre type
8. 7 -> discounts
9. 8 -> ease of booking
10. 9 -> garage service
11. 10 -> delivery puntuality 
12. 11 -> location

In [18]:
for i,topic in enumerate(lda_top[0:]):
  print("Document ",i,": ",topic*100,"%")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Document  7632 :  [ 2.83207384  2.83213729  2.83241232  2.83193587 68.84716722  2.83201157
  2.83210353  2.83201907  2.8323269   2.83198294  2.83191601  2.83191342] %
Document  7633 :  [ 3.46148077 61.92371148  3.46148077  3.46148077  3.46148077  3.46148077
  3.46148077  3.46148077  3.46148077  3.46148077  3.46148077  3.46148077] %
Document  7634 :  [ 1.93843936  1.93841024  1.93863099 78.67676199  1.93837931  1.93845956
  1.93848266  1.93844795  1.9387889   1.9384095   1.93841499  1.93837455] %
Document  7635 :  [ 2.86646003 68.46908745  2.86646482  2.86643402  2.8664328   2.86644558
  2.86645093  2.866439    2.86648911  2.86643674  2.86643019  2.86642935] %
Document  7636 :  [ 3.07797181  3.07791243  3.07796244  3.0779106   3.07797201  3.07794129
 66.14272582  3.07795931  3.07791753  3.0779206   3.07790392  3.07790223] %
Document  7637 :  [ 2.67611156  2.67608195  2.67611075  2.67609674  2.6761019  31.7102863
 41.528666

Checking the results:
We can check the proportion of topics that have been assigned to the first document using the lines of code given below.

As you can see that we have generated the calusters for all the data points that are provides to us. On carefully seeing many inference can be drawn.

These scores are the probabilities of falling in a particular cluster so those with greater or equal than 50%, it's prediction is quite sure.
Now there are serveral cases where the clsuters are of equal probabilities. Assigning them to all doesn't makes any sense, So we'll try to either avoid or use any of them. This gives a rise to bias but I think with some approximation this can be cut down as well.
Those clusters that will have relative max scores (first max and second max) greater than 10% will be labelled.
These were few approximations and assumptions that can be used to alter down multiple cases.

In [19]:
lda_top=lda_top*100
lda_top

array([[ 2.10846955,  2.10829083,  2.10827338, ...,  2.10832125,
        43.53824671,  2.10820365],
       [ 2.18112667,  2.18106676, 15.10818391, ...,  2.18106694,
         2.18104213,  2.18102397],
       [68.5834061 ,  2.85606968,  2.85609799, ...,  2.85602886,
         2.85601775,  2.85601291],
       ...,
       [ 1.71255391,  1.71253778,  1.7125943 , ...,  1.71259441,
        30.04211687,  1.7124719 ],
       [ 2.40601383,  2.40604717,  2.40601909, ...,  2.40599407,
         2.40596728,  2.40594808],
       [21.12068793,  2.64076995,  2.64074923, ...,  2.64074923,
         2.64076094,  2.64074923]])

In [20]:
lda_top = lda_top.astype(int)
lda_top

array([[ 2,  2,  2, ...,  2, 43,  2],
       [ 2,  2, 15, ...,  2,  2,  2],
       [68,  2,  2, ...,  2,  2,  2],
       ...,
       [ 1,  1,  1, ...,  1, 30,  1],
       [ 2,  2,  2, ...,  2,  2,  2],
       [21,  2,  2, ...,  2,  2,  2]])

In [21]:
maxvalues = np.amax(lda_top, axis=1)
maxvalues

array([43, 38, 68, ..., 52, 51, 32])

In [22]:
len(maxvalues)

10132

In [23]:
indexs = np.argmax(lda_top, axis=1)
indexs

array([10,  6,  0, ...,  5,  3,  4])

As we can see here these indexs are clusters which they belong to as these values are the probabilities where they actually belong to.

Our next goal would be to quantify the clusters (if many clusters have nearly same score then it would be difficult to handle many categorical values at the same time). Though this approach may/may not be the correct one but to seems appropriate as we can merely not choose sole on the basis of high probability.

So we'll be using some assumptions and use of bias here to annoatate the dataset and then train subtask b.

Now let's first check our dataset for assumption 1 that is values > 0.5 or 50

In [24]:
greater_than_50 = (maxvalues > 50)
greater_than_50

array([False, False,  True, ...,  True,  True, False])

In [25]:
res_50 = [i for i, val in enumerate(greater_than_50) if not val]

In [26]:
len(res_50)

4357

So as per our dataset we can clearly say that close to 5,775 samples have values > 50 (percentage) which means our model was smart enough to classify it on the basis of the features.

In [27]:
#less than 50 and greater than 40
l_50_g_40 = (maxvalues > 40) & (maxvalues < 50)
l_50_g_40

array([ True, False, False, ..., False, False, False])

In [28]:
resl_50_g_40 = [i for i, val in enumerate(l_50_g_40) if not val]

In [29]:
len(resl_50_g_40)

8167

So as per our dataset we can clearly say that close to 1,965 samples have values greater than 40 and less than 50 (percentage) which means our model was smart enough to classify it on the basis of the features.

In [30]:
#less than 40 and greater than 30
l_40_g_30 = (maxvalues > 30) & (maxvalues < 40)
l_40_g_30

array([False,  True, False, ..., False, False,  True])

In [31]:
resl_40_g_30 = [i for i, val in enumerate(l_40_g_30) if not val]

In [32]:
len(resl_40_g_30)

8614

So as per our dataset we can clearly say that close to 1,518 samples have values greater than 30 and less than 40 (percentage) which means our model was smart enough to classify it on the basis of the features.

So we'll add the indexes that we calculated as theose were the clusters only, so adding it to the dataset

In [33]:
df['annot'] = indexs
df['high_val'] = maxvalues
df

Unnamed: 0,0,cleaned_text,annot,high_val
0,Tires where delivered to the garage of my choi...,Tires delivered garage choice garage notified ...,10,43
1,"Easy Tyre Selection Process, Competitive Prici...",Easy Tyre Selection Process Competitive Pricin...,6,38
2,Very easy to use and good value for money.,Very easy good value money,0,68
3,Really easy and convenient to arrange,Really easy convenient arrange,8,68
4,It was so easy to select tyre sizes and arrang...,easy select tyre size arrange local fitting pr...,5,30
...,...,...,...,...
10127,"I ordered the wrong tyres, however [REDACTED] ...",ordered wrong tyre however REDACTED arranged c...,3,81
10128,"Good experience, first time I have used [REDAC...",Good experience first time used REDACTED Harbo...,3,55
10129,"I ordered the tyre I needed on line, booked a ...",ordered tyre needed line booked specified time...,5,52
10130,Excellent service from point of order to fitti...,Excellent service point order fitting complain...,3,51


Now that we have the data annotation part done we'll start with training the supervised machine learning algorithm.

In [34]:
from sklearn.model_selection import train_test_split

In [35]:
X_train, X_test, y_train, y_test = train_test_split(df['cleaned_text'], df.index, test_size=0.10)

In [36]:
## Buidlding Count Vectorizer to convert the Messsage to Vectors
vect_df = TfidfVectorizer()

In [37]:
X_train_text = vect_df.fit_transform(X_train)
X_train_text

<9118x5697 sparse matrix of type '<class 'numpy.float64'>'
	with 97126 stored elements in Compressed Sparse Row format>

In [38]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
#Algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB

In [39]:
# model_params = {
#     'svm': {
#         'model': svm.SVC(gamma='auto'),
#         'params' : {
#             'C': [1,10,20],
#             'kernel': ['rbf','linear']
#         }  
#     },
#     'random_forest': {
#         'model': RandomForestClassifier(),
#         'params' : {
#             'n_estimators': [1,5,10]
#         }
#     },
#     'logistic_regression' : {
#         'model': LogisticRegression(solver='liblinear',multi_class='auto'),
#         'params': {
#             'C': [1,5,10,15]
#         }
#     },
#     'decision_tree': {
#         'model': DecisionTreeClassifier(),
#         'params' : {
#             'criterion': ['gini', 'entropy'],
#             'splitter': ['best','random']
#         }  
#     },
#     'knn': {
#         'model': KNeighborsClassifier(),
#         'params' : {
#             'n_neighbors': [5,7,9,11],
#             'algorithm' : ['ball_tree', 'kd_tree', 'brute']
#         }
#     },
#     'naive_bayes' : {
#         'model': GaussianNB(),
#         'params': {
#         }
#     }
# }

In [40]:
# scores = []

# for model_name, mp in model_params.items():
#     clf =  GridSearchCV(mp['model'], mp['params'], cv=2, return_train_score=False)
#     clf.fit(X_train_text, df.index)
#     scores.append({
#         'model': model_name,
#         'best_score': clf.best_score_,
#         'best_params': clf.best_params_
#     })
    
# vals = pd.DataFrame(scores,columns=['model','best_score','best_params'])
# vals

In [41]:
mnb = MultinomialNB()
lr = LogisticRegression()
dt = DecisionTreeClassifier()
rfc = RandomForestClassifier()
svm = SVC()
knn = KNeighborsClassifier()

In [42]:
# from sklearn.pipeline import Pipeline
# clf = Pipeline([
#     ('vectorizer', TfidfVectorizer()),
#     ('nb', MultinomialNB()),
#     ('lr', LogisticRegression()),
#     ('dt', DecisionTreeClassifier()),
#     ('rfc',RandomForestClassifier()),
#     ('svm',SVC()),
#     ('knn',KNeighborsClassifier())
# ])

In [43]:
mnb.fit(X_train_text,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [44]:
mnb.score(X_train_text,y_train)

0.9466988374643562

Summary:
The NLP task that was sent was based upron clustering topics and then using the annotations to predict the topic clusters. 

I followed a basic approach that is first I cleaned the pre-processed the data (lemmatized, tokenized and then created a tfidf vector form). Used these to predict where the text actually belongs to.

There were various times where bias came in such as where the maximum val was less than 35%. Here the second cluster had score somewhat ranging from 28% to 32%. To avoid the situtation of false positives I took the highest score only (which is not a good approach as it can be classified into another cluster as well).

Used classical ML algo of ensemble and other classification and tried grid search cv as well. But due to some issue wasn;t able to do it efficiently. 
I got highest accuracy on Naive bayes aglo of 94.6%. 

