### Subtask 1 (Latent Dirichlet Allocation)

In natural language processing, the Latent Dirichlet Allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics.

      Over here in task 1 our main aim is to use topic modelling and then corrosponding get the containers as explained in the given text file (guidelined)
       
      The Basic structure of the system would be Loading the dataset, Data Analysis, Data Preprocessing, Bag of words/TF-IDF approach for the dataset, LDA using the Bag of words/TF-IDF, Classification and Testing of the model and Probability of the category.
       
       

## Step 1: Loading the Dataset 

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [31]:
df = pd.read_csv('C:\\Users\\mesho\\OneDrive\\Desktop\\SentiSum\\sentisum-assessment-dataset.csv',header=None)
df

Unnamed: 0,0,1
0,Tires where delivered to the garage of my choi...,
1,"Easy Tyre Selection Process, Competitive Prici...",
2,Very easy to use and good value for money.,
3,Really easy and convenient to arrange,
4,It was so easy to select tyre sizes and arrang...,
...,...,...
10127,"I ordered the wrong tyres, however [REDACTED] ...",
10128,"Good experience, first time I have used [REDAC...",
10129,"I ordered the tyre I needed on line, booked a ...",
10130,Excellent service from point of order to fitti...,


In [32]:
df=df.dropna(axis=1)
df

Unnamed: 0,0
0,Tires where delivered to the garage of my choi...
1,"Easy Tyre Selection Process, Competitive Prici..."
2,Very easy to use and good value for money.
3,Really easy and convenient to arrange
4,It was so easy to select tyre sizes and arrang...
...,...
10127,"I ordered the wrong tyres, however [REDACTED] ..."
10128,"Good experience, first time I have used [REDAC..."
10129,"I ordered the tyre I needed on line, booked a ..."
10130,Excellent service from point of order to fitti...


### Let's gain some deep insigths about the dataset

In [33]:
print(df.shape)

(10132, 1)


The dataset (corpus) includes 10132 entries of data. These are sentences/phrases which belong to one of 12 categories provided in the evaluation label file. 

In [34]:
### The top most entry of the data
df.loc[1]

0    Easy Tyre Selection Process, Competitive Prici...
Name: 1, dtype: object

In [35]:
### the top 5 entries of the dataset 
df.head()

Unnamed: 0,0
0,Tires where delivered to the garage of my choi...
1,"Easy Tyre Selection Process, Competitive Prici..."
2,Very easy to use and good value for money.
3,Really easy and convenient to arrange
4,It was so easy to select tyre sizes and arrang...


In [36]:
df.describe(include=[object])

Unnamed: 0,0
count,10132
unique,10132
top,"Easy process, saves time."
freq,1


If we go by summary statastics it doesn't really quote us much of an information. What we can interprest from the data is there are 10132 data counts of which all of them are unique and the frequency of it is 1.

## Step 2: Data Preprocessing

### The standardized procedure of solving the NLP problem is by following the conventional way of Data Pre-processing which includes Tokenization, Stopwrods removal, lemmatization and Stemming. We'll discuss all of them in detail with an example and then on our dataset. 

#### For Preporcssing we'll be using NLTK and Gensim
##### Gensim: 
Gensim is an open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning. 

##### NLTK: 
The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.

In [37]:
#pip install gensim 

In [38]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

### Stemming 
The second pre-procsessing technique we'll be discussing is Stemming. We'll be using few words and let's see how it recognizes and feals with them. 

In [39]:
stemmer = SnowballStemmer("english") #snowball stemmer
original_words = ['alumnus','universal', 'waited', 'Flying', 'caring', 'flies', 'dies', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'state', 'siezing', 'itemization','sensational', 
           'traditionally', 'referencing', 'colonizer','plotted','providing'] 
plural = [stemmer.stem(plural) for plural in original_words] #Stemmed into plural form

pd.DataFrame(data={'original word':original_words, 'stemmed':plural })

Unnamed: 0,original word,stemmed
0,alumnus,alumnus
1,universal,univers
2,waited,wait
3,Flying,fli
4,caring,care
5,flies,fli
6,dies,die
7,agreed,agre
8,owned,own
9,humbled,humbl


### Lemmatization
Let's start with **Lemmatization** of the text (Example)
This is one of the first pre-processing task we'll be doing for our dataset. For convenience an example is demonstrate below:

In [40]:
print(WordNetLemmatizer().lemmatize('working', pos = 'v')) 
# past tense to present tense

work


### Before utilizing the text and document we'll try to clean the data just to make sure that it doesn't inlcude any extra symbol, numbers etc.

In [41]:
# Function to clean the text
def cleantext(text):
  text = re.sub("@[A-Za-z0-9]+", '',text)
  return text

In [42]:
# Refined dataframe
df[0]=df[0].apply(cleantext)  
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[0]=df[0].apply(cleantext)


Unnamed: 0,0
0,Tires where delivered to the garage of my choi...
1,"Easy Tyre Selection Process, Competitive Prici..."
2,Very easy to use and good value for money.
3,Really easy and convenient to arrange
4,It was so easy to select tyre sizes and arrang...
...,...
10127,"I ordered the wrong tyres, however [REDACTED] ..."
10128,"Good experience, first time I have used [REDAC..."
10129,"I ordered the tyre I needed on line, booked a ..."
10130,Excellent service from point of order to fitti...


In [43]:
train = df.loc[:7999]
train.shape
train

Unnamed: 0,0
0,Tires where delivered to the garage of my choi...
1,"Easy Tyre Selection Process, Competitive Prici..."
2,Very easy to use and good value for money.
3,Really easy and convenient to arrange
4,It was so easy to select tyre sizes and arrang...
...,...
7995,It was good deal
7996,Needed 4 new tyres and shopped around a wee bi...
7997,I had 4 new Yokohama Tyres which I had install...
7998,It was very easy to find the tyres that I want...


In [44]:
test = df.loc[8000:]
test.shape

(2132, 1)

The above function is a function to clean the text. It takes the text as input. Pre-processes and cleans the data if it is not a regular word (A-Z).

### Tokenization and Lemmatization function for a sample Document (Example)

In [45]:
# Lemmatized stemming

'''fn: lemmatizes the document(sample).'''
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize

'''fn: preprocess takes arguement document(sample). Tokenizes and lemmatizes the document'''
def preprocess(text):
    final=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS:
            final.append(lemmatize_stemming(token))
            
    return final

In [46]:
### doc (25th sample from the dataset) 
doc_sample = 'Best prices and fitted by local company.'

print("Original document: ")
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))

Original document: 
['Best', 'prices', 'and', 'fitted', 'by', 'local', 'company.']


Tokenized and lemmatized document: 
['best', 'price', 'fit', 'local', 'compani']


### Now Let's apply the same for the rest of the ducument|

In [47]:
# empty list df_document
df_document = []

for doc in train[0]:
    df_document.append(preprocess(doc))

In [48]:
df_document

[['tire',
  'deliv',
  'garag',
  'choic',
  'garag',
  'notifi',
  'deliv',
  'day',
  'time',
  'arrang',
  'garag',
  'go',
  'fit',
  'hassel',
  'free',
  'experi'],
 ['easi',
  'tyre',
  'select',
  'process',
  'competit',
  'price',
  'excel',
  'fit',
  'servic'],
 ['easi', 'use', 'good', 'valu', 'money'],
 ['easi', 'conveni', 'arrang'],
 ['easi',
  'select',
  'tyre',
  'size',
  'arrang',
  'local',
  'fit',
  'price',
  'competit'],
 ['servic',
  'excel',
  'slight',
  'downsid',
  'know',
  'exact',
  'time',
  'garag',
  'garag',
  'quick',
  'wasn',
  'delay'],
 ['user',
  'friend',
  'websit',
  'competit',
  'price',
  'good',
  'communic',
  'effici',
  'servic',
  'at',
  'euromast'],
 ['excel', 'price', 'servic'],
 ['straightforward', 'garag', 'great', 'hadn', 'know'],
 ['use', 'local', 'garag'],
 ['easi', 'use', 'good', 'price'],
 ['outstand', 'valu', 'money', 'friend', 'profession', 'servic'],
 ['great', 'price', 'easi', 'use'],
 ['good', 'price', 'easi', 'use', '

In the above step we tried to simply passed our dataframe column and applied preprocess function to get the coorosponding lemmatized and stemmed tokens.

### We're basically done with the initailization of the dataset, Basic Text Cleaning, Text Preprocessing. We'll start off with our next step that is creating a bag of words of the dataset 

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. 

The methadology we're adopting to vreate bag of words is from gensim's corpora Dictionary [Link: https://radimrehurek.com/gensim/corpora/dictionary.html].

In [49]:
# dict contains the df_document and creating a dictionary of it
dict = gensim.corpora.Dictionary(df_document)

In [50]:
len(dict)

3724

Here we can see that we are able to generate close to 4300 items. 

### Gensim Filter Extremes

Using gensim filter_extremes to filter out the tokens according to the frequency. 

```Dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=100000)``` 

What it does?
Filters out tokens that appear in

1. less than no_below documents (absolute number) or
2. more than no_above documents (fraction of total corpus size, not absolute number).
3. after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

After the pruning, shrink resulting gaps in word ids.
More info: [https://tedboy.github.io/nlps/generated/generated/gensim.corpora.Dictionary.filter_extremes.html].


In [53]:
dict.filter_extremes(no_below=15, no_above=0.1, keep_n=100000)

Conversion of documents to corrosponding vectors

### Gensim doc2bow
```doc2idx(document, unknown_word_index=- 1)```

Convert document (a list of words) into a list of indexes = list of token_id. Replace all unknown words i.e, words not in the dictionary with the index as set via unknown_word_index.

**Parameters**
document (list of str) – Input document

unknown_word_index (int, optional) – Index to use for words not in the dictionary.

Returns
Token ids for tokens in document, in the same order.

Return type
list of int

In [54]:
bow = [dict.doc2bow(doc) for doc in df_document]
bow

[[(0, 1), (1, 1), (2, 1), (3, 2), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(8, 1), (9, 1), (10, 1)],
 [(11, 1), (12, 1)],
 [(0, 1), (13, 1)],
 [(0, 1), (8, 1), (10, 1), (14, 1), (15, 1)],
 [(16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1)],
 [(8, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1)],
 [],
 [(18, 1), (29, 1), (30, 1)],
 [(14, 1)],
 [],
 [(11, 1), (12, 1), (26, 1), (31, 1)],
 [],
 [],
 [(0, 1), (32, 1), (33, 2), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1)],
 [(39, 1)],
 [(8, 1), (31, 1), (39, 1), (40, 1)],
 [(41, 1), (42, 1), (43, 1), (44, 1), (45, 1)],
 [(8, 1), (10, 1), (28, 1), (46, 1), (47, 1)],
 [(4, 1), (33, 1), (48, 1)],
 [(49, 1), (50, 1)],
 [(22, 1)],
 [(1, 1),
  (9, 1),
  (12, 1),
  (19, 1),
  (28, 1),
  (46, 1),
  (51, 1),
  (52, 1),
  (53, 1)],
 [(1, 1), (19, 1)],
 [(54, 1), (55, 1), (56, 1), (57, 1)],
 [(14, 1), (58, 1), (59, 1)],
 [(33, 1), (38, 1), (60, 1), (61, 1), (62, 2)],
 [],
 [(53, 1), (63, 1)],
 [(19, 1)],
 [(33, 1), (52, 1), (64, 1), (65

In [55]:
from gensim import corpora, models
tfidf = models.TfidfModel(bow)
corpus_tfidf = tfidf[bow]
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.3045821450827776),
 (1, 0.24250627422352353),
 (2, 0.2524785986902908),
 (3, 0.5874606507665467),
 (4, 0.2745856122046924),
 (5, 0.3290271045782807),
 (6, 0.2814574101501956),
 (7, 0.4203458066376818)]


As we can see clearly the above document is converted into a bag of words. For each word in document we have the number of times it has been reported

### Now it's time to use LDA in gensim with our bag of words that has been created

We'll be using Gensim's LDA model which comes under it. [Link:https://radimrehurek.com/gensim/models/ldamodel.html].

It's an optimized LDA in python (parallelized for multicore machines).

This module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents. The model can also be updated with new documents for online training.

The algorithm covers: 
1. Is streamed: training documents may come in sequentially, no random access required.

2. Runs in constant memory w.r.t. the number of documents: size of the training corpus does not affect memory footprint, can process corpora larger than RAM.

3. Is distributed: makes use of a cluster of machines, if available, to speed up model estimation.

### Talking about the parameters we'll be using are:

1. Corpus -> The bag of words that we have created will be passed in the function. 

2. num_topics -> As we have the reference that given documents has to be classsified in these categories like value for money, garage service, ease of booking, tyre quality, mobile fitter, location, length of fitting, delivery punctuality, booking confusion, wait time, discounts and finally change of date. These are 12 in total so we'll be using 12 number of topics.

3. id2word -> Is the dictionary of the dataframe (dict) that has been created. 

4. workers -> The number of extra processes to use for parallelization.



In [56]:
num_topics = 12
id2word = dict

model =  gensim.models.LdaMulticore(bow, 
                                   num_topics = num_topics, 
                                   id2word = id2word, 
                                   passes = 10,
                                   workers = 2)

In [57]:
# # Printing the Keyword occuring in the topic
# from pprint import pprint
# pprint(model.print_topics())
# doc_lda = model[bow]

for i,topic in model.show_topics(formatted=True, num_topics=num_topics, num_words=20):
    print(str(i)+": "+ topic)
    print()

0: 0.048*"day" + 0.038*"book" + 0.035*"appoint" + 0.024*"email" + 0.020*"tell" + 0.019*"say" + 0.017*"arriv" + 0.016*"chang" + 0.015*"date" + 0.014*"work" + 0.013*"get" + 0.013*"custom" + 0.012*"phone" + 0.012*"later" + 0.012*"receiv" + 0.012*"wait" + 0.010*"hour" + 0.010*"cancel" + 0.010*"deliv" + 0.010*"confirm"

1: 0.048*"go" + 0.047*"cheaper" + 0.040*"simpl" + 0.032*"want" + 0.030*"get" + 0.030*"process" + 0.026*"fault" + 0.026*"local" + 0.025*"cheapest" + 0.024*"smooth" + 0.024*"turn" + 0.024*"round" + 0.023*"couldn" + 0.022*"buy" + 0.022*"reliabl" + 0.020*"way" + 0.019*"job" + 0.014*"organis" + 0.014*"shop" + 0.014*"best"

2: 0.118*"experi" + 0.045*"start" + 0.040*"process" + 0.039*"straight" + 0.035*"finish" + 0.029*"forward" + 0.022*"overal" + 0.022*"wait" + 0.022*"pleas" + 0.019*"go" + 0.013*"centr" + 0.013*"expect" + 0.013*"come" + 0.012*"car" + 0.012*"away" + 0.011*"take" + 0.011*"recommend" + 0.011*"give" + 0.010*"partner" + 0.010*"job"

3: 0.160*"effici" + 0.091*"hassl" + 

In [78]:
print(df[0].loc[0:][0:])

0        Tires where delivered to the garage of my choi...
1        Easy Tyre Selection Process, Competitive Prici...
2               Very easy to use and good value for money.
3                    Really easy and convenient to arrange
4        It was so easy to select tyre sizes and arrang...
                               ...                        
10127    I ordered the wrong tyres, however [REDACTED] ...
10128    Good experience, first time I have used [REDAC...
10129    I ordered the tyre I needed on line, booked a ...
10130    Excellent service from point of order to fitti...
10131    Seamless, well managed at both ends. I would r...
Name: 0, Length: 10132, dtype: object


In [97]:
print(model[bow[3]])

[(0, 0.027778842), (1, 0.027779467), (2, 0.027778154), (3, 0.02777826), (4, 0.027779495), (5, 0.027778577), (6, 0.027779602), (7, 0.02777807), (8, 0.6944333), (9, 0.027779628), (10, 0.027778376), (11, 0.027778242)]


As we can see that the above document has been classified into 12 different documents, now we have to asssign the categories they belong to out of 12:

0. Value for Money
1. Delivery Punctuality
2. Location
3. 
4. Tyre Quality 
5. Ease of Booking
6.  
7. Change of Date
8. Mobile Fitter
9. Delivery Punctuality
10. Ease of Booking 
11. Location



As we have seen in the above example that we are able to categorize but it still raises the doubts as the features doesn't give us concrete results. So, we'll try to build a model which can lake the neefit of LDA and some other to come up with a better combination.

In [97]:
num = 1721
unseen_document = df[0][num]
print(unseen_document)

Could not be happier with the service and price. A small independent tyre fitter, really friendly and cheaper than the big name chains, and unlike them, I didn't feel I was getting ripped off and didn't get upsold on other services like tracking.This will be my go-to place from now on.


In [98]:
bow_v = dict.doc2bow(preprocess(unseen_document))

for index, score in sorted(model[bow_v], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, model.print_topic(index, 7)))

Score: 0.2832721471786499	 Topic: 0.059*"wheel" + 0.043*"car" + 0.023*"new" + 0.020*"cheap" + 0.018*"check" + 0.017*"damag" + 0.017*"fitter"
Score: 0.267120361328125	 Topic: 0.062*"centr" + 0.049*"problem" + 0.048*"definit" + 0.041*"deal" + 0.035*"purchas" + 0.028*"reason" + 0.022*"custom"
Score: 0.20771954953670502	 Topic: 0.084*"local" + 0.071*"websit" + 0.058*"simpl" + 0.044*"competit" + 0.041*"select" + 0.037*"save" + 0.035*"site"
Score: 0.18127605319023132	 Topic: 0.088*"buy" + 0.051*"cheaper" + 0.049*"qualiti" + 0.042*"way" + 0.036*"start" + 0.033*"want" + 0.032*"finish"


In [99]:
bow_v_tf = dict.doc2bow(preprocess(unseen_document))

for index, score in sorted(model_tfidf[bow_v_tf], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, model_tfidf.print_topic(index, 7)))

Score: 0.4169260263442993	 Topic: 0.044*"simpl" + 0.042*"websit" + 0.031*"want" + 0.031*"cheaper" + 0.028*"perfect" + 0.017*"deliv" + 0.016*"second"
Score: 0.39358434081077576	 Topic: 0.026*"problem" + 0.021*"day" + 0.017*"car" + 0.014*"appoint" + 0.014*"book" + 0.013*"wheel" + 0.011*"wait"
Score: 0.12129729986190796	 Topic: 0.103*"friend" + 0.062*"staff" + 0.061*"help" + 0.052*"reliabl" + 0.039*"round" + 0.035*"job" + 0.021*"recommend"


Now we'll try to get the corrosponding vectors of the training sample

In [77]:
train_vecs = []
for i in range(len(train)):
    top_topics = (
        model.get_document_topics(bow[i],
                                      minimum_probability=0.0)
    )
    topic_vec = [top_topics[i][1] for i in range(12)]
    train_vecs.append(topic_vec)

In [78]:
train_vecs[0]

[0.008334361,
 0.520154,
 0.008334122,
 0.008333732,
 0.008333665,
 0.26110324,
 0.14373845,
 0.008333683,
 0.008333739,
 0.008333713,
 0.008333587,
 0.008333662]