# Object:

### Analysing comments on COVID-⁠19 Vaccines plan in Canada.

# Description:

### We will analyse the text starting with turn these comments into a meaningful format, then cleaning data by:
- Remove capital letters and replace them by lower case letters.
- Remove punctiuation.
- Remove stop words and numbers.
### AS a final steps we will use two of the topic modeling techniques, then converting comments into supervised data that we can explorate data using AMOD and (counter function).

# Tools:
- Numpy
- Pandas
- Sklearn
- NLTK
- RE
- Spacy
- Gensim

In [4]:
import pandas as pd
import numpy as np
import spacy
import re, nltk, spacy, gensim
nlp = spacy.load('en_core_web_sm')
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from gensim import corpora
from gensim.models.coherencemodel import CoherenceModel
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()
from pprint import pprint

In [5]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [5]:
from spacy.lang.en import English
!pip install spacy && python -m spacy download en


Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.2.0
[!] As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the full
pipeline package name 'en_core_web_sm' instead.
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [17]:

data= pd.read_csv(r'C:\Users\sshah\OneDrive\المستندات\comment_on_plan.csv',encoding='latin-1')

In [18]:
data

Unnamed: 0,Comment
0,The vaccine plan according to liberals is in t...
1,Starting a pandemic with masks can only be wor...
2,Dont forget the Libs tossing out our PPE and ...
3,they threw out emergency stock of N95s last ye...
4,Conservatives want to take a vaccine that hasn...
...,...
1357,All is Trudeau has to do is pick up the teleph...
1358,You know for all the bad press and clear ill w...
1359,"Roll Trudeau out,the virus in Canada."
1360,As quickly as possible this could be 2022


In [19]:
data1= list(data.Comment)
data1

['The vaccine plan according to liberals is in that little red book that back in 1993 J Chrétien keep campaigning about, to this date so many liberals have added many many plans and promises in it, but no one seems to be able to find it, and now Trudeau leads the pack in there',
 'Starting a pandemic with masks can only be worn by professionals, to being among the last in the world to getting a vaccine, this is one big govt fail. We should be manufacturing our own vaccines within Canada, and N95 masks as well, enough so we can send them to every senior. incompetent.',
 'Don\x92t forget the Libs tossing out our PPE and then giving what was left to China. Fail',
 'they threw out emergency stock of N95s last year, without replacing them, ... genius move.',
 "Conservatives want to take a vaccine that hasn't been approved? No wonder they were so eager to inject bleach",
 'making stuff up',
 'drivel but thanks anyway\xa0',
 'Trump is more like JT the any Canadian conservative',
 "Its all goo

## Preprocessing:

In [20]:
documents=[]
for i in data1:
    documents.append(nlp(i))
documents

[The vaccine plan according to liberals is in that little red book that back in 1993 J Chrétien keep campaigning about, to this date so many liberals have added many many plans and promises in it, but no one seems to be able to find it, and now Trudeau leads the pack in there,
 Starting a pandemic with masks can only be worn by professionals, to being among the last in the world to getting a vaccine, this is one big govt fail. We should be manufacturing our own vaccines within Canada, and N95 masks as well, enough so we can send them to every senior. incompetent.,
 Dont forget the Libs tossing out our PPE and then giving what was left to China. Fail,
 they threw out emergency stock of N95s last year, without replacing them, ... genius move.,
 Conservatives want to take a vaccine that hasn't been approved? No wonder they were so eager to inject bleach,
 making stuff up,
 drivel but thanks anyway ,
 Trump is more like JT the any Canadian conservative,
 Its all good. The vaccines are com

In [21]:
def preprocessing(docs):
    processed_data=[]
    for e in docs:
        tokens = []
        for token in nlp(e):
            if not token.is_stop:
                tt = gensim.utils.simple_preprocess(str(token.lemma_), deacc=True)
                for i in tt: 
                    tokens.append(i)
        processed_data.append(tokens)
    return processed_data


In [22]:
preprocessed_data=preprocessing(documents)
preprocessed_data[1]

['start',
 'pandemic',
 'mask',
 'wear',
 'professional',
 'world',
 'get',
 'vaccine',
 'big',
 'govt',
 'fail',
 'manufacture',
 'vaccine',
 'canada',
 'mask',
 'send',
 'senior',
 'incompetent']

## LDA Model:

In [23]:
dictionary= corpora.Dictionary(preprocessed_data)
dt_matrix= [dictionary.doc2bow(rev) for rev in preprocessed_data]

In [24]:
lda = gensim.models.ldamodel.LdaModel(corpus=dt_matrix, num_topics=8, id2word=dictionary, passes=5)

In [25]:
lda.print_topics()

[(0,
  '0.021*"trudeau" + 0.011*"vaccine" + 0.011*"time" + 0.009*"canada" + 0.007*"go" + 0.007*"covid" + 0.007*"liberal" + 0.007*"conservative" + 0.007*"people" + 0.006*"vote"'),
 (1,
  '0.017*"vaccine" + 0.015*"work" + 0.012*"line" + 0.012*"canada" + 0.009*"come" + 0.007*"people" + 0.007*"million" + 0.007*"mean" + 0.007*"moderna" + 0.006*"order"'),
 (2,
  '0.021*"vaccine" + 0.007*"china" + 0.006*"find" + 0.006*"possible" + 0.006*"know" + 0.006*"government" + 0.006*"go" + 0.006*"spend" + 0.005*"quickly" + 0.005*"like"'),
 (3,
  '0.031*"vaccine" + 0.016*"quickly" + 0.016*"possible" + 0.012*"trudeau" + 0.009*"approve" + 0.009*"covid" + 0.009*"know" + 0.009*"country" + 0.009*"canadians" + 0.007*"start"'),
 (4,
  '0.035*"vaccine" + 0.013*"time" + 0.011*"government" + 0.007*"get" + 0.007*"wait" + 0.006*"thing" + 0.006*"country" + 0.005*"need" + 0.005*"believe" + 0.005*"minister"'),
 (5,
  '0.037*"vaccine" + 0.033*"possible" + 0.033*"trudeau" + 0.032*"quickly" + 0.022*"say" + 0.020*"canada" 

## Evaluate LDA.

In [26]:
cohrence_lda_model= CoherenceModel(model=lda, texts= preprocessed_data, dictionary= dictionary, coherence='c_v')
cohrence_lda= cohrence_lda_model.get_coherence()
print(f'\n Cohrence score: {cohrence_lda}')


 Cohrence score: 0.2806336825826262


# determining best number of topics for LDA Model.

In [37]:
def compute_cohrence_values(dictionary, corpus, texts, limit, start=2, step=1):
    coherence_values=[]
    model_list=[]
    for num_topics in range(start, limit, step):
        model= gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics= num_topics, id2word= dictionary)
        model_list.append(model)
        coherence_model= CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherence_model.get_coherence())
    return model_list, coherence_values

In [38]:
model_list, coherence_values= compute_cohrence_values(dictionary= dictionary, corpus= dt_matrix, texts= preprocessed_data, limit=9, start= 2, step=1)

In [39]:
start=2
limit=9
step=1

x= range(start,limit,step)

In [40]:
for topic, cv in zip(x, coherence_values):
    print('Number of topics:', topic, 'has coherence score:', round(cv,4))

Number of topics: 2 has coherence score: 0.319
Number of topics: 3 has coherence score: 0.285
Number of topics: 4 has coherence score: 0.2886
Number of topics: 5 has coherence score: 0.3081
Number of topics: 6 has coherence score: 0.303
Number of topics: 7 has coherence score: 0.3202
Number of topics: 8 has coherence score: 0.3077


## From the previous, we found out that the best model will have 7 topics

In [42]:
optimal_model= model_list[5]
model_topics= optimal_model.show_topics(formatted=False)
optimal_model.print_topics(num_words=10)

[(0,
  '0.024*"trudeau" + 0.024*"vaccine" + 0.015*"say" + 0.010*"canada" + 0.010*"time" + 0.008*"people" + 0.008*"line" + 0.007*"liberal" + 0.006*"government" + 0.006*"like"'),
 (1,
  '0.016*"canada" + 0.014*"vaccine" + 0.010*"trudeau" + 0.009*"know" + 0.008*"go" + 0.008*"like" + 0.008*"approval" + 0.007*"quickly" + 0.007*"possible" + 0.007*"wait"'),
 (2,
  '0.019*"vaccine" + 0.014*"quickly" + 0.013*"possible" + 0.009*"government" + 0.008*"trudeau" + 0.008*"canada" + 0.006*"year" + 0.006*"good" + 0.006*"china" + 0.006*"mean"'),
 (3,
  '0.017*"vaccine" + 0.013*"canadians" + 0.009*"possible" + 0.009*"quickly" + 0.009*"health" + 0.008*"canada" + 0.008*"month" + 0.007*"trudeau" + 0.006*"date" + 0.006*"virus"'),
 (4,
  '0.051*"vaccine" + 0.015*"people" + 0.013*"canada" + 0.008*"health" + 0.008*"approve" + 0.007*"country" + 0.006*"need" + 0.006*"want" + 0.006*"know" + 0.006*"like"'),
 (5,
  '0.039*"possible" + 0.036*"quickly" + 0.026*"trudeau" + 0.015*"say" + 0.015*"vaccine" + 0.012*"canada"

## Visualaization:

In [43]:
visualaization= gensimvis.prepare(optimal_model,dt_matrix,dictionary)
visualaization

  default_term_info = default_term_info.sort_values(


# CorEx Model:

In [45]:
!pip install corextopic
!pip install networkx
from corextopic import corextopic as ct
from corextopic import vis_topic as vt



In [46]:
vectorizer2 = CountVectorizer(max_features=20000,
                             stop_words='english', token_pattern="\\b[a-z][a-z]+\\b",
                             binary=True)

doc_word = vectorizer2.fit_transform(data1)
words = list(np.asarray(vectorizer2.get_feature_names()))



In [47]:
topic_model = ct.Corex(n_hidden=4, words=words, seed=1)
topic_model.fit(doc_word, words=words, docs=data1)

<corextopic.corextopic.Corex at 0x26df6476ca0>

In [48]:
topics = topic_model.get_topics()
for n,topic in enumerate(topics):
    topic_words,_,_ = zip(*topic)
    print('{}: '.format(n) + ','.join(topic_words))
    
    categories = ['Vaccine.plan', 'Healthcare.in.Canada', 
              'Canadian.goverment', 'Trudeau.and.the.liberal.party.of.Canada']

0: evidence,canada,control,multiple,countries,billion,pandemic,delivering,meant,later
1: vaccine,going,people,weeks,distribution,signed,receive,cansino,likely,health
2: quickly,possible,says,trudeau,incompetence,plan,doses,liberals,million,government
3: federal,gone,used,getting,information,spending,angry,tired,clinical,covid


In [49]:
predictions = pd.DataFrame(topic_model.predict(doc_word), columns=['topic'+str(i) for i in range(4)])
predictions

Unnamed: 0,topic0,topic1,topic2,topic3
0,True,False,True,False
1,False,False,False,True
2,False,False,False,True
3,False,False,True,False
4,False,False,False,False
...,...,...,...,...
1357,False,True,False,False
1358,False,False,True,True
1359,False,False,False,False
1360,False,False,True,False


In [50]:
topic_model.fit(doc_word, words=words, docs=data1, 
                anchors=[['plan', 'decide'], ['healthcare','health','care','children'],['government','country','citizen','decision'],['canada','US','liberal','pay']], anchor_strength=10)

topics = topic_model.get_topics()
for n,topic in enumerate(topics):
    topic_words,_,_ = zip(*topic)
    print('{}: '.format(n) + ','.join(topic_words))


0: plan,decide,low,distribution,version,citizens,falling,impact,sidelines,statistics
1: health,care,healthcare,children,workers,provinces,oil,jobs,approval,vaccines
2: government,country,people,example,private,playing,domestically,current,ottawa,wrong
3: canada,liberal,pay,vaccine,approve,foreign,production,chinese,blocked,domestic


In [52]:
topics

[[('plan', 1.414018340081795, 1.0),
  ('decide', 0.04465564096072224, 1.0),
  ('low', 0.011284747795852227, 1.0),
  ('distribution', 0.010662170696289306, 1.0),
  ('version', 0.009325814168233377, 1.0),
  ('citizens', 0.008482299996407407, 1.0),
  ('falling', 0.006892214175120324, 1.0),
  ('impact', 0.006892214175120324, 1.0),
  ('sidelines', 0.006892214175120324, 1.0),
  ('statistics', 0.006892214175120324, 1.0)],
 [('health', 2.2681551570942555, 1.0),
  ('care', 0.8408324662832578, 1.0),
  ('healthcare', 0.24318029647287515, 1.0),
  ('children', 0.04512429827107381, 1.0),
  ('workers', 0.01782684512977439, 1.0),
  ('provinces', 0.012475109063445785, 1.0),
  ('oil', 0.011884888049206914, 1.0),
  ('jobs', 0.011884888049206914, 1.0),
  ('approval', 0.011784753448307723, 1.0),
  ('vaccines', 0.011072715596201713, 1.0)],
 [('government', 2.4623831879410076, 1.0),
  ('country', 0.909519504182787, 1.0),
  ('people', 0.0160248320540512, 1.0),
  ('example', 0.013534736935286626, 1.0),
  ('pri

In [54]:
predictions = pd.DataFrame(topic_model.predict(doc_word), columns=['topic'+str(i) for i in range(4)])
predictions

Unnamed: 0,topic0,topic1,topic2,topic3
0,True,False,False,False
1,False,False,False,True
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
...,...,...,...,...
1357,False,False,False,False
1358,False,False,True,False
1359,False,False,False,True
1360,False,False,False,False


## Converting data into supervised data:

In [59]:
data['spacy_doc'] = list(nlp.pipe(data.Comment))

In [73]:
data['index']= range(0,1362)

In [75]:
predictions['index']= range(0,1362)

In [78]:
spacy_data=pd.merge(data,predictions, on= ['index','index'])

In [82]:
spacy_data.rename(columns = {'topic0':'covid_plan', 'topic1':'healthcare', 'topic2':'canadian_government','topic3':'liberal_party'}, inplace = True)

In [84]:
spacy_data.drop('Topic0', axis=1, inplace=True)

In [85]:
spacy_data

Unnamed: 0,Comment,spacy_doc,index,covid_plan,healthcare,canadian_government,liberal_party
0,The vaccine plan according to liberals is in t...,"(The, vaccine, plan, according, to, liberals, ...",0,True,False,False,False
1,Starting a pandemic with masks can only be wor...,"(Starting, a, pandemic, with, masks, can, only...",1,False,False,False,True
2,Dont forget the Libs tossing out our PPE and ...,"(Dont, forget, the, Libs, tossing, out, our, ...",2,False,False,False,False
3,they threw out emergency stock of N95s last ye...,"(they, threw, out, emergency, stock, of, N95s,...",3,False,False,False,False
4,Conservatives want to take a vaccine that hasn...,"(Conservatives, want, to, take, a, vaccine, th...",4,False,False,False,False
...,...,...,...,...,...,...,...
1357,All is Trudeau has to do is pick up the teleph...,"(All, is, Trudeau, has, to, do, is, pick, up, ...",1357,False,False,False,False
1358,You know for all the bad press and clear ill w...,"(You, know, for, all, the, bad, press, and, cl...",1358,False,False,True,False
1359,"Roll Trudeau out,the virus in Canada.","(Roll, Trudeau, out, ,, the, virus, in, Canada...",1359,False,False,False,True
1360,As quickly as possible this could be 2022,"(As, quickly, as, possible, this, could, be, 2...",1360,False,False,False,False


In [140]:
type(spacy_data.spacy_doc[0])

spacy.tokens.doc.Doc

In [109]:
covid_plan_reviews = spacy_data[spacy_data.covid_plan==True]
healthcare_reviews = spacy_data[spacy_data.healthcare==True]
canadian_government_reviews = spacy_data[spacy_data.canadian_government==True]
liberal_party_reviews = spacy_data[spacy_data.liberal_party==True]

## Spacy(Amods & Counter):

In [110]:
from spacy.symbols import amod

In [156]:
def get_amods(noun, ser):
    amod_list = []
    for doc in ser:
        for token in doc:
            if (token.text) == noun:
                for child in token.children:
                    if child.dep == amod:
                        amod_list.append(child.text.lower())
    return sorted(amod_list)

def amods_by_sentiment(noun):
    print(f"Adjectives describing {str.upper(noun)}:\n")
    
    print("\nCovid plan topic:")
    pprint(get_amods(noun, covid_plan_reviews.spacy_doc))
    
    print("\nHealthcare topic:")
    pprint(get_amods(noun, healthcare_reviews.spacy_doc))
    print("\n Canadian government topic:")
    pprint(get_amods(noun, canadian_government_reviews.spacy_doc))
    print("\nLiberal party topic:")
    pprint(get_amods(noun, liberal_party_reviews.spacy_doc))
   

In [163]:
amods_by_sentiment(' masks')


Adjectives describing  MASKS:


Covid plan topic:
[]

Healthcare topic:
[]

 Canadian government topic:
[]

Liberal party topic:
[]


In [141]:
covidplan_adj = [token.text.lower() for doc in covid_plan_reviews.spacy_doc for token in doc if token.pos_=='ADJ']
healthcare_adj = [token.text.lower() for doc in healthcare_reviews.spacy_doc for token in doc if token.pos_=='ADJ']
canadiangovernment_adj = [token.text.lower() for doc in canadian_government_reviews.spacy_doc for token in doc if token.pos_=='ADJ']
liberal_party_adj = [token.text.lower() for doc in liberal_party_reviews.spacy_doc for token in doc if token.pos_=='ADJ']


covidplan_noun = [token.text.lower() for doc in covid_plan_reviews.spacy_doc for token in doc if token.pos_=='NOUN']
healthcare_noun = [token.text.lower() for doc in healthcare_reviews.spacy_doc for token in doc if token.pos_=='NOUN']
canadiangovernment_noun = [token.text.lower() for doc in canadian_government_reviews.spacy_doc for token in doc if token.pos_=='NOUN']
liberal_party_noun = [token.text.lower() for doc in liberal_party_reviews.spacy_doc for token in doc if token.pos_=='NOUN']



In [142]:
from collections import Counter

In [143]:
Counter(canadiangovernment_adj).most_common(10)

[('possible', 38),
 ('more', 13),
 ('first', 11),
 ('other', 11),
 ('many', 9),
 ('only', 8),
 ('conservative', 8),
 ('covid', 8),
 ('much', 7),
 ('own', 7)]

In [144]:
Counter(liberal_party_noun).most_common(10)

[('vaccine', 120),
 ('vaccines', 40),
 ('government', 35),
 ('countries', 22),
 ('time', 20),
 ('plan', 17),
 ('approval', 17),
 ('%', 17),
 ('production', 17),
 ('people', 16)]

In [146]:
Counter(healthcare_adj).most_common(10)

[('possible', 28),
 ('other', 16),
 ('safe', 11),
 ('good', 10),
 ('more', 9),
 ('first', 8),
 ('much', 8),
 ('long', 7),
 ('pandemic', 7),
 ('only', 7)]

In [164]:
Counter(canadiangovernment_noun).most_common(10)

[('government', 99),
 ('vaccine', 88),
 ('people', 37),
 ('country', 37),
 ('vaccines', 23),
 ('time', 19),
 ('plan', 14),
 ('production', 13),
 ('world', 12),
 ('distribution', 12)]