# Module: Natural Language Processing & Generation

# Section: Topic Modelling

## <font color='#4073FF'>Project Solution: Comcast Consumer Complaints - Topic Modelling </font>

###  <font color='#14AAF5'> Use Latent Dirichlet Allocation (LDA) to classify text in a document to a particular topic. </font>

### Project Brief:

For a service provider, customer complaints may carry a negative connotation; however, they serve as source of insights on what areas should be worked on and upgraded. 

1. They can help provide better understanding about consumers.
2. They provide an opportunity for the service provider to resolve the customer’s problems on time and thus reduce dissatisfaction levels.
3. Customers who have had a problem resolved by a service provide efficiently, often have a stronger loyalty to the company.



### 1. Dataset:

Comcast was once notorious for terrible customer service and despite repeated promises to improve, they continued to fall short. On October 2016 the FCC fined them a $2.3 million after receiving over 1000 consumer complaints. 

This dataset includes public customer complaints filed against Comcast. You are required to pin down what is wrong with Comcast's customer service.

In [None]:
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### 2. Data collection and exploration

In [None]:
# Collecting data
url = "https://raw.githubusercontent.com/bluedataconsulting/AIMasteryProgram/main/Projects/Module-10/comcast_consumeraffairs_complaints.csv"
df = pd.read_csv(url,usecols=["author","posted_on","rating","text"])

In [None]:
# Viewing Data
df.head(10)

Unnamed: 0,author,posted_on,rating,text
0,"Alantae of Chesterfeild, MI","Nov. 22, 2016",1,I used to love Comcast. Until all these consta...
1,"Vera of Philadelphia, PA","Nov. 19, 2016",1,I'm so over Comcast! The worst internet provid...
2,"Sarah of Rancho Cordova, CA","Nov. 17, 2016",1,If I could give them a negative star or no sta...
3,"Dennis of Manchester, NH","Nov. 16, 2016",1,I've had the worst experiences so far since in...
4,"Ryan of Bellevue, WA","Nov. 14, 2016",1,Check your contract when you sign up for Comca...
5,"Terri of Mobile, AL","Nov. 9, 2016",1,Thank God. I am changing to Dish. They gave me...
6,"Kellie of Salt Lake City, UT","Nov. 9, 2016",1,I Have been a long time customer and only have...
7,"Kathleen of New Haven, CT","Nov. 6, 2016",2,There is a malfunction on the DVR manager whic...
8,"Shira of Bloomfield, NJ","Nov. 5, 2016",1,Charges overwhelming. Comcast service rep was ...
9,"Kristy of Alpharetta, GA","Nov. 2, 2016",1,"I have had cable, DISH, and U-verse, etc. in t..."


In [None]:
# Shape of data
df.shape

(5659, 4)

In [None]:
# Viewing a complaint
df.loc[0,'text']

"I used to love Comcast. Until all these constant updates. My internet and cable crash a lot at night, and sometimes during the day, some channels don't even work and on demand sometimes don't play either. I wish they will do something about it. Because just a few mins ago, the internet have crashed for about 20 mins for no reason. I'm tired of it and thinking about switching to Wow or something. Please do not get Xfinity."

In [None]:
complaints = df['text']
print(len(complaints))

5659


### 3. Data cleaning and pre-processing

In [None]:
import nltk
import gensim
from gensim.utils import simple_preprocess
import gensim.corpora as corpora
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# we will drop any complaint having size less than 20 characters

complaints = [doc for doc in complaints if len(str(doc))>20]
len(complaints)

5629

In [None]:
from nltk.corpus import stopwords

# Printing stop words in english
stop_words = stopwords.words("english")
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:
# removing stop words
complaints = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in complaints]

In [None]:
# making bigrams

#Automatically detects commmon words
bigrams = gensim.models.Phrases(complaints,min_count=5,threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigrams)
bigram_data = [bigram_mod[doc] for doc in complaints]

In [None]:
# Lemmatization usign spacy
# Removing inflections
import spacy

nlp = spacy.load('en',disable = ['parser','ner'])
clean_data = []
allowed_postags = ['NOUN','ADJ','VERB','ADV']
for sent in bigram_data:
    doc = nlp(" ".join(sent))
    clean_data.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])

In [None]:
clean_data[0]

['use',
 'love',
 'constant',
 'update',
 'cable',
 'crash',
 'lot',
 'night',
 'sometimes',
 'day',
 'channel',
 'even',
 'work',
 'demand',
 'sometimes',
 'play',
 'wish',
 'min',
 'ago',
 'internet',
 'crash',
 'min',
 'reason',
 'tired',
 'thinking',
 'switch',
 'xfinity']

### 4. Modelling

### Methodology- Topic Modelling

Topic modelling is a type of statistical modelling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

In [None]:
# create a dictionary
dictionary = corpora.Dictionary(clean_data)

# corpus
corpus = [dictionary.doc2bow(doc) for doc in clean_data]

# LDA Model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,id2word=dictionary,num_topics=10,
                                           random_state=5,update_every=1,chunksize=50,
                                           passes=10,alpha='auto',per_word_topics=True)

 
# chunksize (int, optional) – Number of documents to be used in each training chunk.
# passes (int, optional) – Number of passes through the corpus during training.
# update_every (int, optional) – Number of documents to be iterated through for each update. Set to 0 for batch learning, > 1 for online iterative learning.

In [None]:
# print words occuring in all topics and their relative weights

for topic in lda_model.print_topics():
    print(topic)
    print("\n")

(0, '0.070*"receive" + 0.048*"service" + 0.031*"state" + 0.031*"comcast" + 0.030*"issue" + 0.026*"contact" + 0.024*"representative" + 0.022*"disconnect" + 0.021*"customer" + 0.021*"speak"')


(1, '0.128*"call" + 0.052*"tell" + 0.050*"would" + 0.047*"say" + 0.044*"phone" + 0.040*"time" + 0.032*"day" + 0.022*"back" + 0.022*"supervisor" + 0.020*"hour"')


(2, '0.088*"cable" + 0.074*"box" + 0.071*"come" + 0.067*"work" + 0.048*"tech" + 0.043*"line" + 0.035*"home" + 0.028*"technician" + 0.023*"signal" + 0.019*"house"')


(3, '0.053*"comcast" + 0.020*"get" + 0.020*"new" + 0.018*"channel" + 0.018*"know" + 0.016*"try" + 0.015*"go" + 0.015*"need" + 0.014*"order" + 0.014*"also"')


(4, '0.122*"service" + 0.046*"internet" + 0.042*"customer" + 0.032*"cable" + 0.023*"company" + 0.022*"year" + 0.022*"get" + 0.018*"time" + 0.018*"comcast" + 0.018*"tv"')


(5, '0.163*"problem" + 0.061*"fix" + 0.028*"connection" + 0.015*"dvr" + 0.012*"trouble" + 0.012*"break" + 0.010*"course" + 0.010*"damage" + 0.010*"r

In [None]:
!pip install pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[?25l[K     |▏                               | 10 kB 22.8 MB/s eta 0:00:01[K     |▍                               | 20 kB 26.5 MB/s eta 0:00:01[K     |▋                               | 30 kB 31.0 MB/s eta 0:00:01[K     |▉                               | 40 kB 30.4 MB/s eta 0:00:01[K     |█                               | 51 kB 23.7 MB/s eta 0:00:01[K     |█▏                              | 61 kB 25.8 MB/s eta 0:00:01[K     |█▍                              | 71 kB 25.2 MB/s eta 0:00:01[K     |█▋                              | 81 kB 25.2 MB/s eta 0:00:01[K     |█▉                              | 92 kB 27.1 MB/s eta 0:00:01[K     |██                              | 102 kB 28.6 MB/s eta 0:00:01[K     |██▏                             | 112 kB 28.6 MB/s eta 0:00:01[K     |██▍                             | 122 kB 28.6 MB/s eta 0:00:01[K     |██▋                             | 133 kB 28.6 MB/s eta 0:00:01

### 5. Visualizing Results

In [None]:
# Display a viz
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()

# feed the LDA model into the pyLDAvis instance
lda_viz = gensimvis.prepare(lda_model, corpus, dictionary)

  from collections import Iterable
  by='saliency', ascending=False).head(R).drop('saliency', 1)


In [None]:
lda_viz

### 6. Exporting the model

In [None]:
lda_model.save('lda_model.model')



In [None]:
# later on, load trained model from file
# model =  gensim.models.ldamodel.LdaModel.load('lda_model.model')

# print all topics
# model.show_topics()