<a href="https://colab.research.google.com/github/Ashish0898/Automatic-Ticket-Classification/blob/Dev/Automatic_Ticket_Classification_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Problem Statement 

You need to build a model that is able to classify customer complaints based on the products/services. By doing so, you can segregate these tickets into their relevant categories and, therefore, help in the quick resolution of the issue.

You will be doing topic modelling on the <b>.json</b> data provided by the company. Since this data is not labelled, you need to apply NMF to analyse patterns and classify tickets into the following five clusters based on their products/services:

* Credit card / Prepaid card

* Bank account services

* Theft/Dispute reporting

* Mortgages/loans

* Others 


With the help of topic modelling, you will be able to map each ticket onto its respective department/category. You can then use this data to train any supervised model such as logistic regression, decision tree or random forest. Using this trained model, you can classify any new customer complaint support ticket into its relevant department.

## Pipelines that needs to be performed:

You need to perform the following eight major tasks to complete the assignment:

1.  Data loading

2. Text preprocessing

3. Exploratory data analysis (EDA)

4. Feature extraction

5. Topic modelling 

6. Model building using supervised learning

7. Model training and evaluation

8. Model inference

## Cloning the DataSet from Github repo

In [1]:
# Cloning Github Repo to fetch JSON file
import os
import shutil

if os.path.exists('/content/Automatic-Ticket-Classification'):
  shutil.rmtree('/content/Automatic-Ticket-Classification')
  
! git clone 'https://github.com/Ashish0898/Automatic-Ticket-Classification'

Cloning into 'Automatic-Ticket-Classification'...
remote: Enumerating objects: 29, done.[K
remote: Counting objects: 100% (29/29), done.[K
remote: Compressing objects: 100% (28/28), done.[K
remote: Total 29 (delta 13), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (29/29), 14.21 MiB | 7.38 MiB/s, done.


In [2]:

# Unzipping the file which contains JSON
!unzip /content/Automatic-Ticket-Classification/complaints-2021-05-14_08_16.zip

Archive:  /content/Automatic-Ticket-Classification/complaints-2021-05-14_08_16.zip
  inflating: complaints-2021-05-14_08_16.json  


## Importing the necessary libraries

In [3]:
import json 
import numpy as np
import pandas as pd
import re, nltk, spacy, string
import en_core_web_sm
nlp = en_core_web_sm.load()
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from plotly.offline import plot
import plotly.graph_objects as go
import plotly.express as px

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from pprint import pprint

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

from nltk.stem import WordNetLemmatizer
from textblob import TextBlob, Word
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

## Loading the data

The data is in JSON format and we need to convert it to a dataframe.

In [4]:
# Opening JSON file 
with open('/content/complaints-2021-05-14_08_16.json', 'r') as file:
    data = json.load(file)

# returns JSON object as  
# a dictionary 

df = pd.json_normalize(data)

## Debugging

In [5]:
text = "Thisisatextwithoutspaces"
#pattern = r"\b\w+\b"  # regex pattern to match one or more word characters

pattern = r'^\S+$'

xyz=df.copy()
xyz['_source.complaint_what_happened']=xyz['_source.complaint_what_happened'].replace('',np.nan,regex=True)
#xyz['_source.complaint_what_happened']=xyz['_source.complaint_what_happened'].replace('',np.nan,regex=True)
xyz.dropna(subset=['_source.complaint_what_happened'],inplace=True)


# use regex to filter rows with sample text with no spaces
xyz = xyz[~xyz['_source.complaint_what_happened'].str.contains(pattern)]
xyz.head()
# print the resulting dataframe
#xyz.head()


# for i in xyz['_source.complaint_what_happened']:
#   matches = re.search(pattern, i)
#   print(matches)
  # if(matches== True):
  #   xyz = xyz.drop(xyz[xyz['_source.complaint_what_happened'] == i].index, inplace=True)

#df.head()

Unnamed: 0,_index,_type,_id,_score,_source.tags,_source.zip_code,_source.complaint_id,_source.issue,_source.date_received,_source.state,...,_source.company_response,_source.company,_source.submitted_via,_source.date_sent_to_company,_source.company_public_response,_source.sub_product,_source.timely,_source.complaint_what_happened,_source.sub_issue,_source.consumer_consent_provided
1,complaint-public-v2,complaint,3229299,0.0,Servicemember,319XX,3229299,Written notification about debt,2019-05-01T12:00:00-05:00,GA,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-05-01T12:00:00-05:00,,Credit card debt,Yes,Good morning my name is XXXX XXXX and I apprec...,Didn't receive enough information to verify debt,Consent provided
2,complaint-public-v2,complaint,3199379,0.0,,77069,3199379,"Other features, terms, or problems",2019-04-02T12:00:00-05:00,TX,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-04-02T12:00:00-05:00,,General-purpose credit card or charge card,Yes,I upgraded my XXXX XXXX card in XX/XX/2018 and...,Problem with rewards from credit card,Consent provided
10,complaint-public-v2,complaint,3233499,0.0,,104XX,3233499,Incorrect information on your report,2019-05-06T12:00:00-05:00,NY,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-05-06T12:00:00-05:00,,Other personal consumer report,Yes,Chase Card was reported on XX/XX/2019. However...,Information belongs to someone else,Consent provided
11,complaint-public-v2,complaint,3180294,0.0,,750XX,3180294,Incorrect information on your report,2019-03-14T12:00:00-05:00,TX,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-03-15T12:00:00-05:00,,Credit reporting,Yes,"On XX/XX/2018, while trying to book a XXXX XX...",Information belongs to someone else,Consent provided
14,complaint-public-v2,complaint,3224980,0.0,,920XX,3224980,Managing an account,2019-04-27T12:00:00-05:00,CA,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-04-27T12:00:00-05:00,,Checking account,Yes,my grand son give me check for {$1600.00} i de...,Funds not handled or disbursed as instructed,Consent provided


In [6]:
# pattern=r'\b\w+\b'
# for i in df['_source.complaint_what_happened']:
#   matches = re.findall(pattern, i)
#   if(matches):
#     print(matches)
#     #df = df.drop(df[df['_source.complaint_what_happened'] == i].index, inplace=True)

In [7]:
#pd.set_option('display.max_colwidth', None)
#pd.set_option('display.max_rows', None)


xyz.head()
#print(xyz[xyz['_source.complaint_id']=='3417346']['_source.complaint_what_happened'])
print(xyz[xyz['_source.complaint_id']=='3417346']['_source.complaint_what_happened'])

print(xyz.loc[7677])

#pd.reset_option('display.max_colwidth')
#pd.reset_option('display.max_rows')


7677    IreceivedanemailXX/XX/XXXXthatmycreditcardswer...
Name: _source.complaint_what_happened, dtype: object
_index                                                             complaint-public-v2
_type                                                                        complaint
_id                                                                            3417346
_score                                                                             0.0
_source.tags                                                                      None
_source.zip_code                                                                 928XX
_source.complaint_id                                                           3417346
_source.issue                                                     Closing your account
_source.date_received                                        2019-10-24T12:00:00-05:00
_source.state                                                                       CA
_source.consumer_di

In [8]:
print(len(xyz),len(df))
xyz.head()


21069 78313


Unnamed: 0,_index,_type,_id,_score,_source.tags,_source.zip_code,_source.complaint_id,_source.issue,_source.date_received,_source.state,...,_source.company_response,_source.company,_source.submitted_via,_source.date_sent_to_company,_source.company_public_response,_source.sub_product,_source.timely,_source.complaint_what_happened,_source.sub_issue,_source.consumer_consent_provided
1,complaint-public-v2,complaint,3229299,0.0,Servicemember,319XX,3229299,Written notification about debt,2019-05-01T12:00:00-05:00,GA,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-05-01T12:00:00-05:00,,Credit card debt,Yes,Good morning my name is XXXX XXXX and I apprec...,Didn't receive enough information to verify debt,Consent provided
2,complaint-public-v2,complaint,3199379,0.0,,77069,3199379,"Other features, terms, or problems",2019-04-02T12:00:00-05:00,TX,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-04-02T12:00:00-05:00,,General-purpose credit card or charge card,Yes,I upgraded my XXXX XXXX card in XX/XX/2018 and...,Problem with rewards from credit card,Consent provided
10,complaint-public-v2,complaint,3233499,0.0,,104XX,3233499,Incorrect information on your report,2019-05-06T12:00:00-05:00,NY,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-05-06T12:00:00-05:00,,Other personal consumer report,Yes,Chase Card was reported on XX/XX/2019. However...,Information belongs to someone else,Consent provided
11,complaint-public-v2,complaint,3180294,0.0,,750XX,3180294,Incorrect information on your report,2019-03-14T12:00:00-05:00,TX,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-03-15T12:00:00-05:00,,Credit reporting,Yes,"On XX/XX/2018, while trying to book a XXXX XX...",Information belongs to someone else,Consent provided
14,complaint-public-v2,complaint,3224980,0.0,,920XX,3224980,Managing an account,2019-04-27T12:00:00-05:00,CA,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-04-27T12:00:00-05:00,,Checking account,Yes,my grand son give me check for {$1600.00} i de...,Funds not handled or disbursed as instructed,Consent provided


## Data preparation

In [9]:
# Inspect the dataframe to understand the given data.
df.head()

Unnamed: 0,_index,_type,_id,_score,_source.tags,_source.zip_code,_source.complaint_id,_source.issue,_source.date_received,_source.state,...,_source.company_response,_source.company,_source.submitted_via,_source.date_sent_to_company,_source.company_public_response,_source.sub_product,_source.timely,_source.complaint_what_happened,_source.sub_issue,_source.consumer_consent_provided
0,complaint-public-v2,complaint,3211475,0.0,,90301,3211475,Attempts to collect debt not owed,2019-04-13T12:00:00-05:00,CA,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-04-13T12:00:00-05:00,,Credit card debt,Yes,,Debt is not yours,Consent not provided
1,complaint-public-v2,complaint,3229299,0.0,Servicemember,319XX,3229299,Written notification about debt,2019-05-01T12:00:00-05:00,GA,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-05-01T12:00:00-05:00,,Credit card debt,Yes,Good morning my name is XXXX XXXX and I apprec...,Didn't receive enough information to verify debt,Consent provided
2,complaint-public-v2,complaint,3199379,0.0,,77069,3199379,"Other features, terms, or problems",2019-04-02T12:00:00-05:00,TX,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-04-02T12:00:00-05:00,,General-purpose credit card or charge card,Yes,I upgraded my XXXX XXXX card in XX/XX/2018 and...,Problem with rewards from credit card,Consent provided
3,complaint-public-v2,complaint,2673060,0.0,,48066,2673060,Trouble during payment process,2017-09-13T12:00:00-05:00,MI,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2017-09-14T12:00:00-05:00,,Conventional home mortgage,Yes,,,Consent not provided
4,complaint-public-v2,complaint,3203545,0.0,,10473,3203545,Fees or interest,2019-04-05T12:00:00-05:00,NY,...,Closed with explanation,JPMORGAN CHASE & CO.,Referral,2019-04-05T12:00:00-05:00,,General-purpose credit card or charge card,Yes,,Charged too much interest,


In [10]:
#print the column names
df.columns

Index(['_index', '_type', '_id', '_score', '_source.tags', '_source.zip_code',
       '_source.complaint_id', '_source.issue', '_source.date_received',
       '_source.state', '_source.consumer_disputed', '_source.product',
       '_source.company_response', '_source.company', '_source.submitted_via',
       '_source.date_sent_to_company', '_source.company_public_response',
       '_source.sub_product', '_source.timely',
       '_source.complaint_what_happened', '_source.sub_issue',
       '_source.consumer_consent_provided'],
      dtype='object')

In [11]:
#Assign new column names
new_names=['index', 'type', 'id', 'score', 'source.tags', 'source.zip_code',
       'source.complaint_id', 'source.issue', 'source.date_received',
       'source.state', 'source.consumer_disputed', 'source.product',
       'source.company_response', 'source.company', 'source.submitted_via',
       'source.date_sent_to_company', 'source.company_public_response',
       'source.sub_product', 'source.timely',
       'source.complaint_what_happened', 'source.sub_issue',
       'source.consumer_consent_provided']
df.columns=new_names


In [12]:
df.head()

Unnamed: 0,index,type,id,score,source.tags,source.zip_code,source.complaint_id,source.issue,source.date_received,source.state,...,source.company_response,source.company,source.submitted_via,source.date_sent_to_company,source.company_public_response,source.sub_product,source.timely,source.complaint_what_happened,source.sub_issue,source.consumer_consent_provided
0,complaint-public-v2,complaint,3211475,0.0,,90301,3211475,Attempts to collect debt not owed,2019-04-13T12:00:00-05:00,CA,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-04-13T12:00:00-05:00,,Credit card debt,Yes,,Debt is not yours,Consent not provided
1,complaint-public-v2,complaint,3229299,0.0,Servicemember,319XX,3229299,Written notification about debt,2019-05-01T12:00:00-05:00,GA,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-05-01T12:00:00-05:00,,Credit card debt,Yes,Good morning my name is XXXX XXXX and I apprec...,Didn't receive enough information to verify debt,Consent provided
2,complaint-public-v2,complaint,3199379,0.0,,77069,3199379,"Other features, terms, or problems",2019-04-02T12:00:00-05:00,TX,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2019-04-02T12:00:00-05:00,,General-purpose credit card or charge card,Yes,I upgraded my XXXX XXXX card in XX/XX/2018 and...,Problem with rewards from credit card,Consent provided
3,complaint-public-v2,complaint,2673060,0.0,,48066,2673060,Trouble during payment process,2017-09-13T12:00:00-05:00,MI,...,Closed with explanation,JPMORGAN CHASE & CO.,Web,2017-09-14T12:00:00-05:00,,Conventional home mortgage,Yes,,,Consent not provided
4,complaint-public-v2,complaint,3203545,0.0,,10473,3203545,Fees or interest,2019-04-05T12:00:00-05:00,NY,...,Closed with explanation,JPMORGAN CHASE & CO.,Referral,2019-04-05T12:00:00-05:00,,General-purpose credit card or charge card,Yes,,Charged too much interest,


In [13]:
#Assign nan in place of blanks in the complaints column
df['source.complaint_what_happened']=df['source.complaint_what_happened'].replace('',np.nan,regex = True)

In [14]:
#Remove all rows where complaints column is nan

df.dropna(subset=['source.complaint_what_happened'],inplace=True)

pattern = r'^\S+$'
df = df[~df['source.complaint_what_happened'].str.contains(pattern)]

# resetting index after removing nan values
df=df.reset_index()

## Prepare the text for topic modeling

Once you have removed all the blank complaints, you need to:

* Make the text lowercase
* Remove text in square brackets
* Remove punctuation
* Remove words containing numbers


Once you have done these cleaning operations you need to perform the following:
* Lemmatize the texts
* Extract the POS tags of the lemmatized text and remove all the words which have tags other than NN[tag == "NN"].


In [15]:
# Write your function here to clean the text and remove all the unnecessary elements.
def clean_text(s):
  s=s.lower()
  s= re.sub(r'[^\w\s]', '', s)
  s=re.sub(r'\w*\d\w*', '', s).strip()
  
  return s

In [16]:
for i in range(len(df['source.complaint_what_happened'])):
  df['source.complaint_what_happened'][i]=clean_text(df['source.complaint_what_happened'][i])



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [17]:
df['source.complaint_what_happened']=df['source.complaint_what_happened'].replace('',np.nan,regex = True)
df.dropna(subset=['source.complaint_what_happened'],inplace=True)
df=df.reset_index(drop=True)

In [18]:
df['source.complaint_what_happened']

0        good morning my name is xxxx xxxx and i apprec...
1        i upgraded my xxxx xxxx card in  and was told ...
2        chase card was reported on  however fraudulent...
3        on  while trying to book a xxxx  xxxx  ticket ...
4        my grand son give me check for  i deposit it i...
                               ...                        
21063    after being a chase card customer for well ove...
21064    on wednesday xxxxxxxx i called chas my xxxx xx...
21065    i am not familiar with xxxx pay and did not un...
21066    i have had flawless credit for  yrs ive had ch...
21067    roughly  years ago i closed out my accounts wi...
Name: source.complaint_what_happened, Length: 21068, dtype: object

In [19]:
s1="Hello1, World!@!. 123, ABV  AS12"
clean_text(s1)
clean_text(df['source.complaint_what_happened'][110])

'i received an offer from chase xxxx visa promising a companion airline pass xxxx bonus points for opening a chase xxxx visa card i applied for the card online in xxxxxxxx providing my current email address phone number and address i was approved in xxxx i requested paperless billing statementscorrespondence and assumed that my current contact information provided at time of approval would be utilized by chase for the xxxx xxxx \n\nhaving previous chase credit card products i was able to log into my online account and setup automatic bill payment for my account i used my card as needed to earn the companion pass and bonus points and assumed the auto payments were being made from my checking account as i viewed line items that had chase in the description and as i was not receiving paper statementsemails or phone calls from chase regarding my xxxx account \n\nin reviewing my checking account statement in xxxx and realizing the that the chase line item was actually an auto payment for on

In [20]:
#Write your function to Lemmatize the texts
def LemmatizeTexts(s):
  lemmatizer = WordNetLemmatizer()
  doc = nlp(s)
 
  # Create list of tokens from given string
  tokens = []
  for token in doc:
      tokens.append(token)
  
  lemmatized_sentence = " ".join([token.lemma_ for token in doc])
  return lemmatized_sentence

In [21]:
#[Debugging]: For Checking if function is working or not

#LemmatizeTexts(df['source.complaint_what_happened'][2812])
#df['source.complaint_what_happened'][2812]

'i opened an amazon rewards credit card through chase bank in xxxx of  \n\ni opened this card because i make a lot of purchases through amazon and i would get cash back and i was under the impression i would have an interest free period for  months \n\napparently i havent been checking my statement very well because i just noticed that i was being charged interest dating back since xxxx of  \n\nthe most recent interest charges were on xxxxxxxx for  and xxxxxxxx for   there were  more months i was also charged interest  \n\ni called chase and inquired about this issue  i spoke with an associated who identified herself as xxxx  i was informed that i did not have to pay interest on amazon purchases but i would have to pay interest on other purchases made through this card i informed them i found this misleading for a few reasons   i have paperwork that seems to say i have   apr on my amazon card purchases \n i was not charged interest for the first few months of having this credit card so

In [22]:
lemmatized_texts_list=[]
for i in range(len(df['source.complaint_what_happened'])):
  lemmatized_texts_list.append(LemmatizeTexts(df['source.complaint_what_happened'][i]))

lemmatized_texts=pd.DataFrame(lemmatized_texts_list)
lemmatized_texts.columns=['lemmatized_text']

lemmatized_texts.head()  

Unnamed: 0,0
0,good morning my name be xxxx xxxx and I apprec...
1,I upgrade my xxxx xxxx card in and be tell b...
2,chase card be report on however fraudulent a...
3,on while try to book a xxxx xxxx ticket ...
4,my grand son give I check for I deposit it i...


In [25]:
#Create a dataframe('df_clean') that will have only the complaints and the lemmatized complaints 

d11 = {'complaints': df['source.complaint_what_happened'], 'lemmatized_texts':lemmatized_texts['lemmatized_text']}
df_clean = pd.DataFrame(data=d11)
df_clean.head()

Unnamed: 0,complaints,lemmatized_texts
0,good morning my name is xxxx xxxx and i apprec...,good morning my name be xxxx xxxx and I apprec...
1,i upgraded my xxxx xxxx card in and was told ...,I upgrade my xxxx xxxx card in and be tell b...
2,chase card was reported on however fraudulent...,chase card be report on however fraudulent a...
3,on while trying to book a xxxx xxxx ticket ...,on while try to book a xxxx xxxx ticket ...
4,my grand son give me check for i deposit it i...,my grand son give I check for I deposit it i...


In [26]:
#Write your function to extract the POS tags 
stop_words = set(stopwords.words('english'))

def pos_tag(sentence):
  tokenized = sent_tokenize(sentence)
  for i in tokenized:
     
    # Word tokenizers is used to find the words
    # and punctuation in a string
    wordsList = nltk.word_tokenize(i)
 
    # removing stop words from wordList
    wordsList = [w for w in wordsList if not w in stop_words]
 
    #  Using a Tagger. Which is part-of-speech
    # tagger or POS-tagger.
    tagged = nltk.pos_tag(wordsList)
    sent_clean = [x for (x,y) in tagged if y not in ('NN')]
    
  return sent_clean

#df_clean["complaint_POS_removed"] =  #this column should contain lemmatized text with all the words removed which have tags other than NN[tag == "NN"].


In [None]:
# [Debugging]: To check if function is working or not
pos_tag(df_clean['lemmatized_texts'][212])

In [None]:
complaint_POS_removed_list=[]
for i in range(len(df_clean['lemmatized_texts'])):
  complaint_POS_removed_list.append(pos_tag(df_clean['lemmatized_texts'][i]))


In [47]:
len(complaint_POS_removed_list[100])
#abcd = pd.DataFrame(complaint_POS_removed_list, columns=['complaint_POS_removed'])
#abcd.head()
#pos_tag(df_clean['lemmatized_texts'][15])

474

In [None]:
len(complaint_POS_removed_list)
complaint_POS_removed.head()
complaint_POS_removed_list[2]

In [None]:
#The clean dataframe should now contain the raw complaint, lemmatized complaint and the complaint after removing POS tags.
#df_clean['complaint_POS_removed']=complaint_POS_removed
complaint_POS_removed

## Exploratory data analysis to get familiar with the data.

Write the code in this task to perform the following:

*   Visualise the data according to the 'Complaint' character length
*   Using a word cloud find the top 40 words by frequency among all the articles after processing the text
*   Find the top unigrams,bigrams and trigrams by frequency among all the complaints after processing the text. ‘




In [31]:
# Write your code here to visualise the data according to the 'Complaint' character length

#### Find the top 40 words by frequency among all the articles after processing the text.

In [32]:
#Using a word cloud find the top 40 words by frequency among all the articles after processing the text


In [None]:
#Removing -PRON- from the text corpus
df_clean['Complaint_clean'] = df_clean['complaint_POS_removed'].str.replace('-PRON-', '')

#### Find the top unigrams,bigrams and trigrams by frequency among all the complaints after processing the text.

In [None]:
#Write your code here to find the top 30 unigram frequency among the complaints in the cleaned datafram(df_clean). 


In [None]:
#Print the top 10 words in the unigram frequency


In [None]:
#Write your code here to find the top 30 bigram frequency among the complaints in the cleaned datafram(df_clean). 


In [None]:
#Print the top 10 words in the bigram frequency

In [None]:
#Write your code here to find the top 30 trigram frequency among the complaints in the cleaned datafram(df_clean). 


In [None]:
#Print the top 10 words in the trigram frequency

## The personal details of customer has been masked in the dataset with xxxx. Let's remove the masked text as this will be of no use for our analysis

In [None]:
df_clean['Complaint_clean'] = df_clean['Complaint_clean'].str.replace('xxxx','')

In [None]:
#All masked texts has been removed
df_clean

## Feature Extraction
Convert the raw texts to a matrix of TF-IDF features

**max_df** is used for removing terms that appear too frequently, also known as "corpus-specific stop words"
max_df = 0.95 means "ignore terms that appear in more than 95% of the complaints"

**min_df** is used for removing terms that appear too infrequently
min_df = 2 means "ignore terms that appear in less than 2 complaints"

In [None]:
#Write your code here to initialise the TfidfVectorizer 



#### Create a document term matrix using fit_transform

The contents of a document term matrix are tuples of (complaint_id,token_id) tf-idf score:
The tuples that are not there have a tf-idf score of 0

In [None]:
#Write your code here to create the Document Term Matrix by transforming the complaints column present in df_clean.


## Topic Modelling using NMF

Non-Negative Matrix Factorization (NMF) is an unsupervised technique so there are no labeling of topics that the model will be trained on. The way it works is that, NMF decomposes (or factorizes) high-dimensional vectors into a lower-dimensional representation. These lower-dimensional vectors are non-negative which also means their coefficients are non-negative.

In this task you have to perform the following:

* Find the best number of clusters 
* Apply the best number to create word clusters
* Inspect & validate the correction of each cluster wrt the complaints 
* Correct the labels if needed 
* Map the clusters to topics/cluster names

In [None]:
from sklearn.decomposition import NMF

## Manual Topic Modeling
You need to do take the trial & error approach to find the best num of topics for your NMF model.

The only parameter that is required is the number of components i.e. the number of topics we want. This is the most crucial step in the whole topic modeling process and will greatly affect how good your final topics are.

In [None]:
#Load your nmf_model with the n_components i.e 5
num_topics = #write the value you want to test out

#keep the random_state =40
nmf_model = #write your code here

In [None]:
nmf_model.fit(dtm)
len(tfidf.get_feature_names())

In [None]:
#Print the Top15 words for each of the topics


In [None]:
#Create the best topic for each complaint in terms of integer value 0,1,2,3 & 4



In [None]:
#Assign the best topic to each of the cmplaints in Topic Column

df_clean['Topic'] = #write your code to assign topics to each rows.

In [None]:
df_clean.head()

In [None]:
#Print the first 5 Complaint for each of the Topics
df_clean=df_clean.groupby('Topic').head(5)
df_clean.sort_values('Topic')

#### After evaluating the mapping, if the topics assigned are correct then assign these names to the relevant topic:
* Bank Account services
* Credit card or prepaid card
* Theft/Dispute Reporting
* Mortgage/Loan
* Others

In [None]:
#Create the dictionary of Topic names and Topics

Topic_names = {   }
#Replace Topics with Topic Names
df_clean['Topic'] = df_clean['Topic'].map(Topic_names)

In [None]:
df_clean

## Supervised model to predict any new complaints to the relevant Topics.

You have now build the model to create the topics for each complaints.Now in the below section you will use them to classify any new complaints.

Since you will be using supervised learning technique we have to convert the topic names to numbers(numpy arrays only understand numbers)

In [None]:
#Create the dictionary again of Topic names and Topics

Topic_names = {   }
#Replace Topics with Topic Names
df_clean['Topic'] = df_clean['Topic'].map(Topic_names)

In [None]:
df_clean

In [None]:
#Keep the columns"complaint_what_happened" & "Topic" only in the new dataframe --> training_data
training_data=

In [None]:
training_data

####Apply the supervised models on the training data created. In this process, you have to do the following:
* Create the vector counts using Count Vectoriser
* Transform the word vecotr to tf-idf
* Create the train & test data using the train_test_split on the tf-idf & topics


In [None]:

#Write your code to get the Vector count


#Write your code here to transform the word vector to tf-idf

You have to try atleast 3 models on the train & test data from these options:
* Logistic regression
* Decision Tree
* Random Forest
* Naive Bayes (optional)

**Using the required evaluation metrics judge the tried models and select the ones performing the best**

In [None]:
# Write your code here to build any 3 models and evaluate them using the required metrics



