# ***NLP Machine Learning Model Project*** 


This project is under the mentorship by Dr. Xu Wang, currently the software engineer in Google Dublin office. The project uses the Consumer Complaint Database (n=2,418,392, k=18) of credit products from Consumer Financial Protection Bureau (CFPB).

The project workflow takes reference from the following examples/tutorials:
#
https://www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-python-codes/
#
https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908
#
https://github.com/kapadias/mediumposts/blob/master/natural_language_processing/topic_modeling/notebooks/Introduction%20to%20Topic%20Modeling.ipynb

In [1]:
from __future__ import division
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy


In [2]:
import sklearn
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from scipy.cluster import hierarchy
from math import ceil, floor

%matplotlib inline

plt.style.use('seaborn-white')

In [3]:
import nltk
nltk.download('stopwords')
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer, PorterStemmer
from wordcloud import WordCloud, STOPWORDS

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yanlinzhang/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from textblob import TextBlob

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, MinMaxScaler, RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

In [7]:
import warnings
warnings.filterwarnings('ignore')

In [8]:
import re
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [9]:
df = pd.read_csv("/Users/yanlinzhang/ProjectExample/complaints 2.csv")

In [10]:
df.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,2019-06-13,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Information belongs to someone else,,,CAPITAL ONE FINANCIAL CORPORATION,PA,18640,,Consent not provided,Web,2019-06-13,Closed with explanation,Yes,,3274605
1,2019-11-01,Vehicle loan or lease,Loan,Struggling to pay your loan,Denied request to lower payments,I contacted Ally on Friday XX/XX/XXXX after fa...,Company has responded to the consumer and the ...,ALLY FINANCIAL INC.,NJ,8854,,Consent provided,Web,2019-11-01,Closed with explanation,Yes,,3425257
2,2019-04-01,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Account status incorrect,,Company has responded to the consumer and the ...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",PA,19067,,Consent not provided,Web,2019-04-01,Closed with explanation,Yes,,3198225
3,2019-07-09,Student loan,Federal student loan servicing,Dealing with your lender or servicer,Don't agree with the fees charged,I was contacted about student loan consolidati...,Company believes it acted appropriately as aut...,Equitable Acceptance Corp,TX,75039,,Consent provided,Web,2019-07-09,Closed with explanation,Yes,,3300773
4,2019-08-08,Mortgage,Conventional home mortgage,Trouble during payment process,,,Company has responded to the consumer and the ...,"FLAGSTAR BANK, FSB",ID,83706,,,Referral,2019-08-15,Closed with explanation,Yes,,3342290


## 1. Text Preprocessing ##

### 1.1 Data Undersampling ###

Three raw variables will be (transformed) used in the modeling later: "Issue", "Sub-issue", "Consumer Complaint Narrative". I will explain the supervised learning model in later sections, but since the dependent variable will be the sentiment polarity score of "Consumer Complaint Narrative" and the independent variable X will be a sparse matrix stacking the tf-idf vectorized matrices of all three variables, I chose to simply undersample the imbalanced data, which is just to drop 1,572,930 rows from "Issue" and "Sub-issue" that both miss values from "Consumer Complaint Narrative", for the purpose of this project. An oversampling alternative could be employed in the future if necessary.

In [11]:
# Check missing values
df.isnull().sum()

Date received                         0
Product                               0
Sub-product                      235164
Issue                                 0
Sub-issue                        642720
Consumer complaint narrative    1574930
Company public response         1429558
Company                               0
State                             38596
ZIP code                          38809
Tags                            2126380
Consumer consent provided?       736491
Submitted via                         0
Date sent to company                  0
Company response to consumer          3
Timely response?                      0
Consumer disputed?              1649933
Complaint ID                          0
dtype: int64

In [11]:
# Create a new dataframe with only variables that will be used and drop rows with missing values from "Consumer complaint narrative"
IV = df[['Issue','Sub-issue','Consumer complaint narrative']]
IV.dropna(inplace=True)

### 1.2 Punctuation Removal ###

The very first step here to preprocess our main text column is to remove the punctuation using the popular string.punctuation function, where the dictionary contains the following punctuations: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'. Any punctuations covered by the very dictionary will be removed from the texts.

In [None]:
# Define a function to remove punctuations included in the string.puntuation dictionary
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree

# Store the puncutation-removed documents into a new column
IV['complaint_nopunc']=IV['Consumer complaint narrative'].apply(lambda x:remove_punctuation(x))

### 1.3 Lowercasing ###

Also very straightforwardly, the complaint narrative texts are then converted into the lowercases. This is not necessarily required, but our project focuses more on the actual content of complaints rather than primarily emotions (though important as well).

In [None]:
# Store the lowercased documents into a new column
IV['complaint_lower']=IV['complaint_nopunc'].apply(lambda x: x.lower())

### 1.4 Stopwords Removal ###

This step is to remove the stopwords, usually commonly used words that do not add much value to the analysis. Similar to most other text analysis, the popular NLTK stopword dictionary is used here, including, for example, personal pronouns like "i", "you", "we" and their inflections.

Note that I extended the stopword list by adding x's, which are used to censor private or sensitive information from clients, like addresses. Likewise, for the purpose of our analysis, those x's carry less or no meaning.

The table below is a comparison of before and after stopword removal of three randomly chosen complaint texts. The difference is pretty obvious.

In [None]:
# Call the NLTK stopwords dictionary and append a few x's that are specifically applicable to our complaint documents
stop_words = stopwords.words('english')
stop_words.extend(['x','xx','xxx','xxxx','xxxxx','xxxxxx','xxxxxxx','xxxxxxxx'])

# Store the stopword-removed documents into a new column
IV['complaint_nosw'] = IV['complaint_lower'].apply(lambda x: " ".join(x for x in x.split() if x not in stop_words))

In [None]:
# Pull up a few documents before and after stopword removal
compare = IV[['complaint_lower','complaint_nosw']].head(3)
compare.style.set_properties(**{'width':'500px'})

In [None]:
IV['complaint_lemmatized'].head()

### 1.5 Tokenization ###

Tokenization is to split the text into smaller units, words in this case, which is particularly helpful in, for example, computing relative weight of individual words or modeling LDA. The word tokenization package from NLTK is used here.

In [None]:
nltk.download('punkt')
from nltk.tokenize import word_tokenize

# Store the tokenized documents, or tokens, into a new column
IV['complaint_tokenized'] = IV['complaint_nosw'].apply(lambda x: word_tokenize(x))

### 1.6 Lemmatization ###

Lemmatization is sort of similar to stemming, reducing words to their root or base forms, but one disadvantage of stemming is that the root form of certain words is not formally an English word. On the contrary, lemmatization can ensure that the diminished words do not lose their meanings. It will pass words to pre-defined dictionary that stores the context of words.

In [None]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
wordnet = WordNetLemmatizer()

# Define a lemmatizer function using WordNetLemmatizer
def lemmatizer(text):
    lemm_text = [wordnet.lemmatize(word) for word in text]
    return lemm_text

# Store the lemmatized tokens into a new column
IV['complaint_lemmatized']=IV['complaint_tokenized'].apply(lambda x: lemmatizer(x))

### 1.7 WordCloud Visualization ###

Wordcloud is a textual visualization technique where each word is picturized with regard to it's importance/relative weight. After we lowercases the documents and removed the stopwords and punctuations, we can visualize the text data in the form of words, where the importance of a word is explained by its frequency.

In [None]:
from wordcloud import WordCloud

long_string = ','.join(list(IV['complaint_nosw'].astype(str).values))

wordcloud = WordCloud(background_color = 'black', max_words = 1000,
                      contour_width=5, contour_color='steelblue')

wordcloud.generate(long_string)
wordcloud.to_image()

## 2. Topic Modeling ##

Topic modeling on text data is very much like the clustering analysis with regard to numerical data. It's similar to fuzzy clustering where each data point can belong to more than one subgroup or cluster, while in topic modeling one particular document can be part of multiple topics. As a type of unsupervised learning in the context of text analysis, topic models are used to discover hidden topics or groups of topics based on the text data we have, which enables us to understand the documents better and lots of times to proceed with subsequent supervised modeling procedures.

Latent Dirichlet Allocation (LDA), one of the most frequently used algorithms in topics modeling, is generative probabilistic model for collections of discrete data. LDA is a three-level hierarchical Bayesian model, where each item of a collection is modeled as a finite mixture attributable to one of the underlying set of latent topics. After pre-specifying the number of topics before deploying it, LDA returns the sorted words in each topic with respect to their probability score. 

### 2.1 Prepare for LDA Modeling ###

There are a few preparatory steps before deploying LDA. "Id2word" enables the mapping between normalized words and their integer ids. The "doc2bow" attribute is used to convert the document into the bag-of-words format, i.e., the 2-tuples (token_id, token_count), which in our case is exerted onto the lemmatized complaint narrative documents. Lastly, I specified the number of topics k to be five because it has one of the highest coherence scores, around 0.5. Tuning k is illustrated in section 2.2.

In [None]:
import gensim
import gensim.corpora as corpora

data_words = IV['complaint_lemmatized']

# Map the normalized words and their integer ids
id2word = corpora.Dictionary(data_words)
texts = data_words
# Create a corpus storing bow-converted lemmatized words 
corpus = [id2word.doc2bow(text) for text in texts]

### 2.2 Finding Optimal Number of Topics ###

The way here to ascertain the optimal number of topics in LDA is to run different LDA models with different numbers of topics, k, and see which one returns the highest coherence score. Topic coherence is defined as the measure of score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. In here the "c_v" coherence measure is used, which, by definition, is based on a sliding window, one-set segmentation of the top words and an indirect confirmation measure that uses normalized pointwise mutual information and the cosine similarity. The higher the coherence score an LDA model returns, the more coherent, and usually the better the model is in clustering groups of topics.

Similar to the elbow curve of choosing the optimal k in the context of k-means clustering, the elbow curve of LDA visualizes the trend of changing coherence score when the number of topics increases. It makes sense to pick the k with relatively higher coherence score (not necessarily the highest) but also marks an "elbow" turning point where the topic coherence curve starts to flatten out. If some keywords appear repetitively in multiple topics, then the LDA might allocate too many topics even if the coherence score is rather high. An optimal LDA model ideally should be both coherent and have distinct words in each topic it partitions.

In [None]:
coherence_score = []

# Formulate a for loop iterating LDA & coherence modeling over a sequence of k from 1 to 20
for k in range (1, 20):
    lda_tune = gensim.models.LdaMulticore(corpus=corpus,
                                 id2word=id2word,
                                 num_topics=k)
    coherence_tune = CoherenceModel(model=lda_tune, texts=data_words,
                                    dictionary = id2word, coherence='c_v')
    coherence_score.append(coherence_tune.get_coherence())

# Visualize an Elbow Curve of coherence score against number of topics
fig = plt.figure(figsize=(15, 5))
plt.plot(range(1, 20), coherence_score, marker='o',color='black')
plt.xlabel('Number of Topics')
plt.ylabel('Coherence Score')
plt.title('Elbow Curve') 
plt.grid(True)
    

From the Elbow Curve above, a reasonable k would be 7, which means seven topics in the LDA model will return one of those higher coherence scores and will simultanously have more distinct words in each topic. Note that the elbow curve of LDA is display an increasing concave shape very roughly, which makes the choice of k harder and more subjective. K=7, if not the best, is at least a very acceptable choice right here.

## 2.3 Building LDA Model ##

With number of topics specified, we can proceed with the model construction. Again, id2word is the dictionary mapping tokens to the corresponding token id's, and corpus is the list storing tuples obtained from converting documents into the bag-of-words (BoW) format.

In [None]:
from pprint import pprint

# Pre-specify the paramter k
topics = 7

# Deploy LDA model
lda = gensim.models.LdaMulticore(corpus=corpus,
                                 id2word=id2word,
                                 num_topics=topics)

pprint(lda.print_topics())

From the LDA output, we can try to interpret some of the topics LDA provides. For example, Topic 1 is represented by '0.021*"account" + 0.016*"bank" + 0.012*"card" + 0.009*"told" + ''0.009*"credit" + 0.008*"would" + 0.008*"time" + 0.008*"called" + ''0.007*"back" + 0.007*"check"'. It means the top 10 keywords in this particular topic are "account", "bank", "card", and so on. Topic 1, therefore, is probably focusing on personal bank transactions. Other topics could also be inferred from the corresponding top 10 keywords.

Although there are some of the common words appearing in multiple topics like "credit" and "account", they are intuitively frequent occurrences in coherent narrative regarding credit products. That is to say, the seven topics our LDA model offers are relatively unique and should represent different focuses.

In [None]:
from gensim.models import CoherenceModel

print('\nPerplexity: ', lda.log_perplexity(corpus))

coherence_lda = CoherenceModel(model=lda, texts=data_words,
                               dictionary=id2word, coherence='c_v')
coherence = coherence_lda.get_coherence()
print('\nCoherence Score: ', coherence)

Perplexity score is the normalized inverse probability of the test set. Both perplexity and coherence scores are intrinsic evaluation metrics which evaluate the language model, like LDA, itself rather than employing the model in an actual task. If our LDA model assigns a high probability to the test set, it means the model is not perplexed by the words in the test set, which means the job done by the model is more understandable and coherent. Therefore, a perplexity of -7.08 indicates a good topic model.

In [None]:
import pyLDAvis
import pyLDAvis.gensim_models

pyLDAvis.enable_notebook()
visual = pyLDAvis.gensim_models.prepare(lda, corpus, id2word)
visual

The pyLDAvis is a very powerful tool to visualize and understand the topics, especially if combination of words from the LDA output is still confusing. The interactive dashboard has bubbles, blue bars, and red bars. Each bubble represents a topic, and the size of the bubble corresponds to the number of words in that topic. The farther away the bubbles are from each other, the more different those topics are. Blue bar represents the overall frequency of each word. Red bar, if a bubble is selected, represents the estimated number of times a given word was generated by a given topic.

On the intertopic distance map, our seven topics are quite spread out overall, but certain topics are more similar than the others. Topic 3, 5, 6 have overalapping areas, sort of like a three-way Venn diagram, while Topic 1 and 2 are very similar both in terms of words themselves and number of words. Top 7 and 4, on the contrary, are farther away from those two "bubble clusters". For example, Topic 7 consists of words including "consumer", "information", "agency", "call", etc., that appear less frequently in other topics. Topic 4 has words like "debt", "act", "claim", "law", "violation", which might be focusing on some more serious accusations relevant to credit products.


## 3. Sentiment Analysis ##

### 3.1 TextBlob Sentiment Polarity ###

TextBlob is a Python library for processing textual data, and in here I'm using sentiment.polarity property which returns a document's polarity score within the rage [-1.0, 1.0], where -1.0 is very negative and 1.0 is very positive. Note that the original complaint narrative column is used here rather than the lemmatized complaint because it makes a lot more sense to capture the sentiment polarity of original documents with certain tone, linguistic subtlety, and coherent content being kept.

The second cell is meant to label the complaints into three categories: "Positive", "Negative", and "Neutral", based on the polarity score. This is also a preliminary step for comparing with another library: VADER, and possibly supervised learning models as well where the "sentiment1" column could be the target of classification.

In [None]:
# Define a function to compute TextBlob sentiment polarity scores correspond to each document
def senti(x):
    return TextBlob(x).sentiment.polarity

# Store the TextBlob polarity scores into a new column    
IV['score1']=IV['Consumer complaint narrative'].apply(lambda x:senti(x))

In [None]:
# Define a function to classify documents into categories based on polarity scores
def analysis(x):
    if x < 0:
        return 'Negative'
    elif x == 0:
        return 'Neutral'
    else:
        return 'Positive'

# Store the labels into a new column
IV['sentiment1']=IV['score1'].apply(lambda x: analysis(x))

### 3.2 VADER Compound Score ###

Valence Aware Dictionary and sEntiment Reaonser (VADER) is a lexicon and rule-based sentiment analysis tool, frequently used to analyze social media posts or tweets. The Sentiment Intensity Analyzer returns the intensity scores of negative, neutral, and positive components of a document that add up to one, as well as a compound score used to draw the overall sentiment, within the range of [-1.0, 1.0], which is comparable to the TextBlob polarity score. 

Similar procedure is repeated down here: compute the VADER compound scores corresponding to complaint narrative documents, and label the complaints into three categories: "Positive", "Negative", and "Neutral" based on the compound scores. Note that compounds scores higher or equal to 0.5 is considered positive, those lower or equal to -0.5 is considered negative, and any scores in-between correspond to neutral sentiment.

In [12]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Define a function to compute VADER sentiment compound scores correspond to each document
analyzer = SentimentIntensityAnalyzer()
def vader(x):
    vs = analyzer.polarity_scores(x)
    return vs['compound']

# Store the VADER compound scores into a new column
IV['score2'] = IV['Consumer complaint narrative'].apply(vader)

In [13]:
# Define a function to classify documents into categories based on compound scores
def vader_analysis(compound):
    if compound >= 0.5:
        return "Positive"
    elif compound <= -0.5:
        return "Negative"
    else:
        return "Neutral"

# Store the labels into a new column
IV['sentiment2']=IV['score2'].apply(lambda compound: vader_analysis(compound))

### 3.3 Comparison of Two Methods ###

In [None]:
# Group the polarity and compound scores by their labels (positive, neutral, or negative)
tb_counts = IV['sentiment1'].value_counts().sort_index(ascending=False)
vader_counts = IV['sentiment2'].value_counts().sort_index(ascending=False)

In [None]:
# Define a function to add value labels covered in rectangular box at the center of the height
# of each bar of bar chart
def addlabels(x,y):
    for i in range(len(x)):
        plt.text(i, y[i]//2, y[i], ha = 'center',
                 Bbox = dict(facecolor = 'white', alpha = 0.5))

if __name__ == '__main__':

    x = tb_counts.index
    y = tb_counts.values

    plt.figure(figsize=(17,6))
    
    # Create two subplots each visualizing the counts distribution of sentiment labels
    # Set ylim to let both subplots be to the same scale

    plt.subplot(1,3,1)
    plt.title("TextBlob - Distribution of Sentiment Categories")
    plt.bar(x, y, width=0.7, color='sienna')
    addlabels(x, y)
    plt.ylim(0, 320000)

    x = vader_counts.index
    y = vader_counts.values

    plt.subplot(1,3,2)
    plt.title("VADER - Distribution of Sentiment Categories")
    plt.bar(x, y, width=0.7, color='sandybrown')
    addlabels(x, y)
    plt.ylim(0, 320000)


    plt.show()

## 4. Supervised Learning - Text Classification ##

### 4.1 Vectorizing Textual Data ###

TF-IDF (Term Frequency - Inverse Document Frequency) is a weighting scheme that evaluates how relevant a word is to document in a collection of documents. It is a popular statistical measure meant to convert text to numerical format (sparse matrix) for machine learning purposes.

In [14]:
tfidf_vect = TfidfVectorizer(analyzer='word')

# Convert three collections of raw documents to three sparse matrices of TF-IDF features
tfidf_com = tfidf_vect.fit_transform(IV['Consumer complaint narrative'])
tfidf_issue = tfidf_vect.fit_transform(IV['Issue'])
tfidf_sub = tfidf_vect.fit_transform(IV['Sub-issue'])

# Stack three sparse matrices horizontally
tfidfBIG=scipy.sparse.hstack((tfidf_com, tfidf_issue, tfidf_sub))
print("TF-IDF Sparse Matrix Shape: {}".format(tfidfBIG.shape))
print("Number of Features: {}".format(len(tfidf_vect.get_feature_names())))

TF-IDF Sparse Matrix Shape: (670772, 124192)
Number of Features: 408


### 4.2 Dimensionality Reduction ###

In [15]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=500, n_iter=10, random_state=42)
sam = svd.fit_transform(tfidfBIG)

print("original shape: ", tfidfBIG.shape)
print("transformed shape: ", sam.shape)
print("explained variance ratio sum:\n", svd.explained_variance_ratio_.sum())

original shape:  (670772, 124192)
transformed shape:  (670772, 500)
explained variance ratio sum:
 0.8365072811813636


In [27]:
y = IV['sentiment2']
X = tfidfBIG

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### 4.3 Logistic Regression ###

In [None]:
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV

param_l2 = {'C': np.arange(1, 200, 1),
            'penalty':['l2'],
            'solver':['saga','newton-cg','liblinear']}

kf = StratifiedKFold(n_splits=3, random_state=42, shuffle=True)

grid_l2 = RandomizedSearchCV(LogisticRegression(max_iter=1000), param_l2, cv=kf, verbose=2)
grid_l2.fit(X_train, y_train)

print("best mean cross-validation score: {:.3f}".format(grid_l2.best_score_))
print("best parameters: {}".format(grid_l2.best_params_))

In [19]:
logit_best = LogisticRegression(C=18, penalty='l2',solver='saga').fit(X_train, y_train)

print("Training set score: {:.5f}".format(logit_best.score(X_train, y_train)))
print("Test set score: {:.5f}".format(logit_best.score(X_test, y_test)))
print("Mean Cross Validation score (stratified k-fold): {:.5f}".format(np.mean(cross_val_score(logit_best, X_train, y_train, cv=kf))))

Training set score: 0.69372
Test set score: 0.69108
Mean Cross Validation score (stratified k-fold): 0.69216


In [18]:
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
kf = StratifiedKFold(n_splits=3, random_state=42, shuffle=True)

In [None]:
logit_best = LogisticRegression(C=18, penalty='l2',solver='saga').fit(X_train, y_train)


In [None]:
logit_best.score(X_train, y_train)

In [None]:
cross_val_score(logit_best, X_train, y_train, cv=kf, scoring='roc_auc')

In [None]:
from sklearn.metrics import roc_auc_score, f1_score

In [None]:
roc_auc_score(y_train, logit_best.predict_proba(X_train), multi_class='ovr')

In [None]:
f1_score(y_train, logit_best.predict(X_train), average='weighted')

In [18]:
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold

kf = StratifiedKFold(n_splits=3, random_state=42, shuffle=True)


In [None]:
svm_best = SVC().fit(X_train, y_train)

print("Training set score: {:.5f}".format(svm_best.score(X_train, y_train)))
print("Test set score: {:.5f}".format(svm_best.score(X_test, y_test)))

In [None]:




param_svc = {'C': [50, 10, 1.0, 0.1, 0.01],
            'kernel':['poly','rbf','sigmoid'],
            'gamma':['scale']}

grid_svc = RandomizedSearchCV(SVC(), param_svc, cv=kf, verbose=2, n_jobs=-1)
grid_svc.fit(X_train, y_train)

print("best mean cross-validation score: {:.3f}".format(grid_svc.best_score_))
print("best parameters: {}".format(grid_svc.best_params_))

In [21]:
from sklearn import naive_bayes

In [None]:
y_train

In [20]:
from sklearn.preprocessing import MinMaxScaler, MaxAbsScaler
from sklearn.naive_bayes import GaussianNB, MultinomialNB, ComplementNB, BernoulliNB, CategoricalNB

In [21]:
p = Pipeline([('Normalizing',MaxAbsScaler()),('MultimonialNB',MultinomialNB())])
naive1 = p.fit(X_train, y_train)

ValueError: Negative values in data passed to MultinomialNB (input X)

In [28]:
X_scaled = MaxAbsScaler().fit_transform(X_train)

In [2]:
X_scaled

NameError: name 'X_scaled' is not defined

In [31]:
naive = MultinomialNB().fit(X_train, y_train)

print("Training set score: {:.5f}".format(naive.score(X_train, y_train)))
print("Test set score: {:.5f}".format(naive.score(X_test, y_test)))

Training set score: 0.53787
Test set score: 0.53383


In [3]:
from sklearn.preprocessing import PowerTransformer

params_NB = {"var_smoothing": np.logspace(0, -9, num=100)}

gs_NB = RandomizedSearchCV(estimator=MultinomialNB(),
                     param_grid=params_NB,
                     cv=kf,
                     verbose=1,
                     n_jobs=-1,
                     scoring='accuracy')

X_scaled = PowerTransformer().fit_transform(X_train.toarray())

gs_NB.fit(X_scaled, y_train)
print("best mean cross-validation score: {:.3f}".format(gs_NB.best_score_))
print("best parameters: {}".format(gs_NB.best_params_))

NameError: name 'np' is not defined