Build a model that can rate the sentiment of a Tweet based on its content.

You'll build an NLP model to analyze Twitter sentiment about Apple and Google products. The dataset comes from CrowdFlower via data.world. Human raters rated the sentiment in over 9,000 Tweets as positive, negative, or neither.

Aim for a Proof of Concept There are many approaches to NLP problems - start with something simple and iterate from there. For example, you could start by limiting your analysis to positive and negative Tweets only, allowing you to build a binary classifier. Then you could add in the neutral Tweets to build out a multiclass classifier. You may also consider using some of the more advanced NLP methods in the Mod 4 Appendix.

Evaluation Evaluating multiclass classifiers can be trickier than binary classifiers because there are multiple ways to mis-classify an observation, and some errors are more problematic than others. Use the business problem that your NLP project sets out to solve to inform your choice of evaluation metrics.

Data: https://data.world/crowdflower/brands-and-product-emotions

# Business Understanding

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
# nltk related imports
import nltk
from nltk.tokenize import RegexpTokenizer, TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

In [3]:
!ls ../../data

judge-1377884607_tweet_product_company.csv


In [4]:
df = pd.read_csv('../../data/judge-1377884607_tweet_product_company.csv', encoding = 'unicode_escape')
df

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion
...,...,...,...
9088,Ipad everywhere. #SXSW {link},iPad,Positive emotion
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion toward brand or product
9090,"Google's Zeiger, a physician never reported po...",,No emotion toward brand or product
9091,Some Verizon iPhone customers complained their...,,No emotion toward brand or product


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


In [6]:
first_tweet = df['tweet_text'][0]
first_tweet

'.@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead!  I need to upgrade. Plugin stations at #SXSW.'

In [7]:
#lowercase
first_tweet_lower = first_tweet.lower()
first_tweet_lower

'.@wesley83 i have a 3g iphone. after 3 hrs tweeting at #rise_austin, it was dead!  i need to upgrade. plugin stations at #sxsw.'

In [8]:
#tweet tokenizer
tweet_tknzr = TweetTokenizer(strip_handles=True)
first_tweet_lower_tt = tweet_tknzr.tokenize(first_tweet_lower)
first_tweet_lower_tt

['.',
 'i',
 'have',
 'a',
 '3g',
 'iphone',
 '.',
 'after',
 '3',
 'hrs',
 'tweeting',
 'at',
 '#rise_austin',
 ',',
 'it',
 'was',
 'dead',
 '!',
 'i',
 'need',
 'to',
 'upgrade',
 '.',
 'plugin',
 'stations',
 'at',
 '#sxsw',
 '.']

In [9]:
#turn tokenized words back into tweet
first_tweet_lower_tweet = " ".join(first_tweet_lower_tt)
first_tweet_lower_tweet

'. i have a 3g iphone . after 3 hrs tweeting at #rise_austin , it was dead ! i need to upgrade . plugin stations at #sxsw .'

In [10]:
#use regexptokenizer
pattern = r"(?u)\w{2,}" # select all words with 2 or more characters
         #r"(?u)\b\w\w+\b"
         #r'\w+'     <-REMOVES PUNCTUATION
regexp_tknzr = RegexpTokenizer(pattern)
first_tweet_regexp = regexp_tknzr.tokenize(first_tweet_lower_tweet)
first_tweet_regexp

['have',
 '3g',
 'iphone',
 'after',
 'hrs',
 'tweeting',
 'at',
 'rise_austin',
 'it',
 'was',
 'dead',
 'need',
 'to',
 'upgrade',
 'plugin',
 'stations',
 'at',
 'sxsw']

In [11]:
# create list of stopwords in English
stopwords_list = stopwords.words('english')

#remove stopwords
first_tweet_sw_removed = [word for word in first_tweet_regexp if word not in stopwords_list]
first_tweet_sw_removed

['3g',
 'iphone',
 'hrs',
 'tweeting',
 'rise_austin',
 'dead',
 'need',
 'upgrade',
 'plugin',
 'stations',
 'sxsw']

In [12]:
# create lemma object
lemma = WordNetLemmatizer()
first_tweet_lemma = [lemma.lemmatize(token) for token in first_tweet_sw_removed]
first_tweet_lemma

['3g',
 'iphone',
 'hr',
 'tweeting',
 'rise_austin',
 'dead',
 'need',
 'upgrade',
 'plugin',
 'station',
 'sxsw']

# *Monday Night notes to add to Ely's code on Tuesday*

In [13]:
# #Use Lemma instead of Stemmer

In [14]:
# # Create a target map
# target_map = {'positive': 1,
#               'negative': 0}
# # Map it
# df['sentiment'] = df['sentiment'].map(target_map)

In [15]:
# # Turn preprocessed_text back into a sentence
# df_nona_c1['preprocessed_text']=df_nona_c1['preprocessed_text'].apply(lambda x: ' '.join(x))

In [16]:
# X = df_nona_c1['preprocessed_text']

# *Doing 2 Train_Test_Splits first to avoid Data Leakage.<br> Then Preprocessing each split*

In [17]:
df = df.rename(columns = {'tweet_text': 'Tweet', 
                         'emotion_in_tweet_is_directed_at': 'Product', 
                         'is_there_an_emotion_directed_at_a_brand_or_product': 'Sentiment'})
df.head() #Sanity Check

Unnamed: 0,Tweet,Product,Sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [18]:
df.shape

(9093, 3)

In [19]:
df['Tweet'].iloc[9092]

'\x8cÏ¡\x8eÏà\x8aü_\x8b\x81Ê\x8b\x81Î\x8b\x81Ò\x8b\x81£\x8b\x81Á\x8bââ\x8b\x81_\x8b\x81£\x8b\x81\x8f\x8bâ_\x8bÛâRT @mention Google Tests \x89ÛÏCheck-in Offers\x89Û\x9d At #SXSW {link}'

In [20]:
df['Tweet'].iloc[6]

nan

In [21]:
df.drop([6, 9092], inplace=True)
df.drop_duplicates(inplace=True)
df['Tweet'].dropna(inplace=True)

In [22]:
df.isna().sum()

Tweet           0
Product      5787
Sentiment       0
dtype: int64

In [23]:
df['Sentiment'].value_counts()

No emotion toward brand or product    5374
Positive emotion                      2970
Negative emotion                       569
I can't tell                           156
Name: Sentiment, dtype: int64

In [24]:
X = df[['Tweet']]
y = df['Sentiment']
X_tr, X_test, y_tr, y_test = train_test_split(X, y, test_size=0.10, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_tr, y_tr, test_size=0.20, random_state=42)

In [25]:
X.isna().sum()

Tweet    0
dtype: int64

In [26]:
y.value_counts()

No emotion toward brand or product    5374
Positive emotion                      2970
Negative emotion                       569
I can't tell                           156
Name: Sentiment, dtype: int64

In [27]:
# #If we did a multi class
# dict_sent = {'No emotion toward brand or product':1, 
#              'Positive emotion':2,
#              'Negative emotion':0,
#              "I can't tell": 1}
# df['Sentiment'] = df['Sentiment'].map(dict_sent)

In [28]:
#Preprocess targets
y_train = y_train.apply(lambda x: 1 if x == "Positive emotion" else 0)
y_val = y_val.apply(lambda x: 1 if x == "Positive emotion" else 0)
y_test = y_test.apply(lambda x: 1 if x == "Positive emotion" else 0)

In [29]:
y_train.value_counts(normalize=True)

0    0.676061
1    0.323939
Name: Sentiment, dtype: float64

In [30]:
X_train.head()

Unnamed: 0,Tweet
2638,FourSquare CEO @mention sounded open-minded to...
6441,"RT @mention Per this rumor, Google may preview..."
7077,Apple to Open Pop-Up Shop at SXSW [REPORT]: {l...
8412,@mention That's exactly what I've been sayin! ...
4957,40% of google maps users are on mobile. #SXSW


In [31]:
X_train.shape

(6529, 1)

In [32]:
X_val.shape

(1633, 1)

In [33]:
X_test.shape

(907, 1)

In [34]:
#Instantiate necessary tools
tokenizer = RegexpTokenizer(r"(?u)\w{3,}")
stopwords_list = stopwords.words("english")
lemma = WordNetLemmatizer()
tweet_tknzr = TweetTokenizer(strip_handles=True)

In [35]:
def clean_tweets(text):
    no_handle = tweet_tknzr.tokenize(text)
    tweet = " ".join(no_handle) 
    #remove http websites, pound sign, any words in brackets, any words with ampersand right in front
        # ?, www dot com websites, links, videos, and non english characters
    #clean = re.sub("((^|\W)@\b([-a-zA-Z0-9._]{3,25})\b) \
        #|(&[a-z]+;)|([^\w\s]) \
    clean = re.sub("(https?:\/\/\S+) \
                   |(#[A-Za-z0-9_]+) \
                   |(\{([a-zA-Z].+)\}) \
                   |(&[a-z]+;) \
                   |(www\.[a-z]?\.?(com)+|[a-z]+\.(com))\
                   |({link})\
                   |(\[video\])\
                   |([^\x00-\x7F]+\ *(?:[^\x00-\x7F]| )*)"," ", tweet)
    lower = clean.lower()
    token_list = tokenizer.tokenize(lower)
    stopwords_removed=[token for token in token_list if token not in stopwords_list]
    lemma_list = [lemma.lemmatize(token) for token in stopwords_removed]
    cleaned_string = " ".join(lemma_list) #Turn the lemma list into a string for the Vectorizer
    return cleaned_string

In [36]:
#Sanity Check
clean_tweets(X_train['Tweet'].iloc[0])

'foursquare ceo sounded open minded big partnership google acquisition sxsw wallstreet'

In [37]:
X_train['Tweet'] = X_train['Tweet'].apply(lambda x: clean_tweets(x))
X_val['Tweet'] = X_val['Tweet'].apply(lambda x: clean_tweets(x))
X_test['Tweet'] = X_test['Tweet'].apply(lambda x: clean_tweets(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['Tweet'] = X_train['Tweet'].apply(lambda x: clean_tweets(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_val['Tweet'] = X_val['Tweet'].apply(lambda x: clean_tweets(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['Tweet'] = X_test['Tweet'].apply(lambda x: clean_tweets(x))


In [38]:
#Sanity Check
X_train

Unnamed: 0,Tweet
2638,foursquare ceo sounded open minded big partner...
6441,per rumor google may preview big social strate...
7077,apple open pop shop sxsw report link sxsw
8412,exactly sayin able attend sxsw buy ipad today ...
4957,google map user mobile sxsw
...,...
5702,even security guard austin enjoy ipad time sxs...
8604,attending sxsw want explore austin check austi...
7836,apple popup store sxsw link gonnagetanipad2
7504,putting pop apple store sxsw smart talk unders...


In [39]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6529 entries, 2638 to 3536
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Tweet   6529 non-null   object
dtypes: object(1)
memory usage: 102.0+ KB


In [40]:
X_train['Tweet'] = X_train['Tweet'].astype(str)
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6529 entries, 2638 to 3536
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Tweet   6529 non-null   object
dtypes: object(1)
memory usage: 102.0+ KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['Tweet'] = X_train['Tweet'].astype(str)


In [41]:
X_val

Unnamed: 0,Tweet
891,hootsuite mobile sxsw update iphone blackberry...
4198,morning hearing google circle today link sxsw
2164,great location choice nice timing ipad launch ...
1885,win ipad sxsw via sxsw link
4700,launching product sxsw plenty else join h4cker...
...,...
7371,awesome iphone case sxsw
6751,iphone version flipboard totally redesigned pl...
2716,anyone seen pop store austin like cnn restaura...
2306,link google sxsw marissa mayer google map usag...


In [42]:
#Sanity Check
y_train

2638    0
6441    0
7077    0
8412    0
4957    0
       ..
5702    1
8604    0
7836    0
7504    1
3536    0
Name: Sentiment, Length: 6529, dtype: int64

In [43]:
y_val

891     0
4198    0
2164    1
1885    1
4700    0
       ..
7371    1
6751    0
2716    0
2306    0
8429    0
Name: Sentiment, Length: 1633, dtype: int64

In [44]:
#Check who balanced the target is for the train
y_train.value_counts(normalize = True)

0    0.676061
1    0.323939
Name: Sentiment, dtype: float64

In [45]:
#Check who balanced the target is for the val
y_val.value_counts(normalize = True)

0    0.659522
1    0.340478
Name: Sentiment, dtype: float64

In [46]:
#DON'T NEED BECAUSE I ADDED A LINE TO THE CLEAN_TWEETS FUNCTION

# X_train["Tweet"] = X_train["Tweet"].str.join(" ")
# X_val["Tweet"] = X_val["Tweet"].str.join(" ")
# X_test["Tweet"] = X_test["Tweet"].str.join(" ")

In [47]:
#X_train.head()

# Vectorize

In [48]:
# Import the relevant vectorizers
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [49]:
c_vectorizer = CountVectorizer()
c_vectorizer.fit(X_train['Tweet'])
X_train_c_vec = c_vectorizer.transform(X_train['Tweet'])
X_train_c_vec

<6529x7362 sparse matrix of type '<class 'numpy.int64'>'
	with 67228 stored elements in Compressed Sparse Row format>

In [50]:
X_train_c_vec_df = pd.DataFrame(X_train_c_vec.toarray(), columns=c_vectorizer.get_feature_names(), 
                              index=X_train.index)
X_train_c_vec_df

Unnamed: 0,000,0310apple,100,1000,101,106,10am,10pm,10x,10x2,...,zlf,zms,zomb,zombie,zomg,zone,zoom,zuckerberg,zynga,zzzs
2638,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6441,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7077,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8412,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4957,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5702,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8604,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7836,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7504,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [51]:
#Sanity Check
X_train_c_vec_df['sxsw']

2638    1
6441    1
7077    2
8412    1
4957    1
       ..
5702    1
8604    1
7836    1
7504    1
3536    1
Name: sxsw, Length: 6529, dtype: int64

In [52]:
X_val_c_vec = c_vectorizer.transform(X_val['Tweet'])
X_val_c_vec_df = pd.DataFrame(X_val_c_vec.toarray())

In [53]:
tfidf_vectorizer = TfidfVectorizer()
    #max_df=.95,  # removes words that appear in more than 95% of docs
    #min_df=2     # removes words that appear 2 or fewer times
    #max_features=10
tfidf_vectorizer.fit(X_train['Tweet'])
X_train_tfidf_vec = tfidf_vectorizer.transform(X_train['Tweet'])
X_val_tfidf_vec = tfidf_vectorizer.transform(X_val['Tweet'])
X_train_tfidf_vec_df = pd.DataFrame(X_train_tfidf_vec.toarray())
X_val_tfidf_vec_df = pd.DataFrame(X_val_tfidf_vec.toarray())

# Simple Logistic Regression Model w/ Count Vectorizer

In [54]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train_c_vec_df, y_train)
print(lr.score(X_train_c_vec_df, y_train))
print(lr.score(X_val_c_vec_df, y_val))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.8855873793842854
0.7458665033680343


# Simple Logistic Regression Model w/ Tfidf Vectorizer

In [55]:
lr_2 = LogisticRegression()
lr_2.fit(X_train_tfidf_vec_df, y_train)
print(lr_2.score(X_train_tfidf_vec_df, y_train))
print(lr_2.score(X_val_tfidf_vec_df, y_val))

0.8027262980548323
0.7323943661971831


# Naive Bayes w/ Count Vectorizer

In [56]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train_c_vec_df, y_train)
print(naive_bayes.score(X_train_c_vec_df, y_train))
print(naive_bayes.score(X_val_c_vec_df, y_val))

0.8544953285342319
0.7244335578689528


# Naive Bayes w/ Tfidf Vectorizer

In [57]:
naive_bayes_2 = MultinomialNB()
naive_bayes_2.fit(X_train_tfidf_vec_df, y_train)
print(naive_bayes_2.score(X_train_tfidf_vec_df, y_train))
print(naive_bayes_2.score(X_val_tfidf_vec_df, y_val))

0.7900137846530862
0.7121861604409063


# Naive Bayes w/ Tuned Vectorizers

In [58]:
c_vectorizer_2 = CountVectorizer(max_df=.99,min_df=2, max_features=1000)
    #max_df=.95,  # removes words that appear in more than 95% of docs
    #min_df=2     # removes words that appear 2 or fewer times
c_vectorizer_2.fit(X_train['Tweet'])
X_train_c_vec_2 = c_vectorizer_2.transform(X_train['Tweet'])
X_val_c_vec_2 = c_vectorizer_2.transform(X_val['Tweet'])
X_train_c_vec_df_2 = pd.DataFrame(X_train_c_vec_2.toarray())
X_val_c_vec_df_2 = pd.DataFrame(X_val_c_vec_2.toarray())

In [59]:
naive_bayes_3 = MultinomialNB()
naive_bayes_3.fit(X_train_c_vec_df_2, y_train)
print("naive bayes with tuned count vectorizer")
print(naive_bayes_3.score(X_train_c_vec_df_2, y_train))
print(naive_bayes_3.score(X_val_c_vec_df_2, y_val))

naive bayes with tuned count vectorizer
0.7583090825547557
0.706062461726883


In [60]:
tfidf_vectorizer_2 = TfidfVectorizer(max_df=.99,min_df=2, max_features=1000)
tfidf_vectorizer_2.fit(X_train['Tweet'])
X_train_tfidf_vec_2 = tfidf_vectorizer_2.transform(X_train['Tweet'])
X_val_tfidf_vec_2 = tfidf_vectorizer_2.transform(X_val['Tweet'])
X_train_tfidf_vec_df_2 = pd.DataFrame(X_train_tfidf_vec_2.toarray())
X_val_tfidf_vec_df_2 = pd.DataFrame(X_val_tfidf_vec_2.toarray())

In [61]:
naive_bayes_4 = MultinomialNB()
naive_bayes_4.fit(X_train_tfidf_vec_df_2, y_train)
print("naive bayes with tuned tfidf")
print(naive_bayes_4.score(X_train_tfidf_vec_df_2, y_train))
print(naive_bayes_4.score(X_val_tfidf_vec_df_2, y_val))

naive bayes with tuned tfidf
0.7563179659978557
0.7109614206981016


# Logistic Regression with max iter = 1000

In [62]:
lr_3 = LogisticRegression(max_iter=1000)
lr_3.fit(X_train_c_vec_df, y_train)
print("default count vectorizer")
print(lr_3.score(X_train_c_vec_df, y_train))
print(lr_3.score(X_val_c_vec_df, y_val))

default count vectorizer
0.8855873793842854
0.7452541334966319


In [63]:
#lr_3 = LogisticRegression(max_iter=1000)
lr_3.fit(X_train_c_vec_df, y_train)
print("default tfidf vectorizer")
print(lr_3.score(X_train_tfidf_vec_df, y_train))
print(lr_3.score(X_val_tfidf_vec_df, y_val))

default tfidf vectorizer
0.683106141828764
0.6711573790569504


In [64]:
#lr_3 = LogisticRegression(max_iter=1000)
lr_3.fit(X_train_tfidf_vec_df_2, y_train)
print("tfidf vectorizer with tuned parameters")
print(lr_3.score(X_train_tfidf_vec_df_2, y_train))
print(lr_3.score(X_val_tfidf_vec_df_2, y_val))

tfidf vectorizer with tuned parameters
0.7653545719099403
0.7201469687691365


# PCA w/ Logistic Regression

In [65]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_c_vec_df) #Use default count vectorizer
X_val_scaled = scaler.transform(X_val_c_vec_df)

In [66]:
# Code to import, instantiate and fit a PCA object
from sklearn.decomposition import PCA

pca = PCA(n_components = .95, random_state=42)
pca.fit(X_train_scaled)
pca.n_components_

3041

In [67]:
from sklearn.pipeline import Pipeline
# Construct a pipelines
pipe_lr = Pipeline([('pca', pca), 
                    ('lr', LogisticRegression(random_state=42, max_iter=1000))])
pipe_lr.fit(X_train_scaled, y_train)
print("PCA with n_components=0.95, default count vectorizer, and logistic regression")
print(pipe_lr.score(X_train_scaled, y_train))
print(pipe_lr.score(X_val_scaled, y_val))

PCA with n_components=0.95, default count vectorizer, and logistic regression
0.9460866901516312
0.7054500918554807


In [68]:
# pipe_mnb = Pipeline([('pca', pca), 
#                     ('mnb', MultinomialNB())])
# pipe_mnb.fit(X_train_scaled, y_train)
# print("PCA with n_components=0.95, default count vectorizer, and naive bayes")
# print(pipe_lr.score(X_train_scaled, y_train))
# print(pipe_lr.score(X_val_scaled, y_val))

*I got an error about MultinomialNB not having negative values*

# PCA w/ Tuned Count Vectorizer

In [69]:
scaler_2 = StandardScaler()
X_train_scaled_2 = scaler_2.fit_transform(X_train_c_vec_df_2) #Use tuned count vectorizer
X_val_scaled_2 = scaler_2.transform(X_val_c_vec_df_2)

In [70]:
pca_2 = PCA(n_components = .95, random_state=42)
pca_2.fit(X_train_scaled_2)
pca_2.n_components_

801

In [71]:
pipe_lr_2 = Pipeline([('pca2', pca_2), 
                    ('lr2', LogisticRegression(random_state=42, max_iter=1000))])
pipe_lr_2.fit(X_train_scaled_2, y_train)
print("PCA with n_components=0.95, tuned count vectorizer, and naive bayes")
print(pipe_lr_2.score(X_train_scaled_2, y_train))
print(pipe_lr_2.score(X_val_scaled_2, y_val))

PCA with n_components=0.95, tuned count vectorizer, and naive bayes
0.7930770408944708
0.711573790569504
