Build a model that can rate the sentiment of a Tweet based on its content.

You'll build an NLP model to analyze Twitter sentiment about Apple and Google products. The dataset comes from CrowdFlower via data.world. Human raters rated the sentiment in over 9,000 Tweets as positive, negative, or neither.

Aim for a Proof of Concept There are many approaches to NLP problems - start with something simple and iterate from there. For example, you could start by limiting your analysis to positive and negative Tweets only, allowing you to build a binary classifier. Then you could add in the neutral Tweets to build out a multiclass classifier. You may also consider using some of the more advanced NLP methods in the Mod 4 Appendix.

Evaluation Evaluating multiclass classifiers can be trickier than binary classifiers because there are multiple ways to mis-classify an observation, and some errors are more problematic than others. Use the business problem that your NLP project sets out to solve to inform your choice of evaluation metrics.

Data: https://data.world/crowdflower/brands-and-product-emotions

# Business Understanding

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
# nltk related imports
import nltk
from nltk.tokenize import RegexpTokenizer, TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

In [3]:
!ls ../../data

judge-1377884607_tweet_product_company.csv


In [4]:
df = pd.read_csv('../../data/judge-1377884607_tweet_product_company.csv', encoding = 'unicode_escape')
df

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion
...,...,...,...
9088,Ipad everywhere. #SXSW {link},iPad,Positive emotion
9089,"Wave, buzz... RT @mention We interrupt your re...",,No emotion toward brand or product
9090,"Google's Zeiger, a physician never reported po...",,No emotion toward brand or product
9091,Some Verizon iPhone customers complained their...,,No emotion toward brand or product


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9093 entries, 0 to 9092
Data columns (total 3 columns):
 #   Column                                              Non-Null Count  Dtype 
---  ------                                              --------------  ----- 
 0   tweet_text                                          9092 non-null   object
 1   emotion_in_tweet_is_directed_at                     3291 non-null   object
 2   is_there_an_emotion_directed_at_a_brand_or_product  9093 non-null   object
dtypes: object(3)
memory usage: 213.2+ KB


In [6]:
first_tweet = df['tweet_text'][0]
first_tweet

'.@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead!  I need to upgrade. Plugin stations at #SXSW.'

In [7]:
#lowercase
first_tweet_lower = first_tweet.lower()
first_tweet_lower

'.@wesley83 i have a 3g iphone. after 3 hrs tweeting at #rise_austin, it was dead!  i need to upgrade. plugin stations at #sxsw.'

In [8]:
#tweet tokenizer
tweet_tknzr = TweetTokenizer(strip_handles=True)
first_tweet_lower_tt = tweet_tknzr.tokenize(first_tweet_lower)
first_tweet_lower_tt

['.',
 'i',
 'have',
 'a',
 '3g',
 'iphone',
 '.',
 'after',
 '3',
 'hrs',
 'tweeting',
 'at',
 '#rise_austin',
 ',',
 'it',
 'was',
 'dead',
 '!',
 'i',
 'need',
 'to',
 'upgrade',
 '.',
 'plugin',
 'stations',
 'at',
 '#sxsw',
 '.']

In [9]:
#turn tokenized words back into tweet
first_tweet_lower_tweet = " ".join(first_tweet_lower_tt)
first_tweet_lower_tweet

'. i have a 3g iphone . after 3 hrs tweeting at #rise_austin , it was dead ! i need to upgrade . plugin stations at #sxsw .'

In [10]:
#use regexptokenizer
pattern = r"(?u)\w{2,}" # select all words with 2 or more characters
         #r"(?u)\b\w\w+\b"
         #r'\w+'     <-REMOVES PUNCTUATION
regexp_tknzr = RegexpTokenizer(pattern)
first_tweet_regexp = regexp_tknzr.tokenize(first_tweet_lower_tweet)
first_tweet_regexp

['have',
 '3g',
 'iphone',
 'after',
 'hrs',
 'tweeting',
 'at',
 'rise_austin',
 'it',
 'was',
 'dead',
 'need',
 'to',
 'upgrade',
 'plugin',
 'stations',
 'at',
 'sxsw']

In [11]:
# create list of stopwords in English
stopwords_list = stopwords.words('english')

#remove stopwords
first_tweet_sw_removed = [word for word in first_tweet_regexp if word not in stopwords_list]
first_tweet_sw_removed

['3g',
 'iphone',
 'hrs',
 'tweeting',
 'rise_austin',
 'dead',
 'need',
 'upgrade',
 'plugin',
 'stations',
 'sxsw']

In [12]:
# create lemma object
lemma = WordNetLemmatizer()
first_tweet_lemma = [lemma.lemmatize(token) for token in first_tweet_sw_removed]
first_tweet_lemma

['3g',
 'iphone',
 'hr',
 'tweeting',
 'rise_austin',
 'dead',
 'need',
 'upgrade',
 'plugin',
 'station',
 'sxsw']

# *Doing 2 Train_Test_Splits first to avoid Data Leakage.<br> Then Preprocessing each split*

In [13]:
df = df.rename(columns = {'tweet_text': 'Tweet', 
                         'emotion_in_tweet_is_directed_at': 'Product', 
                         'is_there_an_emotion_directed_at_a_brand_or_product': 'Sentiment'})
df.head() #Sanity Check

Unnamed: 0,Tweet,Product,Sentiment
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [14]:
df.shape

(9093, 3)

In [15]:
df['Tweet'].iloc[9092]

'\x8cÏ¡\x8eÏà\x8aü_\x8b\x81Ê\x8b\x81Î\x8b\x81Ò\x8b\x81£\x8b\x81Á\x8bââ\x8b\x81_\x8b\x81£\x8b\x81\x8f\x8bâ_\x8bÛâRT @mention Google Tests \x89ÛÏCheck-in Offers\x89Û\x9d At #SXSW {link}'

In [16]:
df['Tweet'].iloc[6]

nan

In [17]:
df.drop([6, 9092], inplace=True)
df.drop_duplicates(inplace=True)
df['Tweet'].dropna(inplace=True)

In [18]:
df.isna().sum()

Tweet           0
Product      5787
Sentiment       0
dtype: int64

In [19]:
df['Sentiment'].value_counts()

No emotion toward brand or product    5374
Positive emotion                      2970
Negative emotion                       569
I can't tell                           156
Name: Sentiment, dtype: int64

In [20]:
df['Sentiment'] = df['Sentiment'].apply(lambda x: 1 if x == "Positive emotion" else 0)

In [21]:
X = df[['Tweet']]
y = df['Sentiment']
X_tr, X_test, y_tr, y_test = train_test_split(X, y, test_size=0.10, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_tr, y_tr, test_size=0.25, random_state=42)

In [22]:
#BASELINE UNDERSTANDING
y_train.value_counts(normalize=True)

0    0.67636
1    0.32364
Name: Sentiment, dtype: float64

In [23]:
# #If we did a multi class
# dict_sent = {'No emotion toward brand or product':1, 
#              'Positive emotion':2,
#              'Negative emotion':0,
#              "I can't tell": 1}
# df['Sentiment'] = df['Sentiment'].map(dict_sent)

In [24]:
# #Preprocess targets
# y_train = y_train.apply(lambda x: 1 if x == "Positive emotion" else 0)
# y_val = y_val.apply(lambda x: 1 if x == "Positive emotion" else 0)
# y_test = y_test.apply(lambda x: 1 if x == "Positive emotion" else 0)

In [25]:
X_train.head()

Unnamed: 0,Tweet
2324,@mention Can we make you an iPhone case with T...
5632,RT @mention Come party down with @mention &amp...
1751,#winning #winning - just gave away 5 red mophi...
5799,RT @mention google &amp; facebook have an offi...
3339,Rumor of Google launching their new social net...


In [26]:
X_train.shape

(6121, 1)

In [27]:
X_val.shape

(2041, 1)

In [28]:
#Instantiate necessary tools
tokenizer = RegexpTokenizer(r"(?u)\w{3,}")
stopwords_list = stopwords.words("english")
lemma = WordNetLemmatizer()
tweet_tknzr = TweetTokenizer(strip_handles=True)

In [29]:
def clean_tweets(text):
    no_handle = tweet_tknzr.tokenize(text)
    tweet = " ".join(no_handle) 
    #remove http websites, pound sign, any words in brackets, any words with ampersand right in front
        # ?, www dot com websites, links, videos, and non english characters
    #clean = re.sub("((^|\W)@\b([-a-zA-Z0-9._]{3,25})\b) \
        #|(&[a-z]+;)|([^\w\s]) \
    clean = re.sub("(https?:\/\/\S+) \
                   |(#[A-Za-z0-9_]+) \
                   |(\{([a-zA-Z].+)\}) \
                   |(&[a-z]+;) \
                   |(www\.[a-z]?\.?(com)+|[a-z]+\.(com))\
                   |({link})\
                   |(\[video\])\
                   |([^\x00-\x7F]+\ *(?:[^\x00-\x7F]| )*)"," ", tweet)
    lower = clean.lower()
    token_list = tokenizer.tokenize(lower)
    stopwords_removed=[token for token in token_list if token not in stopwords_list]
    lemma_list = [lemma.lemmatize(token) for token in stopwords_removed]
    cleaned_string = " ".join(lemma_list) #Turn the lemma list into a string for the Vectorizer
    return cleaned_string

In [30]:
#Sanity Check
clean_tweets(X_train['Tweet'].iloc[0])

'make iphone case ttye time sxsw want show support'

In [31]:
X_train['Tweet'] = X_train['Tweet'].apply(lambda x: clean_tweets(x))
X_val['Tweet'] = X_val['Tweet'].apply(lambda x: clean_tweets(x))
X_test['Tweet'] = X_test['Tweet'].apply(lambda x: clean_tweets(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['Tweet'] = X_train['Tweet'].apply(lambda x: clean_tweets(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_val['Tweet'] = X_val['Tweet'].apply(lambda x: clean_tweets(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['Tweet'] = X_test['Tweet'].apply(lambda x: clean_tweets(x))


In [32]:
#Sanity Check
X_train

Unnamed: 0,Tweet
2324,make iphone case ttye time sxsw want show support
5632,come party google tonight sxsw link band food ...
1751,winning winning gave away red mophie juice pac...
5799,google facebook official death policy vast maj...
3339,rumor google launching new social network call...
...,...
5702,even security guard austin enjoy ipad time sxs...
8604,attending sxsw want explore austin check austi...
7836,apple popup store sxsw link gonnagetanipad2
7504,putting pop apple store sxsw smart talk unders...


In [33]:
X_val

Unnamed: 0,Tweet
891,hootsuite mobile sxsw update iphone blackberry...
4198,morning hearing google circle today link sxsw
2164,great location choice nice timing ipad launch ...
1885,win ipad sxsw via sxsw link
4700,launching product sxsw plenty else join h4cker...
...,...
1033,racing around sxsw best fueling great local fa...
4186,omg still line new ipad dieing hunger sxsw els...
7735,hour sxsw popup apple store lone security guar...
8211,great app interface example moma target flipbo...


In [34]:
#Sanity Check
y_train

2324    0
5632    1
1751    0
5799    0
3339    0
       ..
5702    1
8604    0
7836    0
7504    1
3536    0
Name: Sentiment, Length: 6121, dtype: int64

In [35]:
y_val

891     0
4198    0
2164    1
1885    1
4700    0
       ..
1033    0
4186    1
7735    0
8211    1
4517    0
Name: Sentiment, Length: 2041, dtype: int64

In [36]:
#DON'T NEED BECAUSE I ADDED A LINE TO THE CLEAN_TWEETS FUNCTION

# X_train["Tweet"] = X_train["Tweet"].str.join(" ")
# X_val["Tweet"] = X_val["Tweet"].str.join(" ")
# X_test["Tweet"] = X_test["Tweet"].str.join(" ")

In [37]:
#X_train.head()

# Vectorize

In [38]:
# Import the relevant vectorizers
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [39]:
c_vectorizer = CountVectorizer()
c_vectorizer.fit(X_train['Tweet'])
X_train_c_vec = c_vectorizer.transform(X_train['Tweet'])
X_train_c_vec

<6121x7145 sparse matrix of type '<class 'numpy.int64'>'
	with 63041 stored elements in Compressed Sparse Row format>

In [40]:
X_train_c_vec_df = pd.DataFrame(X_train_c_vec.toarray(), columns=c_vectorizer.get_feature_names(), 
                              index=X_train.index)
X_train_c_vec_df

Unnamed: 0,000,0310apple,100,1000,101,106,10am,10pm,10x,10x2,...,zlf,zms,zomb,zombie,zomg,zone,zoom,zuckerberg,zynga,zzzs
2324,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5632,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1751,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5799,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3339,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5702,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8604,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7836,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7504,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [41]:
#Sanity Check
X_train_c_vec_df['sxsw']

2324    1
5632    1
1751    1
5799    1
3339    1
       ..
5702    1
8604    1
7836    1
7504    1
3536    1
Name: sxsw, Length: 6121, dtype: int64

In [42]:
X_val_c_vec = c_vectorizer.transform(X_val['Tweet'])
X_val_c_vec_df = pd.DataFrame(X_val_c_vec.toarray())

In [62]:
tfidf_vectorizer = TfidfVectorizer()
    #max_df=.95,  # removes words that appear in more than 95% of docs
    #min_df=2     # removes words that appear 2 or fewer times
    #max_features=10
tfidf_vectorizer.fit(X_train['Tweet'])
X_train_tfidf_vec = tfidf_vectorizer.transform(X_train['Tweet'])
X_val_tfidf_vec = tfidf_vectorizer.transform(X_val['Tweet'])
X_train_tfidf_vec_df = pd.DataFrame(X_train_tfidf_vec.toarray())
X_val_tfidf_vec_df = pd.DataFrame(X_val_tfidf_vec.toarray())
X_train_tfidf_vec_df.shape

(6121, 7145)

# Simple Logistic Regression Model w/ Count Vectorizer

In [184]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state=42)
lr.fit(X_train_c_vec_df, y_train)
print(lr.score(X_train_c_vec_df, y_train))
print(lr.score(X_val_c_vec_df, y_val))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.8921744812939062
0.7398334149926507


# Simple Logistic Regression Model w/ Tfidf Vectorizer

In [45]:
lr_2 = LogisticRegression()
lr_2.fit(X_train_tfidf_vec_df, y_train)
print(lr_2.score(X_train_tfidf_vec_df, y_train))
print(lr_2.score(X_val_tfidf_vec_df, y_val))

0.800359418395687
0.7315041646251838


# Naive Bayes w/ Count Vectorizer

In [46]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train_c_vec_df, y_train)
print(naive_bayes.score(X_train_c_vec_df, y_train))
print(naive_bayes.score(X_val_c_vec_df, y_val))

0.8577029897075641
0.7158255756981872


# Naive Bayes w/ Tfidf Vectorizer

In [47]:
naive_bayes_2 = MultinomialNB()
naive_bayes_2.fit(X_train_tfidf_vec_df, y_train)
print(naive_bayes_2.score(X_train_tfidf_vec_df, y_train))
print(naive_bayes_2.score(X_val_tfidf_vec_df, y_val))

0.7892501225289985
0.7123958843704067


# Naive Bayes w/ Tuned Vectorizers

In [48]:
c_vectorizer_2 = CountVectorizer(max_df=.99,min_df=2, max_features=1000)
    #max_df=.95,  # removes words that appear in more than 95% of docs
    #min_df=2     # removes words that appear 2 or fewer times
c_vectorizer_2.fit(X_train['Tweet'])
X_train_c_vec_2 = c_vectorizer_2.transform(X_train['Tweet'])
X_val_c_vec_2 = c_vectorizer_2.transform(X_val['Tweet'])
X_train_c_vec_df_2 = pd.DataFrame(X_train_c_vec_2.toarray())
X_val_c_vec_df_2 = pd.DataFrame(X_val_c_vec_2.toarray())

In [49]:
naive_bayes_3 = MultinomialNB()
naive_bayes_3.fit(X_train_c_vec_df_2, y_train)
print("naive bayes with tuned count vectorizer")
print(naive_bayes_3.score(X_train_c_vec_df_2, y_train))
print(naive_bayes_3.score(X_val_c_vec_df_2, y_val))

naive bayes with tuned count vectorizer
0.7582094429014867
0.705046545810877


In [182]:
tfidf_vectorizer_2 = TfidfVectorizer(max_df=.99,min_df=0.005, max_features=1000)
tfidf_vectorizer_2.fit(X_train['Tweet'])
X_train_tfidf_vec_2 = tfidf_vectorizer_2.transform(X_train['Tweet'])
X_val_tfidf_vec_2 = tfidf_vectorizer_2.transform(X_val['Tweet'])
X_train_tfidf_vec_df_2 = pd.DataFrame(X_train_tfidf_vec_2.toarray())
X_val_tfidf_vec_df_2 = pd.DataFrame(X_val_tfidf_vec_2.toarray())

In [183]:
naive_bayes_4 = MultinomialNB()
naive_bayes_4.fit(X_train_tfidf_vec_df_2, y_train)
print("naive bayes with tuned tfidf")
print(naive_bayes_4.score(X_train_tfidf_vec_df_2, y_train))
print(naive_bayes_4.score(X_val_tfidf_vec_df_2, y_val))

naive bayes with tuned tfidf
0.7162228394053259
0.7011268985791279


# Logistic Regression with max iter = 1000

In [52]:
lr_3 = LogisticRegression(max_iter=1000)
lr_3.fit(X_train_c_vec_df, y_train)
print("default count vectorizer")
print(lr_3.score(X_train_c_vec_df, y_train))
print(lr_3.score(X_val_c_vec_df, y_val))

default count vectorizer
0.8921744812939062
0.739343459088682


In [53]:
#lr_3 = LogisticRegression(max_iter=1000)
lr_3.fit(X_train_tfidf_vec_df, y_train)
print("default tfidf vectorizer")
print(lr_3.score(X_train_tfidf_vec_df, y_train))
print(lr_3.score(X_val_tfidf_vec_df, y_val))

default tfidf vectorizer
0.800359418395687
0.7315041646251838


In [54]:
#lr_3 = LogisticRegression(max_iter=1000)
lr_3.fit(X_train_tfidf_vec_df_2, y_train)
print("tfidf vectorizer with tuned parameters")
print(lr_3.score(X_train_tfidf_vec_df_2, y_train))
print(lr_3.score(X_val_tfidf_vec_df_2, y_val))

tfidf vectorizer with tuned parameters
0.7645809508250286
0.7251347378735914


# PCA w/ Logistic Regression

In [55]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_c_vec_df) #Use default count vectorizer
X_val_scaled = scaler.transform(X_val_c_vec_df)

In [56]:
# Code to import, instantiate and fit a PCA object
from sklearn.decomposition import PCA

pca = PCA(n_components = .95, random_state=42)
pca.fit(X_train_scaled)
pca.n_components_

2916

In [57]:
from sklearn.pipeline import Pipeline
# Construct a pipelines
pipe_lr = Pipeline([('pca', pca), 
                    ('lr', LogisticRegression(random_state=42, max_iter=1000))])
pipe_lr.fit(X_train_scaled, y_train)
print("PCA with n_components=0.95, default count vectorizer, and logistic regression")
print(pipe_lr.score(X_train_scaled, y_train))
print(pipe_lr.score(X_val_scaled, y_val))

PCA with n_components=0.95, default count vectorizer, and logistic regression
0.9516418885802973
0.6888780009799118


In [58]:
# pipe_mnb = Pipeline([('pca', pca), 
#                     ('mnb', MultinomialNB())])
# pipe_mnb.fit(X_train_scaled, y_train)
# print("PCA with n_components=0.95, default count vectorizer, and naive bayes")
# print(pipe_lr.score(X_train_scaled, y_train))
# print(pipe_lr.score(X_val_scaled, y_val))

*I got an error about MultinomialNB not having negative values*

# PCA w/ Tuned TFIDF Vectorizer

In [151]:
scaler_2 = StandardScaler()
X_train_scaled_2 = scaler_2.fit_transform(X_train_tfidf_vec_df_2) #Use tuned tfidf vectorizer
X_val_scaled_2 = scaler_2.transform(X_val_tfidf_vec_df_2)

In [154]:
pca_2 = PCA(n_components = .90, random_state=42)
pca_2.fit(X_train_scaled_2)
pca_2.n_components_

1238

In [155]:
pipe_lr_2 = Pipeline([('pca2', pca_2), 
                    ('lr2', LogisticRegression(random_state=42, max_iter=1000))])
pipe_lr_2.fit(X_train_scaled_2, y_train)
print("PCA with n_components=0.95, tuned tfidf vectorizer, and naive bayes")
print(pipe_lr_2.score(X_train_scaled_2, y_train))
print(pipe_lr_2.score(X_val_scaled_2, y_val))

PCA with n_components=0.95, tuned tfidf vectorizer, and naive bayes
0.8307466100310407
0.7074963253307203


# Pipeline and Cross Validate

In [185]:
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_auc_score, plot_confusion_matrix, plot_roc_curve
from sklearn.ensemble import RandomForestClassifier

In [187]:
pipe_logreg = Pipeline(steps=[
    ('count_vectorizer', CountVectorizer()),
    ('logreg', LogisticRegression(random_state=42))
])
cv = cross_validate(pipe_logreg, X_train['Tweet'], y_train, return_train_score=True, \
                    scoring=['accuracy', 'precision','roc_auc'])
cv

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

{'fit_time': array([0.16672111, 0.13967609, 0.13760495, 0.13788915, 0.14039111]),
 'score_time': array([0.02055192, 0.01950383, 0.01929021, 0.01866698, 0.01906395]),
 'test_accuracy': array([0.74040816, 0.71486928, 0.75408497, 0.71078431, 0.74264706]),
 'train_accuracy': array([0.90093954, 0.90034715, 0.89810088, 0.90320604, 0.90055136]),
 'test_precision': array([0.64259928, 0.58545455, 0.68060837, 0.57      , 0.6539924 ]),
 'train_precision': array([0.93301812, 0.93019608, 0.92093023, 0.93843725, 0.93502377]),
 'test_roc_auc': array([0.75917814, 0.72217647, 0.7512428 , 0.72191419, 0.7299032 ]),
 'train_roc_auc': array([0.96578595, 0.96697679, 0.96357515, 0.96520491, 0.96583773])}

In [188]:
pipe_rfc = Pipeline(steps=[
    ('tfidf_vectorizer', TfidfVectorizer(max_df=.99,min_df=0.005, max_features=1000)),
    ('rfc', RandomForestClassifier(random_state=42))
])
cv = cross_validate(pipe_rfc, X_train['Tweet'], y_train, return_train_score=True, \
                    scoring=['accuracy', 'precision','roc_auc'])
cv

{'fit_time': array([1.35563898, 1.32724214, 1.3579073 , 1.32797313, 1.31642294]),
 'score_time': array([0.09072995, 0.08811212, 0.09062457, 0.0879178 , 0.08778   ]),
 'test_accuracy': array([0.71510204, 0.71568627, 0.71650327, 0.72058824, 0.70915033]),
 'train_accuracy': array([0.95118464, 0.94853992, 0.94894834, 0.94813151, 0.94935675]),
 'test_precision': array([0.59677419, 0.59022556, 0.59760956, 0.608     , 0.57936508]),
 'train_precision': array([0.96030116, 0.95997239, 0.96450939, 0.95676047, 0.96650384]),
 'test_roc_auc': array([0.71998169, 0.71112087, 0.70720185, 0.71120474, 0.70237856]),
 'train_roc_auc': array([0.99131925, 0.99003833, 0.99008938, 0.98925492, 0.98986803])}

# Grid Search for Random Forest Classifier

In [195]:
pg_rfc = {
    "rfc__max_depth" :[25, 50, 100],
    "rfc__min_samples_leaf" : [1, 3, 5],
    "rfc__n_estimators": [500, 1000, 1500],
    "rfc__class_weight" :['balanced'],
    "rfc__random_state":[42]
}
grid_rfc = GridSearchCV(estimator = pipe_rfc, param_grid=pg_rfc, scoring='accuracy',
                        return_train_score = True)
grid_rfc.fit(X_train['Tweet'], y_train)

GridSearchCV(estimator=Pipeline(steps=[('tfidf_vectorizer',
                                        TfidfVectorizer(max_df=0.99,
                                                        max_features=1000,
                                                        min_df=0.005)),
                                       ('rfc',
                                        RandomForestClassifier(random_state=42))]),
             param_grid={'rfc__class_weight': ['balanced'],
                         'rfc__max_depth': [25, 50, 100],
                         'rfc__min_samples_leaf': [1, 3, 5],
                         'rfc__n_estimators': [500, 1000, 1500],
                         'rfc__random_state': [42]},
             scoring='accuracy')

In [196]:
pd.DataFrame(grid_rfc.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_rfc__class_weight,param_rfc__max_depth,param_rfc__min_samples_leaf,param_rfc__n_estimators,param_rfc__random_state,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,2.318658,0.058411,0.104045,0.00857,balanced,25,1,500,42,"{'rfc__class_weight': 'balanced', 'rfc__max_de...",0.703673,0.69281,0.695261,0.693627,0.676471,0.692369,0.00884,9
1,4.573491,0.09627,0.190555,0.002927,balanced,25,1,1000,42,"{'rfc__class_weight': 'balanced', 'rfc__max_de...",0.702041,0.691176,0.698529,0.695261,0.680556,0.693513,0.007407,8
2,6.840107,0.197343,0.282089,0.002942,balanced,25,1,1500,42,"{'rfc__class_weight': 'balanced', 'rfc__max_de...",0.701224,0.687092,0.702614,0.695261,0.681373,0.693513,0.008173,7
3,1.624615,0.020397,0.093181,0.001943,balanced,25,3,500,42,"{'rfc__class_weight': 'balanced', 'rfc__max_de...",0.686531,0.678105,0.681373,0.68219,0.658497,0.677339,0.009797,18
4,3.238793,0.054916,0.177897,0.00205,balanced,25,3,1000,42,"{'rfc__class_weight': 'balanced', 'rfc__max_de...",0.692245,0.673203,0.687092,0.683824,0.662582,0.679789,0.010623,16
5,4.739683,0.064339,0.270101,0.023248,balanced,25,3,1500,42,"{'rfc__class_weight': 'balanced', 'rfc__max_de...",0.693061,0.670752,0.687092,0.686275,0.661765,0.679789,0.011646,17
6,1.456939,0.015529,0.08932,0.000478,balanced,25,5,500,42,"{'rfc__class_weight': 'balanced', 'rfc__max_de...",0.679184,0.667484,0.678105,0.678105,0.661765,0.672928,0.007028,27
7,2.861272,0.03416,0.169892,0.002214,balanced,25,5,1000,42,"{'rfc__class_weight': 'balanced', 'rfc__max_de...",0.683265,0.665033,0.687092,0.673203,0.660131,0.673745,0.010299,24
8,4.256708,0.030177,0.247809,0.001791,balanced,25,5,1500,42,"{'rfc__class_weight': 'balanced', 'rfc__max_de...",0.684898,0.660948,0.688725,0.674837,0.659314,0.673744,0.012017,25
9,4.277533,0.067883,0.135058,0.000474,balanced,50,1,500,42,"{'rfc__class_weight': 'balanced', 'rfc__max_de...",0.706939,0.705065,0.710784,0.709967,0.698529,0.706257,0.004379,2


In [197]:
grid_rfc.best_params_

{'rfc__class_weight': 'balanced',
 'rfc__max_depth': 50,
 'rfc__min_samples_leaf': 1,
 'rfc__n_estimators': 1000,
 'rfc__random_state': 42}