## Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

In [2]:
df = pd.read_csv("https://github.com/murpi/wilddata/raw/master/quests/tweets.zip")
df.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


In [3]:
df.tail()

Unnamed: 0,textID,text,selected_text,sentiment
27476,4eac33d1c0,wish we could come see u on Denver husband l...,d lost,negative
27477,4f4c4fc327,I`ve wondered about rake to. The client has ...,", don`t force",negative
27478,f67aae2310,Yay good for both of you. Enjoy the break - y...,Yay good for both of you.,positive
27479,ed167662a5,But it was worth it ****.,But it was worth it ****.,positive
27480,6f7127d9d7,All this flirting going on - The ATG smiles...,All this flirting going on - The ATG smiles. Y...,neutral


## Preprocessing

In [4]:
df = df[df['sentiment'] != 'neutral']
df['sentiment'].value_counts(normalize=True)

positive    0.524476
negative    0.475524
Name: sentiment, dtype: float64

In [5]:
df.reset_index(drop=True, inplace=True)

About 52% of the tweets are positive, after removing the neutral ones.

In [6]:
X = df['text']
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=32, train_size = 0.75)

vectorizer = CountVectorizer()
vectorizer.fit(X_train)

CountVectorizer()

In [7]:
X_train_CV = vectorizer.transform(X_train)
X_train_CV

<12272x15806 sparse matrix of type '<class 'numpy.int64'>'
	with 144578 stored elements in Compressed Sparse Row format>

In [8]:
X_test_CV = vectorizer.transform(X_test)
X_test_CV

<4091x15806 sparse matrix of type '<class 'numpy.int64'>'
	with 44633 stored elements in Compressed Sparse Row format>

## Classification

In [9]:
model = LogisticRegression().fit(X_train_CV, y_train)

print(f"Accuracy score on the train dataset: {model.score(X_train_CV, y_train)}")
print(f"Accuracy score on the test dataset: {model.score(X_test_CV, y_test)}")

Accuracy score on the train dataset: 0.9663461538461539
Accuracy score on the test dataset: 0.8772916157418724


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Accuracy score is as expected, but I'm not a fan of warning messages so let's get rid of those.

In [10]:
model = LogisticRegression(max_iter=150).fit(X_train_CV, y_train)

print(f"Accuracy score on the train dataset: {model.score(X_train_CV, y_train)}")
print(f"Accuracy score on the test dataset: {model.score(X_test_CV, y_test)}")

Accuracy score on the train dataset: 0.9663461538461539
Accuracy score on the test dataset: 0.8772916157418724


In [11]:
pd.DataFrame(data = confusion_matrix(y_true = y_test, y_pred = model.predict(X_test_CV)),
             index = model.classes_ + " ACTUAL",
             columns = model.classes_ + " PREDICTED")

Unnamed: 0,negative PREDICTED,positive PREDICTED
negative ACTUAL,1700,235
positive ACTUAL,267,1889


In [12]:
model.predict_proba(X_test_CV)

array([[0.98175039, 0.01824961],
       [0.9931976 , 0.0068024 ],
       [0.42704948, 0.57295052],
       ...,
       [0.67934786, 0.32065214],
       [0.85247421, 0.14752579],
       [0.00350557, 0.99649443]])

In [13]:
model.predict(X_test_CV)

array(['negative', 'negative', 'positive', ..., 'negative', 'negative',
       'positive'], dtype=object)

In [14]:
model.classes_

array(['negative', 'positive'], dtype=object)

In [15]:
test_df = X_test.copy()
test_df = pd.DataFrame(test_df)
test_df

Unnamed: 0,text
3386,"- no, is buttfuck stupid. I`m just silly and..."
4559,get better omg i still dont believe that i di...
1616,HollowbabesHere comes the utter shite #bgt <I ...
2985,Thank You Clayton. Going to my favorite Greek...
16069,I`m watching it at the moment -sighs- and st...
...,...
2442,I can`t take it
2757,so where r u spinning now that the Hookah is ...
10898,WHAT?! i was wanting to see that show!!
9863,Har vondt i ryggen My back hurts


In [16]:
proba = model.predict_proba(X_test_CV)
test_df['prediction_negative_percent'] = proba[:,0] * 100
test_df

Unnamed: 0,text,prediction_negative_percent
3386,"- no, is buttfuck stupid. I`m just silly and...",98.175039
4559,get better omg i still dont believe that i di...,99.319760
1616,HollowbabesHere comes the utter shite #bgt <I ...,42.704948
2985,Thank You Clayton. Going to my favorite Greek...,0.024477
16069,I`m watching it at the moment -sighs- and st...,56.026406
...,...,...
2442,I can`t take it,63.331550
2757,so where r u spinning now that the Hookah is ...,85.460161
10898,WHAT?! i was wanting to see that show!!,67.934786
9863,Har vondt i ryggen My back hurts,85.247421


In [17]:
test_df['sentiment_prediction'] = model.predict(X_test_CV)
test_df

Unnamed: 0,text,prediction_negative_percent,sentiment_prediction
3386,"- no, is buttfuck stupid. I`m just silly and...",98.175039,negative
4559,get better omg i still dont believe that i di...,99.319760,negative
1616,HollowbabesHere comes the utter shite #bgt <I ...,42.704948,positive
2985,Thank You Clayton. Going to my favorite Greek...,0.024477,positive
16069,I`m watching it at the moment -sighs- and st...,56.026406,negative
...,...,...,...
2442,I can`t take it,63.331550,negative
2757,so where r u spinning now that the Hookah is ...,85.460161,negative
10898,WHAT?! i was wanting to see that show!!,67.934786,negative
9863,Har vondt i ryggen My back hurts,85.247421,negative


In [18]:
test_df['sentiment'] = df['sentiment'].iloc[test_df.index]
test_df

Unnamed: 0,text,prediction_negative_percent,sentiment_prediction,sentiment
3386,"- no, is buttfuck stupid. I`m just silly and...",98.175039,negative,negative
4559,get better omg i still dont believe that i di...,99.319760,negative,negative
1616,HollowbabesHere comes the utter shite #bgt <I ...,42.704948,positive,negative
2985,Thank You Clayton. Going to my favorite Greek...,0.024477,positive,positive
16069,I`m watching it at the moment -sighs- and st...,56.026406,negative,negative
...,...,...,...,...
2442,I can`t take it,63.331550,negative,negative
2757,so where r u spinning now that the Hookah is ...,85.460161,negative,negative
10898,WHAT?! i was wanting to see that show!!,67.934786,negative,negative
9863,Har vondt i ryggen My back hurts,85.247421,negative,negative


In [19]:
df_errors = test_df[(test_df['sentiment'] == 'negative') & (test_df['sentiment_prediction'] == 'positive')]
df_errors_10 = df_errors.head(10)
df_errors_10

Unnamed: 0,text,prediction_negative_percent,sentiment_prediction,sentiment
1616,HollowbabesHere comes the utter shite #bgt <I ...,42.704948,positive,negative
11177,SUFFICATION NO BREATHING. It`s okay. There`ll...,4.009449,positive,negative
13034,I love music so much that i`ve gone through pa...,32.666816,positive,negative
9047,have an amazing time with your mommas tomorro...,0.009714,positive,negative
15628,Watching 1971 edition if Old Grey Whistle Test...,46.867301,positive,negative
11595,"awh, thats not good, get better soon!",7.2531,positive,negative
11602,"Alas, the best I can offer is a small pony an...",32.27665,positive,negative
9630,is very worried about sam and wants to know h...,37.422888,positive,negative
5001,There`s nothing good on tonight anyway!! #S...,12.59263,positive,negative
15257,Loll whats boyfriend #2 supposed to mean then?...,48.571977,positive,negative


In [20]:
for i in range(10):
  print(f"{i+1}. {df_errors_10['text'].iloc[i]} \n {round(df_errors_10['prediction_negative_percent'].iloc[i], 2)} percent predicted negativity")
  print('\n')

1. HollowbabesHere comes the utter shite #bgt <I completely agree 
 42.7 percent predicted negativity


2.  SUFFICATION NO BREATHING. It`s okay. There`ll be more. You`re invited to mine, but I can`t promise fun times.  *Jinx 
 4.01 percent predicted negativity


3. I love music so much that i`ve gone through pain to play :S my sides of my fingers now are peeling and have blisters from playing so much 
 32.67 percent predicted negativity


4.  have an amazing time with your mommas tomorrow! Show them how much they mean to you  Whatever you do they will love it 
 0.01 percent predicted negativity


5. Watching 1971 edition if Old Grey Whistle Test. Fanny, Mamas and the Papas & Isaac Hayes. Don`t make shows like this anymore 
 46.87 percent predicted negativity


6.  awh, thats not good,   get better soon! 
 7.25 percent predicted negativity


7.  Alas, the best I can offer is a small pony and a rowing boat 
 32.28 percent predicted negativity


8. is very worried about sam  and wants to 

Some examples here are quite fringe and don't say much about the model.

The 4th one has been tagged as negative in the dataset but looks very much positive. We lack context to determine if this is a positive message (for Mothers' Day for example), or sarcasm.

Same for 6th, if it was actually negative, then it was sarcasm towards someone the poster didn't like at all, and who said being sick probably. Again, we lack context here, on its own I would have tagged it as positive manually.

On the contrary, 3rd tweet looks positive, but that person is mentioning the negative side effects of playing too much, so it could go either way and the classifier isn't really wrong.

As a general rule in those 10 examples, reported negativity is either unclear or a result of sarcasm that requires more context to be detected. The model does seem to work quite well, and doing better manually would require access to those tweets on Twitter, not just as an extract in a dataset.