In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer


In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/BerengerQueune/wild_notebooks/main/Dataset/train.csv")
df.shape

(27481, 4)

Keep only positive and negative tweets (so you exclude neutral). What is the percentage of positive/negative tweets?

In [None]:
df_positive_negative = df.loc[df["sentiment"].isin(["negative", "positive"])]

df_positive_negative["sentiment"].value_counts(normalize=True)*100

positive    52.447595
negative    47.552405
Name: sentiment, dtype: float64

<b><font color='orange'>The percentage of positive tweets is 52,44% and negative tweets is 47,55%.</font></b>

In [None]:
df_positive_negative.reset_index(drop=True,inplace=True)

Copy the text column into a Series X, and the sentiment column into a Series y. Apply a train test split with the random_state = 32.

In [None]:
X = df_positive_negative["text"]
y = df_positive_negative['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 32, train_size = 0.75)

Create a vectorizer model with scikit-learn using the Countvectorizer method. Train your model on X_train, then create a matrix of features X_train_CV. Create the X_test_CV matrix without re-training the model. The format of the X_test_CV matrix should be 4091x15806 with 44633 stored elements.

In [None]:
vectorizer = CountVectorizer()
vectorizer.fit(X_train)

CountVectorizer()

In [None]:
X_train_CV = vectorizer.transform(X_train)

In [None]:
X_test_CV = vectorizer.transform(X_test)

In [None]:
X_test_CV

<4091x15806 sparse matrix of type '<class 'numpy.int64'>'
	with 44633 stored elements in Compressed Sparse Row format>

Now train a logistic regression with default parameters. You should get these scores: 0.966 for the train test, and 0.877 for the test set.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression().fit(X_train_CV,y_train)

print("accuracy score on train set:",model.score(X_train_CV, y_train))
print("accuracy score on test set:",model.score(X_test_CV, y_test))

accuracy score on train set: 0.9663461538461539
accuracy score on test set: 0.8772916157418724



lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



Bonus step: try to display 10 tweets that were badly predicted (false positive or false negative). Would you have done better than the algorithm?

In [None]:
from sklearn.metrics import confusion_matrix

pd.DataFrame(data = confusion_matrix(y_true = y_test, y_pred = model.predict(X_test_CV)),
             index = model.classes_ + " actual",
             columns = model.classes_ + " predicted")

Unnamed: 0,negative predicted,positive predicted
negative actual,1700,235
positive actual,267,1889


In [None]:
X_test2 = pd.DataFrame(X_test).copy()

In [None]:
X_test2.head()

Unnamed: 0,text
3386,"- no, is buttfuck stupid. I`m just silly and..."
4559,get better omg i still dont believe that i di...
1616,HollowbabesHere comes the utter shite #bgt <I ...
2985,Thank You Clayton. Going to my favorite Greek...
16069,I`m watching it at the moment -sighs- and st...


In [None]:
prediction = model.predict(X_test_CV)

X_test2["predict"] = prediction

In [None]:
X_test2.head()

Unnamed: 0,text,predict
3386,"- no, is buttfuck stupid. I`m just silly and...",negative
4559,get better omg i still dont believe that i di...,negative
1616,HollowbabesHere comes the utter shite #bgt <I ...,positive
2985,Thank You Clayton. Going to my favorite Greek...,positive
16069,I`m watching it at the moment -sighs- and st...,negative


In [None]:
X_test2["sentiment"] = df_positive_negative["sentiment"].iloc[X_test2.index]

In [None]:
X_test2.head()

Unnamed: 0,text,predict,sentiment
3386,"- no, is buttfuck stupid. I`m just silly and...",negative,negative
4559,get better omg i still dont believe that i di...,negative,negative
1616,HollowbabesHere comes the utter shite #bgt <I ...,positive,negative
2985,Thank You Clayton. Going to my favorite Greek...,positive,positive
16069,I`m watching it at the moment -sighs- and st...,negative,negative


In [None]:
X_test2["prediction_result"] = np.where( X_test2['predict'] == X_test2['sentiment'], 'accurate', "inaccurate")
bonus_question = X_test2.loc[X_test2["prediction_result"] == 'inaccurate']

In [None]:
X_test2

Unnamed: 0,text,predict,sentiment,prediction_result
3386,"- no, is buttfuck stupid. I`m just silly and...",negative,negative,accurate
4559,get better omg i still dont believe that i di...,negative,negative,accurate
1616,HollowbabesHere comes the utter shite #bgt <I ...,positive,negative,inaccurate
2985,Thank You Clayton. Going to my favorite Greek...,positive,positive,accurate
16069,I`m watching it at the moment -sighs- and st...,negative,negative,accurate
...,...,...,...,...
2442,I can`t take it,negative,negative,accurate
2757,so where r u spinning now that the Hookah is ...,negative,negative,accurate
10898,WHAT?! i was wanting to see that show!!,negative,negative,accurate
9863,Har vondt i ryggen My back hurts,negative,negative,accurate


In [None]:
pd.set_option('display.max_colwidth', None)
bonus_question.head(10)

Unnamed: 0,text,predict,sentiment,prediction_result
1616,HollowbabesHere comes the utter shite #bgt <I completely agree,positive,negative,inaccurate
11177,"SUFFICATION NO BREATHING. It`s okay. There`ll be more. You`re invited to mine, but I can`t promise fun times. *Jinx",positive,negative,inaccurate
7203,i wanna vote for Miley Cyrus for the mtv movie awards..but i don`t know where i could somebody could send me a link? thaank you <3,negative,positive,inaccurate
13034,I love music so much that i`ve gone through pain to play :S my sides of my fingers now are peeling and have blisters from playing so much,positive,negative,inaccurate
11012,"I can only message those who message me, if we`re fwends...so those that want replies..follow me. hmm..that sounds funny..",negative,positive,inaccurate
1803,"wish I could feel no pain (8) but it`s ok, at least they like Brazil!",negative,positive,inaccurate
2355,so glad i`m not at uni anymore,negative,positive,inaccurate
3100,You`re not here. I hope you`re still resting. I don`t want you to be stressed.,negative,positive,inaccurate
277,"you`re missing out, bb! i`m such a cereal nut, i think i like every kind available.",negative,positive,inaccurate
9047,have an amazing time with your mommas tomorrow! Show them how much they mean to you Whatever you do they will love it,positive,negative,inaccurate


<b><font color='orange'>I am not sure I could have done better overall. Some tweets are not so clear. The one on index 9047 also looks very positive so I would have predict it as positive. Why is the sentiment negative? Seems like a mistake in the dataset to me.</font></b>