# Project: Sentiment detection

- Date: July 25 2025 - break - August 5 2025

- Data: We download the data with this command: wget http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip and then we decompressed it. This data contains 1,600,000 lines (text or tweet).

- Description: In this project, we will build a model able to predict the feeling or sentiment expressed in a tweet( positif or negatif)
 

In [1]:
import pandas as pd

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer



# Downloading and cleaning of the data

In [3]:
# Training set
df_train = pd.read_csv("./data/training.1600000.processed.noemoticon.csv",
                        encoding='ISO-8859-1',
                        header=None,
                        names=['target','id','date','flag','user','text'])

df_train= df_train[["target","text"]] # we keep only the targets and their texts
df_train['target'] = df_train['target'].map({0:0, 4:1}) # Change the 0 and 4 into 0 and 1


In [4]:
print(df_train['text'])

0          @switchfoot http://twitpic.com/2y1zl - Awww, t...
1          is upset that he can't update his Facebook by ...
2          @Kenichan I dived many times for the ball. Man...
3            my whole body feels itchy and like its on fire 
4          @nationwideclass no, it's not behaving at all....
                                 ...                        
1599995    Just woke up. Having no school is the best fee...
1599996    TheWDB.com - Very cool to hear old Walt interv...
1599997    Are you ready for your MoJo Makeover? Ask me f...
1599998    Happy 38th Birthday to my boo of alll time!!! ...
1599999    happy #charitytuesday @theNSPCC @SparksCharity...
Name: text, Length: 1600000, dtype: object


After seeing some texts, I'm not very sure of the feeling given to the text in the target part but... let's do it like this.

In [5]:
def text_preprocessing_nltk(text):
    tokens=[]
    stop_words=set(stopwords.words("english")) # all stopword in english("the", "of", "is",...)
    stemming=PorterStemmer()

    text=text.lower() #conversion in lowercase
    text=re.sub(pattern=r'[^\w\s]', repl= '', string=text) # remove all special characters and keep only letters and numbers
    text=re.sub(pattern=r'\S+@\S+',repl=' ', string=text) # Change of all emails by a space
    text=re.sub(pattern=r'http:\S+|www\.\S+', repl=' ',string=text) #Change all links or URL by space

    #Tokenization
    tokens=word_tokenize(text)

    #Suppresion of stopwords
    tokens=[word for word in tokens if word not in stop_words]

    #Stemming
    tokens=[stemming.stem(word) for word in tokens]

    return ' '.join(tokens)

#Definition of the vetorizer, the features and target
vectorizer=TfidfVectorizer(preprocessor=text_preprocessing_nltk,
                        tokenizer= lambda txt: txt.split())
X=vectorizer.fit_transform(df_train['text'])
y=df_train['target'].values

#Note to myself: Search after what is really a pipepline and how to do it, cause it seems easier. 




In [None]:
print(df_train['target'].values)

[0 0 0 ... 1 1 1]


Now we have cleaned the data and split it into features and target let build our models.  

# Models

Our goal here is to predict if a message or a text has a good feeling or not ( 1 or 0). So we're going to use classification models.

In [7]:
from sklearn.linear_model import LogisticRegression #Despite his name it is for classification
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


In [43]:
#splitting the data
X_train, X_test, y_train, y_test=train_test_split(X, y,test_size=0.2,
                                                  random_state=42,
                                                  stratify=y)
# train = 1 600 000*80% =  1 280 000
# test = 1 600 000*20%= 320 000

#Building models and training
log_model=LogisticRegression().fit(X_train, y_train)
nb_model=MultinomialNB().fit(X_train, y_train)

#Predictions
log_pred = log_model.predict(X_test)
nb_pred  = nb_model.predict(X_test)

#Evaluation
def evaluate(model,y_true, y_pred):
    print(f"=={model}==")
    print("\t Accuracy score: ", accuracy_score(y_true, y_pred))
    print("\t Precision score: ", precision_score(y_true, y_pred))
    print("\t Recall score: ", recall_score(y_true, y_pred))
    print("\t F1 score : ", f1_score(y_true, y_pred))

evaluate("Logisitc Regression", y_test, log_pred)
evaluate("Naive Bayes", y_test, nb_pred)


==Logisitc Regression==
	 Accuracy score:  0.783453125
	 Precision score:  0.7694629456885335
	 Recall score:  0.8094125
	 F1 score :  0.7889323103071211
==Naive Bayes==
	 Accuracy score:  0.75935
	 Precision score:  0.7829439921449903
	 Recall score:  0.71765625
	 F1 score :  0.7488798596482075


We can notice that the accuracy, the precision and the F1 score are approximatly the same for this 2 models. And we can notice a bigger difference on thre recall score, it's mean the Logistic regression manage better to predict the good(1) tweets. 

Now we will take the best model amoung the model or algorithm made here and try to build a better one , more trained using the cross validation to increase is evalation scores

# Cross Validation

So here we will train the models using cross validation algorithm. We don't really find the fact to use a stratify k fold useful because for me, whatever the case our model must be able to decide (regardless the other text) if a text has a good felling or not. But ...

Also we will choose the metric f1_score because we find the precison and the recall very important, si taking a metric which is based on both seem suitable for us.

In [11]:
from sklearn.model_selection import cross_validate, StratifiedKFold

cv=StratifiedKFold(n_splits=5,random_state=0,shuffle=True)

nb_scores=cross_validate(nb_model,X,y,cv=cv,scoring='f1')
log_scores=cross_validate(log_model,X,y,cv=cv,scoring="f1")
print(" ")

print("===Logistic Regression Model===")
print("\tFit time by fold:", log_scores["fit_time"])
print("\tScore time by fold :",log_scores["score_time"])
print("\tScore by fold:", log_scores["score_time"])
print("\tMean of scores: ",log_scores["score_time"].mean())

print("===Naive Bayes Model===")
print("\tFit time by fold:", nb_scores["fit_time"])
print("\tScore time by fold :",nb_scores["score_time"])
print("\tScore by fold:", nb_scores["score_time"])
print("\tMean of scores: ",nb_scores["score_time"].mean())


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


 
===Logistic Regression Model===
	Fit time by fold: [27.30670953 59.36786914 37.95665884 60.02354431 39.00714755]
	Score time by fold : [0.09246063 0.08748221 0.08915782 0.0947628  0.08222151]
	Score by fold: [0.09246063 0.08748221 0.08915782 0.0947628  0.08222151]
	Mean of scores:  0.08921699523925782
===Naive Bayes Model===
	Fit time by fold: [0.52549553 0.51467371 0.59553051 0.51238537 0.54161119]
	Score time by fold : [0.09839749 0.11704421 0.10127902 0.12599659 0.14350796]
	Score by fold: [0.09839749 0.11704421 0.10127902 0.12599659 0.14350796]
	Mean of scores:  0.11724505424499512


We can notice that the mean of score is very very low in both case. But this is not really a big problem cause it is a little bit normal. In fact, we divided the dataset so we have less data in each group, what make it difficult for the model to be trained very well and have good result on the over part of the data. But globaly our models could be well trained.

Let's see it.

In [15]:
#Just to see what is truly in the log_score and his type.
print(log_scores)

{'fit_time': array([27.30670953, 59.36786914, 37.95665884, 60.02354431, 39.00714755]), 'score_time': array([0.09246063, 0.08748221, 0.08915782, 0.0947628 , 0.08222151]), 'test_score': array([0.78765576, 0.7878557 , 0.78892763, 0.78665638, 0.78735007])}


# Test

Now we will test the models we trained on a new data to see their 'true' performances.

In [14]:
# Test set
df_test=pd.read_csv("/home/christian/ProjetsPerso/IA/MachineLearning/Sentiment_Detection/data/testdata.manual.2009.06.14.csv",
                        encoding='ISO-8859-1',
                        header=None,
                        names=['target','id','date','flag','user','text'])
df_test= df_test[["target","text"]] # we keep only the targets and their texts
df_test = df_test[df_test['target'].isin([0,4])].copy() # we lay the neutral tweet
df_test['target'] = df_test['target'].map({0:0, 4:1}) # Change the 0 and 4 into 0 and 1


#definitions of the features and targets
#very important to have "transform" and not "fit_transform" cause the vacabulary already learned
X_test=vectorizer.transform(df_test["text"])  
y_test=df_test['target'].values

#predictions
log_pred_test=log_model.predict(X_test)
nb_pred_test=nb_model.predict(X_test)

#evaluation
evaluate("Logistic Regression", y_test, log_pred_test)
evaluate("Naive Bayes", y_test, nb_pred_test)


==Logistic Regression==
	 Accuracy score:  0.807799442896936
	 Precision score:  0.8021390374331551
	 Recall score:  0.8241758241758241
	 F1 score :  0.8130081300813008
==Naive Bayes==
	 Accuracy score:  0.7855153203342619
	 Precision score:  0.8387096774193549
	 Recall score:  0.7142857142857143
	 F1 score :  0.771513353115727


# Conclusion

We can notice that the Logistic Regression model have a better score in every metric than the Naive Bayes one. Also despite some very bad scores in the cross validation, the models manage to predict correctly the target in the test.

This project helps to revise principally the task of cleaning the data for learning tasks. We also learn that having bad scores in cross validation doesn't mean everytimes having a bad model.