# **Emotion Detection Challenge on Twitter**

Imagine sifting through the endless stream of tweets and figuring out the prevailing emotion. In this challenge, you won't just focus on classic positive or negative sentiment. Instead, you'll tackle the more intricate task of identifying four core emotions:


1.   😠 **Anger**     (class 0)
2.   😂 **Joy**       (class 1)
3.   😀 **Optimism** (class 2)
4.   😞 **Sadness**   (class 3)

Your goal? Assign the most dominant emotion to each tweet. Sounds fun, right? Let's see how you handle the nuances of human feelings, all packed into 280 characters!

## **Step 1: Loading the data**

### Library

In [1]:
!pip install nlpaug
!pip install contractions
!pip install wordsegment
!pip install emoji

Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl.metadata (14 kB)
Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nlpaug
Successfully installed nlpaug-1.1.11
Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (13 kB)
Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Downloading textsearch-0.0.24-py2.py3-none-any.wh

In [2]:
import pandas as pd
import numpy as np
import os
import plotly.express as px
import matplotlib.pyplot as plt
import random

import nltk

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.tag import pos_tag
from string import punctuation
import re

import tensorflow_datasets as tfds
import tensorflow as tf
import tensorflow_hub as hub

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from collections import Counter
from torch.nn.utils.rnn import pad_sequence
from sklearn.preprocessing import LabelEncoder
import torch.optim as optim

import nlpaug.augmenter.word as naw

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import f1_score
from sklearn.utils.class_weight import compute_class_weight

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [3]:
def set_seed(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)

set_seed(42)

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Dataset

In [133]:
file_path_train = '/content/drive/MyDrive/Emotional_Detection/train_text.txt'
file_path_valid = '/content/drive/MyDrive/Emotional_Detection/val_text.txt'
file_path_test = '/content/drive/MyDrive/Emotional_Detection/test_text.txt'

## Train
with open(file_path_train, 'r') as file:
    righe = file.readlines()

righe = [riga.strip() for riga in righe]

df_train = pd.DataFrame(righe, columns=['text'])

## Valid
with open(file_path_valid, 'r') as file:
    righe = file.readlines()

righe = [riga.strip() for riga in righe]

df_valid = pd.DataFrame(righe, columns=['text'])

## Test

with open(file_path_test, 'r') as file:
    righe = file.readlines()

righe = [riga.strip() for riga in righe]

df_test = pd.DataFrame(righe, columns=['text'])



In [134]:
Y_train = pd.read_csv("/content/drive/MyDrive/Emotional_Detection/train_labels.txt", header=None)
Y_valid = pd.read_csv("/content/drive/MyDrive/Emotional_Detection/val_labels.txt", header=None)

with open(file_path_train, 'r') as file:
    righe = file.readlines()

righe = [riga.strip() for riga in righe]

X_for_graph = pd.DataFrame(righe, columns=['text'])


### Visualizzation

In [None]:
df_train.tail()

Unnamed: 0,text
3252,I get discouraged because I try for 5 fucking ...
3253,The @user are in contention and hosting @user ...
3254,@user @user @user @user @user as a fellow UP g...
3255,You have a #problem? Yes! Can you do #somethin...
3256,@user @user i will fight this guy! Don't insul...


In [None]:
df_test.tail()

Unnamed: 0,text
1416,I need a sparkling bodysuit . No occasion. Jus...
1417,@user I've finished reading it; simply mind-bl...
1418,shaft abrasions from panties merely shifted to...
1419,All this fake outrage. Y'all need to stop 🤣
1420,Would be ever so grateful if you could record ...


In [None]:
df_train.head()

Unnamed: 0,text
0,“Worry is a down payment on a problem you may ...
1,My roommate: it's okay that we can't spell bec...
2,No but that's so cute. Atsu was probably shy a...
3,Rooneys fucking untouchable isn't he? Been fuc...
4,it's pretty depressing when u hit pan on ur fa...


In [None]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3257 entries, 0 to 3256
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    3257 non-null   object
dtypes: object(1)
memory usage: 25.6+ KB


In [None]:
print("\nChecking for missing values")
df_train.isnull().sum()


Checking for missing values


Unnamed: 0,0
text,0


In [None]:
print("Count of sentiment wise values: \n", Y_train.value_counts())

Count of sentiment wise values: 
 0
0    1400
3     855
1     708
2     294
Name: count, dtype: int64


In [None]:
print(Y_train)

      0
0     2
1     0
2     1
3     0
4     3
...  ..
3252  3
3253  3
3254  0
3255  0
3256  0

[3257 rows x 1 columns]


## **Step 2: Data Analysis & Processing**

### **Data Analysis**

In [None]:
fig=px.histogram(Y_train,
                title="Sentiment Count ",
                color_discrete_sequence=["red"])
fig.update_layout(bargap=0.1)
fig.show()

Dall'istogramma notiamo che le classi sono sbilanciate, con la classe 2 poco presente e la classe 0 molto presente

In [None]:
def text_length(tweet):
    str_len=len(tweet.split(" "))
    return(str_len)

X_for_graph['Length'] = X_for_graph['text'].apply(lambda x:text_length(x))


fig = px.histogram(X_for_graph,
                  x='Length',
                  marginal='box',
                  title="Length of tweets text")
fig.update_layout(bargap=0.1)
fig.show()

Dall'istogramma che mostra il numero di parole per i vari testi notiamo che è presente un outliers con una lunghezza di 58 parole, mentre il restante si aggira tra 3 e 33 parole

In [None]:
def text_length(tweet):
    str_len = len(tweet.split(" "))
    return str_len

data_train = X_for_graph
data_train['Length'] = data_train['text'].apply(lambda x: text_length(x))

# Combina il dataset con le etichette
dataframe_train = pd.concat([data_train, Y_train], axis=1)
dataframe_train.columns = ['Text', 'Length', 'Label']

unique_labels = dataframe_train['Label'].unique()

colors = px.colors.qualitative.Set1

# Crea un istogramma separato per ogni classe con un colore diverso
for i, label in enumerate(unique_labels):
    df_class = dataframe_train[dataframe_train['Label'] == label]

    # Crea l'istogramma per la classe
    fig = px.histogram(df_class,
                       x='Length',
                       marginal='box',
                       title=f"Length of tweets text for class {label}",
                       color_discrete_sequence=[colors[i % len(colors)]]
                      )
    fig.update_layout(bargap=0.1)
    fig.show()


Rappresentiamo gli istogrammi del numero di parole nel testo per ogni classe, cercando di identificare se il sentimento che si prova impatta nella lunghezza del testo. Sembrerebbe che le persone nervose tendino a scrivere testi più lunghi. Le persone felici testi un pochino più corti.

### **Data Processing**

Remove stop words, digits, and punctuation and lowercase a given collection of texts

In [135]:
from collections import defaultdict

def count_repeated_hashtags(text_lines):
    """
    Funzione che estrae e conta gli hashtag ripetuti almeno 5 volte da una lista di righe di testo.

    Args:
    text_lines (list of str): Lista contenente le righe di testo.

    Returns:
    dict: Dizionario con gli hashtag ripetuti almeno 5 volte e il loro conteggio.
    """
    hashtag_counter = defaultdict(int)  # Dizionario per contare le occorrenze degli hashtag

    # Itera attraverso le righe del testo
    for line in text_lines:
        # Trova tutti gli hashtag in ogni riga
        hashtags = re.findall(r'#\w+', line)
        # Aggiorna il conteggio degli hashtag trovati
        for hashtag in hashtags:
            hashtag_counter[hashtag] += 1

    # Filtra gli hashtag che si ripetono almeno 5 volte
    frequent_hashtags = {hashtag: count for hashtag, count in hashtag_counter.items() if count >= 5}

    return frequent_hashtags


# Chiamata alla funzione
repeated_hashtags = count_repeated_hashtags(df_train["text"])

# Stampa gli hashtag che si ripetono almeno 5 volte
print(repeated_hashtags)


{'#leadership': 5, '#worry': 9, '#terrible': 26, '#angry': 32, '#horrible': 25, '#joke': 5, '#terror': 21, '#bully': 21, '#rage': 24, '#glee': 6, '#sad': 49, '#GBBO': 25, '#hilarious': 24, '#lost': 24, '#depression': 41, '#blues': 29, '#rock': 5, '#music': 8, '#horror': 17, '#MHChat': 6, '#awful': 31, '#anxiety': 27, '#nightmare': 27, '#terrorism': 33, '#sadness': 38, '#unhappy': 6, '#smile': 12, '#snap': 9, '#dark': 10, '#afraid': 11, '#bitter': 18, '#serious': 15, '#fear': 33, '#optimism': 10, '#quote': 12, '#fuming': 30, '#anger': 24, '#musically': 27, '#war': 5, '#shocking': 28, '#revenge': 14, '#happy': 11, '#nervous': 12, '#cry': 6, '#life': 9, '#funny': 8, '#Trump': 12, '#Hillary': 5, '#Charlotte': 5, '#UNGA': 6, '#India': 7, '#Pakistan': 13, '#fuck': 5, '#pun': 5, '#punny': 5, '#lol': 11, '#mufc': 5, '#relentless': 5, '#panic': 12, '#love': 9, '#outrage': 13, '#bb18': 9, '#restless': 8, '#offended': 11, '#faith': 7, '#BB18': 6, '#racism': 5, '#grim': 5, '#CharlotteProtest': 7, 

In [136]:
import string

def handle_negations(text):
    tokens = text.split()
    negations = {"not", "no", "never", "n't"}
    result = []
    negate = False
    for token in tokens:
        lower_token = token.lower()
        if negate and token not in string.punctuation:
            result.append(token + "_NEG")
        else:
            result.append(token)
        negate = lower_token in negations
    return ' '.join(result)

In [137]:
import re
import spacy
import html
from nltk.corpus import stopwords
from contractions import fix
import wordsegment
wordsegment.load()
import emoji
import unicodedata

# Initialize spaCy model
nlp = spacy.load('en_core_web_sm')

# Define stopwords
mystopwords = set(stopwords.words("english"))
negation_words = {"not", "no", "nor", "never", "n't"}
additional_words_to_keep = {"but", "against", "without", "won", "don't", "can't", "couldn't"}
words_to_keep = negation_words.union(additional_words_to_keep)

# Remove these words from the stopword list
mystopwords = mystopwords - words_to_keep


def normalize_text(text):
    """
    Removes URLs, processes mentions and hashtags, handles HTML entities,
    expands contractions, replaces slang, handles emojis, and converts everything to lowercase.
    """
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    # Replace HTML entities with their corresponding characters
    text = html.unescape(text)

    # Remove punctuation but keep repeated question marks and exclamation points
    text = re.sub(r'(?<!\?)\?(?!\?)|(?<!\!)\!(?!\!)', '', text)


    # Normalize unicode characters
    text = unicodedata.normalize('NFKD', text)

    # Expand contractions
    text = fix(text)

    # Handle negations by appending _NEG to the following word
    text = handle_negations(text)

    important_hashtags = repeated_hashtags
    text = re.sub(r'#(\w+)', lambda m: m.group(0) if m.group(0) in important_hashtags else ' '.join(wordsegment.segment(m.group(0)[1:])), text)

    # Replace mentions with placeholder
    text = re.sub(r'@\w+', '<USER>', text)

    # Replace slang terms
    slang_dict = {
        "u": "you",
        "bcuz": "because",
        "gonna": "going to",
        "prolly": "probably",
        "tho": "though",
        "tbh": "to be honest",
        "idk": "I do not know",
        "im": "I am",
        "cant": "cannot",
        "wanna": "want to",
        "gimme": "give me",
        "gotta": "got to",
        "kinda": "kind of",
        "luv": "love",
        "yall": "you all",
        "ya": "you",
        "dunno": "do not know",
        "btw": "by the way",
        "thx": "thanks",
        "omg": "oh my god",

    }
    def replace_slang(text):
        tokens = text.split()
        tokens = [slang_dict.get(token.lower(), token) for token in tokens]
        return ' '.join(tokens)

    text = replace_slang(text)

    # Handle emojis: convert emojis to text
    text = emoji.demojize(text, delimiters=(' ', ' '))

    # Remove punctuation (except for emoji descriptions)
    text = re.sub(r'[^\w\s_]', '', text)

    # Remove extra whitespace and newlines
    text = text.strip()
    text = re.sub(r'\s+', ' ', text)

    return text.lower()

def preprocess_text(text):
    """
    Performs text preprocessing with normalization, stopword removal,
    and lemmatization.
    """
    text = normalize_text(text)

    # Process text with spaCy
    doc = nlp(text)

    # Tokenization and lemmatization
    tokens = [
        token.lemma_ for token in doc
        if token.text.lower() not in mystopwords and not token.is_punct and not token.like_num
    ]

    return ' '.join(tokens)

#### Data agumentation

In [138]:
synonym_aug = naw.SynonymAug(aug_src='wordnet')

def augment_dataset(df, label_column, text_column, augmenter, n=1):
    augmented_texts = []
    augmented_labels = []

    for index, row in df.iterrows():
        text = row[text_column]
        label = row[label_column]

        augmented_versions = augment_text(text, augmenter, n)
        augmented_texts.extend(augmented_versions)
        augmented_labels.extend([label] * len(augmented_versions))

    augmented_df = pd.DataFrame({
        text_column: augmented_texts,
        label_column: augmented_labels
    })

    return augmented_df

# Applica l'augmentation al dataset
def augment_text(text, augmenter, n=1):
    # Genera n versioni aumentate del testo
    augmented_texts = augmenter.augment(text, n=n)

    # Assicurati che augmented_texts sia una lista di stringhe
    if isinstance(augmented_texts, str):
        augmented_texts = [augmented_texts]
    elif isinstance(augmented_texts, list):
        # Verifica che tutti gli elementi siano stringhe
        augmented_texts = [str(t) for t in augmented_texts]
    else:
        # In caso di tipi inattesi, convertili in stringhe
        augmented_texts = [str(t) for t in augmented_texts]

    return augmented_texts


df_train = pd.concat([df_train, Y_train], axis=1)
df_train.columns = ['processed_text', "label"]

# Applica l'augmentation solo alle classi minoritarie
class_counts = df_train['label'].value_counts()
max_count = class_counts.max()
minority_classes = class_counts[class_counts < max_count].index.tolist()  # Esclude la classe maggioritaria
df_minority = df_train[df_train['label'].isin(minority_classes)]

# Applica l'augmentation
augmented_df = augment_dataset(df_minority, 'label', 'processed_text', synonym_aug, n=1)

# Unisci il dataset originale con quello aumentato
df_train_augmented = pd.concat([df_train, augmented_df]).reset_index(drop=True)
print(df_train_augmented['label'].value_counts())

label
3    1710
1    1416
0    1400
2     588
Name: count, dtype: int64


#### Applichiamo il pre processing

In [139]:
def preprocess_corpus(texts):
    return [preprocess_text(text) for text in texts]


df_train_augmented['processed_text'] = df_train_augmented['processed_text'].apply(preprocess_text)
#df_train.drop('text', axis=1, inplace=True)

df_valid['processed_text'] = df_valid['text'].apply(preprocess_text)
df_valid.drop('text', axis=1, inplace=True)

df_test['processed_text'] = df_test['text'].apply(preprocess_text)
df_test.drop('text', axis=1, inplace=True)

#df_train = pd.concat([df_train, Y_train], axis=1)
#df_train.columns = ['processed_text', 'label']

In [140]:
df_train.tail()

Unnamed: 0,processed_text,label
3252,I get discouraged because I try for 5 fucking ...,3
3253,The @user are in contention and hosting @user ...,3
3254,@user @user @user @user @user as a fellow UP g...,0
3255,You have a #problem? Yes! Can you do #somethin...,0
3256,@user @user i will fight this guy! Don't insul...,0


## **Step 3: Model**

In [141]:
df_valid['label'] = Y_valid
df_combined = pd.concat([df_train, df_valid], axis=0).reset_index(drop=True)
X = df_combined['processed_text']
y = df_combined['label'].astype(int)

In [108]:
print(df_combined['label'].value_counts())

label
3    2565
1    2124
0    1400
2     882
Name: count, dtype: int64


Vettorizzazione

### TFIDF

In [142]:
X_test = df_test['processed_text']

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

X_train_vect = vectorizer.fit_transform(X)
X_test_vect = vectorizer.transform(X_test)


### Word2Vec

In [39]:
"""
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

X_test = df_test['processed_text']

# Assuming X and X_test are pandas Series containing your processed text
X_tokenized = X.apply(lambda x: simple_preprocess(x))
X_test_tokenized = X_test.apply(lambda x: simple_preprocess(x))

# Combine all tokenized documents for training
sentences = X_tokenized.tolist()

# Train the model
w2v_model = Word2Vec(
    sentences=sentences,
    vector_size=100,    # Dimensionality of the word vectors
    window=5,           # Maximum distance between the current and predicted word
    min_count=1,        # Ignores all words with total frequency lower than this
    workers=4,          # Number of worker threads to train the model
    sg=1                # Use skip-gram; set to 0 for CBOW
)

def document_vector(doc):
    # Filter out words that are not in the vocabulary
    doc = [word for word in doc if word in w2v_model.wv.key_to_index]
    # If the document is empty after filtering, return a zero vector
    if not doc:
        return np.zeros(w2v_model.vector_size)
    # Compute the mean of word vectors
    return np.mean(w2v_model.wv[doc], axis=0)

X_train_vect = X_tokenized.apply(document_vector)
X_test_vect = X_test_tokenized.apply(document_vector)

X_train_vect = np.stack(X_train_vect.values)
X_test_vect = np.stack(X_test_vect.values)
"""

### Pre trained embedding

In [47]:
"""
from transformers import AutoTokenizer, AutoModel
X_test = df_test['processed_text']
model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
def get_document_embedding(text):
    # Tokenizzazione
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    # Ottieni le ultime hidden states dal modello
    with torch.no_grad():
        outputs = model(**inputs)
    # Prendi l'embedding del token [CLS] (il primo token)
    embeddings = outputs.last_hidden_state[:, 0, :].numpy()
    return embeddings.squeeze()
# Applica la funzione ai dati di addestramento
X_train_vect = X.apply(get_document_embedding)

# Applica la funzione ai dati di test
X_test_vect = X_test.apply(get_document_embedding)
X_train_vect = np.vstack(X_train_vect.values)
X_test_vect = np.vstack(X_test_vect.values)
"""

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

### Logistic regression

In [143]:
# Train the logistic regression model
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000, class_weight="balanced", )
model.fit(X_train_vect, y)

# Make predictions on the test set
y_pred = model.predict(X_test_vect)

# Save the predictions to a CSV file
df_predictions = pd.DataFrame(y_pred, columns=['label'])
df_predictions.to_csv('previsioni_migliori_LR.csv', index=False)

print("Predictions saved to 'test_predictions.csv'")

Predictions saved to 'test_predictions.csv'


### Multinomial Naive Bayes

In [117]:
from sklearn.naive_bayes import MultinomialNB

# Allena il modello Naive Bayes
nb_model = MultinomialNB()
nb_model.fit(X_train_vect, y)

# Make predictions on the test set
y_pred = nb_model.predict(X_test_vect)

# Save the predictions to a CSV file
df_predictions = pd.DataFrame(y_pred, columns=['label'])
df_predictions.to_csv('previsioni_migliori_NB.csv', index=False)

print("Predictions saved to 'test_predictions.csv'")

Predictions saved to 'test_predictions.csv'


### XGB

In [None]:
import xgboost as xgb
from xgboost import XGBClassifier

# Allena il modello XGBoost
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', class_weight="balanced")
xgb_model.fit(X_train_vect, y)

# Make predictions on the test set
y_pred = xgb_model.predict(X_test_vect)

# Save the predictions to a CSV file
df_predictions = pd.DataFrame(y_pred, columns=['label'])
df_predictions.to_csv('previsioni_migliori_XGB.csv', index=False)

print("Predictions saved to 'test_predictions.csv'")


Parameters: { "class_weight", "use_label_encoder" } are not used.



Predictions saved to 'test_predictions.csv'


### Random Forest

In [118]:
from sklearn.ensemble import RandomForestClassifier

# Allena il modello Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, class_weight="balanced")
rf_model.fit(X_train_vect, y)

# Make predictions on the test set
y_pred = rf_model.predict(X_test_vect)

# Save the predictions to a CSV file
df_predictions = pd.DataFrame(y_pred, columns=['label'])
df_predictions.to_csv('previsioni_migliori_RF.csv', index=False)

print("Predictions saved to 'test_predictions.csv'")


Predictions saved to 'test_predictions.csv'


### Support Vector Machine

In [None]:
from sklearn.svm import SVC

# Allena il modello SVM
svm_model = SVC(kernel='linear', probability=True, class_weight="balanced")
svm_model.fit(X_train_vect, y)

# Make predictions on the test set
y_pred = svm_model.predict(X_test_vect)

# Save the predictions to a CSV file
df_predictions = pd.DataFrame(y_pred, columns=['label'])
df_predictions.to_csv('previsioni_migliori_SVM.csv', index=False)

print("Predictions saved to 'test_predictions.csv'")

Predictions saved to 'test_predictions.csv'


### Multi layer perceptron

In [None]:
from sklearn.neural_network import MLPClassifier

# Allena il modello MLP
mlp_model = MLPClassifier(hidden_layer_sizes=(100,), max_iter=300, random_state=42)
mlp_model.fit(X_train_vect, y)

# Make predictions on the test set
y_pred = mlp_model.predict(X_test_vect)

# Save the predictions to a CSV file
df_predictions = pd.DataFrame(y_pred, columns=['label'])
df_predictions.to_csv('previsioni_migliori_MLP.csv', index=False)

print("Predictions saved to 'test_predictions.csv'")

Predictions saved to 'test_predictions.csv'


### LightGBM

In [None]:
import lightgbm as lgb
lgb_model = lgb.LGBMClassifier(class_weight="balanced")
lgb_model.fit(X_train_vect, y)

# Make predictions on the test set
y_pred = lgb_model.predict(X_test_vect)

# Save the predictions to a CSV file
df_predictions = pd.DataFrame(y_pred, columns=['label'])
df_predictions.to_csv('previsioni_migliori_LGB.csv', index=False)

print("Predictions saved to 'test_predictions.csv'")

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.027976 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8530
[LightGBM] [Info] Number of data points in the train set: 5488, number of used features: 448
[LightGBM] [Info] Start training from score -1.386294
[LightGBM] [Info] Start training from score -1.386294
[LightGBM] [Info] Start training from score -1.386294
[LightGBM] [Info] Start training from score -1.386294
Predictions saved to 'test_predictions.csv'


### ADABoost

In [None]:
from sklearn.ensemble import AdaBoostClassifier
model_ada = AdaBoostClassifier(n_estimators=100, random_state=42)
model_ada.fit(X_train_vect, y)

# Make predictions on the test set
y_pred = model_ada.predict(X_test_vect)

# Save the predictions to a CSV file
df_predictions = pd.DataFrame(y_pred, columns=['label'])
df_predictions.to_csv('previsioni_migliori_ADA.csv', index=False)

print("Predictions saved to 'test_predictions.csv'")



Predictions saved to 'test_predictions.csv'


### Passive Aggressive

In [52]:
from sklearn.linear_model import PassiveAggressiveClassifier
model_PA = PassiveAggressiveClassifier(max_iter=1000, random_state=42,class_weight="balanced")
model_PA.fit(X_train_vect, y)

# Make predictions on the test set
y_pred = model_PA.predict(X_test_vect)

# Save the predictions to a CSV file
df_predictions = pd.DataFrame(y_pred, columns=['label'])
df_predictions.to_csv('previsioni_migliori_PA.csv', index=False)

print("Predictions saved to 'test_predictions.csv'")

Predictions saved to 'test_predictions.csv'


### Extra Tree Classify

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
model_ETC = ExtraTreesClassifier(n_estimators=100, random_state=42,class_weight="balanced")
model_ETC.fit(X_train_vect, y)

# Make predictions on the test set
y_pred = model_ETC.predict(X_test_vect)

# Save the predictions to a CSV file
df_predictions = pd.DataFrame(y_pred, columns=['label'])
df_predictions.to_csv('previsioni_migliori_ETC.csv', index=False)

print("Predictions saved to 'test_predictions.csv'")

Predictions saved to 'test_predictions.csv'


## Voting

### Merge prediction in a single dataframe

In [144]:
lr = pd.read_csv("/content/previsioni_migliori_LR.csv")
#nb = pd.read_csv("/content/previsioni_migliori_NB.csv")
#xgb = pd.read_csv("/content/previsioni_migliori_XGB.csv")
#rf = pd.read_csv("/content/previsioni_migliori_RF.csv")
#svm = pd.read_csv("/content/previsioni_migliori_SVM.csv")
#mlp = pd.read_csv("/content/previsioni_migliori_MLP.csv")
#lgb = pd.read_csv("/content/previsioni_migliori_LGB.csv")
#ada = pd.read_csv("/content/previsioni_migliori_ADA.csv")
#pa = pd.read_csv("/content/previsioni_migliori_PA.csv")
#etc = pd.read_csv("/content/previsioni_migliori_ETC.csv")


lr.rename(columns={'label': 'LR'}, inplace=True)
#nb.rename(columns={'label': 'NB'}, inplace=True)
#xgb.rename(columns={'label': 'XGB'}, inplace=True)
#rf.rename(columns={'label': 'RF'}, inplace=True)
#svm.rename(columns={'label': 'SVM'}, inplace=True)
#mlp.rename(columns={'label': 'MLP'}, inplace=True)
#lgb.rename(columns={'label': 'LGB'}, inplace=True)
#ada.rename(columns={'label': 'ADA'}, inplace=True)
#pa.rename(columns={'label': 'PA'}, inplace=True)
#etc.rename(columns={'label': 'ETC'}, inplace=True)

final_df = pd.concat([lr], axis=1)
final_df.to_csv('merged_labels.csv', index=False)

### make the final prediction

In [145]:
labels = final_df.mode(axis=1).iloc[:, 0].astype(int)
output_df = pd.DataFrame({'label': labels})

output_df.to_csv('previsioni_finali_lr_terminato.csv', index=False)

print("Predictions saved to 'test_predictions.csv, daje'")

Predictions saved to 'test_predictions.csv, daje'
