<a href="https://colab.research.google.com/github/DiegoFleury/MI201-groupe-4/blob/main/notebooks/preProcessingAndEDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocessing and Exploratory Data Analysis
## MI201

##**Group 4** :
- Diego FLEURY CORRÊA DE MORAES
- Hazael SOLEDADE DE ARAUJO JUMONJI
- Lucas DE OLIVEIRA MARTIM

### Project 3 : **Sentiment Analysis Using LLMs**

###Introduction

This notebook is dedicated to handling the basic preprocessing and conducting Exploratory Data Analysis (EDA) for the project's associated dataset. The primary goal is to curate, investigate and visualize the data to uncover patterns, handle missing values, identify potential anomalies, and understand the relationships between features. By gaining insights from this analysis, we aim to prepare a solid foundation for the subsequent processing, feature engineering, and modeling stages of the project.

## Preprocessing


Let's first begin with importing the kaggle dataset:

In [2]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
path = kagglehub.dataset_download('abhi8923shriv/sentiment-analysis-dataset')

print('Data source import complete.')


Data source import complete.


Importing libraries (visualization, NLP, data handling, model selection) :

In [3]:
# --------------- MAIN LIBRARIES ------------------

import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split


# --------------- HELPING LIBRARIES ----------------
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

In [4]:
train_dataset = path+'/train.csv'
test_dataset = path+'/test.csv'

# Check if the path exists
print (os.path.exists(train_dataset))
print (os.path.exists(test_dataset))

True
True


In [5]:
# Load the CSV file into a DataFrame
train_df = pd.read_csv(train_dataset, encoding='ISO-8859-1')
test_df = pd.read_csv(test_dataset, encoding='ISO-8859-1')

Let's analyse a dataset sample:

In [6]:
train_df.head()

Unnamed: 0,textID,text,selected_text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km²),Density (P/Km²)
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,morning,0-20,Afghanistan,38928346,652860.0,60
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,noon,21-30,Albania,2877797,27400.0,105
2,088c60f138,my boss is bullying me...,bullying me,negative,night,31-45,Algeria,43851044,2381740.0,18
3,9642c003ef,what interview! leave me alone,leave me alone,negative,morning,46-60,Andorra,77265,470.0,164
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,noon,60-70,Angola,32866272,1246700.0,26


Taking from the original Kaggle competition data description (available [here](https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset/data)) :

"*There's a story behind every dataset and here's your opportunity to share yours.
training data was automatically created, as opposed to having humans manual annotate tweets. In our approach, we assume that any tweet with positive emoticons, like :), were positive, and tweets with negative emoticons, like :(, were negative.*"

We can notice that the data collection process was self supervised, and that the heuristic behind the label attribution was the emoticon present. This already biases our data, as it's more likely that certain groups (based on age, sex, country, etc) used these markers as a hole. A good starting point is, therefore, see the label/class balance.

... **but** first, let's look at the data.

In [12]:
print("Train DataFrame Info:")
train_df.info()
print("\n" + "-" * 50 + "\n")

print("Test DataFrame Info:")
test_df.info()

Train DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27481 entries, 0 to 27480
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   textID            27481 non-null  object 
 1   text              27480 non-null  object 
 2   selected_text     27480 non-null  object 
 3   sentiment         27481 non-null  object 
 4   Time of Tweet     27481 non-null  object 
 5   Age of User       27481 non-null  object 
 6   Country           27481 non-null  object 
 7   Population -2020  27481 non-null  int64  
 8   Land Area (Km²)   27481 non-null  float64
 9   Density (P/Km²)   27481 non-null  int64  
dtypes: float64(1), int64(2), object(7)
memory usage: 2.1+ MB

--------------------------------------------------

Test DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4815 entries, 0 to 4814
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------     

**A few key points**:

- It seems that the field `selected_text` isn't present in the text dataset, as so, it's probably best to discart it, as we will not be able to use it for estimating the models' generalization error.

- We have then 9 columns, 6 of which are categorical (`Dtype=object`) and 3 are numerical. One of them (`sentiment`) is the label, leaving us with 5 categorical features and 3 numerical ones.

- There seems to be very few missing values in the training dataset, but for some reason lots of them in the test data. This needs investigating.

### Non-Text preprocessing

Handling nulls & imputation (label distribution)

Categorical data preprocessing

Numerical data preprocessing

###Text preprocessing

Links, tags

Whitespace, formatting

Contractions, stop words, numbers, case

BERT embeddings and dataloader

## EDA

Look at the following distributions:

- Age
- Time of tweet
- Country, population, pop. density

Look for correlations between data and with the label.

Compare tf-idf, Word2Vec and BERT embeddings (Cluster and visualize words)

## Final pipeline