# [Projet InPoDA - IN304](https://github.com/Egeyae/InPoDA-project-in304) - UVSQ UFR DES SCIENCES
#### *Done by KONSTANTINOV Julien and COSSEC Elouan*

**Goal:** Make a tweet analysis application *(extracting from french tweets: author, hastags, user mentioned, sentiment, topics)* and performing various data analysis actions

**Table of Contents**:

    - Part I: How we extract the tweets from the provided file
    - Part II: Different analysis operations performed on the tweets

#### Installation
For installation process, please follow the guide in the `README.md` found in the project directory.
It is recommended to use a virtual environnement (with `Python 3.12.x` interpreter (or latest supported version by PyTorch))

In [1]:
### Setup
from InPoDA_Pipeline import *

# The InPoDA_Pipeline class is used as an interface to use the project
# A logger is setup automatically, to remove any logging/log to a file, please update the config.json file
pipeline = InPoDAPipeline()
pipeline.logger.info("Pipeline setup was a success")

GPU available: True
[2024-12-08 00:44:01,547] ::InPoDAPipeline:: (INFO) - Pipeline setup was a success


### I - Tweets data extraction

This part is a detailed explanation on how we extract tweets and parse them into a pandas.Dataframe()

##### ***1.** Load the tweets in memory*

In [2]:
tweets = pipeline.load_tweets()

[2024-12-08 00:44:01,550] ::InPoDAPipeline:: (INFO) - Loading tweets...


In [3]:
tweets

##### ***2.** Process the tweets in a pandas.DataFrame*

In [4]:
dataframe = pipeline.process_tweets_to_dataframe(tweets)

[2024-12-08 00:44:01,566] ::InPoDAPipeline:: (INFO) - Processing tweets into a DataFrame...


In [5]:
dataframe.head()

AttributeError: 'NoneType' object has no attribute 'head'

##### *(**3.** Annex: Sentiment Analysis)*

For learning purposes, we tried to create our own Neural Network model, trained to find the sentiment of a tweet. We used a Genetic Algorithm approach to explore solutions as we were not at ease with backpropagation. The training dataset is Sentiment140, around 1.6 millions tweets annoted for sentiment analysis. We embedded the training tweets using a multilingual model as the project tweets were in french and Sentiment140's are in english.

In the following cells, we try to present the global pipeline of model usage and training. However, training the model can cost a lot in terms of resources, so the code is commented by default.

PS: As the results were too bad for any practical usage, InPoDA uses textblob for the moment until we find a valid solution. The predictions are very off the expected results, we get all fed data to be more or less in the same category. Which is strange because during training everything seems fine... An error lies in between training and model usage, perhaps a CPU-GPU error.


###### **a.** Model loading

In [6]:
# Loads the best pre-computed model
pipeline.load_creature()

[2024-12-08 00:44:04,173] ::InPoDAPipeline:: (INFO) - Loading a pre-trained creature...


###### **b.** Model usage

In [7]:
# Example usage of the pre-computed model
test_tweet = "I'm so happy"

pipeline.process_input(test_tweet)

[2024-12-08 00:44:05,757] ::InPoDAPipeline:: (INFO) - Processing input data...


4.0


'positive'

###### (**c.** Model training)

In [8]:
# Loads as a pandas.DataFrame the training data
# Configuration can be updated in the `config.json` file

pipeline.load_training_data()

[2024-12-08 00:44:06,917] ::InPoDAPipeline:: (INFO) - Loading training data...
[2024-12-08 00:44:57,428] ::InPoDAPipeline:: (INFO) - Loaded training data with 40000 tweets.


In [9]:
# The data is split in 2: first half is negative (sentiment 0)
pipeline.data.head()

Unnamed: 0,sentiment,text,embeddings
0,0.0,"- Awww, that's a bummer. You shoulda got David...","[0.0919, -0.11096, 0.0916, 0.09717, 0.1726, 0...."
1,0.0,is upset that he can't update his Facebook by ...,"[0.1577, -0.10016, -0.32, 0.2515, 0.2886, 0.27..."
2,0.0,I dived many times for the ball. Managed to sa...,"[-0.0796, -0.1976, -0.2961, 0.2598, 0.3318, -0..."
3,0.0,my whole body feels itchy and like its on fire,"[0.04578, 0.006332, 0.24, 0.1765, 0.269, 0.060..."
4,0.0,"no, it's not behaving at all. i'm mad. why am ...","[-0.132, -0.2452, -0.2404, 0.1582, 0.08405, 0...."


In [10]:
# Second half is positive (sentiment 4)
pipeline.data.tail()

Unnamed: 0,sentiment,text,embeddings
39995,4.0,Just woke up. Having no school is the best fee...,"[0.006783, -0.1874, 0.1539, 0.12006, 0.0657, 0..."
39996,4.0,TheWDB.com - Very cool to hear old Walt interv...,"[-0.2632, -0.06793, -0.1497, 0.0775, 0.2793, -..."
39997,4.0,Are you ready for your MoJo Makeover? Ask me f...,"[-0.03998, -0.1318, 0.2385, 0.1422, 0.4578, -0..."
39998,4.0,Happy 38th Birthday to my boo of alll time!!! ...,"[-0.04913, 0.2195, 0.3545, 0.087, 0.2272, 0.12..."
39999,4.0,happy,"[0.1354, -0.3357, 0.5337, 0.2334, 0.2068, 0.22..."


In [None]:
# Here we train a model based on the loaded data
# Since it is very expensive, it doesn't run by default
# ! Be aware that you need to update the save file in `config.json` if you don't want to overide pre-trained model !
run = False

if run:
    pipeline.train_genetic_algorithm()

### II - Tweets data analysis

This part is a presentation of different analysis we can extract from the data provided

### III - References

Dataset:
`Sentiment140 dataset with 1.6 million tweets. (2017, September 13). https://www.kaggle.com/datasets/kazanova/sentiment140`

Genetic Algorithm:
`9. Evolutionary computing. (n.d.). https://natureofcode.com/genetic-algorithms/`