# [Projet InPoDA - IN304](https://github.com/Egeyae/InPoDA-project-in304) - UVSQ UFR DES SCIENCES
#### *Done by KONSTANTINOV Julien (22301776) and COSSEC Elouan (22300813)*
---
**Goal:** Make a tweet analysis application *(extracting from french tweets: author, hastags, user mentioned, sentiment, topics)* and performing various data analysis actions

**Table of Contents**:

    - Part I: How we extract the tweets from the provided file
    - Part II: Different analysis operations performed on the tweets
    - Part III: Some references used for the project

**Diagram**:

![title](diagram/InPoDA_Diagram.drawio.png)

### Installation
---
For installation process, please follow the guide in the `README.md` found in the project directory.
It is recommended to use a virtual environnement (with `Python 3.12.x` interpreter (or latest supported version by PyTorch))
---
After installing the environnement, you can run the following Jupyter Notebook

In [None]:
### Setup
from InPoDA_Pipeline import *

# The InPoDA_Pipeline class is used as an interface to use the project
# A logger is set up automatically, to remove any logging/log to a file, please update the config.json file
pipeline = InPoDAPipeline()
pipeline.logger.info("Pipeline setup was a success")

## I - Tweets data extraction
---
This part is a detailed explanation on how we extract tweets and parse them into a pandas.Dataframe()

##### ***1.** Load the tweets in memory*

In [None]:
tweets = pipeline.load_tweets()

In [None]:
pretty_dict_display(tweets)

##### ***2.** Process the tweets in a pandas.DataFrame*

By default, the Model used for topic classification is the smaller one. This is faster to run but can induce worse results in terms of topic identification. If you want to run the bigger model, update `config.json` and change the _`"topic_model":"small"`_ to _`"topic_model":"big"`_

In [None]:
# Perform data extraction on the loaded tweets
dataframe = pipeline.process_tweets_to_dataframe()

In [None]:
dataframe

##### *(**3.** Annex: Sentiment Analysis)*

For learning purposes, we tried to create our own Neural Network model, trained to find the sentiment of a tweet. We used a Genetic Algorithm approach to explore solutions as we were not at ease with backpropagation. The training dataset is Sentiment140, around 1.6 millions tweets annoted for sentiment analysis. We embedded the training tweets using a multilingual model as the project tweets were in French and Sentiment140's are in english.

In the following cells, we try to present the global pipeline of model usage and training. However, training the model can cost a lot in terms of resources, so the code is commented by default.

PS: As the results were too bad for any practical usage, InPoDA uses textblob for the moment until we find a valid solution. The predictions are very off the expected results, we get all fed data to be more or less in the same category. Which is strange because during training everything seems fine... An error lies between training and model usage: during training we have almost perfect results but when testing the best model we only get 50% accuracy.


###### **a.** Model loading

In [None]:
# Loads the best pre-computed model
pipeline.load_creature()

###### **b.** Model usage

In [None]:
# Example usage of the pre-computed model
test_tweet = "I'm so happy"

pipeline.process_input(test_tweet)

###### (**c.** Model training)

In [None]:
# The dataset is very big and (1.6 million tweets) and it can be heavy on memory to store that much embeddings (768 * 2 bytes * 1.6 million ~= 2.3 GB)
# To prevent this, it treats the data chunk by chunk and save those chunks onto the disk in order to load these chunks only when needed during training.
# By default, chunks computation is deactivated as it can be expensive
run_chunks = False

if run_chunks: 
    pipeline.compute_chunks()


In [None]:
# Loads as a pandas.DataFrame the training data (all computed chunks)
# Configuration can be updated in the `config.json` file

pipeline.load_training_data()

In [None]:
# The data is split in 2: first half is negative (sentiment 0)
pipeline.data.head()

In [None]:
# Second half is positive (sentiment 4)
pipeline.data.tail()

In [None]:
# Here we train a model based on the loaded data
# Since it is very expensive, it doesn't run by default
# Moreover, it is preferable to run directly the script `run_training.py` found in the ./sentiment_analysis/ folder
# ! Be aware that you need to update the save file in `config.json` if you don't want to override pre-trained model !
run = False

if run:
    pipeline.train_genetic_algorithm()

## II - Tweets data analysis
---
This part is a presentation of different analysis we can do using the data we loaded

#### **0. Data Presentation**
---
Presentation of all unique Authors, Mentions, Hashtags

In [None]:
# All authors
pipeline.get_all_authors()

In [None]:
# All mentions
pipeline.get_all_mentions()

In [None]:
# All hashtags
pipeline.get_all_hashtags()

#### 1. **Top K analysis**
---
We extract:

    - Top K hastags   (Most used hashtags)
    - Top K authors   (Users who posted the most tweets)
    - Top K mentioned (Users who were the most mentioned)
    - Top K topics    (Topics that comme back the most)


In [None]:
# Please set the desired K value here
K = 5 

In [None]:
# TOP K HASHTAGS
pipeline.top_k_hashtags(k = K)

In [None]:
# TOP K AUTHORS
pipeline.top_k_authors(k = K)

In [None]:
# TOP K MENTIONED
pipeline.top_k_mentioned(k = K)

In [None]:
# TOP K TOPICS
pipeline.top_k_topics(k = K)

#### **2. Number of tweets per X**
---
We extract:

    - Number of tweets per user
    - Number of tweets per hashtags
    - Number of tweets per topics


In [None]:
# Number of tweets per user
pipeline.number_of_tweets_per_user()

In [None]:
# Number of tweets per hashtags
pipeline.number_of_tweets_per_hashtag()

In [None]:
# Number of tweets per topics
pipeline.number_of_tweets_per_topic()

#### **3. User analysis**
---
We extract all tweets from a provided user

In [None]:
# Set user
user = pipeline.get_all_authors().iloc[0, 0]

# All tweets from user
pipeline.all_tweets_from_user(user)

#### **4. Usage analysis**
---

We extract:

    - All tweets mentionning a specific user
    - All users using a specific hashtag
    - All users mentionned by a specific user

In [None]:
# All tweets mentionning a specific user
user = pipeline.get_all_mentions().loc[0]
print(user.tolist())
pipeline.all_tweets_where_user(user[0][1:])

In [None]:
# All users using a hashtag
hashtag = pipeline.get_all_hashtags().loc[0]
hashtag = hashtag.tolist()
pipeline.all_users_using_hashtag(hashtag[0][1:])

In [None]:
# All users mentioned by a specific user
user = "372993152" #pipeline.get_all_authors().iloc[0, 0]

pipeline.all_users_mentioned_by_user(user)

## III - References
---
Dataset:
`Sentiment140 dataset with 1.6 million tweets. (2017, September 13). https://www.kaggle.com/datasets/kazanova/sentiment140`

Genetic Algorithm:
`9. Evolutionary computing. (n.d.). https://natureofcode.com/genetic-algorithms/`