# Dataset presentation

[The Twitter US Airline Sentiment dataset available on Kaggle](https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment) is a collection of tweets from customers of major US airlines. The dataset was originally created by CrowdFlower and contains 14,640 tweets that were posted on Twitter in February 2015. Each tweet in the dataset is labeled with a sentiment (positive, negative, or neutral) towards the airline that the customer was addressing in the tweet.

The dataset includes a range of information about each tweet, including the tweet text, the airline being addressed, the time the tweet was posted, the user's location and the user's Twitter handle.

The main goal of this dataset is to enable researchers and data scientists to build models that can accurately predict the sentiment of customer tweets towards different airlines. This type of analysis can help airlines to better understand customer feedback and improve their overall customer service and experience.

The dataset has been widely used in natural language processing and sentiment analysis research, and many studies have been conducted using this dataset to explore various aspects of sentiment analysis, such as feature selection, model building, and performance evaluation.

# Load and Visualize data

In [1]:
# install autotime package to track time execution of cells
!pip install ipython-autotime
%load_ext autotime

time: 2.21 ms (started: 2023-05-09 12:44:15 +02:00)


In [2]:
import pandas as pd

time: 641 ms (started: 2023-05-09 12:44:15 +02:00)


Read the data from the drive file

In [3]:
Tweets = pd.read_csv('https://drive.google.com/uc?id=16BQRafnVEFMTAARipirqXoIdP-FBEGvJ')
Tweets.shape

(14640, 15)

time: 2.84 s (started: 2023-05-09 12:44:19 +02:00)


### Question 1: Show first 2 lines of the dataset.

### Question 2: For our study, we will only need few column. Let's subset the dataframe to keep only the column:


*  **airline**: Name of the company tagged in the tweet.
*  **retweet_count**: number of retweet.
*  **text**: content of the tweet.
*  **tweet_created**: time of publication of the tweet.



### Question 3: add column to your dataframe that stores the size of your "text" string. Then filter your dataframe to keep only tweets that are above 75 chars.

### Question 4: Using plotly express, can you plot the histogram for the text size and a barpot of number of tweets per company

### Question5: Plotying number of tweets overtime.
    - First convert your 'tweet_created' column to datetime.
    - Create new column 'date' that contains only the day without timestamp
    - Plot number of tweets per company per day.

# NLP with transformers

Install transformers from HuggingFace. Here is the official website https://huggingface.co/

In [16]:
!pip install transformers

time: 4.44 s (started: 2023-05-07 19:15:07 +02:00)


When using Neural Network, it's usual better to work with GPU device (specially for training). 

In [17]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

if device.type == "cuda":
    print("GPU: ", torch.cuda.get_device_name(0))
else:
    print("CPU")

CPU
time: 1.89 s (started: 2023-05-07 19:15:11 +02:00)


## Classification

### Question 6: Load the sentiment analysis pipeline 
the default package is **distilbert-base-uncased-finetuned-sst-2-english**. 

### Question 7: Apply the model
 - First check that the model is working properly and 10 lines of your dataset.
 - Run the model on your whole data.
 - Add new column to your dataset data contains the predicted label.

### Question 8:  barpot showing the number of positive and negative tweets per company

### Question 9: Translate the first 10 tweets to french

### Question 10:  Summarize  the first 10 tweets into 5 to 15 words

## Topic modeling

Insall BERTopic package. you can find the documentation in this repo: https://github.com/MaartenGr/BERTopic

In [None]:
!pip install bertopic

In [None]:
from bertopic import BERTopic


# Instantiate the Bertopic model
model = BERTopic(language='english', calculate_probabilities=True)

Topic modeling can be sensitive to the quality of the data, it's using intersting to apply some cleaning before starting the modeling. in our case, we will start by removing the must frequet taggs

### Question 11: can create new column 'clen_text' and remove all taggs for airline companies.

In [2]:
ref_companies = '@VirginAmerica|@AmericanAir|@JetBlue|@SouthwestAir|@united|@USAirways'

### Question 12: can you improve the cleaning of your text data?

### Question 13: train the model and clean_text with size bigger than 110 characters (use fit fonction)

### Question 14: use get_topic_info and visualize_barchart function to explore found topics.

### Question 15: Predict the topics of all the dataset with 75 characters and add main_topic column to your dataframe.

### Question 16: Propose few plots to show the main topics per company and per sentiment.

### Question 17: Propose a way to include retweets in your study/plots

### Question 18: Can you try to use "zero-shot-classification" to classify your tweets using these labels : Very positive, Positive, Neutral,  Negative, Very negative