<a href="https://colab.research.google.com/github/OlekanmaVictoria/Data-science-Portfolio/blob/main/Data_Preparation_Of_AirLine_tweets_for_sentimental_analysis_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction**
Welcome Jaco  to the demo presentation of one my project on Sentiment Analysis using Airline Tweets.
Today, we'll explore how we leveraged machine learning techniques to analyze sentiments expressed by airline passengers through their tweets.

Sentiment analysis plays a crucial role in understanding customer feedback, and our goal is to showcase how we can extract valuable insights from large volumes of social media data.

# **Problem Statement**
1.)Airlines receive a vast amount of feedback from passengers on social media platforms like Twitter.
2.)Manually analyzing this data is time-consuming and inefficient.
3.)Our challenge was to develop a system that can automatically classify the sentiment of airline tweets as positive, negative, or neutral.




# **Data Preprocessing**
I collected a dataset of airline tweets containing information such as tweet text and sentiment labels.
To ensure data quality, I removed duplicate tweets and balanced the dataset for equal representation of sentiment classes.
Let's take a look at the code snippets of my data preprocessing steps.

In [None]:
import pandas as pd


**Read the C.S.V**

In [None]:
df = pd.read_csv("/content/Tweets.csv")

In [None]:
df.columns

Index(['tweet_id', 'airline_sentiment', 'airline_sentiment_confidence',
       'negativereason', 'negativereason_confidence', 'airline',
       'airline_sentiment_gold', 'name', 'negativereason_gold',
       'retweet_count', 'text', 'tweet_coord', 'tweet_created',
       'tweet_location', 'user_timezone'],
      dtype='object')

In [None]:
df.shape

(14640, 15)

**I have to drop duplicate tweets.Tweets that have extra white space at beginning or at the end of the tweet, are also considered duplicate tweets  by Google GCP**

In [None]:
df.head(2)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)


In [None]:
df["text"] = df["text"].str.strip()


In [None]:
df = df.drop_duplicates(subset=["text"])

In [None]:
df.shape

(14427, 15)

**I have to keep the two relevant columns:The text and the sentiment of the text.**

In [None]:
df = df[["text", "airline_sentiment"]]

**The sentimental score has to be a numeric value,so I  replaced negative sentiment with 0,neutral with 1 and positive with 2**

In [None]:
#0 - NEGATIVE

df = df.replace("negative",0)

In [None]:
#1 - NEUTRAL

df = df.replace("neutral",1)

In [None]:
#2 -POSITIVE
df = df.replace('positive',2)

In [None]:
df.head(2)

Unnamed: 0,text,airline_sentiment
0,@VirginAmerica What @dhepburn said.,1
1,@VirginAmerica plus you've added commercials t...,2


**The  dataset  doesn't have an equal number of Samples in each class.I will train one model with an unbalanced dataset and one model with a balanced dataset.**

In [None]:
df.groupby(["airline_sentiment"])["airline_sentiment"].count()

airline_sentiment
0    9080
1    3057
2    2290
Name: airline_sentiment, dtype: int64

**I Wrote  the unbalance dataset into a CSV File** bold text

In [None]:
df.to_csv("/content/Tweets.csv", index=False)

I Created a balanced dataset
# **bold text**
With each classs  having 2290 samples,which is a maximum number of samples for positive sentiments  

In [None]:
df_balanced = df.groupby(["airline_sentiment"]).apply(lambda x: x.sample(2290)).reset_index(drop=True)


In [None]:
df.groupby(["airline_sentiment"])["airline_sentiment"].count()

airline_sentiment
0    9080
1    3057
2    2290
Name: airline_sentiment, dtype: int64

** I Wrote the balanced dataset to a csv**
I also wrote the balanced dataset to a csv

In [None]:
df_balanced.to_csv("/content/balancedTweets.csv")

In [None]:
df_balanced.head()

Unnamed: 0,text,airline_sentiment
0,"@AmericanAir you have my money, you change my ...",0
1,@JetBlue You respond to complaints about cultu...,0
2,@AmericanAir @robinreda being stuck two days i...,0
3,@VirginAmerica I'm trying to check into my 10:...,0
4,@USAirways Can't stress enough how awful the a...,0


# **Model Training**
I uploaded our preprocessed data to a Google Cloud Storage (GCS) and created a dataset in Vertex AI.
Leveraging AutoML, I  trained a sentiment analysis model that could automatically classify tweets into sentiment categories using the balanced datasets.
AutoML's automated feature engineering and hyperparameter tuning helped us achieve optimal model performance.
Now, let's dive into the model training process and see how Vertex AI and AutoML facilitated our efforts.


In [None]:
from google.colab import auth
auth.authenticate_user()


In [None]:
!gsutil cp /content/balancedTweets.csv gs://sentimentalanalysis/


Copying file:///content/balancedTweets.csv [Content-Type=text/csv]...
-
Operation completed over 1 objects/706.0 KiB.                                    


In [None]:
!gsutil cp //content/Tweets.csv gs://sentimentalanalysis/

Copying file:////content/Tweets.csv [Content-Type=text/csv]...
/ [1 files][  1.5 MiB/  1.5 MiB]                                                
Operation completed over 1 objects/1.5 MiB.                                      
