**Learner Name: Damian Najera**

# Introduction to Computer Vision: Twitter US Airline Sentiment

## Problem Statement

### Context

Twitter's massive user base of 330 million monthly active users presents a direct avenue for businesses to connect with a broad audience. However, the vast amount of information on the platform makes it challenging for brands to swiftly detect negative social mentions that may impact their reputation. To tackle this, sentiment analysis has become a crucial tool in social media marketing, enabling businesses to monitor emotions in conversations, understand customer sentiments, and gain insights to stay ahead in their industry.

That's why sentiment analysis/classification, which involves monitoring emotions in conversations on social media platforms, has become a key strategy in social media marketing.

### Objective

The aim of this project is to build a sentimental analysis model that classify the sentiment of tweets into the positive, neutral & negative.

### Data Dictionary

* tweet_id - A unique identifier for each tweet                                                          
* airline_sentiment - The sentiment label of the tweet, such as positive, negative, or neutral                                               
* airline_sentiment_confidence - The confidence level associated with the sentiment label                               
* negativereason - A category indicating the reason for negative sentiment                                                   
* negativereason_confidence - The confidence level associated with the negative reason                                    
*airline - The airline associated with the tweet                                                                   
* airline_sentiment_gold - Gold standard sentiment label                                               
* name - The username of the tweet author    
* retweet_count - The number of times the tweet has been retweeted
* text - The actual text content of the tweet.
* tweet_coord - Coordinates of the tweet
* tweet_created - The timestamp when the tweet was created
* tweet_location - The location mentioned in the tweet
* user_timezone - The timezone of the tweet author

## Importing necessary libraries

In [15]:
import re, string, unicodedata                          # Import Regex, string and unicodedata.
import contractions                                     # Import contractions library.
from bs4 import BeautifulSoup                           # Import BeautifulSoup.

import numpy as np                                      # Import numpy.
import pandas as pd                                     # Import pandas.
import nltk                                             # Import Natural Language Tool-Kit.
import seaborn as sns                                   # Import seaborn
import matplotlib.pyplot as plt                         # Import Matplotlib

nltk.download('stopwords')                              # Download Stopwords.
nltk.download('punkt')
nltk.download('wordnet')

from nltk.corpus import stopwords                                              # Import stopwords.
from nltk.tokenize import word_tokenize, sent_tokenize                         # Import Tokenizer.
from nltk.stem.wordnet import WordNetLemmatizer                                # Import Lemmatizer.
from wordcloud import WordCloud,STOPWORDS                                      # Import WorldCloud and Stopwords
from sklearn.feature_extraction.text import CountVectorizer                    # Import count Vectorizer
from sklearn.model_selection import train_test_split                           # Import train test split
from sklearn.ensemble import RandomForestClassifier                            # Import Rndom Forest Classifier
from sklearn.model_selection import cross_val_score                            # Import cross val score
from sklearn.metrics import confusion_matrix                                   # Import confusion matrix
from wordcloud import WordCloud                                                # Import Word Cloud
from sklearn.feature_extraction.text import TfidfVectorizer                    # Import Tf-Idf vector
import nltk                                                                    # Import nltk
nltk.download('omw-1.4')

from tensorflow.keras import backend                                           # Import backend
import random                                                                  # Import random
import tensorflow as tf                                                        # Import tensorflow
from sklearn.preprocessing import LabelBinarizer                               # Import Label Binarizer
from tensorflow.keras.layers import Dropout                                    # Import Dropout

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\BD\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\BD\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\BD\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\BD\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


## Loading the dataset

In [16]:
data = pd.read_csv("Tweets.csv")

## Data Overview

The initial steps to get an overview of any dataset is to:
- Observe the first few rows of the dataset, to check whether the dataset has been loaded properly or not
- Get information about the number of rows and columns in the dataset
- Find out the data types of the columns to ensure that data is stored in the preferred format and the value of each property is as expected.

### Check the head and tail of the data

In [17]:
data.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [18]:
data.tail()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
14635,569587686496825344,positive,0.3487,,0.0,American,,KristenReenders,,0,@AmericanAir thank you we got on a different f...,,2015-02-22 12:01:01 -0800,,
14636,569587371693355008,negative,1.0,Customer Service Issue,1.0,American,,itsropes,,0,@AmericanAir leaving over 20 minutes Late Flig...,,2015-02-22 11:59:46 -0800,Texas,
14637,569587242672398336,neutral,1.0,,,American,,sanyabun,,0,@AmericanAir Please bring American Airlines to...,,2015-02-22 11:59:15 -0800,"Nigeria,lagos",
14638,569587188687634433,negative,1.0,Customer Service Issue,0.6659,American,,SraJackson,,0,"@AmericanAir you have my money, you change my ...",,2015-02-22 11:59:02 -0800,New Jersey,Eastern Time (US & Canada)
14639,569587140490866689,neutral,0.6771,,0.0,American,,daviddtwu,,0,@AmericanAir we have 8 ppl so we need 2 know h...,,2015-02-22 11:58:51 -0800,"dallas, TX",


### Understand the shape of the dataset

In [19]:
data.shape

(14640, 15)

### Check for Duplicate Entries

In [20]:
# Check for duplicate rows
duplicate_rows = data.duplicated().sum()

# Display the number of duplicate rows
print(f"Number of duplicate rows: {duplicate_rows}")

Number of duplicate rows: 36


Given the context of our problem and the understanding of our dataset, let us remove the duplicate entries.

In [21]:
# Removing duplicate rows
data.drop_duplicates(inplace=True)

# Verifying that duplicates have been removed
remaining_duplicates = data.duplicated().sum()

remaining_duplicates

0

### Checking for Missing Values

In [22]:
# Check for total missing values in each column
missing_values = data.isnull().sum()

# Display the columns with their respective count of missing values
missing_values

tweet_id                            0
airline_sentiment                   0
airline_sentiment_confidence        0
negativereason                   5445
negativereason_confidence        4101
airline                             0
airline_sentiment_gold          14564
name                                0
negativereason_gold             14572
retweet_count                       0
text                                0
tweet_coord                     13589
tweet_created                       0
tweet_location                   4723
user_timezone                    4814
dtype: int64

#### Observations:
- `negativereason`: 5,445 missing values. This is expected since not every tweet will be negative, and thus won't have a reason associated with it.
- `negativereason_confidence`: 4,101 missing values. Similar to the previous point, not every tweet will have a negative reason confidence if it's not negative.
- `airline_sentiment_gold`: 14,564 missing values. This column seems to represent some gold standard for sentiment, but it's mostly missing. Given its high count of missing values, it might not be very useful for analysis or modeling.
negativereason_gold: 14,572 missing values. Like airline_sentiment_gold, this column has a high number of missing values, which makes it less useful.
- `negativereason_gold`: 14,572 missing values. Like airline_sentiment_gold, this column has a high number of missing values, which makes it less useful.
- `tweet_coord`: 13,589 missing values. This indicates that a large portion of tweets do not have geolocation data associated with them.
- `tweet_location`: 4,723 missing values. This suggests that many users have not specified a location in their tweets or profiles.
- `user_timezone`: 4,814 missing values. A significant number of users haven't set or provided their timezones.


## Exploratory Data Analysis

### Univariate Analysis

In [23]:
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

#### Percentage of tweets for each airline



In [None]:
labeled_barplot(data, "_________", perc=True)         # Complete the code to plot the labeled barplot for airline

#### Distribution of sentiments across all the tweets

In [None]:
labeled_barplot(________, "_____________", perc=True) # Complete the code to plot the labeled barplot for airline_sentiment

#### Plot of all the negative reasons

In [None]:
labeled_barplot(________, "______________", perc=True)             # Complete the code to plot the labeled barplot for negative reason

### Bivariate Analysis

#### Distribution of Sentiment of tweets for each airline

In [None]:
airline_sentiment =  data.groupby(['________', '_________']).airline_sentiment.count().unstack()    # Complete the code to plot the barplot for the distribution of each airline with total sentiments
airline_sentiment.plot(kind='bar')

#### Wordcloud for negative tweets

In [None]:
airline_tweets=data[data['airline_sentiment']=='negative']
words = ' '.join(data['text'])
cleaned_word = " ".join([word for word in words.split()
                            if 'http' not in word
                                and not word.startswith('@')
                                and word != 'RT'])

In [None]:
wordcloud = WordCloud(stopwords=STOPWORDS,
                      background_color='black',
                      width=3000,
                      height=2500
                     ).generate(cleaned_word)

In [None]:
plt.figure(1,figsize=(12, 12))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

#### Wordcloud for positive tweets

In [None]:
# write the code to make the word cloud for positive tweets

## Data Preparation for Modeling



- Drop all unnecessary columns
- Remove html tags
- Replace contractions in string(e.g. replace I'm --> I am) and so on.\
- Remove numbers
- Tokenization
- To remove Stopwords
- Lemmatized data

### Drop all unnecessary columns

In [None]:
# Take text and airline sentiment columns from the data
data = data[['______________', '_______________']]                      # Complete the code to get a subset of data

In [None]:
data.______                                                             # Complete the code to display the first 5 rows of the dataset

In [None]:
data.______                                                             # Complete the code to get the shape of the data

In [None]:
data['___________']._________()                                         # Complete the code to display the unique values in airline sentiment column

In [None]:
data['___________']._________()                                         # Complete the code to display the values in airline sentiment column

### Remove HTML Tages

In [None]:
# Code to remove the html tage
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

data['text'] = data['___'].apply(____________)                        # Complete the code to apply strip html function on text column
data._______                                                          # Complete the code to display the head of the data

### Replace contractions in string

In [None]:
def replace_contractions(text):
    """Replace contractions in string of text"""
    return contractions.fix(text)

data['_____'] = data['_________'].apply(lambda x: replace_contractions(x))                  # Complete the code to apply replace contractions function on text column
data._______                                                                                # Complete the code to display the head of the data

### Remove numbers

In [None]:
def remove_numbers(text):
  text = re.sub(________________)                                     # Complete the code to
  return text

data['_____'] = data['_________'].apply(___________)                  # Complete the code to apply remove numbers function on text column
data._______                                                          # Complete the code to display the head of the data

### Apply Tokenization

In [None]:
data.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)

In [None]:
# Complete the code to apply tokenization on text column
data['_______'] = data.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)
# Complete the code to display the head of the data
data._______

### Applying lowercase and removing stopwords and punctuation

**Adding Stopwords**

In [None]:
stopwords = stopwords.words('english')

customlist = ['not', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn',
        "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',
        "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn',
        "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

# Set custom stop-word's list as not, couldn't etc. words matter in Sentiment, so not removing them from original data.

stopwords = list(set(stopwords) - set(customlist))

**All the preprocessing steps in one function**

In [None]:
lemmatizer = WordNetLemmatizer()

def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords:
            new_words.append(word)
    return new_words

def lemmatize_list(words):
    new_words = []
    for word in words:
      new_words.append(lemmatizer.lemmatize(word, pos='v'))
    return new_words

def normalize(words):
    words = remove_non_ascii(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_stopwords(words)
    words = lemmatize_list(words)
    return ' '.join(words)

data['text'] = data.apply(lambda row: normalize(row['text']), axis=1)
data.head()

## Model Building

### Using countvectorizer

In [None]:
# Vectorization (Convert text data to numbers).

Count_vec = ______________(max_features=_____)                # Complete the code to initialize the CountVectorizer function with max_ features = 5000.
data_features = Count_vec._____(data['_____'])                # Complete the code to fit and transrofm the count_vec variable on the text column

data_features = data_features._______()                       # Complete the code to convert the datafram into array

In [None]:
data_features.___________                                     # Complete the code to check the shape of the data features

#### Create train and test sets

In [None]:
X = _____________                                             # Complete the code to get the independent variable (data_features) stored as X

y = data.__________                                           # Complete the code to get the dependent variable (airline_sentiment) stored as Y

In [None]:
# Split data into training and testing set.

X_train, X_test, y_train, y_test =_________ (__, __, test_size=___, random_state=____)   # Complete the code to split the X and Y into train and test dat

#### Random Forest Model

In [None]:
# Using Random Forest to build model for the classification of reviews.

forest = ____________(n_estimators=____, n_jobs=4)            # Initialize the Random Forest Classifier

forest = ______.____(______, _______)                         # Fit the forest variable on X_train and y_train

print(forest)

print(np.mean(_______________(forest, X, y, cv=10)))          # Calculate cross validation score

#### Optimize the parameter: The number of trees in the random forest model(n_estimators)

In [None]:
# Finding optimal number of base learners using k-fold CV ->
base_ln = [x for x in range(1, 25)]

In [None]:
# K-Fold Cross - validation .
cv_scores = []                                                                             # Initializing a emptry list to store the score
for b in base_ln:
    clf = _______________(n_estimators = b)                                                # Complete the code to apply Rondome Forest Classifier
    scores = ___________(_____, ______, _______, cv = 5, scoring = '___________')          # Complete the code to find the cross-validation score on the classifier (clf) for accuracy
    cv_scores.append(scores.mean())                                                        # Append the scores to cv_scores list

In [None]:
# plot the error as k increases
error = [1 - x for x in cv_scores]                                 # Error corresponds to each number of estimator
optimal_learners = base_ln[error.index(min(error))]                # Selection of optimal number of n_estimator corresponds to minimum error.
plt.plot(base_ln, error)                                           # Plot between each number of estimator and misclassification error
xy = (optimal_learners, min(error))
plt.annotate('(%s, %s)' % xy, xy = xy, textcoords='data')
plt.xlabel("Number of base learners")
plt.ylabel("Misclassification Error")
plt.show()

In [None]:
# Train the best model and calculating accuracy on test data .
clf = _________(n_estimators = _____________)                     # Initialize the Random Forest classifier with optimal learners
___.____(____, ___)                                               # Fit the classifer on X_train and y_train
___.____(____, ___)                                               # Find the score on X_train and y_train

#### Best Random Forest model

In [None]:
  # Predict the result for test data using the model built above.
  result = _____.predict(_______)                                   # Complete the code to predict the X_test data using the model built above (forest)

In [None]:
# Print and plot Confusion matirx

conf_mat = ________(___________, _________)                       # Complete the code to calculate the confusion matrix between test data and result

print(conf_mat)                                                   # Print confusion matrix

In [None]:
# Plot the confusion matrix
df_cm = pd.DataFrame(conf_mat, index = [i for i in ['positive', 'negative', 'neutral']],
                  columns = [i for i in ['positive', 'negative', 'neutral']])
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True, fmt='g')

#### Wordcloud of top 40 important features from countvectorizer+Randomforest based mode

In [None]:
all_features = Count_vec.get_feature_names()                     # Instantiate the feature from the vectorizer
top_features=''                                                  # Addition of top 40 feature into top_feature after training the model
feat=clf.feature_importances_
features=np.argsort(feat)[::-1]
for i in features[0:40]:
    top_features+=all_features[i]
    top_features+=','

print(top_features)

print(" ")
print(" ")

# Complete the code by applying wordcloud on top features
wordcloud = ________(background_color="white",colormap='viridis',width=2000,height=1000).generate(_______)

In [None]:
# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.figure(1, figsize=(14, 11), frameon='equal')
plt.title('Top 40 features WordCloud', fontsize=20)
plt.axis("off")
plt.show()

### Using TF-IDF (Term Frequency- Inverse Document Frequency)

In [None]:
# Using TfidfVectorizer to convert text data to numbers.

tfidf_vect = ______________(max_features=_____)                          # Complete the code to initialize the TF-IDF vector function with max_features = 5000.
data_features = tfidf_vect.fit_transform(data['text'])                   # Fit the tf idf function on the text column

data_features = data_features._______()                                  # Complete the code to convert the datafram into array

In [None]:
data_features.___________                                                # Complete the code to check the shape of the data features

#### Create train and test sets

In [None]:
X = _____________                                                        # Complete the code to get the independent variable (data_features) stored as X

y = data.__________                                                      # Complete the code to get the dependent variable (airline_sentiment) stored as Y

In [None]:
# Split data into training and testing set.

X_train, X_test, y_train, y_test =_________ (__, __, test_size=___, random_state=____)   # Complete the code to split the X and Y into train and test dat

#### Random Forest Model

In [None]:
# Using Random Forest to build model for the classification of reviews.

forest = ____________(n_estimators=____, n_jobs=4)            # Initialize the Random Forest Classifier

forest = ______.____(______, _______)                         # Fit the forest variable on X_train and y_train

print(forest)

print(np.mean(_______________(forest, X, y, cv=10)))          # Calculate cross validation score

#### Optimize the parameter: The number of trees in the random forest model(n_estimators)

In [None]:
# Finding optimal number of base learners using k-fold CV ->
base_ln = [x for x in range(1, 25)]

In [None]:
# K-Fold Cross - validation .
cv_scores = []                                                                             # Initializing a emptry list to store the score
for b in base_ln:
    clf = _______________(n_estimators = b)                                                # Complete the code to apply Rondome Forest Classifier
    scores = ___________(_____, ______, _______, cv = 5, scoring = '___________')          # Complete the code to find the cross-validation score on the classifier (clf) for accuracy
    cv_scores.append(scores.mean())                                                        # Append the scores to cv_scores list

In [None]:
# Plot the misclassification error for each of estimators (Hint: Use the above code which is used while plotting the miscalssification error for CountVector function )

In [None]:
# Train the best model and calculating accuracy on test data .
clf = _________(n_estimators = _____________)                     # Initialize the Random Forest classifier with optimal learners
___.____(____, ___)                                               # Fit the classifer on X_train and y_train
___.____(____, ___)                                               # Find the score on X_train and y_train

In [None]:
# Predict the result for test data using the model built above.
result = _____.predict(_______)                                   # Complete the code to predict the X_test data using the model built above (forest)

In [None]:
# Plot the confusion matrix
conf_mat = confusion_matrix(y_test, result)                      # Complete the code to calculate the confusion matrix between test data and restust


df_cm = pd.DataFrame(conf_mat, index = [i for i in ['positive', 'negative', 'neutral']],columns = [i for i in ['positive', 'negative', 'neutral']])
plt.figure(figsize = (10,7))
sns.heatmap(_______, annot=True, fmt='g')                         # Complete the code to plot the heatmap of the confusion matrix

#### Wordcloud of top 20 important features from TF-IDF+Randomforest based mode

In [None]:
all_features = tfidf_vect.get_feature_names_out()          #Instantiate the feature from the vectorizer
top_features=''                                            # Addition of top 40 feature into top_feature after training the model
feat=clf.feature_importances_
features=np.argsort(feat)[::-1]
for i in features[0:40]:
    top_features+=all_features[i]
    top_features+=', '

print(top_features)

print(" ")
print(" ")

# Complete the code by applying wordcloud on top features
wordcloud = ________(background_color="white",colormap='viridis',width=2000,height=1000).generate(_______)

In [None]:
# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.figure(1, figsize=(14, 11), frameon='equal')
plt.title('Top 40 features WordCloud', fontsize=20)
plt.axis("off")
plt.show()

### Using LSTM

In [None]:
# Clearing backend
backend.clear_session()
# Fixing the seed for random number generators
np.random.seed(42)
random.seed(42)
tf.random.set_seed(42)

#### Tokenizing the text column

In [None]:
# Complete the code with by having 800 vocab size
tokenizer = Tokenizer(num_words = ____, split = ' ')

# Complete the code to fit tokenizer on text data
tokenizer.fit_on_texts(data['____'].values)

# Converting text to sequences
X = tokenizer.texts_to_sequences(data['text'].values)

# Padding the sequences
X = pad_sequences(X)

#### Encoding the target variable

In [None]:
# Storing the Label Binarizer
enc = LabelBinarizer()
# Fitting the Label Binarizer on airline_sentiment
y_encoded = enc.fit_transform(data['airline_sentiment'])

#### Split the data into train and test

In [None]:
# Splitting the data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size = 0.30, random_state = 42)

#### Training LSTM Model

In [None]:
# Initializing the model
model = Sequential()

# Adding the embedding layer with 800 vocabularies, 120 neurons
model.add(Embedding(___, ____, input_length = X.shape[1]))

# Complete the code to add the LSTM layer with 256 neurons
model.add(LSTM(___,return_sequences=True))

# Complete the code to add the LSTM layer with 150 neurons and dropout_rate= 0.2
model.add(LSTM(___, dropout = ___, recurrent_dropout = 0.2))

# Complete the code to add the dense layer with 124 neurons and relu activation function
model.add(Dense(___,activation = '___'))

# Complete the code to add dropout with dropout_rate= 0.2
model.add(Dropout(____))

# Complete the code to add a dense layer with 64 neurons and relu activation function
model.add(Dense(___,activation = '___'))

# Complete to the code to add the output layer with 3 neurons and softmax activation function
model.add(Dense(___, activation = '___'))

# Complete the code to compile the model with categorical_crossentropy as loss function, accuracy as metrics and adam as optimizer
model.compile(loss = '______', optimizer = '_____', metrics = ['_____'])

In [None]:
# Summary of the model
print(model.summary())

In [None]:
%%time
# Complete the code to fit the model on X_train and y_train with epochs as 30, batch size as 32
his = model.fit(X_train, y_train, epochs = ___, batch_size = ___, verbose = 'auto')

In [None]:
# Predicting on X_test using the above model
result = model.predict(X_test)

In [None]:
# Applying argmax function on the predicted values (result) to get the predicted labels
y_pred_arg=np.argmax(result,axis=1)
# Applying argmax function on the y_test to get back the predicted labels
y_test_arg=np.argmax(y_test,axis=1)

**Plotting the confusion matrix**

In [None]:
conf_mat = confusion_matrix(y_test_arg, y_pred_arg)

df_cm = pd.DataFrame(conf_mat, index = [i for i in ['positive', 'negative', 'neutral']],
                  columns = [i for i in ['positive', 'negative', 'neutral']])
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True, fmt='g')

## Summary



-


## Happy Learning!
---