# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Mini-Project: End-to-end analytics application using Pyspark

## Problem Statement

Perform sentiment classification by analyzing the tweets data with Pyspark

## Learning Objectives

At the end of the mini-project, you will be able to :

* analyze the text data using pyspark
* derive the insights and visualize the data
* implement feature extraction and classify the data
* train the classification model and deploy

### Dataset

The dataset chosen for this mini-project is **[Twitter US Airline Sentiment](https://data.world/socialmediadata/twitter-us-airline-sentiment)**. It is a record of tweets about airlines in the US. It was created by scraping Twitter data from February 2015. Contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").  Along with other information, it contains ID of a Tweet, the sentiment of a tweet ( neutral, negative and positive), reason for a negative tweet, name of airline and text of a tweet.

## Information

The airline industry is a very competitive market that has grown rapidly in the past 2 decades. Airline companies resort to traditional customer feedback forms which in turn are very tedious and time consuming. This is where Twitter data serves as a good source to gather customer feedback tweets and perform sentiment analysis. This dataset comprises of tweets for 6 major US Airlines and a multi-class classification can be performed to categorize the sentiment (neutral, negative, positive). For this mini-project we will start with pre-processing techniques to clean the tweets and then represent these tweets as vectors. A classification algorithm will be used to predict the sentiment for unseen tweets data. The end-to-end analytics will be performed using Pyspark.

## Grading = 10 Points

#### Install Pyspark

In [None]:
#@title Install packages and download the dataset
!pip -qq install pyspark
!pip -qq install handyspark
!wget -qq https://cdn.iisc.talentsprint.com/CDS/MiniProjects/US_Airline_Tweets.csv
print("Packages installed successfully and dataset downloaded!!")

#### Import required packages

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from handyspark import *
import seaborn as sns
from matplotlib import pyplot as plt
import re
import string
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.classification import NaiveBayes
from pyspark.sql.types import ArrayType, StringType

In [None]:
# NLTK imports
import nltk
nltk.download('punkt')
# Download stopwords
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

### Data Loading

#### Start a Spark Session

Spark session is a combined entry point of a Spark application, which came into implementation from Spark 2.0. It provides a way to interact with various Spark functionalities, with a lesser number of constructs.

In [None]:
# YOUR CODE HERE


#### Load the data and infer the schema

To load the dataset use the `read.csv` with `inferSchema` and `header` as parameters.

In [None]:
path = "/content/US_Airline_Tweets.csv"
# YOUR CODE HERE

### EDA & Visualization ( 2 points)

#### Visualize the horizontal barplot of airline_sentiment (positive, negative, neutral)

Convert the data to handyspark and remove the other records from the column except 3 values mentioned above and plot the graph

In [None]:
# YOUR CODE HERE

#### Plot the number of tweets received for each airline

In [None]:
# YOUR CODE HERE

#### Visualize a stacked barchart of 6 US airlines and 3 sentiments on each bar

* Display the count corresponding to each sentiment in each bar. [hint](https://priteshbgohil.medium.com/stacked-bar-chart-in-python-ddc0781f7d5f)

In [None]:
# YOUR CODE HERE

#### Visualize the horizontal barplot of negative reasons

In [None]:
# YOUR CODE HERE

### Pre-processing (3 points)

#### Check the null values and drop the records where the text value is null

In [None]:
# YOUR CODE HERE

#### Fill the null values with 0 in all the columns except the target

The target should not be empty. Ensure that all features are integer type, convert if needed.

In [None]:
# YOUR CODE HERE

#### Preprocessing and cleaning the tweets

* Convert the text to lower case
* Remove usernames, hashtags and links from the text (tweets)

In [None]:
# YOUR CODE HERE

#### Tokenize each sentence into words using nltk word tokenizer

In [None]:
# YOUR CODE HERE

#### Remove the stopwords from tokenized words

In [None]:
stop_words = set(stopwords.words('english'))
print(stop_words)

In [None]:
# YOUR CODE HERE

#### Apply Lemmatization to the words

In [None]:
# YOUR CODE HERE

### Feature Extraction (3 points)

Create the useful features from the text column to train the model

For example:
* Length of the tweet 
* No. of hashtags in the tweet starting with '#'
* No. of mentions in the tweet starting with '@'

Hint: create a new column for each of the above features

In [None]:
# YOUR CODE HERE

#### Get the features by applying CountVectorizer
CountVectorizer converts the list of tokens to vectors of token counts. See the [documentation](https://spark.apache.org/docs/latest/ml-features.html#countvectorizer) for details.

In [None]:
# YOUR CODE HERE

#### Encode the labels

Using the `udf` function encode the string values of *airline_sentiment* to integers.

In [None]:
def LabelEncoder(x):
    if x == 'positive':
        return 0
    elif x == 'negative':
        return 1
    return 2

# YOUR CODE HERE

### Train the classifier the evaluate (1 point)

#### Create vector assembler with the selected features to train the model

In [None]:
# YOUR CODE HERE

#### Arrange features and label and split them into train and test.

In [None]:
# YOUR CODE HERE

#### Train the model with train data and make predictions on the test data

For classification of text data, implement NaiveBayes classifier. It is a probabilistic machine learning model.

For more information about **NaiveBayes Classifier**, click [here](https://spark.apache.org/docs/latest/ml-classification-regression.html#naive-bayes)

In [None]:
nb = NaiveBayes(featuresCol='features', labelCol='labels')
# Fit the model with train data
model = nb.fit(train_data)

In [None]:
# get the predictions
# YOUR CODE HERE

#### Evaluate the model and find the accuracy

Compare the labels and predictions and find how many are correct.

To find the accuracy, get the count of correct predictions from test data and divide by the total amount of test dataset.

**Hint:** convert the predictions dataframe to pandas and compare with labels

In [None]:
# YOUR CODE HERE

### Deployment (1 point)

Let's integrate all the above code snippets in app.py and run it with **Streamlit**.

From the start (data loading step), place every code in app.py including data preprocessing, feature extraction and model training.

* implement the `predict_users_Input()` function which takes one tweet input from user and returns the prediction using the trained model.

* use the same preprocessing techniques and features extraction used for train data on user input.

* user input can be captured from the textbox from **Streamlit** app. Action is triggered when predict button is clicked and user input is classified using `predict_users_Input()` function.


For More information about Streamlit, click [here](https://docs.streamlit.io/en/stable/)

In [None]:
# Install streamlit and colab-everything
!pip install -qq streamlit
# Python library to run streamlit, flask, fastapi, etc on Google Colab
!pip install -qq colab-everything

Create the `app.py` file and run with Streamlit

**Note:** We have provided the required code to execute Streamlit.

In [None]:
%%writefile app.py
import streamlit as st
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import re
import string
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.classification import NaiveBayes
from pyspark.sql.types import ArrayType, StringType
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

st.write("Creating a spark session")
spark = SparkSession.builder.appName('TwitterSentiment').getOrCreate()
dataset = spark.read.csv("/content/US_Airline_Tweets.csv",inferSchema=True,header=True)

st.write("Preprocessing the train data")
# 1. Data preprocessing (PASTE YOUR ENTIRE DATA PREPROCESSING CODE FROM ABOVE)

st.write("Ongoing feature extraction!!")
# 2. Feature Extraction (PASTE YOUR ENTIRE FEATURE EXTRACTION CODE FROM ABOVE)

st.write("Training the model")
# 3. Training the model (PASTE YOUR MODEL TRAINING CODE FROM ABOVE)

def predict_users_Input(user_input):
  df1 = spark.createDataFrame([ (1, user_input)],['Id', 'UserTweet'])

  # YOUR CODE HERE for data preprocessing and feature extraction for user input data

  # YOUR CODE HERE for predicting the user input vector using trained model

  return predicted_result # return dataframe object

def decode(label):
  if label == 0:
    return "Positive Tweet!"
  elif label == 1:
    return "Negative Tweet!"
  return "Neutral Tweet"

user_input = st.text_input("Take Input","@mention #Hashtag good something!")
if st.button('predict'):
    result = predict_users_Input(user_input)
    st.write(decode(result.prediction.values[0]))

After you execute the code below you will get a web app link where you could perform the sentiment prediction task.
* Note: The cell below keeps executing until the server is stopped by interrupting the execution. An error message may appear upon interruption, you could ignore it.

In [None]:
from colab_everything import ColabStreamlit
ColabStreamlit('/content/app.py')

Refer the screenshot below.
![img](https://cdn.iisc.talentsprint.com/CDS/MiniProjects/sentiment_analysis_streamlit_button.JPG)