> ### Note on Labs and Assignments:
>
> 🔧 Look for the **wrench emoji** 🔧 — it highlights where you're expected to take action!
>
> These sections are graded and are not optional.
>

# IS 4487 Lab 13: Sentiment Analysis

## Outline

- Conduct simple Sentiment Analysis
- Learn about Multi-Class Logistic Regression models
- Applying TF and IDF to turn text into numerical data
- Predict the sentiment of some test data 

In this lab, you will explore **sentiment analysis** techniques to determine the positivity/negativity of certain sentences. 

<a href="https://colab.research.google.com/github/Stan-Pugsley/is_4487_base/blob/main/Labs/lab_13_text_analytics.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


# Data Description




### Dataset 1 Description: train.tsv

The dataset comes with pre-labelled data about the sentiment of certain phrases.

| Column                        | Data Type       | Description                                                  |
|------------------------------|------------------|--------------------------------------------------------------|
| `PhraseID`                   | Integer           | ID of an entry                                               |
| `SentenceID`                 | Integer           | Shows which phrases belong to which sentence                                      |
| `Phrase`             | String       | A sentence/phrase                       |
| `Sentiment`                 | Categorical       | 0 = Very Negative, 1 = Negative, 2 = Neutral, 3 = Positive, 4= Highly Positive        |

### Dataset 2 Description: test.tsv

This dataset is used to test a predictive model. Notice how the target variable is missing.

| Column                        | Data Type       | Description                                                  |
|------------------------------|------------------|--------------------------------------------------------------|
| `PhraseID`                   | Integer           | ID of an entry                                               |
| `SentenceID`                 | Integer           | Shows which phrases belong to which sentence                                      |
| `Phrase`             | String       | A sentence/phrase                       |

## Part 1: Load and Prepare the Data

What you are going to do:
- Load the dataset
- Preview the data 

Why this matters:
<br>
All throughout the semester you've mainly dealt with data that had a wide variety of types. But what if we only have a few variables and one of them has tons of data?

Things to notice:
- Which variables are actually important? 
- Why are there so few variables? Which variable(s) has the most data?


### 🔧 Try It Yourself — Part 0
To download the data 
- go to https://www.kaggle.com/datasets/satwikdondapati/moviereviewsentimentalanalysis 
- click 'download' then 'download as a zip'
- unzip the folder 
- import the two files onto collab. 

In [None]:
import pandas as pd

# load the training data 
file_path = 'train.tsv'
df = pd.read_csv(file_path, sep='\t')
display(df.head())


# Part 2 : Split the data
What you are going to do:
- Load the necessary libraries 
- Store the 2 most important columns into variables

Why this matters:
<br>
It is important to have a feature and target variable for a logistic regression model. 

Things to notice:
- Why are there only 2 variables total?

In [None]:
# load all the libraries 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# extract and store the Phrase and Sentiment columns into their own variable
X = df['Phrase']          
y = df['Sentiment']       

### 🔧 Try It Yourself — Part 2

1. Just like in past assignments, split the data so that 80% of it is training data and 20% is testing data. Make sure to use variable names such as X_train, X_test, y_train, and y_test or else the later code won't work!

In [None]:
# 🔧 Enter your code here

# Part 3: TF and IDF
What you are going to do:
- Vectorize the data 
- Calculate a TF and IDF

Why this matters:
<br>
You'll notice that the Phrase variable has lots of data but none of it is numerical. For logistical regression to work, the inputs need to be converted into numbers. 

Things to notice: 
- take a look at the paramters for the vectorizor, what are they for?
- why are X_test and X_train the only variables vectorized? 

In [None]:
# need to vectorize the string data into numbers
vectorizer = TfidfVectorizer(
    stop_words='english',       
    max_features=10000,         
    ngram_range=(1,2)           
)

# vectorize the training and test datasets 
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

### 🔧 Try It Yourself — Part 3
You may have noticed already that all of our training data is just strings. But logistic regression only takes numbers as inputs. How do data scientists turn strings into numbers? Through two formulas called TF and IDF. In this step you will be calculating both TF and IDF for the word "cat" for the following documents:

Document 1
<br>
"There is an orange cat and a grey cat."

Document 2
<br>
"A cat is there"

Before you do that though let's go through an example of how to calculate TF and IDF. Let's look at some actual data from the train.tsv: 

Document 1
<br>
"A series" 

Document 2
<br>
"A"

Document 3
<br>
"series" 

Let's calculate TF and IDF for the word "series." 

TF = number of times word "series" appears in a given document / the total number of words in a document 

So for document 1 the calculation goes like this:
TF = 1/2 = 0.5
And for document 2 it would look like this:
TF = 0/1 = 0 
And for document 3 it looks like this:
TF = 1/1 = 1

Now let's calcualte the IDF. 

IDF = log(total number of documents / number of documents with the word "series")

There are 3 documents. Documents 1 and 3 have the word "series" in them. Therefore our IDF is...
IDF = log(3/2) = 0.176

1. Calculate the TF for Document 1 and Document 2 using the word "cat"
2. Calculate the IDF for all 2 documents using the word "cat"

In [None]:
# 🔧 Enter your calculations here

# Part 4: Evaluate the model
What you are going to do:
- fit the logistic regression model

Why this matters:
<br>
Sentiment analysis can use a variety of different models. In this case, we are using a logstic regression model. 

Things to notice: 
- why does this logistic regression model use multi_class?

In [None]:
model = LogisticRegression(multi_class='multinomial')
model.fit(X_train_tfidf, y_train)

### 🔧 Try It Yourself — Part 4
1. use the test data to find the accuracy of this model
2. generate a classification report on the model 
3. In a markdown cell, write down why we used a multi-class logistic regression. (Hint: you are conducting sentiment analysis using a logistic regression model. Sentiment analysis gives a sentiment score. There are 5 total sentiment scores. BUT regular logistic regression only has 2 outputs)

In [None]:
# 🔧 Enter your code here

🔧 Add comment here

# Part 5: Apply the model
What you are going to do:
- Create some sample text
- Run the model through the sample text to determine sentiment

Why this matters:
<br>
Being able to determine the sentiment of a sentence helps us know how "positive" or "negative" it sounds. 

Things to notice: 
- what scores did you end up with?
- do these scores look correct?

In [None]:
sample_text = ["I love this lab!", 
               "This lab is a waste of time."]
sample_features = vectorizer.transform(sample_text)
predictions = model.predict(sample_features)

for text, pred in zip(sample_text, predictions):
    print(f"Text: {text}\nPredicted Sentiment: {pred}\n")

### 🔧 Try It Yourself — Part 5
In sentiment analysis, 
1. create 2-3 more sentences and run them through the model. Print out the predicted sentiment score of each sentence

In [None]:
# 🔧 Enter your code here

## 🔧 Part 6: Reflection (100 words or less)

In this lab you built a mulitclass logistic regression model and learned how data analysts turn sentences into numbers. Using these techniques, you were able to build a sentiment analysis model which could look at a sentence and determine how positive or negative it was.  

Use the cell below to answer the following questions:

1. Why is it important to know how positive or negative a sentence is? 
2. How could a business use sentiment analysis and customer reviews to improve customer service? 

🔧 Add comment here:

# Export Your Notebook to Submit in Canvas
Use the instructions from Lab 1

In [None]:
!jupyter nbconvert --to html "lab_13_LastnameFirstname.ipynb"