# Logistic Regression model

## Due date

March 31, 2023

## Assiginment Description

Please read the assignment description before starting the assignment. The assignment description is available here: [Assignment Description](../assignment_descriptions/07_Logistic_Regression.md)

## Section 0: Load the data

The data is available on Kaggle. You can download the data from [here](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews), or you can load the data with the below function.

In [7]:
## load the dataset
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

# URL to the IMDB dataset
IMDB_REVIEWS = "https://raw.githubusercontent.com/JamesMTucker/DATA_340_NLP/master/Notebooks/data/IMDB_Dataset.csv"

# create dataframe
df = pd.read_csv(IMDB_REVIEWS) # if you have RAM issues, you can use the nrows argument to read in fewer rows

### Data Set Description

In [8]:
## df shape and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [31]:
# Analyze the distribution of the target variable

targets = df['sentiment'].groupby(df['sentiment']).count()

# plot the distribution of the target variable
fig = px.bar(targets, x=targets.index, y=targets.values,
             labels={'x':'Sentiment', 'y':'Count'},
             color=targets.index,
             color_discrete_map={'positive':'green', 'negative':'red'})
fig.show()

### Data set preview

In [14]:
df.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


### Randomize some examples

In [28]:
## randomly generate three examples from the dataset
from IPython.display import display, HTML

random_sample = df.sample(3)

## If negative, colorize the text red, otherwise colorize the text green
def colorize_text(text, sentiment):
    """Colorize pre-formatted text"""
    if sentiment == 'negative':
        return f'<span style="color:red">{text}</span>'
    if sentiment == 'positive':
        return f'<span style="color:green">{text}</span>'

for i in range(len(random_sample)):
    display(HTML(f"Review:<br />{colorize_text(random_sample.iloc[i]['review'], random_sample.iloc[i]['sentiment'])}"))

In [21]:
random_sample

Unnamed: 0,review,sentiment
39689,Arthur is middle aged rich 'kid' who drinks li...,positive
26610,"In the tradition of G-Men, The House On 92nd S...",positive
17177,"I wanted to like this film, yes its a SAW, bla...",negative


## Section 1: Train a Logistic Regression model on noun and adjective phrases

Use the spaCy library to extract the noun and adjective phrases from the reviews and train a Logistic Regression model on the noun and adjective phrases.

`train` and `validation` are the data you should use to train and validate your model. `test` is the data you should use to test your model. `test` is the data that mimics real world data.

N.B.: Use the part of speech tags from spaCy to extract the noun and adjective phrases. For example, the noun phrase "the movie" is the head of the noun phrase "the movie was good". The adjective phrase "good" is the head of the adjective phrase "the movie was good".

```python

import spacy

NLP = spacy.load("en_core_web_sm")

def extract_noun_adj_phrases(text):
    """Extract noun and adjective phrases from text.
    
    Args:
        text (str): Text to extract noun and adjective phrases from.
        
    Returns:
        noun_adj_phrases (list): List of noun and adjective phrases.
    """
    doc = NLP(text)
    noun_adj_phrases = []
    for token in doc:
        if token.pos_ in ["NOUN", "ADJ"]:
            noun_adj_phrases.append(token.text)
    return noun_adj_phrases

```

In [36]:
### YOUR CODE HERE ###

## Section 2: Train a Logistic Regression model on verbs and adverbs

Use the spaCy library to extract the verbs and adjectives from the reviews and train a Logistic Regression model on the noun and adjective phrases.

`train` and `validation` are the data you should use to train and validate your model. `test` is the data you should use to test your model. `test` is the data that mimics real world data.

In [33]:
## YOUR CODE HERE

## Section 3: Compare the performance of the two models

Please answer the following questions:

1. Which model performed better? Why do you think that is the case?
2. What are the limitations of the Logistic Regression model?
3. What are some ways to improve the performance of the Logistic Regression model?
4. What are some ways to improve the performance of the two models?



## Section 4: Extra Credit - Compare a Logistic regression model to a Naive Bayes model