<a href="https://colab.research.google.com/github/JVCarmich0959/CSC228/blob/main/Jacquelyn's_Copy_of_CSC228_Lesson04_TextClassification_TfIdf_Sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Text Classification using TFIDF Vectorization and Sklearn
# Text is Classified using Binary Classification of positive or negative sentiment.
# using supervised What's up Goldsboro Facebook Comments.
#
# Uses Libraries:  sklearn
# Runtime:  Google CoLab (cpu)
#
# Owner:  Lorrie Tomek
# 
# Data: 
# WUG:Facebook
# General URL: https://www.facebook.com/groups/whatsupgoldsboro/permalink/968963637824334
# Github repo (RAW): https://raw.githubusercontent.com/JVCarmich0959/CSC228/main/WUG_Mall_improvement.csv
# Reference:  Real Python 
# URL:  https://realpython.com/python-keras-text-classification/
# The tutorial is freely available on the internet.  (Verified December 2022.)  
# It is modified in this notebook for teaching purposes, especially n-grams.

# Text Classification using TFIDF

In contrast to the CountVectorizer approach, we employed the TfidfVectorizer in this model.

The TfidfVectorizer operates in a similar manner to the CountVectorizer, with the distinction of using a different method to calculate word frequencies. These frequencies are computed as the product of the term frequency and the inverse document frequency. The term frequency is defined as the number of occurrences of a word in a document, while the inverse document frequency is the ratio of the total number of documents to the number of documents containing the word. In this context, a document is equivalent to a sample utterance.

In [50]:
# import python libraries
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np # linear algebra
from pprint import pprint # pretty print
from sklearn.feature_extraction.text import TfidfVectorizer # Term Frequency - Inverse Document Frequency
import nltk # I used this to clean my dirty csv
import string # for pre-processing
from nltk.corpus import stopwords # for refining utterances




In [51]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [47]:
# Define some constants
MIN_DF = 0.05 # minimum document frequency # I also changed this trying to improve
MAX_DF = 0.5 # maximum document frequency # I changed this trying to improve the model original is 0.8
MAX_FEATURES = 50000 # maximum number of features
LOWERCASE = False # convert all characters to lowercase before tokenizing

## Choose and Download a Dataset

I used a pre-labled csv file of utterances extracted from the What's Up Goldsboro Facebook page.

The comments can be found here: https://www.facebook.com/groups/whatsupgoldsboro/permalink/968963637824334

The raw file

In [3]:
# reading in my data from the github repository to a data frame 

url = "https://raw.githubusercontent.com/JVCarmich0959/CSC228/main/WUG_Mall_improvement.csv"

df = pd.read_csv(url, usecols=[0,1,2], encoding='latin-1')

df = pd.concat([df, pd.read_csv(url, usecols=[0,1,2], encoding='latin-1')])
print(df.iloc[1])


Source                                               Facebook
Label                                                       1
Sentence    The mall is a joke. When I worked there 30+ ye...
Name: 1, dtype: object


## Explore the Data

Next, we want to explore the data, so we gain an intuitive understanding.  In our case, we have downloaded a dataset that is quite "clean", so we won't do a lot of **data cleansing**.  Our dataset is also already labeled as positive or negative sentiment, so we will be able to use **supervised machine learning**. 

In [4]:
# Look at the first 5 rows of data 
# and the label is 1 or 0
# 1 meaning positive and 0 meaning negative
df[:5]

Unnamed: 0,Source,Label,Sentence
0,Facebook,0,Anything but another damn shoe store.
1,Facebook,1,The mall is a joke. When I worked there 30+ ye...
2,Facebook,1,A Spencerâs
3,Facebook,1,Arcade for kids and teens that serves food whe...
4,Facebook,1,"Dillards, Talbots, Chicoâs."


In [54]:
# Filthy dirty data makes for a ill fitted machine so I went back and pre-processed a bit
import nltk
import string
from nltk.corpus import stopwords


def clean_text(text):
    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    
    # Convert to lowercase
    text = text.lower()
    
    # Tokenize words
    words = text.split()
    
    # Remove stop words
    words = [word for word in words if word not in stopwords.words("english")]
    
    # Join words back into a single string
    cleaned_text = " ".join(words)
    
    return cleaned_text

df["Sentence"] = df["Sentence"].apply(lambda x: clean_text(x))

print(df['Sentence']) # Now that what I call ITERABLE 

0                      anything another damn shoe store
1     mall joke worked 30 years ago nice understand ...
2                                           spencerâs
3                      arcade kids teens serves food kw
4                            dillards talbots chicoâs
                            ...                        
93             guess trying keep times trying save mall
94                                         big gun shop
95                                         put cfa back
96                                       bring back k w
97    home goods food court least kids meet trampoli...
Name: Sentence, Length: 196, dtype: object


In [55]:
# In pandas, columns have data types that are inferred;
# Here we can see that the sentence column is a general 'object' rather than 'string' 
# Sometimes this means that there is missing (or NaN) data
df.dtypes

Source      object
Label        int64
Sentence    object
predict      int64
dtype: object

In [56]:
# What are all the possible values of the source column?
set(df['Source'].tolist())

{'Facebook', 'Facebook '}

In [57]:
# What are all the unique values of the label?
set(df['Label'].tolist())

{0, 1}

In [59]:
# What are the first 5 rows of data from my source?
df[df['Source']=='Facebook'][:5]

Unnamed: 0,Source,Label,Sentence,predict
0,Facebook,0,anything another damn shoe store,1
1,Facebook,1,mall joke worked 30 years ago nice understand ...,1
2,Facebook,1,spencerâs,1
3,Facebook,1,arcade kids teens serves food kw,1
4,Facebook,1,dillards talbots chicoâs,1


### Vocabulary 
The process of selecting specific subsets of rows and columns from data using the Pandas library is relatively straightforward. However, in cases where more complex processing of data columns is required, alternative methods may be necessary.

In this model, we will expand our understanding to include n-grams, which are groups of words that frequently appear together, and utilize their positional information to train our model.

For this purpose, we will make use of the TfidfVectorizer from the scikit-learn library. The TfidfVectorizer can be found within the sklearn.feature_extraction.text module. For more information on this tool, visit the scikit-learn website at https://scikit-learn.org/stable/ and search for "TfidfVectorizer". In comparison to the previous lesson where we created our own custom features, this time we will utilize the features provided by scikit-learn.

In [9]:
# Use sklearn TfidfVectorizer to find the vocabulary in our small list of sentences
from sklearn.feature_extraction.text import TfidfVectorizer

sentences = ['John likes ice cream', 'John hates chocolate.']

small_vectorizer = TfidfVectorizer(min_df=MIN_DF,
    lowercase=LOWERCASE, max_features=MAX_FEATURES, use_idf=True)
small_vectorizer.fit(sentences)
small_vectorizer.vocabulary_

{'John': 0, 'likes': 5, 'ice': 4, 'cream': 2, 'hates': 3, 'chocolate': 1}

In [10]:
# How many vocabulary words are there?  Notice our vocablary words now 
# include bi-grams (like "ice cream") and tri-grams (like "likes ice cream")
# Our vocabulary size is therefore larger.  
len(small_vectorizer.vocabulary_)

6

In [68]:
# Use sklearn TfidfVectorizer to find the vocabulary of our full dataset
from sklearn.feature_extraction.text import TfidfVectorizer

sentences = df['Sentence'].tolist()

vectorizer = TfidfVectorizer(min_df=MIN_DF, ngram_range=(1,2),  # dinkying with n-gram size a bit to see if I can get a better model.
  lowercase=LOWERCASE, max_features=MAX_FEATURES,use_idf=True)
vectorizer.fit(sentences) # train the vectorizer
# uncomment the following line to see vocabulary of our entire dataset
# warning it is very long, and cuts off at 5000 (and you'll want to recomment 
# the line after and run the cell again
pprint(vectorizer.vocabulary_, width=79, compact=True)

# look at the clean list now!

{'back': 0,
 'bring': 1,
 'court': 2,
 'five': 3,
 'food': 4,
 'food court': 5,
 'good': 6,
 'goods': 7,
 'home': 8,
 'home goods': 9,
 'kids': 10,
 'kw': 11,
 'like': 12,
 'mall': 13,
 'malls': 14,
 'movie': 15,
 'needs': 16,
 'people': 17,
 'pizza': 18,
 'place': 19,
 'secret': 20,
 'shop': 21,
 'spencers': 22,
 'store': 23,
 'stores': 24,
 'trampoline': 25,
 'ulta': 26,
 'victoria': 27,
 'would': 28}


In [69]:
len(vectorizer.vocabulary_) # looking at the size of the vocabulary from my data set.

29

### Create a Feature Vector using TfidfVectorizer

The TfIdfVectorizer is a tool that performs tokenization and generates a feature vector. By using the transform() method on a trained TfidfVectorizer object, we can build a representation of our sample sentences. In this example, we have two sentences, resulting in a two-row array. Given 14 vocabulary n-grams, we will have 14 columns in the resulting representation. Each column represents the count of the number of occurrences of a specific vocabulary n-gram in the sample sentence. For example, the first column (index 0) will show the count of the word 'John' in the sentence, while the second column (index 1) will indicate the count of the phrase 'John hates' in the sentence, and so on.





In [70]:
# Let's build a feature vector that describes our sample sentences.
# This provides an array where each entry is the count or number of times 
# the vocabulary word associated with the column is in the sentence

sample_sentences = ['The mall is good.', 'The mall is bad.']
sample_vectorizer = TfidfVectorizer(min_df=MIN_DF, max_df=MAX_DF, ngram_range= (1,2),max_features=MAX_FEATURES,
lowercase=LOWERCASE, use_idf=True)
sample_vectorizer.fit(sample_sentences)

sample_vectorizer.transform(sample_sentences).toarray()


array([[0.        , 0.70710678, 0.        , 0.70710678],
       [0.70710678, 0.        , 0.70710678, 0.        ]])

This code creates an instance of the TfidfVectorizer class from the scikit-learn library, which is used for text data preprocessing. The code defines the parameters for the vectorizer object and applies the fit method to the sample sentences, which trains the vectorizer on the sample sentences.

The TfidfVectorizer takes the following parameters:

min_df: Minimum number of documents a word must be in to be included.
max_df: Maximum number of documents a word can be in to be included.
ngram_range: Range of n-grams to consider, defined as a tuple of minimum and maximum n-gram size. The value (1, 3) means to consider unigrams, bigrams and trigrams.
max_features: The maximum number of features to consider.
lowercase: Convert all words to lowercase before processing.
use_idf: Enable inverse-document-frequency reweighting.
The transform method takes the sample sentences and converts them into a numerical representation, which is stored in a sparse matrix. The toarray method is called to convert the sparse matrix to a dense matrix, which returns the array of numerical values that represent the sample sentences.

In [71]:
# Let's try adding a few more words to our example sentences so we have 
# duplicate words in some sentences.

sample_sentences_2 = ['The mall is a good place to shop, and I eat there often.', 'The mall is a bad place to shop, I never eat there.']

sample_vectorizer_2 = TfidfVectorizer(min_df=MIN_DF, max_df=MAX_DF, ngram_range= (1,3),max_features=MAX_FEATURES,
lowercase=LOWERCASE, use_idf=True)

sample_vectorizer_2.fit(sample_sentences_2)

sample_vectorizer_2.transform(sample_sentences_2).toarray()

array([[0.25819889, 0.25819889, 0.25819889, 0.        , 0.        ,
        0.        , 0.25819889, 0.25819889, 0.25819889, 0.25819889,
        0.        , 0.        , 0.25819889, 0.25819889, 0.        ,
        0.25819889, 0.        , 0.        , 0.        , 0.25819889,
        0.25819889, 0.25819889, 0.        , 0.        , 0.25819889,
        0.25819889, 0.        ],
       [0.        , 0.        , 0.        , 0.28867513, 0.28867513,
        0.28867513, 0.        , 0.        , 0.        , 0.        ,
        0.28867513, 0.28867513, 0.        , 0.        , 0.28867513,
        0.        , 0.28867513, 0.28867513, 0.28867513, 0.        ,
        0.        , 0.        , 0.28867513, 0.28867513, 0.        ,
        0.        , 0.28867513]])

In [73]:
# We can just as easily create feature vectors for all sentences in our dataframe
# Here we see that sklearn, knowing that there are mostly zeros in the array, 
# used a sparce object to represent the array. 
vectorizer.transform(df['Sentence'].tolist())

# This is a nice even number :)

<196x29 sparse matrix of type '<class 'numpy.float64'>'
	with 380 stored elements in Compressed Sparse Row format>

## Build a Baseline Model

When you work with machine learning, one important step is to define a **baseline model**. The baseline model is a simple model that can be quickly created.  It is used to compare more advanced models that you want to experiment with.  In our case, we will compare the baseline model with more advanced methods using (deep) neural network models.  

We will start by taking our dataframe of labeled data.  We will start by just looking at data sourced from 'facebook' to train and test our model. 

### Train/Test Split

The data will be partitioned into a training set and a test set in accordance with conventional practice, with 80% of the data utilized for training the model and 20% of the data reserved for testing the model's performance. After training on the training data, the model's effectiveness will be assessed using the unseen test data set.

To mitigate the impact of any inherent ordering in the data collection process, a random partitioning strategy will be employed. The train_test_split() method from the scikit-learn library will be utilized for randomly dividing the data. The random_state parameter will be set to 1000, which serves as a seed for the random number generator, ensuring consistent and reproducible results. The specific value chosen for the random_state parameter is arbitrary.

In [74]:
from sklearn.model_selection import train_test_split

df_facebook = df[df['Source'] == 'Facebook']

sentences = df_facebook['Sentence'].values

y = df_facebook['Label'].values

# split the data into training and test sets
sentences_train, sentences_test, y_train, y_test = train_test_split(
    sentences, y, test_size=0.2, random_state=1000)

### Train TfidfVectorizer and Build Feature Vectors

To train the TfidfVectorizer on the sentences in our training data, we can use the fit() method. This method will perform the training necessary to generate the feature vectors.

Once the TfidfVectorizer has been trained, we can use the transform() method to create numerical arrays representing the features for both the training and testing sets. These arrays will serve as the feature vectors that we will use to train our model. It is recommended to use the variable names X_train and X_test for these numeric feature vectors, respectively.

In [75]:
from sklearn.feature_extraction.text import TfidfVectorizer

# I dinkied with the n-gram size here too.

vectorizer_facebook = TfidfVectorizer(min_df=MIN_DF, max_df=MAX_DF, ngram_range= (1,2),max_features=MAX_FEATURES,
lowercase=LOWERCASE, use_idf=True)
vectorizer_facebook.fit(sentences_train)

# build our featue vectors
X_all = vectorizer_facebook.transform(sentences)
X_train_facebook = vectorizer_facebook.transform(sentences_train)
X_test_facebook = vectorizer_facebook.transform(sentences_test)

X_train_facebook

# There are 461 stored elements within the data set with unigrams
# There are 503 stored elements in the sparse matrix with bi-grams
# With the cleaned data and bi-grams I have 238 elements!


<134x26 sparse matrix of type '<class 'numpy.float64'>'
	with 238 stored elements in Compressed Sparse Row format>

### Consider our Feature Vectors

The "What's Up Goldsboro" dataset, after undergoing the train/test split, results in the feature vectors in X_train having 510 samples, representing the 510 sentences in the training data. The number of columns in this feature vector is determined by the vocabulary n-grams present in the training data. Each sentence is thus represented as a vector of decimal numbers.

It is important to note that the size of the vocabulary n-grams and the number of samples can vary depending on the proportions selected for the train/test split or the random number seed chosen for randomizing the data.

Additionally, it can be observed that the feature vectors obtained are in the form of a sparse matrix. This type of matrix is optimized for matrices with a relatively small number of non-zero elements, as it only retains information about these non-zero elements, reducing the memory requirements.

### Build LogisticRegression Model as Baseline

The logistic regression model is a simple and effective linear model used for classification. It uses a regression between 0 and 1, based on the input feature vector, and by default uses a cutoff value of 0.5 to make predictions. The selection of the cutoff value is subjective and can be adjusted based on specific requirements.

In [76]:
from sklearn.linear_model import LogisticRegression

classifier_facebook = LogisticRegression() # instantiate the model
classifier_facebook.fit(X_train_facebook, y_train) # train the model
score = classifier_facebook.score(X_test_facebook, y_test) # test the model

print("Accuracy of our What's Up Goldsboro Facebook Classifier: {:.4f}".format(score))

# Using a Logistic Regression model is pretty straightforward but doesn't provide the most state of the art results

Accuracy of our What's Up Goldsboro Facebook Classifier: 0.7353


In [77]:
df[df['Source']=='Facebook'][:10]

Unnamed: 0,Source,Label,Sentence,predict
0,Facebook,0,anything another damn shoe store,1
1,Facebook,1,mall joke worked 30 years ago nice understand ...,1
2,Facebook,1,spencerâs,1
3,Facebook,1,arcade kids teens serves food kw,1
4,Facebook,1,dillards talbots chicoâs,1
5,Facebook,1,spencers hot topic 5,1
6,Facebook,1,victoriaâs secret pottery barn kohlâs amer...,1
7,Facebook,1,kw bring backkkkk,1
8,Facebook,1,shein fashion nova store wishful thinking,1
9,Facebook,0,knock,0


In [78]:
def explore_facebook_classifier(user_input):
    """explore the facebook classifier results"""
    X_user = vectorizer_facebook.transform([user_input])
    #We needed an array that is 1 x our vocabulary size to predict()
    # so we added [] around our sentence.
    result = classifier_facebook.predict(X_user)
    #predict() returns a list of predicted results, since we only have one input
    # we index on [0]
    # print(result[0])
    # We can display a nicer message since we know what these labels mean.
    if result[0] == 1:
        predicted_label = 'Positive'
    else:
        predicted_label = 'Negative'
    print(f"Utterances: {user_input}")
    print(f"Predicted Label: {predicted_label}")

In [79]:
# Let's try some examples
explore_facebook_classifier('The mall is good.')

Utterances: The mall is good.
Predicted Label: Positive


In [80]:
# Here is a sentence that our classifier gets wrong.
# It uses "not" and "hate" both are negative words, but "not hate" is 
# intuitively positive.  This error is due to using Bag-of-Words to create 
# our feature vectors.  
explore_facebook_classifier("I do not hate the mall.")

# AH! I can't understand why it would think this was a positive utterance.

Utterances: I do not hate the mall.
Predicted Label: Positive


It appears that the issue with the current model may be attributed to a lack of sufficient training data. With only 98 utterances in the local domain, the model may have limited exposure to diverse inputs and may not be capable of accurately making predictions for new sentences. This limited training data can result in a narrow perspective for the model and negatively impact its classification effectiveness.


In [81]:
# 
explore_facebook_classifier("bring it back.") # This is a Sentiment that was frequently labeled 0 in my csv
explore_facebook_classifier("The mall is boring.") # This is a new scentence that was not in the data set that should be predicted negative.

# The model is functioning just as we told it to but it requires more examples of positive and negative utterances.
# I think its only memorizing  the 0 and 1 labels...

Utterances: bring it back.
Predicted Label: Negative
Utterances: The mall is boring.
Predicted Label: Positive


In [82]:
# Add the model predictions to the dataframe
df['predict'] = classifier_facebook.predict(vectorizer_facebook.transform(df['Sentence'].tolist()))

In [83]:
# Dataframe where the model predicted correctly
df_correct = df[df['predict'] == df['Label']]

In [84]:
# Look at some correct predictions
df_correct[:10]

# This is what I expected from exploring my classifier using TF-IDF vectorization
# It appears to memorize the positive and negative utterances like flashcards

Unnamed: 0,Source,Label,Sentence,predict
1,Facebook,1,mall joke worked 30 years ago nice understand ...,1
2,Facebook,1,spencerâs,1
3,Facebook,1,arcade kids teens serves food kw,1
4,Facebook,1,dillards talbots chicoâs,1
5,Facebook,1,spencers hot topic 5,1
6,Facebook,1,victoriaâs secret pottery barn kohlâs amer...,1
7,Facebook,1,kw bring backkkkk,1
8,Facebook,1,shein fashion nova store wishful thinking,1
11,Facebook,1,kw victoria secret holliester food court perha...,1
12,Facebook,1,amc theater alamo drafthouse trampoline park s...,1


In [85]:
# Dataframe where the model predicted incorrectly
df_incorrect = df[df['predict'] != df['Label']]

In [86]:
# Look at some incorrect predictions
df_incorrect[:10]

# I'm not sure why it didn't predict these utterances as negative? There are alot of instances
# of sarcasm or figurative language that the model may not be picking up on this might be fixable
# if we change the size of n-grams? I'm thinking the bi gram "Anything but" should be a negative sentiment.

Unnamed: 0,Source,Label,Sentence,predict
0,Facebook,0,anything another damn shoe store,1
9,Facebook,0,knock,1
10,Facebook,0,cant believe one said blockbusteror wafâ¦neve...,1
16,Facebook,0,pipe dreaming,1
20,Facebook,0,agree mall concept outdated need combine stars...,1
25,Facebook,0,bad cant buy punctuation,1
29,Facebook,0,turn whole mall big drive car wash im convince...,1
33,Facebook,0,wonât happen,1
36,Facebook,0,food options food always brings people pretzel...,1
43,Facebook,0,want see mall brings go greenville least three...,1


# Building a Logistic Regression Model

Here we are building a logistic regression model to determine the accuracy of our model
(Even though I don't think it's...really that accurate :/) 


In [87]:
for source in df['Source'].unique():  # iterate over the sources
    df_source = df[df['Source'] == source] # get the data for the source
    sentences = df_source['Sentence'].values # get the sentences
    y = df_source['Label'].values # get the labels

    # split the data into training and test sets
    sentences_train, sentences_test, y_train, y_test = train_test_split(
        sentences, y, test_size=0.2, random_state=1000)

    # train the TfidfVectorizer with ngrams=(1,3), and setting a max_features to
    # avoid an explosion of features
    vectorizer = TfidfVectorizer(min_df=MIN_DF, max_df=MAX_DF, ngram_range= (1,3),max_features=MAX_FEATURES,
    lowercase=LOWERCASE, use_idf=True)
    vectorizer.fit(sentences_train) # fit the vectorizer to the training data
    X_train = vectorizer.transform(sentences_train) # transform the training data
    X_test = vectorizer.transform(sentences_test) # transform the test data

    classifier = LogisticRegression() # instantiate the model
    classifier.fit(X_train, y_train) # train the model

    score = classifier.score(X_test, y_test) # test the model
    print(f"Accuracy of our {source} Classifier: {score:.4f}") # print the results
    

Accuracy of our Facebook Classifier: 0.7353
Accuracy of our Facebook  Classifier: 1.0000
