In [1]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [3]:
!pip install pandas scikit-learn nltk



# Assignment 3 - Text Mining

Project management and tools for health informatics

## 1. Download and prepare data:

**Do not alter the code in this Section!**

The code in this section downloads the [IMDB IMDB Large Movie Review Dataset]('https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz') which is the dataset you will be working on in this assignment.

In [4]:
import os
import tarfile
from urllib.request import urlretrieve

In [5]:
if not os.path.exists('aclImdb'):
    # download data:
    urlretrieve('https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz', 'aclImdb.tar.gz')

    # unzip data:
    with tarfile.open('aclImdb.tar.gz') as file:
        file.extractall('./')

## 2. Some helper Functions:

**Do not alter the code in this Section!**

This section contains the code for some helper functions that will be useful for solving the assignment. Example code on how to use the functions is provided in section 3.

In [6]:
import pandas as pd

from typing import Literal, Tuple, Iterable

Function for loading data into a pandas dataframe:

In [7]:
def load_data(split:Literal['train', 'test'], texts_per_class:int=500) -> pd.DataFrame:
    ''' Loads the data into a pandas dataframe.'''
    paths  = []
    labels = []

    for label in ('pos', 'neg'):
        # get all files in the folder:
        files = os.listdir(os.path.join('aclImdb', split, label))[:texts_per_class]

        # append them to the lists:
        paths.extend([os.path.join('aclImdb', split, label, f) for f in files])
        labels.extend([label] * len(files))

    return pd.DataFrame({'path':paths, 'label':labels})

Function for loading a specific text:

In [8]:
def load_text(path:str) -> str:
    ''' Reads a single text given the path. '''
    # read file from disk:
    with open(path, 'r', encoding='utf8') as file:
        s = file.read()

    return s

Function for iterating through multiple texts:

In [9]:
def iterate_texts(data:pd.DataFrame) -> Iterable[Tuple[str, str]]:
    ''' Iterates through a pandas dataframe. '''

    for path in data['path'].values:
        # read file from disk:
        with open(path, 'r', encoding='utf8') as file:
            text = file.read()

        yield text

## 3. Your Code:

**Alter the code below to complete the assignment!**

Load the training data:

In [10]:
data_train = load_data('train')
data_test  = load_data('test')
data_train

Unnamed: 0,path,label
0,aclImdb\train\pos\0_9.txt,pos
1,aclImdb\train\pos\10000_8.txt,pos
2,aclImdb\train\pos\10001_10.txt,pos
3,aclImdb\train\pos\10002_7.txt,pos
4,aclImdb\train\pos\10003_8.txt,pos
...,...,...
995,aclImdb\train\neg\10446_2.txt,neg
996,aclImdb\train\neg\10447_1.txt,neg
997,aclImdb\train\neg\10448_1.txt,neg
998,aclImdb\train\neg\10449_4.txt,neg


### Accessing the texts:

In [11]:
# Sample code: load a single text
load_text(data_train.loc[0, 'path'])

'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'

In [12]:
# Sample code: iterate through all texts
for text in iterate_texts(data_train[:20]):
    print(text)

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!
Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to

### A simple pipeline:

**White-Space tokenization:**

In [13]:
def tokenize(text:str):
    ''' An example tokenization function. '''

    # simple white-space tokenization:
    return text.lower().split()


**Bag-of-words Embedding:**

See documentation of [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

# create a simple bag of words embedding:
bow = CountVectorizer(

    # the next line converts the filepaths to the actual texts:
    preprocessor = load_text,

    # tokenization function from above:
    tokenizer = tokenize

)

# train the embedding:
embeddings_train = bow.fit_transform(data_train['path'].values)

# vectorize test data:
embeddings_test = bow.transform(data_test['path'].values)



**Classification with a linear SVM**

See documentation of [sklearn.svm.LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC)

In [15]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

svm = LinearSVC()

# train classifier:
svm.fit(embeddings_train, data_train['label'].values)

# test classifier:
predictions = svm.predict(embeddings_test)

# Calculate Accuracy:
print('Accuracy:', accuracy_score(data_test['label'].values, predictions))

Accuracy: 0.759




# 1.Preparation

a, b. The datset is downloaded, and above cells of this notebook is run to know the codes, how they work and the process of text-mining.  

c. Some codes below are executed to print some texts from the dataset, and to understand the above functions and the dataframe.

In [16]:
#to know the above 3 rows of training set
data_train[0:3]

Unnamed: 0,path,label
0,aclImdb\train\pos\0_9.txt,pos
1,aclImdb\train\pos\10000_8.txt,pos
2,aclImdb\train\pos\10001_10.txt,pos


In [17]:
# to know the path of the text file in 3rd row of training data. The path is returned as string.
data_test.loc[3, 'path']

'aclImdb\\test\\pos\\10002_8.txt'

In [18]:
#to load the text in the txt file at 3rd row of training data
#using the function load_text to know how it returns
load_text(data_test.loc[3, 'path'])

"I saw this film in a sneak preview, and it is delightful. The cinematography is unusually creative, the acting is good, and the story is fabulous. If this movie does not do well, it won't be because it doesn't deserve to. Before this film, I didn't realize how charming Shia Lebouf could be. He does a marvelous, self-contained, job as the lead. There's something incredibly sweet about him, and it makes the movie even better. The other actors do a good job as well, and the film contains moments of really high suspense, more than one might expect from a movie about golf. Sports movies are a dime a dozen, but this one stands out. <br /><br />This is one I'd recommend to anyone."

In [19]:
#run the function, iterate_texts, to know its return which is an object
# the object relates to the first 3 rows of training data
iterate_texts(data_train[0:2])

<generator object iterate_texts at 0x0000016CF1EAE240>

In [20]:
#extract the texts in the text files of paths at first 5 rows of training data
# use for loop to extract the texts from the object returned by iterate_texts
for text in iterate_texts(data_train[:5]):
    print(text)

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!
Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to

In [21]:
#for clarity, insert a space line between texts
for text in iterate_texts(data_train[:5]):
    print(text)
    print(' ')

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!
 
Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going 

In [22]:
#to know the result of bag-of-winds(BOW) embeddings, print the embeddings_train which is the return of BOW embeddings of training data 
#it returns compressed sparse row sparse matix
embeddings_train

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 150666 stored elements and shape (1000, 30586)>

In [23]:
#to know the result of tokenization of a text form path of 2nd row of training data,
#first, text is extracted using load_text function, then, this text is tokenized with tokenize function
#after printing the return 'text1' of tokenize function, it generate the list of tokens form the text
text1 = tokenize(load_text(data_train.loc[1, 'path']))
text1

['homelessness',
 '(or',
 'houselessness',
 'as',
 'george',
 'carlin',
 'stated)',
 'has',
 'been',
 'an',
 'issue',
 'for',
 'years',
 'but',
 'never',
 'a',
 'plan',
 'to',
 'help',
 'those',
 'on',
 'the',
 'street',
 'that',
 'were',
 'once',
 'considered',
 'human',
 'who',
 'did',
 'everything',
 'from',
 'going',
 'to',
 'school,',
 'work,',
 'or',
 'vote',
 'for',
 'the',
 'matter.',
 'most',
 'people',
 'think',
 'of',
 'the',
 'homeless',
 'as',
 'just',
 'a',
 'lost',
 'cause',
 'while',
 'worrying',
 'about',
 'things',
 'such',
 'as',
 'racism,',
 'the',
 'war',
 'on',
 'iraq,',
 'pressuring',
 'kids',
 'to',
 'succeed,',
 'technology,',
 'the',
 'elections,',
 'inflation,',
 'or',
 'worrying',
 'if',
 "they'll",
 'be',
 'next',
 'to',
 'end',
 'up',
 'on',
 'the',
 'streets.<br',
 '/><br',
 '/>but',
 'what',
 'if',
 'you',
 'were',
 'given',
 'a',
 'bet',
 'to',
 'live',
 'on',
 'the',
 'streets',
 'for',
 'a',
 'month',
 'without',
 'the',
 'luxuries',
 'you',
 'once',


In [24]:
# to know the type of return of tokenize function and the number of tokens
print('type of tokenized result:',type(text1), 'it includes',len(text1),'tokens.')

type of tokenized result: <class 'list'> it includes 428 tokens.


### Limitations
 1. It is hard to visualize the embeddings matrics (embeddings_train and embedding_test) although the shape can be seen. It is required to write additional cell to see tokens for comprehension <br />
2. In data loading, it relies on a fixed folder structure. If repositories of train and test data change, it would lead to error. <br />
3. It uses white-space tokenization. It is simple and efficient but has some limitations such that it cannot separate punctuation and contraction, so punctuation will attached to words, ' can't' cannot be separated into 'can' and 'not' respectively. It is not so efficient in language which do not usually use white spaces.<br />
4. There is no text normalization process in the original pipeline. So, it can have difficulties in handling inconsistent words, case sensitivity issues, miss spellings, special character and stop words. This could lead to adding more noisy features in model training, affecting the generaliziblity and performance of the models. 
5. Only Bag of Words (BoW) Embedding is used. It is simple but it limits the sementic meaning of the words.<br />
6. Regarding model training and testing, SVC is only used and so it would miss other Machine learning models, more suitable for complex datasets, giving better accuracy and performance.<br />
7. This text mining pipeline is simple but lack advanced techniques like TF-IDF that can empahsize on term importance and semantic relationship. 

## 2. Tokenization

We choose to stem our film review text to reduce words to their base forms

In [30]:
# Import necessary libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

In [35]:
# Initialize the stemmer
stemmer = PorterStemmer()

# Function to tokenize and stem the text
def tokenize_and_stem(text):
    tokens = word_tokenize(text)
    
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    
    return stemmed_tokens


## 3. Embedding

b. TD-IDF embeddings is implemented using TfidfVectorizer of sklearn. 

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a new instance of TfidfVectorizer with basic configuration
tfidf_vectorizer = TfidfVectorizer(
    tokenizer=tokenize,  # Use white-space tokenizer above
    stop_words='english',  # Remove English stop words
    ngram_range=(1, 2)  # Unigrams and bigrams
)

Then, transform the text for training and testing into vectorized form using the TD-IDF embeddings, making them ready for developing prediction model.

In [26]:
X_train_tfidf = tfidf_vectorizer.fit_transform(data_train['path'].values)
X_test_tfidf = tfidf_vectorizer.transform(data_test['path'].values)



In [27]:
#print the training data to know its type and shape
X_train_tfidf

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 1000 stored elements and shape (1000, 1000)>

In [28]:
#print the testing data to know its type and shape
X_test_tfidf

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 0 stored elements and shape (1000, 1000)>

Then, a classifier model, linear SVM is developed by training the training data, and its performance, accuracy is tested on testing data.

In [29]:
from sklearn.svm import LinearSVC

# Train the classifier on the TF-IDF vectors
svm_model_tfidf = LinearSVC()
svm_model_tfidf.fit(X_train_tfidf, data_train['label'].values)

# Evaluate on the test set
y_pred_tfidf = svm_model_tfidf.predict(X_test_tfidf)

# Calculate accuracy
from sklearn.metrics import accuracy_score
accuracy_tfidf = accuracy_score(data_test['label'].values, y_pred_tfidf)

print(f"TF-IDF Model Accuracy: {accuracy_tfidf:.4f}")

TF-IDF Model Accuracy: 0.5000


The accuracy of linear SVM model trained on data from TD_IDF embeddings is 0.5