# Text Cleaning/Pre-processing

This notebook will cover the following

## Text conversion

    - .doc
    - .pdf
    - Web scraping
    - Text cleaning

In [1]:
# Load the Drive helper
from google.colab import drive

# Below will prompt for authorization but it will make your google drive available (i.e., mount your drive).
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
#find out where you are and move to correct location
import os #package for figuring out operating system

os.getcwd() #what is the current working directory

#os.listdir() #what is in currrent working directory

os.chdir("/content/drive/MyDrive/Colab Notebooks/DS_5780_spring_25/text_cleaning") #change directory

os.listdir() #data is there

['childes.txt',
 'Copy of syllabus.txt',
 'annotations_cleaned.txt',
 'syllabus.txt',
 'boiler_plate_cleaned.txt',
 'Crossley_DS_5780_spring_25_final.docx',
 'annotations.txt',
 'tweet.txt',
 'mikolov.txt',
 'boiler_plate.txt',
 'misspellings_punct.txt',
 'mikolov.pdf']

### Text Conversion

**.docx conversion**

Using docx2txt

  - extract text from Microsoft Word (.docx) files



In [4]:
!pip install docx2txt
# will need to install each time because it will not be in colab library
# docx2txt is a Python library that allows you to extract text and images
# from Microsoft Word (.docx) files.




In [5]:
import docx2txt

# Extract text from the .doc file
text = docx2txt.process("Crossley_DS_5780_spring_25_final.docx")

print(text[:1000]) #print first 1,000 characters
#pretty good

# Save the text to a .txt
# create file objest (f) in write mode (w) and write in the text
with open("syllabus.txt", "w", encoding="utf-8") as f:
    f.write(text)

DS 5780

Natural Language Processing 

Spring 2025

M/W 8:30-9:45 am, 17th & Horton A2000

The course syllabus provides a general plan for the course; deviations may be necessary.



1. Course information



Instructor

Scott Crossley

Office Hours

Teaching Assistants

Langdon Holmes 

Office Hours

Wesley Morris

Office Hours

17th & Horton B2002G

Email: scott.crossley@vanderbilt.edu

Monday, 10-11 am, virtual office hours on Zoom by request

17th & Horton B2002G

Email: langdon.holmes@Vanderbilt.Edu

Tuesday, 12-1 pm, virtual office hours on Zoom by request

Email: wesley.g.morris@Vanderbilt.Edu

Wednesday, 10-11 am, virtual office hours on Zoom by request



2. Course description



This course is colloquially titled “from tokens to transformers” and will focus on using computers to automatically analyze language data for linguistic features. The goal of the course is to provide students with the background and computing skills necessary to independently analyze and assess languag


**.pdf conversion**

Using PyMuPDF

In [6]:
# need to install each time as well
!pip install PyMuPDF
# allows Python to work with PDF files
# reading and extracting information



Collecting PyMuPDF
  Downloading pymupdf-1.25.2-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.2-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m44.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.25.2


In [7]:

import fitz  # From PyMuPDF
# fitz is the main module you interact with to work with PDF files

# Open the PDF of the original BERT paper
with fitz.open("mikolov.pdf") as doc: # with allows for immediate close of document
    # Extract text from all pages
    text = "" #empty string variable
    for page in doc: #goes through each page in doc
        text += page.get_text() #extracts info


# Print the text
print(text[:2000]) #print first 2,000 characters

# Save the text to a .txt
with open("mikolov.txt", "w", encoding="utf-8") as f:
    f.write(text)

Distributed Representations of Words and Phrases
and their Compositionality
Tomas Mikolov
Google Inc.
Mountain View
mikolov@google.com
Ilya Sutskever
Google Inc.
Mountain View
ilyasu@google.com
Kai Chen
Google Inc.
Mountain View
kai@google.com
Greg Corrado
Google Inc.
Mountain View
gcorrado@google.com
Jeffrey Dean
Google Inc.
Mountain View
jeff@google.com
Abstract
The recently introduced continuous Skip-gram model is an efﬁcient method for
learning high-quality distributed vector representations that capture a large num-
ber of precise syntactic and semantic word relationships. In this paper we present
several extensions that improve both the quality of the vectors and the training
speed. By subsampling of the frequent words we obtain signiﬁcant speedup and
also learn more regular word representations. We also describe a simple alterna-
tive to the hierarchical softmax called negative sampling.
An inherent limitation of word representations is their indifference to word order
and their

**Web scraping**

Using BeautifulSoup

What it does

    - Makes a GET Request and stores the response
    - Check status code (typically 200 for success)
      - i.e., can you access the data
    - Parse the HTML and stores data
    - Print the Prettified HTML
      - It's not pretty



In [None]:
#!pip install beautifulsoup4
# should be installed in colab as a base package

In [8]:
import requests
from bs4 import BeautifulSoup


# Making a GET request and store variable as r
r = requests.get('https://www.bbc.com/news/articles/cm2enepy8g8o')

# Is data available
# Success code should = 200
print("Response: ", r)

# Parse the html using html.parser and store in soup
soup = BeautifulSoup(r.content, 'html.parser')


Response:  <Response [200]>


**Important**

The researcher needs to identify all the tags to remove
  - This can be a tedious process
  - Below is an example of how it can be done (imperfectly)

In [9]:
print(soup.prettify()[:5000]) #print the first 1,000 characters

<!DOCTYPE html>
<html lang="en-GB">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width" name="viewport"/>
  <title>
   Dick Van Dyke and Cher forced to evacuate Franklin fire in Malibu
  </title>
  <meta content="Dick Van Dyke and Cher forced to evacuate Franklin fire in Malibu" property="og:title"/>
  <meta content="Dick Van Dyke and Cher forced to evacuate Franklin fire in Malibu" name="twitter:title"/>
  <meta content="The blaze has burned more than 4,000 acres in the celebrity enclave near Los Angeles." name="description"/>
  <meta content="The blaze has burned more than 4,000 acres in the celebrity enclave near Los Angeles." property="og:description"/>
  <meta content="The blaze has burned more than 4,000 acres in the celebrity enclave near Los Angeles." name="twitter:description"/>
  <meta content="https://ichef.bbci.co.uk/news/1024/branded_news/e482/live/9202b190-b7ad-11ef-aff0-072ce821b6ab.jpg" property="og:image"/>
  <meta content="https://ichef.bbci.co.uk/n

In [10]:
# Remove html tags
# This is not perfect

#clean up data by removing all the html tags found
for data in soup(['noscript','style', 'script']): #select the style and script tags
  data.decompose() # and remove those tags

# Get the text content
text = ' '.join(soup.stripped_strings) # returns a generator of all the strings
# in the parsed data with whitespace stripped

print(text)





## Text Cleaning

Work on the removal of
  - Removing XML
  - Other meta-data removal
  - Non-alphabetic/numeric characters
  - Punctuation and white space problems
  - Misspellings


### XML tags

A simple linguistic annotation example is provided below
  - Grammar and mechanical error tags

If you are cleaning an XML text that is hierarchical in nature (like the BNC), it is probably better to use an XML parsing library
  - like lxml
  - or use much more complicated code



In [11]:
#Removing XML tags

import re

with open('annotations.txt', 'r') as file:
    text = file.read()

print(text) #this is a text with simple XML for grammar and spelling errors
pattern = r'<.*?>'  #regex pattern (non-greedy)
# matches any character (.) zero or more times (*) as few times as possible (?)

cleaned_text = re.sub(pattern, '', text)
cleaned_text = re.sub(" +", " ", cleaned_text) #get rid of extra space

print(cleaned_text) #this is a text without XML

with open('annotations_cleaned.txt', 'w') as file:
    file.write(cleaned_text)




Many people asks <GVN> themselves Weather <FSH> <FSC> its <PM> better to work <WM> cooperating with one another or weather <FSH> <FSC> its <PG> better to be always competing. I think <WM> both ways you can achieve success because when you compete with one another <PM> you try more why <WR> because you really want to achieve that goal. And <PM> when cooperate <GVM> with one another <PM> you can achieve great things aswell <FS> because more hands are better than two <PM> dont <PM> they say. but <FSC> <PM> here are some of the reaons <FS> why i <FSC> thing <FS> cooperation helps you achieve more success than competition.
One of the reasons why i <FSC> believe cooperation achieves you <LP> more success than competing is because <PM> when you cooperate with someone <PM> <WM> means you help each other out, so <PM> like they say <PM> two heads are better than one. And <PM> when you compete <PM> you might be on your own and competition doesn't get you where you would like to be at <WM> points 

**See Appendix 1 at end of notebook**

If you are interested in constructing a map of character indexes between two string versions
  - i.e., a map that includes the word and the annotation

### Metadata

This data comes from the Switchboard Corpus

- Switchboard is a collection of about 2,400 two-sided telephone conversations among 543 speakers (302 male, 241 female) from all areas of the United States

Data has some unique attributes
- Introduction section
- Two different speakers
- Meta-linguistic data [laughter]
- False starts
  - right now I-, I'm not...
- Uncertain translations
  - ((Well, let's see))
- Hashtags (no idea what they represent)

In [12]:
#remove metadata from Switchboard corpus

with open('boiler_plate.txt', 'r') as file:
    text = file.read()

print(text[:2000]) #this is the text we need to clean


FILENAME:	2001_1020_1044
TOPIC#:		303
DATE:		910304
TRANSCRIBER:	khb
DIFFICULTY:	1
TOPICALITY:	1
NATURALNESS:	2
ECHO_FROM_B:	1
ECHO_FROM_A:	3
STATIC_ON_A:	1
STATIC_ON_B:	1
BACKGROUND_A:	1
BACKGROUND_B:	1
REMARKS:


B.1:  Okay.

A.2:  Hi.

B.3:  Hi.

A.4:  Um, yeah, I would like to talk about how you dress for work, and, and, um,
what do you normally, what type of outfit do you normally have to wear?

B.5:  Well, I work in, uh, Corporate Control, so we have to dress kind of nice,
so I usually wear skirts and sweaters in the winter time, slacks, I guess.

A.6:  Uh-huh.

B.7:  And in the summer, just dresses.  We can't even, well, we're not even
really supposed to wear jeans very often, so,

A.8:  And is,

B.9:  It really doesn't vary that much from season to season since the office is
kind of, you know, always the same temperature.

A.10:  Right, right.  Is there, is there, um, a-, is there a, like a code of dress
where you work?  Do they ask,

B.11:  Not formally, 

A.12:  Right.

B.13:

In [16]:
import re # we will need RegEx here

# get rid of intro
pattern_intro = r'(FILENAME:.*?REMARKS:\s*\n)?={2,}'
# () starts a grouping that matches FILENAME:
# followed by any characters (.) zero of more times (*) non-greedily (?) so it matches
# as little as possible until it reaches "REMARKS:
# \s*\n: equals optional whitespace (\s*), and a newline character (\n)
# next part gets rid of ====, ? zero of more occurrences of = that happens at least twice {2,}

# get rid of data within parentheses and square brackets
pattern_meta = r'\([^)]*\)|\[[^]]*\]|#|-'
# Find \( and \). \ is used as an escape character. It indicates that the character
# following it should be treated specially. [] are used to define a character class.
# A character class matches any one of the characters inside the brackets.
# [^)] is a character class that matches any character except the closing parenthesis ).
# * is a  quantifier that means "zero or more occurrences"
# | is or, so we add in anything in between [], the hashtag, or --

cleaned_text = re.sub(pattern_intro, '', text, flags=re.DOTALL)
# re.DOTALL makes the dot (.) match any character, including newline characters
# so we can search for patterns across multiple lines and line breaks
# which is what happens with the introduction

cleaned_text_2 = re.sub(pattern_meta, '', cleaned_text)
'''
“cleaned_text = re.sub (pattern_intro, '', text, flags=re.DOTALL)” 是一段 Python 代码语句。
它使用正则表达式模块（re）的 sub 方法。这个方法的作用是将文本（text）中的与给定正则表达式模式（pattern_intro）匹配的部分替换为空字符串（''），并将替换后的结果赋值给变量 cleaned_text。参数 flags=re.DOTALL 表示让正则表达式中的点号（.）可以匹配包括换行符在内的任意字符。
'''
print(cleaned_text_2[:2000]) #this is a text without XML
with open('boiler_plate_cleaned.txt', 'w') as file:
    file.write(cleaned_text_2)




B.1:  Okay.

A.2:  Hi.

B.3:  Hi.

A.4:  Um, yeah, I would like to talk about how you dress for work, and, and, um,
what do you normally, what type of outfit do you normally have to wear?

B.5:  Well, I work in, uh, Corporate Control, so we have to dress kind of nice,
so I usually wear skirts and sweaters in the winter time, slacks, I guess.

A.6:  Uhhuh.

B.7:  And in the summer, just dresses.  We can't even, well, we're not even
really supposed to wear jeans very often, so,

A.8:  And is,

B.9:  It really doesn't vary that much from season to season since the office is
kind of, you know, always the same temperature.

A.10:  Right, right.  Is there, is there, um, a, is there a, like a code of dress
where you work?  Do they ask,

B.11:  Not formally, 

A.12:  Right.

B.13:  but it's kind of understood that we're supposed to dress a little bit
nice.  A lot of times we have to go over to, uh, like Jerry Junkins' office
and Bill Ellsworth's office to deliver stuff,

A.14:  Right.

B.15:

In [17]:
# What if you only want data by individual?

person_A = re.findall(r"^A.\d+:\s.*", cleaned_text_2, re.MULTILINE)
# ^ beginning of line followed by A followed by any character (.) followed by
# any digit (\d+) followed by a literal colon and then whitespace (\s)
# and any character zero or more captures the words after the demarker
# re.MULTILINE across lines of text

for line in person_A:
    print(line) #did we get them?

person_A_text = ' '.join(person_A) #join all the lines together

print(person_A_text[:1000]) #print it out

pattern_A = r'(A.*?:)' #regex to remove demarker

person_A_cleaned = re.sub(pattern_A, '', person_A_text)

print(person_A_cleaned[:1000]) #the cleaned text


A.2:  Hi.
A.4:  Um, yeah, I would like to talk about how you dress for work, and, and, um,
A.6:  Uhhuh.
A.8:  And is,
A.10:  Right, right.  Is there, is there, um, a, is there a, like a code of dress
A.12:  Right.
A.14:  Right.
A.16:  Right, right.  And does it, does it change?  I guess, um, you can, can you
A.18:  Right.
A.20:  Yes .
A.22:  Yeah, yeah, well i, it's,
A.24:  that's right.  And it,
A.26:  Yeah, and it was usually, uh, also, uh, where I was, it was, um, i, in
A.28:  and it was, it was a casual office, there was no formal dress code, but,
A.30:  just in case things would come up during the day, 
A.32:  sometimes unexpected meetings or a client would come in and would want to
A.34:  uh, other times it could be very casual.  If you knew you would be at a
A.36:  Um.
A.38:  Uh, well, I was, I was in a, uh, private consulting firm, 
A.40:  so, um, and, uh,
A.42:  and, uh, anyway, um, right now I, I'm not, I'm not there but ,
A.44:  but, anyway, um, and seas, really same, season

### Misspellings and Punctuation

This is data from a second language learner of English.

Taken from The EF-Cambridge Open Language Database ([EFCAMDAT](https://ef-lab.mmll.cam.ac.uk/EFCAMDAT.html))
  - Large open-access corpus of English learner essays
  - Comprises submissions from students worldwide who attend an online EF school.
  - Learners are assigned to proficiency levels based on their initial placement test results or through successful course progression.
  - There are 16 proficiency levels, aligned with the Common European Framework of Reference for Languages (CEFR)
  - This student is level 4

In [18]:
#let's install a spell check library
!pip install pyspellchecker

# uses a Levenshtein Distance algorithm to find permutations within an edit distance
# of 2 from the original word. It then compares all permutations (insertions, deletions,
# replacements, and transpositions) to known words in a word frequency list.
# Those words that are found more often in the frequency list are more likely the correct results.

Collecting pyspellchecker
  Downloading pyspellchecker-0.8.2-py3-none-any.whl.metadata (9.4 kB)
Downloading pyspellchecker-0.8.2-py3-none-any.whl (7.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m35.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.8.2


In [19]:
# Misspellings

from spellchecker import SpellChecker

spell = SpellChecker() #assign variable to spellchecker

import spacy #import spacy. We will need it for tokenization
nlp = spacy.load("en_core_web_sm")

with open('misspellings_punct.txt', 'r') as file:
    text = file.read()

print(text[:1000]) #there are misspellings and problems with punctuation.

spacy_doc = nlp(text)

misspelled = []
correct_spelled = []

for token in spacy_doc:
  if token.is_space == False: #gets rid of large white spaces that are treated as tokens
    #print(token.text)
    if spell.unknown([token.text]):  #is the word misspelled?
      misspelled.append(token.text) #append it
      correct_spelled.append(spell.correction(token.text)) #what is the closest correct spelling?
      # and append it

print(misspelled) #note that these are not terribly accurate. Misspelling is a hard task!
# it misses cheep (context)
print(correct_spelled)

# replace the words using spacy
cleaned_words = [] #list for all the cleaned words
for token in spacy_doc:
    if token.text in misspelled: #if token is "misspelled"
        cleaned_words.append(correct_spelled[misspelled.index(token.text)] + token.whitespace_)
        # replace it with "correctly spelled word" plus trailing whitespace if token is followed by whitespace
    else:
        cleaned_words.append(token.text + token.whitespace_) #if not, just include normal word

# join words in list into a string
cleaned_text = "".join(cleaned_words)
print(cleaned_text) #this also takes care of punctuation problems because spaCy recognizes punctuation!




Dear mom     I am going to go to a music festival this weekend. The ticket is cheep, only 20 yuan. I will see so many singers in the festival.I am going to listen to pop music.I will go camping.I am taking a tent ,T-shirt,sleeping bag and flashlight.I am also taking a umbralla. Maybe it will rain in sunday.         best regards        liji yuan
['umbralla', 'liji']
['umbrella', 'fiji']
Dear mom     I am going to go to a music festival this weekend. The ticket is cheep, only 20 yuan. I will see so many singers in the festival.I am going to listen to pop music.I will go camping.I am taking a tent ,T-shirt,sleeping bag and flashlight.I am also taking a umbrella. Maybe it will rain in sunday.         best regards        fiji yuan


### Lowercasing

We want all of our data to be in lowercase so it is standardized.

This is simple to do in spaCy

In [20]:
text = "It should be pretty easy to put CAPITAL LETTER WORDS into lowercase using spaCy"

# Using only spaCy

spacy_doc = nlp(text)

for token in spacy_doc:
    #lowercased_token = token.text.lower() #base Python
    lowercased_token = token.lower_ #spaCy lowercase
    print(lowercased_token)




it
should
be
pretty
easy
to
put
capital
letter
words
into
lowercase
using
spacy


In [21]:
# You can also just lowercase the text and then spaCy it
# but spaCy tagger may work better on the original text, this is probably bad practice

text = text.lower()

spacy_doc = nlp(text)

for token in spacy_doc:
    print(token.text)

it
should
be
pretty
easy
to
put
capital
letter
words
into
lowercase
using
spacy


**APPENDIX 1**

Remember when we removed simple linguistic XML items?
  - But maybe we want to use these annotations as part of our analysis?
  - We can construct a map of character indexes between two string versions

We can add them as token annotations in SpaCy.
  - See below for code

For simplicity, we will just attach each annotation to the preceding token, even though some of these describe phrasal issues.
  - For example, `<SF>` indicates a sentence fragment at the end of the document.

## Your Turn

We have three pieces of data that need to be cleaned.

1. childes.txt

This comes from the Child Language Data Exchange System (CHILDES)
  - established in 1984
  - a central repository for data of first language acquisition
  - earliest transcripts date from the 1960s, and as of 2015 has contents (transcripts, audio, and video) in 26 languages from 230 different corpora
  - tanscriptions are coded in the CHAT (Codes for the Human Analysis of Transcripts) transcription format
      - provides a standardized format for producing conversational transcripts.
      - system also has options for phonological and morphological analysis

CHAT
  - Introductory texts
  - Speaker information
  - *MOT:	Mommy sit down at the table with you ?
    - Utterance begins with *
  - %mor:	n:prop|Mommy v|sit n|down prep|at det|the n|table prep|with pro|you ?
    - POS tags begin with %
  - %gpx:	looks at Wanda
    - Non-verbal begin with %gpx
  - %act:	sits facing Wanda
    - Action begin with %act
  - () unpronounced sounds
  - xxx unknown words
  - \# pauses
  - \[ \] unknown words

2. tweet.txt
  - Tweets are filled with lots of noise

      - Emoticons
      - @'s
      - html links
      - hash tags


In groups, choose one and clean up the text!

In [22]:
import re
import spacy
from spacy.tokens import Token, Doc

Token.set_extension('annotations', default=[], force=True)

def add_annotations(text):
    nlp = spacy.load('en_core_web_sm')

    clean_text = ''

    to_remove = [] # This is an array of all the character indexes corresponding to annotations
    orig_starts = {} # This is a dictionary of annotations keyed by starting position in the orig text

   # This pattern uses a slightly different strategy by collecting all characters that are not ">", then finding the closing right-angle bracket.
    for match in re.finditer(r' <[^>]+>', text):
      to_remove.extend(range(match.start(), match.end()))
      orig_starts[match.start()] = match.group()

    # Remove tags while creating a mapping from original to cleaned text positions
    orig_to_clean = {}
    clean_pos = 0
    for i, char in enumerate(text):
        if not i in to_remove:
            clean_text += char
            clean_pos += 1
        else:
            orig_to_clean[i] = clean_pos

    # Create the SpaCy Doc on the clean text
    doc = nlp(clean_text)

    # Find and attach tags using position mapping
    for start, tag in orig_starts.items():
        clean_pos = orig_to_clean[start]
        token_idx = len([t for t in doc if t.idx < clean_pos]) - 1
        if token_idx >= 0:
            doc[token_idx]._.annotations.append(tag.strip())

    return doc

# Test
doc = add_annotations(text)
print(doc)
for token in doc:
   if token._.annotations:
       print(f"{token.text}: {token._.annotations}")

it should be pretty easy to put capital letter words into lowercase using spacy


In [31]:
import re

def clean_tweets(text):
    text = re.sub(r'http\S+', '', text)  # Remove HTML links
    text = re.sub(r'@\w+', '', text)  # Remove mentions (@'s)
    text = re.sub(r'#\w+', '', text)  # Remove hashtags
    text = re.sub(r'[^\w\s]', '', text)  # Remove emoticons and other punctuation
    text = re.sub(r'\(@.*?\)', '', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

# Load the existing file
file_path = '/content/drive/MyDrive/Colab Notebooks/DS_5780_spring_25/text_cleaning/tweet.txt'


with open(file_path, 'r') as file:
  tweets = file.readlines()

# Clean each tweet
cleaned_tweets = [clean_tweets(tweet) for tweet in tweets]

# Save the cleaned tweets to a new file
cleaned_file_path = '/content/drive/MyDrive/Colab Notebooks/DS_5780_spring_25/text_cleaning/cleaned-tweet.txt'
with open(cleaned_file_path, 'w') as cleaned_file:
  cleaned_file.write("\n".join(cleaned_tweets))

print(f"Cleaned tweets have been saved to {cleaned_file_path}")


Cleaned tweets have been saved to /content/drive/MyDrive/Colab Notebooks/DS_5780_spring_25/text_cleaning/cleaned-tweet.txt
