# Product Corpus Preparation for Topic Modeling:
## Product Text Preprocessing on Raw Product Data from Oracle

This R script contains all the necessary steps to preprocess the product text data from Oracle in order to perform topic modeling. Following are the main steps to preprocess:

    1) Remove all of the non-word tokens, such as html formatting
    2) Remove the Overstock product information disclaimer
    3) Remove all numbers
    3) Collapse all mutliple whitespaces into a single space
    4) Lemmatize all words (e.g. "drives" to "drive" and "driver" to "driver") 

All other preprocessing steps are done in the python topic modeling pipeline. These steps are not included here as they are parametric functions that can be altered to produce different topic models. These pre-processing steps should be constant for each model.

# Libraries

In [18]:
library(tidyverse)
library(stringr)
library(pbapply)
library(parallel)
library(textstem)

## Functions

In [None]:
split_text_file <- function(file_path) {
    
}

In [662]:
data <- read_csv("../data/table_export_DATA.csv", n_max = 100000, na = character()) %>% unite(., text, PRO_NAME:PRO_SHORT_NAME, sep = " ", remove = TRUE)

Parsed with column specification:
cols(
  PRO_ID = col_integer(),
  PRO_NAME = col_character(),
  SHORT_DESCRIPTION = col_character(),
  PRO_DESC = col_character(),
  PRO_BRAND_NAME = col_character(),
  PRO_MATERIALS = col_character(),
  PRO_SHORT_NAME = col_character()
)


In [664]:
nrow(data)

In [665]:
system.time(    
    cleaned <- data %>% mutate(text = map_chr(text ,function(y) {gsub("<[^>]+>", " ", y) %>%  # Remove anything in between brackets
                                                                 gsub("[^[:alnum:] ]", " ", .) %>%  # Remove non alpha numeric charachters
                                                                 gsub("[0-9]", " ", .) %>% # Remove all numbers
                                                                 gsub("(^| ).( |$)", " ", .) %>% # Remove all single characters
                                                                 gsub("^ *|(?<= ) | *$", "", ., perl = TRUE) # Replace all multiple spaces with a single space
                                                                 }))
)

   user  system elapsed 
 52.644   0.182  52.866 

In [666]:
head(cleaned, n = 13)
nrow(cleaned)

PRO_ID,text
8120438,Michael Tompsett Ireland Text Map Canvas Art Multi Artist Michael Tompsett Title Ireland Text Map Product type Giclee gallery wrapped This ready to hang gallery wrapped art piece features map of Ireland with cities in text Art and design were always Michael favorite subjects at school He was fortunate to land job as graphic designer at one of London most prestigious publishing companies After years Michael Topsett made the decision to pursue full time career creating his own work The freedom and time to be able to focus solely on his own projects has been wonderful experience Athrough Michael likes to experiment with different styles and subjects hi main focus is on map art He enjoys looking for unique and interesting ways to depict something which is very familiar Maps are visual representations of the world we live in world which is incredibly diverse Apart from maps Tompsett creates urban landmark and cityscape designs Giclee jee clay is an advanced printmaking process for creating high quality fine art reproductions The attainable excellence that Giclee printmaking affords makes the reproduction virtually indistinguishable from the original piece The result is wide acceptance of Giclee by galleries museums and private collectors Gallery wrap is method of stretching an artist canvas so that the canvas wraps around the sides and is secured on hidden wooden frame This method of stretching and preparing canvas allows for frameless presentation of the finished painting Artist Michael Tompsett Title Ireland Text Map Product type Giclee gallery wrapped Style Contemporary Format Vertical Size options Small medium large extra large Subject Map Medium Giclee Dimensions Small inches high inches wide inches deep Medium inches high inches wide inches deep Large inches high inches wide inches deep Extra large inches high inches wide inches deep Giclee is an advanced print making process for creating high quality fine art reproductions The attainable excellence that giclee printmaking affords makes the reproduction virtually indistinguishable from the original artwork The result is wide acceptance of giclee by galleries museums and private collectors Gallery wrapped is method of stretching an artist canvas so that the canvas wraps around the sides and is secured to the back of the wooden frame This method of stretching and preparing canvas allows for frameless presentation of the finished painting This custom made item will ship within business days Trademark Fine Art Canvas wood Michael Tompsett Ireland Text Map Canvas Art
22176308,Heart Flower Wreath Tee Women Image by Shutterstock Valentines Day Hand Drawn Romantic Doodle Women White shirt Valentines Day Hand Drawn Romantic Doodle Women White shirt While we aim to supply accurate product information it is sourced by manufacturers suppliers and marketplace sellers and has not been provided by Overstock
20133524,Censorship and Heresy in Revolutionary England and Counter reformation Rome Giorgio Caravale This book explores the secrets of the extraordinary editorial success of Jacobus Acontius áSatan Stratagems an important book that intrigued readers and outraged religious authorities across Europe This book explores the secrets of the extraordinary editorial success of Jacobus Acontius Satan Stratagems an important book that intrigued readers and outraged religious authorities across Europe Despite condemnation by the Catholic Church the work first published in Basel in was resounding success For the next century it was republished dozens of times in different historical context from France to Holland to England The work sowed the idea that religious persecution and coercion are stratagems made up by the devil to destroy the kingdom of God Acontius work prepared the ground for religious toleration amid seemingly unending religious conflicts In Revolutionary England it was propagated by latitudinarians and independents but also harshly censored by Presbyterians as dangerous Socinian book Giorgio Caravale casts new light on the reasons why both Catholics and Protestants welcomed this work as one of the most threatening attacks to their religious power This book is an invaluable resource for anyone interested in the history of toleration in the Reformation and Counter Reformation across Europe While we aim to supply accurate product information it is sourced by manufacturers suppliers and marketplace sellers and has not been provided by Overstock
20133977,Blue Lives Matter in the Line of Duty Steve Cooley Robert Schirn Ground breaking and numbingly current Blue Lives Matter is first of it kind series exploring line of duty deaths suffered by the law enforcement blue family Ground breaking and numbingly current Blue Lives Matter is first of it kind series exploring line of duty deaths suffered by the law enforcement blue family The first book in the series examines the deaths of police officers and one brave who all made the ultimate sacrifice in service to their communities An historical true crime account with cases from the to more recent cases methodological look at each case extracting and presenting tactical lessons learned and the continual price in the survivors left behind Los Angeles County has over law enforcement officers with of those officers assigned to either LAPD or Los Angeles Sheriff Department Three term Los Angeles District Attorney Steve Cooley and former LAPD Reserve Officer presents each case to readers putting his lense and that of Los Angeles archivist Robert Schirn on the incidents that took each blue life lost Readers will not only learn the never before released details of each case in this updated format but also the lessons learned and impact on survivors Cooley and Schirn have worked tirelessly for years to put together the information in an understandable form with the stakes never being higher for both the public to understand the Blue Lives Matter focus While we aim to supply accurate product information it is sourced by manufacturers suppliers and marketplace sellers and has not been provided by Overstock
20133987,Augmented Human Helen Papagiannis The boundaries of the digital and physical are blurring Augmented Reality AR is quickly advancing into new phase of contextually rich experiences that combine sensors wearable computing the Internet of Things and artificial intelligence The boundaries of the digital and physical are blurring Augmented Reality AR is quickly advancing into new phase of contextually rich experiences that combine sensors wearable computing the Internet of Things and artificial intelligence In this book Dr Helen Papagiannis shares stories from inside and outside research labs spanning decade of work as designer researcher and public speaker In nontechnical terms she highlights and expands upon the inventions and ideas that will forever change the way we live work and play Learn about AR and related technologies and understand the significance of this new communication medium Understand the impact and opportunities this second wave of AR presents for business design and culture Gain deep insight into this emerging field from trailblazers and experts in the field Learn how you can contribute to and help define this new technological area either as designer entrepreneur business or cultural leader or engaged consumer Our digital future is no longer distant promise but rapidly growing industry Consider Facebook billion acquisition of Virtual Reality headset maker Oculus Google part in leading million investment in Augmented Reality company Magic Leap and Microsoft introduction of holographic experiences with HoloLens By inspiring design for the best of humanity and the best of technology Augmented Human is essential reading for designers technologists entrepreneurs business leaders and anyone who desires peek at our virtual future While we aim to supply accurate product information it is sourced by manufacturers suppliers and marketplace sellers and has not been provided by Overstock
19531684,Premium Series Ultra Soft Microfiber Bed in Bag MAXIMUM COMFORT Sleep better and wake up feeling refreshed ready to take on the day Our superior down alternative premium comforters are luxuriously soft and fulffy The best down alternative for allergy free comfort all year round MAXIMUM COMFORT Sleep better and wake up feeling refreshed ready to take on the day Our superior down alternative premium comforters are luxuriously soft and fulffy The best down alternative for allergy free comfort all year round PREMIUM QUALITY Generous all season down alternative plush polyester filling Made from ultra soft premium mircofiber fabric Gentle on skin shell and filling made from hypoallergenic materials Durable sewn through box stitch design prevents fill shift HYPOALLERGENIC Ultrasoft wrinkle resistant microfiber is breathable stain resistant hypoallergenic and resistant to dust mites Fully elasticized deep pocket fitted sheet fits your mattress snug from inches up to inches deep EASY CARE Machine wash in cold water with similar colors Tumble dry low Do not bleach Features Set includes comforter flat sheet fitted sheet pillowcases pillowcase in Twin and Twin XL set and shams sham in Twin and Twin XL set Made of microfiber Machine washable Solid color pattern Fully elasticized fitted shet Twin Twin XL Full Full XL Queen King or California King size White Comforter does not include shams Twin Dimensions Comforter inches wide inches long Flat Sheet inches wide inches long Fitted Sheet inches wide inches long inches deep Pillowcase inches wide inches long Sham inches wide inches long Twin XL Dimensions Comforter inches wide inches long Flat Sheet inches wide inches long Fitted Sheet inches wide inches long inches deep Pillowcase inches wide inches long Sham inches wide inches long Full Dimensions Comforter inches wide inches long Flat Sheet inches wide inches long Fitted Sheet inches wide inches long inches deep Pillowcases inches wide inches long Shams inches wide inches long Full XL Dimensions Comforter inches wide inches long Flat Sheet inches wide inches long Fitted Sheet inches wide inches long inches deep Pillowcases inches wide inches long Shams inches wide inches long Queen Dimensions Comforter inches wide inches long Flat Sheet inches wide inches long Fitted Sheet inches wide inches long inches deep Pillowcases inches wide inches long Shams inches wide inches long King Dimensions Comforter inches wide inches long Flat Sheet inches wide inches long Fitted Sheet inches wide inches long inches deep Pillowcases inches wide inches long Shams inches wide inches long California King Dimensions Comforter inches wide inches long Flat Sheet inches wide inches long Fitted Sheet inches wide inches long inches deep Pillowcases inches wide inches long Shams inches wide inches long Premium Series Ultra Soft Microfiber Bed In
10594463,Design Art Golden Eagle Canvas Art Print Wx Inches Panels This design is printed on high quality cotton canvas and is gallery wrapped around solid wood subframes carefully packaged and ready to hang This printing technology allows for crisp deep canvas print which is fade resistant and never pixelated This beautiful art is printed using the highest quality fade resistant ink on canvas Every one of our fine art giclee canvas prints is printed on premium quality cotton canvas using the finest quality inks which will not fade over time Each giclee print is stretched tightly over inch wood sub frame ensuring the canvas is taught and does not buckle This canvas giclee print is gallery wrapped the design continues on the sides giving it real art gallery feel All of our canvas prints are carefully packaged in inflated cushion wrap fragile labeling and sturdy boxes to ensure safe delivery Every canvas print arrives ready to hang on the wall with the hanging kits included Orientation Horizontal Landscape Rectangle Subject Animals Type Canvas Art Fine Art Giclee Framed Art Print Style Contemporary Framed Art Horizontal Framed Framed Size Medium Oversized Factory sealed boxes cannot be returned if opened DESIGN ART Canvas Ink Wood Design Art Golden Eagle Canvas Art Panels
20528593,Clay Alder Home Ebey Leather Office Chair with Lumbar Support This office chair features luxurious leather for upscale look and feel This office chair features luxurious leather for upscale look and feel The high back and seat are designed with segmented padding providing total body support Back also features lumbar cradling recess for exceptional comfort Offers simple and intuitive controls such as pneumatic seat height adjustment degree swivel tilt tension and tilt lock Cantilevered padded arms add to the unique appeal Dimensions inches deep inches wide inches high Product Features Swivel Adjustable Height Lumbar Support Material Leather Plastic Metal Assembly Assembly Required Color Black Brown Assembly Required Assembly Required Clay Alder Home Leather plastic metal Ebey Leather Office Chair with Lumbar Support
10195749,Adapter Set Converting mount to mm or mm Photo Port Adapter Set Converting mount to mm and or mm Photo Port This is metal adapter set that converts C mount to mm and or mm photo port Material Metal Adapter set Two adapters included Mounting size on camera side Mm or mm in diameter Mounting size on microscope side Inch Mm mount female Compatibity All microscopes with inch Mm mount Type Accessories AmScope Metal Alloy Glass
10594489,Design Art Isidre Nonell Still Life with Onions and Herring Canvas Art Print Wx Inches Panels Multi color This design is printed on high quality cotton canvas and is gallery wrapped around solid wood subframes carefully packaged and ready to hang This printing technology allows for crisp deep canvas print which is fade resistant and never pixelated This beautiful art is printed using the highest quality fade resistant ink on canvas Every one of our fine art giclee canvas prints is printed on premium quality cotton canvas using the finest quality inks which will not fade over time Each giclee print is stretched tightly over inch wood sub frame ensuring the canvas is taught and does not buckle This canvas giclee print is gallery wrapped the design continues on the sides giving it real art gallery feel All of our canvas prints are carefully packaged in inflated cushion wrap fragile labeling and sturdy boxes to ensure safe delivery Every canvas print arrives ready to hang on the wall with the hanging kits included Orientation Horizontal Landscape Rectangle Subject Floral Museum Masters Type Canvas Art Fine Art Giclee Framed Art Print Material Canvas Wood Style Framed Art Museum Masters Floral Framed Framed Size Medium Oversized Factory sealed boxes cannot be returned if opened DESIGN ART Canvas Ink Wood Design Art Art


In [667]:
data[12,]
cleaned[12,]

PRO_ID,text
10597359,"8 PK Compatible LC41 2 Set Inkjet Cartridge For Brother FAX1840C 1940CN 2440C (Pack of 8) NL- 8 x LC41 NL- 8 x LC41<br><br> <ul>  <li>Compatible models: FAX1840C/1940CN/2440C, MFC210C/420CN/620CN/3240C/3340CN/5440CN/5840CN <li>Color: Black Yellow Cyan Magenta </li>  <li>Print yield: BK (500) CMY (400)pages at 5-percent coverage</li>  <li>Non-refillable</li>  <li>Model: NL- 2 x LC41 (BK C M Y) </ul> <br><br><ul> <li>Printer Manufacturer: Brother</li> <li>Print Quality: Inkjet</li> <li>Ink Color: Black, Cyan, Magenta, Yellow</li> <li>Printer Type: Inkjet</li> <li>Paper, Ink or Toner Type: Ink Cartridge</li> </ul><br><br> Plastic"


PRO_ID,text
10597359,PK Compatible LC Set Inkjet Cartridge For Brother FAX CN Pack of NL LC NL LC Compatible models FAX CN MFC C CN CN CN CN CN Color Black Yellow Cyan Magenta Print yield BK CMY pages at percent coverage Non refillable Model NL LC BK M Printer Manufacturer Brother Print Quality Inkjet Ink Color Black Cyan Magenta Yellow Printer Type Inkjet Paper Ink or Toner Type Ink Cartridge Plastic


In [668]:
write_lines(cleaned$text, path = paste0("../data/corpus_clean", nrow(cleaned),".txt"))