In [170]:
import pandas as pd
import numpy as np
import nltk
import importlib
import utils.preprocessing as preprocessing

# Data Preprocessing

In this first part we will be Preprocessing text data to prepare them for clustering and classification. This will include the following steps:
* Noise Removal
* Normalization
* Tekenization & Segmentation 

## Data Loading

In [171]:
df = pd.read_pickle("data/dataset_business_technology_cybersecurity.pickle")
df = pd.DataFrame(df)
df.sample(5)

Unnamed: 0,title,content,topic
278,Computer worm,<p>A <b>computer worm</b> is a standalone malw...,cybersecurity
204,Sports car,"<p class=""mw-empty-elt"">\n</p>\n\n\n<p>A <b>sp...",technology
93,Social responsibility,<p><b>Social responsibility </b> is an ethical...,business
159,Electromechanics,"<p>In engineering, <b>electromechanics</b> com...",technology
192,Bus,"<p class=""mw-empty-elt"">\n\n</p>\n\n\n\n\n<p>A...",technology


In [172]:
# explore the data format in a txt file 
df.to_csv("data/backup_preprocess/content.txt")

## Noise Removal
Noise removal can be defined as text-specific normalization. As we are dealing with html row data, our data preprocessing pipeline will include striping away all HTML markup with the help of the BeautifulSoup library. We will also be replacing contractions with their expansions.

In [173]:
importlib.reload(preprocessing)
df["content"] = preprocessing.remove_noise_from_df(df["content"])
# backup saving
df.to_csv("data/backup_preprocess/content_without_noise.txt")

0it [00:00, ?it/s]
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
333it [00:03, 109.50it/s]


## Normalization
Normalization refers to a series of tasks that put all text on a level of playing field: converting all text to the same case(upper or lower), removing special characters(punctuation) and numbers, stemming, lemmatization, ... Normalization puts all words on equal footing and alows processing to proceed uniformly.

In [174]:
importlib.reload(preprocessing)
df["content"] = preprocessing.normalize_df(df["content"])
# backup save
df.to_csv("data/backup_preprocess/content_normalized.txt")

0it [00:00, ?it/s]
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
333it [03:57,  1.40it/s]


## Tockenization
 

In [177]:
importlib.reload(preprocessing)
df["content"] = df["content"].progress_apply(nltk.word_tokenize)
df.to_csv("data/backup_preprocess/content_tokenized.txt")
df.head(5)

0it [00:00, ?it/s]
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,title,content,topic
0,Accounting,"[account, account, measur, process, commun, fi...",business
1,Commerce,"[commerc, exchang, good, servic, especi, larg,...",business
2,Finance,"[financ, term, matter, regard, manag, creation...",business
3,Industrial relations,"[industri, relat, employ, relat, multidiscipli...",business
4,Management,"[manag, manag, administr, organ, whether, busi...",business
