# Dataset: BBC News. Setting Format. 

In this python notebook it will be analyzed a BBC News datasetto add a few columns that will be useful and to determine wich compression method fits best

In [33]:
# Import
from DimCuantifier import DimCuantifier
from PreProcessingDimCuantifier import PreProcessingDimCuantifier

import gensim
import pandas as pd

## Set Model

First, set word embeddings model, in this case normalized Glove with 42 billions of words and 300 dimensions.

This will be useful for filtering out words from news content that do not appear in the model

In [34]:
# Create PreProceesingDimCuantifier object
PreProcess = PreProcessingDimCuantifier()

In [35]:
# Normalize word embeddings or use already normalized word embeddings
norm_glove_42B = 'normalized_glove.42B.300d.mod'

In [36]:
# Set current model of word embeddings
current_model = gensim.models.KeyedVectors.load_word2vec_format(
    norm_glove_42B, binary=True)

## Load Dataset

Load dataset with pandas and see columns types and memory usage

In [38]:
filename = 'bbc-news-data.csv'
bbc_news_dataset = pd.read_csv(filename, sep='\t')
bbc_news_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   category  2225 non-null   object
 1   filename  2225 non-null   object
 2   title     2225 non-null   object
 3   content   2225 non-null   object
dtypes: object(4)
memory usage: 69.7+ KB


In [39]:
bbc_news_dataset.head()

Unnamed: 0,category,filename,title,content
0,business,001.txt,Ad sales boost Time Warner profit,Quarterly profits at US media giant TimeWarne...
1,business,002.txt,Dollar gains on Greenspan speech,The dollar has hit its highest level against ...
2,business,003.txt,Yukos unit buyer faces loan claim,The owners of embattled Russian oil giant Yuk...
3,business,004.txt,High fuel prices hit BA's profits,British Airways has blamed high fuel prices f...
4,business,005.txt,Pernod takeover talk lifts Domecq,Shares in UK drinks and food firm Allied Dome...


## Add new columns and save memory usage

Seems like category can be cast to category type and filename column can be dropped

In [40]:
bbc_news_dataset['category'] = bbc_news_dataset['category'].astype('category')
bbc_news_dataset = bbc_news_dataset.drop(['filename'], axis=1)
bbc_news_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   category  2225 non-null   category
 1   title     2225 non-null   object  
 2   content   2225 non-null   object  
dtypes: category(1), object(2)
memory usage: 37.3+ KB


That decreased memory usage of the dataset.

Now new columns will be added that can be useful for further analysis of the dataset:

- *content_tokenized* : content of the news tokenized with stopwords removed and filtered to only keep words in word embeddings model
- *original_len*: lenght of the original content of the news
- *tokenized_len*: lenght of the new tokenized content of the news

In [41]:
# Use PreProcess.preprocess_document to tokenize and remove stopwords from the documents
bbc_news_dataset['content_tokenized'] = bbc_news_dataset['content'].apply(lambda x: PreProcess.preprocess_document(x, current_model))
bbc_news_dataset['original_len'] = bbc_news_dataset['content'].apply(lambda x: len(x.split()))
bbc_news_dataset['tokenized_len'] = bbc_news_dataset['content_tokenized'].apply(lambda x: len(x))

Take a look again to memory usage and column types

In [42]:
bbc_news_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   category           2225 non-null   category
 1   title              2225 non-null   object  
 2   content            2225 non-null   object  
 3   content_tokenized  2225 non-null   object  
 4   original_len       2225 non-null   int64   
 5   tokenized_len      2225 non-null   int64   
dtypes: category(1), int64(2), object(3)
memory usage: 89.4+ KB


*original_len* and *tokenized_len* are *int64* and that might not be necessary, with pandas describe max and min value of every column will be displayed

In [43]:
bbc_news_dataset.describe()

Unnamed: 0,original_len,tokenized_len
count,2225.0,2225.0
mean,378.835955,229.438652
std,238.220755,132.533164
min,84.0,47.0
25%,240.0,147.0
50%,326.0,202.0
75%,466.0,284.0
max,4428.0,2270.0


Both columns can be downcast to save memory usage

In [44]:
bbc_news_dataset[['original_len', 'tokenized_len']] = bbc_news_dataset[['original_len', 'tokenized_len']].apply(pd.to_numeric, downcast='integer')

In [45]:
bbc_news_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   category           2225 non-null   category
 1   title              2225 non-null   object  
 2   content            2225 non-null   object  
 3   content_tokenized  2225 non-null   object  
 4   original_len       2225 non-null   int16   
 5   tokenized_len      2225 non-null   int16   
dtypes: category(1), int16(2), object(3)
memory usage: 63.3+ KB


In [46]:
bbc_news_dataset.head()

Unnamed: 0,category,title,content,content_tokenized,original_len,tokenized_len
0,business,Ad sales boost Time Warner profit,Quarterly profits at US media giant TimeWarne...,"[quarterly, profits, us, media, giant, timewar...",415,250
1,business,Dollar gains on Greenspan speech,The dollar has hit its highest level against ...,"[dollar, hit, highest, level, euro, almost, th...",379,238
2,business,Yukos unit buyer faces loan claim,The owners of embattled Russian oil giant Yuk...,"[owners, embattled, russian, oil, giant, yukos...",258,150
3,business,High fuel prices hit BA's profits,British Airways has blamed high fuel prices f...,"[british, airways, blamed, high, fuel, prices,...",400,265
4,business,Pernod takeover talk lifts Domecq,Shares in UK drinks and food firm Allied Dome...,"[shares, uk, drinks, food, firm, allied, domec...",260,163


## Save file

Now four different methods of saving the file will be tried to compare memory usage and how fast they are. Those are csv, pickle, parquet, feather

Download fastparquet and set it to pandas

In [47]:
!pip install fastparquet
pd.io.parquet.get_engine('auto')
pd.show_versions()


INSTALLED VERSIONS
------------------
commit           : 2e218d10984e9919f0296931d92ea851c6a6faf5
python           : 3.8.5.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19041
machine          : AMD64
processor        : AMD64 Family 23 Model 17 Stepping 0, AuthenticAMD
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : Spanish_Belize.1252

pandas           : 1.5.3
numpy            : 1.20.3
pytz             : 2020.1
dateutil         : 2.8.1
setuptools       : 49.6.0.post20200814
pip              : 20.2.2
Cython           : 0.29.23
pytest           : None
hypothesis       : None
sphinx           : 3.2.1
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.18.1
pandas_datareader: None
bs4              : None
bottleneck       : 1.3.2

Save dataset in 4 different file types and time them

In [48]:
%timeit bbc_news_dataset.to_csv('bbc_news_dataset_tokenized.csv', sep='\t', index=False)
%timeit bbc_news_dataset.to_pickle('bbc_news_dataset_tokenized.pkl')
%timeit bbc_news_dataset.to_parquet('bbc_news_dataset_tokenized.parquet')
%timeit bbc_news_dataset.to_feather('bbc_news_dataset_tokenized.feather')

531 ms ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
254 ms ± 26.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
236 ms ± 38.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
131 ms ± 16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


csv is clearly the slowest method, and feather the fastest.

Now time reading those files

In [50]:
%timeit bbc_news_dataset_csv = pd.read_csv('bbc_news_dataset_tokenized.csv', sep='\t')
%timeit bbc_news_dataset_pickle = pd.read_pickle('bbc_news_dataset_tokenized.pkl')
%timeit bbc_news_dataset_parquet = pd.read_parquet('bbc_news_dataset_tokenized.parquet')
%timeit bbc_news_dataset_feather = pd.read_feather('bbc_news_dataset_tokenized.feather')

139 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
70.1 ms ± 2.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
126 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
76.7 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Again, csv is the slowest. Pickle and feather are very fast

Now compare memory usage

In [54]:
bbc_news_dataset_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   category           2225 non-null   object
 1   title              2225 non-null   object
 2   content            2225 non-null   object
 3   content_tokenized  2225 non-null   object
 4   original_len       2225 non-null   int64 
 5   tokenized_len      2225 non-null   int64 
dtypes: int64(2), object(4)
memory usage: 104.4+ KB


In [55]:
bbc_news_dataset_pickle.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   category           2225 non-null   category
 1   title              2225 non-null   object  
 2   content            2225 non-null   object  
 3   content_tokenized  2225 non-null   object  
 4   original_len       2225 non-null   int16   
 5   tokenized_len      2225 non-null   int16   
dtypes: category(1), int16(2), object(3)
memory usage: 63.2+ KB


In [56]:
bbc_news_dataset_parquet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   category           2225 non-null   object
 1   title              2225 non-null   object
 2   content            2225 non-null   object
 3   content_tokenized  2225 non-null   object
 4   original_len       2225 non-null   int16 
 5   tokenized_len      2225 non-null   int16 
dtypes: int16(2), object(4)
memory usage: 78.3+ KB


In [57]:
bbc_news_dataset_feather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   category           2225 non-null   category
 1   title              2225 non-null   object  
 2   content            2225 non-null   object  
 3   content_tokenized  2225 non-null   object  
 4   original_len       2225 non-null   int16   
 5   tokenized_len      2225 non-null   int16   
dtypes: category(1), int16(2), object(3)
memory usage: 63.3+ KB


It is observed that csv method is the heaviest.

Csv and parquet do not keep the types of the columns as were saved.

Pickle and feather are the lightest and both keep types of columns

Finally take a look and how this methods keep the list structure for the column *content_tokenized*

In [17]:
print(bbc_news_dataset_csv['content_tokenized'][0][0])
print(eval(bbc_news_dataset_csv['content_tokenized'][0])[0])

[
quarterly


In [19]:
print(bbc_news_dataset_pickle['content_tokenized'][0][0])

quarterly


In [58]:
print(bbc_news_dataset_parquet['content_tokenized'][0][0])
print(bbc_news_dataset_parquet['content_tokenized'][0])
print(eval(bbc_news_dataset_parquet['content_tokenized'][0])[0])

91
b'["quarterly","profits","us","media","giant","timewarner","jumped","76","three","months","december","year-earlier","firm","one","biggest","investors","google","benefited","sales","high-speed","internet","connections","higher","advert","sales","timewarner","said","fourth","quarter","sales","rose","2","profits","buoyed","one-off","gains","offset","profit","dip","warner","bros","less","users","aol","time","warner","said","friday","owns","8","search-engine","google","internet","business","aol","mixed","fortunes","lost","464,000","subscribers","fourth","quarter","profits","lower","preceding","three","quarters","however","company","said","aol","\'s","underlying","profit","exceptional","items","rose","8","back","stronger","internet","advertising","revenues","hopes","increase","subscribers","offering","online","service","free","timewarner","internet","customers","try","sign","aol","\'s","existing","customers","high-speed","broadband","timewarner","also","restate","2000","2003","results","f

In [21]:
print(bbc_news_dataset_feather['content_tokenized'][0][0])

quarterly


## Conclusion

Only pickle and feather keep list structure.

In summary, pickle and feather are the lightest and keep structure of the columns, even list structure.
Pickle is a bit faster for reading and feather is much faster for saving.
But, since this dataset will be read more than saved, pickle is chosen