# Download, prepare and save the Bag of Words Data Set

In this notebook, you will find guidelines to download, prepare, and store the Bag of Words Data Set from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml).


## Download the data

Follow these guidelines to download the data:

- Visit [the UCI website](https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/)
- Click on **docword.enron.txt.gz** to download the data.
- Unzip the data and save it in the same folder that contains this notebook.
- Then click on **vocab.enron.txt** to download the word names.
- Save vocab.enron.txt in the same folder that contains this notebook.

You can find more information about this particular dataset [here](https://archive.ics.uci.edu/ml/datasets/Bag+of+Words).

In [1]:
import pandas as pd

In [3]:
# load the word counts

data = pd.read_csv("docword.enron.txt", sep=" ", skiprows=3, header=None)
data.columns = ["docID", "wordID", "count"]

data.head()

Unnamed: 0,docID,wordID,count
0,1,118,1
1,1,285,1
2,1,1229,1
3,1,1688,1
4,1,2068,1


In [4]:
# load the words

words = pd.read_csv("vocab.enron.txt", header=None)
words.columns = ["words"]

words.head()

Unnamed: 0,words
0,aaa
1,aaas
2,aactive
3,aadvantage
4,aaker


In [5]:
# select at random 10 words

words = words.sample(10, random_state=290917)

words

Unnamed: 0,words
8704,eurobond
13618,keen
11114,halligan
19968,pvr
23327,soda
20714,refundable
390,advice
6257,decker
8680,etis
3370,cab


In [6]:
data = words.merge(data, left_index=True, right_on="wordID")

data.head()

Unnamed: 0,words,docID,wordID,count
137715,eurobond,2021,8704,2
140167,eurobond,2050,8704,11
151530,eurobond,2269,8704,2
155066,eurobond,2352,8704,2
156247,eurobond,2375,8704,2


In [7]:
# reconstitute the bag of words dataset

bow = data.pivot(index="docID", columns="words", values="count")
bow.fillna(0, inplace=True)
bow.reset_index(inplace=True, drop=True)
bow.shape

(1388, 10)

In [8]:
bow.head()

words,advice,cab,decker,etis,eurobond,halligan,keen,pvr,refundable,soda
0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
bow.to_csv("../bag_of_words.csv", index=False)