tfm-nlp

Data

Download Enron data

wget https://www.cs.cmu.edu/~enron/enron_mail_20150507.tar.gz

Project structure configuration

Execute projec-structure.sh to create or ensure that the project has the correct structure.

Extraction

The extract.sh file downloads enron emails and extracts them into their respective environment. If files are already existing, it does nothing. This is to prevent unnecessary downloading since the files are heavy.

Transformation

Ignoring directories

According to Klimt & Yang (2004), there are two folders that are not necessary:

discussion_threads: Do not appear to be used directly by the users, but rather computer generated.
all_documents: Contanied large number of duplicate emails, which where already present the rest of the folders (TODO: Check if there is time).

Preprocessing

Consult Notebook Preprocessing.ipynb in the notebooks folder.

Doc2vec and TF-IDF transformations

Consult Notebooks Doc2Vec.ipynb and TF-IDF.ipynb in the notebooks folder.

Distance Matrices Calculation

Consult scripts distance_matrices*.py for the obtention of euclidean, cosine and WMD distance matrices for both doc2vec and TF-IDF vectors.

Clustering

Consult scripts test_clustering*.py for performing of methods KMeans, DBSCAN and HDBSCAN, and their respective scores:

Silhouette Coefficients
Calinski-Harabasz
David-Bouldin
Entropy

There is also the notebook cuMLHDBSCAN.ipynb where it can be seen a draft of the logic behind the scripts.

Results

There are a couple of notebooks that are still in works:

Results.ipynb
LabeledEmails.ipynb

Notes

The rest of the notebooks and scripts are left for legacy purposes. Will be removed for the final delivery.

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
data/interim		data/interim
docs		docs
img		img
notebooks		notebooks
references		references
src		src
.$architecture.drawio.bkp		.$architecture.drawio.bkp
.$architecture.drawio.dtmp		.$architecture.drawio.dtmp
.gitignore		.gitignore
README.md		README.md
Script2.docx		Script2.docx
architecture.drawio		architecture.drawio
environment.yml		environment.yml
presentation-4.pdf		presentation-4.pdf
wmd_example.py		wmd_example.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tfm-nlp

Data

Project structure configuration

Extraction

Transformation

Ignoring directories

Preprocessing

Doc2vec and TF-IDF transformations

Distance Matrices Calculation

Clustering

Results

Notes

About

Releases

Packages

Languages

MiguelHeCa/tfm-nlp

Folders and files

Latest commit

History

Repository files navigation

tfm-nlp

Data

Project structure configuration

Extraction

Transformation

Ignoring directories

Preprocessing

Doc2vec and TF-IDF transformations

Distance Matrices Calculation

Clustering

Results

Notes

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages