# Stack Overflow dataset analysis

This notebook uses the MSR2021Replication scripts to run an analysis on other StackOverflow datasets, using the Mallet tool to analyze and cluster the dataset into a predetermined number of topics. This notebook aims to simplify the use of those scripts and make them more understandable and possible to be used in other datasets.


## Install python libraries

The notebook's first step is installing the libraries used on the scripts.

In [1]:
!pip3 install -r notebook/requirements.txt



In [2]:
import nltk

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/nailliW/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/nailliW/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Import the notebook scripts to this notebook.

In [3]:
import sys
sys.path.insert(0, 'notebook/')

## Export variables

To customize the scripts to the correct dataset and output path, configure the environment variables to use the path for your dataset and configure the output folder. 


In [4]:
# Export path to the raw dataset
%env DATASET_PATH=./tcc/so_questions.csv

# Export the output path
%env OUTPUT_PATH=./output

# Export the number of topics division
%env TOPICS_NUM=15


env: DATASET_PATH=./tcc/so_questions.csv
env: OUTPUT_PATH=./output
env: TOPICS_NUM=15


## Prepare dataset for Mallet

The following scripts cleans the StackOverflow dataset and prepare the documents where the Mallet tool will execute the algorithm to separate the topics.


In [5]:
from clean_stackoverflow_data import clean_so_data
from export_so_to_mallet import export_to_mallet

print('Cleaning dataset...')
clean_so_data()

print('Exporting documents to Mallet...')
export_to_mallet()
print('Done!')


Cleaning dataset...
./tcc/so_questions.csv


FileNotFoundError: [Errno 2] No such file or directory: './tcc/so_questions.csv'

## Run the Mallet Tool

The next step is to run the Mallet tool. The mallet commands are using the environment variables set in the beggining of this notebook.

In [6]:
!mallet/mallet-2.0.8/bin/mallet import-dir --input $OUTPUT_PATH/so_data/ --output $OUTPUT_PATH/so.mallet --keep-sequence --remove-stopwords --extra-stopwords extra_stop_words/so.txt

Labels = 
   ./output/so_data/


In [7]:
!mallet/mallet-2.0.8/bin/mallet train-topics --random-seed 100 --input $OUTPUT_PATH/so.mallet --num-topics 15 --optimize-interval 20 --output-state $OUTPUT_PATH/so-topic-state.gz --output-topic-keys $OUTPUT_PATH/so_keys.txt --output-doc-topics $OUTPUT_PATH/so_composition.txt --diagnostics-file $OUTPUT_PATH/so_diagnostics.xml


Mallet LDA: 15 topics, 4 topic bits, 1111 topic mask
Data loaded.
max tokens: 0
total tokens: 0
Infinite value after topic 0 0
<10> LL/token: NaN
Infinite value after topic 0 0
<20> LL/token: NaN
Infinite value after topic 0 0
<30> LL/token: NaN
Infinite value after topic 0 0
<40> LL/token: NaN

0	0.33333	
1	0.33333	
2	0.33333	
3	0.33333	
4	0.33333	
5	0.33333	
6	0.33333	
7	0.33333	
8	0.33333	
9	0.33333	
10	0.33333	
11	0.33333	
12	0.33333	
13	0.33333	
14	0.33333	

Infinite value after topic 0 0
<50> LL/token: NaN
Infinite value after topic 0 0
<60> LL/token: NaN
Infinite value after topic 0 0
<70> LL/token: NaN
Infinite value after topic 0 0
<80> LL/token: NaN
Infinite value after topic 0 0
<90> LL/token: NaN

0	0.33333	
1	0.33333	
2	0.33333	
3	0.33333	
4	0.33333	
5	0.33333	
6	0.33333	
7	0.33333	
8	0.33333	
9	0.33333	
10	0.33333	
11	0.33333	
12	0.33333	
13	0.33333	
14	0.33333	

Infinite value after topic 0 0
<100> LL/token: NaN
Infinite value after topic 0 0
<110> LL/token: NaN
Infinite

## Parse results

The following script parse the mallet output and place all questions from the same topic into one file. Resulting in one file per document.

In [8]:
from parse_topics_composition import parse_topics
from unite_topics_in_one_file import unite_questions_documents_by_topic

print('Parsing topics...')
parse_topics()
print('Uniting questions by topic...')
unite_questions_documents_by_topic()
print('Done!')

Parsing topics...


KeyError: 'filename'