# Stack Overflow dataset analysis

This notebook uses the MSR2021Replication scripts to run an analysis on other StackOverflow datasets, using the Mallet tool to analyze and cluster the dataset into a predetermined number of topics. This notebook aims to simplify the use of those scripts and make them more understandable and possible to be used in other datasets.


## Install python libraries

The notebook's first step is installing the libraries used on the scripts.

In [1]:
!pip3 install -r notebook/requirements.txt

Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Defaulting to user installation because normal site-packages is not writeable


In [2]:
import nltk

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/nailliW/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/nailliW/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Import the notebook scripts to this notebook.

In [3]:
import sys
sys.path.insert(0, 'notebook/')

## Export variables

To customize the scripts to the correct dataset and output path, configure the environment variables to use the path for your dataset and configure the output folder. 


In [4]:
# Export path to the raw dataset
%env DATASET_PATH=./tcc/so_questions.csv

# Export the output path
%env OUTPUT_PATH=./output

# Export the number of topics division
%env TOPICS_NUM=15


env: DATASET_PATH=./tcc/so_questions.csv
env: OUTPUT_PATH=./output
env: TOPICS_NUM=15


## Prepare dataset for Mallet

The following scripts cleans the StackOverflow dataset and prepare the documents where the Mallet tool will execute the algorithm to separate the topics.


In [5]:
from clean_stackoverflow_data import clean_so_data
from export_so_to_mallet import export_to_mallet

print('Cleaning dataset...')
clean_so_data()

print('Exporting documents to Mallet...')
export_to_mallet()
print('Done!')


Cleaning dataset...
./tcc/so_questions.csv
Loaded CSV!
Removed HTML tags!
Removed stopwords!
Saved new csv!
Exporting documents to Mallet...
Done!


## Run the Mallet Tool

The next step is to run the Mallet tool. The mallet commands are using the environment variables set in the beggining of this notebook.

In [6]:
!mallet/mallet-2.0.8/bin/mallet import-dir --input $OUTPUT_PATH/so_data/ --output $OUTPUT_PATH/so.mallet --keep-sequence --remove-stopwords --extra-stopwords extra_stop_words/so.txt

Labels = 
   ./output/so_data/


In [7]:
!mallet/mallet-2.0.8/bin/mallet train-topics --random-seed 100 --input $OUTPUT_PATH/so.mallet --num-topics 15 --optimize-interval 20 --output-state $OUTPUT_PATH/so-topic-state.gz --output-topic-keys $OUTPUT_PATH/so_keys.txt --output-doc-topics $OUTPUT_PATH/so_composition.txt --diagnostics-file $OUTPUT_PATH/so_diagnostics.xml


Mallet LDA: 15 topics, 4 topic bits, 1111 topic mask
Data loaded.
max tokens: 239
total tokens: 1891
<10> LL/token: -7.81783
<20> LL/token: -7.73297
<30> LL/token: -7.60891
<40> LL/token: -7.57648

0	0.33333	dependency await adminsiteotprequired difference system ensure leak otpadminsite import reading understanding development surrounding tool implicit tf_urls urlpatterns class understand execution 
1	0.33333	environment file variable profile question env hard write load serve order process twelve-factor supply app.web.port app.db.conn prod database follow create 
2	0.33333	developing based pas list command current driver linux native testing theory aware repository design accept swift team quot branch advance 
3	0.33333	user authentication login idea json private python workflow enable jbutton username box admin gdrive browser access small failed ci/cd stuff 
4	0.33333	release build run quot stage apply web simply give key sound define method change impossible runtime recommended htt

## Parse results

The following script parse the mallet output and place all questions from the same topic into one file. Resulting in one file per document.

In [8]:
from parse_topics_composition import parse_topics
from unite_topics_in_one_file import unite_questions_documents_by_topic

print('Parsing topics...')
parse_topics()
print('Uniting questions by topic...')
unite_questions_documents_by_topic()
print('Done!')

Parsing topics...
Uniting questions by topic...
Folder ./output/topics created!
Done!
