In [1]:
import requests
import pandas as pd
from datetime import datetime
import time

# __1. Data Collection__

I've written a function called `fetch_questions` that's designed to gather questions from Stack Overflow related to Natural Language Processing (NLP) and related areas.

Here's a step-by-step explanation:

1.  **Setting up the API:**
    * I start by defining the base URL for the Stack Exchange API, which is where I'll be requesting the data from.
    * I initialize an empty list called `all_questions` to store all the questions I retrieve.
    * I have a list of tags related to NLP, like 'nlp', 'language-model', 'text-classification', and many more. This list is important because I'll use these tags to filter the questions I want.
    * I define a dictionary called 'params' that contains all the parameters that will be passed to the stackoverflow api. These parameters include:
        * 'page_size': how many questions should be returned per page.
        * 'accepted': only return questions that have accepted answers.
        * 'order': order the returned questions by their creation date.
        * 'sort': sort the returned questions by their creation date.
        * 'site': the site to get the questions from, which is stackoverflow.
        * 'tagged': the tag to filter the questions by.
        * 'key': my api key.
        * 'filter': include the body of the question in the returned data.

2.  **Looping through Pages and Tags:**
    * I use a `while` loop to keep fetching questions until I have at least 30,000 questions.
    * Inside the loop, I set the `page` and `tagged` parameters in my request.
    * I make a request to the Stack Exchange API using the `requests.get()` function.
    * I check if the API request was successful (status code 200). If not, I print an error message and stop.
    * I print out the page number to monitor the progress of the data fetching.
    * I then extract the 'items' from the json response, which are the questions, and append them to the 'all_questions' list.
    * I increse the size variable, that keeps track of how many questions have been added.
    * I add a small delay using `time.sleep(1)` to avoid overwhelming the API.
    * I increment the page number.
    * I check if there are more pages for the current tag. If not, I move to the next tag in the list.
    * If all tags have been processed, the loop stops.

3.  **Returning the Results:**
    * Finally, I return the `all_questions` list, which now contains all the fetched questions.

4.  **Creating a DataFrame:**
    * Outside the function, I call `fetch_questions()` to get the questions.
    * I then convert the list of questions into a Pandas DataFrame for easier data manipulation.

In essence, I'm using the Stack Exchange API to gather a large dataset of NLP-related questions from Stack Overflow, ensuring I only get questions with accepted answers. I'm iterating through multiple pages and tags to gather enough data. Then, I convert this data into a Pandas DataFrame.


In [None]:

def fetch_questions(pages=200, page_size=30):
    """
    Fetches Stack Overflow questions related to Natural Language Processing (NLP) and related topics using the Stack Exchange API.

    Args:
        pages (int, optional): The maximum number of pages to fetch. Defaults to 200.
        page_size (int, optional): The number of questions to fetch per page. Defaults to 100.
    Returns:
        list: A list of dictionaries, where each dictionary represents a question.
    """

    BASE_URL = "https://api.stackexchange.com/2.3/search/advanced"
    all_questions = []

    tags = ['nlp', 'language-model', 'text-classification', 'word-embedding', 'spacy', 'nltk', 'seq2seq',
            'sentence-similarity', 'named-entity-recognition', 'text-processing', 'text-mining',
            'sentiment-analysis', 'stemming', 'lemmatization', "huggingface-transformers",
            'tokenization', 'lstm', 'chatbot', 'language-detection', 'speech-to-text',
            'text-to-speech', 'gensim', 'deep-learning', 'machine-learning']  # Supplement 'nlp' if needed.

    params = {
        'page_size': 30,  # Number of questions per page from the API.
        'accepted': 'True',  # Fetch only questions with accepted answers.
        'order': 'desc',  # Order results by creation date (newest first).
        'sort': 'creation',  # Sort by creation date.
        'site': 'stackoverflow',  # Stack Overflow site.
        'tagged': 'nlp',  # Initial tag to filter by.
        'key': 'rl_GXP3yYDC5NCfRWViwHuhPwMAZ',  # My API key.
        'filter': 'withbody',  # Include question body in the response.
    }

    page = 1
    cur_tag = 0
    size = 0
    while size < 30000:  # Ensure we collect at least 30,000 questions, even if it requires using multiple related tags.
        params["page"] = page
        params["tagged"] = tags[cur_tag]
        response = requests.get(BASE_URL, params=params)
        if response.status_code != 200:  # Handle API request errors.
            print(f"Error on page {page}: {response.status_code}. Skipping this page.")
            break

        if page % 20 == 1:  # Print progress every 20 pages.
            print(f"Fetching data from {page}-th page of tag {tags[cur_tag]}...")

        data = response.json()
        all_questions.extend(data.get("items", []))
        size += params['page_size']  # Track total questions fetched.
        time.sleep(1)  # Prevent API rate limiting.
        page += 1
        if data["has_more"] == False:  # Move to the next tag if no more pages for the current tag.
            cur_tag += 1
            page = 1
            if cur_tag + 1 > len(tags):
                break
    return all_questions

questions = fetch_questions()
df = pd.DataFrame(questions)

Fetching data from 1-th page of tag nlp...
Fetching data from 21-th page of tag nlp...
Fetching data from 41-th page of tag nlp...
Fetching data from 61-th page of tag nlp...
Fetching data from 81-th page of tag nlp...
Fetching data from 101-th page of tag nlp...
Fetching data from 121-th page of tag nlp...
Fetching data from 141-th page of tag nlp...
Fetching data from 161-th page of tag nlp...
Fetching data from 181-th page of tag nlp...
Fetching data from 201-th page of tag nlp...
Fetching data from 221-th page of tag nlp...
Fetching data from 241-th page of tag nlp...
Fetching data from 261-th page of tag nlp...
Fetching data from 281-th page of tag nlp...
Fetching data from 1-th page of tag language-model...
Fetching data from 1-th page of tag text-classification...
Fetching data from 21-th page of tag text-classification...
Fetching data from 1-th page of tag word-embedding...
Fetching data from 1-th page of tag spacy...
Fetching data from 21-th page of tag spacy...
Fetching data

In [None]:
df.to_csv("nlp_questions.csv", index = False)

In [None]:
df.head()

Unnamed: 0,tags,owner,is_answered,view_count,closed_date,accepted_answer_id,answer_count,score,last_activity_date,creation_date,...,closed_reason,title,body,content_license,last_edit_date,posted_by_collectives,migrated_from,protected_date,community_owned_date,locked_date
0,"[numpy, nlp, dependencies, google-colaboratory...","{'account_id': 8652474, 'reputation': 687, 'us...",True,89,1743076000.0,79523777,1,0,1742494070,1742481362,...,Duplicate,Trouble getting importing gensim to work in colab,<p>I am trying to import gensim into colab.</p...,,,,,,,
1,"[python, nlp, large-language-model]","{'account_id': 1230089, 'reputation': 5390, 'u...",True,26,,79501337,1,0,1741708699,1741704631,...,,Store images instead of showing in a server,<p>I am running the code found on this [site][...,CC BY-SA 4.0,,,,,,
2,"[python, nlp, spacy, langchain, presidio]","{'account_id': 22369526, 'reputation': 69, 'us...",True,210,,79495969,2,4,1742055531,1741040827,...,,Presidio with Langchain Experimental does not ...,<p>I am using presidio/langchain_experimental ...,CC BY-SA 4.0,1741330000.0,,,,,
3,"[nlp, opennlp]","{'account_id': 21332, 'reputation': 5495, 'use...",True,32,,79475445,1,1,1740743750,1740240371,...,,OpenNLP POSTaggerME and ChunkerME synergy,<p>I'm trying to use the OpenNLP chunking API ...,CC BY-SA 4.0,1740586000.0,,,,,
4,"[python, python-3.x, nlp]","{'account_id': 3657839, 'reputation': 1081, 'u...",True,48,,79461281,1,1,1740316677,1739980065,...,,word/ sentence similarities,<p>I am trying to find if a given word/ set of...,CC BY-SA 4.0,1740151000.0,,,,,


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29638 entries, 0 to 29637
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   tags                   29638 non-null  object 
 1   owner                  29638 non-null  object 
 2   is_answered            29638 non-null  bool   
 3   view_count             29638 non-null  int64  
 4   closed_date            1222 non-null   float64
 5   accepted_answer_id     29638 non-null  int64  
 6   answer_count           29638 non-null  int64  
 7   score                  29638 non-null  int64  
 8   last_activity_date     29638 non-null  int64  
 9   creation_date          29638 non-null  int64  
 10  question_id            29638 non-null  int64  
 11  link                   29638 non-null  object 
 12  closed_reason          1222 non-null   object 
 13  title                  29638 non-null  object 
 14  body                   29638 non-null  object 
 15  co

### There are some posts might have several tags, so that if we retrieve posts data by querying with different tags, there probably are duplicated posts, here I remove duplicated posts by checking if posts' '__link__' are unique

In [None]:
# df = pd.read_excel("nlp_questions.xlsx", engine="openpyxl")

In [11]:
df_unique = df.drop_duplicates(subset=['link'], keep='first') #remove duplicated posts
df_unique = df_unique.reset_index(drop=True)
df_unique.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24776 entries, 0 to 24775
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   tags                   24776 non-null  object 
 1   owner                  24775 non-null  object 
 2   is_answered            24775 non-null  object 
 3   view_count             24775 non-null  object 
 4   closed_date            1033 non-null   float64
 5   accepted_answer_id     24775 non-null  float64
 6   answer_count           24774 non-null  float64
 7   score                  24775 non-null  float64
 8   last_activity_date     24775 non-null  float64
 9   creation_date          24775 non-null  float64
 10  question_id            24775 non-null  float64
 11  link                   24775 non-null  object 
 12  closed_reason          1033 non-null   object 
 13  title                  24774 non-null  object 
 14  body                   24774 non-null  object 
 15  co

In [None]:
# df_unique.to_csv('nlp_questions_unique.csv', index = False)

In [30]:
# Drop question with Nan/null question_id and body (because we cannot retrieve its answer if question_id is null, and drop rows with null body because there's barely content for us to implement sentiment analysis)
df_unique = df_unique.dropna(subset=['question_id', 'body'])
df_unique

Unnamed: 0,tags,owner,is_answered,view_count,closed_date,accepted_answer_id,answer_count,score,last_activity_date,creation_date,...,closed_reason,title,body,content_license,last_edit_date,posted_by_collectives,migrated_from,protected_date,community_owned_date,locked_date
0,"['numpy', 'nlp', 'dependencies', 'google-colab...","{'account_id': 8652474, 'reputation': 687, 'us...",True,89,1.743076e+09,79523777.0,1.0,0.0,1.742494e+09,1.742481e+09,...,Duplicate,Trouble getting importing gensim to work in colab,<p>I am trying to import gensim into colab.</p...,,,,,,,
1,"['python', 'nlp', 'large-language-model']","{'account_id': 1230089, 'reputation': 5390, 'u...",True,26,,79501337.0,1.0,0.0,1.741709e+09,1.741705e+09,...,,Store images instead of showing in a server,<p>I am running the code found on this [site][...,CC BY-SA 4.0,,,,,,
2,"['python', 'nlp', 'spacy', 'langchain', 'presi...","{'account_id': 22369526, 'reputation': 69, 'us...",True,210,,79495969.0,2.0,4.0,1.742056e+09,1.741041e+09,...,,Presidio with Langchain Experimental does not ...,<p>I am using presidio/langchain_experimental ...,CC BY-SA 4.0,1741330413,,,,,
3,"['nlp', 'opennlp']","{'account_id': 21332, 'reputation': 5495, 'use...",True,32,,79475445.0,1.0,1.0,1.740744e+09,1.740240e+09,...,,OpenNLP POSTaggerME and ChunkerME synergy,<p>I'm trying to use the OpenNLP chunking API ...,CC BY-SA 4.0,1740586328,,,,,
4,"['python', 'python-3.x', 'nlp']","{'account_id': 3657839, 'reputation': 1081, 'u...",True,48,,79461281.0,1.0,1.0,1.740317e+09,1.739980e+09,...,,word/ sentence similarities,<p>I am trying to find if a given word/ set of...,CC BY-SA 4.0,1740151158,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24771,"['python', 'numpy', 'tensorflow', 'deep-learni...","{'account_id': 15718018, 'reputation': 49, 'us...",True,562,,65634693.0,1.0,0.0,1.610131e+09,1.610130e+09,...,,train test split is not splitting correctly,<p>I am still a beginner in AI and deep learni...,CC BY-SA 4.0,1610131142,,,,,
24772,"['machine-learning', 'scikit-learn', 'deep-lea...","{'account_id': 15851108, 'reputation': 42, 'us...",True,5106,,65629414.0,2.0,2.0,1.640657e+09,1.610103e+09,...,,MultinomialNB or GaussianNB or CategoricalNB w...,"<p>Let I have a input feature <code>X = {X1, X...",CC BY-SA 4.0,1610206542,,,,,
24773,"['python', 'deep-learning', 'pytorch', 'conv-n...","{'account_id': 17239237, 'reputation': 17, 'us...",True,607,,65635714.0,1.0,0.0,1.614703e+09,1.610103e+09,...,,Custom small CNN has better accuracy than the ...,<p>I have a dataset of laser welding images of...,CC BY-SA 4.0,1610105061,,,,,
24774,"['deep-learning', 'computer-vision', 'object-d...","{'account_id': 9126690, 'reputation': 406, 'us...",True,462,,65611595.0,1.0,1.0,1.610042e+09,1.610019e+09,...,,Creating a dataset of images for object detect...,<p>Even though I am quite familiar with the co...,CC BY-SA 4.0,1610041831,,,,,


### Querying __accepted__ answers from fetched posts (because original fetched data only contains of only 1 accepted answer per post)

In [13]:
len_ques = len(df_unique['link'].unique()) #check if all questions in this dataframe is unique
print(len_ques)

24776


In [20]:
cols = ['title', 'description', 'tags', 'accepted answer 1', 'accepted answer 2', 'creation date', 'view count', 'score']

In [28]:
#Method to get accepted answers from given question ID(s)
def get_answers_for_question(question_ids):
    url = f"https://api.stackexchange.com/2.3/questions/{question_ids}/answers"
    res = []
    has_more = True
    params = {
        'page_size': 30,
        'order': 'desc',
        'sort': 'votes',
        'site': 'stackoverflow',
        'filter': '!6WPIomp1bTBj5', #is_accepted filter
        'key':'rl_GXP3yYDC5NCfRWViwHuhPwMAZ', # my key :3
    }
    page=0
    # Loop over to get all answers until has_more == False
    while has_more:
        page+=1
        params['page'] = page
        response = requests.get(url, params=params)
        # Update has_more variable
        has_more = response.json().get('has_more', False)
        res.extend(response.json().get('items', []))
    # print(response.json())
    return res

    # except Exception as e:
    #     print(f"Error fetching answers for question {question_ids}: {e}")
    # return _



In [38]:

data = []
answers = None
iter = 1  # first iteration

while (iter - 1) * 30 < len_ques:
    try:
        print(f"Iteration {iter}: fetching answer for questions from index {(iter - 1) * 30} to {min(30 * iter, len_ques)}...")

        # Get ID(s) for the current batch, join them with ';'
        questions = df_unique.iloc[(iter - 1) * 30:min(30 * iter, len_ques)]

        # Join all ids together to get answers of at max 30 questions at a time
        ids = ';'.join(list(questions['question_id'].astype(int).astype(str)))

        # Fetch answers for the batch

        answers = get_answers_for_question(ids)

        # Check if API response is valid
        if not answers or not isinstance(answers, list):
            print("Error: Received an invalid response from API. Stopping execution.")
            break

        answers_df = pd.DataFrame(answers)
        answers_df["question_id"] = answers_df["question_id"].astype(str)

        for i in range(len(questions)):
            ans = answers_df[answers_df['question_id'] == questions.iloc[i]['question_id']]

            accepted_ans_1, accepted_ans_2 = None, None

            for idx, a in ans.iterrows():
                if accepted_ans_1 and accepted_ans_2:
                    break
                if not accepted_ans_1:
                    accepted_ans_1 = a['body']
                elif not accepted_ans_2:
                    accepted_ans_2 = a['body']

            data.append([
                questions.iloc[i]['title'], questions.iloc[i]['body'], questions.iloc[i]['tags'],
                accepted_ans_1, accepted_ans_2, questions.iloc[i]['creation_date'],
                questions.iloc[i]['view_count'], questions.iloc[i]['score']
            ])

        iter += 1

        # Sleep to avoid API rate limiting
        time.sleep(1)

    except Exception as e:
        print(f"An error occurred: {e}. Retrying after a short delay...")
        time.sleep(5)  # Wait before retrying

print("Finished retrieving answers for all questions in our dataset!")



Iteration 1: fetching answer for questions from index 0 to 30...
Iteration 2: fetching answer for questions from index 30 to 60...
Iteration 3: fetching answer for questions from index 60 to 90...
Iteration 4: fetching answer for questions from index 90 to 120...
Iteration 5: fetching answer for questions from index 120 to 150...
Iteration 6: fetching answer for questions from index 150 to 180...
Iteration 7: fetching answer for questions from index 180 to 210...
Iteration 8: fetching answer for questions from index 210 to 240...
Iteration 9: fetching answer for questions from index 240 to 270...
Iteration 10: fetching answer for questions from index 270 to 300...
Iteration 11: fetching answer for questions from index 300 to 330...
Iteration 12: fetching answer for questions from index 330 to 360...
Iteration 13: fetching answer for questions from index 360 to 390...
Iteration 14: fetching answer for questions from index 390 to 420...
Iteration 15: fetching answer for questions from in

My dataset retrieved from StackExchange APIs has 24774 rows with 8 columns   
8 columns in the dataset include:

1.  **`title`**: The title of the Stack Overflow question.
2.  **`description`**: The body/content of the question, likely containing details, code snippets, and formatting (as indicated by the `<p>` tags).
3.  **`tags`**: A list of tags associated with the question on Stack Overflow, useful for categorization (e.g., 'python', 'nlp', 'spacy', 'deep-learning', 'tensorflow').
4.  **`accepted_answer_1`**: Contains the body text of the first answer retrieved for the question (likely the highest-voted answer based on typical API usage). It may be `None` if no answer was found or processed.
5.  **`accepted_answer_2`**: Contains the body text of the second answer retrieved (likely the second highest-voted). Also may be `None`.
6.  **`creation_date`**: The timestamp indicating when the question was originally posted (appears to be in Unix epoch format).
7.  **`view_count`**: The number of times the question has been viewed on Stack Overflow.
8.  **`score`**: The score (net upvotes/downvotes) of the question on Stack Overflow.


In [39]:
df_ans = pd.DataFrame(data, columns = cols)
df_ans

Unnamed: 0,title,description,tags,accepted answer 1,accepted answer 2,creation date,view count,score
0,Trouble getting importing gensim to work in colab,<p>I am trying to import gensim into colab.</p...,"['numpy', 'nlp', 'dependencies', 'google-colab...",,,1.742481e+09,89,0.0
1,Store images instead of showing in a server,<p>I am running the code found on this [site][...,"['python', 'nlp', 'large-language-model']",,,1.741705e+09,26,0.0
2,Presidio with Langchain Experimental does not ...,<p>I am using presidio/langchain_experimental ...,"['python', 'nlp', 'spacy', 'langchain', 'presi...",,,1.741041e+09,210,4.0
3,OpenNLP POSTaggerME and ChunkerME synergy,<p>I'm trying to use the OpenNLP chunking API ...,"['nlp', 'opennlp']",,,1.740240e+09,32,1.0
4,word/ sentence similarities,<p>I am trying to find if a given word/ set of...,"['python', 'python-3.x', 'nlp']",,,1.739980e+09,48,1.0
...,...,...,...,...,...,...,...,...
24769,train test split is not splitting correctly,<p>I am still a beginner in AI and deep learni...,"['python', 'numpy', 'tensorflow', 'deep-learni...",,,1.610130e+09,562,0.0
24770,MultinomialNB or GaussianNB or CategoricalNB w...,"<p>Let I have a input feature <code>X = {X1, X...","['machine-learning', 'scikit-learn', 'deep-lea...",,,1.610103e+09,5106,2.0
24771,Custom small CNN has better accuracy than the ...,<p>I have a dataset of laser welding images of...,"['python', 'deep-learning', 'pytorch', 'conv-n...",,,1.610103e+09,607,0.0
24772,Creating a dataset of images for object detect...,<p>Even though I am quite familiar with the co...,"['deep-learning', 'computer-vision', 'object-d...",,,1.610019e+09,462,1.0
