# Stackexchange (Stackoverflow parent)
These are probably more closely related to reasoning questions.

Datasource: https://archive.org/details/stackexchange

https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede/2678#2678

df

In [1]:
import os
DATASET_LOC = '/mnt/p/datasets/stackexchange/stackoverflow.com-Posts'
DATASET_POSTS_LOC = os.path.join(DATASET_LOC, 'Posts.xml')
if not os.path.exists(DATASET_POSTS_LOC):
    raise Exception(f"Dataset not found at {DATASET_POSTS_LOC}")

Nice! The Posts.xml files contains all the stackexchange posts. 

Lets filter out the Kubernetes questions and import them in to a database.

In [3]:
from utils.stackexchange import read_stackexchange
df = read_stackexchange(DATASET_POSTS_LOC, nrows=1000)

100%|██████████| 2/2 [00:00<00:00, 4357.72it/s]


100%|██████████| 1000/1000 [00:00<00:00, 67577.04it/s]


This results in some rows. Lets take a look at a single one:

In [4]:
df.iloc[0]

Id                                                                       4
PostTypeId                                                               1
AcceptedAnswerId                                                         7
CreationDate                                       2008-07-31T21:42:52.667
Score                                                                  794
ViewCount                                                            70633
Body                     <p>I want to assign the decimal variable &quot...
OwnerUserId                                                              8
LastEditorUserId                                                  16124033
LastEditorDisplayName                                               Rich B
LastEditDate                                       2022-09-08T05:07:26.033
LastActivityDate                                   2022-09-08T05:07:26.033
Title                              How to convert Decimal to Double in C#?
Tags                     

In [5]:
df = read_stackexchange(file=DATASET_POSTS_LOC, condition=('kubernetes' or 'Kubernetes'), skip_lines=2, nrows=1000000)
df

100%|██████████| 2/2 [00:00<00:00, 10010.27it/s]


 15%|█▍        | 147077/1000000 [00:03<00:19, 43442.38it/s]

Lets inspect the first Kubernetes row. It is of PostTypeId 2 (answer).

In [None]:
from IPython.display import display, HTML
chart = HTML(df.iloc[0]['Body'])
chart

## Data import to SQL

Lets try to import the entire dataset (all kubernetes posts). Lets start with 100.000.000 rows:

In [None]:
df = read_stackexchange(file=DATASET_POSTS_LOC, condition=('kubernetes' or 'Kubernetes'), skip_lines=2, nrows=1e8)
df

100%|██████████| 100000000/100000000.0 [1:27:35<00:00, 19026.62it/s]


Unnamed: 0,Id,PostTypeId,ParentId,CreationDate,Score,Body,OwnerUserId,LastEditorUserId,LastEditDate,LastActivityDate,...,ContentLicense,ViewCount,Title,Tags,AnswerCount,FavoriteCount,AcceptedAnswerId,ClosedDate,LastEditorDisplayName,OwnerDisplayName
0,999124,2,998832,2009-06-16T00:31:02.797,386,<p><strong>Update:</strong> In an effort to an...,42223,6093952,2022-02-15T00:15:35.203,2022-02-15T00:15:35.203,...,CC BY-SA 4.0,,,,,,,,,
1,2973781,2,153721,2010-06-04T11:49:04.040,31,<p><strong>AppScale</strong></p>\n\n<p>AppScal...,72787,72787,2014-12-05T15:11:57.957,2014-12-05T15:11:57.957,...,CC BY-SA 3.0,,,,,,,,,
2,5827081,5,,2011-04-29T01:56:49.773,0,<p><strong>About</strong></p>\n<p>Google API i...,2000557,11407695,2020-07-09T10:57:07.800,2020-07-09T10:57:07.800,...,CC BY-SA 4.0,,,,,,,,,
3,7555490,5,,2011-09-26T13:03:20.387,0,<blockquote>\n <p>Please <strong>do not use t...,528720,1352530,2017-11-23T10:21:26.173,2017-11-23T10:21:26.173,...,CC BY-SA 3.0,,,,,,,,,
4,7555491,4,,2011-09-26T13:03:20.387,0,"DO NOT USE THIS TAG! Instead, prefer a specifi...",1033581,1033581,2018-03-18T17:09:42.420,2018-03-18T17:09:42.420,...,CC BY-SA 3.0,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99582,75638798,1,,2023-03-04T21:32:30.200,0,<p>We have an AKS POD (.NET 6 application) whi...,20142949,13302,2023-03-04T21:54:54.063,2023-03-04T21:54:54.063,...,CC BY-SA 4.0,11,Kubernetes - connect to Azure SQL using Active...,<kubernetes><.net-core><azure-sql-database><az...,0,,,,,
99583,75638826,2,58601318,2023-03-04T21:38:18.120,0,<p>It is actually straightforward but not well...,70826,,,2023-03-04T21:38:18.120,...,CC BY-SA 4.0,,,,,,,,,
99584,75638928,1,,2023-03-04T21:58:44.903,0,<p>How to enable Pytorch to use GPUs on <a hre...,20828520,,,2023-03-04T21:58:44.903,...,CC BY-SA 4.0,4,Pytorch cannot detect CUDA GPUs on GKE Autopil...,<installation><pytorch><google-kubernetes-engi...,0,,,,,
99585,75639427,2,75632287,2023-03-04T23:57:08.170,0,<p>It's turnout that the installation of kubec...,1071908,,,2023-03-04T23:57:08.170,...,CC BY-SA 4.0,,,,,,,,,


Rusult is 99587 posts. Lets do 100.000.000 more, skipping the first section.

In [5]:
df_2 = read_stackexchange(file=DATASET_POSTS_LOC, condition=('kubernetes' or 'Kubernetes'), skip_lines=int(1e8), nrows=1e8)
df_2

100%|██████████| 100000000/100000000 [1:28:53<00:00, 18748.13it/s]


100%|██████████| 100000000/100000000.0 [1:37:36<00:00, 17074.67it/s]


Skipping is not faster than reading normally...

Lets do 1.000.000.000 posts, hopefully that covers everything.

In [6]:
df_3 = read_stackexchange(file=DATASET_POSTS_LOC, condition=('kubernetes' or 'Kubernetes' or 'docker' or 'Docker'), nrows=1e9)
df_3

100%|██████████| 2/2 [00:00<00:00, 3766.78it/s]


100%|██████████| 1000000000/1000000000.0 [15:59:07<00:00, 17377.08it/s] 


Unnamed: 0,Id,PostTypeId,ParentId,CreationDate,Score,Body,OwnerUserId,LastEditorUserId,LastEditDate,LastActivityDate,...,ContentLicense,ViewCount,Title,Tags,AnswerCount,FavoriteCount,AcceptedAnswerId,ClosedDate,LastEditorDisplayName,OwnerDisplayName
0,999124,2,998832,2009-06-16T00:31:02.797,386,<p><strong>Update:</strong> In an effort to an...,42223,6093952,2022-02-15T00:15:35.203,2022-02-15T00:15:35.203,...,CC BY-SA 4.0,,,,,,,,,
1,2973781,2,153721,2010-06-04T11:49:04.040,31,<p><strong>AppScale</strong></p>\n\n<p>AppScal...,72787,72787,2014-12-05T15:11:57.957,2014-12-05T15:11:57.957,...,CC BY-SA 3.0,,,,,,,,,
2,5827081,5,,2011-04-29T01:56:49.773,0,<p><strong>About</strong></p>\n<p>Google API i...,2000557,11407695,2020-07-09T10:57:07.800,2020-07-09T10:57:07.800,...,CC BY-SA 4.0,,,,,,,,,
3,7555490,5,,2011-09-26T13:03:20.387,0,<blockquote>\n <p>Please <strong>do not use t...,528720,1352530,2017-11-23T10:21:26.173,2017-11-23T10:21:26.173,...,CC BY-SA 3.0,,,,,,,,,
4,7555491,4,,2011-09-26T13:03:20.387,0,"DO NOT USE THIS TAG! Instead, prefer a specifi...",1033581,1033581,2018-03-18T17:09:42.420,2018-03-18T17:09:42.420,...,CC BY-SA 3.0,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99582,75638798,1,,2023-03-04T21:32:30.200,0,<p>We have an AKS POD (.NET 6 application) whi...,20142949,13302,2023-03-04T21:54:54.063,2023-03-04T21:54:54.063,...,CC BY-SA 4.0,11,Kubernetes - connect to Azure SQL using Active...,<kubernetes><.net-core><azure-sql-database><az...,0,,,,,
99583,75638826,2,58601318,2023-03-04T21:38:18.120,0,<p>It is actually straightforward but not well...,70826,,,2023-03-04T21:38:18.120,...,CC BY-SA 4.0,,,,,,,,,
99584,75638928,1,,2023-03-04T21:58:44.903,0,<p>How to enable Pytorch to use GPUs on <a hre...,20828520,,,2023-03-04T21:58:44.903,...,CC BY-SA 4.0,4,Pytorch cannot detect CUDA GPUs on GKE Autopil...,<installation><pytorch><google-kubernetes-engi...,0,,,,,
99585,75639427,2,75632287,2023-03-04T23:57:08.170,0,<p>It's turnout that the installation of kubec...,1071908,,,2023-03-04T23:57:08.170,...,CC BY-SA 4.0,,,,,,,,,


In [13]:
df = df_3.copy()

In [14]:
# Correct Data types
import numpy as np
df['Id'] = df['Id'].astype('int64')
df['OwnerUserId'] = df['OwnerUserId'].astype('float64')
df['ParentId'] = df['ParentId'].astype('float64')
df['PostTypeId'] = df['PostTypeId'].astype('int64')
df['CreationDate'] = df['CreationDate'].astype('datetime64[ns]')
df['LastActivityDate'] = df['LastActivityDate'].astype('datetime64[ns]')
df['CommunityOwnedDate'] = df['CommunityOwnedDate'].astype('datetime64[ns]')
df['ViewCount'] = df['ViewCount'].replace(np.NaN, 0).astype('int64')
df['AnswerCount'] = df['AnswerCount'].replace(np.NaN, 0).astype('int64')
df['CommentCount'] = df['CommentCount'].replace(np.NaN, 0).astype('int64')
df['FavoriteCount'] = df['FavoriteCount'].replace(np.NaN, 0).astype('int64')
df['LastEditorUserId'] = df['LastEditorUserId'].astype('float64')
df['AcceptedAnswerId'] = df['AcceptedAnswerId'].astype('float64')
df['Score'] = df['Score'].astype('int64')
# Extract text from html
#df['Text'] = df['Body'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())

## Setup SQLite DB

In [1]:
DATABASE_LOC = '/mnt/p/datasets/stackoverflow/stackoverflow_kubernetes.db'
DATABASE_LOC = './data/stackexchange_kubernetes.db'

In [2]:
import sqlite3
# Create a new SQLite database
conn = sqlite3.connect(DATABASE_LOC)
cursor = conn.cursor()

Lets see what tables we have

In [3]:
# lets show the tables in the database
cursor.execute('DROP TABLE IF EXISTS posts')
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall()

[('posts',)]

Lets try to post the dataframe directly

In [14]:
# Write the DataFrame to a table in the database
df.drop_duplicates().to_sql('posts', conn, if_exists='replace', index=False)

99587

In [42]:
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall()

[('posts',)]

Nice! We have succesfully created the table. Lets try POSTing the row.

In [5]:
import pandas as pd
query = cursor.execute("SELECT * FROM posts LIMIT 5;").fetchall()
column_names = [description[0] for description in cursor.description]
df = pd.DataFrame(query, columns=column_names)
df

Unnamed: 0,index,Id,PostTypeId,ParentId,CreationDate,Score,Body,OwnerUserId,LastEditorUserId,LastEditDate,...,ContentLicense,ViewCount,Title,Tags,AnswerCount,FavoriteCount,AcceptedAnswerId,ClosedDate,LastEditorDisplayName,OwnerDisplayName
0,0,999124,2,998832.0,2009-06-16 00:31:02.797000,386,<p><strong>Update:</strong> In an effort to an...,42223.0,6093952.0,2022-02-15T00:15:35.203,...,CC BY-SA 4.0,0,,,0,0,,,,
1,1,2973781,2,153721.0,2010-06-04 11:49:04.040000,31,<p><strong>AppScale</strong></p>\n\n<p>AppScal...,72787.0,72787.0,2014-12-05T15:11:57.957,...,CC BY-SA 3.0,0,,,0,0,,,,
2,2,5827081,5,,2011-04-29 01:56:49.773000,0,<p><strong>About</strong></p>\n<p>Google API i...,2000557.0,11407695.0,2020-07-09T10:57:07.800,...,CC BY-SA 4.0,0,,,0,0,,,,
3,3,7555490,5,,2011-09-26 13:03:20.387000,0,<blockquote>\n <p>Please <strong>do not use t...,528720.0,1352530.0,2017-11-23T10:21:26.173,...,CC BY-SA 3.0,0,,,0,0,,,,
4,4,7555491,4,,2011-09-26 13:03:20.387000,0,"DO NOT USE THIS TAG! Instead, prefer a specifi...",1033581.0,1033581.0,2018-03-18T17:09:42.420,...,CC BY-SA 3.0,0,,,0,0,,,,


Nice! All rows were imported succesfully.

Amount of rows:

In [15]:
cursor.execute("select count(*) from posts;").fetchall()[0][0]

99587

Max ID:

In [16]:
cursor.execute("select max(Id) from posts;").fetchall()[0][0]

75639798

Lets select the tags for a post (the categories). We can search for the posts that were tagged as `kubernetes` vs. the ones we just imported.

In [8]:
search_query = 'kubernetes'
tags = cursor.execute("SELECT Title FROM posts;").fetchall()
tags = [t[0] if t[0] is not None else '' for t in tags]
selected_tags = [search_query in t for t in tags]
indices = [i for i, x in enumerate(selected_tags) if x]
print(tags)



We can als use the built in SQL statements and f-strings.

In [47]:
query = cursor.execute(f"SELECT title, tags FROM posts WHERE tags LIKE '%{search_query}%' --case-insensitive;").fetchall()
query

[('Pong game, detect if ball is out-of-bounds',
  '<haskell><pong><frp><kubernetes-helm>'),
 ('Can I use Single AWS ELB to host 2 SSL Certs for 2 Different Domains?',
  '<kubernetes><ssl><dns><certificate><amazon-elb>'),
 ('Can you connect to Amazon ElastiСache Redis outside of Amazon?',
  '<amazon-web-services><kubernetes><amazon-ec2><redis><amazon-elasticache>'),
 ('Ansible Galaxy roles install in to a specific directory?',
  '<kubernetes><ansible><ansible-galaxy>'),
 ('Kubernetes on CoreOS: Proxy service max out CPU',
  '<docker><coreos><kubernetes>'),
 ('Running Kubernetes Example on CoreOS, Part 1 not work',
  '<docker><coreos><kubernetes>'),
 ('Docker try to download unnecessary busybox image on creation of redis pod with kubernetes tools',
  '<redis><docker><coreos><kubernetes>'),
 ('kubernetes failing to connect on fresh installation of CoreOS',
  '<json><vagrant><docker><coreos><kubernetes>'),
 ('How to write a kubernetes pod configuration to start two containers',
  '<docker>

Lets load the entire sqlite database as a `pd.DataFrame`

In [17]:
df = pd.read_sql(f'SELECT * FROM posts;', conn)
print(df.shape, df.columns, df.dtypes, sep='\n\n')

(99587, 23)

Index(['index', 'Id', 'PostTypeId', 'ParentId', 'CreationDate', 'Score',
       'Body', 'OwnerUserId', 'LastEditorUserId', 'LastEditDate',
       'LastActivityDate', 'CommentCount', 'CommunityOwnedDate',
       'ContentLicense', 'ViewCount', 'Title', 'Tags', 'AnswerCount',
       'FavoriteCount', 'AcceptedAnswerId', 'ClosedDate',
       'LastEditorDisplayName', 'OwnerDisplayName'],
      dtype='object')

index                      int64
Id                         int64
PostTypeId                 int64
ParentId                 float64
CreationDate              object
Score                      int64
Body                      object
OwnerUserId              float64
LastEditorUserId         float64
LastEditDate              object
LastActivityDate          object
CommentCount               int64
CommunityOwnedDate        object
ContentLicense            object
ViewCount                  int64
Title                     object
Tags                      object
AnswerCount       

The datatypes look correct! Lets check if the title or body contains kubernetes (case sensitive!)

In [49]:
df['Title'].str.contains('kubernetes').sum(), df['Body'].str.contains('kubernetes').sum()

(18946, 134994)

In [50]:
df['Title'].str.contains('Kubernetes').sum(), df['Body'].str.contains('Kubernetes').sum()

(34286, 63580)

When we are done, we can close the db connection:

In [51]:
conn.close()

## Open Book Question Answering

What data do we have?

In [52]:
df.columns

Index(['index', 'Id', 'PostTypeId', 'ParentId', 'CreationDate', 'Score',
       'Body', 'OwnerUserId', 'LastEditorUserId', 'LastEditDate',
       'LastActivityDate', 'CommentCount', 'CommunityOwnedDate',
       'ContentLicense', 'ViewCount', 'Title', 'Tags', 'AnswerCount',
       'FavoriteCount', 'AcceptedAnswerId', 'ClosedDate',
       'LastEditorDisplayName', 'OwnerDisplayName'],
      dtype='object')

How recent is the data?

In [53]:
df_context = df.loc[df['PostTypeId']==2, :].sort_values(by='CreationDate', ascending=False)
df_context['CreationDate']

199172    2023-03-04 23:57:08.170000
99585     2023-03-04 23:57:08.170000
99583     2023-03-04 21:38:18.120000
199170    2023-03-04 21:38:18.120000
99579     2023-03-04 18:42:24.203000
                     ...            
99597     2013-08-17 09:19:34.887000
99588     2010-06-04 11:49:04.040000
1         2010-06-04 11:49:04.040000
0         2009-06-16 00:31:02.797000
99587     2009-06-16 00:31:02.797000
Name: CreationDate, Length: 68452, dtype: object

Not quite up to date (yet)

### Random key word search

In [54]:
from IPython.display import display, HTML
search = 'Azure'
HTML(df.loc[df['Body'].str.contains(search), 'Body'].iloc[1])

In [13]:
df.drop_duplicates()

Unnamed: 0,index,Id,PostTypeId,ParentId,CreationDate,Score,Body,OwnerUserId,LastEditorUserId,LastEditDate,...,ContentLicense,ViewCount,Title,Tags,AnswerCount,FavoriteCount,AcceptedAnswerId,ClosedDate,LastEditorDisplayName,OwnerDisplayName
0,0,999124,2,998832.0,2009-06-16 00:31:02.797000,386,<p><strong>Update:</strong> In an effort to an...,42223.0,6093952.0,2022-02-15T00:15:35.203,...,CC BY-SA 4.0,0,,,0,0,,,,
1,1,2973781,2,153721.0,2010-06-04 11:49:04.040000,31,<p><strong>AppScale</strong></p>\n\n<p>AppScal...,72787.0,72787.0,2014-12-05T15:11:57.957,...,CC BY-SA 3.0,0,,,0,0,,,,
2,2,5827081,5,,2011-04-29 01:56:49.773000,0,<p><strong>About</strong></p>\n<p>Google API i...,2000557.0,11407695.0,2020-07-09T10:57:07.800,...,CC BY-SA 4.0,0,,,0,0,,,,
3,3,7555490,5,,2011-09-26 13:03:20.387000,0,<blockquote>\n <p>Please <strong>do not use t...,528720.0,1352530.0,2017-11-23T10:21:26.173,...,CC BY-SA 3.0,0,,,0,0,,,,
4,4,7555491,4,,2011-09-26 13:03:20.387000,0,"DO NOT USE THIS TAG! Instead, prefer a specifi...",1033581.0,1033581.0,2018-03-18T17:09:42.420,...,CC BY-SA 3.0,0,,,0,0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199169,99582,75638798,1,,2023-03-04 21:32:30.200000,0,<p>We have an AKS POD (.NET 6 application) whi...,20142949.0,13302.0,2023-03-04T21:54:54.063,...,CC BY-SA 4.0,11,Kubernetes - connect to Azure SQL using Active...,<kubernetes><.net-core><azure-sql-database><az...,0,0,,,,
199170,99583,75638826,2,58601318.0,2023-03-04 21:38:18.120000,0,<p>It is actually straightforward but not well...,70826.0,,,...,CC BY-SA 4.0,0,,,0,0,,,,
199171,99584,75638928,1,,2023-03-04 21:58:44.903000,0,<p>How to enable Pytorch to use GPUs on <a hre...,20828520.0,,,...,CC BY-SA 4.0,4,Pytorch cannot detect CUDA GPUs on GKE Autopil...,<installation><pytorch><google-kubernetes-engi...,0,0,,,,
199172,99585,75639427,2,75632287.0,2023-03-04 23:57:08.170000,0,<p>It's turnout that the installation of kubec...,1071908.0,,,...,CC BY-SA 4.0,0,,,0,0,,,,


### QA

Most asked stackoverflow question for kubernetes: https://stackoverflow.com/questions/47536536/whats-the-difference-between-docker-compose-and-kubernetes

In [56]:
# Input question
question = "What is the difference between Master Node and Control Plane on Kubernetes?"

In [57]:
df['Title'] = df['Title'].astype(str)
df['Title'] = df['Body'].astype(str)
# Combine 'Title' and 'Body' columns into a single text column
df['Text'] = df['Title'] + ' ' + df['Body']
df['Title']

0         <p><strong>Update:</strong> In an effort to an...
1         <p><strong>AppScale</strong></p>\n\n<p>AppScal...
2         <p><strong>About</strong></p>\n<p>Google API i...
3         <blockquote>\n  <p>Please <strong>do not use t...
4         DO NOT USE THIS TAG! Instead, prefer a specifi...
                                ...                        
199169    <p>We have an AKS POD (.NET 6 application) whi...
199170    <p>It is actually straightforward but not well...
199171    <p>How to enable Pytorch to use GPUs on <a hre...
199172    <p>It's turnout that the installation of kubec...
199173    <p>I try to create an image from the docker fi...
Name: Title, Length: 199174, dtype: object

Lets do a Tf-Idf similarity with our question

In [58]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Apply the vectorizer to the 'Text' column
tfidf_matrix = vectorizer.fit_transform(df['Title'])

Ask the question:

In [59]:
# Convert the question into a TF-IDF vector
question_vector = vectorizer.transform([question])

# Calculate the cosine similarity between the question vector and all documents
cosine_similarities = cosine_similarity(question_vector, tfidf_matrix)

# Find the index of the most similar document
most_similar_index = cosine_similarities.argmax()

# Retrieve the most similar record
most_similar_record = df.loc[most_similar_index]

# Print the most similar record
print(most_similar_record)

index                                                                72765
Id                                                                68860301
PostTypeId                                                               1
ParentId                                                               NaN
CreationDate                                    2021-08-20 09:53:12.293000
Score                                                                   11
Body                     <p>What is the difference between Master Node ...
OwnerUserId                                                     14270234.0
LastEditorUserId                                                12892553.0
LastEditDate                                       2021-11-18T06:28:05.413
LastActivityDate                                2021-11-18 06:28:05.413000
CommentCount                                                             0
CommunityOwnedDate                                                    None
ContentLicense           

In [60]:
df['cosine_similarities'] = cosine_similarities[0]
df = df.sort_values(by='cosine_similarities', ascending=False)
df.head(3)

Unnamed: 0,index,Id,PostTypeId,ParentId,CreationDate,Score,Body,OwnerUserId,LastEditorUserId,LastEditDate,...,Title,Tags,AnswerCount,FavoriteCount,AcceptedAnswerId,ClosedDate,LastEditorDisplayName,OwnerDisplayName,Text,cosine_similarities
72765,72765,68860301,1,,2021-08-20 09:53:12.293000,11,<p>What is the difference between Master Node ...,14270234.0,12892553.0,2021-11-18T06:28:05.413,...,<p>What is the difference between Master Node ...,<kubernetes>,3,0,68861391.0,2021-08-20T17:23:37.630,,,<p>What is the difference between Master Node ...,0.872544
172352,72765,68860301,1,,2021-08-20 09:53:12.293000,11,<p>What is the difference between Master Node ...,14270234.0,12892553.0,2021-11-18T06:28:05.413,...,<p>What is the difference between Master Node ...,<kubernetes>,3,0,68861391.0,2021-08-20T17:23:37.630,,,<p>What is the difference between Master Node ...,0.872544
63332,63332,66186014,1,,2021-02-13 14:07:50.253000,9,"<p>I am newbie to kubernetes, I see one of my ...",2830167.0,7146596.0,2021-02-14T00:17:39.197,...,"<p>I am newbie to kubernetes, I see one of my ...",<kubernetes>,2,0,66186779.0,,,,"<p>I am newbie to kubernetes, I see one of my ...",0.638399


Lets find the answer to the question!

In [61]:
answer_ids = df[df['AcceptedAnswerId'].isnull()==False]['AcceptedAnswerId']
answers = df[df['Id'].isin(answer_ids)]
len(answers)

21358

In [62]:
print(question)
HTML(answers['Body'].iloc[0])

What is the difference between Master Node and Control Plane on Kubernetes?
