### General info of the datasets:

Questions contains the title, body, creation date, score, and owner ID for each Python question.

Answers contains the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table.

Tags contains the tags on each question besides the Python tag.

# Reading and Cleaning data

In [1]:
import pandas as pd

In [2]:
# Load tags dataframe
df_tags = pd.read_csv('datasets/Tags.csv')
print(df_tags.info())
print(df_tags.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1885078 entries, 0 to 1885077
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   Id      int64 
 1   Tag     object
dtypes: int64(1), object(1)
memory usage: 28.8+ MB
None
    Id        Tag
0  469     python
1  469        osx
2  469      fonts
3  469  photoshop
4  502     python


## Filter out rows that don't contain python tags
Edit: This step seems to be redundant, since the info about the dataset states that the tags contain all OTHER tags than python

Nevertheless, should the need rearise, the following code can be run:

```python
df_tags_filtered = df_tags[df_tags['Tag']=='python']
print(df_tags_filtered.head())
```

## Read questions

In [3]:
# The csv file contains almost 700.000 entries. For time execution purposes. That's simply too many for the purpose of this notebook
df_questions = pd.read_csv('datasets/Questions.csv', encoding='ISO-8859-1', nrows=1000)
print(df_questions.info())
print(df_questions.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Id            1000 non-null   int64  
 1   OwnerUserId   953 non-null    float64
 2   CreationDate  1000 non-null   object 
 3   Score         1000 non-null   int64  
 4   Title         1000 non-null   object 
 5   Body          1000 non-null   object 
dtypes: float64(1), int64(2), object(3)
memory usage: 47.0+ KB
None
    Id  OwnerUserId          CreationDate  Score  \
0  469        147.0  2008-08-02T15:11:16Z     21   
1  502        147.0  2008-08-02T17:01:58Z     27   
2  535        154.0  2008-08-02T18:43:54Z     40   
3  594        116.0  2008-08-03T01:15:08Z     25   
4  683        199.0  2008-08-03T13:19:16Z     28   

                                               Title  \
0  How can I find the full path to a font from it...   
1            Get a preview JPEG of a PDF on Windows?   
2 

In [4]:
question_ids = df_questions['Id'].tolist()

## Read Answers

In [5]:
df_answers_unfiltered = pd.read_csv('datasets/Answers.csv', encoding='ISO-8859-1', nrows=10000)
df_answers = df_answers_unfiltered[df_answers_unfiltered['ParentId'].isin(question_ids)]
del df_answers_unfiltered
print(df_answers.info())
print(df_answers.head())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4032 entries, 0 to 9998
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Id            4032 non-null   int64  
 1   OwnerUserId   3903 non-null   float64
 2   CreationDate  4032 non-null   object 
 3   ParentId      4032 non-null   int64  
 4   Score         4032 non-null   int64  
 5   Body          4032 non-null   object 
dtypes: float64(1), int64(3), object(2)
memory usage: 220.5+ KB
None
    Id  OwnerUserId          CreationDate  ParentId  Score  \
0  497         50.0  2008-08-02T16:56:53Z       469      4   
1  518        153.0  2008-08-02T17:42:28Z       469      2   
2  536        161.0  2008-08-02T18:49:07Z       502      9   
3  538        156.0  2008-08-02T18:56:56Z       535     23   
4  541        157.0  2008-08-02T19:06:40Z       535     20   

                                                Body  
0  <p>open up a terminal (Applications-&gt;Utilit... 

## Clean out html elements from the text data

In [6]:
from bs4 import BeautifulSoup

In [7]:
# Define a function to extract text from HTML tags using BeautifulSoup
def remove_html_tags(text):
    soup = BeautifulSoup(text, 'html.parser')
    return soup.get_text()

# Apply the function to the relevant column of the dataframe
df_questions['Body'] = df_questions['Body'].apply(remove_html_tags)

# Print cleaned dataframe
print(df_questions['Body'].head())

0    I am using the Photoshop's javascript API to f...
1    I have a cross-platform (Python) application w...
2    I'm starting work on a hobby project with a py...
3    There are several ways to iterate over a resul...
4    I don't remember whether I was dreaming or not...
Name: Body, dtype: object


In [8]:
# Apply the same logic to the answers
df_answers['Body'] = df_answers['Body'].apply(remove_html_tags)
print(df_answers['Body'].head())

0    open up a terminal (Applications->Utilities->T...
1    I haven't been able to find anything that does...
2    You can use ImageMagick's convert utility for ...
3    One possibility is Hudson.  It's written in Ja...
4    We run Buildbot - Trac at work, I haven't used...
Name: Body, dtype: object
