## Notebook to convert data into from XML into CSV

### Description
The main objective of this notebook is to convert Stackoverflow data dump `*.xml` documents into `*.csv` format.
The former one is then used for all transformations and calculations, because it is easier to operate with using Pandas.

In particularly it converts each row from XML document into CSV and specific attributes into columns. Some attributes are filtered out because they are not relevant for aggregations to make.


For instance, `Posts.xml` have the following structure:
```xml
<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="2008-07-31T21:42:52.667" Score="804" ViewCount="76276" Body="&lt;p&gt;I want to assign the decimal variable &amp;quot;trans&amp;quot; to the double variable &amp;quot;this.Opacity&amp;quot;.&lt;/p&gt;&#xA;&lt;pre class=&quot;lang-cs prettyprint-override&quot;&gt;&lt;code&gt;decimal trans = trackBar1.Value / 5000;&#xA;this.Opacity = trans;&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&lt;p&gt;When I build the app it gives the following error:&lt;/p&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;p&gt;Cannot implicitly convert type decimal to double&lt;/p&gt;&#xA;&lt;/blockquote&gt;&#xA;" OwnerUserId="8" LastEditorUserId="16124033" LastEditorDisplayName="Rich B" LastEditDate="2022-09-08T05:07:26.033" LastActivityDate="2022-09-08T05:07:26.033" Title="How to convert Decimal to Double in C#?" Tags="&lt;c#&gt;&lt;floating-point&gt;&lt;type-conversion&gt;&lt;double&gt;&lt;decimal&gt;" AnswerCount="13" CommentCount="4" FavoriteCount="0" CommunityOwnedDate="2012-10-31T16:42:47.213" ContentLicense="CC BY-SA 4.0" />
  <!--other rows-->
</posts>
```
That will be converted into:
```csv
Tags,ParentId,CreationDate,Id,DeletionDate,PostTypeId,ClosedDate
<c#><floating-point><type-conversion><double><decimal>,,2008-07-31T21:42:52.667,4,,1,
```

### Input 
This notebook takes as an input two files from raw Stackoverflow data dump:
- `Posts.xml` - data with raw posts data;
- `Votes.xml` - data with raw votes data;

### Output
- `posts.xml` - converted posts data;
- `votes.xml` - converted votes data;

In [1]:
import xml.sax
import csv

class DataDocumentHandler(xml.sax.ContentHandler):
    def __init__(self, output_file_name, attributes_to_include):
        super().__init__()
        self.csvfile = open(output_file_name, "w", newline='', encoding='utf-8')
        self.csvwriter = csv.writer(self.csvfile)
        self.headers_written = False
        self.rows_processed = 0
        self.attributes_to_include = attributes_to_include

        print(f"Initialized handler and opened {output_file_name} for writing.")

    def startElement(self, name, attrs):
        if name == 'row':
            self.rows_processed += 1
            row_data = {a: attrs.getValue(a) for a in self.attributes_to_include if a in attrs}
            if not self.headers_written:
                self.csvwriter.writerow(self.attributes_to_include)
                self.headers_written = True
                print("CSV headers written.")
            self.csvwriter.writerow([row_data.get(a, None) for a in self.attributes_to_include])

    def endDocument(self):
        self.csvfile.close()
        print(f"Finished processing and closed the file. Total rows processed: {self.rows_processed}")

    def startDocument(self):
        print("Started processing XML document.")


def prepare_data(xml_file: str, csv_file: str, attributes_to_include: set[str]):
    parser = xml.sax.make_parser()
    parser.setFeature(xml.sax.handler.feature_namespaces, 0)

    handler = DataDocumentHandler(csv_file, attributes_to_include)
    parser.setContentHandler(handler)

    print(f"Starting XML parsing: {xml_file}")
    parser.parse(xml_file)
    print("XML parsing completed successfully.")

#### Convert posts data dump
Convert `Posts.xml` file that contains Stackoverflow posts (questions and answers) into CSV.

In [2]:
posts_xml_file = "Posts.xml"
posts_file_path = get_file_path("posts.csv")
posts_attributes_to_include = {
    'Id', # this field is used for later analysis
    'PostTypeId', # this field is used for later analysis
    'ParentId', # this field is used for later analysis
    'CreationDate', # this field is used for later analysis
    'DeletionDate', # this field is used to filter out deleted posts
    'Tags', # this field is used to explore tags
    'ClosedDate' # this field is used to filter out closed posts
}

prepare_data(posts_xml_file, posts_file_path, posts_attributes_to_include)

Initialized handler and opened posts.csv for writing.
Starting XML parsing: Posts.xml
Started processing XML document.
CSV headers written.
Finished processing and closed the file. Total rows processed: 59749049
XML parsing completed successfully.


#### Convert votes data dump
Convert `Votes.xml` file that contains Stackoverflow votes for posts into CSV.

In [3]:
votes_xml_file = "Votes.xml"
votes_csv_path = get_file_path("votes.csv")
votes_attributes_to_include = {
    'PostId', # need for analysis
    'VoteTypeId', # need for analysis
    'CreationDate' # need for analysis
}

prepare_data(votes_xml_file, votes_csv_path, votes_attributes_to_include)

Initialized handler and opened votes.csv for writing.
Starting XML parsing: Votes.xml
Started processing XML document.
CSV headers written.


KeyboardInterrupt: 