Skip to content

Commit

Permalink
Change section about splitting large XML files in top README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
Phrancis committed Sep 6, 2016
1 parent d085a73 commit 8a78bde
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 82 deletions.
7 changes: 7 additions & 0 deletions PythonScripts/Python3_SplitAllLargeXmlFiles.py
Expand Up @@ -10,6 +10,10 @@
# Edit this value to be your actual SE data dump root directory:
ROOT_DIRECTORY = 'D:\Downloads\stackexchange'

# Set this to False if you want to still keep the source XML files after split (takes up more space and XML files need deleted later)
# Set this to True if you want to delete source XML files automatically after split (more risky but takes less space)
DELETE_SOURCE_XML_AFTER_SPLIT = False

SIZE_LIMIT = 20000000 # 20 MB

# Clock to measure how long the script takes to execute
Expand Down Expand Up @@ -78,6 +82,9 @@
# move on to the next output file
current_file_num += 1

if DELETE_SOURCE_XML_AFTER_SPLIT == True:
os.remove(input_file)

# Print results
print('Path:', full_file_path)
print(num_lines_in_file, 'total lines, split into', num_split_files_needed, 'files =', num_lines_per_split_file, 'lines per input_file.')
Expand Down
85 changes: 3 additions & 82 deletions README.md
Expand Up @@ -93,87 +93,8 @@ There is an additional, manual step that needs to be performed in order for Stac

__IMPORTANT: These steps are only needed if data from very large sites (i.e., Stack Overflow) is to be loaded into your database. If you do not plan on loading this data, you may skip this step entirely.__

SQL Server has a maximum XML field size of 2147483647 bytes (2^3g1-1) or 4 gigabytes (GiB). This will work just fine for the majority of XML files, however, some of the files for Stack Overflow exceed this size and if you wish to load that data, the XML files will need to be "split" into smaller files < 4 GiB each. Following are instructions on how to split such files, if you wish to load them.
SQL Server Express _technically_ has a maximum XML field size of 2 GB each. In reality, SQL Server Express, as tested by the author, has a lot of difficulty with files larger than about ~20 MB.

###2.3.1 Find the files that are too large
Thus, in the [PythonScripts](https://github.com/Phrancis/StackExchangeDataToMicrosoftSQLServer/tree/master/PythonScripts) section you will find some code you can run on your own root directory to split up any files >20MB into smaller bites. Note that this is provided for your convenience and you are responsible for inspecting, editing and using the script (or not).


As of June 13 2016 data dump, only 2 files exceed the 4 GiB size limit. Those are `stackoverflow.com\Posts.xml` and `stackoverflow.com\PostHistory.xml`.

####2.3.1.a Using Windows operating system

You can find the files that exceed the size using Windows PowerShell. Open the PowerShell console (Start Menu -> Search -> PowerShell) then paste in (right-click in console) the following command (from [Stack Overflow](http://stackoverflow.com/a/3423144/3626537)), editing the `-path` value to the path where your folders are:

```powershell
Get-ChildItem -path D:\StackExchangeData -recurse | where { ($_.Length / 4000MB) -gt 10 }
```

The console will display something like this:

```text
Directory: D:\StackExchangeData\stackoverflow.com
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 2016-03-07 16:52 71927132845 PostHistory.xml
-a---- 2016-03-07 17:14 44071695285 Posts.xml
```

####2.3.1.b Using *NIX operating system

_To be added later._

###2.3.2 Find how many files to split into

To find how many files each of the files will need split into, simply divide their file size (a.k.a. Length in PowerShell) by 2147483647 bytes.

- PostHistory.xml : 71927132845 / 2147483647 = 33.4 (34 to be safe)
- Posts.xml : 44071695285 / 2147483647 = 20.5 (21 to be safe)

###2.3.3 Split the files

To effectively split the files manually, you will need an advanced text editor such as [Notepad++](https://notepad-plus-plus.org/) or [Sublime Text](https://www.sublimetext.com/). (the author uses Sublime Text).

Open the XML file you would like to split in the text editor. Note that this can take a long time to load for very large files. Once the file is loaded, the first two lines should look like this:

```xml
<?xml version="1.0" encoding="utf-8"?>
<posts>
```

And the very last line will be closing the root element, e.g.:

```xml
</posts>
```

Copy or make a note of these, as they will need to be added at the beginning and end (respectively) of each of the split files.

Next, verify the total number of lines in the file, minus 3 (for the first and last lines), then divide the number of lines by the number of needed files above. This will give you the number of lines to copy into each of the split files. We will use 500,000 lines as an example.

From the large source file, copy the first 500,000 lines, then paste these into a new text file. Add the first two lines (as above) at the very top, and the last line at the very end. It is __extremely important__ that you do not copy partial lines, as this would likely result in invalid XML; each line will begin with `<row Id=` and end with `/>`. An example line from Posts.xml should look like this:

```xml
<row Id="1" PostTypeId="1" AcceptedAnswerId="3" CreationDate="2011-04-26T19:37:32.613" Score="5" ViewCount="76" Body="some content here" OwnerUserId="51" LastEditorUserId="297" LastEditDate="2011-05-08T19:53:20.583" LastActivityDate="2011-05-08T19:53:20.583" Title="some title here" Tags="&lt;support&gt;" AnswerCount="2" CommentCount="1" />
```

To summarize, each of the files will need to be structured as in this example:

```xml
<?xml version="1.0" encoding="utf-8"?>
<posts>
<row Id="1" ......... />
<row Id="2" ......... />
<row Id="3" ......... />

..........

<row Id="500000" ......... />
</posts>
```


Save the file into the same folder as the source, and name it `Posts1.xml`, or `PostHistory1.xml`, depending on the name of the source file. Continue doing this until all the lines from the source file are copied into separate numbered files, e.g., `Posts1.xml, Posts2.xml, ... Posts21.xml`. Once this is complete, you may delete the source file and keep only the numbered split files.

Steps for processing these will be covered later.
Note that running the Python script requires [Python 3.5](https://www.python.org/downloads/) or greater, although it could probably be fairly easily modified to work on Python 2.7. Running the script on the entirety of the data dump subdirectories took around 4 hours in total on the author's machine.

0 comments on commit 8a78bde

Please sign in to comment.