## Extracting Data

In [3]:
cd drive/My\ Drive/sih_project

/content/drive/My Drive/sih_project


In [4]:
ls 

[0m[01;34mdata[0m/  [01;34mextracted_files[0m/  [01;34mpkg_data[0m/


In [None]:
!mkdir extracted_files
!unzip -q pkg_data/pan12-sexual-predator-identification-test-corpus-2012-05-21.zip -d extracted_files/
!unzip -q pkg_data/pan12-sexual-predator-identification-training-corpus-2012-05-01.zip -d extracted_files/

mkdir: cannot create directory ‘extracted_files’: File exists


In [None]:
!mv extracted_files/pan12-sexual-predator-identification-test-corpus-2012-05-21/ extracted_files/test_corpus
!mv extracted_files/pan12-sexual-predator-identification-training-corpus-2012-05-01/ extracted_files/train_corpus

## Extracting Information

In [None]:
!cat extracted_files/train_corpus/readme.txt

Overview

This archive contains the training corpus for the "Sexual Predator Identification" task of the PAN 2012 Lab, held in conjunction with the CLEF 2012 conference.

Find out about all the details at http://pan.webis.de.



Training Corpus Description

Update 01 May 2012:

pan12-sexual-predator-identification-training-corpus-2012-05-01.xml A new xml file containing conversations without bad username substitution.
pan12-sexual-predator-identification-diff.txt A text file containing conversation id and line number of modified text 
pan12-sexual-predator-identification-training-corpus-predators-2012-05-01.txt The list of predators without the ones not present in the traininig set



The corpus comprises:

pan12-sexual-predator-identification-training-corpus.xml An xml file containing around 60000 documents (each document is a conversation)
pan12-sexual-predator-identification-training-corpus-predators.txt A file containing a list of predators id

The xml file is organized as follow:


In [None]:
!ls extracted_files/train_corpus

pan12-sexual-predator-identification-diff.txt
pan12-sexual-predator-identification-training-corpus-2012-05-01.xml
pan12-sexual-predator-identification-training-corpus-predators-2012-05-01.txt
readme.txt


In [None]:
train_predators = []
with open('extracted_files/train_corpus/pan12-sexual-predator-identification-training-corpus-predators-2012-05-01.txt', 'r') as f:
    train_predators = f.readlines()
print("Total number of predators in Training corpus : {}".format(len(train_predators)))

Total number of predators in Training corpus : 142


In [42]:
import os
os.system('pip install xmltodict')
import xmltodict
import json
import pandas as pd

class ExtractText():
    def __init__(self, filename, out_folder='./'):
        '''
        Extract details from XML files
        Args : filename -> Path to the XML file
               out_folder -> Path to output folder
        '''
        self.filename = filename
        if (out_folder[-1]=='/'):
            self.out_folder = out_folder
        else:
            self.out_folder = out_folder + '/'
        
        try:
            os.mkdir(self.out_folder)
        except:
            pass

        print("Parsing XML to Dictionary...")
        dictionary = self.xml_to_dictionary()

        # Converting chat message with single chat to list format
        for i in dictionary['conversations']['conversation']:
            if (str(type(i['message'])) != "<class 'list'>"):
                i['message'] = [i['message']]

        print('Converting XML to JSON format...')
        self.xml_to_json(dictionary)

        print('Converting XML to CSV format...')
        self.xml_to_csv(dictionary)
        print("Files created in {} directory".format(self.out_folder))

    def xml_to_dictionary(self):
        '''
        Converts XML file to data dictionary
        '''
        with open(self.filename) as xml_file:
            data_dict = xmltodict.parse(xml_file.read())
        return data_dict

    def xml_to_json(self, dictionary):
        '''
        Converts parsed dictionary to json and saves
        '''
        data = json.dumps(dictionary)
        with open(self.out_folder + self.filename.split('/')[-1].rstrip('xml') + 'json', 'w') as f:
            f.write(data)

    def xml_to_csv(self, dictionary):
        '''
        Converts parsed dictionary to dataframe and saves in CSV format
        '''
        data = []
        for conv in dictionary['conversations']['conversation']:
            id = conv['@id']
            for message in conv['message']:
                d = dict()
                d = {key: message[key] for key in message.keys()}
                d['@id'] = id
                data.append(d)
        df = pd.DataFrame(data)
        df.to_csv(self.out_folder + self.filename.split('/')[-1].rstrip('xml') + 'csv')

In [43]:
ExtractText('extracted_files/train_corpus/pan12-sexual-predator-identification-training-corpus-2012-05-01.xml', 'data')
ExtractText('extracted_files/test_corpus/pan12-sexual-predator-identification-test-corpus-2012-05-17.xml', 'data')

Parsing XML to Dictionary...
Converting XML to JSON format...
Converting XML to CSV format...
Files created in data/ directory
Parsing XML to Dictionary...
Converting XML to JSON format...
Converting XML to CSV format...
Files created in data/ directory


<__main__.ExtractText at 0x7f9221d76630>