## 今天要搞定txt2xml

*思路：*
- *读取txt数据*
- *生成xml文件*

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

### 调研

Python有三种方法解析XML：**SAX, DOM, ElementTree**

- SAX(*simple API for XML*)
    - Python标准库包含SAX解析器。SAX用事件驱动模型，通过在解析XML过程中触发一个个事件并调用用户定义的回调函数来处理XML文件，流式读取，**速度快，占用内存少，但需要用户实现回调函数（*handler*）**
- DOM(*Document Object Model*)
    - 将XML数据在内存中解析成一棵树，通过对树的操作来操作XML。**速度慢，占用内存高**
- ElementTree
    - 类似一个轻量级的DOM，具有方便友好的API。代码可用性好，**速度快，消耗内存少，也许能算前两个的折衷？**

**[lxml XML toolkit](https://lxml.de/)**
- lxml.etree大部分兼容但**优于**ElementTree

ElementTree生成xml文件代码如下（*生成的xml文件无缩进*）:

In [11]:
from xml.etree.ElementTree import Element, SubElement, ElementTree

root=Element('annotation')
folder=SubElement(root,'folder')
filename=SubElement(root,'filename')
source=SubElement(root,'source')
owner=SubElement(root,'owner')
size=SubElement(root,'size')
seg=SubElement(root,'segmented')
obj=SubElement(root,'object')

folder.text='simi-data-201710'

# for filename

db=SubElement(source,'database')
anno_s=SubElement(source,'annotation')
img=SubElement(source,'image')
flid_s=SubElement(source,'flickrid')
db.text='simi-data-201710'
anno_s.text='simi-data-201710'
img.text='flickr'
flid_s.text='201701018'

flid_o=SubElement(owner,'flickrid')
name_o=SubElement(owner,'name')
flid_o.text='Random'
name_o.text='SimImage'

width=SubElement(size,'width')
height=SubElement(size,'height')
depth=SubElement(size,'depth')
depth.text='1'
#for width and height

seg.text='0'

name_ob=SubElement(obj,'name')
pose=SubElement(obj,'pose')
trun=SubElement(obj,'truncated')
hard=SubElement(obj,'difficult')
name_ob.text='object'
pose.text='Unspecified'
trun.text='0'
hard.text='0'
# for multi bbox

file=ElementTree(root)
file.write('lucky.xml',encoding='utf-8',xml_declaration=True,method='xml',short_empty_elements=False)

以下代码用lxml实现：

In [11]:
from lxml import etree

root=etree.Element('annotation')
folder=etree.SubElement(root,'folder')
# root.append(etree.Element('filename')) 另一种写法
filename=etree.SubElement(root,'filename')
source=etree.SubElement(root,'source')
owner=etree.SubElement(root,'owner')
size=etree.SubElement(root,'size')
seg=etree.SubElement(root,'segmented')
obj=etree.SubElement(root,'object')

folder.text='simi-data-201710'

# for filename

db=etree.SubElement(source,'database')
anno_s=etree.SubElement(source,'annotation')
img=etree.SubElement(source,'image')
flid_s=etree.SubElement(source,'flickrid')
db.text='simi-data-201710'
anno_s.text='simi-data-201710'
img.text='flickr'
flid_s.text='201701018'

flid_o=etree.SubElement(owner,'flickrid')
name_o=etree.SubElement(owner,'name')
flid_o.text='Random'
name_o.text='SimImage'

width=etree.SubElement(size,'width')
height=etree.SubElement(size,'height')
depth=etree.SubElement(size,'depth')
depth.text='1'
width.text='190'
height.text='380'
#width and height may change

seg.text='0'

name_ob=etree.SubElement(obj,'name')
pose=etree.SubElement(obj,'pose')
trun=etree.SubElement(obj,'truncated')
hard=etree.SubElement(obj,'difficult')
name_ob.text='object'
pose.text='Unspecified'
trun.text='0'
hard.text='0'
# multi bbox

file=etree.ElementTree(root)
file.write('Annotations/pretty.xml',encoding='utf-8',xml_declaration=True,pretty_print=True)

生成xml文件已经搞定了，可是怎么批量读取txt文件呢？怎么获取文件名呢？
- import os

In [23]:
import os

anno='Annotations'
if not os.path.exists(anno):
        os.mkdir(anno)
files=os.listdir(os.getcwd()+'/Desktop/papers/all/name')
for file in files[1:4]:
    with open(os.getcwd()+'/Desktop/papers/all/name/'+file,'r') as f:
        lines=f.readlines()
        try:
            s=''
            for line in lines[:-1]:
                s+=line
                s+=','
            s+=lines[-1]
            bboxes=s.split(',')
        except:
            print('error: file '+ file + ' has no object! ')
        try:
            if len(bboxes)%4:
                100/0
        except ZeroDivisionError:
            print('error: file '+ file + ' has wrong data number! ')
#         print(bboxes)
#     print("out",bboxes)
    info={}
    info['bboxes']=bboxes
    info['filename']=file[:-3]
    print(info)
    print(len(info['bboxes']))

error: file 002_gun_20190726130937941-1.txt has no object! 
{'bboxes': ['63', '6', '115', '368'], 'filename': '002_gun_20190726130937941-1.'}
4
{'bboxes': ['105', '156', '117', '174', '111', '237', '133', '263'], 'filename': '002_gun_20190726131004381-1.'}
8
{'bboxes': ['63', '6', '115', '368'], 'filename': '002_gun_20190731135645593-1.'}
4


根据以上内容，可完成create_xml_new.py。存储在[codes](codes/txt2xml.py)