# ***difflib***  --  Helpers for computing deltas

>（摘自官方文档）此模块提供用于比较序列的类和函数。 例如，它可被用于比较文件，并可产生多种格式的不同文件差异信息，包括 HTML 和上下文以及统一的 diff 数据。 有关比较目录和文件，另请参阅 filecmp 模块。

In [3]:
# CODE LIST 1-63
"""Data for use with difflib examples.
"""

# see 'difflib_data.py'

'Data for use with difflib examples.\n'

## 1.4.1 比较文本体

- ***Differ*** 类
- ***unified_diff( )***

*difflib.Differ().compare( text1, text2 )* 

生成人类可读的差异(deltas)。行前缀 ***"- + ?"*** 分别指示 *text1, text2* 中的文本行，以及差异。

In [9]:
# CODE LIST 1-64
import difflib
from difflib_data import *

d = difflib.Differ()
diff = d.compare(text1_lines, text2_lines)
# or: diff = difflibe.Differ().compare(..., ...)
print('\n'.join(diff))
# .compare()返回一个 generator，下述方法打印结果相同
# while True:
#     try:
#         print(next(diff))
#     except:
#         break


  Lorem ipsum dolor sit amet, consectetuer adipiscing
  elit. Integer eu lacus accumsan arcu fermentum euismod. Donec
- pulvinar porttitor tellus. Aliquam venenatis. Donec facilisis
+ pulvinar, porttitor tellus. Aliquam venenatis. Donec facilisis
?         +

- pharetra tortor.  In nec mauris eget magna consequat
?                 -

+ pharetra tortor. In nec mauris eget magna consequat
- convalis. Nam sed sem vitae odio pellentesque interdum. Sed
?                 - --

+ convalis. Nam cras vitae mi vitae odio pellentesque interdum. Sed
?               +++ +++++   +

  consequat viverra nisl. Suspendisse arcu metus, blandit quis,
  rhoncus ac, pharetra eget, velit. Mauris urna. Morbi nonummy
  molestie orci. Praesent nisi elit, fringilla ac, suscipit non,
  tristique vel, mauris. Curabitur vel lorem id nisl porta
- adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate
- tristique enim. Donec quis lectus a justo imperdiet tempus.
+ adipiscing. Duis vulputate tristique enim. Donec 

统一差异格式( *unified diff* ) 只包含含有修改的文本行和一些上下文

In [12]:
# CODE LIST 1-65
import difflib
from difflib_data import *

diff = difflib.unified_diff(
    text1_lines,
    text2_lines,
    lineterm='',
)
print('\n'.join(diff))

--- 
+++ 
@@ -1,11 +1,11 @@
 Lorem ipsum dolor sit amet, consectetuer adipiscing
 elit. Integer eu lacus accumsan arcu fermentum euismod. Donec
-pulvinar porttitor tellus. Aliquam venenatis. Donec facilisis
-pharetra tortor.  In nec mauris eget magna consequat
-convalis. Nam sed sem vitae odio pellentesque interdum. Sed
+pulvinar, porttitor tellus. Aliquam venenatis. Donec facilisis
+pharetra tortor. In nec mauris eget magna consequat
+convalis. Nam cras vitae mi vitae odio pellentesque interdum. Sed
 consequat viverra nisl. Suspendisse arcu metus, blandit quis,
 rhoncus ac, pharetra eget, velit. Mauris urna. Morbi nonummy
 molestie orci. Praesent nisi elit, fringilla ac, suscipit non,
 tristique vel, mauris. Curabitur vel lorem id nisl porta
-adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate
-tristique enim. Donec quis lectus a justo imperdiet tempus.
+adipiscing. Duis vulputate tristique enim. Donec quis lectus a
+justo imperdiet tempus.  Suspendisse eu lectus. In nunc.


## 1.4.2 无用数据

- ***SequenceMatcher*** 类

有点难懂，建议参考 blog：

> [文本相似度-python之difflib库SequenceMatcher类](https://blog.csdn.net/minosisterry/article/details/117028761)

在比较差异中，有时需要忽略一些字符（如制表符、空格等）。

In [14]:
# CODE LIST 1-66
from difflib import SequenceMatcher


def show_results(match):
    print('  a    = {}'.format(match.a))
    print('  b    = {}'.format(match.b))
    print('  size = {}'.format(match.size))
    i, j, k = match
    print('  A[a:a+size] = {!r}'.format(A[i:i + k]))
    print('  B[b:b+size] = {!r}'.format(B[j:j + k]))


A = " abcd"
B = "abcd abcd"

print('A = {!r}'.format(A))
print('B = {!r}'.format(B))

# 构造函数：SequenceMatcher(isjunk=None, a='', b='', autojunk=True)
#   生成一个"比较器"
# .find_longest_match(alo=0, ahi=None, blo=0, bhi=None)
#   返回一个 match 结果：对于所有最长匹配块，返回在a中出现的第一个；对于a中出现的所有最长匹配块，返回b中出现的第一个
#   alo,ali 指定位置范围，范围值 match.a, match.b, match.size 分别为匹配块的起始位置以及匹配长度

print('\nWithout junk detection:')
s1 = SequenceMatcher(None, A, B) # None 则表示不忽略任何值，相当于 lambda x: False
match1 = s1.find_longest_match(0, len(A), 0, len(B))
show_results(match1)

print('\nTreat spaces as junk:')
s2 = SequenceMatcher(lambda x: x == " ", A, B) # 忽略空格
match2 = s2.find_longest_match(0, len(A), 0, len(B))
show_results(match2)

A = ' abcd'
B = 'abcd abcd'

Without junk detection:
  a    = 0
  b    = 4
  size = 5
  A[a:a+size] = ' abcd'
  B[b:b+size] = ' abcd'

Treat spaces as junk:
  a    = 1
  b    = 0
  size = 4
  A[a:a+size] = 'abcd'
  B[b:b+size] = 'abcd'


## 1.4.3 比较任意类型

- ***SequenceMatcher***

SequenceMatcher 类可以比较任意《值是可散列的》类型的两个序列。通过一个算法标识序列中最长的连续匹配块，并删除无用值。

***get_opcodes*** 函数返回一个指令列表以修改第一个序列，使之与第二个序列匹配。<br>
**(tag, i1, i2, j1, j2)** 分别为 **操作码** 和 序列的两对开始及结束索引

|操作码|定义|
|:---:|:---:|
|'replace'|将a[i1:i2]替换为b[j1:j2]|
|'delete'|将a[i1:i2]删除|
|'insert'|于a[i1:i1]处插入b[j1:j2]|
|'equal'|两个序列已经相等|

***reversed( )***

注意返回值为一个 **迭代器**
```python
a = [1,2,3]
b = reversed(a)
print(type(b),list(b))
```

In [23]:
# CODE LIST 1-67
import difflib

s1 = [1, 2, 3, 5, 6, 4]
s2 = [2, 3, 5, 4, 6, 1]

print('Initial data:')
print('s1 =', s1)
print('s2 =', s2)
print('s1 == s2:', s1 == s2)
print()

matcher = difflib.SequenceMatcher(None, s1, s2)
for tag, i1, i2, j1, j2 in reversed(matcher.get_opcodes()):

    if tag == 'delete':
        print('Remove {} from positions [{}:{}]'.format(
            s1[i1:i2], i1, i2))
        print('  before =', s1)
        del s1[i1:i2]

    elif tag == 'equal':
        print('s1[{}:{}] and s2[{}:{}] are the same'.format(
            i1, i2, j1, j2))

    elif tag == 'insert':
        print('Insert {} from s2[{}:{}] into s1 at {}'.format(
            s2[j1:j2], j1, j2, i1))
        print('  before =', s1)
        s1[i1:i2] = s2[j1:j2]

    elif tag == 'replace':
        print(('Replace {} from s1[{}:{}] '
               'with {} from s2[{}:{}]').format(
                   s1[i1:i2], i1, i2, s2[j1:j2], j1, j2))
        print('  before =', s1)
        s1[i1:i2] = s2[j1:j2]

    print('   after =', s1, '\n')

print('s1 == s2:', s1 == s2)

Initial data:
s1 = [1, 2, 3, 5, 6, 4]
s2 = [2, 3, 5, 4, 6, 1]
s1 == s2: False

Replace [4] from s1[5:6] with [1] from s2[5:6]
  before = [1, 2, 3, 5, 6, 4]
   after = [1, 2, 3, 5, 6, 1] 

s1[4:5] and s2[4:5] are the same
   after = [1, 2, 3, 5, 6, 1] 

Insert [4] from s2[3:4] into s1 at 4
  before = [1, 2, 3, 5, 6, 1]
   after = [1, 2, 3, 5, 4, 6, 1] 

s1[1:4] and s2[0:3] are the same
   after = [1, 2, 3, 5, 4, 6, 1] 

Remove [1] from positions [0:1]
  before = [1, 2, 3, 5, 4, 6, 1]
   after = [2, 3, 5, 4, 6, 1] 

s1 == s2: True


## 其他内容

- context
- html
- ndiff

In [16]:
import difflib
from difflib_data import *

diff = difflib.ndiff(text1_lines, text2_lines)
print('\n'.join(diff))

  Lorem ipsum dolor sit amet, consectetuer adipiscing
  elit. Integer eu lacus accumsan arcu fermentum euismod. Donec
- pulvinar porttitor tellus. Aliquam venenatis. Donec facilisis
+ pulvinar, porttitor tellus. Aliquam venenatis. Donec facilisis
?         +

- pharetra tortor.  In nec mauris eget magna consequat
?                  -

+ pharetra tortor. In nec mauris eget magna consequat
- convalis. Nam sed sem vitae odio pellentesque interdum. Sed
?                ------

+ convalis. Nam cras vitae mi vitae odio pellentesque interdum. Sed
?               +++        +++++++++

  consequat viverra nisl. Suspendisse arcu metus, blandit quis,
  rhoncus ac, pharetra eget, velit. Mauris urna. Morbi nonummy
  molestie orci. Praesent nisi elit, fringilla ac, suscipit non,
  tristique vel, mauris. Curabitur vel lorem id nisl porta
- adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate
- tristique enim. Donec quis lectus a justo imperdiet tempus.
+ adipiscing. Duis vulputate tristique eni

In [18]:
import difflib
from difflib_data import *

diff = difflib.context_diff(
    text1_lines,
    text2_lines,
    lineterm='',
)
print('\n'.join(diff))

*** 
--- 
***************
*** 1,11 ****
  Lorem ipsum dolor sit amet, consectetuer adipiscing
  elit. Integer eu lacus accumsan arcu fermentum euismod. Donec
! pulvinar porttitor tellus. Aliquam venenatis. Donec facilisis
! pharetra tortor.  In nec mauris eget magna consequat
! convalis. Nam sed sem vitae odio pellentesque interdum. Sed
  consequat viverra nisl. Suspendisse arcu metus, blandit quis,
  rhoncus ac, pharetra eget, velit. Mauris urna. Morbi nonummy
  molestie orci. Praesent nisi elit, fringilla ac, suscipit non,
  tristique vel, mauris. Curabitur vel lorem id nisl porta
! adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate
! tristique enim. Donec quis lectus a justo imperdiet tempus.
--- 1,11 ----
  Lorem ipsum dolor sit amet, consectetuer adipiscing
  elit. Integer eu lacus accumsan arcu fermentum euismod. Donec
! pulvinar, porttitor tellus. Aliquam venenatis. Donec facilisis
! pharetra tortor. In nec mauris eget magna consequat
! convalis. Nam cras vitae mi vitae 

In [17]:
import difflib
from difflib_data import *

d = difflib.HtmlDiff()
print(d.make_table(text1_lines, text2_lines))


    <table class="diff" id="difflib_chg_to0__top"
           cellspacing="0" cellpadding="0" rules="groups" >
        <colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
        <colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup>
        
        <tbody>
            <tr><td class="diff_next" id="difflib_chg_to0__0"><a href="#difflib_chg_to0__0">f</a></td><td class="diff_header" id="from0_1">1</td><td nowrap="nowrap">Lorem&nbsp;ipsum&nbsp;dolor&nbsp;sit&nbsp;amet,&nbsp;consectetuer&nbsp;adipiscing</td><td class="diff_next"><a href="#difflib_chg_to0__0">f</a></td><td class="diff_header" id="to0_1">1</td><td nowrap="nowrap">Lorem&nbsp;ipsum&nbsp;dolor&nbsp;sit&nbsp;amet,&nbsp;consectetuer&nbsp;adipiscing</td></tr>
            <tr><td class="diff_next"></td><td class="diff_header" id="from0_2">2</td><td nowrap="nowrap">elit.&nbsp;Integer&nbsp;eu&nbsp;lacus&nbsp;accumsan&nbsp;arcu&nbsp;fermentum&nbsp;euismod.&nbsp;Donec</td><td class="diff_next"></td><td class="