# Case 3 Cleaning Text with Python (String Methods & Regex)

Nowadays, we can find many digital texts online, as well as digitize our own. Prior to analysis, however, we need to make sure that texts are in an appropriate format.  This might require some extensive cleaning to remove any embedded tags, formatting and irrelevant information.

In this exercise, we will be cleaning a .txt version of [Volume IV of the Lives of Vasari](https://www.gutenberg.org/files/28420/28420-h/28420-h.htm) as a source of texts for analysis. We will remove unnecessary formatting and export a corpus of biographies for analysis in other exercises. Then, we will continue working on the file to split it into tokens (single words) it and conduct some simple word frequency checks. 

Output during this lesson is quite large, so we recommend that you collapse it (double click the blue bar on the left of the cell) after looking at it, to avoid having to scroll down for long periods. 

## Pre-processing Text with Python

When working with digital texts, it is important to clean them to ensure they suit your purposes for analysis. Oftentimes, texts contain formatting and information that are intrusive and unnecessary for a specific analysis. 

Most commonly, one will have to remove any front matter and back matter (title pages, tables of contents, forewords, indexes, glossaries and the like). This is easily done by opening the text file and deleting it manually. Other content, such as footnotes, endnotes, page numbers, or, with digital documents taken from the internet, different types of markup, are more complicated to remove. Python can help in these cases.

Let's have a look at the text file's structure to begin to think about how to employ it. I have gone ahead and removed front and back matter from the file. I am going to read the file and print enough of it that we can start to get an idea of the cleaning we need to do. 

Take your time to look at the document's structure and any changes we might have to make so that we can analyze the contents of the biographies without extraneous information.

In [1]:
Vasaritxt = open("VasariLives.txt","r").read() #Open and read the file in the same line

In [2]:
print(Vasaritxt[0:150000]) #Print a long excerpt of the text

LIFE OF FILIPPO LIPPI, CALLED FILIPPINO

PAINTER OF FLORENCE


There was at this same time in Florence a painter of most beautiful
intelligence and most lovely invention, namely, Filippo, son of Fra
Filippo of the Carmine, who, following in the steps of his dead
father in the art of painting, was brought up and instructed, being
still very young, by Sandro Botticelli, notwithstanding that his
father had commended him on his death-bed to Fra Diamante, who was
much his friend--nay, almost his brother. Such was the intelligence
of Filippo, and so abundant his invention in painting, and so
bizarre and new were his ornaments, that he was the first who showed
to the moderns the new method of giving variety to vestments, and
embellished and adorned his figures with the girt-up garments of
antiquity. He was also the first to bring to light grotesques, in
imitation of the antique, and he executed them on friezes in
terretta or in colours, with more design and grace than the men
before him had s

A close look at the text above allows us to identify some characteristics. It is a collection of 'lives' that each have a title in all caps incorporating the artist's name, a new line, and another line stating "Life of..." that artist's name. Additional lines give further information, such as alternative names or the artist's field. Text then follows in several paragraphs that observe normal capitalization rules.  We might use this information to split the book into a collection of texts.

Interspersed within the text are bracketed references to images that we will have to remove.

And similarly, we will have to remove footnotes, which include the heading "FOOTNOTE" followed by lines with the footnotes, preceded by numbers in square brackets. We will also have to remove their references within the text.

Finally, line breaks have been used within sentences to format the text. We will remove these.

# Using Regular Expressions to Remove Recurring structures

In this sections, we will use the re module to search the text for patterns using regular expressions. Our approach is to use the **re.findall()** function to first look at what a regular expression retrieves from the text. When satisfied, we will use the **re.sub()** function to make replacements using said regular expression. Using re.findall() first makes us avoid making any unwanted changes to our text before understanding exactly what we are doing.

First, let's have a look at all square brackets and see what they do within the text. Below, we are finding all instances of text within square brackets using regular expressions.

The meaning of the regular expression below is:

- \[ Escaped opening square bracket, because it is usually a special character and we want to use it literally
- . All characters
- *? 0 or more times, non-greedy
- \] Escaped closing square bracket
- re.DOTALL: mode that lets the dot include line breaks

In [2]:
import re
re.findall("\[.*?\]",Vasaritxt, re.DOTALL)

['[Illustration: THE LIBERATION OF S. PETER\n\n(_After the fresco by =Filippo Lippi (Filippino)=. Florence: S.\nMaria del Carmine_)\n\n_Anderson_]',
 '[Illustration: S. JOHN THE EVANGELIST RAISING DRUSIANA FROM THE DEAD\n\n(_After the fresco by =Filippo Lippi (Filippino)=. Florence: S.\nMaria Novella, Strozzi Chapel_)\n\n_Anderson_]',
 '[Illustration: THE ADORATION OF THE MAGI\n\n(_After the panel by =Filippo Lippi (Filippino)=. Florence: Uffizi,\n1257_)\n\n_Alinari_]',
 '[1]',
 '[Illustration: BERNARDINO PINTURICCHIO: THE MADONNA IN GLORY\n\n(_San Gimignano. Panel_)]',
 '[2]',
 '[Illustration: FREDERICK III CROWNING THE POET ÆNEAS SYLVIUS\n\n(_After the fresco by =Bernardino Pinturicchio=. Siena: Sala\nPiccolominea_)\n\n_Brogi_]',
 '[Illustration: POPE ALEXANDER VI ADORING THE RISEN CHRIST\n\n(_After the fresco by =Bernardino Pinturicchio=. Rome: The Vatican,\nBorgia Apartments_)\n\n_Anderson_]',
 '[Illustration: BENEDETTO BUONFIGLIO: MADONNA, CHILD AND THREE ANGELS\n\n(_Perugia: Pina

From the output below we can see that square brackets give several pieces of information:

Those containing numbers reference a footnote, either referenced in the text or written out at the end of a biography.

Others give information on a painter name,

And others give information on an illustration.

We should deal with them separately.

First, let's identify the footnotes at the end of each biography.

In [25]:
re.findall("^\[[0-9]*?\].*", Vasaritxt, re.MULTILINE)

['[1] Pietro Perugino.',
 '[2] This seems to be an error for Calistus III.',
 '[3] The text says "Messer Bart...."',
 '[4] Exchange or Bank.',
 '[5] It is now generally accepted that these two men are',
 '[6] This master has been identified with Il Bassiti, under',
 '[7] See note on p. 57, Vol. I.',
 '[8] See note on p. 57, Vol. I.',
 '[9] A judicial court, the members of which sat in rotation.',
 '[10] Two accurate literal translations of the same original',
 '[11] This name is missing in the text.',
 '[12] Signet-office, for the sealing of Papal Bulls and',
 '[13] See note on p. 57, Vol. I.',
 '[14] The word "calavano" has been substituted here for the',
 '[15] These numbers are missing from the text.',
 '[16] The word "utilmente" is substituted here for the',
 '[17] The words of the text, "un quadro d\' una spera," are a',
 '[18] Florentine puff-pastry.',
 '[19] Don Vincenzio Borghini.',
 '[20] Filippo Brunelleschi.',
 '[21] The name given in the text is Domenico.',
 '[22] A friable

As we can see when we search these, the method above does not retrieve the full footnote for all cases. Some footnotes occupy multiple lines, and are cut.

Instead, let's try to find the blocks of footnotes as a group.

In [5]:
re.findall("(?<=FOOTNOTE:).+?(?=\n[A-Z ]+)",Vasaritxt, re.MULTILINE|re.DOTALL)

['\n\n[1] Pietro Perugino.\n\n[2] This seems to be an error for Calistus III.\n\n\n\n',
 '\n\n[3] The text says "Messer Bart...."\n\n\n\n',
 '\n\n[4] Exchange or Bank.\n\n\n\n',
 '\n\n[5] It is now generally accepted that these two men are\none, under the name of Lazzaro Bastiani.\n\n[6] This master has been identified with Il Bassiti, under\nthe name of Basaiti.\n\n[7] See note on p. 57, Vol. I.\n\n[8] See note on p. 57, Vol. I.\n\n\n\n',
 '\n\n[9] A judicial court, the members of which sat in rotation.\n\n\n\n',
 "\n\n[10] Two accurate literal translations of the same original\nmust often coincide; and in dealing with this beautiful Life, the\ntranslator has had to take the risk either of seeming to copy the\nalmost perfect rendering of Mr. H. P. Horne, or of introducing\nunsatisfactory variants for mere variety's sake. Having rejected the\nlatter course, he feels doubly bound to record once more his deep\nobligation to Mr. Horne's example.\n\n[11] This name is missing in the text.\n

In [3]:
Vasaritxt = re.sub("(?<=FOOTNOTE:).+?(?=\n[A-Z ]+)", "", Vasaritxt, flags= re.MULTILINE|re.DOTALL)

If you inspect the updated text below, you will see that this has removed all footnotes at the end of any bibliography, but has left the word "FOOTNOTE" in the text. Let's remove these too:

In [4]:
Vasaritxt = re.sub("FOOTNOTE:", "", Vasaritxt, flags= re.MULTILINE|re.DOTALL)

And we should all remove any references to footnotes in the text:

In [9]:
re.findall("\[[0-9]+\]", Vasaritxt, flags= re.MULTILINE|re.DOTALL)

['[1]',
 '[2]',
 '[3]',
 '[4]',
 '[5]',
 '[5]',
 '[6]',
 '[7]',
 '[8]',
 '[9]',
 '[10]',
 '[11]',
 '[12]',
 '[13]',
 '[14]',
 '[15]',
 '[16]',
 '[17]',
 '[18]',
 '[18]',
 '[19]',
 '[20]',
 '[21]',
 '[22]',
 '[23]',
 '[24]',
 '[25]',
 '[26]',
 '[27]',
 '[24]',
 '[25]',
 '[26]',
 '[27]',
 '[28]',
 '[29]',
 '[30]']

In [5]:
Vasaritxt = re.sub("\[[0-9]+\]", "", Vasaritxt, flags= re.MULTILINE|re.DOTALL)

In [11]:
re.findall("\[.*?\]",Vasaritxt, re.DOTALL)

['[Illustration: THE LIBERATION OF S. PETER\n\n(_After the fresco by =Filippo Lippi (Filippino)=. Florence: S.\nMaria del Carmine_)\n\n_Anderson_]',
 '[Illustration: S. JOHN THE EVANGELIST RAISING DRUSIANA FROM THE DEAD\n\n(_After the fresco by =Filippo Lippi (Filippino)=. Florence: S.\nMaria Novella, Strozzi Chapel_)\n\n_Anderson_]',
 '[Illustration: THE ADORATION OF THE MAGI\n\n(_After the panel by =Filippo Lippi (Filippino)=. Florence: Uffizi,\n1257_)\n\n_Alinari_]',
 '[Illustration: BERNARDINO PINTURICCHIO: THE MADONNA IN GLORY\n\n(_San Gimignano. Panel_)]',
 '[Illustration: FREDERICK III CROWNING THE POET ÆNEAS SYLVIUS\n\n(_After the fresco by =Bernardino Pinturicchio=. Siena: Sala\nPiccolominea_)\n\n_Brogi_]',
 '[Illustration: POPE ALEXANDER VI ADORING THE RISEN CHRIST\n\n(_After the fresco by =Bernardino Pinturicchio=. Rome: The Vatican,\nBorgia Apartments_)\n\n_Anderson_]',
 '[Illustration: BENEDETTO BUONFIGLIO: MADONNA, CHILD AND THREE ANGELS\n\n(_Perugia: Pinacoteca. Panel_)]

In [6]:
Vasaritxt= re.sub("\[Illustration.+?\]","",Vasaritxt, flags= re.MULTILINE|re.DOTALL)

In [13]:
re.findall("\[.*?\]",Vasaritxt, re.DOTALL)

['[_PIETRO VANNUCCI, OR PIETRO DA CASTEL DELLA PIEVE_]',
 '[_LUCA DA CORTONA_]',
 '[_BACCIO DELLA PORTA_]',
 '[_RAFFAELLO SANZIO_]',
 '[_GUILLAUME DE MARCILLAC, OR THE FRENCH PRIOR_]',
 '[_SIMONE DEL POLLAIUOLO_]']

As we see, all expressions left in brackets are alternative names for artists, which we will leave in the text.

To make sure we did not miss any other brackets, the number left should coincide with the number in the strings above (12). We can use the len() function to determine the length of the list outputted by findall().

In [7]:
len(re.findall("\[|\]", Vasaritxt, re.MULTILINE))

12

In [54]:
# You can use this to check some of the changes we make to the text as we proceed.
print(Vasaritxt)

['FILIPPO LIPPI, CALLED FILIPPINO\nPAINTER OF FLORENCE\nThere was at this same time in Florence a painter of most beautiful\nintelligence and most lovely invention, namely, Filippo, son of Fra\nFilippo of the Carmine, who, following in the steps of his dead\nfather in the art of painting, was brought up and instructed, being\nstill very young, by Sandro Botticelli, notwithstanding that his\nfather had commended him on his death-bed to Fra Diamante, who was\nmuch his friend--nay, almost his brother. Such was the intelligence\nof Filippo, and so abundant his invention in painting, and so\nbizarre and new were his ornaments, that he was the first who showed\nto the moderns the new method of giving variety to vestments, and\nembellished and adorned his figures with the girt-up garments of\nantiquity. He was also the first to bring to light grotesques, in\nimitation of the antique, and he executed them on friezes in\nterretta or in colours, with more design and grace than the men\nbefore hi

Now that we have removed these unnecessary bits of formatting, let's split the text into a collection of biographies that we can use. Each Biography starts with a set of lines in capital letters, so that makes it an easy choice for splitting.

One approach would be to find these capitalized lines and add a character before them that will allow us to split the texts. The words 'LIFE' or 'LIVES' are good candidates:

In [8]:
Vasaritxt = re.split("LIFE|LIVES", Vasaritxt)

In [9]:
len(Vasaritxt)

23

In [97]:
Vasaritxt

['',
 ' OF FILIPPO LIPPI, CALLED FILIPPINO\n\nPAINTER OF FLORENCE\n\n\nThere was at this same time in Florence a painter of most beautiful\nintelligence and most lovely invention, namely, Filippo, son of Fra\nFilippo of the Carmine, who, following in the steps of his dead\nfather in the art of painting, was brought up and instructed, being\nstill very young, by Sandro Botticelli, notwithstanding that his\nfather had commended him on his death-bed to Fra Diamante, who was\nmuch his friend--nay, almost his brother. Such was the intelligence\nof Filippo, and so abundant his invention in painting, and so\nbizarre and new were his ornaments, that he was the first who showed\nto the moderns the new method of giving variety to vestments, and\nembellished and adorned his figures with the girt-up garments of\nantiquity. He was also the first to bring to light grotesques, in\nimitation of the antique, and he executed them on friezes in\nterretta or in colours, with more design and grace than the

In [10]:
Vasaritxt.pop(0) #Removes the first, blank, element of the list.

''

In [17]:
for index, item in enumerate(Vasaritxt):
    Vasaritxt[index] = re.sub('\n(?!\n)',' ', item) 

In [21]:
for index, item in enumerate(Vasaritxt):
    item = re.sub('\n\n', '\n', item) #collapse repeated whitespace
    item = re.sub('[A-Z ]+$','', item) # Remove next artist's name in final line
    Vasaritxt[index] = item.strip() #remove leading and trailing whitespace again

In [22]:
Vasaritxt[5]

"OF JACOPO, CALLED L'INDACO\n PAINTER\n Jacopo, called L'Indaco, who was a disciple of Domenico del Ghirlandajo, and who worked in Rome with Pinturicchio, was a passing good master in his day; and although he did not make many works, yet those that he did make are worthy of commendation. Nor is there any need to marvel that only very few works issued from his hands, for the reason that, being a gay and humorous fellow and a lover of good cheer, he harboured but few thoughts and would never work save when he could not help it; and so he used to say that doing nothing else but labour, without taking a little pleasure in the world, was no life for a Christian. He lived in close intimacy with Michelagnolo, for when that craftsman, supremely excellent beyond all who have ever lived, wished to have some recreation after his studies and his continuous labours of body and mind, no one was more pleasing to him for the purpose or more suited to his humour than this man.\n Jacopo worked for many 

Now the individual biographies look more or less as they should. The first lines will serve as useful metadata for the content of the biography, the text we will be analyzing. Let's extract this information.

In [27]:
for index, item in enumerate(Vasaritxt):
    print(re.findall('^[A-Z][A-Z :,\n()\[\]_\']+(?![a-z])', item))

['OF FILIPPO LIPPI, CALLED FILIPPINO\n PAINTER OF FLORENCE\n ']
['OF BERNARDINO PINTURICCHIO\n PAINTER OF PERUGIA\n ']
['OF FRANCESCO FRANCIA\n GOLDSMITH AND PAINTER OF BOLOGNA\n ']
['OF PIETRO PERUGINO\n [_PIETRO VANNUCCI, OR PIETRO DA CASTEL DELLA PIEVE_]\n PAINTER\n ']
['OF VITTORE SCARPACCIA (CARPACCIO), AND OF OTHER VENETIAN AND LOMBARD PAINTERS\n ']
["OF JACOPO, CALLED L'INDACO\n PAINTER\n "]
['OF LUCA SIGNORELLI OF CORTONA\n [_LUCA DA CORTONA_]\n PAINTER\n ']
['OF THE SCULPTORS, PAINTERS, AND      ARCHITECTS, WHO HAVE LIVED FROM CIMABUE TO OUR OWN DAY']
['OF LEONARDO DA VINCI\n PAINTER AND SCULPTOR OF FLORENCE\n ']
['OF GIORGIONE DA CASTELFRANCO\n PAINTER OF VENICE\n ']
['OF ANTONIO DA CORREGGIO\n PAINTER\n I']
['OF PIERO DI COSIMO\n PAINTER OF FLORENCE\n ']
['OF BRAMANTE DA URBINO\n ARCHITECT\n ']
['OF FRA BARTOLOMMEO DI SAN MARCO\n [_BACCIO DELLA PORTA_]\n PAINTER OF FLORENCE\n ']
['OF MARIOTTO ALBERTINELLI\n PAINTER OF FLORENCE\n ']
['OF RAFFAELLINO DEL GARBO\n PAINTER OF FLO

The regular expression above (which is the result of an iterative process of trial and error) seems to be capturing the titles and subtitles of each biography. The titles include names, crafts and locations, which can all be different variables. 

Let's create a dataframe to store this information and their text:

In [24]:
import pandas as pd

VasariData = pd.DataFrame(columns=['Artist', 'Craft', 'Location', 'Bio'])

In [32]:
for index, item in enumerate(Vasaritxt):
    subtitles = re.findall('^[A-Z][A-Z :,\n()\[\]_\']+(?![a-z])', item)
    subtitles = subtitles[0].split('\n')
    if len(subtitles)==1:
        a = subtitles[0]
    elif '[' in subtitles[1]:
        a = subtitles[0] + subtitles[1]
        if 'OF' in subtitles[2]:
            cr = re.findall('.+(?=OF)', subtitles[2])
            b = cr[0]
            loc =re.findall('(?<=OF).+', subtitles[2])
            c = loc[0]
        else: 
            b= subtitles[2]
            c=''
    else:
        a = subtitles[0]
        if 'OF' in subtitles[1]:
            cr = re.findall('.+(?=OF)', subtitles[1])
            b = cr[0]
            loc =re.findall('(?<=OF).+', subtitles[1])
            c = loc[0]
        else:
            b= subtitles[1]
            c=''
    
    d = re.sub('^[A-Z][A-Z :,\n()\[\]_\']+(?![a-z])','', item)
    
    VasariData = VasariData.append({'Artist':a,'Craft':b,'Location':c,'Bio':d}, ignore_index=True)

In [30]:
subtitles

['OF THE SCULPTORS, PAINTERS, AND      ARCHITECTS, WHO HAVE LIVED FROM CIMABUE TO OUR OWN DAY']

In [33]:
VasariData

Unnamed: 0,Artist,Craft,Location,Bio
0,"OF FILIPPO LIPPI, CALLED FILIPPINO",PAINTER,FLORENCE,There was at this same time in Florence a pain...
1,OF BERNARDINO PINTURICCHIO,PAINTER,PERUGIA,Even as many are assisted by fortune without b...
2,OF FRANCESCO FRANCIA,GOLDSMITH AND PAINTER,BOLOGNA,"Francesco Francia, who was born in Bologna in ..."
3,"OF PIETRO PERUGINO [_PIETRO VANNUCCI, OR PIETR...",PAINTER,,How great a benefit poverty may be to men of g...
4,"OF VITTORE SCARPACCIA (CARPACCIO), AND OF OTHE...",,,It is very well known that when some of our cr...
5,"OF JACOPO, CALLED L'INDACO",PAINTER,,"Jacopo, called L'Indaco, who was a disciple of..."
6,OF LUCA SIGNORELLI OF CORTONA [_LUCA DA CORTONA_],PAINTER,,"Luca Signorelli, an excellent painter, of whom..."
7,"OF FILIPPO LIPPI, CALLED FILIPPINO",PAINTER,FLORENCE,There was at this same time in Florence a pain...
8,OF BERNARDINO PINTURICCHIO,PAINTER,PERUGIA,Even as many are assisted by fortune without b...
9,OF FRANCESCO FRANCIA,GOLDSMITH AND PAINTER,BOLOGNA,"Francesco Francia, who was born in Bologna in ..."


In [None]:
Some more cleaning:
    
- Number 7 is not a biography but an introduction, can be removed
- 'Of' can be removed in the artist column
- Some errors in parsing

In [34]:
VasariData['Artist'] = VasariData['Artist'].str.replace('OF ','')

In [35]:
VasariData.drop(7)

Unnamed: 0,Artist,Craft,Location,Bio
0,"FILIPPO LIPPI, CALLED FILIPPINO",PAINTER,FLORENCE,There was at this same time in Florence a pain...
1,BERNARDINO PINTURICCHIO,PAINTER,PERUGIA,Even as many are assisted by fortune without b...
2,FRANCESCO FRANCIA,GOLDSMITH AND PAINTER,BOLOGNA,"Francesco Francia, who was born in Bologna in ..."
3,"PIETRO PERUGINO [_PIETRO VANNUCCI, OR PIETRO D...",PAINTER,,How great a benefit poverty may be to men of g...
4,"VITTORE SCARPACCIA (CARPACCIO), AND OTHER VENE...",,,It is very well known that when some of our cr...
5,"JACOPO, CALLED L'INDACO",PAINTER,,"Jacopo, called L'Indaco, who was a disciple of..."
6,LUCA SIGNORELLI CORTONA [_LUCA DA CORTONA_],PAINTER,,"Luca Signorelli, an excellent painter, of whom..."
8,BERNARDINO PINTURICCHIO,PAINTER,PERUGIA,Even as many are assisted by fortune without b...
9,FRANCESCO FRANCIA,GOLDSMITH AND PAINTER,BOLOGNA,"Francesco Francia, who was born in Bologna in ..."
10,"PIETRO PERUGINO [_PIETRO VANNUCCI, OR PIETRO D...",PAINTER,,How great a benefit poverty may be to men of g...


In [36]:
VasariData.iloc[4,1]= 'PAINTERS'

In [37]:
VasariData.iloc[4,2]= 'LOMBARDY'

In [38]:
VasariData.iloc[19,1]= 'PAINTER AND MASTER OF GLASS WINDOWS'

In [39]:
VasariData.iloc[19,2]= 'FRANCE'

In [40]:
VasariData

Unnamed: 0,Artist,Craft,Location,Bio
0,"FILIPPO LIPPI, CALLED FILIPPINO",PAINTER,FLORENCE,There was at this same time in Florence a pain...
1,BERNARDINO PINTURICCHIO,PAINTER,PERUGIA,Even as many are assisted by fortune without b...
2,FRANCESCO FRANCIA,GOLDSMITH AND PAINTER,BOLOGNA,"Francesco Francia, who was born in Bologna in ..."
3,"PIETRO PERUGINO [_PIETRO VANNUCCI, OR PIETRO D...",PAINTER,,How great a benefit poverty may be to men of g...
4,"VITTORE SCARPACCIA (CARPACCIO), AND OTHER VENE...",PAINTERS,LOMBARDY,It is very well known that when some of our cr...
5,"JACOPO, CALLED L'INDACO",PAINTER,,"Jacopo, called L'Indaco, who was a disciple of..."
6,LUCA SIGNORELLI CORTONA [_LUCA DA CORTONA_],PAINTER,,"Luca Signorelli, an excellent painter, of whom..."
7,"FILIPPO LIPPI, CALLED FILIPPINO",PAINTER,FLORENCE,There was at this same time in Florence a pain...
8,BERNARDINO PINTURICCHIO,PAINTER,PERUGIA,Even as many are assisted by fortune without b...
9,FRANCESCO FRANCIA,GOLDSMITH AND PAINTER,BOLOGNA,"Francesco Francia, who was born in Bologna in ..."


The above seems in good shape.

Let's export the dataframe as a CSV. We will use it for Named-Entity Recognition in Case 4.


In [41]:
VasariData.to_csv('VasariData.csv',index=False)