# Sentiment Analysis of The Times Music Reviews
## Part I: Data Preparation
*How have artforms been reported?  Is there a status hierarchy between them?  How has this changed over time?*

* **Project:** What counts as culture?  Reporting and criticism in The Times 1785-2000
* **Project Lead:** Dave O'Brien
* **Developer:** Lucy Havens
* **Funding:** from the Centre for Data, Culture & Society, University of Edinburgh

Begun February 2021

### 1. Load Data
Before we can write much code, we need to import programming libraries needed for data loading and tranformation:

In [1]:
import re
import string
import numpy as np
import pandas as pd

In [3]:
# From https://github.com/defoe-code/defoe/blob/master/queries/music_genres.txt
genres = ["Music", "African", "Big Band", "Bluegrass", "Country", "Blues", "Musical", "Classical", "Electronic",
          "Folk", "Gospel", "Hip Hop", "Jazz", "Latin", "Metal", "Easy Listening", "Opera", "Pop", "Rap", "Rave",
          "Reggae", "Rock"
         ]
musical_terms = []

In [4]:
file = open("../TheTimes_DaveO/TimesMusicReviewsData/results_music_types_excluding_music_details", "r")
data = file.read()
print(data[:1000])
file.close()
# YAML file format:
#     Year
#     - 'article_id:':
#       'authors:' ''
#       filename: ...
#       issue_id: ...
#       original text: " .....\
#                        ......"
#       page_ids:
#       - '000'
#       - '000'
#       - '000'
#       term: ...
#       title: ...

1819:
- 'article_id:': 0FFO-1819-DEC01-001-004
  'authors:': ''
  filename: /lustre/home/sc048/rosaf4/TDA_GDA_1785-2009/1785_1909/0FFO-1819-DEC01.xml
  issue_id: 0FFO-1819-DEC01
  original text: "PARLIAMIENTARY INTELLIGEATCE. HOUSE OF LORDS. TITTRAnAr NrlV. ?d)\
    \ CASH PAVNMT,N''T.S. * TIiie H~ari of LAAUD~ERDALE, seeing a noble lord in his\
    \ place, f4rom whom~ he -niighit expect to obtain information on a subject of\
    \ imvportaiice respecting the currency of the country, wished to knowv whietheor\
    \ there waLs any truith in a report whichl prevailed, that minis- ters intended\
    \ to make a proposition to Pa-rliam-ent for preventing the resuimption of cashi-payments\
    \ at the time fixed by the act 1' Thle Earl of LIVERPOOL did ,lot know uipon whlat\
    \ grouind suchl an idea couild hiave been entertained by aniy part of tile public.\
    \ ex- cep tht,in hi gratmetoplis te most absurdI and unfoundedc reprtsmigt otai\
    \ crditfro soe; buthe-could assutre the

Now we can read the data file from the defoe query, which looked for articles in the Reviews section of *The Times*, from 1910 onwards, that contain any of the following terms:
* song(s)
* lyric(s)
* album(s)
* band(s)
* artist(s)
* singer(s)
* songwriter(s)
* rapper(s)
* composer(s)
* composition(s)
* orchestra(s)
* DJ(s)
* Music
* African
* Big Band
* Bluegrass
* Country
* Blues
* Musical
* Classical
* Electronic
* Folk
* Gospel
* Hip Hop
* Jazz
* Latin
* Metal
* Easy Listening
* Opera
* Pop
* Rap
* Rave
* Reggae
* Rock

In [2]:
file = open("../TheTimes_DaveO/results_keysearch_details_tda-1910_2010", "r")
data = file.read()
print(data[:1000])
file.close()
# YAML file format:
#     Year
#     - 'article_id:':
#       'authors:' ''
#       filename: ...
#       issue_id: ...
#       original text: " .....\
#                        ......"
#       page_ids:
#       - '000'
#       - '000'
#       - '000'
#       section:
#       term: 
#       - ...
#       title: ...

1910:
- 'article_id:': 0FFO-1910-JAN06-011-005
  'authors:': ''
  filename: /lustre/home/dc125/shared/TDA_GDA_1785-2009/1910_2010/0FFO-1910-JAN06.xml
  issue_id: 0FFO-1910-JAN06
  original text: Nrsic. NEW PIAOFORTE MUSIC. Among pianoforto duets sont by Meessrs.
    Augenor there aro a couplo of very elementary little. pieces. written by Francis
    Edwaard Bache for his little brothers, ono of whom afterwards became eminent as
    a solo planist; two books of F. Kirchners consist of dances, Spanish and. Bohemian
    respee- tively, and arrangements are cent of Wagner's " Huldigungsmarsch " and
    Thomnas's anaytond over- ture. X. Scharwenka's well-knouwn PoIsh dance is set
    for solo, duct, and quartet (two pianas, eight -hands). Among educational publications,
    there aro Duvernoy's graceful "Melodious Studies," an elementary book of some
    value by G. florvith; numerous books of rather slight quality by Paul Zilcher,
    a well-arranged book by G. Schifer, who also has compil

### 2. Transform Data
#### 2.1 Create a CSV file containing article metadata
**Step 1:** Create a list of for each metadata field for all articles, meaning all article_ids are in one list, all authors are in another list, etc.  The lists should all be the same length, with one item for every article in the corpus.

In [3]:
def readFileByLines(filepath):
    file = open(filepath,'r')
    lines_list = file.readlines()
    file.close()
    return lines_list

In [29]:
def createDataLists(lines_list):
    article_ids, authors, filenames, issue_ids, terms, sections, titles, years, original_texts, page_ids = [], [], [], [], [], [], [], [], [], []
    y = ''
    txt = False
    pages = []
    term = []
    for line in lines_list:
        # Check if the line is a year
        date = re.findall("\d{4}:\n", line)
        if len(date) != 0:
            date = int(date[0][0:4])
            # Valid dates should be no earlier than 1910
            # and no later than 2010 in this corpus
            if date >= 1910 and date <= 2010:
                y = date
        else:
            if "'article_id:':" in line:
                a_id = line.replace("'article_id:':","")
                a_id = a_id.strip('-')
                a_id = a_id.strip()
            elif "'authors:':" in line:
                auth = line.replace("'authors:':", "")
                auth = auth.strip()
            elif "filename:" in line:
                f = line.replace("filename:",'')
                f = f.strip()
            elif 'issue_id:' in line:
                i_id = line.replace('issue_id:','')
                i_id = i_id.strip()
            elif 'original text:' in line:
                txt = line.replace('original text:','')
                txt = txt.replace("\\","")
                txt = txt.strip()
            elif 'term:' in line:
                t = line.replace('term:','')
                t = t.strip()
                if len(t) > 0:
                    term += [t]
            elif re.search("  - [a-zA-Z]+", line) != None:
                t = re.search("  - [a-zA-Z]+", line)[0]
                t = t.replace('  -','')
                term += [t]
            elif 'section:' in line:
                sec = line.replace('section:','')
                sec = sec.strip()
            elif re.search("- '(\d)+'", line) != None:
                p = re.search("- '(\d)+'", line)[0]
                p = p.strip('-')
                p = p.strip()
                p = p.strip("'")
                pages += [p]
            # only the first line of the original text is preceded with a line name,
            # so all lines without one of the line names above, without the line page_ids,
            # and with backslashes are original text
            elif ('page_ids:' not in line) and ("  - " not in line) and ('title:' not in line) and (txt):
                txt_cont = line.replace("\\","")
                txt_cont = txt_cont.strip()
                txt += " " + txt_cont
            elif txt and ('title:' in line):
                t = line.replace('title:','')
                t = t.strip()
                titles += [t]
                # title is the last line per term details instance, so add the completed
                # original text, pages and year the article was published after this line, along
                # with all the remaining data, and reset the original text, year, and pages variables
#                 if txt:
                txt = txt.strip('\"') # Remove leading and ending quotation marks, if present
                original_texts += [txt]
                txt = False
                years += [y]
                page_ids += [pages]
                pages = []
                article_ids += [a_id]
                authors += [auth]
                filenames += [f]
                issue_ids += [i_id]
                terms += [term]
                term = []
                sections += [sec]
                
    return article_ids, authors, filenames, issue_ids, terms, sections, titles, years, original_texts, page_ids

In [30]:
# times_lines = readFileByLines("../TheTimes_DaveO/TimesMusicReviewsData/results_music_types_excluding_music_details")
times_lines = readFileByLines("../TheTimes_DaveO/results_keysearch_details_tda-1910_2010")
article_ids, authors, filenames, issue_ids, terms, sections, titles, years, original_texts, page_ids = createDataLists(times_lines)
assert len(article_ids) == len(years)
assert len(authors) == len(filenames)
assert len(original_texts) == len(terms)
assert len(issue_ids) == len(titles)
assert len(page_ids) == len(original_texts)
assert len(authors) == len(sections)

In [31]:
print("Total articles:", len(article_ids))
# print(len(authors))
# print(len(filenames))
# print(len(issue_ids))
# print(len(terms))
# print(len(titles))
# print(len(years))
# print(len(original_texts))
# print(len(page_ids))

Total articles: 106810


Let's print the beginning and end of an article's text to see what it looks like:

In [21]:
print(original_texts[1][:400])
print(original_texts[1][-100:])

SAVOY THE.ATRE.-MiSs Amy Evans now appeats as Sclene, Queen of. the Flairics, the part in Fallen FairCs hitherto taken by 3iss 'ancy McIntosh. Miss Evans ih new to the London musical stage, thongh she has, we believe, been heard on the concert platform, and sho ought to prove a valuable addition to our sopranos. Sihe has a delicate but a beautiful voice. Heir high notes, both forlisaimo and piantW
iola and Its famous makers. Yesterday the delegates visited Canterbury, and went over tho cathedral.


The `\` at the beginning of each line of `original text` and the `"..."` surrounding the entire `original text` entries have successfully been removed!

**Step 2:** Create a DataFrame of all data lists except the articles' text (stored in the `original_text` list) that can serve as our inventory of articles.  *DataFrames* are essentially tables, organizing data into rows and columns for easy export as a CSV that can be viewed in Microsoft Excel!

In [32]:
metadata = {'title':titles, 'year':years, 'author':authors, 
             'term':terms, 'section':sections, 'pages':page_ids, 'filename':filenames, 'article_id':article_ids, 'issue_id':issue_ids}
inventory = pd.DataFrame.from_dict(metadata)
inventory.head(10)

Unnamed: 0,title,year,author,term,section,pages,filename,article_id,issue_id
0,Nrsic. NEW PIAOFORTE MUSIC.,1910,'',"[ composer, composers, compositions, opera,...",Reviews,[011],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1910-JAN06-011-005,0FFO-1910-JAN06
1,'',1910,'',"[ band, composer, musical, opera]",Reviews,[011],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1910-JAN06-011-021,0FFO-1910-JAN06
2,MlJSIc. ENGUSH MUSIC IN ROME.,1910,'',[ orchestra],Reviews,[],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1910-JAN07-009-009,0FFO-1910-JAN07
3,IuSIC. MUSIC IN TIE THEATRE.,1910,'',"[ composers, lyrics, musical, opera, orche...",Reviews,[011],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1910-JAN08-011-015,0FFO-1910-JAN08
4,THE THEATRES.,1910,'',"[ classical, musical, singer]",Reviews,[],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1910-JAN10-008-003,0FFO-1910-JAN10
5,WusIC. PROGRADIES OF TEE WVEEK.,1910,'',"[ artists, composer, composers, opera, orc...",Reviews,[011],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1910-JAN10-011-007,0FFO-1910-JAN10
6,MUSIC. ST. JOHN'S COLLEOG (CVIBRIDGE) 9ISSION ...,1910,(FROM OUR OWN CORRESPONDENT.),"[ composer, composers, composition, musical...",Reviews,[013],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1910-JAN12-013-013,0FFO-1910-JAN12
7,"BOOKS Or PtpEENtxECE. ."" ""BltR's .PrxA4r' 1910.",1910,'',[ country],Reviews,[013],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1910-JAN12-013-014,0FFO-1910-JAN12
8,"WYNDHIAM'S THEATRE. "" CAPTAIN KIDD."" .A Musica...",1910,'',"[ composer, lyrics, musical]",Reviews,[010],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1910-JAN13-010-018,0FFO-1910-JAN13
9,"I . . I MuSIC. LONDON OEWBER CONCERT ASSOcIATION,",1910,'',"[ artist, composer, composers]",Reviews,[011],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1910-JAN13-011-005,0FFO-1910-JAN13


In [33]:
print(inventory.author.unique())
print(len(inventory.author.unique()))

["''" '(FROM OUR OWN CORRESPONDENT.)' '(FROM A CORRESPONDENT.)' ...
 'Mike Wade' 'Loudon Wainwright' 'A. L. Kennedy']
5839


There are 5,839 unique author names included in the metadata for the articles in our corpus!

In [34]:
print(min(inventory.year))
print(max(inventory.year))

1910
2009


In [35]:
inventory.sort_values("year")

Unnamed: 0,title,year,author,term,section,pages,filename,article_id,issue_id
0,Nrsic. NEW PIAOFORTE MUSIC.,1910,'',"[ composer, composers, compositions, opera,...",Reviews,[011],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1910-JAN06-011-005,0FFO-1910-JAN06
328,"'"" Is4IJSIC. SAVOY THE.ATRE. "" ORPHEUS.""'",1910,'',"[ composer, composers, country, lyrics, mu...",Reviews,[010],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1910-APR13-010-010,0FFO-1910-APR13
327,"HIS MAJESTY'S TEEATRE. "" KING L1AR.""",1910,'',[[]],Reviews,[010],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1910-APR13-010-008,0FFO-1910-APR13
326,"MUSIC'. CORONET THEATRE. I' s U;N"" BALLO IN MA...",1910,'',"[ composer, musical, opera, orchestra, songs]",Reviews,[012],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1910-APR12-012-003,0FFO-1910-APR12
325,MUSIC. HERR STRAUSS AT QUEEN'S HALL,1910,'',"[ classical, composers, lyric, lyrics, mus...",Reviews,[013],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1910-APR11-013-014,0FFO-1910-APR11
...,...,...,...,...,...,...,...,...,...
105268,"""Vt_t_\xA3 lsf7#JI.Bjk. Moseley Folk Festival ...",2009,Stephen Dalton,"[ album, band, folk, musical, rock, songs]",Reviews,[0104],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-2009-0908-0104-001,0FFO-2009-0908
105269,Dance Deloitte Ignite 09 Covent Garden,2009,Donald Hutera,"[ artist, opera]",Reviews,[0104],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-2009-0908-0104-002,0FFO-2009-0908
105270,"Pop Moby Rough Trade East, El",2009,David Sinclair,"[ album, band, blues, gospel, pop, singer...",Reviews,[0104],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-2009-0908-0104-005,0FFO-2009-0908
105218,The hits just kept on comin; sadly,2009,Allen Robertson,"[ folk, latin, singers, song]",Reviews,[],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-2009-0903-0092-002,0FFO-2009-0903


**Step 3:** Export the inventory as a CSV file.

In [36]:
inventory.to_csv("../TheTimes_DaveO/TheTimesArticles_1910-2009_Inventory.csv")

#### 2.2 Create one TXT file per article 
Each file contains the text from the field `original_text` and uses its corresponding `article_id` as the filename.

**Step 1:** Create a subset of the data that only contains articles written in 1950 or later, including the articles' `original_text`.

In [37]:
# Add the articles' text to the inventory DataFrame
inventory['original_text'] = original_texts
# Create a DataFrame containing only articles published from 1950 onwards
subset = inventory[inventory.year >= 1950]
subset.head()

Unnamed: 0,title,year,author,term,section,pages,filename,article_id,issue_id,original_text
20787,SOME NEW SCORES MOTET AND OPERA,1950,BY OUR MUSIC CRITIC,"[ bands, composer, musical, opera, orchest...",Reviews,[],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-JUN30-008-023,0FFO-1950-JUN30,'SOME NEW SCORES MOTET AND OPERA BY OUR MUSIC ...
20788,"THE ROYAL OPERA "" TRISTAN AND ISOLDE """,1950,'',"[ opera, orchestra]",Reviews,[],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-JUN30-008-027,0FFO-1950-JUN30,"THE ROYAL OPERA "" TRISTAN AND ISOLDE "" Nietzsc..."
20789,GROWING TASTE FOR MUSIC PLEA FOR ENLARGED QUEE...,1950,'',[ country],Reviews,[],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-JUN30-008-032,0FFO-1950-JUN30,'GROWING TASTE FOR MUSIC PLEA FOR ENLARGED QUE...
20790,ROYAL PHILHARMONIC CONCERT BEECHAM AND MOZART,1950,'',"[ orchestra, orchestras]",Reviews,[010],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-MAR02-010-006,0FFO-1950-MAR02,ROYAL PHILHARMONIC CONCERT BEECHAM AND MOZART ...
20791,MUSICAL JOURNALS SOME NEWCOMERS,1950,BY OUR MUSIC CRITIC,"[ musical, orchestra, orchestras]",Reviews,[007],/lustre/home/dc125/shared/TDA_GDA_1785-2009/19...,0FFO-1950-MAR03-007-010,0FFO-1950-MAR03,'MUSICAL JOURNALS SOME NEWCOMERS BY OuR Music ...


In [38]:
print("Rows (articles):", subset.shape[0])
print("Columns (data fields per article):", subset.shape[1])

Rows (articles): 83625
Columns (data fields per article): 10


In [39]:
indeces = list(subset.index)
print(indeces[0])
print(indeces[-1])

20787
106809


In [40]:
# Make sure each article has text in the "original text" field
print(subset.original_text[subset.original_text == False])

Series([], Name: original_text, dtype: object)


**Step 2:** Let's review the titles of the articles in our subset of data, removing articles with titles that are obviously not about music and keeping all those that have any possibility of being about music. 

In [20]:
# print(subset.title.unique())
print(len(subset.title.unique()))

69355


In [14]:
# not_music = ["'CORONATION HONOURS THE THREE NEW PEERS: ONE ORDER OF MERIT'",
#        'STOCK EXCHANGE DEALINGS',
#        'Idea for :Realization of Peace in Vietnam Without Sacrificing a Single Human',
#        'Gen:eral Election results 1974', 'Election results October, 1974',
#        'CONCISE CROSSWORD (No 569)',
#        'CONCISE CROSSWORD NO 1084', 'CONCISE CROSSWORD NO 136? j',
#        'xigg ^TjMEs DEGREE COURSE VACANCY SERVICE', '"Branson nets \\xA382minsale to Japanese"',
#        'Chest, Heart and Stroke Association',
#        'Equity reversal',
#        'WINTER OLYMPICS 42',
#        'clirectorv', "'i| .- : WORD-WATCHING -|'", 
#        'Full details of courses start here', 'BUSINESS 26',
#        'AS-LEVEL RESULTS', 'CLEARING',
#        '"- . UNIVERSITY CLEARING LiSTIWeS START HERE .., _____\\xA3"',
#        '"TRAVEL \\u201E"', 
#        'ELECTION 2005',
#        '"*FTSE 350 firms have a combined value of \\xA31,739 billion"',
#        'SECONDARY SCHOOLS REPORT',
#        'NatWest Three brush up on their small talk',
#        "'''Student Prince'' actor dies'",
#        ]

In [36]:
# subset = subset[~subset["title"].isin(not_music)]
# subset

In [41]:
print(subset.section.unique())

['Reviews']


In [42]:
subset_years = subset.year.unique()
subset_years.sort
print(subset_years)

[1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963
 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977
 1978 1979 1980 1981 1982 1983 1984 1985 2000 1986 1987 1988 1989 1990
 1991 1992 1993 1994 1995 1996 1997 1998 1999 2001 2002 2003 2004 2005
 2006 2007 2008 2009]


All right so now we have 83,625 reviews published every year from 1950 through 2009!

**Step 2:** Associate each article's text and index number (a unique identifier) to one another by creating a dictionary from the `original_text` and `indeces` columns of our subset of data.

In [43]:
id_text = dict(zip(list(indeces),list(subset.original_text)))
print(id_text[20787][:100]) # first 100 characters of text from an article with the input index number

'SOME NEW SCORES MOTET AND OPERA BY OUR MUSIC CRIrIC Music publishing has got into its stride once m


**Step 3:** Write each article's text to a file named as the article's index in the inventory, splitting large articles into multiple files.

In [44]:
# Find length of articles (length = number of characters)
original_texts = list(subset.original_text)
texts_len = []
for t in original_texts:
    texts_len += [len(t)]
print("Shortest article:", min(texts_len))
print("Longest article:", max(texts_len))
print("Average article length:", sum(texts_len)/len(texts_len))
# texts_len.sort()
# print(texts_len)

Shortest article: 64
Longest article: 47716
Average article length: 3431.853620328849


In [45]:
def writeTxtFiles(article_dict, max_chars):
    for key,value in article_dict.items():
        text_len = len(value) # number of characters in the article
        
        # If the article has more than the input maximum number of characters, 
        # split the article's text across multiple files
        if text_len > max_chars:
            split_files = int(text_len/max_chars)
            char_index = 0
            while char_index < split_files:
                # Create multiple files no longer than the input maximum number of characters (max_chars)
                filepath = "../TheTimes_DaveO/TheTimesMusicReviews_1950-2009_part1/"+str(key)+"_"+str(char_index)
                split_value = value[(char_index*max_chars):((char_index+1)*max_chars)]
                file = open(filepath, "a")  # a for append
                try:
                    file.write(value)
                except TypeError:
                    print(key, value)
                file.close()
                char_index += 1
            # Write the last file with the remaining characters in the article 
            filepath = "../TheTimes_DaveO/TheTimesMusicReviews_1950-2009_part1/"+str(key)+"_"+str(char_index)
            split_value = value[(char_index*max_chars):]
            file = open(filepath, "a")  # a for append
            try:
                file.write(value)
            except TypeError:
                print(key, value)
            file.close()
        
        # If the article has no more than the maximum number of characters, 
        # write the article's text to a single file
        else:
            filepath = "../TheTimes_DaveO/TheTimesMusicReviews_1950-2009_part1/"+str(key)
            file = open(filepath, "a")  # a for append
            try:
                file.write(value)
            except TypeError:
                print(key, value)
            file.close()
            
    print("Text files for each article can now be found in the folder named TheTimesMusicReviews_1950-2009_part1!")

In [46]:
writeTxtFiles(id_text, 10000)

OSError: [Errno 27] File too large: '../TheTimes_DaveO/TheTimesMusicReviews_1950-2009_part1/86443'

In [47]:
len(id_text[86443])
def writeTxtFiles(article_dict, max_chars):
    for key,value in article_dict.items():
        if key >= 86433:
            text_len = len(value) # number of characters in the article

            # If the article has more than the input maximum number of characters, 
            # split the article's text across multiple files
            if text_len > max_chars:
                split_files = int(text_len/max_chars)
                char_index = 0
                while char_index < split_files:
                    # Create multiple files no longer than the input maximum number of characters (max_chars)
                    filepath = "../TheTimes_DaveO/TheTimesMusicReviews_1950-2009_part2/"+str(key)+"_"+str(char_index)
                    split_value = value[(char_index*max_chars):((char_index+1)*max_chars)]
                    file = open(filepath, "a")  # a for append
                    try:
                        file.write(value)
                    except TypeError:
                        print(key, value)
                    file.close()
                    char_index += 1
                # Write the last file with the remaining characters in the article 
                filepath = "../TheTimes_DaveO/TheTimesMusicReviews_1950-2009_part2/"+str(key)+"_"+str(char_index)
                split_value = value[(char_index*max_chars):]
                file = open(filepath, "a")  # a for append
                try:
                    file.write(value)
                except TypeError:
                    print(key, value)
                file.close()

            # If the article has no more than the maximum number of characters, 
            # write the article's text to a single file
            else:
                filepath = "../TheTimes_DaveO/TheTimesMusicReviews_1950-2009_part2/"+str(key)
                file = open(filepath, "a")  # a for append
                try:
                    file.write(value)
                except TypeError:
                    print(key, value)
                file.close()

    print("Text files for each article can now be found in the folder named TheTimesMusicReviews_1950-2009_part2!")

In [48]:
writeTxtFiles(id_text, 10000)

Text files for each article can now be found in the folder named TheTimesMusicReviews_1950-2009_part2!


**Step 4:** Create an inventory for this subset of data (excluding the articles' text).

In [49]:
subset_inventory = subset.drop(columns="original_text")
subset_inventory.to_csv("../TheTimes_DaveO/TheTimesArticles_1950-2009_Inventory.csv")

## 3. Analyze Data

In [50]:
texts_words = []
for t in original_texts:
    t = t.split(" ")
    texts_words += [len(t)]

In [51]:
print("Rough Estimate of Word Counts")
print(" - Shortest article:",min(texts_words))
print(" - Longest article:",max(texts_words))
print(" - Average total words:",(sum(texts_words))/(len(texts_words)))

Rough Estimate of Word Counts
 - Shortest article: 8
 - Longest article: 7719
 - Average total words: 573.3512107623318
