# Sentiment Analysis of The Times Music Reviews
## Part I: Data Preparation
*How have artforms been reported?  Is there a status hierarchy between them?  How has this changed over time?*

* **Project:** What counts as culture?  Reporting and criticism in The Times 1785-2000
* **Project Lead:** Dave O'Brien
* **Developer:** Lucy Havens
* **Funding:** from the Centre for Data, Culture & Society, University of Edinburgh

Begun February 2021

### 1. Load Data
Import programming libraries needed for data loading and tranformation.

In [2]:
# For data loading
import re
import string
import numpy as np
import pandas as pd

In [3]:
# From https://github.com/defoe-code/defoe/blob/master/queries/music_genres.txt
genres = ["Music", "African", "Big Band", "Bluegrass", "Country", "Blues", "Musical", "Classical", "Electronic",
          "Folk", "Gospel", "Hip Hop", "Jazz", "Latin", "Metal", "Easy Listening", "Opera", "Pop", "Rap", "Rave",
          "Reggae", "Rock"
         ]

In [6]:
file = open("../TheTimes_DaveO/TimesMusicReviewsData/results_music_types_excluding_music_details", "r")
data = file.read()
print(data[:1000])
file.close()
# YAML file format:
#     Year
#     - 'article_id:':
#       'authors:' ''
#       filename: ...
#       issue_id: ...
#       original text: " .....\
#                        ......"
#       page_ids:
#       - '000'
#       - '000'
#       - '000'
#       term: ...
#       title: ...

1819:
- 'article_id:': 0FFO-1819-DEC01-001-004
  'authors:': ''
  filename: /lustre/home/sc048/rosaf4/TDA_GDA_1785-2009/1785_1909/0FFO-1819-DEC01.xml
  issue_id: 0FFO-1819-DEC01
  original text: "PARLIAMIENTARY INTELLIGEATCE. HOUSE OF LORDS. TITTRAnAr NrlV. ?d)\
    \ CASH PAVNMT,N''T.S. * TIiie H~ari of LAAUD~ERDALE, seeing a noble lord in his\
    \ place, f4rom whom~ he -niighit expect to obtain information on a subject of\
    \ imvportaiice respecting the currency of the country, wished to knowv whietheor\
    \ there waLs any truith in a report whichl prevailed, that minis- ters intended\
    \ to make a proposition to Pa-rliam-ent for preventing the resuimption of cashi-payments\
    \ at the time fixed by the act 1' Thle Earl of LIVERPOOL did ,lot know uipon whlat\
    \ grouind suchl an idea couild hiave been entertained by aniy part of tile public.\
    \ ex- cep tht,in hi gratmetoplis te most absurdI and unfoundedc reprtsmigt otai\
    \ crditfro soe; buthe-could assutre the

### 2. Transform Data
#### 2.1 Create a CSV file containing article metadata
**Step 1:** Create a list of for each metadata field for all articles, meaning all article_ids are in one list, all authors are in another list, etc.  The lists should all be the same length, with one item for every article in the corpus.

In [7]:
def readFileByLines(filepath):
    file = open(filepath,'r')
    lines_list = file.readlines()
    file.close()
    return lines_list

In [37]:
def createDataLists(lines_list):
    article_ids, authors, filenames, issue_ids, terms, titles, years, original_texts, page_ids = [], [], [], [], [], [], [], [], []
    y = ''
    txt = False
    pages = []
    for line in lines_list:
        # Check if the line is a year
        date = re.findall("\d{4}:", line)
        if len(date) != 0:
            date = int(date[0][0:4])
            # Valid dates should be no earlier than 1725
            # and no later than 2000 in this corpus
            if date >= 1725 and date <= 2000:
                y = date
        else:
            if "'article_id:':" in line:
                a_id = line.replace("'article_id:':","")
                a_id = a_id.strip('-')
                a_id = a_id.strip()
            elif "'authors:':" in line:
                auth = line.replace("'authors:':", "")
                auth = auth.strip()
            elif "filename:" in line:
                f = line.replace("filename:",'')
                f = f.strip()
            elif 'issue_id:' in line:
                i_id = line.replace('issue_id:','')
                i_id = i_id.strip()
            elif 'original text:' in line:
                txt = line.replace('original text: "','')
                txt = txt.replace("\\","")
                txt = txt.strip()
            elif 'term:' in line:
                term = line.replace('term:','')
                term = term.strip()
            elif re.search("- '(\d)+'", line) != None:
                p = re.search("- '(\d)+'", line)[0]
                p = p.strip('-')
                p = p.strip()
                p = p.strip("'")
                pages += [p]
            # only the first line of the original text is preceded with a line name,
            # so all lines without one of the line names above, without the line page_ids,
            # and with backslashes are original text
            elif ('page_ids:' not in line) and ("  - " not in line) and ('title:' not in line) and (txt):
                txt_cont = line.replace("\\","")
                txt_cont = txt_cont.strip()
                txt += " " + txt_cont
            elif 'title:' in line:
                t = line.replace('title:','')
                t = t.strip()
                titles += [t]
                # title is the last line per term details instance, so add the completed
                # original text, pages and year the article was published after this line, along
                # with all the remaining data, and reset the original text, year, and pages variables
                txt = txt[:-1] # Remove ending quotation mark
                original_texts += [txt]
                txt = False
                years += [y]
                page_ids += [pages]
                pages = []
                article_ids += [a_id]
                authors += [auth]
                filenames += [f]
                issue_ids += [i_id]
                terms += [term]
                
    return article_ids, authors, filenames, issue_ids, terms, titles, years, original_texts, page_ids

In [39]:
times_lines = readFileByLines("../TheTimes_DaveO/TimesMusicReviewsData/results_music_types_excluding_music_details")
article_ids, authors, filenames, issue_ids, terms, titles, years, original_texts, page_ids = createDataLists(times_lines)
assert len(article_ids) == len(years)
assert len(authors) == len(filenames)
assert len(original_texts) == len(terms)
assert len(issue_ids) == len(titles)
assert len(page_ids) == len(original_texts)

In [11]:
print("Total articles:", len(article_ids))
# print(len(authors))
# print(len(filenames))
# print(len(issue_ids))
# print(len(terms))
# print(len(titles))
# print(len(years))
# print(len(original_texts))
# print(len(page_ids))

Total articles: 2276


In [42]:
print(original_texts[1][:400])
print(original_texts[1][-100:])

PARLIAMIENTARY INTELLIGEATCE. HOUSE OF LORDS. TITTRAnAr NrlV. ?d) CASH PAVNMT,N''T.S. * TIiie H~ari of LAAUD~ERDALE, seeing a noble lord in his place, f4rom whom~ he -niighit expect to obtain information on a subject of imvportaiice respecting the currency of the country, wished to knowv whietheor there waLs any truith in a report whichl prevailed, that minis- ters intended to make a proposition t
TENTS.-Present . 110 Proxies-o x i6i8-178 Majority against the motion -131 Adjourned at OXE O'CLOCY.


The `\` at the beginning of each line of `original text` and the `"..."` surrounding the entire `original text` entries have successfully been removed!

**Step 2:** Create a DataFrame of all data except the article texts (`original_text: ...`), which organizes the data into a table for easy export as a CSV that can be viewed in Microsoft Excel.

In [44]:
metadata = {'title':titles, 'year':years, 'author':authors, 
             'term':terms, 'pages':page_ids, 'filename':filenames, 'article_id':article_ids, 'issue_id':issue_ids}
inventory = pd.DataFrame.from_dict(metadata)
inventory.head(10)

Unnamed: 0,title,year,author,term,pages,filename,article_id,issue_id
0,PARLIAMIENTARY INTELLIGEATCE. HOUSE OF LORDS. ...,1819,'',country,"[001, 002, 003]",/lustre/home/sc048/rosaf4/TDA_GDA_1785-2009/17...,0FFO-1819-DEC01-001-004,0FFO-1819-DEC01
1,PARLIAMIENTARY INTELLIGEATCE. HOUSE OF LORDS. ...,1819,'',latin,"[001, 002, 003]",/lustre/home/sc048/rosaf4/TDA_GDA_1785-2009/17...,0FFO-1819-DEC01-001-004,0FFO-1819-DEC01
2,"HOUSE OF COMMONS, Thursday, Dec. 2. A",1819,'',country,"[001, 002, 003]",/lustre/home/sc048/rosaf4/TDA_GDA_1785-2009/17...,0FFO-1819-DEC03-001-002,0FFO-1819-DEC03
3,"""TIOYJSS> OP LORDA. Ptinw. Dr.a. \xB67.""",1819,'',country,"[002, 003]",/lustre/home/sc048/rosaf4/TDA_GDA_1785-2009/17...,0FFO-1819-DEC18-002-001,0FFO-1819-DEC18
4,"""TIOYJSS> OP LORDA. Ptinw. Dr.a. \xB67.""",1819,'',rock,"[002, 003]",/lustre/home/sc048/rosaf4/TDA_GDA_1785-2009/17...,0FFO-1819-DEC18-002-001,0FFO-1819-DEC18
5,"HOUSE OF COMMONS, TtUEsDAY, Nov. B.",1819,'',country,"[002, 003, 004]",/lustre/home/sc048/rosaf4/TDA_GDA_1785-2009/17...,0FFO-1819-NOV24-002-001,0FFO-1819-NOV24
6,"HOUSE OF COMAMONS, WEDNESDAY. NOv. 24.",1819,'',country,"[001, 002, 003]",/lustre/home/sc048/rosaf4/TDA_GDA_1785-2009/17...,0FFO-1819-NOV25-001-004,0FFO-1819-NOV25
7,"HOUSE OF COMAMONS, WEDNESDAY. NOv. 24.",1819,'',gospel,"[001, 002, 003]",/lustre/home/sc048/rosaf4/TDA_GDA_1785-2009/17...,0FFO-1819-NOV25-001-004,0FFO-1819-NOV25
8,"HOUSE OF COMAMONS, WEDNESDAY. NOv. 24.",1819,'',latin,"[001, 002, 003]",/lustre/home/sc048/rosaf4/TDA_GDA_1785-2009/17...,0FFO-1819-NOV25-001-004,0FFO-1819-NOV25
9,"PARLIAMENTARY INTELLIGENCE. H OUSE OF L,ORDS, ...",1820,'',country,"[001, 002, 003]",/lustre/home/sc048/rosaf4/TDA_GDA_1785-2009/17...,0FFO-1820-OCT06-001-002,0FFO-1820-OCT06


In [77]:
inventory.author.unique()

array(["''", '(From a Correspondent.)', 'FROM OUR OWN REPORTER.',
       "St. Paul's Cathedral", '(FROM OUR SPECIAL CORRESPONDENT.)',
       'MORIHIRO MATSUDA,',
       'Henri Schoup Vlaamse Elsevier Bernard D. Nossiter Washington Post'],
      dtype=object)

There aren't many author names included in the metadata for the articles in our corpus.

In [78]:
print(min(inventory.year))
print(max(inventory.year))

1726
2000


**Step 3:** Export the inventory as a CSV file.

In [79]:
inventory.to_csv("TheTimesArticles_1726-2000_Inventory")

#### 2.2 Create one TXT file per article 
Each file contains the text from the field `original_text` and uses its corresponding `article_id` as the filename.

**Step 1:** Create a subset of the data that only contains articles written in 1950 or later, including the articles' `original_text`.

In [45]:
inventory['original_text'] = original_texts
subset = inventory[inventory.year >= 1950]
subset.tail()

Unnamed: 0,title,year,author,term,pages,filename,article_id,issue_id,original_text
2271,'',1997,'',musical,[0127],/lustre/home/sc048/rosaf4/TDA_GDA_1785-2009/19...,0FFO-2009-1202-0127-007,0FFO-2009-1202,original text: /ZJF/? m I $ THE MUSICA
2272,'',1997,'',musical,[0136],/lustre/home/sc048/rosaf4/TDA_GDA_1785-2009/19...,0FFO-2009-1211-0136-005,0FFO-2009-1211,u25A0. .. Benedict Nightingale enjoys a music...
2273,'',1997,'',pop,[],/lustre/home/sc048/rosaf4/TDA_GDA_1785-2009/19...,0FFO-2009-1218-0109-003,0FFO-2009-1218,original text: live web chat Join TZrnes pop c...
2274,'',1997,'',musical,[],/lustre/home/sc048/rosaf4/TDA_GDA_1785-2009/19...,0FFO-2009-1219-0298-006,0FFO-2009-1219,THE MUSICAL HIT iaFTHEYEAB: THE STIIY IF FUiKI...
2275,Heirs bells,1997,'',rock,[0071],/lustre/home/sc048/rosaf4/TDA_GDA_1785-2009/19...,0FFO-2009-1222-0071-001,0FFO-2009-1222,original text: Heirs bells 20dB Rustling leave...


In [85]:
subset_years = subset.year.unique()
subset_years.sort
print(subset_years)

[1953 1964 1958 1999 1991 1961 1974 1971 1966 1965 1968 1969 1970 1973
 1976 1977 1978 1980 1981 1982 1985 1983 1986 1987 1988 1989 1990 1993
 1994 1995 1996 1997 1992 1998 2000]


Great!  So the years articles were published in our subset are from 1953 through 2000.

**Step 2:** Associate each article's text and ID to one another by creating a dictionary from the `original_text` and `article_id` columns of our subset of data.

In [46]:
id_text = dict(zip(list(subset.article_id),list(subset.original_text)))
print(id_text["0FFO-2009-1218-0109-003"][:100]) # First one hundred characters of article with ID 0FFO-2009-1202-0127-007

original text: live web chat Join TZrnes pop critic Pete Paphides at midday on Monday to dissect the


**Step 3:** Write each article's text to a file named as the article's ID.

In [47]:
def writeTxtFiles(article_dict):
    for key,value in article_dict.items():
        filepath = "../TheTimes_DaveO/TheTimesTextFiles_1953-2000/"+key
        file = open(filepath, "a")  # a for append
        file.write(value)
        file.close()
    print("Text files for each article can now be found in the folder named TheTimesTextFiles_1953-2000!")

In [48]:
writeTxtFiles(id_text)

Text files for each article can now be found in the folder named TheTimesTextFiles_1953-2000!
