## Loading Davies Corpora

This notebook will explore the corpora in this folder.

In [1]:
import re
import zipfile
import os
import sys

### Loading raw data

The following method iterates through the files in the folder, and unzips the files, storing them in a dictionary with each zip file mapping to a list of the texts.

```corpus_name``` is a string which contains the directory of the corpus you need to use. 

In [2]:
corpus_name = "/Users/bhargavvader/Downloads/Movies"
# corpus_name = "Movies"

In [3]:
def loadcorpus(corpus_name, corpus_style="text"):
    texts_raw = {}
    for file in os.listdir(corpus_name + "/"):
        if corpus_style in file:
            print(file)
            zfile = zipfile.ZipFile(corpus_name + "/" + file)
            for file in zfile.namelist():
                texts_raw[file] = []
                with zfile.open(file) as f:
                    for line in f:
                        texts_raw[file].append(line)
    return texts_raw

We will be using the movies corpus for our purposes, but you can uncomment the code and try out the other corpora too.
You might have to make some adjustments in the cleaning for the other corpora; I have tried it for most of them and it works fine.

In [4]:
movie_raw = loadcorpus(corpus_name)

text_13_idi.zip
text_16_qll.zip
text_32_ldf.zip
text_19_gvc.zip
text_05_nko.zip
text_17_arp.zip
text_01_ote.zip
text_28_rfy.zip
text_31_akv.zip
text_22_etp.zip
text_11_uoy.zip
text_09_oii.zip
text_06_jfy.zip
text_14_lnc.zip
text_08_loh.zip
text_33_kje.zip
text_30_wkp.zip
text_07_oma.zip
text_03_mnq.zip
text_21_fqa.zip
text_29_oye.zip
text_27_fle.zip
text_23_fmh.zip
text_12_rcq.zip
text_00_myn.zip
text_10_aoy.zip
text_04_mlq.zip
text_20_cde.zip
text_02_mqu.zip
text_26_ngj.zip
text_24_ywo.zip
text_18_jfj.zip
text_25_byg.zip
text_15_guo.zip


In [5]:
# tv_raw = loadcorpus("TV")

In [6]:
# wiki_raw = loadcorpus("Wiki")

In [7]:
# soap_raw = loadcorpus("SOAP")

In [8]:
# span_raw = loadcorpus("SPAN")

In [9]:
# now_raw = loadcorpus("NOW")

In [10]:
# web_raw = loadcorpus("iWeb")

In [11]:
# glowbe_raw = loadcorpus("GloWbe")

In [12]:
# coha_raw = loadcorpus("COHA")

In [13]:
# coca_raw = loadcorpus("COCA")

Let us look at what the raw movie data looks like, and what we would like to use.

In [14]:
movie_raw['11.txt'][2]

b'@@3512517 I \'m \' most frightened to death . Sure , after you \'ve done it eight or nine times , you wo n\'t even give it a thought . - Gee , Dot , you look swell . - Am I all right ? Lovely . dddd Well , there goes the maiden \'s prayer . I wonder how I \'ll act . It \'s like diving overboard-you never know how the water \'s going to be till you hit it . - I \'m so nervous . - Say ... if I could look like you in a wedding gown , I \'d be a bigamist . Come on . dddd I say , is n\'t that girl in the bride \'s outfit a new model ? Why , yes . She \'s a salesgirl downstairs . We \'re trying her out . She \'s got my okay . These guys usually make wisecracks . Do n\'t let it bother you . I know all the answers- men have been insulting me for years . Say , beautiful . Doing anything tonight ? I \'m taking my two pet fish out for a drive . There \'ll be @ @ @ @ @ @ @ @ @ @ Do n\'t talk back to them . You \'ll get fired . When they deliver baloney at my door , I always give them a receipt .

It seems messy, but nothing we can't clean. This basic method replaces some of the issues with the formatting, and prints the errors if any for debugging. Let us clean one of the raw text files. 

Note: we skip any text data which isn't utf-8 encoded here. I do this to keep things clean; you might want more data or anticipate special characters and not include that restriction.

In [15]:
def clean_raw_text(raw_texts):
    clean_texts = []
    for text in raw_texts:
        try:
            text = text.decode("utf-8")
            clean_text = text.replace(" \'m", "'m").replace(" \'ll", "'ll").replace(" \'re", "'re").replace(" \'s", "'s").replace(" \'re", "'re").replace(" n\'t", "n't").replace(" \'ve", "'ve").replace(" /'d", "'d")
            clean_texts.append(clean_text)
        except AttributeError:
            # print("ERROR CLEANING")
            # print(text)
            continue
        except UnicodeDecodeError:
            # print("Unicode Error, Skip")
            continue
    return clean_texts

In [16]:
clean_11 = clean_raw_text(movie_raw['11.txt'])

Nice. This is looking a lot cleaner. We can now run some of our lucem_illud text cleaning methods we discuss/model in week 4. 

In [17]:
import lucem_illud_2020

In [18]:
lucem_illud_2020.word_tokenize(clean_11[1])

['@@216680',
 'Hey',
 'I',
 "'m",
 'talking',
 'to',
 'you',
 'Give',
 'me',
 '600',
 'dollars',
 'You',
 'wish',
 'That',
 "'s",
 'all',
 'we',
 "'ve",
 'left',
 'And',
 'you',
 'still',
 'go',
 'to',
 'gamble',
 'Shut',
 'up',
 'I',
 'earn',
 'the',
 'money',
 'Even',
 'that',
 'you',
 'ca',
 "n't",
 'take',
 'it',
 'for',
 'gamble',
 'Shut',
 'up',
 'What',
 "'re",
 'you',
 'doing',
 'Bastard',
 'I',
 "'m",
 'gon',
 'na',
 'beat',
 'you',
 'You',
 'gambling',
 'pig',
 'I',
 "'ll",
 'beat',
 'the',
 'shit',
 'out',
 'of',
 'you',
 'You',
 'bitch',
 'I',
 "'ll",
 'beat',
 'you',
 'You',
 'dare',
 'to',
 'hit',
 'me',
 'with',
 'something',
 'I',
 "'ll",
 'kill',
 'you',
 'All',
 'you',
 'know',
 'is',
 'gambling',
 'I',
 "'ll",
 'beat',
 'you',
 'What',
 "'re",
 'you',
 'doing',
 'Let',
 'go',
 'of',
 'me',
 'Stop',
 'You',
 "'ll",
 'kill',
 'Mom',
 'Mom',
 'are',
 'you',
 'all',
 'right',
 'Do',
 "n't",
 'touch',
 'my',
 'money',
 'Dad',
 'where',
 "'re",
 'you',
 'going',
 'Go',
 'a

In [19]:
lucem_illud_2020.normalizeTokens(clean_11[1])

['@@216680',
 'hey',
 'talk',
 'dollar',
 'wish',
 'leave',
 'gamble',
 'shut',
 'earn',
 'money',
 'gamble',
 'shut',
 'bastard',
 'gon',
 'na',
 'beat',
 'gamble',
 'pig',
 'beat',
 'shit',
 'bitch',
 'beat',
 'dare',
 'hit',
 'kill',
 'know',
 'gamble',
 'beat',
 'let',
 'stop',
 'kill',
 'mom',
 'mom',
 'right',
 'touch',
 'money',
 'dad',
 'go',
 'away',
 'dad',
 'lose',
 'dad',
 'come',
 'marble',
 'way',
 'want',
 'trouble',
 'trouble',
 'woman',
 'sailor',
 'care',
 'marble',
 'gamble',
 'bit',
 'raise',
 'kid',
 'way',
 'raise',
 'like',
 'raise',
 'care',
 'marry',
 'grow',
 'tell',
 'smart',
 'marry',
 'gambler',
 'hey',
 'want',
 'quarrel',
 'time',
 'right',
 'rush',
 'share',
 'crowded',
 'worship',
 'ancestor',
 'thing',
 'hey',
 'granny',
 'come',
 'come',
 'bite',
 'dish',
 'good',
 'year',
 'right',
 'marble',
 'share',
 'pork',
 'leg',
 'washing',
 'basin',
 'hey',
 'let',
 'share',
 'pork',
 'share',
 'wish',
 'boy',
 'yeah',
 'wee',
 'wee',
 'aunty',
 'chiang',
 'l

Great! Now let us create a Pandas dataframe with movie names, raw words, tokenized words, and so on.
The file "sources_movies.zip" has this information. Similar information files are found for the other datasets too, in their respective folders.

In [20]:
zfile = zipfile.ZipFile(corpus_name + "/sources_movies.zip")
source = []

In [21]:
for file in zfile.namelist():
    with zfile.open(file) as f:
        for line in f:
            source.append(line)

In [22]:
source[0:20]

[b'textID\tfileID\t#words\tgenre\tyear\tlanguage(s)\tcountry\timdb\ttitle\r\n',
 b'-----\t-----\t-----\t-----\t-----\t-----\t-----\t-----\t-----\r\n',
 b'\r\n',
 b'290635\t3547424\t4722\tShort, Musical\t1930\tUK\tEnglish\t0290635\tGoodbye to All That\r\n',
 b'21165\t6332374\t10220\tCrime, Mystery, Thriller\t1930\tUK\tEnglish\t0021165\tMurder!\r\n',
 b'21191\t6013789\t5281\tDrama, Romance\t1930\tUSA\tEnglish\t0021191\tA Notorious Affair\r\n',
 b'20620\t3660608\t6724\tBiography, Drama, History\t1930\tUSA\tEnglish\t0020620\tAbraham Lincoln\r\n',
 b'20629\t60053\t9552\tDrama, War\t1930\tUSA\tEnglish, French, German, Latin\t0020629\tAll Quiet on the Western Front\r\n',
 b'20640\t6850720\t13862\tComedy, Musical\t1930\tUSA\tEnglish\t0020640\tAnimal Crackers\r\n',
 b'20641\t176501\t11140\tDrama, Romance\t1930\tUSA\tEnglish\t0020641\tAnna Christie\r\n',
 b'20643\t3603861\t1748\tComedy, Short\t1930\tUSA\tEnglish\t0020643\tAnother Fine Mess\r\n',
 b'20670\t4159455\t6966\tComedy, Musical\t1930\tUS

It looks dirty because the file is encoded as bytes, but we can certainly see the information in here. The file id is also present in the original raw text data: as the first "word". Look back at the normalized/tokenized words to confirm that. We're going to use this to create a dataframe with: Fileid, movie name, genre, year, and country.

It is advised that you run a similar check of the source file before you do other extraction.

First, let us create a dictionary mapping file-id to all the text. Each movie will be mapped to a list of the tokenized words.

In this example, I only use it to load 1000 movies. You can comment this out or increase/decrease the number as inspired.

In [23]:
movie_texts = {}

In [24]:
for files in movie_raw:
    if len(movie_texts) > 1000:
        break
    movies = clean_raw_text(movie_raw[files][1:])
    for movie in movies:
        txts = lucem_illud_2020.word_tokenize(movie)
        try:
            movie_texts[txts[0][2:]] = txts[1:]
        except IndexError:
            continue

In [25]:
import pandas as pd

In [26]:
movie_df = pd.DataFrame(columns=["Movie Name", "Genre", "Year", "Country", "Tokenized Texts"])

In [27]:
for movie in source[3:]:
    try:
        tid, fileid, total_words, genre, year, lang, country, imdb, title = movie.decode("utf-8").split("\t")
    except UnicodeDecodeError:
        continue
    try:
        movie_df.loc[fileid.strip()] = [title.strip(), genre.strip(), year.strip(), country.strip(), movie_texts[fileid.strip()]]
    except KeyError:
        continue

In [28]:
movie_df.head()

Unnamed: 0,Movie Name,Genre,Year,Country,Tokenized Texts
6861982,Blonde Crazy,"Comedy, Crime, Drama",1931,English,"[Who, cares, for, starlit, skies, when, you, '..."
6606107,Five and Ten,"Drama, Romance",1931,English,"[Subtitles, Lu, s, Filipe, Bernardes, Mr, Rari..."
6406611,Five Star Final,"Crime, Drama",1931,English,"[Extra, Extra, Extra, Five, star, final, Indis..."
3251135,The Smiling Lieutenant,"Comedy, Romance, Musical",1931,"English, French","[Bell_Rings, Bell_Rings, Sighing, Yawns, Yes, ..."
6909562,Faithless,Drama,1932,English,"[But, Carol, this, bank, is, your, guardian, W..."


This dataframe contains information of the name, the genre, the year, the country, and the texts associated with it: all sorts of analysis can be run with this information now.



You are encouraged to try the similar process and load the other datasets.