

String processing with Python

Using a text corpus found on the cds-language GitHub repo or a corpus of your own found on a site such as Kaggle, write a Python script which calculates collocates for a specific keyword.

The script should take a directory of text files, a keyword, and a window size (number of words) as input parameters, and an output file called out/{filename}.csv These parameters can be defined in the script itself Find out how often each word collocates with the target across the corpus Use this to calculate mutual information between the target word and all collocates across the corpus Save result as a single file consisting of four columns: collocate, raw_frequency, MI

BONUS CHALLENGE: Use argparse to take inputs from the command line as parameters

General instructions

For this assignment, you should upload a standalone .py script which can be executed from the command line. Save your script as collocation.py Make sure to include a requirements.txt file and your data You can either upload the scripts here or push to GitHub and include a link - or both! Your code should be clearly documented in a way that allows others to easily follow the structure of your script and to use them from the command line

Purpose

This assignment is designed to test that you have a understanding of:

how to structure, document, and share a Python scripts; how to effectively make use of native Python packages for string processing; how to extract basic linguistic information from large quantities of text, specifically in relation to a specific target keyword



In [107]:
help(set())

Help on set object:

class set(object)
 |  set() -> new empty set object
 |  set(iterable) -> new set object
 |  
 |  Build an unordered collection of unique elements.
 |  
 |  Methods defined here:
 |  
 |  __and__(self, value, /)
 |      Return self&value.
 |  
 |  __contains__(...)
 |      x.__contains__(y) <==> y in x.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __iand__(self, value, /)
 |      Return self&=value.
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __ior__(self, value, /)
 |      Return self|=value.
 |  
 |  __isub__(self, value, /)
 |      Return self-=value.
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  __ixor__(self, value, /)
 |      Return self^=value.



# Define function which includes the arguments text directory, keyword and window size (the latter n-words before and n-words after keyword)

In [157]:
#Importing all the necessary modules
import os
import re
from collections import Counter
import pandas as pd
from pathlib import Path

filepath = os.path.join("..", "data", "100_english_novels", "corpus")

#for file in Path(filepath).glob("*.txt"):
#    print(file)

def collocation(text_dir, keyword, window_size = 3):
    vocabs = set()
    index_list = []
    window_count_list = []
    tokenized_text_list=[]
    for file in Path(text_dir).glob("*5.txt"):
        with open(file, "r", encoding="utf-8") as file:
            text = file.read()
            text_tokens = re.compile(r"\W+").split(text)
            tokenized_text_list.extend(text_tokens)
            
            indices = [index for index, match in enumerate(text_tokens) if match == keyword]
            index_list.append(indices)
    
            for index in indices:
                vocab = set(text_tokens[index-window_size:index+window_size+1])
                for word in vocab:
                    vocabs.add(word)
                    
                
        for word in vocabs:
            count =  tokenized_text_list.count(word)
            window_count_list.append(count)
            
    print(f"vocab len = {len(vocabs)}")        
    return vocabs

vocabs,window_count_list = collocation(filepath, "Peter")




vocab len = 90


ValueError: too many values to unpack (expected 2)

In [158]:
print(window_count_list)


[634, 987, 533, 480, 423, 1, 1443, 9285, 10231, 1747, 1, 12510, 2, 1058, 5, 6, 4, 572, 3, 13, 2333, 43, 930, 830, 3460, 4, 21, 19447, 507, 761, 9, 2954, 1816, 40, 1282, 2150, 2, 168, 26, 38, 5757, 1516, 95, 224, 2884, 4071, 674, 46, 14285, 164, 15480, 2701, 1, 17284, 2, 1658, 5, 7, 4, 1, 762, 4, 20, 3227, 73, 1019, 1397, 17, 1076, 4743, 4, 8931, 26, 27030, 769, 1378, 11, 4615, 3062, 54, 1753, 27, 238, 2376, 2, 194, 1, 48, 44, 8589, 2132, 103, 245, 4125, 6321, 804, 58, 22322, 257, 23634, 3883, 1, 28319, 9, 2, 2528, 5, 10, 5, 1, 1343, 5, 33, 5335, 98, 1665, 1980, 23, 1438, 6512, 4, 15625, 1, 30, 39078, 921, 59, 11, 1550, 512, 14, 6582, 4077, 77, 2779, 171, 317, 5205, 2, 399, 1, 58, 63, 11898, 312, 3054, 158, 3665, 339, 2893, 6908, 9666, 1061, 2632, 75, 26902, 285, 26959, 4533, 1, 30496, 9, 2, 2662, 5, 10, 5, 1, 1452, 5, 39, 5938, 110, 1932, 2204, 25, 1500, 7892, 4, 20435, 1, 31, 44043, 1154, 63, 11, 1930, 594, 14, 7851, 4871, 77, 3019, 171, 383, 6911, 3, 530, 1, 62, 65, 13810, 317, 3282,