### Task 1: 

In Moodle you will find three files, each containing 2 movie reviews: reviews1.txt, reviews2.txt
and reviews3.txt. One of the files has a UTF-16 encoding, while the other two are UTF-8
encoded. Check them for their encoding and load the texts within them into your console.


In [31]:
file_paths = ['reviews1.txt', 'reviews2.txt', 'reviews3.txt']

def load_text(file_path):
    try:
        # Attempt to open and read the file in UTF-8 encoding
        with open(file_path, 'r', encoding='utf-8') as file:
            return file.read()
    except UnicodeDecodeError:
        # If UTF-8 fails, try reading in UTF-16 encoding
        with open(file_path, 'r', encoding='utf-16') as file:
            return file.read()

texts = [load_text(path) for path in file_paths]


### Task 2: 

Split the lines in all texts (seperator ”\n”) to separate the two reviews in each file and then
combine the reviews from all three files into one list. The result should thus be a list of six
strings.

In [32]:

all_reviews = []
for text in texts:
    # Split each text into reviews using '\n' as a delimiter this time
    reviews = text.strip().split("\n")
    # Ensure we only add non-empty strings (in case there are extra newlines)
    reviews_filtered = [review for review in reviews if review.strip()]
    all_reviews.extend(reviews_filtered)

# Check the length of all_reviews to ensure it contains six strings as expected
len(all_reviews), all_reviews


(6,
 ["Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly.",
  'Bromwell High is nothing short of brilliant. Expertly scripted and perfectly delivered, this searing parody of a students and teachers at a South London Public School leaves you literally rolling with laughter. It\'s vulgar, provocative, witty and sharp. The characters are a superbly caricatured cross section of British society (o

### Task 3.

#### Apply elementary text handling (”preprocessing“) steps. That is, within each review 

* Remove punctuation, numbers and special characters 
* Turn all letters into lower case 
* Split the text into individual words

#### The result should be a list of lists of Strings. Each inner list/character vector represents a review as separated words.

In [33]:
import re

# Function to preprocess text according to the specified requirements
def preprocess_text(text):
    # Remove punctuation, numbers, and special characters
    text_cleaned = re.sub(r'[^a-zA-Z\s]', '', text)
    # Turn all letters into lower case
    text_lower = text_cleaned.lower()
    # Split the text into individual words
    words = text_lower.split()
    return words

# Apply the preprocessing step to each review
preprocessed_reviews = [preprocess_text(review) for review in all_reviews]

preprocessed_reviews


[['story',
  'of',
  'a',
  'man',
  'who',
  'has',
  'unnatural',
  'feelings',
  'for',
  'a',
  'pig',
  'starts',
  'out',
  'with',
  'a',
  'opening',
  'scene',
  'that',
  'is',
  'a',
  'terrific',
  'example',
  'of',
  'absurd',
  'comedy',
  'a',
  'formal',
  'orchestra',
  'audience',
  'is',
  'turned',
  'into',
  'an',
  'insane',
  'violent',
  'mob',
  'by',
  'the',
  'crazy',
  'chantings',
  'of',
  'its',
  'singers',
  'unfortunately',
  'it',
  'stays',
  'absurd',
  'the',
  'whole',
  'time',
  'with',
  'no',
  'general',
  'narrative',
  'eventually',
  'making',
  'it',
  'just',
  'too',
  'off',
  'putting',
  'even',
  'those',
  'from',
  'the',
  'era',
  'should',
  'be',
  'turned',
  'off',
  'the',
  'cryptic',
  'dialogue',
  'would',
  'make',
  'shakespeare',
  'seem',
  'easy',
  'to',
  'a',
  'third',
  'grader',
  'on',
  'a',
  'technical',
  'level',
  'its',
  'better',
  'than',
  'you',
  'might',
  'think',
  'with',
  'some',
  'goo

### Task 4: 

Count how often each word occurs in this text corpus and display the 5 most common words.
Can we use these top words to infer what the texts are about?

In [34]:
from collections import Counter

# Combine all the words from all reviews into a single list
all_words = [word for review in preprocessed_reviews for word in review]

# Count how often each word occurs in the text corpus
word_counts = Counter(all_words)

# Get the 5 most common words
most_common_words = word_counts.most_common(5)

most_common_words


[('the', 45), ('of', 36), ('a', 30), ('is', 28), ('and', 27)]

The 5 most common words in the text corpus are:

* 'the' (45 occurrences)
* 'of' (36 occurrences)
* 'a' (30 occurrences)
* 'is' (28 occurrences)
* 'and' (27 occurrences)

These words are primarily English stop words, which are common in almost all English sentences and do not carry specific meaning on their own. They serve as the structure of the language rather than providing insight into the specific content of the texts. Therefore, based solely on these top words, we cannot infer what the texts are about since they do not provide any context or thematic information. To gain insights into the themes or topics of the texts, we would need to look beyond these common stop words and analyze the frequency of more descriptive and unique words within the corpus. 

### Thank You