## Definition of the analyzed text

In [169]:
import textwrap
import re

last_name = "Крылов"

if not last_name:
    raise Exception('Last name is required!')

alp = "абвгдеёжзийклмнопрстуфхцчшщъыьэюя"
w = [1, 4, 21, 25, 34,  6, 44, 26, 13, 44, 38, 26, 4, 43,  4, 49, 46,
        17, 42, 29,  4,  9, 36, 34, 31, 22,  15, 30,  4, 19, 28, 28, 33]

d = dict(zip(alp, w))
variant =  sum([d[el] for el in last_name.lower()]) % 4 + 1

print("Variant: ", variant)

# Construct the file name based on the variant number
file_name = f"data/{variant}.txt"

# Read the contents of the file
with open(file_name, "r", encoding="utf-8") as file:
    TEXT = file.read()

text_for_task_3 = TEXT

wrapped = textwrap.fill(TEXT, width=80)
print(f"Analyzed text: \n{wrapped}")

Variant:  2
Analyzed text: 
Selling **_Concord_**, a task that might have been easy five years ago, feels
like rolling a boulder up a hill in 2024. Developed by new studio Firewalk and
published by PlayStation Studios, _Concord_ is a 5v5 hero shooter that riffs on
_Guardians of the Galaxy_ with a cassette-era take on a pulpy space setting.
It's a premise that could certainly work, but the prime cultural moment for
something fitting that bill has already passed, and an aggressive slate of
upcoming hero shooters like _Marvel Rivals_ and Valve's _Deadlock_ promises a
lot of competition.  **When it comes to the hero shooter staples, _Concord_
covers all the bases**. 16 characters divided into six different classes compete
across several primary game modes — the team-based deathmatch Brawl, a control
points scenario called Overrun, and a competitive tactical option labeled
Rivalry. Each one will randomly play out as one of two versions that make minor
tweaks to the formula, but there's noth

# Task 1

In [170]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s.!?0-9\']', '', text)
    text = re.sub(r'([.!?])\1+', r'\1', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

preprocessed_text = preprocess_text(TEXT)

print(f"Preprocessed text: \n{preprocessed_text}")

Preprocessed text: 
selling concord a task that might have been easy five years ago feels like rolling a boulder up a hill in 2024. developed by new studio firewalk and published by playstation studios concord is a 5v5 hero shooter that riffs on guardians of the galaxy with a cassetteera take on a pulpy space setting. it's a premise that could certainly work but the prime cultural moment for something fitting that bill has already passed and an aggressive slate of upcoming hero shooters like marvel rivals and valve's deadlock promises a lot of competition. when it comes to the hero shooter staples concord covers all the bases. 16 characters divided into six different classes compete across several primary game modes the teambased deathmatch brawl a control points scenario called overrun and a competitive tactical option labeled rivalry. each one will randomly play out as one of two versions that make minor tweaks to the formula but there's nothing as radically different as a payload mo

# Task 2

In [171]:
stop_words = ["and", "the", "is", "in", "it", "you", "that", "to", "of", "a", "with", "for", "on", "this", "at", "by", "an"]

In [172]:
def word_frequency(text, stop_words = None):
    if (stop_words == None):
        stop_words = ["and", "the", "is", "in", "it", "you", "that", "to", "of", "a", "with", "for", "on", "this", "at",
                        "by", "an", "i"]
    arr = re.sub(r'[^a-z\s\']', '', text).split()
    d = {}
    for word in arr:
        if (word not in dict.keys(d) and word not in stop_words):
            d[word] = 1
        elif (word not in stop_words):
            d[word] += 1
    return d

word_frequencies = word_frequency(preprocessed_text)

print(f"Word frequencies: \n{word_frequencies}")

Word frequencies: 
{'selling': 1, 'concord': 18, 'task': 1, 'might': 5, 'have': 2, 'been': 1, 'easy': 1, 'five': 1, 'years': 1, 'ago': 1, 'feels': 5, 'like': 12, 'rolling': 1, 'boulder': 1, 'up': 7, 'hill': 1, 'developed': 1, 'new': 1, 'studio': 1, 'firewalk': 2, 'published': 1, 'playstation': 1, 'studios': 1, 'v': 1, 'hero': 9, 'shooter': 7, 'riffs': 1, 'guardians': 1, 'galaxy': 1, 'cassetteera': 1, 'take': 1, 'pulpy': 1, 'space': 1, 'setting': 2, "it's": 10, 'premise': 2, 'could': 4, 'certainly': 3, 'work': 1, 'but': 10, 'prime': 1, 'cultural': 1, 'moment': 3, 'something': 1, 'fitting': 1, 'bill': 1, 'has': 3, 'already': 2, 'passed': 1, 'aggressive': 1, 'slate': 1, 'upcoming': 1, 'shooters': 2, 'marvel': 1, 'rivals': 1, "valve's": 1, 'deadlock': 1, 'promises': 2, 'lot': 3, 'competition': 1, 'when': 2, 'comes': 1, 'staples': 2, 'covers': 1, 'all': 3, 'bases': 1, 'characters': 5, 'divided': 1, 'into': 3, 'six': 1, 'different': 3, 'classes': 2, 'compete': 1, 'across': 3, 'several': 1, '

# Task 3

In [173]:
def extract_information(text, *, email_pattern=r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
                        phone_pattern=r'[\+(]\d[\d \-()]{8,12}\d', date_pattern=r'\b\d{1,2}[-/]\d{1,2}[-/]\d{2,4}\b',
                        time_pattern=r'\b\d{1,2}:\d{2}\s*[APap][mM]?\b', price_pattern=r'\$\d+(?:\.\d{2})?', **custom_patterns):
    d = {}
    d['emails'] = re.findall(email_pattern, text)
    d['phone_numbers'] = re.findall(phone_pattern, text)
    d['dates'] = re.findall(date_pattern, text)
    d['times'] = re.findall(time_pattern, text)
    d['prices'] = re.findall(price_pattern, text)
    for key, value in custom_patterns.items():
        d[key] = re.findall(value, text)
    return d

extracted_information = extract_information(text_for_task_3)

print(f"Extracted information: \n{extracted_information}")

Extracted information: 
{'emails': [], 'phone_numbers': [], 'dates': [], 'times': [], 'prices': []}


# Task 4

In [174]:
positive_words = ["good", "great", "happy", "joy", "excellent", "fantastic", "love", "best"]

In [175]:
negative_words = ["bad", "sad", "hate", "terrible", "awful", "poor", "worst"]

In [176]:
def analyze_sentiment(text, positive_words=None, negative_words=None):
    if negative_words is None:
        negative_words = ["bad", "sad", "hate", "terrible", "awful", "poor", "worst"]
    if positive_words is None:
        positive_words = ["good", "great", "happy", "joy", "excellent", "fantastic", "love", "best", "amazing", "fun"]
    arr = re.sub(r'[^a-z\s\']', '', text).split()
    number_of_positive_words = 0
    for word in positive_words:
        number_of_positive_words += arr.count(word)
    number_of_negative_words = 0
    for word in negative_words:
        number_of_negative_words += arr.count(word)
    return number_of_positive_words - number_of_negative_words

sentiment_score = analyze_sentiment(preprocessed_text)

print(f"Sentiment score: \n{sentiment_score}")

Sentiment score: 
9


# Task 5

In [177]:
def summarize_text(text: str, compression_ratio: float, min_threshold=2):
    sentences = re.split(r'(?<=\w[.!?])\s+', text)
    needed_sentences = max(int(len(sentences) * compression_ratio), min_threshold)

    def summarize(sentences):
        if len(sentences) <= needed_sentences:
            return
        level = 100000000000
        len_sentence_rang = 3
        count_words_rang = 10
        sentiment_rang = 100
        for sentence in sentences:
            sentence_level = len(sentence) * len_sentence_rang + len(sentence.split(' ')) * count_words_rang \
            + abs(analyze_sentiment(sentence)) * sentiment_rang
            if (sentence_level < level):
                level = sentence_level
                min_level_sentence = sentence
        sentences.remove(min_level_sentence)
        summarize(sentences)
    summarize(sentences)
    sentences = ' '.join(sentences)
    return sentences

summarized_text = summarize_text(preprocessed_text, 0.05)

print(f"Summarized text: \n{summarized_text}")

Summarized text: 
pros has all the hero shooter staples you'd expect focus on skillsgunplay having some nuance creates fun learning curve gameplay modes are fun cons uninspiring design restrictions on gameplay modes dull enjoyment weird graphics decisions on pc familiar elements with minor twists that aren't earthshattering don't expect anything gamechanging from concord it's difficult to discuss concord without talking about overwatch so it's best to get it out of the way early. final thoughts review score 35 solid or good by screen rant's review metric the upfront cost of concord might make it a hard sell in 2024 but it's certainly nice to play a liveservice shooter that doesn't feel designed to ask for money at every turn.


# Task 6

In [178]:
def visualize_word_frequency(dictionary, max_threshold=100):
    sorted_d = {key: value for key, value in sorted(dictionary.items(), key=lambda item: item[1], reverse=True)}
    temp = 0
    if (max_threshold):
        for key, value in sorted_d.items():
            if (temp < max_threshold):
                temp += 1
                print(key + ': ' + value * '*')
    else:
        for key, value in sorted_d.items():
            print(key + ': ' + value * '*')

visualized = visualize_word_frequency(word_frequencies)

print(f"Visualized word frequinces: \n{visualized}")

concord: ******************
like: ************
it's: **********
but: **********
one: **********
hero: *********
more: *********
game: ********
modes: ********
be: ********
up: *******
shooter: *******
abilities: *******
as: ******
can: ******
which: ******
might: *****
feels: *****
characters: *****
rivalry: *****
or: *****
focus: *****
gameplay: *****
are: *****
graphics: *****
than: *****
who: *****
casual: *****
maps: *****
settings: *****
could: ****
some: ****
fun: ****
pc: ****
from: ****
without: ****
although: ****
feel: ****
certainly: ***
moment: ***
has: ***
lot: ***
all: ***
into: ***
different: ***
across: ***
control: ***
tactical: ***
will: ***
out: ***
two: ***
there's: ***
design: ***
don't: ***
overwatch: ***
get: ***
even: ***
its: ***
heroswapping: ***
interesting: ***
end: ***
being: ***
players: ***
character: ***
things: ***
those: ***
healing: ***
enemy: ***
them: ***
however: ***
good: ***
team: ***
lack: ***
while: ***
other: ***
every: ***
better: ***
element

# Task 7

In [179]:
def apply_analysis(text, operations):
    text = preprocess_text(text)
    result = {}
    for i in operations:
        result[i.__name__] = i(text)
    return result

analysis = apply_analysis(preprocessed_text, [word_frequency, analyze_sentiment])

print(f"Applied analysis: \n{analysis}")

Applied analysis: 
{'word_frequency': {'selling': 1, 'concord': 18, 'task': 1, 'might': 5, 'have': 2, 'been': 1, 'easy': 1, 'five': 1, 'years': 1, 'ago': 1, 'feels': 5, 'like': 12, 'rolling': 1, 'boulder': 1, 'up': 7, 'hill': 1, 'developed': 1, 'new': 1, 'studio': 1, 'firewalk': 2, 'published': 1, 'playstation': 1, 'studios': 1, 'v': 1, 'hero': 9, 'shooter': 7, 'riffs': 1, 'guardians': 1, 'galaxy': 1, 'cassetteera': 1, 'take': 1, 'pulpy': 1, 'space': 1, 'setting': 2, "it's": 10, 'premise': 2, 'could': 4, 'certainly': 3, 'work': 1, 'but': 10, 'prime': 1, 'cultural': 1, 'moment': 3, 'something': 1, 'fitting': 1, 'bill': 1, 'has': 3, 'already': 2, 'passed': 1, 'aggressive': 1, 'slate': 1, 'upcoming': 1, 'shooters': 2, 'marvel': 1, 'rivals': 1, "valve's": 1, 'deadlock': 1, 'promises': 2, 'lot': 3, 'competition': 1, 'when': 2, 'comes': 1, 'staples': 2, 'covers': 1, 'all': 3, 'bases': 1, 'characters': 5, 'divided': 1, 'into': 3, 'six': 1, 'different': 3, 'classes': 2, 'compete': 1, 'across':

# Task 8

In [180]:
class TextAnalyzer:
    def __init__(self, text: str):
        self.original_text = text
        self.cleaned_text = self.preprocess_text(text)

    def preprocess_text(self, text: str):
        text = text.lower()
        text = re.sub(r'[^a-z\s.!?0-9\']', '', text)
        text = re.sub(r'([.!?])\1+', r'\1', text)
        text = re.sub(r'\s+', ' ', text).strip()
        return text

    def word_frequency(self) -> dict:
        if (self.cleaned_text):
            stop_words = ["and", "the", "is", "in", "it", "you", "that", "to", "of", "a", "with", "for", "on", "this", "at", "by", "an", "i"]
            arr = re.sub(r'[^a-z\s\']', '', self.cleaned_text).split()
            d = {}
            for word in arr:
                if (word not in dict.keys(d) and word not in stop_words):
                    d[word] = 1
                elif (word not in stop_words):
                    d[word] += 1
            return d
        else:
            return {}

    def extract_information(self, *, email_pattern=r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
                        phone_pattern=r'[\+(]\d[\d \-()]{8,12}\d', date_pattern=r'\b\d{1,2}[-/]\d{1,2}[-/]\d{2,4}\b',
                        time_pattern=r'\b\d{1,2}:\d{2}\s*[APap][mM]?\b', price_pattern=r'\$\d+(?:\.\d{2})?', **patterns) -> dict:
        d = {}
        d['emails'] = re.findall(email_pattern, self.original_text)
        d['phone_numbers'] = re.findall(phone_pattern, self.original_text)
        d['dates'] = re.findall(date_pattern, self.original_text)
        d['times'] = re.findall(time_pattern, self.original_text)
        d['prices'] = re.findall(price_pattern, self.original_text)
        for i in patterns:
            d['i'] = re.findall(i, self.original_text)
        return d

    def analyze_sentiment(self, positive_words = ["good", "great", "happy", "joy", "excellent", "fantastic", "love", "best", "amazing", "fun"],
                          negative_words = ["bad", "sad", "hate", "terrible", "awful", "poor", "worst"]) -> int:
        arr = re.sub(r'[^a-z\s\']', '', self.cleaned_text).split()
        for i in range(0, len(arr)):
            if (arr[i][-1] == '.'):
                arr[i] = arr[i][:-1]
        number_of_positive_words = 0
        for word in positive_words:
            number_of_positive_words += arr.count(word)
        number_of_negative_words = 0
        for word in negative_words:
            number_of_negative_words += arr.count(word)
        return number_of_positive_words - number_of_negative_words

    def summarize_text(self, compression_ratio: float, min_threshold=2) -> str:
        sentences = re.split(r'(?<=\w[.!?])\s+', self.cleaned_text)
        needed_sentences = max(int(len(sentences) * compression_ratio), min_threshold)
    
        def summarize(sentences):
            if len(sentences) <= needed_sentences:
                return
            level = 100000000000
            len_sentence_rang = 3
            count_words_rang = 10
            sentiment_rang = 100
            for sentence in sentences:
                sentence_level = len(sentence) * len_sentence_rang + len(sentence.split(' ')) * count_words_rang \
                + abs(analyze_sentiment(sentence)) * sentiment_rang
                if (sentence_level < level):
                    level = sentence_level
                    min_level_sentence = sentence
            sentences.remove(min_level_sentence)
            summarize(sentences)
        summarize(sentences)
        sentences = ' '.join(sentences)
        return sentences

    def visualize_word_frequency(self, max_threshold=None) -> None:
        sorted_d = {key: value for key, value in sorted(self.word_frequency().items(), key=lambda item: item[1], reverse=True)}
        temp = 0
        if (max_threshold):
            for key, value in sorted_d.items():
                if (temp < max_threshold):
                    temp += 1
                    print(key + ': ' + value * '*')
        else:
            for key, value in sorted_d.items():
                print(key + ': ' + value * '*')

    def apply_analysis(self, analysis_functions: list) -> dict:
        result = {}
        for i in analysis_functions:
            result[i.__name__] = i(self.cleaned_text)
        return result

# Task 9

In [181]:
Sic_Mundus_Creatus_Est = TextAnalyzer(wrapped)

print("Preprocessed text:")
print(textwrap.fill(Sic_Mundus_Creatus_Est.cleaned_text[:300], width=120) + '...')
print()
word_frequency_dic = Sic_Mundus_Creatus_Est.word_frequency()
sorted_dic = {key: value for key, value in sorted(word_frequency_dic.items(), key=lambda item: item[1], reverse=True)}
temp = 0
s = ''
print("Most frequent words:")
for i in sorted_dic.keys():
    if (temp < 13):
        print(i + ": " + str(sorted_dic[i]) + " times.")
        temp += 1
print()
print("Sentiment score:")
print(Sic_Mundus_Creatus_Est.analyze_sentiment())
print()
print("Summarized text:")
print(textwrap.fill(Sic_Mundus_Creatus_Est.summarize_text(0.05), width=150))

Preprocessed text:
selling concord a task that might have been easy five years ago feels like rolling a boulder up a hill in 2024.
developed by new studio firewalk and published by playstation studios concord is a 5v5 hero shooter that riffs on
guardians of the galaxy with a cassetteera take on a pulpy space setting....

Most frequent words:
concord: 18 times.
like: 12 times.
it's: 10 times.
but: 10 times.
one: 10 times.
hero: 9 times.
more: 9 times.
game: 8 times.
modes: 8 times.
be: 8 times.
up: 7 times.
shooter: 7 times.
abilities: 7 times.

Sentiment score:
10

Summarized text:
pros has all the hero shooter staples you'd expect focus on skillsgunplay having some nuance creates fun learning curve gameplay modes are fun cons
uninspiring design restrictions on gameplay modes dull enjoyment weird graphics decisions on pc familiar elements with minor twists that aren't
earthshattering don't expect anything gamechanging from concord it's difficult to discuss concord without talking about

By looking at the most frequently occurring words and the condensed text, we can conclude that the original text is a review of a shooter called 'Concord.' According to the sentiment analysis, the review is more positive than negative. Based on the summary, the shooter has all the core elements of games in this genre but slightly falls short in graphics. The author also notes that the game does not require much in-game spending for an enjoyable experience.

## Conclusion:

In the course of this work, I developed a text analysis tool. This tool can process various types of text entries, generating insights and statistics. It includes the following functions:
  - Text cleaning
  - Word frequency counting
  - Pattern-based information extraction
  - Sentiment analysis
  - Text compression based on specific metrics
  - Word frequency visualization
  - Applying multiple functions simultaneously
  - A dedicated class containing all the analysis functions.