In [1]:
import os
# Move to Thesis directory (two levels up)
os.chdir(os.path.abspath(os.path.join("..", "..")))

# Move to model/src if it exists
model_dir = os.path.join(os.getcwd(), "model", "src")
if os.path.exists(model_dir):
    os.chdir(model_dir)

print("Current Directory:", os.getcwd())

Current Directory: c:\Users\1176153\Downloads\github\Thesis\model\src


In [2]:
import pandas as pd
import re
import libs.data_understanding as du
import libs.data_preparation as dp

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\1176153\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
dict_maininfo_raw = pd.read_pickle(r"../../data/Webscrapping/bachelor_dict_textfiles_raw/dict_maininfo_raw.pkl")


# Cleaning Teaching Staff files

In [4]:
dict_maininfo_raw.keys()

dict_keys(['data-science_main_course_extracted_text.txt', 'information-management_main_course_extracted_text.txt', 'information-systems_main_course_extracted_text.txt'])

## Deleting words

In [5]:
# Print the full text of the first document
for filename, content in dict_maininfo_raw.items():
    print(f"\n--- {filename} ---")
    print(content)
    break


--- data-science_main_course_extracted_text.txt ---
Text from https://www.novaims.unl.pt/en/education/programs/bachelor-s-degrees/data-science/:
Data Science
Degree in
Data Science
en
Education
Programs
Bachelor's Degrees
Data Science
In the Bachelor´s Degree in Data Science, students learn the most modern techniques of artificial intelligence and machine learning to analyze large volumes of data (Big Data).
They will become true data scientists - considered the sexiest profession of the 21
st
century by the Harvard Business Review.
The main objective of this course is to train future professionals capable of understanding, developing and using models, algorithms and the most advanced techniques in data science, to analyze and extract knowledge from Big Data.
The 3
rd
phase of applications under the International Student Statute for the 2025/26 academic year are open from February 26
th
to March 27
th
, 2025.
Duration
3 years (6 semesters)
Timetable
Daytime
Start
September 2025
Career

In [6]:
cleaned_maininfo = dp.clean_text_documents(
    dict_maininfo_raw, 
    words_to_remove=["Programs", "resumo do conteudo da tabela", "Loading...", "Education", "caption text",
                     "card item", "modal item", "Apply here", "Know more"
                     ],
    words_to_deduplicate=["Data Science", "Information Management", "Information Systems", "Who is it for?"]
)

In [7]:
# Print the full text of the first document
for filename, content in cleaned_maininfo.items():
    print(f"\n--- {filename} ---")
    print(content)
    break


--- data-science_main_course_extracted_text.txt ---
Text from https://www.novaims.unl.pt/en///bachelor-s-degrees/data-science/: Data Science Degree in Data Science en Bachelor's Degrees Data Science In the Bachelor´s Degree in Data Science, students learn the most modern techniques of artificial intelligence and machine learning to analyze large volumes of data (Big Data). They will become true data scientists - considered the sexiest profession of the 21 st century by the Harvard Business Review. The main objective of this course is to train future professionals capable of understanding, developing and using models, algorithms and the most advanced techniques in data science, to analyze and extract knowledge from Big Data. The 3 rd phase of applications under the International Student Statute for the 2025/26 academic year are open from February 26 th to March 27 th , 2025. Duration 3 years (6 semesters) Timetable Daytime Start September 2025 Career Opportunities The Bachelor´s Degree

## Teaching staff stats after cleaning

In [8]:
du.histogram_word_count_multiple_docs(cleaned_maininfo)
du.histogram_token_count_multiple_docs(cleaned_maininfo)
du.generate_document_statistics_by_word_count(cleaned_maininfo)
du.generate_document_statistics_by_tokens(cleaned_maininfo)
du.bar_plot_word_frequency(cleaned_maininfo, top_n=20)
du.bar_plot_ngram_frequency(cleaned_maininfo, n=2, top_n=20) 

# Analysing the results

In [10]:
target_filename = 'information-management_main_course_extracted_text.txt'

if target_filename in cleaned_maininfo:
    print(f"\n--- {target_filename} ---\n")
    formatted_content = cleaned_maininfo[target_filename].replace('. ', '.\n').replace('? ', '?\n').replace('! ', '!\n')
    print(formatted_content)
else:
    print(f"Document with key '{target_filename}' not found.")


--- information-management_main_course_extracted_text.txt ---

Text from https://www.novaims.unl.pt/en///bachelor-s-degrees/information-management/: Information Management Degree in Information Management en Bachelor's Degrees Information Management The Bachelor’s degree in Information Management combines management with data science .
It prepares students to be managers of the new generation, capable of understanding business and the current challenges of modern management, transforming data into information.
In today's society, business is increasingly complex and companies deal daily with a huge volume of data, generated by numerous sources.
This reality causes a high demand for professionals with skills in the area of information management, who are able to use the most modern techniques and analytical tools to support decision making.
The 3 rd phase of applications under the International Student Statute for the 2025/26 academic year are open from February 26 th to March 27 th , 

In [11]:
target_filename = 'information-systems_main_course_extracted_text.txt'

if target_filename in cleaned_maininfo:
    print(f"\n--- {target_filename} ---\n")
    formatted_content = cleaned_maininfo[target_filename].replace('. ', '.\n').replace('? ', '?\n').replace('! ', '!\n')
    print(formatted_content)
else:
    print(f"Document with key '{target_filename}' not found.")


--- information-systems_main_course_extracted_text.txt ---

Text from https://www.novaims.unl.pt/en///bachelor-s-degrees/information-systems/: Information Systems Degree in Information Systems en Bachelor's Degrees Information Systems Nowadays, Information Technologies are present in the most diverse areas of knowledge and people's everyday lives, even when they do not realize it.
We permanently use intelligent systems, connectivity services, network equipment and data integration, which correspond to the computing platforms we know as: computers, tablets, smartphones, among many other equipment that share information and perform tasks.
In the Bachelor´s Degree in Information Systems, students learn to analyze, design and implement information systems, fundamental in modern organizations, which include artificial intelligence, new programming languages, development of apps and web systems, mobile computing, among others.
They also acquire a set of tools that support the companies' bus

# Saving the cleaned files in dictionaries

In [9]:
cleaned_maininfo

{'data-science_main_course_extracted_text.txt': "Text from https://www.novaims.unl.pt/en///bachelor-s-degrees/data-science/: Data Science Degree in Data Science en Bachelor's Degrees Data Science In the Bachelor´s Degree in Data Science, students learn the most modern techniques of artificial intelligence and machine learning to analyze large volumes of data (Big Data). They will become true data scientists - considered the sexiest profession of the 21 st century by the Harvard Business Review. The main objective of this course is to train future professionals capable of understanding, developing and using models, algorithms and the most advanced techniques in data science, to analyze and extract knowledge from Big Data. The 3 rd phase of applications under the International Student Statute for the 2025/26 academic year are open from February 26 th to March 27 th , 2025. Duration 3 years (6 semesters) Timetable Daytime Start September 2025 Career Opportunities The Bachelor´s Degree in 

In [12]:
import pickle 
# Define output folder
output_folder = "../../data/Preprocessing_text/bachelors_data"

os.makedirs(output_folder, exist_ok=True)  # Ensure output folder exists

# Save each dictionary as a separate Pickle file
with open(os.path.join(output_folder, "dict_maininfo_cleaned.pkl"), "wb") as f:
    pickle.dump(cleaned_maininfo, f)

print("Pickle files saved successfully!")

Pickle files saved successfully!
