# Connecting Google Drive to Colab

In [20]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Installing Apache Spark

In [21]:

!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
!tar xf spark-3.3.2-bin-hadoop3.tgz
!pip install -q findspark
!pip install pyspark
!pip install py4j==0.10.9

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting py4j==0.10.9.7 (from pyspark)
  Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.5/200.5 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: py4j
  Attempting uninstall: py4j
    Found existing installation: py4j 0.10.9
    Uninstalling py4j-0.10.9:
      Successfully uninstalled py4j-0.10.9
Successfully installed py4j-0.10.9.7


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting py4j==0.10.9
  Using cached py4j-0.10.9-py2.py3-none-any.whl (198 kB)
Installing collected packages: py4j
  Attempting uninstall: py4j
    Found existing installation: py4j 0.10.9.7
    Uninstalling py4j-0.10.9.7:
      Successfully uninstalled py4j-0.10.9.7
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pyspark 3.4.0 requires py4j==0.10.9.7, but you have py4j 0.10.9 which is incompatible.[0m[31m
[0mSuccessfully installed py4j-0.10.9


In [22]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.3.2-bin-hadoop3"
import pandas as pd
import findspark
findspark.init()
findspark.find()

'/content/spark-3.3.2-bin-hadoop3'

# Creating an Apache Spark session

In [23]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Tokenizer
from pyspark.sql.functions import regexp_replace, split, col
from pyspark.sql.types import StructType, StructField, StringType
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("Abstract Cleaner") \
    .getOrCreate()

# Cleaning the abstracts


In [24]:

data_path = '/content/drive/MyDrive/IAOS/'
abstracts_file_path = data_path+"abstracts.txt"
titles_file_path = data_path+"orden.txt"

with open(abstracts_file_path, 'r') as abstracts_file:
    abstracts = abstracts_file.read().split('---')

# Read the titles file
with open(titles_file_path, 'r') as titles_file:
    titles = titles_file.readlines()

# Remove leading/trailing whitespaces from titles
titles = [title.strip() for title in titles]

# Match titles with abstracts (even if the abstract is blank)
title_abstract=[]

for i, title in enumerate(titles):
    if i < len(abstracts):
        abstract = abstracts[i].strip()
    else:
        abstract = ""  # If there is no abstract available for a title
    if abstract != "null":
         title_abstract.append((title, abstract))
       

# Print the title-abstract pairs
for title, abstract in title_abstract:
    print("Title:", title)
    print("Abstract:", abstract)
    print()



Title: A Comparison of Several AI Techniques for Authorship Attribution on Romanian Texts
Abstract: Determining the author of a text is a difficult task. Here, we compare multiple Artificial Intelligence techniques for classifying literary texts written by multiple authors by taking into account a limited number of speech parts (prepositions, adverbs, and conjunctions). We also introduce a new dataset composed of texts written in the Romanian language on which we have run the algorithms. The compared methods are artificial neural networks, multi-expression programming, k-nearest neighbour, support vector machines, and decision trees with C5.0. Numerical experiments show, first of all, that the problem is difficult, but some algorithms are able to generate acceptable error rates on the test set.

Title: Deep Learning-Based Channel Estimation for Doubly Selective Fading Channels
Abstract: In this paper, online deep learning (DL)-based channel estimation algorithm for doubly selective fad

In [25]:

with open(data_path+"cleaned.txt", 'w') as file:
  with open(data_path+"cleanedTitles.txt", 'w') as file2:
    for title, abstract in title_abstract:
      # Create a DataFrame with a single column 'abstract'^
      abstract = abstract.lower()
      df = spark.createDataFrame([(abstract,)], ["abstract"])
      df = df.withColumn("abstract", regexp_replace("abstract", "[^\w\s]", ""))
      # Tokenize the abstract
      tokenizer = Tokenizer(inputCol="abstract", outputCol="words")
      words_df = tokenizer.transform(df)

      # Remove stop words from the abstract
      remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")
      filtered_df = remover.transform(words_df)

      # Retrieve the cleaned words
      cleaned_words = filtered_df.select("filtered_words").collect()[0][0]

      # Convert the list of words to a string
      cleaned_abstract = ' '.join(cleaned_words)
      file.write(title + "\n---\n")
      file.write(cleaned_abstract + "\n---\n")
      print(cleaned_abstract)
      

determining author text difficult task compare multiple artificial intelligence techniques classifying literary texts written multiple authors taking account limited number speech parts prepositions adverbs conjunctions also introduce new dataset composed texts written romanian language run algorithms compared methods artificial neural networks multiexpression programming knearest neighbour support vector machines decision trees c50 numerical experiments show first problem difficult algorithms able generate acceptable error rates test set
paper online deep learning dlbased channel estimation algorithm doubly selective fading channels proposed employing deep neural network dnn properly selected inputs dnn exploit features channel variation previous channel estimates also extract additional features pilots received signals moreover dnn take advantages least squares estimation improve performance channel estimation dnn first trained simulated data offline manner track dynamic channel onli