<h1 style="color:pink; text-align:center">
    <strong>Authorship Identification: Racine & Corneille</strong>
</h1>

### Create a SparkSession object

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("MyApp") \
    .config("spark.extraListeners", "") \
    .getOrCreate()

### Import the required libraries

In [2]:
#data analysis and manipulation
import pandas as pd

#interacting with the operating system
import os

#extracting text and metadata from PDF files
import pdfplumber

#regular expression operations
import re

#natural language processing
import nltk
from nltk.tokenize import word_tokenize

#download the necessary NLTK resources
nltk.download('punkt')

#machine learning
from pyspark.ml.feature import *
from pyspark.ml.classification import LogisticRegression,RandomForestClassifier
from pyspark.ml import Pipeline
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator




[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Text Chunking Function

takes a text and a chunk size as input and splits the text into chunks of the specified size. The function returns a list of text chunks.

In [3]:
def split_text(text, chunk_size):
    """Splits the text into chunks of a specified size.

    Args:
        text: The text to be split.
        chunk_size: The desired size of each text chunk.

    Returns:
        A list of text chunks.
    """
    chunks = []
    current_chunk = ''
    words = text.split()
    for word in words:
        if len(current_chunk) + len(word) + 1 <= chunk_size:
            current_chunk += word + ' '
        else:
            chunks.append(current_chunk.strip())
            current_chunk = word + ' '
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

### Text Preprocessing Function
performs preprocessing on a given text. The preprocessing steps include removing digits, removing punctuation, converting the text to lowercase. The function returns the preprocessed text.

In [4]:
def preprocess_text(text):
    """Preprocesses the text by removing digits, punctuation, converting to lowercase,
    removing stopwords, and performing lemmatization.

    Args:
        text: The text to be preprocessed.

    Returns:
        The preprocessed text.
    """
    # Remove digits
    text = re.sub('\d', '', text)
    
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    
    # Convert to lowercase
    text = text.lower()
       
    return text

### PDF Files Parsing Function
takes a folder path and an optional chunk size as input. It parses all PDF files in the specified folder and splits the extracted text into smaller chunks. The function returns a pandas DataFrame with three columns: filename (the name of the PDF file), text_chunk (a single text chunk extracted from the PDF file), and author (the name of the author, same as the folder name).


In [5]:
def parse_pdf_files(folder, chunk_size=100):
    """Parses all PDF files in a given folder and splits the extracted text into smaller chunks.

    Args:
        folder: The path to the folder that contains the PDF files, and it's the name of the author.
        chunk_size: The desired size of each text chunk (default: 100 characters).

    Returns:
        A pandas DataFrame containing the following columns:
            * `filename`: The name of the PDF file.
            * `text_chunk`: A single text chunk extracted from the PDF file.
            * `author`: The name of the author (same as the folder name).
    """

    rows = []
    for file in os.listdir(folder):
        if file.endswith('.pdf'):
            with pdfplumber.open(f'{folder}/{file}') as pdf:
                for page in pdf.pages:
                    text = page.extract_text()
                    text = preprocess_text(text)

                    chunks = split_text(text, chunk_size)

                    for chunk in chunks:
                        rows.append({
                            'filename': file,
                            'text_chunk': chunk,
                            'author': folder
                        })

    return pd.DataFrame(rows)


### Parsing PDF Files and Saving to CSV
The main function is responsible for parsing PDF files of two authors, Racine and Corneille, and saving the extracted data into separate CSV files.

In [7]:
def main():
    parse_pdf_files('Racine').to_csv("data/Racine.csv", index=False)
    parse_pdf_files('Corneille').to_csv("data/Corneille.csv", index=False)

if __name__ == '__main__':
    main()


### Combining and Shuffling the Data
combines the data from the CSV files of Corneille and Racine, shuffles the rows randomly, and saves the shuffled data into a new CSV file. This step is crucial for ensuring randomness in the data and avoiding any bias that may arise from the original order of the records.

In [6]:
# Read the CSV files for Corneille and Racine
df_corneille = pd.read_csv('data/Corneille.csv')
df_racine = pd.read_csv('data/Racine.csv')

# Concatenate the DataFrames
combined_df = pd.concat([df_corneille, df_racine], ignore_index=True)

# Shuffle the combined DataFrame
shuffled_df = combined_df.sample(frac=1).reset_index(drop=True)

# Write the shuffled DataFrame to a new CSV file
shuffled_df.to_csv('data/shuffled.csv', index=False)

### Loading Shuffled Data into Spark DataFrame

the shuffled data from the "shuffled.csv" file is loaded into a Spark DataFrame, enabling further processing and analysis using Spark's distributed computing capabilities.

In [7]:
df = spark.read \
    .format("csv") \
    .option("header", "true") \
    .load("data/shuffled.csv")

In [8]:
df.show()

+---------------+--------------------+---------+
|       filename|          text_chunk|   author|
+---------------+--------------------+---------+
| MITHRIDATE.pdf|de pompée elle tr...|   Racine|
| MORTPOMPEE.pdf|coeur que rome eû...|Corneille|
| MORTPOMPEE.pdf|diligence à ses f...|Corneille|
|     PHEDRE.pdf|poète quils détes...|   Racine|
|PLACEROYALE.pdf|garantir dune dél...|Corneille|
| MITHRIDATE.pdf|juge plus sévère ...|   Racine|
| ANDROMAQUE.pdf|vainqueur digne d...|   Racine|
| MORTPOMPEE.pdf|cette nouvelle vo...|Corneille|
|   BERENICE.pdf|seigneur antiochu...|   Racine|
| ANDROMAQUE.pdf|forçant de rompre...|   Racine|
|     PHEDRE.pdf|dune si belle vie...|   Racine|
|   RODOGUNE.pdf|ils ont même sang...|Corneille|
| ANDROMAQUE.pdf|madame et puisquo...|   Racine|
| MORTPOMPEE.pdf|bien due ils vous...|Corneille|
|     PHEDRE.pdf|apprentissage en ...|   Racine|
| MITHRIDATE.pdf|cest par de tels ...|   Racine|
|     HORACE.pdf|règles infaillibl...|Corneille|
|    BAJAZET.pdf|de 

In [9]:
df.select("text_chunk", "author").show()

+--------------------+---------+
|          text_chunk|   author|
+--------------------+---------+
|de pompée elle tr...|   Racine|
|coeur que rome eû...|Corneille|
|diligence à ses f...|Corneille|
|poète quils détes...|   Racine|
|garantir dune dél...|Corneille|
|juge plus sévère ...|   Racine|
|vainqueur digne d...|   Racine|
|cette nouvelle vo...|Corneille|
|seigneur antiochu...|   Racine|
|forçant de rompre...|   Racine|
|dune si belle vie...|   Racine|
|ils ont même sang...|Corneille|
|madame et puisquo...|   Racine|
|bien due ils vous...|Corneille|
|apprentissage en ...|   Racine|
|cest par de tels ...|   Racine|
|règles infaillibl...|Corneille|
|de saintcyr jouée...|   Racine|
|achille et tout c...|   Racine|
|et den faire lélo...|   Racine|
+--------------------+---------+
only showing top 20 rows



In [10]:
df.groupBy("author").count().show()

+---------+-----+
|   author|count|
+---------+-----+
|   Racine| 9299|
|Corneille| 9386|
+---------+-----+



### Feature Engineering and Labeling

In [11]:

# Assigns a unique numerical label to each distinct author
indexer = StringIndexer(inputCol="author", outputCol="label")
df = indexer.fit(df).transform(df)

In [12]:
df['author','label'].distinct().show()

+---------+-----+
|   author|label|
+---------+-----+
|Corneille|  0.0|
|   Racine|  1.0|
+---------+-----+



the author column is encoded with unique numerical labels using label encoding, which facilitates classification tasks.

In [13]:
# Tokenize the text column
tokenizer = Tokenizer(inputCol="text_chunk", outputCol="words")

# Apply CountVectorizer to get term frequency vectors
vectorizer = CountVectorizer(inputCol="words", outputCol="rawFeatures")


# Apply IDF to get tf-idf vectors
idf = IDF(inputCol="rawFeatures", outputCol="features")


In the preprocessing step, the text column is tokenized to split the text into individual words. Then, the tokenized words are transformed into term frequency vectors.Next, these term frequency vectors are converted into TF-IDF vectors, assigning weights to words based on their importance in the corpus.

### Training and Test Data Split

In [14]:
(train_data, test_data) = df.randomSplit((0.6, 0.4), seed=42)

### Model Evaluation and Selection with Cross-Validation

In [15]:
# Define the models to be evaluated
lr = LogisticRegression(featuresCol="features", labelCol="label")
rf = RandomForestClassifier(featuresCol="features", labelCol="label")

# Define the parameter grid for tuning
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.01, 0.1, 1.0]) \
    .addGrid(rf.numTrees, [10, 20, 30]) \
    .build()

models = [lr, rf]

# Define the evaluator
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")

# Perform K-fold cross-validation and select the best model
cv_results = []
best_accuracy = 0.0
best_model = None

for model in models:
    pipeline = Pipeline(stages=[tokenizer, vectorizer, idf, model])
    
    # Create a CrossValidator instance
    crossval = CrossValidator(estimator=pipeline,
                              estimatorParamMaps=paramGrid,
                              evaluator=evaluator,
                              numFolds=5)
    
    # Run cross-validation
    cv_model = crossval.fit(train_data)
    
    # Evaluate the model on the test data
    predictions = cv_model.transform(test_data)
    accuracy = evaluator.evaluate(predictions)
    cv_results.append((model.__class__.__name__, accuracy))
    
    # Check if the current model is the best one
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_model = cv_model.bestModel

# Print the cross-validation results
for result in cv_results:
    print(result[0], "Accuracy:", result[1])

# Use the best model for predictions
predictions_best = best_model.transform(test_data)
best_model_name = best_model.stages[-1].__class__.__name__
best_accuracy = evaluator.evaluate(predictions_best)

print("Best Model:", best_model_name)
print("Best Accuracy:", best_accuracy)



LogisticRegression Accuracy: 0.815464057616524
RandomForestClassifier Accuracy: 0.6649001222992255
Best Model: LogisticRegressionModel
Best Accuracy: 0.815464057616524


Based on these results, it can be concluded that the Logistic Regression model outperformed the Random Forest Classifier in terms of accuracy for predicting the author of the text. So The Logistic Regression was selected as the best model due to its highest accuracy.

 ### Predicting the Author of a new text 

In [18]:
# Sample text to classify
text = """Hé bien, je l'avouerai, que ma juste colère
Aime la guerre autant que la paix vous est chère.
235 J'avouerai que brûlant d'une noble chaleur,
Je vais contre Alexandre éprouver ma valeur.
Du bruit de ses exploits mon âme importunée
Attend depuis longtemps cette heureuse journée.
Avant qu'il me cherchât, un orgueil inquiet
240 M'avait déjà rendu son ennemi secret.
Dans le noble transport de cette jalousie,
Je le trouvais trop lent à traverser l'Asie.
Je l'attirais ici par des voeux si puissants,
Que je portais envie au bonheur des Persans.
245 Et maintenant encor s'il trompait mon courage,
Pour sortir de ces lieux, s'il cherchait un passage,
Vous me verriez moi-même armé pour l'arrêter,
Lui refuser la paix qu'il nous veut présenter."""

# Preprocess the text
preprocessed_text = preprocess_text(text)


# Create a DataFrame with a single row containing the preprocessed text
data = spark.createDataFrame([(preprocessed_text,)], ["text_chunk"])

# Make predictions
predictions = best_model.transform(data)

# Extract predicted author
predicted_author = predictions.select("prediction").first()[0]

# Map the predicted label to author name
author_name = "Corneille" if predicted_author == 0 else "Racine"

# Print the predicted author
print("The Author is:", author_name)


The Author is: Racine
