# Workshop 2

## Install Libraries
   - `nltk`: For natural language processing tasks, such as tokenization and stopword removal.
   - `re`: For cleaning text using regular expressions.
   - `pandas as pd` : For data manipulation and analysis.

In [7]:
import pandas as pd
import re
import nltk

nltk.download('punkt')
nltk.download('wordnet')

from textstat import flesch_kincaid_grade

[nltk_data] Downloading package punkt to /Users/bampatra/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/bampatra/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Read file

In [8]:
train_file = 'resource/train.csv'
test_file = 'resource/test.csv'

train_data = pd.read_csv(train_file)
teat_data = pd.read_csv(test_file)

## Text cleaning , Tokenization , Lowercasing and Stop word removal

In [9]:
# Define a function to process the text
def clean_text(text):
    if not isinstance(text, str):
        return ""
    # Remove special characters and numbers
    text = re.sub(r'[^A-Za-z\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

## Save clean data to new file

In [10]:
# Apply cleaning function to all columns with text data
train_processed_data = train_data.copy()
for column in train_processed_data.select_dtypes(include=['object']).columns:
    train_processed_data[column] = train_processed_data[column].apply(clean_text)

test_processed_data = teat_data.copy()
for column in test_processed_data.select_dtypes(include=['object']).columns:
    test_processed_data[column] = test_processed_data[column].apply(clean_text)

# Save the processed data to a new CSV file
train_output_file = 'cleaned_train.csv'
test_output_file = 'cleaned_test.csv'

train_processed_data.to_csv(train_output_file, index=False)
test_processed_data.to_csv(test_output_file, index=False)

train_output_file , test_output_file

('cleaned_train.csv', 'cleaned_test.csv')

## Check readability score Flesch-Kincaid

Computes the readability score (using the Flesch-Kincaid Grade Level) for the main text columns of both datasets, and prints the scores. The main functionality relies on identifying the text columns in both datasets and then applying a readability metric to those columns.

Check readability train_data and cleaned_train_data

In [14]:
# Load datasets
train_data = pd.read_csv('resource/train.csv')
cleaned_train_data = pd.read_csv('cleaned_train.csv')

# Function to compute readability score
def compute_readability_score(data, text_column):
    if text_column in data.columns:
        text = ' '.join(data[text_column].dropna().astype(str))
        return flesch_kincaid_grade(text)
    else:
        return None

# Identify main text columns
train_text_column = train_data.select_dtypes(include=['object']).columns[0]
cleaned_text_column = cleaned_train_data.select_dtypes(include=['object']).columns[0]

# Calculate scores
train_readability_score = compute_readability_score(train_data, train_text_column)
cleaned_train_readability_score = compute_readability_score(cleaned_train_data, cleaned_text_column)

print("Train Readability Score:", train_readability_score)
print("Cleaned Train Readability Score:", cleaned_train_readability_score)


Train Readability Score: 8.1
Cleaned Train Readability Score: 2259533.4


Check readability test_data and cleaned_test_data

In [13]:
# Load datasets
test_data = pd.read_csv('resource/test.csv')
cleaned_test_data = pd.read_csv('cleaned_test.csv')

# Function to compute readability score
def compute_readability_score(data, text_column):
    if text_column in data.columns:
        text = ' '.join(data[text_column].dropna().astype(str))
        return flesch_kincaid_grade(text)
    else:
        return None

# Identify main text columns
test_text_column = test_data.select_dtypes(include=['object']).columns[0]
cleaned_text_column = cleaned_test_data.select_dtypes(include=['object']).columns[0]

# Calculate scores
test_readability_score = compute_readability_score(test_data, test_text_column)
cleaned_test_readability_score = compute_readability_score(cleaned_test_data, cleaned_text_column)

print("Test Readability Score:", test_readability_score)
print("Cleaned Test Readability Score:", cleaned_test_readability_score)


Test Readability Score: 8.1
Cleaned Test Readability Score: 2208901.3
