# Full pipeline for Text Data Exploration

As a data scientist specializing in Natural Language Processing (NLP), a thorough data exploration phase is crucial for understanding the text data, identifying patterns, and informing subsequent preprocessing and modeling steps. Here's a comprehensive pipeline with common tasks, tips, code, libraries, and useful charts, presented step-by-step in Python. The data used by this guide can be downloaded from https://zenodo.org/records/10157504.

# 1. Data Loading and Initial Inspection

**Common Task**: Load your text data and get a first glance at its structure and content.

**Tips**:
- Start with a sample if your dataset is massive.
- Understand the format: Is it a CSV, JSON, database, etc.?
- Check for missing values immediately.

In [24]:
! pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting click (from nltk)
  Downloading click-8.2.1-py3-none-any.whl.metadata (2.5 kB)
Collecting joblib (from nltk)
  Downloading joblib-1.5.1-py3-none-any.whl.metadata (5.6 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.11.6-cp312-cp312-win_amd64.whl.metadata (41 kB)
Collecting tqdm (from nltk)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ------------- -------------------------- 0.5/1.5 MB 3.4 MB/s eta 0:00:01
   ---------------------------------- ----- 1.3/1.5 MB 3.5 MB/s eta 0:00:01
   ---------------------------------------- 1.5/1.5 MB 3.3 MB/s eta 0:00:00
Downloading regex-2024.11.6-cp312-cp312-win_amd64.whl (273 kB)
Downloading click-8.2.1-py3-none-any.whl (102 kB)
Downloading joblib-1.5.1-py3-none-any.whl (307 kB)
Downloading tqdm-4.67.1-py3-none

In [25]:
import pandas as pd
import numpy as np
import glob
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [43]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to C:\Users\Alesa
[nltk_data]     TA\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\Alesa
[nltk_data]     TA\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.
[nltk_data] Downloading package stopwords to C:\Users\Alesa
[nltk_data]     TA\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
file_input = '*.*'

In [4]:
#Understanding the type of file, looking for the extension file
file_list = glob.glob(file_input)
len(file_list)
file_list

['AllProductReviews.csv', 'data-exploration.ipynb', 'README.md']

In [5]:
reviews =pd.read_csv(file_list[0], encoding='utf-8')

In [6]:
reviews.tail()

Unnamed: 0,ReviewTitle,ReviewBody,ReviewStar,Product
14332,Good\n,Good\n,4,JBL T110BT
14333,Amazing Product\n,An amazing product but a bit costly.\n,5,JBL T110BT
14334,Not bad\n,Sound\n,1,JBL T110BT
14335,a good product\n,the sound is good battery life is good but the...,5,JBL T110BT
14336,"Average headphones , n overrated name\n",M writing this review after using for almost 7...,1,JBL T110BT


In [7]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14337 entries, 0 to 14336
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   ReviewTitle  14337 non-null  object
 1   ReviewBody   14337 non-null  object
 2   ReviewStar   14337 non-null  int64 
 3   Product      14337 non-null  object
dtypes: int64(1), object(3)
memory usage: 448.2+ KB


There isn't null. 

# 2. Basic Text Statistics

**Common Tasks**: Calculate fundamental statistics about your text data to understand its overall characteristics.

**Tips**:
- Character count can indicate brevity or verbosity.
- Word count and sentence count provide insights into text length and complexity.
- Average word length can hint at the formality or simplicity of the language.

In [8]:
#Cleaning the data, replace \n with ""
reviews['ReviewTitle'] = reviews['ReviewTitle'].str.replace('\n', '', regex=False)


In [12]:
#reviews['char_count_Title'] = reviews['ReviewTitle'].str.len()
#all the analysis is made on body 
reviews['char_count'] = reviews['ReviewBody'].str.len()
reviews['word_count'] = reviews['ReviewBody'].str.split().str.len()
reviews['sentence_count'] = reviews['ReviewBody'].str.split('.').str.len()
reviews['word_len'] = reviews['ReviewBody'].str.split().apply(lambda word_list: [len(word) for word in word_list])
reviews['average_word_len'] = reviews['word_len'].apply(lambda counts: sum(counts)/len(counts) if counts else 0)

In [13]:
reviews.describe()

Unnamed: 0,ReviewStar,char_count,word_count,sentence_count,average_word_len
count,14337.0,14337.0,14337.0,14337.0,14337.0
mean,3.675874,126.584362,22.320709,3.666039,4.836041
std,1.503409,154.807798,27.702611,3.910061,1.010389
min,1.0,1.0,0.0,1.0,0.0
25%,3.0,36.0,6.0,1.0,4.24
50%,4.0,88.0,15.0,2.0,4.666667
75%,5.0,160.0,28.0,5.0,5.222222
max,5.0,5046.0,864.0,65.0,31.0


In [16]:
reviews.head()
reviews['ReviewBody'][0]

'No doubt it has a great bass and to a great extent noise cancellation and decent sound clarity and mindblowing battery but the following dissapointed me though i tried a lot to adjust.1.Bluetooth range not more than 10m2. Pain in ear due the conical buds(can be removed)3. Wires are a bit long which makes it odd in front.4. No pouch provided.5. Worst part is very low quality and distoring mic. Other person keeps complaining about my voice.\n'

# 3. Text Preprocessing (for Exploration)

**Common Tasks**: Clean and normalize text to prepare it for frequency analysis and other exploratory tasks. This is a lighter preprocessing step compared to what you might do for modeling.

**Tips**:
- Lowercasing prevents treating "The" and "the" as different words.
- Punctuation removal reduces noise.
- Stopword removal focuses on meaningful content words.
- Stemming/Lemmatization reduces words to their root forms, consolidating variations.

In [22]:
reviews['ReviewBody'] = reviews['ReviewBody'].str.lower()
reviews['ReviewBody'] = reviews['ReviewBody'].str.replace(rf"[{string.punctuation}]", "", regex=True)

In [None]:

# Funtion to remove stopwords
# def remove_stopwords(text):
#     if not isinstance(text, str):  # evita errores si hay NaN u otros tipos
#         return ""
#     words = word_tokenize(text.lower())
#     filtered = [word for word in words if word.isalpha() and word not in stopwords.words('english')]
#     return " ".join(filtered)
# # Apply for each row
# reviews['CleanedReview'] = reviews['ReviewBody'].apply(remove_stopwords)


KeyboardInterrupt: 

In [45]:
# Descargar recursos si no existen
def ensure_nltk_resource(resource_name, resource_path):
    try:
        nltk.data.find(resource_path)
    except LookupError:
        nltk.download(resource_name)

ensure_nltk_resource('punkt', 'tokenizers/punkt')
ensure_nltk_resource('stopwords', 'corpora/stopwords')

# Función de limpieza robusta
def clean_text_remove_stopwords(text):
    try:
        # Verificar que sea texto
        if not isinstance(text, str):
            return ""

        # Tokenizar
        words = word_tokenize(text.lower())

        # Filtrar: solo letras, sin stopwords
        clean_words = [
            word for word in words
            if word.isalpha() and word not in stopwords.words('english')
        ]

        return " ".join(clean_words)
    except Exception as e:
        # Si algo falla, devolver string vacío (y opcional: imprimir el error)
        print(f"Error al procesar: {text} → {e}")
        return ""


In [46]:
reviews['CleanedReview'] = reviews['ReviewBody'].apply(clean_text_remove_stopwords)

In [None]:
#print(reviews['ReviewBody'].apply(type).value_counts())

ReviewBody
<class 'str'>    14337
Name: count, dtype: int64


In [47]:
reviews.head()

Unnamed: 0,ReviewTitle,ReviewBody,ReviewStar,Product,char_count,word_count,sentence_count,word_len,average_word_len,CleanedReview
0,Honest review of an edm music lover,no doubt it has a great bass and to a great ex...,3,boAt Rockerz 255,443,77,11,"[2, 5, 2, 3, 1, 5, 4, 3, 2, 1, 5, 6, 5, 12, 3,...",4.753247,doubt great bass great extent noise cancellati...
1,Unreliable earphones with high cost,this earphones are unreliable i bought it bef...,1,boAt Rockerz 255,371,64,4,"[4, 9, 3, 11, 1, 6, 2, 6, 2, 4, 9, 5, 4, 3, 4,...",4.78125,earphones unreliable bought days meanwhile rig...
2,Really good and durable.,i bought itfor 999i purchased it second time g...,4,boAt Rockerz 255,484,86,10,"[1, 6, 5, 5, 9, 2, 6, 5, 6, 5, 3, 2, 8, 4, 2, ...",4.627907,bought itfor purchased second time gifted firs...
3,stopped working in just 14 days,its sound quality is adorable overall it was g...,1,boAt Rockerz 255,199,37,4,"[3, 5, 7, 2, 9, 7, 2, 3, 4, 3, 4, 3, 1, 5, 5, ...",4.378378,sound quality adorable overall good weeks stop...
4,Just Awesome Wireless Headphone under 1000...😉,its awesome good sound quality 89 hrs battery...,5,boAt Rockerz 255,235,36,22,"[3, 10, 4, 5, 7, 1, 3, 3, 7, 7, 4, 4, 7, 1, 1,...",5.527778,awesome good sound quality hrs battery life wa...


# 4. Vocabulary Analysis

**Common Tasks**: Understand the unique words, their frequencies, and patterns.

**Tips**:

- Word clouds provide a quick visual summary of frequent terms.
- Bar charts of top N words show exact frequencies.
- Analyzing n-grams (bigrams, trigrams) reveals common phrases.

# 5. Part-of-Speech (POS) Tagging

**Common Task**: Analyze the distribution of grammatical categories (nouns, verbs, adjectives, etc.) in your text.

**Tips**:

- Provides insights into the linguistic structure of your corpus.
- Can highlight if your text is descriptive (many adjectives), action-oriented (many verbs), or topic-focused (many nouns).

# 6. Named Entity Recognition (NER)

**Common Task**: Identify and categorize named entities (people, organizations, locations, dates, etc.) in your text.

**Tips**:

- Reveals key subjects and concepts in your data.
- Useful for extracting structured information from unstructured text.


# 7. Sentiment Analysis (if applicable)

**Common Task**: Determine the emotional tone (positive, negative, neutral) of your text data.

**Tips**:

- Provides a high-level understanding of the sentiment distribution.
- Can be done with simple lexicon-based models or more complex pre-trained models.

# 8. Topic Modeling (High-level exploration)

**Common Task**: Discover abstract "topics" that occur in a collection of documents.

**Tips**:

- LDA (Latent Dirichlet Allocation) is a common algorithm.
- Requires a document-term matrix.
- Provides a sense of the main themes present in your corpus.