# Full pipeline for Text Data Exploration

As a data scientist specializing in Natural Language Processing (NLP), a thorough data exploration phase is crucial for understanding the text data, identifying patterns, and informing subsequent preprocessing and modeling steps. Here's a comprehensive pipeline with common tasks, tips, code, libraries, and useful charts, presented step-by-step in Python. The data used by this guide can be downloaded from https://zenodo.org/records/10157504.

# 1. Data Loading and Initial Inspection

**Common Task**: Load your text data and get a first glance at its structure and content.

**Tips**:
- Start with a sample if your dataset is massive.
- Understand the format: Is it a CSV, JSON, database, etc.?
- Check for missing values immediately.

In [2]:
import pandas as pd
import numpy as np
import glob

In [3]:
file_input = '*.*'

In [4]:
#Understanding the type of file, looking for the extension file
file_list = glob.glob(file_input)
len(file_list)
file_list

['AllProductReviews.csv', 'data-exploration.ipynb', 'README.md']

In [5]:
reviews =pd.read_csv(file_list[0], encoding='utf-8')

In [6]:
reviews.head()

Unnamed: 0,ReviewTitle,ReviewBody,ReviewStar,Product
0,Honest review of an edm music lover\n,No doubt it has a great bass and to a great ex...,3,boAt Rockerz 255
1,Unreliable earphones with high cost\n,"This earphones are unreliable, i bought it be...",1,boAt Rockerz 255
2,Really good and durable.\n,"i bought itfor 999,I purchased it second time,...",4,boAt Rockerz 255
3,stopped working in just 14 days\n,Its sound quality is adorable. overall it was ...,1,boAt Rockerz 255
4,Just Awesome Wireless Headphone under 1000...😉\n,Its Awesome... Good sound quality & 8-9 hrs ba...,5,boAt Rockerz 255


In [7]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14337 entries, 0 to 14336
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   ReviewTitle  14337 non-null  object
 1   ReviewBody   14337 non-null  object
 2   ReviewStar   14337 non-null  int64 
 3   Product      14337 non-null  object
dtypes: int64(1), object(3)
memory usage: 448.2+ KB


There isn't null. 

# 2. Basic Text Statistics

**Common Tasks**: Calculate fundamental statistics about your text data to understand its overall characteristics.

**Tips**:
- Character count can indicate brevity or verbosity.
- Word count and sentence count provide insights into text length and complexity.
- Average word length can hint at the formality or simplicity of the language.

In [8]:
#Cleaning the data, replace \n with ""
reviews['ReviewTitle'] = reviews['ReviewTitle'].str.replace('\n', '', regex=False)


In [9]:
reviews['char_count_Title'] = reviews['ReviewTitle'].str.len()
reviews['char_count_Body'] = reviews['ReviewBody'].str.len()

In [21]:
reviews['word_count_Body'] = reviews['ReviewBody'].str.split().str.len()
reviews['sentence_count_Body'] = reviews['ReviewBody'].str.split('.').str.len()
reviews['word_len_Body'] = reviews['ReviewBody'].str.split().apply(lambda word_list: [len(word) for word in word_list])
reviews['average_word_len'] = reviews['word_len_Body'].apply(lambda counts: sum(counts)/len(counts) if counts else 0)

In [23]:
reviews.describe()

Unnamed: 0,ReviewStar,char_count_Title,char_count_Body,word_count_Body,sentence_count_Body,average_word_len
count,14337.0,14337.0,14337.0,14337.0,14337.0,14337.0
mean,3.675874,21.545791,126.584362,22.320709,3.666039,4.836041
std,1.503409,16.443729,154.807798,27.702611,3.910061,1.010389
min,1.0,1.0,1.0,0.0,1.0,0.0
25%,3.0,10.0,36.0,6.0,1.0,4.24
50%,4.0,16.0,88.0,15.0,2.0,4.666667
75%,5.0,28.0,160.0,28.0,5.0,5.222222
max,5.0,128.0,5046.0,864.0,65.0,31.0


In [16]:
reviews.head()
reviews['ReviewBody'][0]

'No doubt it has a great bass and to a great extent noise cancellation and decent sound clarity and mindblowing battery but the following dissapointed me though i tried a lot to adjust.1.Bluetooth range not more than 10m2. Pain in ear due the conical buds(can be removed)3. Wires are a bit long which makes it odd in front.4. No pouch provided.5. Worst part is very low quality and distoring mic. Other person keeps complaining about my voice.\n'

# 3. Text Preprocessing (for Exploration)

**Common Tasks**: Clean and normalize text to prepare it for frequency analysis and other exploratory tasks. This is a lighter preprocessing step compared to what you might do for modeling.

**Tips**:
- Lowercasing prevents treating "The" and "the" as different words.
- Punctuation removal reduces noise.
- Stopword removal focuses on meaningful content words.
- Stemming/Lemmatization reduces words to their root forms, consolidating variations.

# 4. Vocabulary Analysis

**Common Tasks**: Understand the unique words, their frequencies, and patterns.

**Tips**:

- Word clouds provide a quick visual summary of frequent terms.
- Bar charts of top N words show exact frequencies.
- Analyzing n-grams (bigrams, trigrams) reveals common phrases.

# 5. Part-of-Speech (POS) Tagging

**Common Task**: Analyze the distribution of grammatical categories (nouns, verbs, adjectives, etc.) in your text.

**Tips**:

- Provides insights into the linguistic structure of your corpus.
- Can highlight if your text is descriptive (many adjectives), action-oriented (many verbs), or topic-focused (many nouns).

# 6. Named Entity Recognition (NER)

**Common Task**: Identify and categorize named entities (people, organizations, locations, dates, etc.) in your text.

**Tips**:

- Reveals key subjects and concepts in your data.
- Useful for extracting structured information from unstructured text.


# 7. Sentiment Analysis (if applicable)

**Common Task**: Determine the emotional tone (positive, negative, neutral) of your text data.

**Tips**:

- Provides a high-level understanding of the sentiment distribution.
- Can be done with simple lexicon-based models or more complex pre-trained models.

# 8. Topic Modeling (High-level exploration)

**Common Task**: Discover abstract "topics" that occur in a collection of documents.

**Tips**:

- LDA (Latent Dirichlet Allocation) is a common algorithm.
- Requires a document-term matrix.
- Provides a sense of the main themes present in your corpus.