# Introduction

Hey! This is my very first notebook in my journey of exploring data science.

In this notebook, I dive into Exploratory Data Analysis (EDA)-a crucial first step in any data science or analytics project. Through hands-on experimentation and by applying techniques and ideas gathered from various sources (including Kaggle!), I’ve tried to understand my dataset, uncover patterns, and learn by doing. I hope this notebook is both helpful and fun to read!

**What is EDA?**

Exploratory Data Analysis (EDA) is all about exploring and examining a dataset to understand its features, structure, and relationships. It helps summarize the main characteristics of the data, identify distributions, spot patterns, and detect anomalies or outliers. EDA often uses data visualization (like histograms and box plots) to make insights clearer and easier to explain.

**Why is EDA important?**

EDA gives data scientists a clear picture of the data, including its structure, missing values, and overall quality. Most importantly, it helps discover hidden patterns and relationships, which are essential for identifying trends, generating insights, and guiding further analysis or modeling decisions.

Let’s get started and see what we can learn from the data!


**About the DataSet**

The dataset used in this notebook consists of research documents related to cancer, specifically focused on three types: Thyroid, Lung, and Colon Cancer. As someone passionate about healthcare and medical data, I chose this dataset to begin my data science journey. It offers a rich opportunity to explore real-world clinical literature and apply data analysis techniques.

**Dataset Overview:**

*Rows:* 900 
*Features:* 3 main columns-Title, Abstract, and Label 
*Labels:* Each document is categorized as Thyroid, Lung, or Colon Cancer, sourced from various medical and research repositories.

This dataset provides a mix of scientific titles and abstracts, making it ideal for practicing exploratory data analysis (EDA) in the context of natural language processing and healthcare research. Throughout this notebook, I focus on EDA-exploring the structure, key characteristics, and patterns within the data. The insights gained here will lay the foundation for any future modeling or deeper analysis.

**Note:** EDA is a crucial step in understanding and preparing data for further analysis, especially in complex fields like cancer research, where uncovering patterns and relationships can lead to valuable clinical insights

# Data Loading & Initial Exploration

In [None]:
import pandas as pd  # import necessary libraries for EDA
import numpy as np
import string
string.punctuation
import re
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import PCA
import seaborn as sns
import nltk
from nltk.corpus import stopwords
from sklearn.preprocessing import LabelEncoder


In [None]:
df = pd.read_excel('/kaggle/input/cancer-papers-dataset/data2.xlsx')  # Read the Dataset
df.head()

In [None]:
df.tail()

In [None]:
df.columns  # Check for columns (features)

In [None]:
df.info()  # Basic info

In [None]:
print(df.isnull().sum())  # Check for Null Values
print("Number of duplicate rows:", df.duplicated().sum()) # Check for duplicates

**Initial Exploration & Data Quality Overview**

From our initial exploration, we found that this dataset is entirely text-based:

* Two out of three features (Title and Abstract) are textual, making the dataset predominantly driven by text length and content.
* The third feature (Label) is categorical, indicating the cancer type for each document.

A quick check revealed that the dataset is clean and well-structured:
* No missing (null) values in any of the features.
* No duplicate entries detected.

With these checks complete, we can confidently proceed to more detailed *Data Quality Checks* and begin our exploratory data analysis.

**C. Feature Exploration**

In feature exploration, we begin by examining the class distribution using a bar graph to visualize the number of documents in each class of the Label feature (i.e., Thyroid, Colon, and Lung Cancers). Next, we explore the Title and Abstract features to better understand the content and prepare for further analysis.

In [None]:
num_classes = df['Label'].nunique()
print("Number of unique classes:", num_classes)  # Number of unique classes in label

unique_classes = df['Label'].unique()
print("\nUnique classes:", unique_classes)  # Unique class names in label

In [None]:
df['Label'].value_counts().plot(kind='bar', title='Label Distribution')  # Class Distribution using bar graph (count of each classes)

In [None]:
for index, text in enumerate(df['Title'][35:38]):
    print('Title %d:\n'%(index+1), text)  # review few samples titles from dataset

In [None]:
for index, text in enumerate(df['Abstract'][100:101]):
    print('Abstract %d:\n'%(index+1), text)  # review of a sample abstract from dataset

# Observations from Feature Exploration

The label feature, which is categorical, contains three classes: Thyroid_Cancer, Lung_Cancer, and Colon_Cancer. The dataset is evenly distributed across these classes, with each class containing 300 documents, as shown in the label distribution bar graph. Additionally, by reviewing printed samples, it is evident that the abstract feature is significantly longer than the title feature. This difference in length may contribute to the presence of outliers in the data.

# Text Processing

Text processing is a crucial step when working with text-driven data. This phase helps analyze large volumes of information by removing irrelevant content and noise, ultimately cleaning the dataset and improving its quality. Effective text processing is essential for further analysis and model training.

In this section, we will focus on cleaning the text. Before cleaning, we will combine the Title and Abstract features into a new feature called Combined_document (a common feature engineering approach for handling large textual datasets). We will then clean this combined data by converting all text to lowercase, removing punctuation, and eliminating extra spaces.

Cleaning the text in this way reduces noise-punctuation can introduce unnecessary complexity, and lowercasing helps normalize the data, ensuring that words are treated consistently regardless of their original capitalization.

let's get started with text processing!

In [None]:
df['Combined_document'] = df['Title'] + " [SEP] " + df['Abstract']  # Title & Abstract is merged into Combined_document (acts as a single document)
df.head()

In [None]:
df['Combined_document'] = df['Combined_document'].apply(lambda x: x.lower())  # Lowercasing
df['Combined_document'] = df['Combined_document'].apply(lambda x: re.sub('[%s]' % re.escape(string.punctuation), '', x))  # removes all punctuation characters
df['Combined_document'] = df['Combined_document'].apply(lambda x: re.sub(' +', ' ',x))  # removes extra spacing

for index, text in enumerate(df['Combined_document'][0:1]):
    print("Document %d:\n"%(index+1), text)  # review sample document (title+abstract)


# Document Length Analysis

Document length analysis involves measuring the length of documents-typically in words or characters-to gain insights into your dataset.

**Why Analyze Document Length?**

* *Understanding Complexity:* Longer documents may indicate greater complexity and require more effort to process.

* *Identifying Patterns:* Analyzing document lengths can reveal patterns, such as certain topics or classes having longer or shorter texts.

* *Feature Engineering:* Document length can serve as a useful feature in machine learning models.

**Methods and Techniques**

* *Statistical Analysis:* Use descriptive statistics (mean, median, standard deviation) to summarize document lengths.

* *Visualization:* Histograms and box plots help visualize the distribution and spot outliers.

In [None]:
df['doc_length'] = df['Combined_document'].apply(lambda x: len(str(x).split()))
print(df['doc_length'].describe())

plt.figure(figsize=(8, 5))
plt.hist(df['doc_length'], bins=30, color='mediumpurple', edgecolor='black', alpha=0.8)
plt.title('Document Length Distribution')
plt.xlabel('Number of Words')
plt.ylabel('Number of Documents')
plt.grid(axis='y', alpha=0.75)
plt.show()

# The distribution (histogram) is roughly bell-shaped (close to normal) but slightly right-skewed, with a long tail of longer documents

In [None]:
plt.figure(figsize=(8, 5))
sns.boxplot(x='Label', y='doc_length', data=df, palette='Set2')
plt.title('Class-wise Document Length Distribution')
plt.xlabel('Cancer Type')
plt.ylabel('Document Length (words)')
plt.show()

In [None]:
avg_lengths = df.groupby('Label')['doc_length'].mean()
print("Average Document Length by Class:")
print(avg_lengths)

# Observation on Document Length Analysis

* The document length distribution is approximately normal, with most documents containing between 150 and 350 words.
* All three cancer types have similar document length distributions, with median lengths around 250 words and comparable variability.
* Average document lengths are close across classes (Colon: 245, Thyroid: 253, Lung: 262 words).
* Outliers are present in all classes, especially among Colon_Cancer documents, which include *a few very long* entries.
* These findings suggest the dataset is well-balanced in terms of document length, but attention should be paid to outliers in further analysis.

# Word Frequency Analysis

Word frequency analysis is a foundational technique in qualitative text analysis. It involves counting how often each word or phrase appears within a text or a collection of documents. This method helps identify the most prominent topics, recurring themes, or key terms in the dataset.

While word frequency analysis highlights which words are most common-shedding light on the internal focus and dominant subjects of the text-document length analysis measures how long each document is (in terms of word count). Document length can influence word frequency counts, as longer documents may naturally contain more occurrences of certain words.

In summary, word frequency analysis reveals the internal distribution and prominence of terms within your data, whereas document length analysis provides context about the size and potential variability of your documents. Together, these analyses offer valuable insights into both the content and structure of your textual dataset.

In [None]:
from collections import Counter
words = ' '.join(df['Combined_document']).split()  # Most Frequent Words Overall
freq = Counter(words).most_common(30)
pd.DataFrame(freq, columns=['word', 'Frequency']).plot(kind='bar', x='word', y='Frequency', title='Top 30 Words')

Most words in the graph are stopwords, which add little domain-specific meaning. Removing them helps highlight the true, meaningful terms relevant to the cancer dataset.

In [None]:
for label in df['Label'].unique():
  words = ' '.join(df[df['Label']==label]['Combined_document']).split()
  freq = Counter(words).most_common(10)
  print(f"\nLabel: {label}")
  print(freq)  # Most Frequent Words by Label

This review also contains many stopwords, so we will focus on removing them in the next phase to better capture meaningful domain-specific terms.

# Stop Word Impact & Removal

From the previous two analyses, we see that stopwords dominate the top ranks in our visuals, overshadowing important domain-specific terms. Therefore, we will now focus on removing stopwords to better highlight the meaningful words in our dataset.

*Stop word removal* is crucial in Exploratory Data Analysis (EDA) of text data because it reduces noise and focuses on meaningful words, improving analysis accuracy and model performance. By removing common, uninformative words like "a", "the", and "is," EDA can identify key themes, topics, and relationships within the text more effectively.

In [None]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

stop_words = ENGLISH_STOP_WORDS

def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word.lower() not in stop_words])

df['Cleaned_document'] = df['Combined_document'].apply(remove_stopwords)


Now, re-run the previous visual, and check for new observations. Ww should see a much clearer representation of the key topics and terms in related to cancer dataset!

In [None]:
from collections import Counter
words = ' '.join(df['Cleaned_document']).split()  # Most Frequent Words Overall
freq = Counter(words).most_common(30)
pd.DataFrame(freq, columns=['word', 'Frequency']).plot(kind='bar', x='word', y='Frequency', title='Top 30 Words')

In [None]:
for label in df['Label'].unique():
  words = ' '.join(df[df['Label']==label]['Cleaned_document']).split()
  freq = Counter(words).most_common(10)
  print(f"\nLabel: {label}")
  print(freq)  # Most Frequent Words by Label

Yes, as we can see, domain-specific words now dominate the visuals. Thanks to the stopword removal function!

# Word Cloud Visualizations

Word clouds are a valuable technique in Exploratory Data Analysis (EDA), especially when working with text data. They provide a visual representation of word frequency, helping to identify key themes and insights within the text.

Here, we use word clouds to visualize both the overall words throughout the dataset and the words specific to each class.

In [None]:
text = ' '.join(df['Cleaned_document'])
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("Overall Word Cloud")  # Word Cloud (Visual)
plt.show() 

In [None]:
# Class-wise WordCloud with extra space between plots
for label in df['Label'].unique():
    plt.figure(figsize=(10,5))
    text = ' '.join(df[df['Label']==label]['Cleaned_document'])
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(f"Word Cloud for {label}")
    plt.tight_layout(pad=3)  # Adds extra padding around the plot
    plt.show()
    plt.close()  # Ensures plots don't overlap in some environments


As we can see, the word clouds look great, with domain-specific words prominently featured in both the overall and class-wise visualizations.

# Feature Importance 

Feature importance measures how much each feature (such as a word or variable) influences a machine learning model’s predictions. In text analysis, feature importance *(often using TF-IDF)* highlights words that are not just frequent but are also *distinctive* and *informative* for classification or prediction tasks.

**Difference from frequency word analysis:**

* *Word frequency analysis* simply counts how often each word appears, showing the most common terms but not their usefulness for distinguishing between classes.

* *Feature importance (e.g., via TF-IDF)* identifies which words are most valuable for prediction by considering both their frequency and their uniqueness across documents, making it more effective for model building and interpretation.

**In short:**

* *Word frequency:* Most common words.
* *Feature importance:* Most influential words for the model’s decisions.

In [None]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['Cleaned_document'])  # TF-IDF Feature Importance (Top Words)
tfidf_scores = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
top_words = tfidf_scores.sum().sort_values(ascending=False).head(30)
top_words.plot(kind='bar', title='Top TF-IDF Words') 

# Observation from Feature Importance

Frequent word analysis highlights the most common terms, while feature importance emphasizes words that carry more meaning or predictive power. Comparing both reveals that some words may be frequent but not necessarily important, and vice versa.

# Correlation Analysis

Correlation analysis measures how strongly two variables are related to each other. In the context of text data, when we select the top TF-IDF features (the most important words or terms across documents), correlation analysis examines how the presence or importance of one term relates to another across the entire dataset.

By calculating the correlation between top TF-IDF features, we can identify which terms tend to occur together (positive correlation) or rarely appear together (negative correlation) across documents.

A heatmap visually displays these correlations, making it easier to spot patterns, clusters, or redundancies among features.

Highly correlated features may indicate redundancy, which can be useful for feature selection or dimensionality reduction. It also helps in understanding relationships between key terms, which can enhance tasks like topic modeling or document classification.

In [None]:
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)

tfidf = TfidfVectorizer(max_features=50, stop_words='english')
X_tfidf = tfidf.fit_transform(df['Cleaned_document'])
tfidf_df = pd.DataFrame(X_tfidf.toarray(),
                        columns=tfidf.get_feature_names_out())

# Correlation Matrix among features
import seaborn as sns
import matplotlib.pyplot as plt

# Compute correlation matrix
corr_matrix = tfidf_df.corr()

# Plot heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, cmap='coolwarm', center=0)
plt.title('Correlation Matrix of Top TF-IDF Features')
plt.show()


# Observations from Correlation Heatmap of Top TF-IDF Features

* Most TF-IDF features show low correlation with each other, indicating feature independence.
* The diagonal displays perfect correlation (value = 1) as expected.
* A few small clusters of moderate correlation exist, reflecting related domain-specific terms that often co-occur.
* There is no large block of high correlation, suggesting minimal redundancy among features.
* The features capture diverse and meaningful information from the text data.
* Low multicollinearity implies that classification models can effectively leverage these features without risk of overfitting due to redundant inputs.
* The heatmap confirms that the feature selection and preprocessing steps were successful in extracting relevant and distinct textual features.
* Small correlated groups may highlight interesting semantic or topical relationships worth exploring further.

**Encode Labels Numerically**

Label encoding is useful in EDA because it converts categorical labels into numeric form, enabling easier analysis, visualization, and statistical summarization of the target variable. This numeric representation also prepares the data for machine learning models, which require numerical inputs to process and learn effectively

In [None]:
le = LabelEncoder()  # Encode Labels Numerically
df['label_num'] = le.fit_transform(df['Label'])

**Feature-Label(target) Correlation Analysis**

Feature-label correlation in EDA helps pinpoint the words most associated with each label, guiding both data understanding and the selection of meaningful features for downstream modeling.

In [None]:
# Feature-Label Correlation
# Add label to tfidf_df for correlation
tfidf_df['label_num'] = df['label_num']

# Correlation of each feature with the label
feature_label_corr = tfidf_df.corr()['label_num'].drop('label_num').sort_values(ascending=False)
print("Top positively correlated words with label:\n", feature_label_corr.head(10))
print("\nTop negatively correlated words with label:\n", feature_label_corr.tail(10))

# Observation from Correlation Analysis (Feature-Label Correlation)

**Strongest positive correlation:**

* thyroid (0.72) and ptc (0.34) are highly positively correlated with the label, indicating these words are strong indicators of a particular class.

**Strongest negative correlation:**

* colon (-0.72) and cancer (-0.19) are highly negatively correlated, suggesting these words are strong indicators of a different class.
* Other words (like patients, risk, therapy, outcomes, cells, etc.) have moderate positive or negative correlations, indicating their varying importance in distinguishing between classes.

I'm skipping Dimensionality reduction (PCA)

# Outlier Detection 

Here, Outlier detection helps identify documents that are unusually short, long, or have abnormal TF-IDF sums. These documents may be errors, irrelevant, or unrepresentative, and can negatively impact model performance and data quality.

We calculate the total TF-IDF score for each document and use the IQR (Interquartile Range) method to find documents with abnormally high or low TF-IDF sums.

We compute the IQR for document lengths and identify documents that are much shorter or longer.

In [None]:
tfidf_sums = X_tfidf.sum(axis=1).A1  # Sum TF-IDF values for each document

Q1 = np.percentile(tfidf_sums, 25)  # Calculate IQR for document TF-IDF sums
Q3 = np.percentile(tfidf_sums, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outlier_docs = np.where((tfidf_sums < lower_bound) | (tfidf_sums > upper_bound))[0]  # Identify documents with outlier TF-IDF sums
print(f"Number of documents with outlier TF-IDF sums: {len(outlier_docs)}")

In [None]:
Q1 = df['doc_length'].quantile(0.25)
Q3 = df['doc_length'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Lower bound for outliers: {lower_bound}")
print(f"Upper bound for outliers: {upper_bound}")

In [None]:
short_outliers = df[df['doc_length'] < 72.0]
long_outliers = df[df['doc_length'] > 440.0]

print(f"Number of too short documents: {len(short_outliers)}")
print(f"Number of too long documents: {len(long_outliers)}")

# Observation from Outlier Analysis

Out of all documents, only 4 have outlier TF-IDF sums, indicating that most documents have typical content richness. However, 52 documents are unusually short and 20 are unusually long based on document length, suggesting some variability in document size within the dataset.

Most documents fall within normal ranges for both TF-IDF content and length, but a small number of documents are flagged as outliers. 

Well, we can flag those outlier doucments, keep it for future review.

# Final EDA Conclusion Report

**Key Findings and Observations**

*Data Quality:* The dataset is generally clean, with most documents falling within normal ranges for both TF-IDF content and document length. Only a small fraction of documents were identified as outliers, which have been flagged for further review.

*Feature Independence:* Correlation analysis among the top 50 TF-IDF features shows low multicollinearity, indicating that features are mostly independent and suitable for modeling.

*Label-Feature Relationships:* Several words, such as thyroid and colon, exhibit strong positive or negative correlations with the target label, providing valuable insights into the most influential terms for classification.

*Class Distribution:* The label encoding process revealed a balanced class distribution, with no significant class imbalance detected.

*Document Length Variability:* While most documents are of typical length, a subset is much shorter or longer, which may warrant further inspection for data consistency.

**Valuable Insights**

* The most predictive words for each class have been identified, supporting both interpretability and targeted feature engineering.
* Outlier detection ensures that data quality is maintained, reducing the risk of noise or bias in downstream modeling.
* The feature set is well-prepared for machine learning, with minimal redundancy and strong interpretability.

**Potential Next Steps**

*Feature Engineering:*
* Explore n-grams or domain-specific keywords to enhance predictive power.
* Consider dimensionality reduction if further simplification is needed.

*Modeling:*
* Proceed with classification models (e.g., logistic regression, random forest, SVM) using the current feature set.
* Evaluate model performance and refine features based on feature importance and validation results.

*Outlier Review:*
* Manually review flagged outlier documents to decide on removal or retention.
* Document any changes made for transparency and reproducibility.

# My Learnings

The EDA process has provided a comprehensive understanding of the dataset’s structure, quality, and key features. The data is well-prepared for modeling, with clear insights into the most influential terms and minimal redundancy. With outliers flagged and features validated, the project is ready to move confidently into the next phase.

It was a great learning experience, I'll come up with few more notebooks in future. Share my learnings and findings. 

Thankyou for reading till here, Hope you liked it :)