# Statistical NLP Project - 1
## Author: Sridharan Chandran 

## PART - A

* **DOMAIN:** Digital content management.

* **CONTEXT:** Classification is probably the most popular task that you would deal with in real life. Text in the form of blogs, posts, articles, etc.are written every second. It is a challenge to predict the information about the writer without knowing about him/her. We are going to create a
classifier that predicts multiple features of the author of a given text. We have designed it as a Multi label classification problem.

* **DATA DESCRIPTION:** Over 600,000 posts from more than 19 thousand bloggers The Blog Authorship Corpus consists of the collected posts of
19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or
approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and
the blogger’s self-provided gender, age, industry, and astrological sign. (All are labelled for gender and age but for many, industry and/or sign is
marked as unknown.) All bloggers included in the corpus fall into one of three age groups:
• 8240 "10s" blogs (ages 13-17),
• 8086 "20s" blogs(ages 23-27) and
• 2994 "30s" blogs (ages 33-47)
• For each age group, there is an equal number of male and female bloggers. Each blog in the corpus includes at least 200 occurrences of
common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the
date of the following post and links within a post are denoted by the label url link.

* **PROJECT OBJECTIVE:** To build a NLP classifier which can use input text parameters to determine the label/s of the blog. Specific to this case
study, you can consider the text of the blog: ‘text’ feature as independent variable and ‘topic’ as dependent variable.

## Steps and tasks

In [1]:
#Import necessary libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, hamming_loss

## 1. Read and Analyse Dataset.

### A. Clearly write outcome of data analysis(Minimum 2 points)

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Extract the ZIP file
file_path = "/content/drive/My Drive/Colab Notebooks/My Python Projects/Statistical NLP Project - 1/blogs.zip"

from zipfile import ZipFile
with ZipFile(file_path,'r') as zip:
  zip.extractall(path='/content/drive/My Drive/Colab Notebooks/My Python Projects/Statistical NLP Project - 1/')

In [4]:
# Read the dataset
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/My Python Projects/Statistical NLP Project - 1/blogtext.csv')
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [5]:
df.shape # Shape of the dataframe

(681284, 7)

In [6]:
df["id"].nunique() # No. of bloggers

19320

In [7]:
df.columns # Features Label

Index(['id', 'gender', 'age', 'topic', 'sign', 'date', 'text'], dtype='object')

In [8]:
for col in df.columns:
  print(f"Value counts for {col}:")
  print(df[col].value_counts())
  print("================================================") # Check for value counts of each Features


Value counts for id:
449628     4221
734562     2301
589736     2294
1975546    2261
958176     2244
           ... 
3424020       1
4325089       1
3516939       1
3331848       1
3717818       1
Name: id, Length: 19320, dtype: int64
Value counts for gender:
male      345193
female    336091
Name: gender, dtype: int64
Value counts for age:
17    80859
24    80071
23    72889
16    72708
25    67051
26    55312
27    46124
15    41767
14    27400
34    21347
33    17584
35    17462
36    14229
13    13133
37     9317
38     7545
39     5556
40     5016
45     4482
43     4230
41     3738
48     3572
42     2908
46     2733
47     2207
44     2044
Name: age, dtype: int64
Value counts for topic:
indUnk                     251015
Student                    153903
Technology                  42055
Arts                        32449
Education                   29633
Communications-Media        20140
Internet                    16006
Non-Profit                  14700
Engineering              

## Observations:

* **Dataset:** The dataset consists of 681,284 posts written by 19,320 bloggers. Each blog is presented as a separate file and contains information about the blogger's gender, age, industry, and astrological sign (although some of these attributes may be marked as unknown).

* **Multi-label Classification:** The task is framed as a multi-label classification problem, where the goal is to predict multiple features of the author based on the given text. The specific features to be predicted are not mentioned, but it can be inferred that at least the gender, age, industry, and astrological sign are among the potential labels.

* **Age Groups:** The bloggers in the corpus are categorized into three age groups: "10s" (ages 13-17), "20s" (ages 23-27), and "30s" (ages 33-47). Each age group has an equal number of male and female bloggers.

* **Textual Features:** The main independent variable for the classification task is the text of the blog itself. The dataset mentions that each blog includes at least 200 occurrences of common English words. The formatting has been stripped from the text, except for the separation of individual posts by the date of the following post and the presence of URL links.

* **Label Imbalance:** The data description doesn't explicitly mention the distribution of labels or the presence of any class imbalance issues. However, it states that for many bloggers, the industry and astrological sign attributes are marked as unknown, which suggests that there might be missing or incomplete label information.

## B. Clean the Structured Data:

### (i). Missing value analysis and imputation

In [9]:
# Check for Missing values
df.isnull().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

**There is no null value in the dataframe**

In [10]:
df.duplicated().sum() # Check for duplicates

4686

In [11]:
df.drop_duplicates(keep='first',inplace=True) # Drop the duplicate values
df.shape

(676598, 7)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 676598 entries, 0 to 681283
Data columns (total 7 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   id      676598 non-null  int64 
 1   gender  676598 non-null  object
 2   age     676598 non-null  int64 
 3   topic   676598 non-null  object
 4   sign    676598 non-null  object
 5   date    676598 non-null  object
 6   text    676598 non-null  object
dtypes: int64(2), object(5)
memory usage: 41.3+ MB


In [13]:
df.describe().T # 5 point summery

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,676598.0,2395488.0,1248401.0,5114.0,1236636.0,2596124.0,3526127.0,4337650.0
age,676598.0,23.93915,7.774925,13.0,17.0,24.0,26.0,48.0


### (ii). Eliminate Non-English textual data.

In [14]:
# Take 30000 sample data from whole dataset
df1 = df.sample(n=30000, random_state=42)

In [15]:
pip install langdetect

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993224 sha256=6b1dacb5f4ea0fb648784046610870e3308abfa775af22db46ddbf1f1bf345d6
  Stored in directory: /root/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b5462a7711ab78fba2f655d05106
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [16]:
from langdetect import detect # Language detection

def is_english(text):
    try:
        lang = detect(text)
        return lang == 'en'
    except:
        return False

df1['is_english'] = df1['text'].apply(lambda x: is_english(x))

df1 = df1[df1['is_english']]
df1.drop(columns=['is_english'], inplace=True)

In [17]:
df1.shape

(28747, 7)

## 2. Preprocess unstructured data to make it consumable for model training.

### A. Eliminate All special Characters and Numbers 

In [18]:
import re

def remove_special_chars(text):
    """
    Removes all special characters and numbers from the given text
    """
    # Replace all non-alphabetic characters with space
    text = re.sub('[^A-Za-z]+', ' ', text)
    return text


In [19]:
# Apply remove_special_chars() function to 'text' column of our dataset
df1['text'] = df1['text'].apply(remove_special_chars)

In [20]:
df1['text'].head()

255678     ahhh hayo i dunt believe we are in yr carazy ...
606383     For the first time since the end of school I ...
40902      Wow today has been HECTIC to say the least We...
626648     Celebrating posts of Jonah Hosted by Laurence...
71623      Man I ve been having some majorly trippy drea...
Name: text, dtype: object

### B. Lowercase all textual data

In [21]:
df1['text'] = df1['text'].apply(lambda x: x.lower()) # Convert the words to lowercase

In [22]:
df1['text'].head()

255678     ahhh hayo i dunt believe we are in yr carazy ...
606383     for the first time since the end of school i ...
40902      wow today has been hectic to say the least we...
626648     celebrating posts of jonah hosted by laurence...
71623      man i ve been having some majorly trippy drea...
Name: text, dtype: object

### C. Remove all Stopwords

In [23]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [24]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

def remove_stopwords(text):
    text = [word.lower() for word in text.split() if word.lower() not in stop_words]
    return " ".join(text)
    
df1['text'] = df1['text'].apply(lambda x: remove_stopwords(x))

### D. Remove all extra white spaces

In [25]:
# Remove extra white spaces
df1['text'] = df1['text'].apply(lambda x: re.sub('\s+', ' ', x).strip())

df1['text'].head()

255678    ahhh hayo dunt believe yr carazy yes well weir...
606383    first time since end school met brandon sean p...
40902     wow today hectic say least busy year hope sign...
626648    celebrating posts jonah hosted laurence fishbu...
71623     man majorly trippy dreams past week two last n...
Name: text, dtype: object

## 3. Build a base Classification model

### A. Create dependent and independent variables

In [26]:
X = df1['text'] # Independent variables 
y = df1['topic'] # Dependent variables 

### B. Split data into train and test.

In [27]:
# Split data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42) # Taken 30% of the data for testing

print("X_train shape is", X_train.shape)
print("X_test shape is", X_test.shape)

X_train shape is (20122,)
X_test shape is (8625,)


### C. Vectorize data using any one vectorizer.

In [28]:
!pip install --upgrade scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorize data using TfidfVectorizer
vectorizer = TfidfVectorizer()

In [30]:
X_train_vec = vectorizer.fit_transform(X_train) # Train vector

In [31]:
X_test_vec = vectorizer.transform(X_test) # Test vector

### D. Build a base model for Supervised Learning - Classification. 

In [32]:
#Import Naive bayes algorithm
from sklearn.naive_bayes import MultinomialNB

# Naive Bayes
nb_model = MultinomialNB()

In [33]:
# Train the Naive bayes model

nb_model.fit(X_train_vec, y_train)

### E. Clearly print Performance Metrics.

In [34]:
# Make predictions on the test set
y_pred = nb_model.predict(X_test_vec)

# Print the performance metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred, average='micro'))
print("Recall:", recall_score(y_test, y_pred, average='micro'))
print("F1-score:", f1_score(y_test, y_pred, average='micro'))
print("Hamming Loss:", hamming_loss(y_test, y_pred))

Accuracy: 0.3705507246376812
Precision: 0.3705507246376812
Recall: 0.3705507246376812
F1-score: 0.3705507246376812
Hamming Loss: 0.6294492753623189


##4. Improve Performance of model.

### A. Experiment with other vectorisers.

In [35]:
from sklearn.feature_extraction.text import CountVectorizer

# CountVectorizer
count_vectorizer = CountVectorizer()
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

In [36]:
X_train_count.shape

(20122, 86364)

In [37]:
X_test_count.shape

(8625, 86364)


### B. Build classifier Models using other algorithms than base model. 

In [38]:
#Import RandomForest algorithm
from sklearn.ensemble import RandomForestClassifier

# Random Forest
rf_model = RandomForestClassifier()

In [39]:
# Train the Random Forest model on the current batch
rf_model.fit(X_train_count, y_train)

In [40]:
from sklearn.svm import SVC

# Create and train the SVM classifier
svm_model = SVC()

svm_model.fit(X_train_count, y_train)

### C. Tune Parameters/Hyperparameters of the model/s. 

### D. Clearly print Performance Metrics.

In [41]:
# Random search for parameter tuning
from sklearn.model_selection import RandomizedSearchCV

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15]
}

In [42]:
# Perform randomized search
random_search = RandomizedSearchCV(
    rf_model,
    param_distributions=param_grid,
    random_state=42)

In [43]:
# Fit the model on the training set
random_search.fit(X_train_count, y_train)

In [44]:
# Get the best model after parameter tuning
best_rf_model = random_search.best_estimator_

In [45]:
y_pred_rf = rf_model.predict(X_test_count)

# RandomForest Metrics
print("Random Forest Metrics:")
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Precision:", precision_score(y_test, y_pred_rf, average='micro'))
print("Recall:", recall_score(y_test, y_pred_rf, average='micro'))
print("F1-score:", f1_score(y_test, y_pred_rf, average='micro'))


Random Forest Metrics:
Accuracy: 0.3770434782608696
Precision: 0.3770434782608696
Recall: 0.3770434782608696
F1-score: 0.3770434782608696


In [46]:
y_pred_svm = svm_model.predict(X_test_count)

# SVM Classifier Metrics
print("SVM Classifier Metrics:")
print("Accuracy:", accuracy_score(y_test, y_pred_svm))
print("Precision:", precision_score(y_test, y_pred_svm, average='micro'))
print("Recall:", recall_score(y_test, y_pred_svm, average='micro'))
print("F1-score:", f1_score(y_test, y_pred_svm, average='micro'))

SVM Classifier Metrics:
Accuracy: 0.38168115942028985
Precision: 0.38168115942028985
Recall: 0.38168115942028985
F1-score: 0.38168115942028985


In [47]:
y_pred_rf_cv = best_rf_model.predict(X_test_count)

# Tuned RandomForest Metrics
print("Tuned RandomForest Metrics:")
print("Accuracy:", accuracy_score(y_test, y_pred_rf_cv))
print("Precision:", precision_score(y_test, y_pred_rf_cv, average='micro'))
print("Recall:", recall_score(y_test, y_pred_rf_cv, average='micro'))
print("F1-score:", f1_score(y_test, y_pred_rf_cv, average='micro'))

Tuned RandomForest Metrics:
Accuracy: 0.368463768115942
Precision: 0.368463768115942
Recall: 0.368463768115942
F1-score: 0.36846376811594206


In [50]:
# Clear backend session
from keras import backend as K
import gc
K.clear_session()
gc.collect()

6542

**Let's  try with Deep Learning Algorithm**

In [51]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, BatchNormalization, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from keras import optimizers

# Preprocess the labels
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(y)

# Tokenize the texts
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
sequences = tokenizer.texts_to_sequences(X)

# Pad sequences to the same length
max_sequence_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_sequence_length)

# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(sequences, labels, test_size=0.2, random_state=42)

# Define the LSTM model
model = Sequential()
model.add(Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=100, input_length=max_sequence_length))
model.add(LSTM(32, return_sequences=True))
model.add(BatchNormalization())
model.add(Dropout(0.2))  # Add dropout regularization
model.add(LSTM(32))
model.add(BatchNormalization())
model.add(Dropout(0.2))  # Add dropout regularization
model.add(Dense(len(np.unique(labels)), activation='softmax'))


# adam optmizer with custom learning rate
opt= optimizers.Adam(0.01)

# Compile the model
model.compile(loss='sparse_categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=10, batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f9f9349a620>

In [52]:

# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test)
print("Test Loss:", loss)
print("Test Accuracy:", accuracy)

Test Loss: 2.3356997966766357
Test Accuracy: 0.3662608563899994


## 5. Share insights on relative performance comparison


### A. Which vectorizer performed better? Probable reason?

*Count Vectorizer is performed better than TF-IDF Vectorizer*

Probable reasons are below,

**1. Task Requirements:** CountVectorizer may have been more suitable for the task's requirements. Depending on the nature of the classification problem, word frequency and occurrence might play a crucial role in determining the labels or topics. CountVectorizer directly captures this information, making it a better choice if the task primarily relies on word counts rather than the rarity of terms.

**2. Word Importance:** In certain scenarios, the importance of words may be better reflected by their frequency rather than their rarity. For instance, if your classification task involves identifying specific topics or distinguishing certain patterns based on common words, CountVectorizer's emphasis on frequency can capture those patterns effectively.

**3. Domain-Specific Considerations:** The nature of your specific domain or dataset may favor CountVectorizer. For example, if your dataset consists of informal texts such as social media posts or blogs where specific words or phrases are strongly associated with the target labels, CountVectorizer's focus on word frequency might align better with the underlying patterns in the data.

**4. Document Length:** CountVectorizer does not consider document length normalization, while TF-IDF accounts for term frequency relative to document length. If longer documents in your dataset contain more informative content or exhibit stronger patterns, CountVectorizer's representation of raw word counts might capture that signal better than TF-IDF's normalization.


### B. Which model outperformed? Probable reason? 


*Support Vector Classfier Model outperformed.*

Probable reasons are below,

**1. Handling complex decision boundaries:** SVC is known for its ability to handle complex decision boundaries due to its kernel trick, which can transform the input data into a higher-dimensional feature space. This can be advantageous when the data points are not linearly separable. RF and NB models, on the other hand, are based on ensemble methods and may struggle with capturing complex relationships.

**2. Robustness to noise:** SVC can be robust to noisy data because it aims to find a decision boundary with the maximum margin of separation between classes. This property helps in reducing the influence of noisy or outlier data points. In contrast, RF models can be sensitive to noise since they construct decision trees based on impurity measures.

**3. Limited training data:** SVC can perform well with limited training data, especially when combined with appropriate kernel functions. This is particularly useful in scenarios where the dataset is small, and the other models may suffer from overfitting. RF and LSTM models, in particular, might require more data to generalize effectively.

**4. Class imbalance:** If the dataset is imbalanced, meaning some classes have significantly fewer samples than others, SVC can handle this situation well. It uses a weighted approach or cost-sensitive learning to give more importance to the minority class. RF and NB models may struggle with imbalanced data if not explicitly addressed.

### C. Which parameter/hyperparameter significantly helped to improve performance?Probable reason?


The hyperparmeter of Random Forest "Number of Trees" & "Maximum Depth" are significantly helped to improve performancce.

Probable reasons are below,

**1. Number of Trees (Random Forest):** In Random Forest, the number of trees in the ensemble is a critical hyperparameter. Increasing the number of trees can improve performance up to a certain point, as it allows for more robust and accurate predictions by reducing overfitting. However, setting the number of trees too high can lead to increased computational complexity without significant gains in performance.

**2. Maximum Depth (Decision Trees):** For models that use decision trees, such as Random Forest or Gradient Boosting, the maximum depth of each tree is an important parameter. A deeper tree can capture more complex relationships in the data but may also increase the risk of overfitting. Finding the right balance between capturing sufficient complexity and preventing overfitting is crucial for optimal performance.

### D. According to you, which performance metric should be given most importance, why?

*According to me, **F1-Score** should be given most importance.*

And the reasons are below,

**1. Balanced measure:** The F1 score provides a balanced measure of a model's performance by considering both precision and recall. It takes into account both the ability to correctly predict positive instances (precision) and the ability to identify all positive instances (recall). By combining these two metrics into a single score, the F1 score provides a comprehensive evaluation of the model's performance.

**2. Suitable for imbalanced datasets:** The F1 score is particularly useful when dealing with imbalanced datasets, where one class has a significantly larger number of instances than the other. In such cases, accuracy may be misleading, as a model can achieve high accuracy by simply predicting the majority class most of the time. The F1 score takes into account the performance on both positive and negative instances, providing a more accurate assessment of the model's effectiveness in handling imbalanced class distributions.

**3. Emphasis on errors:** The F1 score places equal importance on false positives and false negatives. This is important when the costs or consequences of these errors differ. For example, in spam email detection, it may be more critical to minimize false negatives (classifying spam as non-spam) to avoid missing important messages, even if it means tolerating a higher rate of false positives (classifying non-spam as spam). The F1 score allows you to evaluate the model's performance while considering the specific context and priorities of the problem.

**4. Single scalar value:** The F1 score condenses precision and recall into a single scalar value, making it easy to compare and rank different models. It simplifies the evaluation process by providing a unified measure of performance, allowing for straightforward comparisons and decision-making.