### CSC 820 NLT
### HW 10
- Khalid Mehtab Khan
- SFSU ID: 923673423

## Reading the training data
- The train csv file is present in the same directory as the notebook
- we use pandas to read the training data


In [1]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# The training csv file is present in the same folder as the notebook
df = pd.read_csv('train.csv')

# Drop rows with missing values
df.dropna(axis=0)

# Setting the index to the id column
df.set_index('id', inplace = True)

df.head()

Unnamed: 0_level_0,text,author
id,Unnamed: 1_level_1,Unnamed: 2_level_1
id26305,"This process, however, afforded me no means of...",EAP
id17569,It never once occurred to me that the fumbling...,HPL
id11008,"In his left hand was a gold snuff box, from wh...",EAP
id27763,How lovely is spring As we looked from Windsor...,MWS
id12958,"Finding nothing else, not even gold, the Super...",HPL


- We can see that the data contains rows with some some text and the author corresponding to each text
- Each text can be mapped using a unique id assigned to the text

## Basic Text Pre-Processing 
- that does not depend upon the type of data
- Also
- this also includes, some new feature creations like

### Original features
- count of words
- length of text
- avg length of words
- count of commas etc

### Additonal Features
- Nouns
- Verbs
- Adjective
counts of the text

- These nnewly created columns will be helpful and can be used as features for model 2

- Importing nltk tools like tokenizer and pos tagging to preprocess the text data

In [3]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from collections import Counter

import nltk
# ddownload punkt and averaged_perceptron_tagger
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

stopWords = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/khalidkhan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/khalidkhan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## function processing()
- The precessing function will be applied to all text values in the data set
- It preprocess the text to
    - convert to lower case
    - tokenize
    - length
    - calculate number of words
    - calculate words that are not stop words
    - average length
    - number of commas

- The function also calls get_pos_counts to process nouns verbs and adjectives


## function get_pos_counts()
- The function uses punkt and nltk pos tagging to count the number of
    - Nouns in the text
    - Verbs 
    - and adjectives


In [4]:
# POS tagging function to calculate the number of nouns, verbs and adjectives in a sentence
# This function will be used to create new features for STEP 2 of the model
def get_pos_counts(text):
    tags = pos_tag(word_tokenize(text))
    counts = Counter(tag for word, tag in tags)
    # Defining parts of speech we are interested in
    nouns = ['NN', 'NNS', 'NNP', 'NNPS']
    verbs = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
    adjectives = ['JJ', 'JJR', 'JJS']
    # Counting occurrences of the desired parts of speech
    noun_count = sum(counts[tag] for tag in nouns)
    verb_count = sum(counts[tag] for tag in verbs)
    adjective_count = sum(counts[tag] for tag in adjectives)
    return noun_count, verb_count, adjective_count

def processing(df):
    # Lowering and removing punctuation
    df['processed'] = df['text'].apply(lambda x: re.sub(r'[^\w\s]', '', x.lower()))
    
    # Numerical feature engineering
    # Total length of sentence
    df['length'] = df['processed'].apply(lambda x: len(x))
    # Get number of words
    df['words'] = df['processed'].apply(lambda x: len(x.split(' ')))
    df['words_not_stopword'] = df['processed'].apply(lambda x: len([t for t in x.split(' ') if t not in stopWords]))
    # Get the average word length
    df['avg_word_length'] = df['processed'].apply(lambda x: np.mean([len(t) for t in x.split(' ') if t not in stopWords]) if len([len(t) for t in x.split(' ') if t not in stopWords]) > 0 else 0)
    # Count commas
    df['commas'] = df['text'].apply(lambda x: x.count(','))

    # Adding POS counts
    pos_counts = df['processed'].apply(get_pos_counts)
    df['nouns'], df['verbs'], df['adjectives'] = zip(*pos_counts)
    
    return df


df = processing(df)
df.head()


Unnamed: 0_level_0,text,author,processed,length,words,words_not_stopword,avg_word_length,commas,nouns,verbs,adjectives
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
id26305,"This process, however, afforded me no means of...",EAP,this process however afforded me no means of a...,224,41,21,6.380952,4,12,6,2
id17569,It never once occurred to me that the fumbling...,HPL,it never once occurred to me that the fumbling...,70,14,6,6.166667,0,2,2,1
id11008,"In his left hand was a gold snuff box, from wh...",EAP,in his left hand was a gold snuff box from whi...,195,36,19,5.947368,4,10,4,5
id27763,How lovely is spring As we looked from Windsor...,MWS,how lovely is spring as we looked from windsor...,202,34,21,6.47619,3,10,5,6
id12958,"Finding nothing else, not even gold, the Super...",HPL,finding nothing else not even gold the superin...,170,27,16,7.1875,2,6,6,1


- we can see the dataframe now has processed text and new feature columns

## Modeling
- Dividing the columns into predictors and target variables

- Further divinding the data into Training and Test Sets
- we use scikit lean's train_test_split feature

### Splitting the training data
- We divide the data in 2 parts
- 2/3 parts of the data is used as training data
- 1/3 of the data is used as text data


In [5]:
from sklearn.model_selection import train_test_split

features= [c for c in df.columns.values if c  not in ['id','text','author']]
numeric_features= [c for c in df.columns.values if c  not in ['id','text','author','processed']]
target = 'author'

X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.2, random_state=42)
X_train.head()

Unnamed: 0_level_0,processed,length,words,words_not_stopword,avg_word_length,commas,nouns,verbs,adjectives
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
id04058,he must have spoken of some peculiarity in thi...,52,10,4,6.25,0,2,2,0
id11688,the arts of life and the discoveries of scienc...,215,37,20,6.75,3,12,6,0
id06232,once idris named me casually a frown a convuls...,534,103,45,6.111111,8,24,18,9
id27887,the time will soon come grief and famine have ...,327,57,31,6.354839,4,15,11,4
id27361,the great stone city rlyeh with its monoliths ...,215,38,21,5.761905,4,10,6,4


In [6]:
y_test

id
id15695    EAP
id07954    MWS
id16303    MWS
id07932    EAP
id20875    HPL
          ... 
id03632    EAP
id16541    MWS
id14308    HPL
id21303    HPL
id01019    EAP
Name: author, Length: 3916, dtype: object

Now for the tricky parts.

First thing I want to do is define how to process my variables. The standard preprocessing apply the same preprocessing to the whole dataset, but in cases where you have heterogeneous data, this doesn't quite work. So first thing I'm going to do is create a selector transformer that simply returns the one column in the dataset by the key value I pass. 

I was having difficulty getting the selector to play nicely, so I made two different selectors for either text or numeric columns. The return type is different, but other than that they work the same.

In [7]:
from sklearn.base import BaseEstimator, TransformerMixin

class TextSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on text columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.key]
    
class NumberSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on numeric columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]
    


## Pipeline 1
### the text pipeline
- as it was text we apply tfidf vectorizer and find values for our Xi

In [8]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

text = Pipeline([
                ('selector', TextSelector(key='processed')),
                ('tfidf', TfidfVectorizer( stop_words='english'))
            ])

text.fit_transform(X_train)

<15663x23011 sparse matrix of type '<class 'numpy.float64'>'
	with 176590 stored elements in Compressed Sparse Row format>

## Pipeline 2
### the length pipeline
- we convert the length of word in a standard scalar

In [9]:
from sklearn.preprocessing import StandardScaler

length =  Pipeline([
                ('selector', NumberSelector(key='length')),
                ('standard', StandardScaler())
            ])

length.fit_transform(X_train)

array([[-0.8650563 ],
       [ 0.64575958],
       [ 3.60250967],
       ...,
       [-0.4572287 ],
       [-0.14208919],
       [-0.39234704]])

# Pipelines for every feature
- Rest of the pipelines

- These piplines will be used for the first part of the model
- In this case we use columns - 
- processed, length, words, words_not_stopword, avg_length and commas
- We dont use the nouns, verbs, and adjactives in this case

In [10]:
words =  Pipeline([
                ('selector', NumberSelector(key='words')),
                ('standard', StandardScaler())
            ])
words_not_stopword =  Pipeline([
                ('selector', NumberSelector(key='words_not_stopword')),
                ('standard', StandardScaler())
            ])
avg_word_length =  Pipeline([
                ('selector', NumberSelector(key='avg_word_length')),
                ('standard', StandardScaler())
            ])
commas =  Pipeline([
                ('selector', NumberSelector(key='commas')),
                ('standard', StandardScaler()),
            ])

- But, pipielines for the nouns, verbs, and adjectives can be created 
- The best part about using pipelines is that these pipelines can be intergrated seperately with a model 

Additional features 
- Nouns, verbs and adjectives

In [11]:
nouns =  Pipeline([
                ('selector', NumberSelector(key='nouns')),
                ('standard', StandardScaler())
            ])
verbs =  Pipeline([
                ('selector', NumberSelector(key='verbs')),
                ('standard', StandardScaler())
            ])

adjectives =  Pipeline([
                ('selector', NumberSelector(key='adjectives')),
                ('standard', StandardScaler())
            ])



# Model 
- To create our second model lets integrate all the pipelines using a feature union
### This feature union contains features - feats version 2
    - text
    - length
    - words
    - words_not_stop_words
    - avf_word_length
    - commas

### Also,
    - nouns
    - verbs
    - adjectives

In [12]:
feats_v2 = FeatureUnion([
                    ('text', text), 
                    ('length', length),
                    ('words', words),
                    ('words_not_stopword', words_not_stopword),
                    ('avg_word_length', avg_word_length),
                    ('commas', commas),
                    ('nouns', nouns),
                    ('verbs', verbs),
                    ('adjectives', adjectives)  
                    ])

feature_processing = Pipeline([('feats', feats_v2)])
feature_processing.fit_transform(X_train)

<15663x23019 sparse matrix of type '<class 'numpy.float64'>'
	with 301894 stored elements in Compressed Sparse Row format>

- We add a logistic Regression model at the end of this feature union aswell

In [13]:
model = Pipeline([
    ('features', feats_v2),
    ('classifier', LogisticRegression(solver='saga', max_iter=1000, tol=1e-3)),
])

model.fit(X_train, y_train)

In [14]:
preds = model.predict(X_test)
np.mean(preds == y_test)

0.6927987742594485

### Observations
- We can see that the accuracy of model has decresed after adding three extra features nouns, verbs and adjectives
- The accuracy dropped from 78.16 % to 77.87%

### Classification Report

In [15]:
report = classification_report(y_test,preds)
print(report)

              precision    recall  f1-score   support

         EAP       0.65      0.81      0.72      1570
         HPL       0.69      0.65      0.67      1071
         MWS       0.77      0.59      0.67      1275

    accuracy                           0.69      3916
   macro avg       0.71      0.68      0.69      3916
weighted avg       0.70      0.69      0.69      3916



## K Folds Cross Validation

In [16]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

In [17]:
X_train

Unnamed: 0_level_0,processed,length,words,words_not_stopword,avg_word_length,commas,nouns,verbs,adjectives
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
id04058,he must have spoken of some peculiarity in thi...,52,10,4,6.250000,0,2,2,0
id11688,the arts of life and the discoveries of scienc...,215,37,20,6.750000,3,12,6,0
id06232,once idris named me casually a frown a convuls...,534,103,45,6.111111,8,24,18,9
id27887,the time will soon come grief and famine have ...,327,57,31,6.354839,4,15,11,4
id27361,the great stone city rlyeh with its monoliths ...,215,38,21,5.761905,4,10,6,4
...,...,...,...,...,...,...,...,...,...
id10932,letting go then his hold upon the rod placing ...,249,49,23,5.173913,8,9,9,1
id22499,his name was john raymond legrasse and he was ...,82,15,7,6.857143,1,6,2,1
id03099,the manner in which wyatt received this harmle...,96,17,7,7.000000,2,2,4,2
id11288,she first assured him of her boundless confide...,130,25,10,6.400000,1,1,4,2


In [18]:
y_train

id
id04058    EAP
id11688    MWS
id06232    MWS
id27887    MWS
id27361    HPL
          ... 
id10932    EAP
id22499    HPL
id03099    EAP
id11288    MWS
id14541    EAP
Name: author, Length: 15663, dtype: object

# K = 2 folds

In [19]:
feats_v2 = FeatureUnion([
                    ('text', text), 
                    ('length', length),
                    ('words', words),
                    ('words_not_stopword', words_not_stopword),
                    ('avg_word_length', avg_word_length),
                    ('commas', commas),
                    ('nouns', nouns),
                    ('verbs', verbs),
                    ('adjectives', adjectives)  
                    ])

In [20]:
model = Pipeline([
    ('features', feats_v2),
    ('classifier', LogisticRegression(solver='saga', max_iter=1000, tol=1e-3)),
])

In [21]:
model.fit(X_train, y_train)

In [22]:
k = 2
kf = KFold(n_splits=k, shuffle=True, random_state=1)


In [23]:
scores_2 = cross_val_score(model, X_train, y_train, cv=kf, scoring='accuracy')

print("Accuracy scores for each fold: ", scores_2)
print("Mean cross-validation accuracy: ", scores_2.mean())

Accuracy scores for each fold:  [0.65091931 0.72213   ]
Mean cross-validation accuracy:  0.6865246507913796


# K = 10 folds

In [24]:
k = 10 
kf = KFold(n_splits=k, shuffle=True, random_state=1)

In [25]:
scores_10 = cross_val_score(model, X_train, y_train, cv=kf, scoring='accuracy')

print("Accuracy scores for each fold: ", scores_10)
print("Mean cross-validation accuracy: ", scores_10.mean())

Accuracy scores for each fold:  [0.68474793 0.69304403 0.69623484 0.69667944 0.67688378 0.70817369
 0.73180077 0.72413793 0.67816092 0.71136654]
Mean cross-validation accuracy:  0.7001229867942012


# K = 20 folds

In [26]:
k = 20
kf = KFold(n_splits=k, shuffle=True, random_state=1)

In [27]:
scores_20 = cross_val_score(model, X_train, y_train, cv=kf, scoring='accuracy')

print("Accuracy scores for each fold: ", scores_20)
print("Mean cross-validation accuracy: ", scores_20.mean())

Accuracy scores for each fold:  [0.67729592 0.69132653 0.72321429 0.67049808 0.70498084 0.69093231
 0.70881226 0.69348659 0.69093231 0.67177522 0.71392082 0.70881226
 0.71136654 0.74329502 0.73307791 0.71902937 0.67177522 0.69476373
 0.71902937 0.70881226]
Mean cross-validation accuracy:  0.7023568431203898


## Accuracy at different K values
- K = 2: Average accuracy was about 68.6%. 
- K = 10: Average accuracy improved to about 70.0%. This setting provided a more stable estimate of model performance as it averaged the scores over more folds.
- K = 20: Average accuracy slightly increased to about 70.2%. With even more folds, this setting is likely the most reliable for estimating how well the model performs across different subsets of the data.


## Feature importance analysis / Model interpretation:

In [28]:
tfidf_vectorizer = model.named_steps['features'].transformer_list[0][1].named_steps['tfidf']
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()


- getting the learnt coefficient to understand how each feature is weighed in the model

In [37]:
coefficients = model.named_steps['classifier'].coef_

- sorting feature for each individual class

In [30]:
classes = model.named_steps['classifier'].classes_
for class_index, class_name in enumerate(classes):
    # Combine the coefficients with the tfidf feature names
    class_coefs = sorted(zip(coefficients[class_index], tfidf_feature_names), key=lambda x: x[0], reverse=True)
    print(f"Class: {class_name}")
    print("10 Most Important Features:", class_coefs[:10])
    print("10 Least Important Features:", class_coefs[-10:])


Class: EAP
10 Most Important Features: [(1.3738116591011313, 'say'), (1.2805905767092105, 'mr'), (1.0344297281592365, 'said'), (0.9384188647058324, 'matter'), (0.8899202728674972, 'fact'), (0.8832886700970661, 'point'), (0.8568427549078933, 'madame'), (0.8133237742261663, 'having'), (0.80476426589306, 'feet'), (0.8025350649847046, 'minutes')]
10 Least Important Features: [(-0.7243588360339219, 'idris'), (-0.7363891794203887, 'men'), (-0.7795040391155474, 'fear'), (-0.7848478875398582, 'love'), (-0.8328874130555068, 'come'), (-0.9239651223436246, 'life'), (-0.9518500693481041, 'adrian'), (-1.0062108311831985, 'father'), (-1.1113381940919127, 'perdita'), (-1.4657245126220815, 'raymond')]
Class: HPL
10 Most Important Features: [(1.2039898699483353, 'old'), (1.097504420714733, 'street'), (1.0558641774149145, 'west'), (1.0214067772423228, 'men'), (1.0033403695208793, 'told'), (0.9797615907245338, 'things'), (0.948978425723086, 'thing'), (0.8537234965604917, 'gilman'), (0.837160343237405, 'l

## Predcitor Weights
- nouns
- verbs
- adjectives

In [41]:
feature_processing = model.named_steps['features']

custom_features = ['nouns', 'verbs', 'adjectives']
custom_feature_indices = [i for i, feature_name in enumerate(custom_features)]

# Retrieve coefficients for each class
for class_index, class_name in enumerate(classes):
    print(f"Class {class_name}:")
    for idx, feature_name in zip(custom_feature_indices, custom_features):
        print(f"Weight for {feature_name}: {coefficients[class_index][idx]}")


Class EAP:
Weight for nouns: 0.008125650319934589
Weight for verbs: 0.058729281793447954
Weight for adjectives: -0.01550114815739469
Class HPL:
Weight for nouns: -0.0016461478589355932
Weight for verbs: -0.03131031270366662
Weight for adjectives: -0.005268708795268769
Class MWS:
Weight for nouns: -0.00647950246099899
Weight for verbs: -0.02741896908978137
Weight for adjectives: 0.020769856952663496


- other predcitor weights

In [40]:
feature_processing = model.named_steps['features']

custom_features = ['length', 'words', 'words_not_stopwords', 'avg_word_length']
custom_feature_indices = [i for i, feature_name in enumerate(custom_features)]

# Retrieve coefficients for each class
for class_index, class_name in enumerate(classes):
    print(f"Class {class_name}:")
    for idx, feature_name in zip(custom_feature_indices, custom_features):
        print(f"Weight for {feature_name}: {coefficients[class_index][idx]}")
    print("-" * 40)

Class EAP:
Weight for length: 0.008125650319934589
Weight for words: 0.058729281793447954
Weight for words_not_stopwords: -0.01550114815739469
Weight for avg_word_length: 0.10727223487032371
----------------------------------------
Class HPL:
Weight for length: -0.0016461478589355932
Weight for words: -0.03131031270366662
Weight for words_not_stopwords: -0.005268708795268769
Weight for avg_word_length: -0.0024687052403609775
----------------------------------------
Class MWS:
Weight for length: -0.00647950246099899
Weight for words: -0.02741896908978137
Weight for words_not_stopwords: 0.020769856952663496
Weight for avg_word_length: -0.10480352962996335
----------------------------------------


In [33]:
# Predictions on the test set
predictions = model.predict(X_test)

## Finding Mismatches

In [34]:
incorrect_indices = np.where(predictions != y_test)[0]

In [35]:
y_test = y_test.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

## Analysing misclassifications
- the vectors of the mis classfied inputs can be easily seen if arrange in a datafframe
- the top row of coulumn names represent the predcitor or vector xi position
- each row represents the vector

In [42]:
misclassified_df_expanded = X_test.iloc[incorrect_indices].copy()
misclassified_df_expanded['Predicted Label'] = predictions[incorrect_indices]
misclassified_df_expanded['Actual Label'] = y_test.iloc[incorrect_indices]


misclassified_df_expanded.head(10)


Unnamed: 0,processed,length,words,words_not_stopword,avg_word_length,commas,nouns,verbs,adjectives,Predicted Label,Actual Label
2,he had seen so many customs and witnessed so g...,399,69,33,6.909091,1,17,11,8,HPL,MWS
5,she listened to me as she had done to the narr...,279,56,18,6.722222,5,10,11,2,EAP,MWS
6,his chief amusements were gunning and fishing ...,214,35,17,7.470588,2,11,6,1,MWS,EAP
8,i will content myself with saying in addition ...,166,30,12,7.083333,5,9,5,3,MWS,EAP
11,johns i bade the knocker enter but was answere...,70,14,7,5.714286,2,6,3,0,EAP,HPL
12,at fifteen or even at twenty one for i had now...,133,27,13,5.461538,1,5,4,3,MWS,EAP
13,though not as yet licenced physicians we now h...,145,25,12,7.0,2,4,5,1,EAP,HPL
17,the tide had turned and was coming in now and ...,89,19,7,5.428571,1,3,6,0,MWS,HPL
25,the dutchman maintains it to have been that of...,159,27,10,8.1,0,4,8,1,MWS,EAP
30,burkes reflections on the french revolution,43,6,4,8.25,0,3,0,1,EAP,MWS
