### CSC 820 NLT
### HW 9
- Khalid Mehtab Khan
- SFSU ID: 923673423

## Reading the training data
- The train csv file is present in the same directory as the notebook
- we use pandas to read the training data


In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# The training csv file is present in the same folder as the notebook
df = pd.read_csv('train.csv')

# Drop rows with missing values
df.dropna(axis=0)

# Setting the index to the id column
df.set_index('id', inplace = True)

df.head()

Unnamed: 0_level_0,text,author
id,Unnamed: 1_level_1,Unnamed: 2_level_1
id26305,"This process, however, afforded me no means of...",EAP
id17569,It never once occurred to me that the fumbling...,HPL
id11008,"In his left hand was a gold snuff box, from wh...",EAP
id27763,How lovely is spring As we looked from Windsor...,MWS
id12958,"Finding nothing else, not even gold, the Super...",HPL


- We can see that the data contains rows with some some text and the author corresponding to each text
- Each text can be mapped using a unique id assigned to the text

## Basic Text Pre-Processing 
- that does not depend upon the type of data
- Also
- this also includes, some new feature creations like

### Original features
- count of words
- length of text
- avg length of words
- count of commas etc

### Additonal Features
- Nouns
- Verbs
- Adjective
counts of the text

- These nnewly created columns will be helpful and can be used as features for model 2

- Importing nltk tools like tokenizer and pos tagging to preprocess the text data

In [2]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from collections import Counter

import nltk
# ddownload punkt and averaged_perceptron_tagger
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

stopWords = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/khalidkhan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/khalidkhan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## function processing()
- The precessing function will be applied to all text values in the data set
- It preprocess the text to
    - convert to lower case
    - tokenize
    - length
    - calculate number of words
    - calculate words that are not stop words
    - average length
    - number of commas

- The function also calls get_pos_counts to process nouns verbs and adjectives


## function get_pos_counts()
- The function uses punkt and nltk pos tagging to count the number of
    - Nouns in the text
    - Verbs 
    - and adjectives


In [3]:
# POS tagging function to calculate the number of nouns, verbs and adjectives in a sentence
# This function will be used to create new features for STEP 2 of the model
def get_pos_counts(text):
    tags = pos_tag(word_tokenize(text))
    counts = Counter(tag for word, tag in tags)
    # Defining parts of speech we are interested in
    nouns = ['NN', 'NNS', 'NNP', 'NNPS']
    verbs = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
    adjectives = ['JJ', 'JJR', 'JJS']
    # Counting occurrences of the desired parts of speech
    noun_count = sum(counts[tag] for tag in nouns)
    verb_count = sum(counts[tag] for tag in verbs)
    adjective_count = sum(counts[tag] for tag in adjectives)
    return noun_count, verb_count, adjective_count

def processing(df):
    # Lowering and removing punctuation
    df['processed'] = df['text'].apply(lambda x: re.sub(r'[^\w\s]', '', x.lower()))
    
    # Numerical feature engineering
    # Total length of sentence
    df['length'] = df['processed'].apply(lambda x: len(x))
    # Get number of words
    df['words'] = df['processed'].apply(lambda x: len(x.split(' ')))
    df['words_not_stopword'] = df['processed'].apply(lambda x: len([t for t in x.split(' ') if t not in stopWords]))
    # Get the average word length
    df['avg_word_length'] = df['processed'].apply(lambda x: np.mean([len(t) for t in x.split(' ') if t not in stopWords]) if len([len(t) for t in x.split(' ') if t not in stopWords]) > 0 else 0)
    # Count commas
    df['commas'] = df['text'].apply(lambda x: x.count(','))

    # Adding POS counts
    pos_counts = df['processed'].apply(get_pos_counts)
    df['nouns'], df['verbs'], df['adjectives'] = zip(*pos_counts)
    
    return df


df = processing(df)
df.head()


Unnamed: 0_level_0,text,author,processed,length,words,words_not_stopword,avg_word_length,commas,nouns,verbs,adjectives
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
id26305,"This process, however, afforded me no means of...",EAP,this process however afforded me no means of a...,224,41,21,6.380952,4,12,6,2
id17569,It never once occurred to me that the fumbling...,HPL,it never once occurred to me that the fumbling...,70,14,6,6.166667,0,2,2,1
id11008,"In his left hand was a gold snuff box, from wh...",EAP,in his left hand was a gold snuff box from whi...,195,36,19,5.947368,4,10,4,5
id27763,How lovely is spring As we looked from Windsor...,MWS,how lovely is spring as we looked from windsor...,202,34,21,6.47619,3,10,5,6
id12958,"Finding nothing else, not even gold, the Super...",HPL,finding nothing else not even gold the superin...,170,27,16,7.1875,2,6,6,1


- we can see the dataframe now has processed text and new feature columns

## Modeling
- Dividing the columns into predictors and target variables

- Further divinding the data into Training and Test Sets
- we use scikit lean's train_test_split feature

### Splitting the training data
- We divide the data in 2 parts
- 2/3 parts of the data is used as training data
- 1/3 of the data is used as text data


In [4]:
from sklearn.model_selection import train_test_split

features= [c for c in df.columns.values if c  not in ['id','text','author']]
numeric_features= [c for c in df.columns.values if c  not in ['id','text','author','processed']]
target = 'author'

X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.33, random_state=42)
X_train.head()

Unnamed: 0_level_0,processed,length,words,words_not_stopword,avg_word_length,commas,nouns,verbs,adjectives
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
id19417,this panorama is indeed glorious and i should ...,91,18,6,6.666667,1,4,2,1
id09522,there was a simple natural earnestness about h...,240,44,18,6.277778,4,8,7,7
id22732,who are you pray that i duc de lomelette princ...,387,74,38,5.552632,9,18,10,3
id10351,he had gone in the carriage to the nearest tow...,118,24,11,5.363636,0,8,3,1
id24580,there is no method in their proceedings beyond...,71,13,5,7.0,1,4,1,0


Now for the tricky parts.

First thing I want to do is define how to process my variables. The standard preprocessing apply the same preprocessing to the whole dataset, but in cases where you have heterogeneous data, this doesn't quite work. So first thing I'm going to do is create a selector transformer that simply returns the one column in the dataset by the key value I pass. 

I was having difficulty getting the selector to play nicely, so I made two different selectors for either text or numeric columns. The return type is different, but other than that they work the same.

In [5]:
from sklearn.base import BaseEstimator, TransformerMixin

class TextSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on text columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.key]
    
class NumberSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on numeric columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]
    


## Pipeline 1
### the text pipeline
- as it was text we apply tfidf vectorizer and find values for our Xi

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

text = Pipeline([
                ('selector', TextSelector(key='processed')),
                ('tfidf', TfidfVectorizer( stop_words='english'))
            ])

text.fit_transform(X_train)

<13117x21516 sparse matrix of type '<class 'numpy.float64'>'
	with 148061 stored elements in Compressed Sparse Row format>

## Pipeline 2
### the length pipeline
- we convert the length of word in a standard scalar

In [7]:
from sklearn.preprocessing import StandardScaler

length =  Pipeline([
                ('selector', NumberSelector(key='length')),
                ('standard', StandardScaler())
            ])

length.fit_transform(X_train)

array([[-0.50769254],
       [ 0.88000324],
       [ 2.24907223],
       ...,
       [-0.46112557],
       [-0.14447015],
       [-0.39593181]])

# Pipelines for every feature
- Rest of the pipelines

- These piplines will be used for the first part of the model
- In this case we use columns - 
- processed, length, words, words_not_stopword, avg_length and commas
- We dont use the nouns, verbs, and adjactives in this case

In [8]:
words =  Pipeline([
                ('selector', NumberSelector(key='words')),
                ('standard', StandardScaler())
            ])
words_not_stopword =  Pipeline([
                ('selector', NumberSelector(key='words_not_stopword')),
                ('standard', StandardScaler())
            ])
avg_word_length =  Pipeline([
                ('selector', NumberSelector(key='avg_word_length')),
                ('standard', StandardScaler())
            ])
commas =  Pipeline([
                ('selector', NumberSelector(key='commas')),
                ('standard', StandardScaler()),
            ])

- But, pipielines for the nouns, verbs, and adjectives can be created 
- The best part about using pipelines is that these pipelines can be intergrated seperately with a model 

Additional features 
- Nouns, verbs and adjectives

In [9]:
nouns =  Pipeline([
                ('selector', NumberSelector(key='nouns')),
                ('standard', StandardScaler())
            ])
verbs =  Pipeline([
                ('selector', NumberSelector(key='verbs')),
                ('standard', StandardScaler())
            ])

adjectives =  Pipeline([
                ('selector', NumberSelector(key='adjectives')),
                ('standard', StandardScaler())
            ])



# Model 1
- To create out first model lets integrate particular pipelines using aa feature union
### This feature union contains features - feats version 1
    - text
    - length
    - words
    - words_not_stop_words
    - avf_word_length
    - commas

In [10]:
from sklearn.pipeline import FeatureUnion

feats_v1 = FeatureUnion([('text', text), 
                      ('length', length),
                      ('words', words),
                      ('words_not_stopword', words_not_stopword),
                      ('avg_word_length', avg_word_length),
                      ('commas', commas)])

feature_processing = Pipeline([('feats', feats_v1)])
feature_processing.fit_transform(X_train)

<13117x21521 sparse matrix of type '<class 'numpy.float64'>'
	with 213646 stored elements in Compressed Sparse Row format>

## Adding a model
- at one end of the pipeline we can add a model
- we add Logistic Regression in this case

In [11]:
from sklearn.linear_model import LogisticRegression

model1 = Pipeline([
    ('features', feats_v1),
    ('classifier', LogisticRegression(random_state=42)),
])

model1.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Accuracy: Model 1

In [12]:
preds = model1.predict(X_test)
np.mean(preds == y_test)

0.7816465490560198

### Classification Report Model 1

In [13]:
from sklearn.metrics import classification_report

report = classification_report(y_test,preds)
print(report)

              precision    recall  f1-score   support

         EAP       0.74      0.85      0.79      2587
         HPL       0.81      0.74      0.77      1852
         MWS       0.83      0.74      0.78      2023

    accuracy                           0.78      6462
   macro avg       0.79      0.77      0.78      6462
weighted avg       0.79      0.78      0.78      6462



# Model 2
- To create our second model lets integrate all the pipelines using a feature union
### This feature union contains features - feats version 2
    - text
    - length
    - words
    - words_not_stop_words
    - avf_word_length
    - commas

### Also,
    - nouns
    - verbs
    - adjectives

In [14]:
feats_v2 = FeatureUnion([
                    ('text', text), 
                    ('length', length),
                    ('words', words),
                    ('words_not_stopword', words_not_stopword),
                    ('avg_word_length', avg_word_length),
                    ('commas', commas),
                    ('nouns', nouns),
                    ('verbs', verbs),
                    ('adjectives', adjectives)  
                    ])

feature_processing = Pipeline([('feats', feats_v2)])
feature_processing.fit_transform(X_train)

<13117x21524 sparse matrix of type '<class 'numpy.float64'>'
	with 252997 stored elements in Compressed Sparse Row format>

- We add a logistic Regression model at the end of this feature union aswell

In [15]:
model2 = Pipeline([
    ('features', feats_v2),
    ('classifier', LogisticRegression(random_state=42)),
])

model2.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Accuracy: Model 2

In [16]:
preds = model2.predict(X_test)
np.mean(preds == y_test)

0.7787062828845559

### Observations
- We can see that the accuracy of model has decresed after adding three extra features nouns, verbs and adjectives
- The accuracy dropped from 78.16 % to 77.87%

### Classification Report Model 2

In [17]:
report = classification_report(y_test,preds)
print(report)

              precision    recall  f1-score   support

         EAP       0.74      0.84      0.78      2587
         HPL       0.80      0.74      0.77      1852
         MWS       0.82      0.74      0.78      2023

    accuracy                           0.78      6462
   macro avg       0.79      0.77      0.78      6462
weighted avg       0.78      0.78      0.78      6462



- Model 1: Logistic Regression with original features.
- Model 2: Logistic Regression with the addition of nouns, verbs, and adjectives features plus all original features.

### Precision:
- For EAP, the precision remained the same (0.74).
- For HPL, the precision decreased from 0.81 to 0.80.
- For MWS, the precision decreased from 0.83 to 0.82.

### Recall:
For EAP, recall remained almost the same (0.85 to 0.84).
For HPL and MWS, recall stayed the same (0.74)

### F1-Score:
For EAP, the F1-score decreased from 0.79 to 0.78.
For HPL, the F1-score remained the same (0.77).
For MWS, the F1-score remained the same (0.78).

- Overall we observe a non increasing pattern in the precision recall and f score measures.

### Observations
- Adding more features increases the dimensionality of the feature space. 
- This can lead to a model that is overfitted to the training data, with reduced generalization to new, unseen data, potentially resulting in lower precision.

- The additional features might not be as relevant for the classification task as expected. 
- If these features do not have a strong relationship with the target variable, they might introduce noise leading to poor performace

# Cross Validation To Find The Best Pipeline

### Best fit for Model 1

In [18]:
from sklearn.model_selection import GridSearchCV

hyperparameters = {
    'features__text__tfidf__max_df': [0.9, 0.95],
    'features__text__tfidf__ngram_range': [(1, 1), (1, 2)],
    'classifier__C': [0.01, 0.1, 1, 10, 100],  # Example values for C
    'classifier__penalty': ['l1', 'l2']  # Example values for penalty
}

clf_model1 = GridSearchCV(model1, hyperparameters, cv=5)
 
# Fit and tune model 2
clf_model1.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [19]:
clf_model1.best_params_

{'classifier__C': 10,
 'classifier__penalty': 'l2',
 'features__text__tfidf__max_df': 0.9,
 'features__text__tfidf__ngram_range': (1, 2)}

In [20]:
#refitting on entire training data using best settings
clf_model1.refit

preds = clf_model1.predict(X_test)
probs = clf_model1.predict_proba(X_test)

np.mean(preds == y_test)

0.7994428969359332

### Classification Report 
- Model 1 best fit

In [21]:
report = classification_report(y_test,preds)
print(report)

              precision    recall  f1-score   support

         EAP       0.79      0.81      0.80      2587
         HPL       0.83      0.78      0.80      1852
         MWS       0.79      0.80      0.79      2023

    accuracy                           0.80      6462
   macro avg       0.80      0.80      0.80      6462
weighted avg       0.80      0.80      0.80      6462



### Best fit for Model 2

In [22]:
clf_model2 = GridSearchCV(model2, hyperparameters, cv=5)
 
# Fit and tune model 2
clf_model2.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [23]:
clf_model2.best_params_

{'classifier__C': 10,
 'classifier__penalty': 'l2',
 'features__text__tfidf__max_df': 0.9,
 'features__text__tfidf__ngram_range': (1, 1)}

In [24]:
#refitting on entire training data using best settings
clf_model2.refit

preds = clf_model2.predict(X_test)
probs = clf_model2.predict_proba(X_test)

np.mean(preds == y_test)

0.7921696069328381

### Classification Report 
- Model 2 best fit

In [25]:
report = classification_report(y_test,preds)
print(report)

              precision    recall  f1-score   support

         EAP       0.77      0.83      0.80      2587
         HPL       0.81      0.77      0.79      1852
         MWS       0.81      0.77      0.79      2023

    accuracy                           0.79      6462
   macro avg       0.80      0.79      0.79      6462
weighted avg       0.79      0.79      0.79      6462



# Final Predictions on the test.csv

In [26]:
submission = pd.read_csv('test.csv')

#preprocessing
submission = processing(submission)


In [27]:
predictions_model1 = clf_model1.predict_proba(submission)

preds = pd.DataFrame(data=predictions_model1, columns = clf_model1.best_estimator_.named_steps['classifier'].classes_)

#generating a submission file
result = pd.concat([submission[['id']], preds], axis=1)
result.set_index('id', inplace = True)
result.head()

Unnamed: 0_level_0,EAP,HPL,MWS
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
id02310,0.11121,0.013204,0.875586
id24541,0.962016,0.015428,0.022556
id00134,0.116354,0.87292,0.010726
id27757,0.794183,0.181254,0.024562
id04081,0.873101,0.117378,0.009521


- Prediction from model 2

In [28]:
predictions_model2 = clf_model2.predict_proba(submission)

preds = pd.DataFrame(data=predictions_model2, columns = clf_model2.best_estimator_.named_steps['classifier'].classes_)

#generating a submission file
result = pd.concat([submission[['id']], preds], axis=1)
result.set_index('id', inplace = True)
result.head()

Unnamed: 0_level_0,EAP,HPL,MWS
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
id02310,0.081152,0.008135,0.910713
id24541,0.972766,0.007753,0.019481
id00134,0.192206,0.799049,0.008746
id27757,0.827455,0.155221,0.017324
id04081,0.927944,0.065365,0.006691
