<h1 align='center'><u><font color ='pickle'>Manual Features</u></h1></font>


# <font color ='pickle'>**Installing/Importing libraries**

In [1]:
%load_ext autoreload
%autoreload 2

In [None]:
!pip install -U spacy

In [3]:
# Import Libraries
import pandas as pd
import numpy as np
from pathlib import Path
import textwrap as tw
import matplotlib.pyplot as plt

# save and load models
import joblib

import re

#from scipy.sparse import hstack
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from  sklearn.compose import ColumnTransformer

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
base_folder = Path('/content/drive/MyDrive/data')
#base_folder = Path('/home/harpreet/Insync/google_drive_shaannoor/data')

In [6]:
data_folder = base_folder/'datasets/aclImdb'
model_folder = base_folder/'models/nlp_fall_2022/imdb'
custom_functions = base_folder/'custom-functions'

In [None]:
!python -m spacy download 'en_core_web_sm'

In [9]:
import sys
sys.path.append(str(custom_functions))

In [10]:
from  featurizer import ManualFeatures

# <font color ='pickle'>**Load dataset**

For this notebook, we will use IMDB movie review dataset. <br>
LInk for complete dataset: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz.

We downloaded the dataset in the previous lecture (notebook: 10_Faster_tokenization_spacy_final.ipynb)

We will now use the saved train.csv and test.csv file.

In [11]:
# location of train and test files
train_file = data_folder /'train.csv'
test_file = data_folder / 'test.csv'

In [12]:
# creating Pandas Dataframe
train_data = pd.read_csv(train_file, index_col=0)
test_data = pd.read_csv(test_file, index_col=0)

In [13]:
# print shape of the datasets
print(f'Shape of Training data set is : {train_data.shape}')
print(f'Shape of Test data set is : {test_data.shape}')

Shape of Training data set is : (25000, 2)
Shape of Test data set is : (25000, 2)


In [14]:
#Printing top 5 train records
train_data.head()

Unnamed: 0,Reviews,Labels
0,Ever wanted to know just how much Hollywood co...,1
1,The movie itself was ok for the kids. But I go...,1
2,You could stage a version of Charles Dickens' ...,1
3,this was a fantastic episode. i saw a clip fro...,1
4,and laugh out loud funny in many scenes.<br />...,1


In [15]:
#Printing top 5 test records
test_data.head()

Unnamed: 0,Reviews,Labels
0,THE SEA INSIDE a film by Alejandro Amenabar.<b...,1
1,After World War II the ungoing crime in Phenix...,1
2,"""Pitch Black"" was a complete shock to me when ...",1
3,This film is an excellent teaching tool as a p...,1
4,"Sweet, rich valley girl develops crush on a pu...",1


# <font color ='pickle'>**Create Subset of Data**

In [16]:
train_smaller = train_data.sample(frac=0.1, replace=True, random_state=1)

In [17]:
test_smaller = test_data.sample(frac=0.1, replace=True, random_state=1)

In [21]:
X_train, X_test, y_train, y_test = train_smaller['Reviews'].values, test_smaller['Reviews'].values, train_smaller['Labels'].values, test_smaller['Labels'].values

print(f'X_train: {X_train.shape} y_train: {y_train.shape}')
print(f'X_test: {X_test.shape} y_test: {y_test.shape}')

X_train: (2500,) y_train: (2500,)
X_test: (2500,) y_test: (2500,)


# <font color ='pickle'>**Manual Features**

In this case we will extract following features and use these as the input to our logistic regression.
  1. number of words
  2. number of characters
  3. number of characters without space
  4. average word length
  5. number of digits
  6. number of numbers
  7. number of nouns or propernouns
  8. number of aux
  9. number of verbs
  10. number of adjectives
  11. number of ner (entiites)

### <font color ='pickle'>**Generate Manual Features**

In [22]:
ManualFeatures??

In [23]:
featurizer =  ManualFeatures(spacy_model='en_core_web_sm')

In [24]:
X_train_features, feature_names  = featurizer.fit_transform(X_train)

In [25]:
X_train_features[0:3]

array([[6.46000000e+02, 3.32900000e+03, 2.68400000e+03, 4.15479876e+00,
        0.00000000e+00, 0.00000000e+00, 2.30000000e+01, 1.48000000e+02,
        5.10000000e+01, 8.10000000e+01, 4.90000000e+01],
       [1.72000000e+02, 9.50000000e+02, 7.79000000e+02, 4.52906977e+00,
        4.00000000e+00, 3.00000000e+00, 1.20000000e+01, 4.50000000e+01,
        7.00000000e+00, 2.00000000e+01, 1.50000000e+01],
       [3.51000000e+02, 2.08400000e+03, 1.73400000e+03, 4.94017094e+00,
        0.00000000e+00, 0.00000000e+00, 1.60000000e+01, 8.70000000e+01,
        1.80000000e+01, 4.20000000e+01, 4.30000000e+01]])

In [26]:
feature_names

['count_words',
 'count_characters',
 'count_characters_no_space',
 'avg_word_length',
 'count_digits',
 'count_numbers',
 'noun_count',
 'aux_count',
 'verb_count',
 'adj_count',
 'ner']