# <font color = 'dodgerblue'>**Feature Engineering**

# <font color = 'dodgerblue'>**Installing/Importing libraries**

In [None]:
if 'google.colab' in str(get_ipython()):
    !pip install -U spacy -qq

In [None]:
# import pandas library for data manipulation
import pandas as pd

# import spacy library for natural language processing
import spacy

# import Path library from pathlib for file system management
from pathlib import Path

2023-07-31 16:21:45.531556: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-31 16:21:46.515824: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-31 16:21:46.516178: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-31 16:21:46.516846: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there m

In [None]:
# check the version of spacy
print(spacy.__version__)

3.6.0


In [None]:
!python -m spacy download en_core_web_sm

2023-07-31 16:21:47.847361: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-31 16:21:48.946462: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-31 16:21:48.946658: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-31 16:21:48.947320: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there m

# <font color = 'dodgerblue'>**Specify Data Folders**

In [None]:
# Check if the code is running in a Colab environment
if 'google.colab' in str(get_ipython()):# If the code is running in Colab

    # mount google drive
    from google.colab import drive
    drive.mount('/content/drive')

    # set the base path to a Google Drive folder
    base_path = '/content/drive/MyDrive/data'
else:
    # If the code is not running in Colab, set the base path to a local folder
    base_path = '/home/harpreet/Insync/google_drive_shaannoor/data'

# Convert the base path to a Path object
base_folder = Path(base_path)

# Define the archive folder path
archive_folder = base_folder/'archive'

# Define the data folder path
data_folder = base_folder/'datasets'

# <font color = 'dodgerblue'>**Loading Data Set**

Note: We have allready downloaded the imdb data set and created train and test csv files (see notebook Faster_tokenization_spacy. You can download train.csv and test.csv from eLearning.
    

The data was downloaded from following link : https://ai.stanford.edu/~amaas/data/sentiment/.

Description of the data from the above link :
"*This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.*".

In [None]:
df = pd.read_csv(data_folder/'aclImdb'/ 'train.csv', index_col=0)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Reviews  25000 non-null  object
 1   Labels   25000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 585.9+ KB


In [None]:
df.head()

Unnamed: 0,Reviews,Labels
0,An excellent example of the spectacular Busby ...,1
1,"In Manhattan, the American middle class Jim Bl...",1
2,"""Foxes"" is a great film. The four young actres...",1
3,Another comment about this film made it sound ...,1
4,The energetic young producer of theatrical pro...,1


# <font color = 'dodgerblue'>**Feature Engineering on IMDB dataset**

* If we look at the dataset it cotains reviews and thier labels, where labels can be 1 for positive and 0 for negative.
* Now let see what kind of feature engineering can be done for this dataset.
* Since, review is a text data so we can count the total number of words because some times user tends to right a big review if they dislike it, although this is just an assumption, and we always tends to do experiment and check assumptions in feature engineering.
* Similarly, we can create features for character count, digit count, pos tags count such as noun, adjective, verb counts, and we can also count entities in a review.

We can also extract POS tags and Named Entities as features. **Do we need to preprocess data before this steps. What pre-prpcessing steps should we do?**

  1. number of nouns or propernouns
  2. number of aux
  3. number of verbs
  4. number of adjectives
  5. number of ner (entiites)

**Do we need to preprocess data before this steps. What pre-prpcessing steps should we do?**

  6. number of words
  7. number of characters
  8. number of characters without space
  9. average word length
  10. number of digits

## <font color = 'dodgerblue'>**Extracting POS**



In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
import os

In [None]:
os.cpu_count()

64

In [None]:
# This cell took 5 mins on colab pro

noun_count = [] # create a list to store the noun count for each document
aux_count = [] # create a list to store the auxiliary verb count for each document
verb_count = [] # create a list to store the verb count for each document
adj_count =[] # create a list to store the adjective count for each document

# disable lemmatizer and named entity recognizer
disabled = nlp.select_pipes(disable= ['lemmatizer', 'ner'])

# iterate over the documents in the dataframe using the spacy pipe method
for doc in nlp.pipe(df.Reviews.values, batch_size=1000, n_process=3):

    # find all nouns and proper nouns in the document and store in a list
    nouns = [token.text for token in doc if (token.pos_ in ["NOUN","PROPN"])]

    # find all auxiliary verbs in the document and store in a list
    auxs =  [token.text for token in doc if (token.pos_ in ["AUX"])]

    # find all verbs in the document and store in a list
    verbs =  [token.text for token in doc if (token.pos_ in ["VERB"])]

    # find all adjectives in the document and store in a list
    adjectives =  [token.text for token in doc if (token.pos_ in ["ADJ"])]

    # store the count of nouns in the noun_count list
    noun_count.append(len(nouns))

    # store the count of auxiliary verbs in the aux_count list
    aux_count.append(len(aux_count))

    # store the count of verbs in the verb_count list
    verb_count.append(len(verb_count))

    # store the count of adjectives in the adj_count list
    adj_count.append(len(adj_count))

# restore the lemmatizer and named entity recognizer
disabled.restore()

### <font color = 'dodgerblue'>**1. number of nouns or proper nouns**
Extract number of nouns or proper nouns in a review in a new column 'noun_count'.

In [None]:
df['noun_count'] = noun_count
df.head()

Unnamed: 0,Reviews,Labels,noun_count
0,An excellent example of the spectacular Busby ...,1,72
1,"In Manhattan, the American middle class Jim Bl...",1,87
2,"""Foxes"" is a great film. The four young actres...",1,16
3,Another comment about this film made it sound ...,1,27
4,The energetic young producer of theatrical pro...,1,163


### <font color = 'dodgerblue'>**2. number of aux**
Extract number of auxilaries (auxilary verbs) in a review in a new column 'aux_count'. Hint: pos tag in spacy is AUX.

In [None]:
df['aux_count'] = aux_count
df.head()

Unnamed: 0,Reviews,Labels,noun_count,aux_count
0,An excellent example of the spectacular Busby ...,1,72,0
1,"In Manhattan, the American middle class Jim Bl...",1,87,1
2,"""Foxes"" is a great film. The four young actres...",1,16,2
3,Another comment about this film made it sound ...,1,27,3
4,The energetic young producer of theatrical pro...,1,163,4


### <font color = 'dodgerblue'>**3. number of verbs**
Extract number of verbs in a review in a new column 'verb_count'.

In [None]:
df['verb_count'] = verb_count
df.head()

Unnamed: 0,Reviews,Labels,noun_count,aux_count,verb_count
0,An excellent example of the spectacular Busby ...,1,72,0,0
1,"In Manhattan, the American middle class Jim Bl...",1,87,1,1
2,"""Foxes"" is a great film. The four young actres...",1,16,2,2
3,Another comment about this film made it sound ...,1,27,3,3
4,The energetic young producer of theatrical pro...,1,163,4,4


### <font color = 'dodgerblue'>**4. number of adjectives**
Extract number of adjectives in a review in a new column 'adj_count'.

In [None]:
df['adj_count'] = adj_count
df.head()

Unnamed: 0,Reviews,Labels,noun_count,aux_count,verb_count,adj_count
0,An excellent example of the spectacular Busby ...,1,72,0,0,0
1,"In Manhattan, the American middle class Jim Bl...",1,87,1,1,1
2,"""Foxes"" is a great film. The four young actres...",1,16,2,2,2
3,Another comment about this film made it sound ...,1,27,3,3,3
4,The energetic young producer of theatrical pro...,1,163,4,4,4


## <font color = 'dodgerblue'>**Extracting Named Entities**
Extract number of named entities (ner) in a review in a new column 'ner_count'.

In [None]:
# This cell took 3 mins on colab pro

# Initialize empty list to store the number of named entities in each document
count_ner  = []

# Disable specified components in the spaCy pipeline
disabled = nlp.select_pipes(disable= ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer'])

# Loop over the reviews in the dataframe, processing each in batches of 1000 using spaCy's nlp.pipe method with parallel processing enabled (-1)
for doc in nlp.pipe(df.Reviews.values, batch_size=1000, n_process=2):
    # Get the named entities from the processed document
    ners = [ent.label_ for ent in doc.ents]
    # Add the count of named entities to the list
    count_ner.append(len(ners))

# Restore the components in the spaCy pipeline
disabled.restore()

In [None]:
df['ner_count'] = count_ner
df.head()

Unnamed: 0,Reviews,Labels,noun_count,aux_count,verb_count,adj_count,ner_count
0,An excellent example of the spectacular Busby ...,1,72,0,0,0,24
1,"In Manhattan, the American middle class Jim Bl...",1,87,1,1,1,24
2,"""Foxes"" is a great film. The four young actres...",1,16,2,2,2,6
3,Another comment about this film made it sound ...,1,27,3,3,3,5
4,The energetic young producer of theatrical pro...,1,163,4,4,4,38
