<a href="https://colab.research.google.com/github/AbdulRauf96/NLP/blob/main/Initial_Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color = 'dodgerblue'>**Feature Engineering**

# <font color = 'dodgerblue'>**Installing/Importing libraries**

In [12]:
# import pandas library for data manipulation
import pandas as pd

# import spacy library for natural language processing
import spacy

# import Path library from pathlib for file system management
from pathlib import Path

# import Counter library from collections for counting words
from collections import Counter

# import custom functions
from basic import basic_functions as bf

# import os library for operating system management
import os

In [2]:
# check the version of spacy
print(spacy.__version__)

3.5.1


In [3]:
!python -m spacy download en_core_web_sm -qq

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


# <font color = 'dodgerblue'>**Specify Data Folders**

In [5]:
base_folder,data,archive,output = bf.set_folders()

Not Running on Colab
Base Folder is C:\Users\abdul\OneDrive\Documents\MSBA
Data Folder is C:\Users\abdul\OneDrive\Documents\MSBA\data_sets
Archive Folder is C:\Users\abdul\OneDrive\Documents\MSBA\archive
Output Folder is C:\Users\abdul\OneDrive\Documents\MSBA\output
The path to the custom functions is C:/Users/abdul/OneDrive/Documents/MSBA/custom_functions
The working directory is c:\Users\abdul\OneDrive\Documents\MSBA\notebooks\NLP


# <font color = 'dodgerblue'>**Loading Data Set** </font>

Note: We have allready downloaded the imdb data set and created train and test csv files
    

The data was downloaded from following link : https://ai.stanford.edu/~amaas/data/sentiment/.

Description of the data from the above link :
"*This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.*".

In [8]:
df = pd.read_csv(data/ 'train.csv', index_col=0)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Reviews  25000 non-null  object
 1   Labels   25000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 585.9+ KB


In [10]:
df.head()

Unnamed: 0,Reviews,Labels
0,I think this movie would be more enjoyable if ...,1
1,"Timeless musical gem, with Gene Kelly in top f...",1
2,I just found the IMDb and searched this film a...,1
3,It's a short movie from David Lynch with just ...,1
4,I dug out from my garage some old musicals and...,1


# <font color = 'dodgerblue'>**Feature Engineering on IMDB dataset**

* If we look at the dataset it cotains reviews and thier labels, where labels can be 1 for positive and 0 for negative.
* Now let see what kind of feature engineering can be done for this dataset.
* Since, review is a text data so we can count the total number of words because some times user tends to right a big review if they dislike it, although this is just an assumption, and we always tends to do experiment and check assumptions in feature engineering.
* Similarly, we can create features for character count, digit count, pos tags count such as noun, adjective, verb counts, and we can also count entities in a review.

We can also extract POS tags and Named Entities as features. **Do we need to preprocess data before this steps. What pre-prpcessing steps should we do?**

  1. number of nouns or propernouns
  2. number of aux
  3. number of verbs
  4. number of adjectives
  5. number of ner (entiites)

**Do we need to preprocess data before this steps. What pre-prpcessing steps should we do?**

  6. number of words
  7. number of characters
  8. number of characters without space
  9. average word length
  10. number of digits

## <font color = 'dodgerblue'>**Extracting POS**



In [13]:
nlp = spacy.load('en_core_web_sm')

In [14]:
os.cpu_count()

8

In [15]:
# This cell can take more than 1 hour to execute on colab

noun_count = [] # create a list to store the noun count for each document
aux_count = [] # create a list to store the auxiliary verb count for each document
verb_count = [] # create a list to store the verb count for each document
adj_count =[] # create a list to store the adjective count for each document

# disable lemmatizer and named entity recognizer
disabled = nlp.select_pipes(disable= ['lemmatizer', 'ner'])

# iterate over the documents in the dataframe using the spacy pipe method
for doc in nlp.pipe(df.Reviews.values, batch_size=1000, n_process=3):
    
    # find all nouns and proper nouns in the document and store in a list
    nouns = [token.text for token in doc if (token.pos_ in ["NOUN","PROPN"])] 
    
    # find all auxiliary verbs in the document and store in a list
    auxs =  [token.text for token in doc if (token.pos_ in ["AUX"])] 
    
    # find all verbs in the document and store in a list
    verbs =  [token.text for token in doc if (token.pos_ in ["VERB"])] 
    
    # find all adjectives in the document and store in a list
    adjectives =  [token.text for token in doc if (token.pos_ in ["ADJ"])]         
    
    # store the count of nouns in the noun_count list
    noun_count.append(len(nouns))
    
    # store the count of auxiliary verbs in the aux_count list
    aux_count.append(len(aux_count))
    
    # store the count of verbs in the verb_count list
    verb_count.append(len(verb_count))
    
    # store the count of adjectives in the adj_count list
    adj_count.append(len(adj_count))

# restore the lemmatizer and named entity recognizer
disabled.restore()

### <font color = 'dodgerblue'>**1. number of nouns or proper nouns**
Extract number of nouns or proper nouns in a review in a new column 'noun_count'.

In [16]:
df['noun_count'] = noun_count
df.head()

Unnamed: 0,Reviews,Labels,noun_count
0,I think this movie would be more enjoyable if ...,1,29
1,"Timeless musical gem, with Gene Kelly in top f...",1,24
2,I just found the IMDb and searched this film a...,1,31
3,It's a short movie from David Lynch with just ...,1,33
4,I dug out from my garage some old musicals and...,1,45


### <font color = 'dodgerblue'>**2. number of aux**
Extract number of auxilaries (auxilary verbs) in a review in a new column 'aux_count'. Hint: pos tag in spacy is AUX.

In [18]:
df['aux_count'] = aux_count
df.head()

Unnamed: 0,Reviews,Labels,noun_count,aux_count
0,I think this movie would be more enjoyable if ...,1,29,0
1,"Timeless musical gem, with Gene Kelly in top f...",1,24,1
2,I just found the IMDb and searched this film a...,1,31,2
3,It's a short movie from David Lynch with just ...,1,33,3
4,I dug out from my garage some old musicals and...,1,45,4


### <font color = 'dodgerblue'>**3. number of verbs**
Extract number of verbs in a review in a new column 'verb_count'. 

In [19]:
df['verb_count'] = verb_count
df.head()

Unnamed: 0,Reviews,Labels,noun_count,aux_count,verb_count
0,I think this movie would be more enjoyable if ...,1,29,0,0
1,"Timeless musical gem, with Gene Kelly in top f...",1,24,1,1
2,I just found the IMDb and searched this film a...,1,31,2,2
3,It's a short movie from David Lynch with just ...,1,33,3,3
4,I dug out from my garage some old musicals and...,1,45,4,4


### <font color = 'dodgerblue'>**4. number of adjectives**
Extract number of adjectives in a review in a new column 'adj_count'.

In [20]:
df['adj_count'] = adj_count
df.head()

Unnamed: 0,Reviews,Labels,noun_count,aux_count,verb_count,adj_count
0,I think this movie would be more enjoyable if ...,1,29,0,0,0
1,"Timeless musical gem, with Gene Kelly in top f...",1,24,1,1,1
2,I just found the IMDb and searched this film a...,1,31,2,2,2
3,It's a short movie from David Lynch with just ...,1,33,3,3,3
4,I dug out from my garage some old musicals and...,1,45,4,4,4


## <font color = 'dodgerblue'>**Extracting Named Entities**
Extract number of named entities (ner) in a review in a new column 'ner_count'.

In [21]:
# Initialize empty list to store the number of named entities in each document
count_ner  = []

# Disable specified components in the spaCy pipeline
disabled = nlp.select_pipes(disable= ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer'])

# Loop over the reviews in the dataframe, processing each in batches of 1000 using spaCy's nlp.pipe method with parallel processing enabled (-1)
for doc in nlp.pipe(df.Reviews.values, batch_size=1000, n_process=2):
    # Get the named entities from the processed document
    ners = [ent.label_ for ent in doc.ents] 
    # Add the count of named entities to the list
    count_ner.append(len(ners))

# Restore the components in the spaCy pipeline
disabled.restore()

In [22]:
df['ner_count'] = count_ner
df.head()

Unnamed: 0,Reviews,Labels,noun_count,aux_count,verb_count,adj_count,ner_count
0,I think this movie would be more enjoyable if ...,1,29,0,0,0,7
1,"Timeless musical gem, with Gene Kelly in top f...",1,24,1,1,1,3
2,I just found the IMDb and searched this film a...,1,31,2,2,2,10
3,It's a short movie from David Lynch with just ...,1,33,3,3,3,3
4,I dug out from my garage some old musicals and...,1,45,4,4,4,15
