# <font color = 'dodgerblue'>**Feature Engineering**

# <font color = 'dodgerblue'>**Installing/Importing libraries**

In [None]:
!pip install swifter -qq

In [None]:
!pip install -U spacy -qq

In [None]:
# Import libraries
import pandas as pd
import spacy
from pathlib import Path
import tarfile
import swifter
import random
import textwrap as tw

2022-08-28 19:10:55.904172: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-28 19:10:55.904875: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-28 19:10:55.905248: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero


In [None]:
# check the version of spacy
# your spacy version should be 3.1.2
print(spacy.__version__)

3.4.1


In [None]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 6.9 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# <font color = 'dodgerblue'>**Specify Data Folders**

In [None]:
# specify paths to save and download data
base_path = '/home/harpreet/Insync/google_drive_shaannoor/data'
#base_path = '/content/drive/MyDrive/datasets'
base_folder = Path(base_path)
archive_folder = base_folder/'archive'
data_folder = base_folder/'datasets'

# <font color = 'dodgerblue'>**Loading Data Set**

*Note:  we have allready downloaded the imdb data set and created train and test csv files (see notebook Faster_tokenization_spacy. You can download train.csv and test.csv from eLearning.
    

The data was downloaded from following link : https://ai.stanford.edu/~amaas/data/sentiment/.

Description of the data from the above link :
"*This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.*".

In [None]:
df = pd.read_csv(data_folder/'aclImdb'/ 'train.csv', index_col=0)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Reviews  25000 non-null  object
 1   Labels   25000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 585.9+ KB


In [None]:
df.head()

Unnamed: 0,Reviews,Labels
0,Ever wanted to know just how much Hollywood co...,1
1,The movie itself was ok for the kids. But I go...,1
2,You could stage a version of Charles Dickens' ...,1
3,this was a fantastic episode. i saw a clip fro...,1
4,and laugh out loud funny in many scenes.<br />...,1


# <font color = 'dodgerblue'>**Feature Engineering on IMDB dataset**

* If we look at the dataset it cotains reviews and thier labels, where labels can be 1 for positive and 0 for negative.
* Now let see what kind of feature engineering can be done for this dataset.
* Since, review is a text data so we can count the total number of words because some times user tends to right a big review if they dislike it, although this is just an assumption, and we always tends to do experiment and check assumptions in feature engineering.
* Similarly, we can create features for character count, digit count, pos tags count such as noun, adjective, verb counts, and we can also count entities in a review.

We can also extract POS tags and Named Entities as features. **Do we need to preprocess data before this steps. What pre-prpcessing steps should we do?**

  1. number of nouns or propernouns
  2. number of aux
  3. number of verbs
  4. number of adjectives
  5. number of ner (entiites)

**Do we need to preprocess data before this steps. What pre-prpcessing steps should we do?**

  6. number of words
  7. number of characters
  8. number of characters without space
  9. average word length
  10. number of digits

## <font color = 'dodgerblue'>**Extracting POS**



In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
# this cell will take around 10 minutes on colab
noun_count = []
aux_count = []
verb_count = []
adj_count =[]
disabled = nlp.select_pipes(disable= ['lemmatizer', 'ner'])
for doc in nlp.pipe(df.Reviews.values, batch_size=1000, n_process=-1):
    nouns = [token.text for token in doc if (token.pos_ in ["NOUN","PROPN"])] 
    auxs =  [token.text for token in doc if (token.pos_ in ["AUX"])] 
    verbs =  [token.text for token in doc if (token.pos_ in ["VERB"])] 
    adjectives =  [token.text for token in doc if (token.pos_ in ["ADJ"])]         
    
    noun_count.append(len(nouns))
    aux_count.append(len(aux_count))
    verb_count.append(len(verb_count))
    adj_count.append(len(adj_count))

disabled.restore()

### <font color = 'dodgerblue'>**1. number of nouns or proper nouns**
Extract number of nouns or proper nouns in a review in a new column 'noun_count'.

In [None]:
df['noun_count'] = noun_count
df.head()

Unnamed: 0,Reviews,Labels,noun_count
0,Ever wanted to know just how much Hollywood co...,1,59
1,The movie itself was ok for the kids. But I go...,1,11
2,You could stage a version of Charles Dickens' ...,1,214
3,this was a fantastic episode. i saw a clip fro...,1,28
4,and laugh out loud funny in many scenes.<br />...,1,31


### <font color = 'dodgerblue'>**2. number of aux**
Extract number of auxilaries (auxilary verbs) in a review in a new column 'aux_count'. Hint: pos tag in spacy is AUX.

In [None]:
df['aux_count'] = aux_count
df.head()

Unnamed: 0,Reviews,Labels,noun_count,aux_count
0,Ever wanted to know just how much Hollywood co...,1,59,0
1,The movie itself was ok for the kids. But I go...,1,11,1
2,You could stage a version of Charles Dickens' ...,1,214,2
3,this was a fantastic episode. i saw a clip fro...,1,28,3
4,and laugh out loud funny in many scenes.<br />...,1,31,4


### <font color = 'dodgerblue'>**3. number of verbs**
Extract number of verbs in a review in a new column 'verb_count'. 

In [None]:
df['verb_count'] = verb_count
df.head()


Unnamed: 0,Reviews,Labels,noun_count,aux_count,verb_count
0,Ever wanted to know just how much Hollywood co...,1,59,0,0
1,The movie itself was ok for the kids. But I go...,1,11,1,1
2,You could stage a version of Charles Dickens' ...,1,214,2,2
3,this was a fantastic episode. i saw a clip fro...,1,28,3,3
4,and laugh out loud funny in many scenes.<br />...,1,31,4,4


### <font color = 'dodgerblue'>**4. number of adjectives**
Extract number of adjectives in a review in a new column 'adj_count'.

In [None]:
df['adj_count'] = adj_count
df.head()

Unnamed: 0,Reviews,Labels,noun_count,aux_count,verb_count,adj_count
0,Ever wanted to know just how much Hollywood co...,1,59,0,0,0
1,The movie itself was ok for the kids. But I go...,1,11,1,1,1
2,You could stage a version of Charles Dickens' ...,1,214,2,2,2
3,this was a fantastic episode. i saw a clip fro...,1,28,3,3,3
4,and laugh out loud funny in many scenes.<br />...,1,31,4,4,4


## <font color = 'dodgerblue'>**Extracting Named Entities**
Extract number of named entities (ner) in a review in a new column 'ner_count'.

In [None]:
# this cell will take around 10 minutes on colab
count_ner  = []
disabled = nlp.select_pipes(disable= ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer'])
for doc in nlp.pipe(df.Reviews.values, batch_size=1000, n_process=-1):
    ners = [ent.label_ for ent in doc.ents] 
    count_ner.append(len(ners))
disabled.restore()

In [None]:
df['ner_count'] = count_ner
df.head()

Unnamed: 0,Reviews,Labels,noun_count,aux_count,verb_count,adj_count,ner_count
0,Ever wanted to know just how much Hollywood co...,1,59,0,0,0,13
1,The movie itself was ok for the kids. But I go...,1,11,1,1,1,0
2,You could stage a version of Charles Dickens' ...,1,214,2,2,2,47
3,this was a fantastic episode. i saw a clip fro...,1,28,3,3,3,8
4,and laugh out loud funny in many scenes.<br />...,1,31,4,4,4,6


# <font color = 'dodgerblue'>**Preprocessing**

In [None]:
import sys
#sys.path.append('/content/drive/MyDrive/data/custom_functions')
sys.path.append('/home/harpreet/Insync/google_drive_shaannoor/data/custom-functions')

In [None]:
import  custom_preprocessor as cp

In [None]:
text = ["""New version of operation system is iOS 11. It is better than iOS 9.
The new version of iPhone X seems cool."""]

In [None]:
preprocessor = cp.SpacyPreprocessor(model = 'en_core_web_sm', batch_size=64, lammetize=False, lower=False, remove_stop=False, 
                remove_punct=False, remove_email=True, remove_url=True, remove_num=False, stemming = True,
                add_user_mention_prefix=False, remove_hashtag_prefix=False)
cleaned_text = preprocessor.fit_transform(text)

cleaned_text

['new version of operation system be ios it be well than ios the new version of iphone x seem cool']

In [None]:
preprocessor = cp.SpacyPreprocessor(model = 'en_core_web_sm', 
                                    batch_size = 1000, 
                                    lammetize=True, 
                                    lower=True, 
                                    remove_stop=True, 
                                    remove_punct=True, 
                                    remove_email=True, 
                                    remove_url=True, 
                                    remove_num=False, 
                                    stemming = False,
                                    add_user_mention_prefix=False, 
                                    remove_hashtag_prefix=False)

In [None]:
cleaned_text = preprocessor.fit_transform(df.Reviews.values)



In [None]:
df['cleaned_text'] = cleaned_text

In [None]:
df.head()

Unnamed: 0,Reviews,Labels,cleaned_text
0,Ever wanted to know just how much Hollywood co...,1,want know hollywood away hayes code officially...
1,The movie itself was ok for the kids. But I go...,1,movie ok kid got to tell ya scratch little squ...
2,You could stage a version of Charles Dickens' ...,1,stage version charles dickens christmas carol ...
3,this was a fantastic episode. i saw a clip fro...,1,fantastic episode see clip youtube vow tv glue...
4,and laugh out loud funny in many scenes.<br />...,1,laugh loud funny scene movie basic plot chroni...


In [None]:
df.to_csv(data_folder/'aclImdb' /'train_mod.csv')

In [None]:
df.iloc[5, 5]

"I just can't understand the negative comments about this film. Yes it is a typical boy-meets-girl romance but it is done with such flair and polish that the time just flies by. Henstridge (talk about winning the gene-pool lottery!) is as magnetic and alluring as ever (who says the golden age of cinema is dead?) and Vartan holds his own.<br /><br />There is simmering chemistry between the two leads; the film is most alive when they share a scene - lots! It is done so well that you find yourself willing them to get together...<br /><br />Ignore the negative comments - if you are feeling a bit blue, watch this flick, you will feel so much better. If you are already happy, then you will be euphoric.<br /><br />(PS: I am 33, Male, from the UK and a hopeless romantic still searching for his Princess...)"