# Feature Engineering 

* In this HW you will extract new features(variables) using regular expression and spacy.

* You will do feature engineering on IMDB movie review dataset.


## 1. Installing/Importing libraries

In [90]:
!pip install -U spacy



In [91]:
# Import libraries
import pandas as pd
import spacy
from pathlib import Path
import tarfile

In [92]:
# check the version of spacy
# your spacy version should be 3.1.2
print(spacy.__version__)

3.1.2


In [93]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Loading Data Set

The imdb moview review dataset. The details of the data can be found from this link : https://ai.stanford.edu/~amaas/data/sentiment/.

Description :
"*This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.*".

We will extract the data from text files and save  the train and test data as csv files. 

In [94]:
# provide the path to where you are storing datasets
folder =Path('/content/drive/MyDrive/Colab_Notebooks/nlpAssignment')

In [95]:
# create a subfolder for this data
movie_review_data = folder /'movie_review_data'
!mkdir {str(movie_review_data)}


In [96]:
basepath = str(movie_review_data)

In [None]:
# Now we will use wget to get the data
url = 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
!wget {url} -P {basepath}

--2021-09-17 04:19:23--  https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘/content/drive/MyDrive/Colab_Notebooks/nlpAssignment/movie_review_data/aclImdb_v1.tar.gz.1’


2021-09-17 04:19:25 (32.9 MB/s) - ‘/content/drive/MyDrive/Colab_Notebooks/nlpAssignment/movie_review_data/aclImdb_v1.tar.gz.1’ saved [84125825/84125825]



In [None]:
for entities in movie_review_data.iterdir():
  print(entities.name)

aclImdb
aclImdb_v1.tar.gz
train.csv
test.csv


In [None]:
# extract all the files/folder

file = movie_review_data/'aclImdb_v1.tar.gz'
with tarfile.open(file, 'r') as f:
  f.extractall(path = movie_review_data)

In [None]:
# we are only printing the directory names
for x in movie_review_data.glob('**'):
  print(x)


/content/drive/MyDrive/Colab_Notebooks/nlpAssignment/movie_review_data
/content/drive/MyDrive/Colab_Notebooks/nlpAssignment/movie_review_data/aclImdb
/content/drive/MyDrive/Colab_Notebooks/nlpAssignment/movie_review_data/aclImdb/test
/content/drive/MyDrive/Colab_Notebooks/nlpAssignment/movie_review_data/aclImdb/test/pos
/content/drive/MyDrive/Colab_Notebooks/nlpAssignment/movie_review_data/aclImdb/test/neg
/content/drive/MyDrive/Colab_Notebooks/nlpAssignment/movie_review_data/aclImdb/train
/content/drive/MyDrive/Colab_Notebooks/nlpAssignment/movie_review_data/aclImdb/train/unsup
/content/drive/MyDrive/Colab_Notebooks/nlpAssignment/movie_review_data/aclImdb/train/pos
/content/drive/MyDrive/Colab_Notebooks/nlpAssignment/movie_review_data/aclImdb/train/neg


In [None]:
def get_reviews(path):
  reviews = []
  for file in path.iterdir():
    
    # check if the file is a text file
    if file.suffix == '.txt':
      # We can open files and read or write their contents using open() function
      # The files are opened in read-only mode for reading content
      with open(path/file,'r') as f:
        # We store our text from the files into the positive_reviews list as an element in our list
        text = f.read()
        # append the review to the list
        reviews.append(text)
  return reviews

In [None]:
# Function to create dataframe from extracted list of files
def make_dataframe(folder):
  
  positive_reviews = get_reviews(folder / 'pos')
  negative_reviews = get_reviews(folder / 'neg')
  
  data = pd.DataFrame({'Reviews':positive_reviews + negative_reviews,
                                'Labels':list('1' * len(positive_reviews) + '0' * len(negative_reviews))})
  # We want our labels to be int32 type, we will change that here
  data.astype({'Labels':'int32'}).dtypes
  return data

In [None]:
# create a train data set
train_data = make_dataframe(movie_review_data/'aclImdb/train')

In [None]:
# create a test data set
test_data = make_dataframe(movie_review_data/'aclImdb/test')

In [None]:
train_data.to_csv(movie_review_data/'train.csv')

In [None]:
test_data.to_csv(movie_review_data/'test.csv')

In [None]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Reviews  25000 non-null  object
 1   Labels   25000 non-null  object
dtypes: object(2)
memory usage: 390.8+ KB


In [None]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Reviews  25000 non-null  object
 1   Labels   25000 non-null  object
dtypes: object(2)
memory usage: 390.8+ KB


## 2. Import Data



In [97]:
# replace tge path to where you are storing datasets for this course
folder =Path('/content/drive/MyDrive/Colab_Notebooks/nlpAssignment')
movie_review_data = folder /'movie_review_data'
df = pd.read_csv(movie_review_data / 'train.csv')

In [98]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  25000 non-null  int64 
 1   Reviews     25000 non-null  object
 2   Labels      25000 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 586.1+ KB


In [99]:
df.head()

Unnamed: 0.1,Unnamed: 0,Reviews,Labels
0,0,Zentropa has much in common with The Third Man...,1
1,1,Zentropa is the most original movie I've seen ...,1
2,2,Lars Von Trier is never backward in trying out...,1
3,3,*Contains spoilers due to me having to describ...,1
4,4,That was the first thing that sprang to mind a...,1


In [100]:
df = df.drop(['Unnamed: 0'], axis=1)
df

Unnamed: 0,Reviews,Labels
0,Zentropa has much in common with The Third Man...,1
1,Zentropa is the most original movie I've seen ...,1
2,Lars Von Trier is never backward in trying out...,1
3,*Contains spoilers due to me having to describ...,1
4,That was the first thing that sprang to mind a...,1
...,...,...
24995,There just isn't enough here. There a few funn...,0
24996,Tainted look at kibbutz life<br /><br />This f...,0
24997,"I saw this movie, just now, not when it was re...",0
24998,Any film which begins with a cowhand shagging ...,0


In [101]:
#Lets ignore Labels as we don't need this class variable as of now
data = df.drop(['Labels'], axis=1)
data

Unnamed: 0,Reviews
0,Zentropa has much in common with The Third Man...
1,Zentropa is the most original movie I've seen ...
2,Lars Von Trier is never backward in trying out...
3,*Contains spoilers due to me having to describ...
4,That was the first thing that sprang to mind a...
...,...
24995,There just isn't enough here. There a few funn...
24996,Tainted look at kibbutz life<br /><br />This f...
24997,"I saw this movie, just now, not when it was re..."
24998,Any film which begins with a cowhand shagging ...


## 3. Feature Engineering on IMDB dataset

* If we look at our dataset it cotains reviews and thier labels, where labels can be 1 for positive and 0 for negative.
* Now let see what kind of feature engineering can be done for this dataset.
* We will igore labels and will use our domain knowledge to do feature engineering.
* This is a movie review dataset that review can be either positive or negative which eacch review might have some sentiments based on that only label will be either or positive.

* All the possible new feature that we will extract from the existing feature.
  1. number of words
  2. number of characters
  3. number of characters without space
  4. average word length
  5. number of digits
  6. number of nouns or propernouns
  7. number of aux
  8. number of verbs
  9. number of adjectives
  10. number of ner (entiites)


### 1. number of words
Extract number of words in a review in a new column 'word_count'.

In [102]:
#remove any html tags from Reviews
from bs4 import BeautifulSoup
import re

def basic_clean(text):
        '''
        This fuction removes HTML tags from text
        '''
        if (bool(BeautifulSoup(text, "html.parser").find())==True):         
            soup = BeautifulSoup(text, "html.parser")
            text = soup.get_text()
        else:
            pass
        return re.sub(r'[\n\r]',' ', text) 


In [103]:
data['Reviews'] = [basic_clean(review)  for review in data['Reviews']]
data.head()

Unnamed: 0,Reviews
0,Zentropa has much in common with The Third Man...
1,Zentropa is the most original movie I've seen ...
2,Lars Von Trier is never backward in trying out...
3,*Contains spoilers due to me having to describ...
4,That was the first thing that sprang to mind a...


In [104]:
# HTML Tags are removed 
data['Reviews'][0]

'Zentropa has much in common with The Third Man, another noir-like film set among the rubble of postwar Europe. Like TTM, there is much inventive camera work. There is an innocent American who gets emotionally involved with a woman he doesn\'t really understand, and whose naivety is all the more striking in contrast with the natives.But I\'d have to say that The Third Man has a more well-crafted storyline. Zentropa is a bit disjointed in this respect. Perhaps this is intentional: it is presented as a dream/nightmare, and making it too coherent would spoil the effect. This movie is unrelentingly grim--"noir" in more than one sense; one never sees the sun shine. Grim, but intriguing, and frightening.'

In [105]:
data['word_count'] = [len(review.split()) for review in data['Reviews']]

In [106]:
data.head()

Unnamed: 0,Reviews,word_count
0,Zentropa has much in common with The Third Man...,117
1,Zentropa is the most original movie I've seen ...,115
2,Lars Von Trier is never backward in trying out...,375
3,*Contains spoilers due to me having to describ...,459
4,That was the first thing that sprang to mind a...,581


### 2. number of characters
Extract number of characters in a review in a new column 'char_count'.

In [107]:
data['char_count'] = [len(review) for review in data['Reviews']] 
data.head()

Unnamed: 0,Reviews,word_count,char_count
0,Zentropa has much in common with The Third Man...,117,704
1,Zentropa is the most original movie I've seen ...,115,673
2,Lars Von Trier is never backward in trying out...,375,2115
3,*Contains spoilers due to me having to describ...,459,2472
4,That was the first thing that sprang to mind a...,581,3201


### 3. number of characters without space
Extract number of characters (ignoring spaces) in a review in a new column 'char_count_wo_space'.

In [108]:
data['char_count_wo_space'] = [(len(review)- review.count(' ')) for review 
                               in data['Reviews']]
data.head()

Unnamed: 0,Reviews,word_count,char_count,char_count_wo_space
0,Zentropa has much in common with The Third Man...,117,704,588
1,Zentropa is the most original movie I've seen ...,115,673,559
2,Lars Von Trier is never backward in trying out...,375,2115,1741
3,*Contains spoilers due to me having to describ...,459,2472,2014
4,That was the first thing that sprang to mind a...,581,3201,2621


### 4. average word length
Extract  average length of words in a  review in a new column 'avg_word_length'.

In [109]:
data['avg_word_length'] = data['char_count_wo_space']/ data['word_count']

### 5. number of digits
Extract number of digits in a review in a new column 'digit_count'.

In [110]:
data['digit_count'] = [len(re.findall(r'\d', review)) for review 
                       in data['Reviews']]

In [111]:
data.head()

Unnamed: 0,Reviews,word_count,char_count,char_count_wo_space,avg_word_length,digit_count
0,Zentropa has much in common with The Third Man...,117,704,588,5.025641,0
1,Zentropa is the most original movie I've seen ...,115,673,559,4.86087,0
2,Lars Von Trier is never backward in trying out...,375,2115,1741,4.642667,0
3,*Contains spoilers due to me having to describ...,459,2472,2014,4.3878,1
4,That was the first thing that sprang to mind a...,581,3201,2621,4.511188,4


### 6. number of nouns or proper nouns
Extract number of nouns or proper nouns in a review in a new column 'noun_count'.

In [112]:
spacy_folder = Path('/content/drive/MyDrive/Colab_Notebooks/nlpAssignment/SPACY')

url = 'https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0.tar.gz'
!wget {url} -P {spacy_folder} 

--2021-09-18 01:00:11--  https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0.tar.gz
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-releases.githubusercontent.com/84940268/91614580-d516-11eb-8aec-29de6db356b7?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210918%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210918T010012Z&X-Amz-Expires=300&X-Amz-Signature=0c656856b4b1cc87f4a7e0ef4fffae7245434861d72774e8fc81e8db37694302&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=84940268&response-content-disposition=attachment%3B%20filename%3Den_core_web_sm-3.1.0.tar.gz&response-content-type=application%2Foctet-stream [following]
--2021-09-18 01:00:12--  https://github-releases.githubusercontent.com/84940268/91614580-d516-11eb-8aec-29de6db356b7?X-Amz-Algorithm=AWS4-HMAC-SHA

In [None]:
file = spacy_folder / 'en_core_web_sm-3.1.0.tar.gz'
with  tarfile.open(file, 'r') as tar:
  tar.extractall(path = spacy_folder)

Load Spacy Model

In [114]:
model = spacy_folder /'en_core_web_sm-3.1.0'/'en_core_web_sm'/'en_core_web_sm-3.1.0'
nlp = spacy.load(model, disable=['parser'])

In [115]:
def getTokensWithPos(nlpdoc,pos):
  collection = [token.pos_ for token in nlpdoc if(token.pos_ in pos)]  
  return collection

In [119]:
# get all reviews processed by nlp and store them in docs. 
docs = [nlp(text) for text in data["Reviews"]]

In [120]:
# Get all necessary Parts of Speech for each Token identified from each doc 
posListsForReviews =  [getTokensWithPos(doc,["NOUN","PROPN","AUX","VERB",
                                               "ADJ"])  for doc in docs]

In [122]:
# get noun count in each collection fetched from each doc 
data['noun_count'] = [len([a for a in collection if
                           a in ["NOUN","PROPN"]])  for collection in
                       posListsForReviews]

In [123]:
data.head()

Unnamed: 0,Reviews,word_count,char_count,char_count_wo_space,avg_word_length,digit_count,noun_count
0,Zentropa has much in common with The Third Man...,117,704,588,5.025641,0,30
1,Zentropa is the most original movie I've seen ...,115,673,559,4.86087,0,29
2,Lars Von Trier is never backward in trying out...,375,2115,1741,4.642667,0,104
3,*Contains spoilers due to me having to describ...,459,2472,2014,4.3878,1,121
4,That was the first thing that sprang to mind a...,581,3201,2621,4.511188,4,174


### 7. number of aux
Extract number of auxilaries (auxilary verbs) in a review in a new column 'aux_count'. Hint: pos tag in spacy is AUX.

In [124]:
# get aux-verb count in each collection fetched from each doc 

data['aux_count'] = [len([a for a in collection if a == 'AUX'])  for 
                     collection in posListsForReviews]

### 8. number of verbs
Extract number of verbs in a review in a new column 'verb_count'. 

In [125]:
# get verb count in each collection fetched from each doc 

data['verb_count'] = [len([a for a in collection if a == 'VERB'])  for collection in posListsForReviews]

### 9. number of adjectives
Extract number of adjectives in a review in a new column 'adj_count'.

In [126]:
# get adjective count in each collection fetched from each doc 

data['adj_count'] = [len([a for a in collection if a == 'ADJ'])  for collection in posListsForReviews]

In [127]:
data.head()

Unnamed: 0,Reviews,word_count,char_count,char_count_wo_space,avg_word_length,digit_count,noun_count,aux_count,verb_count,adj_count
0,Zentropa has much in common with The Third Man...,117,704,588,5.025641,0,30,2,23,12
1,Zentropa is the most original movie I've seen ...,115,673,559,4.86087,0,29,0,25,12
2,Lars Von Trier is never backward in trying out...,375,2115,1741,4.642667,0,104,1,68,31
3,*Contains spoilers due to me having to describ...,459,2472,2014,4.3878,1,121,1,77,37
4,That was the first thing that sprang to mind a...,581,3201,2621,4.511188,4,174,9,91,41


### 10. number of ner (entites)
Extract number of named entities (ner) in a review in a new column 'ner_count'.

In [128]:
ner_count = [len(doc.ents) for doc in docs]
data['ner_count'] = ner_count

In [129]:
data.head()

Unnamed: 0,Reviews,word_count,char_count,char_count_wo_space,avg_word_length,digit_count,noun_count,aux_count,verb_count,adj_count,ner_count
0,Zentropa has much in common with The Third Man...,117,704,588,5.025641,0,30,2,23,12,6
1,Zentropa is the most original movie I've seen ...,115,673,559,4.86087,0,29,0,25,12,5
2,Lars Von Trier is never backward in trying out...,375,2115,1741,4.642667,0,104,1,68,31,15
3,*Contains spoilers due to me having to describ...,459,2472,2014,4.3878,1,121,1,77,37,21
4,That was the first thing that sprang to mind a...,581,3201,2621,4.511188,4,174,9,91,41,42
