# NLP

### Goals:
- Gain a better understanding of text preprocessing
- Create a term document matrix based on a text / genre corpus
- Create a new dataframe with all text from a specific genre within a single cell

The data I'm using is a modified dataset from Kaggle. I've scraped all book texts from the corresponding book's Goodreads page.

In [91]:
import re
import numpy as np
import pandas as pd
from langdetect import detect
from tqdm.notebook import tqdm
_
# NLP
import nltk
from nltk.tag import pos_tag
from nltk.corpus import stopwords
from unicodedata import normalize
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer

In [92]:
# df = pd.read_csv('original_datasets/books_CLEANED.csv')
# df = df[df['genre'] != 'Nonfiction'] # Remove nonficion books
_
df = pd.read_csv("original_datasets/my_books_CLEANED.csv")
df = df[['id', 'title', 'author', 'genre', 'text']]  # Reorder columns
_
print(df.shape, '\n')
df.head()

(5701, 5) 



Unnamed: 0,id,title,author,genre,text
0,1,The Hunger Games,Suzanne Collins,Fantasy,"Could you survive on your own in the wild, wit..."
1,2,Harry Potter and the Sorcerers Stone,J.K. Rowling,Fantasy,Harry Potter has no idea how famous he is. Tha...
2,3,Twilight,Stephenie Meyer,Romance,About three things I was absolutely positive.F...
3,5,The Great Gatsby,F. Scott Fitzgerald,Romance,"The Great Gatsby, F. Scott Fitzgerald's third ..."
4,7,The Hobbit,J.R.R. Tolkien,Fantasy,In a hole in the ground there lived a hobbit. ...


### Preprocess `text`

In [93]:
%run 00_preprocessing_fuctions.ipynb

In [94]:
lem = WordNetLemmatizer()
sia = SentimentIntensityAnalyzer()

In [95]:
polarity_results = {}
all_filtered_texts = {}

for index, row in tqdm(df.iterrows(), total=len(df)):
    curr_id = row['id']
    curr_text = row['text']
    
    # Tokenize, lemmatize and filter text by removing stopwords
    # Also removing as many character names from text as I can...
    cleaned_text = clean_text(curr_text)  # Clean text
    preprocessed_text = preprocess_text(cleaned_text) 
    
    # Store filtered text and filtered text polarity scores 
    all_filtered_texts[curr_id] = [preprocessed_text]
    polarity_results[curr_id] = sia.polarity_scores(preprocessed_text)

  0%|          | 0/5701 [00:00<?, ?it/s]

Create a new dataframe with the new metadata and `genre` and `text` merged on `id`:

In [96]:
polarity_df = pd.DataFrame(polarity_results).T.reset_index().rename(columns={'index': 'id'})
filtered_df = pd.DataFrame(all_filtered_texts).T.reset_index().rename(columns={0: 'filtered_text', 'index': 'id'})

# Merge both dataframes with 'genre', 'text' and 'id' from orignal dataframe
merged_df = filtered_df.merge(polarity_df, on='id', how='left')
preprocessed_df = df[['id', 'genre', 'text']].merge(merged_df, on='id', how='left').rename(columns={'text': 'original_text'})
preprocessed_df.head()

Unnamed: 0,id,genre,original_text,filtered_text,neg,neu,pos,compound
0,1,Fantasy,"Could you survive on your own in the wild, wit...",could survive wild every make sure live see mo...,0.262,0.618,0.12,-0.9504
1,2,Fantasy,Harry Potter has no idea how famous he is. Tha...,harry potter idea famous raised miserable aunt...,0.114,0.71,0.177,0.4588
2,3,Romance,About three things I was absolutely positive.F...,three thing absolutely positive first edward v...,0.0,0.642,0.358,0.923
3,5,Romance,"The Great Gatsby, F. Scott Fitzgerald's third ...",great gatsby scott fitzgerald third stand supr...,0.036,0.523,0.441,0.9819
4,7,Fantasy,In a hole in the ground there lived a hobbit. ...,hole ground lived hobbit nasty dirty wet hole ...,0.193,0.598,0.209,0.3612


In [97]:
preprocessed_df['filtered_text'][0]

'could survive wild every make sure live see morning ruin place known north america lie nation panem shining capitol surrounded twelve outlying district capitol harsh cruel keep district line forcing send boy girl age twelve eighteen participate annual hunger game fight death live sixteen year old katniss everdeen life alone mother younger sister regard death sentence step forward take sister place game katniss close dead beforeand survival second nature without really meaning becomes contender win start making choice weight survival humanity life love'

### Remove `filtered_text` rows with a length less than 5

In [98]:
preprocessed_df = preprocessed_df[preprocessed_df['filtered_text'].apply(lambda x: True if len(x.split(' ')) >= 5 else False)]

### Add sentiment column

A text is considered 'POSITIVE' if its compound score is >= 0.5, a text is considered 'NEGATIVE' if its compound score is <= 0, a text is considered neutral if its compound score is between (0, 0.49)

In [99]:
def calculate_sentiment(score: float) -> str:
    if score >= 0.5:
        return 'POSITIVE'
    elif score >= 0 and score <= 0.49:
        return 'NEUTRAL'
    else:
        return 'NEGATIVE'

In [100]:
preprocessed_df['text_sentiment'] = preprocessed_df['compound'].apply(calculate_sentiment)

In [101]:
preprocessed_df.head(3)

Unnamed: 0,id,genre,original_text,filtered_text,neg,neu,pos,compound,text_sentiment
0,1,Fantasy,"Could you survive on your own in the wild, wit...",could survive wild every make sure live see mo...,0.262,0.618,0.12,-0.9504,NEGATIVE
1,2,Fantasy,Harry Potter has no idea how famous he is. Tha...,harry potter idea famous raised miserable aunt...,0.114,0.71,0.177,0.4588,NEUTRAL
2,3,Romance,About three things I was absolutely positive.F...,three thing absolutely positive first edward v...,0.0,0.642,0.358,0.923,POSITIVE


### Add verb / adjective counts

To be honest I have no idea how useful these columns will be be but...ah well.

In [102]:
def count_verbs_in_sentence(text: str) -> int:
    """
    Count words tagged as a verb and return count.
    """
    valid_tags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
    word_tags = pos_tag(text.split(' '))
    verb_count = 0
    for word, tag in word_tags:
        if tag in valid_tags:
            verb_count += 1

    return verb_count


def count_adjs_in_sentence(text: str) -> int:
    """
    Count words tagged as an adjective and return count.
    """
    valid_tags = ['JJ', 'JJR', 'JJS']
    word_tags = pos_tag(text.split(' '))
    adj_count = 0
    for word, tag in word_tags:
        if tag in valid_tags:
            adj_count += 1

    return adj_count


def count_nouns_in_sentence(text: str) -> int:
    """
    Count words tagged as a noun and return count.
    """
    valid_tags = ['NN', 'NNP', 'NNS']
    word_tags = pos_tag(text.split(' '))
    noun_count = 0
    for word, tag in word_tags:
        if tag in valid_tags:
            noun_count += 1

    return noun_count

In [103]:
preprocessed_df['verb_count'] = preprocessed_df['filtered_text'].apply(count_verbs_in_sentence)
preprocessed_df['adj_count'] = preprocessed_df['filtered_text'].apply(count_adjs_in_sentence)
preprocessed_df['noun_count'] = preprocessed_df['filtered_text'].apply(count_nouns_in_sentence)

In [104]:
preprocessed_df.head(3)

Unnamed: 0,id,genre,original_text,filtered_text,neg,neu,pos,compound,text_sentiment,verb_count,adj_count,noun_count
0,1,Fantasy,"Could you survive on your own in the wild, wit...",could survive wild every make sure live see mo...,0.262,0.618,0.12,-0.9504,NEGATIVE,14,22,38
1,2,Fantasy,Harry Potter has no idea how famous he is. Tha...,harry potter idea famous raised miserable aunt...,0.114,0.71,0.177,0.4588,NEUTRAL,9,14,21
2,3,Romance,About three things I was absolutely positive.F...,three thing absolutely positive first edward v...,0.0,0.642,0.358,0.923,POSITIVE,2,8,8


### Finalize dataframes

Combine all texts between corresponding genres and create a new genre corpus:

In [105]:
unique_genres = preprocessed_df['genre'].unique()
template = {}

for genre in unique_genres:
    genre_text = " ".join(preprocessed_df[preprocessed_df['genre'] == genre]['filtered_text'].to_list())
    template[genre] = [genre_text]

In [106]:
# Genre corpus dataframe
genre_df = pd.DataFrame(template).T.reset_index().rename(columns={'index': 'genre', 0: 'text'})
_
genre_df

Unnamed: 0,genre,text
0,Fantasy,could survive wild every make sure live see mo...
1,Romance,three thing absolutely positive first edward v...
2,Thriller,world renowned harvard symbologist robert lang...
3,Science Fiction,year come gone george orwell prophetic nightma...


### Save data

In [107]:
# Save processed datasets as csv files
preprocessed_df.to_csv('processed_datasets/my_books_PROCESSED.csv', index=False)
genre_df.to_csv('processed_datasets/my_genre_corpus.csv', index=False)