## char count, word count and punctuation

In [None]:
import pandas as pd
from textblob import TextBlob

In [None]:
data = pd.read_csv("amazon_alexa.tsv", sep="\t")
data.shape

(3150, 5)

In [None]:
data.head()

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1


In [None]:
data = data.loc[data['verified_reviews'].notna()]

In [None]:
data['char_count'] = data['verified_reviews'].apply(len)

In [None]:
data['word_count'] = data['verified_reviews'].apply(lambda x: len(x.split()))

The ratio of character count to word count in a text measures the average length of words in the text.

In [None]:
data['word_density'] = data['char_count'] / (data['word_count'] + 1)

### Punctuation

The code snippet punctuation = string.punctuation is typically used in Python programming. This line of code imports a predefined set of punctuation characters from the string module and assigns it to the variable punctuation.

The code snippet punctuation = string.punctuation is typically used in Python programming. This line of code imports a predefined set of punctuation characters from the string module and assigns it to the variable punctuation

In [None]:
import string
punctuation = string.punctuation
data['punctuation_count'] = data['verified_reviews'].apply(lambda x: len("".join(_ for _ in x if _ in punctuation)))

In [None]:
# summarise all thes variables
data[['char_count','word_count','word_density','punctuation_count']].describe()

Unnamed: 0,char_count,word_count,word_density,punctuation_count
count,3149.0,3149.0,3149.0,3149.0
mean,132.090187,25.30073,4.606172,3.889171
std,182.114569,34.587753,1.133967,5.762846
min,1.0,0.0,0.5,0.0
25%,30.0,6.0,4.269231,1.0
50%,74.0,14.0,4.806452,2.0
75%,165.0,32.0,5.208333,5.0
max,2851.0,526.0,32.5,121.0


### Parts Of Speech (POS)

NLTK (Natural Language Toolkit) library, which is a powerful library for natural language processing (NLP) in Python. NLTK provides various tools and resources for tasks such as tokenization, part-of-speech tagging, parsing, and more.

in this case we need 'punkt' and 'averaged_percentron_tagger' resources

- 'punkt'  :  it is a popular tokenizer model included in NLTK. This setup is essential for performing text tokenization, a key step in many NLP tasks, allowing you to split text into sentences and words effectively.

- 'averaged_perceptron_tagger'  :  It is a pre-trained model used for part-of-speech (POS) tagging,

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

Nouns and pronouns etc are  parts of speech

'NN','NNS' etc are the tags

When wiki = TextBlob(x), the wiki.tags attribute provides a list of tuples, each containing a word from the text x and its corresponding part-of-speech tag

In [None]:
# just an example
good = TextBlob("Wow! what a beutyful day!")
print(good.tags)

[('Wow', 'NN'), ('what', 'WP'), ('a', 'DT'), ('beutyful', 'JJ'), ('day', 'NN')]


In [None]:
# lets create a Part of speech (POS) dictionary
pos_dict = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

In [None]:
def pos_check(x,flag):
    count=0
    try:
        wiki = TextBlob(x)
        for tup in wiki.tags:
            pop = list(tup)[1]
            if pop in pos_dict[flag]:
                count += 1
    except:
        pass
    return count

In [None]:
data['noun_count'] = data['verified_reviews'].apply(lambda x: pos_check(x,'noun'))

In [None]:
data['verb_count'] = data['verified_reviews'].apply(lambda x: pos_check(x,'verb'))

In [None]:
data[['noun_count','verb_count']].describe()

Unnamed: 0,noun_count,verb_count
count,3149.0,3149.0
mean,5.946967,5.15751
std,8.223609,7.224127
min,0.0,0.0
25%,1.0,1.0
50%,3.0,3.0
75%,7.0,7.0
max,137.0,102.0


above data gives there is no noun and verb in sentence

In [None]:
data['adj_count'] = data['verified_reviews'].apply(lambda x: pos_check(x,'adj'))

data['adv_count'] = data['verified_reviews'].apply(lambda x: pos_check(x,'adv'))

data['pron_count'] = data['verified_reviews'].apply(lambda x: pos_check(x,'pron'))

data[['adj_count','adv_count','pron_count']].describe()

Unnamed: 0,adj_count,adv_count,pron_count
count,3149.0,3149.0,3149.0
mean,2.173071,2.003176,3.243252
std,2.9356,3.27741,4.627609
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,1.0,1.0,2.0
75%,3.0,3.0,4.0
max,39.0,54.0,70.0


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=5f404f83-de41-42a0-af9d-afbc4f2e2f7f' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>