Text Analytics on online content.


# Objective

The objective of this assignment is to extract textual data articles from the given URL and perform text analysis to compute variables that are explained below. 

# Data Extraction

Input.xlsx

For each of the articles, given in the input.xlsx file, extract the article text and save the extracted article in a text file with URL_ID as its file name.

While extracting text, please make sure your program extracts only the article title and the article text. It should not extract the website header, footer, or anything other than the article text. 


In [7]:
import pandas as pd

In [10]:
df = pd.read_excel('input.xlsx')

In [11]:
df.head(10)

Unnamed: 0,URL_ID,URL
0,1,https://insights.blackcoffer.com/how-is-login-...
1,2,https://insights.blackcoffer.com/how-does-ai-h...
2,3,https://insights.blackcoffer.com/ai-and-its-im...
3,4,https://insights.blackcoffer.com/how-do-deep-l...
4,5,https://insights.blackcoffer.com/how-artificia...
5,6,https://insights.blackcoffer.com/how-are-genet...
6,7,https://insights.blackcoffer.com/how-is-ai-use...
7,8,https://insights.blackcoffer.com/benefits-of-b...
8,9,https://insights.blackcoffer.com/how-big-data-...
9,10,https://insights.blackcoffer.com/how-will-ai-m...


We would derive the text content from each URL

In [16]:
from newspaper import Article
import nltk

In [17]:
url = 'https://insights.blackcoffer.com/ai-and-its-impact-on-the-fashion-industry/'
article = Article(url, language="en")

In [19]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Sumant\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [20]:
article.download() 
article.parse() 
article.nlp() 

In [21]:
print("Article Title:") 
print(article.title) #prints the title of the article
print("\n") 
print("Article Text:") 
print(article.text) #prints the entire text of the article
print("\n") 
print("Article Summary:") 
print(article.summary) #prints the summary of the article
print("\n") 
print("Article Keywords:")
print(article.keywords) #prints the keywords of the article

Article Title:
AI and its impact on the Fashion Industry


Article Text:
If you were a fan of the 90’s film Clueless back in the day, then you’ll remember the protagonist, Cher Horowitz’s amazing virtual wardrobe. She used it to browse her clothing and choose a perfectly coordinated ensemble. This virtual application, which was just the brainchild of a writer wanting to make the protagonist look rich, fashionable, and ahead of her time, ignited a buzz and prospected having an automated style device to make everyday dress-up fun and engagingly time-saving.

Times have changed and technological advancement today is transforming everything from probabilities to possibilities. We are in the era where, machines not just facilitate our tasks and demands but rather suggest, forecasts, and analyze thus making lives simpler and smarter.

With the advent of technology, style suggestions are just a fraction of the big picture that AI has painted in the fashion industry today.

What is AI?

Artifi

In [22]:
def get_text(url):
    
    url1 = url
    article = Article(url1, language="en")
    
    article.download() 
    article.parse() 
    article.nlp()
    
    return article.text

In [23]:
for i in range(0,len(df)):
    df['URL_ID'][i] = get_text(df['URL'][i])

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df['URL_ID'][i] = get_text(df['URL'][i])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['URL_ID'][i] = get_

ArticleException: Article `download()` failed with 404 Client Error: Not Found for url: https://insights.blackcoffer.com/how-do-deep-learning-models-predict-old-and-new-drugs-that-are-successfully-treated-in-healthcare/ on URL https://insights.blackcoffer.com/how-do-deep-learning-models-predict-old-and-new-drugs-that-are-successfully-treated-in-healthcare/

In [25]:
df.rename({'URL_ID':'Text'},axis=1,inplace=True)

In [26]:
df.head(10)

Unnamed: 0,Text,URL
0,When people hear AI they often think about sen...,https://insights.blackcoffer.com/how-is-login-...
1,"With increasing computing power and more data,...",https://insights.blackcoffer.com/how-does-ai-h...
2,If you were a fan of the 90’s film Clueless ba...,https://insights.blackcoffer.com/ai-and-its-im...
3,4,https://insights.blackcoffer.com/how-do-deep-l...
4,5,https://insights.blackcoffer.com/how-artificia...
5,6,https://insights.blackcoffer.com/how-are-genet...
6,7,https://insights.blackcoffer.com/how-is-ai-use...
7,8,https://insights.blackcoffer.com/benefits-of-b...
8,9,https://insights.blackcoffer.com/how-big-data-...
9,10,https://insights.blackcoffer.com/how-will-ai-m...


Let's do some Text preprocessing and cleaning

In [34]:
import re
import pandas as pd
import nltk
from nltk.corpus import stopwords

In [35]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sumant\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [32]:
import re
from nltk.corpus import stopwords

In [39]:
def transform(text):
    if isinstance(text, str):  # Check if the input is already a string
        review = text
    else:
        review = str(text)  # Convert non-string values to string
    
    review = re.sub('[^a-zA-Z0-9]', ' ', review)  # Keep only alphanumeric characters
    review = review.lower()  # Convert to lowercase
    review = review.split()  # Split text into words

    # Remove stopwords
    review = [word for word in review if word not in stopwords.words('english')]
    
    # Join the words back into a string
    review = ' '.join(review)
    return review

# Create a sample DataFrame (replace this with your actual DataFrame)
data = {'Text': [123, 'This is a sample text', 'Another text', 456]}
df = pd.DataFrame(data)

# Apply the transformation function to the 'Text' column
df['Transform_Text'] = df['Text'].apply(transform)

print(df)

                    Text Transform_Text
0                    123            123
1  This is a sample text    sample text
2           Another text   another text
3                    456            456


In [40]:
'''def transform(text):

    review = re.sub('[^a-zA-Z0-9]', ' ',text)  # except small and capital letters and numeric remove everythong.
    review = review.lower()                    # lower it.
    review = review.split()
    
    review = [word for word in review if not word in stopwords.words('english')]   # remove stopwords.
    review = ' '.join(review)
    return review


df['Transform_Text'] = df['Text'].apply(transform)'''

"def transform(text):\n\n    review = re.sub('[^a-zA-Z0-9]', ' ',text)  # except small and capital letters and numeric remove everythong.\n    review = review.lower()                    # lower it.\n    review = review.split()\n    \n    review = [word for word in review if not word in stopwords.words('english')]   # remove stopwords.\n    review = ' '.join(review)\n    return review\n\n\ndf['Transform_Text'] = df['Text'].apply(transform)"

# Data Analysis

For each of the extracted texts from the article, perform textual analysis and compute variables.

I am looking for these variables in the analysis document:

POSITIVE SCORE

NEGATIVE SCORE

POLARITY SCORE

SUBJECTIVITY SCORE

AVG SENTENCE LENGTH

PERCENTAGE OF COMPLEX WORDS

FOG INDEX

AVG NUMBER OF WORDS PER SENTENCE

COMPLEX WORD COUNT

WORD COUNT

SYLLABLE PER WORD

PERSONAL PRONOUNS

AVG WORD LENGTH

In [41]:
# word count in each text row.

df['word_counts'] = df['Transform_Text'].apply(lambda x: len(str(x).split()))    

In [44]:
import pandas as pd
import nltk

In [45]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Sumant\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [47]:
# Create a sample DataFrame (replace this with your actual DataFrame)
data = {'Text': [123, 'This is a sample text', 'Another text', 456]}
df = pd.DataFrame(data)

# Convert non-string values in the 'Text' column to strings
df['Text'] = df['Text'].astype(str)

# Check the length of the tokenized sentences
for text in df['Text']:
    print(len(nltk.sent_tokenize(text)))  # Tokenize each text and get the length of the sentences

1
1
1
1


In [48]:
len(nltk.sent_tokenize(df['Text'][0]))  # checking length function 

1

In [49]:
import numpy as np

In [51]:
import pandas as pd
import numpy as np
import nltk

In [52]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Sumant\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [57]:
df['average number of words per sentence'] = np.nan

for i in range(0,len(df)):
    
    df['average number of words per sentence'][i] = df['word_counts'][i]/len(nltk.sent_tokenize(df['Text'][i]))

KeyError: 'word_counts'

In [56]:
df.head(10)

Unnamed: 0,Text,sentences,sentence_count,word_count,avg_words_per_sentence,average number of words per sentence
0,This is a sample sentence.,[This is a sample sentence.],1,6,6.0,
1,Another sentence.,[Another sentence.],1,3,3.0,
2,Yet another sentence.,[Yet another sentence.],1,4,4.0,


# Average Word Length


Average Word Length is calculated by the formula:
    
( Sum of the total number of characters in each word ) / ( Total number of words )


In [None]:
def char_count(x):
    s = x.split()
    x = ''.join(s)
    return len(x)      # counting the total number of characters in each text data.

In [None]:
df['chara_count'] = df['Transform_Text'].apply(lambda x: char_count(x))

to check for stopwords in each text.

from nltk.corpus import stopwords

df['stopwords'] = df['Text'].apply(lambda x: [t for t in x.split() if t  in stopwords.words('english')])

In [None]:
df['average word length'] = np.nan

for i in range(0,len(df)):
    
    df['average word length'][i] = df['chara_count'][i]/df['word_counts'][i]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['average word length'][i] = df['chara_count'][i]/df['word_counts'][i]


In [None]:
df.head()

Unnamed: 0,Text,URL,Transform_Text,word_counts,average number of words per sentence,chara_count,average word length
0,When people hear AI they often think about sen...,https://insights.blackcoffer.com/how-is-login-...,people hear ai often think sentient robots mag...,434,18.083333,2780,6.40553
1,"With increasing computing power and more data,...",https://insights.blackcoffer.com/how-does-ai-h...,increasing computing power data potential valu...,383,13.678571,2618,6.835509
2,If you were a fan of the 90’s film Clueless ba...,https://insights.blackcoffer.com/ai-and-its-im...,fan 90 film clueless back day remember protago...,1103,14.324675,7515,6.813237
3,"Understanding exactly how data is ingested, an...",https://insights.blackcoffer.com/how-do-deep-l...,understanding exactly data ingested analyzed r...,254,16.933333,1778,7.0
4,"From the stone age to the modern world, from h...",https://insights.blackcoffer.com/how-artificia...,stone age modern world hunting gathering culti...,711,10.61194,4640,6.52602


# SYLLABLE COUNT

We count the number of Syllables in each word of the text by counting the vowels present in each word.

In [None]:
h = df.head()

In [None]:
def syllable_count(x):
    v = []
    d = {}
    for i in x:
        if i in "aeiou":
            v.append(i)
            d[i] = d.get(i,0)+1     # checking purpose
            
    k = []
    for i in d:
        k.append(d[i])
    print(d)
    print(v)  
    print(k)
    print(np.sum(k))
        
    
g = 'bore i am gone to london in england britian uk'

syllable_count(g)

{'o': 5, 'e': 3, 'i': 4, 'a': 3, 'u': 1}
['o', 'e', 'i', 'a', 'o', 'e', 'o', 'o', 'o', 'i', 'e', 'a', 'i', 'i', 'a', 'u']
[5, 3, 4, 3, 1]
16


In [None]:
def syllable_count(x):
    v = []
    d = {}
    for i in x:
        if i in "aeiou":
            v.append(i)
            d[i] = d.get(i,0)+1
            
    k = []
    for i in d:
        k.append(d[i])

    return np.sum(k)

g = h['Transform_Text'][1]

syllable_count(g)

996

In [None]:
df['syllable count'] = df['Transform_Text'].apply(lambda x: syllable_count(x))

In [None]:
df.head()

Unnamed: 0,Text,URL,Transform_Text,word_counts,average number of words per sentence,chara_count,average word length,syllable count
0,When people hear AI they often think about sen...,https://insights.blackcoffer.com/how-is-login-...,people hear ai often think sentient robots mag...,434,18.083333,2780,6.40553,1074
1,"With increasing computing power and more data,...",https://insights.blackcoffer.com/how-does-ai-h...,increasing computing power data potential valu...,383,13.678571,2618,6.835509,996
2,If you were a fan of the 90’s film Clueless ba...,https://insights.blackcoffer.com/ai-and-its-im...,fan 90 film clueless back day remember protago...,1103,14.324675,7515,6.813237,2904
3,"Understanding exactly how data is ingested, an...",https://insights.blackcoffer.com/how-do-deep-l...,understanding exactly data ingested analyzed r...,254,16.933333,1778,7.0,694
4,"From the stone age to the modern world, from h...",https://insights.blackcoffer.com/how-artificia...,stone age modern world hunting gathering culti...,711,10.61194,4640,6.52602,1840


# COMPLEX Word Count

Complex words are words in the text that contain more than two Syllables.


In [None]:
from collections import  Counter

def complex_word_count(x):
    
    syllable = 'aeiou'
    
    t = x.split()
    
    v = []
    
    for i in t:
        words = i.split()
        c=Counter()
        
        for word in words:
            c.update(set(word))

        n = 0
        for a in c.most_common():
            if a[0] in syllable:
                if a[1] >= 2:
                    n += 1
                
        m = 0
        p = []
        for a in c.most_common():
            if a[0] in syllable:
                p.append(a[0])
        if len(p) >= 2:
            m += 1
        
        if n >= 1 or m >= 1:
            v.append(i)
            
    return len(v) 

g = h['Transform_Text'][1]

complex_word_count(g)

293

In [None]:
df['complex_count'] = np.nan

df['complex_count'] = df['Transform_Text'].apply(lambda x: complex_word_count(x))
df.head()

Unnamed: 0,Text,URL,Transform_Text,word_counts,average number of words per sentence,chara_count,average word length,syllable count,complex_count
0,When people hear AI they often think about sen...,https://insights.blackcoffer.com/how-is-login-...,people hear ai often think sentient robots mag...,434,18.083333,2780,6.40553,1074,321
1,"With increasing computing power and more data,...",https://insights.blackcoffer.com/how-does-ai-h...,increasing computing power data potential valu...,383,13.678571,2618,6.835509,996,293
2,If you were a fan of the 90’s film Clueless ba...,https://insights.blackcoffer.com/ai-and-its-im...,fan 90 film clueless back day remember protago...,1103,14.324675,7515,6.813237,2904,884
3,"Understanding exactly how data is ingested, an...",https://insights.blackcoffer.com/how-do-deep-l...,understanding exactly data ingested analyzed r...,254,16.933333,1778,7.0,694,195
4,"From the stone age to the modern world, from h...",https://insights.blackcoffer.com/how-artificia...,stone age modern world hunting gathering culti...,711,10.61194,4640,6.52602,1840,553


# Analysis of Readability

Analysis of Readability is calculated using the Gunning Fox index formula described below.

Average Sentence Length      =  the number of words / the number of sentences

Percentage of Complex words  =  the number of complex words / the number of words 

Fog Index                    =  0.4 * (Average Sentence Length + Percentage of Complex words)


In [None]:
df['sentence length'] = np.nan
df['Average Sentence Length'] = np.nan
df['Percentage of Complex words'] = np.nan
df['Fog Index'] = np.nan


for i in range(0,len(df)):
    
    df['sentence length'][i]  =   len(nltk.sent_tokenize(df['Text'][i]))
    df['Average Sentence Length'][i] = df['word_counts'][i]/df['sentence length'][i]
    df['Percentage of Complex words'][i] = df['complex_count'][i]/df['word_counts'][i] 
    df['Fog Index'][i] = 0.4 * (df['Average Sentence Length'][i] + df['Percentage of Complex words'][i])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['sentence length'][i]  =   len(nltk.sent_tokenize(df['Text'][i]))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Average Sentence Length'][i] = df['word_counts'][i]/df['sentence length'][i]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Percentage of Complex words'][i] = df['complex_count'][i]/df['word_counts'][i]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/

In [None]:
df.head()

Unnamed: 0,Text,URL,Transform_Text,word_counts,average number of words per sentence,chara_count,average word length,syllable count,complex_count,sentence length,Average Sentence Length,Percentage of Complex words,Fog Index
0,When people hear AI they often think about sen...,https://insights.blackcoffer.com/how-is-login-...,people hear ai often think sentient robots mag...,434,18.083333,2780,6.40553,1074,321,24.0,18.083333,0.739631,7.529186
1,"With increasing computing power and more data,...",https://insights.blackcoffer.com/how-does-ai-h...,increasing computing power data potential valu...,383,13.678571,2618,6.835509,996,293,28.0,13.678571,0.765013,5.777434
2,If you were a fan of the 90’s film Clueless ba...,https://insights.blackcoffer.com/ai-and-its-im...,fan 90 film clueless back day remember protago...,1103,14.324675,7515,6.813237,2904,884,77.0,14.324675,0.801451,6.05045
3,"Understanding exactly how data is ingested, an...",https://insights.blackcoffer.com/how-do-deep-l...,understanding exactly data ingested analyzed r...,254,16.933333,1778,7.0,694,195,15.0,16.933333,0.767717,7.08042
4,"From the stone age to the modern world, from h...",https://insights.blackcoffer.com/how-artificia...,stone age modern world hunting gathering culti...,711,10.61194,4640,6.52602,1840,553,67.0,10.61194,0.777778,4.555887


# SENTIMENT ANALYSIS

Sentimental analysis is the process of determining whether a piece of writing is positive, negative or neutral.

The Master Dictionary (found here) is used for creating a dictionary of Positive and Negative words. We add only those words in the dictionary if they are not found in the Stop Words Lists. Use this url if above does not work https://sraf.nd.edu/textual-analysis/resources/ 


In [None]:
sentiment = pd.read_csv('sentiment dict.csv')

In [None]:
dfs = sentiment[['Word','Negative','Positive']]
dfs

Unnamed: 0,Word,Negative,Positive
0,AARDVARK,0,0
1,AARDVARKS,0,0
2,ABACI,0,0
3,ABACK,0,0
4,ABACUS,0,0
...,...,...,...
86526,ZYGOTE,0,0
86527,ZYGOTES,0,0
86528,ZYGOTIC,0,0
86529,ZYMURGIES,0,0


In [None]:
f = ['ZYGOTIC','BAD','DONE','EXCELLENT','WORSE']

negative = 0
positive = 0

for i in dfs['Word']:
    if i in f:
        if dfs[dfs['Word']==i].Negative.any() == True:
            negative += 1
        if dfs[dfs['Word']==i].Positive.any() == True:                # CHECKING
            positive += 1
            
print(negative),
print(positive)

2
1


We need to lower the word column in dfs to be used for sentiment score for the text data.

In [None]:
dfs = dfs.dropna()
dfs.isnull().sum()

Word        0
Negative    0
Positive    0
dtype: int64

In [None]:
w = 'the good man'
w.split()

['the', 'good', 'man']

In [None]:
dfs['word_lower'] = np.nan

In [None]:
import warnings
warnings.filterwarnings('ignore')

for i in range(len(dfs)):
        dfs['word_lower'][i] = dfs['Word'][i].lower()

KeyError: 50741

In [None]:
for i in range(50742,len(dfs)):
        dfs['word_lower'][i] = dfs['Word'][i].lower()

In [None]:
dfs['word_lower'].dtype

dtype('O')

In [None]:
dfs.head()

Unnamed: 0,Word,Negative,Positive,word_lower
0,AARDVARK,0,0,aardvark
1,AARDVARKS,0,0,aardvarks
2,ABACI,0,0,abaci
3,ABACK,0,0,aback
4,ABACUS,0,0,abacus


# Positive Score

Positive Score: This score is calculated by assigning the value of +1 for each word if found in the Positive Dictionary and then adding up all the values.


In [None]:
# Calculate the positive score for text.

def positive(x):
    
    s = x.split()
    
    positive = 0
    
    for i in dfs['word_lower']:
        if i in s:
            if dfs[dfs['word_lower']==i].Positive.any() == True:
                positive += 1
            
    return positive

In [None]:
df['positive_score'] = np.nan

for i in range(len(df)):
    df['positive_score'][i] = positive(df['Transform_Text'][i])

In [None]:
df.head()

Unnamed: 0,Text,URL,Transform_Text,word_counts,average number of words per sentence,chara_count,average word length,syllable count,complex_count,sentence length,Average Sentence Length,Percentage of Complex words,Fog Index,positive_score
0,When people hear AI they often think about sen...,https://insights.blackcoffer.com/how-is-login-...,people hear ai often think sentient robots mag...,434,18.083333,2780,6.40553,1074,321,24.0,18.083333,0.739631,7.529186,4.0
1,"With increasing computing power and more data,...",https://insights.blackcoffer.com/how-does-ai-h...,increasing computing power data potential valu...,383,13.678571,2618,6.835509,996,293,28.0,13.678571,0.765013,5.777434,9.0
2,If you were a fan of the 90’s film Clueless ba...,https://insights.blackcoffer.com/ai-and-its-im...,fan 90 film clueless back day remember protago...,1103,14.324675,7515,6.813237,2904,884,77.0,14.324675,0.801451,6.05045,27.0
3,"Understanding exactly how data is ingested, an...",https://insights.blackcoffer.com/how-do-deep-l...,understanding exactly data ingested analyzed r...,254,16.933333,1778,7.0,694,195,15.0,16.933333,0.767717,7.08042,5.0
4,"From the stone age to the modern world, from h...",https://insights.blackcoffer.com/how-artificia...,stone age modern world hunting gathering culti...,711,10.61194,4640,6.52602,1840,553,67.0,10.61194,0.777778,4.555887,19.0


In [None]:
def positive_word(x):
    
    s = x.split()
    
    positive_word = []
    
    for i in dfs['word_lower']:
        if i in s:
            if dfs[dfs['word_lower']==i].Positive.any() == True:   # checking which words are positive
                positive_word.append(i)
            
    print(positive_word)

In [None]:
df['positive_word'] = np.nan

for i in range(1):
    df['positive_word'][i] = positive_word(df['Transform_Text'][i])

['able', 'better', 'easy', 'success']


In [None]:
df.drop('positive_word',axis=1,inplace=True)

# NEGATIVE Score

Negative Score: This score is calculated by assigning the value of -1 for each word if found in the Negative Dictionary and then adding up all the values. We multiply the score with -1 so that the score is a positive number.


In [None]:
def negative_score(x):
    
    s = x.split()
    
    negative = 0
    
    for i in dfs['word_lower']:
        if i in s:
            if dfs[dfs['word_lower']==i].Negative.any() == True:
                negative += 1
            
    return negative

In [None]:
df['negative_score'] = np.nan

for i in range(len(df)):
    df['negative_score'][i] = negative_score(df['Transform_Text'][i])

In [None]:
df.head()

Unnamed: 0,Text,URL,Transform_Text,word_counts,average number of words per sentence,chara_count,average word length,syllable count,complex_count,sentence length,Average Sentence Length,Percentage of Complex words,Fog Index,positive_score,negative_score
0,When people hear AI they often think about sen...,https://insights.blackcoffer.com/how-is-login-...,people hear ai often think sentient robots mag...,434,18.083333,2780,6.40553,1074,321,24.0,18.083333,0.739631,7.529186,4.0,5.0
1,"With increasing computing power and more data,...",https://insights.blackcoffer.com/how-does-ai-h...,increasing computing power data potential valu...,383,13.678571,2618,6.835509,996,293,28.0,13.678571,0.765013,5.777434,9.0,6.0
2,If you were a fan of the 90’s film Clueless ba...,https://insights.blackcoffer.com/ai-and-its-im...,fan 90 film clueless back day remember protago...,1103,14.324675,7515,6.813237,2904,884,77.0,14.324675,0.801451,6.05045,27.0,21.0
3,"Understanding exactly how data is ingested, an...",https://insights.blackcoffer.com/how-do-deep-l...,understanding exactly data ingested analyzed r...,254,16.933333,1778,7.0,694,195,15.0,16.933333,0.767717,7.08042,5.0,1.0
4,"From the stone age to the modern world, from h...",https://insights.blackcoffer.com/how-artificia...,stone age modern world hunting gathering culti...,711,10.61194,4640,6.52602,1840,553,67.0,10.61194,0.777778,4.555887,19.0,17.0


# Polarity Score


Polarity Score: This is the score that determines if a given text is positive or negative in nature.

It is calculated by using the formula: 

Polarity Score = (Positive Score – Negative Score)/ ((Positive Score + Negative Score) + 0.000001)

Range is from -1 to +1


In [None]:
df['Polarity Score'] = np.nan

for i in range(len(df)):
    df['Polarity Score'][i] = (df['positive_score'][i]-df['negative_score'][i])/ ((df['positive_score'][i] + df['negative_score'][i]) + 0.000001)

In [None]:
df.head()

Unnamed: 0,Text,URL,Transform_Text,word_counts,average number of words per sentence,chara_count,average word length,syllable count,complex_count,sentence length,Average Sentence Length,Percentage of Complex words,Fog Index,positive_score,negative_score,Polarity Score
0,When people hear AI they often think about sen...,https://insights.blackcoffer.com/how-is-login-...,people hear ai often think sentient robots mag...,434,18.083333,2780,6.40553,1074,321,24.0,18.083333,0.739631,7.529186,4.0,5.0,-0.111111
1,"With increasing computing power and more data,...",https://insights.blackcoffer.com/how-does-ai-h...,increasing computing power data potential valu...,383,13.678571,2618,6.835509,996,293,28.0,13.678571,0.765013,5.777434,9.0,6.0,0.2
2,If you were a fan of the 90’s film Clueless ba...,https://insights.blackcoffer.com/ai-and-its-im...,fan 90 film clueless back day remember protago...,1103,14.324675,7515,6.813237,2904,884,77.0,14.324675,0.801451,6.05045,27.0,21.0,0.125
3,"Understanding exactly how data is ingested, an...",https://insights.blackcoffer.com/how-do-deep-l...,understanding exactly data ingested analyzed r...,254,16.933333,1778,7.0,694,195,15.0,16.933333,0.767717,7.08042,5.0,1.0,0.666667
4,"From the stone age to the modern world, from h...",https://insights.blackcoffer.com/how-artificia...,stone age modern world hunting gathering culti...,711,10.61194,4640,6.52602,1840,553,67.0,10.61194,0.777778,4.555887,19.0,17.0,0.055556


# SUBJECTIVITY SCORE

Subjectivity Score: This is the score that determines if a given text is objective or subjective. 

It is calculated by using the formula: 

Subjectivity Score = (Positive Score + Negative Score)/ ((Total Words after cleaning) + 0.000001)

Range is from 0 to +1


In [None]:
from textblob import TextBlob

In [None]:
blob = TextBlob(df['Transform_Text'][1])
blob.sentiment

Sentiment(polarity=0.13926077545642765, subjectivity=0.48480990024468285)

In [None]:
TextBlob(df['Transform_Text'][1]).sentiment[1]

0.48480990024468285

In [None]:
df['subjectivity'] = np.nan

for i in range(len(df)):
    df['subjectivity'][i] = TextBlob(df['Transform_Text'][i]).sentiment[1]

In [None]:
df.head()

Unnamed: 0,Text,URL,Transform_Text,word_counts,average number of words per sentence,chara_count,average word length,syllable count,complex_count,sentence length,Average Sentence Length,Percentage of Complex words,Fog Index,positive_score,negative_score,Polarity Score,subjectivity
0,When people hear AI they often think about sen...,https://insights.blackcoffer.com/how-is-login-...,people hear ai often think sentient robots mag...,434,18.083333,2780,6.40553,1074,321,24.0,18.083333,0.739631,7.529186,4.0,5.0,-0.111111,0.459438
1,"With increasing computing power and more data,...",https://insights.blackcoffer.com/how-does-ai-h...,increasing computing power data potential valu...,383,13.678571,2618,6.835509,996,293,28.0,13.678571,0.765013,5.777434,9.0,6.0,0.2,0.48481
2,If you were a fan of the 90’s film Clueless ba...,https://insights.blackcoffer.com/ai-and-its-im...,fan 90 film clueless back day remember protago...,1103,14.324675,7515,6.813237,2904,884,77.0,14.324675,0.801451,6.05045,27.0,21.0,0.125,0.542281
3,"Understanding exactly how data is ingested, an...",https://insights.blackcoffer.com/how-do-deep-l...,understanding exactly data ingested analyzed r...,254,16.933333,1778,7.0,694,195,15.0,16.933333,0.767717,7.08042,5.0,1.0,0.666667,0.451277
4,"From the stone age to the modern world, from h...",https://insights.blackcoffer.com/how-artificia...,stone age modern world hunting gathering culti...,711,10.61194,4640,6.52602,1840,553,67.0,10.61194,0.777778,4.555887,19.0,17.0,0.055556,0.483817


# PERSONAL PRONOUNS 

To calculate Personal Pronouns mentioned in the text, we use regex to find the counts of the words - “I,” “we,” “my,” “ours,” and “us”. 

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
x = 'he is the my father'
y = nlp(x)

for noun in y.noun_chunks:
    print(noun)

he
the my father


In [None]:
doc = nlp('he is the my father')

In [None]:
for token in doc:
    if token.pos_ == 'PRON':
        print(token)

he
my


In [None]:
df['PERSONAL PRONOUNS'] = np.nan

In [None]:
doc = nlp(df['Text'][1])
tok = []
for token in doc:
    if token.pos_ == 'PRON':
        tok.append(token)
        
tok

[their,
 them,
 their,
 it,
 them,
 its,
 it,
 Our,
 their,
 What,
 them,
 they,
 it,
 Our,
 your,
 It,
 Our,
 your,
 they,
 their,
 it,
 we,
 we,
 them,
 it,
 our,
 it,
 it,
 our]

In [None]:
df['PERSONAL PRONOUNS'][1] = tok

In [None]:
df.head()

Unnamed: 0,Text,URL,Transform_Text,word_counts,average number of words per sentence,chara_count,average word length,syllable count,complex_count,sentence length,Average Sentence Length,Percentage of Complex words,Fog Index,positive_score,negative_score,Polarity Score,subjectivity,PERSONAL PRONOUNS
0,When people hear AI they often think about sen...,https://insights.blackcoffer.com/how-is-login-...,people hear ai often think sentient robots mag...,434,18.083333,2780,6.40553,1074,321,24.0,18.083333,0.739631,7.529186,4.0,5.0,-0.111111,0.459438,
1,"With increasing computing power and more data,...",https://insights.blackcoffer.com/how-does-ai-h...,increasing computing power data potential valu...,383,13.678571,2618,6.835509,996,293,28.0,13.678571,0.765013,5.777434,9.0,6.0,0.2,0.48481,"[their, them, their, it, them, its, it, Our, t..."
2,If you were a fan of the 90’s film Clueless ba...,https://insights.blackcoffer.com/ai-and-its-im...,fan 90 film clueless back day remember protago...,1103,14.324675,7515,6.813237,2904,884,77.0,14.324675,0.801451,6.05045,27.0,21.0,0.125,0.542281,
3,"Understanding exactly how data is ingested, an...",https://insights.blackcoffer.com/how-do-deep-l...,understanding exactly data ingested analyzed r...,254,16.933333,1778,7.0,694,195,15.0,16.933333,0.767717,7.08042,5.0,1.0,0.666667,0.451277,
4,"From the stone age to the modern world, from h...",https://insights.blackcoffer.com/how-artificia...,stone age modern world hunting gathering culti...,711,10.61194,4640,6.52602,1840,553,67.0,10.61194,0.777778,4.555887,19.0,17.0,0.055556,0.483817,


In [None]:
df['PERSONAL PRONOUNS'] = np.nan

for i in range(len(df)):
    doc = nlp(df['Text'][i])
    tok = []
    for token in doc:
        if token.pos_ == 'PRON':
            tok.append(token)
        
    df['PERSONAL PRONOUNS'][i] = tok

In [None]:
df.head()

Unnamed: 0,Text,URL,Transform_Text,word_counts,average number of words per sentence,chara_count,average word length,syllable count,complex_count,sentence length,Average Sentence Length,Percentage of Complex words,Fog Index,positive_score,negative_score,Polarity Score,subjectivity,PERSONAL PRONOUNS
0,When people hear AI they often think about sen...,https://insights.blackcoffer.com/how-is-login-...,people hear ai often think sentient robots mag...,434,18.083333,2780,6.40553,1074,321,24.0,18.083333,0.739631,7.529186,4.0,5.0,-0.111111,0.459438,"[they, it, it, We, there, what, their, there, ..."
1,"With increasing computing power and more data,...",https://insights.blackcoffer.com/how-does-ai-h...,increasing computing power data potential valu...,383,13.678571,2618,6.835509,996,293,28.0,13.678571,0.765013,5.777434,9.0,6.0,0.2,0.48481,"[their, them, their, it, them, its, it, Our, t..."
2,If you were a fan of the 90’s film Clueless ba...,https://insights.blackcoffer.com/ai-and-its-im...,fan 90 film clueless back day remember protago...,1103,14.324675,7515,6.813237,2904,884,77.0,14.324675,0.801451,6.05045,27.0,21.0,0.125,0.542281,"[you, you, She, it, her, her, everything, We, ..."
3,"Understanding exactly how data is ingested, an...",https://insights.blackcoffer.com/how-do-deep-l...,understanding exactly data ingested analyzed r...,254,16.933333,1778,7.0,694,195,15.0,16.933333,0.767717,7.08042,5.0,1.0,0.666667,0.451277,"[they, they, its, they, their, it, its, we, them]"
4,"From the stone age to the modern world, from h...",https://insights.blackcoffer.com/how-artificia...,stone age modern world hunting gathering culti...,711,10.61194,4640,6.52602,1840,553,67.0,10.61194,0.777778,4.555887,19.0,17.0,0.055556,0.483817,"[we, what, I, we, our, we, us, our, what, We, ..."


In [None]:
df['PERSONAL PRONOUNS'][2]

[you,
 you,
 She,
 it,
 her,
 her,
 everything,
 We,
 our,
 What,
 we,
 everyone,
 its,
 their,
 their,
 itself,
 them,
 its,
 them,
 their,
 their,
 they,
 they,
 their,
 them,
 their,
 their,
 they,
 their,
 us,
 our,
 its,
 It,
 them,
 them,
 them,
 their,
 them,
 their,
 what,
 you,
 your,
 It,
 your,
 you,
 its,
 them,
 You,
 your,
 it,
 it,
 they,
 its,
 It,
 us,
 We,
 We,
 we,
 We,
 our,
 She,
 we,
 We,
 it,
 our,
 She,
 we,
 we,
 our,
 our,
 it,
 it,
 everything,
 there,
 We,
 their,
 It,
 there]

In [None]:
submit = df[['URL','positive_score','negative_score','Polarity Score','subjectivity','Average Sentence Length','Percentage of Complex words',
            'Fog Index','average number of words per sentence','complex_count','word_counts','syllable count','PERSONAL PRONOUNS','average word length']]

In [None]:
submit.head()

Unnamed: 0,URL,positive_score,negative_score,Polarity Score,subjectivity,Average Sentence Length,Percentage of Complex words,Fog Index,average number of words per sentence,complex_count,word_counts,syllable count,PERSONAL PRONOUNS,average word length
0,https://insights.blackcoffer.com/how-is-login-...,4.0,5.0,-0.111111,0.459438,18.083333,0.739631,7.529186,18.083333,321,434,1074,"[they, it, it, We, there, what, their, there, ...",6.40553
1,https://insights.blackcoffer.com/how-does-ai-h...,9.0,6.0,0.2,0.48481,13.678571,0.765013,5.777434,13.678571,293,383,996,"[their, them, their, it, them, its, it, Our, t...",6.835509
2,https://insights.blackcoffer.com/ai-and-its-im...,27.0,21.0,0.125,0.542281,14.324675,0.801451,6.05045,14.324675,884,1103,2904,"[you, you, She, it, her, her, everything, We, ...",6.813237
3,https://insights.blackcoffer.com/how-do-deep-l...,5.0,1.0,0.666667,0.451277,16.933333,0.767717,7.08042,16.933333,195,254,694,"[they, they, its, they, their, it, its, we, them]",7.0
4,https://insights.blackcoffer.com/how-artificia...,19.0,17.0,0.055556,0.483817,10.61194,0.777778,4.555887,10.61194,553,711,1840,"[we, what, I, we, our, we, us, our, what, We, ...",6.52602


In [None]:
file_name = "Output Data Structure.xlsx"

submit.to_excel(file_name)