# George McIntire Self NLP Project

## Notebook 1: Preliminary Analysis

This notebook conducts an introductory analysis on the data. Consists calculating summary stats, visualizing distributions, and looking at top N articles by variable.

In [113]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.decomposition import PCA, NMF
from sklearn.manifold import TSNE, MDS
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import StandardScaler, Normalizer, MinMaxScaler
import plotly_express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from glob import glob

In [61]:
#Load in data
data = pd.read_pickle('../complete_data_directory/the_data.pkl')
data.head()

Unnamed: 0,Authors,Keywords,Text,Title,publication,reading_score,sentence_count,textblob_sentiment,textblob_subjectivity,time_read,...,word_length,Google_Sentiment_Score,Google_Sentiment_Magnitude,wat_sent_score,wat_sent_label,anger,disgust,fear,joy,sadness
2e08ee38-925b-11ea-b6e7-88e9fe7866f0,"[Elizabeth Day, Gary Younge, Amy Goodman]",black/garza/activists/media/white/civil/youre/...,"Alicia Garza was in a bar in Oakland, Californ...",#BlackLivesMatter: the birth of a new civil ri...,The Guardian,70.23,249,0.055132,0.413599,2015-07-22 00:27:37,...,4457,-0.2,104.099998,-0.422902,negative,0.160291,0.157887,0.106689,0.573019,0.565451
2e08b0b2-925b-11ea-b6e7-88e9fe7866f0,"[David Agren, Tom Phillips]",leftwinger/mexico/indigenous/amlo/political/ma...,A promise by Andrés Manuel López Obrador to ta...,'Amlo': the veteran leftwinger who could be Me...,The Guardian,56.59,62,0.093262,0.403221,2018-05-07 23:50:42,...,1424,-0.1,23.9,0.296427,positive,0.125887,0.129476,0.544648,0.528289,0.556313
ce1b2992-8f7a-11ea-9be2-88e9fe7866f0,[Lauren Gambino],bernie/partys/believe/threat/moderate/party/sa...,Moderates are increasingly vocal in their disd...,'An existential threat': Bernie Sanders faces ...,The Guardian,46.61,60,0.128316,0.406487,2019-06-21 20:45:21,...,1468,-0.3,31.700001,-0.385376,negative,0.156979,0.525166,0.122238,0.53507,0.555992
2e08e848-925b-11ea-b6e7-88e9fe7866f0,[Nellie Bowles],party/future/festival/man/eric/elite/tomas/sch...,"Further Future is the tech-centric, unapologet...",'Burning Man for the 1%': the desert party for...,The Guardian,63.49,86,0.086764,0.409743,2016-05-02 14:28:33,...,1390,0.0,30.1,0.558893,positive,0.109795,0.112361,0.109605,0.616656,0.482087
ce1a5c10-8f7a-11ea-9be2-88e9fe7866f0,[Senior Reporter],history/revolt/slackers/youre/political/right/...,"“We’re not experts at anything,” Matt Christma...",'Chapo Trap House' And The Slackers' Revolt,Huffington Post,49.35,70,0.072886,0.48299,2019-07-25 12:50:24,...,1527,-0.3,36.799999,-0.513165,negative,0.506526,0.49406,0.094165,0.559362,0.536008


Variables

In [65]:
#Clean and impute Keywords variable
data.Keywords = data.Keywords.str.join("/")
data.Keywords.fillna("", inplace=True)

In [66]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3403 entries, 2e08ee38-925b-11ea-b6e7-88e9fe7866f0 to ce1b066a-8f7a-11ea-9be2-88e9fe7866f0
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Authors                     3403 non-null   object 
 1   Keywords                    3403 non-null   object 
 2   Text                        3403 non-null   object 
 3   Title                       3403 non-null   object 
 4   publication                 3403 non-null   object 
 5   reading_score               3403 non-null   float64
 6   sentence_count              3403 non-null   int64  
 7   textblob_sentiment          3403 non-null   float64
 8   textblob_subjectivity       3403 non-null   float64
 9   time_read                   3403 non-null   object 
 10  vader_sentiment             3403 non-null   float64
 11  word_length                 3403 non-null   int64  
 12  Google_Sentiment_Score      

### Data Dictionary:

Basic explanation of the variables some of which are self-explanatory.

    - Authors
    - Keywords: List of import keywords determined by Newspaper package
    - Text
    - Title
    - publication
    - reading_score: Flesch reading ease score that measures the difficulty of reading a text. Derived by the textstat library. Higher is easier to read and lower is harder.
    - sentence_count: number of sentences
    - textblob_sentiment: Sentiment score derived by TextBlob library. Scale: -1 (Extremely negative) - 1 (Extremely Positive).
    - textblob_subjectivity: Subjectivity/objectivity score derived by TextBlob library. Scale: 0 (very objective) - 1 (very subjective).
    - time_read: The date the article was read. This is the time variable instead of date published.
    - vader_sentiment: Sentiment score derived by the VaderSentiment package. Scale: -1 (Extremely negative) - 1 (Extremely Positive).
    - word_length: number of words
    - Google_Sentiment_Score: Sentiment score derived by Google. Scale: -1 (Extremely negative) - 1 (Extremely Positive).
    - Google_Sentiment_Magnitude: Google's definition of strength of sentiment/s. For example if a document has a low sentiment score but high magnitude score, that means there's abundances of negative and positive text in the document.
    - wat_sent_score: Sentiment score derived by Watson AI. Scale: -1 (Extremely negative) - 1 (Extremely Positive).
    - wat_sent_label: Sentiment category value: negative, positive, neutral
    - anger, disgust, fear, joy, sadness: Emotion scores derived by Watson AI. Scale: 0 - 1.

In [67]:
#Create two different dataframes, one for numbers and columns
num_df = data.select_dtypes(include="number")
obj_df = data.select_dtypes(include="object")

Summary Stats of numerical columns

In [68]:
num_df.describe()

Unnamed: 0,reading_score,sentence_count,textblob_sentiment,textblob_subjectivity,vader_sentiment,word_length,Google_Sentiment_Score,Google_Sentiment_Magnitude,wat_sent_score,anger,disgust,fear,joy,sadness
count,3403.0,3403.0,3403.0,3403.0,3403.0,3403.0,3403.0,3403.0,3403.0,3403.0,3403.0,3403.0,3403.0,3403.0
mean,59.533964,125.919189,0.096104,0.438147,0.511346,2439.599471,-0.120864,44.571114,-0.045158,0.285815,0.314834,0.162828,0.403398,0.377016
std,10.21605,142.331639,0.052383,0.051751,0.829103,2447.450767,0.156515,48.074018,0.457978,0.194616,0.211937,0.137638,0.207747,0.193955
min,2.29,10.0,-0.145186,0.17982,-1.0,300.0,-0.7,0.4,-0.847182,0.006959,0.028204,0.000232,0.015192,0.028673
25%,52.8,47.0,0.062374,0.406534,0.4404,1027.5,-0.2,17.700001,-0.448858,0.124159,0.119587,0.096909,0.153216,0.159335
50%,59.64,79.0,0.095628,0.438368,0.9965,1644.0,-0.1,29.5,-0.290788,0.167498,0.186766,0.117542,0.511734,0.492298
75%,66.88,146.5,0.12876,0.471612,0.9993,2873.0,0.0,51.35,0.397594,0.509413,0.530672,0.144369,0.564982,0.542246
max,96.69,1885.0,0.564439,0.701966,1.0,33720.0,0.6,489.200012,0.91582,0.702171,0.804702,0.698295,0.751957,0.722891


In [69]:
#Correlation matrix of numerical columns
corr = num_df.corr()
corr

Unnamed: 0,reading_score,sentence_count,textblob_sentiment,textblob_subjectivity,vader_sentiment,word_length,Google_Sentiment_Score,Google_Sentiment_Magnitude,wat_sent_score,anger,disgust,fear,joy,sadness
reading_score,1.0,0.273802,0.12083,0.177771,0.121077,0.154811,0.280167,0.243692,0.077338,0.055481,0.003087,-0.057619,0.023418,0.001785
sentence_count,0.273802,1.0,-0.060851,-0.038255,-0.007675,0.956565,0.009487,0.909359,-0.063739,0.275782,0.20334,0.03047,-0.208422,-0.221942
textblob_sentiment,0.12083,-0.060851,1.0,0.242021,0.485506,-0.06417,0.539605,-0.088247,0.560159,-0.213254,-0.002868,-0.250963,0.087303,-0.159003
textblob_subjectivity,0.177771,-0.038255,0.242021,1.0,0.106912,-0.049703,0.165498,0.005928,0.112758,-0.026417,-0.022398,-0.023204,0.068018,0.030628
vader_sentiment,0.121077,-0.007675,0.485506,0.106912,1.0,-0.00821,0.498297,-0.058975,0.5466,-0.179181,0.014948,-0.303672,0.049207,-0.163496
word_length,0.154811,0.956565,-0.06417,-0.049703,-0.00821,1.0,-0.008159,0.917359,-0.071332,0.284822,0.217303,0.028684,-0.213533,-0.221376
Google_Sentiment_Score,0.280167,0.009487,0.539605,0.165498,0.498297,-0.008159,1.0,-0.050705,0.737192,-0.264912,-0.057484,-0.282801,0.053382,-0.203468
Google_Sentiment_Magnitude,0.243692,0.909359,-0.088247,0.005928,-0.058975,0.917359,-0.050705,1.0,-0.150005,0.222614,0.122828,0.063975,-0.100224,-0.084849
wat_sent_score,0.077338,-0.063739,0.560159,0.112758,0.5466,-0.071332,0.737192,-0.150005,1.0,-0.228101,0.003798,-0.302471,-0.001832,-0.29174
anger,0.055481,0.275782,-0.213254,-0.026417,-0.179181,0.284822,-0.264912,0.222614,-0.228101,1.0,0.546961,0.099184,-0.539606,-0.498848


In [81]:
#heatmap version
hm = px.imshow(corr.values, 
               x = num_df.columns, 
              y = num_df.columns, width = 800, height = 700,
               color_continuous_scale="viridis", range_color=[-1, 1])
hm.show()

I'm bit surprised the sentiment scores don't have a stronger correlation. Let's look at the sentiment scores for a better of this.

Also, we see that google magnitude score is essentially another proxy for article length due to its strong correlations with number of words and sentences.

In [135]:

sent_cols = ["textblob_sentiment", "vader_sentiment", "Google_Sentiment_Score", "wat_sent_score"]

hm = px.imshow(corr.loc[sent_cols, sent_cols].values, 
               x = sent_cols,
              y = sent_cols, width = 800, height = 500, 
               color_continuous_scale="viridis", range_color=[-1, 1])
hm.show()

Let's visualize distributions of the numerical variables. We'll split the data into three different groups in order to better display the distributions.

In [83]:
num_df.head()

Unnamed: 0,reading_score,sentence_count,textblob_sentiment,textblob_subjectivity,vader_sentiment,word_length,Google_Sentiment_Score,Google_Sentiment_Magnitude,wat_sent_score,anger,disgust,fear,joy,sadness
2e08ee38-925b-11ea-b6e7-88e9fe7866f0,70.23,249,0.055132,0.413599,-0.9998,4457,-0.2,104.099998,-0.422902,0.160291,0.157887,0.106689,0.573019,0.565451
2e08b0b2-925b-11ea-b6e7-88e9fe7866f0,56.59,62,0.093262,0.403221,-0.9718,1424,-0.1,23.9,0.296427,0.125887,0.129476,0.544648,0.528289,0.556313
ce1b2992-8f7a-11ea-9be2-88e9fe7866f0,46.61,60,0.128316,0.406487,0.9977,1468,-0.3,31.700001,-0.385376,0.156979,0.525166,0.122238,0.53507,0.555992
2e08e848-925b-11ea-b6e7-88e9fe7866f0,63.49,86,0.086764,0.409743,0.9986,1390,0.0,30.1,0.558893,0.109795,0.112361,0.109605,0.616656,0.482087
ce1a5c10-8f7a-11ea-9be2-88e9fe7866f0,49.35,70,0.072886,0.48299,0.7513,1527,-0.3,36.799999,-0.513165,0.506526,0.49406,0.094165,0.559362,0.536008


In [84]:
text_metrics_cols = ["reading_score", "word_length", 'sentence_count']

emotion_cols = ["anger", "disgust", 'fear', "joy", "sadness"]

#Already have list of sentiment columns in sent_cols



Distributions of reading score, number of words, and number of sentences.

In [190]:
def hist_subplots_maker(columns, rows, cols, height = 700, width = 800, bins = None):
    subplot1 = make_subplots(rows = rows, cols=cols, 
                             subplot_titles=[col.replace("_", " ").title() for col in columns])

    hists = [go.Histogram(x = num_df[col], 
    #                          name = col.replace("_", " ").title()
                            showlegend = False, nbinsx = bins) for col in columns]
    
    count = 0
    for i in range(rows):
        for e in range(cols):

            subplot1.add_trace(hists[count],
                row = i+1, col=e + 1)
            count += 1
            
    subplot1.update_layout(height = height, width = width)
    subplot1.show()

In [191]:
hist_subplots_maker(text_metrics_cols, rows=3, cols=1, width = 700)

- A solid majority articles have a reading score between 50 (12th grade reading level) and 70 (8th grade reading).

- Both word length and sentence count show skewed distributions. The typical article I read is between 800 and 2000 words.

Distributions of the four sentiment scores distributions.

In [182]:
hist_subplots_maker(sent_cols, rows=2, cols=2, height = 700, width = 800)

This 2x2 subplot yields some interesting results.
- The textblob and google sentiment scores show normal or pretty close to normal distributions.
- Vader is pretty lobsided. Vast majority of articles are either completely negative or positive.
- Watson sentiment shows there are two equally-sized distributions in one.
- According to textblob, the vast majority of articles are slightly or barely positve.
- Google says the most articls are slighty negative or neutral.

Despite all four of these sentiment metrics having strong correlations with one another, their distributions indicate a significant variance among them. This validates my idea to acquire multiple sentiment scores because I knew that different sentiment scoring algorithms could produce varying results.

Now let's look at emotions

In [193]:
hist_subplots_maker(emotion_cols, rows=5, cols=1, height=900, width = 600)

All five emotions distributions look quite similar and all contain two different distributions within. This may give us some sort of insight into watson algorithms and how they may be biased.

### Top Articles per numerical 