## TODO:

Structure:
1. Read data
    - <input type="checkbox"></input>display count of rows
    - <input type="checkbox"></input> display columns; give description for each column
    - <input type="checkbox"></input>show how many rows have 1 None value, 2 None, ..., all None values
2. Display each column
    - <input type="checkbox"></input> **title**: min/max/average length, stretch goal: show most popular topics
    - <input type="checkbox"></input>**url**: what site is the most popular
    - <input type="checkbox"></input>**published_date**: min/max range, show distribution per year, maybe per season/year
    - <input type="checkbox"></input>**author**: show how many authors are on average per each article, who is the most popular author
    - <input type="checkbox"></input>**publisher**: check that there is only one publisher
    - <input type="checkbox"></input>**short_description**: check empty values, show that short_description is basically the first sentence of description (but before check this assumption). If it is true then it will be strategy for filling empty values in case description exists
    - <input type="checkbox"></input>**keywords**: how many keywords on average, what is the most popular, what is the list popular
    - <input type="checkbox"></input>**header_image**: skip
    - <input type="checkbox"></input>**raw_description**: is the same as description only with html tags. In case raw description exists but not description, raw description can be used as description but only after stripping all html tags
    - <input type="checkbox"></input>**description**: min/average/max. Show number of empty values. Check that when description is empty raw_description is not. Stretch goal: most popular topics
    - <input type="checkbox"></input>**scraped_at**: skip? maybe show the range


<h1><center>Exploratory Data Analysis</center></h1>

In [1]:
cd ..

/Users/andreiaksionov/Study/Machine_Learning/semantic_search/Weaviate-demo


In [145]:
import os
import sys

import pandas as pd
import spacy
from bs4 import BeautifulSoup
from omegaconf import OmegaConf

pd.set_option("display.max_colwidth", 100)

# ROOT = os.path.realpath("../")
# if ROOT not in sys.path:
#     sys.path.append(ROOT)

from src import config

# 1. Read data

Read file from csv format and display the first article.

In [146]:
data = pd.read_csv(config.data.raw)
data.head(1)

Unnamed: 0,title,url,published_at,author,publisher,short_description,keywords,header_image,raw_description,description,scraped_at
0,Santoli’s Wednesday market notes: Could September’s stock shakeout tee up strength for the fourt...,https://www.cnbc.com/2021/09/29/santolis-wednesday-market-notes-could-septembers-stock-shakeout-...,2021-09-29T17:09:39+0000,Michael Santoli,CNBC,"This is the daily notebook of Mike Santoli, CNBC's senior markets commentator, with ideas about ...","cnbc, Premium, Articles, Investment strategy, Markets, Investing, PRO Home, CNBC Pro, Pro: Santo...",https://image.cnbcfm.com/api/v1/image/106949602-1632934577499-FINTECH_ETF_9-29.jpg?v=1632934691,"<div class=""group""><p><em>This is the daily notebook of Mike Santoli, CNBC's senior markets comm...","This is the daily notebook of Mike Santoli, CNBC's senior markets commentator, with ideas about ...",2021-10-30 14:11:23.709372


In [None]:
# data rows
print(data.shape)

(625, 11)


Dataset contains 625 article and each article has 11 values.

In [138]:
# columns
# sorted(data.columns)
data.columns.tolist()

['title',
 'url',
 'published_at',
 'author',
 'publisher',
 'short_description',
 'keywords',
 'header_image',
 'raw_description',
 'description',
 'scraped_at']

Article values:
1. **title**: title of article, how it's named on CNBC website
2. **url**: url of article, should contain cnbc.com domain (as it's dataset with articles from CNBC)
3. **published_at**: when article was published on CNBC website
4. **author**: name of the article's author/authors
5. **publisher**: name of the publisher, should be CNBC
6. **short_description**: shortened version of description (article body), handy when quickly observing articles
7. **keyword**: list of keywords, helps find articles of the same topic
8. **header_image**: link to the image with which the article was posted
9. **raw_description**: article's body without any post-processing. Description is the same as raw_description only without html tags
10. **description**: article's body without html tags
11. **scraped_at**: when article was scraped from cnbc.com

Now it's a good idea to check how many missing values per each column we have:

In [149]:
nan_values_df = data.isna().sum().reset_index().rename(columns={"index": "Columns", 0: "Num of NaN"})
nan_values_df["%"] = nan_values_df["Num of NaN"].apply(lambda x: int(x / len(data) * 100))
nan_values_df

Unnamed: 0,Columns,Num of NaN,%
0,title,0,0
1,url,0,0
2,published_at,0,0
3,author,228,36
4,publisher,0,0
5,short_description,16,2
6,keywords,0,0
7,header_image,0,0
8,raw_description,31,4
9,description,32,5


One third of all articles have missing author and small fraction of all variants of descriptions are missing. Interestingly enough short_description has less missing values than description or raw_description.

Now let's take a look at how many rows have only one missing value, two missing values, ... all values are missing,

In [137]:
na_count = data.isna().sum(axis=1).tolist()
nan_values_df = pd.DataFrame(columns=["NaN per row", "Number of rows", "%"])

for idx in range(data.shape[-1]):
    nan_count = na_count.count(idx)
    nan_values_df.loc[idx] = idx, nan_count, int(nan_count / len(data) * 100)

nan_values_df.style.hide_index()

  nan_values_df.style.hide_index()


NaN per row,Number of rows,%
0,376,60
1,207,33
2,27,4
3,14,2
4,1,0
5,0,0
6,0,0
7,0,0
8,0,0
9,0,0


As we can see more than half of the dataset doesn't not have any missing value, one third of articles have one missing value and no articles have all missing values.