In [1]:
# imports
import os
import sys
import dvc.api
import pandas as pd
import dataframe_image as dfi
import warnings
warnings.filterwarnings('ignore')

In [2]:
# adding and setting up scripts
sys.path.append('.')
sys.path.append('..')
sys.path.insert(1, '../scripts/')
import defaults as defs
import dataCleaner as dc
import dataVisualizer as dv

cleaner = dc.dataCleaner('data preparation notebook')
visualizer = dv.dataVisualizer('data preparation notebook')

logger <Logger dataCleaner (DEBUG)> created at path: ../logs/cleaner_root.log
Data cleaner in action
logger <Logger dataVisualizer (DEBUG)> created at path: ../logs/visualizer_root.log
Data visualizer in action


In [3]:
# pandas settings
pd.set_option('display.max_columns', 30)

# version of the data
# v1 : local-store
version = 'v1'

# set up the dat url
news_url = dvc.api.get_url(path = defs.news_local_path, 
                       repo = defs.repo, 
                       rev = version)

# print news path
print(f'news data path: {news_url}')

news data path: /home/f0x-tr0t/Documents/dvc-store//50/1fbc56d932bcb51d74876281ec8f71


In [4]:
# reading csv files
DateCols = ['timestamp']
missing_values = ["n/a", "na", "undefined", '?', 'NA', 'undefined']

news_data = pd.read_csv(news_url, na_values=missing_values, parse_dates=DateCols, low_memory=False)

# EDA

## News score data set

In [5]:
news_data

Unnamed: 0,Domain,Title,Description,Body,Link,timestamp,Analyst_Average_Score,Analyst_Rank,Reference_Final_Score
0,rassegnastampa.news,Boris Johnson using a taxpayer-funded jet for ...,…often trigger a protest vote that can upset…t...,Boris Johnson using a taxpayer-funded jet for ...,https://rassegnastampa.news/boris-johnson-usin...,2021-09-09 18:17:46.258006,0.0,4,1.96
1,twitter.com,"Stumbled across an interesting case, a woman f...","Stumbled across an interesting case, a woman f...","Stumbled across an interesting case, a woman f...",http://twitter.com/CoruscaKhaya/status/1435585...,2021-09-08 13:02:45.802298,0.0,4,12.0
2,atpe-tchad.info,Marché Résines dans les peintures et revêtemen...,…COVID-19…COVID…COVID…COVID-19 et Post COVID…C...,Le rapport d’étude de marché Résines dans les ...,http://atpe-tchad.info/2021/09/13/marche-resin...,2021-09-13 07:32:46.244403,0.0,4,0.05
3,badbluetech.bitnamiapp.com,"AI drives data analytics surge, study finds",…hate raiders' linked to automated harassment ...,How to drive the funnel through content market...,http://badbluetech.bitnamiapp.com/p.php?sid=21...,2021-09-11 00:17:45.962605,0.0,4,6.1
4,kryptogazette.com,Triacetin Vertrieb Markt 2021: Globale Unterne...,…Abschnitten und Endanwendungen / Organisation...,Global Triacetin Vertrieb-Markt 2021 von Herst...,https://kryptogazette.com/2021/09/08/triacetin...,2021-09-08 12:47:46.078369,0.0,4,0.13
5,mype.co.za,Male arrested for the murder of an elderly fem...,…Crime Stamp Out…N1 and R101 roads appear in c...,South African Police Service Office of the Pro...,https://mype.co.za/new/male-arrested-for-the-m...,2021-09-10 00:17:46.055622,1.33,2,11.0
6,eminetra.co.za,7th Anniversary of SCOAN Collapse in Nigeria-S...,"…in Lagos, Nigeria, 84 South Africans were kil...",Today is the 7th anniversary [Tragic collapse ...,https://eminetra.co.za/7th-anniversary-of-scoa...,2021-09-12 05:17:50.279081,0.0,4,10.1
7,eminetra.co.za,The construction sector is expected to be boos...,"…additional spending on buildings, repairs and...",Construction activity grew steadily by 4% in t...,https://eminetra.co.za/the-construction-sector...,2021-09-09 09:02:46.320793,1.66,1,1.36
8,news24.com,News24.com | Court dismisses attempt by former...,…Lawsuit Against Public Participation) designe...,- Former Eskom CEO Matshela Moses Koko sought ...,https://www.news24.com/news24/southafrica/news...,2021-09-09 19:32:46.239682,0.33,3,2.4
9,manometcurrent.com,Global and Regional Beta-Carotene Market Resea...,…key players! – DSM – BASF – Allied Biotech – ...,Global and Regional Beta-Carotene Market Resea...,https://manometcurrent.com/global-and-regional...,2021-09-13 03:02:45.609228,0.0,4,0.22


In [6]:
news_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Domain                 10 non-null     object        
 1   Title                  10 non-null     object        
 2   Description            10 non-null     object        
 3   Body                   10 non-null     object        
 4   Link                   10 non-null     object        
 5   timestamp              10 non-null     datetime64[ns]
 6   Analyst_Average_Score  10 non-null     float64       
 7   Analyst_Rank           10 non-null     int64         
 8   Reference_Final_Score  10 non-null     float64       
dtypes: datetime64[ns](1), float64(2), int64(1), object(5)
memory usage: 848.0+ bytes


### Data types

In [7]:
news_data.dtypes

Domain                           object
Title                            object
Description                      object
Body                             object
Link                             object
timestamp                datetime64[ns]
Analyst_Average_Score           float64
Analyst_Rank                      int64
Reference_Final_Score           float64
dtype: object

* We have 5 string or object data types
* 1 date time data type
* 2 floating numbers and
* 1 integer

### Missing values and empty strings

In [8]:
cleaner.percent_missing(news_data)

The dataset contains 0.0 % missing values


In [9]:
news_data.isna().sum()

Domain                   0
Title                    0
Description              0
Body                     0
Link                     0
timestamp                0
Analyst_Average_Score    0
Analyst_Rank             0
Reference_Final_Score    0
dtype: int64

* No null values

### Duplicates

In [10]:
# search for duplicate rows and drop them
cleaner.drop_duplicates(news_data)

No duplicate rows were found.


### Feature descriptions

* Domain - the base URL or a reference to the source these item comes from 
* Title - title of the item - the content of the item
* Description - the content of the item
* Body - the content of the item
* Link - URL to the item source (it may not functional anymore sometime)
* Timestamp - timestamp that this item was collected at
* Analyst_Average_Score -  target variable - the score to be estimated 
* Analyst_Rank - score as rank
* Reference_Final_Score - Not relevant for now - it is a transformed quantity

### Numerical summary

In [11]:
news_data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Analyst_Average_Score,10.0,0.332,0.626379,0.0,0.0,0.0,0.2475,1.66
Analyst_Rank,10.0,3.4,1.074968,1.0,3.25,4.0,4.0,4.0
Reference_Final_Score,10.0,4.532,4.834468,0.05,0.505,2.18,9.1,12.0


* Average analyst score:
    * min: 0: the lowest news protest score
    * max: 1.66: the highest news protest score 
    * mean: 0.332: most news do not indicate high signs of social protest
    * standard deviation: 0.6

* Analyst_Rank:
    * min: 1: minimum analyst rank
    * max: 4: maximum analyst rank
    * mean: 3.4: the average number of ranks: most analystRanks are high
    * standard deviation: 1.07

In [17]:
news_data.shape

(10, 9)

In [12]:
news_data[['Analyst_Rank', 'Analyst_Average_Score', 'Reference_Final_Score']]

Unnamed: 0,Analyst_Rank,Analyst_Average_Score,Reference_Final_Score
0,4,0.0,1.96
1,4,0.0,12.0
2,4,0.0,0.05
3,4,0.0,6.1
4,4,0.0,0.13
5,2,1.33,11.0
6,4,0.0,10.1
7,1,1.66,1.36
8,3,0.33,2.4
9,4,0.0,0.22


* News index 7 and 5 seem to have a very high score
* Indication of a high probability social protest

### Taking a look at high score news

In [13]:
news_data.columns

Index(['Domain', 'Title', 'Description', 'Body', 'Link', 'timestamp',
       'Analyst_Average_Score', 'Analyst_Rank', 'Reference_Final_Score'],
      dtype='object')

In [14]:
news_data.iloc[7, 3]

'Construction activity grew steadily by 4% in the second quarter of the first three months of 2021 and recovered quickly and significantly in the same quarter of 2020 due to the low base effect of Covid. Famous economist Dr. Roerofubota said yesterday. In the announcement of the Afrimat Construction Index (ACI), and on behalf of Afrimat, he said ACI reached 110.3 points in the second half of 2021 and reached a 55% rebound compared to the same quarter last year. .. The construction industry continues to improve, he said in a telephone interview. Quarterly improvements were driven by a significant increase in employment in the construction sector and sales of building materials. Since the second quarter of 2020, the construction industry has created approximately 156,000 jobs. Bota said other promising improvements in the building blocks are the value of the building plans passed and the value of the buildings completed by the larger municipalities of the country. “Unfortunately, the pos

In [15]:
news_data.iloc[5, 3]

'South African Police Service Office of the Provincial Commissioner Eastern Cape EASTERN CAPE – A 42-year-old male suspect was arrested yesterday, Wednesday 8 September 2021 at about 10:00 for the murder of an 80-year-old female in the Epesikeni location Ngqwaru A/A in Cofimvaba. It is alleged that during the morning of Wednesday, neighbors found the door at the deceased’s homestead open. On further investigation it’s where the lifeless body was discovered, covered up, laying on the bed. They also found a known suspect hiding in the room, and after questioning him he reported that he found the lady on her bed sleeping and the suspect then ran away. After the suspect set his house on fire he was rescued by Cofimvaba Visible Police members who arrested him on a charge of murder. According to the suspect the motive for the murder was revenge. The suspect will appear before the Cofimvaba Magistrates Court tomorrow, 10 September 2021 on charges related to Murder. Join Your Neighbourhood Wat

* This are the two highest score newses

### Save data set to image for cml

In [16]:
dfi.export(news_data, '../plots/news_data.png', max_rows=20)

[0914/223441.844938:INFO:headless_shell.cc(660)] Written to file /tmp/tmppjjl9cel/temp.png.
[0914/223443.130360:INFO:headless_shell.cc(660)] Written to file /tmp/tmp1wa67kbl/temp.png.
[0914/223444.657932:INFO:headless_shell.cc(660)] Written to file /tmp/tmp0ctnshin/temp.png.
