# 2. Data manipulation
## 2.1. Checking the accuracy of data scraping

The primary data source for our project is data scraped from the Channel One's website. The goal of this notebook is to ensure that the scraping algorithm is functioning properly. The notebook provides tests that may be used to detect problems in the algorithm.

In [1]:
import pandas as pd
import numpy as np

In [6]:
%run "2.0. DataManipulation_Functions.ipynb"

In [2]:
file = 'data/0-15_output_scraping_cleaned.csv'

## Data loading

In [3]:
df_scraped = pd.read_csv(file, sep='|')
df_scraped.rename(columns={'Unnamed: 0': 'id'}, inplace=True)

In [4]:
df_scraped.head(3)

Unnamed: 0,id,date,tags_top,title,url,body,tags_bottom,video_duration_seconds
0,25,2023-01-28 12:27:00,Общество,Выпуск новостей в 12:00 от 28.01.2023,https://www.1tv.ru/news/2023-01-28/446131-vypu...,Смотрите в этом выпуске: танковый бой — задача...,Общество,
1,7,2023-01-28 12:18:00,"Общество,Культура",Смотрите на Первом канале программы и фильмы в...,https://www.1tv.ru/news/2023-01-28/446141-smot...,Сегодня на Первом канале программы и фильмы с ...,"Общество,Культура,Музыка,Кино,Телевидение,Доку...",
2,8,2023-01-28 12:17:00,"Общество,Погода",В столичном регионе на предстоящей неделе темп...,https://www.1tv.ru/news/2023-01-28/446139-v_st...,В России перепады температуры больше 40 градус...,"Общество,Погода",


In [5]:
df_scraped.shape

(15011, 8)

## Adding columns necessary for scraping algorithm testing

In [7]:
df_scraped = add_columns(df_scraped)

In [10]:
# Adding time of a newscast extracted from its title
df_scraped['newscast title time'] = df_scraped['title'].str.extract('^[а-яА-Я«» ]*([0-9]*):[0-9][0-9].*')
df_scraped['newscast title time1'] = df_scraped['title'].str.extract('^[а-яА-Я«» ]*([0-9]*) час от .*')
df_scraped['newscast title time'] = np.where(df_scraped['newscast title time1'].isnull(),
                                             df_scraped['newscast title time'],
                                             df_scraped['newscast title time1'])
df_scraped['newscast title time'] = pd.to_numeric(df_scraped['newscast title time'], 
                                                  errors='coerce').astype('Int64')

In [11]:
df_scraped.head(3)

Unnamed: 0,id,date,tags_top,title,url,body,tags_bottom,video_duration_seconds,datetime,dat,year,year_month,hour,weekday,whole newscast,newscast title time,newscast title time1
0,25,2023-01-28 12:27:00,Общество,Выпуск новостей в 12:00 от 28.01.2023,https://www.1tv.ru/news/2023-01-28/446131-vypu...,Смотрите в этом выпуске: танковый бой — задача...,Общество,,2023-01-28 12:27:00,2023-01-28,2023,2023-01,12,5,True,12.0,
1,7,2023-01-28 12:18:00,"Общество,Культура",Смотрите на Первом канале программы и фильмы в...,https://www.1tv.ru/news/2023-01-28/446141-smot...,Сегодня на Первом канале программы и фильмы с ...,"Общество,Культура,Музыка,Кино,Телевидение,Доку...",,2023-01-28 12:18:00,2023-01-28,2023,2023-01,12,5,False,,
2,8,2023-01-28 12:17:00,"Общество,Погода",В столичном регионе на предстоящей неделе темп...,https://www.1tv.ru/news/2023-01-28/446139-v_st...,В России перепады температуры больше 40 градус...,"Общество,Погода",,2023-01-28 12:17:00,2023-01-28,2023,2023-01,12,5,False,,


## Removing the first and last days from the dataframe

Because the loading for the first and last day is not complete, we must delete them in order to test the accuracy of data scraping.

In [12]:
max_date = df_scraped['dat'].max()
min_date = df_scraped['dat'].min()
min_date

datetime.date(2022, 6, 3)

In [13]:
df_scraped.drop(df_scraped[df_scraped['dat'].isin([max_date, min_date])].index, inplace=True)

In [14]:
df_scraped.shape

(14988, 17)

## Data checks

### Duplicates

#### Row duplicates

In [15]:
cols_subset = ['date', 'tags_top', 'title', 'url', 'body', 'tags_bottom',
          'video_duration_seconds']
row_duplicates = df_scraped.duplicated(subset=cols_subset, keep=False)
row_duplicates.value_counts()

False    14988
dtype: int64

#### Duplicates in column 'url'

In [16]:
cols_subset = ['url']
url_duplicates = df_scraped.duplicated(subset=cols_subset, keep=False)
url_duplicates.value_counts()

False    14988
dtype: int64

There are no duplicates in the examined file.

### Missing values

In [17]:
df_scraped.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14988 entries, 15 to 15002
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   id                      14988 non-null  int64         
 1   date                    14988 non-null  object        
 2   tags_top                14753 non-null  object        
 3   title                   14988 non-null  object        
 4   url                     14988 non-null  object        
 5   body                    14953 non-null  object        
 6   tags_bottom             14988 non-null  object        
 7   video_duration_seconds  14977 non-null  float64       
 8   datetime                14988 non-null  datetime64[ns]
 9   dat                     14988 non-null  object        
 10  year                    14988 non-null  int64         
 11  year_month              14988 non-null  period[M]     
 12  hour                    14988 non-null  int64

In [18]:
# Number of missing cells in columns
df_scraped.isnull().sum(axis = 0)

id                            0
date                          0
tags_top                    235
title                         0
url                           0
body                         35
tags_bottom                   0
video_duration_seconds       11
datetime                      0
dat                           0
year                          0
year_month                    0
hour                          0
weekday                       0
whole newscast                0
newscast title time       13545
newscast title time1      14988
dtype: int64

The columns "body" and "video_duration_seconds" are crucial for the analysis. We need to ensure that the algorithm scrapes them appropriately. We will carefully verify that their missing values are actually missing and that this is not an algorithm error. For this check, we will go to the Channel One website and examine the information offered there.

In [19]:
# Check rows were body is missing
pd.set_option('display.max_colwidth', None)
df_scraped.loc[df_scraped['body'].isnull(), 'url']

3738     https://www.1tv.ru/news/2022-11-25/442233-80_let_nazad_podpisano_soglashenie_o_formirovanii_frantsuzskoy_aviatsionnoy_eskadrili_normandiya_neman
4691       https://www.1tv.ru/news/2022-11-10/441245-prezident_poruchil_privesti_normativy_obespechennosti_vs_rf_v_sootvetstvie_s_realnymi_potrebnostyami
10241                             https://www.1tv.ru/news/2022-08-11/435453-s_territorii_ukrainy_v_osvobozhdennye_rayony_vozvraschayutsya_tselymi_semyami
11087                                                                     https://www.1tv.ru/news/2022-07-30/434570-vypusk_novostey_v_10_00_ot_30_07_2022
11411                                                                     https://www.1tv.ru/news/2022-07-25/434233-vypusk_novostey_v_18_00_ot_25_07_2022
11817                                                             https://www.1tv.ru/news/2022-07-19/433804-vypusk_programmy_vremya_v_21_00_ot_19_07_2022
11835                                                                     ht

In [20]:
# Check rows were video duration is missing
df_scraped.loc[df_scraped['video_duration_seconds'].isnull(), 'url']

798                                                 https://www.1tv.ru/news/2023-01-15/445302-skonchalsya_sovetskiy_i_gruzinskiy_pevets_akter_i_rezhisser_vahtang_kikabidze
835                                                                        https://www.1tv.ru/news/2023-01-14/445265-ushla_iz_zhizni_narodnaya_artistka_sssr_inna_churikova
1000                                                   https://www.1tv.ru/news/2023-01-11/445088-valeriy_gerasimov_naznachen_komanduyuschim_ob_edinennoy_gruppirovkoy_voysk
1583                                                                               https://www.1tv.ru/news/2022-12-29/444474-umer_trehkratnyy_chempion_mira_po_futbolu_pele
2064                     https://www.1tv.ru/news/2022-12-22/443973-dmitriy_peskov_vyrazil_sozhalenie_chto_vo_vremya_vizita_zelenskogo_v_ssha_ne_prozvuchalo_prizyvov_k_miru
4154    https://www.1tv.ru/news/2022-11-18/441799-perspektivy_dalneyshego_rasshireniya_torgovo_ekonomicheskih_svyazey_moskvy_i_baku_obsudili

In [21]:
pd.reset_option("max_colwidth")

All of the URLs given above were manually checked. The missing values are also missing on the Channel One website. So the algorithm works properly. The amount of missing values is negligible and will have no effect on the analysis's findings.

### Anomalies in data

To detect algorithm problems, we chose to compare the duration of the video with the entire newscast and the sum of the durations of individual stories from this newscast. Typically, the newscast lasts no more than 10 minutes longer than the aggregate of individual stories (these 10 minutes are for the introduction and a brief overview of the news release). If the news broadcast lasts longer than 10 minutes, this might indicate that the algorithm is not downloading all of the news stories.

In [22]:
# Create dataframe with duration of newscast and sum of duration of individual stories
video_dur = df_scraped.groupby(['dat', 'hour', 'whole newscast'])[['video_duration_seconds']].sum()
video_dur.reset_index(inplace=True)
video_dur

Unnamed: 0,dat,hour,whole newscast,video_duration_seconds
0,2022-06-04,10,False,834.0
1,2022-06-04,10,True,935.0
2,2022-06-04,12,False,809.0
3,2022-06-04,12,True,890.0
4,2022-06-04,15,False,786.0
...,...,...,...,...
2941,2023-01-27,15,True,1006.0
2942,2023-01-27,18,False,2003.0
2943,2023-01-27,18,True,2391.0
2944,2023-01-27,21,False,2307.0


In [23]:
df = pd.pivot_table(video_dur, values='video_duration_seconds', index = ['dat', 'hour'],
                    columns=['whole newscast'], aggfunc=min)
df.reset_index(inplace=True)
df.rename(columns={False: 'sum duration of stories',
                   True: 'newscast duration'},
          inplace=True)
df

whole newscast,dat,hour,sum duration of stories,newscast duration
0,2022-06-04,10,834.0,935.0
1,2022-06-04,12,809.0,890.0
2,2022-06-04,15,786.0,873.0
3,2022-06-04,18,1173.0,1253.0
4,2022-06-04,21,1954.0,2073.0
...,...,...,...,...
1547,2023-01-27,13,673.0,864.0
1548,2023-01-27,14,583.0,1012.0
1549,2023-01-27,15,737.0,1006.0
1550,2023-01-27,18,2003.0,2391.0


In [24]:
df['diff'] = df['newscast duration'] - df['sum duration of stories']

In [25]:
# Dataframe with only newscasts (no individual news stories)
newscasts = df_scraped[df_scraped['whole newscast']==True].copy()

In [26]:
df = df.merge(newscasts[['dat', 'hour', 'newscast title time']],
              how='left',
              left_on = ['dat', 'hour'],
              right_on = ['dat', 'hour'])

In [27]:
df

Unnamed: 0,dat,hour,sum duration of stories,newscast duration,diff,newscast title time
0,2022-06-04,10,834.0,935.0,101.0,10
1,2022-06-04,12,809.0,890.0,81.0,12
2,2022-06-04,15,786.0,873.0,87.0,15
3,2022-06-04,18,1173.0,1253.0,80.0,18
4,2022-06-04,21,1954.0,2073.0,119.0,21
...,...,...,...,...,...,...
1548,2023-01-27,13,673.0,864.0,191.0,13
1549,2023-01-27,14,583.0,1012.0,429.0,14
1550,2023-01-27,15,737.0,1006.0,269.0,15
1551,2023-01-27,18,2003.0,2391.0,388.0,18


In [28]:
df.describe()

Unnamed: 0,hour,sum duration of stories,newscast duration,diff,newscast title time
count,1553.0,1506.0,1442.0,1395.0,1442.0
mean,15.112041,1249.328685,1492.648405,288.993548,14.630374
std,4.153196,852.050045,1116.134635,746.95392,3.858891
min,7.0,0.0,358.0,-1169.0,1.0
25%,12.0,798.25,918.0,61.0,12.0
50%,14.0,926.0,1032.5,112.0,14.0
75%,18.0,1433.75,1510.5,360.0,18.0
max,23.0,5903.0,9346.0,7472.0,21.0


We looked at several rows that had a negative difference between the newcast duration and the total of individual stories durations. In these instances, the newscast video only contained a portion of the newscast. This information will have no bearing on our study because we will only utilize the scripts of individual stories.

In [29]:
# Let's check cases were > 10 min
df[df['diff']>600].head(10)

Unnamed: 0,dat,hour,sum duration of stories,newscast duration,diff,newscast title time
11,2022-06-05,23,584.0,7327.0,6743.0,21
24,2022-06-07,14,981.0,2069.0,1088.0,15
25,2022-06-07,14,981.0,2069.0,1088.0,14
71,2022-06-13,23,136.0,7521.0,7385.0,21
93,2022-06-16,22,409.0,4088.0,3679.0,21
101,2022-06-17,22,1307.0,5033.0,3726.0,21
113,2022-06-19,23,295.0,7498.0,7203.0,21
123,2022-06-21,13,219.0,859.0,640.0,13
160,2022-06-26,22,1459.0,7629.0,6170.0,21
260,2022-07-10,23,466.0,7894.0,7428.0,21


In most cases, the discrapency was caused by an inaccurate time in the news broadcast's "date" field (compare "hour" that was extracted from the field "date" and "newscast title time" that was extracted from the newscast title). This is another another example of inaccurate information on the Channel One website.

This will have no effect on the outcome of our project.

In [30]:
# Check discrepency for the cases with the correct newscast time
df_for_check = df[(df['diff']>600)&(df['hour']==df['newscast title time'])]
df_for_check

Unnamed: 0,dat,hour,sum duration of stories,newscast duration,diff,newscast title time
25,2022-06-07,14,981.0,2069.0,1088.0,14
123,2022-06-21,13,219.0,859.0,640.0,13
279,2022-07-13,14,217.0,972.0,755.0,14
285,2022-07-14,13,249.0,935.0,686.0,13
313,2022-07-18,14,249.0,907.0,658.0,14
424,2022-08-03,13,346.0,1082.0,736.0,13
455,2022-08-08,9,831.0,1467.0,636.0,9
458,2022-08-08,14,343.0,994.0,651.0,14
546,2022-08-22,13,290.0,980.0,690.0,13
600,2022-08-30,14,206.0,856.0,650.0,14


The dates 2022-11-04, 2022-10-07, and 2023-01-23 were manually checked from the list above. This discrepancy is, once again, the fault of Channel One. They did not supply all of the individual news stories for these newscasts. In order to evaluate the potential impact on our project, we need to assess the share of such newscasts.

In [31]:
# SShare of newscasts for which not all individual news stories were provided
len(df_for_check) / len(newscasts)

0.03398058252427184

### Manual check for one row
Information from one randomly selected row was compared to information from the Channel One website. There was no discrepancy discovered.

In [32]:
pd.set_option('display.max_colwidth', None)
df_scraped.sample(1).T

Unnamed: 0,11224
id,10549
date,2022-07-28 12:04:00
tags_top,В мире
title,Бойцы 2-го армейского корпуса народной милиции ЛНР прикрывают колонны во время эвакуации людей
url,https://www.1tv.ru/news/2022-07-28/434433-boytsy_2_go_armeyskogo_korpusa_narodnoy_militsii_lnr_prikryvayut_kolonny_vo_vremya_evakuatsii_lyudey
body,"Особая роль в защите мирного населения Донбасса у бойцов второго армейского корпуса народной милиции ЛНР. Они не только прикрывают людей от ударов националистов, ищут и уничтожают диверсантов, отбивают атаки и проводят успешные наступательные операции. На них еще возложена и огромная гуманитарная миссия, доставка топлива и продовольствия, эвакуация из опасных районов, разминирование территории и восстановление нормальных условий жизни. Евгений Лямин с подробностями."
tags_bottom,"В мире,Специальная военная операция на Украине"
video_duration_seconds,179.0
datetime,2022-07-28 12:04:00
dat,2022-07-28


In [33]:
pd.reset_option('display.max_colwidth')