# Analyzing the situation between 29.01.2020 and 31.07.2020 with Covid-19 in Russia and making some prediction for the future



## 1.Intoduction: Covid-19 in Russia


The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The disease was first identified in December 2019 in Wuhan, China,became a Public Health Emergency of International Concern in January 2020, and subsequently recognised as a pandemic. As of 28 September 2020, more than 33.1 million cases have been reported worldwide, although the true number of cases is likely to be much higher. A better indicator for case spread is the more than 998,000 deaths attributed to COVID-19.

The disease spreads between people most often when they are physically close.It spreads very easily and sustainably through the air, primarily via small droplets or particles such as aerosols, produced after an infected person breathes, coughs, sneezes, talks or sings. It may also be transmitted via contaminated surfaces, although this has not been conclusively demonstrated. It can spread for up to two days prior to symptom onset, and from people who are asymptomatic. People remain infectious for 7–12 days in moderate cases, and up to two weeks in severe cases.
Common symptoms include fever, cough, fatigue, shortness of breath or breathing difficulties, and loss of smell. Complications may include pneumonia and acute respiratory distress syndrome. The incubation period is typically around five days but may range from one to 14 days. There are several vaccine candidates in development, although none have completed clinical trials to prove their safety and efficacy. There is no known specific antiviral medication, so primary treatment is currently symptomatic.

The COVID-19 pandemic in Russia is part of the ongoing pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The virus was confirmed to have spread to Russia on 31 January 2020, when two Chinese citizens in Tyumen (Siberia) and Chita (Russian Far East) tested positive for the virus, with both cases being contained. Early prevention measures included restricting the border with China and extensive testing. The infection spread from Italy on 2 March, leading to additional measures such as cancelling events, closing schools, theatres, and museums, as well as shutting the border and declaring a non-working period which, after two extensions, lasted until 11 May 2020. By the end of March 2020, the vast majority of federal subjects, including Moscow, had imposed lockdowns. By 17 April 2020, cases had been confirmed in all federal subjects.

Russia currently has the highest number of confirmed cases in Europe, and the fourth-highest in the world after the United States, India, and Brazil. According to figures from the national coronavirus crisis centre, as of 28 September 2020, Russia has 1,159,573 confirmed cases, 945,920 recoveries, 20,385 deaths, and over 45.4 million tests performed. According to mortality data published by the Federal State Statistics Service, 37,490 people with COVID-19 died from April to July 2020, with the virus determined or assumed to have been the main cause of death for 22,063 of them.

Sources: (https://en.wikipedia.org/wiki/COVID-19_pandemic) 
        (https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Russia)

## 2. Data description and objectives

In this alalysis I am interested in analyzing Covid-19 situation in Russia.It is is fascinating to observe the dynamic of Covid-19 in Russia throw months.

All data is taken from kaggle.com website : https://www.kaggle.com/kapral42/covid19-russia-regions-cases?select=isolation_daily.csv 

My analysis will be based on data from the 29th January to the 31st July of 2020. Below is data that I will use for my analysis.

+ Region_ID - ID of the regions for table matching.
+ Region - region name in Russian
+ Region Eng - region names in English
+ Population - region population
+ Day-Confirmed - confirmed cases for the day
+ Day-Death - death cases for the day
+ Day-Recovered - recovered cases for the day
+ Confirmed - sum of confirmed cases in the region presently(cumulative)
+ Deaths - sum of death cases in the region presently(cumulative)
+ Recovered - sum of recovered cases in the region presently(cumulative)
+ Tests - number of performed tests (cumulative)
+ Control - people under medical control (cumulative)
+ Control_active - people under medical cintrol at given day
+ Recovered - people who recovered from Imported_SARS (cumulative)
+ Isol_idx - Self-Isolation index

In this project, I will try to answer for the main 5 questions:

1. What was the situation of Covid-19 spreading in Russian regions between 29.01.2020 and 31.07.2020?
2. Which regions was the most affected by Covid-19?
3. What was the impact of Covid - 19 on Russian society and economy?
4. Will Russia be ready for the second wave of Covid-19?
5. What is the future prediction for Covid-19 spreading Russian regions?


## 3. Data manipulation: cleaning and shaping 

First of all, we need to go through all Russian regions data list and see what we are dealing with.



In [1]:
# importing main libraries
import numpy as np
import pandas as pd

In [2]:
# creating dataframes for all data that I need
#data frame of the information about the Russian regions like  its id, name,population and etc.
reg_info_df = pd.read_csv('regions-info.csv')

#data frame of the info about amount of tests, people under control and etc.
reg_control_df = pd.read_csv('covid19-tests-and-other.csv')

#data frame of the info about deaths, recovered people and etc.
rus_cases_df = pd.read_csv('covid19-russia-cases-scrf.csv')

#data frame of the info about isolation indexes of all Russian regions 
isol_df = pd.read_csv('isolation_daily.csv')



#### Now let's look at all data of each data frame, drop all colums we don't need and change missing values with NaN values where it is possible


In [3]:
reg_info_df.head()

Unnamed: 0,Region_ID,Region,Region_eng,Population,Rus_perc,Urban_pop,Urban_pop_perc,Rural_pop,Rural_pop_perc,Area,Density_pop_sqkm,Federal_district,Latitude,Longitude
0,0,Россия,Russia,146745098,100,109326899,745,37553533,2559,17125191,857,РФ,64.686314,97.745306
1,77,Москва,Moscow,12692466,865,12342615,9724,163853,129,2561,495606,ЦФО,55.479205,37.32733
2,50,Московская область,Moscow region,7687647,524,6123573,7965,1379812,1795,44329,17342,ЦФО,55.504316,38.035393
3,23,Краснодарский край,Krasnodar region,5677786,387,3075168,5416,2528252,4453,75485,7522,ЮФО,45.768401,39.026104
4,78,Санкт-Петербург,St. Petersburg,5392992,368,5351935,9924,0,0,1403,38439,СЗФО,59.960674,30.158655


In [4]:
# As you can see we have unnecessary columns, Federal district is not related to my project and I don't need column "Area",
# because I've already had density population per square km. So we'll drop them 
# I left region names in russian for better orientation for future
reg_info_df.drop(['Federal_district','Area'],axis='columns', inplace = True)
reg_info_df.head()

Unnamed: 0,Region_ID,Region,Region_eng,Population,Rus_perc,Urban_pop,Urban_pop_perc,Rural_pop,Rural_pop_perc,Density_pop_sqkm,Latitude,Longitude
0,0,Россия,Russia,146745098,100,109326899,745,37553533,2559,857,64.686314,97.745306
1,77,Москва,Moscow,12692466,865,12342615,9724,163853,129,495606,55.479205,37.32733
2,50,Московская область,Moscow region,7687647,524,6123573,7965,1379812,1795,17342,55.504316,38.035393
3,23,Краснодарский край,Krasnodar region,5677786,387,3075168,5416,2528252,4453,7522,45.768401,39.026104
4,78,Санкт-Петербург,St. Petersburg,5392992,368,5351935,9924,0,0,38439,59.960674,30.158655


In [5]:
#There we don't have any missing values
reg_info_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86 entries, 0 to 85
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Region_ID         86 non-null     int64  
 1   Region            86 non-null     object 
 2   Region_eng        86 non-null     object 
 3   Population        86 non-null     int64  
 4   Rus_perc          86 non-null     object 
 5   Urban_pop         86 non-null     int64  
 6   Urban_pop_perc    86 non-null     object 
 7   Rural_pop         86 non-null     int64  
 8   Rural_pop_perc    86 non-null     object 
 9   Density_pop_sqkm  86 non-null     object 
 10  Latitude          86 non-null     float64
 11  Longitude         86 non-null     float64
dtypes: float64(2), int64(4), object(6)
memory usage: 8.2+ KB


In [6]:
# Just to know how many rows do I have
reg_control_df.shape

(185, 10)

In [7]:
#All Nan values are part of column with cumulative data, which means if the row is Nan, the number of observations stay the same as previous 
reg_control_df.loc[30:50]

Unnamed: 0,Date,Tests,Control,Control_active,Isolated,Imported_SARS,Recovered,Isolators,Isolators_active,Mos_self_isolat_idx
30,28.02.2020,32885.0,,8414.0,443.0,1163.0,471.0,,,
31,29.02.2020,,,8681.0,375.0,1177.0,474.0,,,
32,01.03.2020,,,,321.0,1222.0,826.0,,,
33,02.03.2020,,,7418.0,270.0,1232.0,859.0,,,0.5
34,03.03.2020,42776.0,,8053.0,247.0,1236.0,859.0,,,0.4
35,04.03.2020,46414.0,,7341.0,226.0,1280.0,871.0,,,0.4
36,05.03.2020,51366.0,,7594.0,164.0,1314.0,938.0,,,0.3
37,06.03.2020,55688.0,,,120.0,1401.0,1165.0,,,0.4
38,07.03.2020,,,8373.0,120.0,1492.0,1227.0,,,1.8
39,08.03.2020,59960.0,,9276.0,98.0,1518.0,1235.0,,,2.4


In [8]:
# Because of the fact that first row is empty, I replaced them by 0 and full filled next rows by the cumulative principle
reg_control_df.head(1).replace(np.nan,0,inplace =True)
reg_control_df.fillna(method='ffill',inplace = True)
reg_control_df.loc[30:50]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


Unnamed: 0,Date,Tests,Control,Control_active,Isolated,Imported_SARS,Recovered,Isolators,Isolators_active,Mos_self_isolat_idx
30,28.02.2020,32885.0,0.0,8414.0,443.0,1163.0,471.0,0.0,0.0,0.0
31,29.02.2020,32885.0,0.0,8681.0,375.0,1177.0,474.0,0.0,0.0,0.0
32,01.03.2020,32885.0,0.0,8681.0,321.0,1222.0,826.0,0.0,0.0,0.0
33,02.03.2020,32885.0,0.0,7418.0,270.0,1232.0,859.0,0.0,0.0,0.5
34,03.03.2020,42776.0,0.0,8053.0,247.0,1236.0,859.0,0.0,0.0,0.4
35,04.03.2020,46414.0,0.0,7341.0,226.0,1280.0,871.0,0.0,0.0,0.4
36,05.03.2020,51366.0,0.0,7594.0,164.0,1314.0,938.0,0.0,0.0,0.3
37,06.03.2020,55688.0,0.0,7594.0,120.0,1401.0,1165.0,0.0,0.0,0.4
38,07.03.2020,55688.0,0.0,8373.0,120.0,1492.0,1227.0,0.0,0.0,1.8
39,08.03.2020,59960.0,0.0,9276.0,98.0,1518.0,1235.0,0.0,0.0,2.4


In [9]:
#Removing column I don't need
reg_control_df.drop(['Mos_self_isolat_idx'],axis='columns', inplace = True)
reg_control_df.head()

Unnamed: 0,Date,Tests,Control,Control_active,Isolated,Imported_SARS,Recovered,Isolators,Isolators_active
0,29.01.2020,0.0,0.0,0.0,0.0,184.0,0.0,0.0,0.0
1,30.01.2020,0.0,0.0,0.0,0.0,184.0,0.0,0.0,0.0
2,31.01.2020,0.0,0.0,0.0,0.0,236.0,0.0,0.0,0.0
3,01.02.2020,0.0,0.0,0.0,0.0,305.0,0.0,0.0,0.0
4,02.02.2020,0.0,0.0,0.0,0.0,360.0,0.0,0.0,0.0


In [10]:
#No missing values
reg_control_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 185 entries, 0 to 184
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date              185 non-null    object 
 1   Tests             185 non-null    float64
 2   Control           185 non-null    float64
 3   Control_active    185 non-null    float64
 4   Isolated          185 non-null    float64
 5   Imported_SARS     185 non-null    float64
 6   Recovered         185 non-null    float64
 7   Isolators         185 non-null    float64
 8   Isolators_active  185 non-null    float64
dtypes: float64(8), object(1)
memory usage: 13.1+ KB


In [11]:
rus_cases_df.head(10)
#We'll use all of the columns, so there are no unnecessary data

Unnamed: 0,Date,Region/City,Region/City-Eng,Region_ID,Day-Confirmed,Day-Deaths,Day-Recovered,Confirmed,Deaths,Recovered
0,2020-03-02,Московская область,Moscow region,50.0,1.0,0.0,0.0,1.0,0.0,0.0
1,2020-03-03,Московская область,Moscow region,50.0,0.0,0.0,0.0,1.0,0.0,0.0
2,2020-03-04,Московская область,Moscow region,50.0,0.0,0.0,0.0,1.0,0.0,0.0
3,2020-03-05,Московская область,Moscow region,50.0,0.0,0.0,0.0,1.0,0.0,0.0
4,2020-03-06,Москва,Moscow,77.0,5.0,0.0,0.0,5.0,0.0,0.0
5,2020-03-06,Московская область,Moscow region,50.0,0.0,0.0,0.0,1.0,0.0,0.0
6,2020-03-06,Нижегородская область,Nizhny Novgorod Region,52.0,1.0,0.0,0.0,1.0,0.0,0.0
7,2020-03-07,Липецкая область,Lipetsk region,48.0,3.0,0.0,0.0,3.0,0.0,0.0
8,2020-03-07,Москва,Moscow,77.0,0.0,0.0,0.0,5.0,0.0,0.0
9,2020-03-07,Московская область,Moscow region,50.0,0.0,0.0,0.0,1.0,0.0,0.0


In [12]:
#There are no missing values
rus_cases_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12054 entries, 0 to 12053
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Date             12054 non-null  object 
 1   Region/City      12054 non-null  object 
 2   Region/City-Eng  12054 non-null  object 
 3   Region_ID        12054 non-null  float64
 4   Day-Confirmed    12054 non-null  float64
 5   Day-Deaths       12054 non-null  float64
 6   Day-Recovered    12054 non-null  float64
 7   Confirmed        12054 non-null  float64
 8   Deaths           12054 non-null  float64
 9   Recovered        12054 non-null  float64
dtypes: float64(7), object(3)
memory usage: 941.8+ KB


In [13]:
isol_df.head()
# I want to split City_geo,Region_geo columns into separate latitude, longitude columns

Unnamed: 0,City,City_geo,City_pop,Date,Date_last,Region,Region_geo_id,Region_geo,Region_pop,Isol_idx,Isol_idx_minus_day,Isol_idx_minus_week
0,Сергиев Посад,"[56.315291, 38.135999]",103444,2020-06-15,Да,Москва и Московская область,1,"[55.815792, 37.380031]",7503385,1.3,2.1,1.6
1,Хасавюрт,"[43.246265, 46.590044]",141259,2020-06-15,Да,Республика Дагестан,11010,"[42.259793, 47.095742]",3063885,1.9,2.3,2.1
2,Петрозаводск,"[61.787374, 34.354325]",263639,2020-06-15,Да,Республика Карелия,10933,"[63.621324, 33.232608]",622484,1.1,2.7,1.3
3,Самара,"[53.195538, 50.101783]",1163399,2020-06-15,Да,Самарская область,11131,"[53.27635, 50.463301]",3193514,1.0,2.4,1.3
4,Котлас,"[61.25297, 46.633217]",61805,2020-06-15,Да,Архангельская область,10842,"[63.637517, 43.336661]",1111031,1.0,2.2,1.1


In [14]:
#First of all, I removed some characters from the columns
isol_df['Region_geo']= isol_df['Region_geo'].map(lambda x: x.lstrip('[').rstrip(']'))
isol_df['City_geo']= isol_df['City_geo'].map(lambda x: x.lstrip('[').rstrip(']'))
isol_df.head()


Unnamed: 0,City,City_geo,City_pop,Date,Date_last,Region,Region_geo_id,Region_geo,Region_pop,Isol_idx,Isol_idx_minus_day,Isol_idx_minus_week
0,Сергиев Посад,"56.315291, 38.135999",103444,2020-06-15,Да,Москва и Московская область,1,"55.815792, 37.380031",7503385,1.3,2.1,1.6
1,Хасавюрт,"43.246265, 46.590044",141259,2020-06-15,Да,Республика Дагестан,11010,"42.259793, 47.095742",3063885,1.9,2.3,2.1
2,Петрозаводск,"61.787374, 34.354325",263639,2020-06-15,Да,Республика Карелия,10933,"63.621324, 33.232608",622484,1.1,2.7,1.3
3,Самара,"53.195538, 50.101783",1163399,2020-06-15,Да,Самарская область,11131,"53.27635, 50.463301",3193514,1.0,2.4,1.3
4,Котлас,"61.25297, 46.633217",61805,2020-06-15,Да,Архангельская область,10842,"63.637517, 43.336661",1111031,1.0,2.2,1.1


In [15]:
#Then I splitted them into latitude and longitude columns
isol_df[['City_latitude','City_longitude']] = isol_df.City_geo.str.split(",",expand=True,)
isol_df[['Region_latitude','Region_longitude']] = isol_df.Region_geo.str.split(",",expand=True,)
isol_df.head()

Unnamed: 0,City,City_geo,City_pop,Date,Date_last,Region,Region_geo_id,Region_geo,Region_pop,Isol_idx,Isol_idx_minus_day,Isol_idx_minus_week,City_latitude,City_longitude,Region_latitude,Region_longitude
0,Сергиев Посад,"56.315291, 38.135999",103444,2020-06-15,Да,Москва и Московская область,1,"55.815792, 37.380031",7503385,1.3,2.1,1.6,56.315291,38.135999,55.815792,37.380031
1,Хасавюрт,"43.246265, 46.590044",141259,2020-06-15,Да,Республика Дагестан,11010,"42.259793, 47.095742",3063885,1.9,2.3,2.1,43.246265,46.590044,42.259793,47.095742
2,Петрозаводск,"61.787374, 34.354325",263639,2020-06-15,Да,Республика Карелия,10933,"63.621324, 33.232608",622484,1.1,2.7,1.3,61.787374,34.354325,63.621324,33.232608
3,Самара,"53.195538, 50.101783",1163399,2020-06-15,Да,Самарская область,11131,"53.27635, 50.463301",3193514,1.0,2.4,1.3,53.195538,50.101783,53.27635,50.463301
4,Котлас,"61.25297, 46.633217",61805,2020-06-15,Да,Архангельская область,10842,"63.637517, 43.336661",1111031,1.0,2.2,1.1,61.25297,46.633217,63.637517,43.336661


In [16]:
#And finally removed them and unnessary column Date_last
isol_df.drop(['City_geo','Region_geo','Date_last'],axis='columns', inplace = True)
isol_df.head()

Unnamed: 0,City,City_pop,Date,Region,Region_geo_id,Region_pop,Isol_idx,Isol_idx_minus_day,Isol_idx_minus_week,City_latitude,City_longitude,Region_latitude,Region_longitude
0,Сергиев Посад,103444,2020-06-15,Москва и Московская область,1,7503385,1.3,2.1,1.6,56.315291,38.135999,55.815792,37.380031
1,Хасавюрт,141259,2020-06-15,Республика Дагестан,11010,3063885,1.9,2.3,2.1,43.246265,46.590044,42.259793,47.095742
2,Петрозаводск,263639,2020-06-15,Республика Карелия,10933,622484,1.1,2.7,1.3,61.787374,34.354325,63.621324,33.232608
3,Самара,1163399,2020-06-15,Самарская область,11131,3193514,1.0,2.4,1.3,53.195538,50.101783,53.27635,50.463301
4,Котлас,61805,2020-06-15,Архангельская область,10842,1111031,1.0,2.2,1.1,61.25297,46.633217,63.637517,43.336661


In [17]:
#As you can see, we don't have missing values
isol_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38190 entries, 0 to 38189
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   City                 38190 non-null  object 
 1   City_pop             38190 non-null  int64  
 2   Date                 38190 non-null  object 
 3   Region               38190 non-null  object 
 4   Region_geo_id        38190 non-null  int64  
 5   Region_pop           38190 non-null  int64  
 6   Isol_idx             38190 non-null  float64
 7   Isol_idx_minus_day   38190 non-null  float64
 8   Isol_idx_minus_week  38190 non-null  float64
 9   City_latitude        38190 non-null  object 
 10  City_longitude       38190 non-null  object 
 11  Region_latitude      38190 non-null  object 
 12  Region_longitude     38190 non-null  object 
dtypes: float64(3), int64(3), object(7)
memory usage: 3.8+ MB


### Now our data is ready for further manipulation