In [3]:
pip install pandas




In [4]:

import urllib.request
import pandas as pd
import os

url="https://raw.githubusercontent.com/nlihin/data-analytics/main/datasets/italy-covid-daywise.csv"
os.makedirs("./data1")
urllib.request.urlretrieve(url, "./data1/italy_covid.csv")



FileExistsError: [WinError 183] Cannot create a file when that file already exists: './data1'

In [5]:
italy_covid_data=pd.read_csv("./data1/italy_covid.csv")

In [6]:
italy_covid_data

Unnamed: 0,date,new_cases,new_deaths,new_tests
0,2019-12-31,0.0,0.0,
1,2020-01-01,0.0,0.0,
2,2020-01-02,0.0,0.0,
3,2020-01-03,0.0,0.0,
4,2020-01-04,0.0,0.0,
...,...,...,...,...
243,2020-08-30,1444.0,1.0,53541.0
244,2020-08-31,1365.0,4.0,42583.0
245,2020-09-01,996.0,6.0,54395.0
246,2020-09-02,975.0,8.0,


We can get view the basic information about the data frame using the .info method; keeping in mind that these are officially reported data and the actual number of cases and deaths my be higher-not all cases get reported/diagnosed.


In [7]:
italy_covid_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248 entries, 0 to 247
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        248 non-null    object 
 1   new_cases   248 non-null    float64
 2   new_deaths  248 non-null    float64
 3   new_tests   135 non-null    float64
dtypes: float64(3), object(1)
memory usage: 7.9+ KB


In [8]:
# The .describe method helps with viewing basic statistical information like mean, std, min/max values and number of non empty values
italy_covid_data.describe()

Unnamed: 0,new_cases,new_deaths,new_tests
count,248.0,248.0,135.0
mean,1094.818548,143.133065,31699.674074
std,1554.508002,227.105538,11622.209757
min,-148.0,-31.0,7841.0
25%,123.0,3.0,25259.0
50%,342.0,17.0,29545.0
75%,1371.75,175.25,37711.0
max,6557.0,971.0,95273.0


In [9]:
# The .columns contains the list of columns 
italy_covid_data.columns

Index(['date', 'new_cases', 'new_deaths', 'new_tests'], dtype='object')

In [10]:
#One can also retrive the number of rows and columns in the data frame using the .shape method
italy_covid_data.shape

(248, 4)

## Retrieving data from a data frame
While retrieving data from a dataframe, it is very important have a good grasp of the internal representations of data in a dataframe. It is important to understand that a dataframe is a sort of dictionary of lists, columns being the keys and values being lists/arrays containing data for respective columns. 

Benefits this conceptual view include;
*All values in a column typically have the same type of value, so it's more efficient to store them in as single array.

*Retrieving the values for a particular row simply requare extracting the elements at a given index from each column arrays.

*The representation is more compact (column names are recorded only once)

In [11]:
italy_covid_data["new_cases"]

0         0.0
1         0.0
2         0.0
3         0.0
4         0.0
        ...  
243    1444.0
244    1365.0
245     996.0
246     975.0
247    1326.0
Name: new_cases, Length: 248, dtype: float64

In [12]:
#Each column is represented using a data structure called series, which is essentially a numpy array with some extra methods and properties
type(italy_covid_data["new_cases"])

pandas.core.series.Series

In [13]:
#In a similar ways with arrays, you can retrieve a specific value with a series using the index notation[]
italy_covid_data["new_cases"][246]

975.0

In [14]:
#Pandas also provides the .at method to directly retrieve at a specific row and column.
italy_covid_data.at[246, "new_cases"]

975.0

In [15]:
#Additionally, pandas allows for the use of . notation to access columns as the properties of the data frame. 
#The method only works for columns for whose names do not contain spaces or special characters
italy_covid_data.new_cases

0         0.0
1         0.0
2         0.0
3         0.0
4         0.0
        ...  
243    1444.0
244    1365.0
245     996.0
246     975.0
247    1326.0
Name: new_cases, Length: 248, dtype: float64

In [20]:
#Also possible is passing a list of columns within the indexing notation [] to access a subset of the data frame with just the given columns.
#example
case_df=italy_covid_data[["date","new_cases"]]
case_df

Unnamed: 0,date,new_cases
0,2019-12-31,0.0
1,2020-01-01,0.0
2,2020-01-02,0.0
3,2020-01-03,0.0
4,2020-01-04,0.0
...,...,...
243,2020-08-30,1444.0
244,2020-08-31,1365.0
245,2020-09-01,996.0
246,2020-09-02,975.0


In [21]:
#in cases where you need a full copy for instance Italy_covid_data_copy we 
#use the copy() method 
italy_covid_data_copy=italy_covid_data.copy()

In [22]:
#To access a specific row of data in pands we use the .loc method
italy_covid_data.loc[243]
#each retrieved row is also a series object

date          2020-08-30
new_cases         1444.0
new_deaths           1.0
new_tests        53541.0
Name: 243, dtype: object