# Exploratory Data Analysis with Python Cookbook Practice

## Chapter One: Generating Summary Statistics

The following topics cover in this chapter:
- Analyzing the mean of a dataset
- Checking the median of a dataset
- Identifying the mode of a dataset
- Checking the variance of a dataset
- Identifying the standard deviation of a dataset
- Generating the range of a dataset
- Identifying the percentiles of a dataset
- Checking the quartiles of a dataset
- Analyzing the interquartile range (IQR) of a dataset

### 1. Analysing the mean of a dataset

In [275]:
import numpy as np
import pandas as pd

In [276]:
import os

# Get the current working directory (this will work in most environments)
base_dir = os.getcwd()  # Current working directory

# Construct the full path to the CSV file (modify the structure if needed)
data_path = os.path.join(base_dir, 'Exploratory-Data-Analysis-with-Python-Cookbook-main', 'Ch1', 'Data', 'covid-data.csv')

# Check if the file exists
if os.path.exists(data_path):
    # Read the CSV file
    covid_data = pd.read_csv(data_path)
    print("The file is available.")  # Print the 'The file is available.'
else:
    print(f"File not found at: {data_path}")


The file is available.


#### Subset the covid_data to include relevant columns only

In [277]:
Sub_covid_data = covid_data[['iso_code','continent','location','date','total_cases','new_cases']]
Sub_covid_data

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases
0,AFG,Asia,Afghanistan,24/02/2020,5,5
1,AFG,Asia,Afghanistan,25/02/2020,5,0
2,AFG,Asia,Afghanistan,26/02/2020,5,0
3,AFG,Asia,Afghanistan,27/02/2020,5,0
4,AFG,Asia,Afghanistan,28/02/2020,5,0
...,...,...,...,...,...,...
5813,NGA,Africa,Nigeria,06/10/2022,265741,236
5814,NGA,Africa,Nigeria,07/10/2022,265741,0
5815,NGA,Africa,Nigeria,08/10/2022,265816,75
5816,NGA,Africa,Nigeria,09/10/2022,265816,0


In [278]:
Sub_covid_data .dtypes

iso_code       object
continent      object
location       object
date           object
total_cases     int64
new_cases       int64
dtype: object

In [279]:
Sub_covid_data.shape

(5818, 6)

#### 2. Get mean of the new case data

In [280]:
data_mean = np.mean(Sub_covid_data ["new_cases"])

#### Inspect result

In [281]:
data_mean

8814.365761430045

### Insight

On average, there were approximately 8,814 new COVID-19 cases reported daily across all countries in the dataset.

### 3. Analysing the median of a dataset

In [282]:
data_median = np.median(Sub_covid_data["new_cases"])

##### Inspect result

In [283]:
data_median

261.0

### Insight

The median is **261** while the average number of new daily COVID-19 cases stands at 8,814. The significantly lower median possibly, highlighting a substantial skew in the data.
This large difference indicates a heavily skewed distribution, likely caused by a small number of countries or days with exceptionally high case counts that significantly increased the mean.

### 4. Analysing the mode of a dataset

#### Identify the mode of the new_cases column using the mode method

In [284]:
from scipy import stats
data_mode = stats.mode(Sub_covid_data["new_cases"])

Inspect the result subset of the output to extract the mode:

In [285]:
data_mode


ModeResult(mode=0, count=805)

In [286]:
data_mode[0]

0

#### Identify the mode of the continent column using the mode method

In [287]:
data_mode = Sub_covid_data["continent"].mode()[0]
data_mode

'Europe'

The most frequent number of new COVID-19 cases is 0, showing many days with no reported cases—likely due to low transmission, underreporting, or data delays. This aligns with the skewed distribution observed earlier. The most common continent in the dataset is Europe, indicating it has the most entries, possibly due to more comprehensive or consistent reporting. Consequently, overall trends may be largely shaped by European data.

### 5. Checking the variance of a dataset

In [288]:
data_variance = np.var(Sub_covid_data["new_cases"])


Inspect the result:

In [289]:
data_variance

451321915.92810047

The variance is **451321915.92810047**. The high variance of **451 million** in new COVID-19 cases indicates a very wide spread in daily case counts. This suggests large fluctuations, with some days or countries reporting extremely high numbers while others had few or none. It confirms that the data is highly skewed, making the mean less reliable as a summary measure.

### 6. Identifying the standard deviation of a dataset

In [290]:
data_sd = np.std(Sub_covid_data["new_cases"])

In [291]:
data_sd

21244.33844411495

The standard deviation of **21,244** for new COVID-19 cases indicates a high level of variability around the average. This means daily case counts fluctuate significantly, with frequent extreme highs and lows. It supports the earlier observation that the data is highly dispersed and skewed, making the mean less representative of typical values.

### 7. Generating the range of a dataset

In [292]:
data_max = np.max(Sub_covid_data["new_cases"])
data_min = np.min(Sub_covid_data["new_cases"])

In [293]:
print(data_max,data_min)

287149 0


In [294]:
data_range = data_max - data_min
data_range

287149

The data range of 287,149 shows a vast difference between the lowest and highest daily COVID-19 case counts, indicating extreme variability and the presence of outliers. This supports earlier findings of a highly skewed and dispersed dataset.

### 8. Identifying the percentiles of a dataset

In [295]:
import numpy as np

# Calculate percentiles
data_percentiles = np.percentile(Sub_covid_data["new_cases"], [25, 50, 60, 75])

# Print results
percentile_labels = [25, 50, 60, 75]
for label, value in zip(percentile_labels, data_percentiles):
    print(f"{label}th percentile: {value}")


25th percentile: 24.0
50th percentile: 261.0
60th percentile: 591.3999999999996
75th percentile: 3666.0


The percentiles show that most days had relatively low new COVID-19 cases, with 75% of days recording fewer than 3,666 cases. The sharp rise from the 60th (591) to 75th percentile highlights a steep increase in case counts, suggesting that a small number of days had exceptionally high cases. This confirms a right-skewed distribution, where a few extreme values significantly raise the upper range of the data.

### 8. Analyzing the interquartile range (IQR) of a dataset 
The interquartile range (IQR) measures the spread or variability of a dataset. It is simply the distance between the first and third quartiles.

In [296]:
data_iqr = np.percentile(Sub_covid_data["new_cases"], [25, 75])
IQR = data_iqr[1] - data_iqr[0]
IQR

3642.0

An IQR of 3,642 shows significant variation in daily new COVID-19 cases within the middle 50% of the data. This indicates that even typical case counts fluctuated widely, reflecting high variability and reinforcing the presence of inconsistent daily trends in the dataset.

## Chapter Two: Preparing Data for EDA

The following topics cover in this chapter:
- Grouping data
- Appending data
- Concatenating data
- Merging data
- Sorting data
- Categorizing data
- Removing duplicate data
- Dropping data rows and columns
- Replacing data
- Changing a data format
- Dealing with missing values