# Exploratory Data Analysis with Python Cookbook Practice

## Chapter One: Generating Summary Statistics

The following topics cover in this chapter:
- Analyzing the mean of a dataset
- Checking the median of a dataset
- Identifying the mode of a dataset
- Checking the variance of a dataset
- Identifying the standard deviation of a dataset
- Generating the range of a dataset
- Identifying the percentiles of a dataset
- Checking the quartiles of a dataset
- Analyzing the interquartile range (IQR) of a dataset

### 1. Analysing the mean of a dataset

In [172]:
import numpy as np
import pandas as pd

In [173]:
import os

# Get the current working directory (this will work in most environments)
base_dir = os.getcwd()  # Current working directory

# Construct the full path to the CSV file (modify the structure if needed)
data_path = os.path.join(base_dir, 'Exploratory-Data-Analysis-with-Python-Cookbook-main', 'Ch1', 'Data', 'covid-data.csv')

# Check if the file exists
if os.path.exists(data_path):
    # Read the CSV file
    covid_data = pd.read_csv(data_path)
    print("The file is available.")  # Print the 'The file is available.'
else:
    print(f"File not found at: {data_path}")


The file is available.


In [174]:
covid_data.shape

(5818, 67)

In [175]:
covid_data.dtypes

iso_code                                    object
continent                                   object
location                                    object
date                                        object
total_cases                                  int64
                                            ...   
human_development_index                    float64
excess_mortality_cumulative_absolute       float64
excess_mortality_cumulative                float64
excess_mortality                           float64
excess_mortality_cumulative_per_million    float64
Length: 67, dtype: object

#### Subset the covid_data to include relevant columns only

In [176]:
Sub_covid_data = covid_data[['iso_code','continent','location','date','total_cases','new_cases']]
Sub_covid_data

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases
0,AFG,Asia,Afghanistan,24/02/2020,5,5
1,AFG,Asia,Afghanistan,25/02/2020,5,0
2,AFG,Asia,Afghanistan,26/02/2020,5,0
3,AFG,Asia,Afghanistan,27/02/2020,5,0
4,AFG,Asia,Afghanistan,28/02/2020,5,0
...,...,...,...,...,...,...
5813,NGA,Africa,Nigeria,06/10/2022,265741,236
5814,NGA,Africa,Nigeria,07/10/2022,265741,0
5815,NGA,Africa,Nigeria,08/10/2022,265816,75
5816,NGA,Africa,Nigeria,09/10/2022,265816,0


In [177]:
Sub_covid_data .dtypes

iso_code       object
continent      object
location       object
date           object
total_cases     int64
new_cases       int64
dtype: object

In [178]:
Sub_covid_data.shape

(5818, 6)

#### 2. Get mean of the new case data

In [179]:
data_mean = np.mean(Sub_covid_data ["new_cases"])

#### Inspect result

In [180]:
data_mean

8814.365761430045

### Insight

On average, there were approximately 8,814 new COVID-19 cases reported daily across all countries in the dataset.

### 3. Analysing the median of a dataset

In [181]:
data_median = np.median(Sub_covid_data["new_cases"])

##### Inspect result

In [182]:
data_median

261.0

### Insight

The median is **261** while the average number of new daily COVID-19 cases stands at 8,814. The significantly lower median possibly, highlighting a substantial skew in the data.
This large difference indicates a heavily skewed distribution, likely caused by a small number of countries or days with exceptionally high case counts that significantly increased the mean.

### 4. Analysing the mode of a dataset

#### Identify the mode of the new_cases column using the mode method

In [183]:
from scipy import stats
data_mode = stats.mode(Sub_covid_data["new_cases"])

Inspect the result subset of the output to extract the mode:

In [184]:
data_mode


ModeResult(mode=0, count=805)

In [185]:
data_mode[0]

0

#### Identify the mode of the continent column using the mode method

In [186]:
data_mode = Sub_covid_data["continent"].mode()[0]
data_mode

'Europe'

The most frequent number of new COVID-19 cases is 0, showing many days with no reported cases—likely due to low transmission, underreporting, or data delays. This aligns with the skewed distribution observed earlier. The most common continent in the dataset is Europe, indicating it has the most entries, possibly due to more comprehensive or consistent reporting. Consequently, overall trends may be largely shaped by European data.

### 5. Checking the variance of a dataset

In [187]:
data_variance = np.var(Sub_covid_data["new_cases"])


Inspect the result:

In [188]:
data_variance

451321915.92810047

The variance is **451321915.92810047**. The high variance of **451 million** in new COVID-19 cases indicates a very wide spread in daily case counts. This suggests large fluctuations, with some days or countries reporting extremely high numbers while others had few or none. It confirms that the data is highly skewed, making the mean less reliable as a summary measure.

### 6. Identifying the standard deviation of a dataset

In [189]:
data_sd = np.std(Sub_covid_data["new_cases"])

In [190]:
data_sd

21244.33844411495

The standard deviation of **21,244** for new COVID-19 cases indicates a high level of variability around the average. This means daily case counts fluctuate significantly, with frequent extreme highs and lows. It supports the earlier observation that the data is highly dispersed and skewed, making the mean less representative of typical values.

### 7. Generating the range of a dataset

In [191]:
data_max = np.max(Sub_covid_data["new_cases"])
data_min = np.min(Sub_covid_data["new_cases"])

In [192]:
print("Max:", data_max, "\nMiN:" ,data_min)

Max: 287149 
MiN: 0


In [193]:
data_range = data_max - data_min
data_range

287149

The data range of 287,149 shows a vast difference between the lowest and highest daily COVID-19 case counts, indicating extreme variability and the presence of outliers. This supports earlier findings of a highly skewed and dispersed dataset.

### 8. Identifying the percentiles of a dataset

In [194]:
import numpy as np

# Calculate percentiles
data_percentiles = np.percentile(Sub_covid_data["new_cases"], [25, 50, 60, 75])

# Print results
percentile_labels = [25, 50, 60, 75]
for label, value in zip(percentile_labels, data_percentiles):
    print(f"{label}th percentile: {value}")


25th percentile: 24.0
50th percentile: 261.0
60th percentile: 591.3999999999996
75th percentile: 3666.0


The percentiles show that most days had relatively low new COVID-19 cases, with 75% of days recording fewer than 3,666 cases. The sharp rise from the 60th (591) to 75th percentile highlights a steep increase in case counts, suggesting that a small number of days had exceptionally high cases. This confirms a right-skewed distribution, where a few extreme values significantly raise the upper range of the data.

### 8. Analyzing the interquartile range (IQR) of a dataset 
The interquartile range (IQR) measures the spread or variability of a dataset. It is simply the distance between the first and third quartiles.

In [195]:
data_iqr = np.percentile(Sub_covid_data["new_cases"], [25, 75])
IQR = data_iqr[1] - data_iqr[0]
IQR

3642.0

An IQR of 3,642 shows significant variation in daily new COVID-19 cases within the middle 50% of the data. This indicates that even typical case counts fluctuated widely, reflecting high variability and reinforcing the presence of inconsistent daily trends in the dataset.

## Chapter Two: Preparing Data for EDA

The following topics cover in this chapter:
- Grouping data
- Appending data
- Concatenating data
- Merging data
- Sorting data
- Categorizing data
- Removing duplicate data
- Dropping data rows and columns
- Replacing data
- Changing a data format
- Dealing with missing values

### 1. Load all the datasets at a go and print thier shapes

In [196]:
import os
import pandas as pd

# Define base directory and data folder
base_dir = os.getcwd()
data_folder = os.path.join(base_dir, 'Data_ChPt2')

# File list
#file_names = ['marketing_campaign1.csv', 'marketing_campaign2.csv', 'marketing_campaign3.csv']
# List of CSV filenames to read
file_names = ['marketing_campaign.csv', 'marketing_campaign_append1.csv', 'marketing_campaign_append2.csv','marketing_campaign_concat1.csv',
              'marketing_campaign_concat2.csv', 'marketing_campaign_merge1.csv', 'marketing_campaign_merge1.csv']
# Dictionary to hold each DataFrame separately
dataframes = {}

# Load each file individually
for file_name in file_names:
    file_path = os.path.join(data_folder, file_name)
    
    if os.path.exists(file_path):
        df = pd.read_csv(file_path)
        dataframes[file_name] = df
        print(f"Loaded {file_name} with shape: {df.shape}")
    else:
        print(f"File not found: {file_path}")


Loaded marketing_campaign.csv with shape: (2240, 30)
Loaded marketing_campaign_append1.csv with shape: (500, 29)
Loaded marketing_campaign_append2.csv with shape: (500, 29)
Loaded marketing_campaign_concat1.csv with shape: (2240, 5)
Loaded marketing_campaign_concat2.csv with shape: (2240, 5)
Loaded marketing_campaign_merge1.csv with shape: (2240, 3)
Loaded marketing_campaign_merge1.csv with shape: (2240, 3)


### 1.1. Inspect the marketing_campaign dataset. Check the first few rows and use transpose (T) to show more information. 

In [197]:
# Access the specific DataFrame
marketing_data = dataframes['marketing_campaign.csv']

marketing_data.head(5).T

Unnamed: 0,0,1,2,3,4
Unnamed: 0,0,1,2,3,4
ID,5524,2174,4141,6182,5324
Year_Birth,1957,1954,1965,1984,1981
Education,Graduation,Graduation,Graduation,Graduation,PhD
Marital_Status,Single,Single,Together,Together,Married
Income,58138.0,46344.0,71613.0,26646.0,58293.0
Kidhome,0,1,0,1,1
Teenhome,0,1,0,0,0
Dt_Customer,04-09-2012,08-03-2014,21-08-2013,10-02-2014,19-01-2014
Recency,58,38,26,26,94


### 1.2. Subset the dataframe to include only relevant columns. Also, check the data types as well as the number of columns and rows:

In [198]:
marketing_data = marketing_data[['ID','Year_Birth','Education','Marital_Status','Income','Kidhome', 'Teenhome', 'Dt_Customer','Recency','NumStorePurchases','NumWebVisitsMonth']]

In [199]:
marketing_data.head(3).T

Unnamed: 0,0,1,2
ID,5524,2174,4141
Year_Birth,1957,1954,1965
Education,Graduation,Graduation,Graduation
Marital_Status,Single,Single,Together
Income,58138.0,46344.0,71613.0
Kidhome,0,1,0
Teenhome,0,1,0
Dt_Customer,04-09-2012,08-03-2014,21-08-2013
Recency,58,38,26
NumStorePurchases,4,2,10


In [200]:
marketing_data.dtypes

ID                     int64
Year_Birth             int64
Education             object
Marital_Status        object
Income               float64
Kidhome                int64
Teenhome               int64
Dt_Customer           object
Recency                int64
NumStorePurchases      int64
NumWebVisitsMonth      int64
dtype: object

In [201]:
marketing_data.shape

(2240, 11)

## 2. Groupby Method

### 2.1. Use the **groupby method** in pandas to get the average number of store purchases of customers based on the number of kids at home:

In [202]:
marketing_data.groupby('Kidhome')['NumStorePurchases'].mean()

Kidhome
0    7.217324
1    3.863181
2    3.437500
Name: NumStorePurchases, dtype: float64

### 2.2. Use the **groupby method** in pandas to get the average number of store purchases of customers based on the number of Marital Status :

In [203]:
marketing_data.groupby('Marital_Status')['NumStorePurchases'].mean()

Marital_Status
Absurd      6.500000
Alone       4.000000
Divorced    5.818966
Married     5.850694
Single      5.639583
Together    5.736207
Widow       6.415584
YOLO        6.000000
Name: NumStorePurchases, dtype: float64

## 3. Appending data

### 3.1 Load the .csv files and subeset the dataframes to include eelevant columns

In [204]:
marketing_sample1 = dataframes['marketing_campaign_append1.csv']
marketing_sample1 = marketing_sample1[['ID','Year_Birth','Education','Marital_Status','Income','Kidhome', 'Teenhome', 'Dt_Customer','Recency','NumStorePurchases','NumWebVisitsMonth']]

In [205]:
marketing_sample2 = dataframes['marketing_campaign_append2.csv']
marketing_sample2 = marketing_sample2[['ID','Year_Birth','Education','Marital_Status','Income','Kidhome', 'Teenhome', 'Dt_Customer','Recency','NumStorePurchases','NumWebVisitsMonth']]

### 3.2. Take a look at the two datasets. Check the first few rows and use transpose (T) to show more information

In [206]:
marketing_sample1.head(5).T

Unnamed: 0,0,1,2,3,4
ID,5524,2174,4141,6182,5324
Year_Birth,1957,1954,1965,1984,1981
Education,Graduation,Graduation,Graduation,Graduation,PhD
Marital_Status,Single,Single,Together,Together,Married
Income,58138.0,46344.0,71613.0,26646.0,58293.0
Kidhome,0,1,0,1,1
Teenhome,0,1,0,0,0
Dt_Customer,04/09/2012,08/03/2014,21/08/2013,10/02/2014,19/01/2014
Recency,58,38,26,26,94
NumStorePurchases,4,2,10,4,6


In [207]:
marketing_sample2.head(5).T

Unnamed: 0,0,1,2,3,4
ID,9135,466,9135,10623,8151
Year_Birth,1950,1944,1950,1961,1990
Education,Graduation,Graduation,Graduation,Master,Basic
Marital_Status,Together,Married,Together,Together,Married
Income,27203,65275,27203,48330,24279
Kidhome,1,0,1,0,0
Teenhome,1,0,1,1,0
Dt_Customer,06/08/2012,03/04/2013,06/08/2012,15/11/2013,29/12/2012
Recency,92,9,92,2,6
NumStorePurchases,2,13,2,3,3


### 3.3. Check the data types as well as the number of columns and rows

In [208]:
marketing_sample1.dtypes


ID                     int64
Year_Birth             int64
Education             object
Marital_Status        object
Income               float64
Kidhome                int64
Teenhome               int64
Dt_Customer           object
Recency                int64
NumStorePurchases      int64
NumWebVisitsMonth      int64
dtype: object

In [209]:
marketing_sample1.shape

(500, 11)

In [210]:
marketing_sample2.dtypes

ID                    int64
Year_Birth            int64
Education            object
Marital_Status       object
Income                int64
Kidhome               int64
Teenhome              int64
Dt_Customer          object
Recency               int64
NumStorePurchases     int64
NumWebVisitsMonth     int64
dtype: object

In [211]:
marketing_sample2.shape

(500, 11)

### 3.4. Append the datasets. Use the concat method from the pandas library to append the data

In [212]:
appended_datasets = pd.concat([marketing_sample1, marketing_sample2])

In [213]:
appended_datasets.head(3).T

Unnamed: 0,0,1,2
ID,5524,2174,4141
Year_Birth,1957,1954,1965
Education,Graduation,Graduation,Graduation
Marital_Status,Single,Single,Together
Income,58138.0,46344.0,71613.0
Kidhome,0,1,0
Teenhome,0,1,0
Dt_Customer,04/09/2012,08/03/2014,21/08/2013
Recency,58,38,26
NumStorePurchases,4,2,10


In [214]:
appended_datasets.shape

(1000, 11)

### 4. Concatenating data

### 4.1 Load the .csv files into a dataframe 

In [216]:
marketing_sample1 = dataframes['marketing_campaign_concat1.csv']
marketing_sample2 = dataframes['marketing_campaign_concat2.csv']

### 4.2. Take a look at the two datasets. Check the first few rows and/ or use transpose (T) to show more information

In [220]:
marketing_sample1.head(3)

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income
0,5524,1957,Graduation,Single,58138.0
1,2174,1954,Graduation,Single,46344.0
2,4141,1965,Graduation,Together,71613.0


In [219]:
marketing_sample2.head(3)

Unnamed: 0,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth
0,3,8,10,4,7
1,2,1,1,2,5
2,1,8,2,10,4


### 4.3. Check the data types as well as the number of columns and rows

In [222]:
marketing_sample1.dtypes

ID                  int64
Year_Birth          int64
Education          object
Marital_Status     object
Income            float64
dtype: object

In [224]:
marketing_sample1.shape

(2240, 5)

In [223]:
marketing_sample2.dtypes

NumDealsPurchases      int64
NumWebPurchases        int64
NumCatalogPurchases    int64
NumStorePurchases      int64
NumWebVisitsMonth      int64
dtype: object

In [225]:
marketing_sample2.shape

(2240, 5)

### 4.4. Concatenate the datasets. Use the concat method from the pandas library to concatenate the data
Note the additional argument for the axis parameter. The value 1 indicates that the axis refers to columns. The default value is typically 0, which refers to rows and is relevant for appending datasets. 

In [226]:
concatenated_data = pd.concat([marketing_sample1, marketing_sample2], axis=1)

### 4.5. Inspect the shape of the result and the first few rows

In [227]:
concatenated_data.shape

(2240, 10)

In [228]:
concatenated_data.head(3)

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth
0,5524,1957,Graduation,Single,58138.0,3,8,10,4,7
1,2174,1954,Graduation,Single,46344.0,2,1,1,2,5
2,4141,1965,Graduation,Together,71613.0,1,8,2,10,4


In [229]:
concatenated_data.head(3).T

Unnamed: 0,0,1,2
ID,5524,2174,4141
Year_Birth,1957,1954,1965
Education,Graduation,Graduation,Graduation
Marital_Status,Single,Single,Together
Income,58138.0,46344.0,71613.0
NumDealsPurchases,3,2,1
NumWebPurchases,8,1,8
NumCatalogPurchases,10,1,2
NumStorePurchases,4,2,10
NumWebVisitsMonth,7,5,4


## 5. Merging data

Merging sounds a bit like concatenating our dataset; however, it is quite different. To merge datasets, we need to have a common field in both datasets on which we can perform a merge.