# Overview
This notebook attempts to perform Exploratory Data Analysis (EDA) on the dataset: [List of Highest Grossing Concert Tours by Women](https://en.wikipedia.org/wiki/List_of_highest-grossing_concert_tours_by_women), which was scraped and made available on Kaggle as [Dirty Dataset to Practice Data Cleaning](https://www.kaggle.com/datasets/amruthayenikonda/dirty-dataset-to-practice-data-cleaning).
The dataset has 20 entries from 20 different tours and artists

In [69]:
!pip install -r requirements.txt

Defaulting to user installation because normal site-packages is not writeable


In [70]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [71]:
df = pd.read_csv("grossing_movies.csv")

Retrieve the first 10 rows from the DataFrame

In [72]:
df.head(10)

Unnamed: 0,Rank,Peak,All Time Peak,Actual gross,Adjusted gross (in 2022 dollars),Artist,Tour title,Year(s),Shows,Average gross,Ref.
0,1,1,2,"$780,000,000","$780,000,000",Taylor Swift,The Eras Tour †,2023–2024,56,"$13,928,571",[1]
1,2,1,7[2],"$579,800,000","$579,800,000",Beyoncé,Renaissance World Tour,2023,56,"$10,353,571",[3]
2,3,1[4],2[5],"$411,000,000","$560,622,615",Madonna,Sticky & Sweet Tour ‡[4][a],2008–2009,85,"$4,835,294",[6]
3,4,2[7],10[7],"$397,300,000","$454,751,555",Pink,Beautiful Trauma World Tour,2018–2019,156,"$2,546,795",[7]
4,5,2[4],,"$345,675,146","$402,844,849",Taylor Swift,Reputation Stadium Tour,2018,53,"$6,522,173",[8]
5,6,2[4],10[9],"$305,158,363","$388,978,496",Madonna,The MDNA Tour,2012,88,"$3,467,709",[9]
6,7,2[10],,"$280,000,000","$381,932,682",Celine Dion,Taking Chances World Tour,2008–2009,131,"$2,137,405",[11]
7,7,,,"$257,600,000","$257,600,000",Pink,Summer Carnival †,2023–2024,41,"$6,282,927",[12]
8,9,,,"$256,084,556","$312,258,401",Beyoncé,The Formation World Tour,2016,49,"$5,226,215",[13]
9,10,,,"$250,400,000","$309,141,878",Taylor Swift,The 1989 World Tour,2015,85,"$2,945,882",[14]


In [73]:
df.columns

Index(['Rank', 'Peak', 'All Time Peak', 'Actual gross',
       'Adjusted gross (in 2022 dollars)', 'Artist', 'Tour title', 'Year(s)',
       'Shows', 'Average gross', 'Ref.'],
      dtype='object')

## Data Cleaning and Preprocessing

### Selecting relevant columns
The only columns we want to work with are:

 - `Rank`: Ranking of the show by gross revenue
 - `Actual gross`: The amount (in USD) generated by the tour
 - `Adjusted gross (in 2022 dollars)`: The `Actual gross` column, with inflation changes applied
 - `Artist`: The female artist/band that organized the tour
 - `Tour title`: The name of the tour, as used in campaigns and promotional material
 - `Year(s)`: The year(s) in which the tour was conducted
 - `Shows`: The number of shows within the tour
 - `Average gross`: The average gross raised from each show on average

We rename some columns so the dataframe is easier to work with and drop unnecessary columns

In [74]:
# Rename
df.rename(
    columns={
        "Rank": "rank",
        "Adjusted gross (in 2022 dollars)": "adj_gross",
        "Actual gross": "gross",
        "Artist": "artist",
        "Tour title": "tour_title",
        "Year(s)": "years",
        "Shows": "shows",
        "Average gross": "avg_gross",
    },
    inplace=True,
)

# Drop unnecessary columns
columns = [
    "rank",
    "gross",
    "adj_gross",
    "artist",
    "tour_title",
    "years",
    "shows",
    "avg_gross",
]
df = df.loc[:, columns]
df.head(10)

Unnamed: 0,rank,gross,adj_gross,artist,tour_title,years,shows,avg_gross
0,1,"$780,000,000","$780,000,000",Taylor Swift,The Eras Tour †,2023–2024,56,"$13,928,571"
1,2,"$579,800,000","$579,800,000",Beyoncé,Renaissance World Tour,2023,56,"$10,353,571"
2,3,"$411,000,000","$560,622,615",Madonna,Sticky & Sweet Tour ‡[4][a],2008–2009,85,"$4,835,294"
3,4,"$397,300,000","$454,751,555",Pink,Beautiful Trauma World Tour,2018–2019,156,"$2,546,795"
4,5,"$345,675,146","$402,844,849",Taylor Swift,Reputation Stadium Tour,2018,53,"$6,522,173"
5,6,"$305,158,363","$388,978,496",Madonna,The MDNA Tour,2012,88,"$3,467,709"
6,7,"$280,000,000","$381,932,682",Celine Dion,Taking Chances World Tour,2008–2009,131,"$2,137,405"
7,7,"$257,600,000","$257,600,000",Pink,Summer Carnival †,2023–2024,41,"$6,282,927"
8,9,"$256,084,556","$312,258,401",Beyoncé,The Formation World Tour,2016,49,"$5,226,215"
9,10,"$250,400,000","$309,141,878",Taylor Swift,The 1989 World Tour,2015,85,"$2,945,882"


Taking a look at the data types of the gross columns, we need to convert from string to numeric values.
We make use of a RegEx query to match and extract dollar values from a string. It specifically matches a dollar sign followed by a number that can include thousands separators (,), and an optional decimal part (.xx). The number can have one to three digits before the thousands separator and one or more digits after the decimal point

The RegEx query is **r"\$(\d{1,3}(?:,\d{3})*(?:\.\d+)?)"**

    - \$: Matches a literal dollar sign in the text.

    - (: Begins a capturing group.

    - \d{1,3}: Matches one to three digits.

    - (?:,\d{3})*: Non-capturing group (?: ... ) followed by * which means zero or more occurrences of a comma , followed by exactly three digits \d{3}. This matches the thousands separator in numbers like 1,000 or 1,000,000.

    - (?:\.\d+)?: Non-capturing group (?: ... ) followed by ? which means zero or one occurrence of a decimal point . followed by one or more digits \d+. This matches the decimal part of the number if present.

    - ): Ends the capturing group.


In [87]:
df['gross']

0     780000000.0
1     579800000.0
2     411000000.0
3     397300000.0
4     345675146.0
5     305158363.0
6     280000000.0
7     257600000.0
8     256084556.0
9     250400000.0
10    229100000.0
11    227400000.0
12    204000000.0
13    200000000.0
14    194000000.0
15    184000000.0
16    170000000.0
17    169800000.0
18    167700000.0
19    150000000.0
Name: gross, dtype: float64

In [75]:
df.dtypes

rank           int64
gross         object
adj_gross     object
artist        object
tour_title    object
years         object
shows          int64
avg_gross     object
dtype: object

In [76]:
DOLLAR_VALUE_REGEX = r"\$(\d{1,3}(?:,\d{3})*(?:\.\d+)?)"

# Converting from string to numbers
df["gross"] = (
    df["gross"]
    .str.extract(DOLLAR_VALUE_REGEX, expand=False)
    .str.replace(",", "")
    .astype(float)
)
df["adj_gross"] = (
    df["adj_gross"]
    .str.extract(DOLLAR_VALUE_REGEX, expand=False)
    .str.replace(",", "")
    .astype(float)
)
df["avg_gross"] = (
    df["avg_gross"]
    .str.extract(DOLLAR_VALUE_REGEX, expand=False)
    .str.replace(",", "")
    .astype(float)
)

We can also perform a bit of feature engineering, to create a new feature - the no. of years the tour was held for.

In [77]:
# Getting a visual of all the unique year values
df["years"].unique()

array(['2023–2024', '2023', '2008–2009', '2018–2019', '2018', '2012',
       '2016', '2015', '2013–2014', '2009–2011', '2014–2015', '2002–2005',
       '2006', '2012–2013', '2015–2016', '2016–2017'], dtype=object)

In [78]:
def calc_no_years(x):
    years = x.split("–")

    if len(years) == 2:
        oldest, latest = years
        oldest, latest = int(oldest), int(latest)
        return latest - oldest

    return 1


df["no_years"] = df["years"].apply(calc_no_years)

In [79]:
df["no_years"].unique()

array([1, 2, 3])

## Exploratory Data Analysis

In [80]:
df.describe()

Unnamed: 0,rank,gross,adj_gross,shows,avg_gross,no_years
count,20.0,20.0,20.0,20.0,20.0,20.0
mean,10.45,287950900.0,343878100.0,110.0,3726571.0,1.15
std,5.942488,156328400.0,151462700.0,66.507617,3393340.0,0.48936
min,1.0,150000000.0,185423100.0,41.0,615385.0,1.0
25%,5.75,191500000.0,245755700.0,59.0,1647508.0,1.0
50%,10.5,239750000.0,297488900.0,87.0,2342100.0,1.0
75%,15.25,315287600.0,392445100.0,134.5,4933024.0,1.0
max,20.0,780000000.0,780000000.0,325.0,13928570.0,3.0


Using `describe`, we see that:
 - There are 20 rows in the dataset
 - The average (mean) adjusted gross of a tour is approx. *\$ 340 million*  with a standard deviation of *\$ 160 million*
 - Interestingly, the average tour had a *110* shows, with each show netting an average of *\$ 4 million*.
 - On average, most tours run for a year.

In [85]:
df.isna().any()

rank          False
gross         False
adj_gross     False
artist        False
tour_title    False
years         False
shows         False
avg_gross     False
no_years      False
dtype: bool

Checking for missing (NaN) values, we see that our selected columns have no missing values.

We can proceed with our analysis

In [90]:
categorical_cols = ["artist", "tour_title",]
numeric_cols = ["rank", "gross", "adj_gross", "shows", "no_years"]

In [91]:
df[numeric_cols].cov()

Unnamed: 0,rank,gross,adj_gross,shows,no_years
rank,35.31316,-768476400.0,-776859200.0,109.4211,0.4552632
gross,-768476400.0,2.443858e+16,2.301495e+16,-3180538000.0,-12444880.0
adj_gross,-776859200.0,2.301495e+16,2.294094e+16,-2288910000.0,-7209488.0
shows,109.4211,-3180538000.0,-2288910000.0,4423.263,27.52632
no_years,0.4552632,-12444880.0,-7209488.0,27.52632,0.2394737


Considering the covariance of the distribution, we find that:
 - ,

In [93]:
df[numeric_cols].corr()

Unnamed: 0,rank,gross,adj_gross,shows,no_years
rank,1.0,-0.827226,-0.863114,0.276861,0.156554
gross,-0.827226,1.0,0.972,-0.305908,-0.162676
adj_gross,-0.863114,0.972,1.0,-0.227223,-0.097268
shows,0.276861,-0.305908,-0.227223,1.0,0.845761
no_years,0.156554,-0.162676,-0.097268,0.845761,1.0


Considering the correlation of the distribution:
 - j

In [81]:
def plot_bar(x, y):
    """
    A function to plot a bar chart of categories `x`, having frequency `y`
    :param x: The categories of the data
    :param y: The frequency of the categories
    """
    ...

In [None]:
def plot_hist(x, y, bins=None):
    """
    A function to plot a histogram of categories `x` and frequency `y`
    :param x: The categories of the data
    :param y: The frequency of the categories
    :param bins: Optional parameter, the number of bins to use in the histogram
    """
    ...

In [None]:
def plot_pie(x, y):
    """
    A function to plot a pie chart of categories `x` and frequency `y`
    :param x: The categories of the data
    :param y: The frequency of the categories
    """
    ...

In [None]:
def plot_line(x, y):
    """
    A function to plot a line chart of `y` against `x`
    :param x: An iterable of abscissae
    :param y: An iterable of ordinates
    """
    ...

In [None]:
def plot_heatmap(x):
    ...

In [83]:
def plot_confusion_matrix(x):
    ...