#### 1. **Write an Introduction**
   - **Start your analysis by writing an introduction** in an empty jupyter notebook.  The introduction will provide context for the data on superheroes that you are analyzing. The introduction should include:
     - **A brief description of the dataset:** Mention that you are analyzing a superhero dataset and provide a source if known (e.g., "This cleaned_superhero_dataset.csv file was provided as part of a Data Analysis exercise, is originally from Kaggle, and contains information about various superheroes, including their origins and publishers.").
     - **3-5 key questions or hypotheses** you aim to explore in the analysis. For example, you may work towards answering the following:
    1. Which publishers have introduced the most superheroes?
        This question examines which publishers (e.g., Marvel, DC) have the highest count of superheroes in the dataset.
    2. How are superheroes distributed across different origins?
        This question looks at the distribution of superheroes based on their origin (e.g., alien, human, mutant) across the entire dataset.
    3. Do major publishers focus on certain origins more than independent publishers?
        This question compares the origins of superheroes between major and independent publishers to determine if there is a significant difference in the types of origins they introduce.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

In [12]:
df = pd.read_csv('../data/highest-female-artist-gross-concert.csv')
df.rename(lambda col: col.replace('\xa0', ' '), axis='columns', inplace=True) 
df.head()

Unnamed: 0,Rank,Peak,All Time Peak,Actual gross,Adjusted gross (in 2022 dollars),Artist,Tour title,Year(s),Shows,Average gross,Ref.
0,1,1,2,"$780,000,000","$780,000,000",Taylor Swift,The Eras Tour †,2023–2024,56,"$13,928,571",[1]
1,2,1,7[2],"$579,800,000","$579,800,000",Beyoncé,Renaissance World Tour,2023,56,"$10,353,571",[3]
2,3,1[4],2[5],"$411,000,000","$560,622,615",Madonna,Sticky & Sweet Tour ‡[4][a],2008–2009,85,"$4,835,294",[6]
3,4,2[7],10[7],"$397,300,000","$454,751,555",Pink,Beautiful Trauma World Tour,2018–2019,156,"$2,546,795",[7]
4,5,2[4],,"$345,675,146","$402,844,849",Taylor Swift,Reputation Stadium Tour,2018,53,"$6,522,173",[8]


In [20]:
df.info()
df.isnull().sum()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 11 columns):
 #   Column                            Non-Null Count  Dtype 
---  ------                            --------------  ----- 
 0   Rank                              20 non-null     int64 
 1   Peak                              9 non-null      object
 2   All Time Peak                     6 non-null      object
 3   Actual gross                      20 non-null     object
 4   Adjusted gross (in 2022 dollars)  20 non-null     object
 5   Artist                            20 non-null     object
 6   Tour title                        20 non-null     object
 7   Year(s)                           20 non-null     object
 8   Shows                             20 non-null     int64 
 9   Average gross                     20 non-null     object
 10  Ref.                              20 non-null     object
dtypes: int64(2), object(9)
memory usage: 1.8+ KB


Unnamed: 0,Rank,Shows
count,20.0,20.0
mean,10.45,110.0
std,5.942488,66.507617
min,1.0,41.0
25%,5.75,59.0
50%,10.5,87.0
75%,15.25,134.5
max,20.0,325.0


In [21]:
df.columns

Index(['Rank', 'Peak', 'All Time Peak', 'Actual gross',
       'Adjusted gross (in 2022 dollars)', 'Artist', 'Tour title', 'Year(s)',
       'Shows', 'Average gross', 'Ref.'],
      dtype='object')

In [None]:
df[['Actual gross', 'Adjusted gross (in 2022 dollars)', 'Average gross']].value_counts()
df['Actual gross'] = df['Actual gross'].replace({r'\$':'', r',':'', r'\[.*?\]':''}, regex=True).astype(float)
df['Adjusted gross (in 2022 dollars)'] = df['Adjusted gross (in 2022 dollars)'].replace({r'\$': '', r',': '', r'\[.*?\]': ''}, regex=True).astype(float)
df['Average gross'] = df['Average gross'].replace({r'\$': '', r',': '', r'\[.*?\]': ''}, regex=True).astype(float)


In [37]:
df[['Actual gross', 'Adjusted gross (in 2022 dollars)', 'Average gross']].head()

Unnamed: 0,Actual gross,Adjusted gross (in 2022 dollars),Average gross
0,780000000.0,780000000.0,13928571.0
1,579800000.0,579800000.0,10353571.0
2,411000000.0,560622615.0,4835294.0
3,397300000.0,454751555.0,2546795.0
4,345675146.0,402844849.0,6522173.0


In [51]:
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])

Peak             11
All Time Peak    14
dtype: int64


In [55]:
df['Peak'] = df['Peak'].fillna('Unknown')
df['All Time Peak'] = df['All Time Peak'].fillna('Unknown')

In [57]:
df.head()

Unnamed: 0,Rank,Peak,All Time Peak,Actual gross,Adjusted gross (in 2022 dollars),Artist,Tour title,Year(s),Shows,Average gross,Ref.
0,1,1,2,780000000.0,780000000.0,Taylor Swift,The Eras Tour †,2023–2024,56,13928571.0,[1]
1,2,1,7[2],579800000.0,579800000.0,Beyoncé,Renaissance World Tour,2023,56,10353571.0,[3]
2,3,1[4],2[5],411000000.0,560622615.0,Madonna,Sticky & Sweet Tour ‡[4][a],2008–2009,85,4835294.0,[6]
3,4,2[7],10[7],397300000.0,454751555.0,Pink,Beautiful Trauma World Tour,2018–2019,156,2546795.0,[7]
4,5,2[4],Unknown,345675146.0,402844849.0,Taylor Swift,Reputation Stadium Tour,2018,53,6522173.0,[8]


In [60]:
df['Artist'] = df['Artist'].str.lower()
df['Artist'].unique

<bound method Series.unique of 0     taylor swift
1          beyoncé
2          madonna
3             pink
4     taylor swift
5          madonna
6      celine dion
7             pink
8          beyoncé
9     taylor swift
10         beyoncé
11       lady gaga
12      katy perry
13            cher
14         madonna
15            pink
16       lady gaga
17         madonna
18           adele
19    taylor swift
Name: Artist, dtype: object>