---------------
## Data Cleaning and Preparation - Day4 HW

## Homework 4

Run the cell to down load the Kaggle Dataset below

1. Load the data into Pandas
2. Quickly describe what you see from a data science perspective (vars, observations, concerns with formatting, etc)
3. How many NaNs are in the data? What do the [] mean in the data? Should you remove all rows with NaN? Are there any duplicate rows?
4. Using the dictionary given here, add a genre column using the .map() command.

```{python}
artist_to_genre = {
    "Taylor Swift": "Pop / Country",
    "Beyoncé": "R&B / Pop",
    "Madonna": "Pop",
    "Pink": "Pop Rock",
    "Celine Dion": "Adult Contemporary",
    "Lady Gaga": "Pop / Dance",
    "Katy Perry": "Pop",
    "Cher": "Pop / Disco",
    "Adele": "Soul / Pop"
}
```
5. Bin the number of shows into 'high','medium','low'. Your choice on with the cutoffs should be. Add columns with the cutoffs and the codes.
6. Create dummy variables for the genres. Separate the text by the / symbol
7. Remove the $ from the money columns and turn these into integers.
8. Remove all other special characters from the data.
9. Separate the Year(s) column into two "Tour Start" and "Tour End"
10. Save the final data as a pickle.

    
------------------------------------

Your final notebooks should:

- [ ] Be a completely new notebook with just the Day4 stuff in it: Read in the data, clean it up and save it. 
- [ ] Be reproducible with junk code removed.
- [ ] Have lots of language describing what you are doing, especially for questions you are asking or things that you find interesting about the data. Use complete sentences, nice headings, and good markdown formatting: https://www.markdownguide.org/cheat-sheet/
- [ ] It should run without errors from start to finish.


In [9]:
# Some basic package imports
import os
import numpy as np
import pandas as pd
import pickle

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

In [16]:
import kagglehub

# Data was created by scraping: https://en.wikipedia.org/wiki/List_of_highest-grossing_concert_tours_by_women
path = kagglehub.dataset_download("amruthayenikonda/dirty-dataset-to-practice-data-cleaning")

print("Path to dataset files:", path)

Path to dataset files: /Users/sethchairez/.cache/kagglehub/datasets/amruthayenikonda/dirty-dataset-to-practice-data-cleaning/versions/1


In [18]:
# If this gives an error you might have to copy and paste the pat from above
# then update the \ with / to make it work
os.listdir(path)

['my_file (1).csv']

In [24]:
file_path = path + '/'+ os.listdir(path)[0]
df = pd.read_csv(file_path)

# 2. 
In this data set there are 20 observations, and 11 variables. Only rank and shows appear to be integers. This is strange because in the data set we can see that peak and All time peak appear as integers to us but are actually strings when we observe them. We can also see a similar case when we observe year and it is classifed as a string rather than an integer.

In [26]:
df.keys()

Index(['Rank', 'Peak', 'All Time Peak', 'Actual gross',
       'Adjusted gross (in 2022 dollars)', 'Artist', 'Tour title', 'Year(s)',
       'Shows', 'Average gross', 'Ref.'],
      dtype='object')

In [28]:
df.shape

(20, 11)

In [34]:
df.dtypes

Rank                                 int64
Peak                                object
All Time Peak                       object
Actual gross                        object
Adjusted gross (in 2022 dollars)    object
Artist                              object
Tour title                          object
Year(s)                             object
Shows                                int64
Average gross                       object
Ref.                                object
dtype: object

# 3. How many NaNs are in the data? What do the [] mean in the data? Should you remove all rows with NaN? Are there any duplicate rows?
The [] within the data set represent's the reference section that is shown on wikipedia. For example the peak section for Madonna is 1[4] meaning that the 4 represent the number of the source that is found within the reference section. This is different in python as it is most likely read in as a list. I think it would be better to subsitute the NaN values with zero's as all of them are located within the Peak and All Time Peak sections. These 0's could just represent that there was no ranking. There are no duplicate rows within the data set.

In [41]:
df.isna().sum()

Rank                                 0
Peak                                11
All Time Peak                       14
Actual gross                         0
Adjusted gross (in 2022 dollars)     0
Artist                               0
Tour title                           0
Year(s)                              0
Shows                                0
Average gross                        0
Ref.                                 0
dtype: int64

In [44]:
df.duplicated()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
dtype: bool

# 4  Using the dictionary given here, add a genre column using the .map() command.
```{python}
artist_to_genre = {
    "Taylor Swift": "Pop / Country",
    "Beyoncé": "R&B / Pop",
    "Madonna": "Pop",
    "Pink": "Pop Rock",
    "Celine Dion": "Adult Contemporary",
    "Lady Gaga": "Pop / Dance",
    "Katy Perry": "Pop",
    "Cher": "Pop / Disco",
    "Adele": "Soul / Pop"
}

In [56]:
artist_to_genre = {
    "Taylor Swift": "Pop / Country",
    "Beyoncé": "R&B / Pop",
    "Madonna": "Pop",
    "Pink": "Pop Rock",
    "Celine Dion": "Adult Contemporary",
    "Lady Gaga": "Pop / Dance",
    "Katy Perry": "Pop",
    "Cher": "Pop / Disco",
    "Adele": "Soul / Pop"
}

In [58]:
df['genre'] = df['Artist'].map(artist_to_genre)
df

Unnamed: 0,Rank,Peak,All Time Peak,Actual gross,Adjusted gross (in 2022 dollars),Artist,Tour title,Year(s),Shows,Average gross,Ref.,genre
0,1,1,2,"$780,000,000","$780,000,000",Taylor Swift,The Eras Tour †,2023–2024,56,"$13,928,571",[1],Pop / Country
1,2,1,7[2],"$579,800,000","$579,800,000",Beyoncé,Renaissance World Tour,2023,56,"$10,353,571",[3],R&B / Pop
2,3,1[4],2[5],"$411,000,000","$560,622,615",Madonna,Sticky & Sweet Tour ‡[4][a],2008–2009,85,"$4,835,294",[6],Pop
3,4,2[7],10[7],"$397,300,000","$454,751,555",Pink,Beautiful Trauma World Tour,2018–2019,156,"$2,546,795",[7],Pop Rock
4,5,2[4],,"$345,675,146","$402,844,849",Taylor Swift,Reputation Stadium Tour,2018,53,"$6,522,173",[8],Pop / Country
5,6,2[4],10[9],"$305,158,363","$388,978,496",Madonna,The MDNA Tour,2012,88,"$3,467,709",[9],Pop
6,7,2[10],,"$280,000,000","$381,932,682",Celine Dion,Taking Chances World Tour,2008–2009,131,"$2,137,405",[11],Adult Contemporary
7,7,,,"$257,600,000","$257,600,000",Pink,Summer Carnival †,2023–2024,41,"$6,282,927",[12],Pop Rock
8,9,,,"$256,084,556","$312,258,401",Beyoncé,The Formation World Tour,2016,49,"$5,226,215",[13],R&B / Pop
9,10,,,"$250,400,000","$309,141,878",Taylor Swift,The 1989 World Tour,2015,85,"$2,945,882",[14],Pop / Country


# 5. Bin the number of shows into 'high','medium','low'. Your choice on with the cutoffs should be. Add columns with the cutoffs and the codes.

In [67]:
df['Shows'].sort_values(ascending = True)

7      41
8      49
4      53
0      56
1      56
14     60
17     82
2      85
9      85
19     86
5      88
16     98
18    121
6     131
10    132
15    142
12    151
3     156
11    203
13    325
Name: Shows, dtype: int64

In [70]:
bins = [40,60,100,330]
show_categories = pd.cut(df['Shows'],bins)
show_categories

0       (40, 60]
1       (40, 60]
2      (60, 100]
3     (100, 330]
4       (40, 60]
5      (60, 100]
6     (100, 330]
7       (40, 60]
8       (40, 60]
9      (60, 100]
10    (100, 330]
11    (100, 330]
12    (100, 330]
13    (100, 330]
14      (40, 60]
15    (100, 330]
16     (60, 100]
17     (60, 100]
18    (100, 330]
19     (60, 100]
Name: Shows, dtype: category
Categories (3, interval[int64, right]): [(40, 60] < (60, 100] < (100, 330]]

In [73]:
show_categories.cat.codes

0     0
1     0
2     1
3     2
4     0
5     1
6     2
7     0
8     0
9     1
10    2
11    2
12    2
13    2
14    0
15    2
16    1
17    1
18    2
19    1
dtype: int8

In [76]:
df['new_codes'] = show_categories.cat.codes
df

Unnamed: 0,Rank,Peak,All Time Peak,Actual gross,Adjusted gross (in 2022 dollars),Artist,Tour title,Year(s),Shows,Average gross,Ref.,genre,new_codes
0,1,1,2,"$780,000,000","$780,000,000",Taylor Swift,The Eras Tour †,2023–2024,56,"$13,928,571",[1],Pop / Country,0
1,2,1,7[2],"$579,800,000","$579,800,000",Beyoncé,Renaissance World Tour,2023,56,"$10,353,571",[3],R&B / Pop,0
2,3,1[4],2[5],"$411,000,000","$560,622,615",Madonna,Sticky & Sweet Tour ‡[4][a],2008–2009,85,"$4,835,294",[6],Pop,1
3,4,2[7],10[7],"$397,300,000","$454,751,555",Pink,Beautiful Trauma World Tour,2018–2019,156,"$2,546,795",[7],Pop Rock,2
4,5,2[4],,"$345,675,146","$402,844,849",Taylor Swift,Reputation Stadium Tour,2018,53,"$6,522,173",[8],Pop / Country,0
5,6,2[4],10[9],"$305,158,363","$388,978,496",Madonna,The MDNA Tour,2012,88,"$3,467,709",[9],Pop,1
6,7,2[10],,"$280,000,000","$381,932,682",Celine Dion,Taking Chances World Tour,2008–2009,131,"$2,137,405",[11],Adult Contemporary,2
7,7,,,"$257,600,000","$257,600,000",Pink,Summer Carnival †,2023–2024,41,"$6,282,927",[12],Pop Rock,0
8,9,,,"$256,084,556","$312,258,401",Beyoncé,The Formation World Tour,2016,49,"$5,226,215",[13],R&B / Pop,0
9,10,,,"$250,400,000","$309,141,878",Taylor Swift,The 1989 World Tour,2015,85,"$2,945,882",[14],Pop / Country,1


# 6. Create dummy variables for the genres. Separate the text by the / symbol
Come back and ask for help

In [84]:
dummies = df['genre'].str.get_dummies(' / ')
dummies

Unnamed: 0,Adult Contemporary,Country,Dance,Disco,Pop,Pop Rock,R&B,Soul
0,0,1,0,0,1,0,0,0
1,0,0,0,0,1,0,1,0
2,0,0,0,0,1,0,0,0
3,0,0,0,0,0,1,0,0
4,0,1,0,0,1,0,0,0
5,0,0,0,0,1,0,0,0
6,1,0,0,0,0,0,0,0
7,0,0,0,0,0,1,0,0
8,0,0,0,0,1,0,1,0
9,0,1,0,0,1,0,0,0


# 7. Remove the $ from the money columns and turn these into integers.


In [135]:
#We should get an error, showing that the column does not work spelled normal
df.keys()
df['Actualgross']

0        $780,000,000
1        $579,800,000
2        $411,000,000
3        $397,300,000
4        $345,675,146
5        $305,158,363
6        $280,000,000
7        $257,600,000
8        $256,084,556
9        $250,400,000
10    $229,100,000[b]
11       $227,400,000
12       $204,000,000
13       $200,000,000
14       $194,000,000
15       $184,000,000
16       $170,000,000
17       $169,800,000
18    $167,700,000[e]
19       $150,000,000
Name: Actualgross, dtype: object

In [94]:
df.columns

Index(['Rank', 'Peak', 'All Time Peak', 'Actual gross',
       'Adjusted gross (in 2022 dollars)', 'Artist', 'Tour title', 'Year(s)',
       'Shows', 'Average gross', 'Ref.', 'genre', 'new_codes'],
      dtype='object')

In [97]:
df.columns = [c.replace('\xa0', '') for c in df.columns]
for col in df.columns:
    print(repr(col))


'Rank'
'Peak'
'All Time Peak'
'Actualgross'
'Adjustedgross (in 2022 dollars)'
'Artist'
'Tour title'
'Year(s)'
'Shows'
'Average gross'
'Ref.'
'genre'
'new_codes'


In [101]:
for col in df.columns:
    print(repr(col))

'Rank'
'Peak'
'All Time Peak'
'Actualgross'
'Adjustedgross (in 2022 dollars)'
'Artist'
'Tour title'
'Year(s)'
'Shows'
'Average gross'
'Ref.'
'genre'
'new_codes'


In [104]:
df

Unnamed: 0,Rank,Peak,All Time Peak,Actualgross,Adjustedgross (in 2022 dollars),Artist,Tour title,Year(s),Shows,Average gross,Ref.,genre,new_codes
0,1,1,2,"$780,000,000","$780,000,000",Taylor Swift,The Eras Tour †,2023–2024,56,"$13,928,571",[1],Pop / Country,0
1,2,1,7[2],"$579,800,000","$579,800,000",Beyoncé,Renaissance World Tour,2023,56,"$10,353,571",[3],R&B / Pop,0
2,3,1[4],2[5],"$411,000,000","$560,622,615",Madonna,Sticky & Sweet Tour ‡[4][a],2008–2009,85,"$4,835,294",[6],Pop,1
3,4,2[7],10[7],"$397,300,000","$454,751,555",Pink,Beautiful Trauma World Tour,2018–2019,156,"$2,546,795",[7],Pop Rock,2
4,5,2[4],,"$345,675,146","$402,844,849",Taylor Swift,Reputation Stadium Tour,2018,53,"$6,522,173",[8],Pop / Country,0
5,6,2[4],10[9],"$305,158,363","$388,978,496",Madonna,The MDNA Tour,2012,88,"$3,467,709",[9],Pop,1
6,7,2[10],,"$280,000,000","$381,932,682",Celine Dion,Taking Chances World Tour,2008–2009,131,"$2,137,405",[11],Adult Contemporary,2
7,7,,,"$257,600,000","$257,600,000",Pink,Summer Carnival †,2023–2024,41,"$6,282,927",[12],Pop Rock,0
8,9,,,"$256,084,556","$312,258,401",Beyoncé,The Formation World Tour,2016,49,"$5,226,215",[13],R&B / Pop,0
9,10,,,"$250,400,000","$309,141,878",Taylor Swift,The 1989 World Tour,2015,85,"$2,945,882",[14],Pop / Country,1


In [107]:
for col in df.columns:
    try: 
        df[col] = df[col].apply(lambda x: int(x.replace('$','')))
    except:
        continue 

df

Unnamed: 0,Rank,Peak,All Time Peak,Actualgross,Adjustedgross (in 2022 dollars),Artist,Tour title,Year(s),Shows,Average gross,Ref.,genre,new_codes
0,1,1,2,"$780,000,000","$780,000,000",Taylor Swift,The Eras Tour †,2023–2024,56,"$13,928,571",[1],Pop / Country,0
1,2,1,7[2],"$579,800,000","$579,800,000",Beyoncé,Renaissance World Tour,2023,56,"$10,353,571",[3],R&B / Pop,0
2,3,1[4],2[5],"$411,000,000","$560,622,615",Madonna,Sticky & Sweet Tour ‡[4][a],2008–2009,85,"$4,835,294",[6],Pop,1
3,4,2[7],10[7],"$397,300,000","$454,751,555",Pink,Beautiful Trauma World Tour,2018–2019,156,"$2,546,795",[7],Pop Rock,2
4,5,2[4],,"$345,675,146","$402,844,849",Taylor Swift,Reputation Stadium Tour,2018,53,"$6,522,173",[8],Pop / Country,0
5,6,2[4],10[9],"$305,158,363","$388,978,496",Madonna,The MDNA Tour,2012,88,"$3,467,709",[9],Pop,1
6,7,2[10],,"$280,000,000","$381,932,682",Celine Dion,Taking Chances World Tour,2008–2009,131,"$2,137,405",[11],Adult Contemporary,2
7,7,,,"$257,600,000","$257,600,000",Pink,Summer Carnival †,2023–2024,41,"$6,282,927",[12],Pop Rock,0
8,9,,,"$256,084,556","$312,258,401",Beyoncé,The Formation World Tour,2016,49,"$5,226,215",[13],R&B / Pop,0
9,10,,,"$250,400,000","$309,141,878",Taylor Swift,The 1989 World Tour,2015,85,"$2,945,882",[14],Pop / Country,1


# 8 Remove all other special characters from the data.

In [5]:
df['Tour title' ] = df['Tour title'].apply(lambda x: x.replace('†','')) 
df['Tour title' ] = df['Tour title'].apply(lambda x: x.replace('‡','')) 
df['Tour title' ] = df['Tour title'].apply(lambda x: x.replace('*',''))
df['Tour title' ] = df['Tour title'].apply(lambda x: x.replace('[21][a]',''))
df['Tour title' ] = df['Tour title'].apply(lambda x: x.replace('[4][a]',''))

NameError: name 'df' is not defined

In [117]:
titles = df['Tour title']
titles

0                       The Eras Tour 
1               Renaissance World Tour
2                 Sticky & Sweet Tour 
3          Beautiful Trauma World Tour
4              Reputation Stadium Tour
5                        The MDNA Tour
6            Taking Chances World Tour
7                     Summer Carnival 
8             The Formation World Tour
9                  The 1989 World Tour
10     The Mrs. Carter Show World Tour
11              The Monster Ball Tour 
12                Prismatic World Tour
13    Living Proof: The Farewell Tour 
14                    Confessions Tour
15           The Truth About Love Tour
16                  Born This Way Ball
17                    Rebel Heart Tour
18                     Adele Live 2016
19                        The Red Tour
Name: Tour title, dtype: object

# 9 Separate the Year(s) column into two "Tour Start" and "Tour End"


In [122]:
#do it in function 
#new column called tour start and end
#use.split could try . apply lamnda, or for loop 
print(type(df['Year(s)']))

<class 'pandas.core.series.Series'>


In [125]:
empty_list = []
for item in df['Year(s)']:   
    #item.split('–')
    empty_list.append(item.split('–'))

start_dates = [] 
end_dates = []
for l in empty_list: 
    start_dates.append(l[0])
    if len(l) == 2: 
        end_dates.append(l[1])
    else: 
        end_dates.append(np.nan)

df['Year Start'] =start_dates
df['End Dates']= end_dates
df

Unnamed: 0,Rank,Peak,All Time Peak,Actualgross,Adjustedgross (in 2022 dollars),Artist,Tour title,Year(s),Shows,Average gross,Ref.,genre,new_codes,Year Start,End Dates
0,1,1,2,"$780,000,000","$780,000,000",Taylor Swift,The Eras Tour,2023–2024,56,"$13,928,571",[1],Pop / Country,0,2023,2024.0
1,2,1,7[2],"$579,800,000","$579,800,000",Beyoncé,Renaissance World Tour,2023,56,"$10,353,571",[3],R&B / Pop,0,2023,
2,3,1[4],2[5],"$411,000,000","$560,622,615",Madonna,Sticky & Sweet Tour,2008–2009,85,"$4,835,294",[6],Pop,1,2008,2009.0
3,4,2[7],10[7],"$397,300,000","$454,751,555",Pink,Beautiful Trauma World Tour,2018–2019,156,"$2,546,795",[7],Pop Rock,2,2018,2019.0
4,5,2[4],,"$345,675,146","$402,844,849",Taylor Swift,Reputation Stadium Tour,2018,53,"$6,522,173",[8],Pop / Country,0,2018,
5,6,2[4],10[9],"$305,158,363","$388,978,496",Madonna,The MDNA Tour,2012,88,"$3,467,709",[9],Pop,1,2012,
6,7,2[10],,"$280,000,000","$381,932,682",Celine Dion,Taking Chances World Tour,2008–2009,131,"$2,137,405",[11],Adult Contemporary,2,2008,2009.0
7,7,,,"$257,600,000","$257,600,000",Pink,Summer Carnival,2023–2024,41,"$6,282,927",[12],Pop Rock,0,2023,2024.0
8,9,,,"$256,084,556","$312,258,401",Beyoncé,The Formation World Tour,2016,49,"$5,226,215",[13],R&B / Pop,0,2016,
9,10,,,"$250,400,000","$309,141,878",Taylor Swift,The 1989 World Tour,2015,85,"$2,945,882",[14],Pop / Country,1,2015,


In [128]:
df

Unnamed: 0,Rank,Peak,All Time Peak,Actualgross,Adjustedgross (in 2022 dollars),Artist,Tour title,Year(s),Shows,Average gross,Ref.,genre,new_codes,Year Start,End Dates
0,1,1,2,"$780,000,000","$780,000,000",Taylor Swift,The Eras Tour,2023–2024,56,"$13,928,571",[1],Pop / Country,0,2023,2024.0
1,2,1,7[2],"$579,800,000","$579,800,000",Beyoncé,Renaissance World Tour,2023,56,"$10,353,571",[3],R&B / Pop,0,2023,
2,3,1[4],2[5],"$411,000,000","$560,622,615",Madonna,Sticky & Sweet Tour,2008–2009,85,"$4,835,294",[6],Pop,1,2008,2009.0
3,4,2[7],10[7],"$397,300,000","$454,751,555",Pink,Beautiful Trauma World Tour,2018–2019,156,"$2,546,795",[7],Pop Rock,2,2018,2019.0
4,5,2[4],,"$345,675,146","$402,844,849",Taylor Swift,Reputation Stadium Tour,2018,53,"$6,522,173",[8],Pop / Country,0,2018,
5,6,2[4],10[9],"$305,158,363","$388,978,496",Madonna,The MDNA Tour,2012,88,"$3,467,709",[9],Pop,1,2012,
6,7,2[10],,"$280,000,000","$381,932,682",Celine Dion,Taking Chances World Tour,2008–2009,131,"$2,137,405",[11],Adult Contemporary,2,2008,2009.0
7,7,,,"$257,600,000","$257,600,000",Pink,Summer Carnival,2023–2024,41,"$6,282,927",[12],Pop Rock,0,2023,2024.0
8,9,,,"$256,084,556","$312,258,401",Beyoncé,The Formation World Tour,2016,49,"$5,226,215",[13],R&B / Pop,0,2016,
9,10,,,"$250,400,000","$309,141,878",Taylor Swift,The 1989 World Tour,2015,85,"$2,945,882",[14],Pop / Country,1,2015,


In [131]:
df.to_pickle('df.pkl')