<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Lab: Cleaning Rock Song Data

---


In [1]:
import pandas as pd
import numpy as np 
import seaborn as sns

%matplotlib inline

### 1. Load `rock.csv` and do an initial examination of its data columns.

In [2]:
rockfile = "../../../../../resource-datasets/rock_songs/rock.csv"

In [3]:
# Load the data.
df = pd.read_csv(rockfile)
df.head()

Unnamed: 0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,First?,Year?,PlayCount,F*G
0,Caught Up in You,.38 Special,1982.0,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981.0,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980.0,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975.0,Art For Arts Sake by 10cc,1,1,1,1


In [4]:
# Look at the information regarding its columns.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2230 entries, 0 to 2229
Data columns (total 8 columns):
Song Clean      2230 non-null object
ARTIST CLEAN    2230 non-null object
Release Year    1653 non-null object
COMBINED        2230 non-null object
First?          2230 non-null int64
Year?           2230 non-null int64
PlayCount       2230 non-null int64
F*G             2230 non-null int64
dtypes: int64(4), object(4)
memory usage: 139.5+ KB


### 2.  Clean up the column names.

Let's clean up the column names. There are two ways we can accomplish this:

#### 2.A Change the column names when you import the data using `pd.read_csv()`.

Notice that, when passing `names=[..A LIST OF STRING..]` with a number of columns that matches the number of strings in the passed list, you replace the column names.

NOTE: When you create custom column names, the first row of the `.csv` already represents a header. It is important to tell `pandas` to skip that row. The `skiprows=1` keyword argument to `read_csv()` will tell `pandas` to skip the first row.

In [5]:
# Change the column names when loading the '.csv':
column_names = ['song', 'artist', 'release', 'song_artist', 'first', 'year', 'playcount', 'fg']
df = pd.read_csv(rockfile, names=column_names, skiprows=1)
df.head()

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
0,Caught Up in You,.38 Special,1982.0,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981.0,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980.0,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975.0,Art For Arts Sake by 10cc,1,1,1,1


#### 2.B Change column names using the `.rename()` function.

The `.rename()` function takes an argument, `columns=name_dict`, in which `name_dict` is a dictionary containing the original column names as keys and the new column names as values.

In [6]:
# Change the column names using the `.rename()` function.

df = pd.read_csv(rockfile)

rename_map = {
    # Original column: [renamed column]
    'Song Clean':    'song', 
    'ARTIST CLEAN':  'artist', 
    'Release Year':  'release', 
    'COMBINED':      'song_artist', 
    'First?':        'first', 
    'Year?':         'year', 
    'PlayCount':     'playcount', 
    'F*G':           'fg'
}

df.rename(columns=rename_map, inplace=True)
df.head(4)

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
0,Caught Up in You,.38 Special,1982.0,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981.0,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980.0,Rockin' Into the Night by .38 Special,1,1,18,18


#### 2.C Reassigning the `.columns` attribute of a DataFrame.

You can also just reassign the `.columns` attribute to a list of strings containing the new column names. 

The only caveat with reassigning `.columns` is that you have to reassign all of the column names at once. You can't partially replace a value by working on `.columns` directly. You have to reassign `.columns` with a list of equal length. 

In [7]:
# Replace the column names by reassigning the `.columns` attribute.
df = pd.read_csv(rockfile)
column_names = ['song', 'artist', 'release', 'song_artist', 'first', 'year', 'playcount', 'fg']
df.columns = column_names
df.head()

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
0,Caught Up in You,.38 Special,1982.0,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981.0,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980.0,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975.0,Art For Arts Sake by 10cc,1,1,1,1


### 3. Subsetting data where null values exist.

We have mixed `str` and `NaN` values in the `release` column. `NaN` stands for "not a number" and is the way `pandas` handles "nulls" or nonexistent data. We can use the `.isnull()` method of a Series to find null values.

Print the header of the data subset to where the `release` column is null values.

In [8]:
# This will show us records where `df['release']` is null.
null_release_mask = df['release'].isnull()
df[null_release_mask].head()

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
10,"Baby, Please Don't Go",AC/DC,,"Baby, Please Don't Go by AC/DC",1,0,1,0
13,CAN'T STOP ROCK'N'ROLL,AC/DC,,CAN'T STOP ROCK'N'ROLL by AC/DC,1,0,5,0
16,Girls Got Rhythm,AC/DC,,Girls Got Rhythm by AC/DC,1,0,24,0
24,Let's Get It Up,AC/DC,,Let's Get It Up by AC/DC,1,0,4,0


### 4. Update slices of your DataFrame based on mask selection/slices.

In many scenarios, we want to upate values in our DataFrame according to criteria. Let's say we wanted to set all of the null values in `release` to 0.

With newer versions of `pandas`, in order to manipulate data in the original DataFrame, we have to use `.loc` while performing reassignment using a mask and an index.

For example, the following won't always work:
```python
df[row_mask]['column_name'] = new_value
```

The best way to accomplish the same task is:
```python
df.loc[row_mask, 'column_name'] = new_value
```

For multiple column assignment, you would use:
```python
df.loc[row_mask, ['col_1', 'col_2', 'col_3']] = new_value
```

#### 4.A Let's try it out. Make all of the null values in `release` 0.

In [9]:
# We're going to reload our data to a fresh state.
column_names = ['song', 'artist', 'release', 'song_artist', 'first', 'year', 'playcount', 'fg']
df = pd.read_csv(rockfile, names=column_names, skiprows=1)

# create/identify the target subset and use it as an indexing tool to set values.
null_release_mask = df['release'].isnull()
df.loc[null_release_mask, 'release'] = 0

# We'll then print out our DataFrame's first 15 rows:
df.head(15)

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
0,Caught Up in You,.38 Special,1982,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,0,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975,Art For Arts Sake by 10cc,1,1,1,1
5,Kryptonite,3 Doors Down,2000,Kryptonite by 3 Doors Down,1,1,13,13
6,Loser,3 Doors Down,2000,Loser by 3 Doors Down,1,1,1,1
7,When I'm Gone,3 Doors Down,2002,When I'm Gone by 3 Doors Down,1,1,6,6
8,What's Up?,4 Non Blondes,1992,What's Up? by 4 Non Blondes,1,1,3,3
9,Take On Me,a-ha,1985,Take On Me by a-ha,1,1,1,1


#### 4.B Verify that `release` contains no null values.

In [10]:
df.isnull().sum()

song           0
artist         0
release        0
song_artist    0
first          0
year           0
playcount      0
fg             0
dtype: int64

### 5. Ensure that the data types of the columns make sense. 

Verifying column data types is a critical part of data munging. If columns have the wrong data type, then there is usually corrupted or incorrect data in some of the observations.

#### 5.A Look at the data types for the columns. Are any incorrect given what the data represents?

In [11]:
# We're going to reload our data to a fresh state.
column_names = ['song', 'artist', 'release', 'song_artist', 'first', 'year', 'playcount', 'fg']
df = pd.read_csv(rockfile, names=column_names, skiprows=1)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2230 entries, 0 to 2229
Data columns (total 8 columns):
song           2230 non-null object
artist         2230 non-null object
release        1653 non-null object
song_artist    2230 non-null object
first          2230 non-null int64
year           2230 non-null int64
playcount      2230 non-null int64
fg             2230 non-null int64
dtypes: int64(4), object(4)
memory usage: 139.5+ KB


In [12]:
df.head(3)

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
0,Caught Up in You,.38 Special,1982.0,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981.0,Hold On Loosely by .38 Special,1,1,85,85


_Only the `release` column appears to be wrong. It is represented as a string but should be an integer for year._

### 6. Investigate and clean up the `release` column.

The `release` column is a string data type when it should be an integer.

#### 6.A Figure out what value(s) are causing the `release` column to be encoded as a string instead of an integer.

In [13]:
# Looking at the unique values in the column can be a good way to find offending values:
df.release.unique()

array(['1982', nan, '1981', '1980', '1975', '2000', '2002', '1992',
       '1985', '1993', '1976', '1995', '1979', '1984', '1977', '1990',
       '1986', '1974', '2014', '1987', '1973', '2001', '1989', '1997',
       '1971', '1972', '1994', '1970', '1966', '1965', '1983', '1955',
       '1978', '1969', '1999', '1968', '1988', '1962', '2007', '1967',
       '1958', '1071', '1996', '1991', '2005', '2011', '2004', '2012',
       '2003', '1998', '2008', '1964', '2013', '2006', 'SONGFACTS.COM',
       '1963', '1961'], dtype=object)

A row has SONGFACTS.COM as a value — this is clearly not a year.

#### 6.B Look at the rows in which there is incorrect data in the `release` column.

In [14]:
# Slice and assign.
release_mask = df['release'] == "SONGFACTS.COM"
df[release_mask]

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
1504,Bullfrog Blues,Rory Gallagher,SONGFACTS.COM,Bullfrog Blues by Rory Gallagher,1,1,1,1


#### 6.C. Clean up the data. Normally we may replace the offending data with null np.nan values, however we previously converted all of the nan values in the release column to zeros so we might as well continue with the same practice. Replacing with 0 (or nan) will allow us to convert the column to numeric.

In [15]:
df.release[df.release!='SONGFACTS.COM'].fillna(0).head()

0    1982
1       0
2    1981
3    1980
4    1975
Name: release, dtype: object

In [16]:
df.loc[release_mask, 'release'] = np.nan
df['release'] = df['release'].map(lambda x: float(x))

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2230 entries, 0 to 2229
Data columns (total 8 columns):
song           2230 non-null object
artist         2230 non-null object
release        1652 non-null float64
song_artist    2230 non-null object
first          2230 non-null int64
year           2230 non-null int64
playcount      2230 non-null int64
fg             2230 non-null int64
dtypes: float64(1), int64(4), object(3)
memory usage: 139.5+ KB


**Note:** Year can also be considered a descriptive value and therefore it also makes sense for a Year column to be an object.  
However, just like in this situation, using conversion to numerics is a great way of identifying improper values in a year column.

### 7. Get summary statistics for the `release` column using the `.describe()` function.

Now that the `release` column is finally a numeric data type, we can apply the `.describe()` function.  

#### 7.A Print out the summary stats for the `release` column. What is the earliest and latest release date?

In [18]:
df['release'].describe()

count    1652.000000
mean     1978.019976
std        24.191247
min      1071.000000
25%      1971.000000
50%      1977.000000
75%      1984.000000
max      2014.000000
Name: release, dtype: float64

The earliest release date is is 1071 and latest is 2014.

#### 7.B Based on the summary statistics, is there anything else wrong with the `release` column? 

A year of 1071 for a song release seems wrong. 
We might want to impose a cut off for what the earliest song can be.

In [19]:
df[df.release == 1071]

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
547,Levon,Elton John,1071.0,Levon by Elton John,1,1,8,8


_Looking at the DataFrame that contains the year 1071, we can see that the year was probably corrupted and should be replaced with something else if possible._

### 8. Make changes and investigate using custom functions with `.apply()`.

Let's say we want to traverse every single row in our data set and apply a function to that row.

#### 8.A Write a function that will take a row of a DataFrame and print out the song, artist, and whether or not the release date is < 1970.

In [20]:
def release_inspector(row):
    print('--------------------------------')
    try:
        print(row['song'], row['artist'], row['release'], '< 1970?:', int(row['release']) < 1970)
        return int(row['release']) < 1970
    except:
        print(row['song'], row['artist'], row['release'], '< 1970?:', 'no release year')
        return 'no release year'

In [21]:
# We're going to reload our data to a fresh state.
column_names = ['song', 'artist', 'release', 'song_artist', 'first', 'year', 'playcount', 'fg']
df = pd.read_csv(rockfile, names=column_names, skiprows=1)

#### 8.B Using the `.apply()` function, apply the function you wrote to the first four rows of the DataFrame.

You will need to tell the `apply` function to operate row by row. Setting the keyword argument as `axis=1` indicates that the function should be applied to each row individually.

In [22]:
df.head(4).apply(release_inspector, axis=1)

--------------------------------
Caught Up in You .38 Special 1982 < 1970?: False
--------------------------------
Fantasy Girl .38 Special nan < 1970?: no release year
--------------------------------
Hold On Loosely .38 Special 1981 < 1970?: False
--------------------------------
Rockin' Into the Night .38 Special 1980 < 1970?: False


0              False
1    no release year
2              False
3              False
dtype: object

You'll notice that there will be a final output Series of `None` values. The `.apply()` function, if a return value is not specified, will return a Series of `None` values (similar to how the default return for Python functions is `None` when a return statement is not specified).

### 9. Write a function that converts cells in a DataFrame to float and otherwise replaces them with `np.nan`.

If applied to our data, it would keep only the numeric information and otherwise input null values.

Recall that the try-except syntax in Python is a great way to try something and take another action if the initial step fails:

```python
try:
    Perform some action.
except:
   Perform some other action if the first failed with an error.
```

#### 9.A Write the function that takes a column and converts all of its values to float if possible and `np.nan` otherwise. The return value should be the converted Series.

In [23]:
def converter_helper(value):
    try:
        return float(value)
    except:
        return np.nan

def convert_to_float(column):
    column = column.map(converter_helper)
    return column

#### 9.B Try your function out on the rock song data and ensure the output is what you expected.


In [24]:
df.apply(convert_to_float).head(10)

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
0,,,1982.0,,1.0,1.0,82.0,82.0
1,,,,,1.0,0.0,3.0,0.0
2,,,1981.0,,1.0,1.0,85.0,85.0
3,,,1980.0,,1.0,1.0,18.0,18.0
4,,,1975.0,,1.0,1.0,1.0,1.0
5,,,2000.0,,1.0,1.0,13.0,13.0
6,,,2000.0,,1.0,1.0,1.0,1.0
7,,,2002.0,,1.0,1.0,6.0,6.0
8,,,1992.0,,1.0,1.0,3.0,3.0
9,,,1985.0,,1.0,1.0,1.0,1.0


In [25]:
# create a new dataframe via applying our function

df2 = df.apply(convert_to_float)

#### 9.C Describe the new float-only DataFrame.

In [26]:
df2.song[~df2.song.isnull()]

1566      45.0
1577    1979.0
Name: song, dtype: float64

In [27]:
df2.describe()

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
count,2.0,0.0,1652.0,0.0,2230.0,2230.0,2230.0,2230.0
mean,1012.0,,1978.019976,,1.0,0.741256,16.872646,15.04843
std,1367.544515,,24.191247,,0.0,0.438043,25.302972,25.288366
min,45.0,,1071.0,,1.0,0.0,0.0,0.0
25%,528.5,,1971.0,,1.0,0.0,1.0,0.0
50%,1012.0,,1977.0,,1.0,1.0,4.0,3.0
75%,1495.5,,1984.0,,1.0,1.0,21.0,18.0
max,1979.0,,2014.0,,1.0,1.0,142.0,142.0


### 10. What are the top 20 most popular songs by plays?

In [28]:
df.sort_values(by='playcount')[-20:]

Unnamed: 0,song,artist,release,song_artist,first,year,playcount,fg
130,Feel Like Makin' Love,Bad Company,,Feel Like Makin' Love by Bad Company,1,0,113,0
1866,Just What I Needed,The Cars,1978.0,Just What I Needed by The Cars,1,1,113,113
1523,Tom Sawyer,Rush,1981.0,Tom Sawyer by Rush,1,1,114,114
956,Wheel in the Sky,Journey,1978.0,Wheel in the Sky by Journey,1,1,114,114
1106,Blinded by the Light,Manfred Mann,1976.0,Blinded by the Light by Manfred Mann,1,1,114,114
1621,The Joker,Steve Miller Band,1973.0,The Joker by Steve Miller Band,1,1,114,114
784,Magic Man,Heart,1976.0,Magic Man by Heart,1,1,115,115
1675,Renegade,Styx,1978.0,Renegade by Styx,1,1,116,116
989,Rock And Roll All Nite,Kiss,1975.0,Rock And Roll All Nite by Kiss,1,1,119,119
1354,Bohemian Rhapsody,Queen,1975.0,Bohemian Rhapsody by Queen,1,1,119,119


### 11. Which years have the most plays?

In [29]:
df.release.value_counts()

1973             104
1975              83
1977              83
1970              81
1971              75
1969              72
1980              70
1978              64
1979              63
1981              61
1967              61
1983              60
1976              56
1982              54
1984              51
1972              50
1974              48
1968              46
1985              39
1987              39
1986              37
1991              34
1989              32
1966              30
1988              29
1965              28
1994              25
1990              22
1993              19
1964              14
1992              14
1999              13
1995              10
1997               9
1996               9
1963               9
2002               6
1998               6
2005               5
2012               5
2004               5
2001               4
2003               3
2008               3
1962               3
2011               3
2007               3
2000         

### 12. Which records don't have matching "Play Count" corresponding to "F*G"?

In [30]:
df[df.playcount!=df.fg].shape

(576, 8)

### Bonus: Which artists have the most missing values between each of the variables? 

In [31]:
pd.DataFrame([(artist,df[df.artist==artist].isnull().sum().sum()) for artist in df.artist.unique()],
            columns=['artist','missing_values']).sort_values(by='missing_values',ascending=False)[:20]

Unnamed: 0,artist,missing_values
174,Heart,17
454,Van Halen,17
162,Grateful Dead,11
285,Paul McCartney & Wings,11
52,Bob Seger,11
441,Tom Petty & The Heartbreakers,11
364,Stevie Ray Vaughan,10
399,The Cars,9
5,AC/DC,9
28,Bad Company,9
