<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Lab: Cleaning Rock Song Data

_Authors: Dave Yerrington (SF)_

---


In [1]:
import pandas as pd
import numpy as np 
import seaborn as sns

%matplotlib inline

### 1. Load `rock.csv` and do an initial examination of its data columns.

In [2]:
rockfile = "./datasets/rock.csv"

In [50]:
rockfile_df = pd.read_csv(rockfile)

In [51]:
rockfile_df.head(10)

Unnamed: 0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,First?,Year?,PlayCount,F*G
0,Caught Up in You,.38 Special,1982.0,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981.0,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980.0,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975.0,Art For Arts Sake by 10cc,1,1,1,1
5,Kryptonite,3 Doors Down,2000.0,Kryptonite by 3 Doors Down,1,1,13,13
6,Loser,3 Doors Down,2000.0,Loser by 3 Doors Down,1,1,1,1
7,When I'm Gone,3 Doors Down,2002.0,When I'm Gone by 3 Doors Down,1,1,6,6
8,What's Up?,4 Non Blondes,1992.0,What's Up? by 4 Non Blondes,1,1,3,3
9,Take On Me,a-ha,1985.0,Take On Me by a-ha,1,1,1,1


In [52]:
# How many rows (2230)
rockfile_df.index

RangeIndex(start=0, stop=2230, step=1)

In [53]:
# How many rows and columns (8)
rockfile_df.shape

(2230, 8)

In [54]:
# List of column names
rockfile_df.columns

Index(['Song Clean', 'ARTIST CLEAN', 'Release Year', 'COMBINED', 'First?',
       'Year?', 'PlayCount', 'F*G'],
      dtype='object')

In [55]:
# Column datatypes (object = string)
rockfile_df.dtypes

Song Clean      object
ARTIST CLEAN    object
Release Year    object
COMBINED        object
First?           int64
Year?            int64
PlayCount        int64
F*G              int64
dtype: object

In [56]:
# Number of NaN values in each column
rockfile_df.isnull().sum()

Song Clean        0
ARTIST CLEAN      0
Release Year    577
COMBINED          0
First?            0
Year?             0
PlayCount         0
F*G               0
dtype: int64

In [57]:
# Summary statistics for numerical columns
rockfile_df.describe()

Unnamed: 0,First?,Year?,PlayCount,F*G
count,2230.0,2230.0,2230.0,2230.0
mean,1.0,0.741256,16.872646,15.04843
std,0.0,0.438043,25.302972,25.288366
min,1.0,0.0,0.0,0.0
25%,1.0,0.0,1.0,0.0
50%,1.0,1.0,4.0,3.0
75%,1.0,1.0,21.0,18.0
max,1.0,1.0,142.0,142.0


In [58]:
# Variable to store cleaned column names
col_names = ['song_name', 'artist', 'release_year', 'combined', 'first_song', 'year_count', 'play_count', 'f_g']

### 2.  Clean up the column names.

Let's clean up the column names. There are two ways we can accomplish this:

#### 2.A Change the column names when you import the data using `pd.read_csv()`.

Notice that, when passing `names=[..A LIST OF STRING..]` with a number of columns that matches the number of strings in the passed list, you replace the column names.

NOTE: When you create custom column names, the first row of the `.csv` already represents a header. It is important to tell `pandas` to skip that row. The `skiprows=1` keyword argument to `read_csv()` will tell `pandas` to skip the first row.

In [84]:
# Change the column names when loading the '.csv':

# col_names = ['song_name', 'artist', 'release_year', 'combined', 'first_song', 'year_count', 'play_count', 'f_g']
# pd.read_csv(rockfile, names=col_names, skiprows=1)

#### 2.B Change column names using the `.rename()` function.

The `.rename()` function takes an argument, `columns=name_dict`, in which `name_dict` is a dictionary containing the original column names as keys and the new column names as values.

In [59]:
# Change column names using dictionary format
rockfile_df.rename(columns={"Song Clean" : "song_name",
                            "ARTIST CLEAN" : "artist", 
                            "Release Year" : "release_year", 
                            "COMBINED" : "combined", 
                            "First?" : "first_song", 
                            "Year?" : "year_count", 
                            "PlayCount" : "play_count", 
                            "F*G" : "f_g"}, 
                   inplace=True)

In [60]:
rockfile_df.columns

Index(['song_name', 'artist', 'release_year', 'combined', 'first_song',
       'year_count', 'play_count', 'f_g'],
      dtype='object')

In [17]:
# You may need to flatten all chars and whitespaces in column names
pd_friendly_columns = [col_string.replace(' ', '_') for col_string in rockfile_df.columns]
rockfile_df.columns = pd_friendly_columns
rockfile_df.columns = map(str.lower, rockfile_df.columns)

In [18]:
rockfile_df.columns

Index(['song_name', 'artist', 'release_year', 'combined', 'first_song',
       'year_count', 'play_count', 'f_g'],
      dtype='object')

#### 2.C Reassigning the `.columns` attribute of a DataFrame.

You can also just reassign the `.columns` attribute to a list of strings containing the new column names. 

The only caveat with reassigning `.columns` is that you have to reassign all of the column names at once. You can't partially replace a value by working on `.columns` directly. You have to reassign `.columns` with a list of equal length. 

In [89]:
# Replace the column names by reassigning the `columns` attribute:

# col_names = ['song_name', 'artist', 'release_year', 'combined', 'first_song', 'year_count', 'play_count', 'f_g']
# rockfile_df.columns = col_names

### 3. Subsetting data where null values exist.

We have mixed `str` and `NaN` values in the `release_year` column. `NaN` stands for "not a number" and is the way `pandas` handles "nulls" or nonexistent data. We can use the `.isnull()` method of a Series to find null values.

Print the header of the data subset to where the `release_year` column is null values.

In [61]:
# Show records where df['release_year'] is null (577 rows)
rockfile_df[rockfile_df['release_year'].isnull()]

Unnamed: 0,song_name,artist,release_year,combined,first_song,year_count,play_count,f_g
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
10,"Baby, Please Don't Go",AC/DC,,"Baby, Please Don't Go by AC/DC",1,0,1,0
13,CAN'T STOP ROCK'N'ROLL,AC/DC,,CAN'T STOP ROCK'N'ROLL by AC/DC,1,0,5,0
16,Girls Got Rhythm,AC/DC,,Girls Got Rhythm by AC/DC,1,0,24,0
24,Let's Get It Up,AC/DC,,Let's Get It Up by AC/DC,1,0,4,0
...,...,...,...,...,...,...,...,...
2216,"I'm Bad, I'm Nationwide",ZZ Top,,"I'm Bad, I'm Nationwide by ZZ Top",1,0,10,0
2218,Just Got Paid,ZZ Top,,Just Got Paid by ZZ Top,1,0,2,0
2221,My Head's In Mississippi,ZZ Top,,My Head's In Mississippi by ZZ Top,1,0,1,0
2222,Party On The Patio,ZZ Top,,Party On The Patio by ZZ Top,1,0,14,0


In [62]:
rockfile_df[rockfile_df['release_year'].isnull()].head()

Unnamed: 0,song_name,artist,release_year,combined,first_song,year_count,play_count,f_g
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
10,"Baby, Please Don't Go",AC/DC,,"Baby, Please Don't Go by AC/DC",1,0,1,0
13,CAN'T STOP ROCK'N'ROLL,AC/DC,,CAN'T STOP ROCK'N'ROLL by AC/DC,1,0,5,0
16,Girls Got Rhythm,AC/DC,,Girls Got Rhythm by AC/DC,1,0,24,0
24,Let's Get It Up,AC/DC,,Let's Get It Up by AC/DC,1,0,4,0


### 4. Update slices of your DataFrame based on mask selection/slices.

In many scenarios, we want to upate values in our DataFrame according to criteria. Let's say we wanted to set all of the null values in `release_year` to 0.

With newer versions of `pandas`, in order to manipulate data in the original DataFrame, we have to use `.loc` while performing reassignment using a mask and an index.

For example, the following won't always work:
```python
df[row_mask]['column_name'] = new_value
```

The best way to accomplish the same task is:
```python
df.loc[row_mask, 'column_name'] = new_value
```

For multiple column assignment, you would use:
```python
df.loc[row_mask, ['col_1', 'col_2', 'col_3']] = new_value
```

#### 4.A Let's try it out. Make all of the null values in `release_year` 0.

In [63]:
# Create variable to store all null values otherwise you will get CopyWarning
null_release_mask = rockfile_df['release_year'].isnull()

rockfile_df.loc[null_release_mask, 'release_year'] = 0

#### 4.B Verify that `release_year` contains no null values.

In [64]:
rockfile_df[rockfile_df['release_year'].isnull()]

Unnamed: 0,song_name,artist,release_year,combined,first_song,year_count,play_count,f_g


In [65]:
# You can also use `fillna()` function to replace the NaN values

# rockfile_df['release_year'].fillna(value=0, inplace=True)

In [66]:
# Check again for null values in entire Dataframe
rockfile_df.isnull().sum()

song_name       0
artist          0
release_year    0
combined        0
first_song      0
year_count      0
play_count      0
f_g             0
dtype: int64

**Always reload your Dataframe to a fresh state when data munging.** 

*Data munging* is the process of transforming original data to more readable, usable and valid data. 

*Data wrangling* is the process of mapping data from the 'raw' format to another format with the intent of making it more appropriate for downstream processes, e.g. analytics.  

### 5. Ensure that the data types of the columns make sense. 

Verifying column data types is a critical part of *data munging*. If columns have the wrong data type, then there is usually corrupted or incorrect data in some of the observations.

#### 5.A Look at the data types for the columns. Are any incorrect given what the data represents?

In [67]:
rockfile_df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2230 entries, 0 to 2229
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   song_name     2230 non-null   object
 1   artist        2230 non-null   object
 2   release_year  2230 non-null   object
 3   combined      2230 non-null   object
 4   first_song    2230 non-null   int64 
 5   year_count    2230 non-null   int64 
 6   play_count    2230 non-null   int64 
 7   f_g           2230 non-null   int64 
dtypes: int64(4), object(4)
memory usage: 139.5+ KB


The `release_year` column is a string data type when it should be an integer.

### 6. Investigate and clean up the `release_year` column.

#### 6.A Figure out what value(s) are causing the `release_year` column to be encoded as a string instead of an integer.

In [68]:
# You can view unique values
rockfile_df.release_year.unique()

array(['1982', 0, '1981', '1980', '1975', '2000', '2002', '1992', '1985',
       '1993', '1976', '1995', '1979', '1984', '1977', '1990', '1986',
       '1974', '2014', '1987', '1973', '2001', '1989', '1997', '1971',
       '1972', '1994', '1970', '1966', '1965', '1983', '1955', '1978',
       '1969', '1999', '1968', '1988', '1962', '2007', '1967', '1958',
       '1071', '1996', '1991', '2005', '2011', '2004', '2012', '2003',
       '1998', '2008', '1964', '2013', '2006', 'SONGFACTS.COM', '1963',
       '1961'], dtype=object)

#### 6.B Look at the rows in which there is incorrect data in the `release_year` column.

In [69]:
# 'SONGFACTS.COM' is string not integer

# Assign slice and view row
release_slice = rockfile_df['release_year'] == "SONGFACTS.COM"
rockfile_df[release_slice]

Unnamed: 0,song_name,artist,release_year,combined,first_song,year_count,play_count,f_g
1504,Bullfrog Blues,Rory Gallagher,SONGFACTS.COM,Bullfrog Blues by Rory Gallagher,1,1,1,1


#### 6.C. Clean up the data. 

Normally we may replace the offending data with null np.nan values, however we previously converted all of the nan values in the `release_year` column to zeros so we might as well continue with the same practice. Replacing with 0 will allow us to convert the column to numeric.

In [70]:
# Use slice and `loc` attribute to rename the value 
rockfile_df.loc[release_slice, 'release_year'] = 0

rockfile_df[release_slice]

Unnamed: 0,song_name,artist,release_year,combined,first_song,year_count,play_count,f_g
1504,Bullfrog Blues,Rory Gallagher,0,Bullfrog Blues by Rory Gallagher,1,1,1,1


In [71]:
# Convert column to 'int' datatype
rockfile_df['release_year'] = rockfile_df['release_year'].map(lambda x: int(x))

In [72]:
rockfile_df.release_year.unique()

array([1982,    0, 1981, 1980, 1975, 2000, 2002, 1992, 1985, 1993, 1976,
       1995, 1979, 1984, 1977, 1990, 1986, 1974, 2014, 1987, 1973, 2001,
       1989, 1997, 1971, 1972, 1994, 1970, 1966, 1965, 1983, 1955, 1978,
       1969, 1999, 1968, 1988, 1962, 2007, 1967, 1958, 1071, 1996, 1991,
       2005, 2011, 2004, 2012, 2003, 1998, 2008, 1964, 2013, 2006, 1963,
       1961], dtype=int64)

In [73]:
rockfile_df.dtypes

song_name       object
artist          object
release_year     int64
combined        object
first_song       int64
year_count       int64
play_count       int64
f_g              int64
dtype: object

### 7. Get summary statistics for the `release_year` column using the `describe()` function.

Now that the `release_year` column is finally numeric data type, we can apply the `describe()` function.  

#### 7.A Print out the summary stats for the `release_year` column. What is the earliest and latest release date?

In [74]:
# Earliest release year is 0 (obviously), latest release year is 2014
rockfile_df['release_year'].describe()

count    2230.000000
mean     1465.331390
std       867.196161
min         0.000000
25%         0.000000
50%      1973.000000
75%      1981.000000
max      2014.000000
Name: release_year, dtype: float64

#### 7.B Based on the summary statistics, is there anything else wrong with the `release_year` column? 

In [75]:
# The mean is quite low - due to 25% of data is 0
# See unique value - maybe replace year 1071 with 0?
corrupt_year_slice = rockfile_df['release_year'] == 1071

rockfile_df[corrupt_year_slice]

Unnamed: 0,song_name,artist,release_year,combined,first_song,year_count,play_count,f_g
547,Levon,Elton John,1071,Levon by Elton John,1,1,8,8


_Looking at the DataFrame row that contains the year 1071, we can see that the year was probably corrupted and should be replaced with the correct value if possible or 0._

In [76]:
rockfile_df.loc[corrupt_year_slice, 'release_year'] = 1971

In [77]:
rockfile_df.release_year.unique()

array([1982,    0, 1981, 1980, 1975, 2000, 2002, 1992, 1985, 1993, 1976,
       1995, 1979, 1984, 1977, 1990, 1986, 1974, 2014, 1987, 1973, 2001,
       1989, 1997, 1971, 1972, 1994, 1970, 1966, 1965, 1983, 1955, 1978,
       1969, 1999, 1968, 1988, 1962, 2007, 1967, 1958, 1996, 1991, 2005,
       2011, 2004, 2012, 2003, 1998, 2008, 1964, 2013, 2006, 1963, 1961],
      dtype=int64)

In [78]:
# The mean has not changed much! It is definitely low due to the 0 values replacing the NaNs
rockfile_df['release_year'].describe()

count    2230.000000
mean     1465.734978
std       867.221986
min         0.000000
25%         0.000000
50%      1973.000000
75%      1981.000000
max      2014.000000
Name: release_year, dtype: float64

### 8. Make changes and investigate using custom functions with `apply()`.

Let's say we want to traverse every single row in our data set and apply a function to that row.

#### 8.A Write a function that will take a row of a DataFrame and print out the song, artist, and whether or not the release date is < 1970.


In [40]:
def release_date(df_row):
    print(df_row['song_name'], "by", df_row['artist'], ", release date earlier than 1970?", df_row['release_year'] < 1970)
    print('-------------------')


#### 8.B Using the `apply()` function, apply the function above to the first five rows of the DataFrame.

You will need to tell the `apply` function to operate row by row. Setting the keyword argument as `axis=1` indicates that the function should be applied to each row individually.

In [41]:
rockfile_df.head().apply(release_date, axis=1)

Caught Up in You by .38 Special , release date earlier than 1970? False
-------------------
Fantasy Girl by .38 Special , release date earlier than 1970? True
-------------------
Hold On Loosely by .38 Special , release date earlier than 1970? False
-------------------
Rockin' Into the Night by .38 Special , release date earlier than 1970? False
-------------------
Art For Arts Sake by 10cc , release date earlier than 1970? False
-------------------


0    None
1    None
2    None
3    None
4    None
dtype: object

You'll notice that there will be a final output Series of `None` values. The `apply()` function, when a return value is not specified, will return a Series of `None` values (similar to how the default return for Python functions is `None` when a return statement is not specified).

In [42]:
# Confirm function output with original DataFrame
rockfile_df.head()

Unnamed: 0,song_name,artist,release_year,combined,first_song,year_count,play_count,f_g
0,Caught Up in You,.38 Special,1982,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,0,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975,Art For Arts Sake by 10cc,1,1,1,1


### 9. Write a function that converts cells in a DataFrame to float and otherwise replaces them with `np.nan`.

If applied to our data, it would keep only the numeric information and otherwise input null values.

Recall that the try-except syntax in Python is a great way to try something and take another action if the initial step fails:

```python
try:
    Perform some action.
except:
   Perform some other action if the first failed with an error.
```

#### 9.A Write function that takes a column and converts all of its values to float if possible and `np.nan` otherwise. The return value should be the converted Series.

In [None]:
# OPTIONAL! Use this function in case of unwanted punctuation
def convert_punct(df_col):
    df_col= pd.to_numeric(df_col.str.replace('$', ''), errors='coerce')


In [87]:
# This function converts all values to float or Nan otherwise
def converter_helper(value):
    try:
        return float(value)
    except:
        return np.nan


# You call this function, which maps 'converter_helper' to a column
def col_to_float(col_name):
    df_col = col_name.map(converter_helper)
    return df_col


#### 9.B Try your function out on the rock song data and ensure the output is what you expected.


In [89]:
rockfile_df = rockfile_df.apply(col_to_float)

In [90]:
# All string values are converted to NaN
rockfile_df.head(10)

Unnamed: 0,song_name,artist,release_year,combined,first_song,year_count,play_count,f_g
0,,,1982.0,,1.0,1.0,82.0,82.0
1,,,0.0,,1.0,0.0,3.0,0.0
2,,,1981.0,,1.0,1.0,85.0,85.0
3,,,1980.0,,1.0,1.0,18.0,18.0
4,,,1975.0,,1.0,1.0,1.0,1.0
5,,,2000.0,,1.0,1.0,13.0,13.0
6,,,2000.0,,1.0,1.0,1.0,1.0
7,,,2002.0,,1.0,1.0,6.0,6.0
8,,,1992.0,,1.0,1.0,3.0,3.0
9,,,1985.0,,1.0,1.0,1.0,1.0


#### 9.C Describe the new float-only DataFrame.

In [83]:
rockfile_df.describe()

Unnamed: 0,song_name,artist,release_year,combined,first_song,year_count,play_count,f_g
count,2.0,0.0,2230.0,0.0,2230.0,2230.0,2230.0,2230.0
mean,1012.0,,1465.734978,,1.0,0.741256,16.872646,15.04843
std,1367.544515,,867.221986,,0.0,0.438043,25.302972,25.288366
min,45.0,,0.0,,1.0,0.0,0.0,0.0
25%,528.5,,0.0,,1.0,0.0,1.0,0.0
50%,1012.0,,1973.0,,1.0,1.0,4.0,3.0
75%,1495.5,,1981.0,,1.0,1.0,21.0,18.0
max,1979.0,,2014.0,,1.0,1.0,142.0,142.0


In [84]:
# How does 'song_name' column have any statistics at all???
rockfile_df['song_name'].value_counts()

45.0      1
1979.0    1
Name: song_name, dtype: int64

In [86]:
rockfile_df[rockfile_df['song_name'] == 1979]

Unnamed: 0,song_name,artist,release_year,combined,first_song,year_count,play_count,f_g
1577,1979.0,,1995.0,,1.0,1.0,3.0,3.0


In [None]:
# We had original song names with just a number...??? Check original DataFrame!