<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Data Manipulation, EDA, and Reporting Results

_Authors: Joseph Nelson (DC), Sam Stack (DC)_

---

> **This lab is intentionally open-ended, and you're encouraged to answer your own questions about the dataset!**


### What makes a song a hit?

On next week's episode of the 'Are You Entertained?' podcast, we're going to be analyzing the latest generation's guilty pleasure- the music of the '00s. 

Our Data Scientists have poured through Billboard chart data to analyze what made a hit soar to the top of the charts, and how long they stayed there. Tune in next week for an awesome exploration of music and data as we continue to address an omnipresent question in the industry- why do we like what we like?

**Provide (at least) a markdown cell explaining your key learnings about top hits: what are they, what common themes are there, is there a trend among artists (type of music)?**

---

### Minimum Requirements

**At a minimum, you must:**

- Use Pandas to read in your data
- Rename column names where appropriate
- Describe your data: check the value counts and descriptive statistics
- Make use of groupby statements
- Utilize Boolean sorting
- Assess the validity of your data (missing data, distributions?)

**You should strive to:**

- Produce a blog-post ready description of your lab
- State your assumptions about the data
- Describe limitations
- Consider how you can action this from a stakeholder perspective (radio, record label, fan)
- Include visualizations

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Billboard data CSV:
billboard_csv = '../datasets/billboard.csv'

----

##### Use Pandas to read in your data

In [None]:
import numpy as np
import pandas as pd

music = pd.read_csv(billboard_csv, encoding='latin-1')

In [None]:
print(len(music))
# This Dataset has 317 values

In [None]:
# You can run this cell if you don't believe me
music.isnull().sum()

Given that there are 317 observations in this dataset and for weeks 66-76 there are 317 nulls, It is safe to assume that those weeks can be dropped because they contain no relevant information.  

##### Rename column names where appropriate

In [None]:
# creating a list of week names that arn't crap
week = 1
week_list = []
while week < 77:
    week_num = 'week '+ str(week)
    week_list.append(week_num)
    week += 1

Week_list is a list that states weeks as 'week x' to be appeneded to the silly way the variables are currently displayed.

Other than the weeks, the other features are ... 
['year', 'artist.inverted', 'track', 'time', 'genre', 'date.entered', 'date.peaked']

Im going to change them to...
['year','artist','track', 'length','genre','first_apperence','peak_date']

In [None]:
names = ['year','artist','track', 'length','genre','first_apperence','highest_data']

for item in week_list:
    names.append(item)

In [None]:
# Renaming the columns
music.columns = names

In [None]:
# dropping all those columns with zero values in them.
music.drop(['week 66','week 67','week 68','week 69','week 70','week 71',
           'week 72','week 73','week 74','week 74','week 75','week 76'],
           axis =1,inplace = True)

In [None]:
# getting only the columns that correspond to weekly ratings
music[music.columns[7:72]].count(axis=1)

# Creates a dataframe column called 'weeks_active' that is the number of weeks the song is on the top 100
music['weeks_active'] = music[music.columns[7:72]].count(axis=1)


# there are 65 posible weeks a song can be active and df.count() returns the sum of null values

##### Describe your data: check the value counts and descriptive statistics

In [None]:
music.describe()
#not exactly the most useful


In [None]:
# I wonder What musicians were on the board the most in 2000?
music['artist'].value_counts().head(10)

In [None]:
# Wow, sure looks like Country and HipHop dominate the chart.  
# Lets see what distribution of songs that make it to the top 100 are

music['genre'].value_counts()

In [None]:
# Getting a dataframe that will only be used for a plot
plot_frame = music[['genre','weeks_active']]

Going to try to plot average length on top 100 by genre.


In [None]:
# Stacking distribution of of songs lifes grouped by genre(color) 
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn')
genres = ['Rock','Country','Rap','R&B']

for m_type in genres:
    sub_df = plot_frame[plot_frame['genre']== m_type]
    sns.distplot(sub_df['weeks_active'])

plt.show()

- Green = Country
- Blue = Rock
- Red = Rap
- Purple = R&B



##### Addressing* Validity of data

In [None]:
# Making a DF copy where all the Nulls are filled with 0's
music2 = music.fillna(value = 0)

In [None]:
# Engineered column displaying a songs average weekly rating.
music['avg_weekly_rating'] = music2[music2.columns[7:73]].sum(axis =1)/music2['weeks_active']

In [None]:
# Before I continue. lets check out the average_weekly rating grouped by Genre
music.groupby(['genre'])['avg_weekly_rating'].mean()

In [None]:
music['median_weekly_rating'] = music[music.columns[7:73]].median(axis =1)

In [None]:
# Plotting the weekly mean rating by weeks active.  Color is genre
sns.lmplot(x = 'weeks_active',y ='avg_weekly_rating', hue = 'genre', data = music, fit_reg=False);

In [None]:
# Plotting the median rating as well so we can see if the datas medians and means are vastly different.
sns.lmplot(x='weeks_active', y='median_weekly_rating', hue = 'genre', data = music, fit_reg=False);

##### Stakeholder Insights
As clear from the visualizations, 20 weeks active seems to be where alot of songs hover.  A goal by a record label to have a song stay in the top 100 for more than 20 weeks would be beneficial as songs that maintain activity longer than 20 weeks on average recieve better ratings.  

##### Yearly Insights
Rock dominates those songs that maintain long lifes on the top 100 and coincidentally are also those with the best weekly ratings.

##### idk
While Rock music seems to have the most songs appear within the top 100, longest lifestyle and average weekly rating, no rock group appears in the top 10 most frequent artists.

##### Data Limitations
It would be nice to have additional years of information.

---

**Lifestyle Plotter**  
Calling the function below will return a chart of any songs life trend.  This is case sensitive so make sure your song is typed exactly as it appears in the dataframe.

There is an example below

In [None]:
def song_life(title):
    plt.style.use('seaborn')
    
    # Grab the column to plot 
    samp = music[music['track'] == title][music[music['track'] == title].columns[7:72]]
    
    # Drop null values
    samp.dropna(axis=1, inplace = True)
    
    # Transpose the matrix
    samp_t = samp.transpose()
    
    # Reset index twice to get a column where week is listed as an int value for plotting
    samp_t.reset_index(inplace = True)
    samp_t.reset_index(inplace = True)
    
    # Rename the columns
    samp_t.columns = ['Week','numbered_week','Rating']
    
    # Plot it!
    plt.plot((samp_t['Week']+1),samp_t['Rating'])
    plt.xlabel('Weeks in Top 100')
    plt.ylabel('Weekly Rating')
    plt.gca().invert_yaxis()



In [None]:
song_life('Kryptonite')

** Additional Ideas/Actions**
- Time it takes a song to get to the top.
- Converting song time to an actual time (from a string)
- Weekly Average Rating
- How many weeks a song was number 1 
- Get the correct genre for each artist
- Word that appears most often in song Names
- Entry level rating of Genre
- Time to peak grouped by Genre
- Visualize the lifecycle of a song (average)