# Project 2
## Step 1: Exploring your data.

##### Load your data in using Pandas and start to explore. Save all of your early exploration code here and include in your final submission.


In [28]:
import pandas as pd
import numpy as np

df = pd.read_csv('../../assets/billboard.csv')

# Look for signs that the student has investigated the dataframe from multiple dimensions. 
#Here are a few examples of things they should be looking at.
df.columns

df['year'].unique()

df.dtypes

df.time.mean

df.apply(lambda x: pd.lib.infer_dtype(x.values))

year                integer
artist.inverted      string
track                string
time                 string
genre                string
date.entered         string
date.peaked          string
x1st.week           integer
x2nd.week          floating
x3rd.week          floating
x4th.week          floating
x5th.week          floating
x6th.week          floating
x7th.week          floating
x8th.week          floating
x9th.week          floating
x10th.week         floating
x11th.week         floating
x12th.week         floating
x13th.week         floating
x14th.week         floating
x15th.week         floating
x16th.week         floating
x17th.week         floating
x18th.week         floating
x19th.week         floating
x20th.week         floating
x21st.week         floating
x22nd.week         floating
x23rd.week         floating
                     ...   
x47th.week         floating
x48th.week         floating
x49th.week         floating
x50th.week         floating
x51st.week         f

##### Create a data dictionary for the data set.

Variable | Description | Type of Variable
---| ---| ---
Year| Year the track was charted | ordinal
Artist.Inverted | Artist name | categorical
Track | Track name | categorical
Time | Track time in m:ss format | ordinal
Genre | Musical classification | categorical
Date.entered | Date the track entered the charts | ordinal
Date.Peaked | Date the track peaked on the charts | ordinal
x1st.week ... x76th.week | Chart ranking by week for the track | continuous

##### Write a brief description of your data, and any interesting observations you've made thus far. 

In [29]:
# Expect the student to describe the data, relationships between columns, notes on the shape, etc.

## Step 2: Clean your data.

##### Do some rudimentary cleaning. Rename any columns that are poorly named, shorten any strings that may be too long, fill in any missing values. 

In [42]:
# Here are some of the more obvious/accessible cleaning you can expect from the students. 
# This is not exhaustive, but is enough to get through the project.

# Rename artist.inverted to artist
df.rename(columns={'artist.inverted': 'artist'}, inplace=True)

# Remove all non-ascii characters
import re
df['artist'] = df['artist'].apply(lambda x: re.sub(r'[^\x00-\x7f]',r'',x))
df['track'] = df['track'].apply(lambda x: re.sub(r'[^\x00-\x7f]',r'',x))

# Rename week columns
ranks = [i for i in range(1,77)]
names = ['week' + str(i) for i in ranks]
col_names = np.append(df.columns[0:7].values, names)
df.columns=col_names.tolist()

# Shorten track names that are too long
df.track = df.track.apply(lambda x: x[:20])


##### Using Pandas' built in `melt` function, pivot the weekly ranking data to be long rather than wide. As a result, you will have removed the 72 'week' columns and replace it with two: Week and Ranking. There will now be multiple entries for each song, one for each week on the Billboard rankings.

In [43]:
df.columns[7:]
df = pd.melt(df, id_vars=df.columns[:7].values, var_name='week', value_name='rank').dropna(subset=['rank'])
df.sort_values(by=['year','artist','track'])


Unnamed: 0,year,artist,track,time,genre,date.entered,date.peaked,week,rank
246,2000,2 Pac,Baby Don't Cry (Keep,4:22,Rap,2000-02-26,2000-03-11,week1,87
563,2000,2 Pac,Baby Don't Cry (Keep,4:22,Rap,2000-02-26,2000-03-11,week2,82
880,2000,2 Pac,Baby Don't Cry (Keep,4:22,Rap,2000-02-26,2000-03-11,week3,72
1197,2000,2 Pac,Baby Don't Cry (Keep,4:22,Rap,2000-02-26,2000-03-11,week4,77
1514,2000,2 Pac,Baby Don't Cry (Keep,4:22,Rap,2000-02-26,2000-03-11,week5,87
1831,2000,2 Pac,Baby Don't Cry (Keep,4:22,Rap,2000-02-26,2000-03-11,week6,94
2148,2000,2 Pac,Baby Don't Cry (Keep,4:22,Rap,2000-02-26,2000-03-11,week7,99
287,2000,2Ge+her,The Hardest Part Of,3:15,R&B,2000-09-02,2000-09-09,week1,91
604,2000,2Ge+her,The Hardest Part Of,3:15,R&B,2000-09-02,2000-09-09,week2,87
921,2000,2Ge+her,The Hardest Part Of,3:15,R&B,2000-09-02,2000-09-09,week3,92


## Step 3: Visualize your data.

##### Using a plotting utility of your choice, create visualizations that will provide context to your data. There is no minimum or maximum number of graphs you should generate, but there should be a clear and consistent story being told. Give insights to the distribution, statistics, and relationships of the data. 

In [None]:
# Students should be building histograms to investigate distribution, box plots to determine ranges and identify outliers, and scatter plots to analyse relationships. 
# Expect thorough analyses of the graphs, written eloquently. 

## Step 4: Create a Problem Statement.

##### Having explored the data, come up with a problem statement for this data set. You can feel free to introduce data from any other source to support your problem statement, just be sure to provide a link to the origin of the data. Once again- be creative!

## Step 5: Brainstorm your Approach.
##### In bullet-list form, provide a proposed approach for evaluating your problem statement. This can be somewhat high-level, but start to think about ways you can massage the data for maximum efficacy. 

In [None]:
# You should expect students to lay out their plans for organizing and mutating the data, all the way through presenting results. 
# A good answer should be at least 5 bullet points, written in complete sentences.

## Step 6: Create a blog post with your code snippets and visualizations.
##### Data Science is a growing field, and the Tech industry thrives off of collaboration and sharing of knowledge. Blogging is a powerful means for pushing the needle forward in our field. Using your blogging platform of choice, create a post describing each of the 5 steps above. Rather than writing a procedural text, imagine you're describing the data, visualizations, and conclusions you've arrived at to your peers. Aim for roughly 500 words. 

## BONUS: The Content Managers working for the Podcast Publishing Company have recognized you as a thought leader in your field. They've asked you to pen a white paper (minimum 500 words) on the subject of 'What It Means To Have Clean Data'. This will be an opinion piece read by a wide audience, so be sure to back up your statements with real world examples or scenarios.

##### Hint: To get started, look around on the internet for articles, blog posts, papers, youtube videos, podcasts, reddit discussions, anything that will help you understand the challenges and implications of dealing with big data. This should be a personal reflection on everything you've learned this week, and the learning goals that have been set out for you going forward. 