# Lambda School Data Science - Logistic Regression

Logistic regression is the baseline for classification models, as well as a handy way to predict probabilities (since those too live in the unit interval). While relatively simple, it is also the foundation for more sophisticated classification techniques such as neural networks (many of which can effectively be thought of as networks of logistic models).

## Lecture - Where Linear goes Wrong
### Return of the Titanic 🚢

You've likely already explored the rich dataset that is the Titanic - let's use regression and try to predict survival with it. The data is [available from Kaggle](https://www.kaggle.com/c/titanic/data), so we'll also play a bit with [the Kaggle API](https://github.com/Kaggle/kaggle-api).

In [1]:
!pip install kaggle

Collecting kaggle
[?25l  Downloading https://files.pythonhosted.org/packages/9e/94/5370052b9cbc63a927bda08c4f7473a35d3bb27cc071baa1a83b7f783352/kaggle-1.5.1.1.tar.gz (53kB)
[K    100% |████████████████████████████████| 61kB 2.1MB/s ta 0:00:01
[?25hCollecting urllib3<1.23.0,>=1.15 (from kaggle)
[?25l  Downloading https://files.pythonhosted.org/packages/63/cb/6965947c13a94236f6d4b8223e21beb4d576dc72e8130bd7880f600839b8/urllib3-1.22-py2.py3-none-any.whl (132kB)
[K    100% |████████████████████████████████| 133kB 3.9MB/s ta 0:00:01
Collecting python-slugify (from kaggle)
  Downloading https://files.pythonhosted.org/packages/1f/9c/8b07d625e9c9df567986d887f0375075abb1923e49d074a7803cd1527dae/python-slugify-2.0.1.tar.gz
Collecting Unidecode>=0.04.16 (from python-slugify->kaggle)
[?25l  Downloading https://files.pythonhosted.org/packages/31/39/53096f9217b057cb049fe872b7fc7ce799a1a89b76cf917d9639e7a558b5/Unidecode-1.0.23-py2.py3-none-any.whl (237kB)
[K    100% |██████████████████████████

In [3]:
# Note - you'll also have to sign up for Kaggle and authorize the API
# https://github.com/Kaggle/kaggle-api#api-credentials

# This essentially means uploading a kaggle.json file
# For Colab we can have it in Google Drive
# from google.colab import drive
# drive.mount('/content/drive')
%env KAGGLE_CONFIG_DIR=/home/mishraka/Documents/Manjula/Lambda_School/Assignments/Unit_2/Advanced_Regression_Week3/

# You also have to join the Titanic competition to have access to the data
!kaggle competitions download -c titanic

env: KAGGLE_CONFIG_DIR=/home/mishraka/Documents/Manjula/Lambda_School/Assignments/Unit_2/Advanced_Regression_Week3/
Downloading train.csv to /home/mishraka/Documents/Manjula/Lambda_School/Assignments/Unit_2/Advanced_Regression_Week3
  0%|                                               | 0.00/59.8k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 59.8k/59.8k [00:00<00:00, 6.37MB/s]
Downloading test.csv to /home/mishraka/Documents/Manjula/Lambda_School/Assignments/Unit_2/Advanced_Regression_Week3
  0%|                                               | 0.00/28.0k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 28.0k/28.0k [00:00<00:00, 11.9MB/s]
Downloading gender_submission.csv to /home/mishraka/Documents/Manjula/Lambda_School/Assignments/Unit_2/Advanced_Regression_Week3
  0%|                                               | 0.00/3.18k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 3.18k/3.18k [00:00<00:00, 1.92MB/s]


In [4]:
# How would we try to do this with linear regression?
import pandas as pd

train_df = pd.read_csv('train.csv').dropna()
test_df = pd.read_csv('test.csv').dropna()  # Unlabeled, for Kaggle submission

train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S


In [5]:
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,183.0,183.0,183.0,183.0,183.0,183.0,183.0
mean,455.36612,0.672131,1.191257,35.674426,0.464481,0.47541,78.682469
std,247.052476,0.470725,0.515187,15.643866,0.644159,0.754617,76.347843
min,2.0,0.0,1.0,0.92,0.0,0.0,0.0
25%,263.5,0.0,1.0,24.0,0.0,0.0,29.7
50%,457.0,1.0,1.0,36.0,0.0,0.0,57.0
75%,676.0,1.0,1.0,47.5,1.0,1.0,90.0
max,890.0,1.0,3.0,80.0,3.0,4.0,512.3292


In [6]:
from sklearn.linear_model import LinearRegression

X = train_df[['Pclass', 'Age', 'Fare']]
y = train_df.Survived

linear_reg = LinearRegression().fit(X, y)
linear_reg.score(X, y)

0.08389810726550917

In [7]:
linear_reg.predict(test_df[['Pclass', 'Age', 'Fare']])

array([0.79543117, 0.58610823, 0.67595121, 0.793829  , 0.62090522,
       0.72542107, 0.59848968, 0.58734245, 0.48567063, 0.77627736,
       0.84211887, 0.57052439, 0.7754689 , 0.96621114, 0.70287941,
       0.57673837, 0.72321391, 0.75894755, 0.77968041, 0.50246003,
       0.49858077, 0.7474959 , 0.3542282 , 0.61648435, 0.71300224,
       0.66294608, 0.53175333, 0.77397395, 0.68419387, 0.68395536,
       0.52041202, 0.56814038, 0.79586606, 0.81372012, 0.61068545,
       0.57260627, 0.52525981, 0.58055388, 0.45584728, 0.67976208,
       0.8226707 , 0.84286197, 0.96189157, 0.66724612, 0.68589478,
       0.61846513, 0.63455044, 0.68275686, 0.65738372, 0.45198998,
       0.59988596, 0.63845908, 0.63132487, 0.7888473 , 0.60126246,
       0.79714045, 0.78713803, 0.54643775, 0.42823635, 0.7711724 ,
       0.53552976, 0.55608044, 0.54480459, 0.57031915, 0.65080369,
       0.77958926, 0.6371013 , 0.70993488, 0.71493598, 0.60375943,
       0.54407206, 0.48186138, 0.76576089, 0.75456305, 0.53968

In [8]:
linear_reg.coef_

array([-0.08596295, -0.00829314,  0.00048775])

In [10]:
import numpy as np

test_case = np.array([[1, 5, 500]])  # Rich 5-year old in first class
linear_reg.predict(test_case)

array([1.14845883])

In [11]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression().fit(X, y)
log_reg.score(X, y)

0.7103825136612022

In [0]:
log_reg.predict(test_df[['Pclass', 'Age', 'Fare']])

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [12]:
log_reg.predict(test_case)[0]

1

In [13]:
help(log_reg.predict)

Help on method predict in module sklearn.linear_model.base:

predict(X) method of sklearn.linear_model.logistic.LogisticRegression instance
    Predict class labels for samples in X.
    
    Parameters
    ----------
    X : {array-like, sparse matrix}, shape = [n_samples, n_features]
        Samples.
    
    Returns
    -------
    C : array, shape = [n_samples]
        Predicted class label per sample.



In [14]:
log_reg.predict_proba(test_case)[0]

array([0.02485552, 0.97514448])

In [15]:
# What's the math?
log_reg.coef_

array([[-0.0455017 , -0.02912513,  0.0048037 ]])

In [16]:
log_reg.intercept_

array([1.45878264])

In [17]:
# The logistic sigmoid "squishing" function, implemented to accept numpy arrays
def sigmoid(x):
  return 1 / (1 + np.e**(-x))

In [18]:
sigmoid(log_reg.intercept_ + np.dot(log_reg.coef_, np.transpose(test_case)))

array([[0.97514448]])

So, clearly a more appropriate model in this situation! For more on the math, [see this Wikipedia example](https://en.wikipedia.org/wiki/Logistic_regression#Probability_of_passing_an_exam_versus_hours_of_study).

For live - let's tackle [another classification dataset on absenteeism](http://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work) - it has 21 classes, but remember, scikit-learn LogisticRegression automatically handles more than two classes. How? By essentially treating each label as different (1) from some base class (0).

In [21]:
!wget http://archive.ics.uci.edu/ml/machine-learning-databases/00445/Absenteeism_at_work_AAA.zip
!unzip Absenteeism_at_work_AAA.zip

--2019-01-20 16:21:47--  http://archive.ics.uci.edu/ml/machine-learning-databases/00445/Absenteeism_at_work_AAA.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.249
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.249|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 66136 (65K) [application/zip]
Saving to: ‘Absenteeism_at_work_AAA.zip.1’


2019-01-20 16:21:47 (303 KB/s) - ‘Absenteeism_at_work_AAA.zip.1’ saved [66136/66136]

Archive:  Absenteeism_at_work_AAA.zip
  inflating: Absenteeism_at_work.arff  
  inflating: Absenteeism_at_work.csv  
  inflating: Absenteeism_at_work.xls  
  inflating: Attribute Information.docx  
  inflating: UCI_ABS_TEXT.docx       


In [22]:

!head Absenteeism_at_work.csv

ID;Reason for absence;Month of absence;Day of the week;Seasons;Transportation expense;Distance from Residence to Work;Service time;Age;Work load Average/day ;Hit target;Disciplinary failure;Education;Son;Social drinker;Social smoker;Pet;Weight;Height;Body mass index;Absenteeism time in hours
11;26;7;3;1;289;36;13;33;239.554;97;0;1;2;1;0;1;90;172;30;4
36;0;7;3;1;118;13;18;50;239.554;97;1;1;1;1;0;0;98;178;31;0
3;23;7;4;1;179;51;18;38;239.554;97;0;1;0;1;0;0;89;170;31;2
7;7;7;5;1;279;5;14;39;239.554;97;0;1;2;1;1;0;68;168;24;4
11;23;7;5;1;289;36;13;33;239.554;97;0;1;2;1;0;1;90;172;30;2
3;23;7;6;1;179;51;18;38;239.554;97;0;1;0;1;0;0;89;170;31;2
10;22;7;6;1;361;52;3;28;239.554;97;0;1;1;1;0;4;80;172;27;8
20;23;7;6;1;260;50;11;36;239.554;97;0;1;4;1;0;0;65;168;23;4
14;19;7;2;1;155;12;14;34;239.554;97;0;1;2;1;0;0;95;196;25;40


In [23]:
absent_df = pd.read_table('Absenteeism_at_work.csv', sep=';')
absent_df.head()

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,26,7,3,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,4
1,36,0,7,3,1,118,13,18,50,239.554,...,1,1,1,1,0,0,98,178,31,0
2,3,23,7,4,1,179,51,18,38,239.554,...,0,1,0,1,0,0,89,170,31,2
3,7,7,7,5,1,279,5,14,39,239.554,...,0,1,2,1,1,0,68,168,24,4
4,11,23,7,5,1,289,36,13,33,239.554,...,0,1,2,1,0,1,90,172,30,2


In [24]:
absent_df.shape

(740, 21)

In [33]:
from sklearn.linear_model import LogisticRegression
X = absent_df.drop('Reason for absence', axis='columns')
y = absent_df['Reason for absence']

absent_log1 = LogisticRegression(solver='liblinear', multi_class='ovr').fit(X, y)
absent_log1.score(X, y)

0.5013513513513513

## Assignment - real-world classification

We're going to check out a larger dataset - the [FMA Free Music Archive data](https://github.com/mdeff/fma). It has a selection of CSVs with metadata and calculated audio features that you can load and try to use to classify genre of tracks. To get you started:

In [35]:
!wget https://os.unil.cloud.switch.ch/fma/fma_https://github.com/mdeff/fmametadata.zip
!unzip fma_metadata.zip

--2019-01-20 22:06:45--  https://os.unil.cloud.switch.ch/fma/fma_https://github.com/mdeff/fmametadata.zip
Resolving os.unil.cloud.switch.ch (os.unil.cloud.switch.ch)... 86.119.28.13, 2001:620:5ca1:2ff::ce53
Connecting to os.unil.cloud.switch.ch (os.unil.cloud.switch.ch)|86.119.28.13|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2019-01-20 22:06:47 ERROR 404: Not Found.

unzip:  cannot find or open fma_metadata.zip, fma_metadata.zip.zip or fma_metadata.zip.ZIP.


In [125]:
tracks = pd.read_csv("/home/mishraka/Documents/Manjula/Lambda_School/Assignments/Unit_2/Advanced_Regression_Week3/fma_metadata/tracks.csv",
                    header = None)

  interactivity=interactivity, compiler=compiler, result=result)


In [126]:
tracks.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52
count,106575,106576,103047,70296,15297,106576,106576,83151,106576,18062,106576,105551,106576,100068,22713,5377,14273,71158,106576,105720,106576,106576,44546.0,70212,44546.0,46851,106576,13154,106576,79258,5583,106576,106576,106576,106576,3672,106576,6161,106576,106576,49600,106576,106576,2351,106576,15026,106489,106576,313,106576,1265,106576,106575
unique,106575,30,14342,3671,624,66,14930,11077,11352,762,2390,14300,171,7,78,59,756,6086,53,15972,215,17056,1614.0,2331,1602.0,4017,16296,754,15662,6644,391,5,5,12173,29,507,86171,775,3367,330,18,4770,4153,1588,18978,46,115,15341,68,332,137,2453,94988
top,679,0,2015-01-26 13:04:57,2008-01-01 00:00:00,Ernie Indradat,0,-1,"<p class=""p1"" style=""margin: 0px; padding: 8px...",-1,Joe Belock,[],microSong Entries,10,Album,2007-01-01 00:00:00,2016-01-01 00:00:00,HUSH Records,"<p><span style=""color: #333333; font-family: G...",0,2013-03-31 02:17:41,0,15891,40.692455,"Brooklyn, NY",-73.990364,Konstantin Trokay,Kosta T,"Ratatat, Lullatone, Nightmares On Wax, Air, Mo...",[],https://soundcloud.com/konstantin-trokay,http://en.wikipedia.org/wiki/Josh_Woodward,training,large,320000,0,konstantin trokai,2009-06-25 11:36:04,2008-11-26 00:00:00,180,0,Rock,[21],[21],"<p><a href=""http://www.myspace.com/theshambler...",320,en,Attribution-Noncommercial-Share Alike 3.0 Unit...,97,Apache Tomcat,1,Victrola Dog (ASCAP),[],Untitled
freq,1,71188,310,667,876,45753,805,310,3130,855,83549,310,6045,87549,1789,479,604,745,59915,745,15414,745,890.0,2327,890.0,745,745,604,3061,745,284,84353,81574,49095,88387,541,4,700,507,35944,14182,2735,2735,22,67,14255,19250,110,44,10459,465,83078,298


In [127]:
pd.set_option('display.max_columns', None)  # Unlimited columns
tracks.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52
0,,album,album,album,album,album,album,album,album,album,album,album,album,album,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,artist,set,set,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track,track
1,,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments,date_created,favorites,id,latitude,location,longitude,members,name,related_projects,tags,website,wikipedia_page,split,subset,bit_rate,comments,composer,date_created,date_recorded,duration,favorites,genre_top,genres,genres_all,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
2,track_id,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
4,3,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000,0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave


In [128]:
tracks.shape

(106577, 53)

In [129]:
tracks.columns

Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
            17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
            34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
            51, 52],
           dtype='int64')

In [130]:
#Let's try to remove trhe first row of the df
#track_df = tracks.iloc[1:]
header = tracks.iloc[1]

In [131]:
df_update = tracks[1:]

In [132]:
df_update.rename(columns = header)

Unnamed: 0,nan,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments.1,date_created.1,favorites.1,id.1,latitude,location,longitude,members,name,related_projects,tags.1,website,wikipedia_page,split,subset,bit_rate,comments.2,composer,date_created.2,date_recorded,duration,favorites.2,genre_top,genres,genres_all,information.1,interest,language_code,license,listens.1,lyricist,number,publisher,tags.2,title.1
1,,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments,date_created,favorites,id,latitude,location,longitude,members,name,related_projects,tags,website,wikipedia_page,split,subset,bit_rate,comments,composer,date_created,date_recorded,duration,favorites,genre_top,genres,genres_all,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
2,track_id,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
4,3,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000,0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave
5,5,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:20,2008-11-26 00:00:00,206,6,Hip-Hop,[21],[21],,1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[],This World
6,10,0,2008-11-26 01:45:08,2008-02-06 00:00:00,,4,6,,47632,,[],Constant Hitmaker,2,Album,,,"Mexican Summer, Richie Records, Woodsist, Skul...","<p><span style=""font-family:Verdana, Geneva, A...",3,2008-11-26 01:42:55,74,6,,,,"Kurt Vile, the Violators",Kurt Vile,,"['philly', 'kurt vile']",http://kurtvile.com,,training,small,192000,0,Kurt Vile,2008-11-25 17:49:06,2008-11-26 00:00:00,161,178,Pop,[10],[10],,54881,en,Attribution-NonCommercial-NoDerivatives (aka M...,50135,,1,,[],Freeway
7,20,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:48:56,2008-01-01 00:00:00,311,0,,"[76, 103]","[17, 10, 76, 103]",,978,en,Attribution-NonCommercial-NoDerivatives (aka M...,361,,3,,[],Spiritual Level
8,26,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:49:05,2008-01-01 00:00:00,181,0,,"[76, 103]","[17, 10, 76, 103]",,1060,en,Attribution-NonCommercial-NoDerivatives (aka M...,193,,4,,[],Where is your Love?
9,30,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:49:11,2008-01-01 00:00:00,174,0,,"[76, 103]","[17, 10, 76, 103]",,718,en,Attribution-NonCommercial-NoDerivatives (aka M...,612,,5,,[],Too Happy
10,46,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:49:53,2008-01-01 00:00:00,104,0,,"[76, 103]","[17, 10, 76, 103]",,252,en,Attribution-NonCommercial-NoDerivatives (aka M...,171,,8,,[],Yosemite


In [133]:
df_update.columns

Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
            17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
            34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
            51, 52],
           dtype='int64')

In [134]:
df_update.drop(df_update.index[[1]])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52
1,,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments,date_created,favorites,id,latitude,location,longitude,members,name,related_projects,tags,website,wikipedia_page,split,subset,bit_rate,comments,composer,date_created,date_recorded,duration,favorites,genre_top,genres,genres_all,information,interest,language_code,license,listens,lyricist,number,publisher,tags,title
3,2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
4,3,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000,0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave
5,5,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:20,2008-11-26 00:00:00,206,6,Hip-Hop,[21],[21],,1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[],This World
6,10,0,2008-11-26 01:45:08,2008-02-06 00:00:00,,4,6,,47632,,[],Constant Hitmaker,2,Album,,,"Mexican Summer, Richie Records, Woodsist, Skul...","<p><span style=""font-family:Verdana, Geneva, A...",3,2008-11-26 01:42:55,74,6,,,,"Kurt Vile, the Violators",Kurt Vile,,"['philly', 'kurt vile']",http://kurtvile.com,,training,small,192000,0,Kurt Vile,2008-11-25 17:49:06,2008-11-26 00:00:00,161,178,Pop,[10],[10],,54881,en,Attribution-NonCommercial-NoDerivatives (aka M...,50135,,1,,[],Freeway
7,20,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:48:56,2008-01-01 00:00:00,311,0,,"[76, 103]","[17, 10, 76, 103]",,978,en,Attribution-NonCommercial-NoDerivatives (aka M...,361,,3,,[],Spiritual Level
8,26,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:49:05,2008-01-01 00:00:00,181,0,,"[76, 103]","[17, 10, 76, 103]",,1060,en,Attribution-NonCommercial-NoDerivatives (aka M...,193,,4,,[],Where is your Love?
9,30,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:49:11,2008-01-01 00:00:00,174,0,,"[76, 103]","[17, 10, 76, 103]",,718,en,Attribution-NonCommercial-NoDerivatives (aka M...,612,,5,,[],Too Happy
10,46,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:49:53,2008-01-01 00:00:00,104,0,,"[76, 103]","[17, 10, 76, 103]",,252,en,Attribution-NonCommercial-NoDerivatives (aka M...,171,,8,,[],Yosemite
11,48,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:49:56,2008-01-01 00:00:00,205,0,,"[76, 103]","[17, 10, 76, 103]",,247,en,Attribution-NonCommercial-NoDerivatives (aka M...,173,,9,,[],Light of Light


In [135]:
header = tracks.iloc[1]

In [136]:
df = df_update[1:]

In [137]:
df = df.rename(columns = header)

In [138]:
df.head()

Unnamed: 0,nan,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments.1,date_created.1,favorites.1,id.1,latitude,location,longitude,members,name,related_projects,tags.1,website,wikipedia_page,split,subset,bit_rate,comments.2,composer,date_created.2,date_recorded,duration,favorites.2,genre_top,genres,genres_all,information.1,interest,language_code,license,listens.1,lyricist,number,publisher,tags.2,title.1
2,track_id,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,2,0.0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4.0,1.0,<p></p>,6073.0,,[],AWOL - A Way Of Life,7.0,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0.0,2008-11-26 01:42:32,9.0,1.0,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000.0,0.0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168.0,2.0,Hip-Hop,[21],[21],,4656.0,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293.0,,3.0,,[],Food
4,3,0.0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4.0,1.0,<p></p>,6073.0,,[],AWOL - A Way Of Life,7.0,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0.0,2008-11-26 01:42:32,9.0,1.0,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000.0,0.0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237.0,1.0,Hip-Hop,[21],[21],,1470.0,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514.0,,4.0,,[],Electric Ave
5,5,0.0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4.0,1.0,<p></p>,6073.0,,[],AWOL - A Way Of Life,7.0,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0.0,2008-11-26 01:42:32,9.0,1.0,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000.0,0.0,,2008-11-26 01:48:20,2008-11-26 00:00:00,206.0,6.0,Hip-Hop,[21],[21],,1933.0,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151.0,,6.0,,[],This World
6,10,0.0,2008-11-26 01:45:08,2008-02-06 00:00:00,,4.0,6.0,,47632.0,,[],Constant Hitmaker,2.0,Album,,,"Mexican Summer, Richie Records, Woodsist, Skul...","<p><span style=""font-family:Verdana, Geneva, A...",3.0,2008-11-26 01:42:55,74.0,6.0,,,,"Kurt Vile, the Violators",Kurt Vile,,"['philly', 'kurt vile']",http://kurtvile.com,,training,small,192000.0,0.0,Kurt Vile,2008-11-25 17:49:06,2008-11-26 00:00:00,161.0,178.0,Pop,[10],[10],,54881.0,en,Attribution-NonCommercial-NoDerivatives (aka M...,50135.0,,1.0,,[],Freeway


In [139]:
df = df.drop(df.index[[0]])

In [140]:
df.head()

Unnamed: 0,nan,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments.1,date_created.1,favorites.1,id.1,latitude,location,longitude,members,name,related_projects,tags.1,website,wikipedia_page,split,subset,bit_rate,comments.2,composer,date_created.2,date_recorded,duration,favorites.2,genre_top,genres,genres_all,information.1,interest,language_code,license,listens.1,lyricist,number,publisher,tags.2,title.1
3,2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
4,3,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000,0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave
5,5,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:20,2008-11-26 00:00:00,206,6,Hip-Hop,[21],[21],,1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[],This World
6,10,0,2008-11-26 01:45:08,2008-02-06 00:00:00,,4,6,,47632,,[],Constant Hitmaker,2,Album,,,"Mexican Summer, Richie Records, Woodsist, Skul...","<p><span style=""font-family:Verdana, Geneva, A...",3,2008-11-26 01:42:55,74,6,,,,"Kurt Vile, the Violators",Kurt Vile,,"['philly', 'kurt vile']",http://kurtvile.com,,training,small,192000,0,Kurt Vile,2008-11-25 17:49:06,2008-11-26 00:00:00,161,178,Pop,[10],[10],,54881,en,Attribution-NonCommercial-NoDerivatives (aka M...,50135,,1,,[],Freeway
7,20,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:48:56,2008-01-01 00:00:00,311,0,,"[76, 103]","[17, 10, 76, 103]",,978,en,Attribution-NonCommercial-NoDerivatives (aka M...,361,,3,,[],Spiritual Level


In [121]:
#df.rename(columns = {'nan':'year'})
#df = df.rename(columns={'nan': 'track_id'}, inplace=True)

In [141]:
df.columns

Index([                nan,          'comments',      'date_created',
           'date_released',          'engineer',         'favorites',
                      'id',       'information',           'listens',
                'producer',              'tags',             'title',
                  'tracks',              'type', 'active_year_begin',
         'active_year_end', 'associated_labels',               'bio',
                'comments',      'date_created',         'favorites',
                      'id',          'latitude',          'location',
               'longitude',           'members',              'name',
        'related_projects',              'tags',           'website',
          'wikipedia_page',             'split',            'subset',
                'bit_rate',          'comments',          'composer',
            'date_created',     'date_recorded',          'duration',
               'favorites',         'genre_top',            'genres',
              'genre

In [145]:
df = df.rename(columns={ df.columns[0]: "track_id" })

In [146]:
df.head()
track_id, favorites, id, listens, title, tracks, bio, bit_rate, duration, genre_top, number, title

Unnamed: 0,track_id,comments,date_created,date_released,engineer,favorites,id,information,listens,producer,tags,title,tracks,type,active_year_begin,active_year_end,associated_labels,bio,comments.1,date_created.1,favorites.1,id.1,latitude,location,longitude,members,name,related_projects,tags.1,website,wikipedia_page,split,subset,bit_rate,comments.2,composer,date_created.2,date_recorded,duration,favorites.2,genre_top,genres,genres_all,information.1,interest,language_code,license,listens.1,lyricist,number,publisher,tags.2,title.1
3,2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:12,2008-11-26 00:00:00,168,2,Hip-Hop,[21],[21],,4656,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,,3,,[],Food
4,3,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,medium,256000,0,,2008-11-26 01:48:14,2008-11-26 00:00:00,237,1,Hip-Hop,[21],[21],,1470,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,,4,,[],Electric Ave
5,5,0,2008-11-26 01:44:45,2009-01-05 00:00:00,,4,1,<p></p>,6073,,[],AWOL - A Way Of Life,7,Album,2006-01-01 00:00:00,,,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,40.0583238,New Jersey,-74.4056612,"Sajje Morocco,Brownbum,ZawidaGod,Custodian of ...",AWOL,The list of past projects is 2 long but every1...,['awol'],http://www.AzillionRecords.blogspot.com,,training,small,256000,0,,2008-11-26 01:48:20,2008-11-26 00:00:00,206,6,Hip-Hop,[21],[21],,1933,en,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,,6,,[],This World
6,10,0,2008-11-26 01:45:08,2008-02-06 00:00:00,,4,6,,47632,,[],Constant Hitmaker,2,Album,,,"Mexican Summer, Richie Records, Woodsist, Skul...","<p><span style=""font-family:Verdana, Geneva, A...",3,2008-11-26 01:42:55,74,6,,,,"Kurt Vile, the Violators",Kurt Vile,,"['philly', 'kurt vile']",http://kurtvile.com,,training,small,192000,0,Kurt Vile,2008-11-25 17:49:06,2008-11-26 00:00:00,161,178,Pop,[10],[10],,54881,en,Attribution-NonCommercial-NoDerivatives (aka M...,50135,,1,,[],Freeway
7,20,0,2008-11-26 01:45:05,2009-01-06 00:00:00,,2,4,"<p> ""spiritual songs"" from Nicky Cook</p>",2710,,[],Niris,13,Album,1990-01-01 00:00:00,2011-01-01 00:00:00,,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,51.895927,Colchester England,0.891874,Nicky Cook\n,Nicky Cook,,"['instrumentals', 'experimental pop', 'post pu...",,,training,large,256000,0,,2008-11-26 01:48:56,2008-01-01 00:00:00,311,0,,"[76, 103]","[17, 10, 76, 103]",,978,en,Attribution-NonCommercial-NoDerivatives (aka M...,361,,3,,[],Spiritual Level


In [147]:
df.isna().sum()

track_id                  0
comments                  0
date_created           3529
date_released         36280
engineer              91279
favorites                 0
id                        0
information           23425
listens                   0
producer              88514
tags                      0
title                  1025
tracks                    0
type                   6508
active_year_begin     83863
active_year_end      101199
associated_labels     92303
bio                   35418
comments                  0
date_created            856
favorites                 0
id                        0
latitude              62030
location              36364
longitude             62030
members               59725
name                      0
related_projects      93422
tags                      0
website               27318
wikipedia_page       100993
split                     0
subset                    0
bit_rate                  0
comments                  0
composer            

In [151]:
df_new = df.drop(columns=['engineer',
        
       'active_year_begin', 'active_year_end',
       'associated_labels', 'latitude', 'location', 'longitude', 'members',
         'website', 'wikipedia_page', 'composer',
       'date_recorded',
       'information', 'language_code',
        'lyricist', 'publisher', 'tags', 'tags', 'tags'])

In [152]:
df_new.head()

Unnamed: 0,track_id,comments,date_created,date_released,favorites,id,listens,producer,title,tracks,type,bio,comments.1,date_created.1,favorites.1,id.1,name,related_projects,split,subset,bit_rate,comments.2,date_created.2,duration,favorites.2,genre_top,genres,genres_all,interest,license,listens.1,number,title.1
3,2,0,2008-11-26 01:44:45,2009-01-05 00:00:00,4,1,6073,,AWOL - A Way Of Life,7,Album,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,AWOL,The list of past projects is 2 long but every1...,training,small,256000,0,2008-11-26 01:48:12,168,2,Hip-Hop,[21],[21],4656,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1293,3,Food
4,3,0,2008-11-26 01:44:45,2009-01-05 00:00:00,4,1,6073,,AWOL - A Way Of Life,7,Album,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,AWOL,The list of past projects is 2 long but every1...,training,medium,256000,0,2008-11-26 01:48:14,237,1,Hip-Hop,[21],[21],1470,Attribution-NonCommercial-ShareAlike 3.0 Inter...,514,4,Electric Ave
5,5,0,2008-11-26 01:44:45,2009-01-05 00:00:00,4,1,6073,,AWOL - A Way Of Life,7,Album,"<p>A Way Of Life, A Collective of Hip-Hop from...",0,2008-11-26 01:42:32,9,1,AWOL,The list of past projects is 2 long but every1...,training,small,256000,0,2008-11-26 01:48:20,206,6,Hip-Hop,[21],[21],1933,Attribution-NonCommercial-ShareAlike 3.0 Inter...,1151,6,This World
6,10,0,2008-11-26 01:45:08,2008-02-06 00:00:00,4,6,47632,,Constant Hitmaker,2,Album,"<p><span style=""font-family:Verdana, Geneva, A...",3,2008-11-26 01:42:55,74,6,Kurt Vile,,training,small,192000,0,2008-11-25 17:49:06,161,178,Pop,[10],[10],54881,Attribution-NonCommercial-NoDerivatives (aka M...,50135,1,Freeway
7,20,0,2008-11-26 01:45:05,2009-01-06 00:00:00,2,4,2710,,Niris,13,Album,<p>Songs written by: Nicky Cook</p>\n<p>VOCALS...,2,2008-11-26 01:42:52,10,4,Nicky Cook,,training,large,256000,0,2008-11-26 01:48:56,311,0,,"[76, 103]","[17, 10, 76, 103]",978,Attribution-NonCommercial-NoDerivatives (aka M...,361,3,Spiritual Level


In [154]:
df_new.isna().sum()

track_id                0
comments                0
date_created         3529
date_released       36280
favorites               0
id                      0
listens                 0
producer            88514
title                1025
tracks                  0
type                 6508
bio                 35418
comments                0
date_created          856
favorites               0
id                      0
name                    0
related_projects    93422
split                   0
subset                  0
bit_rate                0
comments                0
date_created            0
duration                0
favorites               0
genre_top           56976
genres                  0
genres_all              0
interest                0
license                87
listens                 0
number                  0
title                   1
dtype: int64

In [158]:
df_new.shape

(106574, 33)

In [160]:
df_new.dtypes

track_id            object
comments            object
date_created        object
date_released       object
favorites           object
id                  object
listens             object
producer            object
title               object
tracks              object
type                object
bio                 object
comments            object
date_created        object
favorites           object
id                  object
name                object
related_projects    object
split               object
subset              object
bit_rate            object
comments            object
date_created        object
duration            object
favorites           object
genre_top           object
genres              object
genres_all          object
interest            object
license             object
listens             object
number              object
title               object
dtype: object

In [180]:
df_selected = df[['track_id', 'favorites', 'id', 'listens', 'title', 'tracks', 'bit_rate', 'duration', 'genre_top', 'number', 'title']]

In [181]:
df_selected.head()

Unnamed: 0,track_id,favorites,favorites.1,favorites.2,id,id.1,listens,listens.1,title,title.1,tracks,bit_rate,duration,genre_top,number,title.2,title.3
3,2,4,9,2,1,1,6073,1293,AWOL - A Way Of Life,Food,7,256000,168,Hip-Hop,3,AWOL - A Way Of Life,Food
4,3,4,9,1,1,1,6073,514,AWOL - A Way Of Life,Electric Ave,7,256000,237,Hip-Hop,4,AWOL - A Way Of Life,Electric Ave
5,5,4,9,6,1,1,6073,1151,AWOL - A Way Of Life,This World,7,256000,206,Hip-Hop,6,AWOL - A Way Of Life,This World
6,10,4,74,178,6,6,47632,50135,Constant Hitmaker,Freeway,2,192000,161,Pop,1,Constant Hitmaker,Freeway
7,20,2,10,0,4,4,2710,361,Niris,Spiritual Level,13,256000,311,,3,Niris,Spiritual Level


In [171]:
df_selected.shape

(106574, 18)

In [182]:
df_selected.isna().sum()

track_id         0
favorites        0
favorites        0
favorites        0
id               0
id               0
listens          0
listens          0
title         1025
title            1
tracks           0
bit_rate         0
duration         0
genre_top    56976
number           0
title         1025
title            1
dtype: int64

In [183]:
df_selected.genre_top.value_counts()

Rock                   14182
Experimental           10608
Electronic              9372
Hip-Hop                 3552
Folk                    2803
Pop                     2332
Instrumental            2079
International           1389
Classical               1230
Jazz                     571
Old-Time / Historic      554
Spoken                   423
Country                  194
Soul-RnB                 175
Blues                    110
Easy Listening            24
Name: genre_top, dtype: int64

In [175]:
df_selected['genre_top'] = df_selected['genre_top'].fillna('Rock')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [184]:
df_selected = df_selected.loc[:, ~df_selected.columns.duplicated()]

In [185]:
df_selected.head()

Unnamed: 0,track_id,favorites,id,listens,title,tracks,bit_rate,duration,genre_top,number
3,2,4,1,6073,AWOL - A Way Of Life,7,256000,168,Hip-Hop,3
4,3,4,1,6073,AWOL - A Way Of Life,7,256000,237,Hip-Hop,4
5,5,4,1,6073,AWOL - A Way Of Life,7,256000,206,Hip-Hop,6
6,10,4,6,47632,Constant Hitmaker,2,192000,161,Pop,1
7,20,2,4,2710,Niris,13,256000,311,,3


In [187]:
df_selected['genre_top'] = df_selected['genre_top'].fillna('Rock')

In [188]:
df_selected.isna().sum()

track_id        0
favorites       0
id              0
listens         0
title        1025
tracks          0
bit_rate        0
duration        0
genre_top       0
number          0
dtype: int64

In [189]:
df_selected.title.describe()

count                105549
unique                14298
top       microSong Entries
freq                    310
Name: title, dtype: object

In [190]:
df_selected.title.value_counts()

microSong Entries                                                310
Sectioned v4.0                                                   200
Live at the 2014 Golden Festival                                 168
INTO INFINITY: an exploration of on and on and on and on...      151
Necktar Volume 6                                                 150
Entries                                                          139
The Conet Project                                                138
Classwar Karaoke - 0025 Survey                                   135
Necktar Volume 7                                                 128
Necktar Volume 8                                                 117
Necktar Volume 5                                                 115
Sincerity Is The Key                                             112
Necktar Volume 3                                                 111
Necktar Volume 9.1                                               110
Classwar Karaoke - 0021 Survey    

In [191]:
df_selected.shape

(106574, 10)

In [193]:
final_df = df_selected.dropna()

In [194]:
final_df.shape

(105549, 10)

In [195]:
final_df.dtypes

track_id     object
favorites    object
id           object
listens      object
title        object
tracks       object
bit_rate     object
duration     object
genre_top    object
number       object
dtype: object

In [196]:
final_df.head()

Unnamed: 0,track_id,favorites,id,listens,title,tracks,bit_rate,duration,genre_top,number
3,2,4,1,6073,AWOL - A Way Of Life,7,256000,168,Hip-Hop,3
4,3,4,1,6073,AWOL - A Way Of Life,7,256000,237,Hip-Hop,4
5,5,4,1,6073,AWOL - A Way Of Life,7,256000,206,Hip-Hop,6
6,10,4,6,47632,Constant Hitmaker,2,192000,161,Pop,1
7,20,2,4,2710,Niris,13,256000,311,Rock,3


In [199]:
a =final_df['track_id'].replace('',regex=True).astype(float)
b =final_df['favorites'].replace('',regex=True).astype(float)
c = final_df['id'].replace('',regex=True).astype(float)
d = final_df['listens'].replace('',regex=True).astype(float)
e = final_df['tracks'].replace('',regex=True).astype(float)
f = final_df['bit_rate'].replace('',regex=True).astype(float)
g = final_df['duration'].replace('',regex=True).astype(float)
h = final_df['number'].replace('',regex=True).astype(float)

In [200]:
df_float_features = pd.concat([a,b,c,d,e,f,g,h], axis = 1)

In [201]:
df_float_features.dtypes

track_id     float64
favorites    float64
id           float64
listens      float64
tracks       float64
bit_rate     float64
duration     float64
number       float64
dtype: object

In [236]:
final_df.columns

Index(['track_id', 'favorites', 'id', 'listens', 'title', 'tracks', 'bit_rate',
       'duration', 'genre_top', 'number'],
      dtype='object')

In [237]:
df_columns_dropped = final_df.drop(columns = ['track_id', 'favorites', 'id', 'listens', 'tracks', 'bit_rate',
       'duration', 'number'])

In [238]:
df_columns_dropped.head()

Unnamed: 0,title,genre_top
3,AWOL - A Way Of Life,Hip-Hop
4,AWOL - A Way Of Life,Hip-Hop
5,AWOL - A Way Of Life,Hip-Hop
6,Constant Hitmaker,Pop
7,Niris,Rock


In [239]:
working_df = pd.concat([df_float_features, df_columns_dropped], axis=1)


In [240]:
working_df.head()

Unnamed: 0,track_id,favorites,id,listens,tracks,bit_rate,duration,number,title,genre_top
3,2.0,4.0,1.0,6073.0,7.0,256000.0,168.0,3.0,AWOL - A Way Of Life,Hip-Hop
4,3.0,4.0,1.0,6073.0,7.0,256000.0,237.0,4.0,AWOL - A Way Of Life,Hip-Hop
5,5.0,4.0,1.0,6073.0,7.0,256000.0,206.0,6.0,AWOL - A Way Of Life,Hip-Hop
6,10.0,4.0,6.0,47632.0,2.0,192000.0,161.0,1.0,Constant Hitmaker,Pop
7,20.0,2.0,4.0,2710.0,13.0,256000.0,311.0,3.0,Niris,Rock


In [241]:
working_df.dtypes

track_id     float64
favorites    float64
id           float64
listens      float64
tracks       float64
bit_rate     float64
duration     float64
number       float64
title         object
genre_top     object
dtype: object

In [242]:
working_df = working_df.drop(columns=['title'])


In [246]:
working_df.head()

Unnamed: 0,track_id,favorites,id,listens,tracks,bit_rate,duration,number,genre_top
3,2.0,4.0,1.0,6073.0,7.0,256000.0,168.0,3.0,Hip-Hop
4,3.0,4.0,1.0,6073.0,7.0,256000.0,237.0,4.0,Hip-Hop
5,5.0,4.0,1.0,6073.0,7.0,256000.0,206.0,6.0,Hip-Hop
6,10.0,4.0,6.0,47632.0,2.0,192000.0,161.0,1.0,Pop
7,20.0,2.0,4.0,2710.0,13.0,256000.0,311.0,3.0,Rock


In [253]:
X = workable_df.drop('genre_top', axis='columns')
y = workable_df['genre_top']

In [258]:
X.head()

Unnamed: 0,track_id,favorites,id,listens,tracks,bit_rate,duration,number,title
3,2.0,4.0,1.0,6073.0,7.0,256000.0,168.0,3.0,553
4,3.0,4.0,1.0,6073.0,7.0,256000.0,237.0,4.0,553
5,5.0,4.0,1.0,6073.0,7.0,256000.0,206.0,6.0,553
6,10.0,4.0,6.0,47632.0,2.0,192000.0,161.0,1.0,2271
7,20.0,2.0,4.0,2710.0,13.0,256000.0,311.0,3.0,8593


In [254]:
y

3          7
4          7
5          7
6         12
7         13
8         13
9         13
10        13
11        13
12         7
13        13
14        13
15         5
16         5
17         6
18         6
19         6
20         6
21        10
22        10
23        10
24        10
25         5
26         5
27         5
28        13
29        13
30        13
31        13
32        13
          ..
106547     8
106548     8
106549     8
106550     8
106551     8
106552     8
106553     8
106554     8
106555     6
106556     6
106557     6
106558     6
106559     6
106560     6
106561     6
106562     6
106563     6
106564     5
106565     5
106566     5
106567     5
106568     5
106569     5
106570    13
106571    13
106572    13
106573    13
106574    13
106575    13
106576    13
Name: genre_top, Length: 105549, dtype: int64

In [255]:
X

Unnamed: 0,track_id,favorites,id,listens,tracks,bit_rate,duration,number,title
3,2.0,4.0,1.0,6073.0,7.0,256000.0,168.0,3.0,553
4,3.0,4.0,1.0,6073.0,7.0,256000.0,237.0,4.0,553
5,5.0,4.0,1.0,6073.0,7.0,256000.0,206.0,6.0,553
6,10.0,4.0,6.0,47632.0,2.0,192000.0,161.0,1.0,2271
7,20.0,2.0,4.0,2710.0,13.0,256000.0,311.0,3.0,8593
8,26.0,2.0,4.0,2710.0,13.0,256000.0,181.0,4.0,8593
9,30.0,2.0,4.0,2710.0,13.0,256000.0,174.0,5.0,8593
10,46.0,2.0,4.0,2710.0,13.0,256000.0,104.0,8.0,8593
11,48.0,2.0,4.0,2710.0,13.0,256000.0,205.0,9.0,8593
12,134.0,4.0,1.0,6073.0,7.0,256000.0,207.0,5.0,553


In [227]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

In [229]:
#https://stats.stackexchange.com/questions/221622/interpreting-multinomial-logistic-regression-in-scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split    

X_train, X_test, Y_train, Y_test = train_test_split(X_std, y, test_size=0.20, random_state=42)

model = LogisticRegression(multi_class = 'multinomial', solver = 'newton-cg')
model.fit(X_train, Y_train)
predit_y = model.predict(X_test)

In [231]:
#https://chrisalbon.com/machine_learning/naive_bayes/multinomial_logistic_regression/
model.predict_proba(X_test)


array([[1.05814714e-03, 1.23345402e-02, 2.97246875e-03, ...,
        7.27420298e-01, 2.51823187e-03, 6.81581213e-04],
       [7.01878840e-04, 6.33816031e-03, 2.54804492e-03, ...,
        6.18283724e-01, 1.67071961e-03, 2.63140078e-03],
       [1.62547614e-03, 9.85325615e-03, 2.87695565e-03, ...,
        6.11679140e-01, 2.41646622e-03, 5.59357896e-03],
       ...,
       [1.47704739e-03, 6.03364116e-03, 3.57475643e-03, ...,
        6.27191808e-01, 2.87773361e-03, 4.42794989e-03],
       [7.63681394e-04, 1.26589330e-02, 1.14271991e-03, ...,
        6.78778951e-01, 1.68999994e-03, 5.14850335e-03],
       [2.33867603e-03, 4.78987494e-04, 2.11127721e-03, ...,
        6.52535757e-01, 1.28813354e-03, 3.10350263e-03]])

In [234]:
print("coefficients: ", model.coef_)
print("intercepts: ", model.intercept_)
print("classes: ", model.classes_)

coefficients:  [[ 0.82854084  0.23786628 -1.37017619  0.91203616 -0.86811951 -0.11529722
   0.12787828  0.09195371 -0.10598877]
 [-2.95279556  0.22769649  2.82647071  0.9356726   0.79052749 -0.28936604
   0.34245146  0.90835061 -0.32163688]
 [ 0.85614053 -0.14387641 -0.9042811   1.10152641 -2.05329116  0.09439973
  -0.36562936  0.88149393  0.17249406]
 [ 0.46323291 -0.04790519 -0.0392185   1.00425008 -0.4545952  -0.08042344
   0.26769218 -0.38903756 -0.11973703]
 [-0.72875958 -0.01380208  0.77044243  0.45494259  0.67967752  0.34354268
   0.14861344  0.73301203  0.30180874]
 [ 0.28954945  0.12422889 -0.21289096 -5.65738368  1.39561338  0.14934905
   0.38018609  0.86990798  0.18165374]
 [ 0.83120808  0.45423691 -0.5813224  -3.53139301  0.9768515  -0.15245843
  -0.17043442  0.1129528   0.15286892]
 [ 0.29301882  0.05125404 -0.14547141 -0.04952812  0.35425122  0.4436137
  -0.72070396  0.92164925  0.29799603]
 [ 0.39891849  0.12668987  0.23923791  0.99295518  0.38834983  0.0797064
   0.0441

In [210]:
from sklearn.preprocessing import LabelEncoder
def dummyEncode(df):
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        le = LabelEncoder()
        for feature in columnsToEncode:
            try:
                df[feature] = le.fit_transform(df[feature])
            except:
                print('Error encoding '+feature)
        return df

In [211]:
dummyEncode(workable_df)

Unnamed: 0,track_id,favorites,id,listens,tracks,bit_rate,duration,number,title,genre_top
3,2.0,4.0,1.0,6073.0,7.0,256000.0,168.0,3.0,553,7
4,3.0,4.0,1.0,6073.0,7.0,256000.0,237.0,4.0,553,7
5,5.0,4.0,1.0,6073.0,7.0,256000.0,206.0,6.0,553,7
6,10.0,4.0,6.0,47632.0,2.0,192000.0,161.0,1.0,2271,12
7,20.0,2.0,4.0,2710.0,13.0,256000.0,311.0,3.0,8593,13
8,26.0,2.0,4.0,2710.0,13.0,256000.0,181.0,4.0,8593,13
9,30.0,2.0,4.0,2710.0,13.0,256000.0,174.0,5.0,8593,13
10,46.0,2.0,4.0,2710.0,13.0,256000.0,104.0,8.0,8593,13
11,48.0,2.0,4.0,2710.0,13.0,256000.0,205.0,9.0,8593,13
12,134.0,4.0,1.0,6073.0,7.0,256000.0,207.0,5.0,553,7


In [212]:
workable_df.dtypes

track_id     float64
favorites    float64
id           float64
listens      float64
tracks       float64
bit_rate     float64
duration     float64
number       float64
title          int64
genre_top      int64
dtype: object

In [257]:
from sklearn.linear_model import LogisticRegression
X = workable_df.drop('genre_top', axis='columns')
y = workable_df['genre_top']

genre_log1 = LogisticRegression(solver='lbfgs', multi_class='multinomial').fit(X, y)
genre_log1.score(X, y)



0.6671214317520773

This is the biggest data you've played with so far, and while it does generally fit in Colab, it can take awhile to run. That's part of the challenge!

Your tasks:
- Clean up the variable names in the dataframe
- Use logistic regression to fit a model predicting (primary/top) genre
- Inspect, iterate, and improve your model
- Answer the following questions (written, ~paragraph each):
  - What are the best predictors of genre?
  - What information isn't very useful for predicting genre?
  - What surprised you the most about your results?

*Important caveats*:
- This is going to be difficult data to work with - don't let the perfect be the enemy of the good!
- Be creative in cleaning it up - if the best way you know how to do it is download it locally and edit as a spreadsheet, that's OK!
- If the data size becomes problematic, consider sampling/subsetting
- You do not need perfect or complete results - just something plausible that runs, and that supports the reasoning in your written answers

If you find that fitting a model to classify *all* genres isn't very good, it's totally OK to limit to the most frequent genres, or perhaps trying to combine or cluster genres as a preprocessing step. Even then, there will be limits to how good a model can be with just this metadata - if you really want to train an effective genre classifier, you'll have to involve the other data (see stretch goals).

This is real data - there is no "one correct answer", so you can take this in a variety of directions. Just make sure to support your findings, and feel free to share them as well! This is meant to be practice for dealing with other "messy" data, a common task in data science.

## Resources and stretch goals

- Check out the other .csv files from the FMA dataset, and see if you can join them or otherwise fit interesting models with them
- [Logistic regression from scratch in numpy](https://blog.goodaudience.com/logistic-regression-from-scratch-in-numpy-5841c09e425f) - if you want to dig in a bit more to both the code and math (also takes a gradient descent approach, introducing the logistic loss function)
- Create a visualization to show predictions of your model - ideally show a confidence interval based on error!
- Check out and compare classification models from scikit-learn, such as [SVM](https://scikit-learn.org/stable/modules/svm.html#classification), [decision trees](https://scikit-learn.org/stable/modules/tree.html#classification), and [naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html). The underlying math will vary significantly, but the API (how you write the code) and interpretation will actually be fairly similar.
- Sign up for [Kaggle](https://kaggle.com), and find a competition to try logistic regression with
- (Not logistic regression related) If you enjoyed the assignment, you may want to read up on [music informatics](https://en.wikipedia.org/wiki/Music_informatics), which is how those audio features were actually calculated. The FMA includes the actual raw audio, so (while this is more of a longterm project than a stretch goal, and won't fit in Colab) if you'd like you can check those out and see what sort of deeper analysis you can do.