# Real ML Problem
Suggesting music on the basis of their profile

STEP 1: IMPORTING THE DATASET

In [3]:
import pandas as pd
music_data = pd.read_csv('music.csv')
music_data

Unnamed: 0,age,gender,genre
0,20,1,HipHop
1,23,1,HipHop
2,25,1,HipHop
3,26,1,Jazz
4,29,1,Jazz
5,30,1,Jazz
6,31,1,Classical
7,33,1,Classical
8,37,1,Classical
9,20,0,Dance


STEP 2: CLEANING/PREPARING THE DATA

Cleaning the dataset means removing duplicates, null values etc.

Since we have values for all rows and columns....we do not need to clean the data as there is no null/duplicate data

But we will divite our data into 2 sets...the input set(with age and gender) and the output set(with genre)

In [4]:
# The input set (with age and gender) 
X = music_data.drop(columns = ['genre'])
X

Unnamed: 0,age,gender
0,20,1
1,23,1
2,25,1
3,26,1
4,29,1
5,30,1
6,31,1
7,33,1
8,37,1
9,20,0


In [5]:
# The output set (with genre) 
y = music_data['genre']
y

0        HipHop
1        HipHop
2        HipHop
3          Jazz
4          Jazz
5          Jazz
6     Classical
7     Classical
8     Classical
9         Dance
10        Dance
11        Dance
12     Acoustic
13     Acoustic
14     Acoustic
15    Classical
16    Classical
17    Classical
Name: genre, dtype: object

STEP 3: LEARNING AND PREDICTING

Here we use the Decision Tree Algo

In [6]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

# Now we train the class model(with instance DecisionTreeClassifier) like:
model.fit(X.values, y)    # here we train the model with the values of input set

# Here we ask our model to make predictions
predictions = model.predict([ [21,1], [22,0] ])   # passing new input sets
predictions

array(['HipHop', 'Dance'], dtype=object)

STEP 4: MEASURING ACCURACY

In [12]:
# First we need to split our datasets into training and testing sets for accuracy
# Genereal rule is to allocate 70-80% for training and the other 20-30% for testing

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score   # We import it while finding the accuracy

# We give this function 3 arguments- X, y and a keyboard argument that specifies the size of our test dataset(0.2 means 20% of data as test data)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)  # Here first two are input sets of training and testing data and the next two are output sets of training and testing data

# Now instead of passing the entire dataset to train the model as we did before, we pass only training dataset
model.fit(X_train, y_train)

# Here instead of passing new input sets for prediction...we use the test datset
predictions = model.predict(X_test)

# To calculate the accuracy we simply have to compare the the predictions with the actual values we have in testing (y_test)
# For that we import another function in the top

 


0.75

By running this code multiple times we notice the change in accuracy as every time the training and testing data changes.
But still the accuracy remains between 75%-100% (0.75-1.0)....which is good.

But if for an eg. we change the test_size from 0.2 to 0.8 (means 80% as test data and only 20% of data to train the model) we immediately notice the drop in accuracy which is not good. 

_Model Persistence_

The goal of model persistence is that once in a while we build and train our model and then we will save it to a file. Now next time we want to make some predictions we simply load the model from the file and ask it to make the predictions as the model is already trained...so we don't need to re-train it.

In [15]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
import joblib   # It is a python package. joblib object has methods for saving and loading models

music_data = pd.read_csv('music.csv')
X = music_data.drop(columns = ['genre'])
y = music_data['genre']

model = DecisionTreeClassifier()
model.fit(X.values, y)

# We give it two arguments - model and the name of the file in which we wanna store this model 
joblib.dump(model, 'music-recommender.joblib')

# predictions = model.predict([ [21,1], [22,0] ])


['music-recommender.joblib']

In [16]:
# In real application we don't train model everytime..thus model persistence. So we comment out the training part

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
import joblib   # It is a python package. joblib object has methods for saving and loading models

''' music_data = pd.read_csv('music.csv')
 X = music_data.drop(columns = ['genre'])
 y = music_data['genre']

 model = DecisionTreeClassifier()
 model.fit(X.values, y) '''

# This time instead of dumping...we load it 
model = joblib.load('music-recommender.joblib')
predictions = model.predict([[21,1]])
predictions

array(['HipHop'], dtype=object)

_Visualizing Decision Trees_

In [18]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

music_data = pd.read_csv('music.csv')
X = music_data.drop(columns = ['genre'])
y = music_data['genre']

model = DecisionTreeClassifier()
model.fit(X.values, y)

# After training the model we will plot the graph
# It takes two arguments - model and the name of out output file
tree.export_graphviz(model, out_file='music-recommender.dot',     # .dot format is the graph description lang
                     feature_names=['age', 'gender'],      # these are the features/columns of our dataset
                     class_names=sorted(y.unique()),    # since y contains all the genres but in a repeated manner...sorted(y.unique()) sorts the genres alphabetically and also prevents repeatition
                     label='all',
                     rounded=True,
                     filled=True)   