# Python Machine Learning

* **Import** the Data.
* **Clean** the Data.
* **Split** the Data into Training & Testing Sets.
* **Create** the Model.
* **Train** the Model.
* Make Predictions.
* Evaluate and Improve.

### Importing the Data

In [None]:
import pandas as pd
print ("Pandas Libraries has been Imported!")

In [8]:
music_data = pd.read_csv("https://nq.aibigdatasolutions.com/wp-content/uploads/2021/11/music.csv")
print ("Retriving Transactions Done!")

Retriving Transactions Done!


In [33]:
#music_data

### Cleaning & Prepearing the Data

In [49]:
X = music_data.drop(columns=['genre'])
y = music_data ['genre']

In [50]:
#x

In [51]:
#Y

### Learning & Predicting

In [52]:
from sklearn.tree import DecisionTreeClassifier
print ("sklearn.trees Imported!")

sklearn.trees Imported!


In [53]:
# Create object of the class
model = DecisionTreeClassifier()
# now we have a model

In [54]:
# Learning the model is very easy, it taked 2 parameters input& output:
model.fit(X,y)

DecisionTreeClassifier()

In [61]:
music_data

Unnamed: 0,age,gender,genre
0,20,1,HipHop
1,23,1,HipHop
2,25,1,HipHop
3,26,1,Jazz
4,29,1,Jazz
5,30,1,Jazz
6,31,1,Classical
7,33,1,Classical
8,37,1,Classical
9,20,0,Dance


In [62]:
# now lets ask our model to predict something not exists in our dataset:
predictions = model.predict([ [21,1],[22,0] ])
predictions

array(['HipHop', 'Dance', 'Classical'], dtype=object)

### How to measure the accuracy of our model?

* **We need** to split our dataset into two sets, one for **Training** and the other for **Testing**.
* A general role is to allocate 70-80% of our data for training, and the other 20-30% for testing.
    * Instade of passing only 2 samples "In[62]" to make predictions, we pass the dataset we have for testing, and get the predictions.
    * then we compare these predictions with the actual value in the test set.
* Based on that we can mesure the accurancy.

* to implement this we are going to write our code from the start, so we must put this code_line before "In[50]:" because we don't plan to measure the accuracy in advance:
    * X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)    
    * the first two variables (X_train, X_test) are the **input sets for training & testing**.
    * the other(y_train, y_test) are the **output sets for training & testing**. 
    * now, when training our model instead of passing the entire data set we want to pass only the training dataset:
        * model.fit(X_train,y_train)
    * when making predictions instead of passing these two samples "In[62]:" we pass x_test:
        * predictions = model.predict(X_test)

    * to **calculate the accuracy** we simply have to compare predictions with y_test(the actual values we have in our output set)

In [123]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split  #new to split the dataset
from sklearn.metrics import accuracy_score            #new to measure the accuracy

In [124]:
music_data = pd.read_csv("https://nq.aibigdatasolutions.com/wp-content/uploads/2021/11/music.csv")
print ("Retriving Transactions Done!")

Retriving Transactions Done!


In [125]:
X = music_data.drop(columns=['genre'])
y = music_data ['genre']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [126]:
model = DecisionTreeClassifier()
model.fit(X_train,y_train)
predictions = model.predict(X_test)

In [127]:
score = accuracy_score(y_test, predictions)
score

1.0

* the accuracy score is one or 100 percent, 
* but if we run this "In[80], In[81], In[82]" one more time we're going to see a **different result** because every time we split our data set into training and test sets we'll have different data sets **because this function randomly picks data for training and testing**

### Persisting Models

* Training a model **can sometimes be really time consuming**, in our example we're dealing with a very small data set, but in real applications we might have a data set with thousands or millions of samples, so training a model for that might take seconds minutes or even hours, **so that is why model persistence is important**.
* We always build and train our model, and then we'll save it to a file, now next time we want to make predictions, we simply load the model from the file, and ask it to make predictions; that model is already trained, so we don't need to retrain it.
* now we need to simplify things 
    * i have removed all the code that we wrote in the last section for calculating the accuracy because in this sction we're going to focus on a different topic
    * we import our data set, create a model, train it, and then ask it to make predictions.
* we can implement this by:
    * from sklearn.externals module we import joblib, this joblib object has methods for saving and loading models.
    * so after we train our model we simply call joblib.dump and give it two arguments, **"our model" and "the name of the file"** in which we want to store this model.

In [138]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
# pip install joblib
# Joblib: running Python functions as pipeline jobs.
# Joblib is a set of tools to provide pipelining in Python.
# "from sklearn.externals import joblib" sometimes makes error so we import "import joblib as jl" instade:
import joblib as jl

In [139]:
music_data = pd.read_csv("https://nq.aibigdatasolutions.com/wp-content/uploads/2021/11/music.csv")
X = music_data.drop(columns=['genre'])
y = music_data ['genre']

In [140]:
model = DecisionTreeClassifier()
model.fit(X,y)

DecisionTreeClassifier()

In [145]:
# Creating joblib file:
# joblib.dump has two arguments, "our model" and "the name of the file"
jl.dump(model, 'music-recommender.joblib')
print ("music-recommender.joblib File Created in the same directory of our Python File")

music-recommender.joblib File Created in the same directory of our Python File


#### lets check how to run our prediction without learning our model agian:

* temporarily i'm going to **comment out the code-lines: In[139] & In[140]**
* back to our .py directory we can see our joblib file.
* as we mentioned before **in a real application we don't want to train a model every time** so: 
    * instead of dumping our model 
    * we're going to load it by "call the **load_method**"

In [148]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
import joblib as jl

In [150]:
model = jl.load('music-recommender.joblib')
predictions = model.predict([ [21,1],[22,0] ])
predictions

array(['HipHop', 'Dance'], dtype=object)

### Visualising Decision Trees

in this section, using decision trees we're going to **export our model in a visual format**, so we will see how this model makes predictions:
* **import tree** this object has a method for exporting our decision tree in a graphical format

In [157]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

In [158]:
music_data = pd.read_csv("https://nq.aibigdatasolutions.com/wp-content/uploads/2021/11/music.csv")
X = music_data.drop(columns=['genre'])
y = music_data ['genre']

In [159]:
model = DecisionTreeClassifier()
model.fit(X,y)

DecisionTreeClassifier()

**The parameter we're going to set is:**
* **out_file.dot**, dot this is the dot format which is a graph
* **feature_names** we set this to an array of two strings age and gender these are the features or the columns of our data set so they are the properties or features of our data
* **class_names**, we should set this to the list of classes or labels we have in our output data set, like hiphop, jazz, classical and so on,  so this y data set includes all the genres or all the classes of our data, but they're repeated a few times in this data set so here we call y.unique this returns the unique list of classes, now we should sort this alphabetically, so we call the **sorted function** and pass the result a y.unique
* **label** we set this to a string all 
* **round** it to true.
* **filled** to true

In [162]:
tree.export_graphviz(model, out_file='music-recommender.dot',
                     feature_names = ['age','gender'],
                     class_names = sorted(y.unique()),
                     label ='all',
                     rounded = True,
                     filled = True)