# Machine learning example

## 1. Import the data

In [2]:
import pandas as pd

In [7]:
# df = pd.read_csv('./data/music.csv.zip')
# our file has multiple files

import zipfile

zf = zipfile.ZipFile('./data/music.csv.zip') # having music.csv zipped file.
df = pd.read_csv(zf.open('music.csv'))

In [8]:
df.head()

Unnamed: 0,age,gender,genre
0,20,1,HipHop
1,23,1,HipHop
2,25,1,HipHop
3,26,1,Jazz
4,29,1,Jazz


## 2. Cleaning the data

Alternative guide:  
https://www.educative.io/blog/what-is-data-cleaning

### Framework for Data Cleaning
**Step 1:** Remove duplicates at id level, that is, the level at which the rows should be unique.\
**Step 2:** Transform qualitative data into quantitative data by mapping strings to integers.\
*Eg: for a hotel, they offer packages for 2 days, 5 days and 10 days. We can encode the data as: 1=2 days, 2= 5 days and 3 = 10 days*\
**Step 3:** Handle outliers\
Check outliers on all key variables, especially the computed ones.\
**Step 4:** Handle missing values, columns etc.\
Check for blank columns, large % of blank data, high % of same data\
*Look for columns which are entirely blank. This can happen in case some join fails or in case there is some error in data extraction.*\
Check the % of blank cases by each column and frequency distributions to find out if the same data is being repeated in more cases than expected.\
**Step 5:** Handle missing values\
https://medium.com/@vinitasilaparasetty/guide-to-handling-missing-values-in-data-science-37d62edbfdc1

Follow the link above for my guide on handling missing values.



#### Quality Check:
Check the quality of the cleaning tha has been done, by conduting one or both of the following tests:
1. Synchronisation Test
Check whether all columns they are in sync with each other. That is, check if they are in chronological order.
2. Log Test
If your data is perfectly clean, a simple query, such as displaying logs of the variables, should return the right result. If not, you may have to go back and check what you missed.

### Data Cleaning Checklist
- Remove HTML characters.
- Decode encoded data.
- Remove or substitute NULL values
- Handle zero values
- Handle negative values
- Handle date values
- Remove unnecessary values
- Remove stop-words
- Remove punctuation
- Remove expressions
- Use an underscore to split words that are attached.
- Check min and max for each column to ensure that they make sense
- Remove URLs, unless required for analysis.
- Check grammar
- Check spellings
- Check for incorrect entries
- Geographic coordinates must be within -180 to 180 degrees latitude or longitude.

The data set we have is pretty simple so we don't have to do that  much

## 3. Split into training & test set

`70% - 80%` of our data should be used for training and the rest for testing

by default `X` represent our input set and `y` represents our output set

In [11]:

X = df.drop(columns=['genre'])
y = df['genre']

In [10]:
X.head()

Unnamed: 0,age,gender
0,20,1
1,23,1
2,25,1
3,26,1
4,29,1


In [13]:
y.head()

0    HipHop
1    HipHop
2    HipHop
3      Jazz
4      Jazz
Name: genre, dtype: object

In [25]:
from sklearn.model_selection import train_test_split

# 80% is for training:. 20% is for testing
# output is a tuple which we can unpack in advance
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) 

## 4. Create a model

Simply use the tools from sklearn and create the apropriate model\

*Note that **choosing a model** is the tougher thing* \
utilise: \
https://towardsdatascience.com/do-you-know-how-to-choose-the-right-machine-learning-algorithm-among-7-different-types-295d0b0c7f60 \
https://www.kdnuggets.com/2020/05/guide-choose-right-machine-learning-algorithm.html \
https://towardsdatascience.com/considerations-when-choosing-a-machine-learning-model-aa31f52c27f3 \
https://www.datasciencecentral.com/how-to-choose-a-machine-learning-model-some-guidelines/ \
(saved the best for last)

In [21]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

## 5. Train the model

#### This is what most models in scikit-learn will do

utilise `model.fit(X, y)` as a general shortcut with the bulk of work being with choosing the right model

In [26]:
model.fit(X_train, y_train)

DecisionTreeClassifier()

## 6. Make Predictions

here we use `model.predict` which takes input of a multidimensional array \
for example, here we pass `[21,1]` a 21 year old male and `[22,0]` a 22 year old female inside our multidimensional array

In [18]:
#predictions = model.predict([ [21,1], [22,0] ])
#predictions

array(['HipHop', 'Dance'], dtype=object)

In [27]:
predictions = model.predict(X_test)
predictions

array(['Classical', 'Jazz', 'Dance', 'HipHop', 'Classical', 'Acoustic'],
      dtype=object)

## 7. Evaluate & Improve

In [28]:
from sklearn.metrics import accuracy_score

score = accuracy_score(y_test, predictions)
score

0.8333333333333334