# Overview

<img src="machine-learning-life-cycle.png" width=350></img>
<br> [source](https://www.javatpoint.com/machine-learning-life-cycle)
<br><br> The following should be noted:
* There is a step 0: Understanding the problem <br><br>
* Since Data wrangling is done during data analysis and model building ([source](https://www.infoq.com/articles/ml-data-processing/#:~:text=data%20wrangling%3A%20preparation%20of%20data%20during%20the%20interactive%20data%20analysis%20and%20model%20building.)), it will not have<br>its own header in the notebook file, rather it will be inter-twined with other steps <br><br>
* Model deployment will not be included in this notebook

# 0: Understanding The Problem

* Note: An "Abalone" is a marine snail. 
<br><br>
* Objective: Predict the age of an Abalone by cutting the shell through the cone, staining it, and counting the number of rings through a microscope ([source](https://datahub.io/machine-learning/abalone#:~:text=predicting%20the%20age%20of%20abalone%20from%20physical%20measurements.%20the%20age%20of%20abalone%20is%20determined%20by%20cutting%20the%20shell))
<br><br>
* Where ML can help: Given the details (features) of an Abalone, it can predict the number of rings, which will give an indication to the age of that Abalone
<br>
(by adding 1.5 ([source](https://www.openml.org/search?type=data&sort=runs&id=183&status=active#:~:text=%2B1.5%20gives%20the%20age%20in%20years)))

# 1: Gathering The Data

Data could be obtained from [datahub.io](https://datahub.io/machine-learning/abalone#data) or [openml.org](https://www.openml.org/search?type=data&sort=runs&id=183&status=active)

## Importing Libraries

In [20]:
import pandas as pd
from scipy.io.arff import loadarff 

## Importing The Dataset

In [21]:
raw_data = loadarff('dataset_187_abalone.arff')

In [22]:
print(raw_data[0])

[(b'M', 0.455, 0.365, 0.095, 0.514 , 0.2245, 0.101 , 0.15 , b'15')
 (b'M', 0.35 , 0.265, 0.09 , 0.2255, 0.0995, 0.0485, 0.07 , b'7')
 (b'F', 0.53 , 0.42 , 0.135, 0.677 , 0.2565, 0.1415, 0.21 , b'9') ...
 (b'M', 0.6  , 0.475, 0.205, 1.176 , 0.5255, 0.2875, 0.308, b'9')
 (b'F', 0.625, 0.485, 0.15 , 1.0945, 0.531 , 0.261 , 0.296, b'10')
 (b'M', 0.71 , 0.555, 0.195, 1.9485, 0.9455, 0.3765, 0.495, b'12')]


In [23]:
print(raw_data[1])

Dataset: abalone_train
	Sex's type is nominal, range is ('M', 'F', 'I')
	Length's type is numeric
	Diameter's type is numeric
	Height's type is numeric
	Whole_weight's type is numeric
	Shucked_weight's type is numeric
	Viscera_weight's type is numeric
	Shell_weight's type is numeric
	Class_number_of_rings's type is nominal, range is ('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29')



In [24]:
df = pd.DataFrame(raw_data[0])
df.head(5)

Unnamed: 0,Sex,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Class_number_of_rings
0,b'M',0.455,0.365,0.095,0.514,0.2245,0.101,0.15,b'15'
1,b'M',0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,b'7'
2,b'F',0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,b'9'
3,b'M',0.44,0.365,0.125,0.516,0.2155,0.114,0.155,b'10'
4,b'I',0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,b'7'


# 2. Data Preparation

## Converting Byte Strings To Unicode Strings
Read more about string types [here](https://towardsdatascience.com/byte-string-unicode-string-raw-string-a-guide-to-all-strings-in-python-684c4c4960ba#:~:text=To%20store%20the,bytes%20(16-bits).), [here](https://www.geeksforgeeks.org/byte-objects-vs-string-python/), and [here](https://stackoverflow.com/questions/6269765/what-does-the-b-character-do-in-front-of-a-string-literal)

In [25]:
df.Sex[0], type(df.Sex[0]), df.Class_number_of_rings[0], type(df.Class_number_of_rings[0])

(b'M', bytes, b'15', bytes)

In [26]:
df.Sex = df.Sex.str.decode('utf-8')
df.Class_number_of_rings = df.Class_number_of_rings.str.decode('utf-8')
df.Sex[0], type(df.Sex[0]), df.Class_number_of_rings[0], type(df.Class_number_of_rings[0])

('M', str, '15', str)

Since there are no missing values and all continuous values were scaled ([source](https://datahub.io/machine-learning/abalone#:~:text=from%20the%20original%20data%20examples%20with%20missing%20values%20were%20removed%20(the%20majority%20having%20the%20predicted%20value%20missing)%2C%20and%20the%20ranges%20of%20the%20continuous%20values%20have%20been%20scaled%20for%20use%20with%20an%20ann%20(by%20dividing%20by%20200).)), therefore we can directly proceed to the next step

# 3. Data Analysis