# Week 4: Mini Project

This notebook will guide you through smaller portions of your final project. For this notebook, we will be using the Abalone dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Abalone) (originating from the Marine Research Laboratories â€“ Taroona). This dataset should already be in your folder (under `abalone.csv`) or you can download it at the above link. 

![Abalone](abalone.jpg)

### A Brief History of Abalones

An abalone is a sea snail belonging to one of a range of 30 to 130 species (depending on which scientist you ask). It is commonly prized for its mother-of-pearl shell, pearls, and delicious flesh by a variety of cultures and has long been a valuable source of food in its native environments. Sadly, wild populations of abalone have been overfished and poached to the point where commercial farming supplies most of abalone flesh nowadays. It now sits on the list of current animals threatened by extinction.

Source: https://en.wikipedia.org/wiki/Abalone

---

## Part 1: Familiarize Yourself With the Dataset

The purpose of this dataset is to predict the age of an abalone through physical characteristics, determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Good thing it's already been done for us!

Below is the dataset description from the UCI Machine Learning Repository. 

|Name	|	Data Type|	Measure	|Description|
|	----	|	---------|	-----	|-----------|
|	Sex		|nominal		|	|M, F, and I (infant)|
|	Length	|	continuous	|mm|	Longest shell measurement|
|	Diameter	|continuous|	mm	|perpendicular to length|
|	Height	|	continuous	|mm	|with meat in shell|
|	Whole weight|	continuous	|grams	|whole abalone|
|	Shucked weight	|continuous|	grams	|weight of meat|
|	Viscera weight	|continuous|	grams	|gut weight (after bleeding)|
|	Shell weight	|continuous|	grams	|after being dried|
|	Rings	|	integer		|	|+1.5 gives the age in years|

Run the cells below to examine the dataset. 

In [66]:
# import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [67]:
# Load Abalone dataset
data = pd.read_csv('abalone.csv', header=None)
# Rename columns
data = data.rename(columns = {0:'Sex', 1:'Length', 2:'Diameter', 3:'Height',4:'Whole weight',
                              5:'Shucked weight',6:'Viscera weight',7:'Shell weight',8:'Rings'})
# Views 5 first lines
data.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [68]:
print('Number of Rows',data.shape[0],'and','Number of Columns', data.shape[1])
print('=='*75)
print('Number of missing values', data.isnull().sum())
print('=='*75)
print('Data Type', data.info())

Number of Rows 4177 and Number of Columns 9
Number of missing values Sex               0
Length            0
Diameter          0
Height            0
Whole weight      0
Shucked weight    0
Viscera weight    0
Shell weight      0
Rings             0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
Sex               4177 non-null object
Length            4177 non-null float64
Diameter          4177 non-null float64
Height            4177 non-null float64
Whole weight      4177 non-null float64
Shucked weight    4177 non-null float64
Viscera weight    4177 non-null float64
Shell weight      4177 non-null float64
Rings             4177 non-null int64
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB
Data Type None


In [69]:
# data.describe()
data.describe()

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
count,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0
mean,0.523992,0.407881,0.139516,0.828742,0.359367,0.180594,0.238831,9.933684
std,0.120093,0.09924,0.041827,0.490389,0.221963,0.109614,0.139203,3.224169
min,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
25%,0.45,0.35,0.115,0.4415,0.186,0.0935,0.13,8.0
50%,0.545,0.425,0.14,0.7995,0.336,0.171,0.234,9.0
75%,0.615,0.48,0.165,1.153,0.502,0.253,0.329,11.0
max,0.815,0.65,1.13,2.8255,1.488,0.76,1.005,29.0


#### Through a descriptive analysis we can identify that abalones have an average length of 0.5239mm and a maximum diameter of 0.65mm.

In [70]:
# I will count small abalones as abalones with lengths less than or equal to the average length of an abalone
data['Age'] = data['Length'].map(lambda x : 'ageSmall' if x <= data['Length'].mean() else 'ageLarge')

data.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings,Age
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15,ageSmall
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7,ageSmall
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9,ageLarge
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10,ageSmall
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7,ageSmall


In [74]:
print(data[data['Age'] == 'ageSmall'].mean())
print(data[data['Age'] == 'ageLarge'].mean())

Length            0.413537
Diameter          0.317317
Height            0.107598
Whole weight      0.393275
Shucked weight    0.168262
Viscera weight    0.085676
Shell weight      0.118489
Rings             8.315646
dtype: float64
Length             0.609949
Diameter           0.478359
Height             0.164355
Whole weight       1.167624
Shucked weight     0.508087
Viscera weight     0.254459
Shell weight       0.332481
Rings             11.192848
dtype: float64
