<p style ="color: #118ab2;
                    font-size:50px; 
                    font-style:bold;
                    text-decoration: underline;
                    text-align: center ">
    Weight Predictor 
</p>


<img src='http://wiki.stat.ucla.edu/socr/uploads/a/ae/SOCR_Data_Dinov_HeightWeight_062408_Fig1.jpg'>

<h2 style ="color: #c1121f;font-size:35px; font-style:bold; text-decoration: underline;"> Step 1: Acquire </h2>

### Step 1.a: Explore Problem
    
##### Problem is to predict Human Weight with Given Height.<br>

   
### Step 1.b: Indentify Data

##### Human height and weight dataset are availble in <a href = 'http://wiki.stat.ucla.edu/socr/index.php/Main_Page'>Statistics Online Computational Resource (SOCR)</a> website .
 
### Step 1.c: About Data

##### Human Height and Weight are mostly hereditable, but lifestyles, diet, health and environmental factors also play a role in determining individual's physical characteristics. The dataset below contains 25,000 synthetic records of human heights and weights of 18 years old children. These data were simulated based on a 1993 by a Growth Survey of 25,000 children from birth to 18 years of age recruited from Maternal and Child Health Centres (MCHC) and schools and were used to develop Hong Kong's current growth charts for weight, height, weight-for-age, weight-for-height and body mass index (BMI) <a href = 'http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights'>Dataset</a>. <br>

### Step 1.d: import Data


In [1]:
import pandas as pd

In [2]:
# 1. Assign path link in varible wiki_link
wiki_link = r"http://socr.ucla.edu/docs/resources/SOCR_Data/SOCR_Data_Dinov_020108_HeightsWeights.html"

In [3]:
# 2. Read HTML path link
html_data = pd.read_html(wiki_link)
html_data

[           0               1               2
 0      Index  Height(Inches)  Weight(Pounds)
 1          1        65.78331        112.9925
 2          2        71.51521        136.4873
 3          3        69.39874        153.0269
 4          4         68.2166        142.3354
 ...      ...             ...             ...
 24996  24996        69.50215        118.0312
 24997  24997        64.54826        120.1932
 24998  24998        64.69855        118.2655
 24999  24999        67.52918        132.2682
 25000  25000        68.87761        124.8742
 
 [25001 rows x 3 columns]]

In [5]:
# check html_data data type 
type(html_data)


list

In [6]:
# 3. convert list to array 

import numpy as np

data_array = np.array(html_data)
data_array

array([[['Index', 'Height(Inches)', 'Weight(Pounds)'],
        ['1', '65.78331', '112.9925'],
        ['2', '71.51521', '136.4873'],
        ...,
        ['24998', '64.69855', '118.2655'],
        ['24999', '67.52918', '132.2682'],
        ['25000', '68.87761', '124.8742']]], dtype=object)

In [7]:
# 2-d array
data_array[0]

array([['Index', 'Height(Inches)', 'Weight(Pounds)'],
       ['1', '65.78331', '112.9925'],
       ['2', '71.51521', '136.4873'],
       ...,
       ['24998', '64.69855', '118.2655'],
       ['24999', '67.52918', '132.2682'],
       ['25000', '68.87761', '124.8742']], dtype=object)

In [8]:
# Columns name 
data_array[0][0]

array(['Index', 'Height(Inches)', 'Weight(Pounds)'], dtype=object)

In [10]:
# 4. Convert Array to DataFrame
data = pd.DataFrame(data_array[0][1:],columns=data_array[0][0])
data

Unnamed: 0,Index,Height(Inches),Weight(Pounds)
0,1,65.78331,112.9925
1,2,71.51521,136.4873
2,3,69.39874,153.0269
3,4,68.2166,142.3354
4,5,67.78781,144.2971
...,...,...,...
24995,24996,69.50215,118.0312
24996,24997,64.54826,120.1932
24997,24998,64.69855,118.2655
24998,24999,67.52918,132.2682


<h2 style ="color: #c1121f;font-size:35px; font-style:bold; text-decoration: underline;"> Step 2: Prepare </h2>

### Step 2.a: Explore data

In [12]:
len(data)

25000

In [16]:
data.shape

(25000, 3)

<p style ="color: #ffc300;font-size:20px; font-style:bold;"> Analysis 1:- Rows 25000  and 3 Columns</p>

In [21]:
data.columns

Index(['Index', 'Height(Inches)', 'Weight(Pounds)'], dtype='object')

<p style ="color: #ffc300;font-size:20px; font-style:bold;"> Analysis 2. Only 2 Feature ('Height(Inches)', 'Weight(Pounds)' Are Important  </p>

In [17]:
data.isna().sum()

Index             0
Height(Inches)    0
Weight(Pounds)    0
dtype: int64

<p style ="color: #ffc300;font-size:20px; font-style:bold;"> Analysis 3.No Null Data in Features</p>

In [20]:
data.describe()

Unnamed: 0,Index,Height(Inches),Weight(Pounds)
count,25000,25000.0,25000.0
unique,25000,24503.0,24248.0
top,1,65.65796,124.7975
freq,1,3.0,4.0


<p style ="color: #ffc300;font-size:20px; 
                    font-style:bold;" > Analysis 4.Feature not in Numeric type</p>

In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Index           25000 non-null  object
 1   Height(Inches)  25000 non-null  object
 2   Weight(Pounds)  25000 non-null  object
dtypes: object(3)
memory usage: 586.1+ KB


<p style ="color: #ffc300;font-size:20px; 
                    font-style:bold;" > Analysis 5. Feature 'Height(Inches)' Unit is in 'Inches' which need to change in 'Feet' & 'Weight(Pounds)' unit is 'Pounds' that need to change in 'Kilogram(KG)'</p>

### Step 2.b: Visulize data

In [None]:
import seaborn as sn
from matplotlib import pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')