<a href="https://colab.research.google.com/github/EphiWalker/melbourne_anlalysis/blob/main/Chapter_1%262_(Openintro).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Melbourne Housing Dataset**

Let's start by setting up our environment and importing the dataset we'll be working with.

In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import io
import requests

df_url = 'https://raw.githubusercontent.com/akmand/datasets/master/openintro/brfss_2000.csv'
url_content = requests.get(df_url, verify=False).content
cdc = pd.read_csv(io.StringIO(url_content.decode('utf-8')))

Taking initial peeks into our data before diving deep.

In [4]:
cdc.shape

(20000, 9)

In [5]:
cdc.sample(10, random_state=999)

Unnamed: 0,exerany,hlthplan,smoke100,height,weight,wtdesire,age,gender,genhlth
6743,1,1,0,71,170,170,23,m,very good
19360,1,0,1,64,120,117,45,f,very good
8104,1,1,1,70,192,170,64,m,good
8535,1,1,1,64,165,140,67,f,excellent
8275,1,1,0,69,130,140,69,m,very good
3511,0,1,0,63,128,128,37,f,very good
1521,1,1,0,68,176,135,37,f,good
976,0,1,1,64,150,125,43,f,fair
14484,1,1,1,68,185,185,78,m,good
3591,1,1,0,71,165,175,34,m,fair


Our dataset has 9 features and 20,000 cases.
Randomly sampling 10 cases, we can see what our 9 features are. Note that we've some features with numeric values and some others with text.

We can also sample the first or last five cases for further peek into our data and get a feel of what it looks like.

In [8]:
cdc.head()

Unnamed: 0,exerany,hlthplan,smoke100,height,weight,wtdesire,age,gender,genhlth
0,0,1,0,70,175,175,77,m,good
1,0,1,1,64,125,115,33,f,good
2,1,1,1,60,105,105,49,f,good
3,1,1,0,66,132,124,42,f,good
4,0,1,0,61,150,130,55,f,very good


In [9]:
cdc.tail()

Unnamed: 0,exerany,hlthplan,smoke100,height,weight,wtdesire,age,gender,genhlth
19995,1,1,0,66,215,140,23,f,good
19996,0,1,0,73,200,185,35,m,excellent
19997,0,1,0,65,216,150,57,f,poor
19998,1,1,0,67,165,165,81,f,good
19999,1,1,1,69,170,165,83,m,good


We also note that the first three features seem to have numeric values alternating between 0 and 1, suggesting that these could be categorical features. We'll explore this later; now, let's take a closer look at our features and what data types they store.

In [6]:
cdc.columns.values

array(['exerany', 'hlthplan', 'smoke100', 'height', 'weight', 'wtdesire',
       'age', 'gender', 'genhlth'], dtype=object)

In [10]:
cdc.dtypes

exerany      int64
hlthplan     int64
smoke100     int64
height       int64
weight       int64
wtdesire     int64
age          int64
gender      object
genhlth     object
dtype: object

We see that only two of our features (gender & genhlth) store string values while the rest have numerical data.

We can also take a closer look at how our 20,000 cases are distributed among our variables (features) by taking summary statistics of the whole dataset.

In [11]:
cdc.describe()

Unnamed: 0,exerany,hlthplan,smoke100,height,weight,wtdesire,age
count,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0,20000.0
mean,0.7457,0.8738,0.47205,67.1829,169.68295,155.09385,45.06825
std,0.435478,0.332083,0.499231,4.125954,40.08097,32.013306,17.192689
min,0.0,0.0,0.0,48.0,68.0,68.0,18.0
25%,0.0,1.0,0.0,64.0,140.0,130.0,31.0
50%,1.0,1.0,0.0,67.0,165.0,150.0,43.0
75%,1.0,1.0,1.0,70.0,190.0,175.0,57.0
max,1.0,1.0,1.0,93.0,500.0,680.0,99.0


Here, we can see sumary statistics for our seven(7) numerical features. Let's take an even closer look at one of our features, height.

In [13]:
cdc['height'].describe()

count    20000.000000
mean        67.182900
std          4.125954
min         48.000000
25%         64.000000
50%         67.000000
75%         70.000000
max         93.000000
Name: height, dtype: float64

We can see here, for example, that the mean and median (50%) are very close to one another, which prossibly means our data doesn't have that many outliers in the 'height' variable. When needed, we can calculate the interquartile range(IQR) by subtracting our Q1(25%) from our Q3(75%).

In [15]:
cdc['height'].quantile(0.75) - cdc['height'].quantile(0.25)

6.0

It's now time to look at our suspected categorical features one by one and check our guess is correct. We can check how many unique values a given feature has and also see if it contains any values decimal places.

In [16]:
cdc['smoke100'].value_counts()

0    10559
1     9441
Name: smoke100, dtype: int64

We see our 'smoke100' feature only has two values for all of our cases: 0 and 1. This means 'smoke100' is categorical, even though it takes integer values, namely 0 and 1, to represent two categories. Now, let's take a look at the 'exerany' and 'hlthplan' fetures

In [17]:
cdc['exerany'].value_counts()

1    14914
0     5086
Name: exerany, dtype: int64

In [18]:
cdc['hlthplan'].value_counts()

1    17476
0     2524
Name: hlthplan, dtype: int64

In [None]:
cdc['smoke100'].value_counts(normalize=True)

In [None]:
import matplotlib.pyplot as plt #hi
%matplotlib inline

In [None]:
%config InlineBackend.figure_format = 'retina'

In [None]:
plt.style.use('ggplot')

In [None]:
plt.rcParams['figure.figsize']=(10,5)

In [None]:
plt.rcParams['font.size']=12

In [None]:
cdc['smoke100'].value_counts().plot(kind = 'bar', color = 'turquoise',
                                    title = 'Bar plot of smoke100')
#plt.show()

In [None]:
smoke = cdc['smoke100'].value_counts()
smoke.plot(kind='bar', color='turquoise',
           title='Bar plot of smoke100')

In [None]:
cdc['height'].describe()

In [None]:
70-64

In [None]:
cdc['age'].describe()

In [None]:
57-31

In [None]:
cdc['gender'].value_counts(normalize = True)

In [None]:
cdc['exerany'].value_counts(normalize=True)

1    0.7457
0    0.2543
Name: exerany, dtype: float64

In [None]:
cdc.groupby('gender')['smoke100'].value_counts(normalize=True).unstack()

In [None]:
cdc.groupby('gender')['genhlth'].value_counts(normalize=True).unstack()

In [None]:
from statsmodels.graphics.mosaicplot import mosaic

gender_colors = lambda key:{'color':'lightcoral' if 'f' in key else 'lightblue'}

mosaic(cdc, ['gender', 'smoke100'], title='Mosaic plot of smoke100 and gender',
       properties = gender_colors, gap=0.02)

In [None]:
cdc['height'].plot(kind='box', title = 'Boxplot of height', vert=False)

In [None]:
cdc['height'].describe()

In [None]:
cdc.boxplot(column='height', by='gender', vert=False)
plt.title('Boxplot of height by gender')
plt.suptitle('')

In [None]:
bmi = (cdc['weight'] / (cdc['height']**2)*703)

In [None]:
import seaborn as sns

sns.boxplot(x=cdc['genhlth'], y=bmi).set(xlabel='genhlth', ylabel='bmi',
                                         title = 'Boxplot of BMI by genhlth')

In [None]:
cdc.columns

In [None]:
sns.boxplot(x=cdc['exerany'], y=bmi).set(xlabel='exerany', ylabel='bmi',
                                         title = 'Box plot of exerany and bmi')

In [None]:
cdc['age'].plot(kind='hist', color='springgreen', edgecolor='black',
                linewidth=1.2, title = 'Histogram of age')

In [None]:
bmi.plot(kind='hist', color='slateblue', edgecolor='black',
         linewidth=1.2, title='Histogram of BMI')
plt.show()

bmi.plot(kind='hist', color='gold', edgecolor='black',linewidth=1.2,
         title='Histogram of BMI (with the bin size of 50)', bins=50)
plt.show()

In [None]:
sns.scatterplot(x=cdc['weight'], y=cdc['wtdesire'])

In [None]:
wdiff=cdc['wtdesire']-cdc['weight']

In [None]:
wdiff

In [None]:
wdiff.describe()

In [None]:
cdc[wdiff==500]

In [None]:
wdiff.plot(kind='hist', bins=100, color='cyan')

In [None]:
wdiff.plot(kind='box', vert=False)

In [None]:
wdiff[cdc['gender']=='m'].median()

In [None]:
wdiff[cdc['gender']=='f'].median()

In [None]:
sns.boxplot(x=cdc['gender'], y=wdiff).set(xlabel='gender', ylabel='wdiff',
                                          title='Boxplots by wdiff')

In [None]:
mean_weight=cdc['weight'].mean()

In [None]:
cdc['weight'].std()

In [None]:
(abs(cdc['weight']-mean_weight)<=1).value_counts(normalize=True)

False    0.9525
True     0.0475
Name: weight, dtype: float64