## 2.1 Exploratory Data Analysis

[Data Link](https://www.icpsr.umich.edu/icpsradmin/nsfg/variableGroupParent/10262?studyNumber=9999https://www.icpsr.umich.edu/icpsradmin/nsfg/variableGroupParent/10262?studyNumber=9999)

<img src = 'nsfg.jpg' height = '500' width = '500'>

In [None]:
import pandas as pd # Importing pandas 
import numpy as np  # Importing numpy
import seaborn as sns

In [None]:
nsfg = pd.read_hdf('nsfg.hdf5', 'nsfg') 
type(nsfg)

In [None]:
nsfg.head() #Head gives the first 5 record

In [None]:
# Display the number of rows and columns
nsfg.shape

In [None]:
# Display the names of the columns
nsfg.columns


Birth Weight in pounds and their count

<img src ='birthwgt_lb1.jpg' height ='500' width = '500'>

In [None]:
pounds = nsfg['birthwgt_lb1'] #accessing the column in the dataframe
type(pounds) #get the type of the column data

In [None]:
pounds.head() # shows the first 5 values 

In [None]:
# Select column birthwgt_oz1: ounces
ounces = nsfg['birthwgt_oz1']

In [None]:
# Print the first 5 elements of ounces
print(ounces.head())

## 2.2 Clean and Validate

In [None]:
pounds.value_counts().sort_index() #Get the values of the weighted sorted by frequency

In [None]:
pounds.describe() #Validates the data with the summary

In [None]:
pounds = pounds.replace([98, 99], np.nan) #Replaces the values
pounds.mean()

In [None]:
pounds.describe()

In [None]:
pounds.value_counts().sort_index() #Distribution after the replacement

In [None]:
ounces.replace([98, 99], np.nan, inplace=True) #Replace the values inplace

In [None]:
birth_weight = pounds + ounces / 16.0
birth_weight.describe()

<img src = 'outcome.jpg' height = '500' width = '500'>

## How many pregnancies in this dataset ended with a live birth?

In [None]:
nsfg['outcome'].value_counts().sort_index()

## 2.3 Filter and Visualize 

### 2.3.1 Histogram

In [None]:
import matplotlib.pyplot as plt
plt.hist(birth_weight.dropna(), bins=30)
plt.xlabel('Birth weight (lb)')
plt.ylabel('Fraction of births')
plt.show()

### 2.3.2 Boolean Series

In [None]:
preterm = nsfg['prglngth'] < 37
preterm.head()

In [None]:
preterm.sum() # Count of the pre term babies

In [None]:
preterm.mean() #Mean of the pre term babies

### 2.3.3 Filtering

In [None]:
preterm_weight = birth_weight[preterm]
preterm_weight.mean()

In [None]:
full_term_weight = birth_weight[~preterm]
full_term_weight.mean()

## 2.4 GSS Dataset

<img src = 'gss.jpg' height = '500' width ='500'>

### 2.4.1 Distributions

In [None]:
gss = pd.read_hdf('gss.hdf5', 'gss')

In [None]:
gss.head()

In [None]:
educ = gss['educ']

In [None]:
plt.hist(educ.dropna(), label = 'educ')
plt.show()

### Fraction over Frequency


Histograms are not really suitable to visualize the distributions, so how can we handle it ??. In analysis not always we require counts, we shall look for fraction. Histograms makes data into bins and critical information can be overseen.

### 2.4.1.1 PMF (Probability Mass Function)

In [None]:
from empiricaldist import Pmf, Cdf

In [None]:
pmf_educ = Pmf.from_seq(educ, normalize = False)
pmf_educ.head()

In [None]:
pmf_educ_normalize = Pmf.from_seq(educ, normalize = True)
pmf_educ_normalize.head()
pmf_educ_normalize.bar(label = 'educ')
plt.xlabel('Years of education')
plt.ylabel('PMF')
plt.show()

In [None]:
# Twelve years of education
pmf_educ[12]

### 2.4.1.2 CDF (Cummulative Distribution Function)


<img src="pmf_cdf.jpg" width="500" height="500">

<img src="pmf_cdf_example.jpg" width="500" height="500">

In [None]:
cdf = Cdf.from_seq(gss['age'])
cdf.plot()
plt.xlabel('Age')
plt.ylabel('CDF')
plt.show()

## Evaluating the CDF

In [None]:
q = 51
p = cdf(q)
print(p)

## Evaluating the Inverse CDF

In [None]:
p = 0.25
q = cdf.inverse(p)
print(q)

## 2.5 Exploring Relationships 

<img src="brfss.jpg" width="500" height="500">

In [None]:
brfss = pd.read_hdf('brfss.hdf5', 'brfss')
height = brfss['HTM4']
weight = brfss['WTKG3']
plt.plot(height, weight, 'o')
plt.xlabel('Height in cm')
plt.ylabel('Weight in kg')
plt.show()

### 2.5.1 Transparency

In [None]:
plt.plot(height, weight, 'o', alpha=0.02)
plt.show()

### 2.5.2 Marker size

In [None]:
plt.plot(height, weight, 'o', markersize=1, alpha=0.02)
plt.show()

### 2.5.3 Jittering

In [None]:
height_jitter = height + np.random.normal(0, 2, size=len(brfss))
plt.plot(height_jitter, weight, 'o', markersize=1, alpha=0.02)
plt.show()

In [None]:
height_jitter = height + np.random.normal(0, 2, size=len(brfss))
weight_jitter = weight + np.random.normal(0, 2, size=len(brfss))
plt.plot(height_jitter, weight_jitter, 'o', markersize=1, alpha=0.01)
plt.show()

### 2.5.4 Zoom

In [None]:
plt.plot(height_jitter, weight_jitter, 'o', markersize=1, alpha=0.02)
plt.axis([140, 200, 0, 160])
plt.show()

### 2.5.5 Visualizing relationships

In [None]:
age = brfss['AGE'] + np.random.normal(0, 2.5, size=len(brfss))
weight = brfss['WTKG3']
plt.plot(age, weight, 'o', markersize=5, alpha=0.2)
plt.show()

In [None]:
age = brfss['AGE'] + np.random.normal(0, 0.5, size=len(brfss))
weight = brfss['WTKG3'] + np.random.normal(0, 2, size=len(brfss))
plt.plot(age, weight, 'o', markersize=1, alpha=0.2)
plt.show()

In [None]:
sns.boxplot(x='AGE', y='WTKG3', data=brfss, whis=10)
plt.show()

In [None]:
sns.boxplot(x='AGE', y='WTKG3', data=brfss, whis=10)
plt.yscale('log')
plt.show()

### 2.5.6 Correlation

In [None]:
columns = ['HTM4', 'WTKG3', 'AGE']
subset = brfss[columns]

In [None]:
subset.corr()

In [None]:
xs = np.linspace(-1, 1)
ys = xs**2
ys += np.random.normal(0, 0.05, len(xs))
np.corrcoef(xs, ys)

In [None]:
plt.scatter(xs,ys)

### 2.5.7 Strength of effect

In [None]:
from scipy.stats import linregress
subset = brfss.dropna(subset=['WTKG3', 'HTM4'])
xs = subset['HTM4']
ys = subset['WTKG3']
res = linregress(xs, ys)
print(res)

### Linear Regression 

In [None]:
plt.plot(height_jitter, weight_jitter, 'o', markersize=1, alpha=0.02)
plt.axis([140, 200, 0, 160])
fx = np.array([xs.min(), xs.max()])
fy = res.intercept + res.slope * fx
plt.plot(fx, fy, '-')
plt.show()

In [None]:
age = brfss['AGE'] + np.random.normal(0, 0.5, size=len(brfss))
weight = brfss['WTKG3'] + np.random.normal(0, 2, size=len(brfss))
plt.plot(age, weight, 'o', markersize=1, alpha=0.2)

subset = brfss.dropna(subset=['WTKG3', 'AGE'])
xs = subset['AGE']
ys = subset['WTKG3']
res = linregress(xs, ys)
fx = np.array([xs.min(), xs.max()])
fy = res.intercept + res.slope * fx
plt.plot(fx, fy, '-')
plt.show()