Data cleaning And Transformation 

In [1]:
import sys
import numpy as np
import thinkstats2
import thinkplot

# READ DATA 

In [2]:
def ReadFemPreg(dct_file='2002FemPreg.dct',
                dat_file='2002FemPreg.dat.gz'):
    """Reads the NSFG pregnancy data.

    dct_file: string file name
    dat_file: string file name

    returns: DataFrame
    """
    dct = thinkstats2.ReadStataDct(dct_file)
    df = dct.ReadFixedWidth(dat_file, compression='gzip')
    CleanFemPreg(df)
    return df


This code defines a function called ReadFemPreg that reads and processes  our female pregnancy data ,  The function takes two optional arguments (dct_file and dat_file) specifying the file names for the data dictionary and data file, respectively.


It reads the data dictionary (dct_file) using the thinkstats2.ReadStataDct function from our predefine thinkstat module .

It reads the fixed-width data file (dat_file) using the information from the data dictionary, and the data is compressed using gzip.

It cleans the pregnancy data using a function called CleanFemPreg.

The cleaned data is then returned as a DataFrame.

# Data Cleaning 

In [3]:
def Cleanfempreg(df):
    """Records variable from pregnancy.
    df:DataFrame
    """
    df.agepreg/100.0
    df.loc[df.birthwght_lb>20,'birthwgt_lb'] = np.nan  #basically if the weight in pounds is greater than 20 then it is not a number
    na_vals[97,98,99] # for the extreme weight value both in oumce and pounds that are obviously not possible to get for kids, replace with not a number
    df.birthwgt_lb.replace(na_vals, np.nan, inplace= True) # for those in pounds
    df.birthwgt_oz.replace(na_vals, np.nan, inplace = True) # for those in ounce
    df.hpagelb.replace(na_vals, np.nan, inplace=True)
    df.babysex.replace([7, 9], np.nan, inplace=True) # replace this values in the baby sex column with not a number
    df.nbrnaliv.replace([9], np.nan, inplace=True)
    df['totalwgt_lb'] = df.birthwgt_lb + df.birthwgt_oz/16.0 # we are converting the ounces to pound by dividing by 16, the add it to the wight already in pounce, then store them in a new column called totalwgt_lb
    df.cmintvw = np.nan # setting all the in the cmintvw column to not a number 

this code defines  a function called CleanFemPreg  that cleans up the data, data cleaning is the most crucial step in data scienec it can be used to check for mistakes, handling special values, changing data formats, and doing calculations.  

this is what the function does : 

1. Dividing Age: The function takes the mother's age at the end of pregnancy (agepreg). In the data, it's stored in centiyears (which means each number is actually 100 times the actual age). The function fixes this by dividing each age by 100 to get the real age in years.

2. Handling Baby Weight: The function deals with the baby's weight (birthwgt_lb and birthwgt_oz). For live births, it's given in pounds and ounces. However, there are special codes (97, 98, and 99) indicating cases where the weight couldn't be determined, was refused, or is unknown. The function replaces these special codes with a special value (np.nan), representing "not a number." This prevents miscalculations, like saying a baby weighs 99 pounds.

3. Combining Weight: The function creates a new column (totalwgt_lb) that combines the baby's weight in pounds and ounces into a single quantity, expressed in pounds.

# Data validations

When you get a new dataset, and you are moving it from software to software ,  you might misinterpret or introduce mistakes.
To validate the data, you can compare it with published  information or basic statistics.
the book cited an  example ,  the NSFG codebook includes tables summarizing each variable and columns, 
to validate our data , they  compared  the values in each column ofour dataset  (using value_counts method) with those foumd on the nsfg codebook 
https://www.cdc.gov/nchs/nsfg/nsfg_cycle6.htm


 for instance to check for the BIRTHWGT_LB1 represents the birth weight of the  baby in pounds 

this is how it looks on the code book: 

Value Label Total

. inapplicable 4449

0-5 UNDER 6 POUNDS 1125

6 6 POUNDS 2223

7 7 POUNDS 3049

8 8 POUNDS 1889

9-95 9 POUNDS OR MORE 799

97 Not ascertained 1

98 REFUSED 1

99 DON'T KNOW 57

Total 13593


In [8]:
import nsfg
df = nsfg.ReadFemPreg()
df.birthwgt_lb.value_counts(sort=False)

8.0     1889
7.0     3049
6.0     2223
4.0      229
5.0      697
10.0     132
12.0      10
14.0       3
3.0       98
1.0       40
2.0       53
0.0        8
9.0      623
11.0      26
13.0       3
15.0       1
Name: birthwgt_lb, dtype: int64

if you add up the value of 0 to 5 pounds you would see the value adds up, 8,7,6 are correct for  the values of 9 above also adds up , this a way to validate your data 