# A Statistical Approach

To address the limitations of anecdotes, we will use the tools of statistics, which includes:

**Data Collection**: 
> We will use data from a large database that it was designed explicitly with the goal of generating statistically valid inferences.

**Descriptive Statistics**:
> Generate statistics that summarize the data concisely, and evaluate different ways to visualize data

**Exploratory Data Analysis**:
> We will look for patterns, differences, and other features that address the questions we are interested in. At the same time we will check for consistencies and identify limitations.

**Estimation**: 
> Use data from a sample to estimate characteristics of the general population.

**Hypothesis testing**:
> Where we see apparent effects, like a difference between two geoups, we will evaluate whether the effect might have happened by chance.

### Type of Studies:

**cross-sectional**: it captures a snapshots of a group at a point in time.

**longitudinal**: observes a group repeatedly over a period of a time. 

### Type of Data 
**representative**: every member of the target population has an equal chance of participating. 

**oversampled** it refers to data where certain classes, categories, or ranges of values are overrepresented.

# Importing the Data

note: I have to download the necessary files that prepare the data for this tutorial from the author's github repository.

In [28]:
from os.path import basename, exists


def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)


download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkstats2.py")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkplot.py")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/nsfg.py")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dct")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dat.gz")

In [29]:
import nsfg

In [30]:
df = nsfg.ReadFemPreg()
df

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw,totalwgt_lb
0,1,1,,,,,6.0,,1.0,,...,0,0,0,3410.389399,3869.349602,6448.271112,2,9,,8.8125
1,1,2,,,,,6.0,,1.0,,...,0,0,0,3410.389399,3869.349602,6448.271112,2,9,,7.8750
2,2,1,,,,,5.0,,3.0,5.0,...,0,0,0,7226.301740,8567.549110,12999.542264,2,12,,9.1250
3,2,2,,,,,6.0,,1.0,,...,0,0,0,7226.301740,8567.549110,12999.542264,2,12,,7.0000
4,2,3,,,,,6.0,,1.0,,...,0,0,0,7226.301740,8567.549110,12999.542264,2,12,,6.1875
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13588,12571,1,,,,,6.0,,1.0,,...,0,0,0,4670.540953,5795.692880,6269.200989,1,78,,6.1875
13589,12571,2,,,,,3.0,,,,...,0,0,0,4670.540953,5795.692880,6269.200989,1,78,,
13590,12571,3,,,,,3.0,,,,...,0,0,0,4670.540953,5795.692880,6269.200989,1,78,,
13591,12571,4,,,,,6.0,,1.0,,...,0,0,0,4670.540953,5795.692880,6269.200989,1,78,,7.5000


In [16]:
df.columns

Index(['caseid', 'pregordr', 'howpreg_n', 'howpreg_p', 'moscurrp', 'nowprgdk',
       'pregend1', 'pregend2', 'nbrnaliv', 'multbrth',
       ...
       'laborfor_i', 'religion_i', 'metro_i', 'basewgt', 'adj_mod_basewgt',
       'finalwgt', 'secu_p', 'sest', 'cmintvw', 'totalwgt_lb'],
      dtype='object', length=244)

In [18]:
df.columns[1]

'pregordr'

In [31]:
pregordr = df['pregordr']
type(pregordr)

pandas.core.series.Series

In [32]:
# whenever you print a series you get the index with its respective value
pregordr

0        1
1        2
2        1
3        2
4        3
        ..
13588    1
13589    2
13590    3
13591    4
13592    5
Name: pregordr, Length: 13593, dtype: int64

In [33]:
# accessing specific rows
pregordr[1:3]

1    2
2    1
Name: pregordr, dtype: int64

In [34]:
# accessing columns using the dot notation "."
pregordr = df.pregordr
pregordr

0        1
1        2
2        1
3        2
4        3
        ..
13588    1
13589    2
13590    3
13591    4
13592    5
Name: pregordr, Length: 13593, dtype: int64

## Transformation

When you import data, you often have to check for errors, deal with special values, convert data into different formats, and perform calculations. These operations are called **data cleanning**

```python
def CleanFemPreg(df):
    # mother's age is encoded in centiyears; convert to years
    df.agepreg /= 100.0 # divides each line by 100

    # birthwgt_lb contains at least one bogus value (51 lbs)
    # replace with NaN
    df.loc[df.birthwgt_lb > 20, 'birthwgt_lb'] = np.nan
    
    # replace 'not ascertained', 'refused', 'don't know' with NaN
    na_vals = [97, 98, 99]
    df.birthwgt_lb.replace(na_vals, np.nan, inplace=True)
    df.birthwgt_oz.replace(na_vals, np.nan, inplace=True)
    df.hpagelb.replace(na_vals, np.nan, inplace=True)

    # birthweight is stored in two columns, lbs and oz.
    # convert to a single column in lb
    # NOTE: creating a new column requires dictionary syntax,
    # not attribute assignment (like df.totalwgt_lb)
    df['totalwgt_lb'] = df.birthwgt_lb + df.birthwgt_oz / 16.0   

```

note: The ```inplace=True``` argument modifies the DataFrame directly without creating a new copy.

### Warning

when adding a new column to a DataFrame, you must use dictionary syntax, like this:

```python
#CORRECT
df['newColumn'] = df.birthwgt_lb + df.birthwgt_oz / 16.0

#WRONG
df.newColumn = df.birthwgt_lb + df.birthwgt_oz / 16.0
```


## Validation 

When data is exported from one software to another and imported into another, errors might be introduced. And when you are getting familiar with a new dataset, you might interpret data incorrectly or introduce other misunderstandings

If you take time to validate the data, you can save time later and avoid errors. 



In [35]:
df.outcome.value_counts().sort_index()

outcome
1    9148
2    1862
3     120
4    1921
5     190
6     352
Name: count, dtype: int64

In [36]:
preg.birthwgt_lb.value_counts().sort_index()

birthwgt_lb
0.0        8
1.0       40
2.0       53
3.0       98
4.0      229
5.0      697
6.0     2223
7.0     3049
8.0     1889
9.0      623
10.0     132
11.0      26
12.0      10
13.0       3
14.0       3
15.0       1
Name: count, dtype: int64