# Chapter 1: Exploratory data analysis

The thesis of this book is that data combined with practical methods can answer questions and guide decisions under uncertainty.

---

Reports came from personal incidents or not genrally known and others  like these are called **anecdotal evidence** because they are based
on data that is unpublished and usually personal. In casual conversation,

there is nothing wrong with anecdotes, so I don't mean to pick on the people
I quoted.

But we might want evidence that is more persuasive and an answer that is
more reliable. By those standards, anecdotal evidence usually fails, because:

- Small number of observations: we might have to compare a large number of pregnancies to be sure that a dfference exists.
- Selection bias: People who join a discussion of this question might be interested because their personal state was similar to the condition discissed
- Confirmation bias: People beliefs may push them to give examples that support the condition discussed
- Inaccuracy: Misrepresentations and inaccurcy of examples or stories introduced for the condition discussed

---

So how can we do better?

## 1. A Statistical Approach

To address the limitations of anecdotes, we will statistics, which include:

- Data collection: Using a reliable source for data like national surveys desgined with purpose of gathering that data
- Descriptive statistics: Summarizing data using statistical aggregations and visualization tools
- Exploratory data analysis: Looking for patterns in data, introducing questions and at the same time we will check for inconsistencies and identify limitations.
- Estimation: Making estimations on samples to drive general population characteristics
- Hypothesis testing: Testing events apparnet if it is by chance or there is relations

---

## 1.2 Importing Data

Pregnancy data from Cycle 6 of the NSFG is in a file called 2002FemPreg.dat.gz; it is a gzip-compressed data file in plain text (ASCII), with fixed width columns. Each line in the file is a record that contains data about one pregnancy.

The format of the file is documented in 2002FemPreg.dct, which is a Stata dictionary file. Stata is a statistical software system; a \dictionary" in this context is a list of variable names, types, and indices that identify where in each line to find each variable

In [1]:
# importing author module to perform needed operatons with python
import thinkstats2

---

## 1.3 DataFrames 

Is the fundamental data structure provided by pandas, which is a Python data and statistics package we'll use throughout this book. A DataFrame contains a row for each record, in this case one row per pregnancy, and a column for each variable.

In [2]:
# Reading data
import nsfg

df = nsfg.ReadFemPreg()
df

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.birthwgt_lb.replace(na_vals, np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.birthwgt_oz.replace(na_vals, np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are 

Unnamed: 0,caseid,pregordr,howpreg_n,howpreg_p,moscurrp,nowprgdk,pregend1,pregend2,nbrnaliv,multbrth,...,laborfor_i,religion_i,metro_i,basewgt,adj_mod_basewgt,finalwgt,secu_p,sest,cmintvw,totalwgt_lb
0,1,1,,,,,6.0,,1.0,,...,0,0,0,3410.389399,3869.349602,6448.271112,2,9,,8.8125
1,1,2,,,,,6.0,,1.0,,...,0,0,0,3410.389399,3869.349602,6448.271112,2,9,,7.8750
2,2,1,,,,,5.0,,3.0,5.0,...,0,0,0,7226.301740,8567.549110,12999.542264,2,12,,9.1250
3,2,2,,,,,6.0,,1.0,,...,0,0,0,7226.301740,8567.549110,12999.542264,2,12,,7.0000
4,2,3,,,,,6.0,,1.0,,...,0,0,0,7226.301740,8567.549110,12999.542264,2,12,,6.1875
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13588,12571,1,,,,,6.0,,1.0,,...,0,0,0,4670.540953,5795.692880,6269.200989,1,78,,6.1875
13589,12571,2,,,,,3.0,,,,...,0,0,0,4670.540953,5795.692880,6269.200989,1,78,,
13590,12571,3,,,,,3.0,,,,...,0,0,0,4670.540953,5795.692880,6269.200989,1,78,,
13591,12571,4,,,,,6.0,,1.0,,...,0,0,0,4670.540953,5795.692880,6269.200989,1,78,,7.5000


In [3]:
# to visualize features or columns
df.columns

Index(['caseid', 'pregordr', 'howpreg_n', 'howpreg_p', 'moscurrp', 'nowprgdk',
       'pregend1', 'pregend2', 'nbrnaliv', 'multbrth',
       ...
       'laborfor_i', 'religion_i', 'metro_i', 'basewgt', 'adj_mod_basewgt',
       'finalwgt', 'secu_p', 'sest', 'cmintvw', 'totalwgt_lb'],
      dtype='object', length=244)

Python DataFrame columns are type of called Series which is very similar to python lists, And can be accessed using square brackets or dot notation

In [4]:
type(df['caseid'])

pandas.core.series.Series

---

## 1.4 Variables

Many of variables are not pre-prepared and ready for using which is firstly drived from raw data with recoding For example, prglngth for live births is equal to the raw variable wksgest (weeks of gestation) if it is available; otherwise it is estimated using mosgest * 4.33 (months of gestation times the average number of weeks in a month).

Recodes are often based on logic that checks the consistency and accuracy of the data. In general it is a good idea to use recodes when they are available, unless there is a compelling reason to process the raw data yourself.

---

## 1.5 Transformations

Data often are full with error with need dealing with special values, converting data into different formats, and performing calculations. These operations are called **data cleaning**.

Special values encoded as numbers are dangerous because if they are not handled properly, they can generate bogus results, like a 99-pound baby. The replace method replaces these values with np.nan, a special floating-point value that represents "not a number."

As part of the IEEE 

floating-point standard, all mathematical operations return nan if either argument is nan:

In [7]:
import numpy as np

np.nan / 10     # Mathematical operations on nan value leads to nan

nan

---

## 1.6 Validation

If you take time to validate (Knowing what are you dealing with) the data, you can save time later and avoid errors.

Encode used in outcome column:

LIVE BIRTH --------> 1

INDUCED ABORTION --> 2

STILLBIRTH --------> 3

MISCARRIAGE -------> 4

ECTOPIC PREGNANCY -> 5

CURRENT PREGNANCY -> 6

In [10]:
# by using value_counts() to count value appears in the feature (column)
df.outcome.value_counts().sort_index()

outcome
1    9148
2    1862
3     120
4    1921
5     190
6     352
Name: count, dtype: int64

In [12]:
# another example babies weight
df.birthwgt_lb.value_counts(sort=False)

birthwgt_lb
8.0     1889
7.0     3049
9.0      623
6.0     2223
4.0      229
5.0      697
10.0     132
12.0      10
14.0       3
3.0       98
1.0       40
11.0      26
2.0       53
13.0       3
0.0        8
15.0       1
Name: count, dtype: int64

---

## 1.7 Interpretation

To work with data effectively, you have to think on two levels at the same time: the level of statistics and the level of context.

Which means that you have to think about causes and reasons behind values

In [21]:
from collections import defaultdict

def MakePregMap(df):
    d = defaultdict(list)
    for index, caseid in df.caseid.iteritems():
        d[caseid].append(index)
    return d

In [27]:
caseid = 10229
preg_map = nsfg.MakePregMap(df)
indices = preg_map[caseid]
df.outcome[indices].values # Or We can use ---> df.loc[df.caseid == 10229]['outcome']

array([4, 4, 4, 4, 4, 4, 1])

The outcome code 1 indicates a live birth. Code 4 indicates a miscarriage; that is, a pregnancy that ended spontaneously, usually with no known medical cause.

Statistically this respondent is not unusual. Miscarriages are common and there are other respondents who reported as many or more. But remembering the context, this data tells the story of a woman who was pregnant six times, each time ending in miscarriage. Her seventh and most recent pregnancy ended in a live birth. If we consider this data with empathy, it is natural to be moved by the story it tells.

---

## Conclusion

Chapter 1 of Think Stats 2 introduces basic statistical concepts, including variables, types of data, and distributions. It encourages a statistical approach to analyzing data and provides the groundwork for more advanced techniques in the following chapters.