- title: Think Stats
- date: 2020-01-02 21:45
- category: study
- tags: study
- slug: think-stats
- authors: Robin Rheem
- summary: Study notes for the book `Think Stats`

# Exploratory data analysis
Anecdotal evidence are things that aren't recorded or official. It's like, "My friend told me that something was blah." and you just had to believe it. Things that are not published, or usually personal.

The reason anecdotal evidence fails:
* Small number of observations
* Selection bias
* Confirmation bias
* Inaccuracy

## A statistical approach
If we use statistics, then we can solve the problems of anecdotes. 
* Data collection
* Descriptive statistics
* Exploratory data analysis
* Estimation
* Hypothesis testing

## The National Survey of Family Growth
This book tries to give a lot by telling us the meanings of the words that are used in statistics.
* cross-sectional study: captures a snapshot of a group at a point in time.
* longitudinal study: observes a group repeatedly over a period of time.
* population: the whole group
* sample: subset of the population(and we'll get to know how to actually get the samples in what method.)
* respondents: the people who participated the study or survey.
* representative: every member of the target population has an equal chance of participation. In cross-sectional studies.
* oversampled: getting too many samples so it'll represent that group. It's hard to draw conclusions of the general public because it's oversampled.
* codebook: a document that describes the design of the study.

## Importing the data
I'll go with the [GitHub](https://github.com/AllenDowney/ThinkStats2) repository that the author wrote from now on. I don't seem to just put the code in here since the author already wrote all the code for it.

* record: one row of the representing data.
* dct: Stata dictionary file.

## DataFrames
The `DataFrame` object is just a pandas object that represents a row-column data structure. It has a lot of rows and columns!

The underlying data for the `nsfg` file has 13593 rows and 244 columns.

`df.columns`: prints all the columns.
The result is an `Index` data structure. It's a pandas data structure as well. It's just a list, but we'll learn more of it later on.

To access a column from a `DataFrame`, you can use the column name as the key. For example `df['caseid']`. The type of the returning object is a pandas `Series`. A series is like a Python list with some additional features. I wonder what's different from `Index`. 

It's also cool to know that you can use dot-notation to access the columns of a `DataFrame`.

## Variables
* recodes: they're not part of the raw data. They're values 'calculated' from the raw data.
* raw data: obviously, raw data.

It's a good idea to use recodes because they're nicely calculated values. You should have a good reason to use raw data.

## Transformation
* data cleaning: checking errors, deal with special values, convert data into different formats, perform calculations. These kinds of progress is called data cleaning.

## Validation
Understanding the data that you have is very important. If you misunderstand the meanings of the columns, you'll have wrong results. 

We can get the count of values in a `Series` with `value_counts` method. If you want to sort it by index, use `sort_index`.

## Interpretation
Thinking on two levels at the same time: the level of statistics and the level of context. Respect and gratitude to the people who did the survey.

## Exercises
Just doing exercises that the book asked!

## Glossary
* anecdotal evidence: Evidence, often personal, that is collected casually rather than by a well-designed study.
* population: A group we are interested in studying. "Population" often refers to a group of people, but the term is used for other subjects, too.
* cross-sectional study: A study that collects data about a population at a particular point in time.
* cycle: In a repeated cross-sectional study, each repition of the study is called a cycle.
* longitudinal study: A study that follows a population over time, collecting data from the same group repeatedly.
* record: In a dataset, a collection of information about a single person or other subject.
* respondent: A person who responds to a survery.
* sample: The subset of a population used to collect data.
* representative: A sample is representative if every member of the population has the same chance of being in the sample.
* oversampling: The technique of increasing the representation of a sub-population in order to avoid errors due to small sample sizes.
* raw data: Values collected and recorded with little or no checking, calculation or interpretation.
* recode: A value that is generated by calculation and other logic applied to raw data.
* data cleaning: Processes that include validating data, identifying errors, translating between data types and representations, etc.