# The source of this notebook:

*Understanding and Visualizing Data with Python - University of Michigan - Coursera*

# What is statistics?

1. Learning from data
2. Applied in research and market
3. Statistic = analysis of numbers of graphs
4. Statistics = field that focus on research methodology
    
## Landscape

1. Continously improved
2. New types of data
3. Advances in computing

## Fields
1. summarizing data
2. uncertainty
3. forecasting
4. decisions
5. so forth
6. measurement

## 1. Summarizing
- data
    - Overwhelming
    - Misleading
    - framework
    - not 100% accurate
    - need to use margin of error

## 4. Decisions
- decisions made in face of uncertainty
    - Risk of cancer: surgery or not?
    - What are the risks?
    
## 3. Predictions
- weather
- earthquake
- estimate demand
- outcome of elections

## 6. Measurement
- Accuracy of data - How to collect?
    - high accuracy : age
    - difficult: bloos pressure / Water level / stream direction
    - harder: political ideology

- trade-off
    - data costs money
    - statistics helps to reduce the costs of data collection

# Statistics major areas

1. Computer science - languages for manipulating data, data structures 
2. Mathematics - expressing concepts precisely and concisely
3. Probability theory - foundation. Express ideas of randomness and uncertainty
4. Data science - database management, machine learning, computational infrastructure, data analysis



# Types of data

1. Organic and processed (Big Data)
    - Financial transactions
    - stock markets exchange
    - netflix viewing history
    - web browser activity
    - sport events
    - pollution
        - weather stations
        - iot devices
    - the process is massive
        - mine this data to uncover patterns
        - computer science and statistical methodology
  2. Designed data collection 
      - Population interviews 
      - Twitter encoding (chosen)
      - generally much smaller
  
  ## i.i.d. = independent and identically distributed
  1. Data is completely indepedent
      - observations are independent
  2. No correlation between populations and individuals
  3. The values arise from common statistical distribution
  4. Example
      - Final exam scores are independent
      - These data come from a bell shape curve (common statistical distribution)
  
  ## Not i.i.d
  1. Students might cheat
     - **assumption broke**
  2. Males and females have different meals
      - subgroups. Particular distributions
  3. Students coming from the same discussion section
  4. **Need different analytic procedures**
  
  ## Summary
  
  ### Can we apply procedures that assume i.i.d.?
  ### Where does the data come from?
  
  
  
  

# Variables

1. Quantitative
    - Continuous
    - Discrete - even though is no continuos, it can be modelled as so
2. Categorical (qualitative)
    - ordinal : class ranking
    - nominal : races, status
    

# Study design

## Confirmatory

- A falsiable hypothesis is tested -> data is collected to test it

## Exploratory

- Data is collected and explored without previous hypothesis

** Key take -away**

- The more questions you ask from your data, the more likely you are to mislead your conclusions
 
 ## Types os studies

- Comparative - values of different individuals or characteristics of distinct populations are compared
    - Observational studies: data is passive and **self-selected** and observations arise naturally
    - experimental data: In this case, there is often manipulation or assignment of individual units to draw conclusions
    - Have to be used when the next is impractical or unethical
    - Examples: 
        - comparison of lifespans or incidence of lung cancer between smokers and non-smokers
        - comparison standardized test scores of different classes
- Non-comparative - Focus on estimating and predicting - **random assignment**
    - predict stock prices
    - impacts of drugs in blood pressure
    - examples:
        - designing a monitoring net to compare the yield lettuce in fields with or without fertilezer. Randomly **assigned** fields with fertilezers and others to be left untreated.
        - If we want to explore how people react to an add, we can randomly expose people to one version or the other and then compare the click rates.
        
        
# Data Management

Data management refers to all steps of data processing that occur after the data are collected but before the actual data analysis.

Most statistical software operates predominantly on rectangular arrays of data in which the rows represent cases and the columns represent variables. In most statistical analysis, the cases represent individual units in the population of interest, although in some settings the distinction between cases and variables is not very clear.


## Best Practices for Data Management

 - Never modify the source data files (you want to preserve a record of the data as you received it).
 - Write a script (e.g. a Python program) to generate your analysis files from the source data files.
 - Name variables with brief interpretable names.
 - variable names consisting only of letters (a-z, case sensitive), numbers (not as the first character) and the underscore character (_) will be handled easily by most statistical software
 - Do not use whitespace
 - Most softwares will treat'NA', blank, or '.' as a missing value
 
 *Databases and Other Tools*
 
 - Database software and tools (e.g. SQL) can be very useful for large-scale data management. Some statistical software can read data directly from a database. Another approach is to construct a text data file from a database e.g. using SQL.
 - HDF5, Apache Parquet, and Apache Arrow are open-source standards for large binary datasets. Using these formats saves processing time relative to text/csv because fewer conversions are performed when reading and writing the data.
 - Hadoop and Spark are two popular tools for manipulating very large datasets.
 
  *Data Files for Storage and Exchange*
 
 - Text/CSV is currently the most universal format for data exchange.
 - The data in a CSV file is “delimited”, usually by a comma or a tab. Large data sets can be saved in compressed form (e.g. using “gzip”) and read into statistical software directly from the compressed file. This allows the data to be read much faster, and reduces storage space.
 -  Formats like XML and JSON are useful for non-rectangular data but tend to produce larger files that are slower to read and process.
 
 *Repeated Measures Data: Wide and Long* 
 
 
 - Wide format: one row per subject (many variables per row - easier to visualize)
 - Long format: one row per measurement (One variable per row - quicker to filter)
 
 **Python/Pandas has tools to convert between wide and long form**
         

# Data Management

Data management is a crucial component to statistical analysis and data science work.  The following code will show how to import data via the pandas library, view your data, and transform your data.

The main data structure that Pandas works with is called a **Data Frame**.  This is a two-dimensional table of data in which the rows typically represent cases (e.g. Cartwheel Contest Participants), and the columns represent variables.  Pandas also has a one-dimensional data structure called a **Series** that we will encounter when accesing a single column of a Data Frame.

Pandas has a variety of functions named '`read_xxx`' for reading data in different formats.  Right now we will focus on reading '`csv`' files, which stands for comma-separated values. However the other file formats include excel, json, and sql just to name a few.

This is a link to the .csv that we will be exploring in this tutorial: [Cartwheel Data](https://www.coursera.org/learn/understanding-visualization-data/resources/0rVxx) (Link goes to the dataset section of the Resources for this course)

There are many other options to '`read_csv`' that are very useful.  For example, you would use the option `sep='\t'` instead of the default `sep=','` if the fields of your data file are delimited by tabs instead of commas.  See [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) for the full documentation for '`read_csv`'.

# Pandas basic exploration

In [10]:
import pandas as pd

file = 'Cartwheeldata.csv'
df = pd.read_csv(file)
type(df) #dataframe
df.head() #the 5 first rows - standardized

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4


In [30]:
%%HTML
<b> print the data types in the data frame </b>

In [33]:
df.dtypes #types of data per column

ID                 int64
Age                int64
Gender            object
GenderGroup        int64
Glasses           object
GlassesGroup       int64
Height           float64
Wingspan         float64
CWDistance         int64
Complete          object
CompleteGroup      int64
Score              int64
dtype: object

## Select pandas series:

### .loc()

.loc() takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns.

In [25]:
df.loc[:10,["CWDistance",'Height',"Wingspan"]] #rows and columns
#case sensitive

Unnamed: 0,CWDistance,Height,Wingspan
0,79,62.0,61.0
1,70,62.0,60.0
2,85,66.0,64.0
3,87,64.0,63.0
4,72,73.0,75.0
5,81,75.0,71.0
6,107,75.0,76.0
7,98,65.0,62.0
8,106,74.0,73.0
9,65,63.0,60.0


In [26]:
df.loc[5:10] #select by rows

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
5,6,24,M,2,N,0,75.0,71.0,81,N,0,3
6,7,28,M,2,N,0,75.0,76.0,107,Y,1,10
7,8,22,F,1,N,0,65.0,62.0,98,Y,1,9
8,9,29,M,2,Y,1,74.0,73.0,106,N,0,5
9,10,33,F,1,Y,1,63.0,60.0,65,Y,1,8
10,11,30,M,2,Y,1,69.5,66.0,96,Y,1,6


### .iloc()

integer based selection, whereas .loc you might refer to labels names.


In [27]:
df.iloc[:4]

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10


In [29]:
df.iloc[:4,6:9]

Unnamed: 0,Height,Wingspan,CWDistance
0,62.0,61.0,79
1,62.0,60.0,70
2,66.0,64.0,85
3,64.0,63.0,87


## Overall understanding of rows

In [40]:
df.Gender.unique()


array(['F', 'M'], dtype=object)

In [55]:
df.groupby(["Gender",'GenderGroup']).size() #it shows they are correlate

Gender  GenderGroup
F       1              12
M       2              13
dtype: int64

This validates the initial assumption where these to series portray the same information.