# Week 05

# Data terminology

### Variables

- Dependent variable (DV):
    - DVs are target or outcome variables whose values we interested in predicting or analyzing
    - In scientific experiments DVs are the measured variables;  IVs are experimentally controlled and DVs are measured and analyzed
- Independent variable (IV):
    - IVs are predictor variables whose values are used to predict target variable values
    - In scientific experiments IV values are usually controlled and changed, and DV values and changes in DV values are measured
- Other ways to think about DVs and IVs:
    -  DV=Y,  IV=X:
       - X: horizontal axis on a graph (controlled variable)
       - Y: vertical axis on a graph (outcome variable)
    - Linear model: `Y = aX + b`
       - Linear models of this form are very important in statistics and data modeling
       - In this model `Y` contains the DVs and `X` contains the IVs
       - `a` is a set of coefficients that map `X` to `Y`
    
    

### Observations

- Case:
   - a collection of DV and IV values representing a single observational unit
   - represented by one row in a CSV file or other data file
   - e.g. single person, single country, etc.
- Observation:
   - a case that has been measured (usually experimentally)
 


### Load, read, parse

- "load" : transfer data from a file to temporary memory (i.e., RAM)
- "read" : transfer a portion of a data file to temporary memory
- "parse" : interpret / extract / reduce / organe data that have been loaded or read

  

<br>
<br>
<br>

___

# Final Project Dataset Requirements & Recommendations

### Data type (recommendation)

- Use only numerical values
   - Other values (dates, strings, etc.) can also be used but are usually more difficult to parse 
- Use one or more CSV files
   - Other data file formats are fine but may be more difficult to work with

### CSV structure (original or parsed data) (recommendation)

- Columns: `Case, DV0, DV1, DV2, ...  IV0, IV1, IV2, ...`

e.g.

```
0, y00,y01,y02, x00,x01,x02
1, y10,y11,y12, x10,x11,x12
2, y20,y21,y22, x20,x21,x22
...
```

### Number of cases (requirement)

- 100 or more cases
- 200-1000 is best;  more is OK too

### Numbers of variables (requirement)

- At least 1 DV
   - More than 1 DV is OK but will complicate analyses
   - Recommend against using more than 3 DVs;  some analyses become very difficult or impossible when the number of DVs is large relative to the number of cases
- At least 3 IVs
   - More than 3 is OK
   - Less than 10 is best;  otherwise some analyses become difficult

<br>
<br>
<br>

___

# Python lesson

## Dictionaries

A dictionary is a collection of variables that are string-indexed. 

The strings that are used to index a dictionary are called "keys".

Keys are used to access dictionary values.

In [1]:
d = {'aaa':123, 'bb':5, 'cc':100}

print( d )
print( d['aaa'] )
print( d['bb'] )
print( d['cc'] )

{'aaa': 123, 'bb': 5, 'cc': 100}
123
5
100


The keys and values can also be accessed using the `keys` and `values` methods:

In [2]:
print( d.keys() )
print()
print( d.values() )

dict_keys(['aaa', 'bb', 'cc'])

dict_values([123, 5, 100])


An alternative way to create dictionaries is to use the `dict` function:

In [3]:
d = dict(aaa=123, bb=5, cc=100)

print( d )
print( d['aaa'] )

{'aaa': 123, 'bb': 5, 'cc': 100}
123


<br>
<br>

# Reading and parsing CSV data

The main data structure in pandas is a dataframe object.  Dataframes are created automatically when loading or reading data:

In [4]:
import pandas as pd

df = pd.read_csv( 'winequality-red.csv' )

print( type(df) )
print()
print( df )

<class 'pandas.core.frame.DataFrame'>

      fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0               7.4             0.700         0.00             1.9      0.076   
1               7.8             0.880         0.00             2.6      0.098   
2               7.8             0.760         0.04             2.3      0.092   
3              11.2             0.280         0.56             1.9      0.075   
4               7.4             0.700         0.00             1.9      0.076   
...             ...               ...          ...             ...        ...   
1594            6.2             0.600         0.08             2.0      0.090   
1595            5.9             0.550         0.10             2.2      0.062   
1596            6.3             0.510         0.13             2.3      0.076   
1597            5.9             0.645         0.12             2.0      0.075   
1598            6.0             0.310         0.47             3.6    

<br>
<br>

The `describe` method more compactly and comprehensively summarizes the contents of a dataframe:

<br>
<br>

In [5]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


<br>
<br>

A dataframe is similar to a dictionary in that keys are used to access specific columns:

In [6]:
print( df['quality'] )

0       5
1       5
2       5
3       6
4       5
       ..
1594    5
1595    6
1596    6
1597    5
1598    6
Name: quality, Length: 1599, dtype: int64


<br>
<br>

A smaller dataframe or dataframe subset can be created, for example, by selecting multiple columns:

In [7]:
# extract columns:

df1 = df[ ['quality', 'volatile acidity'] ]

print( df1 )

      quality  volatile acidity
0           5             0.700
1           5             0.880
2           5             0.760
3           6             0.280
4           5             0.700
...       ...               ...
1594        5             0.600
1595        6             0.550
1596        6             0.510
1597        5             0.645
1598        6             0.310

[1599 rows x 2 columns]


<br>
<br>

Dataframe columns can be renamed using several methods. One useful method is to use a dictionary which maps current column names to new column names, as in the example below. Any columns excluded from the dictionary will retain their current names.

In [8]:
# rename columns:

df1 = df1.rename( columns={'volatile acidity':'acid'} )

print( df1 )

      quality   acid
0           5  0.700
1           5  0.880
2           5  0.760
3           6  0.280
4           5  0.700
...       ...    ...
1594        5  0.600
1595        6  0.550
1596        6  0.510
1597        5  0.645
1598        6  0.310

[1599 rows x 2 columns]


<br>
<br>

The columns can now be accessed using the new keys:

In [9]:
print( df1['acid'] )

0       0.700
1       0.880
2       0.760
3       0.280
4       0.700
        ...  
1594    0.600
1595    0.550
1596    0.510
1597    0.645
1598    0.310
Name: acid, Length: 1599, dtype: float64


<br>
<br>

**Notes**:

- The dataframe reduction and column renaming example above represents a very simple type of parsing.
- Your own Final Project datasets will likely require additional parsing.
- Comprehensive parsing will NOT be covered directly in the lecture.
- Please search the internet for parsing solutions to your particular parsing challenges, and contact the instructor if you are unable to solve those challenges.