# LAB 2
### prepared by Markov Artur

## Task 2. Working with files


Data

The zip file `specdata.zip [2.4MB]` containing the data can be downloaded from data folder is course repository.

The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file "200.csv". Each file contains three variables:

Date: the date of the observation in YYYY-MM-DD format (year-month-day)

sulfate: the level of sulfate PM in the air on that date (measured in micrograms per cubic meter)

nitrate: the level of nitrate PM in the air on that date (measured in micrograms per cubic meter)

In each file there are many days where either sulfate or nitrate (or both) are missing (coded as NA). This is common with air pollution monitoring data in the United States.


In [1]:
import pandas as pd
import numpy as np
import os


## Part 1

Write a function named pollutantmean that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function pollutantmean takes three arguments: directory, pollutant, and id. Given a vector monitor ID numbers, pollutantmean reads that monitors' particulate matter data from the directory specified in the directory argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA.

You can see some example output from this function below.
``` r
pollutantmean("specdata", "sulfate", 1:10)
## [1] 4.064128

pollutantmean("specdata", "nitrate", 70:72)
## [1] 1.706047

pollutantmean("specdata", "nitrate", 23)
## [1] 1.280833
```


In [2]:
lab_folder_path = '/Users/arturmarkov/univer/master_degree/software_for_data_processing/L2'
os.path.join(lab_folder_path, 'specdata')

'/Users/arturmarkov/univer/master_degree/software_for_data_processing/L2/specdata'

In [3]:
def get_specific_dfs(directory, id) -> dict:

    if isinstance(id, tuple):
        array = [i for i in range(min(id[0], id[1]), max(id[0], id[1])+1, *id[2:])]
    elif isinstance(id, list):
        array = id
    elif isinstance(id, int):
        array = [id]
    else: 
        raise ValueError('id should be float if single or tuple if iterable!!')
 
    path, _, filenames = list(os.walk(os.path.join(lab_folder_path, directory)))[0]
    filenames = sorted(filenames)
    files = dict(
        [(i+1, os.path.join(path,filenames[i])) for i in range(0,len(filenames))]
    )

    t = []
    for i in array:
        try:
            df = pd.read_csv(files[i])
            df['id'] = i
            t.append(df)
        except Exception:
            break
    return pd.concat(t)

def pollutantmean(directory: str, pollutant: str, id) -> float:
    """
    """
    df = get_specific_dfs(directory, id)
    return df[pollutant].mean().round(6)


In [4]:
# pollutantmean("specdata", "sulfate", 1:10)
## [1] 4.064128

# pollutantmean("specdata", "nitrate", 70:72)
## [1] 1.706047

# pollutantmean("specdata", "nitrate", 23)
## [1] 1.280833

assert  pollutantmean("specdata", "sulfate", (1,10)) == 4.064128
assert  pollutantmean("specdata", "nitrate", (70, 72)) == 1.706047
assert  pollutantmean('specdata', 'nitrate', 23) == 1.280833


## Part 2

Write a function named complete that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases.


In [5]:
def _complete(df: pd.DataFrame, id_col:str = 'id') -> pd.DataFrame:
    """
    Count complete rows through specific column as identifier
    """
    df['complete'] = ~(df.isna().any(axis=1))
    return df.groupby([id_col], as_index=False, sort=False).complete.sum().rename(columns={'complete':'nobs'})

def complete(directory: str, id) -> pd.DataFrame:

    df = get_specific_dfs(directory, id)

    return _complete(df)

```
complete("specdata", 1)
##   id nobs
## 1  1  117
```

In [6]:
complete("specdata", 1)

Unnamed: 0,id,nobs
0,1,117


``` r
complete("specdata", c(2, 4, 8, 10, 12))
##   id nobs
## 1  2 1041
## 2  4  474
## 3  8  192
## 4 10  148
## 5 12   96
```

In [7]:
complete("specdata", [2, 4, 8, 10, 12])

Unnamed: 0,id,nobs
0,2,1041
1,4,474
2,8,192
3,10,148
4,12,96


``` r
complete("specdata", 30:25)
##   id nobs
## 1 30  932
## 2 29  711
## 3 28  475
## 4 27  338
## 5 26  586
## 6 25  463
```

In [8]:
complete("specdata", (30, 25))[::-1]

Unnamed: 0,id,nobs
5,30,932
4,29,711
3,28,475
2,27,338
1,26,586
0,25,463



## Part 3

Write a function named corr that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function should return a numeric vector of length 0. For this function you will need to use the 'cor' function in R which calculates the correlation between two vectors.

In [None]:
def corr(directory:str, threshold:int):
    df = get_specific_dfs(directory, (1,10000))
    complete_info = _complete(df)

    complete_info = complete_info[complete_info.nobs>threshold]

    corr_info = df[df.id.isin(complete_info.id.unique())].groupby('id')[['sulfate','nitrate']].corr().iloc[0::2,-1]

    return corr_info.values
def summary(corr_info):
    return pd.DataFrame(pd.Series(corr_info).describe(percentiles=[.25,.75]).round(5)).transpose().reset_index()[['min', '25%', '50%', 'mean', '75%', 'max']]

``` r
cr <- corr("specdata", 150)
head(cr)
## [1] -0.01895754 -0.14051254 -0.04389737 -0.06815956 -0.12350667 -0.07588814
summary(cr)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.21057 -0.04999  0.09463  0.12525  0.26844  0.76313
```

In [10]:
cr = corr("specdata", 150)
summary(cr)

Unnamed: 0,min,25%,50%,mean,75%,max
0,-0.21057,-0.04999,0.09463,0.12525,0.26844,0.76313


``` r
cr <- corr("specdata", 400)
head(cr)
## [1] -0.01895754 -0.04389737 -0.06815956 -0.07588814  0.76312884 -0.15782860
summary(cr)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.17623 -0.03109  0.10021  0.13969  0.26849  0.76313

```

In [11]:
cr = corr("specdata", 400)
summary(cr)

Unnamed: 0,min,25%,50%,mean,75%,max
0,-0.17623,-0.03109,0.10021,0.13969,0.26849,0.76313


```
cr <- corr("specdata", 5000)
summary(cr)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##
length(cr)
## [1] 0
```

In [12]:
cr = corr("specdata", 5000)
summary(cr)

Unnamed: 0,min,25%,50%,mean,75%,max
0,,,,,,
