## Problem 2 - Data manipulation and selection (*9 points*)

In this problem you will clean the data from our data file by removing no-data values, convert temperature values in Fahrenheit to Celsius, and split the data into separate datasets using the weather station identification code. We will start this problem by cleaning and converting our temperature data. Please perform the tasks below by writing your code into the codeblocks in each section.

**Notice**: Closely follow the instructions! For example, you should be sure to use **exactly** the same variable names mentioned in the instructions because your answers will be automatically graded, and the tests that grade your answers rely on following the same formatting or variable naming as in the instructions.

**Your score on this problem will be based on following criteria:**

- Creating a new dataframe called `selected` that contains select columns from the data file
- Cleaning the new dataframe by removing no-data values
- Creating a new column for temperatures converted from Fahrenheit to Celsius
- Dividing the data into separate dataframes for the Helsinki Kumpula and Rovaniemi stations
- Saving the new dataframes to CSV files
- Including comments that explain what most lines in the code do
- Answering a couple questions at the end of the problem
- Uploading your notebook and data files to your GitHub repository for this week's exercise  

### Part 1 (*0 points*)

The first step for this problem is to again read the data file.

- Use Pandas to read the [data/6153237444115dat.csv](data/6153237444115dat.csv) file into the varaible `data` (you can copy your code from Problem 1).

In [2]:
# Import Pandas and read in the data from csv
import pandas as pd

data = pd.read_csv('data/6153237444115dat.csv', na_values=['*', '**', '***', '****', '*****', '******'])

data.head()

Unnamed: 0,USAF,WBAN,YR--MODAHRMN,DIR,SPD,GUS,CLG,SKC,L,M,...,SLP,ALT,STP,MAX,MIN,PCP01,PCP06,PCP24,PCPXX,SD
0,28450,99999,201705010000,174.0,10.0,14.0,,,,,...,1009.2,,984.1,,,,,,,35.0
1,28450,99999,201705010020,180.0,10.0,,4.0,,,,...,,29.74,,,,,,,,
2,28450,99999,201705010050,190.0,10.0,,4.0,,,,...,,29.74,,,,,,,,
3,28450,99999,201705010100,188.0,12.0,16.0,,,,,...,1009.1,,984.0,,,,,,,35.0
4,28450,99999,201705010120,200.0,13.0,,2.0,OBS,,,...,,29.74,,,,,,,,


### Part 2 (*2 points*)

Next, you can subset the data and remove the no-data values.

 - Create a new variable `selected`
 - Select the columns `USAF`, `YR--MODAHRMN`, `TEMP`, `MAX`, and `MIN` from the `data` dataframe and assign them to the new variable `selected`
 - Remove all rows from `selected` that have NoData in the column `TEMP` using the `dropna()` function

In [73]:
selected = data[['USAF', 'YR--MODAHRMN', 'TEMP', 'MAX', 'MIN']]

selected.dropna(subset='TEMP')

selected.head()

Unnamed: 0,USAF,YR--MODAHRMN,TEMP,MAX,MIN
0,28450,201705010000,31.0,,
1,28450,201705010020,30.0,,
2,28450,201705010050,30.0,,
3,28450,201705010100,31.0,,
4,28450,201705010120,30.0,,


In [None]:
# Check your dataframe 
selected.head()

from nose.tools import ok_, assert_equal
import inspect

# Check that selected dataframe exist
ok_('selected' in locals())


In [46]:
# Check your dataframe 
len(selected)

11694

### Part 3 (*2 points*)

Next, you can convert the temperature values in Fahrenheit to Celsius.

- Create a new column in `selected` called `Celsius`
- Convert the Fahrenheit temperatures from `TEMP` using the conversion formula (below) and store the results in the new `Celsius`column.

$$
\begin{equation}
  T_{\textrm{C}} = (T_{\textrm{F}} - 32)~/~1.8
\end{equation}
$$

- Round the values in the `Celsius` column to have 0 decimals (**do not** create a new column, update the current one)
- Convert the `Celsius` values into integers (**do not** create a new column, update the current one)

In [81]:
selected['CELSIUS'] = (selected['TEMP'] - 32) / 1.8

selected['CELSIUS'] = selected['CELSIUS'].round(2)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected['CELSIUS'] = (selected['TEMP'] - 32) / 1.8
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected['CELSIUS'] = selected['CELSIUS'].round(2)


In [111]:
selected = selected.dropna(subset='CELSIUS')

selected['CELSIUS'] = selected['CELSIUS'].astype(int)





In [113]:
# Check your dataframe
selected.head()


Unnamed: 0,USAF,YR--MODAHRMN,TEMP,MAX,MIN,CELSIUS
0,28450,201705010000,31.0,,,0
1,28450,201705010020,30.0,,,-1
2,28450,201705010050,30.0,,,-1
3,28450,201705010100,31.0,,,0
4,28450,201705010120,30.0,,,-1


In [115]:
# Check that the temperatures are converted into interger type
ok_(selected['Celsius'].dtype == 'int32' or selected['Celsius'].dtype == 'int64')

NameError: name 'ok_' is not defined

### Part 4 (*2 points*)

Your next task is to divide `selected` into two separate dataframes. Please use the given variable names and write your answer to the codeblock below.

- Select all rows from the `selected` DataFrame with the `USAF` code `29980` into a variable called `kumpula`
- Select all rows from the `selected` DataFrame with the `USAF` code `28450` into a variable called `rovaniemi`

In [119]:
kumpula = selected.loc[selected['USAF'] == 29980]
rovaniemi = selected.loc[selected['USAF'] == 28450]



In [121]:
# Check the dataframe
print("Kumpula: \n", kumpula.head(), "\n")


Kumpula: 
        USAF  YR--MODAHRMN  TEMP  MAX  MIN  CELSIUS
8770  29980  201705010000  37.0  NaN  NaN        2
8771  29980  201705010100  37.0  NaN  NaN        2
8772  29980  201705010200  37.0  NaN  NaN        2
8773  29980  201705010300  37.0  NaN  NaN        2
8774  29980  201705010400  39.0  NaN  NaN        3 



In [123]:
# Check the dataframe
print("Rovaniemi: \n", rovaniemi.head(), "\n")


Rovaniemi: 
     USAF  YR--MODAHRMN  TEMP  MAX  MIN  CELSIUS
0  28450  201705010000  31.0  NaN  NaN        0
1  28450  201705010020  30.0  NaN  NaN       -1
2  28450  201705010050  30.0  NaN  NaN       -1
3  28450  201705010100  31.0  NaN  NaN        0
4  28450  201705010120  30.0  NaN  NaN       -1 



### Part 5 (*3 points*)

Now you can save your selections to csv files.

- Save the `kumpula` DataFrame into the file `Kumpula_temps_May_Aug_2017.csv` (CSV format) 
    - Separate the columns with commas (`,`)
    - Use only 2 decimals for the floating point numbers
- Save the `rovaniemi` DataFrame into the file `Rovaniemi_temps_May_Aug_2017.csv` (CSV format) 
    - Separate the columns with commas (`,`)
    - Use only 2 decimals for the floating point numbers
- Upload both of your data files to your Exercise 5 repository

In [137]:
help(data.to_csv)

Help on method to_csv in module pandas.core.generic:

to_csv(path_or_buf: 'FilePath | WriteBuffer[bytes] | WriteBuffer[str] | None' = None, *, sep: 'str' = ',', na_rep: 'str' = '', float_format: 'str | Callable | None' = None, columns: 'Sequence[Hashable] | None' = None, header: 'bool_t | list[str]' = True, index: 'bool_t' = True, index_label: 'IndexLabel | None' = None, mode: 'str' = 'w', encoding: 'str | None' = None, compression: 'CompressionOptions' = 'infer', quoting: 'int | None' = None, quotechar: 'str' = '"', lineterminator: 'str | None' = None, chunksize: 'int | None' = None, date_format: 'str | None' = None, doublequote: 'bool_t' = True, escapechar: 'str | None' = None, decimal: 'str' = '.', errors: 'OpenFileErrors' = 'strict', storage_options: 'StorageOptions | None' = None) -> 'str | None' method of pandas.core.frame.DataFrame instance
    Write object to a comma-separated values (csv) file.

    Parameters
    ----------
    path_or_buf : str, path object, file-like object

In [141]:
output_kumpula = 'Kumpula_temps_May_Aug_2017.csv'
output_rovaniemi = 'Rovaniemi_temps_May_Aug_2017.csv'

kumpula.to_csv(output_kumpula, sep=',', float_format=2)
rovaniemi.to_csv(output_rovaniemi, sep=',', float_format=2)

In [None]:
#Read-only cell for hidden tests :)

### Problem 2 summary

In the [Exercise 5 summary notebook](Exercise-5-summary.ipynb) you can find a few additional points to consider and two final questions for Problem 2. Please answer those question in [that notebook](Exercise-5-summary.ipynb).

### On to Problem 3

Now you can continue to [Problem 3: Data analysis](Exercise-5-problem-3.ipynb)