# Project - Measure Interpolation Impact

![Data Science Workflow](files/img/DSworkflow.png)

## Goal of Project
- The goal of the project is to see how big impact interpolation can have on results.
- The focus is mainly on step 2.
- To see the impact we will make simple model usages.
- The project will not go into details of steps 3 to 5.

## Step 1: Acquire
- Explore problem
- Identify data
- Import data

### Step 1.a: Import libraries
- Execute the cell below (SHIFT + ENTER)

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
%matplotlib inline

### Step 1.b: Read the data
- Use ```pd.read_parquet()``` to read the file `files/weather-predict.parquet`
- NOTE: Remember to assign the result to a variable (e.g., ```data```)
- Apply ```.head()``` on the data to see all is as expected

In [4]:
data=pd.read_parquet('files/weather-predict.parquet')

In [5]:
data

Unnamed: 0_level_0,Pressure,Pressure+24h
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2006-04-01 00:00:00,1015.13,1015.68
2006-04-01 01:00:00,1015.63,1015.41
2006-04-01 02:00:00,1015.94,1014.98
2006-04-01 03:00:00,1016.41,1015.18
2006-04-01 04:00:00,1016.51,1014.70
...,...,...
2016-09-09 19:00:00,1014.36,1014.93
2016-09-09 20:00:00,1015.16,1015.52
2016-09-09 21:00:00,1015.66,1015.86
2016-09-09 22:00:00,1015.95,1016.04


In [6]:
data.head()

Unnamed: 0_level_0,Pressure,Pressure+24h
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2006-04-01 00:00:00,1015.13,1015.68
2006-04-01 01:00:00,1015.63,1015.41
2006-04-01 02:00:00,1015.94,1014.98
2006-04-01 03:00:00,1016.41,1015.18
2006-04-01 04:00:00,1016.51,1014.7


## Step 2: Prepare
- Explore data
- Visualize ideas
- Cleaning data

### Step 2.a: Check the data types
- This step tells you if some numeric column is not represented numeric.
- Get the data types by ```.dtypes```

In [8]:
data.dtypes

Pressure        float64
Pressure+24h    float64
dtype: object

In [10]:
data.shape

(96418, 2)

### Step 2.b: Check the length, null-values, and zero values
- Check the length
    - HINT: Use `len()`
- Check the number of null-values
    - HINT: Use `.isna().sum()`
- Check the number of zero-values
    - HINT: Use `(data == 0).sum()`

In [11]:
len(data)

96418

In [12]:
data.isna().any()

Pressure        False
Pressure+24h     True
dtype: bool

In [13]:
data.isna().sum()

Pressure         0
Pressure+24h    38
dtype: int64

In [14]:
(data==0).sum()

Pressure        1288
Pressure+24h    1288
dtype: int64

### Step 2.c: Baseline
- Check the correlation to have a measure if we did nothing
    - HINT: Use `corr()`

In [15]:
data.corr()

Unnamed: 0,Pressure,Pressure+24h
Pressure,1.0,0.419074
Pressure+24h,0.419074,1.0


In [17]:
data.describe()

Unnamed: 0,Pressure,Pressure+24h
count,96418.0,96380.0
mean,1003.231025,1003.227178
std,116.990796,117.013672
min,0.0,0.0
25%,1011.9,1011.9
50%,1016.44,1016.45
75%,1021.09,1021.09
max,1046.38,1046.38


### Step 2.d: Prepare data
- We know `Pressure+24` has NaN and 0 values.
- These are not correct values and we cannot use them in our model.
- Create a `dataset` without these rows.
    - HINT: Use filters like `data[data['Pressure+24h'] != 0]` and `dropna()`

In [18]:
dataset=data[data['Pressure+24h'] != 0] 

In [19]:
dataset=dataset.dropna()

In [20]:
dataset.isna().any()

Pressure        False
Pressure+24h    False
dtype: bool

### Step 2.e: Check the size and zero values
- Check the size of datasets `data` and `datasets`
- Check how many zero-values each dataset has

In [21]:
len(data),len(dataset)

(96418, 95092)

### Step 2.f: Check the correlation
- For fun check the correlation of `dataset`
- Then do the same after you interpolated 0 values
    - HINT: Apply `replace` and `interpolate`
- Does the result surprice you?
- Notice how much interpolation improves the result

In [22]:
dataset.corr()

Unnamed: 0,Pressure,Pressure+24h
Pressure,1.0,0.083047
Pressure+24h,0.083047,1.0


In [23]:
dataset.replace(0,np.nan).interpolate().corr()

Unnamed: 0,Pressure,Pressure+24h
Pressure,1.0,0.79447
Pressure+24h,0.79447,1.0


### Step 2.g: Linear Regression Function
- Create function `regression_score` to calculate the r-square score
- It should take independent features X and dependent feature y
- Then split that into training and testing sets.
- Fit the training set.
- Predict the test set.
- Return the r-square score

In [24]:
def regression_score(X,y):
    X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
    lin=LinearRegression()
    lin.fit(X_train,y_train)
    y_pred=lin.predict(X_test)
    return r2_score(y_test,y_pred)

### Step 2.h: Test baseline
- Test the `regression_score` function on `dataset`

In [25]:
regression_score(dataset[['Pressure']],dataset['Pressure+24h'])

0.008080860028906622

### Step 2.i: Test on interploated dataset
- Make a interpolated dataset
- Get the result (from `regression_score`) for interpolated dataset

In [26]:
dataset_interpolated=dataset.replace(0,np.nan).interpolate()
regression_score(dataset_interpolated[['Pressure']],dataset_interpolated['Pressure+24h'])

0.6269601274081953