<a href="https://colab.research.google.com/github/DAbbottPersonal/PM_china/blob/main/PM_china.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multiple Linear Regression

## Importing the libraries

In [177]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

There are several PM reading for each dataset. Store all of them as a dictionary of numpy arrays for now.

In [178]:
dataset = pd.read_csv('/content/BeijingPM20100101_20151231.csv')
#X = dataset.iloc[:, :-1].values
features = ['No', 
            'year', 
            'month', 
            'day', 
            'hour', 
            'season', 
            'DEWP', 
            'HUMI',
            'PRES', 
            'TEMP', 
            'cbwd', 
            'Iws', 
            'precipitation', 
            'Iprec']
X = dataset.loc[:, features].values
y = {}
for i in dataset.columns:
  if 'PM_' in i:
    y[i] = dataset.loc[:, i].values

In [179]:
print(type(X))
print(X[0:4,:])
print(y)

<class 'numpy.ndarray'>
[[1 2010 1 1 0 4 -21.0 43.0 1021.0 -11.0 'NW' 1.79 0.0 0.0]
 [2 2010 1 1 1 4 -21.0 47.0 1020.0 -12.0 'NW' 4.92 0.0 0.0]
 [3 2010 1 1 2 4 -21.0 43.0 1019.0 -11.0 'NW' 6.71 0.0 0.0]
 [4 2010 1 1 3 4 -21.0 55.0 1019.0 -14.0 'NW' 9.84 0.0 0.0]]
{'PM_Dongsi': array([ nan,  nan,  nan, ..., 171., 204.,  nan]), 'PM_Dongsihuan': array([ nan,  nan,  nan, ..., 231., 242.,  nan]), 'PM_Nongzhanguan': array([ nan,  nan,  nan, ..., 196., 221.,  nan]), 'PM_US Post': array([ nan,  nan,  nan, ..., 203., 212., 235.])}


## Data Preprocessing


### Imputing

I need to remove the NA values and handle large periods of time without any PM data from the other chinese locations. The general strategy is:


1.   Impute the NA values of the US Post. These values are for the most part ubiquitous in the dataset with execption for of a few days.
2.   Do not impute the other locations. If they are available, average them together (with the US post and other chinese locations).
3.   Use the average as the final PM value.

I am also considering MTR (multi-target regression) but for now just use step 1






1: Impute the US_post values

In [180]:
from sklearn.impute import SimpleImputer
print(y['PM_US Post'])
US_post_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
US_post_imputer.fit(y['PM_US Post'].reshape(1,-1))
y['PM_US Post'] = US_post_imputer.transform(y['PM_US Post'].reshape(1,-1))
print(y['PM_US Post'][-1])


[ nan  nan  nan ... 203. 212. 235.]
[129. 148. 159. ... 203. 212. 235.]


2: Average locations (TO DO)

3: Store Average (TO DO)

There are some values that need imputing in the feature set.
*    Precipitiation (Precipitation and lprec): Use median as mean could produce odd or tiny values of precipitaion.
*    Wind direction (cbwd): Use local average of neighbors (p/m 2 rows) since we don't expect the wind direction to change suddenly (or at least that is my intuition!). This is categorical and will have to be encoded before manipulation.
*    Other values with NAA (DEWP, HUMI, PRES, TEMP, LWS): Use local average of neighbors for this as well but unlike wind direction this is numerical.

Precipitation:

In [181]:
#print(X[42898:42903,-2:])
print("Before and after imputing: ")
print(np.count_nonzero(np.isnan(X[:,-2:].astype(np.float))))
prec_imputer = SimpleImputer(missing_values=np.nan, strategy='median')
prec_imputer.fit(X[:,-2:])
X[:,-2:] = prec_imputer.transform(X[:,-2:])
print(np.count_nonzero(np.isnan(X[:,-2:].astype(np.float))))

Before and after imputing: 
968
0


Other features (not wind direction):

In [182]:
from pandas.core.frame import DataFrame
print(X[45920:45925,:])
print(np.count_nonzero(np.isnan(X[:,[-8,-7,-6,-5,-3]].astype(np.float))))


for col in [-8,-7,-6,-5,-3]:
  df = DataFrame(data = X[:,col])
  df = df.iloc[:, -1].astype(float).interpolate(method='linear')
  X[:,col] = df


print(X[45920:45925,:])
print(np.count_nonzero(np.isnan(X[:,[-8,-7,-6,-5,-3]].astype(np.float))))

[[45921 2015 3 29 8 1 3.0 50.0 1018.0 13.0 'cv' 0.89 0.0 0.0]
 [45922 2015 3 29 9 1 4.0 47.0 1018.0 15.0 'SE' 3.13 0.0 0.0]
 [45923 2015 3 29 10 1 nan nan nan nan nan nan 0.0 0.0]
 [45924 2015 3 29 11 1 4.0 41.0 1018.0 17.0 'SE' 3.13 0.0 0.0]
 [45925 2015 3 29 12 1 4.0 34.0 1015.0 20.0 'SE' 8.05 0.0 0.0]]
693
[[45921 2015 3 29 8 1 3.0 50.0 1018.0 13.0 'cv' 0.89 0.0 0.0]
 [45922 2015 3 29 9 1 4.0 47.0 1018.0 15.0 'SE' 3.13 0.0 0.0]
 [45923 2015 3 29 10 1 4.0 44.0 1018.0 16.0 nan 3.13 0.0 0.0]
 [45924 2015 3 29 11 1 4.0 41.0 1018.0 17.0 'SE' 3.13 0.0 0.0]
 [45925 2015 3 29 12 1 4.0 34.0 1015.0 20.0 'SE' 8.05 0.0 0.0]]
0


Wind direction:

In [183]:
winds = X[:,-4]
new_cols = np.zeros((len(winds),5))
winds = np.vstack(winds)
new_cols = np.concatenate((winds, new_cols),1)


new_cols[new_cols[:,0] == 'NE'] = ['NE', 1., 0., 0., 0., 0.]
new_cols[new_cols[:,0] == 'NW'] = ['NW', 0., 1., 0., 0., 0.]
new_cols[new_cols[:,0] == 'SW'] = ['SW', 0., 0., 1., 0., 0.]
new_cols[new_cols[:,0] == 'SE'] = ['SE', 0., 0., 0., 1., 0.]
new_cols[new_cols[:,0] == 'cv'] = ['cv', 0., 0., 0., 0., 1.]
new_cols[new_cols[:,0] == 'nan'] = ['nan', np.NaN, np.NaN, np.NaN, np.NaN, np.NaN]


print(np.count_nonzero(np.isnan(new_cols[:,1].astype(np.float))))


df = DataFrame(data = new_cols)
df = df.iloc[:, 1:].astype(float).fillna(method='bfill')
final = df.values
print(np.count_nonzero(np.isnan(final[:,1].astype(np.float))))

print(X[45920:45925,:])
X = np.concatenate((X,final),1)
print(X[21670:21680,:])

5
0
[[45921 2015 3 29 8 1 3.0 50.0 1018.0 13.0 'cv' 0.89 0.0 0.0]
 [45922 2015 3 29 9 1 4.0 47.0 1018.0 15.0 'SE' 3.13 0.0 0.0]
 [45923 2015 3 29 10 1 4.0 44.0 1018.0 16.0 nan 3.13 0.0 0.0]
 [45924 2015 3 29 11 1 4.0 41.0 1018.0 17.0 'SE' 3.13 0.0 0.0]
 [45925 2015 3 29 12 1 4.0 34.0 1015.0 20.0 'SE' 8.05 0.0 0.0]]
[[21671 2012 6 21 22 2 18.0 73.0 1005.0 23.0 'SE' 21.01 2.6 2.7 0.0 0.0
  0.0 1.0 0.0]
 [21672 2012 6 21 23 2 19.0 88.0 1004.0 21.0 'SE' 25.03 1.1 3.8 0.0 0.0
  0.0 1.0 0.0]
 [21673 2012 6 22 0 2 19.0 93.0 1004.0 20.0 'SE' 28.16 0.1 3.9 0.0 0.0
  0.0 1.0 0.0]
 [21674 2012 6 22 1 2 19.0 93.0 1004.0 20.0 'SE' 33.97 0.5 4.4 0.0 0.0
  0.0 1.0 0.0]
 [21675 2012 6 22 2 2 19.0 93.0 1004.0 20.0 'cv' 0.89 0.1 4.5 0.0 0.0 0.0
  0.0 1.0]
 [21676 2012 6 22 3 2 20.0 100.0 1004.0 20.0 'NW' 1.79 0.0 0.0 0.0 1.0
  0.0 0.0 0.0]
 [21677 2012 6 22 4 2 19.0 93.0 1004.0 20.0 'NW' 4.92 0.2 0.2 0.0 1.0 0.0
  0.0 0.0]
 [21678 2012 6 22 5 2 19.0 93.0 1004.0 20.0 'SE' 0.89 0.0 0.0 0.0 0.0 0.0
  1.0 0

## Splitting the dataset into the Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training the Multiple Linear Regression model on the Training set

In [None]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

## Predicting the Test set results

In [None]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

## Evaluating the Model Performance

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)