In this practice code, we are going to use Ozone dataset. Dictionary of this dataset can be found in https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/airquality.html

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as smf

In [2]:
url = "https://raw.githubusercontent.com/ga-students/DS-SF-24/master/Data/ozone.csv"
OzoneData = pd.read_csv(url)

#### Explore the dataset and decide which variables suffer from missing data

In [3]:
OzoneData.describe()



Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
count,116.0,146.0,153.0,153.0,153.0,153.0
mean,42.12931,185.931507,9.957516,77.882353,6.993464,15.803922
std,32.987885,90.058422,3.523001,9.46527,1.416522,8.86452
min,1.0,7.0,1.7,56.0,5.0,1.0
25%,,,7.4,72.0,6.0,8.0
50%,,,9.7,79.0,7.0,16.0
75%,,,11.5,85.0,8.0,23.0
max,168.0,334.0,20.7,97.0,9.0,31.0


In [4]:
len(OzoneData)

153

Answer: It looks like Ozone and Solar.R are missing values. 

#### Let's drop rows that have missing values in all the columns you indentified above

Hint: in dropna() if you set how = 'all', it will only drop columns that are suffering from missing values at all varaibles you introduce in subset. If you want to get rid of the row that contains missing values in any of the variables you specify, then you shall set how = 'any'

df.dropna(how = 'all',subset = ['Var1','Var2','Var3'],inplace = True)

The above code will check if all 3 variables specified in df have missing values, if they all have missing values it will drop that row.

In [5]:
OzoneData.dropna(how = 'all',subset = ['Ozone', 'Solar.R'], inplace = True)

In [6]:
OzoneData.describe()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
count,116.0,146.0,151.0,151.0,151.0,151.0
mean,42.12931,185.931507,9.941722,78.165563,7.019868,15.801325
std,32.987885,90.058422,3.524984,9.198138,1.406984,8.832531
min,1.0,7.0,1.7,57.0,5.0,1.0
25%,,,7.4,73.0,6.0,8.0
50%,,,9.7,79.0,7.0,16.0
75%,,,11.5,85.0,8.0,23.0
max,168.0,334.0,20.7,97.0,9.0,31.0


In [7]:
len(OzoneData)

151

#### Spoiler! If everything is going according to plan you shall be left by 151 observations. Also, it seemed like the first two varibales had missing values. Now please make a copy of your dataframe into a dataframe named OzoneImputeMean. Also, please use mean of the variables to fill in missing values in OzoneImputeMean

In [8]:
OzoneImputeMean = OzoneData

OzoneImputeMean['Ozone'].fillna(value = np.mean(OzoneImputeMean['Ozone']), inplace = True)
OzoneImputeMean['Solar.R'].fillna(value = np.mean(OzoneImputeMean['Solar.R']), inplace = True)

OzoneImputeMean.head(6)

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
5,28.0,185.931507,14.9,66,5,6
6,23.0,299.0,8.6,65,5,7


In [9]:
#Check if missing values are filled
OzoneImputeMean.describe()



Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
count,151.0,151.0,151.0,151.0,151.0,151.0
mean,42.12931,185.931507,9.941722,78.165563,7.019868,15.801325
std,28.884028,88.544727,3.524984,9.198138,1.406984,8.832531
min,1.0,7.0,1.7,57.0,5.0,1.0
25%,21.0,119.0,7.4,73.0,6.0,8.0
50%,42.12931,197.0,9.7,79.0,7.0,16.0
75%,46.5,257.0,11.5,85.0,8.0,23.0
max,168.0,334.0,20.7,97.0,9.0,31.0


#### Now it's time for imputing using linear regression lines

In [10]:
#### Before we start let's define dummy variables for variable Month - don't worry about day!

MonthDummy = pd.get_dummies(OzoneData.Month, prefix = 'Month')
del MonthDummy['Month_9']

OzoneData = pd.concat([OzoneData, MonthDummy], axis=1)

OzoneData.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day,Month_5,Month_6,Month_7,Month_8
0,41.0,190.0,7.4,67,5,1,1.0,0.0,0.0,0.0
1,36.0,118.0,8.0,72,5,2,1.0,0.0,0.0,0.0
2,12.0,149.0,12.6,74,5,3,1.0,0.0,0.0,0.0
3,18.0,313.0,11.5,62,5,4,1.0,0.0,0.0,0.0
5,28.0,185.931507,14.9,66,5,6,1.0,0.0,0.0,0.0


In [11]:
#now let's explore correlation Matrix
OzoneData.corr()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day,Month_5,Month_6,Month_7,Month_8
Ozone,1.0,0.30297,-0.534163,0.630583,0.151089,-0.011472,-0.281111,-0.06582,0.251524,0.264054
Solar.R,0.30297,1.0,-0.055581,0.27199,-0.073886,-0.14712,-0.023842,0.023896,0.17596,-0.073214
Wind,-0.534163,-0.055581,1.0,-0.466032,-0.175317,0.042365,0.237781,0.046054,-0.144638,-0.166105
Temp,0.630583,0.27199,-0.466032,1.0,0.397427,-0.136876,-0.637816,0.050753,0.318103,0.32168
Month,0.151089,-0.073886,-0.175317,0.397427,1.0,-0.007727,-0.702257,-0.362131,-0.007201,0.355246
Day,-0.011472,-0.14712,0.042365,-0.136876,-0.007727,1.0,0.011003,-0.017044,0.011471,0.011471
Month_5,-0.281111,-0.023842,0.237781,-0.637816,-0.702257,0.011003,1.0,-0.242766,-0.247805,-0.247805
Month_6,-0.06582,0.023896,0.046054,0.050753,-0.362131,-0.017044,-0.242766,1.0,-0.25308,-0.25308
Month_7,0.251524,0.17596,-0.144638,0.318103,-0.007201,0.011471,-0.247805,-0.25308,1.0,-0.258333
Month_8,0.264054,-0.073214,-0.166105,0.32168,0.355246,0.011471,-0.247805,-0.25308,-0.258333,1.0


#### What seems to be the list of best variables can define Ozone? how about Solar.R?

Answer: Ozone seems to be strongly positively correlated to Temp, and strongly negatively correlated to Wind. 
Solar.R is not strongly correlated with any of the variables we have in this dataset. 

#### Now let's use a regression model to predict Ozone. First drop NaN values in Ozone and save it in OzoneDroppedValues_Ozone. Then run a regression line on variables of interest and check significancy of your model if in a multi class dummy variable case, you see only a few of the dummy variables are not significant but the majority are,  you shall either drop all or keep all. Otherwise, selecting the base dummy will become important use these variables ['Solar.R','Wind','Temp','Month_5','Month_6','Month_7','Month_8'] to predict.


In [12]:
OzoneDroppedValues_Ozone = OzoneData.dropna(subset = ['Ozone'])

X1 = OzoneDroppedValues_Ozone[['Solar.R', 'Wind', 'Temp', 'Month_5',
                              'Month_6', 'Month_7', 'Month_8']]
y1 = OzoneDroppedValues_Ozone['Ozone']

lm1 = smf.ols(formula = 'y1 ~ X1', data = OzoneDroppedValues_Ozone).fit()
print(lm1.pvalues)

Intercept    0.013671
X1[0]        0.015084
X1[1]        0.000006
X1[2]        0.000003
X1[3]        0.047828
X1[4]        0.648357
X1[5]        0.110642
X1[6]        0.043664
dtype: float64


In [13]:
lm = LinearRegression()

lm.fit(X1,y1)

X = OzoneData[['Solar.R', 'Wind', 'Temp', 'Month_5', 'Month_6', 'Month_7', 'Month_8']]


In [14]:
# now fill in null values of OzoneData['Ozone'] by predicted values

OzoneData['ozone_predict'] = lm.predict(X)
OzoneData['Ozone'].fillna(value = OzoneData['ozone_predict'], inplace = True)

OzoneData.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day,Month_5,Month_6,Month_7,Month_8,ozone_predict
0,41.0,190.0,7.4,67,5,1,1.0,0.0,0.0,0.0,37.902359
1,36.0,118.0,8.0,72,5,2,1.0,0.0,0.0,0.0,39.985348
2,12.0,149.0,12.6,74,5,3,1.0,0.0,0.0,0.0,32.795274
3,18.0,313.0,11.5,62,5,4,1.0,0.0,0.0,0.0,26.495151
5,28.0,185.931507,14.9,66,5,6,1.0,0.0,0.0,0.0,17.295052


In [15]:
# Now repeat previous steps for Solar.R variable using ['Ozone','Wind','Temp]





In [16]:
# now fill in null values of OzoneData['Solar.R'] by predicted values


#### Now check your filled data - if your predicted values are more than maximum or less than minimum, replace them by max and min

#### Bonus: Repeat the above procedure, this time fill in missing values using regression with errors. 

In [17]:
url = "https://raw.githubusercontent.com/ga-students/DS-SF-24/master/Data/ozone.csv"
OzoneData = pd.read_csv(url)