In [None]:
import pandas as pd
import numpy as np

# If you are on a personal computer you may need to pip install pydataset
from pydataset import data

from sklearn.model_selection import train_test_split

# Each of these is a regression tool, AdaBoot, GradientBoost, SVR, and LinearRegression
# These aren't the only ones that exist, but they are popular and easy to use!
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression

Some information about the data we are gathering:

Daily readings of the following air quality values for May 1, 1973 (a Tuesday)
to September 30, 1973.

  * `Ozone`: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island 

  * `Solar.R`: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park 

  * `Wind`: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport 

  * `Temp`: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport. 

In [None]:
air_df = data("airquality")

In [None]:
air_df.describe()

Hmm, it looks like quite a bit of data is missing for the Ozone, and some for the Solar.R, let's replace the NaNs with the mean data for now: (This may not be the best, perhaps we can later think about replacing NaNs with the values that are close to them in Month/Day instead of just taking the average, plus because there are only 153 rows, the Ozone NaNs might show up in streaks). Maybe we just out to outright drop the columns, trying to infer them might be too bold.

This is a common question you will encounter in any Data Science or Machine Learning project you work with: **Our data is not perfect, what should we do to fix it?** The answer is usually: well, it depends...

Good data cleaning can often make or break your project.

In [None]:
# Try reading the docs to figure out how to use df.fillna
air_df.fillna?

In [None]:
# Replace the NaNs with the mean.
#air_df = air_df.fillna(# Put approriate parameters here!!)

In [None]:
# Verify that we fixed all of our NaNs
air_df.describe()

In [None]:
# We want to regress on the Ozone value, so we can't use it in our train values.
trainX, testX, trainY, testY = train_test_split(air_df.drop("Ozone", axis=1), air_df.Ozone, test_size=.2, train_size=.8)

# Warning! These won't work if we have any NaNs in our data!
lr = LinearRegression()
lr.fit(trainX, trainY)
lr_acc = lr.score(testX, testY)

abr = AdaBoostRegressor()
abr.fit(trainX, trainY)
abr_acc = abr.score(testX, testY)

print("Linear Regression got R^2 of", lr_acc)
print("AdaBoost Regression got R^2 of", abr_acc)

We used Linear Regression and AdaBoost in this tutorial, how do Gradient Boost and SVR (Support Vector Regression) perform? Better, or worse?

In [None]:
# Try using Gradient Boost and SVR here:

Which method performed best? Can you get a better $R^2$ using less columns? How can we find which columns matter and which ones do not? Consider looking up these Regression models to try and learn! 

(Google something like: sklearn AdaBoostRegressor)

In [None]:
# Try performing regression using all 4 methods on a subset of the data!

This dataset is small enough that it's hard to be consistent with our $R^2$, for that reason we are going to have to set a random seed in order to maintain consistency across our $R^2$ values. Try using one of these seeds and see how your $R^2$ changes:

Seeds = [16325, 81438, 43289, 42382, 50947,  25083,  32385,  22261,  65884,
        54264,  76296, 1822, 49744]
        
Where do you put this seed number? Any ideas? If not, ask somebody!