### Project 1

Project description: 
- Read data into Jupyter notebook, use pandas to import data into a data frame
- preprocess data: explore data, address missing data, categorical data, if there is any, and data scaling. Justify the type of scaling used in this project. 
- train your dataset using all the linear regression models you've learned so far. If your model has a scaling parameter(s) use Grid Search to find the best scaling parameter. Use plots and graphs to help you get a better glimpse of the results. 
- Then use cross validation to find average training and testing score. 
- Your submission should have at least the following regression models: KNN repressor, linear regression, Ridge, Lasso, polynomial regression, SVM both simple and with kernels. 
- Finally find the best repressor for this dataset and train your model on the entire dataset using the best parameters and predict the market price for the test_set.
- submit IPython notebook. Use markdown to provide an inline report for this project.

##### <font color = 'red'> Important note: All the group members should participate in completing this project.  This includes coding, preparing report and testing the models. 

# Importing Packages

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVR
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

# Loading Data

In [None]:
#Loading Dataset for Building the Model
data = pd.read_csv('bitcoin_dataset.csv')
#Loading the Testing Dataset for Prediction
test = pd.read_csv('test_set.csv')
data.head()

# Modifying Date Format and Drop Null Values

In [None]:
#Changing Date from String formaat to Date format
thelist=[(datetime.strptime(data['Date'][i][0:-5], '%m/%d/%Y')) for i in data.index]
df = pd.Series( (v for v in thelist) )
data["Date"]=df
#Dropping the Null Values
data=data.dropna()

# Variation of Bit Coin Price with Time

In [None]:
plt.plot(data['Date'],data['btc_market_price'])
plt.xlabel("Date")
plt.ylabel("Price")
plt.title("Variation of Bitcoin")
plt.show()

# Correlation between Variables

In [None]:
data.corr()

# Removing the Independent Variables with High Correlation

In [None]:
#Important columns for Regression
column=['btc_total_bitcoins', 'btc_market_cap','btc_n_orphaned_blocks','btc_median_confirmation_time',
       'btc_cost_per_transaction','btc_output_volume', 'btc_estimated_transaction_volume']
# Filtered Data for Training the model
Xdata=data[column]
# Filtered Dataset for Testing the model
Tdata=test[column]
# Target Variable
Ydata=data['btc_market_price']

### Variation of Dependent Variable with the change in Independent Variables

In [None]:
for i in column:
    plt.plot(data[i],data['btc_market_price'])
    plt.xlabel(i)
    plt.ylabel("Price")
    plt.title("Variation of Bitcoin")
    plt.show()

# Splitting Data for Training and Testing 75/25

In [None]:
x_train, x_test, y_train, y_test = train_test_split(Xdata,Ydata,random_state=111)
print("Shape of Training data\n", "Indepenedent Variables : ",x_train.shape,"Dependent Variables : ",y_train.shape)
print("Shape of Testing data\n", "Indepenedent Variables : ",x_test.shape," Dependent Variables : ",y_test.shape)

# Scaling Data

In [None]:
scaler=MinMaxScaler().fit(Xdata)
# Scaling the Data for Training the model
x_scaled=scaler.fit_transform(Xdata)
# Scaling the Testing Data to fit for prediction
t_scaled=scaler.transform(Tdata)
# Splitting the Scaled Data 
x_train_scaled, x_test_scaled, y_train_scaled, y_test_scaled = train_test_split(x_scaled,Ydata,random_state=111)
print("Shape of Training data\n", "Indepenedent Variables : ",x_train_scaled.shape,"Dependent Variables : ",y_train_scaled.shape)
print("Shape of Testing data\n", "Indepenedent Variables : ",x_test_scaled.shape," Dependent Variables : ",y_test_scaled.shape)

# KNN Regressor

KNN Regression on Unscaled Data

In [None]:
# Defining the KNN Regressor
knnreg=KNeighborsRegressor()
# Defining the number of neighbors for the model
k_value={'n_neighbors':range(1,100)}
# Defining the GridSearchCV for best accuracy using the different values of n
grid=GridSearchCV(knnreg,param_grid=k_value)
# Fitting the model
grid.fit(x_train,y_train)
print("The best training score is : ",'%.4f' %grid.score(x_train,y_train))
print("The best test score is : ",'%.4f' %grid.score(x_test,y_test))
# Printing the Best parmeters and Best estimator
print(grid.best_params_)
print(grid.best_estimator_)

In [None]:
print("Mean Cross Validation Score of KNN on Unscaled Data :", 
      '%.4f' %cross_val_score(KNeighborsRegressor(n_neighbors=2), x_train, y_train, cv = 10).mean())

KNN Regression on Scaled Data

In [None]:
knnreg=KNeighborsRegressor()
# Defining the range of K values
k_value={'n_neighbors':range(1,100)}
gridscaled=GridSearchCV(knnreg,param_grid=k_value)
gridscaled.fit(x_train_scaled,y_train_scaled)
print("The training score is : ",'%.4f' %gridscaled.score(x_train_scaled,y_train_scaled))
print("The test score is : ",'%.4f' %gridscaled.score(x_test_scaled,y_test_scaled)," with ", gridscaled.best_params_)
# Printing the Best parmeters and Best estimator
print(gridscaled.best_params_)
print(gridscaled.best_estimator_)

In [None]:
print("Mean Cross Validation Score of KNN on Scaled Data :", 
      '%.4f' %cross_val_score(KNeighborsRegressor(n_neighbors=3), x_train_scaled, y_train_scaled, cv = 10).mean())

# Linear Regression

Linear Regression on Unscaled Data

In [None]:
# Defining the Linear Regressor and Fitting the model on unscaled data
linreg=LinearRegression().fit(x_train,y_train)
print("The best training score is : ",'%.4f' %linreg.score(x_train,y_train))
print("The best test score is : ",'%.4f' %linreg.score(x_test,y_test))

In [None]:
print("Mean Cross Validation Score of Linear Regression on Unscaled Data :", 
      '%.4f' %cross_val_score(LinearRegression(), x_train, y_train, cv = 10).mean())

Linear Regression on Scaled Data

In [None]:
# Defining the Linear Regressor and Fitting the model on scaled data
linregscaled=LinearRegression().fit(x_train_scaled,y_train_scaled)
print("The best training score is : ",'%.4f' %linregscaled.score(x_train_scaled,y_train_scaled))
print("The best test score is : ",'%.4f' %linregscaled.score(x_test_scaled,y_test_scaled))

In [None]:
print("Mean Cross Validation Score of Linear Regression on Scaled Data :", 
      '%.4f' %cross_val_score(LinearRegression(), x_train_scaled, y_train_scaled, cv = 10).mean())

# Ridge Regression

Ridge Regression on Unscaled Data

In [None]:
# Defining the Ridge Regressor
linridge=Ridge()
alpha={'alpha':[0.1,1,5,10,15,20,25]}
gridridge=GridSearchCV(linridge,param_grid=alpha)
# Fitting the model on Unscaled data
gridridge.fit(x_train,y_train)
print("The training score is : ",'%.4f' %gridridge.score(x_train,y_train))
print("The test score is : ",'%.4f' %gridridge.score(x_test,y_test)," with ", gridridge.best_params_)
# Printing the Best parmeters and Best estimator
print(gridridge.best_params_)
print(gridridge.best_estimator_)

In [None]:
print("Mean Cross Validation Score of Ridge Regression on Unscaled Data :", 
      '%.4f' %cross_val_score(Ridge(alpha=25), x_train, y_train, cv = 10).mean())

In [None]:
# Defining the RidgeCv
linridge=RidgeCV()
# Fitting the model 
linridge.fit(x_train,y_train)
print("The training score is : ",'%.4f' %linridge.score(x_train,y_train))
print("The test score is : ",'%.4f' %linridge.score(x_test,y_test))
# Intercepts and Coefficients in the regression
print(linridge.intercept_,linridge.coef_)
print("Number of non zero Coef : ",np.sum(linridge.coef_ != 0))
print("Number of Important Coef : ",np.sum(abs(linridge.coef_) > 1))

Ridge Regression on Scaled Data

In [None]:
linridgescaled=Ridge()
alpha={'alpha':[0.1,1,5,10,15,20,25]}
gridridgescaled=GridSearchCV(linridgescaled,param_grid=alpha)
# Fitting the model
gridridgescaled.fit(x_train_scaled,y_train_scaled)
print("The training score is : ",'%.4f' %gridridgescaled.score(x_train_scaled,y_train_scaled))
print("The test score is : ",'%.4f' %gridridgescaled.score(x_test_scaled,y_test_scaled)," with ", gridridgescaled.best_params_)
# Printing the Best parmeters and Best estimator
print(gridridgescaled.best_params_)
print(gridridgescaled.best_estimator_)

In [None]:
print("Mean Cross Validation Score of Ridge Regression on Scaled Data :", 
      '%.4f' %cross_val_score(Ridge(alpha=0.1), x_train_scaled, y_train_scaled, cv = 10).mean())

In [None]:
linridgescaled=RidgeCV()
# Fittinf the model on Scaled data
linridgescaled.fit(x_train_scaled,y_train_scaled)
print("The training score is : ",'%.4f' %linridgescaled.score(x_train_scaled,y_train_scaled))
print("The test score is : ",'%.4f' %linridgescaled.score(x_test_scaled,y_test_scaled))
# Number of non zero Coefficients
print("Number of non zero Coef : ",np.sum(linridgescaled.coef_ != 0))
# Number of important Coefficients
print("Number of Important Coef : ",np.sum(abs(linridgescaled.coef_) > 1))

# Lasso Regression

Lasso Regression on Unscaled Data

In [None]:
# Defining the Lasso Model
linlasso=Lasso()
param={'alpha':[0.1,1,5,10,15,20,25], 'max_iter':[10,100,1000,10000]}
gridlasso=GridSearchCV(linlasso,param_grid=param)
# Fitting the Model on Unscaled Data
gridlasso.fit(x_train,y_train)
print("The training score is : ",'%.4f' %gridlasso.score(x_train,y_train))
print("The test score is : ",'%.4f' %gridlasso.score(x_test,y_test)," with ", gridlasso.best_params_)
# Printing the Best parmeters and Best estimator
print(gridlasso.best_params_)
print(gridlasso.best_estimator_)

In [None]:
print("Mean Cross Validation Score of Lasso Regression on Unscaled Data :", 
      '%.4f' %cross_val_score(Lasso(alpha=0.1, max_iter=100), x_train, y_train, cv = 10).mean())

In [None]:
# Defining the Lasso Model with build in Cross Validation
linlasso=LassoCV()
# Training the model
linlasso.fit(x_train,y_train)
print("The training score is : ",'%.4f' %linlasso.score(x_train,y_train))
print("The test score is : ",'%.4f' %linlasso.score(x_test,y_test))
print("Number of Important Coef : ",np.sum(linlasso.coef_ != 0))

Lasso Regression on Scaled Data

In [None]:
linlassoscaled=Lasso()
param={'alpha':[0.1,1,5,10,15,20,25], 'max_iter':[10,100,1000,10000]}
gridlassoscaled=GridSearchCV(linlassoscaled,param_grid=param)
# Training the model on Scaled Data
gridlassoscaled.fit(x_train_scaled,y_train_scaled)
print("The training score is : ",'%.4f' %gridlassoscaled.score(x_train_scaled,y_train_scaled))
print("The test score is : ",'%.4f' %gridlassoscaled.score(x_test_scaled,y_test_scaled)," with ", gridlassoscaled.best_params_)
# Printing the Best parmeters and Best estimator
print(gridlassoscaled.best_params_)
print(gridlassoscaled.best_estimator_)

In [None]:
print("Mean Cross Validation Score of Lasso Regression on Scaled Data :", 
      '%.4f' %cross_val_score(Lasso(alpha=0.1, max_iter=100), x_train_scaled, y_train_scaled, cv = 10).mean())

In [None]:
linlassoscaled=LassoCV()
# Fitting the model
linlassoscaled.fit(x_train_scaled,y_train_scaled)
print("The training score is : ",'%.4f' %linlassoscaled.score(x_train_scaled,y_train_scaled))
print("The test score is : ",'%.4f' %linlassoscaled.score(x_test_scaled,y_test_scaled))
print("Number of non zero Coef : ",np.sum(linlassoscaled.coef_ != 0))
print("Number of Important Coef : ",np.sum(abs(linlassoscaled.coef_) > 1))

# Polynomial Regression

Polynomial Regression on Unscaled Data

In [None]:
# Defining a Polynomial Model with degree=2
poly=PolynomialFeatures(degree=2)
# Fitting the training Data with polynimial Features
x_poly=poly.fit_transform(Xdata)'o,'p,
x_train_poly, x_test_poly, y_train_poly, y_test_poly = train_test_split(x_poly,Ydata)
linregpoly=LinearRegression().fit(x_train_poly,y_train_poly)
print("The training score is : ",'%.4f' %linregpoly.score(x_train_poly,y_train_poly))
print("The test score is : ",'%.4f' %linregpoly.score(x_test_poly,y_test_poly))

In [None]:
print("Mean Cross Validation Score of Polynomial Regression on Unscaled Data :", 
      '%.4f' %cross_val_score(LinearRegression(), x_train_poly, y_train_poly, cv = 10).mean())

Polynomial Regression on Scaled Data

In [None]:
poly=PolynomialFeatures(degree=2)
# Fitting the polynomial Features on Scaled Data
x_poly_scaled=poly.fit_transform(x_scaled)
t_poly_scaled=poly.transform(t_scaled)
x_train_poly_scaled, x_test_poly_scaled, y_train_poly_scaled, y_test_poly_scaled = train_test_split(x_poly_scaled,Ydata)
# Training the polynomial Regression using Linear Regression
linregpolyscaled=LinearRegression().fit(x_train_poly_scaled,y_train_poly_scaled)
print("The training score is : ",'%.4f' %linregpolyscaled.score(x_train_poly_scaled,y_train_poly_scaled))
print("The test score is : ",'%.4f' %linregpolyscaled.score(x_test_poly_scaled,y_test_poly_scaled))

In [None]:
print("Mean Cross Validation Score of Polynomial Regression on Scaled Data :", 
      '%.4f' %cross_val_score(LinearRegression(), x_train_poly_scaled, y_train_poly_scaled, cv = 10).mean())

# SVM

### SVM with out Kernel

In [None]:
# Defining the Support Vector Regressor With out any Kernel and training the model
SVreg=SVR().fit(x_train_scaled,y_train_scaled)
print("The best training score is : ",'%.4f' %SVreg.score(x_train_scaled,y_train_scaled))
print("The best test score is : ",'%.4f' %SVreg.score(x_test_scaled,y_test_scaled))

In [None]:
print("Mean Cross Validation Score of SVR on Scaled Data :", 
      '%.4f' %cross_val_score(SVR(), x_train_scaled, y_train_scaled, cv = 10).mean())

### SVM with Kernel

In [None]:
param={'epsilon':[0.1,0.5,1,2,5,10,15,20,25,100], 'C':[1,5,10,20,50,100,1000,10000]}
# Defining the Support vector Regressor with Linear Kernel
clf=SVR(kernel='linear')
gridsvr=GridSearchCV(clf,param)
# Training the model
gridsvr.fit(x_train_scaled,y_train_scaled)
print("The training score is : ",'%.4f' %gridsvr.score(x_train_scaled,y_train_scaled))
print("The test score is : ",'%.4f' %gridsvr.score(x_test_scaled,y_test_scaled))
# Printing the Best parmeters and Best estimator
print(gridsvr.best_params_)
print(gridsvr.best_estimator_)

In [None]:
print("Mean Cross Validation Score of Linear SVR on Scaled Data :", 
      '%.4f' %cross_val_score(SVR(kernel='linear', epsilon=20, C=10000), x_train_scaled, y_train_scaled, cv = 10).mean())

In [None]:
param={'epsilon':[0.1,0.5,1,2,5,10,50,100,500], 'C':[1,5,10,20,50,100,1000]}
# Defining the Support vector Regressor with RBF Kernel
clfrbf=SVR(kernel='rbf')
gridsvrrbf=GridSearchCV(clfrbf,param)
# Training the model
gridsvrrbf.fit(x_train_scaled,y_train_scaled)
print("The training score is : ",'%.4f' %gridsvrrbf.score(x_train_scaled,y_train_scaled))
print("The test score is : ",'%.4f' %gridsvrrbf.score(x_test_scaled,y_test_scaled))
# Printing the Best parmeters and Best estimator
print(gridsvrrbf.best_params_)
print(gridsvrrbf.best_estimator_)

In [None]:
print("Mean Cross Validation Score of RBF SVR on Scaled Data :", 
      '%.4f' %cross_val_score(SVR(kernel='rbf', epsilon=100, C=1000), x_train_scaled, y_train_scaled, cv = 10).mean())

In [None]:
param={'epsilon':[0.1,1,10,100,400,500], 'C':[1,5,10,20,50,100,1000,10000]}
# Defining the Support vector Regressor with Polynomial Kernel
clfpoly=SVR(kernel='poly')
gridsvrpoly=GridSearchCV(clfpoly,param)
# Training the model
gridsvrpoly.fit(x_train_scaled,y_train_scaled)
print("The training score is : ",'%.4f' %gridsvrpoly.score(x_train_scaled,y_train_scaled))
print("The test score is : ",'%.4f' %gridsvrpoly.score(x_test_scaled,y_test_scaled))
# Printing the Best parmeters and Best estimator
print(gridsvrpoly.best_params_)
print(gridsvrpoly.best_estimator_)

In [None]:
print("Mean Cross Validation Score of Polynomial SVR on Scaled Data :", 
      '%.4f' %cross_val_score(SVR(kernel='poly', epsilon=400, C=10000), x_train_scaled, y_train_scaled, cv = 10).mean())

# Cross Validation Scores

In [None]:
print("Mean Cross Validation Score of KNN on Unscaled Data :", 
      '%.4f' %cross_val_score(KNeighborsRegressor(n_neighbors=2), x_train, y_train, cv = 10).mean())
print("Mean Cross Validation Score of KNN on Scaled Data :", 
      '%.4f' %cross_val_score(KNeighborsRegressor(n_neighbors=3), x_train_scaled, y_train_scaled, cv = 10).mean())
print("Mean Cross Validation Score of Linear Regression on Unscaled Data :", 
      '%.4f' %cross_val_score(LinearRegression(), x_train, y_train, cv = 10).mean())
print("Mean Cross Validation Score of Linear Regression on Scaled Data :", 
      '%.4f' %cross_val_score(LinearRegression(), x_train_scaled, y_train_scaled, cv = 10).mean())
print("Mean Cross Validation Score of Ridge Regression on Unscaled Data :", 
      '%.4f' %cross_val_score(Ridge(alpha=25), x_train, y_train, cv = 10).mean())
print("Mean Cross Validation Score of Ridge Regression on Scaled Data :", 
      '%.4f' %cross_val_score(Ridge(alpha=0.1), x_train_scaled, y_train_scaled, cv = 10).mean())
print("Mean Cross Validation Score of Lasso Regression on Unscaled Data :", 
      '%.4f' %cross_val_score(Lasso(alpha=0.1, max_iter=100), x_train, y_train, cv = 10).mean())
print("Mean Cross Validation Score of Lasso Regression on Scaled Data :", 
      '%.4f' %cross_val_score(Lasso(alpha=0.1, max_iter=100), x_train_scaled, y_train_scaled, cv = 10).mean())
print("Mean Cross Validation Score of Polynomial Regression on Unscaled Data :", 
      '%.4f' %cross_val_score(LinearRegression(), x_train_poly, y_train_poly, cv = 10).mean())
print("Mean Cross Validation Score of Polynomial Regression on Scaled Data :", 
      '%.4f' %cross_val_score(LinearRegression(), x_train_poly_scaled, y_train_poly_scaled, cv = 10).mean())
print("Mean Cross Validation Score of SVR on Scaled Data :", 
      '%.4f' %cross_val_score(SVR(), x_train_scaled, y_train_scaled, cv = 10).mean())
print("Mean Cross Validation Score of Linear SVR on Scaled Data :", 
      '%.4f' %cross_val_score(SVR(kernel='linear', epsilon=20, C=10000), x_train_scaled, y_train_scaled, cv = 10).mean())
print("Mean Cross Validation Score of RBF SVR on Scaled Data :", 
      '%.4f' %cross_val_score(SVR(kernel='rbf', epsilon=100, C=1000), x_train_scaled, y_train_scaled, cv = 10).mean())
print("Mean Cross Validation Score of Polynomial SVR on Scaled Data :", 
      '%.4f' %cross_val_score(SVR(kernel='poly', epsilon=400, C=10000), x_train_scaled, y_train_scaled, cv = 10).mean())

# Predicting Testing Data with Top 5 Models

### 1. Linear Polynomial Regression

In [None]:
# Predicting the Test Data with a Model with Maximum Accuracy
linregpolyscaled.predict(t_poly_scaled)

### 2. Lasso Regression

In [None]:
linlassoscaled.predict(t_scaled)

### 3. Ridge Regression

In [None]:
linridgescaled.predict(t_scaled)

### 4. Linear Regression

In [None]:
linregscaled.predict(t_scaled)

### 5. KNN

In [None]:
knn=KNeighborsRegressor(n_neighbors=2)
knn=knn.fit(x_train,y_train)
knn.predict(Tdata)

### 6. Linear SVR

In [None]:
linsvr=SVR(kernel='linear', epsilon=20, C=10000)
linsvr=linsvr.fit(x_train_scaled,y_train_scaled)
linsvr.predict(t_scaled)