<a href="https://colab.research.google.com/github/123xenos321/Game-Dev-101/blob/master/exercise_10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gradient Descent: Boston Housing

In this exercise, let's use Gradient Descent based methods to predict boston house prices. The 14 variables are:

* CRIM per capita crime rate by town

* ZN proportion of residential land zoned for lots over 25,000 sq.ft.

* INDUS proportion of non-retail business acres per town

* CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

* NOX nitric oxides concentration (parts per 10 million)

* RM average number of rooms per dwelling

* AGE proportion of owner-occupied units built prior to 1940

* DIS weighted distances to five Boston employment centres

* RAD index of accessibility to radial highways

* TAX full-value property-tax rate per $10,000

* PTRATIO pupil-teacher ratio by town

* B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

* LSTAT % lower status of the population

* MEDV Median value of owner-occupied homes in $1000’s


The target variable is **Median value of owner-occupied homes in $1000’s**.

Split the dataset into train/test. Build a regression model using **SGDRegressor** from sklearn. Compute the mean square error on the test set. Describle how the loss changes as we have more epochs (Hint: set a larger verbose parameter).


## (1) Split the dataset into train/test. Build a regression model using **SGDRegressor** from sklearn. Compute the mean square error on the test set. Describle how the loss changes as we have more epochs (Hint: set a larger verbose parameter).


In [3]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor, LinearRegression
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn import preprocessing

data = load_boston()
X = data.data
y = data.target
feature_names = data.feature_names
target_name = 'MEDV'
# Hint: Standardize features by removing the mean and scaling to unit variance
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.1,random_state=99)
# insert your code here
scaler = preprocessing.StandardScaler().fit(X_train)
X_train= scaler.transform(X_train)
X_test = scaler.transform(X_test)
print('Number of train = {}'.format(len(y_train)))
print('Number of test = {}'.format(len(y_test)))
reg = SGDRegressor(max_iter=50,tol=1e-3,verbose=10)
reg.fit(X_train,y_train)
y_train_prediction = reg.predict(X_train)
y_test_prediction = reg.predict(X_test)
print('Train MSE = {}'.format(mean_squared_error(y_train,y_train_prediction)))
print('Test MSE ={}'.format(mean_squared_error(y_test,y_test_prediction)))
print(reg.coef_)

Number of train = 455
Number of test = 51
-- Epoch 1
Norm: 4.86, NNZs: 13, Bias: 16.637025, T: 455, Avg. loss: 93.280399
Total training time: 0.00 seconds.
-- Epoch 2
Norm: 5.31, NNZs: 13, Bias: 20.191613, T: 910, Avg. loss: 21.477900
Total training time: 0.00 seconds.
-- Epoch 3
Norm: 5.63, NNZs: 13, Bias: 21.470702, T: 1365, Avg. loss: 14.063938
Total training time: 0.00 seconds.
-- Epoch 4
Norm: 5.49, NNZs: 13, Bias: 21.996813, T: 1820, Avg. loss: 12.670777
Total training time: 0.00 seconds.
-- Epoch 5
Norm: 5.71, NNZs: 13, Bias: 22.267478, T: 2275, Avg. loss: 12.359664
Total training time: 0.00 seconds.
-- Epoch 6
Norm: 5.84, NNZs: 13, Bias: 22.429453, T: 2730, Avg. loss: 12.108272
Total training time: 0.00 seconds.
-- Epoch 7
Norm: 5.96, NNZs: 13, Bias: 22.461638, T: 3185, Avg. loss: 11.977729
Total training time: 0.00 seconds.
-- Epoch 8
Norm: 5.99, NNZs: 13, Bias: 22.514320, T: 3640, Avg. loss: 11.957772
Total training time: 0.00 seconds.
-- Epoch 9
Norm: 6.02, NNZs: 13, Bias: 2

# Logistic Regression and Tree-based Methods: Breast Cancer detection

In this exercise, let's use Gradient Descent based methods to classify malignent/benign breast tumor. The 30 variables are mean/error/worst of the following 10 metrics:

* radius

* texture

* perimeter

* area

* smoothness

* compactness

* concavity

* concave points

* symmetry

* fractal dimension


The goal is to predict whether or not the tumor is malignent or benign.

(1) Play with the dataset. Show the number of positives and negatives. The split the dataset into train and test, where test is 10% of the data.

(2) Using logistic regression to train on the train set. Then test model accuracy, F1 score for negatives, F1 score for positives, F1 macro, and F1 micro.

(3) Build a random forest model to train on the train set. Then test the same metrics as in step (2).


## (1) Play with the dataset. Show the number of positives and negatives. The split the dataset into train and test, where test is 10% of the data.


In [6]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor, LinearRegression
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn import preprocessing
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names
target_names=data.target_names
print('Number of positives = {}'.format(np.sum(y==1)))
print('Numebr of negatives = {}'.format(np.sum(y==0)))
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.1,random_state =99)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train=scaler.transform(X_train)
X_test = scaler.transform(X_test)
for i in range(X.shape[1]):
    print('Variable: Mean of {} = {}'.format(feature_names[i],np.mean(X[:,i])))

print('Number of train = {}'.format(len(y_train)))
print('Number of test = {}'.format(len(y_test)))

# insert your code here


Number of positives = 357
Numebr of negatives = 212
Variable: Mean of mean radius = 14.127291739894552
Variable: Mean of mean texture = 19.289648506151142
Variable: Mean of mean perimeter = 91.96903339191564
Variable: Mean of mean area = 654.8891036906855
Variable: Mean of mean smoothness = 0.0963602811950791
Variable: Mean of mean compactness = 0.10434098418277679
Variable: Mean of mean concavity = 0.0887993158172232
Variable: Mean of mean concave points = 0.04891914586994728
Variable: Mean of mean symmetry = 0.18116186291739894
Variable: Mean of mean fractal dimension = 0.06279760984182776
Variable: Mean of radius error = 0.40517205623901575
Variable: Mean of texture error = 1.2168534270650264
Variable: Mean of perimeter error = 2.8660592267135327
Variable: Mean of area error = 40.337079086116
Variable: Mean of smoothness error = 0.007040978910369069
Variable: Mean of compactness error = 0.025478138840070295
Variable: Mean of concavity error = 0.03189371634446397
Variable: Mean of co

## (2) Using logistic regression to train on the train set. Then test model accuracy, F1 score for negatives, F1 score for positives.


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix, precision_recall_fscore_support

# insert your code here

## (3) Build a random forest model to train on the train set. Then test the same metrics as in step (2).


In [None]:
from sklearn.ensemble import RandomForestClassifier

# insert your code here