## Mini-Project Part 3: Supervised Learning (1)
### Logistic Regression, KNN, Decision Tree, and Random Forest

In this part, we will continue to use the dataset about adaptivity to online education.  Target for models is predicting adaptability levels

In [2]:
# Encode variables using the part one code

import pandas as pd

data = pd.read_csv(r"C:\Users\roryq\Downloads\online_adapt.csv")

# Encode `Age` to integers, 1, 2, 3, 4, 5, 6.

age_mapper = {'26-30':6, '21-25':5, '16-20':4, '11-15':3, '6-10':2, '1-5':1}
age_t = data['Age'].replace(age_mapper)

# Encode `Network Type` to integers, 2, 3, 4.

net_mapper = {'2G':2, '3G':3, '4G':4}
net_t = data['Network Type'].replace(net_mapper)

# Encode `Class Duration` to integers, 0, 1, 2.

class_mapper = {'0':0, '1-3':1, '3-6':2}
class_t = data['Class Duration'].replace(class_mapper)

# Replace `Age`, `Network Type`, `Class Duration` by their corresponding numeric versions.

data['Age'] = age_t
data['Network Type'] = net_t
data['Class Duration'] = class_t

# One-hot encode the rest of the variables except for the response variable, `Adaptivity Level`.
y = data['Adaptivity Level']
data1 = pd.get_dummies(data.drop('Adaptivity Level', axis=1),dtype=int)


# Check data
data1.head(3)

Unnamed: 0,Age,Network Type,Class Duration,Gender_Boy,Gender_Girl,Education Level_College,Education Level_School,Education Level_University,Institution Type_Government,Institution Type_Non Government,...,Financial Condition_Mid,Financial Condition_Poor,Financial Condition_Rich,Internet Type_Mobile Data,Internet Type_Wifi,Self Lms_No,Self Lms_Yes,Device_Computer,Device_Mobile,Device_Tab
0,5,4,2,1,0,0,0,1,0,1,...,1,0,0,0,1,1,0,0,0,1
1,5,4,1,0,1,0,0,1,0,1,...,1,0,0,1,0,0,1,0,1,0
2,4,4,1,0,1,1,0,0,1,0,...,1,0,0,0,1,1,0,0,1,0


In [45]:
data1.shape

(1205, 26)

## Regularized Logistic Regression
Now, fit a regularized logistic regression model on `data1`. 

Requirements:
1. Standardize all features of `data1`. 
2. Split data (80% training; 20% test) (set `random_state=100`)
3. Use `L2` penalty.
4. Use 5-fold cross-validation to find the optimal `C`.
5. Consider 20 values of the hyperparameter `C`.
6. Calculate the confusion matrix. 
7. Calculate the accuracy of the test set using the final model.
8. Calculate the accuracy for each class of `y`.

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import confusion_matrix
import numpy as np

# Create features and target to specifcy for the rest of models in the part
# Target is adaptability level
# Features are all other variables in the data set specified in read me file
features = data1
target = y

# Standardize all features of the data with standard scaler
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)

# define your test split on data with 80% training and 20% test
X_train, X_test, y_train, y_test = train_test_split(
    features_standardized, target, random_state=100, test_size= .2)

# Create logistics regression specifications
logistic_regression = LogisticRegressionCV(cv=5, # Use 5 fold cross validation
    solver='lbfgs', 
    multi_class='multinomial', 
    penalty='l2',   # Use L2 penalty
    Cs=20,          # Check 20 values of our hyperparameter
    random_state=100, n_jobs=-1)

# Create model with out logistics regression specifications
model_cv = logistic_regression.fit(X_train, y_train)

# Predictions of our model for out test data
y_pred_cv = model_cv.predict(X_test)

# accuracy for final test
np.mean(y_pred_cv==y_test)

0.7095435684647303

In [11]:
from sklearn.metrics import confusion_matrix

# Get lables to use for confusion matrix
labels = np.unique(y_test)

# Create confusion matrix from predicted and test data
matrix= confusion_matrix(y_pred_cv, y_test, labels= labels)

# Print matrix with lables
pd.DataFrame(matrix, index=labels, columns=labels)



Unnamed: 0,High,Low,Moderate
High,9,1,4
Low,7,64,24
Moderate,7,27,98


In [12]:
# Print accuracy for each level of adaptability
print("Accuracy", matrix.diagonal()/matrix.sum(axis=1))

Accuracy [0.64285714 0.67368421 0.74242424]


#### Accuracy for each level of adaptability
+ Accruacy for High is .6428
+ Accuracy for Low is .6736
+ Accuracy for Moderate is .7424


## K Nearest Neighbors and Radius Nearest Neighbors
 
Requirements:
1. Split data (80% training; 20% test) (set `random_state=100`)
2. Use the min-max scaler to preprocess features.
3. Use 5-fold cross-validation in the grid-search.
4. Consider 10 different values of `K`, i.e. 1, 2, 3, ..., 10.
5. After you find the optimal value of `K`, use the entire training set to fit the KNN model and calculate the accuracy for the test set


In [13]:
# Load libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV

# set x and y for model
X = features
y = target



# Split data to training and testing sets and select size
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2,
                                                    random_state=100, stratify=target)
# select standardizer
standardizer = MinMaxScaler()

# Set KNN with classifier
knn = KNeighborsClassifier(n_jobs=-1) 


pipe = Pipeline([("standardizer", standardizer), ("knn", knn)])

# consider 10 different values of K
search_space = [{"knn__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]

# Create search with parameters and set 5 fold cross validation
classifier = GridSearchCV(
    pipe, search_space, cv=5, verbose=0).fit(X_train, y_train)

# Calculate best estimator
classifier.best_estimator_.get_params()["knn__n_neighbors"]

1

In [5]:
# Create knn model
knn_1 = KNeighborsClassifier(n_neighbors=1, n_jobs=-1)

# Standarize data to train and test
X_train_sc = standardizer.fit_transform(X_train)
X_test_sc = standardizer.fit_transform(X_test)

# Train and test model
knn_1.fit(X_train_sc,y_train)

# Print accuracy of model
print("Test set accuracy: {:.2f}".format(knn_1.score(X_test_sc,y_test)))

Test set accuracy: 0.89


#### Radius Nearest Neighbors 
Requirements:
* first split the data (80% training; 20% test). Set `random_state=100`.
* normalize the features.
* use 10-fold cross-validation in the grid-search.
* consider about 200 different values of radius between 0.01 and 2.
* Set `outlier_label='most_frequent'` for `RadiusNeighborsClassifier`. 
* After you find the optimal value of the radius, use the entire training set to fit the RNN model and calculate the accuracy for the test set.




In [6]:
# Load libraries
from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.preprocessing import Normalizer


# Set normalizer
# Set features and target
# Create normalized data
normalizer = Normalizer()
features = data1
target = y
X_norm = normalizer.fit_transform(features)



# Set train and test data, specify split, and random state
X_train, X_test, y_train, y_test = train_test_split(X_norm, target, test_size=0.2,
                                                    random_state=100,stratify=target)


# Create rnn using radius neighbor classifier
rnn = RadiusNeighborsClassifier(outlier_label='most_frequent') 

pipe = Pipeline([("normalizer", normalizer), ("rnn", rnn)])

# Specify search space
SearchSpace=[{'rnn__radius': np.arange(0.01, 200, 2) }]

# Create classifier
    # Use 10 fold cross validation
    # fit training data
    # Use search space
classifier = GridSearchCV(
    pipe, SearchSpace, cv=10, verbose=0).fit(X_train, y_train)

# Get best estimator from our search
classifier.best_estimator_.get_params()["rnn__radius"]

0.01

In [19]:
# Run model

# scale train and test data
X_train_sc1 = standardizer.fit_transform(X_train)
X_test_sc1 = standardizer.fit_transform(X_test)

# fit model with test data, and best radius parameter
rnn_final = RadiusNeighborsClassifier(radius=0.01,outlier_label='most_frequent').fit(X_train_sc1,y_train)

# Asses model accuracy with test data
rnn_final.score(X_test_sc1,y_test)

0.950207468879668

## Decision Trees


####  Split the data (80% training; 20% test). Set `random_state=100`.

In [15]:
from sklearn.tree import DecisionTreeClassifier
X = features

# Specify train and test split and x and y for model
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=100, test_size=.2)



####  Train a decision tree (`random_state=100`) without constraints. Calculate the accuracies of the training and the test sets. 

In [16]:
# Create tree params
tree_clf5 = DecisionTreeClassifier(random_state=100)

# Train model
tree_clf5.fit(X_train,y_train)

# Show accuracy of trained data and test data
print("Accuracy with trained data is:",tree_clf5.score(X_train, y_train))
print("Accuracy with test data is:",tree_clf5.score(X_test, y_test))

Accuracy with trained data is: 0.9273858921161826
Accuracy with test data is: 0.9336099585062241


####  Train a decision tree with `random_state=100` and  `max_features=3`. Calculate the accuracies of the training and the test set.
may increase the bias and reduce the variance.

In [17]:
# Specify training, test data, split and x and y for model
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=100,test_size=.2)

# set params for tree with classifier (random state and max features)
tree_clf4 = DecisionTreeClassifier(random_state=100, max_features=3)

# Train tree
tree_clf4.fit(X_train,y_train)

# Print tree accuracy
print("Accuracy with trained data is:",tree_clf4.score(X_train, y_train))
print("Accuracy with test data is:",tree_clf4.score(X_test, y_test))

Accuracy with trained data is: 0.9273858921161826
Accuracy with test data is: 0.9377593360995851


#### Train a decision tree with `random_state=100` and `min_samples_leaf=5`. Calculate the accuracies of the training and the test set.
 may increase the bias and reduce the variance.

In [18]:
# Specify training, test data, split and x and y for model
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=100,test_size=.2)

# set params for tree with classifier (random state and min samples leaf)
tree_clf8 = DecisionTreeClassifier(random_state=100,min_samples_leaf=5 )

# Train tree
tree_clf8.fit(X_train,y_train)

# Print tree accuracy
print("Accuracy with trained data is:",tree_clf8.score(X_train, y_train))
print("Accuracy with test data is:",tree_clf8.score(X_test, y_test))

Accuracy with trained data is: 0.83298755186722
Accuracy with test data is: 0.8049792531120332


## Random Forests 


Requirements for each of the following parts:
* first split the data (80% training; 20% test). Set `random_state=100`.
* In `RandomForestClassifier`, set `random_state=100` and `n_jobs=-1`.
* Calculate the accuracies for the training and the test sets.

#### Train a random forest with 10 trees.
Note that the default value of `n_estimators` changed from 10 to 100 in Version 0.22 of `sklearn`.

In [12]:
from sklearn.ensemble import RandomForestClassifier

# Specify training, test data, split and x and y for model
X_train, X_test, y_train, y_test = train_test_split(
    features, target, random_state=100, test_size=.2)

# Create forest params with classifier and set random state and estimators
forest = RandomForestClassifier(n_estimators=10,
                                random_state=100, n_jobs=-1)

# Create and Train model using forest params specified 
model_forest = forest.fit(X_train, y_train)

# Print accuracy for model
print("Accuracy on training set: {:.3f}".format(model_forest.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(model_forest.score(X_test, y_test)))

Accuracy on training set: 0.925
Accuracy on test set: 0.929


####  Train a random forest with 100 trees. Any improvement compared with ?

In [13]:

# Specify training, test data, split and x and y for model
X_train, X_test, y_train, y_test = train_test_split(
    features, target, random_state=100, test_size=.2)

# Create forest params with classifier and set random state and estimators
forest1 = RandomForestClassifier(n_estimators=100,
                                random_state=100, n_jobs=-1)

# Create and Train model using forest params specified 
model_forest1 = forest1.fit(X_train, y_train)

# Print accuracy for model
print("Accuracy on training set: {:.3f}".format(model_forest1.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(model_forest1.score(X_test, y_test)))

Accuracy on training set: 0.927
Accuracy on test set: 0.938


+ Test accuracy increases by .009 & training set stays approximately the same

####  Train a random forest with 1000 trees. Any improvement compared with Above?

In [14]:
# Specify training, test data, split and x and y for model
X_train, X_test, y_train, y_test = train_test_split(
    features, target, random_state=100,test_size=.2)

# Create forest params with classifier and set random state and estimators
forest2 = RandomForestClassifier(n_estimators=1000,
                                random_state=100, n_jobs=-1)

# Create and Train model using forest params specified 
model_forest2 = forest2.fit(X_train, y_train)

# Print accuracy for model
print("Accuracy on training set: {:.3f}".format(model_forest2.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(model_forest2.score(X_test, y_test)))

Accuracy on training set: 0.927
Accuracy on test set: 0.934


+ Training and Test accuracy stays approximately the same
+ Test decreases by .004

#### Train a random forest with 100 trees with `max_features=10`. Any difference from Part 4.2?
The default value for `max_features` is `sqrt(n_features)`.

In [15]:
X_train, X_test, y_train, y_test = train_test_split(
    features, target, random_state=100,test_size=.2)


forest3 = RandomForestClassifier(n_estimators=100,
                         random_state=100, n_jobs=-1, max_features=10)


model_forest3 = forest3.fit(X_train, y_train)

print("Accuracy on training set: {:.3f}".format(model_forest3.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(model_forest3.score(X_test, y_test)))

Accuracy on training set: 0.927
Accuracy on test set: 0.934


+ Test decreases slightly from 4.2
+ Train accuracy stays the same as 4.2

####  Show the features importance in the model. 


In [16]:
for name, score in zip(data1.columns, model_forest.feature_importances_):
    print(name, score)

Age 0.08583812991070759
Network Type 0.07565276574449656
Class Duration 0.16590760615206018
Gender_Boy 0.037789738996994046
Gender_Girl 0.037781803465573756
Education Level_College 0.030388563751933356
Education Level_School 0.02271238603454941
Education Level_University 0.03459409680515524
Institution Type_Government 0.024974667159102366
Institution Type_Non Government 0.048354303113187745
IT Student_No 0.023221504438925924
IT Student_Yes 0.02743510384161384
Location_No 0.029671200182698254
Location_Yes 0.0233220814202405
Load-shedding_High 0.033404993405830676
Load-shedding_Low 0.02204304041777359
Financial Condition_Mid 0.04054708948528579
Financial Condition_Poor 0.04066353310245535
Financial Condition_Rich 0.041994417281285785
Internet Type_Mobile Data 0.018826427987668388
Internet Type_Wifi 0.03970590648400056
Self Lms_No 0.018839622864837945
Self Lms_Yes 0.01970665294402183
Device_Computer 0.03009359668253998
Device_Mobile 0.02070333987627108
Device_Tab 0.005827428450790262
