# Homework 3: Machine Learning Tasks (XX/125 points)
## Due Monday 12/7/2022 11:59 pm
## About Dataset
### Data from a semi-conductor manufacturing process

Number of Instances: 1567 <br>
Area: Computer<br>
Attribute Characteristics: Real<br>
Number of Attributes: 591<br>
Date Donated: 2008-11-19<br>
Associated Tasks: Classification, Causal-Discovery<br>
Missing Values? Yes<br>

A complex modern semi-conductor manufacturing process is normally under consistent
surveillance via the monitoring of signals/variables collected from sensors and or
process measurement points. However, not all of these signals are equally valuable
in a specific monitoring system. The measured signals contain a combination of
useful information, irrelevant information as well as noise. It is often the case
that useful information is buried in the latter two. Engineers typically have a
much larger number of signals than are actually required. If we consider each type
of signal as a feature, then feature selection may be applied to identify the most
relevant signals. The Process Engineers may then use these signals to determine key
factors contributing to yield excursions downstream in the process. This will
enable an increase in process throughput, decreased time to learning and reduce the per unit production costs.

To enhance current business improvement techniques the application of feature
selection as an intelligent systems technique is being investigated.

The dataset presented in this case represents a selection of such features where
each example represents a single production entity with associated measured
features and the labels represent a simple pass/fail yield for in-house line testing, figure 2, and associated date time stamp. Where -1 corresponds to a pass
and 1 corresponds to a fail and the data time stamp is for that specific test
point.

This homework assignment will walk you through how to tackle this real problem. 

It is worth noting that this is an actual dataset and thus the problem is not fully tractable. Like many real problems, sometimes you do not have the necessary information to make a perfect solution. We want a useful and informative solution. 

In [82]:
## Here are some packages and modules that you will use. Make sure they are installed.

# for basic operations
import numpy as np 
import pandas as pd 

# for visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# for modeling 
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

import plotly_express as px

from imblearn.over_sampling import SMOTE

# to avoid warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Loading the data (5 points)

1. The data file is in a file called `uci-secom.csv` scikit-learn works well with pandas. It is recommended that you read the csv into a pandas array.
2. It is useful to print the shape of the output array to know what the data is that you are working with

In [5]:
df = pd.read_csv('/content/drive/MyDrive/FinalHomework/uci-secom.csv')

3. Pandas has a built-in method called head that shows the first few rows, this is useful to see what the data looks like

In [6]:
df.head()

Unnamed: 0,Time,0,1,2,3,4,5,6,7,8,...,581,582,583,584,585,586,587,588,589,Pass/Fail
0,2008-07-19 11:55:00,3030.93,2564.0,2187.7333,1411.1265,1.3602,100.0,97.6133,0.1242,1.5005,...,,0.5005,0.0118,0.0035,2.363,,,,,-1
1,2008-07-19 12:32:00,3095.78,2465.14,2230.4222,1463.6606,0.8294,100.0,102.3433,0.1247,1.4966,...,208.2045,0.5019,0.0223,0.0055,4.4447,0.0096,0.0201,0.006,208.2045,-1
2,2008-07-19 13:17:00,2932.61,2559.94,2186.4111,1698.0172,1.5102,100.0,95.4878,0.1241,1.4436,...,82.8602,0.4958,0.0157,0.0039,3.1745,0.0584,0.0484,0.0148,82.8602,1
3,2008-07-19 14:43:00,2988.72,2479.9,2199.0333,909.7926,1.3204,100.0,104.2367,0.1217,1.4882,...,73.8432,0.499,0.0103,0.0025,2.0544,0.0202,0.0149,0.0044,73.8432,-1
4,2008-07-19 15:22:00,3032.24,2502.87,2233.3667,1326.52,1.5334,100.0,100.3967,0.1235,1.5031,...,,0.48,0.4766,0.1045,99.3032,0.0202,0.0149,0.0044,73.8432,-1


## Filtering Data (5 points)

Real data is usually a mess. There could be missing points, outliers, and features that could have vastly different values and ranges. 
* Machine learning models are influenced heavily by these problems

### Fixing Missing Values (5 Points)

1. It is not uncommon that some of the features have only a few entries. These are not helpful for machine learning. We should just remove these features from the data

It is good to visualize how many missing values each feature has.

### Plotting the Missing Data Entries (5 points)

Hint: you can find the nan values with the `.isna()` method, and the sum using `.sum()`

You can plot the data using `px.histogram`

In [7]:
px.histogram(df.isna().sum())

### Removing Sparse Features (10 points)

We can remove the features that have more than 100 missing entries. 

You can find the location where a condition is met in a Pandas array using `data.loc[:,:]` with traditional numpy-like indexing

In [20]:
drop_na_cols = list(df.isna().sum()[df.isna().sum()>100].index)

Remove these columns in the dataframe using the `.drop()` method, make sure inplace is set to True

In [23]:
df.drop(drop_na_cols,axis=1,inplace=True)

It is useful to check the shape to make sure that the operation worked


In [26]:
df.shape

(1567, 540)

It is useful to see how many data points have missing information

In [51]:
df.isna().sum(axis=1) #Get row-wise counts of NaN values in each row

0         4
1         0
2         0
3         0
4         0
       ... 
1562     36
1563     36
1564    136
1565     44
1566    136
Length: 1567, dtype: int64

In [50]:
df.isna().sum(axis=1)[df.isna().sum(axis=1)>0].sum()#Get total number of missing values

3519

In [53]:
len(df.isna().sum(axis=1)[df.isna().sum(axis=1)>0]) #Total number of rows with missing values (NaN values)

174

In [61]:
drop_row_indices = list(df.isna().sum(axis=1)[df.isna().sum(axis=1)>0].index) #Getting row indices with NaN values that will be dropped

In [62]:
df.drop(drop_row_indices,axis=0,inplace=True)

In [63]:
df.shape

(1393, 540)

You should grab the features and labels you can do this by:
1. Using the Pandas `drop` built-in method, you can also drop the time
2. You should set the the prediction to be if the dataset passed or failed
3. It is a good idea to replace the -1 values with 0 using the pandas built-in method `.replace`

In [64]:

df.drop(labels="Time",axis=1, inplace=True)
df.replace(-1,0,inplace=True)

## Test-train split (5 points)

Use the `train_test_split` method to split the data.

For consistency make the `test_size = .3`, and the `random_state = 42`.

In [67]:
X=df.drop('Pass/Fail', axis=1)
y=df['Pass/Fail']
X_train, X_test, y_train, y_test=train_test_split(X, y,  test_size=0.3, random_state=42) 

## Machine Learning (5 points)

It is always good to try a quick machine learning model. If your data is simple it might just work.

* Implement a `LogisticRegression` from scikit-learn
* Fit the model

In [77]:
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



LogisticRegression(max_iter=1000)

It is always a good idea to see if the model fit the data well.
* uses the `.predict()` method to predict on the training data
* use the sklearn function `classification_report` to evaluate the model
* use the sklearn function `confusion_matrix`, you can plot this in plotly using `px.imshow()`

You will reuse these lines of code to visualize your results

In [79]:
y_pred= lr.predict(X_train)
print("classification report:\n",classification_report(y_train, y_pred))
cm = confusion_matrix(y_train,y_pred,labels=[1,0])


classification report:
               precision    recall  f1-score   support

           0       0.95      0.99      0.97       904
           1       0.69      0.31      0.43        71

    accuracy                           0.94       975
   macro avg       0.82      0.65      0.70       975
weighted avg       0.93      0.94      0.93       975



In [94]:
fig = px.imshow(cm,
                labels=dict(x="Predicted Label", y="Actual Label", color="Value"),
                x=[0,1],
                y=[0,1]
               ,color_continuous_scale='thermal',text_auto=True)
fig.update_xaxes(side="top")
fig.show()


Now use the same approach to visualize the test results

In [95]:
y_pred_test= lr.predict(X_test)
print("classification report:\n",classification_report(y_test, y_pred_test))
cm_test = confusion_matrix(y_test,y_pred_test)

classification report:
               precision    recall  f1-score   support

           0       0.93      0.96      0.95       390
           1       0.11      0.07      0.09        28

    accuracy                           0.90       418
   macro avg       0.52      0.51      0.52       418
weighted avg       0.88      0.90      0.89       418



In [96]:
fig = px.imshow(cm_test,
                labels=dict(x="Predicted Label", y="Actual Label", color="Value"),
                x=[0,1],
                y=[0,1]
               ,color_continuous_scale='thermal',text_auto=True)
fig.update_xaxes(side="top")
fig.show()

<span style="color:blue">Question: Describe what might be wrong with this model, does it provide any practical value? (5 points)</span>


The issue with this model may be the disparity between the number of positive and negative values. There are a much greater number of negative values (0) compared to the positive values(1) (from the support column in classification reports). We have low scores for the positive value(1) in the test case. It does provide practical value since we can observe the accuracy of our logistic regression model with the confusion matrices.

### Try Another Model (5 points)

It could be that we just selected a bad model for the problem, try with a random forest classifier as implemented in scikit-learn
* Instantiate the model
* Fit the data

In [97]:
rf=RandomForestClassifier()
rf.fit(X_train, y_train)

RandomForestClassifier()

Validate the model on the training and testing dataset



In [99]:
y_pred_train= rf.predict(X_train)
print("classification report:\n",classification_report(y_train, y_pred_train))
cm = confusion_matrix(y_train,y_pred_train)
fig = px.imshow(cm,
                labels=dict(x="Predicted Label", y="Actual Label", color="Value"),
                x=[0,1],
                y=[0,1]
               ,color_continuous_scale='thermal',text_auto=True)
fig.update_xaxes(side="top")
fig.show()


classification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       904
           1       1.00      1.00      1.00        71

    accuracy                           1.00       975
   macro avg       1.00      1.00      1.00       975
weighted avg       1.00      1.00      1.00       975



In [100]:
y_pred_test= rf.predict(X_test)
print("classification report:\n",classification_report(y_test, y_pred_test))
cm_test = confusion_matrix(y_test,y_pred_test)
fig = px.imshow(cm_test,
                labels=dict(x="Predicted Label", y="Actual Label", color="Value"),
                x=[0,1],
                y=[0,1]
               ,color_continuous_scale='thermal',text_auto=True)
fig.update_xaxes(side="top")
fig.show()


classification report:
               precision    recall  f1-score   support

           0       0.93      1.00      0.97       390
           1       0.00      0.00      0.00        28

    accuracy                           0.93       418
   macro avg       0.47      0.50      0.48       418
weighted avg       0.87      0.93      0.90       418




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



That still does not do anything meaningful

## Normalizing the Data (10 points)

Machine learning models prefer features with a mean of 0 and a standard deviation of 1. This makes the optimization easier.

1. Make a histogram of the mean and standard deviation you can use the built-in method `.mean()` and `.std()`
2. You can plot this using `px.histogram`

In [103]:
px.histogram(df.mean(),title="Mean Histogram")

In [105]:
px.histogram(df.std(),title="Standard Deviation Histogram")

### Scaled Logistic Regression (5 points)

1. Use the `Pipeline` utility to create a machine learning model that:
    - Computes the standard scalar of the data
    - Conducts logistic regression
2. Fit the model

In [106]:
lr_pipeline = Pipeline(steps=[('StandardScaler', StandardScaler()), ('Logistic Regression', LogisticRegression(max_iter=1000))])
lr_pipeline.fit(X_train, y_train)

Pipeline(steps=[('StandardScaler', StandardScaler()),
                ('Logistic Regression', LogisticRegression(max_iter=1000))])

In [107]:
y_pred_train= lr_pipeline.predict(X_train)
print("classification report:\n",classification_report(y_train, y_pred_train))
cm = confusion_matrix(y_train,y_pred_train)
fig = px.imshow(cm,
                labels=dict(x="Predicted Label", y="Actual Label", color="Value"),
                x=[0,1],
                y=[0,1]
               ,color_continuous_scale='thermal',text_auto=True)
fig.update_xaxes(side="top")
fig.show()

classification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       904
           1       1.00      0.97      0.99        71

    accuracy                           1.00       975
   macro avg       1.00      0.99      0.99       975
weighted avg       1.00      1.00      1.00       975



In [108]:
y_pred_test= lr_pipeline.predict(X_test)
print("classification report:\n",classification_report(y_test, y_pred_test))
cm_test = confusion_matrix(y_test,y_pred_test)
fig = px.imshow(cm_test,
                labels=dict(x="Predicted Label", y="Actual Label", color="Value"),
                x=[0,1],
                y=[0,1]
               ,color_continuous_scale='thermal',text_auto=True)
fig.update_xaxes(side="top")
fig.show()

classification report:
               precision    recall  f1-score   support

           0       0.93      0.93      0.93       390
           1       0.07      0.07      0.07        28

    accuracy                           0.87       418
   macro avg       0.50      0.50      0.50       418
weighted avg       0.87      0.87      0.87       418



### Standard Scaled Random Forest (5 points)

1. Use the `Pipeline` utility to create a machine learning model that:
    - Computes the standard scalar of the data
    - Conducts Random Forest
2. Fit the model

In [109]:
rf_pipeline = Pipeline(steps=[('StandardScaler', StandardScaler()), ('Random Forest Classifier', RandomForestClassifier())])
rf_pipeline.fit(X_train,y_train)

Pipeline(steps=[('StandardScaler', StandardScaler()),
                ('Random Forest Classifier', RandomForestClassifier())])

In [110]:
y_pred_train= rf_pipeline.predict(X_train)
print("classification report:\n",classification_report(y_train, y_pred_train))
cm = confusion_matrix(y_train,y_pred_train)
fig = px.imshow(cm,
                labels=dict(x="Predicted Label", y="Actual Label", color="Value"),
                x=[0,1],
                y=[0,1]
               ,color_continuous_scale='thermal',text_auto=True)
fig.update_xaxes(side="top")
fig.show()

classification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       904
           1       1.00      1.00      1.00        71

    accuracy                           1.00       975
   macro avg       1.00      1.00      1.00       975
weighted avg       1.00      1.00      1.00       975



In [111]:
y_pred_test= rf_pipeline.predict(X_test)
print("classification report:\n",classification_report(y_test, y_pred_test))
cm_test = confusion_matrix(y_test,y_pred_test)
fig = px.imshow(cm_test,
                labels=dict(x="Predicted Label", y="Actual Label", color="Value"),
                x=[0,1],
                y=[0,1]
               ,color_continuous_scale='thermal',text_auto=True)
fig.update_xaxes(side="top")
fig.show()

classification report:
               precision    recall  f1-score   support

           0       0.93      0.99      0.96       390
           1       0.00      0.00      0.00        28

    accuracy                           0.93       418
   macro avg       0.47      0.50      0.48       418
weighted avg       0.87      0.93      0.90       418



<span style="color:blue"> Question: Explain what is going on with the random forest model? Why are the results so bad? (5 points)</span>

The random forest pipeline performs badly with the testing data which may be due to overfitting the data. Due to the disproportionately large number of negative values (0), the model is biased and causes the model to wrongly predict the positive values (1) as False Negatives.

## Feature Reduction

### Logistic Regression (5 Points)

We can use PCA to reduce the number of features such that highly covariant features are combined. This helps deal with the curse of dimensionality.

Add PCA to the pipeline for the logistic regression, and visualize the results as we have done before

In [112]:
lr_pca_pipeline = Pipeline(steps=[('StandardScaler', StandardScaler()), ('PCA', PCA()),('Logistic Regression', LogisticRegression(max_iter=1000))])
lr_pca_pipeline.fit(X_train, y_train)

Pipeline(steps=[('StandardScaler', StandardScaler()), ('PCA', PCA()),
                ('Logistic Regression', LogisticRegression(max_iter=1000))])

In [113]:
y_pred_train= lr_pca_pipeline.predict(X_train)
print("classification report:\n",classification_report(y_train, y_pred_train))
cm = confusion_matrix(y_train,y_pred_train)
fig = px.imshow(cm,
                labels=dict(x="Predicted Label", y="Actual Label", color="Value"),
                x=[0,1],
                y=[0,1]
               ,color_continuous_scale='thermal',text_auto=True)
fig.update_xaxes(side="top")
fig.show()

classification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       904
           1       1.00      0.97      0.99        71

    accuracy                           1.00       975
   macro avg       1.00      0.99      0.99       975
weighted avg       1.00      1.00      1.00       975



In [114]:
y_pred_test= lr_pca_pipeline.predict(X_test)
print("classification report:\n",classification_report(y_test, y_pred_test))
cm_test = confusion_matrix(y_test,y_pred_test)
fig = px.imshow(cm_test,
                labels=dict(x="Predicted Label", y="Actual Label", color="Value"),
                x=[0,1],
                y=[0,1]
               ,color_continuous_scale='thermal',text_auto=True)
fig.update_xaxes(side="top")
fig.show()

classification report:
               precision    recall  f1-score   support

           0       0.93      0.93      0.93       390
           1       0.07      0.07      0.07        28

    accuracy                           0.87       418
   macro avg       0.50      0.50      0.50       418
weighted avg       0.87      0.87      0.87       418



### Random Forest (5 points)

Add PCA to the pipeline for the logistic regression, and visualize the results as we have done before

In [115]:

rf_pca_pipeline = Pipeline(steps=[('StandardScaler', StandardScaler()),('PCA', PCA()), ('Random Forest Classifier', RandomForestClassifier())])
rf_pca_pipeline.fit(X_train,y_train)

Pipeline(steps=[('StandardScaler', StandardScaler()), ('PCA', PCA()),
                ('Random Forest Classifier', RandomForestClassifier())])

In [116]:
y_pred_train= rf_pca_pipeline.predict(X_train)
print("classification report:\n",classification_report(y_train, y_pred_train))
cm = confusion_matrix(y_train,y_pred_train)
fig = px.imshow(cm,
                labels=dict(x="Predicted Label", y="Actual Label", color="Value"),
                x=[0,1],
                y=[0,1]
               ,color_continuous_scale='thermal',text_auto=True)
fig.update_xaxes(side="top")
fig.show()

classification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       904
           1       1.00      1.00      1.00        71

    accuracy                           1.00       975
   macro avg       1.00      1.00      1.00       975
weighted avg       1.00      1.00      1.00       975



In [117]:
y_pred_test= rf_pca_pipeline.predict(X_test)
print("classification report:\n",classification_report(y_test, y_pred_test))
cm_test = confusion_matrix(y_test,y_pred_test)
fig = px.imshow(cm_test,
                labels=dict(x="Predicted Label", y="Actual Label", color="Value"),
                x=[0,1],
                y=[0,1]
               ,color_continuous_scale='thermal',text_auto=True)
fig.update_xaxes(side="top")
fig.show()

classification report:
               precision    recall  f1-score   support

           0       0.94      0.92      0.93       390
           1       0.13      0.18      0.15        28

    accuracy                           0.87       418
   macro avg       0.54      0.55      0.54       418
weighted avg       0.89      0.87      0.88       418



<span style="color:blue"> Question: Explain if adding PCA helped, explain why you think PCA helped or did not help. (5 points)</span> 

PCA has not helped the accuracy of the predictions yet. This may be due to the imbalance between Pass/Fail values which leads to bias/overfitting

## Hyperparameter Tuning 

To improve a machine learning model you might want to tune the hyperparameters.

Scikit-learn has automated tools for cross-validation and hyperparameter search. You can just define a dictionary of values that you want to search and it will try all of the fits returning the best results

### Logistic Regression (7.5 points)

Conduct build a pipeline and build a parameter grid to search the following hyperparameters:
1. C = [ 0.001, .01, .1, 1, 10, 100]
2. penalty = ['l1', 'l2']
3. class_weight = ['balanced']
4. solver = ['saga']
5. PCA n_components = [2, 3, 4, 5, 8, 10]

To conduct the fitting you should build the classifier with GridSearchCV. This conducts a grid search with cross-folds. See the documentation for more information. 

For the GridSearchCV set the scoring to 'f1', cv=5. If you want to monitor the status you can set verbose=10. 

In [118]:
lr_pca_pipeline_hpo = Pipeline(steps=[('StandardScaler', StandardScaler()), ('PCA', PCA()),('Logistic Regression', LogisticRegression(max_iter=1000))])
lr_pca_pipeline_hpo.fit(X_train, y_train)

Pipeline(steps=[('StandardScaler', StandardScaler()), ('PCA', PCA()),
                ('Logistic Regression', LogisticRegression(max_iter=1000))])

In [123]:
param_grid = {
    'Logistic Regression__C' : [0.001, .01, .1, 1, 10, 100],
    'Logistic Regression__penalty' : ['l1', 'l2'],
    'Logistic Regression__class_weight' : ['balanced'],
    'Logistic Regression__solver' : ['saga'],
    'PCA__n_components' : [2, 3, 4, 5, 8, 10],
}

cv = GridSearchCV(lr_pca_pipeline_hpo, param_grid, n_jobs=-1,scoring='f1')
cv.fit(X_train,y_train)

GridSearchCV(estimator=Pipeline(steps=[('StandardScaler', StandardScaler()),
                                       ('PCA', PCA()),
                                       ('Logistic Regression',
                                        LogisticRegression(max_iter=1000))]),
             n_jobs=-1,
             param_grid={'Logistic Regression__C': [0.001, 0.01, 0.1, 1, 10,
                                                    100],
                         'Logistic Regression__class_weight': ['balanced'],
                         'Logistic Regression__penalty': ['l1', 'l2'],
                         'Logistic Regression__solver': ['saga'],
                         'PCA__n_components': [2, 3, 4, 5, 8, 10]},
             scoring='f1')

In [124]:
cv.best_estimator_

Pipeline(steps=[('StandardScaler', StandardScaler()),
                ('PCA', PCA(n_components=10)),
                ('Logistic Regression',
                 LogisticRegression(C=0.01, class_weight='balanced',
                                    max_iter=1000, solver='saga'))])

In [125]:
cv_lr = cv.best_estimator_.fit(X_train,y_train)

In [126]:
y_pred_train= cv_lr.predict(X_train)
print("classification report:\n",classification_report(y_train, y_pred_train))
cm = confusion_matrix(y_train,y_pred_train)
fig = px.imshow(cm,
                labels=dict(x="Predicted Label", y="Actual Label", color="Value"),
                x=[0,1],
                y=[0,1]
               ,color_continuous_scale='thermal',text_auto=True)
fig.update_xaxes(side="top")
fig.show()

classification report:
               precision    recall  f1-score   support

           0       0.96      0.70      0.81       904
           1       0.14      0.63      0.23        71

    accuracy                           0.69       975
   macro avg       0.55      0.67      0.52       975
weighted avg       0.90      0.69      0.77       975



In [127]:
y_pred_test= cv_lr.predict(X_test)
print("classification report:\n",classification_report(y_test, y_pred_test))
cm_test = confusion_matrix(y_test,y_pred_test)
fig = px.imshow(cm_test,
                labels=dict(x="Predicted Label", y="Actual Label", color="Value"),
                x=[0,1],
                y=[0,1]
               ,color_continuous_scale='thermal',text_auto=True)
fig.update_xaxes(side="top")
fig.show()

classification report:
               precision    recall  f1-score   support

           0       0.95      0.70      0.81       390
           1       0.11      0.50      0.18        28

    accuracy                           0.69       418
   macro avg       0.53      0.60      0.49       418
weighted avg       0.89      0.69      0.77       418



### Random Forest Classifier (7.5 points)

Conduct build a pipeline and build a parameter grid to search the following hyperparameters:
1. Random Forest Criterion = [ "gini", "entropy", "log_loss"]
2. max depth = [4, 8, 12]
3. max features = ['sqrt', 'log2']
4. PCA n_components = [4, 8, 10, 20]

To conduct the fitting you should build the classifier with GridSearchCV. This conducts a grid search with cross-folds. See the documentation for more information. 

For the GridSearchCV set the scoring to 'f1', cv=5. If you want to monitor the status you can set verbose=10. 

In [128]:
rf_pca_pipeline_hpo = Pipeline(steps=[('StandardScaler', StandardScaler()),('PCA', PCA()), ('Random Forest Classifier', RandomForestClassifier())])
rf_pca_pipeline_hpo.fit(X_train,y_train)

Pipeline(steps=[('StandardScaler', StandardScaler()), ('PCA', PCA()),
                ('Random Forest Classifier', RandomForestClassifier())])

In [129]:
rf_pca_pipeline_hpo.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'StandardScaler', 'PCA', 'Random Forest Classifier', 'StandardScaler__copy', 'StandardScaler__with_mean', 'StandardScaler__with_std', 'PCA__copy', 'PCA__iterated_power', 'PCA__n_components', 'PCA__random_state', 'PCA__svd_solver', 'PCA__tol', 'PCA__whiten', 'Random Forest Classifier__bootstrap', 'Random Forest Classifier__ccp_alpha', 'Random Forest Classifier__class_weight', 'Random Forest Classifier__criterion', 'Random Forest Classifier__max_depth', 'Random Forest Classifier__max_features', 'Random Forest Classifier__max_leaf_nodes', 'Random Forest Classifier__max_samples', 'Random Forest Classifier__min_impurity_decrease', 'Random Forest Classifier__min_samples_leaf', 'Random Forest Classifier__min_samples_split', 'Random Forest Classifier__min_weight_fraction_leaf', 'Random Forest Classifier__n_estimators', 'Random Forest Classifier__n_jobs', 'Random Forest Classifier__oob_score', 'Random Forest Classifier__random_state', 'Random Forest Clas

In [132]:
param_grid = {'Random Forest Classifier__criterion' : ["gini", "entropy"],
    'Random Forest Classifier__max_depth' : [4, 8, 12],
    'Random Forest Classifier__max_features' : ['sqrt', 'log2'],
    'PCA__n_components' : [4, 8, 10, 20],
}

#Log-Loss giving error
cv = GridSearchCV(rf_pca_pipeline_hpo, param_grid, n_jobs=-1,scoring='f1')
cv.fit(X_train,y_train)

GridSearchCV(estimator=Pipeline(steps=[('StandardScaler', StandardScaler()),
                                       ('PCA', PCA()),
                                       ('Random Forest Classifier',
                                        RandomForestClassifier())]),
             n_jobs=-1,
             param_grid={'PCA__n_components': [4, 8, 10, 20],
                         'Random Forest Classifier__criterion': ['gini',
                                                                 'entropy'],
                         'Random Forest Classifier__max_depth': [4, 8, 12],
                         'Random Forest Classifier__max_features': ['sqrt',
                                                                    'log2']},
             scoring='f1')

In [133]:
cv_rf = cv.best_estimator_.fit(X_train,y_train)

In [134]:
y_pred_train= cv_rf.predict(X_train)
print("classification report:\n",classification_report(y_train, y_pred_train))
cm = confusion_matrix(y_train,y_pred_train)
fig = px.imshow(cm,
                labels=dict(x="Predicted Label", y="Actual Label", color="Value"),
                x=[0,1],
                y=[0,1]
               ,color_continuous_scale='thermal',text_auto=True)
fig.update_xaxes(side="top")
fig.show()

classification report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99       904
           1       1.00      0.69      0.82        71

    accuracy                           0.98       975
   macro avg       0.99      0.85      0.90       975
weighted avg       0.98      0.98      0.98       975



In [135]:
y_pred_test= cv_rf.predict(X_test)
print("classification report:\n",classification_report(y_test, y_pred_test))
cm_test = confusion_matrix(y_test,y_pred_test)
fig = px.imshow(cm_test,
                labels=dict(x="Predicted Label", y="Actual Label", color="Value"),
                x=[0,1],
                y=[0,1]
               ,color_continuous_scale='thermal',text_auto=True)
fig.update_xaxes(side="top")
fig.show()

classification report:
               precision    recall  f1-score   support

           0       0.93      0.99      0.96       390
           1       0.00      0.00      0.00        28

    accuracy                           0.93       418
   macro avg       0.47      0.50      0.48       418
weighted avg       0.87      0.93      0.90       418



## Balancing in the Data (10 points)

This is a classification problem, it is useful to see if the classes are balanced as this affects model training. 

If you have a highly unbalanced dataset you can train a model to predict the most common classes but getting the uncommon classes wrong has little effect on the model performance metrics.

View the ratio of the class outcomes. 

The class outcomes are stored in the ['pass/fail'] column, you can view the values and counts using the `.value_count()` built-in method

In [136]:
df["Pass/Fail"].value_counts()

0    1294
1      99
Name: Pass/Fail, dtype: int64

Use `SMOTE()` to balance the dataset

In [137]:

smote = SMOTE()

X_bal, y_bal = smote.fit_resample(X,y)

In [141]:
y.value_counts()

0    1294
1      99
Name: Pass/Fail, dtype: int64

In [142]:
y_bal.value_counts()

0    1294
1    1294
Name: Pass/Fail, dtype: int64

In [143]:
X_train_bal, X_test_bal, y_train_bal, y_test_bal = train_test_split(X_bal, y_bal,test_size=0.3, random_state=42)

Using the balanced dataset repeat the analysis done with the hyperparameter search

### Logistic Regression (5 points)

In [144]:
lr_pca_pipeline_hpo = Pipeline(steps=[('StandardScaler', StandardScaler()), ('PCA', PCA()),('Logistic Regression', LogisticRegression(max_iter=1000))])
lr_pca_pipeline_hpo.fit(X_train_bal, y_train_bal)

param_grid = {
    'Logistic Regression__C' : [0.001, .01, .1, 1, 10, 100],
    'Logistic Regression__penalty' : ['l1', 'l2'],
    'Logistic Regression__class_weight' : ['balanced'],
    'Logistic Regression__solver' : ['saga'],
    'PCA__n_components' : [2, 3, 4, 5, 8, 10],
}

cv = GridSearchCV(lr_pca_pipeline_hpo, param_grid, n_jobs=-1,scoring='f1')
cv.fit(X_train_bal,y_train_bal)

GridSearchCV(estimator=Pipeline(steps=[('StandardScaler', StandardScaler()),
                                       ('PCA', PCA()),
                                       ('Logistic Regression',
                                        LogisticRegression(max_iter=1000))]),
             n_jobs=-1,
             param_grid={'Logistic Regression__C': [0.001, 0.01, 0.1, 1, 10,
                                                    100],
                         'Logistic Regression__class_weight': ['balanced'],
                         'Logistic Regression__penalty': ['l1', 'l2'],
                         'Logistic Regression__solver': ['saga'],
                         'PCA__n_components': [2, 3, 4, 5, 8, 10]},
             scoring='f1')

In [145]:
cv_lr = cv.best_estimator_.fit(X_train_bal,y_train_bal)

In [146]:
y_pred_train_bal= cv_lr.predict(X_train_bal)
print("classification report:\n",classification_report(y_train_bal, y_pred_train_bal))
cm = confusion_matrix(y_train_bal,y_pred_train_bal)
fig = px.imshow(cm,
                labels=dict(x="Predicted Label", y="Actual Label", color="Value"),
                x=[0,1],
                y=[0,1]
               ,color_continuous_scale='thermal',text_auto=True)
fig.update_xaxes(side="top")
fig.show()

classification report:
               precision    recall  f1-score   support

           0       0.61      0.66      0.64       905
           1       0.63      0.58      0.61       906

    accuracy                           0.62      1811
   macro avg       0.62      0.62      0.62      1811
weighted avg       0.62      0.62      0.62      1811



In [148]:
y_pred_test_bal= cv_lr.predict(X_test_bal)
print("classification report:\n",classification_report(y_test_bal, y_pred_test_bal))
cm_test = confusion_matrix(y_test_bal,y_pred_test_bal)
fig = px.imshow(cm_test,
                labels=dict(x="Predicted Label", y="Actual Label", color="Value"),
                x=[0,1],
                y=[0,1]
               ,color_continuous_scale='thermal',text_auto=True)
fig.update_xaxes(side="top")
fig.show()

classification report:
               precision    recall  f1-score   support

           0       0.66      0.65      0.65       389
           1       0.65      0.66      0.66       388

    accuracy                           0.65       777
   macro avg       0.65      0.65      0.65       777
weighted avg       0.65      0.65      0.65       777



### Random Forest Classifier (5 Points)

In [149]:
rf_pca_pipeline_hpo = Pipeline(steps=[('StandardScaler', StandardScaler()),('PCA', PCA()), ('Random Forest Classifier', RandomForestClassifier())])
rf_pca_pipeline_hpo.fit(X_train_bal,y_train_bal)

param_grid = {'Random Forest Classifier__criterion' : ["gini", "entropy"],
    'Random Forest Classifier__max_depth' : [4, 8, 12],
    'Random Forest Classifier__max_features' : ['sqrt', 'log2'],
    'PCA__n_components' : [4, 8, 10, 20],
}

#Log-Loss giving error
cv = GridSearchCV(rf_pca_pipeline_hpo, param_grid, n_jobs=-1,scoring='f1')
cv.fit(X_train_bal,y_train_bal)

GridSearchCV(estimator=Pipeline(steps=[('StandardScaler', StandardScaler()),
                                       ('PCA', PCA()),
                                       ('Random Forest Classifier',
                                        RandomForestClassifier())]),
             n_jobs=-1,
             param_grid={'PCA__n_components': [4, 8, 10, 20],
                         'Random Forest Classifier__criterion': ['gini',
                                                                 'entropy'],
                         'Random Forest Classifier__max_depth': [4, 8, 12],
                         'Random Forest Classifier__max_features': ['sqrt',
                                                                    'log2']},
             scoring='f1')

In [150]:
cv_rf = cv.best_estimator_.fit(X_train_bal,y_train_bal)

In [151]:
y_pred_train_bal= cv_rf.predict(X_train_bal)
print("classification report:\n",classification_report(y_train_bal, y_pred_train_bal))
cm = confusion_matrix(y_train_bal,y_pred_train_bal)
fig = px.imshow(cm,
                labels=dict(x="Predicted Label", y="Actual Label", color="Value"),
                x=[0,1],
                y=[0,1]
               ,color_continuous_scale='thermal',text_auto=True)
fig.update_xaxes(side="top")
fig.show()

classification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       905
           1       1.00      1.00      1.00       906

    accuracy                           1.00      1811
   macro avg       1.00      1.00      1.00      1811
weighted avg       1.00      1.00      1.00      1811



In [152]:
y_pred_test_bal= cv_rf.predict(X_test_bal)
print("classification report:\n",classification_report(y_test_bal, y_pred_test_bal))
cm_test = confusion_matrix(y_test_bal,y_pred_test_bal)
fig = px.imshow(cm_test,
                labels=dict(x="Predicted Label", y="Actual Label", color="Value"),
                x=[0,1],
                y=[0,1]
               ,color_continuous_scale='thermal',text_auto=True)
fig.update_xaxes(side="top")
fig.show()

classification report:
               precision    recall  f1-score   support

           0       0.98      0.93      0.96       389
           1       0.94      0.98      0.96       388

    accuracy                           0.96       777
   macro avg       0.96      0.96      0.96       777
weighted avg       0.96      0.96      0.96       777

