# Phase 3 Code Challenge

This assessment is designed to test your understanding of Module 3 material. It covers:

* Gradient Descent
* Logistic Regression
* Classification Metrics
* Decision Trees

_Read the instructions carefully_. You will be asked both to write code and to answer short answer questions.

## Code Tests

We have provided some code tests for you to run to check that your work meets the item specifications. Passing these tests does not necessarily mean that you have gotten the item correct - there are additional hidden tests. However, if any of the tests do not pass, this tells you that your code is incorrect and needs changes to meet the specification. To determine what the issue is, read the comments in the code test cells, the error message you receive, and the item instructions.

## Short Answer Questions

For the short answer questions...

* _Use your own words_. It is OK to refer to outside resources when crafting your response, but _do not copy text from another source_.

* _Communicate clearly_. We are not grading your writing skills, but you can only receive full credit if your teacher is able to fully understand your response.

* _Be concise_. You should be able to answer most short answer questions in a sentence or two. Writing unnecessarily long answers increases the risk of you being unclear or saying something incorrect.

In [1]:
# Run this cell without changes to import the necessary libraries

from numbers import Number

%matplotlib inline

---
## Part 1: Gradient Descent [Suggested Time: 20 min]
---
In this part, you will describe how gradient descent works to calculate a parameter estimate. Below is an image of a best fit line from a linear regression model using TV advertising spending to predict product sales.

![best fit line](https://raw.githubusercontent.com/learn-co-curriculum/dsc-cc-images/main/phase_3/best_fit_line.png)

This best fit line can be described by the equation $y = mx + b$. Below is the RSS cost curve associated with the slope parameter $m$:

![cost curve](https://raw.githubusercontent.com/learn-co-curriculum/dsc-cc-images/main/phase_3/cost_curve.png)

where RSS is the residual sum of squares: $RSS = \sum_{i=1}^n(y_i - (mx_i + b))^2$

### 1.1) Short Answer: Explain how the RSS curve above could be used to find an optimal value for the slope parameter $m$.

Your answer should provide a one sentence summary, not every step of the process.

YOUR ANSWER HERE

**The RSS curve can be used to find an optimal m-value by identifying the value on the x-axis where the RSS is minimum before it starts to increase, in this case,the optimal m-value would be around 0.05.**
> Indented block



Below is a visualization showing the iterations of a gradient descent algorithm applied the RSS curve. Each yellow marker represents an estimate, and the lines between markers represent the steps taken between estimates in each iteration. Numeric labels identify the iteration numbers.

![gradient descent](https://raw.githubusercontent.com/learn-co-curriculum/dsc-cc-images/main/phase_3/gd.png)

### 1.2) Short Answer: Explain why the distances between markers get smaller over successive iterations.

YOUR ANSWER HERE

**The steps become smaller with each iteration because the optimization algorithm is gradually refining the parameter values to approach the optimal solution while minimizing the cost function.**

### 1.3) Short Answer: What would be the effect of decreasing the learning rate for this application of gradient descent?



```
# This is formatted as code
```

YOUR ANSWER HERE

**The learning rate hyperparameter controls the rate or speed at which the model learns, therefore, decreasing the learning rate would increase the model training time and improve accuracy through stable optimization.**

---
## Part 2: Logistic Regression [Suggested Time: 15 min]
---
In this part, you will answer general questions about logistic regression.

### 2.1) Short Answer: Provide one reason why logistic regression is better than linear regression for modeling a binary target/outcome.

YOUR ANSWER HERE

**Logistic regression, unlike linear regression which is used for continuous variables, is used for categorical variables which provides interpretable predictions in the form of probabilities thus making it possible to model a binary outcome by predicting the probability of an event occurring or not occurring and therefore making it suitable for classification problems.**

### 2.2) Short Answer: Compare logistic regression to another classification model of your choice (e.g. KNN, Decision Tree, etc.). What is one advantage and one disadvantage logistic regression has when compared with the other model?

YOUR ANSWER HERE

**Advantage: It is easier to train and interpret the relationship between the input and output variables than other models.**

**Disadvantage: It is prone to overfitting where there are multiple predictor variables within the model.**

---
## Part 3: Classification Metrics [Suggested Time: 20 min]
---
In this part, you will make sense of classification metrics produced by various classifiers.

The confusion matrix below represents the predictions generated by a classisification model on a small testing dataset.

![cnf matrix](https://raw.githubusercontent.com/learn-co-curriculum/dsc-cc-images/main/phase_3/cnf_matrix.png)

### 3.1) Create a numeric variable `precision` containing the precision of the classifier.

**Starter Code**

    precision = tp / (tp + fp)


In [2]:
# YOUR CODE HERE
# raise NotImplementedError()
# Giving a variable to the true positive (TP) and false positive (FP) counts
TP = 54  # true positive count
FP = 12  # false positive count

# Calculate precision
precision = TP / (TP + FP)

print("Precision:", precision)


Precision: 0.8181818181818182


In [3]:
# This test confirms that you have created a numeric variable named precision

assert isinstance(precision, Number)


### 3.2) Create a numeric variable `f1score` containing the F-1 score of the classifier.

**Starter Code**

    f1score =  2 * (precision * recall) / (precision + recall)

In [4]:
# YOUR CODE HERE
#raise NotImplementedError()

TP = 54  # true positive count
FP = 12  # false positive count
FN = 4   # false negative count
TN = 30  # true negative count

# Calculate precision
precision = TP / (TP + FP)

# Calculate recall
recall = TP / (TP + FN)

# Calculate F1 score
f1score = 2 * (precision * recall) / (precision + recall)

print("F1 Score:", f1score)


F1 Score: 0.8709677419354839


In [5]:
# This test confirms that you have created a numeric variable named f1score

assert isinstance(f1score, Number)


The ROC curves below were calculated for three different models applied to one dataset.

1. Only Age was used as a feature in the model
2. Only Estimated Salary was used as a feature in the model
3. All features were used in the model

![roc](https://raw.githubusercontent.com/learn-co-curriculum/dsc-cc-images/main/phase_3/many_roc.png)

### 3.3) Short Answer: Identify the best ROC curve in the above graph and explain why it is the best.

YOUR ANSWER HERE

**The best ROC curve in the graph is the one that is closest to the top-left corner of the plot, which the Pink curve with all features. This is because this curve represents the ideal scenario where the true positive rate (TPR) is high (close to 1) and the false positive rate (FPR) is low (close to 0), and indicates that the classifier achieves a higher true positive rate while keeping the false positive rate low, which is the desired outcome for most classification tasks.**

Run the following cells to load a sample dataset, run a classification model on it, and perform some EDA.

In [7]:
# Run this cell without changes

# Include relevant imports
import pickle, sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score

network_df = pickle.load(open('sample_network_data.pkl', 'rb'))

# partion features and target
X = network_df.drop('Purchased', axis=1)
y = network_df['Purchased']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2019)

# scale features
scale = StandardScaler()
scale.fit(X_train)
X_train = scale.transform(X_train)
X_test = scale.transform(X_test)

# build classifier
model = LogisticRegression(C=1e5, solver='lbfgs')
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)

# get the accuracy score
print(f'The classifier has an accuracy score of {round(accuracy_score(y_test, y_test_pred), 3)}.')

The classifier has an accuracy score of 0.956.


In [8]:
# Run this cell without changes

y.value_counts()

0    257
1     13
Name: Purchased, dtype: int64

### 3.4) Short Answer: Explain how the distribution of `y` shown above could explain the high accuracy score of the classification model.

YOUR ANSWER HERE

**The high accuracy score of the classification model may be primarily attributed to the imbalanced distribution of the target variable y, where the majority class dominates the dataset.This highlights the importance of considering other evaluation metrics, such as precision, recall, and F1-score, particularly in the context of imbalanced datasets, to assess the model's performance more comprehensively.**

### 3.5) Short Answer: What is one method you could use to improve your model to address the issue discovered in Question 3.4?



```
# This is formatted as code
```

YOUR ANSWER HERE

**One method to improve the model's performance and address the issue of class imbalance is by using techniques specifically designed for imbalanced datasets. One popular approach is resampling. Resampling techniques involve either oversampling the minority class, undersampling the majority class, or a combination of both.**


---
## Part 4: Decision Trees [Suggested Time: 20 min]
---
In this part, you will use decision trees to fit a classification model to a wine dataset. The data contain the results of a chemical analysis of wines grown in one region in Italy using three different cultivars (grape types). There are thirteen features from the measurements taken, and the wines are classified by cultivar in the `target` variable.

In [9]:
# Run this cell without changes

# Relevant imports
import pandas as pd
import numpy as np
from sklearn.datasets import load_wine
from sklearn.tree import DecisionTreeClassifier

# Load the data
wine = load_wine()
X, y = load_wine(return_X_y=True)
X = pd.DataFrame(X, columns=wine.feature_names)
y = pd.Series(y)
y.name = 'target'

### 4.1) Use `train_test_split()` to evenly split `X` and `y` data between training sets (`X_train` and `y_train`) and test sets (`X_test` and `y_test`), with `random_state=1`.

Do not alter `X` or `y` before performing the split.

**Starter Code**

    X_train, X_test, y_train, y_test =

In [10]:
# YOUR CODE HERE
#raise NotImplementedError()

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)

# Check the shapes of the resulting sets
print("Shape of X_train:", X_train.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_test:", y_test.shape)


Shape of X_train: (89, 13)
Shape of y_train: (89,)
Shape of X_test: (89, 13)
Shape of y_test: (89,)


In [11]:
# These tests confirm that you have created DataFrames named X_train, X_test and Series named y_train, and y_test

assert type(X_train) == pd.DataFrame
assert type(X_test) == pd.DataFrame
assert type(y_train) == pd.Series
assert type(y_test) == pd.Series

# These tests confirm that you have split the data evenly between train and test sets

assert X_train.shape[0] == X_test.shape[0]
assert y_train.shape[0] == y_test.shape[0]


### 4.2) Create an untuned decision tree classifier `wine_dt` and fit it using `X_train` and `y_train`, with `random_state=1`.

Use parameter defaults for your classifier. You must use the Scikit-learn DecisionTreeClassifier (docs [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html))

**Starter Code**

    wine_dt =

In [12]:
# YOUR CODE HERE
# raise NotImplementedError()

from sklearn.tree import DecisionTreeClassifier

# Create an untuned decision tree classifier
wine_dt = DecisionTreeClassifier(random_state=1)

# Fit the classifier using X_train and y_train
wine_dt.fit(X_train, y_train)


In [13]:
# This test confirms that you have created a DecisionTreeClassifier named wine_dt

assert type(wine_dt) == DecisionTreeClassifier

# This test confirms that you have set random_state to 1

assert wine_dt.get_params()['random_state'] == 1

# This test confirms that wine_dt has been fit

sklearn.utils.validation.check_is_fitted(wine_dt)


### 4.3) Create an array `y_pred` generated by using `wine_dt` to make predictions for the test data.

**Starter Code**

    y_pred =

In [14]:
# YOUR CODE HERE
#raise NotImplementedError()

# Generate predictions for the test data
y_pred = wine_dt.predict(X_test)


In [15]:
# This test confirms that you have created an array-like object named y_pred

assert type(np.asarray(y_pred)) == np.ndarray


### 4.4) Create a numeric variable `wine_dt_acc` containing the accuracy score for your predictions.

Hint: You can use the `sklearn.metrics` module.

**Starter Code**

    wine_dt_acc =

In [16]:
# YOUR CODE HERE
# raise NotImplementedError()

from sklearn.metrics import accuracy_score

# Calculate the accuracy score for the predictions
wine_dt_acc = accuracy_score(y_test, y_pred)

print("Accuracy Score:", wine_dt_acc)


Accuracy Score: 0.8764044943820225


In [17]:
# This test confirms that you have created a numeric variable named wine_dt_acc

assert isinstance(wine_dt_acc, Number)


### 4.5) Short Answer: Based on the accuracy score, does the model seem to be performing well or to have substantial performance issues? Explain your answer.

YOUR ANSWER HERE

**Based on the accuracy score of approximately 0.876, the model seems to be performing quite well. An accuracy score of around 0.876 indicates that the model correctly predicted the class labels for approximately 87.6% of the instances in the test set.**