# Phase 3 Code Challenge

This assessment is designed to test your understanding of Module 3 material. It covers:

* Logistic Regression
* Classification Metrics
* Decision Trees

_Read the instructions carefully_. You will be asked both to write code and to answer short answer questions.

## Short Answer Questions 

For the short answer questions...

* _Use your own words_. It is OK to refer to outside resources when crafting your response, but _do not copy text from another source_.

* _Communicate clearly_. We are not grading your writing skills, but you can only receive full credit if your teacher is able to fully understand your response. 

* _Be concise_. You should be able to answer most short answer questions in a sentence or two. Writing unnecessarily long answers increases the risk of you being unclear or saying something incorrect.

In [39]:
# Run this cell without changes to import the necessary libraries
import pickle, sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score
from numbers import Number

---
## Part 1: Logistic Regression
---
In this part, you will answer general questions about logistic regression.

### 1.1) Short Answer: Provide one reason why logistic regression is better than linear regression for modeling a binary target/outcome.

# Your answer here
Logistic regression is more suitable for modeling a binary target because its output is a probability value between 0 and 1, thus providing meaningful results in classification problems. On the other hand, linear regression may give results that are not suitable for binary classification because it produces continuous values.



### 1.2) Short Answer: Compare logistic regression to another classification model of your choice (e.g. Decision Tree). What is one advantage and one disadvantage logistic regression has when compared with the other model?

# Your answer here
The advantage of logistic regression is its speed and lower computational cost on large datasets, and it provides interpretable probability outputs. However, it performs poorly on nonlinear problems where SVM excels. Conversely, SVM is computationally efficient compared to decision trees but struggles with non-linearly separable data, where decision trees are more flexible and capture nonlinear relationships better.


---
## Part 2: Classification Metrics
---
In this part, you will make sense of classification metrics produced by various classifiers.

The confusion matrix below represents the predictions generated by a classisification model on a small testing dataset.

![cnf matrix](https://curriculum-content.s3.amazonaws.com/data-science/images/cnf_matrix.png)

### 2.1) Create a numeric variable `precision` containing the precision of the classifier.

In [40]:
# Replace None with appropriate code
TP =30
FP =4
#precision formule

precision = TP/(TP+FP)

print(precision)


0.8823529411764706


In [41]:
# This test confirms that you have created a numeric variable named precision

assert isinstance(precision, Number)

### 2.2) Create a numeric variable `f1score` containing the F-1 score of the classifier.

In [42]:
# Replace None with appropriate code
TP = 30
FP = 4
FN = 12 
precision = TP/(TP+FP)
recall = TP /(TP+FN)
#calculating F1
f1score = 2 * (precision * recall)/ (precision+ recall)
f1score


0.7894736842105262

In [43]:
# This test confirms that you have created a numeric variable named f1score

assert isinstance(f1score, Number)

The ROC curves below were calculated for three different models applied to one dataset.

1. Only Age was used as a feature in the model
2. Only Estimated Salary was used as a feature in the model
3. All features were used in the model

![roc](https://curriculum-content.s3.amazonaws.com/data-science/images/many_roc.png)

### 2.3) Short Answer: Identify the best ROC curve in the above graph and explain why it is the best. 

# Your answer here
The best ROC curve is the curve of the model using all features 'pink line'. This model has the highest area under curve value compared to others, meaning the true positive rate is higher and the false positive rate is lower. This shows that the model is more accurate and effective in classification.


Run the following cells to load a sample dataset, run a classification model on it, and perform some EDA.

In [44]:
# Run this cell without changes
network_df = pickle.load(open('sample_network_data.pkl', 'rb'))

# partion features and target 
X = network_df.drop('Purchased', axis=1)
y = network_df['Purchased']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2019)

# scale features
scale = StandardScaler()
scale.fit(X_train)
X_train = scale.transform(X_train)
X_test = scale.transform(X_test)

# build classifier
model = LogisticRegression(C=1e5, solver='lbfgs')
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)

# get the accuracy score
print(f'The classifier has an accuracy score of {round(accuracy_score(y_test, y_test_pred), 3)}.')

The classifier has an accuracy score of 0.956.


In [45]:
# Run this cell without changes

y.value_counts()

Purchased
0    257
1     13
Name: count, dtype: int64

### 2.4) Short Answer: Explain how the distribution of `y` shown above could explain the high accuracy score of the classification model.

# Your answer here

y distribution shows that there is a large imbalance between classes. Even if the model mostly predicts 0, the correct classification rate will appear high. However, this means that the model actually does not describe the minority class, 1, well enough.

### 2.5) Short Answer: What is one method you could use to improve your model to address the issue discovered in Question 3.4?

# Your answer here

Hyperparameter tuning or other model optimization techniques can be used to achieve better performance. Or We can use ensemble methods. Random Forest can creates multiple decision trees and performs classification based on the results of each tree. It can perform better on imbalanced data as different data samples are sent to each tree to better learn the majority and minority class.


---
## Part 3: Decision Trees
---
In this part, you will use decision trees to fit a classification model to a wine dataset. The data contain the results of a chemical analysis of wines grown in one region in Italy using three different cultivars (grape types). There are thirteen features from the measurements taken, and the wines are classified by cultivar in the `target` variable.

In [46]:
# Run this cell without changes

# Relevant imports 
import pandas as pd 
import numpy as np 
from sklearn.datasets import load_wine
from sklearn.tree import DecisionTreeClassifier

# Load the data 
wine = load_wine()
X, y = load_wine(return_X_y=True)
X = pd.DataFrame(X, columns=wine.feature_names)
y = pd.Series(y)
y.name = 'target'

### 3.1) Use `train_test_split()` to split `X` and `y` data between training sets (`X_train` and `y_train`) and test sets (`X_test` and `y_test`), with `random_state=1`. Evenly split the data between train and test (50/50).

Do not alter `X` or `y` before performing the split.

In [47]:
# Replace None with appropriate code

X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=0.5, random_state=1)

In [48]:
# These tests confirm that you have created DataFrames named X_train, X_test and Series named y_train, and y_test

assert type(X_train) == pd.DataFrame
assert type(X_test) == pd.DataFrame
assert type(y_train) == pd.Series
assert type(y_test) == pd.Series

# These tests confirm that you have split the data evenly between train and test sets

assert X_train.shape[0] == X_test.shape[0]
assert y_train.shape[0] == y_test.shape[0]

In [50]:
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (89, 13)
X_test shape: (89, 13)
y_train shape: (89,)
y_test shape: (89,)


### 3.2) Create an untuned decision tree classifier `wine_dt` with `random_state=1` and fit it using `X_train` and `y_train`. 

Use parameter defaults for your classifier. You must use the Scikit-learn DecisionTreeClassifier (docs [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html))

In [51]:
# Replace None with appropriate code

wine_dt = DecisionTreeClassifier(random_state=1)

# Fit
wine_dt.fit(X_train, y_train)

In [52]:
# This test confirms that you have created a DecisionTreeClassifier named wine_dt

assert type(wine_dt) == DecisionTreeClassifier

# This test confirms that you have set random_state to 1

assert wine_dt.get_params()['random_state'] == 1

# This test confirms that wine_dt has been fit

sklearn.utils.validation.check_is_fitted(wine_dt)

### 3.3) Create an array `y_pred` generated by using `wine_dt` to make predictions for the test data.

In [55]:
# Replace None with appropriate code

y_pred = wine_dt.predict(X_test)
y_pred

array([2, 1, 1, 1, 0, 2, 1, 0, 2, 1, 0, 1, 1, 0, 1, 1, 2, 1, 1, 0, 0, 2,
       2, 0, 1, 2, 0, 0, 0, 2, 1, 2, 2, 0, 1, 1, 1, 1, 1, 0, 0, 2, 2, 0,
       0, 1, 1, 0, 0, 0, 1, 2, 2, 0, 1, 0, 0, 1, 2, 1, 1, 0, 2, 1, 2, 0,
       1, 0, 1, 0, 2, 1, 2, 2, 1, 1, 1, 2, 0, 1, 2, 0, 1, 0, 2, 1, 1, 0,
       1])

In [54]:
# This test confirms that you have created an array-like object named y_pred

assert type(np.asarray(y_pred)) == np.ndarray

### 3.4) Create a numeric variable `wine_dt_acc` containing the accuracy score for your predictions. 

Hint: You can use the `sklearn.metrics` module or the model itself.

In [56]:
# Replace None with appropriate code

wine_dt_acc = accuracy_score(y_test, y_pred)
wine_dt_acc

0.8764044943820225

In [57]:
# This test confirms that you have created a numeric variable named wine_dt_acc

assert isinstance(wine_dt_acc, Number)

### 3.5) Short Answer: Based on the accuracy score, does the model seem to be performing well or does it have substantial performance issues? Explain your answer.

# Your answer here
The model performed reasonably well, but not perfectly. An accuracy rate of 87.6% indicates that the model was successful in most of its predictions, but made some errors.
