In [10]:
# Import the libraries and dependencies:
import pandas as pd
from pathlib import Path
from sklearn.model_selection import train_test_split
import hvplot.pandas
import numpy as np

# Create the DataFrame:
startup_df = pd.read_csv(
    Path('start-up_success.csv')
)
display(startup_df.head())
display(startup_df.tail())

Unnamed: 0,Financial Performance,Industry Health,Firm Category
0,-2.76165,-2.414516,0
1,2.867162,1.989524,1
2,-0.70123,-1.074845,0
3,-3.516214,-1.928217,0
4,-0.981901,-2.798853,0


Unnamed: 0,Financial Performance,Industry Health,Firm Category
1319,-2.458081,-3.3198,0
1320,-2.2612,-3.074141,0
1321,-0.95902,-2.249914,0
1322,-1.10127,-1.755786,0
1323,-1.88401,-0.385503,0


In [11]:
# INTRODUCTION TO CLASSIFIERS
# Financial companies always seek ways to get faster and smarter data-driven decisions and outcomes.
# This impacts everything from credit risk determination to operational design to fraud detection.
# In all these cases, companies seek ways to leverage vast amounts of data to automatically make decisions or predict outcomes.
# Classification is the area of supervised learning that's designed to model and predict discrete outcomes.
# FinTech companies have used classification models to drastically imporve their efforts to properly classify applicants, predict market declines, and classify fraudulent transactions or suspicious activity.
# Classification models have allowed the financial industry to become more proactive than reactive.
# Companies can predict outcomes with probable certainty, which allows for more effective and efficient mitigation.
# In this lesson, you'll leran how to create, train, use, and evaluate classification models, such as LOGISTIC REGRESSION.
# You'll also split the data into training and testing datasets so that you can evaluate the model in an unbiased way.

In [12]:
# LOGISTIC REGRESSION
# The first classification algorithm that we'll explore is logistic regression.
# People consider this one of the most universal and capable classification algorithms.
# People also often consider it the starting point for any classification project.

In [13]:
# APPLY LOGISTIC REGRESSION
# Let's visualize some financial data about firms to discover how logistic regression predicts categories in practice.
# Suppose we have data about two groups of startups: Healthy firms and unhealthy firms.
# The former eventually performed well, and the latter eventually went bankrupt.
# Our goal is to use classification to correctly predict which of these two categories a firm belongs in.
# By using the model-fit-predict patter, we'll take things one step further by evaluating how well the model makes its predictions.
# To apply logistic regression, we'll first prepare the data, and we'll then split the data into training and testing sets.

In [15]:
# PREPARE THE DATA
# To prepare the data, let's first count the number of firms in each category.
# Suppose that a single column, named 'Firm Category', in the DataFrame contains the data about the firm categories.
# A value of 0 in this column means an unhealthy firm, and a value of 1 means a healthy firm.
# We can thus use the `value_counts` function to count the number of firms in each category:

# Count how many firms are in each category:
display(startup_df['Firm Category'].value_counts())
display(startup_df.head(10))

0    978
1    346
Name: Firm Category, dtype: int64

Unnamed: 0,Financial Performance,Industry Health,Firm Category
0,-2.76165,-2.414516,0
1,2.867162,1.989524,1
2,-0.70123,-1.074845,0
3,-3.516214,-1.928217,0
4,-0.981901,-2.798853,0
5,-2.893176,-1.935258,0
6,-2.056318,-2.485971,0
7,-2.029454,-1.910105,0
8,-1.84136,-1.904793,0
9,-3.295717,-0.395888,0


In [16]:
# Note that 346 firms performed well, and 978 went bankrupt.
# It seems that many startups don't eventually succeed.
# This might occur because the industry that a startup belongs to doesn't do well.
# Or, the firm itself might suffer a lower financial performance.
# This makes it even more important to build a logistic regression model to identify the firms that have the hightest probability of success vs. failure.

In [18]:
# SPLIT THE DATA INTO TRAINING AND TESTING SETS
# We need to split our data into training and testing data.
# The reason is to train our CLASSIFIERS, the algorithms that learn models, with the training data, and then evaluate the models with the testing data.
# Doing so helps us make unbiased evaluations of the model, because we'll find out how the model performs when classifying data that it's never encountered before (the test data).
# We can use the `train_test_split` function from the scikit-learn library to automatically split our data into training and testing data.
# This function takes two parameters: the X data and the y data.
# The X data consists of the variables that the model will use to make predictions.
# These variables are sometimes called the FEATURES.
# The y data is the variable that we want to prdict and is sometimes called the TARGET variable.
# We'll use each firm's 'Financial Performance' and  'Industry Health' scores to predict whether the firm will become healthy ('Firm Category" value=1).
# Or whether it will be unhealthy ('Firm Category' value = 0).

# Split training and testing sets,
# Create X, the features DataFrame:
features = startup_df[['Financial Performance', 'Industry Health']]

# Create y, the target DataFrame:
target = startup_df['Firm Category']

# Use train_test_split to separate the data:
training_features, testing_features, training_targets, testing_targets = train_test_split(features, target)

In [19]:
# We can preview one of the DataFrames, named `training_features`, that results from this function.
# By checking the index of this DataFrame, we can observe that it includes only some rows from the original `features` DataFrame.
# We'll later use this `training_features` DataFrame to fit our model.
# We'll then test how our model makes predictions by using the testing data that we kept separate.
training_features

Unnamed: 0,Financial Performance,Industry Health
235,-2.445014,-4.080513
1108,1.621700,0.874349
1069,2.930071,0.903317
476,1.903432,2.002327
190,-1.532283,-1.740581
...,...,...
290,-2.107835,-1.160625
498,-2.030786,-2.386181
964,3.766929,1.740535
512,1.151973,1.377842


In [20]:
# DEEP DIVE 
# In the preceding output, did you notice the number of rows in the training_features DataFrame is 993?
# Also notice that the number of rows in the original DataFrame is 1,324 (you can observe this from the output of the value_counts function).
# This means that the scikit-learn `train_test_split` function used 75% of the original dataset for training the model (993/1,324 = .75).
# The `train_test_split` function has this as the default setting.
# Note that people commonly use 70% to 75% of the data for training.
# But by using the `train_size` parameter to `train_test_split`, oyu can set that portion to anything form 0 to 1 (that is, 0% to 100%).
# For more information about this parameter and other flexibility that the `train_test_split` function has, refer to the following web page: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [21]:
# MODEL: CREATE A MODEL
# Let's start predicting - to do so, we will first import the Logistic Regression class from the scikit-learn library:
    # Web Link: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
    
# Import LogisticRegression from sklearn:
from sklearn.linear_model import LogisticRegression

In [22]:
# Now we can create a `LogisticRegression` function and assign it to a variable named `logistic_regression_model`.
# By doing so, we save an empty model to a new variable:

logistic_regression_model = LogisticRegression()

In [23]:
# FIT: TRAIN THE MODEL
# We'll now supply some training data to the model so that it can learn and mathematically adjust itself to best represent this data.
# Remember that our goal with machine learning is to model real-world data - so that we can later use the model to make decisions and predictions.
# To fit, or train, our model to the training data, we can use the `fit` function that scikit-learn makes available to the logistic regression classifier.

# Fit the model:
logistic_regression_model.fit(training_features, training_targets)

LogisticRegression()

In [24]:
# The output just confirms that the `fit` function used the `LogisticRegression` function to train the model.
# The `fit` function uses the training data to figure out which data belongs in which category.
# In this example, the categories indicate whether a firm will perform well or go bankrupt. 
# Here's what's happening mathematically: In logistic regression (as in all machine learning models), an algorithm tries several versions of the model.
# the algorithm ultimately decides which version best distinguishes the different vategories from each other.
# In this case, the algorithm decides which version of the model best distinguishes the firms that have the highest probability of success from the firms that don't.

In [27]:
# PREDICT: CLASSIFY FEATURES WITH THE MODEL
# We can use the `predict` function to classify our features and discover if the model can assign them to the correct targets.
# We can then put those predictions and the acutal target values into the same DataFrame to compare them and find out if they match.

# Generate predictions form the model we just fit:
predictions = logistic_regression_model.predict(training_features)

# Convert those predictions (and actual values) to a DataFrame:
training_results_df = pd.DataFrame({
    'Prediction': predictions,
    'Actual': training_targets})

# Review the DataFrame:
training_results_df

Unnamed: 0,Prediction,Actual
235,0,0
1108,1,1
1069,1,1
476,1,1
190,0,0
...,...,...
290,0,0
498,0,0
964,1,1
512,1,1


In [28]:
# Note that the model did a good job of predicting whether a startup ultimately succeeded based on its 'Financial Performance' and 'Industry Health' scores.
# In fact, based on the resulting DataFrame, the predictions appear to exactly match the targets for the training data.
# The ones and zeroes in the 'Prediction' column match those in the 'Actual' column.
# Does this mean that we found the ultimate formula for predicting startup success?
# In reality, we expect the model to excel at classifying this data.
# That's because the model is trained specifically to correctly classify this (and only this) data.
# But in the real world, we'd apply the model to classifying startups that it hasn't already been trained to recognized.
# The performance on this new, previously unknown data won't typically prove as stellar.
# For this reason, how a model generalizes to new data is a more important metric for evaluating its usefulness.
# Remember that the `train_test_split` function outputs a testing dataset for X and y in addition to the training dataset that we just examined.
# So let's find out how to evaluate models by using the testing dataset.

In [29]:
# PREDICT: TEST THE MODEL ON NEW DATA
# Similarly to the the way that we made predictions by using the training data, we can make new predictions by using the testing data.
# This will give us a better sense of how well this model will perform when we apply it to new data (that is, in real life).
# We've already trained the model - so it should be able to take the features of new data and predict which category each startup belongs to.
# To make predictions on the testing data, we use `logistic_regression_model`, which we've already fit.
# This time, we run `predict` by using the `testing_features` DataFrame rather than the `training_features` DataFrame.
# Like before, we save those predictions and actual values to a DataFrame:

# Apply the fitted model to the test dataset:
testing_predictions = logistic_regression_model.predict(testing_features)

# Save both the test predictions and the actual test values to a DataFrame:
testing_results_df = pd.DataFrame({
    'Testing Data Predictions': testing_predictions,
    'Testing Data Actual Targets': testing_targets
})

# Review the testing results:
testing_results_df

Unnamed: 0,Testing Data Predictions,Testing Data Actual Targets
1246,0,0
219,1,1
864,0,0
1098,0,0
1317,0,0
...,...,...
684,0,0
420,0,0
1121,0,0
1207,0,0


In [30]:
# Note that the model still accurately predicted which target each startup belongs to - despite this being the testing data.
# The 0s and 1s in the both columns match with each other.
# Perhaps we can use this model to predict which additional startups will succeed.

In [31]:
# EVALUATE THE CLASSIFIERS
# Now that we made predictions with our model, we need to evaluate how good those predictions are.
# This is the evaluation step in the machine learning process.
# We will learn other techniques for model evaluation in the course, but for now, let's introduce a model evaluation technique called the `accuracy_score` function.
# One way to verify the model's performance is by analyzing the differences between the predictions and the acutal targets.
# That is, we'd compare each predicted value to its actual value - row by row.
# However, more efficient ways to evaluate a model exist.
# In particular, scikit-learn has a suite of tools for calculating evaluation metrics.
# For now, we can use the `accuracy_score` function to calculate the accuracy of our model predictions for the testing data.

# Import the accuracy_score function:
from sklearn.metrics import accuracy_score

# Calculate the model's accuracy on the test dataset:
accuracy_score(testing_targets, testing_predictions)

1.0

In [32]:
# The model achieved an accuracy score of 1.0 - which means it correctly predicted every set of features in the testing set.
# That is, the model predicted the correct target (whether the startup succeeded or not) when given the two features.
# Although the model achieved perfect accuracy in this example, that's rare in acutal practice.
# Moreover, an extremely high metric should make you suspicious of OVERFITTING.
# OVERFITTING means that the model is so good at predicting the correct target for the training data that it won't perform will on new data that it wasn't trained on.
# And while the model still performed well on the testing data, we might not get so lucky with other datasets.
# Later in the module, we'll discuss overfitting in greater detail.

In [None]:
# RECAP OUR USE OF LOGISTIC REGRESSION
# Let's summarize the steps tha twe took to use a logistic regression model:
    # 1. Create a model with `LogisticRegression`.
    # 2. Train the model with `model.fit()`.
    # 3. Make predictions with `model.predict()`.
    # 4. Evaluate the model with `accuracy_score`.