# **Space X  Falcon 9 First Stage Landing Prediction**


# Machine Learning Predictive Analytics

## Objectives


Perform Exploratory Data Analysis and determine Training Labels:

*   Creation of a column for the class.
*   Standardization of the data.
*   Splitting into training data and test data.

Four (4) models are assessed for this purpose, namely: Logistic Regression, Support Vector Machine (SVM), Decision Tree and k Nearest Neighbors (KNN).

\- The goal is to find the best hyperparameter for SVM, Classification Trees and Logistic Regression.

*   Find the method that performs best using test data.

We will import the following libraries for the lab.


In [None]:
# Importing required libraries
import pandas as pd # software library written for the Python programming language for data manipulation and analysis
import numpy as np # library adding support to multi-dimensional arrays, matrices and functions to operate on these arrays
import matplotlib.pyplot as plt # library for python and pyplot gives us a MatLab like plotting framework
import seaborn as sns # visualization library based on matplotlib, providing a high-level interface for drawing attractive and informative statistical graphics
from sklearn import preprocessing # preprocessing allows us to standarsize our data
from sklearn.model_selection import train_test_split # allows the split of data into training and testing data
from sklearn.model_selection import GridSearchCV # allows testing parameters of classification algorithms and finding the best one
from sklearn.linear_model import LogisticRegression # Logistic Regression classification algorithm
from sklearn.svm import SVC # Support Vector Machine classification algorithm
from sklearn.tree import DecisionTreeClassifier # Decision Tree classification algorithm
from sklearn.neighbors import KNeighborsClassifier # K Nearest Neighbors classification algorithm

The below function plots the confusion matrix.

In [None]:
def plot_confusion_matrix(y, y_predict):
    "This function plots the confusion matrix"
    from sklearn.metrics import confusion_matrix

    cm = confusion_matrix(y, y_predict)
    ax= plt.subplot()
    sns.heatmap(cm, annot=True, ax = ax); #annot=True to annotate cells
    ax.set_xlabel('Predicted labels')
    ax.set_ylabel('True labels')
    ax.set_title('Confusion Matrix'); 
    ax.xaxis.set_ticklabels(['did not land', 'land']); ax.yaxis.set_ticklabels(['did not land', 'landed']) 
    plt.show() 

We load the data.

In [None]:
url_2 = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/dataset_part_2.csv"

data = pd.read_csv(url_2)

In [None]:
data.head()

In [None]:
url_3 = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/dataset_part_3.csv'

X = pd.read_csv(url_3)

In [None]:
X.head(100)

We create a NumPy array from the column <code>Class</code> in <code>data</code>, by applying the method <code>to_numpy()</code> and then, assign it to the variable <code>Y</code>, making sure the output is a Pandas series (only one bracket df\['name of  column']).

In [None]:
Y = data['Class'].to_numpy()
Y

Standardize the data in <code>X</code> and then reassign it to the variable <code>X</code>, using the transform provided below.

In [None]:
transform = preprocessing.StandardScaler()

In [None]:
X

In [None]:
X = transform.fit_transform(X)
X

We split the data into training and testing sets using the function  <code>train_test_split()</code>. The training data is divided into validation data, a second set used for training data and then, the models are trained and hyperparameters are selected using the function <code>GridSearchCV</code>.

We use the function <code>train_test_split()</code> to split the data X and Y into training and test data. Set the parameter test_size to 0.2 and random_state to 2. The training data and test data should be assigned to the following labels.

<code>X_train, X_test, Y_train, Y_test</code>


In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)

In [None]:
Y_test.shape

We observe that we only have 18 test samples.

### Logistic Regression model

We create a logistic regression object and then, create a GridSearchCV object <code>logreg_cv</code> with cv = 10. Fit the object to find the best parameters from the dictionary <code>parameters</code>.

In [None]:
parameters = {'C': [0.01, 0.1, 1],
              'penalty': ['l2'],
              'solver': ['lbfgs']}

In [None]:
parameters = {"C": [0.01, 0.1, 1], 'penalty': ['l2'], 'solver': ['lbfgs']} # l1 lasso l2 ridge
lr = LogisticRegression()

logreg_cv = GridSearchCV(lr, parameters, cv=10)

# Fitting it to the data
logreg_cv.fit(X_train, Y_train)

We output the <code>GridSearchCV</code> object for logistic regression. We display the best parameters using the data attribute <code>best_params\_</code> and the accuracy on the validation data using the data attribute <code>best_score\_</code>.

In [None]:
print("tuned hyperparameters : (best parameters) ", logreg_cv.best_params_)
print("accuracy :", logreg_cv.best_score_)

We then calculate the accuracy on the test data using the method <code>score</code>:

In [None]:
logreg_cv.score(X_test, Y_test)

Now, we look at the confusion matrix:

In [None]:
yhat = logreg_cv.predict(X_test)
plot_confusion_matrix(Y_test, yhat)

Examining the confusion matrix, we see that logistic regression can distinguish between the different classes. We see that the problem is false positives.

Overview:

True Positive - 12 (True label is landed, Predicted label is also landed)

False Positive - 3 (True label is not landed, Predicted label is landed)

### Support Vector Machine (SVM) model

We create a support vector machine object and then, create a <code>GridSearchCV</code> object <code>svm_cv</code> with cv = 10. Fit the object to find the best parameters from the dictionary <code>parameters</code>.

In [None]:
parameters = {'kernel': ('linear', 'rbf', 'poly', 'rbf', 'sigmoid'),
              'C': np.logspace(-3, 3, 5),
              'gamma': np.logspace(-3, 3, 5)}
svm = SVC()

In [None]:
svm_cv = GridSearchCV(svm, parameters, cv=10)

# Fitting it to the data
svm_cv.fit(X_train, Y_train)

In [None]:
print("tuned hyperparameters : (best parameters) ", svm_cv.best_params_)
print("accuracy :", svm_cv.best_score_)

We calculate the accuracy on the test data using the method <code>score</code>:

In [None]:
svm_cv.score(X_test, Y_test)

Now, we plot the confusion matrix.

In [None]:
yhat = svm_cv.predict(X_test)
plot_confusion_matrix(Y_test, yhat)

### Decision Tree classifier model

We create a decision tree classifier object and then, create a <code>GridSearchCV</code> object <code>tree_cv</code> with cv = 10. Fit the object to find the best parameters from the dictionary <code>parameters</code>.

In [None]:
parameters = {'criterion': ['gini', 'entropy'],
              'splitter': ['best', 'random'],
              'max_depth': [2*n for n in range(1,10)],
              'max_features': ['auto', 'sqrt'],
              'min_samples_leaf': [1, 2, 4],
              'min_samples_split': [2, 5, 10]}

tree = DecisionTreeClassifier()

In [None]:
tree_cv = GridSearchCV(tree, parameters, cv=10)

# Fitting it to the data
tree_cv.fit(X_train, Y_train)

In [None]:
print("tuned hyperparameters : (best parameters) ", tree_cv.best_params_)
print("accuracy :", tree_cv.best_score_)

We calculate the accuracy of tree_cv on the test data using the method <code>score</code>:

In [None]:
tree_cv.score(X_test, Y_test)

Now, we plot the confusion matrix.

In [None]:
yhat = tree_cv.predict(X_test)
plot_confusion_matrix(Y_test, yhat)

### k Nearest Neighbors model

We create a k nearest neighbors object and then, create a <code>GridSearchCV</code> object <code>knn_cv</code> with cv = 10. Fit the object to find the best parameters from the dictionary <code>parameters</code>.

In [None]:
parameters = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
              'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
              'p': [1, 2]}

KNN = KNeighborsClassifier()

In [None]:
knn_cv = GridSearchCV(KNN, parameters, cv=10)

# Fitting it to the data
knn_cv.fit(X_train, Y_train)

In [None]:
print("tuned hyperparameters : (best parameters) ", knn_cv.best_params_)
print("accuracy :", knn_cv.best_score_)

We calculate the accuracy of knn_cv on the test data using the method <code>score</code>:

In [None]:
knn_cv.score(X_test, Y_test)

Now, we plot the confusion matrix.

In [None]:
yhat = knn_cv.predict(X_test)
plot_confusion_matrix(Y_test, yhat)

### Best Performing Predictive model

We find the method performing best:

In [None]:
# All predictors have the same accuracy score on the test dataset and
# therefore, the score on the training dataset will be compared
predictors = [logreg_cv, svm_cv, tree_cv, knn_cv]
predictors_dict = {logreg_cv: 'Logistic Regression',
                   svm_cv: 'Support Vector Machine',
                   tree_cv: 'Decision Tree Classifier',
                   knn_cv: 'k Nearest Neighbors'}

best_predictor = ""
best_result = 0
for predictor in predictors:
    if predictor.best_score_ > best_result:
        best_result = predictor.best_score_
        best_predictor = predictors_dict[predictor]

print('The best predictor with a score={} is the {}.'.format(best_result, best_predictor))