# Table of Contents
   1. Business Statement   
      1.1 About Data Set
   2. Introduction 
   3. Installing Libraries 
   4. Reading Data Set
   5. Machine Learning Models 
   6. Summary

# Business Statement 

## Spaceship Titanic
Dataset Shape: 12790 rows x 15 columns

1. **PassengerId** - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
2. **HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence.

3. **CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

4. **Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.

5. **Destination** - The planet the passenger will be debarking to.

6. **Age** - The age of the passenger.

7. **VIP**- Whether the passenger has paid for special VIP service during the voyage.

8. **RoomService** - Amount the passenger has billed at each of the Spaceship Titanic's room service amenities.

9. **FoodCourt** - Amount the passenger has billed at each of the Spaceship Titanic's foodcourt amenities.

10. **ShoppingMall** - Amount the passenger has billed at each of the Spaceship Titanic's shopping mall amenities.

11. **Spa** - Amount the passenger has billed at each of the Spaceship Titanic's spa amenities.

12. **VRDeck** - Amount the passenger has billed at each of the Spaceship Titanic's vrdeck amenities.

13. **Name** - The first and last names of the passenger.

14. **Transported** - Whether the passenger was transported to another dimension. 

# Introduction

# Importing Important Dependencies 

In [1]:
!pip install yellowbrick catboost lightgbm xgboost



In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random as rnd

# Reading Data Set

In [3]:
df = pd.read_csv('MY2022 Fuel Consumption Ratings.csv')

In [4]:
df.head(10)

Unnamed: 0,Model Year,Make,Model,Vehicle Class,Engine Size(L),Cylinders,Transmission,Fuel Type,Fuel Consumption (City (L/100 km),Fuel Consumption(Hwy (L/100 km)),Fuel Consumption(Comb (L/100 km)),Fuel Consumption(Comb (mpg)),CO2 Emissions(g/km),CO2 Rating,Smog Rating
0,2022,Acura,ILX,Compact,.,4,AM8,Z,9.9,7.0,8.6,33,200,6,3
1,2022,Acura,MDX SH-AWD,SUV: Small,3.5,6,AS10,Z,12.6,9.4,11.2,25,263,4,5
2,2022,Acura,RDX SH-AWD,SUV: Small,2,4,AS10,Z,11.0,8.6,9.9,29,232,5,6
3,2022,Acura,RDX SH-AWD A-SPEC,SUV: Small,2,4,AS10,Z,11.3,9.1,10.3,27,242,5,6
4,2022,Acura,TLX SH-AWD,Compact,2,4,AS10,Z,11.2,8.0,9.8,29,230,5,7
5,2022,Acura,TLX SH-AWD A-SPEC,Compact,2,4,AS10,Z,11.3,8.1,9.8,29,231,5,7
6,2022,Acura,TLX Type S,Compact,3,6,AS10,Z,12.3,9.4,11.0,26,256,5,5
7,2022,Acura,TLX Type S (Performance Tire),Compact,3,6,AS10,Z,12.3,9.8,11.2,25,261,4,5
8,2022,Alfa Romeo,Giulia,Mid-size,2,4,A8,Z,10.0,7.2,8.7,32,205,6,3
9,2022,Alfa Romeo,Giulia AWD,Mid-size,2,4,A8,Z,10.5,7.7,9.2,31,217,5,3


In [40]:
df['Vehicle Class'].unique()

array(['Compact', 'SUV: Small', 'Mid-size', 'Minicompact',
       'SUV: Standard', 'Two-seater', 'Subcompact',
       'Station wagon: Small', 'Station wagon: Mid-size', 'Full-size',
       'Pickup truck: Small', 'Pickup truck: Standard', 'Minivan',
       'Special purpose vehicle'], dtype=object)

# Machine Learning Models:

1. Logistic Regression
2. Decision Tree
3. Gaussian Naive Bayes
4. Random Forest
5. K-Nearest Neighbors
6. Support Vector Machine
7. Stochastic Gradient Descent
8. AdaBoost
9. Extreme Gradient Boosting
10. Light Gradient Boosting Machine
11. CatBoost

# Metrics used: Classification Report, Confusion Matrix, ROCAUC Curve, Precision-Recall Curve

First we will split the data into test and training set, such that we can first train the data using a model and for further evaluation we can use test data to check the above mentioned metircs on the data set.

In [8]:
X = df
y = df

In [9]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split as tts
from sklearn.compose import make_column_transformer

In [10]:
X_train, X_test, y_train, y_test = tts(X, y, test_size = .3, random_state = 2022)

In [36]:
cat_cols = ['Vehicle Class', 'Fuel Type', 'Cylinders', 'Transmission', 'Fuel Type']

In [37]:
cont_cols = ['Fuel Consumption (City (L/100 km)', 'Fuel Consumption(Hwy (L/100 km))', 'Fuel Consumption(Comb (L/100 km))', 'CO2 Emissions(g/km)', 'CO2 Rating', 'Smog Rating']

In [38]:
transformer = make_column_transformer((OneHotEncoder(), cat_cols),(StandardScaler(), cont_cols), remainder = 'passthrough')

In [39]:
#TO split data 
X_train = transformer.fit_transform(X_train)
X_test = transformer.transform(X_test)

ValueError: For a sparse output, all columns should be a numeric or convertible to a numeric.

In [None]:
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

In [13]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from yellowbrick.contrib.wrapper import wrap
from yellowbrick.classifier import ClassificationReport, ConfusionMatrix, ROCAUC, PrecisionRecallCurve
import warnings
warnings.filterwarnings("ignore")

# Logistic Regression

In [None]:
#Created an obj for logistic regression 
log_reg = LogisticRegression(random_state = 10)
log_reg.fit(X_train, y_train)

In [None]:
#Training Accuracy
log_reg.score(X_train, y_train)

In [None]:
#Evalution of Model
y_pred = log_reg.predict(X_test)
accuracy_score(y_test, y_pred) 

In [None]:
visualizer = ClassificationReport(log_reg, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = ConfusionMatrix(log_reg, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show() 

In [None]:
visualizer = ROCAUC(log_reg, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = PrecisionRecallCurve(log_reg, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show() 

# Decision Tree

In [None]:
#Created an obj for Decision Tree 
dec_tree = DecisionTreeClassifier(max_depth = 25, min_samples_split = 5, random_state = 10)
dec_tree.fit(X_train, y_train)

In [None]:
#Training Accuracy
dec_tree.score(X_train, y_train)

In [None]:
#Evalution of Model
y_pred = dec_tree.predict(X_test)
accuracy_score(y_test, y_pred)

In [None]:
visualizer = ClassificationReport(dec_tree, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = ConfusionMatrix(dec_tree, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show() 

In [None]:
visualizer = ROCAUC(dec_tree, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = PrecisionRecallCurve(dec_tree, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

# Gaussian Naive Bayes

In [None]:
#Created an obj for Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)

In [None]:
#Training Accuracy
gnb.score(X_train, y_train)

In [None]:
#Evalution of Model
y_pred = gnb.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)

In [None]:
visualizer = ClassificationReport(gnb, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = ConfusionMatrix(gnb, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = ROCAUC(gnb, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = PrecisionRecallCurve(gnb, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

# Random Forest

In [None]:
#Created an obj for Random Forest
rand_for = RandomForestClassifier(n_jobs = None, max_depth = 25, min_samples_split = 5, random_state = 10, n_estimators= 200)
rand_for.fit(X_train, y_train)

In [None]:
#Training Accuracy
rand_for.score(X_train, y_train)

In [None]:
#Evalution of Model
y_pred = rand_for.predict(X_test)
accuracy_score(y_test, y_pred) 

In [None]:
visualizer = ClassificationReport(rand_for, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = ConfusionMatrix(rand_for, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = ROCAUC(rand_for, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()  

In [None]:
visualizer = PrecisionRecallCurve(rand_for, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show() 

# K-Nearest Neighbors

In [None]:
#Created an obj for K - Nearest Neighbors
knc = KNeighborsClassifier(n_jobs = -1, n_neighbors = 6)
knc.fit(X_train, y_train)

In [None]:
#Training Accuracy
knc.score(X_train, y_train)

In [None]:
#Evalution of Model
y_pred = knc.predict(X_test)
accuracy_score(y_test, y_pred)

In [None]:
visualizer = ClassificationReport(knc, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show() 

In [None]:
visualizer = ConfusionMatrix(knc, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show() 

In [None]:
visualizer = ROCAUC(knc, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = PrecisionRecallCurve(knc, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show() 

# Support Vector Machine

In [None]:
#Created an obj for SVM
svm = SVC(C = 0.25, kernel = 'linear', random_state = 10)
svm.fit(X_train, y_train)

In [None]:
#Training Accuracy
svm.score(X_train, y_train)

In [None]:
#Evalution of Model
y_pred = svm.predict(X_test)
accuracy_score(y_test, y_pred)

In [None]:
visualizer = ClassificationReport(svm, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = ConfusionMatrix(svm, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = ROCAUC(svm, support = True, color_bar = True, cmap = rnd.choice(colors), binary = True)
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show() 

In [None]:
visualizer = PrecisionRecallCurve(svm, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

# Stochastic Gradient Descent

In [None]:
#Created an obj for SGD
sgd = SGDClassifier(loss = 'modified_huber', max_iter = 2000, shuffle = False, n_jobs = None, early_stopping = False, random_state = 10)
sgd.fit(X_train, y_train)

In [None]:
#Training Accuracy
sgd.score(X_train, y_train)

In [None]:
#Evalution of Model
y_pred = sgd.predict(X_test)
accuracy_score(y_test, y_pred)

In [None]:
visualizer = ClassificationReport(sgd, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = ConfusionMatrix(sgd, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = ROCAUC(sgd, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = PrecisionRecallCurve(sgd, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

# Ada Boost

In [None]:
#Created an obj for Ada Boost
adab = AdaBoostClassifier(n_estimators = 200, learning_rate = 0.1 , random_state = 10)
adab.fit(X_train, y_train)

In [None]:
#Training Accuracy
adab.score(X_train, y_train)

In [None]:
#Evalution of Model
y_pred = adab.predict(X_test)
accuracy_score(y_test, y_pred)

In [None]:
visualizer = ClassificationReport(adab, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = ConfusionMatrix(adab, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = ROCAUC(adab, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = PrecisionRecallCurve(adab, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show() 

# Extreme Gradient Boosting

In [None]:
#Created an obj for Extreme Gradient Boosting
xgb = XGBClassifier(tree_method = 'gpu_hist', n_estimators = 200, learning_rate = 0.1 , random_state = 10)
xgb.fit(X_train, y_train)

In [None]:
#Training Accuracy
xgb.score(X_train, y_train)

In [None]:
#Evalution of Model
y_pred = xgb.predict(X_test)
accuracy_score(y_test, y_pred)

In [None]:
visualizer = ClassificationReport(xgb, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show() 

In [None]:
visualizer = ConfusionMatrix(xgb, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = ROCAUC(xgb, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = PrecisionRecallCurve(xgb, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

# Light Gradient Boosting Machine

In [None]:
#Created an obj for Light Gradient Boosting Machine
lgbm = LGBMClassifier(n_estimators = 200, num_leaves = 255, objective = 'binary', num_iterations = 1000, n_jobs = -1, max_depth = 5, random_state = 10)
lgbm.fit(X_train, y_train)

In [None]:
#Training Accuracy
lgbm.score(X_train, y_train)

In [None]:
#Evalution of Model
y_pred = lgbm.predict(X_test)
accuracy_score(y_test, y_pred)

In [None]:
visualizer = ClassificationReport(lgbm, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = ConfusionMatrix(lgbm, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = ROCAUC(xgb, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = PrecisionRecallCurve(lgbm, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show() 

# CatBoost

In [None]:
#Created an obj for Cat Boost
cat = CatBoostClassifier(task_type = "GPU", iterations = 1000, loss_function = 'Logloss', eval_metric ='Accuracy', random_state = 10, early_stopping_rounds = 100, od_type = "Iter")
cat.fit(X_train, y_train, early_stopping_rounds = 100)

In [None]:
#Training Accuracy
cat.score(X_train, y_train)

In [None]:
#Evalution of Model
y_pred = cat.predict(X_test)
accuracy_score(y_test, y_pred)

In [None]:
cat = wrap(cat)

In [None]:
visualizer = ClassificationReport(cat, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = ConfusionMatrix(cat, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = ROCAUC(cat, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()

In [None]:
visualizer = PrecisionRecallCurve(cat, support = True, color_bar = True, cmap = rnd.choice(colors))
visualizer.fit(X_train, y_train)        
visualizer.score(X_test, y_test)   
visualizer.show()