<a href="https://colab.research.google.com/github/The-Kaggle-Crew-18/Kaggle-Challenge-18/blob/main/KaggleChallenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1. Problem Statement
The goal of the Kaggle Spaceship Titanic challenge is to predict which passengers were transported to an alternate dimension during the spaceship Titanic's collision with a spacetime anomaly. Using the provided datasets, we will build a machine learning model to accurately classify whether a passenger was transported or not based on various features such as age, cabin, destination, etc.

#1.2 Introduction to the Dataset
train.csv: This dataset contains the training data with features such as PassengerId, HomePlanet, CryoSleep, Cabin, Destination, Age, and whether the passenger was transported (Transported).
test.csv: This dataset contains similar features as train.csv but without the target variable (Transported). We will use this data to make predictions.
sample_submission.csv: This file provides the format for the submission. It includes PassengerId and a column for our predictions (Transported).

#1.3 Overview of the Kaggle Challenge
The Kaggle Spaceship Titanic challenge aims to classify whether passengers were transported to an alternate dimension. The objective is to build a machine learning model that accurately predicts the target variable (Transported). The competition uses accuracy as the evaluation metric, where submissions are scored based on the percentage of correctly classified passengers.

In [9]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import GridSearchCV


In [None]:

# Load the datasets

train_data = 'https://raw.githubusercontent.com/The-Kaggle-Crew-18/Kaggle-Challenge-18/main/train.csv'
test_data = 'https://raw.githubusercontent.com/The-Kaggle-Crew-18/Kaggle-Challenge-18/main/test.csv'


train_df = pd.read_csv(train_data)
test_df = pd.read_csv(test_data)

In [10]:
# Display the first few rows of each dataset
print("Train Dataset:")
train_df.head()

Train Dataset:


Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


# 2.1 Initial Data Exploration
Understanding the structure and overview of the data (columns, types, etc.). Generating summary statistics for the datasets and Check for missing values in the datasets.

In [11]:
# Structure and overview of the data
print("Train Dataset Info:")
train_df.info()
print("\nTest Dataset Info:")
test_df.info()

Train Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB

Test Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4277 entries, 0 to 4276
Data columns (total 13 colum

In [12]:
# Summary statistics
print("Train Dataset Summary Statistics:")
train_df.describe()


Train Dataset Summary Statistics:


Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,8514.0,8512.0,8510.0,8485.0,8510.0,8505.0
mean,28.82793,224.687617,458.077203,173.729169,311.138778,304.854791
std,14.489021,666.717663,1611.48924,604.696458,1136.705535,1145.717189
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,0.0,0.0,0.0
75%,38.0,47.0,76.0,27.0,59.0,46.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0


In [13]:
print("\nTest Dataset Summary Statistics:")
test_df.describe()


Test Dataset Summary Statistics:


Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,4186.0,4195.0,4171.0,4179.0,4176.0,4197.0
mean,28.658146,219.266269,439.484296,177.295525,303.052443,310.710031
std,14.179072,607.011289,1527.663045,560.821123,1117.186015,1246.994742
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,26.0,0.0,0.0,0.0,0.0,0.0
75%,37.0,53.0,78.0,33.0,50.0,36.0
max,79.0,11567.0,25273.0,8292.0,19844.0,22272.0


In [14]:
# Checking for missing values
print("Missing Values in Train Dataset:")
train_df.isnull().sum()

Missing Values in Train Dataset:


PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [15]:
# Checking for missing values in the test data
print("\nMissing Values in Test Dataset:")
test_df.isnull().sum()


Missing Values in Test Dataset:


PassengerId       0
HomePlanet       87
CryoSleep        93
Cabin           100
Destination      92
Age              91
VIP              93
RoomService      82
FoodCourt       106
ShoppingMall     98
Spa             101
VRDeck           80
Name             94
dtype: int64