<a href="https://www.kaggle.com/code/cynthycynthy/titanic-dataset-analysis?scriptVersionId=115364265" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **Background of the Competition**
The Titanic dataset is a well-known dataset that provides information on the passengers who were onboard the fateful voyage of the RMS Titanic. The data includes details such as the passenger's name, age, gender, ticket class, fare paid, and information on their family members. The dataset also includes a column called "Survived" which indicates whether a passenger survived the disaster or not.

There are a total of 891 rows in the dataset, with 12 columns. Some of the key columns in the dataset include:

• PassengerId: a unique identifier for each passenger
• Survived: a binary variable that indicates whether the passenger survived (1) or did not survive (0) the disaster
• Pclass: the ticket class of the passenger (1 = first class, 2 = second class, 3 = third class)
• Name: the name of the passenger
• Sex: the gender of the passenger (male or female)
• Age: the age of the passenger (some values are missing)
• SibSp: the number of siblings or spouses the passenger had on board
• Parch: the number of parents or children the passenger had on board
• Ticket: the ticket number of the passenger
• Fare: the fare paid by the passenger
• Cabin: the cabin number of the passenger (some values are missing)
• Embarked: the port at which the passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton)

Overall, the key challenges I encountered when working on the Titanic dataset were: how to handle missing values and imbalanced classes, encode categorical variables, reduce the dimensionality of the dataset, and identify and handle noise in the data.

Here are a few tips and resources that I found helpful when getting started in the Titanic dataset competition:

Get familiar with the dataset
Pre-process the data
Split the data into training and test sets
Try out a few different algorithms
Tune the hyper parameters
Evaluate the model
Here are a few resources that I found helpful as I started Working on the competition:
• Kaggle's Titanic tutorial
• scikit-learn documentation.
• Pandas documentation

In [None]:
 import pandas as pd   #data processing, CSV file I/O (e.g. pd.read_csv)

In [None]:
#list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
data = pd.read_csv('/kaggle/input/titanic/train.csv') #Load data that was downloaded from Kaggle, train data
test = pd.read_csv('/kaggle/input/titanic/test.csv') #Load data that was downloaded from Kaggle, test data
test_ids = test["PassengerId"]

def clean(data): #Created a clean function to get some data
    data = data.drop(["Ticket", "PassengerId", "Name", "Cabin"], axis=1)
    
    #Dropped the Ticket, PassengerId, Name and Cabin because I think it doesn't give me a lot of information
    
    cols = ["SibSp", "Parch", "Fare", "Age"] #Columns that don't have a number in them
    for col in cols: #going through the columns
        data[col].fillna(data[col].median(), inplace=True) 
        #Converting the columns to numbers, fill in the numbers that are not filled with their mean.
        
    data.Embarked.fillna("U", inplace=True) #Fill the embarked with missing datapoints with unknown tokens
    return data

data = clean(data)
test = clean(test)

In [None]:
data.head(5) #Drop some columns and limit them to 5

In [None]:
from sklearn import preprocessing #using sklearn to convert strings to actual values
le = preprocessing.LabelEncoder() #Using the label encoder
columns = ["Sex", "Embarked"]

for col in columns:
    data[col] = le.fit_transform(data[col]) #Doing the mapping of the data column
    test[col] = le.transform(test[col]) #Doing the mapping of the data column
    print(le.classes_) # print to see the conversion of the classes to integer e.g Femle is 1
      
data.head(5)

In [None]:
#Using logistic regression to have a validation set to see how good it is
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split

y = data["Survived"]
X = data.drop("Survived", axis=1) #Dropping the column for the survived

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
clf = LogisticRegression(random_state=0, max_iter=1000).fit(X_train, y_train) #Logistic Regression to the classifier to specify the random state

In [None]:
predictions = clf.predict(X_val) #Know how good it is on the validation data that it hasn't seen
from sklearn.metrics import accuracy_score #Getting the accuracy
accuracy_score(y_val, predictions)

In [None]:
submission_preds = clf.predict(test) #Getting the submission Predictions

In [None]:
#Generating a CSV file that can be submitted to Kaggle
df = pd.DataFrame({"PassengerId": test_ids.values,
                   "Survived": submission_preds,
                  })

In [None]:
df.to_csv("final submission.csv", index=False)