## Kaggle- Titanic Project:

https://www.kaggle.com/competitions/titanic

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt

### Load the data

In [20]:
#train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
train_data = pd.read_csv("./titanic_project_data/train.csv")
train_data.head()
#train_data.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [21]:
test_data = pd.read_csv("./titanic_project_data/test.csv")
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


### Explore a pattern

In [22]:
# gen_test_data = pd.read_csv("./titanic_project data/gender_submission.csv")
# gen_test_data.head()

# women = train_data.loc[train_data.Sex == 'female']["Survived"]
# rate_women = sum(women)/len(women)

# men = train_data.loc[train_data.Sex == 'male']["Survived"]
# rate_men = sum(men)/len(men)

# print("% of women who survived:", rate_women)
# print("% of men who survived:", rate_men)

# Your first machine learning model: Random Forest
We'll build what's known as a random forest model. This model is constructed of several "trees" (we'll construct 100!) that will individually consider each passenger's data and vote on whether the individual survived. Then, the random forest model makes a democratic decision: the outcome with the most votes wins!

In [23]:
from sklearn.ensemble import RandomForestClassifier

y_train = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X_train = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId,
                       'Survived': predictions})

output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

accuracy = model.score(X_train,y_train)
print(f"Accuracy: {accuracy:.2f}")

Your submission was successfully saved!
Accuracy: 0.82



## Let's break down the features used in the code above:

### 1. **`pd.get_dummies()` in Pandas**
   - `get_dummies()` converts categorical data into a numerical format. It creates dummy (binary) variables for each unique value in a categorical column.
   - For example, if the "Sex" column has "male" and "female," it will create two new columns: "Sex_male" and "Sex_female," with values 0 or 1 indicating the presence of each category.
   - This is useful for converting non-numeric data into a format suitable for machine learning models.

### 2. **`n_estimators` in `RandomForestClassifier`**
   - `n_estimators` specifies the number of trees in the random forest. In this case, `n_estimators=100` means the model will use 100 decision trees.
   - More trees generally improve the model's accuracy, but they also increase computational time.

### 3. **`random_state=1` in `RandomForestClassifier`**
   - `random_state` ensures the results are reproducible. When you set `random_state=1`, the random number generator's sequence is fixed, which makes the model produce the same results each time you run the code.
   - This helps in comparing results consistently during experimentation.

### 4. **`output.to_csv('submission.csv', index=False)`**
   - `to_csv()` is a Pandas method that saves a DataFrame as a CSV file. Here, it saves the `output` DataFrame to a file named `submission.csv`.
   - The parameter `index=False` ensures that the row indices are not included in the CSV file, keeping it clean for submission.


---

## More about one-hat encoding

In the Titanic dataset, some features, like "Sex" and potentially "Pclass," are categorical, which means they contain non-numeric values. Machine learning models, like the `RandomForestClassifier`, typically expect numerical input. If you don't use `pd.get_dummies()` or some other method to convert these categorical features to a numerical format, the model won't be able to process them correctly.

### Why Binarize (or Encode) Categorical Data?
- **Categorical Data Issue**: For example, the "Sex" column has values like "male" and "female." The model cannot interpret these strings directly.
- **Converting to Numeric**: `pd.get_dummies()` turns these categories into numerical format by creating binary (dummy) variables. For "Sex," it would create two columns: "Sex_male" and "Sex_female," with 0s and 1s indicating whether the passenger is male or female.

### What Happens If You Don't Use `pd.get_dummies()`?
- If you try to use the raw categorical data without converting it to a numerical form, the `RandomForestClassifier` will raise an error, as it cannot process string values directly.
- You need to use some encoding method to convert categorical data into a numerical format.

### Alternative Encoding Methods
- **Label Encoding**: Assigns a unique integer to each category. For example, "male" becomes 0 and "female" becomes 1. This method works but may not be ideal for non-ordinal data (where there's no inherent order).
- **One-Hot Encoding** (`pd.get_dummies()` is one way to do this): Creates separate binary columns for each category. This approach is widely used because it doesn't imply any order between categories.

### Is It Necessary to Binarize the Data in the Titanic Project?
Yes, it is necessary. The "Sex" column and possibly other features (like "Embarked") are categorical and need to be converted into a numerical format for the model to work properly. Using `pd.get_dummies()` is a common and effective way to achieve this.

Let me know if you'd like to dive deeper into encoding techniques!

---

# Increase accuracy

In [25]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer

# Impute missing values for 'Age' and 'Fare'
imputer = SimpleImputer(strategy='median')
train_data['Age'] = imputer.fit_transform(train_data[['Age']])
test_data['Age'] = imputer.transform(test_data[['Age']])
train_data['Fare'] = imputer.fit_transform(train_data[['Fare']])
test_data['Fare'] = imputer.transform(test_data[['Fare']])

# Fill missing 'Embarked' with the most frequent value
train_data['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace=True)
test_data['Embarked'].fillna(test_data['Embarked'].mode()[0], inplace=True)

# Feature Engineering: Adding more features
features = ["Pclass", "Sex", "SibSp", "Parch", "Age", "Fare", "Embarked"]

# One-hot encoding for categorical features
X_train = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

# Align columns in the test set with the training set
X_test = X_test.reindex(columns=X_train.columns, fill_value=0)

# Define the target variable
y_train = train_data["Survived"]

# Train the RandomForest model with more trees and optimized depth
model = RandomForestClassifier(n_estimators=200, max_depth=10, random_state=1)
model.fit(X_train, y_train)

# Make predictions and save them
predictions = model.predict(X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

# Evaluate the model
train_predictions = model.predict(X_train)
accuracy = accuracy_score(y_train, train_predictions)
print(f"Accuracy: {accuracy:.2f}")


Your submission was successfully saved!
Accuracy: 0.91
