# Titanic Survival Analysis
- The Titanic dataset is a well-known dataset used for classification tasks, particularly for predicting survival outcomes. It contains information on passengers aboard the Titanic when it sank in 1912.
### Dataset Columns
- The Titanic dataset typically includes the following columns:
1. PassengerId – Unique identifier for each passenger
2. Survived – Survival status (0 = No, 1 = Yes)
3. Pclass – Ticket class (1 = First, 2 = Second, 3 = Third)
4. Name – Passenger’s full name
5. Sex – Gender (male/female)
6. Age – Age of the passenger
7. SibSp – Number of siblings/spouses aboard
8. Parch – Number of parents/children aboard
9. Ticket – Ticket number
10. Fare – Passenger fare
11. Cabin – Cabin number (often missing)
12. Embarked – Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

## Data Preprocessing and Cleaning

In [30]:
import os
import psycopg2  # Use pymysql or psycopg2 for PostgreSQL
from dotenv import load_dotenv
import pyforest
import warnings

warnings.filterwarnings('ignore')

In [27]:
# Load environment variables from .env file
load_dotenv()

# Fetch credentials
DB_HOST = os.getenv("DB_HOST", "172.178.131.221")
DB_USER = os.getenv("DB_USER", "luxds")
DB_PASSWORD = os.getenv("DB_PASSWORD", "1234")
DB_NAME = os.getenv("DB_NAME", "postgres")
DB_PORT = os.getenv("DB_PORT", "5432")

# Connect to MySQL Database
try:
    conn = psycopg2.connect(
        host=DB_HOST,
        user=DB_USER,
        password=DB_PASSWORD,
        database=DB_NAME,
        port= DB_PORT
    )
    cursor = conn.cursor()

    # Execute SQL Query
    cursor.execute("SELECT * FROM ds.titanicdata LIMIT 10;")  # Adjust table name
    rows = cursor.fetchall()

    # Print fetched data
    for row in rows:
        print(row)

except Exception as e:
    print(f"Error: {e}")

(1, 0, 3, 'Braund, Mr. Owen Harris', 'male', 22.0, 1, 0, 'A/5 21171', 7.25, '', 'S')
(2, 1, 1, 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)', 'female', 38.0, 1, 0, 'PC 17599', 71.2833, 'C85', 'C')
(3, 1, 3, 'Heikkinen, Miss. Laina', 'female', 26.0, 0, 0, 'STON/O2. 3101282', 7.925, '', 'S')
(4, 1, 1, 'Futrelle, Mrs. Jacques Heath (Lily May Peel)', 'female', 35.0, 1, 0, '113803', 53.1, 'C123', 'S')
(5, 0, 3, 'Allen, Mr. William Henry', 'male', 35.0, 0, 0, '373450', 8.05, '', 'S')
(6, 0, 3, 'Moran, Mr. James', 'male', None, 0, 0, '330877', 8.4583, '', 'Q')
(7, 0, 1, 'McCarthy, Mr. Timothy J', 'male', 54.0, 0, 0, '17463', 51.8625, 'E46', 'S')
(8, 0, 3, 'Palsson, Master. Gosta Leonard', 'male', 2.0, 3, 1, '349909', 21.075, '', 'S')
(9, 1, 3, 'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)', 'female', 27.0, 0, 2, '347742', 11.1333, '', 'S')
(10, 1, 2, 'Nasser, Mrs. Nicholas (Adele Achem)', 'female', 14.0, 1, 0, '237736', 30.0708, '', 'C')


In [31]:
  # Load data into a Pandas DataFrame
query = "SELECT * FROM ds.titanicdata;"
df = pd.read_sql(query, conn)
df.head()


<IPython.core.display.Javascript object>

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Data Exploration and cleaning

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   passengerid  891 non-null    int64  
 1   survived     891 non-null    int64  
 2   pclass       891 non-null    int64  
 3   name         891 non-null    object 
 4   sex          891 non-null    object 
 5   age          714 non-null    float64
 6   sibsp        891 non-null    int64  
 7   parch        891 non-null    int64  
 8   ticket       891 non-null    object 
 9   fare         891 non-null    float64
 10  cabin        891 non-null    object 
 11  embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [33]:
# descriptive analysis of the data
df.describe()

Unnamed: 0,passengerid,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292
