## Databases and Data Warehouses Assignment - 7

**Overview** :
This assignment focuses on performing an end-to-end ETL (Extract, Transform, Load) process using the Titanic dataset from Kaggle. The ETL process is integrated with a simple machine-learning pipeline to demonstrate practical applications in a data science context.

### Team Members (Group – 08)

##### 1.Rutika Rajesh Bankar - 25PGAI0103

##### 2.Rishabh Gaur - 25PGAI0023

##### 3.Mukesh Kumar Khemani - 25PGAI0115

##### 4.Guna Shekhar Dasyam - 25PGAI0063

##### 5.Nagendra Jupudy - 25PGAI0146



#### Part 1: Data Extraction
**Objective:** 
#### Extract the Titanic dataset from Kaggle and load it into a pandas DataFrame.

**Implementation:**

##### - Utilized the Kaggle API to download the dataset.
##### - Loaded the dataset using pandas' read_csv function.'

In [None]:
import pandas as pd
data = pd.read_csv('C:\\Users\\india\\Desktop\\Programming_with_python\\Assignment_7_Database\\titanic\\train.csv')


In [None]:
print(data.head())

In [None]:
data['Age'].fillna(data['Age'].median(), inplace=True)

In [None]:
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)

In [None]:
data.drop(columns=['Cabin'], inplace=True)


#### Part 2: Data Transformation
**Objective** :
Handle missing values, perform feature engineering, and prepare data by scaling numerical features and encoding categorical features for machine learning.

**Steps and Tasks:**

**Handling Missing Values:**

##### - Filled missing values in 'Age' with the median.
##### - Replaced missing values in 'Embarked' with the most common embarkation point.
##### - Dropped the 'Cabin' column due to high missing values.

**Feature Engineering**:

##### - Extracted titles from passenger names.
##### - Created a new feature 'FamilySize' based on the number of siblings/spouses and parents/children aboard.

**Data Scaling and Encoding**:

##### - Scaled numerical features (Age, Fare, FamilySize) using StandardScaler.
##### - Encoded categorical features (Sex, Embarked, Title) using OneHotEncoder.

In [None]:
data['Title'] = data['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())

In [None]:
data['FamilySize'] = data['SibSp'] + data['Parch'] + 1

In [None]:
numerical_cols = ['Age', 'Fare', 'FamilySize']
categorical_cols = ['Sex', 'Embarked', 'Title']

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [None]:
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

In [None]:
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])

In [None]:
data_transformed = preprocessor.fit_transform(data)

In [None]:
numerical_col_names = numerical_cols
categorical_col_names = list(preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_cols))
transformed_columns = numerical_col_names + categorical_col_names

In [None]:
print("Shape of data_transformed:", data_transformed.shape)


In [None]:
print("Transformed columns:", transformed_columns)
print("Length of transformed_columns:", len(transformed_columns))


In [None]:
if data_transformed.shape[1] != len(transformed_columns):
    raise ValueError("Mismatch between number of columns in data_transformed and number of feature names.")


In [None]:
data_transformed_df = pd.DataFrame(data = data_transformed.todense(), columns=transformed_columns)

In [None]:
data_transformed_df.shape

#### Part 3: Data Loading
**Objective**:
Store the cleaned and transformed data in an SQLite database and demonstrate data retrieval.

**Implementation**:

Utilized SQLAlchemy to create an SQLite database and stored the transformed data.
Implemented a function to retrieve data from the database to verify successful storage.

In [None]:
from sqlalchemy import create_engine

In [None]:
# !pip install sqlalchemy

In [None]:
engine = create_engine('sqlite:///titanic.db')

In [None]:
data_transformed_df.to_sql('titanic_transformed', engine, index=False, if_exists='replace')

In [None]:
# Function to load data from the database
def load_data_from_db(engine):
    query = "SELECT * FROM titanic_transformed"
    data_from_db = pd.read_sql(query, engine)
    return data_from_db

In [None]:
# Load the data
data_loaded = load_data_from_db(engine)
print(data_loaded.head())

#### Part 4: Integration with ML Pipeline
**Objective**:
Build and evaluate a logistic regression model using the transformed and loaded data.

**Implementation**:

Split data into training and testing sets.
Trained a logistic regression model and evaluated its accuracy on the testing set.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
# Define the features and target variable
X = data_loaded
y = data['Survived']

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Train a logistic regression model
model = LogisticRegression(max_iter=1000)
# model.fit(X_train, y_train)

In [None]:
model.fit(X_train, y_train)

In [None]:
# Make predictions
y_pred = model.predict(X_test)

In [None]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

**Documentation and Code Quality**
##### - Detailed comments and clear code structure are maintained throughout the scripts to ensure readability and maintainability.
##### - Included a README.md file with environment setup and script execution instructions.


**Conclusion**:
#### This assignment effectively demonstrates the capability to perform an ETL job integrated with a machine learning pipeline, highlighting practical data science skills from data extraction to model evaluation.